One Way to Join Data Sets With No Common ID Number

Garoux LLC
3 min readAug 21, 2022

--

It is common to find data sets which ought to be linked, but for which there is no common identifier that directly links them. As an example, consider geolocation data stored as latitude/longitude pairs, or as addresses. In two different data sets, latitudes and longitudes may be slightly different depending on how they were obtained (e.g. by using Google Maps vs OpenStreetMaps). Addresses may vary based on abbreviations and format. In this case there should still be enough information to match the data sets, at least up to a small number of reasonable errors.

One way to approach the problem is using graph matching. Let’s call our data sets A and B. We can think of our two data sets together as a graph, where each row is a node in the graph and there are edges connecting each node from A with each node from B (but no interconnections among A and B themselves). This is known as a bipartite graph, because the vertices split into two sets (A and B) and all edges connect a node from A with a node from B. We can also assign weights to each edge based on the similarity between the rows. This could be the distance between latitude and longitude pairs, or the edit distance between two addresses.

An example of a bipartite graph. Each red node represents a row in data set A. Each blue node represents a row in data set B.
An example of a bipartite graph. Each red node represents a row in data set A. Each blue node represents a row in data set B.

Once we have assigned weights to each row, the best way to join the data sets based on the weights is called a minimum-weight matching. A minimum-weight matching uniquely matches the rows in A and B so that the overall sum of the weights is minimized. In the case of latitude and longitude, this matching would ensure that the total geographic distance between the matched rows is as small as possible.

Fortunately, the scipy library has a handy function called linear_sum_assignment which will calculate the minimum-weight matching for a bipartite graph. You provide it with the matrix of weights and it returns a matching of the rows in tables A and B. Here’s a quick example using randomly generated lat/long values. All code for this example is available on Github.

First we generate some random lat/long values. Then we create a second table which matches the first, except that a small amount of random noise is added to the values. So now there’s no direct way to match based on lat/long, but the values are close enough that we should be able to get a good approximate match.

We now want to compute the weight matrix, which in this case is the distance between the lat/long pairs in table_A and table_B. For real applications it’s important to note that 1 degree of latitude is not equal in physical distance to one degree of longitude, so we need to scale them appropriately. We use the cdist function from scipy.spatial.distance to calculate the matrix of distances to use for the weights.

Now we can use the linear_sum_assignment function to match the rows in table_A and table_B.

And finally we can check how many rows were actually matched correctly.

The number will vary depending on how many points you use and the amount of noise added. For the settings used here the match tends to be over 99% accurate.

--

--

Garoux LLC
Garoux LLC

Written by Garoux LLC

Garoux LLC provides consulting in data science, machine learning, and AI. Check out https://garoux.com to learn more.

No responses yet