When you need to do ‘fuzzy’ string matching between two sets of strings with different formatting/spelling conventions, I recommend using TF-IDF vectorization.
While competing in the Enigma Datathon at NYU, my team and I were analyzing a dataset of Metrocard usage at New York City subway stations. Our goal was to obtain the precise location of each station, and ultimately do some spatial analysis of the relationship between ridership and other spatial data, like traffic conditions around the city. We found a database of station names and coordinates, but there was no ‘station ID’ field that could be used to match stations from the two datasets. Our only option was to use the station names, and here’s what the names from each dataset looked like:
First Dataset (Metrocard Usage) | Second Dataset (Station Coordinates) |
---|---|
8TH AVENUE-14TH STREET | 14 St |
LAFAYETTE AVE FULTON ST | Lafayette Av |
NOSTRAND AVE-FULTON ST | Fulton St |
81ST STREET-MUSEUM | 81 St Museum of Natural History |
103RD ST-CENTRAL PARK WEST | 103 St |
VAN SICLEN AVE-PITKIN AVE | Van Siclen Av |
86TH ST-CENTRAL PARK WEST | 86 St |
155TH STREET-ST NICHOLAS AV | 155 St |
34TH STREET & 8TH AVENUE | 34 St Penn Station |
JAY ST-METROTECH | Jay St MetroTech |
The differences in formatting and spelling (‘Av’ vs ‘AVE’ vs ‘AVENUE’, etc.) make exact string matching impossible, but an even bigger problem is multiplicity. Of the hundreds of stations in New York, several are located at 34th street in some borough, and it takes some intelligence to figure out that ‘34TH STREET & 8TH AVENUE’ corresponds to ‘34 St Penn Station’. Luckily, both datasets have access to the trains that stop at each station, so we appended them to the station names, adding ‘_TRAIN’ to each one so that a matching algorithm wouldn’t mistake a train number for a street number:
First Dataset (Metrocard Usage) | Second Dataset (Station Coordinates) |
---|---|
8 TH Av 14 TH STREET A_TRAIN C_TRAIN E_TRAIN L_TRAIN | 14 St A_TRAIN C_TRAIN E_TRAIN |
LAFAYETTE AVE FULTON ST C_TRAIN | Lafayette Av C_TRAIN |
NOSTRAND AVE FULTON ST A_TRAIN C_TRAIN | Fulton St A_TRAIN C_TRAIN |
81 ST STREET MUSEUM B_TRAIN C_TRAIN | 81 St Museum of Natural History B_TRAIN C_TRAIN |
103 RD ST CENTRAL PARK WEST B_TRAIN C_TRAIN | 103 St B_TRAIN C_TRAIN |
VAN SICLEN AVE PITKIN AVE C_TRAIN | Van Siclen Av C_TRAIN |
86 TH ST CENTRAL PARK WEST B_TRAIN C_TRAIN | 86 St B_TRAIN C_TRAIN |
155 TH STREET ST NICHOLAS AV C_TRAIN | 155 St C_TRAIN |
34 TH STREET & 8 TH Av A_TRAIN C_TRAIN E_TRAIN | 34 St Penn Station A_TRAIN C_TRAIN E_TRAIN |
JAY ST METROTECH A_TRAIN C_TRAIN F_TRAIN | Jay St MetroTech A_TRAIN C_TRAIN F_TRAIN |
This improves things, because each dataset will contain only one station that is both on 34th street and serves the C train, but also leads to other problems. We tried using a few ad-hoc methods, as well as SeatGeek’s fuzzywuzzy package, that tokenized the strings and searched for matches based on the number of matching tokens. But if you look at the name ‘LAFAYETTE AVE FULTON ST C_TRAIN’, it has more tokens in common with ‘Fulton St A_TRAIN C_TRAIN’ than it does with its correct match, ‘Lafayette Av C_TRAIN’ (3 to 2).
What we needed was a method that could incorporate all of the information, using train lines when necessary but ignoring them when more relevant information was available. In other words, we needed to weight the tokens according to their importance, and the best way to do this was TF-IDF vectorization. TF-IDF analyzes the corpus of words as a whole, and weights each token as more important to the string if it is less common in the corpus. This means that if two strings have a relatively rare term (‘Lafayette’) in common, this outweighs the importance of two more common terms (‘Fulton’ and ‘A_TRAIN’). Below, the same strings but with the important (rare) terms highlighted:
First Dataset (Metrocard Usage) | Second Dataset (Station Coordinates) |
---|---|
8 TH Av 14 TH STREET A_TRAIN C_TRAIN E_TRAIN L_TRAIN | 14 St A_TRAIN C_TRAIN E_TRAIN |
LAFAYETTE AVE FULTON ST C_TRAIN | Lafayette Av C_TRAIN |
NOSTRAND AVE FULTON ST A_TRAIN C_TRAIN | Fulton St A_TRAIN C_TRAIN |
81 ST STREET MUSEUM B_TRAIN C_TRAIN | 81 St Museum of Natural History B_TRAIN C_TRAIN |
103 RD ST CENTRAL PARK WEST B_TRAIN C_TRAIN | 103 St B_TRAIN C_TRAIN |
VAN SICLEN AVE PITKIN AVE C_TRAIN | Van Siclen Av C_TRAIN |
86 TH ST CENTRAL PARK WEST B_TRAIN C_TRAIN | 86 St B_TRAIN C_TRAIN |
155 TH STREET ST NICHOLAS AV C_TRAIN | 155 St C_TRAIN |
34 TH STREET & 8 TH Av A_TRAIN C_TRAIN E_TRAIN | 34 St Penn Station A_TRAIN C_TRAIN E_TRAIN |
JAY ST METROTECH A_TRAIN C_TRAIN F_TRAIN | Jay St MetroTech A_TRAIN C_TRAIN F_TRAIN |
To sum up, I had never heard of TF-IDF being used in this way, but it ended up working way better than anything else we tried. If you are ever in a similar situation, think about using it!