Menu Close
Data infrastructure optimization software
Data integration and quality software
Data availability and security software
Cloud solutions


Engineering Machine Learning Data Pipeline Series: Finding and Matching Duplicates at Scale

Making the jump from designing a machine learning model to putting it into production is the key to getting value back – and the roadblock that stops many promising machine learning projects. After the data scientists have done their part, engineering robust data pipelines has its own set of tough problems to solve. Luckily, Syncsort helps the data engineer every step of the way.

When you pull in data from different sources across the enterprise, chances are that you have information about the same person, company, product, or other entity in multiple records. To get a full view of the data regarding that entity, you must find all the records that relate and combine them.

The above may sound simple, but names are very frequently misspelled or simply entered incorrectly. Is Bob Smith the same as Robert Smith? How about Dr. Robert L. Smith, is he is the same person? Is Syncsort, Inc. the same company as Sinksort Corp.? You may have to compare each individual record to every other record in the dataset with some very sophisticated matching algorithms to determine who is who – and you may have to compare the data multiple times in multiple ways to resolve each entity.

Just to add to the difficulty and confusion, let’s say you have very large volumes of records in your data lake – you don’t have to compare a thousand records to a thousand other records multiple times, you may have to compare a million to a million or 100 million to 100 million. This kind of compute intensive comparison can bring even a powerful cluster to its knees. Trying to code this from scratch is a near impossible task and some have resorted to starting a whole new machine learning project just to get the entities resolved for their other machine learning projects.

Don’t go down the recursive ML rabbit hole. View this 15-minute webcast on-demand to learn how you can tackle these challenges successfully in large scale production implementations.