Menu Close
Data infrastructure optimization, availability & security software
Data integration & quality software
The Next Wave of technology & innovation


Engineering Machine Learning Data Pipeline Series: Data Matching and Depuplication at Scale

Trusted data pipelines require big data entity resolution – performing data matching and deduplication at scale

Making the jump from designing a machine learning model to putting it into production is the key to getting value back from your big data investment. However, it's also often the roadblock that stops many promising machine learning projects. After the data scientists have done their part, engineering robust data pipelines poses its own set of challenges to tackle. Luckily, Syncsort helps the data engineer every step of the way.

When you pull in data from different sources across the enterprise, chances are that you have information about the same person, company, product, or other entity in multiple records. To get a full view of the data regarding that entity, you must find all the records that relate and combine them.

The above may sound simple, but names are very frequently misspelled or simply entered incorrectly. Is Bob Smith the same as Robert Smith? How about Dr. Robert L. Smith, is he is the same person? Is Syncsort, Inc. the same company as Sinksort Corp.? You may have to compare each individual record to every other record in the dataset with some very sophisticated matching algorithms to determine who is who – and you may have to compare the data multiple times in multiple ways to resolve each entity.

Just to add to the difficulty and confusion, let’s say you have very large volumes of records in your data lake – you don’t have to compare a thousand records to a thousand other records multiple times, you may have to compare a million to a million or 100 million to 100 million. This kind of compute intensive comparison can bring even a powerful cluster to its knees. Trying to code this from scratch is a near impossible task and some have resorted to starting a whole new machine learning project just to get the entities resolved for their other machine learning projects.

Don’t go down the recursive machine learning rabbit hole. View this 15-minute webcast on-demand to learn how you can tackle these challenges successfully in large scale production implementations.