Menu Close
Data infrastructure optimization, availability & security software
Data integration & quality software
The Next Wave of technology & innovation


Engineering Machine Learning Data Pipelines Series: Streaming New Data as It Changes

How to set up a continuous data stream – to automatically push data as it changes through the same transformation and cleansing flow – into your machine learning model

Moving a machine learning model into production is the key to driving value from big data, but also the roadblock that stops many promising machine learning projects. After the data scientists have done their part, engineering a robust production data pipeline poses its own set of tough problems to solve. Syncsort’s solutions help the data engineer every step of the way.

Building on the process of finding and matching duplicates to resolve entities, the next step is to set up a continuous streaming flow of data from data sources so that as the sources change, new data automatically gets pushed through the same transformation and cleansing data flow – into the arms of machine learning models.

Some of your sources may already be streaming, but the rest are sitting in transactional databases that change hundreds or thousands of times a day. The challenge is that you can’t affect performance of data sources that run key applications, so putting something like database triggers in place is not the best idea. Using Apache Kafka or similar technologies as the backbone to moving data around doesn’t solve the problem of needing to grab changes from the source pushing them into Kafka and consuming the data from Kafka to be processed. If something unexpected happens – like connectivity is lost on either the source or the target side, you don’t want to have to fix it or start over because the data is out of sync.

View this 15-minute webcast on-demand to learn how to tackle these challenges in large scale production implementations.