Menu Close
Data infrastructure optimization, availability & security software
Data integration & quality software
The Next Wave of technology & innovation


Engineering Machine Learning Data Pipelines Series: Tracking Data Lineage from the Source

Tackling the challenge of designing a machine learning model and putting it into production is the key to getting value back – and the roadblock that stops many promising machine learning projects. After the data scientists have done their part, engineering robust production data pipelines has its own set of challenges. Syncsort software helps the data engineer every step of the way.

Once you’ve found and matched duplicates to resolve entities, the next step is to track the lineage of data as it moves from source to final analysis – which is required by several regulations and by the need to verify the source of final decisions or recommendations.

Using Apache Atlas or Cloudera Navigator and tracking data outside the cluster can be done with enterprise governance tools like Collibra or ASG Data Intelligence. However, for auditing purposes you don’t need part of the data provenance – you need ALL of it. Getting a coherent source to consumption view of where specific data came from and how it was changed along the way is much harder than it should be in modern data architectures.

View this 15-minute webcast on-demand to learn how to tackle these challenges in large scale production implementations.