Menu Close
Data infrastructure optimization, availability & security software
Data integration & quality software
The Next Wave of technology & innovation


Engineering Machine Learning Data Pipeline Series: Pulling in Data from Multiple Sources

Making the hurdle from designing a machine learning model to putting it into production, is the key to getting value back – and the roadblock that stops many promising machine learning projects. After the data scientists have done their part, engineering a robust production data pipeline has its own set of tough problems to solve. Syncsort’s solutions help the data engineer every step of the way.

The first step is consolidating data from sources all over the enterprise. The data machine learning models come from a wide variety of physical locations, technical platforms and storage formats. The first challenge is requiring parallel onboarding capability and connectivity to sources from mainframe to streaming to Cloud and getting all that data onto the cluster. The next challenge is getting all the data transformed from its source storage format to the target, whether that system is Hive, Impala, HDFS, ORC, Parquet, KUDU or something else entirely. The final challenge is getting the data normalized, aggregated – or otherwise changed – and the features filtered down.

This is only the first part of creating robust production data pipelines, and if you’re not careful it can take weeks or even months of Sqoop scripts, shell scripting and Scala or Java code to complete the first step. Syncsort has helped data engineers solve this problem for years.

View this 15-minute webcast on-demand to get a deeper look at a better way to get high-performance data access and integration on your production cluster – without spending a bunch of time coding or tuning. These 15-minutes could save you weeks!