Syncsort's got everything you need for Hadoop ETL – no coding, no tuning, no kidding!LEARN MORE
What is ETL?
Extract, Transform and Load (ETL) is a standard information management term used to describe a process for the movement and transformation of data. ETL is commonly used to populate data warehouses and datamarts, and for data migration, data integration and business intelligence initiatives.
ETL processes can be built by manually writing custom scripts or code, with the trade-off that as the complexity of the ETL operations increase, scripts become harder to maintain and update.
Alternatively, purpose built ETL software may offer a graphical user interface to build and run ETL processes, which typically reduces development costs and improves maintainability.
Extract data from one or more source systems containing customer, financial, or product data.
This first phase refers to the task of pulling in data from a variety of sources. Most organizations will have data coming in from more than one source, meaning it will be necessary to automate the task of collecting that data and formatting it correctly for the data warehouse.
Modern organizations have data stored in many disparate systems such as: customer relationship management (CRM), sales, accounting, and stock tracking to name just a few. Each system will typically store data in different mutually incompatible formats. To obtain business value from all this data means the ETL tool you choose should have the ability to extract data from many different sources.
Connectors: Flat files, XML, Oracle, IBM Db2, SQL Server, Teradata, Sybase, Vertica, Netezza, Greenplum, IBM Websphere MQ, ODBC, JDBC, Hadoop Distributed File System (HDFS), Hive/HCatalog, JSON, Mainframe (IBM z/OS), Salesforce.com, SAP/R3
Transform the data by applying business rules, cleansing, and validating the data.
Largely, transformation in this context has to do with manipulating the data in a manner which serves the needs of the business. In other words, the raw data flowing into the organization may not be suitable for any kind of use, so it needs to be cleaned, filtered, etc. This is the phase where generic data is turned into something that can be a valuable asset for the business.
To combine and report on the data extracted in Stage 1, for example comparing orders from the order entry system with stock levels in the warehouse management system, may require multiple steps and many different operations. To meet the current and future requirements of the business, your ETL tool should be able to perform all of the following types of operations:
- Transforms: Aggregation, Copy, Join, Sort, Merge, Partition, Filter, Reformat, Lookup
- Mathematical: +, -, x, /, Abs, IsValidNumber, Mod, Pow, Rand, Round, Sqrt, ToNumber, Truncate, Average, Min, Max
- Logical: And, Or, Not, IfThenElse, RegEx, Variables
- Text: Concatenate, CharacterLengthOf, LengthOf, Pad, Replace, ToLower, ToText, ToUpper, Translate, Trim, Hash
- Date: DateAdd, DateDiff, DateLastDay, DatePart, IsValidDate
- Format: ASCII, EBCDIC, Unicode
Load the results into one or more target systems such as a data warehouse, datamart, or business intelligence reporting system.
With all of the data successfully collected and transformed as necessary, the last phase is to load that data to the warehouse for storage and access. A quality data warehouse will provide those within the organization easy access to the information they need, when they need it. Once the load phase is finished, the data will then reside in a location that is known to everyone who has permission to access that data.
Connectors: Flat files, XML, Oracle, IBM Db2, SQL Server, Teradata, Sybase, Vertica, Netezza, Greenplum, ODBC, JDBC, Hadoop Distributed File System (HDFS), Hive/HCatalog, Mainframe (IBM z/OS), Salesforce.com, Tableau, QlikView
When ETL is used properly, it will permit you to get more out of your data. (Related: Best Tips & Tricks for ETL Testing)
ETL is an important step in the Data Integration process
The ETL Value Equation
A complete end-to-end ETL process may take a few seconds or many hours to complete depending on the amount of data and the capabilities of the hardware and software.
The cost-time-value equation for ETL is defined by three characteristics:
- Volume – How much data needs to be processed? It may be hundreds of megabytes, terabytes, or even petabytes of data
- Velocity – How fast and how frequently does the data need to be processed? Is there a service level agreement (SLA) that needs to be met?
- Variety – What is the layout or format of the source data? How many different data sources need to be processed and combined? What is the format of the final output?
The selection of appropriate software, hardware and developer resources to meet these criteria will directly affect the overall cost and timeline of your ETL project.
See how Syncsort Integrate products can help you seamlessly integrate mission-critical data from traditional data systems to next-generation analytical platforms, and applies market-leading data quality capabilities, to deliver business insights you can trust.