Extract, Transform and Load (ETL)
In this post I will talk about what is ETL, few approaches to build ETL based system, and finally comparison between batch vs real time ETL.
ETL refers to a process where data is extracted from data sources, transformation is where data is transformed for storing in the proper format or structure for the purposes of querying and analysis, data loading is where data is loaded into the final target database typically a DWH.
A properly designed ETL system extracts data from the source systems, enforces data quality and consistency standards, conforms data so that separate sources can be used together, and finally delivers data in a presentation-ready format so that application developers can build applications and end users can make decisions.
ETL process is executed periodically to capture source changes. ETL period can be few minutes to daily depending on target DWH needs.
New age system now a days require that as soon as an event occur in source system, it be available on target system as well. Such systems are categorised as real time and are event driven.
There are two main batch methods to extract a table out of an RDBMS source system into data warehouse: get the whole table every time (snapshot), get it incrementally (only the changes)using on a set frequency.
Snapshot
if records are up to 10k then this approach seems reasonable for a table data extraction, as a select would start extraction immediately as oppose to query with a where clause.
Incremental
Incremental changes are pulled from sources system. These change can be identified using —
Timestamp columns — In this case, deletion is not possible to track if data is hard deleted.
Identity column — In this case, deletion is identified by comparing identity existing in DWH and not existing in Source system. Updating is identified same way.
Triggers — This may slow down the source application as it requires change in application DB.
Transaction date —In this case we rely on transaction dates i.e. order date, system order date so on to identify the incremental changes.
ETL Approaches
There are different tools and framework available for doing batch ETL. i.e. MSSIS (is a tool built in MS SQL server), Talend (is an open source tool), Spring batch ( this is a java based framework specially designed for batch based ETL).
Using tool does not require programming but then again one has to depend on the functionality provided by the tool and scope to do customisation creates a lot of the problem. Other than this tool generate program on the run time and many times are pretty inefficient.
Developing own ETL logic using frameworks like Spring batch gives a lot of control to the developer, not to mention the efficiency gain over ETL tools.
Below we have tried to explore three solution, two of them are Batch based and third one for Real Time. At the end there is comparison between Batch and Real Time systems.
ETL with Python
Python is a good fit for data science. The most part of data science is focused on ETL (Extract-Transform-Load). This feature makes Python a perfect match for these activities. Python data science libraries like Tensorflow, Pandas, and Scikit-Learn make Python a reliable option for modern applications for machine learning.
Being a scripting language it’s dynamically typed that’s why you should carefully use it. Bugs, errors and types non-conformity can appear from time to time. As opposed to Python, the Java is strongly typed
Below are few key comparisons between Java and Python.
ETL with Spring Batch (Java)
A typical batch program generally reads a large number of records from a database, file, or queue, processes the data in some fashion, and then writes back data in a modified form. Spring Batch automates this basic batch iteration, providing the capability to process similar transactions as a set, typically in an offline environment without any user interaction
Business Scenarios
- Commit batch process periodically
- Concurrent batch processing: parallel processing of a job
- Staged, enterprise message-driven processing
- Massively parallel batch processing
- Manual or scheduled restart after failure
- Sequential processing of dependent steps (with extensions to workflow-driven batches)
- Partial processing: skip records (e.g. on rollback)
- Whole-batch transaction: for cases with a small batch size or existing stored procedures/scripts
Technical Objectives
- Batch developers use the Spring programming model: concentrate on business logic; let the framework take care of infrastructure.
- Clear separation of concerns between the infrastructure, the batch execution environment, and the batch application.
- Provide common, core execution services as interfaces that all projects can implement.
- Provide simple and default implementations of the core execution interfaces that can be used ‘out of the box’.
- Easy to configure, customize, and extend services, by leveraging the spring framework in all layers.
- Provide a simple deployment model, with the architecture JARs completely separate from the application, built using Maven.
This layered architecture highlights three major high level components: Application, Core, and Infrastructure. The application contains all batch jobs and custom code written by developers using Spring Batch. The Batch Core contains the core runtime classes necessary to launch and control a batch job. It includes things such as a JobLauncher, Job, and Step implementations. Both Application and Core are built on top of a common infrastructure. This infrastructure contains common readers and writers, and services such as the RetryTemplate, which are used both by application developers (ItemReader and ItemWriter) and the core framework itself.
Spring Batch provides a collection of utilities and implementations that have been battle tested in all enterprise verticals. Some of the benefits and characteristics are listed below
- Over 17 ItemReader and 15 ItemWriter implementations covering vast options for input and output (File, JDBC, NoSQL, JMS, etc). All of these provide declarative I/O options so that you don’t have to write and test code for stateful readers and writers.
- A collection of Tasklet (Spring Batch’s equivalent to JSR-352’s Batchlet) implementations including ones for executing shell commands and interfacing with Hadoop.
- The ability to stop/start/restart jobs and maintain state between executions.
- The ability to skip and retry records as they are being processed.
- Transaction management. Spring Batch handles transactions for you.
- The ability to notify other systems when errors occur via messaging by integrating Spring Integration.
- Java or XML based configuration.
- All the Spring features like DI, AOP, testability, etc.
- Vendor independence — By using Spring Batch, you get to use a framework that open source and not tied to any one vendor.
- Scalability options — Spring Batch provides a number of scalability options that range from within a single JVM via threads (multithreaded step, local partitioning, and splits) to multi-JVM scalability (remote partitioning and remote chunking).
- Integration with Spring Integration — Spring Integration provides a number of useful elements that allow you to build robust batch applications to handle things like error messages, poling directories for files, automatically FTPing files, etc.
- Big data support — Through the Spring for Apache Hadoop project, there are a number of extensions to Spring Batch that allow it to work well with Hadoop. You can run Spring Batch jobs on YARN, you can execute Pig, Hive, MapReduce, etc jobs.
- Integration with Spring XD — Spring XD provides a distributed runtime for the deployment, management, and execution of batch jobs.
ETL using Kafka (Real Time)
What is Apache Kafka?
Kafka is a distributed streaming platform. A streaming platform has three key capabilities.
- Publish and subscribe to streams of records, similar to a message queue or enterprise messaging system.
- Store streams of records in a fault-tolerant durable way.
- Process streams of records as they occur.
Kafka is generally used for two broad classes of applications:
- Building real-time streaming data pipelines that reliably get data between systems or applications
- Building real-time streaming applications that transform or react to the streams of data
To understand how Kafka does these things, let’s dive in and explore Kafka’s capabilities from the bottom up.
First a few concepts:
- Kafka is run as a cluster on one or more servers that can span multiple datacenters.
- The Kafka cluster stores streams of records in categories called topics.
- Each record consists of a key, a value, and a timestamp.
Kafka has four core APIs:
- The Producer API allows an application to publish a stream of records to one or more Kafka topics.
- The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them.
- The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams.
- The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to a table.
Applications publish events to Kafka Queue which gets delivered to all the subscribers in real time. These subscribers can further process these events. One use case is to transform these data and load into the persistent store of choice. One thing to note here is that, Batch ETL moves data from source to target with a time lag which might range from few minutes to few days but real time system move data as it happen.