Data Pipeline Design Patterns

ETL

Extract-Transform-Load (ETL) as shown in figure 2 is the most widely used data pipeline pattern. From the early 1990’s it was the de facto standard to integrate data into a data warehouse, and it continues to be a common pattern for data warehousing, data lakes, operational data stores, and master data hubs. Data is extracted from a data store such as an operational database, then transformed to cleanse, standardize, and integrate before loading into a target database. ETL processing is executed as scheduled batch processing, and data latency is inherent in batch processing. Mini-batch and micro-batch processing help to reduce data latency but zero-latency ETL is not practical. ETL works well when complex data transformations are required. It is especially well-suited for data integration when all data sources are not ready at the same time. As each individual source is ready, the data source is extracted independently of other sources. When all source data extracts are complete, processing continues with the transformation and loading of the entire set of data.

ETLT

The Extract-Transform-Load-Transform (ETLT) pattern shown in figure 5 is a hybrid of ETL and ELT. Each source is extracted when ready. A first stage of “light” transformations is performed before the data is loaded. The first stage transformations are limited to a single data source and are independent of all other data sources. Data cleansing, format standardization, and masking of sensitive data are typical kinds of first stage transformations. Each data source becomes available for use quickly but without the quality and privacy risks of ELT. Once all sources have been loaded, second stage transformation performs integration and other multi-source dependent work in place in the data warehouse.

Data Virtualization

Data virtualization, illustrated in figure 6, serves data pipelines differently than the other patterns. Most pipelines create physical copies of data in a data warehouse, data lake, data feed for analytics, etc. Virtualization delivers data as views without physically storing a separate version for the use case. Virtualization works with layers of data abstraction. The source layer is the least abstract, providing connection views through which the pipeline sees the content of data sources. The integration layer combines and connects data from disparate sources providing views similar to the transformation results of ETL processing. The business layer presents data with semantic context, and the application layer structures the semantic view for specific use cases.

Unlike ETL processing that is initiated by a schedule, virtualization processes are initiated by a query. The query is issued in the application and semantic context and is translated through integration layers and source layers to connect with the right data sources. The response traverses the path in the opposite direction to acquire source data, transform and integrate, present a semantic view, and deliver an application view of the data. Virtualization can work well when people want the freshest and most up-to-date data possible. The data sources determine the degree of data freshness. The amount of historical data available is also determined by the sources.

Virtualization has the distinct advantage of integrating and transforming only data that is requested and used, not all data that may be requested. It works well with relatively easy transformations and modest data volumes, but may struggle to perform with complex transformations or when lots of data is needed to respond to a query.

Stream Processing

Stream processing, as shown in figure 7, has two similar but slightly different patterns. In both patterns, the data origin is a stream with continuous flow of event data in chronological sequence. Processing begins by parsing the events to isolate each unique event as a distinct record. Individual events can then be evaluated to select only those events appropriate to the use case. In many cases, and especially with large data volumes or high percentage of unneeded events, it is desirable to push parsing and selection to the edge of the network—close to the sensors where event data is captured—and avoid moving unneeded data across the network.

At the destination end of the data flow, the two patterns diverge slightly. For some use cases, the goal is to post events to a message queue, an event log, or an events archive. When events are posted, they become the origin or input to a downstream data pipeline where data consumption is somewhat latent. For other use cases, the goal is to push events to a monitoring or alerting application where information about the state of a machine or other entity is delivered in real time. Stream processing pipelines typically work with sensors and Internet of Things (IoT) data using technology that is optimized for high-volume, fast-moving data.

Search This Blog

DataScience

Data Pipeline Design Patterns

Comments

Post a Comment

Popular posts from this blog

Spark Cluster

DORA Metrics

Data Science with BIGDATA