Sidra helps quickly set up data migration processes, from on-premises or Cloud to Data Lake. In my previous post I’ve enumerated the most important components and vocabulary elements. It’s time for an overview of how data is “lifted” to Data Lake.
I am focusing now on tabular data since ingestion of binary or non-structured documents may follow a different and more complex path.
When setting up a Data Intake Process, two main pipelines are being created, with a third simply invoking the first two ones. These materialize as Azure Data Factory pipelines, which are also invoking Databricks notebooks.
- Metadata Extraction: Essentially creates the target Databricks tables corresponding to the source ones, that are being “lifted” to Data Lake.
- Data Ingestion: Essentially copies records from the source tables to their corresponding targets, Databricks tables.
Metadata Extraction
The Metadata Extraction pipeline is reading the table structure metadata from the source system, for all the tables or views in scope for the Data Intake Process. This information is recorded into Sidra’s metadata: the source DB becomes a Provider, source tables become Entities, and table columns become Entity Attributes. Eventually, the Metadata Extraction pipeline is invoking Databricks notebooks which create or update target tables, based on the structures recorded in Sidra’s metadata DB.
This Metadata Extraction process is what makes Sidra such a fast solution for data migration into Data Lake.
It’s worth noting that column additions in the source tables are picked automatically by Sidra and applied in the target Databricks tables, a feature called Schema Evolution.
Data Ingestion
Here too we have a two main stages process:
- Read table data from the source as a Parquet file; then…
- Execute a Databricks notebook that saves the records, from the Parquet files into the target Data Lake table.
In the Data Intake Process, the Data Ingestion may happen periodically, such as “each day”, to keep the Databricks tables in sync with the source data. When executing Data Ingestion for a source table, here’s what could happen in the 2 stages:
1. When reading data from the source table into a Parquet file, Sidra could
- Read the entire table, each time; the Parquet file will reflect the entire table each time; or
- Read the source table in deltas for an Incremental Load. There would be a first full-table load resulting in a large Parquet file, followed by smaller Parquet files that only contain the record changes.
2. When saving the Parquet records in target Databricks table, depending on the Consolidation Mode:
- Sidra may store everything as Snapshots, which would provide a history of the table records and how these evolved; or
- Records in each Parquet file will be merged with the ones in the target Databricks table.
The Incremental Load of the first stage may be automatic. With Microsoft’s SQL Server, for example, Sidra can determine if the source table has Change Tracking enabled; for such tables, the Incremental Load optimizes the ingestion time and resource consumption. The Incremental Load may also be manually configured, for other mechanisms that would help tracking record changes from one ingestion to the next.
The Consolidation Mode for a table Entity is, by default, Snapshot – meaning that the target Databricks table can show the current state of the source table but also its history too. With a Consolidation Mode of Merge, the table storage is optimized, but table history would be lost, and the final stage of the ingestion may take slightly longer. The Merge consolidation mode also requires a Primary Key, of course, in the source table.