We at Sidra Data Platform know the problems we want to solve. And we know them very well indeed, because most of the technical team that builds Sidra have more than 10 years of experience building bespoke enterprise data platforms for countless of customers, so in essence, we are building the tools we wish we had in our collective ‘previous life’.
From this vantage point, we have seen many technologies come and go (Hadoop and MapReduce rings a bell?) and also, we have seen shifts in architectural paradigms (ETL to ELT, lambda architectures for real time data processing, etc.) making good the old phrase: the only thing constant is change.
One of the architectural paradigms that seem to be ubiquitous today is the Medallion architecture, on which the Databricks data lakehouse is based. While it is pervasive and extended across the vast majority of enterprise data platform built over the last couple of years, you might be surprised to learn that Sidra Data Platform does not completely follow this approach.
Let’s stop here for a minute and do a quick refresher on what this architecture is and how it works.
The Medallion Architecture
The Medallion Architecture is a strategic approach to structuring data within a data lake or lakehouse, with the goal of progressively enhancing the organization and quality of data as it moves through each layer: Bronze, Silver, and Gold.
Bronze Layer (Raw Data): This is the initial landing point for all data from various source systems. The table structures in this layer mirror those of the source systems, supplemented with additional metadata columns that capture details like load date/time, process ID, and more. The primary focus here is on rapid Change Data Capture and maintaining a historical archive of source data, which aids in data lineage, auditability, and reprocessing if necessary.
Silver Layer (Cleansed and Conformed Data): Data in the Silver layer is refined from the Bronze layer. It undergoes matching, merging, conforming, and cleansing processes to provide a unified “Enterprise view” of all key business entities, concepts, and transactions. This layer serves as a valuable resource for Departmental Analysts, Data Engineers, and Data Scientists to further create projects and analyses to solve business problems.
Gold Layer (Curated Business-Level Tables): The Gold layer contains highly refined and aggregated data, organized in consumption-ready “project-specific” databases. This data is often used to power analytics, machine learning, and production applications.
The Medallion Architecture aims to ensure data integrity as it passes through multiple layers of validations and transformations before being stored in a layout optimized for efficient analytics.
Is your Data Platform really following the Medallion architecture?
Many of the enterprise companies we have worked with have a very clear and distinct medallion approach on their architecture diagrams, but once these diagrams are left aside and the focus is made on the implementation, there are obvious discrepancies.
One very common situation is having multiple sub-layers per each layer: it is extremely unlikely to find a single Silver Layer, but instead multiple layers with different stages of cleaning and validation processes, basic business transformations, etc. The same applies to all different layers: there is not a clear-cut separation of the different activities involved in each of them because this will vary from company to company, and sometimes even teams inside the same companies which have a need for different processing activities on their data.
Another good example of this is the semantic layer for OLAP models, where critical elements of the business definition reside. This normally lies on the Power BI semantic model, well away from the data lake storage, but it would be a long shot to consider the gold layer to stretch to the Power BI realm.
And… what about the domains in a Data Mesh implementation? Almost by definition, they imply ‘multiple gold layers.
Is any of this a problem? Absolutely not. This is just a way to highlight that these architectures are just tools to help you model and structure your data state and your processes, but not a single architecture that works for all scenarios and companies. Actually, more than an architecture, the medallion approach is a pattern that can be extended to suit the needs of each specific company.
Sidra’s Way
We have some considerations at that are exclusive to our product:
• We need a structure that allows us to achieve our vision of a fully automated data intake process.
• We have to provide a structure that works for hundreds of different customers across different verticals.
For those two reasons, we do things a little bit differently at Sidra. We have two different storage layers: the Data Storage Unit (DSU) and the Data Domains. In a nutshell, the DSU storage is ‘the data lake’, whereas the Data Domains are the ‘data products’.
Our intake process is fully automated: you point to the data source and Sidra then inspects the source, creates automatically the pipelines to ingest the data, as well as the target tables in the DSU, validation queries, etc. But there is a catch to this approach: since this is fully automated, this means that at this stage only technical transformations can be done. At this stage we are performing actions such as converting tables to delta, adding metadata and lineage columns, validations, specific partitioning modes, data type optimization, etc, but no business transformations at all.
The business transformations happen at the domain layer, in our Data Products. Our platform provides an automated mechanism to access or move the data from the DSU to the data product based on a security model and provides some tooling for the developers of the data product to accelerate their work, and the outcome of this work embeds the business transformation.
This has the benefit that every domain can be treated as a different product, with different owners, sponsors and lifecycle, avoiding scenarios such as the all-too-common clash when two analytics team have differing opinions on how a specific measure or calculation needs to be defined when working on the same set of data, etc.
Does this mean that Sidra works as a two-tier architecture? We are making here the same simplification that is being made with the Medallion approach: there are multiple sub-layers on the DSU layer, and similarly on the Domains so it is a multi-layer system that ends up being tailored to the customer’s use case.
Final Thoughts
This was not a post about the benefits of a two-layer architecture in opposition to a three-layer architecture, nor a rant against the Medallion architecture, and definitely not a post to highlight the great work we do at Sidra with our Data Platform. This is a call to rational judgement, instead.
When building your next data platform, don’t force fit all your scenarios and processing stages into the three nominal layers: just build something that works for the business in terms on processes, because the key aspect here is not the technical details of the implementation, but the shared knowledge and expectations of all the people working with data across the company. All employees should be able to tell the level of maturity, quality and availability of the data set just by knowing in which layers it is stored; however you want to call them.