Data Lakehouse

Emergence data warehouse

For decades, data analytics platforms (data warehouses, data lakes, data marts, …) were used to support enterprises in major decisions. By centrally storing company data from various sources, comprehensive analyses can be performed. This allows, for example, a deeper insight into operational processes and therefore to support your business intelligence initiatives. Furthermore, savings potentials and problems can be identified, which could not be found with peripheral analyses and therefore could not be solved otherwise.
To dig even deeper into structures and links of the data, machine learning processes can be incorporated. The gained intelligence gives an essential competitive edge.

Therefore, in the 1980s, the concept of the data warehouse was set up. In this process, the data is aggregated via ETL processes, consolidated, and reformatted into a scheme, which is suitable for the desired analytics. During this process, many of the information contained in raw data is removed. Analytics, which could use this data at a later point in time, cannot be realized afterwards anymore.

Very high costs due to cold data

Another disadvantage of a legacy DWH is that the required amount of RAM and CPU correlates linearly to the amount of stored data. This scaling relationship is typical for classic data base systems and leads to data, which is used seldom (cold data), creating excessive cost. Large data platforms can hold several hundred terabytes of this data, which leads to many organizations getting rid of this data—which could prove to be useful in the future. Furthermore, classic data base systems are specialized on a single data structure. Data, which is structured differently, must first be made persistent using other external systems.

THE NEXT EVOLUTIONARY STEP: THE INTRODUCTION OF CLASSIC DATA LAKES

The outlined problems led to the development of data lakes in the late 2000s. The concept of a data lake is to split up the basic features of a data base into several highly scalable systems. Data lakes have scalable file storage (e.g., Amazon S3) to persist data, at least one database to manage metadata, and at least one query engine as well as other components. This segregation allows the storage of structured, semi-structured, and unstructured data on a consolidated platform. Furthermore, the query engine requires only the computing power, which is needed to process the queried data. This fact results in cold data and raw data being able to be stored cost-effectively in a data lake. Classic data lakes have; however, limited data management functionality which is clearly sub-optimal when compared to classical databases or data warehouses. Therefore, data lakes are used as storage for raw data which is then transformed, enriched and loaded to classic DWHs and data marts.

DISADVANTAGES OF A CONVENTIONAL DATA LAKE

By using both a data lake and a data warehouse, there are, however, some disadvantages. Before data is available in a DWH for analytics, the data analysis and data engineering teams must stay in sync to be able to decide which data is needed in which schema. This creates a significant communication overhead. Furthermore, once the schema is defined in a data warehouse it is complicated to change it. Finally, since the data analysis team doesn’t have a clear view of all the raw data in the data lake, this leads to use-cases being overlooked.

Moreover, a data lake can quickly become unusable if not maintained properly, due to the formation of a data swamp. Causes for such a problem are inappropriate data structures, broken relations between data, a growing amount of data, and relevant data not found efficiently anymore. Furthermore, the data quality is often poor due to missing ACID transactions.
In this state, the data lake is no longer beneficial to your organization.

Other disadvantages are, maintenance effort on multiple systems, analytics on outdated data due to lengthy ETL processes (82% of all analysts use outdated data), or a limited support of machine learning analytics, because they can only be run against raw data in the data lake.

Due to these issues, a new architecture was introduced in 2020, which solved the problems mentioned above—the Lakehouse.

Get a head start by centralizing your data today!

At evoila, we help you design, build, and operationalize your Data Lakehouse architecture so you can adopt a data-driven decision-making approach and get the most value out of your data!

The Lakehouse architecture

The Lakehouse combines a data lake’s economy with a data warehouse’s performance and management features.

Like before, a data lake is used as storage for raw data. It is, however, complemented with a data management and governance layer—the delta lake.

In this case, a spark library from Databricks is used, which adds DBMS management features to the data lake, like in a DWH. Available features include, e.g., ACID transactions, versioning, scheme enforcement, meta data management, and rollback functionality.
Besides these features, it is required to condition the raw data for complex analytics. The delta lake uses a multi-layer approach, which structures raw data with incremental transformations using tables (delta tables).

Raw and historical data

The incoming data from stream or batch processing are initially captured in raw format (bronze table).

Filtered, cleaned and enriched data

In the next step, the bronze table is filtered, cleaned up, and purged. The raw data, which has been normalized in this way, are stored in the silver table, and can be queried from there.

Data prepared for analysis

In a last step, silver tables can be aggregated. These business level aggregations are stored in the gold table. They form the foundation for analytics and reports and allow for efficient data queries.

Delta tables and MPP query engines

The delta tables, including their metadata, are stored in the object storage in immutable parquet files. Additionally, a transaction log is created per delta table, which stores all performed transactions in immutable JSON files. Besides the type of transaction, further metadata is stored, like a time stamp or the responsible user (required for auditing).
If data is changed or deleted by transactions, the parquet files are not deleted, but only marked as deleted. This allows to roll back changes easily.

An update of these delta tables is triggered, if new data is present (real time processing), or by a fixed time schedule (e.g., once per day).
The various transformations are automatically repeated in case of a change; therefore, the data in the gold table is always up to date and of high quality. The pipeline for the table update can always be online. This would not be possible with a DWH.

Another part of the Lakehouse architecture are innovative MPP query engines (massively parallel processing), like e.g., Trino. These are external and distributed query engines, which process data in the memory and therefore enable extremely fast queries. Combined with a metadata storage and a good data structure, the query performance of a DWH can be achieved.

These MPP query engines can be connected directly to the data lake. Thus, the advantages of data lakes and data warehouses can be combined in the Lakehouse.

The advantages of Data Lakes and Data Warehouses in Lakehouse

Supports any desired data structure

Data—structured, semi-structured, or unstructured—are persistently stored in highly scalable file storages (e.g., Amazon S3) and are therefore also available for later analyses.

Decoupling of compute and storage resource

Besides a cost-efficient storage of data, this decoupling only uses as much compute resources as are needed for the actual processing of the data at any point in time (resources scale by load).

Valid transactions

An unlimited number of pipelines can read and write data concurrently without creating consistency issues (ACID conformity).

Supports various workloads

Data science, machine learning, SQL queries, and complex analytics—all these workloads can be connected to a Lakehouse with proper tools and therefore support all your applications on the same data basis.

End-to-end streaming

Real-time applications do not require additional systems. They can be connected directly to the Lakehouse.

Transparency/Openness

The file formats (e.g., parquet) are open and standardized. They provide an interface, enabling a variety of applications, including machine learning applications and Python/R libraries, to directly use the data.

BI support

BI applications can work directly on the files without intermediate steps. This in turn reduces latency and cost, which occurs when managing copies in a hybrid architecture.

Our consultants get the best out of your data

Did the Lakehouse architecture convince you? Do you want to deploy it in your organization, or do you have further questions about this topic? Contact us and we will consult you concerning the next steps.