For an extended time, I didn’t understand the concepts of knowledge Lake and Data Warehouse. I assumed it had been an equivalent thing — knowledge storage where I could find the info and process it for my purposes.
I wasn’t wrong but there’s a difference.
Data Warehouse supports data to be due to different operating systems to analyze/solution systems by creating one repository of knowledge from different sources using various ETL processes.
Data sources are often very diverse and have different data representations, which may cause divergent information (accounting, billing, banking systems). additionally, the massive sort of data models makes it difficult to get consolidated reports when an entire picture from all application systems is required. generally, this is often the most reason why Data Warehouse solutions appeared within the first place.
A Data Warehouse is often represented as an electronic database where processed business data is stored, but this may not be entirely true, things are a touch bit more complicated. Data Warehouse features a complex multi-level architecture called LSA — Layered Scalable Architecture. In fact, LSA implements a logical division of structures with data into several functional levels. the info is copied from level to level and transformed to eventually be available as consistent information suitable for analysis.
Primary data Layer or Staging
Here the knowledge is loaded from the source systems in its original quality and therefore the full history of changes is saved. This layer also abstracts the subsequent storage layers from the physical data representation of the info sources, how they’re collected, and the way modifications are extracted. On this layer, ETL pipelines are usually wont to transfer data from source systems to the info warehouse.
Core Data Layer
It is a sort of operational component that performs consolidation, normalization, deduplication, and cleansing of knowledge from different sources, leading to common structures and keys. this is often where the most work with data quality and general transformations happen to abstract the consumers from the peculiarities of the logical arrangement of knowledge sources and therefore the need for his or her comparison. this is often thanks to ensuring data integrity and quality. Transformations and direct inserting of the latest data are created from the info model.
The data model may be a specification of all entities, objects within the corporate data warehouse database. The model defines the entities and relationships between them, the business area, and therefore the entire database structure — from tables and fields within them to partitions and indexes.
Data Mart Layer
On this layer, processed, cleansed, and consolidated data is converted to structures that are easy to research and use in BI-dashboards or other consumer systems. Data marts provide different domain-specific views of knowledge and may take information from any previous layers.
This layer controls all the layers described above. It doesn’t contain business data, but it does operate metadata and other data quality structures, allowing end-to-end data auditing, MDM, Data governance, security, and cargo management. Monitoring and error diagnostics tools also are available here, which accelerates problem-solving.
To sum up, such systems can store reliable facts also as statistics. Decision-makers in your company can obtain this information at any time when it’s needed to satisfy personal and business needs. additionally, in creating strategic decisions, it is often useful when it involves financial management, strategic decisions, and sales. But storing data within the Data Warehouse is costlier and time-consuming. the info Warehouse is right for operational users due to being well structured and straightforward to use.
Although data warehouses can handle unstructured data, they can’t do so efficiently. once you have an outsized amount of knowledge, storing all of your data during a database or data warehouse is often expensive. additionally, the info that comes into the info warehouses must be processed before it is often stored in some shape or structure. In other words, it should have a knowledge model.
In response, businesses began to support Data Lakes, which stores all structured and unstructured enterprise data on an outsized scale within the most cost-effective way. Data Lakes stores data and may operate without having to work out the structure and layout of the info beforehand. within the case of the info Lake, the knowledge is structured at the output once you got to extract data and analyze it. At an equivalent time, the method of study doesn’t affect the info themselves within the lake — they continue to be unstructured so that they will be conveniently stored and used for other purposes. this manner we get the pliability that Data Warehouse hasn’t.
Thus, the info Lake differs significantly from the info Warehouse. However, LSA’s architectural approach also can be utilized in the development of knowledge Lake(my representation).
- Raw level stores data in various formats (tsv, CSV, parquet, JSON, etc)
On the operational level(core layer) data are often transformed into any required form. While in Raw, data is stored in its native format, here we select the format that matches best for cleansing. The structure is that the same as within the previous layer but it’s going to be partitioned to lower grain if needed.
- Data Mart. Data is transformed into consumable data sets and it’s going to be stored in files or tables. the aim of the info, also as its structure at this stage, is already known. Cleansing and transformations should be done before this layer. This layer is optional.
- Often Data Lakes is employed to store information that’s not yet employed by analysts but is probably going to be useful for the corporate within the future. However, if data lakes are poorly managed, they quickly accumulate huge amounts of uncontrolled data, most frequently useless. it’s not clear where they came from and when, how relevant they’re, whether or not they are often used for analysis or not. this is often how data swamps appear — useless and devouring company resources. to stop the lake from becoming a swamp, the corporate must establish a knowledge management process — data governance. the most a part of this process is to work out the correctness and quality of the info even before loading it into the info lake. Therefore, when designing any data lake, first of all, it’s necessary to make a decision on what purposes to create it.
Check out this quick overview video with Adam Kocoloski as he goes through a knowledge lake architecture and explains how the info lakes help to make the ML systems for businesses:
- Different purposes. Data Warehouses are employed by managers, analysts, and other business end-users, while Data Lakes are mainly employed by Data Scientist and Data engineers. Recall that Data Lake stores mostly raw unstructured and semi-structured data — telemetry, graphics, logs of user behavior, site metrics, and knowledge systems, also as other data with different storage formats. they’re not yet suitable for daily analytics in BI systems but are often employed by Data scientists to check new business hypotheses using statistical algorithms and machine learning methods.
- Different processing methods. ETL may be a popular processing paradigm in many popular data warehousing. Essentially we extract data from a source or sources, clean it up, and convert it into the structured information we’d like and upload it. With Data Lakes we use other paradigm ELT(Extract, Load, Transform) because the transformation takes place within the later stages and as long as needed not upfront.
- Different levels of understanding of the info. In Data Lakes data isn’t rejected because it’s stored in an unprocessed format. this is often especially useful in an environment with large data if you are doing not know beforehand what information is going to be obtained from the info analysis. At an equivalent time, the central database(s) is that the foundation of the info warehousing environment. Usually, such databases are implemented on RDBMS technology and thus the required in-depth design of the info model is required.
- Different approaches to the style. Data Warehouse design is predicated on relational data handling logic — the third normal form for normalized storage, star or snowflake schemes for storages. When designing the info lake, the large Data Architect and Data Engineer pay more attention to ETL processes, taking into consideration the range of sources and consumers of data. and therefore the question of storage is solved quite simply — you simply need a scalable, fault-tolerant, and comparatively cheap filing system, like HDFS or AWS S3.
- Different price. Usually, Data Lake is made on the idea of cheap servers with Apache Hadoop, without expensive licenses and powerful equipment, in contrast to tons of maintenance costs also as large costs of design and buy of specialized platforms for Data Warehouse, like SAP, Oracle, Teradata, etc.
source: luminous men