For Storing Structured, Unstructured and Semi-structured Data
In the modern technology scenario, many organizations are opting for an
ELT (Extract – Load – Transfer) approach which brings in the importance of
Data Lakes. Unlike data warehouses, data lakes can store a vast amount
of raw data (structured, unstructured, and semi-structured) in their native
format. The data are ‘loaded’ with transformation.
The data structure is defined post loading, based on the use cases. Thus,
the transform of data occurs after loading the data. For example, data from
multiple source systems can be stored in a single HDFS (Hadoop Data File
System). These are raw data which are not harmonized, indexed, or even
searchable while loading. The major benefit of the data lake is that one
need not connect to the live operational systems for accessing the data.
In an enterprise, the analytics use cases often deal with multiple data
sources. A data lake having all the data loaded into it can become an
efficient data storage from where these use cases can be built upon. As the
data lake consists of all the data in raw format, without transformation,
hence it is quite flexible in terms of new use case generations. Unlike data
marts, where one need to construct an ETL to prepare the Data Mart
suitable for a particular Use Case, the data lake can be used umpteen
number of times for any number of use cases as data load has taken place
before the transformation.
