Data Lake

For Storing Structured, Unstructured and Semi-structured Data

In the modern technology scenario, many organizations are opting for an

ELT (Extract – Load – Transfer) approach which brings in the importance of

Data Lakes. Unlike data warehouses, data lakes can store a vast amount

of raw data (structured, unstructured, and semi-structured) in their native

format. The data are ‘loaded’ with transformation.


The data structure is defined post loading, based on the use cases. Thus,

the transform of data occurs after loading the data. For example, data from

multiple source systems can be stored in a single HDFS (Hadoop Data File

System). These are raw data which are not harmonized, indexed, or even

searchable while loading. The major benefit of the data lake is that one

need not connect to the live operational systems for accessing the data.  


In an enterprise, the analytics use cases often deal with multiple data

sources. A data lake having all the data loaded into it can become an

efficient data storage from where these use cases can be built upon. As the

data lake consists of all the data in raw format, without transformation,

hence it is quite flexible in terms of new use case generations. Unlike data

marts, where one need to construct an ETL to prepare the Data Mart

suitable for a particular Use Case, the data lake can be used umpteen

number of times for any number of use cases as data load has taken place

before the transformation.