The Data Lakes: A Leap Forward Future of Data Warehousing
The Data Lakes: A Leap Forward Future of Data Warehousing
The Data Lakes: A Leap Forward Future of Data Warehousing
Abstract:- With the rise of data and technological Data Lake: A New Ideology in the Big Data Era
advancements, organizations are more interested than This research [2] is focused on the overall concept of a
ever in exploring infinite data. As data grows, there are data lake and architecture approach. The concept refers to
no limits to what we can analyze and derive from it. An including all the source data from various source systems in
organization needs a central data repository that should different formats. The architecture elaborates on new
be one trustworthy source. A data lake will benefit any technology such as Apache Hadoop (Highly Available
company by helping it make data-driven decisions and Object-Oriented Data Platform), which is divided into two
identify the right business strategy. Unlike data main components- HDFS (Hadoop Distributed File System)
warehouses built for specific use cases, a data lake can be and Map Reduce. HDFS takes care of a single point of contact
built for broader use cases addressing current or future and scalability, and Map Reduce stores data in data block
business rising needs. Data Lakes are a steppingstone in format with key-value pairs.
the data exploration journey, and they have come a long
way from traditional databases and data warehouses. Data lakes in business intelligence: reporting from the
This research paper will describe the data lake trenches
architecture, functionality, and ways to build it. To build The above research [3] does an exploratory study on
a lake, this paper will examine Amazon Web Services understanding the data lake based on 12 interviews with the
(AWS) and the various tools it provides for this case. data practitioners, and it concluded that the data lake is not a
Every organization today should consider data lakes replacement for the data warehouse and should be considered
strongly and consider their advantages. an extension. It further adds that the data lake could be used as
a staging area for the data warehouse. The inherent business
Keywords:- Data Lakes, Data Warehouse, Database, uses could differ for data warehouses and data lakes, where a
Analytics. data warehouse is more specific to the business needs, and a
data lake could be open-ended with evolving requirements.
I. INTRODUCTION
An Overview of Data Warehouse and Data Lake in
The data lake is a centralized storage of raw data in Modern Enterprise Data Management
structured, semi-structured, and unstructured formats [1]. This study [4] discusses various architectural factors
Structured data consists of a relational database, semi- such as governance, metadata, stewardship, orchestration, and
structured data contains CSV and JSON files, and ETL layer. The design aspects also consider data modeling,
unstructured data includes images, video, and PDF. An on-premises vs. cloud models, and ETL vs. ELT design. It
organization can bring such data from various systems into a provided several other tools that can be used to build the data
single place to perform rigorous analytics and derive valuable lake, such as AWS, Google, and Azure.
insights for data reporting, visualization, and machine
learning. The data lakes have been proven beneficial to III. DATA WAREHOUSE VS DATA LAKE
organizations as they gain a competitive advantage by
learning from data insights and acting on growth As we can see, data warehouses (DWH) have been
opportunities. The data lakes can be on-premised based on the running in industries for decades and have satisfied all
Company's infrastructure or hosted in the Cloud by AWS, different business cases in the past. So, we need to understand
Google, or Microsoft. These cloud systems will store what makes the data lake different and how it can be used as
Terabytes or Petabytes of data in object storage, which are described by AWS [5].
cheap, effective, uniquely identified, and accessible across
multiple regions. Data—A Data warehouse consists of a relational database
that stores all structured data and a data lake is built for
II. RELATED RESEARCH structured, semi-structured, and unstructured data.
Schema—The schema is pre-defined in the data
The data lake has been an intriguing topic for data warehouse due to the table and database structure; in the
practitioners as the use cases and how we understand data data lake, the schema is flexible and dynamic, so it is a
have evolved. The research below has been conducted on the schema on-write.
data lake.
Performance—Query response is faster due to table and increase the speed to market. On the other hand, cloud
data lake storage, and compute is separated, so there is services offer many fully automated services that users can
more query response time. use from day 1, resulting in quick delivery with expert
Business Case—DWH is built for a pre-identified use case support from the Cloud. Amazon Web Service (AWS) is one
that supports business intelligence and visualization; of the cloud providers that provides a variety of services and
however, data lakes are used for analytics, machine tools for building data lakes [6], such as follows:
learning, data patterns and trends identification, and
historical reporting. AWS Lake Formation
Data Quality – High-quality, clean data in DWH and raw, Lake Formation is a fully managed AWS service that
crude data in data lakes. ingests data from multiple sources, catalogs it, manages
permissions, and makes it available for future processing in
IV. DATALAKE ON AWS EMR, Athena, or Redshift. It also secures and governs data
for access control and permission management. Lake
There are various ways to build a data lake, such as On- formation provides detailed security, allowing column, row,
Premises vs. Cloud. For the on-premises service, all the and cell-level granularity for access control. The architecture
technical development falls under the Company's internal is shown below.
software development team, which requires vast hours of
planning, development, testing, and execution and can
Data Ingestion—The Lake formation can ingest data from and flow through all the tools that use Lake Formation
various sources. As shown, it is getting data from S3, an data. For instance, data accessed in Redshift or Athena
object-level structure with various file types and formats, will only allow users to see data assigned in Lake
relational databases, or NoSQL data sets. It will first fetch Formation security management. Audit trails are also
the metadata to understand the schema and then bring the present, which log the access, changes to access, and
data, supporting bulk and incremental loading. For all history in the cloud trail service. In addition, a tagging
external databases such as PostgreSQL, MySQL, or MS mechanism is present to tag specific changes or policies
SQL Server, a Java Database Connectivity (JDBC) will for text-based search and recording of the event.
be used. Data Sharing—This feature does not require data transfer.
Data Catalog—AWS Glue pulls the table's metadata. The Instead, it sets up permissions for other data storage, such
catalog folder will have all the databases listed and tables as S3, Redshift, or AWS Data Exchange, allowing it to
underneath, with all columns and data types identified. manage and share resources from other organizations.
We can label, share, and mainly use this catalog to control Permission Management—Whenever users need to query
column—or row-level data access. S3 files, they can use the Athena tool, which follows the
Security Management—It allows permission to access steps below.
row, column, and cell-level data and hides sensitive data Get Metadata – When a user queries, the analytical engine
from broad access. These permissions and policies are identifies the requested table and sends the metadata
part of IAM (Identity and Access Management) systems request to the Data Catalog.
Check Permissions – The Data Catalog will check the Get Data – The analytical engine fetches the data from S3
user's permission and return the metadata if the with a filtered view per permissions and presents it to the
permissions are granted. user.
Get Credentials – Temporary access is granted if the If the table is not part of the Lake Formation catalog, user
requested table is registered in Lake Formation. data is retrieved based on the S3 IAM permissions setup.
AWS S3 (Simple Storage Service) can achieve scalability on demand based on the data and
S3 is highly scalable object-level storage that offers data request load. To achieve high availability, it is deployed in
availability, security, and performance. It’s low-cost and multiple regions, and data gets replicated on a time basis and
technically unlimited in data storage capacity. S3 is the main serves backup and disaster poverty purposes. Redshift
storage for AWS data lakes. It provides storage monitoring supports end-to-end data encryption using AES-256
using CloudWatch and CloudTrail, and its storage lens will encryption for the data at rest. Advanced level data making is
monitor the S3 files for usage and access and recommend supported to protect sensitive data from wider exposure. To
cost-effective and optimization policies. S3 comes in achieve quicker query retrieval, Redshift uses a caching
different types based on the requirements such as S3 mechanism that stores the result of commonly run frequent
Intelligent-Tiering, S3 Standard, S3 Express One Zone, S3 queries. The data is stored in columnar storage format, which
Standard-Infrequent Access (S3 Standard-IA), S3 One Zone- is effective in performing aggregated queries for analytics
Infrequent Access (S3 One Zone-IA), S3 Glacier Instant instead of row-level storage.
Retrieval, S3 Glacier Flexible Retrieval, S3 Glacier Deep
Archive, and S3 Outposts. To perform high performance, S3 AWS Glue
supposed 3500 requests per second to add data and 5500
requests per second to retrieve data. For consistency, S3 Glue is a fully managed serverless Extract Transform and
provided read as soon as after writing by default on all Load (ETL) tool that quickly discovers, prepares, moves,
objects. Even though its file storage, the S3 Select feature and integrates data from multiple sources and loads into
allows queries in place on S3 files instead of loading those S3 or Redshift for future analytical processing. AWS
into other AWS tools or databases. S3 supports data transfer Glue has a crawler that will connect with any Database
using a storage gateway and Data sync and Data exchange using a JDBC connection to fetch the schema and prepare
using AWS Data Exchange service for S3. a data catalog. Lake Formation will use this catalog to
grant users permission and access.
AWS Redshift Glue is a no-code ETL that connects the source data and
Redshift is a data warehouse tool that can input data loads the data into S3 files based on the preferences set.
from S3 using ETL jobs. It is a serverless, fully managed, However, it also allows customizing the code in Python
highly available, and scalable service that uses SQL to or Spark for additional transformations to add the
analyze structured and semi-structured data. It is based on business logic.
massively parallel processing to handle large data sets, and it
separates storage and compute, so they are de-coupled. Each
Glue jobs can be scheduled or event-driven based on the is a limited set of numbers, then it’s a good idea to partition
S3 new file landing. The glue will dump all job logs to by account number. If the user wants to perform daily
the cloud watch, and its workflow can be set up with analysis and see day-over-day patterns, partitioning by day
AWS workflows. Based on preference, they can be set up makes the most sense. Once the file partition is finalized, the
to load the data into S3 and Redshift by using AWS Glue script must be updated so that every file-writing
BOTO3 libraries to connect with all resources. Glue operation follows the partitioning pattern.
supports Python, Spark, and Scala languages and all
external libraries. V. DATA SECURITY, GOVERNANCE, PII
AWS Athena The data in the lake should be secured, governed, and
Athena is another serverless, highly scalable, interactive avoided having PII (Personally Identifiable Information).
analytics platform that supports querying multiple file Securing data should be the highest priority in today's world
formats, S3 files, or open tables. It is built on Trino and Preto to stop data breaches. Lake formation policy controls will
open-source technologies and doesn’t need to be set up, as no hide the PII-related columns from the lake, so any user
provisioning and configuration are required. Athena will querying the data will not find any PII data. Additionally, all
allow users to query the S3 files and analyze the data; it will the data in S3 is encrypted with AWS or customer-provided
also be used internally when S3 data needs to be fetched into keys, so any unauthorized access cannot read data without
another tool, such as visualizations. The customer doesn’t proper keys. All global-level permissions in AWS are
need to manage any servers or infrastructure for Athena. achieved through the Identity and Access Management
Cluster tuning or hardware optimization is needed as it (IAM) service, which works on the least access approach.
already runs parallel to achieve quick results. It just needs to IAM allows users, groups, and service users to be created and
be activated at the account level, and the costing is done tied to policies for reading, writing, updating, etc. The
based on the amount of data scanned and the number of policies are very granular and should vary from use case to
seconds queries run internally in AWS. Athena supports a use case. Apart from that, access controls at the Lake
federated query approach, which connects more than 30+ Formation level will further control data based on different
AWS data sources to join tables and generate output. criteria for security, PII, and masking purposes. If lake
formation policies are not set up, then by default, users get
File Formats access to data based on their original IAM policy setup.
Choosing the correct S3 file format is important,
depending on the data use case. Below are a few popular file Challenges
formats that can be used. Though data lakes are effective, they also have several
challenges, some noted below.
JSON is a text-based, easy-to-read format with key-value
data formats. Expensive to build for the organization due to the
Avro is suitable for real-time streaming with data and plethora of services required, both on-premises and cloud.
schema stored together and for serialization. Long development process to follow as this will be a big
Parquet is a columnar format that is highly effective in initiative for any company and requires significant
aggregating and querying vast data. development and testing time from multiple teams.
Due to multiple new technical tools and deployments, it
Data Compression could be overwhelming for software development teams.
In addition to file format, compression also plays a key
role in file processing; it reduces the size of bigger files, VI. CONCLUSION
saves space, and increases data retrieval time. The most
common types of compression are as follows: The data lakes have many business and technical
advantages if deployed correctly within strict timelines and
Bzip2- a Burrows-Wheeler Transform and Huffman budgets. Every data-driven organization should consider
coding algorithm, achieves a good compression ratio but building lakes and running their analytics, data reporting, and
requires more time and resources. machine learning algorithms. Unlike data warehouses, while
The Gzip—DEFLATE algorithm compresses files with a building data lakes, the business doesn't need a use case
well-balanced speed and compression ratio and is ready; they can build the lake and always use the data for
compatible with multiple systems. analytical use cases coming up in the future.
Xz - LZMA2 algorithm achieves the highest compression
but takes the most time. Author
Bhushan Fadnis received an MS in Information Science
File Partition from San Diego State University, USA, in 2017. He has more
The file partition is used for quicker data retrieval as it than 12+ years of technology experience working in various
acts similarly to a table partition from the RDBMS. The MNCs and is now a Business Intelligence Engineer in a
correct file partition should be the field we use regularly to leading software company in the USA.
query this data in SQL queries. For instance, if the user
queries based on the account number and the account number
REFERENCES