Mastering Snowflake Platform: Generate, fetch, and automate Snowflake data as a skilled data practitioner (English Edition)
()
About this ebook
This book helps you grasp Snowflake, guiding you to create complete solutions from start to finish. The starting focus covers Snowflake architecture, key features, native loading and unloading capabilities, ANSI SQL support, and processing of diverse data types and objects. The next part utilizes acquired knowledge to look into implementing data security, governance, and collaborations, utilizing Snowflake's features like data sharing and cloning.
The final part explores advanced topics, including streams, tasks, performance optimizations, cost efficiencies, and operationalization with automated monitoring. Real-time use cases and reference architectures are provided to assist readers in implementing data warehouse, data lake, and data mesh solutions with Snowflake.
Related to Mastering Snowflake Platform
Related ebooks
SQL and NoSQL Interview Questions: Your essential guide to acing SQL and NoSQL job interviews (English Edition) Rating: 0 out of 5 stars0 ratingsData Engineering with dbt: A practical guide to building a cloud-based, pragmatic, and dependable data platform with SQL Rating: 0 out of 5 stars0 ratingsSnowflake Cookbook: Techniques for building modern cloud data warehousing solutions Rating: 0 out of 5 stars0 ratingsKnockoutJS by Example Rating: 0 out of 5 stars0 ratings.NET 7 Design Patterns In-Depth: Enhance code efficiency and maintainability with .NET Design Patterns (English Edition) Rating: 0 out of 5 stars0 ratingsExpert T-SQL Window Functions in SQL Server 2019: The Hidden Secret to Fast Analytic and Reporting Queries Rating: 0 out of 5 stars0 ratingsBackbone.js Patterns and Best Practices Rating: 0 out of 5 stars0 ratingsLearn T-SQL Querying: A guide to developing efficient and elegant T-SQL code Rating: 0 out of 5 stars0 ratingsUltimate Azure Data Engineering Rating: 0 out of 5 stars0 ratingsGetting Started with Oracle Data Integrator 11g: A Hands-On Tutorial Rating: 5 out of 5 stars5/5Azure Data Lake A Complete Guide - 2019 Edition Rating: 0 out of 5 stars0 ratingsInstant Pentaho Data Integration Kitchen Rating: 0 out of 5 stars0 ratingsQuerying Databricks with Spark SQL: Leverage SQL to query and analyze Big Data for insights (English Edition) Rating: 0 out of 5 stars0 ratingsSybex's Study Guide for Snowflake SnowPro Core Certification: COF-C02 Exam Rating: 0 out of 5 stars0 ratingsMicrosoft Azure SQL Data Warehouse A Complete Guide Rating: 1 out of 5 stars1/5Mastering Data Warehouse Design: Relational and Dimensional Techniques Rating: 4 out of 5 stars4/5Amazon Redshift Complete Self-Assessment Guide Rating: 0 out of 5 stars0 ratingsLearning Azure DocumentDB Rating: 0 out of 5 stars0 ratingsElasticsearch 8.x Cookbook: Over 180 recipes to perform fast, scalable, and reliable searches for your enterprise Rating: 0 out of 5 stars0 ratingsAzure Data Factory A Complete Guide - 2020 Edition Rating: 0 out of 5 stars0 ratingsMicrosoft SQL Server 2014 Business Intelligence Development Beginner’s Guide Rating: 0 out of 5 stars0 ratingsHands-on Cloud Analytics with Microsoft Azure Stack Rating: 0 out of 5 stars0 ratings
Computers For You
Elon Musk Rating: 4 out of 5 stars4/5The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution Rating: 4 out of 5 stars4/5Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics Rating: 4 out of 5 stars4/5The Invisible Rainbow: A History of Electricity and Life Rating: 5 out of 5 stars5/5Alan Turing: The Enigma: The Book That Inspired the Film The Imitation Game - Updated Edition Rating: 4 out of 5 stars4/5Procreate for Beginners: Introduction to Procreate for Drawing and Illustrating on the iPad Rating: 0 out of 5 stars0 ratingsExcel 101: A Beginner's & Intermediate's Guide for Mastering the Quintessence of Microsoft Excel (2010-2019 & 365) in no time! Rating: 0 out of 5 stars0 ratingsThe ChatGPT Millionaire Handbook: Make Money Online With the Power of AI Technology Rating: 4 out of 5 stars4/5The Professional Voiceover Handbook: Voiceover training, #1 Rating: 5 out of 5 stars5/5How to Create Cpn Numbers the Right way: A Step by Step Guide to Creating cpn Numbers Legally Rating: 4 out of 5 stars4/5Mastering ChatGPT: 21 Prompts Templates for Effortless Writing Rating: 4 out of 5 stars4/5Uncanny Valley: A Memoir Rating: 4 out of 5 stars4/5CompTIA Security+ Get Certified Get Ahead: SY0-701 Study Guide Rating: 5 out of 5 stars5/5Slenderman: Online Obsession, Mental Illness, and the Violent Crime of Two Midwestern Girls Rating: 4 out of 5 stars4/5Deep Search: How to Explore the Internet More Effectively Rating: 5 out of 5 stars5/5101 Awesome Builds: Minecraft® Secrets from the World's Greatest Crafters Rating: 4 out of 5 stars4/5The Hacker Crackdown: Law and Disorder on the Electronic Frontier Rating: 4 out of 5 stars4/5The Best Hacking Tricks for Beginners Rating: 4 out of 5 stars4/5Fundamentals of Programming: Using Python Rating: 5 out of 5 stars5/5SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are Rating: 4 out of 5 stars4/5What Video Games Have to Teach Us About Learning and Literacy. Second Edition Rating: 4 out of 5 stars4/5The Huffington Post Complete Guide to Blogging Rating: 3 out of 5 stars3/5The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling Rating: 0 out of 5 stars0 ratings
Reviews for Mastering Snowflake Platform
0 ratings0 reviews
Book preview
Mastering Snowflake Platform - Pooja Kelgaonkar
C
HAPTER
1
Getting Started with Snowflake
Introduction
The first chapter is the foundation of the book and covers the history of the Snowflake, the need to implement Snowflake, and getting started with Snowflake. This chapter guides to setting up a trial or demo account to be used for various hands-on activities covered in subsequent chapters. This also covers the overview of Snowflake certifications and various community events.
You will learn more about Snowflake data platform capabilities in upcoming chapters of this book. Every chapter consists of real-time use cases and references to explain each concept. We have also provided lab questions with every chapter to help you learn with examples. There are a set of questions to check your knowledge at the end of each chapter.
This book is uniquely designed and well-written to help you understand Snowflake’s features easily. You will learn Snowflake seamlessly as you read through this book. With every concept and practical lab in this, you will surely want to plan for certification.
Structure
This chapter consists of the following topics:
Why Snowflake?
History of Snowflake
Snowflake certifications
Snowflake community
Setting up a trial account with Snowflake
Connecting to Snowflake
Objectives
By the end of this chapter, you will be able to understand the need for a Snowflake data platform over traditional or enterprise platforms. You will also be able to set up a trial account for yourself. This trial account will be used to perform various exercises throughout this book.
Why Snowflake?
In the earlier era setting up a data platform needed tedious efforts to design, derive the capacity of the system, and purchase hardware-software or appliances to set up an ecosystem to support the data needs. These ecosystems had their own limitations in terms of scaling – horizontal vs vertical, cost-efficiency, performance efficiency, operational cost, and huge maintenance cost associated-support teams, upgrades, patches, EOLs activities, and so on.
The cloud has broken down almost all these limitations and taken over the majority of legacy ecosystems to the cloud. The cloud offers scalability, efficiency, low operations, and no or low-cost maintenance. Cloud-native and SaaS services replaced the legacy ecosystems however, had their own limitations or dependencies or locking with vendors. For example, if you are using Google Cloud Bigquery – you are tied with using it only for Google Cloud; though this can be integrated with any other platforms available. There are also limitations associated with most cloud-native or managed services in terms of the type of workload they support. Data Lake, Data Warehouse, data analytics, and data science are often treated as separate workloads and designed differently integrating with each other. There are various data platforms available in the market to support data workloads as per business requirements.
Snowflake eliminated the need to design and define data workloads separately. With Snowflake, you can use the same data platform to cater all types of workload needs - Data Lake, Data Warehouse, data analytics, and data science needs. This also caters to a workload where you can combine analytical and transactional workloads: Unistore.
You need to learn about Snowflake and its unique offerings, as they are one of the leading data platforms in the market. You can learn Snowflake and get started with your data career journey or change your career path as well. It will be beneficial to learn Snowflake, considering the ample opportunities available in the market and Snowflake’s adaptability.
History of Snowflake
Snowflake is a one-stop solution for a variety of workloads. Snowflake’s all-in-one platform enables organizations to quickly set up a centralized data platform. This platform can be used to generate values from the data stored within, extending implementations to various applications with data protection, security, and compliance.
This is the journey that started in 2012 when the founders met for the first time with a vision of building a data warehouse for the cloud from scratch to unlock the potential of unlimited analytics from heterogeneous data. They had aimed to build a solution that is not only secure and impactful but also cost-effective and easier to manage.
Within three years, Snowflake’s data warehouse – built from scratch on the cloud – was available in 2015. Snowflake’s unique, cloud-agnostic architecture disrupted the data warehousing market. With Snowflake, data engineering also changed from technical to business-oriented implementations. This has made data analytics simpler, that helps users generate data stats which in turn helped organizations make data-driven decisions.
In another three years, in 2018, Snowflake introduced – data sharing. This is the most critical feature that is used to share data with internal or external stakeholders with appropriate access controls and security. Interestingly, you can also share data with Snowflake as well as non-snowflake users. You will learn more about this in Chapter 12: Data Sharing.
In 2021, Snowflake announced its expansion to support the wider category of data engineering with Snowpark. This is a new development framework which is a unique combination that makes it simpler to design and develop data engineering workloads on Snowflake. This can also be used to extend support to data science capabilities. You will learn more about this in Chapter 6: Understanding Snowpark.
Also, in 2021, Snowflake announced Snowflake organizations. With this, it is very easy to manage multiple accounts for the same customer. You can tag various accounts that are active and required under an organization. You can also set up utilization and usage at the account as well as organization level for tracking.
In 2022, Snowflake added the Security data lake. This is a type of workload that enables the full visibility of security logs. Snowflake’s newest workload: Unistore, is a very unique platform where you can combine the power of transactional and analytical operations.
Snowflake certifications
Snowflake offers basic and advanced-level certifications. Earning a certification badge definitely adds more weight to your profile.
Snowflake SnowPro Core is the first level of the foundation certification exam. Snowflake offers five advanced certifications based on your role. You can refer to Figure 1.1 for basic and advanced certifications available:
Diagram Description automatically generatedFigure 1.1: Snowflake certifications
You will need to complete the foundation - SnowPro Core Certification before you appear for the advanced certifications. Once you pass the certification, it is valid for two years, and you can appear for a re-certification exam to renew your certification for another two years. An advanced certification will automatically renew your SnowPro Core Certification for the next two years. Advanced certifications are available for various roles – Data Engineer, Architect, Administrator, Analyst, and Data Scientist. You can appear for the advanced one based on your role and experience. You can find more details about certifications here: https://www.snowflake.com/certifications/.
Snowflake community
Snowflake runs various community initiatives. This is one of the most active communities where many users can contribute and connect with other community members.
There are Snowflake user groups that you can join if you are interested in connecting and joining. User groups held events in person as well as virtually across various regions. You can find more details here: https://usergroups.snowflake.com/.
Snowflake runs the Data Superheroes community program every year. This program is for Snowflake experts who are highly active in the community. The active contributors are recognized as Data Superheroes, and Snowflake announces members of this elite group at the beginning of the year. You can learn more about this program here: https://medium.com/snowflake/all-you-need-to-know-about-snowflake-data-superheroes-a36914e2e614. Refer to the following figure:
A picture containing calendar Description automatically generatedFigure 1.2: Snowflake data superheroes
You can start contributing to the community as an individual user or through your partner account. There are various trainings available on the community portal as well as the partner network portal. This book is one guide to learning and getting started with Snowflake. Once you understand the concepts and complete practical labs from the book then you can refer to the quick starts to navigate through data engineering, data lakes, and other types of workload-specific use cases.
You can also use Snowflake’s documentation to get started with Snowflake. There are also quick labs available that enable users to perform hands-on labs based on the workloads and learn Snowflake. You can find use case-specific hands-on labs here: https://quickstarts.snowflake.com/.
Setting up a trial account with Snowflake
Snowflake offers a 30-day trial account that is worth $400 to practice, perform hands-on labs, and learn easily. Snowflake offers three versions on all 3 public cloud platforms that users can choose while setting up a trial version.
You can follow these steps to set up a trial account:
Open the link: https://signup.snowflake.com/.
Fill in these details in the signup form specified as shown in Figure 1.3:
First name: Your first name
Last name: Your last name
Email: You can specify your office email or personal email to set up an account.
Company: Provide your company name
Role: Provide your role with your organization
Country: Specify your country of residence.
Graphical user interface, text, application Description automatically generatedFigure 1.3:Snowflake Trial account sign-up form page 1
Click on CONTINUE to select the Snowflake version, cloud provider, Region and tick the checkbox to agree on terms and conditions before you click on GET STARTED as shown in the following screenshot:
Graphical user interface, application Description automatically generatedFigure 1.4:Snowflake Sign up form Page 2
Once you click on GET STARTED, this creates an account link to connect to the trial account. Snowflake account is accessible via a URL. You have to save this URL and your credentials to connect to the trial account every time you want to use it.
Your Snowflake account link will look like this: https://
Connecting to Snowflake
Once the trial account is setup, you can for the Snowflake account URL to connect to the instance. You can connect to Snowflake by using a CLI as well as Web UI. Snowflake offers two versions of Web UI – Classic Console and Snowsight. SnowSQL is the CLI used to connect to Snowflake via command prompt. You will learn more about connecting, installing, and setting up connection to Snowflake in upcoming sections.
Snowflake Web User Interface
There are two different web User Interfaces (UI) available: Classic Console and Snowsight. Currently, users can use both, or choose one of the UI as default.
Classic console
This is the classic web UI users used until Snowsight was launched. Classic UI has a bunch of features, attributes, and functionalities. The classic UI looks like Figure 1.5 once you log in:
Graphical user interface, text, application, email Description automatically generatedFigure 1.5: Classic Console
The top portion with options or buttons is referred as the ribbon as shown in the following screenshot:
Figure 1.6: Classic Console – Ribbon
The ribbon options vary based on the role selected for the session. Each worksheet is referred to as a session with Snowflake. Users can set the context separately for each session which can be different for each session. Context can be set up using console options as well as SQL commands. You will learn more about this in the upcoming chapters.
Context setting
You can set up context from the console using the left-hand side drop as shown in Figure 1.7:
Graphical user interface, text, application, chat or text message, email Description automatically generatedFigure 1.7: Context setting on Classic Console
You can also use SQL commands:
Use role DATA_ENG;
Use warehouse DEV_POC_WH;
Use database RETAIL_POC_DB;
Use Schema RETAIL_BATCH_POC;
Run them on the worksheet to set your session context. This is applicable to only the current active session, that is, the worksheet opens in this classic console as shown in Figure 1.8:
Graphical user interface, text, email Description automatically generatedFigure 1.8: Context setting using SQL commands
Snowsight
Snowsight is the new web UI, introduced in 2021. This is also the default UI created for new accounts. This will soon become the default web UI and the Classic console will be deprecated.
Snowsight consists of all the features, and functionalities offered by Classic Console with many new features added. Snowsight has a separate navigation menu on the left-hand side as shown in Figure 1.9:
Graphical user interface, text, application Description automatically generatedFigure 1.9: Snowsight Web UI
The navigation menu has the following options:
Worksheets: This panel shows all the worksheets created, shared by users. You can also organize them in the folders. Classic does not have this feature to manage and share worksheets with other developers or users.
Dashboards: This panel shows all the dashboards created and shared by you and other users. This a new feature of Snowsight using which you can create dashboards leveraging data within Snowflake tables.
Data: This panel lists down all the databases, tables, and objects created within your account based on your access role setup and use.
Marketplace: This is the data marketplace, the same as the Classic console.
Activity: This panel consists of Query history, Copy history, and Task history details that can be easily accessed from here.
Admin: This panel is accessible to the account admin or with required privileges. This consists of warehouses, accounts, users, roles, usages, billing, and partner connect.
You can use the class console on the left panel to open the classic console.
Command line interface
Snowflake supports accessing and connecting using a Command Line Interface (CLI). You can use Web UI as well as CLI to connect and perform required operations, develop, and test your data pipelines.
Web UI is configured by default and made accessible to users as soon as they setup an account. However, users need to download CLI and setup to connect to the Snowflake.
Download and setup SnowSQL
You can download and install SnowSQL on your local. You can follow these steps to setup CLI:
Download CLI from Web UI | Help | Download, as shown:
Graphical user interface, text, application, email Description automatically generatedFigure 1.10:Download window
You can also download it from your Windows or Linux command line. Use the curl command from the prompt to download the packages and install them.
Connecting to Snowflake
You can use snowsql command to connect to Snowflake. Refer to the below details to know more about the command and parameters required to be set up while configuring the connection:
snowsql
Parameters and their description are showed in the following table:
Table 1.1: SnowSQL connection parameters
Command to connect to Snowflake:
snowsql -a
Once you run this command, it will prompt for a password and allow you to connect to Snowflake for an interactive session. As you know, specifying and storing passwords is critical. You can specify the connection profile as a part of the config and provide it while connecting to Snowflake. You can use the following mentioned connection config and setup:
[connections]
#accountname =
#username =
#password =
#dbname =
#schemaname =
#warehousename =
#rolename =
#authenticator =
This is a sample example of a connection profile:
[connections.sample_test_connection]
accountname = testorg-testaccount
username = testuser
password = xxxxxxxxxxxxxxxxxxxx
rolename = DATA_ENG
dbname = snowdbname
schemaname = public
warehousename = test_wh
You can save this connection profile and use this command to connect to Snowflake:
$ snowsql -c sample_test_connection
Conclusion
This chapter provided a summary of Snowflake’s unique offering along with a brief history of its origin and fundamental concept used to build it from scratch on cloud. This introductory chapter helped you to get introduced to Snowflake certifications and community events. Most importantly, you learnt to setup a trial account and configure connections to use this account for labs in this book. You will learn more about Snowflake as you go along with the next chapters in this book.
In the next chapter, you will learn about the traditional platforms and their challenges. You will also learn how Snowflake helps overcome traditional challenges with its unique three-layer architecture. This is the foundational chapter to getting started with the technical aspect of Snowflake by understanding three layers of architecture and services.
Points to remember
Following are the key takeaways from this chapter:
Snowflake is built from scratch on cloud.
Snowflake is a data platform that offers designing various workloads like data warehouse, data lake, data analytics and many more.
Snowflake offers variety of certifications depending on the role.
Snowflake offers a trial account with $400 credits to get started for free.
Users can setup one or multiple accounts with their choice of cloud provider – AWS, GCP or Microsoft Azure.
You can connect to Snowflake instance as a Web URL and use Classic console or Snowsight Web UI.
You can also connect to Snowflake through programming interface over a CLI using SnowSQL.
Practical
Set up a Snowflake trial account with these options:
Cloud – AWS
Region – US/Canada/ASIA
Version: Enterprise
Multiple choice questions
What kind of workloads are supported by Snowflake?
Data Lake
Data Warehouse
Both A and B
Only analytical workloads
Select what best describes Snowflake:
Cloud-agnostic
Built from Scratch on the cloud
MPP on Cloud
Extending Cloud capabilities to the data platform
When was Snowflake founded?
2018
2015
2021
2012
How many advance certifications are available with Snowflake?
2
5
4
3
Select TRUE | FALSE
Snowflake SnowPro core is the first level certification required to enroll for advanced certification
TRUE
FALSE
Answers
Join our book’s Discord space
Join the book's Discord Workspace for Latest updates, Offers, Tech happenings around the world, New Release and Sessions with the Authors:
https://discord.bpbonline.com
C
HAPTER
2
Three Layered Architecture
Introduction
This chapter helps the reader get started with Snowflake with a detailed view of the architecture. This covers the various data platform architecture challenges and how Snowflake helps to overcome most of these challenges. This chapter defines three layers of architecture, their distinguishing features, and their setup.
Structure
This chapter consists of the following topics:
Traditional data warehouse challenges
Legacy data warehouse challenges
Typical data lake challenges
Snowflake architecture
Cloud services layer
Virtual warehouse layer
Storage layer
Cache in Snowflake
Objectives
By the end of this chapter, you will be able to understand the Snowflake architecture and its three layers of architecture – storage, compute, and cloud services. Snowflake’s unique architecture makes this data platform distinguishable from other cloud-based offerings. You will also be able to relate and understand common data challenges and how Snowflake is designed to overcome these challenges.
Traditional data warehouse challenges
The data warehouse is a repository of the data that is used for analysis, reporting, and dashboards. It is designed to be used for decision-making. Data residing in the warehouse can be used to generate statistics that help drive the business and make decisions based on the data available. There are various types of analysis and reporting available, such as prescriptive and predictive analytics.
This centralized data warehouse is developed over a period by sourcing, transforming, and loading data from various sources. This serves as a Unified Data Platform (UDP) or Enterprise Data Warehouse (EDW) for various stakeholders, internal as well as external. The data sourcing and consumption are dependent on the type of sources, frequency, accessibility, type of consumers, and so on. The data warehouse can be designed to store and organize data into various subject areas as per the data domain, such as sales, finance, marketing, and so on.
There are various vendors that offer data warehouse solutions like Teradata, IBM DB2, Netezza, and so on. These platforms offer support for setting up warehouses, and types of data, structured and semi-structured. This platform supports ANSI SQL standards and users can query data using SQL queries. Queries accessing, and processing data from the platform needs to be optimized all the time to ensure required performance and meeting SLAs.
The data warehouse solutions follow SMP or Massively Parallel Processing (MPP) implementations and have Shared Disk, Shared Nothing, and Shared Memory architecture. These are part of traditional legacy data warehouse systems which are different than Snowflake implementation. You can refer to traditional architectures in Figure 2.1:
Figure 2.1: Shared Disk and Shared Nothing Architecture
Shared disks or database is an OLTP kind of implementation where the single disk is shared across various loads/nodes in the system. On the other hand, shared-nothing architecture is widely used implementation for Data Warehouse (OLAP) implementations, like Teradata, Netezza, and so on. Storage and compute are tightly coupled in these implementations. In case of data growth, you must purchase additional nodes to increase the size, capacity, and performance of the warehouse. You cannot increase the compute (CPU) and storage separately.
Legacy data warehouse challenges
The legacy data warehouse systems have typical challenges that include:
Infrastructure: There is limited or provisioned or pre-purchased Compute and Storage capacity.
Data sharing: Sharing data with consumer applications and setting up processes to share data in the form of file feed or data feed.
Consistency: Most of the data warehouse systems have eventual consistency in case of consumer application data availability or data replication.
Data security: Data classification, data masking, encryption, and security.
Operations and maintenance: Greater maintenance cost and operational dependencies.
Technology limitations: Requires special skills and knowledge of tools and technologies to develop products and features. Skilled resource dependencies.
Typical data lake challenges
Like data warehouse, you can also implement a data platform such as data lake. Data lake implementation has been common after Hadoop and Bigdata offerings. Many legacy warehouse systems migrated their workloads to data lakes or used data lakes to store archived data as well as monthly roll data.
With data lake implementation, processing, storing, and analysis have been efficient with commodity hardware used to set up the Hadoop ecosystem or clusters. If you compare warehouse appliance offerings with the Hadoop ecosystem using commodity systems, then you can easily understand the cost and scalability difference between these two. Implementing a data lake using Hadoop is MPP architecture where you can add and remove nodes for processing and the procurement may not be costly or time taking in comparison with appliance offerings. Refer to the following figure for a better understanding about the setup:
Figure 2.2: Data Lake – Hadoop Setup
In the case of legacy data lake implementation using commodity hardware, data nodes are connected to all edge nodes. Data can be accessed and shared between nodes.
Hadoop ecosystem
Hadoop is one of the implementations of data lake. Like data warehouse systems, data lake also brings up data from various sources and processes. A typical data lake architecture may not look different than the warehouse architecture. It has the following components:
Source layer: Data sourced from various sources.
HDFS layer: Files or storage layer.
Processing layer: Transformations, processing and engineering pipelines.
Data marts: Tables, data as target or consumer layer to be shared with consumer applications.
Consumer layer: Applications, end users, and business users who can access data from data marts and lakes using APIs or SQL queries for reporting, dashboarding, or end-user applications.
The following figure illustrates the Hadoop ecosystem:
A picture containing text, software, computer icon, operating system Description automatically generatedFigure 2.3: Hadoop Ecosystem
Though, the overall cost of the data lake allows users to add nodes to the cluster, it still has issues to maintain, operationalize, upgrade, and patching activities, technology dependency, and special skills to develop data engineering workloads – spark, map-reduce, Hive, Sqoop, and so on.
Data lake systems follow the shared-nothing architecture. Data nodes and edge nodes are defined to manage workloads. These systems also enabled users to bring in unstructured data – columnar, documents, images, videos, and so on. This platform has the capability to implement predictive analytics using machine learning models.
Extended support to NoSQL
Hadoop ecosystem also supports NoSQL solutions. NoSQL is where users can store nonrelational data and do not need specific schema. NoSQL is referred as Not Only SQL which is a great choice to store unstructured data like web or log data, social media data or images, videos or geospatial data.
Based on the type of data to be stored, NoSQL is categorized into four types – document store, columnar store, key-value store, and graph databases. Document store can be implemented using MongoDB where documents are stored in the form of JSONs. Key-value stores are the databases where information is stored in the form of key-value pairs. DynamoDB can be used to implement key-value stores. Columnar store is the type of database where data is stored in the form of columns and column families in place of rows and columns. There is no schema enforcement in case of column families. Each column family defined can have separate columns and store data accordingly. HBASE and Cassandra are examples of columnar implementations. Graph databases are where data is stored and related in one-to-many and many-to-many relations that can be represented in the form of graphs. Neo4j is used for graph database implementation.
NoSQL can be used based on the type of data and formats required for processing. However, they may face performance issues in case of any analytical processing required on top of these datasets. Users may not be able to use ANSI SQL standards or SQL queries to query data in NoSQL. There are specific skills required to access data.
Following are the benefits of data lake over data warehouse:
Reduced infrastructure cost: Cost-effective cluster implementation.
Additional support to the data: Unstructured data supported.
Extended support to predictive analytics: Support to machine learning implementations.
Despite additional benefits over legacy data warehouse systems, data lake implementations using legacy ecosystem components have their own limitations and challenges. The legacy data lake systems have typical challenges are similar to data warehouse limitations that includes:
Infrastructure: There is limited or provisioned or pre-purchased Compute and Storage capacity – fixed scalability up to available nodes.
Data sharing: Sharing data with consumer applications and setting up processes to share data in the form of file feed or data feed.
ANSI SQL: Hive supports most of the DDLs and DMLs. However there are limitations based on the ecosystem setup. Supporting OLTP and OLAP may not be as simple as it is with