Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Discover millions of ebooks, audiobooks, and so much more with a free trial

From $11.99/month after trial. Cancel anytime.

Mastering Snowflake Platform: Generate, fetch, and automate Snowflake data as a skilled data practitioner (English Edition)
Mastering Snowflake Platform: Generate, fetch, and automate Snowflake data as a skilled data practitioner (English Edition)
Mastering Snowflake Platform: Generate, fetch, and automate Snowflake data as a skilled data practitioner (English Edition)
Ebook1,059 pages5 hours

Mastering Snowflake Platform: Generate, fetch, and automate Snowflake data as a skilled data practitioner (English Edition)

Rating: 0 out of 5 stars

()

Read preview

About this ebook

Handling ever evolving data for business needs can get complex. Traditional methods create bulky and costly-to-maintain data systems. Here, Snowflake emerges as a cost-effective solution, catering to both traditional and modern data needs with zero or minimal maintenance costs.

This book helps you grasp Snowflake, guiding you to create complete solutions from start to finish. The starting focus covers Snowflake architecture, key features, native loading and unloading capabilities, ANSI SQL support, and processing of diverse data types and objects. The next part utilizes acquired knowledge to look into implementing data security, governance, and collaborations, utilizing Snowflake's features like data sharing and cloning.

The final part explores advanced topics, including streams, tasks, performance optimizations, cost efficiencies, and operationalization with automated monitoring. Real-time use cases and reference architectures are provided to assist readers in implementing data warehouse, data lake, and data mesh solutions with Snowflake.
LanguageEnglish
Release dateJan 12, 2024
ISBN9789355517470
Mastering Snowflake Platform: Generate, fetch, and automate Snowflake data as a skilled data practitioner (English Edition)

Related to Mastering Snowflake Platform

Related ebooks

Computers For You

View More

Related articles

Reviews for Mastering Snowflake Platform

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Mastering Snowflake Platform - Pooja Kelgaonkar

    C

    HAPTER

    1

    Getting Started with Snowflake

    Introduction

    The first chapter is the foundation of the book and covers the history of the Snowflake, the need to implement Snowflake, and getting started with Snowflake. This chapter guides to setting up a trial or demo account to be used for various hands-on activities covered in subsequent chapters. This also covers the overview of Snowflake certifications and various community events.

    You will learn more about Snowflake data platform capabilities in upcoming chapters of this book. Every chapter consists of real-time use cases and references to explain each concept. We have also provided lab questions with every chapter to help you learn with examples. There are a set of questions to check your knowledge at the end of each chapter.

    This book is uniquely designed and well-written to help you understand Snowflake’s features easily. You will learn Snowflake seamlessly as you read through this book. With every concept and practical lab in this, you will surely want to plan for certification.

    Structure

    This chapter consists of the following topics:

    Why Snowflake?

    History of Snowflake

    Snowflake certifications

    Snowflake community

    Setting up a trial account with Snowflake

    Connecting to Snowflake

    Objectives

    By the end of this chapter, you will be able to understand the need for a Snowflake data platform over traditional or enterprise platforms. You will also be able to set up a trial account for yourself. This trial account will be used to perform various exercises throughout this book.

    Why Snowflake?

    In the earlier era setting up a data platform needed tedious efforts to design, derive the capacity of the system, and purchase hardware-software or appliances to set up an ecosystem to support the data needs. These ecosystems had their own limitations in terms of scaling – horizontal vs vertical, cost-efficiency, performance efficiency, operational cost, and huge maintenance cost associated-support teams, upgrades, patches, EOLs activities, and so on.

    The cloud has broken down almost all these limitations and taken over the majority of legacy ecosystems to the cloud. The cloud offers scalability, efficiency, low operations, and no or low-cost maintenance. Cloud-native and SaaS services replaced the legacy ecosystems however, had their own limitations or dependencies or locking with vendors. For example, if you are using Google Cloud Bigquery – you are tied with using it only for Google Cloud; though this can be integrated with any other platforms available. There are also limitations associated with most cloud-native or managed services in terms of the type of workload they support. Data Lake, Data Warehouse, data analytics, and data science are often treated as separate workloads and designed differently integrating with each other. There are various data platforms available in the market to support data workloads as per business requirements.

    Snowflake eliminated the need to design and define data workloads separately. With Snowflake, you can use the same data platform to cater all types of workload needs - Data Lake, Data Warehouse, data analytics, and data science needs. This also caters to a workload where you can combine analytical and transactional workloads: Unistore.

    You need to learn about Snowflake and its unique offerings, as they are one of the leading data platforms in the market. You can learn Snowflake and get started with your data career journey or change your career path as well. It will be beneficial to learn Snowflake, considering the ample opportunities available in the market and Snowflake’s adaptability.

    History of Snowflake

    Snowflake is a one-stop solution for a variety of workloads. Snowflake’s all-in-one platform enables organizations to quickly set up a centralized data platform. This platform can be used to generate values from the data stored within, extending implementations to various applications with data protection, security, and compliance.

    This is the journey that started in 2012 when the founders met for the first time with a vision of building a data warehouse for the cloud from scratch to unlock the potential of unlimited analytics from heterogeneous data. They had aimed to build a solution that is not only secure and impactful but also cost-effective and easier to manage.

    Within three years, Snowflake’s data warehouse – built from scratch on the cloud – was available in 2015. Snowflake’s unique, cloud-agnostic architecture disrupted the data warehousing market. With Snowflake, data engineering also changed from technical to business-oriented implementations. This has made data analytics simpler, that helps users generate data stats which in turn helped organizations make data-driven decisions.

    In another three years, in 2018, Snowflake introduced – data sharing. This is the most critical feature that is used to share data with internal or external stakeholders with appropriate access controls and security. Interestingly, you can also share data with Snowflake as well as non-snowflake users. You will learn more about this in Chapter 12: Data Sharing.

    In 2021, Snowflake announced its expansion to support the wider category of data engineering with Snowpark. This is a new development framework which is a unique combination that makes it simpler to design and develop data engineering workloads on Snowflake. This can also be used to extend support to data science capabilities. You will learn more about this in Chapter 6: Understanding Snowpark.

    Also, in 2021, Snowflake announced Snowflake organizations. With this, it is very easy to manage multiple accounts for the same customer. You can tag various accounts that are active and required under an organization. You can also set up utilization and usage at the account as well as organization level for tracking.

    In 2022, Snowflake added the Security data lake. This is a type of workload that enables the full visibility of security logs. Snowflake’s newest workload: Unistore, is a very unique platform where you can combine the power of transactional and analytical operations.

    Snowflake certifications

    Snowflake offers basic and advanced-level certifications. Earning a certification badge definitely adds more weight to your profile.

    Snowflake SnowPro Core is the first level of the foundation certification exam. Snowflake offers five advanced certifications based on your role. You can refer to Figure 1.1 for basic and advanced certifications available:

    Diagram Description automatically generated

    Figure 1.1: Snowflake certifications

    You will need to complete the foundation - SnowPro Core Certification before you appear for the advanced certifications. Once you pass the certification, it is valid for two years, and you can appear for a re-certification exam to renew your certification for another two years. An advanced certification will automatically renew your SnowPro Core Certification for the next two years. Advanced certifications are available for various roles – Data Engineer, Architect, Administrator, Analyst, and Data Scientist. You can appear for the advanced one based on your role and experience. You can find more details about certifications here: https://www.snowflake.com/certifications/.

    Snowflake community

    Snowflake runs various community initiatives. This is one of the most active communities where many users can contribute and connect with other community members.

    There are Snowflake user groups that you can join if you are interested in connecting and joining. User groups held events in person as well as virtually across various regions. You can find more details here: https://usergroups.snowflake.com/.

    Snowflake runs the Data Superheroes community program every year. This program is for Snowflake experts who are highly active in the community. The active contributors are recognized as Data Superheroes, and Snowflake announces members of this elite group at the beginning of the year. You can learn more about this program here: https://medium.com/snowflake/all-you-need-to-know-about-snowflake-data-superheroes-a36914e2e614. Refer to the following figure:

    A picture containing calendar Description automatically generated

    Figure 1.2: Snowflake data superheroes

    You can start contributing to the community as an individual user or through your partner account. There are various trainings available on the community portal as well as the partner network portal. This book is one guide to learning and getting started with Snowflake. Once you understand the concepts and complete practical labs from the book then you can refer to the quick starts to navigate through data engineering, data lakes, and other types of workload-specific use cases.

    You can also use Snowflake’s documentation to get started with Snowflake. There are also quick labs available that enable users to perform hands-on labs based on the workloads and learn Snowflake. You can find use case-specific hands-on labs here: https://quickstarts.snowflake.com/.

    Setting up a trial account with Snowflake

    Snowflake offers a 30-day trial account that is worth $400 to practice, perform hands-on labs, and learn easily. Snowflake offers three versions on all 3 public cloud platforms that users can choose while setting up a trial version.

    You can follow these steps to set up a trial account:

    Open the link: https://signup.snowflake.com/.

    Fill in these details in the signup form specified as shown in Figure 1.3:

    First name: Your first name

    Last name: Your last name

    Email: You can specify your office email or personal email to set up an account.

    Company: Provide your company name

    Role: Provide your role with your organization

    Country: Specify your country of residence.

    Graphical user interface, text, application Description automatically generated

    Figure 1.3:Snowflake Trial account sign-up form page 1

    Click on CONTINUE to select the Snowflake version, cloud provider, Region and tick the checkbox to agree on terms and conditions before you click on GET STARTED as shown in the following screenshot:

    Graphical user interface, application Description automatically generated

    Figure 1.4:Snowflake Sign up form Page 2

    Once you click on GET STARTED, this creates an account link to connect to the trial account. Snowflake account is accessible via a URL. You have to save this URL and your credentials to connect to the trial account every time you want to use it.

    Your Snowflake account link will look like this: https://.snowflakecomputing.com/console/login#/ (this is for AWS-based accounts). For other cloud accounts, GCP and Azure will be mentioned along with Snowflake computing to distinguish the providers.

    Connecting to Snowflake

    Once the trial account is setup, you can for the Snowflake account URL to connect to the instance. You can connect to Snowflake by using a CLI as well as Web UI. Snowflake offers two versions of Web UI – Classic Console and Snowsight. SnowSQL is the CLI used to connect to Snowflake via command prompt. You will learn more about connecting, installing, and setting up connection to Snowflake in upcoming sections.

    Snowflake Web User Interface

    There are two different web User Interfaces (UI) available: Classic Console and Snowsight. Currently, users can use both, or choose one of the UI as default.

    Classic console

    This is the classic web UI users used until Snowsight was launched. Classic UI has a bunch of features, attributes, and functionalities. The classic UI looks like Figure 1.5 once you log in:

    Graphical user interface, text, application, email Description automatically generated

    Figure 1.5: Classic Console

    The top portion with options or buttons is referred as the ribbon as shown in the following screenshot:

    Figure 1.6: Classic Console – Ribbon

    The ribbon options vary based on the role selected for the session. Each worksheet is referred to as a session with Snowflake. Users can set the context separately for each session which can be different for each session. Context can be set up using console options as well as SQL commands. You will learn more about this in the upcoming chapters.

    Context setting

    You can set up context from the console using the left-hand side drop as shown in Figure 1.7:

    Graphical user interface, text, application, chat or text message, email Description automatically generated

    Figure 1.7: Context setting on Classic Console

    You can also use SQL commands:

    Use role DATA_ENG;

    Use warehouse DEV_POC_WH;

    Use database RETAIL_POC_DB;

    Use Schema RETAIL_BATCH_POC;

    Run them on the worksheet to set your session context. This is applicable to only the current active session, that is, the worksheet opens in this classic console as shown in Figure 1.8:

    Graphical user interface, text, email Description automatically generated

    Figure 1.8: Context setting using SQL commands

    Snowsight

    Snowsight is the new web UI, introduced in 2021. This is also the default UI created for new accounts. This will soon become the default web UI and the Classic console will be deprecated.

    Snowsight consists of all the features, and functionalities offered by Classic Console with many new features added. Snowsight has a separate navigation menu on the left-hand side as shown in Figure 1.9:

    Graphical user interface, text, application Description automatically generated

    Figure 1.9: Snowsight Web UI

    The navigation menu has the following options:

    Worksheets: This panel shows all the worksheets created, shared by users. You can also organize them in the folders. Classic does not have this feature to manage and share worksheets with other developers or users.

    Dashboards: This panel shows all the dashboards created and shared by you and other users. This a new feature of Snowsight using which you can create dashboards leveraging data within Snowflake tables.

    Data: This panel lists down all the databases, tables, and objects created within your account based on your access role setup and use.

    Marketplace: This is the data marketplace, the same as the Classic console.

    Activity: This panel consists of Query history, Copy history, and Task history details that can be easily accessed from here.

    Admin: This panel is accessible to the account admin or with required privileges. This consists of warehouses, accounts, users, roles, usages, billing, and partner connect.

    You can use the class console on the left panel to open the classic console.

    Command line interface

    Snowflake supports accessing and connecting using a Command Line Interface (CLI). You can use Web UI as well as CLI to connect and perform required operations, develop, and test your data pipelines.

    Web UI is configured by default and made accessible to users as soon as they setup an account. However, users need to download CLI and setup to connect to the Snowflake.

    Download and setup SnowSQL

    You can download and install SnowSQL on your local. You can follow these steps to setup CLI:

    Download CLI from Web UI | Help | Download, as shown:

    Graphical user interface, text, application, email Description automatically generated

    Figure 1.10:Download window

    You can also download it from your Windows or Linux command line. Use the curl command from the prompt to download the packages and install them.

    Connecting to Snowflake

    You can use snowsql command to connect to Snowflake. Refer to the below details to know more about the command and parameters required to be set up while configuring the connection:

    snowsql

    Parameters and their description are showed in the following table:

    Table 1.1: SnowSQL connection parameters

    Command to connect to Snowflake:

    snowsql -a -u -d -s -P

    Once you run this command, it will prompt for a password and allow you to connect to Snowflake for an interactive session. As you know, specifying and storing passwords is critical. You can specify the connection profile as a part of the config and provide it while connecting to Snowflake. You can use the following mentioned connection config and setup:

    [connections]

    #accountname =   # Account identifier to connect to Snowflake.

    #username =       # Username in the account. Optional.

    #password =       # User password. Optional.

    #dbname =         # Default database. Optional.

    #schemaname =     # Default schema. Optional.

    #warehousename = # Default warehouse. Optional.

    #rolename =       # Default role. Optional.

    #authenticator = # Authenticator: ‘snowflake’, ‘externalbrowser’ (to use any IdP and a web browser),  https://.okta.com (to use Okta natively), ‘oauth’ to authenticate using OAuth.

    This is a sample example of a connection profile:

    [connections.sample_test_connection]

    accountname = testorg-testaccount

    username = testuser

    password = xxxxxxxxxxxxxxxxxxxx

    rolename = DATA_ENG

    dbname = snowdbname

    schemaname = public

    warehousename = test_wh

    You can save this connection profile and use this command to connect to Snowflake:

    $ snowsql -c sample_test_connection

    Conclusion

    This chapter provided a summary of Snowflake’s unique offering along with a brief history of its origin and fundamental concept used to build it from scratch on cloud. This introductory chapter helped you to get introduced to Snowflake certifications and community events. Most importantly, you learnt to setup a trial account and configure connections to use this account for labs in this book. You will learn more about Snowflake as you go along with the next chapters in this book.

    In the next chapter, you will learn about the traditional platforms and their challenges. You will also learn how Snowflake helps overcome traditional challenges with its unique three-layer architecture. This is the foundational chapter to getting started with the technical aspect of Snowflake by understanding three layers of architecture and services.

    Points to remember

    Following are the key takeaways from this chapter:

    Snowflake is built from scratch on cloud.

    Snowflake is a data platform that offers designing various workloads like data warehouse, data lake, data analytics and many more.

    Snowflake offers variety of certifications depending on the role.

    Snowflake offers a trial account with $400 credits to get started for free.

    Users can setup one or multiple accounts with their choice of cloud provider – AWS, GCP or Microsoft Azure.

    You can connect to Snowflake instance as a Web URL and use Classic console or Snowsight Web UI.

    You can also connect to Snowflake through programming interface over a CLI using SnowSQL.

    Practical

    Set up a Snowflake trial account with these options:

    Cloud – AWS

    Region – US/Canada/ASIA

    Version: Enterprise

    Multiple choice questions

    What kind of workloads are supported by Snowflake?

    Data Lake

    Data Warehouse

    Both A and B

    Only analytical workloads

    Select what best describes Snowflake:

    Cloud-agnostic

    Built from Scratch on the cloud

    MPP on Cloud

    Extending Cloud capabilities to the data platform

    When was Snowflake founded?

    2018

    2015

    2021

    2012

    How many advance certifications are available with Snowflake?

    2

    5

    4

    3

    Select TRUE | FALSE

    Snowflake SnowPro core is the first level certification required to enroll for advanced certification

    TRUE

    FALSE

    Answers

    Join our book’s Discord space

    Join the book's Discord Workspace for Latest updates, Offers, Tech happenings around the world, New Release and Sessions with the Authors:

    https://discord.bpbonline.com

    C

    HAPTER

    2

    Three Layered Architecture

    Introduction

    This chapter helps the reader get started with Snowflake with a detailed view of the architecture. This covers the various data platform architecture challenges and how Snowflake helps to overcome most of these challenges. This chapter defines three layers of architecture, their distinguishing features, and their setup.

    Structure

    This chapter consists of the following topics:

    Traditional data warehouse challenges

    Legacy data warehouse challenges

    Typical data lake challenges

    Snowflake architecture

    Cloud services layer

    Virtual warehouse layer

    Storage layer

    Cache in Snowflake

    Objectives

    By the end of this chapter, you will be able to understand the Snowflake architecture and its three layers of architecture – storage, compute, and cloud services. Snowflake’s unique architecture makes this data platform distinguishable from other cloud-based offerings. You will also be able to relate and understand common data challenges and how Snowflake is designed to overcome these challenges.

    Traditional data warehouse challenges

    The data warehouse is a repository of the data that is used for analysis, reporting, and dashboards. It is designed to be used for decision-making. Data residing in the warehouse can be used to generate statistics that help drive the business and make decisions based on the data available. There are various types of analysis and reporting available, such as prescriptive and predictive analytics.

    This centralized data warehouse is developed over a period by sourcing, transforming, and loading data from various sources. This serves as a Unified Data Platform (UDP) or Enterprise Data Warehouse (EDW) for various stakeholders, internal as well as external. The data sourcing and consumption are dependent on the type of sources, frequency, accessibility, type of consumers, and so on. The data warehouse can be designed to store and organize data into various subject areas as per the data domain, such as sales, finance, marketing, and so on.

    There are various vendors that offer data warehouse solutions like Teradata, IBM DB2, Netezza, and so on. These platforms offer support for setting up warehouses, and types of data, structured and semi-structured. This platform supports ANSI SQL standards and users can query data using SQL queries. Queries accessing, and processing data from the platform needs to be optimized all the time to ensure required performance and meeting SLAs.

    The data warehouse solutions follow SMP or Massively Parallel Processing (MPP) implementations and have Shared Disk, Shared Nothing, and Shared Memory architecture. These are part of traditional legacy data warehouse systems which are different than Snowflake implementation. You can refer to traditional architectures in Figure 2.1:

    Figure 2.1: Shared Disk and Shared Nothing Architecture

    Shared disks or database is an OLTP kind of implementation where the single disk is shared across various loads/nodes in the system. On the other hand, shared-nothing architecture is widely used implementation for Data Warehouse (OLAP) implementations, like Teradata, Netezza, and so on. Storage and compute are tightly coupled in these implementations. In case of data growth, you must purchase additional nodes to increase the size, capacity, and performance of the warehouse. You cannot increase the compute (CPU) and storage separately.

    Legacy data warehouse challenges

    The legacy data warehouse systems have typical challenges that include:

    Infrastructure: There is limited or provisioned or pre-purchased Compute and Storage capacity.

    Data sharing: Sharing data with consumer applications and setting up processes to share data in the form of file feed or data feed.

    Consistency: Most of the data warehouse systems have eventual consistency in case of consumer application data availability or data replication.

    Data security: Data classification, data masking, encryption, and security.

    Operations and maintenance: Greater maintenance cost and operational dependencies.

    Technology limitations: Requires special skills and knowledge of tools and technologies to develop products and features. Skilled resource dependencies.

    Typical data lake challenges

    Like data warehouse, you can also implement a data platform such as data lake. Data lake implementation has been common after Hadoop and Bigdata offerings. Many legacy warehouse systems migrated their workloads to data lakes or used data lakes to store archived data as well as monthly roll data.

    With data lake implementation, processing, storing, and analysis have been efficient with commodity hardware used to set up the Hadoop ecosystem or clusters. If you compare warehouse appliance offerings with the Hadoop ecosystem using commodity systems, then you can easily understand the cost and scalability difference between these two. Implementing a data lake using Hadoop is MPP architecture where you can add and remove nodes for processing and the procurement may not be costly or time taking in comparison with appliance offerings. Refer to the following figure for a better understanding about the setup:

    Figure 2.2: Data Lake – Hadoop Setup

    In the case of legacy data lake implementation using commodity hardware, data nodes are connected to all edge nodes. Data can be accessed and shared between nodes.

    Hadoop ecosystem

    Hadoop is one of the implementations of data lake. Like data warehouse systems, data lake also brings up data from various sources and processes. A typical data lake architecture may not look different than the warehouse architecture. It has the following components:

    Source layer: Data sourced from various sources.

    HDFS layer: Files or storage layer.

    Processing layer: Transformations, processing and engineering pipelines.

    Data marts: Tables, data as target or consumer layer to be shared with consumer applications.

    Consumer layer: Applications, end users, and business users who can access data from data marts and lakes using APIs or SQL queries for reporting, dashboarding, or end-user applications.

    The following figure illustrates the Hadoop ecosystem:

    A picture containing text, software, computer icon, operating system Description automatically generated

    Figure 2.3: Hadoop Ecosystem

    Though, the overall cost of the data lake allows users to add nodes to the cluster, it still has issues to maintain, operationalize, upgrade, and patching activities, technology dependency, and special skills to develop data engineering workloads – spark, map-reduce, Hive, Sqoop, and so on.

    Data lake systems follow the shared-nothing architecture. Data nodes and edge nodes are defined to manage workloads. These systems also enabled users to bring in unstructured data – columnar, documents, images, videos, and so on. This platform has the capability to implement predictive analytics using machine learning models.

    Extended support to NoSQL

    Hadoop ecosystem also supports NoSQL solutions. NoSQL is where users can store nonrelational data and do not need specific schema. NoSQL is referred as Not Only SQL which is a great choice to store unstructured data like web or log data, social media data or images, videos or geospatial data.

    Based on the type of data to be stored, NoSQL is categorized into four types – document store, columnar store, key-value store, and graph databases. Document store can be implemented using MongoDB where documents are stored in the form of JSONs. Key-value stores are the databases where information is stored in the form of key-value pairs. DynamoDB can be used to implement key-value stores. Columnar store is the type of database where data is stored in the form of columns and column families in place of rows and columns. There is no schema enforcement in case of column families. Each column family defined can have separate columns and store data accordingly. HBASE and Cassandra are examples of columnar implementations. Graph databases are where data is stored and related in one-to-many and many-to-many relations that can be represented in the form of graphs. Neo4j is used for graph database implementation.

    NoSQL can be used based on the type of data and formats required for processing. However, they may face performance issues in case of any analytical processing required on top of these datasets. Users may not be able to use ANSI SQL standards or SQL queries to query data in NoSQL. There are specific skills required to access data.

    Following are the benefits of data lake over data warehouse:

    Reduced infrastructure cost: Cost-effective cluster implementation.

    Additional support to the data: Unstructured data supported.

    Extended support to predictive analytics: Support to machine learning implementations.

    Despite additional benefits over legacy data warehouse systems, data lake implementations using legacy ecosystem components have their own limitations and challenges. The legacy data lake systems have typical challenges are similar to data warehouse limitations that includes:

    Infrastructure: There is limited or provisioned or pre-purchased Compute and Storage capacity – fixed scalability up to available nodes.

    Data sharing: Sharing data with consumer applications and setting up processes to share data in the form of file feed or data feed.

    ANSI SQL: Hive supports most of the DDLs and DMLs. However there are limitations based on the ecosystem setup. Supporting OLTP and OLAP may not be as simple as it is with

    Enjoying the preview?
    Page 1 of 1