"Structured Streaming was a new streaming API introduced to Spark over 2 years ago in Spark 2.0, and was announced GA as of Spark 2.2. Databricks customers have processed over a hundred trillion rows in production using Structured Streaming. We received dozens of questions on how to best develop, monitor, test, deploy and upgrade these jobs. In this talk, we aim to share best practices around what has worked and what hasn't across our customer base.
We will tackle questions around how to plan ahead, what kind of code changes are safe for structured streaming jobs, how to architect streaming pipelines which can give you the most flexibility without sacrificing performance by using tools like Databricks Delta, how to best monitor your streaming jobs and alert if your streams are falling behind or are actually failing, as well as how to best test your code."
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark Summit
Apache Spark 2.1.0 boosted the performance of Apache Spark SQL due to Project Tungsten software improvements. Another 16x times faster has been achieved by using Oracle’s innovations for Apache Spark SQL. This 16x improvement is made possible by using Oracle’s Software in Silicon accelerator offload technologies.
Apache Spark SQL In-memory performance is becoming more important due to many factors. Users are now performing more advanced SQL processing on multi-terabyte workloads. In addition on-prem and cloud servers are getting larger physical memory to enable storing these huge workloads be stored in memory. In this talk we will look at using Spark SQL in feature creation, feature generation within pipelines for Spark ML.
This presentation will explore workloads at scale and with complex interactions. We also provide best practices and tuning suggestion to support these kinds of workloads on real applications in cloud deployments. In addition ideas for next generation Tungsten project will also be discussed.
Radical Speed for SQL Queries on Databricks: Photon Under the HoodDatabricks
Join this session to hear from the Photon product and engineering team talk about the latest developments with the project.
As organizations embrace data-driven decision-making, it has become imperative for them to invest in a platform that can quickly ingest and analyze massive amounts and types of data. With their data lakes, organizations can store all their data assets in cheap cloud object storage. But data lakes alone lack robust data management and governance capabilities. Fortunately, Delta Lake brings ACID transactions to your data lakes – making them more reliable while retaining the open access and low storage cost you are used to.
Using Delta Lake as its foundation, the Databricks Lakehouse platform delivers a simplified and performant experience with first-class support for all your workloads, including SQL, data engineering, data science & machine learning. With a broad set of enhancements in data access and filtering, query optimization and scheduling, as well as query execution, the Lakehouse achieves state-of-the-art performance to meet the increasing demands of data applications. In this session, we will dive into Photon, a key component responsible for efficient query execution.
Photon was first introduced at Spark and AI Summit 2020 and is written from the ground up in C++ to take advantage of modern hardware. It uses the latest techniques in vectorized query processing to capitalize on data- and instruction-level parallelism in CPUs, enhancing performance on real-world data and applications — all natively on your data lake. Photon is fully compatible with the Apache Spark™ DataFrame and SQL APIs to ensure workloads run seamlessly without code changes. Come join us to learn more about how Photon can radically speed up your queries on Databricks.
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Databricks
Structured Streaming provides stateful stream processing capabilities in Spark SQL through built-in operations like aggregations and joins as well as user-defined stateful transformations. It handles state automatically through watermarking to limit state size by dropping old data. For arbitrary stateful logic, MapGroupsWithState requires explicit state management by the user.
With Lakehouse as the future of data architecture, Delta becomes the de facto data storage format for all the data pipelines. By using delta, to build the curated data lakes, users achieve efficiency and reliability end-to-end. Curated data lakes involve multiple hops in the end-to-end data pipeline, which are executed regularly (mostly daily) depending on the need. As data travels through each hop, its quality improves and becomes suitable for end-user consumption. On the other hand real-time capabilities are key for any business and an added advantage, luckily Delta has seamless integration with structured streaming which makes it easy for users to achieve real-time capability using Delta. Overall, Delta Lake as a streaming source is a marriage made in heaven for various reasons and we are already seeing the rise in adoption among our users.
In this talk, we will discuss various functional components of structured streaming with Delta as a streaming source. Deep dive into Query Progress Logs(QPL) and their significance for operating streams in production. How to track the progress of any streaming job and map it with the source Delta table using QPL. What exactly gets persisted in the checkpoint directory and its details. Mapping the contents of the checkpoint directory with the QPL metrics and understanding the significance of contents in the checkpoint directory with respect to Delta streams.
Tomer Shiran est le fondateur et chef de produit (CPO) de Dremio. Tomer était le 4e employé et vice-président produit de MapR, un pionnier de l'analyse du Big Data. Il a également occupé de nombreux postes de gestion de produits et d'ingénierie chez IBM Research et Microsoft, et a fondé plusieurs sites Web qui ont servi des millions d'utilisateurs. Il est titulaire d'un Master en génie informatique de l'Université Carnegie Mellon et d'un Bachelor of Science en informatique du Technion - Israel Institute of Technology.
Le Modern Data Stack meetup est ravi d'accueillir Tomer Shiran. Depuis Apache Drill, Apache Arrow maintenant Apache Iceberg, il ancre avec ses équipes des choix pour Dremio avec une vision de la plateforme de données “ouverte” basée sur des technologies open source. En plus, de ces valeurs qui évitent le verrouillage de clients dans des formats propriétaires, il a aussi le souci des coûts qu’engendrent de telles plateformes. Il sait aussi proposer un certain nombre de fonctionnalités qui transforment la gestion de données grâce à des initiatives telles Nessie qui ouvre la route du Data As Code et du transactionnel multi-processus.
Le Modern Data Stack Meetup laisse “carte blanche” à Tomer Shiran afin qu’il nous partage son expérience et sa vision quant à l’Open Data Lakehouse.
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Edureka!
** PySpark Certification Training: https://www.edureka.co/pyspark-certification-training**
This Edureka tutorial on PySpark Tutorial will provide you with a detailed and comprehensive knowledge of Pyspark, how it works, the reason why python works best with Apache Spark. You will also learn about RDDs, data frames and mllib.
How We Optimize Spark SQL Jobs With parallel and sync IODatabricks
Although NVMe has been more and more popular these years, a large amount of HDD are still widely used in super-large scale big data clusters. In a EB-level data platform, IO(including decompression and decode) cost contributes a large proportion of Spark jobs’ cost. In another word, IO operation is worth optimizing.
In ByteDancen, we do a series of IO optimization to improve performance, including parallel read and asynchronized shuffle. Firstly we implement file level parallel read to improve performance when there are a lot of small files. Secondly, we design row group level parallel read to accelerate queries for big-file scenario. Thirdly, implement asynchronized spill to improve job peformance. Besides, we design parquet column family, which will split a table into a few column families and different column family will be in different Parquets files. Different column family can be read in parallel, so the read performance is much higher than the existing approach. In our practice, the end to end performance is improved by 5% to 30%
In this talk, I will illustrate how we implement these features and how they accelerate Apache Spark jobs.
Properly shaping partitions and your jobs to enable powerful optimizations, eliminate skew and maximize cluster utilization. We will explore various Spark Partition shaping methods along with several optimization strategies including join optimizations, aggregate optimizations, salting and multi-dimensional parallelism.
Making Apache Spark Better with Delta LakeDatabricks
Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies the streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
In this talk, we will cover:
* What data quality problems Delta helps address
* How to convert your existing application to Delta Lake
* How the Delta Lake transaction protocol works internally
* The Delta Lake roadmap for the next few releases
* How to get involved!
Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks
You’ve heard the marketing buzz, maybe you have been to a workshop and worked with some Spark, Delta, SQL, Python, or R, but you still need some help putting all the pieces together? Join us as we review some common techniques to build a lakehouse using Delta Lake, use SQL Analytics to perform exploratory analysis, and build connectivity for BI applications.
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
Flink Forward San Francisco 2022.
In modern data platform architectures, stream processing engines such as Apache Flink are used to ingest continuous streams of data into data lakes such as Apache Iceberg. Streaming ingestion to iceberg tables can suffer by two problems (1) small files problem that can hurt read performance (2) poor data clustering that can make file pruning less effective. To address those two problems, we propose adding a shuffling stage to the Flink Iceberg streaming writer. The shuffling stage can intelligently group data via bin packing or range partition. This can reduce the number of concurrent files that every task writes. It can also improve data clustering. In this talk, we will explain the motivations in details and dive into the design of the shuffling stage. We will also share the evaluation results that demonstrate the effectiveness of smart shuffling.
by
Gang Ye & Steven Wu
The Parquet Format and Performance Optimization OpportunitiesDatabricks
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
Structured Streaming has proven to be the best platform for building distributed stream processing applications. Its unified SQL/Dataset/DataFrame APIs and Spark’s built-in functions make it easy for developers to express complex computations. Delta Lake, on the other hand, is the best way to store structured data because it is a open-source storage layer that brings ACID transactions to Apache Spark and big data workloads Together, these can make it very easy to build pipelines in many common scenarios. However, expressing the business logic is only part of the larger problem of building end-to-end streaming pipelines that interact with a complex ecosystem of storage systems and workloads. It is important for the developer to truly understand the business problem that needs to be solved. Apache Spark, being a unified analytics engine doing both batch and stream processing, often provides multiples ways to solve the same problem. So understanding the requirements carefully helps you to architect your pipeline that solves your business needs in the most resource efficient manner.
In this talk, I am going examine a number common streaming design patterns in the context of the following questions.
WHAT are you trying to consume? What are you trying to produce? What is the final output that the business wants? What are your throughput and latency requirements?
WHY do you really have those requirements? Would solving the requirements of the individual pipeline actually solve your end-to-end business requirements?
HOW are going to architect the solution? And how much are you willing to pay for it?
Clarity in understanding the ‘what and why’ of any problem can automatically much clarity on the ‘how’ to architect it using Structured Streaming and, in many cases, Delta Lake.
Deep Dive: Memory Management in Apache SparkDatabricks
Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Slides for Data Syndrome one hour course on PySpark. Introduces basic operations, Spark SQL, Spark MLlib and exploratory data analysis with PySpark. Shows how to use pylab with Spark to create histograms.
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
Spark SQL enables Spark to perform efficient and fault-tolerant relational query processing with analytics database technologies. The relational queries are compiled to the executable physical plans consisting of transformations and actions on RDDs with the generated Java code. The code is compiled to Java bytecode, executed at runtime by JVM and optimized by JIT to native machine code at runtime. This talk will take a deep dive into Spark SQL execution engine. The talk includes pipelined execution, whole-stage code generation, UDF execution, memory management, vectorized readers, lineage based RDD transformation and action.
Parquet performance tuning: the missing guideRyan Blue
Parquet performance tuning focuses on optimizing Parquet reads by leveraging columnar organization, encoding, and filtering techniques. Statistics and dictionary filtering can eliminate unnecessary data reads by filtering at the row group and page levels. However, these optimizations require columns to be sorted and fully dictionary encoded within files. Increasing dictionary size thresholds and decreasing row group sizes can help avoid dictionary encoding fallback and improve filtering effectiveness. Future work may include new encodings, compression algorithms like Brotli, and page-level filtering in the Parquet format.
Common Strategies for Improving Performance on Your Delta LakehouseDatabricks
The Delta Architecture pattern has made the lives of data engineers much simpler, but what about improving query performance for data analysts? What are some common places to look at for tuning query performance? In this session we will cover some common techniques to apply to our delta tables to make them perform better for data analysts queries. We will look at a few examples of how you can analyze a query, and determine what to focus on to deliver better performance results.
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
This document summarizes a presentation on Spark SQL and its capabilities. Spark SQL allows users to run SQL queries on Spark, including HiveQL queries with UDFs, UDAFs, and SerDes. It provides a unified interface for reading and writing data in various formats. Spark SQL also allows users to express common operations like selecting columns, joining data, and aggregation concisely through its DataFrame API. This reduces the amount of code users need to write compared to lower-level APIs like RDDs.
Writing Continuous Applications with Structured Streaming PySpark APIDatabricks
"We're amidst the Big Data Zeitgeist era in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that’s continuous, reacts and interacts with data in real-time. We call this continuous application.
In this tutorial we'll explore the concepts and motivations behind the continuous application, how Structured Streaming Python APIs in Apache Spark™ enable writing continuous applications, examine the programming model behind Structured Streaming, and look at the APIs that support them.
Through presentation, code examples, and notebooks, I will demonstrate how to write an end-to-end Structured Streaming application that reacts and interacts with both real-time and historical data to perform advanced analytics using Spark SQL, DataFrames and Datasets APIs.
You’ll walk away with an understanding of what’s a continuous application, appreciate the easy-to-use Structured Streaming APIs, and why Structured Streaming in Apache Spark is a step forward in developing new kinds of streaming applications.
This tutorial will be both instructor-led and hands-on interactive session. Instructions in how to get tutorial materials will be covered in class.
WHAT YOU’LL LEARN:
– Understand the concepts and motivations behind Structured Streaming
– How to use DataFrame APIs
– How to use Spark SQL and create tables on streaming data
– How to write a simple end-to-end continuous application
PREREQUISITES
– A fully-charged laptop (8-16GB memory) with Chrome or Firefox
–Pre-register for Databricks Community Edition"
Speaker: Jules Damji
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Databricks
Description:
We are amidst the Big Data Zeitgeist era in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that’s continuous, reacts and interacts with data in real-time. We call this continuous application, which we will discuss.
Abstract:
We are amidst the Big Data Zeitgeist era in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that’s continuous, reacts and interacts with data in real-time. We call this continuous application.
In this talk we will explore the concepts and motivations behind the continuous application, how Structured Streaming Python APIs in Apache Spark 2.x enables writing continuous applications, examine the programming model behind Structured Streaming, and look at the APIs that support them.
Through a short demo and code examples, I will demonstrate how to write an end-to-end Structured Streaming application that reacts and interacts with both real-time and historical data to perform advanced analytics using Spark SQL, DataFrames and Datasets APIs.
You’ll walk away with an understanding of what’s a continuous application, appreciate the easy-to-use Structured Streaming APIs, and why Structured Streaming in Apache Spark 2.x is a step forward in developing new kinds of streaming applications.
Writing Continuous Applications with Structured Streaming in PySparkDatabricks
We are in the midst of a Big Data Zeitgeist in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that reacts and interacts with data in real-time. We call this a continuous application. In this talk we will explore the concepts and motivations behind continuous applications and how Structured Streaming Python APIs in Apache Spark 2.x enables writing them. We also will examine the programming model behind Structured Streaming and the APIs that support them. Through a short demo and code examples, Jules will demonstrate how to write an end-to-end Structured Streaming application that reacts and interacts with both real-time and historical data to perform advanced analytics using Spark SQL, DataFrames, and Datasets APIs.
Real-Time Spark: From Interactive Queries to StreamingDatabricks
This document summarizes Michael Armbrust's presentation on real-time Spark. It discusses:
1. The goals of real-time analytics including having the freshest answers as fast as possible while keeping the answers up to date.
2. Spark 2.0 introduces unified APIs for SQL, DataFrames and Datasets to make developing real-time analytics simpler with powerful yet simple APIs.
3. Structured streaming allows running the same SQL queries on streaming data to continuously aggregate data and update outputs, unifying batch, interactive, and streaming queries into a single API.
Last year, in Apache Spark 2.0, Databricks introduced Structured Streaming, a new stream processing engine built on Spark SQL, which revolutionized how developers could write stream processing application. Structured Streaming enables users to express their computations the same way they would express a batch query on static data. Developers can express queries using powerful high-level APIs including DataFrames, Dataset and SQL. Then, the Spark SQL engine is capable of converting these batch-like transformations into an incremental execution plan that can process streaming data, while automatically handling late, out-of-order data and ensuring end-to-end exactly-once fault-tolerance guarantees.
Since Spark 2.0, Databricks has been hard at work building first-class integration with Kafka. With this new connectivity, performing complex, low-latency analytics is now as easy as writing a standard SQL query. This functionality, in addition to the existing connectivity of Spark SQL, makes it easy to analyze data using one unified framework. Users can now seamlessly extract insights from data, independent of whether it is coming from messy / unstructured files, a structured / columnar historical data warehouse, or arriving in real-time from Kafka/Kinesis.
In this session, Das will walk through a concrete example where – in less than 10 lines – you read Kafka, parse JSON payload data into separate columns, transform it, enrich it by joining with static data and write it out as a table ready for batch and ad-hoc queries on up-to-the-last-minute data. He’ll use techniques including event-time based aggregations, arbitrary stateful operations, and automatic state management using event-time watermarks.
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Chester Chen
GoPro’s camera, drone, mobile devices as well as web, desktop applications are generating billions of event logs. The analytics metrics and insights that inform product, engineering, and marketing team decisions need to be distributed quickly and efficiently. We need to visualize the metrics to find the trends or anomalies.
While trying to building up the features store for machine learning, we need to visualize the features, Google Facets is an excellent project for visualizing features. But can we visualize larger feature dataset?
These are issues we encounter at GoPro as part of the data platform evolution. In this talk, we will discuss few of the progress we made at GoPro. We will talk about how to use Slack + Plot.ly to delivery analytics metrics and visualization. And we will also discuss our work to visualize large feature set using Google Facets with Apache Spark.
Last year, in Apache Spark 2.0, Databricks introduced Structured Streaming, a new stream processing engine built on Spark SQL, which revolutionized how developers could write stream processing application. Structured Streaming enables users to express their computations the same way they would express a batch query on static data. Developers can express queries using powerful high-level APIs including DataFrames, Dataset and SQL. Then, the Spark SQL engine is capable of converting these batch-like transformations into an incremental execution plan that can process streaming data, while automatically handling late, out-of-order data and ensuring end-to-end exactly-once fault-tolerance guarantees.
Since Spark 2.0, Databricks has been hard at work building first-class integration with Kafka. With this new connectivity, performing complex, low-latency analytics is now as easy as writing a standard SQL query. This functionality, in addition to the existing connectivity of Spark SQL, makes it easy to analyze data using one unified framework. Users can now seamlessly extract insights from data, independent of whether it is coming from messy / unstructured files, a structured / columnar historical data warehouse, or arriving in real-time from Kafka/Kinesis.
In this session, Das will walk through a concrete example where – in less than 10 lines – you read Kafka, parse JSON payload data into separate columns, transform it, enrich it by joining with static data and write it out as a table ready for batch and ad-hoc queries on up-to-the-last-minute data. He’ll use techniques including event-time based aggregations, arbitrary stateful operations, and automatic state management using event-time watermarks.
Spark Streaming with Kafka allows processing streaming data from Kafka in real-time. There are two main approaches - receiver-based and direct. The receiver-based approach uses Spark receivers to read data from Kafka and write to write-ahead logs for fault tolerance. The direct approach reads Kafka offsets directly without a receiver for better performance but less fault tolerance. The document discusses using Spark Streaming to aggregate streaming data from Kafka in real-time, persisting aggregates to Cassandra and raw data to S3 for analysis. It also covers using stateful transformations to update Cassandra in real-time.
Spark Streaming can be used to process streaming data from Kafka in real-time. There are two main approaches - the receiver-based approach where Spark receives data from Kafka receivers, and the direct approach where Spark directly reads data from Kafka. The document discusses using Spark Streaming to process tens of millions of transactions per minute from Kafka for an ad exchange system. It describes architectures where Spark Streaming is used to perform real-time aggregations and update databases, as well as save raw data to object storage for analytics and recovery. Stateful processing with mapWithState transformations is also demonstrated to update Cassandra in real-time.
Performance Troubleshooting Using Apache Spark MetricsDatabricks
Luca Canali, a data engineer at CERN, presented on performance troubleshooting using Apache Spark metrics at the UnifiedDataAnalytics #SparkAISummit. CERN runs large Hadoop and Spark clusters to process over 300 PB of data from the Large Hadron Collider experiments. Luca discussed how to gather, analyze, and visualize Spark metrics to identify bottlenecks and improve performance.
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Databricks
This talk is about methods and tools for troubleshooting Spark workloads at scale and is aimed at developers, administrators and performance practitioners. You will find examples illustrating the importance of using the right tools and right methodologies for measuring and understanding performance, in particular highlighting the importance of using data and root cause analysis to understand and improve the performance of Spark applications. The talk has a strong focus on practical examples and on tools for collecting data relevant for performance analysis. This includes tools for collecting Spark metrics and tools for collecting OS metrics. Among others, the talk will cover sparkMeasure, a tool developed by the author to collect Spark task metric and SQL metrics data, tools for analysing I/O and network workloads, tools for analysing CPU usage and memory bandwidth, tools for profiling CPU usage and for Flame Graph visualization.
Continuous Application with Structured Streaming 2.0Anyscale
Introduction to Continuous Application with Apache Spark 2.0 Structured Streaming. This presentation is a culmination and curation from talks and meetups presented by Databricks engineers.
The notebooks on Structured Streaming demonstrates aspects of the Structured Streaming APIs
ETL with SPARK - First Spark London meetupRafal Kwasny
The document discusses how Spark can be used to supercharge ETL workflows by running them faster and with less code compared to traditional Hadoop approaches. It provides examples of using Spark for tasks like sessionization of user clickstream data. Best practices are covered like optimizing for JVM issues, avoiding full GC pauses, and tips for deployment on EC2. Future improvements to Spark like SQL support and Java 8 are also mentioned.
Spark Streaming allows processing of live data streams in Spark. It integrates streaming data and batch processing within the same Spark application. Spark SQL provides a programming abstraction called DataFrames and can be used to query structured data in Spark. Structured Streaming in Spark 2.0 provides a high-level API for building streaming applications on top of Spark SQL's engine. It allows running the same queries on streaming data as on batch data and unifies streaming, interactive, and batch processing.
Making Structured Streaming Ready for ProductionDatabricks
In mid-2016, we introduced Structured Steaming, a new stream processing engine built on Spark SQL that revolutionized how developers can write stream processing application without having to reason about having to reason about streaming. It allows the user to express their streaming computations the same way you would express a batch computation on static data. The Spark SQL engine takes care of running it incrementally and continuously updating the final result as streaming data continues to arrive. It truly unifies batch, streaming and interactive processing in the same Datasets/DataFrames API and the same optimized Spark SQL processing engine.
The initial alpha release of Structured Streaming in Apache Spark 2.0 introduced the basic aggregation APIs and files as streaming source and sink. Since then, we have put in a lot of work to make it ready for production use. In this talk, Tathagata Das will cover in more detail about the major features we have added, the recipes for using them in production, and the exciting new features we have plans for in future releases. Some of these features are as follows:
- Design and use of the Kafka Source
- Support for watermarks and event-time processing
- Support for more operations and output modes
Speaker: Tathagata Das
This talk was originally presented at Spark Summit East 2017.
- Spark Streaming allows processing of live data streams using Spark's batch processing engine by dividing streams into micro-batches.
- A Spark Streaming application consists of input streams, transformations on those streams such as maps and filters, and output operations. The application runs continuously processing each micro-batch.
- Key aspects of operationalizing Spark Streaming jobs include checkpointing to ensure fault tolerance, optimizing throughput by increasing parallelism, and debugging using Spark UI.
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Spark Summit
In this presentation, we are going to talk about the state of the art infrastructure we have established at Walmart Labs for the Search product using Spark Streaming and DataFrames. First, we have been able to successfully use multiple micro batch spark streaming pipelines to update and process information like product availability, pick up today etc. along with updating our product catalog information in our search index to up to 10,000 kafka events per sec in near real-time. Earlier, all the product catalog changes in the index had a 24 hour delay, using Spark Streaming we have made it possible to see these changes in near real-time. This addition has provided a great boost to the business by giving the end-costumers instant access to features likes availability of a product, store pick up, etc.
Second, we have built a scalable anomaly detection framework purely using Spark Data Frames that is being used by our data pipelines to detect abnormality in search data. Anomaly detection is an important problem not only in the search domain but also many domains such as performance monitoring, fraud detection, etc. During this, we realized that not only are Spark DataFrames able to process information faster but also are more flexible to work with. One could write hive like queries, pig like code, UDFs, UDAFs, python like code etc. all at the same place very easily and can build DataFrame template which can be used and reused by multiple teams effectively. We believe that if implemented correctly Spark Data Frames can potentially replace hive/pig in big data space and have the potential of becoming unified data language.
We conclude that Spark Streaming and Data Frames are the key to processing extremely large streams of data in real-time with ease of use.
The document discusses migrating a data warehouse to the Databricks Lakehouse Platform. It outlines why legacy data warehouses are struggling, how the Databricks Platform addresses these issues, and key considerations for modern analytics and data warehousing. The document then provides an overview of the migration methodology, approach, strategies, and key takeaways for moving to a lakehouse on Databricks.
Data Lakehouse Symposium | Day 1 | Part 1Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
Data Lakehouse Symposium | Day 1 | Part 2Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The document discusses the challenges of modern data, analytics, and AI workloads. Most enterprises struggle with siloed data systems that make integration and productivity difficult. The future of data lies with a data lakehouse platform that can unify data engineering, analytics, data warehousing, and machine learning workloads on a single open platform. The Databricks Lakehouse platform aims to address these challenges with its open data lake approach and capabilities for data engineering, SQL analytics, governance, and machine learning.
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.
Democratizing Data Quality Through a Centralized PlatformDatabricks
Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale.
At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including:
Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal
Performing data quality validations using libraries built to work with spark
Dynamically generating pipelines that can be abstracted away from users
Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers
Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time
Learn to Use Databricks for Data ScienceDatabricks
Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.
Why APM Is Not the Same As ML MonitoringDatabricks
Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications.
As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored.
In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs.
There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs.
The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks.
Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model?
The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity.
The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters.
In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal.
In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark.
Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations.
Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.
Sawtooth Windows for Feature AggregationsDatabricks
In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.
We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark.
Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue
· Why?
o Custom queries on top a table; We load the data once and query N times
· Why not Structured Streaming
· Working Solution using Redis
Niche 2 : Distributed Counters
· Problems with Spark Accumulators
· Utilize Redis Hashes as distributed counters
· Precautions for retries and speculative execution
· Pipelining to improve performance
Re-imagine Data Monitoring with whylogs and SparkDatabricks
In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data.
In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components.
We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure.
This allows us to introduce optimization rules that
(i) reduce unnecessary computations by passing information between the data processing and ML operators
(ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and
(iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator.
We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis.
Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them.
Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy.
This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.
Massive Data Processing in Adobe Using Delta LakeDatabricks
At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences.
What are we storing?
Multi Source – Multi Channel Problem
Data Representation and Nested Schema Evolution
Performance Trade Offs with Various formats
Go over anti-patterns used
(String FTW)
Data Manipulation using UDFs
Writer Worries and How to Wipe them Away
Staging Tables FTW
Datalake Replication Lag Tracking
Performance Time!
Airline Satisfaction Project using Azure
This presentation is created as a foundation of understanding and comparing data science/machine learning solutions made in Python notebooks locally and on Azure cloud, as a part of Course DP-100 - Designing and Implementing a Data Science Solution on Azure.
How We Added Replication to QuestDB - JonTheBeachjavier ramirez
Building a database that can beat industry benchmarks is hard work, and we had to use every trick in the book to keep as close to the hardware as possible. In doing so, we initially decided QuestDB would scale only vertically, on a single instance.
A few years later, data replication —for horizontally scaling reads and for high availability— became one of the most demanded features, especially for enterprise and cloud environments. So, we rolled up our sleeves and made it happen.
Today, QuestDB supports an unbounded number of geographically distributed read-replicas without slowing down reads on the primary node, which can ingest data at over 4 million rows per second.
In this talk, I will tell you about the technical decisions we made, and their trade offs. You'll learn how we had to revamp the whole ingestion layer, and how we actually made the primary faster than before when we added multi-threaded Write Ahead Logs to deal with data replication. I'll also discuss how we are leveraging object storage as a central part of the process. And of course, I'll show you a live demo of high-performance multi-region replication in action.
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...javier ramirez
Los sistemas distribuidos son difíciles. Los sistemas distribuidos de alto rendimiento, más. Latencias de red, mensajes sin confirmación de recibo, reinicios de servidores, fallos de hardware, bugs en el software, releases problemáticas, timeouts... hay un montón de motivos por los que es muy difícil saber si un mensaje que has enviado se ha recibido y procesado correctamente en destino. Así que para asegurar mandas el mensaje otra vez.. y otra... y cruzas los dedos para que el sistema del otro lado tenga tolerancia a los duplicados.
QuestDB es una base de datos open source diseñada para alto rendimiento. Nos queríamos asegurar de poder ofrecer garantías de "exactly once", deduplicando mensajes en tiempo de ingestión. En esta charla, te cuento cómo diseñamos e implementamos la palabra clave DEDUP en QuestDB, permitiendo deduplicar y además permitiendo Upserts en datos en tiempo real, añadiendo solo un 8% de tiempo de proceso, incluso en flujos con millones de inserciones por segundo.
Además, explicaré nuestra arquitectura de log de escrituras (WAL) paralelo y multithread. Por supuesto, todo esto te lo cuento con demos, para que veas cómo funciona en la práctica.
[D2T2S04] SageMaker를 활용한 Generative AI Foundation Model Training and TuningDonghwan Lee
이 세션에서는 SageMaker Training Jobs / SageMaker Jumpstart를 사용하여 Foundation Model 을 Pre-Triaining 하거나 Fine Tuing 하는 방안을 제시합니다. 이 세션을 통해 아래 3가지가 소개됩니다.
1. 파운데이션 모델을 처음부터 Training
2. 오픈 소스 모델을 사용하여 파운데이션 모델을 Pre-Training
3. 도메인에 맞게 모델을 Fine Tuning하는 방안
발표자:
Miron Perel, Principal ML GTM Specialist, AWS
Kristine Pearce, Principal ML BD, AWS
Amazon DocumentDB(MongoDB와 호환됨)는 빠르고 안정적이며 완전 관리형 데이터베이스 서비스입니다. Amazon DocumentDB를 사용하면 클라우드에서 MongoDB 호환 데이터베이스를 쉽게 설치, 운영 및 규모를 조정할 수 있습니다. Amazon DocumentDB를 사용하면 MongoDB에서 사용하는 것과 동일한 애플리케이션 코드를 실행하고 동일한 드라이버와 도구를 사용하는 것을 실습합니다.
2. Who am I
● Software Engineer – Databricks
- “We make your streams come true”
● Apache Spark Committer
● MS in Management Science & Engineering -
Stanford University
● BS in Mechanical Engineering - Bogazici
University, Istanbul
7. Structured Streaming
stream processing on Spark SQL engine
fast, scalable, fault-tolerant
rich, unified, high level APIs
deal with complex data and complex workloads
rich ecosystem of data sources
integrate with many storage systems
8. Structured Streaming @
1000s of customer streaming apps
in production on Databricks
1000+ trillions of rows processed
in production
10. Anatomy of a Streaming Query
spark.readStream
.format("kafka")
.option("subscribe", "input")
.load()
.groupBy($"value".cast("string"))
.count()
.writeStream
.format("kafka")
.option("topic", "output")
.trigger("1 minute")
.outputMode(OutputMode.Complete())
.option("checkpointLocation", "…")
.start()
Source
• Specify one or more locations
to read data from
• Built in support for
Files/Kafka/Socket,
pluggable.
• Can include multiple sources
of different types using
union()
11. Anatomy of a Streaming Query
spark.readStream
.format("kafka")
.option("subscribe", "input")
.load()
.groupBy('value.cast("string") as 'key)
.agg(count("*") as 'value)
.writeStream
.format("kafka")
.option("topic", "output")
.trigger("1 minute")
.outputMode(OutputMode.Complete())
.option("checkpointLocation", "…")
.start()
Transformation
• Using DataFrames,
Datasets and/or SQL.
• Catalyst figures out how to
execute the transformation
incrementally.
• Internal processing always
exactly-once.
12. Anatomy of a Streaming Query
spark.readStream
.format("kafka")
.option("subscribe", "input")
.load()
.groupBy('value.cast("string") as 'key)
.agg(count("*") as 'value)
.writeStream
.format("kafka")
.option("topic", "output")
.trigger("1 minute")
.outputMode(OutputMode.Complete())
.option("checkpointLocation", "…")
.start()
Sink
• Accepts the output of each
batch.
• When supported sinks are
transactional and exactly
once (Files).
• Use foreach to execute
arbitrary code.
13. Anatomy of a Streaming Query
spark.readStream
.format("kafka")
.option("subscribe", "input")
.load()
.groupBy('value.cast("string") as 'key)
.agg(count("*") as 'value)
.writeStream
.format("kafka")
.option("topic", "output")
.trigger("1 minute")
.outputMode("update")
.option("checkpointLocation", "…")
.start()
Output mode – What's output
• Complete – Output the whole answer
every time
• Update – Output changed rows
• Append – Output new rows only
Trigger – When to output
• Specified as a time, eventually
supports data size
• No trigger means as fast as possible
14. Anatomy of a Streaming Query
spark.readStream
.format("kafka")
.option("subscribe", "input")
.load()
.groupBy('value.cast("string") as 'key)
.agg(count("*") as 'value)
.writeStream
.format("kafka")
.option("topic", "output")
.trigger("1 minute")
.outputMode("update")
.option("checkpointLocation", "…")
.start()
Checkpoint
• Tracks the progress of a
query in persistent storage
• Can be used to restart the
query if there is a failure.
16. Data Pipelines @ Databricks
Event Based
Reporting
Streaming
Analytics
Bronze Tables Silver Tables Gold Tables
17. Event Based File Sources
• Launched Structured Streaming connectors:
• s3-sqs on AWS (DBR 3.5)
• abs-aqs on Azure (DBR 5.0)
• As blobs are generated:
• Events are published to SQS/AQS
• Spark reads these events
• Then reads original files from
blob storage system Azure
Blob
Storage Event Grid
Queue Storage
AWS SQS
AWS S3
18. Properties of Bronze/Silver/Gold
• Bronze tables
• No data processing
• Deduplication + JSON => Parquet conversion
• Data kept around for a couple weeks in order to fix mistakes just in case
• Silver tables
• Tens/Hundreds of tables
• Directly queryable tables
• PII masking/redaction
• Gold tables
• Materialized views of silver tables
• Curated tables by the Data Science team
23. Testing
Strategy 1: Don’t care about sources and sinks. Just test your
business logic, using batch DataFrames
.groupBy('value.cast("string") as 'key)
.agg(count("*") as 'value)
Pros:
- Easy to do in
Scala/Python
Cons:
- Not all batch operations
are supported in Streaming
24. Testing
Strategy 2: Leverage the StreamTest test harness available in Apache
Spark
val inputData = MemoryStream[Array[Byte]]
val stream = inputData.toDS().toDF("value")
.groupBy('value.cast("string") as 'key)
.agg(count("*") as 'value)
testStream(stream, OutputMode.Update)(
AddData(inputData, "a".getBytes(), "b".getBytes()),
CheckAnswer(("a" -> 1), ("b" -> 1))
)
25. Testing
Strategy 2: Leverage the StreamTest test harness
available in Apache Spark
val inputData = MemoryStream[Array[Byte]]
val stream = inputData.toDS().toDF("value")
.groupBy('value.cast("string") as 'key)
.agg(count("*") as 'value)
testStream(stream, OutputMode.Update)(
AddData(inputData, "a".getBytes(), "b".getBytes()),
CheckAnswer(("a" -> 1), ("b" -> 1))
)
Source is in
memory
Schema can be set
arbitrarily to mimic
real source
26. Testing
Strategy 2: Leverage the StreamTest test harness
available in Apache Spark
val inputData = MemoryStream[Array[Byte]]
val stream = inputData.toDS().toDF("value")
.groupBy('value.cast("string") as 'key)
.agg(count("*") as 'value)
testStream(stream, OutputMode.Update)(
AddData(inputData, "a".getBytes(), "b".getBytes()),
CheckAnswer(("a" -> 1), ("b" -> 1))
)
Transformation
unchanged.
27. Testing
Strategy 2: Leverage the StreamTest test harness
available in Apache Spark
testStream(stream, OutputMode.Update)(
AddData(inputData, ...),
CheckAnswer(("a" -> 1), ("b" -> 1))
)
Starts a stream outputting
data to a memory sink
28. Testing
Strategy 2: Leverage the StreamTest test harness
available in Apache Spark
testStream(stream, OutputMode.Update)(
AddData(inputData, "a".getBytes(), "b".getBytes()),
CheckAnswer(("a" -> 1), ("b" -> 1))
)
Add
data to
the
source
29. Testing
Strategy 2: Leverage the StreamTest test harness
available in Apache Spark
testStream(stream, OutputMode.Update)(
AddData(inputData, "a".getBytes(), "b".getBytes()),
CheckAnswer(("a" -> 1), ("b" -> 1))
)
Process all data and
check result
30. Testing
Available actions in StreamTest:
- StartStream: Allows you to provide a trigger, checkpoint location, or SQL
configurations for your stream
- AddData: Adds data to your source
- CheckAnswer: Check the current data available in your sink
- CheckLastBatch: Check data that was written to your sink in the last
epoch/micro-batch
- StopStream: Stop your stream to mimic failures/upgrades
- ExpectFailure: Allows you to test failure scenarios on the last batch based
on input data
34. Testing
How to use StreamTest?
a) Copy the code from the Spark repository to your project (recommended)
- Isolates you from changes in open source that may break your build
35. Testing
How to use StreamTest?
b) Import the spark-sql test jars
Maven:
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.4.1</version>
<scope>test</scope>
<type>test-jar</type>
</dependency>
SBT:
"org.apache.spark" %% "spark-sql” % "2.4.0" % "test" classifier "tests"
36. Testing
Strategy 2: Leverage the StreamTest test harness available in Apache
Spark
Pros:
- A great test harness for free!
- Quick and cheap way to test
business logic
Cons:
- Only available in Scala
37. Testing
Strategy 3: Integration testing using Databricks Jobs
1. Have a replica of production in a staging account
2. Use Databricks REST APIs/Airflow/Azure Data Factory to kick off a
single-run job
3. Verify data output, data latency, job duration
Pros:
- Closest option to mirror
production
Cons:
- Hard to set up
- Expensive
39. Testing
What else to watch out for?
- Table schemas: Changing the schema/logic of one stream upstream
can break cascading jobs
Stay tuned for Spark Summit Europe!
- Dependency hell: The environment your local machine or Continuous
Integration service may differ from Production
Check out Databricks Container Services!
40. Testing
What else to watch out for?
- Stress Testing: Most times Spark isn’t the bottleneck. In fact,
throwing more money at your Spark clusters make the problem worse!
a) Don’t forget to tune your Kafka brokers (num.io.threads,
num.network.threads)
b) Most cloud services have rate limits, make sure you avoid them as
much as you can
41. Testing Best Practices
1. Leverage the StreamTest harness
for unit tests
- Use MemorySource and
MemorySink to test business logic
42. Testing Best Practices
2. Maintain a staging environment to
integration test before pushing to
production
- You can use Databricks Jobs and
Databricks Container Services to
ensure you have a replica of your
production environment
43. Testing Best Practices
3. Don’t forget to test data dependencies, schema changes upstream
can break downstream jobs
44. Testing Best Practices
4. Perform stress tests in staging
environment to have a runbook for
production. Not all problems lie in
your Spark cluster.
46. Monitoring
Get last progress of the
streaming query
Current input and processing rates
Current processed offsets
Current state metrics
Get progress asynchronously
through by registering your own
StreamingQueryListener
new StreamingQueryListener {
def onQueryStart(...)
def onQueryProgress(...)
def onQueryTermination(...)
}
streamingQuery.lastProgress()
{ ...
"inputRowsPerSecond" : 10024.225210926405,
"processedRowsPerSecond" : 10063.737001006373,
"durationMs" : { ... },
"sources" : [ ... ],
"sink" : { ... }
...
}
48. Monitoring
Even if you are running a map-only job, you can add a watermark
- This allows you to collect event time min, max, average in metrics
You can add current_timestamp() to keep track of ingress timestamps
- udf(() => new java.sql.Timestamp(System.currentTimeMillis)) to get
accurate processing timestamp
49. Monitoring
Start streams on your tables for monitoring and build streaming
dashboards in Databricks!
• Use display(streaming_df) to get live updating displays in Databricks
• Use foreach/foreachBatch to trigger alerts
51. Data Pipelines @ Databricks
Event Based
Reporting
Streaming
Analytics
Bronze Tables Silver Tables Gold Tables
52. Deploying
Where to deploy this many (hundreds of) streams?
a) Each stream gets a cluster
Pros: Cons:
+ Better isolation - Costly
- More moving parts
b) Multiplex many streams on a single cluster
Pros: Cons:
+ Better cluster utilization - Driver becomes a bottleneck
+ Potential Delta Cache - Determining how many is difficult
re-use - Load balancing streams across
clusters also difficult
53. Deploying
What causes bottlenecks in the Driver?
1. Locks!
- JSON Serialization of offsets in streaming (Jackson)
- Scala compiler (Encoder creation)
- Hadoop Configurations (java.util.Properties)
- Whole Stage Codegen (ClassLoader.loadClass)
2. Garbage Collection
54. Deploying
How many streams can you run on a single driver?
- Depends on your streaming sources and sinks
Sources: Sink:
1. Delta Lake 1. Kafka
2. Event Based File Sources 2. Delta Lake
3. Kafka / Azure EventHub / Kinesis 3. Other File Formats
4. Other File Sources (JSON/CSV)
Efficiency
55. Deploying
How many streams can you run on a single driver?
- ~80 S3-SQS => Delta Streams at modest data rates
- ~40 Delta => Delta Streams at high data rates
After removing most locks, we got to 200 Delta => Delta streams at
modest data rates, with 40 streams per SparkSession
58. Updating
The Checkpoint:
- The checkpoint location is the unique identity of your stream
- Contains:
a) The id of the stream (json file named metadata)
b) Source offsets (folder named sources, contains json files)
c) Aggregation state (folder named state, contains binary files)
d) Commit files (folder named commits, contains json files)
e) Source Metadata (folder named sources)
59. Updating
Based on files stored in a checkpoint, what can you change?
1. Sinks
2. Input/Output schema (in the absence of stateful operations)
3. Triggers
4. Transformations
5. Spark Versions
60. Updating
Based on files stored in a checkpoint, what can’t you change?
1. Stateful operations: agg, flatMapGroupsWithState, dropDuplicates,
join
- Schema: key, value
- Parallelism: spark.sql.shuffle.partitions
- Can’t add or remove stateful operators
2. Output Mode (will work, but semantics of stream has changed)
3. Sources
61. Updating
How to workaround limitations?
• Restart stream from scratch
• Use new checkpoint location – avoid
eventual consistency on S3
• Partition source tables by date,
restart stream from a given date