Flink Forward San Francisco 2022.
Probably everyone who has written stateful Apache Flink applications has used one of the fault-tolerant keyed state primitives ValueState, ListState, and MapState. With RocksDB, however, retrieving and updating items comes at an increased cost that you should be aware of. Sometimes, these may not be avoidable with the current API, e.g., for efficient event-time stream-sorting or streaming joins where you need to iterate one or two buffered streams in the right order. With FLIP-220, we are introducing a new state primitive: BinarySortedMultiMapState. This new form of state offers you to (a) efficiently store lists of values for a user-provided key, and (b) iterate keyed state in a well-defined sort order. Both features can be backed efficiently by RocksDB with a 2x performance improvement over the current workarounds. This talk will go into the details of the new API and its implementation, present how to use it in your application, and talk about the process of getting it into Flink.
by
Nico Kruber
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
Flink Forward San Francisco 2022.
Flink consumers read from Kafka as a scalable, high throughput, and low latency data source. However, there are challenges in scaling out data streams where migration and multiple Kafka clusters are required. Thus, we introduced a new Kafka source to read sharded data across multiple Kafka clusters in a way that conforms well with elastic, dynamic, and reliable infrastructure. In this presentation, we will present the source design and how the solution increases application availability while reducing maintenance toil. Furthermore, we will describe how we extended the existing KafkaSource to provide mechanisms to read logical streams located on multiple clusters, to dynamically adapt to infrastructure changes, and to perform transparent cluster migrations and failover.
by
Mason Chen
Dynamic Rule-based Real-time Market Data AlertsFlink Forward
Flink Forward San Francisco 2022.
At Bloomberg, we deal with high volumes of real-time market data. Our clients expect to be notified of any anomalies in this market data, which may indicate volatile movements in the markets, notable trades, forthcoming events, or system failures. The parameters for these alerts are always evolving and our clients can update them dynamically. In this talk, we'll cover how we utilized the open source Apache Flink and Siddhi SQL projects to build a distributed, scalable, low-latency and dynamic rule-based, real-time alerting system to solve our clients' needs. We'll also cover the lessons we learned along our journey.
by
Ajay Vyasapeetam & Madhuri Jain
Flink Forward San Francisco 2022.
This talk will take you on the long journey of Apache Flink into the cloud-native era. It started all the way from where Hadoop and YARN were the standard way of deploying and operating data applications.
We're going to deep dive into the cloud-native set of principles and how they map to the Apache Flink internals and recent improvements. We'll cover fast checkpointing, fault tolerance, resource elasticity, minimal infrastructure dependencies, industry-standard tooling, ease of deployment and declarative APIs.
After this talk you'll get a broader understanding of the operational requirements for a modern streaming application and where the current limits are.
by
David Moravek
Tuning Apache Kafka Connectors for Flink.pptxFlink Forward
Flink Forward San Francisco 2022.
In normal situations, the default Kafka consumer and producer configuration options work well. But we all know life is not all roses and rainbows and in this session we’ll explore a few knobs that can save the day in atypical scenarios. First, we'll take a detailed look at the parameters available when reading from Kafka. We’ll inspect the params helping us to spot quickly an application lock or crash, the ones that can significantly improve the performance and the ones to touch with gloves since they could cause more harm than benefit. Moreover we’ll explore the partitioning options and discuss when diverging from the default strategy is needed. Next, we’ll discuss the Kafka Sink. After browsing the available options we'll then dive deep into understanding how to approach use cases like sinking enormous records, managing spikes, and handling small but frequent updates.. If you want to understand how to make your application survive when the sky is dark, this session is for you!
by
Olena Babenko
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
Change Data Capture CDC is a typical use case in Real-Time Data Warehousing. It tracks the data change log -binlog- of a relational database [OLTP], and replay these change log timely to an external storage to do Real-Time OLAP, such as delta/kudu. To implement a robust CDC streaming pipeline, lots of factors should be concerned, such as how to ensure data accuracy , how to process OLTP source schema changed, whether it is easy to build for variety databases with less code.
Stephan Ewen - Experiences running Flink at Very Large ScaleVerverica
This talk shares experiences from deploying and tuning Flink steam processing applications for very large scale. We share lessons learned from users, contributors, and our own experiments about running demanding streaming jobs at scale. The talk will explain what aspects currently render a job as particularly demanding, show how to configure and tune a large scale Flink job, and outline what the Flink community is working on to make the out-of-the-box for experience as smooth as possible. We will, for example, dive into - analyzing and tuning checkpointing - selecting and configuring state backends - understanding common bottlenecks - understanding and configuring network parameters
Introducing the Apache Flink Kubernetes OperatorFlink Forward
Flink Forward San Francisco 2022.
The Apache Flink Kubernetes Operator provides a consistent approach to manage Flink applications automatically, without any human interaction, by extending the Kubernetes API. Given the increasing adoption of Kubernetes based Flink deployments the community has been working on a Kubernetes native solution as part of Flink that can benefit from the rich experience of community members and ultimately make Flink easier to adopt. In this talk we give a technical introduction to the Flink Kubernetes Operator and demonstrate the core features and use-cases through in-depth examples."
by
Thomas Weise
Ever tried to get get clarity on what kinds of memory there are and how to tune each of them ? If not, very likely your jobs are configured incorrectly. As we found out, its is not straightforward and it is not well documented either. This session will provide information on the types of memory to be aware of, the calculations involved in determining how much is allocated to each type of memory and how to tune it depending on the use case.
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
Flink Forward San Francisco 2022.
At Stripe we have created a complete end to end exactly-once processing pipeline to process financial data at scale, by combining the exactly-once power from Flink, Kafka, and Pinot together. The pipeline provides exactly-once guarantee, end-to-end latency within a minute, deduplication against hundreds of billions of keys, and sub-second query latency against the whole dataset with trillion level rows. In this session we will discuss the technical challenges of designing, optimizing, and operating the whole pipeline, including Flink, Kafka, and Pinot. We will also share our lessons learned and the benefits gained from exactly-once processing.
by
Xiang Zhang & Pratyush Sharma & Xiaoman Dong
Kafka's basic terminologies, its architecture, its protocol and how it works.
Kafka at scale, its caveats, guarantees and use cases offered by it.
How we use it @ZaprMediaLabs.
The top 3 challenges running multi-tenant Flink at scaleFlink Forward
Apache Flink is the foundation for Decodable's real-time SaaS data platform. Flink runs critical data processing jobs with strong security requirements. In addition, Decodable has to scale to thousands of tenants, power various use cases, provide an intuitive user experience and maintain cost-efficiency. We've learned a lot of lessons while building and maintaining the platform. In this talk, I'll share the top 3 toughest challenges building and operating this platform with Flink, and how we solved them.
Presto on Apache Spark: A Tale of Two Computation EnginesDatabricks
The architectural tradeoffs between the map/reduce paradigm and parallel databases has been a long and open discussion since the dawn of MapReduce over more than a decade ago. At Facebook, we have spent the past several years in independently building and scaling both Presto and Spark to Facebook scale batch workloads, and it is now increasingly evident that there is significant value in coupling Presto’s state-of-art low-latency evaluation with Spark’s robust and fault tolerant execution engine.
Flink powered stream processing platform at PinterestFlink Forward
Flink Forward San Francisco 2022.
Pinterest is a visual discovery engine that serves over 433MM users. Stream processing allows us to unlock value from realtime data for pinners. At Pinterest, we adopt Flink as the unified streaming processing engine. In this talk, we will share our journey in building a stream processing platform with Flink and how we onboarding critical use cases to the platform. Pinterest has supported 90+near realtime streaming applications. We will cover the problem statement, how we evaluate potential solutions and our decision to build the framework.
by
Rainie Li & Kanchi Masalia
With the rise of the Internet of Things (IoT) and low-latency analytics, streaming data becomes ever more important. Surprisingly, one of the most promising approaches for processing streaming data is SQL. In this presentation, Julian Hyde shows how to build streaming SQL analytics that deliver results with low latency, adapt to network changes, and play nicely with BI tools and stored data. He also describes how Apache Calcite optimizes streaming queries, and the ongoing collaborations between Calcite and the Storm, Flink and Samza projects.
This talk was given Julian Hyde at Apache Big Data conference, Vancouver, on 2016/05/09.
Distributed stream processing is evolving from a technology in the sidelines of Big Data to a key enabler for businesses to provide more scalable, real-time services to their customers. We at Ververica, the company founded by the original creators of Apache Flink, and other prominent players in the Flink community have witnessed this development from the driver’s seat. Working with our customer and the wider community we have seen great success stories and we have seen things going wrong. In this talk, I would like to share anecdotes and hard-learned lessons of adopting distributed stream processing – Apache Flink specific as well as across frameworks. Afterwards, you will know, how not to model your use cases as a stream processing application, which data structures not to use, how not to deal with failure, how not to approach the topic of monitoring and much more.
Video: https://www.youtube.com/watch?v=F7HQd3KX2TQ&list=PLDX4T_cnKjD207Aa8b5CsZjc7Z_KRezGz&index=48&t=6s
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
This document provides an overview and summary of Amazon S3 best practices and tuning for Hadoop/Spark in the cloud. It discusses the relationship between Hadoop/Spark and S3, the differences between HDFS and S3 and their use cases, details on how S3 behaves from the perspective of Hadoop/Spark, well-known pitfalls and tunings related to S3 consistency and multipart uploads, and recent community activities related to S3. The presentation aims to help users optimize their use of S3 storage with Hadoop/Spark frameworks.
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Flink Forward
Netflix’s playback data records every user interaction with video on the service, from trailers on the home page to full-length movies. This is a critical dataset with high volume that is used broadly across Netflix, powering product experiences, AB test metrics, and offline insights. In processing playback data, we depend heavily on event-time partitioning to handle a long tail of late arriving events. In this talk, I’ll provide an overview of our recent implementation of generic event-time partitioning on high volume streams using Apache Flink and Apache Iceberg (Incubating). Built as configurable Flink components that leverage Iceberg as a new output table format, we are now able to write playback data and other large scale datasets directly from a stream into a table partitioned on event time, replacing the common pattern of relying on a post-processing batch job that “puts the data in the right place”. We’ll talk through what it took to apply this to our playback data in practice, as well as challenges we hit along the way and tradeoffs with a streaming approach to event-time partitioning.
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Using the New Apache Flink Kubernetes Operator in a Production DeploymentFlink Forward
Flink Forward San Francisco 2022.
Running natively on Kubernetes, using the new Apache Flink Kubernetes Operator is a great way to deploy and manage Flink application and session deployments. In this presentation, we provide: - A brief overview of Kubernetes operators and their benefits. - Introduce the five levels of the operator maturity model. - Introduce the newly released Apache Flink Kubernetes Operator and FlinkDeployment CRs - Dockerfile modifications you can make to swap out UBI images and Java of the underlying Flink Operator container - Enhancements we're making in: - Versioning/Upgradeability/Stability - Security - Demo of the Apache Flink Operator in-action, with a technical preview of an upcoming product using the Flink Kubernetes Operator. - Lessons learned - Q&A
by
James Busche & Ted Chang
Redis is an in-memory key-value store that is often used as a database, cache, and message broker. It supports various data structures like strings, hashes, lists, sets, and sorted sets. While data is stored in memory for fast access, Redis can also persist data to disk. It is widely used by companies like GitHub, Craigslist, and Engine Yard to power applications with high performance needs.
This document provides an overview of Weather.com's analytics architecture using Apache Cassandra and Spark. It summarizes Weather.com's initial attempts using Cassandra, lessons learned, and its improved architecture. The improved architecture uses Cassandra for streaming event data with time-window compaction, stores all other data in Amazon S3 for batch processing in Spark, and replaces Kafka with Amazon SQS for event ingestion. It discusses best practices for data modeling in Cassandra including partitioning, secondary indexes, and avoiding wide rows and nulls. The document also highlights how Weather.com uses Apache Zeppelin notebooks for data exploration and visualization.
- Spark Streaming allows processing of live data streams using Spark's batch processing engine by dividing streams into micro-batches.
- A Spark Streaming application consists of input streams, transformations on those streams such as maps and filters, and output operations. The application runs continuously processing each micro-batch.
- Key aspects of operationalizing Spark Streaming jobs include checkpointing to ensure fault tolerance, optimizing throughput by increasing parallelism, and debugging using Spark UI.
Speakers: Chris Larsen (Limelight Networks) and Benoit Sigoure (Arista Networks)
The OpenTSDB community continues to grow and with users looking to store massive amounts of time-series data in a scalable manner. In this talk, we will discuss a number of use cases and best practices around naming schemas and HBase configuration. We will also review OpenTSDB 2.0's new features, including the HTTP API, plugins, annotations, millisecond support, and metadata, as well as what's next in the roadmap.
This document summarizes a presentation about Norikra, an open source SQL stream processing engine written in Ruby. Some key points:
- Norikra allows for schema-less stream processing using SQL queries without needing to restart for new queries. It supports windows, joins, UDFs.
- Example queries demonstrate counting events by field values over time windows. Nested fields can be queried directly.
- Norikra is used in production at LINE for analytics like API error reporting and a Lambda architecture with real-time and batch processing.
- It is built on JRuby to leverage Java libraries and Ruby gems, and can handle 10k-100k events/sec on typical hardware. The
Application Monitoring using Open Source: VictoriaMetrics - ClickHouseVictoriaMetrics
Monitoring is the key to successful operation of any software service, but commercial solutions are complex, expensive, and slow. Let us show you how to build monitoring that is simple, cost-effective, and fast using open source stacks easily accessible to any developer.
We’ll start with the elements of monitoring systems: data ingest, query engine, visualization, and alerting. We’ll then explain and contrast two implementation approaches. The first uses VictoriaMetrics, a fast growing, high performance time series database that uses PromQL for queries. The second is based on ClickHouse, a popular real-time analytics database that speaks SQL. Fast, affordable monitoring is within reach. This webinar provides designs and working code to get you there.
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHo...Altinity Ltd
Application Monitoring using Open Source - VictoriaMetrics & Altinity ClickHouse Webinar Slides
Monitoring is the key to the successful operation of any software service, but commercial solutions are complex, expensive, and slow. Let us show you how to build monitoring that is simple, cost-effective, and fast using open-source stacks easily accessible to any developer.
We’ll start with the elements of monitoring systems: data ingest, query engine, visualization, and alerting. We’ll then explain and contrast two implementation approaches. The first uses VictoriaMetrics, a fast-growing, high-performance time series database that uses PromQL for queries. The second is based on ClickHouse, a popular real-time analytics database that speaks SQL. Fast, affordable monitoring is within reach. This webinar provides designs and working code to get you there.
Presented by:
Roman Khavronenko, Co-Founder at VictoriaMetrics
Robert Hodges, CEO at Altinity
BigQuery case study in Groovenauts & Dive into the DataflowJavaSDKnagachika t
This document summarizes a presentation about using BigQuery and the Dataflow Java SDK. It discusses how Groovenauts uses BigQuery to analyze data from their MAGELLAN container hosting service, including resource monitoring, developer activity logs, application logs, and end-user access logs. It then provides an overview of the Dataflow Java SDK, including the key concepts of PCollections, coders, PTransforms, composite transforms, ParDo and DoFn, and windowing.
WSO2Con USA 2015: WSO2 Analytics Platform - The One Stop Shop for All Your Da...WSO2
The WSO2 Analytics Platform uniquely combines real-time and batch analytics to derive insights from IoT, mobile, and web application data. It includes the WSO2 Data Analytics Server for collecting, analyzing, and communicating real-time and persisted data. The platform can collect data from various sources, analyze it using Spark SQL for batch or Siddhi for real-time, and communicate results through alerts, dashboards, and APIs. It also features predictive analytics capabilities via machine learning algorithms.
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17spark-project
Slides from Tathagata Das's talk at the Spark Meetup entitled "Deep Dive with Spark Streaming" on June 17, 2013 in Sunnyvale California at Plug and Play. Tathagata Das is the lead developer on Spark Streaming and a PhD student in computer science in the UC Berkeley AMPLab.
Spark Streaming allows processing of live data streams using Spark. It works by receiving data streams, chopping them into batches, and processing the batches using Spark. This presentation covered Spark Streaming concepts like the lifecycle of a streaming application, best practices for aggregations, operationalization through checkpointing, and achieving high throughput. It also discussed debugging streaming jobs and the benefits of combining streaming with batch, machine learning, and SQL processing.
The document summarizes Spring Data Neo4j 4.0, a new version of the Spring Data project that provides integration with the Neo4j graph database. It describes Neo4j and Spring Data briefly, then outlines the key features and architecture of SDN 4.0, including a standalone object-graph mapping layer, variable depth persistence, and integration with Spring and repositories. It demonstrates a sample conference application built with SDN 4.0 and provides information on getting started and support resources.
SparkSQL: A Compiler from Queries to RDDsDatabricks
SparkSQL, a module for processing structured data in Spark, is one of the fastest SQL on Hadoop systems in the world. This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will walk away with a deeper understanding of how Spark analyzes, optimizes, plans and executes a user’s query.
Speaker: Sameer Agarwal
This talk was originally presented at Spark Summit East 2017.
Java/Scala Lab: Анатолий Кметюк - Scala SubScript: Алгебра для реактивного пр...GeeksLab Odessa
SubScript - это расширение языка Scala, добавляющее поддержку конструкций и синтаксиса аглебры общающихся процессов (Algebra of Communicating Processes, ACP). SubScript является перспективным расширением, применимым как для разработки высоконагруженных параллельных систем, так и для простых персональных приложений.
Introducing TiDB [Delivered: 09/27/18 at NYC SQL Meetup]Kevin Xu
This presentation was delivered at the NYC SQL meetup on September 27, 2018. It provided a technical overview of the TiDB Platform, a deep dive into TiDB's MySQL compatible layer and MySQL ecosystem tools, use case of Mobike, and appendix with detail materials on coprocessor and transaction model.
WSO2Con ASIA 2016: WSO2 Analytics Platform: The One Stop Shop for All Your Da...WSO2
Today’s highly connected world is flooding businesses with big and fast-moving data. The ability to trawl this data ocean and identify actionable insights can deliver a competitive advantage to any organization. The WSO2 Analytics Platform enables businesses to do just that by providing batch, real-time, interactive and predictive analysis capabilities all in one place.
In this tutorial we will
Plug in the WSO2 Analytics Platform to some common business use cases
Showcase the numerous capabilities of the platform
Demonstrate how to collect data, analyze, predict and communicate effectively
WSO2 Analytics Platform uniquely combines real-time and batch analytics to derive insights from IoT, mobile, and web application data. It includes WSO2 Data Analytics Server for collecting, analyzing, and communicating real-time and stored data. The platform can collect data, perform batch analytics using Spark SQL, real-time analytics using Siddhi, and predictive analytics using machine learning models. It also supports dashboards, APIs, alerts, and other methods for communicating results.
BDA403 How Netflix Monitors Applications in Real-time with Amazon KinesisAmazon Web Services
Thousands of services work in concert to deliver millions of hours of video streams to Netflix customers every day. These applications vary in size, function, and technology, but they all make use of the Netflix network to communicate. Understanding the interactions between these services is a daunting challenge both because of the sheer volume of traffic and the dynamic nature of deployments. In this talk, we’ll first discuss why Netflix chose Amazon Kinesis Streams over other streaming data solutions like Kafka to address these challenges at scale. We’ll then dive deep into how Netflix uses Amazon Kinesis Streams to enrich network traffic logs and identify usage patterns in real time. Lastly, we will cover how Netflix uses this system to build comprehensive dependency maps, increase network efficiency, and improve failure resiliency. From this talk, you’ll take away techniques and processes that you can apply to your large-scale networks and derive real-time, actionable insights.
Real-Time ETL in Practice with WSO2 Enterprise IntegratorWSO2
The availability of timely information and data are critical for modern enterprises. Delays in minutes are not acceptable for many cases and data needs to be available in realtime. However, legacy systems that can’t generate data streams that can be consumed in real-time still exist. These legacy systems emit their output as statics data stores such as files or DB tables. Integrating these static data sources in realtime is crucial and this is where real-time ETL comes into rescue.
This deck explores how WSO2 Streaming Integrator can be used for real-time ETL with techniques such as change data capture and file streaming.
Watch the webinar on-demand here - https://wso2.com/library/webinars/2020/03/real-time-etl-in-practice-with-wso2-enterprise-integrator/
Founding committer of Spark, Patrick Wendell, gave this talk at 2015 Strata London about Apache Spark.
These slides provides an introduction to Spark, and delves into future developments, including DataFrames, Datasource API, Catalyst logical optimizer, and Project Tungsten.
Similar to Introducing BinarySortedMultiMap - A new Flink state primitive to boost your application performance (20)
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...Flink Forward
Flink Forward San Francisco 2022.
To improve Amazon Alexa experiences and support machine learning inference at scale, we built an automated end-to-end solution for incremental model building or fine-tuning machine learning models through continuous learning, continual learning, and/or semi-supervised active learning. Customer privacy is our top concern at Alexa, and as we build solutions, we face unique challenges when operating at scale such as supporting multiple applications with tens of thousands of transactions per second with several dependencies including near-real time inference endpoints at low latencies. Apache Flink helps us transform and discover metrics in near-real time in our solution. In this talk, we will cover the challenges that we faced, how we scale the infrastructure to meet the needs of ML teams across Alexa, and go into how we enable specific use cases that use Apache Flink on Amazon Kinesis Data Analytics to improve Alexa experiences to delight our customers while preserving their privacy.
by
Aansh Shah
One sink to rule them all: Introducing the new Async SinkFlink Forward
Flink Forward San Francisco 2022.
Next time you want to integrate with a new destination for a demo, concept or production application, the Async Sink framework will bootstrap development, allowing you to move quickly without compromise. In Flink 1.15 we introduced the Async Sink base (FLIP-171), with the goal to encapsulate common logic and allow developers to focus on the key integration code. The new framework handles things like request batching, buffering records, applying backpressure, retry strategies, and at least once semantics. It allows you to focus on your business logic, rather than spending time integrating with your downstream consumers. During the session we will dive deep into the internals to uncover how it works, why it was designed this way, and how to use it. We will code up a new sink from scratch and demonstrate how to quickly push data to a destination. At the end of this talk you will be ready to start implementing your own Flink sink using the new Async Sink framework.
by
Steffen Hausmann & Danny Cranmer
Flink Forward San Francisco 2022.
Based on the new Flink-Pulsar connector, we implemented Flink's TableAPI and Catalog to help users to interact with the Pulsar cluster via Flink SQL easily. We would like to go through the design and implementation of the SQL connector in the following aspects:
1. Two different modes of use Pulsar as a metadata store
2. Data format transformation and management
3. SQL semantics support within Pulsar context
by
Sijie Guo & Neng Lu
Processing Semantically-Ordered Streams in Financial ServicesFlink Forward
Flink Forward San Francisco 2022.
What if my data is already in order? Stream Processing has given us an elegant and powerful solution for running analytic queries and logic over high volumes of continuously arriving data. However, in both Apache Flink and Apache Beam, the notion of time-ordering is baked in at a very low level, making it difficult to express computations that are interested in a semantic-, rather than time-ordering of the data. In financial services, what often matters the most about the data moving between systems is not when the data was created, but in what order, to the extent that many institutions engineer a global sequencing over all data entering and produced by their systems to achieve complete determinism. How, then, can financial institutions and others best employ Stream Processing on streams of data that are already ordered? I will cover various techniques that can make this work, as well as seek input from the community on how Flink might be improved to better support these use-cases.
by
Patrick Lucas
Flink Forward San Francisco 2022.
At Flink Forward, we get to hear creative, unique use cases, often on the bleeding edge of some of the most exciting current technologies. This talk will give you a chance to get to open up the hood on our driven and innovative Open Source community. I will cover what our community has been working on this past year, and how this work relates to our (Ververica's) exciting new Flink engineering roadmap! I will also go through some best practices and upcoming opportunities for getting involved in this community!
by
Caito Scherr
Practical learnings from running thousands of Flink jobsFlink Forward
Flink Forward San Francisco 2022.
Task Managers constantly running out of memory? Flink job keeps restarting from cryptic Akka exceptions? Flink job running but doesn’t seem to be processing any records? We share practical learnings from running thousands of Flink Jobs for different use-cases and take a look at common challenges they have experienced such as out-of-memory errors, timeouts and job stability. We will cover memory tuning, S3 and Akka configurations to address common pitfalls and the approaches that we take on automating health monitoring and management of Flink jobs at scale.
by
Hong Teoh & Usamah Jassat
Extending Flink SQL for stream processing use casesFlink Forward
1. For streaming data, Flink SQL uses STREAMs for append-only queries and CHANGELOGs for upsert queries instead of tables.
2. Stateless queries on streaming data, such as projections and filters, result in new STREAMs or CHANGELOGs.
3. Stateful queries, such as aggregations, produce STREAMs or CHANGELOGs depending on whether they are windowed or not. Join queries between streaming sources also result in STREAM outputs.
Using Queryable State for Fun and ProfitFlink Forward
Flink Forward San Francisco 2022.
A particular feature in our system relies on a streaming 90-minute trailing window of 1-minute samples - implemented as a lookaside cache - to speed up a particular query, allowing our customers to rapidly see an overview of their estate. Across our entire customer base, there is a substantial amount of data flowing into this cache - ~1,000,000 entries/second, with the entire cache requiring ~600GB of RAM. The current implementation is simplistic but expensive. In this talk I describe a replacement implementation as a stateful streaming Flink application leveraging Queryable State. This Flink application reduces the net cost by ~90%. In this session, the implementation is described in detail, including windowing considerations, a sliding-window state buffer that avoids the sliding window replication penalty, and a comparison of queryable state and Redis queries. The talk concludes with a frank discussion of when this distinctive approach is, and is not, appropriate.
by
Ron Crocker
Changelog Stream Processing with Apache FlinkFlink Forward
Flink Forward San Francisco 2022.
The world is constantly changing. Data is continuously produced and thus should be consumed in a similar fashion by enterprise systems. Only this enables real-time decisions at scale. Message logs such as Apache Kafka can be found in almost every architecture, while databases and other batch systems still provide the foundation. Change Data Capture (CDC) propagates changes downstream. In this talk, we will highlight what it means to be a general data processor and how Flink can act as an integration hub. We present the current state of Flink and how it can power various use cases on both finite and infinite streams. We demonstrate Flink's SQL engine as a changelog processor that is shipped with an ecosystem tailored to process CDC data and maintain materialized views. We will use Kafka as an upsert log, Debezium for connecting to databases, and enrich streams of various sources. Finally, we will combine Flink's Table API with DataStream API for event-driven applications beyond SQL.
by
Timo Walther
Large Scale Real Time Fraudulent Web Behavior DetectionFlink Forward
Flink Forward San Francisco 2022.
Neuro-ID analyzes web behavior at a large scale to determine visitors' intent on web pages, specifically in the online lending industry. When users interact with an online loan application, our software analyzes their behavior to determine if the applicant may be potentially fraudulent. Lenders can then request various scores describing the applicant's intentions in real-time to use to make decisions during the application flow. Flink gives our product the ability to observe behavior in a stateful manner. As an applicant interacts with an online loan application, a Flink application is used to compare earlier actions to later actions. This processing in Flink can determine the applicant's intent throughout the process of the application.
by
Jeff Niemann & Randy Hanak
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Flink Forward
Flink Forward San Francisco 2022.
Being in the payments space, Stripe requires strict correctness and freshness guarantees. We rely on Flink as the natural solution for delivering on this in support of our Change Data Capture (CDC) infrastructure. We heavily rely on CDC as a tool for capturing data change streams from our databases without critically impacting database reliability, scalability, and maintainability. Data derived from these streams is used broadly across the business and powers many of our critical financial reporting systems totalling over $640 Billion in payment volume annually. We use many components of Flink’s flexible DataStream API to perform aggregations and abstract away the complexities of stream processing from our downstreams. In this talk, we’ll walk through our experience from the very beginning to what we have in production today. We’ll share stories around the technical details and trade-offs we encountered along the way.
by
Jeff Chao
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
Flink Forward San Francisco 2022.
Apache Flink and Delta Lake together allow you to build the foundation for your data lakehouses by ensuring the reliability of your concurrent streams from processing to the underlying cloud object-store. Together, the Flink/Delta Connector enables you to store data in Delta tables such that you harness Delta’s reliability by providing ACID transactions and scalability while maintaining Flink’s end-to-end exactly-once processing. This ensures that the data from Flink is written to Delta Tables in an idempotent manner such that even if the Flink pipeline is restarted from its checkpoint information, the pipeline will guarantee no data is lost or duplicated thus preserving the exactly-once semantics of Flink.
by
Scott Sandre & Denny Lee
Near real-time statistical modeling and anomaly detection using Flink!Flink Forward
Flink Forward San Francisco 2022.
At ThousandEyes we receive billions of events every day that allow us to monitor the internet; the most important aspect of our platform is to detect outages and anomalies that have a potential to cause serious impact to customer applications and user experience. Automatic detection of such events at lowest latency and highest accuracy is extremely important for our customers and their business. After launching several resilient and low latency data pipelines in production using Flink we decided to take it up a notch; we leveraged Flink to build statistical models in near real-time and apply them on incoming stream of events to detect anomalies! In this session we will deep dive into the design as well as discuss pitfalls and learnings while developing our real-time platform that leverages Debezium, Kafka, Flink, ElasticCache and DynamoDB to process events at scale!
by
Kunal Umrigar & Balint Kurnasz
Airports, banks, stock exchanges, and countless other critical operations got thrown into chaos!
In an unprecedented event, a recent CrowdStrike update had caused a global IT meltdown, leading to widespread Blue Screen of Death (BSOD) errors, and crippling 8.5 million Microsoft Windows systems.
What triggered this massive disruption? How did Microsoft step in to provide a lifeline? And what are the next steps for recovery?
Swipe to uncover the full story, including expert insights and recovery steps for those affected.
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc
In a landmark year marked by significant AI advancements, it’s vital to prioritize transparency, accountability, and respect for privacy rights with your AI innovation.
Learn how to navigate the shifting AI landscape with our innovative solution TRUSTe Responsible AI Certification, the first AI certification designed for data protection and privacy. Crafted by a team with 10,000+ privacy certifications issued, this framework integrated industry standards and laws for responsible AI governance.
This webinar will review:
- How compliance can play a role in the development and deployment of AI systems
- How to model trust and transparency across products and services
- How to save time and work smarter in understanding regulatory obligations, including AI
- How to operationalize and deploy AI governance best practices in your organization
Increase Quality with User Access Policies - July 2024Peter Caitens
⭐️ Increase Quality with User Access Policies ⭐️, presented by Peter Caitens and Adam Best of Salesforce. View the slides from this session to hear all about “User Access Policies” and how they can help you onboard users faster with greater quality.
Getting Ready for Copilot for Microsoft 365 with Governance Features in Share...Juan Carlos Gonzalez
Session delivered at the Microsoft 365 Chicago Community Days where I introduce how governance controls within SharePoint Premium are a key asset in a succesfull rollout of Copilot for Microsoft 365. The session was mostly a hands on session with multiple demos as you can see in the session recording available in YouTube: https://www.youtube.com/watch?v=MavcP6k5nU8&t=199s. For more information about Governance controls available in SharePoint Premium visit official documentation available at Microsoft Learn: https://learn.microsoft.com/en-us/sharepoint/advanced-management
Webinar: Transforming Substation Automation with Open Source SolutionsDanBrown980551
This webinar will provide an overview of open source software and tooling for digital substation automation in energy systems. The speakers will provide a brief overview of how open source collaborative development works in general, then delve into how it is driving innovation and accelerating the pace of substation automation. Examples of specific open source solutions and real-world implementations by utilities will be discussed. Participants will walk away with a better understanding of the challenges of automating substations, the ecosystem of solutions available to help, and best practices for implementing them.
IVE 2024 Short Course Lecture 9 - Empathic Computing in VRMark Billinghurst
IVE 2024 Short Course Lecture 9 on Empathic Computing in VR.
This lecture was given by Kunal Gupta on July 17th 2024 at the University of South Australia.
Discover practical tips and tricks for streamlining your Marketo programs from end to end. Whether you're new to Marketo or looking to enhance your existing processes, our expert speakers will provide insights and strategies you can implement right away.
Leading Bigcommerce Development Services for Online RetailersSynapseIndia
As a leading provider of Bigcommerce development services, we specialize in creating powerful, user-friendly e-commerce solutions. Our services help online retailers increase sales and improve customer satisfaction.
DefCamp_2016_Chemerkin_Yury-publish.pdf - Presentation by Yury Chemerkin at DefCamp 2016 discussing mobile app vulnerabilities, data protection issues, and analysis of security levels across different types of mobile applications.
IT market in Israel, economic background, forecasts of 160 categories and the infrastructure and software products in those categories, professional services also. 710 vendors are ranked in 160 categories.
Connecting Attitudes and Social Influences with Designs for Usable Security a...Cori Faklaris
Many system designs for cybersecurity and privacy have failed to account for individual and social circumstances, leading people to use workarounds such as password reuse or account sharing that can lead to vulnerabilities. To address the problem, researchers are building new understandings of how individuals’ attitudes and behaviors are influenced by the people around them and by their relationship needs, so that designers can take these into account. In this talk, I will first share my research to connect people’s security attitudes and social influences with their security and privacy behaviors. As part of this, I will present the Security and Privacy Acceptance Framework (SPAF), which identifies Awareness, Motivation, and Ability as necessary for strengthening people’s acceptance of security and privacy practices. I then will present results from my project to trace where social influences can help overcome obstacles to adoption such as negative attitudes or inability to troubleshoot a password manager. I will conclude by discussing my current work to apply these insights to mitigating phishing in SMS text messages (“smishing”).
Lecture 8 of the IVE 2024 short course on the Pscyhology of XR.
This lecture introduced the basics of Electroencephalography (EEG).
It was taught by Ina and Matthias Schlesewsky on July 16th 2024 at the University of South Australia.
Planetek Italia is an Italian Benefit Company established in 1994, which employs 120+ women and men, passionate and skilled in Geoinformatics, Space solutions, and Earth science.
We provide solutions to exploit the value of geospatial data through all phases of data life cycle. We operate in many application areas ranging from environmental and land monitoring to open-government and smart cities, and including defence and security, as well as Space exploration and EO satellite missions.
Jacquard Fabric Explained: Origins, Characteristics, and Usesldtexsolbl
In this presentation, we’ll dive into the fascinating world of Jacquard fabric. We start by exploring what makes Jacquard fabric so special. It’s known for its beautiful, complex patterns that are woven into the fabric thanks to a clever machine called the Jacquard loom, invented by Joseph Marie Jacquard back in 1804. This loom uses either punched cards or modern digital controls to handle each thread separately, allowing for intricate designs that were once impossible to create by hand.
Next, we’ll look at the unique characteristics of Jacquard fabric and the different types you might encounter. From the luxurious brocade, often used in fancy clothing and home décor, to the elegant damask with its reversible patterns, and the artistic tapestry, each type of Jacquard fabric has its own special qualities. We’ll show you how these fabrics are used in everyday items like curtains, cushions, and even artworks, making them both functional and stylish.
Moving on, we’ll discuss how technology has changed Jacquard fabric production. Here, LD Texsol takes center stage. As a leading manufacturer and exporter of electronic Jacquard looms, LD Texsol is helping to modernize the weaving process. Their advanced technology makes it easier to create even more precise and complex patterns, and also helps make the production process more efficient and environmentally friendly.
Finally, we’ll wrap up by summarizing the key points and highlighting the exciting future of Jacquard fabric. Thanks to innovations from companies like LD Texsol, Jacquard fabric continues to evolve and impress, blending traditional techniques with cutting-edge technology. We hope this presentation gives you a clear picture of how Jacquard fabric has developed and where it’s headed in the future.
Jacquard Fabric Explained: Origins, Characteristics, and Uses
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your application performance
1. Introducing
BinarySortedState
A new Flink state
primitive to boost
your application
performance
Nico Kruber
–
Software
Engineer
——
David Anderson
–
Community
Engineering
–
Flink Forward
22
2. About me
Open source
● Apache Flink contributor/committer since 2016
○ Focus on network stack, usability, and performance
Career
● PhD in Computer Science at HU Berlin / Zuse Institute Berlin
● Software engineer -> Solutions architect -> Head of Solutions Architecture
@ DataArtisans/Ververica (acquired by Alibaba)
● Engineering @ Immerok
About Immerok
● Building a fully managed Apache Flink cloud service
for powering real-time systems at any scale
○ immerok.com
2
7. List<Long> listAtTs = events.get(ts);
if (listAtTs == null) {
listAtTs = new ArrayList<>();
}
listAtTs.add(event);
events.put(ts, listAtTs);
Use Case: Stream Sort - What’s
Happening Underneath
State (RocksDB)
De-/Serializer
full list
as byte[]
search memtable
+ sst files
for one entry
lookup
deserialized
(Java) list
8. List<Long> listAtTs = events.get(ts);
if (listAtTs == null) {
listAtTs = new ArrayList<>();
}
listAtTs.add(event);
events.put(ts, listAtTs);
Use Case: Stream Sort - What’s
Happening Underneath
State (RocksDB)
De-/Serializer
full Java list
serialized list
as byte[]
add new entry
to memtable
(leave old one for
compaction)
9. Use Case: Stream Sort - Alternative Solutions
● Using MapState<Long, Event> instead of MapState<Long, List<Event>>?
○ Cannot handle multiple events per timestamp
● Using Window API?
○ Efficient event storage per timestamp
○ No really well-matching window types: sliding, tumbling, and session windows
● Using HashMapStateBackend?
○ No de-/serialization overhead
○ state limited by available memory
○ no incremental checkpoints
● Using ListState<Event> and filtering in onTimer()?
○ Reduced overhead in processElement() vs. more to do in onTimer()
10. {ts: 5, code: GBP,
rate: 1.20}
{ts: 10, code: USD,
rate: 1.00}
{ts: 19, code: USD,
rate: 1.02}
rates
{ts: 15, code: USD,
amount: 1.00}
{ts: 10, code: GBP,
amount: 2.00}
{ts: 25, code: USD,
amount: 3.00}
transactions
Use Case: Event-Time Stream Join
{ts1: 15, ts2: 10,
amount: 1.00}
{ts1: 10, ts2: 5,
amount: 2.40}
{ts1: 25, ts2: 19,
amount: 3.06}
SELECT
t.ts AS ts1, r.ts AS ts2,
t.amount * r.rate AS amount
FROM transactions AS t
LEFT JOIN rates
FOR SYSTEM_TIME AS OF t.ts AS r
ON t.code = r.code;
14. BinarySortedState - History
“Temporal state” Hackathon project (David + Nico + Seth)
● Main primitives: getAt(), getAtOrBefore(), getAtOrAfter(), add(),
addAll(), update()
2020
Nov 2021
April 2022
started as a Hackathon project (David + Nico) on
custom windowing with process functions
Created FLIP-220 and discussed on dev@flink.apache.org
● Extended scope further to allow arbitrary user keys (not just timestamps)
○ Identified further use cases in SQL operators, e.g. min/max with retractions
● Clarified serializer requirements
● Extend proposed API to offer range read and clear operations
● …
15. ● A new keyed-state primitive, built on top of ListState
● Efficiently add to list of values for a user-provided key
● Efficiently iterate user-keys in a well-defined sort order,
with native state-backend support, especially RocksDB
● Efficient operations for time-based functions
(windowing, sorting, event-time joins, custom, ...)
● Operations on subset of the state, based on user-key ranges
● Portable between state backends (RocksDB, HashMap)
BinarySortedState - Goals
17. ● RocksDB is a key-value store writing into MemTables → flushing into SST files
● SST files are sorted by the key in lexicographical binary order
(byte-wise unsigned comparison)
BinarySortedState - How does it work with RocksDB?!
● RocksDB offers Prefix Seek and SeekForPrev
● RocksDBMapState.RocksDBMapIterator provides efficient iteration via:
○ Fetching up to 128 RocksDB entries at once
○ RocksDBMapEntry with lazy key/value deserialization
18. BinarySortedState - LexicographicalTypeSerializer
● “Just” need to provide serializers that are compatible with RocksDB’s sort order
● Based on lexicographical binary order as defined by byte-wise unsigned comparison
● Compatible serializers extend LexicographicalTypeSerializer
public abstract class LexicographicalTypeSerializer<T> extends TypeSerializer<T> {
public Optional<Comparator<T>> findComparator() { return Optional.empty(); }
}
21. events.add(ts, event);
Stream Sort with
BinarySortedState - What’s
Happening Underneath
State (RocksDB)
De-/Serializer
add new merge
op to memtable
Serialized
entry as byte[]
new list
entry
22. Stream Sort with
BinarySortedState - What’s
Happening Underneath
State (RocksDB)
De-/Serializer
full list
as byte[]
Lookup
deserialized
(Java) list
search memtable
+ sst files
for all entries
events.valuesAt(ts).forEach(out::collect);
23. Mark k/v as
deleted
(removal during
compaction)
Stream Sort with
BinarySortedState - What’s
Happening Underneath
State (RocksDB)
De-/Serializer
events.clearEntryAt(ts);
Delete
24. Event-Time Stream Join w/out BinarySortedState (1)
private BinarySortedState<Long, Transaction>
transactions;
private BinarySortedState<Long, Double> rate;
private MapState<Long, List<Transaction>>
transactions;
private MapState<Long, Double> rate;
public void processElement1(/*...*/) {
// ...
if (!isLate(ts, timerSvc)) {
// append to BinarySortedState:
transactions.add(ts, value);
timerSvc.registerEventTimeTimer(ts);
}
}
// similar for processElement2()
public void processElement1(/*...*/) {
// ...
if (!isLate(ts, timerSvc)) {
// replace list in MapState:
addTransaction(ts, value);
timerSvc.registerEventTimeTimer(ts);
}
}
// similar for processElement2()
26. Stream Sort with BinarySortedState - Optimized
● Idea:
○ Increase efficiency by processing all events between watermarks
● Challenge:
○ Registering a timer for the next watermark will fire too often
➔ Solution:
○ Register timer for the first unprocessed event
○ When the timer fires:
■ Process all events until the current watermark (not the timer timestamp!)
● events.readRangeUntil(currentWatermark, true)
27. Event-Time Stream Join with BinarySortedState - Optimized
● Idea:
○ Increase efficiency by processing all events between watermarks
● Challenge:
○ Registering a timer for the next watermark will fire too often
➔ Solution:
○ Same as Stream Sort: Timer for first unprocessed event, processing until watermark, but:
○ When the timer fires:
■ Iterate both, transactions and rate (in the appropriate time range) in event-time order
32. ● (Custom) stream sorting
● Time-based (custom) joins
● Code with range-reads or bulk-deletes
● Custom window implementations
● Min/Max with retractions
● …
● Basically everything maintaining a MapState<?, List<?>> or requiring range operations
BinarySortedState - Who will benefit?
33. What’s left to do?
● Iron out last bits and pieces + tests
○ Start voting thread for FLIP-220 on dev@flink.apache.org
○ Create a PR and get it merged
● Expected to land in Flink 1.17 (as experimental feature)
● Port Table/SQL/DataStream operators to improve efficiency:
○ TemporalRowTimeJoinOperator (PoC already done for validating the API ✓)
○ RowTimeSortOperator
○ IntervalJoinOperator
○ CepOperator
○ …
● Provide more LexicographicalTypeSerializers
34. Get ready to ROK!!!
Nico Kruber
linkedin.com/in/nico-kruber
nico@immerok.com