Slides for Data Syndrome one hour course on PySpark. Introduces basic operations, Spark SQL, Spark MLlib and exploratory data analysis with PySpark. Shows how to use pylab with Spark to create histograms.
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2L4rPmM
This CloudxLab Basics of RDD tutorial helps you to understand Basics of RDD in detail. Below are the topics covered in this tutorial:
1) What is RDD - Resilient Distributed Datasets
2) Creating RDD in Scala
3) RDD Operations - Transformations & Actions
4) RDD Transformations - map() & filter()
5) RDD Actions - take() & saveAsTextFile()
6) Lazy Evaluation & Instant Evaluation
7) Lineage Graph
8) flatMap and Union
9) Scala Transformations - Union
10) Scala Actions - saveAsTextFile(), collect(), take() and count()
11) More Actions - reduce()
12) Can We Use reduce() for Computing Average?
13) Solving Problems with Spark
14) Compute Average and Standard Deviation with Spark
15) Pick Random Samples From a Dataset using Spark
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
This presentation about Apache Spark covers all the basics that a beginner needs to know to get started with Spark. It covers the history of Apache Spark, what is Spark, the difference between Hadoop and Spark. You will learn the different components in Spark, and how Spark works with the help of architecture. You will understand the different cluster managers on which Spark can run. Finally, you will see the various applications of Spark and a use case on Conviva. Now, let's get started with what is Apache Spark.
Below topics are explained in this Spark presentation:
1. History of Spark
2. What is Spark
3. Hadoop vs Spark
4. Components of Apache Spark
5. Spark architecture
6. Applications of Spark
7. Spark usecase
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
Simplilearn’s Apache Spark and Scala certification training are designed to:
1. Advance your expertise in the Big Data Hadoop Ecosystem
2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark
3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos
What skills will you learn?
By completing this Apache Spark and Scala course you will be able to:
1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations
2. Understand the fundamentals of the Scala programming language and its features
3. Explain and master the process of installing Spark as a standalone cluster
4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark
5. Master Structured Query Language (SQL) using SparkSQL
6. Gain a thorough understanding of Spark streaming features
7. Master and describe the features of Spark ML programming and GraphX programming
Who should take this Scala course?
1. Professionals aspiring for a career in the field of real-time big data analytics
2. Analytics professionals
3. Research professionals
4. IT developers and testers
5. Data scientists
6. BI and reporting professionals
7. Students who wish to gain a thorough understanding of Apache Spark
Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training
The document summarizes Spark SQL, which is a Spark module for structured data processing. It introduces key concepts like RDDs, DataFrames, and interacting with data sources. The architecture of Spark SQL is explained, including how it works with different languages and data sources through its schema RDD abstraction. Features of Spark SQL are covered such as its integration with Spark programs, unified data access, compatibility with Hive, and standard connectivity.
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
Slides cover Spark core concepts of Apache Spark such as RDD, DAG, execution workflow, forming stages of tasks and shuffle implementation and also describes architecture and main components of Spark Driver. The workshop part covers Spark execution modes , provides link to github repo which contains Spark Applications examples and dockerized Hadoop environment to experiment with
This presentation is an introduction to Apache Spark. It covers the basic API, some advanced features and describes how Spark physically executes its jobs.
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.
Apache Spark is a cluster computing framework designed for fast, general-purpose processing of large datasets. It uses in-memory computing to improve processing speeds. Spark operations include transformations that create new datasets and actions that return values. The Spark stack includes Resilient Distributed Datasets (RDDs) for fault-tolerant data sharing across a cluster. Spark Streaming processes live data streams using a discretized stream model.
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...Edureka!
** PySpark Certification Training: https://www.edureka.co/pyspark-certification-training **
This Edureka tutorial on PySpark Programming will give you a complete insight of the various fundamental concepts of PySpark. Fundamental concepts include the following:
1. PySpark
2. RDDs
3. DataFrames
4. PySpark SQL
5. PySpark Streaming
6. Machine Learning (MLlib)
Introduction: This workshop will provide a hands-on introduction to Apache Spark using the HDP Sandbox on students’ personal machines.
Format: A short introductory lecture about Apache Spark components used in the lab followed by a demo, lab exercises and a Q&A session. The lecture will be followed by lab time to work through the lab exercises and ask questions.
Objective: To provide a quick and short hands-on introduction to Apache Spark. This lab will use the following Spark and Apache Hadoop components: Spark, Spark SQL, Apache Hadoop HDFS, Apache Hadoop YARN, Apache ORC, and Apache Ambari User Views. You will learn how to move data into HDFS using Spark APIs, create Apache Hive tables, explore the data with Spark and Spark SQL, transform the data and then issue some SQL queries.
Pre-requisites: Registrants must bring a laptop that can run the Hortonworks Data Cloud.
Speaker:
Robert Hryniewicz, Developer Advocate, Hortonworks
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
This presentation on Spark Architecture will give an idea of what is Apache Spark, the essential features in Spark, the different Spark components. Here, you will learn about Spark Core, Spark SQL, Spark Streaming, Spark MLlib, and Graphx. You will understand how Spark processes an application and runs it on a cluster with the help of its architecture. Finally, you will perform a demo on Apache Spark. So, let's get started with Apache Spark Architecture.
YouTube Video: https://www.youtube.com/watch?v=CF5Ewk0GxiQ
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
Simplilearn’s Apache Spark and Scala certification training are designed to:
1. Advance your expertise in the Big Data Hadoop Ecosystem
2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark
3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos
What skills will you learn?
By completing this Apache Spark and Scala course you will be able to:
1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations
2. Understand the fundamentals of the Scala programming language and its features
3. Explain and master the process of installing Spark as a standalone cluster
4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark
5. Master Structured Query Language (SQL) using SparkSQL
6. Gain a thorough understanding of Spark streaming features
7. Master and describe the features of Spark ML programming and GraphX programming
Who should take this Scala course?
1. Professionals aspiring for a career in the field of real-time big data analytics
2. Analytics professionals
3. Research professionals
4. IT developers and testers
5. Data scientists
6. BI and reporting professionals
7. Students who wish to gain a thorough understanding of Apache Spark
Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training
Optimizing spark jobs through a true understanding of spark core. Learn: What is a partition? What is the difference between read/shuffle/write partitions? How to increase parallelism and decrease output files? Where does shuffle data go between stages? What is the "right" size for your spark partitions and files? Why does a job slow down with only a few tasks left and never finish? Why doesn't adding nodes decrease my compute time?
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
Watch video at: http://youtu.be/Wg2boMqLjCg
Want to learn how to write faster and more efficient programs for Apache Spark? Two Spark experts from Databricks, Vida Ha and Holden Karau, provide some performance tuning and testing tips for your Spark applications
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Edureka!
This Edureka Spark Tutorial will help you to understand all the basics of Apache Spark. This Spark tutorial is ideal for both beginners as well as professionals who want to learn or brush up Apache Spark concepts. Below are the topics covered in this tutorial:
1) Big Data Introduction
2) Batch vs Real Time Analytics
3) Why Apache Spark?
4) What is Apache Spark?
5) Using Spark with Hadoop
6) Apache Spark Features
7) Apache Spark Ecosystem
8) Demo: Earthquake Detection Using Apache Spark
*** Apache Spark and Scala Certification Training: https://www.edureka.co/apache-spark-scala-training ***
This Edureka PPT on "RDD Using Spark" will provide you the detailed and comprehensive knowledge about RDD, which are considered to be the backbone of Apache Spark. You will learn about the various Transformations and Actions that can be performed on RDDs. This PPT will cover the following topics:
Need for RDDs
What are RDDs?
Features of RDDs
Creation of RDDs using Spark
Operations performed on RDDs
RDDs using Spark: Pokemon Use Case
Blog Series: http://bit.ly/2VRogGx
Complete Apache Spark and Scala playlist: http://bit.ly/2In8IXD
Follow us to never miss an update in the future.
YouTube: https://www.youtube.com/user/edurekaIN
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Spark is an open source cluster computing framework for large-scale data processing. It provides high-level APIs and runs on Hadoop clusters. Spark components include Spark Core for execution, Spark SQL for SQL queries, Spark Streaming for real-time data, and MLlib for machine learning. The core abstraction in Spark is the resilient distributed dataset (RDD), which allows data to be partitioned across nodes for parallel processing. A word count example demonstrates how to use transformations like flatMap and reduceByKey to count word frequencies from an input file in Spark.
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...Edureka!
This Edureka "What is Spark" tutorial will introduce you to big data analytics framework - Apache Spark. This tutorial is ideal for both beginners as well as professionals who want to learn or brush up their Apache Spark concepts. Below are the topics covered in this tutorial:
1) Big Data Analytics
2) What is Apache Spark?
3) Why Apache Spark?
4) Using Spark with Hadoop
5) Apache Spark Features
6) Apache Spark Architecture
7) Apache Spark Ecosystem - Spark Core, Spark Streaming, Spark MLlib, Spark SQL, GraphX
8) Demo: Analyze Flight Data Using Apache Spark
Spark Streaming allows processing of live data streams in Spark. It integrates streaming data and batch processing within the same Spark application. Spark SQL provides a programming abstraction called DataFrames and can be used to query structured data in Spark. Structured Streaming in Spark 2.0 provides a high-level API for building streaming applications on top of Spark SQL's engine. It allows running the same queries on streaming data as on batch data and unifies streaming, interactive, and batch processing.
Agile Data Science 2.0 (O'Reilly 2017) defines a methodology and a software stack with which to apply the methods. *The methodology* seeks to deliver data products in short sprints by going meta and putting the focus on the applied research process itself. *The stack* is but an example of one meeting the requirements that it be utterly scalable and utterly efficient in use by application developers as well as data engineers. It includes everything needed to build a full-blown predictive system: Apache Spark, Apache Kafka, Apache Incubating Airflow, MongoDB, ElasticSearch, Apache Parquet, Python/Flask, JQuery. This talk will cover the full lifecycle of large data application development and will show how to use lessons from agile software engineering to apply data science using this full-stack to build better analytics applications. The entire lifecycle of big data application development is discussed. The system starts with plumbing, moving on to data tables, charts and search, through interactive reports, and building towards predictions in both batch and realtime (and defining the role for both), the deployment of predictive systems and how to iteratively improve predictions that prove valuable.
Agile Data Science 2.0 (O'Reilly 2017) defines a methodology and a software stack with which to apply the methods. *The methodology* seeks to deliver data products in short sprints by going meta and putting the focus on the applied research process itself. *The stack* is but an example of one meeting the requirements that it be utterly scalable and utterly efficient in use by application developers as well as data engineers. It includes everything needed to build a full-blown predictive system: Apache Spark, Apache Kafka, Apache Incubating Airflow, MongoDB, ElasticSearch, Apache Parquet, Python/Flask, JQuery. This talk will cover the full lifecycle of large data application development and will show how to use lessons from agile software engineering to apply data science using this full-stack to build better analytics applications. The entire lifecycle of big data application development is discussed. The system starts with plumbing, moving on to data tables, charts and search, through interactive reports, and building towards predictions in both batch and realtime (and defining the role for both), the deployment of predictive systems and how to iteratively improve predictions that prove valuable.
This document discusses building full stack data analytics applications using Apache Kafka and Apache Spark. It provides an overview of agile data science principles and methodologies. It also outlines various tools that can be used in the data pipeline and stack, such as Apache Spark, Apache Kafka, MongoDB, Elasticsearch, and d3.js. It discusses considerations for data structure and access patterns, as well as climbing the data value pyramid from raw data to higher order insights.
The document describes a dataset containing on-time performance records for 95% of commercial flights in the United States. It includes over 30 fields of information for each flight such as airline, departure/arrival times, delays, distances, and causes of delays. An example record from the dataset is shown containing values for many of the fields.
Agile Data Science 2.0 covers the theory and practice of applying agile methods to the practice of applied analytics research called data science. The book takes the stance that data products are the preferred output format for data science teams to effect change in an organization. Accordingly, we show how to "get meta" to enable agility in building applications describing the applied research process itself. Then we show how to use 'big data' tools to iteratively build, deploy and refine analytics applications. Tracking data-product development through the five stages of the "data value pyramid", we show you how to build applications from conception through development through deployment and then through iterative improvement. Application development is a fundamental skill for a data scientist, and by publishing your data science work as a web application, we show you how to effect maximal change within your organization.
Technologies covered include Python, Apache Spark (Spark MLlib, Spark Streaming), Apache Kafka, MongoDB, ElasticSearch and Apache Airflow.
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioAlluxio, Inc.
Alluxio Global Online Meetup
Apr 23, 2020
For more Alluxio events: https://www.alluxio.io/events/
Speakers:
Jiao (Jennie) Wang, Intel
Tsai Louie, Intel
Bin Fan, Alluxio
Today, many people run deep learning applications with training data from separate storage such as object storage or remote data centers. This presentation will demo the Intel Analytics Zoo + Alluxio stack, an architecture that enables high performance while keeping cost and resource efficiency balanced without network being I/O bottlenecked.
Intel Analytics Zoo is a unified data analytics and AI platform open-sourced by Intel. It seamlessly unites TensorFlow, Keras, PyTorch, Spark, Flink, and Ray programs into an integrated pipeline, which can transparently scale from a laptop to large clusters to process production big data. Alluxio, as an open-source data orchestration layer, accelerates data loading and processing in Analytics Zoo deep learning applications.
This talk, we will go over:
- What is Analytics Zoo and how it works
- How to run Analytics Zoo with Alluxio in deep learning applications
- Initial performance benchmark results using the Analytics Zoo + Alluxio stack
Adatao Live Demo at the First Spark SummitArimo, Inc.
This document describes adatao, a big data analytics platform that provides a single integrated environment for business users, data scientists, and data engineers. It allows powerful in-memory data mining and machine learning on large datasets across Hadoop, Cassandra, and SQL databases. The platform offers visually interactive data exploration, narrative reporting, and real-time predictive analytics capabilities. It also details a live demo of the platform analyzing over 100 million rows of airline arrival data on an 8-node cluster.
Telemetry doesn't have to be scary; Ben FordPuppet
This document discusses Puppet telemetry and metrics collection. It introduces Dropsonde, an open source tool for collecting anonymous usage data from Puppet servers. Dropsonde plugins define metrics that are collected and sent to Google BigQuery for analysis. The data is aggregated and made public to help understand Puppet module usage and ecosystem trends, while keeping individual server data private. Users are encouraged to contribute plugins and use the public data for their own analysis and tools.
This document discusses Puppet telemetry and metrics collection. It introduces Dropsonde, an open source tool for collecting anonymous usage data from Puppet servers. Dropsonde plugins define metrics that are collected and sent to Google BigQuery for analysis. The data is aggregated and made public to help understand Puppet module usage and ecosystem trends, while keeping individual server data private. Users are encouraged to contribute plugins and use the public data for their own analysis and tools.
The document provides details about the candidate's experience and skills in big data technologies like Hadoop, Hive, Pig, Spark, Sqoop, Flume, and HBase. The candidate has over 1.5 years of experience learning and working with these technologies. He has installed and configured Hadoop clusters from different versions and used distributions from MapR. He has in-depth knowledge of Hadoop architecture and frameworks and has performed various tasks in a Hadoop environment including configuration of Hive, writing Pig scripts, using Sqoop and Flume, and writing Spark programs.
Ultra Fast Deep Learning in Hybrid Cloud using Intel Analytics Zoo & AlluxioAlluxio, Inc.
The document discusses using Intel Analytics Zoo and Alluxio for ultra fast deep learning in hybrid cloud environments. Analytics Zoo provides an end-to-end deep learning pipeline that can prototype on a laptop using sample data and experiment on clusters with historical data, while Alluxio enables zero-copy access to remote data for accelerated analytics. Performance tests showed Alluxio providing up to a 1.5x speedup for data loading compared to accessing data directly from cloud storage. Real-world customers are using the combined Analytics Zoo and Alluxio solution for deep learning, recommendation systems, computer vision, and time series applications.
Getting Started with Splunk Breakout SessionSplunk
Splunk is a software platform that allows users to search, monitor, and analyze machine-generated big data for security, IT and business intelligence. It collects data from sources like servers, networks, sensors and applications. Splunk can scale from analyzing data from a single computer to very large enterprises handling terabytes of data per day. It provides real-time operational intelligence through universal data ingestion, schema-on-the-fly indexing, and an intuitive search process.
The document discusses how big data and analytics can transform businesses. It notes that the volume of data is growing exponentially due to increases in smartphones, sensors, and other data producing devices. It also discusses how businesses can leverage big data by capturing massive data volumes, analyzing the data, and having a unified and secure platform. The document advocates that businesses implement the four pillars of data management: mobility, in-memory technologies, cloud computing, and big data in order to reduce the gap between data production and usage.
Peter Marshall, Technology Evangelist at Imply
Abstract: Apache Druid® can revolutionise business decision-making with a view of the freshest of fresh data in web, mobile, desktop, and data science notebooks. In this talk, we look at key activities to integrate into Apache Druid POCs, discussing common hurdles and signposting to important information.
Bio: Peter Marshall (https://petermarshall.io) is an Apache Druid Technology Evangelist at Imply (http://imply.io/), a company founded by original developers of Apache Druid. He has 20 years architecture experience in CRM, EDRM, ERP, EIP, Digital Services, Security, BI, Analytics, and MDM. He is TOGAF certified and has a BA degree in Theology and Computer Studies from the University of Birmingham in the United Kingdom.
Analyzing Big data in R and Scala using Apache Spark 17-7-19Ahmed Elsayed
We can make a data mining to get the prediction about the future data, which is mined from an old data especially Big data using a machine learning algorithms based on two clusters. One is the intrinsic for managing the file system of Big data, which is called Hadoop. The other is essentially to make fast analysis of Big data which is called Apache Spark. In order to achieve this purpose we will use R based on Rstudio or Scala based on Zeppelin.
You Too Can Be a Radio Host Or How We Scaled a .NET Startup And Had Fun Doing ItAleksandr Yampolskiy
The document discusses Cinchcast, a company that provides cloud-based software for creating and distributing voice-based content. It describes the backgrounds of the CTO and VP of Development and an overview of the company and its flagship product BlogTalkRadio. The document then discusses Cinchcast's technology stack, development processes, architecture improvements over time including moving to services and caching, and future projects. It concludes by describing open job positions and benefits of working at Cinchcast.
This document discusses scaling R to enterprise data using Oracle's Big Data Analytics solutions. It describes Oracle R Enterprise for performing advanced analytics on large datasets within the database using R. It also describes the Oracle R Connector for Hadoop for accessing and manipulating data stored in Hadoop from R. The document provides examples of loading, preparing, analyzing and modeling data on both relational and HDFS data using Oracle R. It highlights the performance advantages of in-database analytics and discusses deploying R models and scripts to production.
Predictive Analytics with Airflow and PySparkRussell Jurney
The document is a slide presentation about predictive analytics using Apache Airflow and PySpark. It includes slides that:
1) Provide an overview of the speaker's background and skills in data science and engineering.
2) Describe a sample data science workflow using tools like Apache Spark, Kafka, MongoDB, ElasticSearch and Flask.
3) Show examples of Airflow tasks defined in Python that use PySpark to process data from JSON files on a daily basis, grouping and aggregating the data.
Networks All Around Us: Extracting networks from your problem domainRussell Jurney
This document summarizes a presentation on analyzing networks in problem domains. It provides examples of different types of networks that can be analyzed, including founder networks, website behavior networks, and online social networks. It also describes various tools and techniques for social network analysis, such as calculating centrality, clustering, and dispersion. The presentation emphasizes how to identify relevant entities and relationships to model a problem domain as a property graph and analyze it using graph databases and network analysis libraries.
Social Network Analysis in Your Problem DomainRussell Jurney
This document discusses various types of networks that can be analyzed using social network analysis techniques. It provides examples of networks including founder networks, website behavior networks, online social networks, and email inbox networks. It also summarizes tools and methods for social network analysis including centrality measures, clustering, block models, cores, and dispersion analysis.
Networks All Around Us: Extracting networks from your problem domainRussell Jurney
Network analytics are being increasingly utilized to create machine intelligence that automates the world around us. But what is a network, and how do you analyze them? More directly: how do I find and analyze networks in my dataset? This talk will go over a number of examples of practical network analytics to give viewers a playbook for doing applied social network analysis and network analytics.
Networks All Around Us: Extracting networks from your problem domainRussell Jurney
Network analytics are being increasingly utilized to create machine intelligence that automates the world around us. But what is a network, and how do you analyze them? More directly: how do I find and analyze networks in my dataset? This talk will go over a number of examples of practical network analytics to give viewers a playbook for doing applied social network analysis and network analytics.
Agile Data Science: Building Hadoop Analytics ApplicationsRussell Jurney
This document discusses building agile analytics applications with Hadoop. It outlines several principles for developing data science teams and applications in an agile manner. Some key points include:
- Data science teams should be small, around 3-4 people with diverse skills who can work collaboratively.
- Insights should be discovered through an iterative process of exploring data in an interactive web application, rather than trying to predict outcomes upfront.
- The application should start as a tool for exploring data and discovering insights, which then becomes the palette for what is shipped.
- Data should be stored in a document format like Avro or JSON rather than a relational format to reduce joins and better represent semi-structured
Agile Data Science: Hadoop Analytics ApplicationsRussell Jurney
This document provides instructions and examples for analyzing and visualizing event data in an agile manner. It discusses loading event data stored in Avro format using tools like Pig and displaying the data in a browser. Specific steps outlined include using Cat to view Avro data, loading the data into Pig and using Illustrate to view sample records. The overall approach emphasized is to work with atomic event data in an iterative way using Pig and other Hadoop tools to explore and visualize the data.
This document discusses building agile analytics applications. It recommends taking an iterative approach where data is explored interactively from the start to discover insights. Rather than designing insights upfront, the goal is to build an application that facilitates exploration of the data to uncover insights. This is done by setting up an environment where insights can be repeatedly produced and shared with the team. The focus is on using simple, flexible tools that work from small local data to large datasets.
This document provides instructions on installing and running Azkaban, a job scheduling system, and configuring Pig jobs to run within Azkaban. It explains how to download and run Azkaban, configure global and job-specific Pig properties, run and schedule Pig jobs through Azkaban, and view job status and logs.
Seamless PostgreSQL to Snowflake Data Transfer in 8 Simple StepsEstuary Flow
Unlock the full potential of your data by effortlessly migrating from PostgreSQL to Snowflake, the leading cloud data warehouse. This comprehensive guide presents an easy-to-follow 8-step process using Estuary Flow, an open-source data operations platform designed to simplify data pipelines.
Discover how to seamlessly transfer your PostgreSQL data to Snowflake, leveraging Estuary Flow's intuitive interface and powerful real-time replication capabilities. Harness the power of both platforms to create a robust data ecosystem that drives business intelligence, analytics, and data-driven decision-making.
Key Takeaways:
1. Effortless Migration: Learn how to migrate your PostgreSQL data to Snowflake in 8 simple steps, even with limited technical expertise.
2. Real-Time Insights: Achieve near-instantaneous data syncing for up-to-the-minute analytics and reporting.
3. Cost-Effective Solution: Lower your total cost of ownership (TCO) with Estuary Flow's efficient and scalable architecture.
4. Seamless Integration: Combine the strengths of PostgreSQL's transactional power with Snowflake's cloud-native scalability and data warehousing features.
Don't miss out on this opportunity to unlock the full potential of your data. Read & Download this comprehensive guide now and embark on a seamless data journey from PostgreSQL to Snowflake with Estuary Flow!
Try it Free: https://dashboard.estuary.dev/register
Software development... for all? (keynote at ICSOFT'2024)miso_uam
Our world runs on software. It governs all major aspects of our life. It is an enabler for research and innovation, and is critical for business competitivity. Traditional software engineering techniques have achieved high effectiveness, but still may fall short on delivering software at the accelerated pace and with the increasing quality that future scenarios will require.
To attack this issue, some software paradigms raise the automation of software development via higher levels of abstraction through domain-specific languages (e.g., in model-driven engineering) and empowering non-professional developers with the possibility to build their own software (e.g., in low-code development approaches). In a software-demanding world, this is an attractive possibility, and perhaps -- paraphrasing Andy Warhol -- "in the future, everyone will be a developer for 15 minutes". However, to make this possible, methods are required to tweak languages to their context of use (crucial given the diversity of backgrounds and purposes), and the assistance to developers throughout the development process (especially critical for non-professionals).
In this keynote talk at ICSOFT'2024 I presented enabling techniques for this vision, supporting the creation of families of domain-specific languages, their adaptation to the usage context; and the augmentation of low-code environments with assistants and recommender systems to guide developers (professional or not) in the development process.
Sami provided a beginner-friendly introduction to Amazon Web Services (AWS), covering essential terms, products, and services for cloud deployment. Participants explored AWS' latest Gen AI offerings, making it accessible for those starting their cloud journey or integrating AI into coding practices.
Drona Infotech is one of the best Mobile App Development Company in Noida. Elevate your business with our professional app development services. Let us help you create user-friendly and high-performing mobile applications.
Visit Us For: https://www.dronainfotech.com/mobile-application-development/
What is OCR Technology and How to Extract Text from Any Image for FreeTwisterTools
Discover the fascinating world of Optical Character Recognition (OCR) technology with our comprehensive presentation. Learn how OCR converts various types of documents, such as scanned paper documents, PDFs, or images captured by a digital camera, into editable and searchable data. Dive into the history, modern applications, and future trends of OCR technology. Get step-by-step instructions on how to extract text from any image online for free using a simple tool, along with best practices for OCR image preparation. Ideal for professionals, students, and tech enthusiasts looking to harness the power of OCR.
Alluxio Webinar | 10x Faster Trino Queries on Your Data PlatformAlluxio, Inc.
Alluxio Webinar
June. 18, 2024
For more Alluxio Events: https://www.alluxio.io/events/
Speaker:
- Jianjian Xie (Staff Software Engineer, Alluxio)
As Trino users increasingly rely on cloud object storage for retrieving data, speed and cloud cost have become major challenges. The separation of compute and storage creates latency challenges when querying datasets; scanning data between storage and compute tiers becomes I/O bound. On the other hand, cloud API costs related to GET/LIST operations and cross-region data transfer add up quickly.
The newly introduced Trino file system cache by Alluxio aims to overcome the above challenges. In this session, Jianjian will dive into Trino data caching strategies, the latest test results, and discuss the multi-level caching architecture. This architecture makes Trino 10x faster for data lakes of any scale, from GB to EB.
What you will learn:
- Challenges relating to the speed and costs of running Trino in the cloud
- The new Trino file system cache feature overview, including the latest development status and test results
- A multi-level cache framework for maximized speed, including Trino file system cache and Alluxio distributed cache
- Real-world cases, including a large online payment firm and a top ridesharing company
- The future roadmap of Trino file system cache and Trino-Alluxio integration
Mumbai @Call @Girls Whatsapp 9930687706 With High Profile Service
Introduction to PySpark
1. A One Hour Introduction to Analytics with PySpark
Introduction to PySpark
http://www.slideshare.net/rjurney/introduction-to-pyspark
or
http://bit.ly/intro_to_pyspark
2. Agile Data Science 2.0
Russell Jurney
2
Data Engineer
Data Scientist
Visualization Software Engineer
85%
85%
85%
Writer
85%
Teacher
50%
Russell Jurney is a veteran data
scientist and thought leader. He
coined the term Agile Data Science in
the book of that name from O’Reilly
in 2012, which outlines the first agile
development methodology for data
science. Russell has constructed
numerous full-stack analytics
products over the past ten years and
now works with clients helping them
extract value from their data assets.
Russell Jurney
Skill
Principal Consultant at Data Syndrome
Russell Jurney
Data Syndrome, LLC
Email : russell.jurney@gmail.com
Web : datasyndrome.com
Principal Consultant
3. Building Full-Stack Data Analytics Applications with Spark
http://bit.ly/agile_data_science
Agile Data Science 2.0
4. Agile Data Science 2.0 4
Realtime Predictive
Analytics
Rapidly learn to build entire predictive systems driven by
Kafka, PySpark, Speak Streaming, Spark MLlib and with a web
front-end using Python/Flask and JQuery.
Available for purchase at http://datasyndrome.com/video
5. Lorem Ipsum dolor siamet suame this placeholder for text can simply
random text. It has roots in a piece of classical. variazioni deiwords which
whichhtly. ven on your zuniga merida della is not denis.
Product Consulting
We build analytics products and systems
consisting of big data viz, predictions,
recommendations, reports and search.
Corporate Training
We offer training courses for data
scientists and engineers and data
science teams,
Video Training
We offer video training courses that rapidly
acclimate you with a technology and
technique.
14. Data Syndrome: Agile Data Science 2.0 14
Python 3 > 2.7
While the break in compatibility between Python 2.X
and 3.X was unfortunate and unnecessary , Python 3
has increasingly become the platform of choice for
analytics work. With a few alterations, all code in this
course will execute in a Python 2.7 environment.
15. Data Syndrome: Agile Data Science 2.0 15
Virtualbox
Virtualbox is a Free and Open Source Software (FOSS)
virtualization product for AMD64/Intel64 processors. It
supports many operating systems, and is under active
development.
https://www.virtualbox.org/wiki/Downloads
16. Data Syndrome: Agile Data Science 2.0 16
Vagrant
Vagrant sits on top of Virtualbox and provides easy to
use, reproducible development environments.
https://www.vagrantup.com/downloads.html
17. Data Syndrome: Agile Data Science 2.0
Vagrant Setup
17
Initializing our Vagrant Environment
# Get the project
git clone https://github.com/rjurney/Agile_Data_Code_2/
# Setup and connect to our virtual machine
vagrant up; vagrant ssh
# Now, from within Vagrant
cd Agile_Data_Code_2
intro_download.sh
# See Appendix A and install.sh for manual install
18. Data Syndrome: Agile Data Science 2.0 18
Amazon EC2
Alternatively, Amazon Web Services provide a simple
way to launch a prepared image for use in this exercise.
19. Data Syndrome: Agile Data Science 2.0
EC2 Setup for Ubuntu Linux
19
Initializing our EC2 Environment
# See ec2.sh, which uses aws/ec2_bootstrap.sh
# To use add: —user-data file://aws/ec2_bootstrap.sh
# Get the project
git clone git@github.com:rjurney/Agile_Data_Code_2.git
# Setup AWS CLI tools
pip install awscli
# Edit and run r3.xlarge instance with your key
./ec2.sh
# ssh to the machine
20. Data Syndrome: Agile Data Science 2.0
EC2 Setup for Ubuntu Linux
Initializing our EC2 Environment
20
# Contents of ec2.sh
# Launch our instance, which ec2_bootstrap.sh will initialize
aws ec2 run-instances
--image-id ami-4ae1fb5d
--key-name agile_data_science
--user-data file://aws/ec2_bootstrap.sh
--instance-type r3.xlarge
--ebs-optimized
--placement "AvailabilityZone=us-east-1d"
--block-device-mappings '{"DeviceName":"/dev/sda1","Ebs":
{"DeleteOnTermination":false,"VolumeSize":1024}}'
--count 1
21. Data Syndrome: Agile Data Science 2.0
EC2 Setup for Ubuntu Linux
Initializing our EC2 Environment
21
# Download the data
cd Agile_Data_Code_2
./intro_download.sh
22. Data Syndrome: Agile Data Science 2.0
EC2 Setup for Ubuntu Linux
Initializing our EC2 Environment
22
# Download the data
cd Agile_Data_Code_2
./intro_download.sh
23. Data Syndrome: Agile Data Science 2.0
Documentation Setup
Opening the right web pages to answer your questions
23
http://spark.apache.org/docs/latest/api/python/pyspark.html
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html
http://spark.apache.org/docs/latest/api/python/pyspark.ml.html
25. Data Syndrome: Agile Data Science 2.0
Hello, World!
How to load data and perform an operation on it in Spark
25
# See ch02/spark.py
# Load the text file using the SparkContext
csv_lines = sc.textFile("data/example.csv")
# Map the data to split the lines into a list
data = csv_lines.map(lambda line: line.split(","))
# Collect the dataset into local RAM
data.collect()
26. Data Syndrome: Agile Data Science 2.0
Creating Objects from CSV using a function
How to create objects from CSV using a function instead of a lambda
26
# See ch02/groupby.py
csv_lines = sc.textFile("data/example.csv")
# Turn the CSV lines into objects
def csv_to_record(line):
parts = line.split(",")
record = {
"name": parts[0],
"company": parts[1],
"title": parts[2]
}
return record
# Apply the function to every record
records = csv_lines.map(csv_to_record)
# Inspect the first item in the dataset
records.first()
27. Data Syndrome: Agile Data Science 2.0
Using a GroupBy to Count Jobs
Count things using the groupBy API
27
# Group the records by the name of the person
grouped_records = records.groupBy(lambda x: x["name"])
# Show the first group
grouped_records.first()
# Count the groups
job_counts = grouped_records.map(
lambda x: {
"name": x[0],
"job_count": len(x[1])
}
)
job_counts.first()
job_counts.collect()
28. Data Syndrome: Agile Data Science 2.0
Map vs FlatMap
Understanding the difference between these two operators
28
# See ch02/flatmap.py
csv_lines = sc.textFile("data/example.csv")
# Compute a relation of words by line
words_by_line = csv_lines
.map(lambda line: line.split(","))
words_by_line.collect()
# Compute a relation of words
flattened_words = csv_lines
.map(lambda line: line.split(","))
.flatMap(lambda x: x)
flattened_words.collect()
29. Data Syndrome: Agile Data Science 2.0
Map vs FlatMap
Understanding the difference between these two operators
29
words_by_line.collect()
[['Russell Jurney', 'Relato', 'CEO'],
['Florian Liebert', 'Mesosphere', 'CEO'],
['Don Brown', 'Rocana', 'CIO'],
['Steve Jobs', 'Apple', 'CEO'],
['Donald Trump', 'The Trump Organization', 'CEO'],
['Russell Jurney', 'Data Syndrome', 'Principal Consultant']]
flattened_words.collect()
['Russell Jurney',
'Relato',
'CEO',
'Florian Liebert',
'Mesosphere',
'CEO',
'Don Brown',
'Rocana',
'CIO',
'Steve Jobs',
'Apple',
'CEO',
'Donald Trump',
'The Trump Organization',
'CEO',
'Russell Jurney',
'Data Syndrome',
'Principal Consultant']
30. Data Syndrome: Agile Data Science 2.0
Using DataFrames and Spark SQL to Count Jobs
Converting an RDD to a DataFrame to use Spark SQL
30
# See ch02/sql.py
csv_lines = sc.textFile("data/example.csv")
from pyspark.sql import Row
# Convert the CSV into a pyspark.sql.Row
def csv_to_row(line):
parts = line.split(",")
row = Row(
name=parts[0],
company=parts[1],
title=parts[2]
)
return row
# Apply the function to get rows in an RDD
rows = csv_lines.map(csv_to_row)
31. Data Syndrome: Agile Data Science 2.0
Using DataFrames and Spark SQL to Count Jobs
Converting an RDD to a DataFrame to use Spark SQL
31
# Convert to a pyspark.sql.DataFrame
rows_df = rows.toDF()
# Register the DataFrame for Spark SQL
rows_df.registerTempTable("executives")
# Generate a new DataFrame with SQL using the SparkSession
job_counts = spark.sql("""
SELECT
name,
COUNT(*) AS total
FROM executives
GROUP BY name
""")
job_counts.show()
# Go back to an RDD
job_counts.rdd.collect()
32. Agile Data Science 2.0 32
Working with a more complex dataset
Exploratory Data Analysis
with Airline Data
33. Data Syndrome: Agile Data Science 2.0
Loading a Parquet Columnar File
Using the Apache Parquet format to load columnar data
33
# See ch02/load_on_time_performance.py
# Load the parquet file containing flight delay records
on_time_dataframe = spark.read.parquet('data/on_time_performance.parquet')
# Register the data for Spark SQL
on_time_dataframe.registerTempTable("on_time_performance")
# Check out the columns
on_time_dataframe.columns
# Check out some data
on_time_dataframe
.select("FlightDate", "TailNum", "Origin", "Dest", "Carrier", "DepDelay", "ArrDelay")
.show()
34. Data Syndrome: Agile Data Science 2.0
Sampling a DataFrame
Sampling a DataFrame to get a better view of its data
34
# Trim the fields and keep the result
trimmed_on_time = on_time_dataframe
.select(
"FlightDate",
"TailNum",
"Origin",
"Dest",
"Carrier",
"DepDelay",
"ArrDelay"
)
# Sample 0.01% of the data and show
trimmed_on_time.sample(False, 0.0001).show()
35. Data Syndrome: Agile Data Science 2.0
Calculating a Histogram
Computing the distribution of a column in a dataset
35
# See ch02/histogram.py
# Load the parquet file containing flight delay records
on_time_dataframe = spark.read.parquet('data/on_time_performance.parquet')
# Register the data for Spark SQL
on_time_dataframe.registerTempTable("on_time_performance")
# Compute a histogram of departure delays
on_time_dataframe
.select("DepDelay")
.rdd
.flatMap(lambda x: x)
.histogram(10)
36. Data Syndrome: Agile Data Science 2.0
Displaying a Histogram
Using pyplot to display a histogram
36
import numpy as np
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
# Function to plot a histogram using pyplot
def create_hist(rdd_histogram_data):
"""Given an RDD.histogram, plot a pyplot histogram"""
heights = np.array(rdd_histogram_data[1])
full_bins = rdd_histogram_data[0]
mid_point_bins = full_bins[:-1]
widths = [abs(i - j) for i, j in zip(full_bins[:-1], full_bins[1:])]
bar = plt.bar(mid_point_bins, heights, width=widths, color='b')
return bar
# Compute a histogram of departure delays
departure_delay_histogram = on_time_dataframe
.select("DepDelay")
.rdd
.flatMap(lambda x: x)
.histogram(10, [-60,-30,-15,-10,-5,0,5,10,15,30,60,90,120,180])
create_hist(departure_delay_histogram)
37. Data Syndrome: Agile Data Science 2.0
Displaying a Histogram
Using pyplot to display a histogram
37
38. Data Syndrome: Agile Data Science 2.0
Counting Airplanes
How many airplanes are in the US fleet in total?
38
# See ch05/assess_airplanes.py
# Load the parquet file
on_time_dataframe = spark.read.parquet('data/on_time_performance.parquet')
on_time_dataframe.registerTempTable("on_time_performance")
# Dump the unneeded fields
tail_numbers = on_time_dataframe.rdd.map(lambda x: x.TailNum)
tail_numbers = tail_numbers.filter(lambda x: x != '')
# distinct() gets us unique tail numbers
unique_tail_numbers = tail_numbers.distinct()
# now we need a count() of unique tail numbers
airplane_count = unique_tail_numbers.count()
print("Total airplanes: {}".format(airplane_count))
39. Data Syndrome: Agile Data Science 2.0
Counting Total Flights by Month
Preparing data for a chart
39
# See ch05/total_flights.py
# Load the parquet file
on_time_dataframe = spark.read.parquet('data/on_time_performance.parquet')
# Use SQL to look at the total flights by month across 2015
on_time_dataframe.registerTempTable("on_time_dataframe")
total_flights_by_month = spark.sql(
"""SELECT Month, Year, COUNT(*) AS total_flights
FROM on_time_dataframe
GROUP BY Year, Month
ORDER BY Year, Month"""
)
# This map/asDict trick makes the rows print a little prettier. It is optional.
flights_chart_data = total_flights_by_month.rdd.map(lambda row: row.asDict())
flights_chart_data.collect()
40. Data Syndrome: Agile Data Science 2.0
Preparing Complex Records for Storage
Getting data ready for storage in a document or key/value store
40
# See ch05/extract_airplanes.py
# Load the parquet file
on_time_dataframe = spark.read.parquet('data/on_time_performance.parquet')
on_time_dataframe.registerTempTable("on_time_performance")
# Filter down to the fields we need to identify and link to a flight
flights = on_time_dataframe.rdd.map(lambda x:
(x.Carrier, x.FlightDate, x.FlightNum, x.Origin, x.Dest, x.TailNum)
)
# Group flights by tail number, sorted by date, then flight number, then origin/dest
flights_per_airplane = flights
.map(lambda nameTuple: (nameTuple[5], [nameTuple[0:5]]))
.reduceByKey(lambda a, b: a + b)
.map(lambda tuple:
{
'TailNum': tuple[0],
'Flights': sorted(tuple[1], key=lambda x: (x[1], x[2], x[3], x[4]))
}
)
flights_per_airplane.first()
41. Data Syndrome: Agile Data Science 2.0
Counting Flight Delays
Analyzing and understanding why flights are late
41
# See ch07/explore_delays.py
# Load the on-time parquet file
on_time_dataframe = spark.read.parquet('data/on_time_performance.parquet')
on_time_dataframe.registerTempTable("on_time_performance")
total_flights = on_time_dataframe.count()
# Flights that were late leaving...
late_departures = on_time_dataframe.filter(on_time_dataframe.DepDelayMinutes > 0)
total_late_departures = late_departures.count()
# Flights that were late arriving...
late_arrivals = on_time_dataframe.filter(on_time_dataframe.ArrDelayMinutes > 0)
total_late_arrivals = late_arrivals.count()
# Get the percentage of flights that are late, rounded to 1 decimal place
pct_late = round((total_late_arrivals / (total_flights * 1.0)) * 100, 1)
42. Data Syndrome: Agile Data Science 2.0
Hero Flights
How many flights made up for time in the air? Those that departed late and arrived on time?
42
# See ch07/explore_delays.py
# Flights that left late but made up time to arrive on time...
on_time_heros = on_time_dataframe.filter(
(on_time_dataframe.DepDelayMinutes > 0)
&
(on_time_dataframe.ArrDelayMinutes <= 0)
)
total_on_time_heros = on_time_heros.count()
43. Data Syndrome: Agile Data Science 2.0
Presenting Results
Displaying the answers in plaintext we’ve just calculated
43
# See ch07/explore_delays.py
print("Total flights: {:,}".format(total_flights))
print("Late departures: {:,}".format(total_late_departures))
print("Late arrivals: {:,}".format(total_late_arrivals))
print("Recoveries: {:,}".format(total_on_time_heros))
print("Percentage Late: {}%".format(pct_late))
44. Data Syndrome: Agile Data Science 2.0
44
# See ch07/explore_delays.py
# Get the average minutes late departing and arriving
spark.sql("""
SELECT
ROUND(AVG(DepDelay),1) AS AvgDepDelay,
ROUND(AVG(ArrDelay),1) AS AvgArrDelay
FROM on_time_performance
"""
).show()
Average Lateness Departing and Arriving
Drilling down into flights and how late they are…
45. Data Syndrome: Agile Data Science 2.0
Sampling Late Flights
Getting to know our data by sampling records of interest
45
# Why are flights late? Lets look at some delayed flights and the delay causes
late_flights = spark.sql("""
SELECT
ArrDelayMinutes,
WeatherDelay,
CarrierDelay,
NASDelay,
SecurityDelay,
LateAircraftDelay
FROM
on_time_performance
WHERE
WeatherDelay IS NOT NULL
OR
CarrierDelay IS NOT NULL
OR
NASDelay IS NOT NULL
OR
SecurityDelay IS NOT NULL
OR
LateAircraftDelay IS NOT NULL
ORDER BY
FlightDate
""")
late_flights.sample(False, 0.01).show()
46. Data Syndrome: Agile Data Science 2.0
Why are Flights Late?
Analyzing and understanding why flights are late
46
# Calculate the percentage contribution to delay for each source
total_delays = spark.sql("""
SELECT
ROUND(SUM(WeatherDelay)/SUM(ArrDelayMinutes) * 100, 1) AS pct_weather_delay,
ROUND(SUM(CarrierDelay)/SUM(ArrDelayMinutes) * 100, 1) AS pct_carrier_delay,
ROUND(SUM(NASDelay)/SUM(ArrDelayMinutes) * 100, 1) AS pct_nas_delay,
ROUND(SUM(SecurityDelay)/SUM(ArrDelayMinutes) * 100, 1) AS pct_security_delay,
ROUND(SUM(LateAircraftDelay)/SUM(ArrDelayMinutes) * 100, 1) AS pct_late_aircraft_delay
FROM on_time_performance
""")
total_delays.show()
47. Data Syndrome: Agile Data Science 2.0
How Often are Weather Delayed Flights Late?
Analyzing and understanding why flights are late
47
# Eyeball the first to define our buckets
weather_delay_histogram = on_time_dataframe
.select("WeatherDelay")
.rdd
.flatMap(lambda x: x)
.histogram([1, 5, 10, 15, 30, 60, 120, 240, 480, 720, 24*60.0])
print(weather_delay_histogram)
create_hist(weather_delay_histogram)
48. Data Syndrome: Agile Data Science 2.0
How Often are Weather Delayed Flights Late?
Analyzing and understanding why flights are late
48
49. Data Syndrome: Agile Data Science 2.0
Preparing Histogram Data for d3.js
Analyzing and understanding why flights are late
49
# Transform the data into something easily consumed by d3
def histogram_to_publishable(histogram):
record = {'key': 1, 'data': []}
for label, value in zip(histogram[0], histogram[1]):
record['data'].append(
{
'label': label,
'value': value
}
)
return record
# Recompute the weather histogram with a filter for on-time flights
weather_delay_histogram = on_time_dataframe
.filter(
(on_time_dataframe.WeatherDelay != None)
&
(on_time_dataframe.WeatherDelay > 0)
)
.select("WeatherDelay")
.rdd
.flatMap(lambda x: x)
.histogram([0, 15, 30, 60, 120, 240, 480, 720, 24*60.0])
print(weather_delay_histogram)
record = histogram_to_publishable(weather_delay_histogram)
50. Agile Data Science 2.0 50
Building a classifier model
Predictive Analytics
Machine Learning
51. Data Syndrome: Agile Data Science 2.0
Download Prepared Training Data
Saving time by using a prepared dataset
51
# Be in the root directory of the project
cd Agile_Data_Code_2
# Run the download script
ch08/download_data.sh
52. Data Syndrome: Agile Data Science 2.0
String Vectorization
From properties of items to vector format
52
53. Data Syndrome: Agile Data Science 2.0 53
scikit-learn was 166. Spark MLlib is very powerful!
ch08/train_spark_mllib_model.py
190 Line Model
# !/usr/bin/env python
import sys, os, re
# Pass date and base path to main() from airflow
def main(base_path):
# Default to "."
try: base_path
except NameError: base_path = "."
if not base_path:
base_path = "."
APP_NAME = "train_spark_mllib_model.py"
# If there is no SparkSession, create the environment
try:
sc and spark
except NameError as e:
import findspark
findspark.init()
import pyspark
import pyspark.sql
sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc).builder.appName(APP_NAME).getOrCreate()
#
# {
# "ArrDelay":5.0,"CRSArrTime":"2015-12-31T03:20:00.000-08:00","CRSDepTime":"2015-12-31T03:05:00.000-08:00",
# "Carrier":"WN","DayOfMonth":31,"DayOfWeek":4,"DayOfYear":365,"DepDelay":14.0,"Dest":"SAN","Distance":368.0,
# "FlightDate":"2015-12-30T16:00:00.000-08:00","FlightNum":"6109","Origin":"TUS"
# }
#
from pyspark.sql.types import StringType, IntegerType, FloatType, DoubleType, DateType, TimestampType
from pyspark.sql.types import StructType, StructField
from pyspark.sql.functions import udf
schema = StructType([
StructField("ArrDelay", DoubleType(), True), # "ArrDelay":5.0
StructField("CRSArrTime", TimestampType(), True), # "CRSArrTime":"2015-12-31T03:20:00.000-08:00"
StructField("CRSDepTime", TimestampType(), True), # "CRSDepTime":"2015-12-31T03:05:00.000-08:00"
StructField("Carrier", StringType(), True), # "Carrier":"WN"
StructField("DayOfMonth", IntegerType(), True), # "DayOfMonth":31
StructField("DayOfWeek", IntegerType(), True), # "DayOfWeek":4
StructField("DayOfYear", IntegerType(), True), # "DayOfYear":365
StructField("DepDelay", DoubleType(), True), # "DepDelay":14.0
StructField("Dest", StringType(), True), # "Dest":"SAN"
StructField("Distance", DoubleType(), True), # "Distance":368.0
StructField("FlightDate", DateType(), True), # "FlightDate":"2015-12-30T16:00:00.000-08:00"
StructField("FlightNum", StringType(), True), # "FlightNum":"6109"
StructField("Origin", StringType(), True), # "Origin":"TUS"
])
input_path = "{}/data/simple_flight_delay_features.jsonl.bz2".format(
base_path
)
features = spark.read.json(input_path, schema=schema)
features.first()
#
# Check for nulls in features before using Spark ML
#
null_counts = [(column, features.where(features[column].isNull()).count()) for column in features.columns]
cols_with_nulls = filter(lambda x: x[1] > 0, null_counts)
print(list(cols_with_nulls))
#
# Add a Route variable to replace FlightNum
#
from pyspark.sql.functions import lit, concat
features_with_route = features.withColumn(
'Route',
concat(
features.Origin,
lit('-'),
features.Dest
)
)
features_with_route.show(6)
#
# Use pysmark.ml.feature.Bucketizer to bucketize ArrDelay into on-time, slightly late, very late (0, 1, 2)
#
from pyspark.ml.feature import Bucketizer
# Setup the Bucketizer
splits = [-float("inf"), -15.0, 0, 30.0, float("inf")]
arrival_bucketizer = Bucketizer(
splits=splits,
inputCol="ArrDelay",
outputCol="ArrDelayBucket"
)
# Save the bucketizer
arrival_bucketizer_path = "{}/models/arrival_bucketizer_2.0.bin".format(base_path)
arrival_bucketizer.write().overwrite().save(arrival_bucketizer_path)
# Apply the bucketizer
ml_bucketized_features = arrival_bucketizer.transform(features_with_route)
ml_bucketized_features.select("ArrDelay", "ArrDelayBucket").show()
#
# Extract features tools in with pyspark.ml.feature
#
from pyspark.ml.feature import StringIndexer, VectorAssembler
# Turn category fields into indexes
for column in ["Carrier", "Origin", "Dest", "Route"]:
string_indexer = StringIndexer(
inputCol=column,
outputCol=column + "_index"
)
string_indexer_model = string_indexer.fit(ml_bucketized_features)
ml_bucketized_features = string_indexer_model.transform(ml_bucketized_features)
# Drop the original column
ml_bucketized_features = ml_bucketized_features.drop(column)
# Save the pipeline model
string_indexer_output_path = "{}/models/string_indexer_model_{}.bin".format(
base_path,
column
)
string_indexer_model.write().overwrite().save(string_indexer_output_path)
# Combine continuous, numeric fields with indexes of nominal ones
# ...into one feature vector
numeric_columns = [
"DepDelay", "Distance",
"DayOfMonth", "DayOfWeek",
"DayOfYear"]
index_columns = ["Carrier_index", "Origin_index",
"Dest_index", "Route_index"]
vector_assembler = VectorAssembler(
inputCols=numeric_columns + index_columns,
outputCol="Features_vec"
)
final_vectorized_features = vector_assembler.transform(ml_bucketized_features)
# Save the numeric vector assembler
vector_assembler_path = "{}/models/numeric_vector_assembler.bin".format(base_path)
vector_assembler.write().overwrite().save(vector_assembler_path)
# Drop the index columns
for column in index_columns:
final_vectorized_features = final_vectorized_features.drop(column)
# Inspect the finalized features
final_vectorized_features.show()
# Instantiate and fit random forest classifier on all the data
from pyspark.ml.classification import RandomForestClassifier
rfc = RandomForestClassifier(
featuresCol="Features_vec",
labelCol="ArrDelayBucket",
predictionCol="Prediction",
maxBins=4657,
maxMemoryInMB=1024
)
model = rfc.fit(final_vectorized_features)
# Save the new model over the old one
model_output_path = "{}/models/spark_random_forest_classifier.flight_delays.5.0.bin".format(
base_path
)
model.write().overwrite().save(model_output_path)
# Evaluate model using test data
predictions = model.transform(final_vectorized_features)
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
evaluator = MulticlassClassificationEvaluator(
predictionCol="Prediction",
labelCol="ArrDelayBucket",
metricName="accuracy"
)
accuracy = evaluator.evaluate(predictions)
print("Accuracy = {}".format(accuracy))
# Check the distribution of predictions
predictions.groupBy("Prediction").count().show()
# Check a sample
predictions.sample(False, 0.001, 18).orderBy("CRSDepTime").show(6)
if __name__ == "__main__":
main(sys.argv[1])
54. Data Syndrome: Agile Data Science 2.0
Loading Our Training Data
Loading our data as a DataFrame to use the Spark ML APIs
54
from pyspark.sql.types import StringType, IntegerType, FloatType, DoubleType, DateType, TimestampType
from pyspark.sql.types import StructType, StructField
from pyspark.sql.functions import udf
schema = StructType([
StructField("ArrDelay", DoubleType(), True), # "ArrDelay":5.0
StructField("CRSArrTime", TimestampType(), True), # "CRSArrTime":"2015-12-31T03:20:00.000-08:00"
StructField("CRSDepTime", TimestampType(), True), # "CRSDepTime":"2015-12-31T03:05:00.000-08:00"
StructField("Carrier", StringType(), True), # "Carrier":"WN"
StructField("DayOfMonth", IntegerType(), True), # "DayOfMonth":31
StructField("DayOfWeek", IntegerType(), True), # "DayOfWeek":4
StructField("DayOfYear", IntegerType(), True), # "DayOfYear":365
StructField("DepDelay", DoubleType(), True), # "DepDelay":14.0
StructField("Dest", StringType(), True), # "Dest":"SAN"
StructField("Distance", DoubleType(), True), # "Distance":368.0
StructField("FlightDate", DateType(), True), # "FlightDate":"2015-12-30T16:00:00.000-08:00"
StructField("FlightNum", StringType(), True), # "FlightNum":"6109"
StructField("Origin", StringType(), True), # "Origin":"TUS"
])
features = spark.read.json(
"data/simple_flight_delay_features.jsonl.bz2",
schema=schema
)
features.first()
55. Data Syndrome: Agile Data Science 2.0
Checking the Data for Nulls
Nulls will cause problems hereafter, so detect and address them first
55
#
# Check for nulls in features before using Spark ML
#
null_counts = [(column, features.where(features[column].isNull()).count()) for column in features.columns]
cols_with_nulls = filter(lambda x: x[1] > 0, null_counts)
print(list(cols_with_nulls))
56. Data Syndrome: Agile Data Science 2.0
Adding a Feature - The Route
Route is defined as origin airport code + “-“ + destination airport code
56
#
# Add a Route variable to replace FlightNum
#
from pyspark.sql.functions import lit, concat
features_with_route = features.withColumn(
'Route',
concat(
features.Origin,
lit('-'),
features.Dest
)
)
features_with_route.select("Origin", "Dest", "Route").show(5)
57. Data Syndrome: Agile Data Science 2.0
Bucketizing ArrDelay into ArrDelayBucket
We can’t classify a continuous variable, so we must bucketize it to make it nominal/categorical
57
#
# Use pysmark.ml.feature.Bucketizer to bucketize ArrDelay
#
from pyspark.ml.feature import Bucketizer
splits = [-float("inf"), -15.0, 0, 30.0, float("inf")]
bucketizer = Bucketizer(
splits=splits,
inputCol="ArrDelay",
outputCol="ArrDelayBucket"
)
ml_bucketized_features = bucketizer.transform(features_with_route)
# Check the buckets out
ml_bucketized_features.select("ArrDelay", "ArrDelayBucket").show()
58. Data Syndrome: Agile Data Science 2.0
Indexing String Columns into Numeric Columns
Nominal/categorical/string columns need to be made numeric before we can vectorize them
58
#
# Extract features tools in with pyspark.ml.feature
#
from pyspark.ml.feature import StringIndexer, VectorAssembler
# Turn category fields into categoric feature vectors, then drop intermediate fields
for column in ["Carrier", "DayOfMonth", "DayOfWeek", "DayOfYear",
"Origin", "Dest", "Route"]:
string_indexer = StringIndexer(
inputCol=column,
outputCol=column + "_index"
)
ml_bucketized_features = string_indexer.fit(ml_bucketized_features)
.transform(ml_bucketized_features)
# Check out the indexes
ml_bucketized_features.show(6)
59. Data Syndrome: Agile Data Science 2.0
Combining Numeric and Indexed Fields into One Vector
Our classifier needs a single field, so we combine all our numeric fields into one feature vector
59
# Handle continuous, numeric fields by combining them into one feature vector
numeric_columns = ["DepDelay", "Distance"]
index_columns = ["Carrier_index", "DayOfMonth_index",
"DayOfWeek_index", "DayOfYear_index", "Origin_index",
"Origin_index", "Dest_index", "Route_index"]
vector_assembler = VectorAssembler(
inputCols=numeric_columns + index_columns,
outputCol="Features_vec"
)
final_vectorized_features = vector_assembler.transform(ml_bucketized_features)
# Drop the index columns
for column in index_columns:
final_vectorized_features = final_vectorized_features.drop(column)
# Check out the features
final_vectorized_features.show()
60. Data Syndrome: Agile Data Science 2.0
Splitting our Data in a Test/Train Split
We need to split our data to evaluate the performance of our classifier
60
#
# Cross validate, train and evaluate classifier
#
# Test/train split
training_data, test_data = final_vectorized_features.randomSplit([0.7, 0.3])
61. Data Syndrome: Agile Data Science 2.0
Training Our Model
This is the magic in machine learning, and it is only a couple of lines of code
61
# Instantiate and fit random forest classifier
from pyspark.ml.classification import RandomForestClassifier
rfc = RandomForestClassifier(
featuresCol="Features_vec",
labelCol="ArrDelayBucket",
maxBins=4657,
maxMemoryInMB=1024
)
model = rfc.fit(training_data)
62. Data Syndrome: Agile Data Science 2.0
Evaluating Our Model
Using the test/train split to evaluate our model for accuracy
62
# Evaluate model using test data
predictions = model.transform(test_data)
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
evaluator = MulticlassClassificationEvaluator(labelCol="ArrDelayBucket", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Accuracy = {}".format(accuracy))
63. Data Syndrome: Agile Data Science 2.0
Sampling Our Predictions
Making sure they pass the sniff check
63
# Check a sample
predictions.sample(False, 0.001, 18).orderBy("CRSDepTime").show(6)
64. Data Syndrome: Agile Data Science 2.0
Experiment Setup
Necessary to improve model
64
65. Data Syndrome: Agile Data Science 2.0 65
155 additional lines to setup an experiment
and add 3 new features to improvement the model
ch09/improve_spark_mllib_model.py
345 L.O.C.
# !/usr/bin/env python
import sys, os, re
import json
import datetime, iso8601
from tabulate import tabulate
# Pass date and base path to main() from airflow
def main(base_path):
APP_NAME = "train_spark_mllib_model.py"
# If there is no SparkSession, create the environment
try:
sc and spark
except NameError as e:
import findspark
findspark.init()
import pyspark
import pyspark.sql
sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc).builder.appName(APP_NAME).getOrCreate()
#
# {
# "ArrDelay":5.0,"CRSArrTime":"2015-12-31T03:20:00.000-08:00","CRSDepTime":"2015-12-31T03:05:00.000-08:00",
# "Carrier":"WN","DayOfMonth":31,"DayOfWeek":4,"DayOfYear":365,"DepDelay":14.0,"Dest":"SAN","Distance":368.0,
# "FlightDate":"2015-12-30T16:00:00.000-08:00","FlightNum":"6109","Origin":"TUS"
# }
#
from pyspark.sql.types import StringType, IntegerType, FloatType, DoubleType, DateType, TimestampType
from pyspark.sql.types import StructType, StructField
from pyspark.sql.functions import udf
schema = StructType([
StructField("ArrDelay", DoubleType(), True), # "ArrDelay":5.0
StructField("CRSArrTime", TimestampType(), True), # "CRSArrTime":"2015-12-31T03:20:00.000-08:00"
StructField("CRSDepTime", TimestampType(), True), # "CRSDepTime":"2015-12-31T03:05:00.000-08:00"
StructField("Carrier", StringType(), True), # "Carrier":"WN"
StructField("DayOfMonth", IntegerType(), True), # "DayOfMonth":31
StructField("DayOfWeek", IntegerType(), True), # "DayOfWeek":4
StructField("DayOfYear", IntegerType(), True), # "DayOfYear":365
StructField("DepDelay", DoubleType(), True), # "DepDelay":14.0
StructField("Dest", StringType(), True), # "Dest":"SAN"
StructField("Distance", DoubleType(), True), # "Distance":368.0
StructField("FlightDate", DateType(), True), # "FlightDate":"2015-12-30T16:00:00.000-08:00"
StructField("FlightNum", StringType(), True), # "FlightNum":"6109"
StructField("Origin", StringType(), True), # "Origin":"TUS"
])
input_path = "{}/data/simple_flight_delay_features.json".format(
base_path
)
features = spark.read.json(input_path, schema=schema)
features.first()
#
# Add a Route variable to replace FlightNum
#
from pyspark.sql.functions import lit, concat
features_with_route = features.withColumn(
'Route',
concat(
features.Origin,
lit('-'),
features.Dest
)
)
features_with_route.show(6)
#
# Add the hour of day of scheduled arrival/departure
#
from pyspark.sql.functions import hour
features_with_hour = features_with_route.withColumn(
"CRSDepHourOfDay",
hour(features.CRSDepTime)
)
features_with_hour = features_with_hour.withColumn(
"CRSArrHourOfDay",
hour(features.CRSArrTime)
)
features_with_hour.select("CRSDepTime", "CRSDepHourOfDay", "CRSArrTime", "CRSArrHourOfDay").show()
#
# Use pysmark.ml.feature.Bucketizer to bucketize ArrDelay into on-time, slightly late, very late (0, 1, 2)
#
from pyspark.ml.feature import Bucketizer
# Setup the Bucketizer
splits = [-float("inf"), -15.0, 0, 30.0, float("inf")]
arrival_bucketizer = Bucketizer(
splits=splits,
inputCol="ArrDelay",
outputCol="ArrDelayBucket"
)
# Save the model
arrival_bucketizer_path = "{}/models/arrival_bucketizer_2.0.bin".format(base_path)
arrival_bucketizer.write().overwrite().save(arrival_bucketizer_path)
# Apply the model
ml_bucketized_features = arrival_bucketizer.transform(features_with_hour)
ml_bucketized_features.select("ArrDelay", "ArrDelayBucket").show()
#
# Extract features tools in with pyspark.ml.feature
#
from pyspark.ml.feature import StringIndexer, VectorAssembler
# Turn category fields into indexes
for column in ["Carrier", "Origin", "Dest", "Route"]:
string_indexer = StringIndexer(
inputCol=column,
outputCol=column + "_index"
)
string_indexer_model = string_indexer.fit(ml_bucketized_features)
ml_bucketized_features = string_indexer_model.transform(ml_bucketized_features)
# Save the pipeline model
string_indexer_output_path = "{}/models/string_indexer_model_3.0.{}.bin".format(
base_path,
column
)
string_indexer_model.write().overwrite().save(string_indexer_output_path)
# Combine continuous, numeric fields with indexes of nominal ones
# ...into one feature vector
numeric_columns = [
"DepDelay", "Distance",
"DayOfMonth", "DayOfWeek",
"DayOfYear", "CRSDepHourOfDay",
"CRSArrHourOfDay"]
index_columns = ["Carrier_index", "Origin_index",
"Dest_index", "Route_index"]
vector_assembler = VectorAssembler(
inputCols=numeric_columns + index_columns,
outputCol="Features_vec"
)
final_vectorized_features = vector_assembler.transform(ml_bucketized_features)
# Save the numeric vector assembler
vector_assembler_path = "{}/models/numeric_vector_assembler_3.0.bin".format(base_path)
vector_assembler.write().overwrite().save(vector_assembler_path)
# Drop the index columns
for column in index_columns:
final_vectorized_features = final_vectorized_features.drop(column)
# Inspect the finalized features
final_vectorized_features.show()
#
# Cross validate, train and evaluate classifier: loop 5 times for 4 metrics
#
from collections import defaultdict
scores = defaultdict(list)
feature_importances = defaultdict(list)
metric_names = ["accuracy", "weightedPrecision", "weightedRecall", "f1"]
split_count = 3
for i in range(1, split_count + 1):
print("nRun {} out of {} of test/train splits in cross validation...".format(
i,
split_count,
)
)
# Test/train split
training_data, test_data = final_vectorized_features.randomSplit([0.8, 0.2])
# Instantiate and fit random forest classifier on all the data
from pyspark.ml.classification import RandomForestClassifier
rfc = RandomForestClassifier(
featuresCol="Features_vec",
labelCol="ArrDelayBucket",
predictionCol="Prediction",
maxBins=4657,
)
model = rfc.fit(training_data)
# Save the new model over the old one
model_output_path = "{}/models/spark_random_forest_classifier.flight_delays.baseline.bin".format(
base_path
)
model.write().overwrite().save(model_output_path)
# Evaluate model using test data
predictions = model.transform(test_data)
# Evaluate this split's results for each metric
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
for metric_name in metric_names:
evaluator = MulticlassClassificationEvaluator(
labelCol="ArrDelayBucket",
predictionCol="Prediction",
metricName=metric_name
)
score = evaluator.evaluate(predictions)
scores[metric_name].append(score)
print("{} = {}".format(metric_name, score))
#
# Collect feature importances
#
feature_names = vector_assembler.getInputCols()
feature_importance_list = model.featureImportances
for feature_name, feature_importance in zip(feature_names, feature_importance_list):
feature_importances[feature_name].append(feature_importance)
#
# Evaluate average and STD of each metric and print a table
#
import numpy as np
score_averages = defaultdict(float)
# Compute the table data
average_stds = [] # ha
for metric_name in metric_names:
metric_scores = scores[metric_name]
average_accuracy = sum(metric_scores) / len(metric_scores)
score_averages[metric_name] = average_accuracy
std_accuracy = np.std(metric_scores)
average_stds.append((metric_name, average_accuracy, std_accuracy))
# Print the table
print("nExperiment Log")
print("--------------")
print(tabulate(average_stds, headers=["Metric", "Average", "STD"]))
#
# Persist the score to a sccore log that exists between runs
#
import pickle
# Load the score log or initialize an empty one
try:
score_log_filename = "{}/models/score_log.pickle".format(base_path)
score_log = pickle.load(open(score_log_filename, "rb"))
if not isinstance(score_log, list):
score_log = []
except IOError:
score_log = []
# Compute the existing score log entry
score_log_entry = {metric_name: score_averages[metric_name] for metric_name in metric_names}
# Compute and display the change in score for each metric
try:
last_log = score_log[-1]
except (IndexError, TypeError, AttributeError):
last_log = score_log_entry
experiment_report = []
for metric_name in metric_names:
run_delta = score_log_entry[metric_name] - last_log[metric_name]
experiment_report.append((metric_name, run_delta))
print("nExperiment Report")
print("-----------------")
print(tabulate(experiment_report, headers=["Metric", "Score"]))
# Append the existing average scores to the log
score_log.append(score_log_entry)
# Persist the log for next run
pickle.dump(score_log, open(score_log_filename, "wb"))
#
# Analyze and report feature importance changes
#
# Compute averages for each feature
feature_importance_entry = defaultdict(float)
for feature_name, value_list in feature_importances.items():
average_importance = sum(value_list) / len(value_list)
feature_importance_entry[feature_name] = average_importance
# Sort the feature importances in descending order and print
import operator
sorted_feature_importances = sorted(
feature_importance_entry.items(),
key=operator.itemgetter(1),
reverse=True
)
print("nFeature Importances")
print("-------------------")
print(tabulate(sorted_feature_importances, headers=['Name', 'Importance']))
#
# Compare this run's feature importances with the previous run's
#
# Load the feature importance log or initialize an empty one
try:
feature_log_filename = "{}/models/feature_log.pickle".format(base_path)
feature_log = pickle.load(open(feature_log_filename, "rb"))
if not isinstance(feature_log, list):
feature_log = []
except IOError:
feature_log = []
# Compute and display the change in score for each feature
try:
last_feature_log = feature_log[-1]
except (IndexError, TypeError, AttributeError):
last_feature_log = defaultdict(float)
for feature_name, importance in feature_importance_entry.items():
last_feature_log[feature_name] = importance
# Compute the deltas
feature_deltas = {}
for feature_name in feature_importances.keys():
run_delta = feature_importance_entry[feature_name] - last_feature_log[feature_name]
feature_deltas[feature_name] = run_delta
# Sort feature deltas, biggest change first
import operator
sorted_feature_deltas = sorted(
feature_deltas.items(),
key=operator.itemgetter(1),
reverse=True
)
# Display sorted feature deltas
print("nFeature Importance Delta Report")
print("-------------------------------")
print(tabulate(sorted_feature_deltas, headers=["Feature", "Delta"]))
# Append the existing average deltas to the log
feature_log.append(feature_importance_entry)
# Persist the log for next run
pickle.dump(feature_log, open(feature_log_filename, "wb"))
if __name__ == "__main__":
main(sys.argv[1])
66. Data Syndrome Russell Jurney
Principal Consultant
Email : rjurney@datasyndrome.com
Web : datasyndrome.com
Data Syndrome, LLC