Databricks is a Software-as-a-Service-like experience (or Spark-as-a-service) that is a tool for curating and processing massive amounts of data and developing, training and deploying models on that data, and managing the whole workflow process throughout the project. It is for those who are comfortable with Apache Spark as it is 100% based on Spark and is extensible with support for Scala, Java, R, and Python alongside Spark SQL, GraphX, Streaming and Machine Learning Library (Mllib). It has built-in integration with many data sources, has a workflow scheduler, allows for real-time workspace collaboration, and has performance improvements over traditional Apache Spark.
Azure Databricks - An Introduction (by Kris Bock)Daniel Toomey
Azure Databricks is a fast, easy to use, and collaborative Apache Spark-based analytics platform optimized for Azure. It allows for interactive collaboration through a unified workspace, enables sharing of insights through integration with Power BI, and provides native integration with other Azure services. It also offers enterprise-grade security through integration with Azure Active Directory and compliance features.
Learn to Use Databricks for Data ScienceDatabricks
Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.
The document discusses the challenges of modern data, analytics, and AI workloads. Most enterprises struggle with siloed data systems that make integration and productivity difficult. The future of data lies with a data lakehouse platform that can unify data engineering, analytics, data warehousing, and machine learning workloads on a single open platform. The Databricks Lakehouse platform aims to address these challenges with its open data lake approach and capabilities for data engineering, SQL analytics, governance, and machine learning.
The document discusses migrating a data warehouse to the Databricks Lakehouse Platform. It outlines why legacy data warehouses are struggling, how the Databricks Platform addresses these issues, and key considerations for modern analytics and data warehousing. The document then provides an overview of the migration methodology, approach, strategies, and key takeaways for moving to a lakehouse on Databricks.
Embarking on building a modern data warehouse in the cloud can be an overwhelming experience due to the sheer number of products that can be used, especially when the use cases for many products overlap others. In this talk I will cover the use cases of many of the Microsoft products that you can use when building a modern data warehouse, broken down into four areas: ingest, store, prep, and model & serve. It’s a complicated story that I will try to simplify, giving blunt opinions of when to use what products and the pros/cons of each.
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a modern data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. They all may sound great in theory, but I'll dig into the concerns you need to be aware of before taking the plunge. I’ll also include use cases so you can see what approach will work best for your big data needs. And I'll discuss Microsoft version of the data mesh.
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. I’ll include use cases so you can see what approach will work best for your big data needs.
Using Databricks as an Analysis PlatformDatabricks
Over the past year, YipitData spearheaded a full migration of its data pipelines to Apache Spark via the Databricks platform. Databricks now empowers its 40+ data analysts to independently create data ingestion systems, manage ETL workflows, and produce meaningful financial research for our clients.
Introducing Snowflake, an elastic data warehouse delivered as a service in the cloud. It aims to simplify data warehousing by removing the need for customers to manage infrastructure, scaling, and tuning. Snowflake uses a multi-cluster architecture to provide elastic scaling of storage, compute, and concurrency. It can bring together structured and semi-structured data for analysis without requiring data transformation. Customers have seen significant improvements in performance, cost savings, and the ability to add new workloads compared to traditional on-premises data warehousing solutions.
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
Data Mesh is an innovative concept addressing many data challenges from an architectural, cultural, and organizational perspective. But is the world ready to implement Data Mesh?
In this session, we will review the importance of core Data Mesh principles, what they can offer, and when it is a good idea to try a Data Mesh architecture. We will discuss common challenges with implementation of Data Mesh systems and focus on the role of open-source projects for it. Projects like Apache Spark can play a key part in standardized infrastructure platform implementation of Data Mesh. We will examine the landscape of useful data engineering open-source projects to utilize in several areas of a Data Mesh system in practice, along with an architectural example. We will touch on what work (culture, tools, mindset) needs to be done to ensure Data Mesh is more accessible for engineers in the industry.
The audience will leave with a good understanding of the benefits of Data Mesh architecture, common challenges, and the role of Apache Spark and other open-source projects for its implementation in real systems.
This session is targeted for architects, decision-makers, data-engineers, and system designers.
Organizations are grappling to manually classify and create an inventory for distributed and heterogeneous data assets to deliver value. However, the new Azure service for enterprises – Azure Synapse Analytics is poised to help organizations and fill the gap between data warehouses and data lakes.
Azure Databricks is a fast, easy, and collaborative Apache Spark-based analytics platform optimized for Azure. Designed in collaboration with the founders of Apache Spark, Azure Databricks combines the best of Databricks and Azure to help customers accelerate innovation with one-click set up, streamlined workflows, and an interactive workspace that enables collaboration between data scientists, data engineers, and business analysts. As an Azure service, customers automatically benefit from the native integration with other Azure services such as Power BI, SQL Data Warehouse, and Cosmos DB, as well as from enterprise-grade Azure security, including Active Directory integration, compliance, and enterprise-grade SLAs.
1) Databricks provides a machine learning platform for MLOps that includes tools for data ingestion, model training, runtime environments, and monitoring.
2) It offers a collaborative data science workspace for data engineers, data scientists, and ML engineers to work together on projects using notebooks.
3) The platform provides end-to-end governance for machine learning including experiment tracking, reproducibility, and model governance.
Databricks CEO Ali Ghodsi introduces Databricks Delta, a new data management system that combines the scale and cost-efficiency of a data lake, the performance and reliability of a data warehouse, and the low latency of streaming.
This document provides an overview and summary of the author's background and expertise. It states that the author has over 30 years of experience in IT working on many BI and data warehouse projects. It also lists that the author has experience as a developer, DBA, architect, and consultant. It provides certifications held and publications authored as well as noting previous recognition as an SQL Server MVP.
In this session, Sergio covered the Lakehouse concept and how companies implement it, from data ingestion to insight. He showed how you could use Azure Data Services to speed up your Analytics project from ingesting, modelling and delivering insights to end users.
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020Timothy McAliley
Jim Boriotti presents an overview and demo of Azure Synapse Analytics, an integrated data platform for business intelligence, artificial intelligence, and continuous intelligence. Azure Synapse Analytics includes Synapse SQL for querying with T-SQL, Synapse Spark for notebooks in Python, Scala, and .NET, and Synapse Pipelines for data workflows. The demo shows how Azure Synapse Analytics provides a unified environment for all data tasks through the Synapse Studio interface.
Azure Synapse Analytics is Azure SQL Data Warehouse evolved: a limitless analytics service, that brings together enterprise data warehousing and Big Data analytics into a single service. It gives you the freedom to query data on your terms, using either serverless on-demand or provisioned resources, at scale. Azure Synapse brings these two worlds together with a unified experience to ingest, prepare, manage, and serve data for immediate business intelligence and machine learning needs. This is a huge deck with lots of screenshots so you can see exactly how it works.
- Delta Lake is an open source project that provides ACID transactions, schema enforcement, and time travel capabilities to data stored in data lakes such as S3 and ADLS.
- It allows building a "Lakehouse" architecture where the same data can be used for both batch and streaming analytics.
- Key features include ACID transactions, scalable metadata handling, time travel to view past data states, schema enforcement, schema evolution, and change data capture for streaming inserts, updates and deletes.
This presentation focuses on the value proposition for Azure Databricks for Data Science. First, the talk includes an overview of the merits of Azure Databricks and Spark. Second, the talk includes demos of data science on Azure Databricks. Finally, the presentation includes some ideas for data science production.
Big Data Adavnced Analytics on Microsoft AzureMark Tabladillo
This presentation provides a survey of the advanced analytics strengths of Microsoft Azure from an enterprise perspective (with these organizations being the bulk of big data users) based on the Team Data Science Process. The talk also covers the range of analytics and advanced analytics solutions available for developers using data science and artificial intelligence from Microsoft Azure.
In this session we will delve into the world of Azure Databricks and analyze why it is becoming a tool for data Scientist and/or fundamental data Engineer in conjunction with Azure services
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionJames Serra
It can be quite challenging keeping up with the frequent updates to the Microsoft products and understanding all their use cases and how all the products fit together. In this session we will differentiate the use cases for each of the Microsoft services, explaining and demonstrating what is good and what isn't, in order for you to position, design and deliver the proper adoption use cases for each with your customers. We will cover a wide range of products such as Databricks, SQL Data Warehouse, HDInsight, Azure Data Lake Analytics, Azure Data Lake Store, Blob storage, and AAS as well as high-level concepts such as when to use a data lake. We will also review the most common reference architectures (“patterns”) witnessed in customer adoption.
Spark is an open-source framework for large-scale data processing. Azure Databricks provides Spark as a managed service on Microsoft Azure, allowing users to deploy production Spark jobs and workflows without having to manage infrastructure. It offers an optimized Databricks runtime, collaborative workspace, and integrations with other Azure services to enhance productivity and scale workloads without limits.
The breath and depth of Azure products that fall under the AI and ML umbrella can be difficult to follow. In this presentation I’ll first define exactly what AI, ML, and deep learning is, and then go over the various Microsoft AI and ML products and their use cases.
1 Introduction to Microsoft data platform analytics for releaseJen Stirrup
Part 1 of a conference workshop. This forms the morning session, which looks at moving from Business Intelligence to Analytics.
Topics Covered: Azure Data Explorer, Azure Data Factory, Azure Synapse Analytics, Event Hubs, HDInsight, Big Data
Ready for take-off - How to get your databases into the cloudAndre Essing
This document discusses options for migrating databases to the cloud using Azure services. It introduces Azure Database Migration Service (DMS) which provides a seamless way to migrate databases from on-premises SQL Server, Oracle and other platforms to Azure SQL Database. It offers pricing information for DMS, noting a limited-time free preview promotion. Additional resources and steps for planning a modern data estate journey to the cloud are outlined.
Microsoft Fabric is the next version of Azure Data Factory, Azure Data Explorer, Azure Synapse Analytics, and Power BI. It brings all of these capabilities together into a single unified analytics platform that goes from the data lake to the business user in a SaaS-like environment. Therefore, the vision of Fabric is to be a one-stop shop for all the analytical needs for every enterprise and one platform for everyone from a citizen developer to a data engineer. Fabric will cover the complete spectrum of services including data movement, data lake, data engineering, data integration and data science, observational analytics, and business intelligence. With Fabric, there is no need to stitch together different services from multiple vendors. Instead, the customer enjoys end-to-end, highly integrated, single offering that is easy to understand, onboard, create and operate.
This is a hugely important new product from Microsoft and I will simplify your understanding of it via a presentation and demo.
Agenda:
What is Microsoft Fabric?
Workspaces and capacities
OneLake
Lakehouse
Data Warehouse
ADF
Power BI / DirectLake
Resources
Ai & Data Analytics 2018 - Azure Databricks for data scientistAlberto Diaz Martin
This document summarizes a presentation given by Alberto Diaz Martin on Azure Databricks for data scientists. The presentation covered how Databricks can be used for infrastructure management, data exploration and visualization at scale, reducing time to value through model iterations and integrating various ML tools. It also discussed challenges for data scientists and how Databricks addresses them through features like notebooks, frameworks, and optimized infrastructure for deep learning. Demo sections showed EDA, ML pipelines, model export, and deep learning modeling capabilities in Databricks.
This document provides an overview of a course on implementing a modern data platform architecture using Azure services. The course objectives are to understand cloud and big data concepts, the role of Azure data services in a modern data platform, and how to implement a reference architecture using Azure data services. The course will provide an ARM template for a data platform solution that can address most data challenges.
This document summarizes how businesses can transform through business intelligence (BI) and advanced analytics using Microsoft's modern BI platform. It outlines the Power BI and Azure Analysis Services tools for visualization, data modeling, and analytics. It also discusses how Collective Intelligence and Microsoft can help customers accelerate their move to a data-driven culture and realize benefits like increased productivity and cost savings by implementing BI and advanced analytics solutions in the cloud. The presentation includes demonstrations of Power BI and Azure Analysis Services.
Introduction to Azure Data Lake and U-SQL presented at Seattle Scalability Meetup, January 2016. Demo code available at https://github.com/Azure/usql/tree/master/Examples/TweetAnalysis
Please signup for the preview at http://www.azure.com/datalake. Install Visual Studio Community Edition and the Azure Datalake Tools (http://aka.ms/adltoolvs) to use U-SQL locally for free.
This document discusses the future of data and the Azure data ecosystem. It highlights that by 2025 there will be 175 zettabytes of data in the world and the average person will have over 5,000 digital interactions per day. It promotes Azure services like Power BI, Azure Synapse Analytics, Azure Data Factory and Azure Machine Learning for extracting value from data through analytics, visualization and machine learning. The document provides overviews of key Azure data and analytics services and how they fit together in an end-to-end data platform for business intelligence, artificial intelligence and continuous intelligence applications.
Azure Synapse Analytics is a limitless analytics service that brings together data integration, enterprise data warehousing, and big data analytics. It provides the freedom to query data at scale using either serverless or dedicated options. Azure HDInsight allows the use of open source frameworks like Hadoop, Spark, Hive, and Kafka for processing large volumes of data. Azure Databricks offers environments for SQL, data science/engineering, and machine learning. The Azure IoT Hub enables scalable IoT solutions by allowing bidirectional communication between IoT applications and connected devices.
Comparing Microsoft Big Data Platform TechnologiesJen Stirrup
In this segment, we look at technologies such as HDInsight, Azure Databricks, Azure Data Lake Analytics and Apache Spark. We compare the technologies to help you to decide the best technology for your situation.
The document discusses Azure Data Factory and its capabilities for cloud-first data integration and transformation. ADF allows orchestrating data movement and transforming data at scale across hybrid and multi-cloud environments using a visual, code-free interface. It provides serverless scalability without infrastructure to manage along with capabilities for lifting and running SQL Server Integration Services packages in Azure.
Data Warehousing Trends, Best Practices, and Future OutlookJames Serra
Over the last decade, the 3Vs of data - Volume, Velocity & Variety has grown massively. The Big Data revolution has completely changed the way companies collect, analyze & store data. Advancements in cloud-based data warehousing technologies have empowered companies to fully leverage big data without heavy investments both in terms of time and resources. But, that doesn’t mean building and managing a cloud data warehouse isn’t accompanied by any challenges. From deciding on a service provider to the design architecture, deploying a data warehouse tailored to your business needs is a strenuous undertaking. Looking to deploy a data warehouse to scale your company’s data infrastructure or still on the fence? In this presentation you will gain insights into the current Data Warehousing trends, best practices, and future outlook. Learn how to build your data warehouse with the help of real-life use-cases and discussion on commonly faced challenges. In this session you will learn:
- Choosing the best solution - Data Lake vs. Data Warehouse vs. Data Mart
- Choosing the best Data Warehouse design methodologies: Data Vault vs. Kimball vs. Inmon
- Step by step approach to building an effective data warehouse architecture
- Common reasons for the failure of data warehouse implementations and how to avoid them
The data lake has become extremely popular, but there is still confusion on how it should be used. In this presentation I will cover common big data architectures that use the data lake, the characteristics and benefits of a data lake, and how it works in conjunction with a relational data warehouse. Then I’ll go into details on using Azure Data Lake Store Gen2 as your data lake, and various typical use cases of the data lake. As a bonus I’ll talk about how to organize a data lake and discuss the various products that can be used in a modern data warehouse.
Power BI Overview, Deployment and GovernanceJames Serra
This document provides an overview of external sharing in Power BI using Azure Active Directory Business-to-Business (Azure B2B) collaboration. Azure B2B allows Power BI content to be securely distributed to guest users outside the organization while maintaining control over internal data. There are three main approaches for sharing - assigning Pro licenses manually, using guest's own licenses, or sharing to guests via Power BI Premium capacity. Azure B2B handles invitations, authentication, and governance policies to control external sharing. All guest actions are audited. Conditional access policies can also be enforced for guests.
Power BI has become a product with a ton of exciting features. This presentation will give an overview of some of them, including Power BI Desktop, Power BI service, what’s new, integration with other services, Power BI premium, and administration.
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...James Serra
Discover, manage, deploy, monitor – rinse and repeat. In this session we show how Azure Machine Learning can be used to create the right AI model for your challenge and then easily customize it using your development tools while relying on Azure ML to optimize them to run in hardware accelerated environments for the cloud and the edge using FPGAs and Neural Network accelerators. We then show you how to deploy the model to highly scalable web services and nimble edge applications that Azure can manage and monitor for you. Finally, we illustrate how you can leverage the model telemetry to retrain and improve your content.
Power BI for Big Data and the New Look of Big Data SolutionsJames Serra
New features in Power BI give it enterprise tools, but that does not mean it automatically creates an enterprise solution. In this talk we will cover these new features (composite models, aggregations tables, dataflow) as well as Azure Data Lake Store Gen2, and describe the use cases and products of an individual, departmental, and enterprise big data solution. We will also talk about why a data warehouse and cubes still should be part of an enterprise solution, and how a data lake should be organized.
In three years I went from a complete unknown to a popular blogger, speaker at PASS Summit, a SQL Server MVP, and then joined Microsoft. Along the way I saw my yearly income triple. Is it because I know some secret? Is it because I am a genius? No! It is just about laying out your career path, setting goals, and doing the work.
I'll cover tips I learned over my career on everything from interviewing to building your personal brand. I'll discuss perm positions, consulting, contracting, working for Microsoft or partners, hot fields, in-demand skills, social media, networking, presenting, blogging, salary negotiating, dealing with recruiters, certifications, speaking at major conferences, resume tips, and keys to a high-paying career.
Your first step to enhancing your career will be to attend this session! Let me be your career coach!
Is the traditional data warehouse dead?James Serra
With new technologies such as Hive LLAP or Spark SQL, do I still need a data warehouse or can I just put everything in a data lake and report off of that? No! In the presentation I’ll discuss why you still need a relational data warehouse and how to use a data lake and a RDBMS data warehouse to get the best of both worlds. I will go into detail on the characteristics of a data lake and its benefits and why you still need data governance tasks in a data lake. I’ll also discuss using Hadoop as the data lake, data virtualization, and the need for OLAP in a big data solution. And I’ll put it all together by showing common big data architectures.
Azure SQL Database Managed Instance is a new flavor of Azure SQL Database that is a game changer. It offers near-complete SQL Server compatibility and network isolation to easily lift and shift databases to Azure (you can literally backup an on-premise database and restore it into a Azure SQL Database Managed Instance). Think of it as an enhancement to Azure SQL Database that is built on the same PaaS infrastructure and maintains all it's features (i.e. active geo-replication, high availability, automatic backups, database advisor, threat detection, intelligent insights, vulnerability assessment, etc) but adds support for databases up to 35TB, VNET, SQL Agent, cross-database querying, replication, etc. So, you can migrate your databases from on-prem to Azure with very little migration effort which is a big improvement from the current Singleton or Elastic Pool flavors which can require substantial changes.
Microsoft Data Platform - What's includedJames Serra
This document provides an overview of a speaker and their upcoming presentation on Microsoft's data platform. The speaker is a 30-year IT veteran who has worked in various roles including BI architect, developer, and consultant. Their presentation will cover collecting and managing data, transforming and analyzing data, and visualizing and making decisions from data. It will also discuss Microsoft's various product offerings for data warehousing and big data solutions.
Learning to present and becoming good at itJames Serra
Have you been thinking about presenting at a user group? Are you being asked to present at your work? Is learning to present one of the keys to advancing your career? Or do you just think it would be fun to present but you are too nervous to try it? Well take the first step to becoming a presenter by attending this session and I will guide you through the process of learning to present and becoming good at it. It’s easier than you think! I am an introvert and was deathly afraid to speak in public. Now I love to present and it’s actually my main function in my job at Microsoft. I’ll share with you journey that lead me to speak at major conferences and the skills I learned along the way to become a good presenter and to get rid of the fear. You can do it!
Think of big data as all data, no matter what the volume, velocity, or variety. The simple truth is a traditional on-prem data warehouse will not handle big data. So what is Microsoft’s strategy for building a big data solution? And why is it best to have this solution in the cloud? That is what this presentation will cover. Be prepared to discover all the various Microsoft technologies and products from collecting data, transforming it, storing it, to visualizing it. My goal is to help you not only understand each product but understand how they all fit together, so you can be the hero who builds your companies big data solution.
Choosing technologies for a big data solution in the cloudJames Serra
Has your company been building data warehouses for years using SQL Server? And are you now tasked with creating or moving your data warehouse to the cloud and modernizing it to support “Big Data”? What technologies and tools should use? That is what this presentation will help you answer. First we will cover what questions to ask concerning data (type, size, frequency), reporting, performance needs, on-prem vs cloud, staff technology skills, OSS requirements, cost, and MDM needs. Then we will show you common big data architecture solutions and help you to answer questions such as: Where do I store the data? Should I use a data lake? Do I still need a cube? What about Hadoop/NoSQL? Do I need the power of MPP? Should I build a "logical data warehouse"? What is this lambda architecture? Can I use Hadoop for my DW? Finally, we’ll show some architectures of real-world customer big data solutions. Come to this session to get started down the path to making the proper technology choices in moving to the cloud.
The document summarizes new features in SQL Server 2016 SP1, organized into three categories: performance enhancements, security improvements, and hybrid data capabilities. It highlights key features such as in-memory technologies for faster queries, always encrypted for data security, and PolyBase for querying relational and non-relational data. New editions like Express and Standard provide more built-in capabilities. The document also reviews SQL Server 2016 SP1 features by edition, showing advanced features are now more accessible across more editions.
DocumentDB is a powerful NoSQL solution. It provides elastic scale, high performance, global distribution, a flexible data model, and is fully managed. If you are looking for a scaled OLTP solution that is too much for SQL Server to handle (i.e. millions of transactions per second) and/or will be using JSON documents, DocumentDB is the answer.
First introduced with the Analytics Platform System (APS), PolyBase simplifies management and querying of both relational and non-relational data using T-SQL. It is now available in both Azure SQL Data Warehouse and SQL Server 2016. The major features of PolyBase include the ability to do ad-hoc queries on Hadoop data and the ability to import data from Hadoop and Azure blob storage to SQL Server for persistent storage. A major part of the presentation will be a demo on querying and creating data on HDFS (using Azure Blobs). Come see why PolyBase is the “glue” to creating federated data warehouse solutions where you can query data as it sits instead of having to move it all to one data platform.
Machine learning allows us to build predictive analytics solutions of tomorrow - these solutions allow us to better diagnose and treat patients, correctly recommend interesting books or movies, and even make the self-driving car a reality. Microsoft Azure Machine Learning (Azure ML) is a fully-managed Platform-as-a-Service (PaaS) for building these predictive analytics solutions. It is very easy to build solutions with it, helping to overcome the challenges most businesses have in deploying and using machine learning. In this presentation, we will take a look at how to create ML models with Azure ML Studio and deploy those models to production in minutes.
Big data architectures and the data lakeJames Serra
The document provides an overview of big data architectures and the data lake concept. It discusses why organizations are adopting data lakes to handle increasing data volumes and varieties. The key aspects covered include:
- Defining top-down and bottom-up approaches to data management
- Explaining what a data lake is and how Hadoop can function as the data lake
- Describing how a modern data warehouse combines features of a traditional data warehouse and data lake
- Discussing how federated querying allows data to be accessed across multiple sources
- Highlighting benefits of implementing big data solutions in the cloud
- Comparing shared-nothing, massively parallel processing (MPP) architectures to symmetric multi-processing (
Introduction to Microsoft’s Hadoop solution (HDInsight)James Serra
Did you know Microsoft provides a Hadoop Platform-as-a-Service (PaaS)? It’s called Azure HDInsight and it deploys and provisions managed Apache Hadoop clusters in the cloud, providing a software framework designed to process, analyze, and report on big data with high reliability and availability. HDInsight uses the Hortonworks Data Platform (HDP) Hadoop distribution that includes many Hadoop components such as HBase, Spark, Storm, Pig, Hive, and Mahout. Join me in this presentation as I talk about what Hadoop is, why deploy to the cloud, and Microsoft’s solution.
Quantum Communications Q&A with Gemini LLM. These are based on Shannon's Noisy channel Theorem and offers how the classical theory applies to the quantum world.
INDIAN AIR FORCE FIGHTER PLANES LIST.pdfjackson110191
These fighter aircraft have uses outside of traditional combat situations. They are essential in defending India's territorial integrity, averting dangers, and delivering aid to those in need during natural calamities. Additionally, the IAF improves its interoperability and fortifies international military alliances by working together and conducting joint exercises with other air forces.
AI_dev Europe 2024 - From OpenAI to Opensource AIRaphaël Semeteys
Navigating Between Commercial Ownership and Collaborative Openness
This presentation explores the evolution of generative AI, highlighting the trajectories of various models such as GPT-4, and examining the dynamics between commercial interests and the ethics of open collaboration. We offer an in-depth analysis of the levels of openness of different language models, assessing various components and aspects, and exploring how the (de)centralization of computing power and technology could shape the future of AI research and development. Additionally, we explore concrete examples like LLaMA and its descendants, as well as other open and collaborative projects, which illustrate the diversity and creativity in the field, while navigating the complex waters of intellectual property and licensing.
What Not to Document and Why_ (North Bay Python 2024)Margaret Fero
We’re hopefully all on board with writing documentation for our projects. However, especially with the rise of supply-chain attacks, there are some aspects of our projects that we really shouldn’t document, and should instead remediate as vulnerabilities. If we do document these aspects of a project, it may help someone compromise the project itself or our users. In this talk, you will learn why some aspects of documentation may help attackers more than users, how to recognize those aspects in your own projects, and what to do when you encounter such an issue.
These are slides as presented at North Bay Python 2024, with one minor modification to add the URL of a tweet screenshotted in the presentation.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/07/intels-approach-to-operationalizing-ai-in-the-manufacturing-sector-a-presentation-from-intel/
Tara Thimmanaik, AI Systems and Solutions Architect at Intel, presents the “Intel’s Approach to Operationalizing AI in the Manufacturing Sector,” tutorial at the May 2024 Embedded Vision Summit.
AI at the edge is powering a revolution in industrial IoT, from real-time processing and analytics that drive greater efficiency and learning to predictive maintenance. Intel is focused on developing tools and assets to help domain experts operationalize AI-based solutions in their fields of expertise.
In this talk, Thimmanaik explains how Intel’s software platforms simplify labor-intensive data upload, labeling, training, model optimization and retraining tasks. She shows how domain experts can quickly build vision models for a wide range of processes—detecting defective parts on a production line, reducing downtime on the factory floor, automating inventory management and other digitization and automation projects. And she introduces Intel-provided edge computing assets that empower faster localized insights and decisions, improving labor productivity through easy-to-use AI tools that democratize AI.
UiPath Community Day Kraków: Devs4Devs ConferenceUiPathCommunity
We are honored to launch and host this event for our UiPath Polish Community, with the help of our partners - Proservartner!
We certainly hope we have managed to spike your interest in the subjects to be presented and the incredible networking opportunities at hand, too!
Check out our proposed agenda below 👇👇
08:30 ☕ Welcome coffee (30')
09:00 Opening note/ Intro to UiPath Community (10')
Cristina Vidu, Global Manager, Marketing Community @UiPath
Dawid Kot, Digital Transformation Lead @Proservartner
09:10 Cloud migration - Proservartner & DOVISTA case study (30')
Marcin Drozdowski, Automation CoE Manager @DOVISTA
Pawel Kamiński, RPA developer @DOVISTA
Mikolaj Zielinski, UiPath MVP, Senior Solutions Engineer @Proservartner
09:40 From bottlenecks to breakthroughs: Citizen Development in action (25')
Pawel Poplawski, Director, Improvement and Automation @McCormick & Company
Michał Cieślak, Senior Manager, Automation Programs @McCormick & Company
10:05 Next-level bots: API integration in UiPath Studio (30')
Mikolaj Zielinski, UiPath MVP, Senior Solutions Engineer @Proservartner
10:35 ☕ Coffee Break (15')
10:50 Document Understanding with my RPA Companion (45')
Ewa Gruszka, Enterprise Sales Specialist, AI & ML @UiPath
11:35 Power up your Robots: GenAI and GPT in REFramework (45')
Krzysztof Karaszewski, Global RPA Product Manager
12:20 🍕 Lunch Break (1hr)
13:20 From Concept to Quality: UiPath Test Suite for AI-powered Knowledge Bots (30')
Kamil Miśko, UiPath MVP, Senior RPA Developer @Zurich Insurance
13:50 Communications Mining - focus on AI capabilities (30')
Thomasz Wierzbicki, Business Analyst @Office Samurai
14:20 Polish MVP panel: Insights on MVP award achievements and career profiling
MYIR Product Brochure - A Global Provider of Embedded SOMs & SolutionsLinda Zhang
This brochure gives introduction of MYIR Electronics company and MYIR's products and services.
MYIR Electronics Limited (MYIR for short), established in 2011, is a global provider of embedded System-On-Modules (SOMs) and
comprehensive solutions based on various architectures such as ARM, FPGA, RISC-V, and AI. We cater to customers' needs for large-scale production, offering customized design, industry-specific application solutions, and one-stop OEM services.
MYIR, recognized as a national high-tech enterprise, is also listed among the "Specialized
and Special new" Enterprises in Shenzhen, China. Our core belief is that "Our success stems from our customers' success" and embraces the philosophy
of "Make Your Idea Real, then My Idea Realizing!"
GDG Cloud Southlake #34: Neatsun Ziv: Automating AppsecJames Anderson
The lecture titled "Automating AppSec" delves into the critical challenges associated with manual application security (AppSec) processes and outlines strategic approaches for incorporating automation to enhance efficiency, accuracy, and scalability. The lecture is structured to highlight the inherent difficulties in traditional AppSec practices, emphasizing the labor-intensive triage of issues, the complexity of identifying responsible owners for security flaws, and the challenges of implementing security checks within CI/CD pipelines. Furthermore, it provides actionable insights on automating these processes to not only mitigate these pains but also to enable a more proactive and scalable security posture within development cycles.
The Pains of Manual AppSec:
This section will explore the time-consuming and error-prone nature of manually triaging security issues, including the difficulty of prioritizing vulnerabilities based on their actual risk to the organization. It will also discuss the challenges in determining ownership for remediation tasks, a process often complicated by cross-functional teams and microservices architectures. Additionally, the inefficiencies of manual checks within CI/CD gates will be examined, highlighting how they can delay deployments and introduce security risks.
Automating CI/CD Gates:
Here, the focus shifts to the automation of security within the CI/CD pipelines. The lecture will cover methods to seamlessly integrate security tools that automatically scan for vulnerabilities as part of the build process, thereby ensuring that security is a core component of the development lifecycle. Strategies for configuring automated gates that can block or flag builds based on the severity of detected issues will be discussed, ensuring that only secure code progresses through the pipeline.
Triaging Issues with Automation:
This segment addresses how automation can be leveraged to intelligently triage and prioritize security issues. It will cover technologies and methodologies for automatically assessing the context and potential impact of vulnerabilities, facilitating quicker and more accurate decision-making. The use of automated alerting and reporting mechanisms to ensure the right stakeholders are informed in a timely manner will also be discussed.
Identifying Ownership Automatically:
Automating the process of identifying who owns the responsibility for fixing specific security issues is critical for efficient remediation. This part of the lecture will explore tools and practices for mapping vulnerabilities to code owners, leveraging version control and project management tools.
Three Tips to Scale the Shift Left Program:
Finally, the lecture will offer three practical tips for organizations looking to scale their Shift Left security programs. These will include recommendations on fostering a security culture within development teams, employing DevSecOps principles to integrate security throughout the development
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Erasmo Purificato
Slide of the tutorial entitled "Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Emerging Trends" held at UMAP'24: 32nd ACM Conference on User Modeling, Adaptation and Personalization (July 1, 2024 | Cagliari, Italy)
An invited talk given by Mark Billinghurst on Research Directions for Cross Reality Interfaces. This was given on July 2nd 2024 as part of the 2024 Summer School on Cross Reality in Hagenberg, Austria (July 1st - 7th)
Sustainability requires ingenuity and stewardship. Did you know Pigging Solutions pigging systems help you achieve your sustainable manufacturing goals AND provide rapid return on investment.
How? Our systems recover over 99% of product in transfer piping. Recovering trapped product from transfer lines that would otherwise become flush-waste, means you can increase batch yields and eliminate flush waste. From raw materials to finished product, if you can pump it, we can pig it.
Quality Patents: Patents That Stand the Test of TimeAurora Consulting
Is your patent a vanity piece of paper for your office wall? Or is it a reliable, defendable, assertable, property right? The difference is often quality.
Is your patent simply a transactional cost and a large pile of legal bills for your startup? Or is it a leverageable asset worthy of attracting precious investment dollars, worth its cost in multiples of valuation? The difference is often quality.
Is your patent application only good enough to get through the examination process? Or has it been crafted to stand the tests of time and varied audiences if you later need to assert that document against an infringer, find yourself litigating with it in an Article 3 Court at the hands of a judge and jury, God forbid, end up having to defend its validity at the PTAB, or even needing to use it to block pirated imports at the International Trade Commission? The difference is often quality.
Quality will be our focus for a good chunk of the remainder of this season. What goes into a quality patent, and where possible, how do you get it without breaking the bank?
** Episode Overview **
In this first episode of our quality series, Kristen Hansen and the panel discuss:
⦿ What do we mean when we say patent quality?
⦿ Why is patent quality important?
⦿ How to balance quality and budget
⦿ The importance of searching, continuations, and draftsperson domain expertise
⦿ Very practical tips, tricks, examples, and Kristen’s Musts for drafting quality applications
https://www.aurorapatents.com/patently-strategic-podcast.html
this resume for sadika shaikh bca studentSadikaShaikh7
I am a dedicated BCA student with a strong foundation in web technologies, including PHP and MySQL. I have hands-on experience in Java and Python, and a solid understanding of data structures. My technical skills are complemented by my ability to learn quickly and adapt to new challenges in the ever-evolving field of computer science.
How RPA Help in the Transportation and Logistics Industry.pptxSynapseIndia
Revolutionize your transportation processes with our cutting-edge RPA software. Automate repetitive tasks, reduce costs, and enhance efficiency in the logistics sector with our advanced solutions.
What's Next Web Development Trends to Watch.pdfSeasiaInfotech2
Explore the latest advancements and upcoming innovations in web development with our guide to the trends shaping the future of digital experiences. Read our article today for more information.
2. About Me
Microsoft, Big Data Evangelist
In IT for 30 years, worked on many BI and DW projects
Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM
architect, PDW/APS developer
Been perm employee, contractor, consultant, business owner
Presenter at PASS Business Analytics Conference, PASS Summit, Enterprise Data World conference
Certifications: MCSE: Data Platform, Business Intelligence; MS: Architecting Microsoft Azure
Solutions, Design and Implement Big Data Analytics Solutions, Design and Implement Cloud Data
Platform Solutions
Blog at JamesSerra.com
Former SQL Server MVP
Author of book “Reporting with Microsoft SQL Server 2012”
3. Agenda
Big Data Architectures
Why data lakes?
Top-down vs Bottom-up
Data lake defined
Hadoop as the data lake
Modern Data Warehouse
Federated Querying
Solution in the cloud
SMP vs MPP
4. Security and performanceFlexibility of choiceReason over any data, anywhere
Data warehouses
Data Lakes
Operational databases
Hybrid
Data warehouses
Data Lakes
Operational databases
SocialLOB Graph IoTImageCRM
T H E M O D E R N D A T A E S T A T E
5. Security and performanceFlexibility of choiceReason over any data, anywhere
Data warehouses
Operational databases
Hybrid
Data warehouses
Operational databases
SQL Server Azure Data Services
AI built-in | Most secure | Lowest TCO
Industry leader 2 years in a row
#1 TPC-H performance
T-SQL query over any data
70% faster than Aurora
2x global reach than Redshift
No Limits Analytics with 99.9% SLA
Easiest lift and shift
with no code changes
SocialLOB Graph IoTImageCRM
T H E M I C R O S O F T O F F E R I N G
Data lakes Data lakes
7. CONTROL EASE OF USE
Azure Data Lake
Analytics
Azure Data Lake Store
Azure Storage
Any Hadoop technology,
any distribution
Workload optimized,
managed clusters
Data Engineering in a
Job-as-a-service model
Azure Marketplace
HDP | CDH | MapR
Azure Data Lake
Analytics
IaaS Clusters Managed Clusters Big Data as-a-service
Azure HDInsight
Frictionless & Optimized
Spark clusters
Azure Databricks
BIGDATA
STORAGE
BIGDATA
ANALYTICS
ReducedAdministration
K N O W I N G T H E V A R I O U S B I G D A T A S O L U T I O N S
8. Model & ServePrep & Train
Databricks
HDInsight
Data Lake Analytics
Custom
apps
Sensors
and devices
Store
Blobs
Data Lake
Ingest
Data Factory
(Data movement, pipelines & orchestration)
Machine
Learning
Cosmos DB
SQL Data
Warehouse
Analysis Services
Event Hub
IoT Hub
SQL Database
Analytical dashboards
Predictive apps
Operational reports
Intelligence
B I G D ATA & A D VA N C E D A N A LY T I C S AT A G L A N C E
Business
apps
10
01
SQLKafka
11. What is Azure Databricks?
A fast, easy and collaborative Apache® Spark™ based analytics platform optimized for Azure
Best of Databricks Best of Microsoft
Designed in collaboration with the founders of Apache Spark
One-click set up; streamlined workflows
Interactive workspace that enables collaboration between data scientists, data engineers, and business analysts.
Native integration with Azure services (Power BI, SQL DW, Cosmos DB, Blob Storage)
Enterprise-grade Azure security (Active Directory integration, compliance, enterprise -grade SLAs)
12. A P A C H E S P A R K
An unified, open source, parallel, data processing framework for Big Data Analytics
Spark Core Engine
Spark SQL
Interactive
Queries
Spark Structured
Streaming
Stream processing
Spark MLlib
Machine
Learning
Yarn Mesos
Standalone
Scheduler
Spark MLlib
Machine
Learning
Spark
Streaming
Stream processing
GraphX
Graph
Computation
13. D A T A B R I C K S S P A R K I S F A S T
Benchmarks have shown Databricks to often have better performance than alternatives
SOURCE: Benchmarking Big Data SQL Platforms in the Cloud
14. A D V A N T A G E S O F A U N I F I E D P L A T F O R M
Spark Streaming
Spark Machine
Learning
Spark SQL
15. Get started quickly by launching
your new Spark environment with
one click.
Share your insights in powerful
ways through rich integration with
Power BI.
Improve collaboration amongst
your analytics team through a
unified workspace.
Innovate faster with native
integration with rest of Azure
platform
Simplify security and identity control
with built-in integration with Active
Directory.
Regulate access with fine-grained user
permissions to Azure Databricks’
notebooks, clusters, jobs and data.
Build with confidence on the trusted
cloud backed by unmatched support,
compliance and SLAs.
Operate at massive scale
without limits globally.
Accelerate data processing with
the fastest Spark engine.
ENHANCE PRODUCTIVITY BUILD ON THE MOST COMPLIANT CLOUD SCALE WITHOUT LIMITS
Differentiated experience on Azure
16. Optimized Databricks Runtime Engine
DATABRICKS I/O SERVERLESS
Collaborative Workspace
Cloud storage
Data warehouses
Hadoop storage
IoT / streaming data
Rest APIs
Machine learning models
BI tools
Data exports
Data warehouses
Azure Databricks
Enhance Productivity
Deploy Production Jobs & Workflows
APACHE SPARK
MULTI-STAGE PIPELINES
DATA ENGINEER
JOB SCHEDULER NOTIFICATION & LOGS
DATA SCIENTIST BUSINESS ANALYST
Build on secure & trusted cloud Scale without limits
Azure Databricks
17. Collaborative Workspace
Optimized Databricks Runtime Engine
DATABRICKS I/O SERVERLESS
Collaborative Workspace
Rest APIs
Azure Databricks
Deploy Production Jobs & Workflows
APACHE SPARK
MULTI-STAGE PIPELINES
DATA ENGINEER
JOB SCHEDULER NOTIFICATION & LOGS
DATA SCIENTIST BUSINESS ANALYST
18. Deploy Production Jobs & Workflows
Optimized Databricks Runtime Engine
DATABRICKS I/O SERVERLESS
Collaborative Workspace
Rest APIs
Azure Databricks
Deploy Production Jobs & Workflows
APACHE SPARK
MULTI-STAGE PIPELINES
DATA ENGINEER
JOB SCHEDULER NOTIFICATION & LOGS
DATA SCIENTIST BUSINESS ANALYST
19. Optimized Databricks Runtime Engine
Optimized Databricks Runtime Engine
DATABRICKS I/O SERVERLESS
Collaborative Workspace
Rest APIs
Azure Databricks
Deploy Production Jobs & Workflows
APACHE SPARK
MULTI-STAGE PIPELINES
DATA ENGINEER
JOB SCHEDULER NOTIFICATION & LOGS
DATA SCIENTIST BUSINESS ANALYST
20. A Z U R E D A T A B R I C K S C O R E A R T I F A C T S
Azure
Databricks
21. G E N E R A L S P A R K C L U S T E R A R C H I T E C T U R E
Data Sources (HDFS, SQL, NoSQL, …)
Cluster Manager
Worker Node Worker Node Worker Node
Driver Program
SparkContext
22. A Z U R E D A T A B R I C K S I N T E G R A T I O N W I T H A A D
Azure Databricks is integrated with AAD—so Azure Databricks users are just regular AAD users
There is no need to define users—and their
access control—separately in Databricks.
AAD users can be used directly in Azure
Databricks for all user-based access control
(Clusters, Jobs, Notebooks etc.).
Databricks has delegated user authentication
to AAD enabling single-sign on (SSO) and
unified authentication.
Notebooks, and their outputs, are stored in
the Databricks account. However, AAD-
based access-control ensures that only
authorized users can access them.
Access
Control
Azure Databricks
Authentication
23. C L U S T E R S : A U T O S C A L I N G A N D A U T O T E R M I N A T I O N
Simplifies cluster management and reduces costs by eliminating wastage
When creating Azure Databricks clusters you can choose
Autoscaling and Auto Termination options.
Autoscaling: Just specify the min and max number of clusters.
Azure Databricks automatically scales up or down based on load.
Auto Termination: After the specified minutes of inactivity the
cluster is automatically terminated.
Benefits:
You do not have to guess, or determine by trial and error, the correct
number of nodes for the cluster
As the workload changes you do not have to manually tweak the
number of nodes
You do not have to worry about wasting resources when the cluster is
idle. You only pay for resource when they are actually being used
You do not have to wait and watch for jobs to complete just so you
can shutdown the clusters
24. J O B S
Jobs are the mechanism to submit Spark application code for execution on the Databricks clusters
• Spark application code is submitted as a ‘Job’ for execution on
Azure Databricks clusters
• Jobs execute either ‘Notebooks’ or ‘Jars’
• Azure Databricks provide a comprehensive set of graphical
tools to create, manage and monitor Jobs.
25. W O R K S P A C E S
Workspaces enables users to organize—and share—their Notebooks, Libraries and Dashboards
• Icons indicate the type of the object contained in a
folder
• By default, the workspace and all its contents are
available to users.
26. A Z U R E D A T A B R I C K S N O T E B O O K S O V E R V I E W
Notebooks are a popular way to develop, and run, Spark Applications
Notebooks are not only for authoring Spark applications but
can be run/executed directly on clusters
• Shift+Enter
•
•
Notebooks support fine grained permissions—so they can be
securely shared with colleagues for collaboration (see
following slide for details on permissions and abilities)
Notebooks are well-suited for prototyping, rapid
development, exploration, discovery and iterative
development Notebooks typically consist of code, data, visualization, comments and notes
27. L I B R A R I E S O V E R V I E W
Enables external code to be imported and stored into a Workspace
28. V I S U A L I Z A T I O N
Azure Databricks supports a number of visualization plots out of the box
All notebooks, regardless of their language,
support Databricks visualizations.
When you run the notebook the visualizations
are rendered inside the notebook in-place
The visualizations are written in HTML.
• You can save the HTML of the entire notebook by
exporting to HTML.
• If you use Matplotlib, the plots are rendered as
images so you can just right click and download
the image
You can change the plot type just by picking
from the selection
29. D A T A B R I C K S F I L E S Y S T E M ( D B F S )
Is a distributed File System (DBFS) that is a layer over Azure Blob Storage
Azure Blob Storage
Python Scala CLI dbutils
DBFS
30. S P A R K S Q L O V E R V I E W
Spark SQL is a distributed SQL query engine for processing structured data
31. D A T A B A S E S A N D T A B L E S O V E R V I E W
Tables enable data to be structured and queried using Spark SQL or any of the Spark’s language APIs
32. S P A R K M A C H I N E L E A R N I N G ( M L ) O V E R V I E W
Offers a set of parallelized machine learning algorithms (MMLSpark,
Spark ML, Deep Learning, SparkR)
Supports Model Selection (hyperparameter tuning) using Cross
Validation and Train-Validation Split.
Supports Java, Scala or Python apps using DataFrame-based API (as
of Spark 2.0). Benefits include:
• An uniform API across ML algorithms and across multiple languages
• Facilitates ML pipelines (enables combining multiple algorithms into a
single pipeline).
• Optimizations through Tungsten and Catalyst
• Spark MLlib comes pre-installed on Azure Databricks
• 3rd Party libraries supported include: H20 Sparkling Water, SciKit-
learn and XGBoost
Enables Parallel, Distributed ML for large datasets on Spark Clusters
33. S P A R K S T R U C T U R E D S T R E A M I N G O V E R V I E W
Unifies streaming, interactive and batch queries—a single API for both
static bounded data and streaming unbounded data.
Runs on Spark SQL. Uses the Spark SQL Dataset/DataFrame API used
for batch processing of static data.
Runs incrementally and continuously and updates the results as data
streams in.
Supports app development in Scala, Java, Python and R.
Supports streaming aggregations, event-time windows, windowed
grouped aggregation, stream-to-batch joins.
Features streaming deduplication, multiple output modes and APIs for
managing/monitoring streaming queries.
Built-in sources: Kafka, File source (json, csv, text, parquet)
A unified system for end-to-end fault-tolerant, exactly-once stateful stream processing
34. A P A C H E K A F K A F O R H D I N S I G H T I N T E G R A T I O N
Azure Databricks Structured Streaming integrates with Apache Kafka for HDInsight
Apache Kafka for Azure HDInsight is an enterprise grade streaming ingestion service running in Azure.
Azure Databricks Structured Streaming applications can use Apache Kafka for HDInsight as a data source or
sink.
No additional software (gateways or connectors) are required.
Setup: Apache Kafka on HDInsight does not provide access to the Kafka brokers over the public internet. So
the Kafka clusters and the Azure Databricks cluster must be located in the same Azure Virtual Network.
35. S P A R K G R A P H X O V E R V I E W
Unifies ETL, exploratory analysis, and
iterative graph computation within a
single system.
Developers can:
• view the same data as both graphs and
collections,
• transform and join graphs with RDDs,
and
• write custom iterative graph algorithms
using the Pregel API.
Currently only supports using the
Scala and RDD APIs.
A set of APIs for graph and graph-parallel computation.
• PageRank
• Connected components
• Label propagation
• SVD++
• Strongly connected
components
• Triangle count
Algorithms
AMPLab
PageRank Benchmark
36. D A T A B R I C K S C L I
An easy to use interface built on top of the Databricks REST API
Currently, the CLI fully implements the DBFS API and the Workspace API
37. D A T A B R I C K S R E S T A P I
Cluster API Create/edit/delete clusters
DBFS API Interact with the Databricks File System
Groups API Manage groups of users
Instance Profile API
Allows admins to add, list, and remove instances
profiles that users can launch clusters with
Job API Create/edit/delete jobs
Library API Create/edit/delete libraries
Workspace API List/import/export/delete notebooks/folders
Databricks
REST API
39. Modern Big Data Warehouse
Business / custom apps
(Structured)
Logs, files and media
(unstructured)
Azure storage
Polybase
Azure SQL Data Warehouse
Data factory
Data factory
Azure Databricks
(Spark)
Analytical dashboards
Model & ServePrep & TrainStoreIngest Intelligence
40. Advanced Analytics on Big Data
Web & mobile appsAzure Databricks
(Spark Mllib,
SparkR, SparklyR)
Azure Cosmos DB
Business / custom apps
(Structured)
Logs, files and media
(unstructured)
Azure storage
Polybase
Azure SQL Data Warehouse
Data factory
Data factory
Analytical dashboards
Model & ServePrep & TrainStoreIngest Intelligence
41. Real-time analytics on Big Data
Unstructured data
Azure storage
Polybase
Azure SQL Data Warehouse
Azure HDInsight
(Kafka)
Azure Databricks
(Spark)
Analytical dashboards
Model & ServePrep & TrainStoreIngest Intelligence
43. What it is
• Hadoop (Hortonworks’ Distribution) as a managed
service supporting a variety of open-source analytics
engines such as Apache Spark, Hive LLAP, Storm, Kafka,
HBase.
• Security via Ranger (Kerberos based)
Pricing
• Priced to compete with AWS EMR. Standard offering.
Use When
• Customer prefers a PaaS like experience to address big
data use cases by working with different OSS analytics
engines to address big data use cases. Cost sensitive.
Big Data OSS - Comparison
Azure HDInsight (1st party + Support)
What it is
• Databricks Spark, the most popular open-source analytics
engine, as a managed service providing an easy and fast
way to unlock big data use cases. Offers best-in-class
notebooks experience for productivity and collaboration as
well integration with Azure Data Warehouse, Power BI, etc
• Security via native Azure AD integration
Pricing
• Priced to match Databricks on AWS. Premium offering.
Use When
• Customer prefers SaaS like experience to address big data
use cases and values Databricks’ ease of use, productivity
& collaboration features.
Azure Databricks (1st party + Support)
What it is
Hadoop distributions from Cloudera, MapR &
Hortonworks available on Azure Marketplace as IaaS
VMs.
Pricing
• N/A. Vendor prices their products.
Use When
• Customer wants to move their on premises
Hadoop distribution to Azure IaaS using their
existing licenses.
3rd Party Offerings
44. Azure HDInsight
What It Is
• Hortonworks distribution as a first party service on Azure
• Big Data engines support – Hadoop Projects, Hive on Tez,
Hive LLAP, Spark, HBase, Storm, Kafka, R Server
• Best-in-class developer tooling and Monitoring capabilities
• Enterprise Features
• VNET support (join existing VNETs)
• Ranger support (Kerberos based Security)
• Log Analytics via OMS
• Orchestration via Azure Data Factory
• Available in most Azure Regions (27) including Gov
Cloud and Federal Clouds
Guidance
• Customer needs Hadoop technologies other than, or in
addition to Spark
• Customer prefers Hortonworks Spark distribution to stay
closer to OSS codebase and/or ‘Lift and Shift’ from on-
premises deployments
• Customer has specific project requirements that are only
available on HDInsight
Azure Databricks
What It Is
• Databricks’ Spark service as a first party service on Azure
• Single engine for Batch, Streaming, ML and Graph
• Best-in-class notebooks experience for optimal productivity
and collaboration
• Enterprise Features
• Native Integration with Azure for Security via AAD (OAuth)
• Optimized engine for better performance and scalability
• RBAC for Notebooks and APIs
• Auto-scaling and cluster termination capabilities
• Native integration with SQL DW and other Azure services
• Serverless pools for easier management of resources
Guidance
• Customer needs the best option for Spark on Azure
• Customer teams are comfortable with notebooks and Spark
• Customers need Auto-scaling and
• Customer needs to build integrated and performant data
pipelines
• Customer is comfortable with limited regional availability (3
in preview, 8 by GA)
Azure ML
What It Is
• Azure first party service for Machine Learning
• Leverage existing ML libraries or extend with Python and R
• Targets emerging data scientists with drag & drop offering
• Targets professional data scientists with
– Experimentation service
– Model management service
– Works with customers IDE of choice
Guidance
• Azure Machine Learning Studio is a GUI based ML tool for
emerging Data Scientists to experiment and operationalize
with least friction
• Azure Machine Learning Workbench is not a compute
engine & uses external engines for Compute, including SQL
Server and Spark
• AML deploys models to HDI Spark currently
• AML should be able to deploy Azure Databricks in the near
future
L O O K I N G A C R O S S T H E O F F E R I N G S
52. Engage Microsoft experts for a workshop to help identify
high impact scenarios
Sign up for preview at http://databricks.azurewebsites.net
Learn more about Azure Databricks www.azure.com/databricks
How to get started
53. Q & A ?
James Serra, Big Data Evangelist
Email me at: JamesSerra3@gmail.com
Follow me at: @JamesSerra
Link to me at: www.linkedin.com/in/JamesSerra
Visit my blog at: JamesSerra.com (where this slide deck is posted under the “Presentations” tab)
Editor's Notes
Azure analysis services
Databricks
Cosmos DB
Azure time series
ADF v2
Fluff, but point is I bring real work experience to the session
All kinds of data being generated
Stored on-premises and in the cloud – but vast majority in hybrid
Reason over all this data without requiring to move data
They want a choice of platform and languages, privacy and security
<Transition> Microsoft’s offerng
The most complete and compelling offering
SQL Server, Azure Database Services (+ Open Source), Hybrid
AI Built in, Most Secure, Lowest TCO – 1/10th cost of Oracle
<Transition> DEMO
Azure build-out of DMSA. Supports hybrid\on-premises too through ADF \ Blob \ Stretch.
When it comes to ease of use, Spark again happens to be a lot better than Hadoop. Spark has APIs for several languages such as Scala, Java and Python, besides having the likes of Spark SQL. It is relatively simple to write user-defined functions. It also happens to boast an interactive mode for running commands. Hadoop, on the other hand, is written in Java and has earned the reputation of being pretty difficult to program, although it does have tools that assist in the process. (To learn more about Spark, see How Apache Spark Helps Rapid Application Development.)
In-Memory Technology
One of the unique aspects of Apache Spark is its unique "in-memory" technology that allows it to be an extremely good data processing system. In this technology, Spark loads all of the data to the internal memory of the system and then unloads it on the disk later. This way, a user can save a part of the processed data on the internal memory and leave the remaining on the disk.
Spark also has an innate ability to load necessary information to its core with the help of its machine learningalgorithms. This allows it to be extremely fast.
Spark’s Core
Spark’s core manages several important functions like setting tasks and interactions as well as producing input/output operations. It can be said to be an RDD, or resilient distributed dataset. Basically, this happens to be a mix of data that is spread across several machines connected via a network. The transformation of this data is created by a four-step method, comprised of mapping the data, sorting it, reducing it and then finally, joining the data.
Following this step is the release of the RDD, which is done with support from an API. This API is a union of three languages: Scala, Java and Python.
Spark’s SQL
Apache Spark’s SQL has a relatively new data management solution called SchemaRDD. This allows the arrangement of data into many levels and can also query data via a specific language.
Graphx Service
Apache Spark comes with the ability to process graphs or even information that is graphical in nature, thus enabling the easy analysis with a lot of precision.
Streaming
This is a prime part of Spark that allows it to stream large chunks of data with help from the core. It does so by breaking the large data into smaller packets and then transforming them, thereby accelerating the creation of the RDD.
MLib – Machine Learning Library
Apache Spark has the MLib, which is a framework meant for structured machine learning. It is also predominantly faster in implementation than Hadoop. MLib is also capable of solving several problems, such as statistical reading, data sampling and premise testing, to name a few.
Azure Databricks features –
Enhance your teams’ productivity
Get started quickly by launching your new Spark environment with one click.
Share your insights in powerful ways through rich integration with PowerBI.
Improve collaboration amongst your analytics team through a unified workspace.
Innovate faster with native integration with rest of Azure platform.
Build on the most compliant and trusted cloud
Simplify security and identity control with built-in integration with Active Directory.
Regulate access with fine-grained user permissions to Azure Databricks’ notebooks, clusters, jobs and data.
Build with confidence on the trusted cloud backed by unmatched support, compliance and SLAs.
Scale without limits
Operate at massive scale without limits globally.
Accelerate data processing with the fastest Spark engine.
Databricks, founded by the team that created Apache Spark – unified analytics platform that accelerates innovation by unifying data science, engineering & business.
75% of the code committed to Apache Spark comes from Databricks
Unified Runtime
Create clusters in seconds, dynamically scale them up and down.
They’ve made enhancements to Spark engine to make it 10x faster than open source Spark
Serverless- Auto-configured multi-user cluster, Reliable sharing with fault isolation
Unified Collaboration
Overall – a simple & collaborative environment that enables your entire team to use Spark & interact with your data simultaneously
DE – Improve ETL performance, zero management clusters. Execute production code from within notebooks
DS - For data scientists, easy data exploration in notebooks
Business SME – interactive dashboards empower teams to create dynamic reports
Enterprise Security
Encryption
Fine grained Role-based access control (files, clusters, code, application, dashboard)
Compliance
Rest APIs
DE – DBIO, SPARK, API’s , JOBS
DS – Spark and Serverless, Interactive Data Science
Data Products - Everything
Creators of Spark
Training People
Number of Customers
Ingest
Workflow
Schedule / Run / Monitor
Execute
Troubleshoot
Debug
Production Jobs
---------
Ingest, ETL, Scheduling, Monitoring
Databricks, founded by the team that created Apache Spark – unified analytics platform that accelerates innovation by unifying data science, engineering & business.
75% of the code committed to Apache Spark comes from Databricks
Unified Runtime
Create clusters in seconds, dynamically scale them up and down.
They’ve made enhancements to Spark engine to make it 10x faster than open source Spark
Serverless- Auto-configured multi-user cluster, Reliable sharing with fault isolation
Unified Collaboration
Overall – a simple & collaborative environment that enables your entire team to use Spark & interact with your data simultaneously
DE – Improve ETL performance, zero management clusters. Execute production code from within notebooks
DS - For data scientists, easy data exploration in notebooks
Business SME – interactive dashboards empower teams to create dynamic reports
Enterprise Security
Encryption
Fine grained Role-based access control (files, clusters, code, application, dashboard)
Compliance
Rest APIs
DE – DBIO, SPARK, API’s , JOBS
DS – Spark and Serverless, Interactive Data Science
Data Products - Everything
Creators of Spark
Training People
Number of Customers
Ingest
Workflow
Schedule / Run / Monitor
Execute
Troubleshoot
Debug
Production Jobs
---------
Ingest, ETL, Scheduling, Monitoring
Databricks, founded by the team that created Apache Spark – unified analytics platform that accelerates innovation by unifying data science, engineering & business.
75% of the code committed to Apache Spark comes from Databricks
Unified Runtime
Create clusters in seconds, dynamically scale them up and down.
They’ve made enhancements to Spark engine to make it 10x faster than open source Spark
Serverless- Auto-configured multi-user cluster, Reliable sharing with fault isolation
Unified Collaboration
Overall – a simple & collaborative environment that enables your entire team to use Spark & interact with your data simultaneously
DE – Improve ETL performance, zero management clusters. Execute production code from within notebooks
DS - For data scientists, easy data exploration in notebooks
Business SME – interactive dashboards empower teams to create dynamic reports
Enterprise Security
Encryption
Fine grained Role-based access control (files, clusters, code, application, dashboard)
Compliance
Rest APIs
DE – DBIO, SPARK, API’s , JOBS
DS – Spark and Serverless, Interactive Data Science
Data Products - Everything
Creators of Spark
Training People
Number of Customers
Ingest
Workflow
Schedule / Run / Monitor
Execute
Troubleshoot
Debug
Production Jobs
---------
Ingest, ETL, Scheduling, Monitoring
Databricks, founded by the team that created Apache Spark – unified analytics platform that accelerates innovation by unifying data science, engineering & business.
75% of the code committed to Apache Spark comes from Databricks
Unified Runtime
Create clusters in seconds, dynamically scale them up and down.
They’ve made enhancements to Spark engine to make it 10x faster than open source Spark
Serverless- Auto-configured multi-user cluster, Reliable sharing with fault isolation
Unified Collaboration
Overall – a simple & collaborative environment that enables your entire team to use Spark & interact with your data simultaneously
DE – Improve ETL performance, zero management clusters. Execute production code from within notebooks
DS - For data scientists, easy data exploration in notebooks
Business SME – interactive dashboards empower teams to create dynamic reports
Enterprise Security
Encryption
Fine grained Role-based access control (files, clusters, code, application, dashboard)
Compliance
Rest APIs
DE – DBIO, SPARK, API’s , JOBS
DS – Spark and Serverless, Interactive Data Science
Data Products - Everything
Creators of Spark
Training People
Number of Customers
Ingest
Workflow
Schedule / Run / Monitor
Execute
Troubleshoot
Debug
Production Jobs
---------
Ingest, ETL, Scheduling, Monitoring
Databricks, founded by the team that created Apache Spark – unified analytics platform that accelerates innovation by unifying data science, engineering & business.
75% of the code committed to Apache Spark comes from Databricks
Unified Runtime
Create clusters in seconds, dynamically scale them up and down.
They’ve made enhancements to Spark engine to make it 10x faster than open source Spark
Serverless- Auto-configured multi-user cluster, Reliable sharing with fault isolation
Unified Collaboration
Overall – a simple & collaborative environment that enables your entire team to use Spark & interact with your data simultaneously
DE – Improve ETL performance, zero management clusters. Execute production code from within notebooks
DS - For data scientists, easy data exploration in notebooks
Business SME – interactive dashboards empower teams to create dynamic reports
Enterprise Security
Encryption
Fine grained Role-based access control (files, clusters, code, application, dashboard)
Compliance
Rest APIs
DE – DBIO, SPARK, API’s , JOBS
DS – Spark and Serverless, Interactive Data Science
Data Products - Everything
Creators of Spark
Training People
Number of Customers
Ingest
Workflow
Schedule / Run / Monitor
Execute
Troubleshoot
Debug
Production Jobs
---------
Ingest, ETL, Scheduling, Monitoring
Additionally, all Azure Databricks programming language notebooks (python, scala, R) support using interactive HTML graphics using javascript libraries like D3.
To use this, you can pass any HTML, CSS, or JavaScript code to the displayHTML() function to render its results.
You can display MatPlotLib and ggplot objects in Python notebooks
You can use Plotly, an interactive graphing library
Azure Databricks supports htmlwidgets. With R htmlwidgets you can generate interactive plots using R’s flexible syntax and environment.
Diagram from Databricks
Managed and Unmanaged Tables
Every Spark-SQL table has a metadata information that stores the schema and the data itself.
Managed tables are Spark SQL tables where Spark manages both the data and the metadata. Since Spark SQL manages the tables, doing a DROP TABLE example_data will delete both the metadata and data automatically.
Unmanaged tables: Here Spark SQL manages the metadata and you control the data’s location. Spark SQL will just manage the relevant metadata, so when you perform DROP TABLE example_data, Spark will only remove the metadata and not the data itself. The data will still be present in the path you provided. Note you can also create an unmanaged table with your data in other data sources like Cassandra, JDBC table, etc.
Addresses many of the pain points with DStreams.
Enables the development o f “Continuous Applications” that need to interact with batch data, interactive analysis, ML etc.
This article claims that Facebook found in their benchmarks that GraphX was not as performant as Giraph. See https://code.facebook.com/posts/319004238457019/a-comparison-of-state-of-the-art-graph-processing-systems/
Setting Up Authentication
There are two ways to authenticate to Databricks. The first way is to use your username and password pair. To do this run Databricks configure and follow the prompts. The second and recommended way is to use an access token generated from Databricks. To configure the CLI to use the access token run Databricks configure --token. After following the prompts, your access credentials will be stored in the file ~/.Databricks.
Databricks, founded by the team that created Apache Spark – unified analytics platform that accelerates innovation by unifying data science, engineering & business.
75% of the code committed to Apache Spark comes from Databricks
Unified Runtime
Create clusters in seconds, dynamically scale them up and down.
They’ve made enhancements to Spark engine to make it 10x faster than open source Spark
Serverless- Auto-configured multi-user cluster, Reliable sharing with fault isolation
Unified Collaboration
Overall – a simple & collaborative environment that enables your entire team to use Spark & interact with your data simultaneously
DE – Improve ETL performance, zero management clusters. Execute production code from within notebooks
DS - For data scientists, easy data exploration in notebooks
Business SME – interactive dashboards empower teams to create dynamic reports
Enterprise Security
Encryption
Fine grained Role-based access control (files, clusters, code, application, dashboard)
Compliance
Rest APIs
DE – DBIO, SPARK, API’s , JOBS
DS – Spark and Serverless, Interactive Data Science
Data Products - Everything
Creators of Spark
Training People
Number of Customers
Ingest
Workflow
Schedule / Run / Monitor
Execute
Troubleshoot
Debug
Production Jobs
---------
Ingest, ETL, Scheduling, Monitoring