Until recently, data was gathered for well-defined objectives such as auditing, forensics, reporting and line-of-business operations; now, exploratory and predictive analysis is becoming ubiquitous, and the default increasingly is to capture and store any and all data, in anticipation of potential future strategic value. These differences in data heterogeneity, scale and usage are leading to a new generation of data management and analytic systems, where the emphasis is on supporting a wide range of very large datasets that are stored uniformly and analyzed seamlessly using whatever techniques are most appropriate, including traditional tools like SQL and BI and newer tools, e.g., for machine learning and stream analytics. These new systems are necessarily based on scale-out architectures for both storage and computation.
Hadoop has become a key building block in the new generation of scale-out systems. On the storage side, HDFS has provided a cost-effective and scalable substrate for storing large heterogeneous datasets. However, as key customer and systems touch points are instrumented to log data, and Internet of Things applications become common, data in the enterprise is growing at a staggering pace, and the need to leverage different storage tiers (ranging from tape to main memory) is posing new challenges, leading to caching technologies, such as Spark. On the analytics side, the emergence of resource managers such as YARN has opened the door for analytics tools to bypass the Map-Reduce layer and directly exploit shared system resources while computing close to data copies. This trend is especially significant for iterative computations such as graph analytics and machine learning, for which Map-Reduce is widely recognized to be a poor fit.
While Hadoop is widely recognized and used externally, Microsoft has long been at the forefront of Big Data analytics, with Cosmos and Scope supporting all internal customers. These internal services are a key part of our strategy going forward, and are enabling new state of the art external-facing services such as Azure Data Lake and more. I will examine these trends, and ground the talk by discussing the Microsoft Big Data stack.
Power BI for Big Data and the New Look of Big Data SolutionsJames Serra
New features in Power BI give it enterprise tools, but that does not mean it automatically creates an enterprise solution. In this talk we will cover these new features (composite models, aggregations tables, dataflow) as well as Azure Data Lake Store Gen2, and describe the use cases and products of an individual, departmental, and enterprise big data solution. We will also talk about why a data warehouse and cubes still should be part of an enterprise solution, and how a data lake should be organized.
Embarking on building a modern data warehouse in the cloud can be an overwhelming experience due to the sheer number of products that can be used, especially when the use cases for many products overlap others. In this talk I will cover the use cases of many of the Microsoft products that you can use when building a modern data warehouse, broken down into four areas: ingest, store, prep, and model & serve. It’s a complicated story that I will try to simplify, giving blunt opinions of when to use what products and the pros/cons of each.
Think of big data as all data, no matter what the volume, velocity, or variety. The simple truth is a traditional on-prem data warehouse will not handle big data. So what is Microsoft’s strategy for building a big data solution? And why is it best to have this solution in the cloud? That is what this presentation will cover. Be prepared to discover all the various Microsoft technologies and products from collecting data, transforming it, storing it, to visualizing it. My goal is to help you not only understand each product but understand how they all fit together, so you can be the hero who builds your companies big data solution.
Introduction to Microsoft’s Hadoop solution (HDInsight)James Serra
Did you know Microsoft provides a Hadoop Platform-as-a-Service (PaaS)? It’s called Azure HDInsight and it deploys and provisions managed Apache Hadoop clusters in the cloud, providing a software framework designed to process, analyze, and report on big data with high reliability and availability. HDInsight uses the Hortonworks Data Platform (HDP) Hadoop distribution that includes many Hadoop components such as HBase, Spark, Storm, Pig, Hive, and Mahout. Join me in this presentation as I talk about what Hadoop is, why deploy to the cloud, and Microsoft’s solution.
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...James Serra
Discover, manage, deploy, monitor – rinse and repeat. In this session we show how Azure Machine Learning can be used to create the right AI model for your challenge and then easily customize it using your development tools while relying on Azure ML to optimize them to run in hardware accelerated environments for the cloud and the edge using FPGAs and Neural Network accelerators. We then show you how to deploy the model to highly scalable web services and nimble edge applications that Azure can manage and monitor for you. Finally, we illustrate how you can leverage the model telemetry to retrain and improve your content.
Big data is driving transformative changes in traditional data warehousing. Traditional ETL processes and highly structured data schemas are being replaced with schema flexibility to handle all types of data from diverse sources. This allows for real-time experimentation and analysis beyond just operational reporting. Microsoft is applying lessons from its own big data journey to help customers by providing a comprehensive set of Apache big data tools in Azure along with intelligence and analytics services to gain insights from diverse data sources.
In this session we will delve into the world of Azure Databricks and analyze why it is becoming a tool for data Scientist and/or fundamental data Engineer in conjunction with Azure services
Databricks is a Software-as-a-Service-like experience (or Spark-as-a-service) that is a tool for curating and processing massive amounts of data and developing, training and deploying models on that data, and managing the whole workflow process throughout the project. It is for those who are comfortable with Apache Spark as it is 100% based on Spark and is extensible with support for Scala, Java, R, and Python alongside Spark SQL, GraphX, Streaming and Machine Learning Library (Mllib). It has built-in integration with many data sources, has a workflow scheduler, allows for real-time workspace collaboration, and has performance improvements over traditional Apache Spark.
What is an Open Data Lake? - Data Sheets | WhitepaperVasu S
A data lake, where data is stored in an open format and accessed through open standards-based interfaces, is defined as an Open Data Lake.
https://www.qubole.com/resources/data-sheets/what-is-an-open-data-lake
Data Analytics Meetup: Introduction to Azure Data Lake Storage CCG
Microsoft Azure Data Lake Storage is designed to enable operational and exploratory analytics through a hyper-scale repository. Journey through Azure Data Lake Storage Gen1 with Microsoft Data Platform Specialist, Audrey Hammonds. In this video she explains the fundamentals to Gen 1 and Gen 2, walks us through how to provision a Data Lake, and gives tips to avoid turning your Data Lake into a swamp.
Learn more about Data Lakes with our blog - Data Lakes: Data Agility is Here Now https://bit.ly/2NUX1H6
Is the traditional data warehouse dead?James Serra
With new technologies such as Hive LLAP or Spark SQL, do I still need a data warehouse or can I just put everything in a data lake and report off of that? No! In the presentation I’ll discuss why you still need a relational data warehouse and how to use a data lake and a RDBMS data warehouse to get the best of both worlds. I will go into detail on the characteristics of a data lake and its benefits and why you still need data governance tasks in a data lake. I’ll also discuss using Hadoop as the data lake, data virtualization, and the need for OLAP in a big data solution. And I’ll put it all together by showing common big data architectures.
The new Microsoft Azure SQL Data Warehouse (SQL DW) is an elastic data warehouse-as-a-service and is a Massively Parallel Processing (MPP) solution for "big data" with true enterprise class features. The SQL DW service is built for data warehouse workloads from a few hundred gigabytes to petabytes of data with truly unique features like disaggregated compute and storage allowing for customers to be able to utilize the service to match their needs. In this presentation, we take an in-depth look at implementing a SQL DW, elastic scale (grow, shrink, and pause), and hybrid data clouds with Hadoop integration via Polybase allowing for a true SQL experience across structured and unstructured data.
This document discusses big data and analytics solutions from Microsoft. It introduces Azure Data Lake Store as a hyper-scale repository for big data analytics workloads that allows storing any data in its native format. It also describes Azure Data Lake Analytics as a service for big data analytics that offers distributed, parallel processing with U-SQL and integration with Visual Studio. The document provides examples of using Azure Data Lake Analytics to extract, transform, and analyze big data from various sources like call log files and customer tables.
Cortana Analytics Workshop: Operationalizing Your End-to-End Analytics SolutionMSAdvAnalytics
Wee Hyong Tok. With Azure Data Factory (ADF), existing data movement and analytics processing services can be composed into data pipelines that are highly available and managed in the cloud. In this demo-driven session, you learn by example how to build, operationalize, and manage scalable analytics pipelines. Go to https://channel9.msdn.com/ to find the recording of this session.
The document discusses Ido Friedman and his background working with various data technologies. It then discusses the concept of a data lake and how it serves as a single store for raw and transformed data used for reporting, analytics, and machine learning. The rest of the document discusses how traditional tools like SQL have changed with the rise of Hadoop and cloud storage. It provides examples of performance and cost differences between running data workloads on Hadoop clusters versus cloud-based data processing services like BigQuery and Dataproc. The document concludes that a large data lake is now possible in the cloud and discusses various deployment options to consider.
First introduced with the Analytics Platform System (APS), PolyBase simplifies management and querying of both relational and non-relational data using T-SQL. It is now available in both Azure SQL Data Warehouse and SQL Server 2016. The major features of PolyBase include the ability to do ad-hoc queries on Hadoop data and the ability to import data from Hadoop and Azure blob storage to SQL Server for persistent storage. A major part of the presentation will be a demo on querying and creating data on HDFS (using Azure Blobs). Come see why PolyBase is the “glue” to creating federated data warehouse solutions where you can query data as it sits instead of having to move it all to one data platform.
Microsoft Data Platform - What's includedJames Serra
This document provides an overview of a speaker and their upcoming presentation on Microsoft's data platform. The speaker is a 30-year IT veteran who has worked in various roles including BI architect, developer, and consultant. Their presentation will cover collecting and managing data, transforming and analyzing data, and visualizing and making decisions from data. It will also discuss Microsoft's various product offerings for data warehousing and big data solutions.
Building Modern Data Platform with Microsoft AzureDmitry Anoshin
This document provides an overview of building a modern cloud analytics solution using Microsoft Azure. It discusses the role of analytics, a history of cloud computing, and a data warehouse modernization project. Key challenges covered include lack of notifications, logging, self-service BI, and integrating streaming data. The document proposes solutions to these challenges using Azure services like Data Factory, Kafka, Databricks, and SQL Data Warehouse. It also discusses alternative implementations using tools like Matillion ETL and Snowflake.
The document discusses Big Data on Azure and provides an overview of HDInsight, Microsoft's Apache Hadoop-based data platform on Azure. It describes HDInsight cluster types for Hadoop, HBase, Storm and Spark and how clusters can be automatically provisioned on Azure. Example applications and demos of Storm, HBase, Hive and Spark are also presented. The document highlights key aspects of using HDInsight including storage integration and tools for interactive analysis.
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Hortonworks
How do you turn data from many different sources into actionable insights and manufacture those insights into innovative information-based products and services?
Industry leaders are accomplishing this by adding Hadoop as a critical component in their modern data architecture to build a data lake. A data lake collects and stores data across a wide variety of channels including social media, clickstream data, server logs, customer transactions and interactions, videos, and sensor data from equipment in the field. A data lake cost-effectively scales to collect and retain massive amounts of data over time, and convert all this data into actionable information that can transform your business.
Join Hortonworks and Informatica as we discuss:
- What is a data lake?
- The modern data architecture for a data lake
- How Hadoop fits into the modern data architecture
- Innovative use-cases for a data lake
The Hive Think Tank - Design Thinking by Bernie Roth, Professor at Stanford U...The Hive
Bernie Roth is a founder of Stanford's d.school and author of The Achievement Habit: how to stop wishing, start doing, and take command of life.
Bernie brings to the d.school a wealth of experience in teaching design, an intimate knowledge of the functioning of Stanford University, and a worldwide reputation as a researcher in kinematics and robotics. Together with Doug Wilde and the late Rolf Faste, Bernie developed the concept of a Creativity Workshop. This has been offered to students, faculty and professionals around the world. These same techniques have been made available to d.school students and are described in his book The Achievement Habit. He has found that these types of learning experiences enhance students’ ability to make meaningful positive difference in their own lives. He is especially pleased that his activities at the d.school have contributed to creating an environment where students and coworkers get the tools and values for realizing the enduring satisfactions that come from assisting others in the human community.
The Hive Think Tank: Machine Learning at Pinterest by Jure LeskovecThe Hive
Machine learning is at the core of Pinterest. Pinterest personalizes and ranks 1B+ pins, 700+ million boards for 100M+ users all over the world, using data gathered from collaborative filtering, user curation, web crawling, and more. At Pinterest we model relationships between pins, handle cold-start problems and deal with real-time recommendations.
In this presentation Jure gave an overview of the problems and effective solutions developed at Pinterest. He focused on systems and effective engineering choices made to enable productive machine learning development and enable multiple engineers effectively develop, test, and deploy machine-learned models.
This document provides an overview and hands-on demonstration of Twitter's Heron stream processing framework. The agenda includes a Heron overview, hands-on experience launching topologies and using Heron tools, and exploring the UI. Instructions are given on installing Heron client and tools binaries. Example topologies are launched using the 'heron submit' command. The Heron tracker and UI are launched to view logical/physical plans, metrics, logs, and exceptions. Additional resources mentioned include the Heron starters repository and user forum.
The Hive Think Tank: Translating IoT into Innovation at Every Level by Prith ...The Hive
In this presentation Prith Banerjee discusses how a sustainable future must become radically more efficient with the way we use energy. He shared how the Internet of Things (IoT) and the convergence of Operational Technology (OT) and Information Technology (IT) are enabling Schneider Electric's innovation at every level, redefining power and automation for a new world of energy which is more electric, decarbonized, decentralized and digitized. Prith shared how, in this new world of energy, Schneider ensures that Life Is On everywhere, for everyone and at every moment. He also shared a set of IoT predictions for the future, based on findings of the company’s recent IoT Survey of 2,500 top business executives.
The Hive Think Tank: Machine Learning Applications in Genomics by Prof. Jian ...The Hive
In this The Hive Think Tank talk, Professor Jian Ma introduces machine learning methods that can be used to help tackle some of the most intriguing questions in genomics and biomedicine. He discusses the research projects in his group to study genome structure and function, including algorithms to unravel complex genomic aberrations in cancer genomes and gene regulatory principles encoded in our genome, by utilizing
probabilistic graphical models and deep neural network techniques. The knowledge obtained from such computational methods can greatly enhance our ability to understand disease genomes.
The Hive Think Tank: The Future Of Customer Support - AI Driven AutomationThe Hive
The Hive Think Tank Panel Discussion moderated by Kate Leggett (Forrester) with panelists: Allan Leinwand (ServiceNow), Nitin Narkhede (Wipro), Jason Smale (Zendesk), Dan Turchin (Neva). The future of customer support is AI-driven virtual agents. Soon, we’ll interact conversationally with bots that know who we are, how we’re impacted, and what we need. Soon, the capabilities of virtual agents will far exceed those of today’s best human agents. We’ll receive support that is more reliable than friends, more accurate than social media, and less frustrating than waiting on hold.
The Hive Think Tank: AI in The Enterprise by Venkat SrinivasanThe Hive
This The Hive Think Tank talk by Venkat Srinivasan, CEO of RAGE Frameworks, focuses on successful applications of AI in the Enterprise. We start with a broad and more inclusive definition of AI in the context of enterprise business processes.
We introduce a taxonomy of AI solution methods that broaden the focus beyond a narrow focus on deep learning based on neural nets. In line with the taxonomy, we present several successful AI applications in use today at major corporations across industries including financial services, manufacturing/retail, professional services, logistics. These applications range from commercial lending, contract review, customer service intelligence, market and competitive intelligence, signals for capital markets, regulatory compliance and others.
This document discusses the new features in SQL Server 2016 related to business intelligence (BI). Key highlights include:
- Power BI integration, allowing paginated report items to be pinned to Power BI dashboards.
- Enhancements to SQL Server Reporting Services, including modern paginated reports with updated tools, mobile reports optimized for mobile devices, and a new web portal to consume both report types.
- The ability to export paginated reports to PowerPoint, pin report items to Power BI dashboards, and create interactive mobile reports accessed through a single mobile app.
The Hive Think Tank: Unpacking AI for Healthcare The Hive
In this The Hive Think Tank talk, Ash Damle, CEO of Lumiata takes a deep dive into Lumiata’s core technological engine - the Lumiata Medical Graph, which applies graph-based machine learning to compute the complex relationships between health data in the same way that a physician would, and how this medical AI engine powers personalization and automation within risk and care management.
SQL Server Integration Services (SSIS) 2016 includes new features for manageability, connectivity, and usability. Key additions include support for Always On availability groups, custom logging levels, package templates, and expanded data sources like Azure Storage, HDFS, and HDInsight. It also features faster package development and management through improvements to SSDT, the SSIS Catalog, and multi-version support.
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudJaipaul Agonus
This presentation is a real-world case study about moving a large portfolio of batch analytical programs that process 30 billion or more transactions every day, from a proprietary MPP database appliance architecture to the Hadoop ecosystem in the cloud, leveraging Hive, Amazon EMR, and S3.
U-SQL is a language for big data processing that unifies SQL and C#/custom code. It allows for processing of both structured and unstructured data at scale. Some key benefits of U-SQL include its ability to natively support both declarative queries and imperative extensions, scale to large data volumes efficiently, and query data in place across different data sources. U-SQL scripts can be used for tasks like complex analytics, machine learning, and ETL workflows on big data.
The document discusses big data and Hadoop. It provides an overview of key components in Hadoop including HDFS for storage, MapReduce for distributed processing, Hive for SQL-like queries, Pig for data flows, HBase for column-oriented storage, and Storm for real-time processing. It also discusses building a layered data system with batch, speed, and serving layers to process streaming data at scale.
Use cases and examples using Apache Spark, presented at the Hadoop User Group (UK) November 2014 Hadoop Meetup
http://www.meetup.com/hadoop-users-group-uk/events/217791892/
This document provides an introduction to Apache Hive, including:
- What Apache Hive is and its key features like SQL support and rich data types
- An overview of Hive's architecture and how it works within the Hadoop ecosystem
- Where Hive is useful, such as for log processing, and not useful, like for online transactions
- Examples of companies that use Hive
- An introduction to the Hive Query Language (HQL) with examples of creating tables, loading data, queries, and more.
SQL Server 2016 New Features and EnhancementsJohn Martin
SQL Server 2016 new features session that I delivered at SQL Relay 2015 at; Reading, London, Cardiff and Birmingham.
Looking at some of the new features currently slated for inclusion in the next version of Microsoft SQL Server 2016.
Demo Code can be found at: http://1drv.ms/1PC5smY
The document discusses controlled experimentation (A/B testing) as a method to study the effects of treatments on users. It notes that experiments randomly divide users into a control and treatment group, with the only difference being the treatment evaluated. Performance metrics are collected and statistically analyzed to determine if any differences are due to the treatment or random chance. Examples of experiments include variations to website design, mobile calls to action, and personalization algorithms. Key aspects of experimentation platforms include hashing to randomly assign users, detailed logging, metrics dashboards, and ensuring control and treatment groups are identical. The document emphasizes measuring overall impact beyond just segments under treatment.
The document discusses optimizing mobile apps and the challenges of mobile testing. It introduces LeanPlum, a mobile A/B testing service that allows users to implement their SDK, run tests from a dashboard, and view results. Common challenges of mobile optimization include limited screen space, platform fragmentation, connectivity issues, long app store approval times, different metrics than web, and high user acquisition costs. LeanPlum aims to help users overcome these challenges through easy integration, flexible APIs, and A/B testing capabilities.
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionJames Serra
It can be quite challenging keeping up with the frequent updates to the Microsoft products and understanding all their use cases and how all the products fit together. In this session we will differentiate the use cases for each of the Microsoft services, explaining and demonstrating what is good and what isn't, in order for you to position, design and deliver the proper adoption use cases for each with your customers. We will cover a wide range of products such as Databricks, SQL Data Warehouse, HDInsight, Azure Data Lake Analytics, Azure Data Lake Store, Blob storage, and AAS as well as high-level concepts such as when to use a data lake. We will also review the most common reference architectures (“patterns”) witnessed in customer adoption.
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...Jürgen Ambrosi
In questa sessione vedremo, con il solito approccio pratico di demo hands on, come utilizzare il linguaggio R per effettuare analisi a valore aggiunto,
Toccheremo con mano le performance di parallelizzazione degli algoritmi, aspetto fondamentale per aiutare il ricercatore nel raggiungimento dei suoi obbiettivi.
In questa sessione avremo la partecipazione di Lorenzo Casucci, Data Platform Solution Architect di Microsoft.
Build Big Data Enterprise solutions faster on Azure HDInsightDataWorks Summit
Hadoop and Spark are big data frameworks used to extract useful span a variety of scenarios from ingestion, data prep, data management, processing, analyzing and visualizing data. Each step requires specialized toolsets to be productive. In this talk I will share solution examples in the Big Data ecosystem such as Cask, StreamSets, Datameer, AtScale, Dataiku on Microsoft’s Azure HDInsight that simplify your Big Data solutions. Azure HDInsight is a cloud Spark and Hadoop service for the enterprise and take advantage of all the benefits of HDInsight giving you the best of both worlds. Join this session for practical information that will enable faster time to insights for you and your business.
So you got a handle on what Big Data is and how you can use it to find business value in your data. Now you need an understanding of the Microsoft products that can be used to create a Big Data solution. Microsoft has many pieces of the puzzle and in this presentation I will show how they fit together. How does Microsoft enhance and add value to Big Data? From collecting data, transforming it, storing it, to visualizing it, I will show you Microsoft’s solutions for every step of the way
Prague data management meetup 2018-03-27Martin Bém
This document discusses different data types and data models. It begins by describing unstructured, semi-structured, and structured data. It then discusses relational and non-relational data models. The document notes that big data can include any of these data types and models. It provides an overview of Microsoft's data management and analytics platform and tools for working with structured, semi-structured, and unstructured data at varying scales. These include offerings like SQL Server, Azure SQL Database, Azure Data Lake Store, Azure Data Lake Analytics, HDInsight and Azure Data Warehouse.
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Amazon Web Services
The world is creating more data in more ways than ever before. The average internet user in 2017 generates 1.5GB of data per day, with the rate doubling every 18 months. A single autonomous vehicle can generate 4TB per day. Each smart manufacturing plant generates 1PB per day. Storing, managing, and analyzing this data requires integrated database and analytic services that provide reliability and security at scale. AWS offers a range of managed data services that let customers focus on making data useful, including Amazon Aurora, RDS, DynamoDB, Redshift, Spectrum, ElastiCache, Kinesis, EMR, Elasticsearch Service, and Glue. In this session, we discuss these services, share our vision for innovation, and show how our customers use these services today. Learn More: https://aws.amazon.com/government-education/
Understanding AWS Managed Database and Analytics Services | AWS Public Sector...Amazon Web Services
The world is creating more data in more ways than ever before. The average internet user in 2017 generates 1.5GB of data per day, with the rate doubling every 18 months. A single autonomous vehicle can generate 4TB per day. Each smart manufacturing plant generates 1PB per day. Storing, managing, and analyzing this data requires integrated database and analytic services that provide reliability and security at scale. AWS offers a range of managed data services that let customers focus on making data useful, including Amazon Aurora, RDS, DynamoDB, Redshift, Spectrum, ElastiCache, Kinesis, EMR, Elasticsearch Service, and Glue. In this session, we discuss these services, share our vision for innovation, and show how our customers use these services today. Learn More: https://aws.amazon.com/government-education/
This document discusses the future of data and the Azure data ecosystem. It highlights that by 2025 there will be 175 zettabytes of data in the world and the average person will have over 5,000 digital interactions per day. It promotes Azure services like Power BI, Azure Synapse Analytics, Azure Data Factory and Azure Machine Learning for extracting value from data through analytics, visualization and machine learning. The document provides overviews of key Azure data and analytics services and how they fit together in an end-to-end data platform for business intelligence, artificial intelligence and continuous intelligence applications.
Cloudera, Azure and Big Data at Cloudera Meetup '17Nathan Bijnens
The document discusses Microsoft's Azure cloud platform and how it provides a suite of AI, machine learning, and data analytics services to help organizations collect and analyze data to gain insights and make decisions. It highlights several Azure services like Data Lake, Event Hubs, Stream Analytics, and Cognitive Services that allow customers to store and process vast amounts of data and build intelligent applications. Examples are also given of companies using Azure services to modernize their data infrastructure and build predictive models.
1 Introduction to Microsoft data platform analytics for releaseJen Stirrup
Part 1 of a conference workshop. This forms the morning session, which looks at moving from Business Intelligence to Analytics.
Topics Covered: Azure Data Explorer, Azure Data Factory, Azure Synapse Analytics, Event Hubs, HDInsight, Big Data
Azure Data Explorer deep dive - review 04.2020Riccardo Zamana
Modern Data Science Lifecycle with ADX & Azure
This document discusses using Azure Data Explorer (ADX) for data science workflows. ADX is a fully managed analytics service for real-time analysis of streaming data. It allows for ad-hoc querying of data using Kusto Query Language (KQL) and integrates with various Azure data ingestion sources. The document provides an overview of the ADX architecture and compares it to other time series databases. It also covers best practices for ingesting data, visualizing results, and automating workflows using tools like Azure Data Factory.
Azure Days 2019: Business Intelligence auf Azure (Marco Amhof & Yves Mauron)Trivadis
In dieser Session stellen wir ein Projekt vor, in welchem wir ein umfassendes BI-System mit Hilfe von Azure Blob Storage, Azure SQL, Azure Logic Apps und Azure Analysis Services für und in der Azure Cloud aufgebaut haben. Wir berichten über die Herausforderungen, wie wir diese gelöst haben und welche Learnings und Best Practices wir mitgenommen haben.
The document discusses Microsoft's data platform and cloud services. It highlights:
1) Microsoft's data platform provides intelligence over all data with SQL and Apache Spark, enabling AI and machine learning over any data.
2) Microsoft offers data modernization solutions for migrating to the cloud or managing data on-premises and in hybrid environments.
3) Migrating databases to Azure provides cost savings, security, high performance, and intelligent capabilities through services like Azure SQL Database and Azure Cosmos DB.
Big Data Expo 2015 - Microsoft Transform you data into intelligent actionBigDataExpo
Er zijn veel beloftes rondom Big Data. Iedereen praat erover maar hoe begin je zonder meteen een grote business case op te moeten stellen. Cortana Analytics Suite is laagdrempelig en een makkelijk toegankelijk Advanced Analytics platform om je ideeën op haalbaarheid te testen maar daarna ook door te groeien naar (grote) productie implementaties. In deze sessie krijg je een overzicht van de scenario’s die Cortana Analytics biedt. Denk daar bij aan IOT, Machine Learning maar ook Churn Analysis, Forecasting en Predictive Maintenance.
This document provides an overview of 6 modules related to SQL Server workshops:
- Module 1 covers database design and architecture sessions
- Module 2 focuses on intelligent query processing, data classification/auditing, database recovery, data virtualization, and replication capabilities
- Module 3 discusses the big data landscape, including data growth drivers, common use cases, and scale-out processing approaches like Hadoop and Spark
Similar to The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO for Data, Microsoft (20)
Translating a Trillion Points of Data into Therapies, Diagnostics, and New In...The Hive
This document outlines Atul Butte's extensive conflicts of interest and corporate relationships in the biomedical data and technology industry. It then provides brief summaries of several companies started by Butte's students using public data to develop diagnostics, predict disease, and design new drugs. The document concludes by listing Butte's collaborators and supporters in establishing a large biomedical data institute at UCSF.
Quantum Computing (IBM Q) - Hive Think Tank Event w/ Dr. Bob Sutor - 02.22.18The Hive
The document introduces quantum computing and IBM's efforts in the field, including the IBM Q Experience launched in 2016 which allows users to run algorithms and experiments on quantum computers via the cloud. It discusses IBM's goals of building universal fault-tolerant quantum computers and the IBM Q Network, a global community to advance quantum computing.
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...The Hive
Think Tank Event 10/23/2017, hosted by The Hive and presented by Ted Dunning, Chief Application Architect of MapR Technologies and Ellen Friedman of MapR Technologies.
“ High Precision Analytics for Healthcare: Promises and Challenges” by Sriram...The Hive
1) Predictive analytics in healthcare often provides risk scores and predictions but lacks actionable insights on how to prevent outcomes.
2) The right methodology is needed to transform raw data like claims, prescriptions and medical records into meaningful predictions using machine learning algorithms.
3) Accurate predictions require measuring precision down to the individual level while accounting for both patient and provider factors that influence health outcomes.
The Hive Think Tank: Talk by Mohandas Pai - India at 2030, How Tech Entrepren...The Hive
This document discusses how India can become a $10 trillion economy by 2030 through technology entrepreneurship and the growth of its startup ecosystem. It notes that India currently has the 3rd largest startup ecosystem in the world with 19,400 startups. If the ecosystem continues growing at 270% over 6 years, it could create $500 billion in market value and employ over 3.5 million people by 2030. This growth will be accelerated by initiatives like Digital India that are building digital infrastructure and opening government data through APIs, fueling innovation and problem solving across sectors to help propel India to its economic goals.
The Hive Think Tank: The Content Trap - Strategist's Guide to Digital ChangeThe Hive
In this The Hive Think Tank talk Harvard Business School Professor of Strategy Prof. Bharat Anand shares his insights on the Digital innovation trends that are shaping the way organizations will act in the future.
In this talk, Professor Anand presents the findings from his forthcoming book. To answer these questions, Anand examines a range of businesses around the world, from Chinese internet giant Tencent to Scandinavian digital trailblazer Schibsted, from The New York Times to The Economist, and from talent management to the future of education.
The Hive Think Tank: Sidechains by Adam Back, President of BlockstreamThe Hive
Adam Back discusses sidechains, which allow assets like bitcoin to move between blockchains while maintaining the same properties. Sidechains extend the functionality of blockchains to support new applications through interoperability. This helps address challenges like scalability and fragmentation in the bitcoin network. Examples are given like using sidechains for software upgrades, experimental features, and exchange settlements. Sidechains are secured through bitcoin mining incentives and can provide confidential transactions through techniques like zero-knowledge proofs.
The Hive Think Tank: Ceph + RocksDB by Sage Weil, Red Hat.The Hive
Rocking the Database World with RocksDB
Sage Weil, Ceph Principal Architect, Red Hat
Sage helped design Ceph as part of his graduate research at the University of California, Santa Cruz. Since then, he has continued to refine the system with the goal of providing a stable next generation distributed storage system for Linux.
Specialties: Distributed system design, storage and file systems, management, software development.
The Hive Think Tank: Rocking the Database World with RocksDBThe Hive
RocksDB is a new storage engine for MySQL that provides better storage efficiency than InnoDB. It achieves lower space amplification and write amplification than InnoDB through its use of compression and log-structured merge trees. While MyRocks (RocksDB integrated with MySQL) currently has some limitations like a lack of support for online DDL and spatial indexes, work is ongoing to address these limitations and integrate additional RocksDB features to fully support MySQL workloads. Testing at Facebook showed MyRocks uses less disk space and performs comparably to InnoDB for their queries.
The Hive Think Tank: Rocking the Database World with RocksDBThe Hive
Dhruba Borthakur, Facebook
Dhruba Borthakur is an engineer at Facebook. He has been one of the founding engineer of RocksDB, an open-source key-value store optimized for storing data in flash and main-memory storage. He has been one of the founding architects of the Apache Hadoop Distributed File System and has been instrumental in scaling Facebook's Hadoop cluster to multiples of petabytes. Dhruba has contributed code to the Apache HBase project. Earlier, he contributed to the development of the Andrew File System (AFS). He has an M.S. in Computer Science from the University of Wisconsin, Madison and a B.S. in Computer Science BITS, Pilani, India.
The Hive Think Tank: Rocking the Database World with RocksDBThe Hive
Igor Canadi, Facebook
Igor is a software engineer at Facebook where his job is making databases more awesome. He recently graduated from University of Wisconsin-Madison with Masters degree in Computer Science. During his time at UW-M, he worked with prof. Paul Barford in the area of internet measurement and analysis. Igor got his undergraduate degree from University of Zagreb in Croatia. During his undergraduate years, he founded and developed a local non-profit organization that focuses on educating talented high-school students.
The Hive Think Tank: Stream Processing Systems by Nikita Shamgunov of MemSQLThe Hive
Nikita Shamgunov's presentation was part of a panel discussion on Stream Processing Systems on January 20th, 2016 led by Ben Lorica (O'Reilly Media) with panelists: Jay Kreps (Confluent), Karthik Ramasamy (Twitter), M.C. Srivas (MapR), Ram Sriharsha (Hortonworks).
The Hive Think Tank: "Stream Processing Systems" by Karthik Ramasamy of TwitterThe Hive
Karthik Ramasamy's presentation was part of a panel discussion on Stream Processing Systems on January 20th, 2016 led by Ben Lorica (O'Reilly Media) with panelists: Jay Kreps (Confluent), M.C. Srivas (MapR), Nikita Shamgunov (MemSQL), Ram Sriharsha (Hortonworks)
The Hive Think Tank: "Stream Processing Systems" by M.C. Srivas of MapRThe Hive
M.C. Shivas's presentation was part of a panel discussion on Stream Processing Systems on January 20th, 2016 led by Ben Lorica (O'Reilly Media) with panelists: Jay Kreps (Confluent), Karthik Ramasamy (Twitter), Nikita Shamgunov (MemSQL), Ram Sriharsha (Hortonworks)
MYIR Product Brochure - A Global Provider of Embedded SOMs & SolutionsLinda Zhang
This brochure gives introduction of MYIR Electronics company and MYIR's products and services.
MYIR Electronics Limited (MYIR for short), established in 2011, is a global provider of embedded System-On-Modules (SOMs) and
comprehensive solutions based on various architectures such as ARM, FPGA, RISC-V, and AI. We cater to customers' needs for large-scale production, offering customized design, industry-specific application solutions, and one-stop OEM services.
MYIR, recognized as a national high-tech enterprise, is also listed among the "Specialized
and Special new" Enterprises in Shenzhen, China. Our core belief is that "Our success stems from our customers' success" and embraces the philosophy
of "Make Your Idea Real, then My Idea Realizing!"
What Not to Document and Why_ (North Bay Python 2024)Margaret Fero
We’re hopefully all on board with writing documentation for our projects. However, especially with the rise of supply-chain attacks, there are some aspects of our projects that we really shouldn’t document, and should instead remediate as vulnerabilities. If we do document these aspects of a project, it may help someone compromise the project itself or our users. In this talk, you will learn why some aspects of documentation may help attackers more than users, how to recognize those aspects in your own projects, and what to do when you encounter such an issue.
These are slides as presented at North Bay Python 2024, with one minor modification to add the URL of a tweet screenshotted in the presentation.
The Rise of Supernetwork Data Intensive ComputingLarry Smarr
Invited Remote Lecture to SC21
The International Conference for High Performance Computing, Networking, Storage, and Analysis
St. Louis, Missouri
November 18, 2021
Are you interested in dipping your toes in the cloud native observability waters, but as an engineer you are not sure where to get started with tracing problems through your microservices and application landscapes on Kubernetes? Then this is the session for you, where we take you on your first steps in an active open-source project that offers a buffet of languages, challenges, and opportunities for getting started with telemetry data.
The project is called openTelemetry, but before diving into the specifics, we’ll start with de-mystifying key concepts and terms such as observability, telemetry, instrumentation, cardinality, percentile to lay a foundation. After understanding the nuts and bolts of observability and distributed traces, we’ll explore the openTelemetry community; its Special Interest Groups (SIGs), repositories, and how to become not only an end-user, but possibly a contributor.We will wrap up with an overview of the components in this project, such as the Collector, the OpenTelemetry protocol (OTLP), its APIs, and its SDKs.
Attendees will leave with an understanding of key observability concepts, become grounded in distributed tracing terminology, be aware of the components of openTelemetry, and know how to take their first steps to an open-source contribution!
Key Takeaways: Open source, vendor neutral instrumentation is an exciting new reality as the industry standardizes on openTelemetry for observability. OpenTelemetry is on a mission to enable effective observability by making high-quality, portable telemetry ubiquitous. The world of observability and monitoring today has a steep learning curve and in order to achieve ubiquity, the project would benefit from growing our contributor community.
this resume for sadika shaikh bca studentSadikaShaikh7
I am a dedicated BCA student with a strong foundation in web technologies, including PHP and MySQL. I have hands-on experience in Java and Python, and a solid understanding of data structures. My technical skills are complemented by my ability to learn quickly and adapt to new challenges in the ever-evolving field of computer science.
Sustainability requires ingenuity and stewardship. Did you know Pigging Solutions pigging systems help you achieve your sustainable manufacturing goals AND provide rapid return on investment.
How? Our systems recover over 99% of product in transfer piping. Recovering trapped product from transfer lines that would otherwise become flush-waste, means you can increase batch yields and eliminate flush waste. From raw materials to finished product, if you can pump it, we can pig it.
Quantum Communications Q&A with Gemini LLM. These are based on Shannon's Noisy channel Theorem and offers how the classical theory applies to the quantum world.
Performance Budgets for the Real World by Tammy EvertsScyllaDB
Performance budgets have been around for more than ten years. Over those years, we’ve learned a lot about what works, what doesn’t, and what we need to improve. In this session, Tammy revisits old assumptions about performance budgets and offers some new best practices. Topics include:
• Understanding performance budgets vs. performance goals
• Aligning budgets with user experience
• Pros and cons of Core Web Vitals
• How to stay on top of your budgets to fight regressions
AC Atlassian Coimbatore Session Slides( 22/06/2024)apoorva2579
This is the combined Sessions of ACE Atlassian Coimbatore event happened on 22nd June 2024
The session order is as follows:
1.AI and future of help desk by Rajesh Shanmugam
2. Harnessing the power of GenAI for your business by Siddharth
3. Fallacies of GenAI by Raju Kandaswamy
Transcript: Details of description part II: Describing images in practice - T...BookNet Canada
This presentation explores the practical application of image description techniques. Familiar guidelines will be demonstrated in practice, and descriptions will be developed “live”! If you have learned a lot about the theory of image description techniques but want to feel more confident putting them into practice, this is the presentation for you. There will be useful, actionable information for everyone, whether you are working with authors, colleagues, alone, or leveraging AI as a collaborator.
Link to presentation recording and slides: https://bnctechforum.ca/sessions/details-of-description-part-ii-describing-images-in-practice/
Presented by BookNet Canada on June 25, 2024, with support from the Department of Canadian Heritage.
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Erasmo Purificato
Slide of the tutorial entitled "Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Emerging Trends" held at UMAP'24: 32nd ACM Conference on User Modeling, Adaptation and Personalization (July 1, 2024 | Cagliari, Italy)
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/07/intels-approach-to-operationalizing-ai-in-the-manufacturing-sector-a-presentation-from-intel/
Tara Thimmanaik, AI Systems and Solutions Architect at Intel, presents the “Intel’s Approach to Operationalizing AI in the Manufacturing Sector,” tutorial at the May 2024 Embedded Vision Summit.
AI at the edge is powering a revolution in industrial IoT, from real-time processing and analytics that drive greater efficiency and learning to predictive maintenance. Intel is focused on developing tools and assets to help domain experts operationalize AI-based solutions in their fields of expertise.
In this talk, Thimmanaik explains how Intel’s software platforms simplify labor-intensive data upload, labeling, training, model optimization and retraining tasks. She shows how domain experts can quickly build vision models for a wide range of processes—detecting defective parts on a production line, reducing downtime on the factory floor, automating inventory management and other digitization and automation projects. And she introduces Intel-provided edge computing assets that empower faster localized insights and decisions, improving labor productivity through easy-to-use AI tools that democratize AI.
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...Chris Swan
Have you noticed the OpenSSF Scorecard badges on the official Dart and Flutter repos? It's Google's way of showing that they care about security. Practices such as pinning dependencies, branch protection, required reviews, continuous integration tests etc. are measured to provide a score and accompanying badge.
You can do the same for your projects, and this presentation will show you how, with an emphasis on the unique challenges that come up when working with Dart and Flutter.
The session will provide a walkthrough of the steps involved in securing a first repository, and then what it takes to repeat that process across an organization with multiple repos. It will also look at the ongoing maintenance involved once scorecards have been implemented, and how aspects of that maintenance can be better automated to minimize toil.
Video traffic on the Internet is constantly growing; networked multimedia applications consume a predominant share of the available Internet bandwidth. A major technical breakthrough and enabler in multimedia systems research and of industrial networked multimedia services certainly was the HTTP Adaptive Streaming (HAS) technique. This resulted in the standardization of MPEG Dynamic Adaptive Streaming over HTTP (MPEG-DASH) which, together with HTTP Live Streaming (HLS), is widely used for multimedia delivery in today’s networks. Existing challenges in multimedia systems research deal with the trade-off between (i) the ever-increasing content complexity, (ii) various requirements with respect to time (most importantly, latency), and (iii) quality of experience (QoE). Optimizing towards one aspect usually negatively impacts at least one of the other two aspects if not both. This situation sets the stage for our research work in the ATHENA Christian Doppler (CD) Laboratory (Adaptive Streaming over HTTP and Emerging Networked Multimedia Services; https://athena.itec.aau.at/), jointly funded by public sources and industry. In this talk, we will present selected novel approaches and research results of the first year of the ATHENA CD Lab’s operation. We will highlight HAS-related research on (i) multimedia content provisioning (machine learning for video encoding); (ii) multimedia content delivery (support of edge processing and virtualized network functions for video networking); (iii) multimedia content consumption and end-to-end aspects (player-triggered segment retransmissions to improve video playout quality); and (iv) novel QoE investigations (adaptive point cloud streaming). We will also put the work into the context of international multimedia systems research.
The DealBook is our annual overview of the Ukrainian tech investment industry. This edition comprehensively covers the full year 2023 and the first deals of 2024.
In this follow-up session on knowledge and prompt engineering, we will explore structured prompting, chain of thought prompting, iterative prompting, prompt optimization, emotional language prompts, and the inclusion of user signals and industry-specific data to enhance LLM performance.
Join EIS Founder & CEO Seth Earley and special guest Nick Usborne, Copywriter, Trainer, and Speaker, as they delve into these methodologies to improve AI-driven knowledge processes for employees and customers alike.
Knowledge and Prompt Engineering Part 2 Focus on Prompt Design Approaches
The Hive Think Tank - The Microsoft Big Data Stack by Raghu Ramakrishnan, CTO for Data, Microsoft
1. Big Data @ Microsoft
Raghu Ramakrishnan
CTO for Data, Technical Fellow
Microsoft
2. Data and Analytics – 3 Pillars
SQL 2016
Azure SQL DB
Azure SQL DW
SQL Server R services
On-prem and cloud
(Windows, Linux)
Cortana
Intelligence
Suite
Hadoop, Data Lake, Machine
learning, PowerBI, Data
Factory, Streaming,
Perceptual Intelligence
On-prem connectivity
Microsoft
R server
Hadoop
Teradata
On-prem and cloud
(Windows, Linux)
3. SQL Server 2016: Everything Built-In
The above graphic was published by Gartner, Inc. as part of a larger research document and should be evaluated in the context of the entire document. The Gartner document is available upon request from Microsoft. Gartner does not endorse any
vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner's research
organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.
Consistent experience from on-premises to cloud
Microsoft Tableau Oracle
$120
$480
$2,230
Self-service BI per user
In-memory across all workloads
TPC-H non-clustered 10TB
Oracle
is #4#2
SQL Server
#1
SQL Server
#3
SQL Server
built-inbuilt-in built-in built-in built-in
0
1
4
0 0
3
34
29
22
15
5
22
6
43
20
69
18
49
3
-80
-70
-60
-50
-40
-30
-20
-10
0
2010 2011 2012 2013 2014 2015
SQL Server Oracle MySQL2 SAP HANA
TPC-H non-clustered results as of 04/06/15, 5/04/15, 4/15/14 and 11/25/13, respectively. http://www.tpc.org/tpch/results/tpch_perf_results.asp?resulttype=noncluster
at massive scale
National Institute of Standards and Technology Comprehensive Vulnerability Database update 5/4/2015
4. In-Database Advanced Analytics
No need to move the data
Open source R with in-
memory & massive
scale – multi-threading &
massive parallel processing
Data Scientist
Interact directly with data
R built-in to SQL Server
Data Developer/DBA
Manage data and
analytics together
Example Solutions
• Sales forecasting
• Warehouse efficiency
• Predictive maintenance
Extensibility
?
R
R Integration
Relational data
Analytic Library
T-SQL interface
010010
100100
010101
New R scripts
010010
100100
010101
010010
100100
010101
010010
100100
010101
• Credit risk protection
010010
100100
010101
Microsoft Azure Marketplace
Real-time
operational analytics
without moving the data
NEW
NEW
End-to-end mobile BI Advanced AnalyticsMission critical OLTP
5. High-performance open source R plus:
Enterprise Scale & Performance
– Scales from workstations to large clusters
– Scales to large data sizes
– Growing portfolio of Parallelized algorithms
Secure, Scalable R Deployment/Operationalization
Write Once Deploy Anywhere for multiple platforms
IDE for data scientists and developers
Enterprise Class Support
DistributedR
DeployR DevelopR
ScaleR
ConnectR
6. Cloud – SQL Server/SQL Azure
Shifting how you purchase and manage machines
Increased focus on Total Cost of Ownership and continuous improvements
Built from the same code base
We increased surface area compatibility with V12 Azure SQL Database
We’re learning how to run our own code – the good and the bad
We’re using that to improve both product and service
Microsoft is the only provider both on-premises and in the cloud
7. Order history
Name SSN Date
Jane Doe cm61ba906fd 2/28/2005
Jim Gray ox7ff654ae6d 3/18/2005
John Smith i2y36cg776rg 4/10/2005
Bill Brown nx290pldo90l 4/27/2005
Order history
Name SSN Date
Jane Doe cm61ba906fd 2/28/2005
Jim Gray ox7ff654ae6d 3/18/2005
John Smith i2y36cg776rg 4/10/2005
Bill Brown nx290pldo90l 4/27/2005
Customer data
Product data
Order History
Stretch to cloud
Stretch SQL Server into Azure
Stretch warm and cold tables to Azure with remote query processing
App
Query
Microsoft Azure
Jim Gray ox7ff654ae6d 3/18/2005
SQL Server 2016
8. Azure SQL DW
Fully managed relational data warehouse-as-a-service
First elastic cloud data warehouse with proven SQL Server capabilities
Support your smallest to your largest data storage needs
Scales to petabytes of data
Massively Parallel Processing
Instant-on compute scales in seconds
Query Relational / Non-Relational
Saas
Azure
Public
Cloud
Office 365Office 365
Get started in minutes
Integrated with Azure ML, PowerBI & ADF
Simple billing compute & storage
Pay for what you need, when you need
it with dynamic pause
AzureAzure
11. Store any data
relations
Do any analysis
SQL queries
Hive,
At any speed
Batch
Hive
At any scale … elastic!
Anywhere
Data to
Intelligent
Action
12. Web Logs,
Omniture logs
On-Premise
SQL Server
(customer and product data)
In-Store Activity
with
Kinect sensors
Social Data
Diagnostic
streaming
Event hubs
Machine
Learning
Stream Analytics
Azure DataLake
Data Factory: Move Data, Orchestrate, Schedule, and Monitor
HDInsight HDInsight Machine
Learning
Azure SQL
Data Warehouse
Power BI
INGEST PREPARE ANALYZE PUBLISH
Stream Analytics
CONSUMEDATA SOURCES
Cortana
Web/LOB
Dashboards
16. Azure Data Analytics Stack
REEF library
STORAGE
YARN
HDFS/WebHDFS API
Compute-tier
Cache Clusters
(Local ENs + CSM)
RAM / SSD / HDD
WAS-based Remote Storage
Cosmos Store API
CLUSTER-WIDE RM (YARN++)
YARN + Federation
YARN + Rayon (Capacity reservation)
YARN +
Mercury
Shared micro-
services for all
metadata
(extent map,
logical name
space, secure
store) based on
Hekaton/RSL
rings
YARN +
Mercury
YARN +
Mercury
Application
Engines
Per-job RM
and runtimeM/R
U-SQL
Batch
Spark
Tez
Spark
Runtime
Spark HiveU-SQL Azure ML Azure SA
COMPUTE TIER
SQL-DW HDInsight
IaaS
Services
17. Windows
SMSG
Live
Ads
CRM/Dynamics
Windows Phone
Xbox Live
Office365
STB Malware Protection
Microsoft Stores
STBCommerceRisk
Messenger
LCA
Exchange
Yammer
Skype
Bing
data managed: EBs
cluster sizes: 10s of Ks
# machines: 100s of Ks
daily I/O: >100 PBs
# internal developers: 1000s
# daily jobs: 100s of Ks
19. Implement Data Warehouse
Physical Design
ETL
Development
Reporting &
Analytics
Development
Install and Tune
Reporting &
Analytics Design
Dimension Modelling
ETL Design
Setup Infrastructure
Understand
Corporate
Strategy
Data sources
ETL
BI and analytic
Data warehouse
Gather
Requirements
Business
Requirements
Technical
Requirements
20. Ingest all data
regardless of requirements
Store all data
in native format without
schema definition
Do analysis
Using analytic engines
like Hadoop
Interactive queries
Batch queries
Machine Learning
Data warehouse
Real-time analytics
Devices
21. What happened?
What is happening?
Why did it happen?
What are key
relationships?
What will happen?
What if?
How risky is it?
What should happen?
What is the best option?
How can I optimize?
Data sources
24. • Interactive and Real-Time Analytics requires i
• Massive data volumes require scale-out stores
using commodity servers, even archival storage
Tiered Storage
Seamlessly move data across tiers, mirroring life-cycle and usage patterns
Schedule compute near low-latency copies of data
How can we manage this trade-off without moving data across
different storage systems (and governance boundaries)?
25. • Many different analytic engines (OSS and
vendors; SQL, ML; batch, interactive, streaming)
• Many users’ jobs (across these job types) run
on the same machines (where the data lives)
Resource Management with Multitenancy and SLAs
Policy-driven management of vast compute pools co-located with data
Schedule computation “near” data
How can we manage this multi-tenanted heterogeneous job mix
across tens of thousands of machines?
26. Azure Data Lake Store
Fully managed cloud data store designed for analytics
Supports HDFS compliant analytics applications and tools
Petabyte files, unlimited account size
High throughput for analytics performance
Low latency ingestion with read as you write
AAD-based authentication, access auditing
File and folder-level ACLs, Encryption at rest
27. ADLS Security: Encryption-at-Rest
Transparently encrypts data flowing
to and from public networks as well
as at rest
Transparent server-side encryption
User can manage their own
encryption keys or let Azure Data
Lake Store manage the key using
Azure Key Vault
28
28. ADLS Security: Role-Based Access Control
Each file and directory is associated
with an owner and a group
Files or directories have separate
permissions (read(r), write(w),
execute(x)) for owners, members of
the group, and for all other users
Fine-grained access control lists
(ACLs) can be specified for specific
named users or named groups
29
29. ADL Store: Ingress
Data can be ingested into Azure Data Lake Store from a variety of sources
Server logs
Azure Event Hub
Apache
Flume
Azure Storage Blobs
Custom programs
.NET SDK
JavaScript CLI
Azure Portal
Azure PowerShell
Azure Data Factory
Apache Sqoop
Azure SQL DB
Azure SQL DW
Azure tables
Table Storage
On-premises databases
SQL
30
ADL Store
Built-in
copy service
30. ADL Store: Egress
Data can be exported from Azure Data Lake Store into numerous targets/sinks
Azure SQL DB
SQL
Azure SQL DW
Azure
Tables
Table Storage
On-premises databases
Azure Data Factory
Apache Sqoop
Azure Storage Blobs
Custom programs
.NET SDK
JavaScript CLI
Azure Portal
Azure PowerShell
31
Built-in
copy service
ADL Store
31. Extent
Metadata
Data Data Data…
Remote Storage
Naming
Service
Secret Store
1) Filename Translation
3) Find Extents
4) Data
access
Remote storage tier
builds securely on
WAS
Secure
Works with
YARN!
COMPUTE
TIER
Secure Store Service
Intelligent ingest
Massively parallel
2) Azure Access Keys
32. • Interactive and Real-Time Analytics requires i
• Massive data volumes require scale-out stores
using commodity servers, even archival storage
Tiered Storage
Scale storage independently of compute
Seamlessly move data across tiers, mirroring life-cycle and usage patterns
Schedule compute near low-latency copies of data
Data Lifecycle Management
How can we manage this trade-off without moving data across
different storage systems (and governance boundaries)?
33. Extent
Metadata
Data Data Data…
Remote Storage
Naming
Service
Secret Store
1) Filename Translation
3) Find Extents
4) Data
access
Remote storage tier
builds securely on
WAS
Secure
Works with
YARN!
COMPUTE
TIER
Data Data Data
…
Secure Store Service
Local Storage
Intelligent ingest
Massively parallel
2) Azure Access Keys
34. Azure HDInsight—Linux and Windows
Managed, Monitored, Supported
• Cluster customization – Install your favorite project
• Harness existing .Net & Java skills to write
customer extensions
• Supports broad ecosystem of ISVs
(Hadoop and Traditional)
Full Apache Hadoop
• Batch – MapReduce, PIG, Hive, Spark
• Stream Processing and Analytics – Storm,
SparkStreaming
• Interactive SQL – Hive (Tez), and SparkSQL
• Table Serving – Hbase
• Machine Learning – SparkML, Mahout
38. Azure
Data Lake
Analytics Service
A new distributed
analytics service
Built on Apache YARN
Scales dynamically with a dial
Pay by the query
Supports Azure AD for access control,
roles, and integration with on-prem
identity systems
U-SQL language unifies the benefits of
SQL with the power of C#
Hive etc. will be added over time
Processes data across Azure
41
39. Get started
Log in to Azure Create an ADLA
account
Write and
submit an ADLA
job with U-SQL
(or Hive/Pig)
The job reads
and writes data
from storage
1 2 3 4
30 seconds
ADLS
Azure Blobs
Azure DB
…
40. ADLA Complements HDInsight
HDInsight
Dedicated managed clusters for
developers familiar with the Open
Source: Java, Eclipse, Hive, etc.
Clusters offer customization, control,
and flexibility in a managed Hadoop
cluster
ADLA
Enables customers to leverage
existing experience with C#, SQL &
PowerShell
Offers convenience, efficiency, and
automatic scale in a “job service”
form factor over a system-managed
shared resource pool
41. U-SQL A hyper-scalable, highly extensible
language for preparing, transforming
and analyzing all data
Allows users to focus on the what—
not the how—of business problems
Built on familiar languages (SQL and
C#) and supported by a fully integrated
development environment
Built for data developers & scientists
44
42. U-SQL Language Philosophy
Declarative query and transformation language:
• Uses SQL’s SELECT FROM WHERE with GROUP BY/aggregation, joins,
SQL Analytics functions
• Optimizable, scalable
Operates on unstructured & structured data
• Schema on read over files
• Relational metadata objects (e.g. database, table)
Extensible from ground up:
• Type system is based on C#
• Expression language is C#
21
User-defined functions (U-SQL and C#)
User-defined types (U-SQL/C#) (future)
User-defined aggregators (C#)
User-defined operators (UDO) (C#)
U-SQL provides the parallelization and scale-out framework for
usercode
• EXTRACTOR, OUTPUTTER, PROCESSOR, REDUCER, COMBINERS
Expression-flow programming style:
• Easy to use functional lambda composition
• Composable, globally optimizable
Federated query across distributed data sources (soon)
REFERENCE MyDB.MyAssembly;
CREATE TABLE T( cid int, first_order DateTime
, last_order DateTime, order_count int
, order_amount float );
@o = EXTRACT oid int, cid int, odate DateTime, amount float
FROM "/input/orders.txt“
USING Extractors.Csv();
@c = EXTRACT cid int, name string, city string
FROM "/input/customers.txt“
USING Extractors.Csv();
@j = SELECT c.cid, MIN(o.odate) AS firstorder
, MAX(o.date) AS lastorder, COUNT(o.oid) AS ordercnt
, SUM(c.amount) AS totalamount
FROM @c AS c LEFT OUTER JOIN @o AS o ON c.cid == o.cid
WHERE c.city.StartsWith("New")
&& MyNamespace.MyFunction(o.odate) > 10
GROUP BY c.cid;
OUTPUT @j TO "/output/result.txt"
USING new MyData.Write();
INSERT INTO T SELECT * FROM @j;
43. Federated Queries: Query Data Where It Lives
Easily query data in multiple Azure data stores without moving it to a single store
Benefits
Avoid moving large amounts of data across the
network between stores
Single view of data irrespective of physical location
Minimize data proliferation issues caused by
maintaining multiple copies
Single query language for all data
Each data store maintains its own sovereignty
Design choices based on the need
U-SQL
Query
Result
Query
46
Azure
Storage Blobs
Azure SQL
in VMs
Azure
SQL DB
Azure Data
Lake Analytics
44. Join Local (ADLS) and External Data
1. Create two tables.
• An external table ‘PurchaseOrders’ that refers to the
PurchaseOrders table in the external SQL Azure DB.
• A ‘local’ table ‘UserIdsTable’ created by ‘extracting’ User
Ids and region fields from the WebLogRecords.txt file
stored in Azure Data Lake.
2. Join the PurchaseOrders table with UserIds table on the
common UserId column.
Purchase orders table
Azure SQL DB
External
purchase orders
table
Local
user IDs
table
JOIN
(on User IDs)
Azure Data Lake
Analytics
Find sum of all purchases by users in the ‘en-us’ region
Query 9
47
WebLogRecords.txt
45. Concepts: Jobs, Stages and Vertexes
Each job is broken into a number
of vertexes
Each vertex is some work that
needs to be done
Input
Output
Output
6 Stages
8 Vertexes
Vertexes are organized into stages
– Vertexes in each stage do the same
work on the same data
– Vertex in one stage may depend on a
vertex in a earlier stage
Stages themselves are organized into
an acyclic graph
49
46. • Many different analytic engines (OSS and
vendors; SQL, ML; batch, interactive, streaming)
• Many users’ jobs (across these job types) run
on the same machines (where the data lives)
Resource Management with Multitenancy and SLAs
Policy-driven management of vast compute pools co-located with data
Schedule computation “near” data
How can we manage this multi-tenanted heterogeneous job mix
across tens of thousands of machines?
47. Resource Managers for Big Data
Allocate compute containers to competing jobs
Multiple job engines shared pool
Containers
YARN: Resource manager for Hadoop2.x
Corona, Mesos, Omega
48. Shared Data and Compute
Tiered Storage
Relational
Query Engine
Machine
Learning
Compute Fabric (Resource Management)
Multiple analytic
engines sharing same
resource pool
Compute and
store/cache on
same machines
50. YARN Gaps
resource allocation SLOs
scalability limitations
• High allocation latency
• Support for specialized execution frameworks
• Interactive environments, long-running services
51. • Amoeba Rayon
• Status: shipping in Apache Hadoop 2.6
• Mercury and Yaq
• Status: Now in Apache Hadoop trunk!
• Federation
• Status: prototype and JIRA
• Framework-level Pooling
• Enable frameworks that want to take over resource allocation to support millisecond-
level response and adaptation times
• Status: spec
Microsoft Contributions to OSS Apache YARN