The Microsoft Analytics Platform System (APS) is a turnkey appliance that provides a modern data warehouse with the ability to handle both relational and non-relational data. It uses a massively parallel processing (MPP) architecture with multiple CPUs running queries in parallel. The APS includes an integrated Hadoop distribution called HDInsight that allows users to query Hadoop data using T-SQL with PolyBase. This provides a single query interface and allows users to leverage existing SQL skills. The APS appliance is pre-configured with software and hardware optimized to deliver high performance at scale for data warehousing workloads.
Modernizing to a Cloud Data ArchitectureDatabricks
Organizations with on-premises Hadoop infrastructure are bogged down by system complexity, unscalable infrastructure, and the increasing burden on DevOps to manage legacy architectures. Costs and resource utilization continue to go up while innovation has flatlined. In this session, you will learn why, now more than ever, enterprises are looking for cloud alternatives to Hadoop and are migrating off of the architecture in large numbers. You will also learn how elastic compute models’ benefits help one customer scale their analytics and AI workloads and best practices from their experience on a successful migration of their data and workloads to the cloud.
Big data architectures and the data lakeJames Serra
The document provides an overview of big data architectures and the data lake concept. It discusses why organizations are adopting data lakes to handle increasing data volumes and varieties. The key aspects covered include:
- Defining top-down and bottom-up approaches to data management
- Explaining what a data lake is and how Hadoop can function as the data lake
- Describing how a modern data warehouse combines features of a traditional data warehouse and data lake
- Discussing how federated querying allows data to be accessed across multiple sources
- Highlighting benefits of implementing big data solutions in the cloud
- Comparing shared-nothing, massively parallel processing (MPP) architectures to symmetric multi-processing (
Data Architecture Best Practices for Advanced AnalyticsDATAVERSITY
Many organizations are immature when it comes to data and analytics use. The answer lies in delivering a greater level of insight from data, straight to the point of need.
There are so many Data Architecture best practices today, accumulated from years of practice. In this webinar, William will look at some Data Architecture best practices that he believes have emerged in the past two years and are not worked into many enterprise data programs yet. These are keepers and will be required to move towards, by one means or another, so it’s best to mindfully work them into the environment.
Data Architecture Strategies: Data Architecture for Digital TransformationDATAVERSITY
MDM, data quality, data architecture, and more. At the same time, combining these foundational data management approaches with other innovative techniques can help drive organizational change as well as technological transformation. This webinar will provide practical steps for creating a data foundation for effective digital transformation.
The document discusses migrating a data warehouse to the Databricks Lakehouse Platform. It outlines why legacy data warehouses are struggling, how the Databricks Platform addresses these issues, and key considerations for modern analytics and data warehousing. The document then provides an overview of the migration methodology, approach, strategies, and key takeaways for moving to a lakehouse on Databricks.
The document discusses the challenges of modern data, analytics, and AI workloads. Most enterprises struggle with siloed data systems that make integration and productivity difficult. The future of data lies with a data lakehouse platform that can unify data engineering, analytics, data warehousing, and machine learning workloads on a single open platform. The Databricks Lakehouse platform aims to address these challenges with its open data lake approach and capabilities for data engineering, SQL analytics, governance, and machine learning.
Data Con LA 2020
Description
In this session, I introduce the Amazon Redshift lake house architecture which enables you to query data across your data warehouse, data lake, and operational databases to gain faster and deeper insights. With a lake house architecture, you can store data in open file formats in your Amazon S3 data lake.
Speaker
Antje Barth, Amazon Web Services, Sr. Developer Advocate, AI and Machine Learning
Databricks is a Software-as-a-Service-like experience (or Spark-as-a-service) that is a tool for curating and processing massive amounts of data and developing, training and deploying models on that data, and managing the whole workflow process throughout the project. It is for those who are comfortable with Apache Spark as it is 100% based on Spark and is extensible with support for Scala, Java, R, and Python alongside Spark SQL, GraphX, Streaming and Machine Learning Library (Mllib). It has built-in integration with many data sources, has a workflow scheduler, allows for real-time workspace collaboration, and has performance improvements over traditional Apache Spark.
Embarking on building a modern data warehouse in the cloud can be an overwhelming experience due to the sheer number of products that can be used, especially when the use cases for many products overlap others. In this talk I will cover the use cases of many of the Microsoft products that you can use when building a modern data warehouse, broken down into four areas: ingest, store, prep, and model & serve. It’s a complicated story that I will try to simplify, giving blunt opinions of when to use what products and the pros/cons of each.
Data Catalog for Better Data Discovery and GovernanceDenodo
Watch full webinar here: https://buff.ly/2Vq9FR0
Data catalogs are en vogue answering critical data governance questions like “Where all does my data reside?” “What other entities are associated with my data?” “What are the definitions of the data fields?” and “Who accesses the data?” Data catalogs maintain the necessary business metadata to answer these questions and many more. But that’s not enough. For it to be useful, data catalogs need to deliver these answers to the business users right within the applications they use.
In this session, you will learn:
*How data catalogs enable enterprise-wide data governance regimes
*What key capability requirements should you expect in data catalogs
*How data virtualization combines dynamic data catalogs with delivery
This document outlines an agenda for a 90-minute workshop on Snowflake. The agenda includes introductions, an overview of Snowflake and data warehousing, demonstrations of how users utilize Snowflake, hands-on exercises loading sample data and running queries, and discussions of Snowflake architecture and capabilities. Real-world customer examples are also presented, such as a pharmacy building new applications on Snowflake and an education company using it to unify their data sources and achieve a 16x performance improvement.
Master the Multi-Clustered Data Warehouse - SnowflakeMatillion
Snowflake is one of the most powerful, efficient data warehouses on the market today—and we joined forces with the Snowflake team to show you how it works!
In this webinar:
- Learn how to optimize Snowflake
- Hear insider tips and tricks on how to improve performance
- Get expert insights from Craig Collier, Technical Architect from Snowflake, and Kalyan Arangam, Solution Architect from Matillion
- Find out how leading brands like Converse, Duo Security, and Pets at Home use Snowflake and Matillion ETL to make data-driven decisions
- Discover how Matillion ETL and Snowflake work together to modernize your data world
- Learn how to utilize the impressive scalability of Snowflake and Matillion
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a modern data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. They all may sound great in theory, but I'll dig into the concerns you need to be aware of before taking the plunge. I’ll also include use cases so you can see what approach will work best for your big data needs. And I'll discuss Microsoft version of the data mesh.
Product-thinking is making a big impact in the data world with the rise of Data Products, Data Product Managers, data mesh, and treating “Data as a Product.” But Honest, No-BS: What is a Data Product? And what key questions should we ask ourselves while developing them? Tim Gasper (VP of Product, data.world), will walk through the Data Product ABCs as a way to make treating data as a product way simpler: Accountability, Boundaries, Contracts and Expectations, Downstream Consumers, and Explicit Knowledge.
Presentation on Data Mesh: The paradigm shift is a new type of eco-system architecture, which is a shift left towards a modern distributed architecture in which it allows domain-specific data and views “data-as-a-product,” enabling each domain to handle its own data pipelines.
Snowflake's Kent Graziano talks about what makes a data warehouse as a service and some of the key features of Snowflake's data warehouse as a service.
Intuit's Data Mesh - Data Mesh Leaning Community meetup 5.13.2021Tristan Baker
Past, present and future of data mesh at Intuit. This deck describes a vision and strategy for improving data worker productivity through a Data Mesh approach to organizing data and holding data producers accountable. Delivered at the inaugural Data Mesh Leaning meetup on 5/13/2021.
This is Part 4 of the GoldenGate series on Data Mesh - a series of webinars helping customers understand how to move off of old-fashioned monolithic data integration architecture and get ready for more agile, cost-effective, event-driven solutions. The Data Mesh is a kind of Data Fabric that emphasizes business-led data products running on event-driven streaming architectures, serverless, and microservices based platforms. These emerging solutions are essential for enterprises that run data-driven services on multi-cloud, multi-vendor ecosystems.
Join this session to get a fresh look at Data Mesh; we'll start with core architecture principles (vendor agnostic) and transition into detailed examples of how Oracle's GoldenGate platform is providing capabilities today. We will discuss essential technical characteristics of a Data Mesh solution, and the benefits that business owners can expect by moving IT in this direction. For more background on Data Mesh, Part 1, 2, and 3 are on the GoldenGate YouTube channel: https://www.youtube.com/playlist?list=PLbqmhpwYrlZJ-583p3KQGDAd6038i1ywe
Webinar Speaker: Jeff Pollock, VP Product (https://www.linkedin.com/in/jtpollock/)
Mr. Pollock is an expert technology leader for data platforms, big data, data integration and governance. Jeff has been CTO at California startups and a senior exec at Fortune 100 tech vendors. He is currently Oracle VP of Products and Cloud Services for Data Replication, Streaming Data and Database Migrations. While at IBM, he was head of all Information Integration, Replication and Governance products, and previously Jeff was an independent architect for US Defense Department, VP of Technology at Cerebra and CTO of Modulant – he has been engineering artificial intelligence based data platforms since 2001. As a business consultant, Mr. Pollock was a Head Architect at Ernst & Young’s Center for Technology Enablement. Jeff is also the author of “Semantic Web for Dummies” and "Adaptive Information,” a frequent keynote at industry conferences, author for books and industry journals, formerly a contributing member of W3C and OASIS, and an engineering instructor with UC Berkeley’s Extension for object-oriented systems, software development process and enterprise architecture.
The data lake has become extremely popular, but there is still confusion on how it should be used. In this presentation I will cover common big data architectures that use the data lake, the characteristics and benefits of a data lake, and how it works in conjunction with a relational data warehouse. Then I’ll go into details on using Azure Data Lake Store Gen2 as your data lake, and various typical use cases of the data lake. As a bonus I’ll talk about how to organize a data lake and discuss the various products that can be used in a modern data warehouse.
Introducing Snowflake, an elastic data warehouse delivered as a service in the cloud. It aims to simplify data warehousing by removing the need for customers to manage infrastructure, scaling, and tuning. Snowflake uses a multi-cluster architecture to provide elastic scaling of storage, compute, and concurrency. It can bring together structured and semi-structured data for analysis without requiring data transformation. Customers have seen significant improvements in performance, cost savings, and the ability to add new workloads compared to traditional on-premises data warehousing solutions.
Cortana Analytics Suite is a fully managed big data and advanced analytics suite that transforms your data into intelligent action. It is comprised of data storage, information management, machine learning, and business intelligence software in a single convenient monthly subscription. This presentation will cover all the products involved, how they work together, and use cases.
Azure Stream Analytics (ASA) is an Azure Service that enables real-time insights over streaming data from devices, sensors, infrastructure, and applications. In this presentation, we provide introduction to the service, common use cases, example customer scenarios, business benefits, and demo how to get started. We will quickly build a simple real time analytic application that uses an IoT device to ingest data (Event Hubs), process and analyze data (Stream Analytics) and visualize data (PowerBI).
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...MSAdvAnalytics
Lance Olson. Cortana Analytics is a fully managed big data and advanced analytics suite that helps you transform your data into intelligent action. Come to this two-part session to learn how you can do "big data" processing and storage in Cortana Analytics. In the first part, we will provide an overview of the processing and storage services. We will then talk about the patterns and use cases which make up most big data solutions. In the second part, we will go hands-on, showing you how to get started today with writing batch/interactive queries, real-time stream processing, or NoSQL transactions all over the same repository of data. Crunch petabytes of data by scaling out your computation power to any sized cluster. Store any amount of unstructured data in its native format with no limits to file or account size. All of this can be done with no hardware to acquire or maintain and minimal time to setup giving you the value of "big data" within minutes. Go to https://channel9.msdn.com/ to find the recording of this session.
Cortana Analytics Workshop: The "Big Data" of the Cortana Analytics Suite, Pa...MSAdvAnalytics
Lance Olson. Cortana Analytics is a fully managed big data and advanced analytics suite that helps you transform your data into intelligent action. Come to this two-part session to learn how you can do "big data" processing and storage in Cortana Analytics. In the first part, we will provide an overview of the processing and storage services. We will then talk about the patterns and use cases which make up most big data solutions. In the second part, we will go hands-on, showing you how to get started today with writing batch/interactive queries, real-time stream processing, or NoSQL transactions all over the same repository of data. Crunch petabytes of data by scaling out your computation power to any sized cluster. Store any amount of unstructured data in its native format with no limits to file or account size. All of this can be done with no hardware to acquire or maintain and minimal time to setup giving you the value of "big data" within minutes. Go to https://channel9.msdn.com/ to find the recording of this session.
Think of big data as all data, no matter what the volume, velocity, or variety. The simple truth is a traditional on-prem data warehouse will not handle big data. So what is Microsoft’s strategy for building a big data solution? And why is it best to have this solution in the cloud? That is what this presentation will cover. Be prepared to discover all the various Microsoft technologies and products from collecting data, transforming it, storing it, to visualizing it. My goal is to help you not only understand each product but understand how they all fit together, so you can be the hero who builds your companies big data solution.
Building an Effective Data Warehouse ArchitectureJames Serra
Why use a data warehouse? What is the best methodology to use when creating a data warehouse? Should I use a normalized or dimensional approach? What is the difference between the Kimball and Inmon methodologies? Does the new Tabular model in SQL Server 2012 change things? What is the difference between a data warehouse and a data mart? Is there hardware that is optimized for a data warehouse? What if I have a ton of data? During this session James will help you to answer these questions.
So you got a handle on what Big Data is and how you can use it to find business value in your data. Now you need an understanding of the Microsoft products that can be used to create a Big Data solution. Microsoft has many pieces of the puzzle and in this presentation I will show how they fit together. How does Microsoft enhance and add value to Big Data? From collecting data, transforming it, storing it, to visualizing it, I will show you Microsoft’s solutions for every step of the way
As a follow-on to the presentation "Building an Effective Data Warehouse Architecture", this presentation will explain exactly what Big Data is and its benefits, including use cases. We will discuss how Hadoop, the cloud and massively parallel processing (MPP) is changing the way data warehouses are being built. We will talk about hybrid architectures that combine on-premise data with data in the cloud as well as relational data and non-relational (unstructured) data. We will look at the benefits of MPP over SMP and how to integrate data from Internet of Things (IoT) devices. You will learn what a modern data warehouse should look like and how the role of a Data Lake and Hadoop fit in. In the end you will have guidance on the best solution for your data warehouse going forward.
Should I move my database to the cloud?James Serra
So you have been running on-prem SQL Server for a while now. Maybe you have taken the step to move it from bare metal to a VM, and have seen some nice benefits. Ready to see a TON more benefits? If you said “YES!”, then this is the session for you as I will go over the many benefits gained by moving your on-prem SQL Server to an Azure VM (IaaS). Then I will really blow your mind by showing you even more benefits by moving to Azure SQL Database (PaaS/DBaaS). And for those of you with a large data warehouse, I also got you covered with Azure SQL Data Warehouse. Along the way I will talk about the many hybrid approaches so you can take a gradual approve to moving to the cloud. If you are interested in cost savings, additional features, ease of use, quick scaling, improved reliability and ending the days of upgrading hardware, this is the session for you!
Machine learning allows us to build predictive analytics solutions of tomorrow - these solutions allow us to better diagnose and treat patients, correctly recommend interesting books or movies, and even make the self-driving car a reality. Microsoft Azure Machine Learning (Azure ML) is a fully-managed Platform-as-a-Service (PaaS) for building these predictive analytics solutions. It is very easy to build solutions with it, helping to overcome the challenges most businesses have in deploying and using machine learning. In this presentation, we will take a look at how to create ML models with Azure ML Studio and deploy those models to production in minutes.
I often hear from clients: “We don’t know much about Big Data – can you tell us what it is and how it can help our business?” Yes! The first step is this vendor-free presentation, where I start with a business level discussion, not a technical one. Big Data is an opportunity to re-imagine our world, to track new signals that were once impossible, to change the way we experience our communities, our places of work and our personal lives. I will help you to identify the business value opportunity from Big Data and how to operationalize it. Yes, we will cover the buzz words: modern data warehouse, Hadoop, cloud, MPP, Internet of Things, and Data Lake, but I will show use cases to better understand them. In the end, I will give you the ammo to go to your manager and say “We need Big Data an here is why!” Because if you are not utilizing Big Data to help you make better business decisions, you can bet your competitors are.
The new Microsoft Azure SQL Data Warehouse (SQL DW) is an elastic data warehouse-as-a-service and is a Massively Parallel Processing (MPP) solution for "big data" with true enterprise class features. The SQL DW service is built for data warehouse workloads from a few hundred gigabytes to petabytes of data with truly unique features like disaggregated compute and storage allowing for customers to be able to utilize the service to match their needs. In this presentation, we take an in-depth look at implementing a SQL DW, elastic scale (grow, shrink, and pause), and hybrid data clouds with Hadoop integration via Polybase allowing for a true SQL experience across structured and unstructured data.
First introduced with the Analytics Platform System (APS), PolyBase simplifies management and querying of both relational and non-relational data using T-SQL. It is now available in both Azure SQL Data Warehouse and SQL Server 2016. The major features of PolyBase include the ability to do ad-hoc queries on Hadoop data and the ability to import data from Hadoop and Azure blob storage to SQL Server for persistent storage. A major part of the presentation will be a demo on querying and creating data on HDFS (using Azure Blobs). Come see why PolyBase is the “glue” to creating federated data warehouse solutions where you can query data as it sits instead of having to move it all to one data platform.
The document summarizes new features in SQL Server 2016 SP1, organized into three categories: performance enhancements, security improvements, and hybrid data capabilities. It highlights key features such as in-memory technologies for faster queries, always encrypted for data security, and PolyBase for querying relational and non-relational data. New editions like Express and Standard provide more built-in capabilities. The document also reviews SQL Server 2016 SP1 features by edition, showing advanced features are now more accessible across more editions.
The cloud is all the rage. Does it live up to its hype? What are the benefits of the cloud? Join me as I discuss the reasons so many companies are moving to the cloud and demo how to get up and running with a VM (IaaS) and a database (PaaS) in Azure. See why the ability to scale easily, the quickness that you can create a VM, and the built-in redundancy are just some of the reasons that moving to the cloud a “no brainer”. And if you have an on-prem datacenter, learn how to get out of the air-conditioning business!
Relational databases vs Non-relational databasesJames Serra
There is a lot of confusion about the place and purpose of the many recent non-relational database solutions ("NoSQL databases") compared to the relational database solutions that have been around for so many years. In this presentation I will first clarify what exactly these database solutions are, compare them, and discuss the best use cases for each. I'll discuss topics involving OLTP, scaling, data warehousing, polyglot persistence, and the CAP theorem. We will even touch on a new type of database solution called NewSQL. If you are building a new solution it is important to understand all your options so you take the right path to success.
Choosing technologies for a big data solution in the cloudJames Serra
Has your company been building data warehouses for years using SQL Server? And are you now tasked with creating or moving your data warehouse to the cloud and modernizing it to support “Big Data”? What technologies and tools should use? That is what this presentation will help you answer. First we will cover what questions to ask concerning data (type, size, frequency), reporting, performance needs, on-prem vs cloud, staff technology skills, OSS requirements, cost, and MDM needs. Then we will show you common big data architecture solutions and help you to answer questions such as: Where do I store the data? Should I use a data lake? Do I still need a cube? What about Hadoop/NoSQL? Do I need the power of MPP? Should I build a "logical data warehouse"? What is this lambda architecture? Can I use Hadoop for my DW? Finally, we’ll show some architectures of real-world customer big data solutions. Come to this session to get started down the path to making the proper technology choices in moving to the cloud.
James has worked at Microsoft for the past year. Before that, he was an independent consultant as well as having worked as a permanent employee and contractor and numerous companies. What is different about Microsoft? What is it like to see how things work “behind the curtain”? How does it compare to what he anticipated it to be like? Come join this session to find out more working for Microsoft: benefits, compensation, training, career advancement, work-life balance, travel, types of jobs, etc. We will leave plenty of time to ask questions!
This presentation is for those of you who are interested in moving your on-prem SQL Server databases and servers to Azure virtual machines (VM’s) in the cloud so you can take advantage of all the benefits of being in the cloud. This is commonly referred to as a “lift and shift” as part of an Infrastructure-as-a-service (IaaS) solution. I will discuss the various Azure VM sizes and options, migration strategies, storage options, high availability (HA) and disaster recovery (DR) solutions, and best practices.
Big Data, IoT, data lake, unstructured data, Hadoop, cloud, and massively parallel processing (MPP) are all just fancy words unless you can find uses cases for all this technology. Join me as I talk about the many use cases I have seen, from streaming data to advanced analytics, broken down by industry. I’ll show you how all this technology fits together by discussing various architectures and the most common approaches to solving data problems and hopefully set off light bulbs in your head on how big data can help your organization make better business decisions.
Prague data management meetup 2018-03-27Martin Bém
This document discusses different data types and data models. It begins by describing unstructured, semi-structured, and structured data. It then discusses relational and non-relational data models. The document notes that big data can include any of these data types and models. It provides an overview of Microsoft's data management and analytics platform and tools for working with structured, semi-structured, and unstructured data at varying scales. These include offerings like SQL Server, Azure SQL Database, Azure Data Lake Store, Azure Data Lake Analytics, HDInsight and Azure Data Warehouse.
Data Warehouse Modernization: Accelerating Time-To-Action MapR Technologies
Data warehouses have been the standard tool for analyzing data created by business operations. In recent years, increasing data volumes, new types of data formats, and emerging analytics technologies such as machine learning have given rise to modern data lakes. Connecting application databases, data warehouses, and data lakes using real-time data pipelines can significantly improve the time to action for business decisions. More: http://info.mapr.com/WB_MapR-StreamSets-Data-Warehouse-Modernization_Global_DG_17.08.16_RegistrationPage.html
Microsoft Data Platform - What's includedJames Serra
This document provides an overview of a speaker and their upcoming presentation on Microsoft's data platform. The speaker is a 30-year IT veteran who has worked in various roles including BI architect, developer, and consultant. Their presentation will cover collecting and managing data, transforming and analyzing data, and visualizing and making decisions from data. It will also discuss Microsoft's various product offerings for data warehousing and big data solutions.
The document discusses modernizing a data warehouse using the Microsoft Analytics Platform System (APS). APS is described as a turnkey appliance that allows organizations to integrate relational and non-relational data in a single system for enterprise-ready querying and business intelligence. It provides a scalable solution for growing data volumes and types that removes limitations of traditional data warehousing approaches.
Data Analytics Meetup: Introduction to Azure Data Lake Storage CCG
Microsoft Azure Data Lake Storage is designed to enable operational and exploratory analytics through a hyper-scale repository. Journey through Azure Data Lake Storage Gen1 with Microsoft Data Platform Specialist, Audrey Hammonds. In this video she explains the fundamentals to Gen 1 and Gen 2, walks us through how to provision a Data Lake, and gives tips to avoid turning your Data Lake into a swamp.
Learn more about Data Lakes with our blog - Data Lakes: Data Agility is Here Now https://bit.ly/2NUX1H6
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudDataWorks Summit
This document discusses how organizations can leverage data and analytics to power their business models. It provides examples of Fortune 100 companies that are using Attunity products to build data lakes and ingest data from SAP and other sources into Hadoop, Apache Kafka, and the cloud in order to perform real-time analytics. The document outlines the benefits of Attunity's data replication tools for extracting, transforming, and loading SAP and other enterprise data into data lakes and data warehouses.
Overview of Apache Trafodion (incubating), Enterprise Class Transactional SQL-on-Hadoop DBMS, with operational use cases, what it takes to be a world class RDBMS, some performance information, and the new company Esgyn which will leverage Apache Trafodion for operational solutions.
Streaming Real-time Data to Azure Data Lake Storage Gen 2Carole Gunst
Check out this presentation to learn the basics of using Attunity Replicate to stream real-time data to Azure Data Lake Storage Gen2 for analytics projects.
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureDATAVERSITY
Whether to take data ingestion cycles off the ETL tool and the data warehouse or to facilitate competitive Data Science and building algorithms in the organization, the data lake – a place for unmodeled and vast data – will be provisioned widely in 2020.
Though it doesn’t have to be complicated, the data lake has a few key design points that are critical, and it does need to follow some principles for success. Avoid building the data swamp, but not the data lake! The tool ecosystem is building up around the data lake and soon many will have a robust lake and data warehouse. We will discuss policy to keep them straight, send data to its best platform, and keep users’ confidence up in their data platforms.
Data lakes will be built in cloud object storage. We’ll discuss the options there as well.
Get this data point for your data lake journey.
Innovation Track AWS Cloud Experience Argentina - Data Lakes & Analytics en AWS Amazon Web Services LATAM
Data lakes allow organizations to store all types of data in a centralized repository at scale. AWS Lake Formation makes it easy to build secure data lakes by automatically registering and cleaning data, enforcing access permissions, and enabling analytics. Data stored in data lakes can be analyzed using services like Amazon Athena, Redshift, and EMR depending on the type of analysis and latency required.
A data lake can be used as a source for both structured and unstructured data - but how? We'll look at using open standards including Spark and Presto with Amazon EMR, Amazon Redshift Spectrum and Amazon Athena to process and understand data.
Speakers:
Neel Mitra - Solutions Architect, AWS
Roger Dahlstrom - Solutions Architect, AWS
The document discusses using Attunity Replicate to accelerate loading and integrating big data into Microsoft's Analytics Platform System (APS). Attunity Replicate provides real-time change data capture and high-performance data loading from various sources into APS. It offers a simplified and automated process for getting data into APS to enable analytics and business intelligence. Case studies are presented showing how major companies have used APS and Attunity Replicate to improve analytics and gain business insights from their data.
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...Pentaho
This document discusses approaches to implementing Hadoop, NoSQL, and analytical databases. It describes:
1) The current landscape of big data databases including Hadoop, NoSQL, and analytical databases that are often used together but come from different vendors with different interfaces.
2) Common uses of transactional databases, Hadoop, NoSQL databases, and analytical databases.
3) The complexity of current implementation approaches that involve multiple coding steps across various tools.
4) How Pentaho provides a unified platform and visual tools to reduce the time and effort needed for implementation by eliminating disjointed steps and enabling non-coders to develop workflows and analytics for big data.
This document provides an overview of a course on implementing a modern data platform architecture using Azure services. The course objectives are to understand cloud and big data concepts, the role of Azure data services in a modern data platform, and how to implement a reference architecture using Azure data services. The course will provide an ARM template for a data platform solution that can address most data challenges.
Data Analytics Week at the San Francisco Loft
Using Data Lakes
A data lake can be used as a source for both structured and unstructured data - but how? We'll look at using open standards including Spark and Presto with Amazon EMR, Amazon Redshift Spectrum and Amazon Athena to process and understand data.
Speakers:
John Mallory - Principal Business Development Manager Storage (Object), AWS
Hemant Borole - Sr. Big Data Consultant, AWS
A data lake can be used as a source for both structured and unstructured data - but how? We'll look at using open standards including Spark and Presto with Amazon EMR, Amazon Redshift Spectrum and Amazon Athena to process and understand data.
Level: Intermediate
Speakers:
Tony Nguyen - Senior Consultant, ProServe, AWS
Hannah Marlowe - Consultant - Federal, AWS
Testing Big Data: Automated Testing of Hadoop with QuerySurgeRTTS
Are You Ready? Stepping Up To The Big Data Challenge In 2016 - Learn why Testing is pivotal to the success of your Big Data Strategy.
According to a new report by analyst firm IDG, 70% of enterprises have either deployed or are planning to deploy big data projects and programs this year due to the increase in the amount of data they need to manage.
The growing variety of new data sources is pushing organizations to look for streamlined ways to manage complexities and get the most out of their data-related investments. The companies that do this correctly are realizing the power of big data for business expansion and growth.
Learn why testing your enterprise's data is pivotal for success with big data and Hadoop. Learn how to increase your testing speed, boost your testing coverage (up to 100%), and improve the level of quality within your data - all with one data testing tool.
Cardinality provides innovative analytics using big data techniques. It analyzes data from multiple sources, including at the customer, service, and network-wide levels. Cardinality's analytics platform includes components for data collection, storage, analysis, and visualization. It utilizes open-source technologies like Hadoop, Spark, and open APIs to provide customizable and scalable solutions to customers.
Similar to Modern Data Warehousing with the Microsoft Analytics Platform System (20)
Microsoft Fabric is the next version of Azure Data Factory, Azure Data Explorer, Azure Synapse Analytics, and Power BI. It brings all of these capabilities together into a single unified analytics platform that goes from the data lake to the business user in a SaaS-like environment. Therefore, the vision of Fabric is to be a one-stop shop for all the analytical needs for every enterprise and one platform for everyone from a citizen developer to a data engineer. Fabric will cover the complete spectrum of services including data movement, data lake, data engineering, data integration and data science, observational analytics, and business intelligence. With Fabric, there is no need to stitch together different services from multiple vendors. Instead, the customer enjoys end-to-end, highly integrated, single offering that is easy to understand, onboard, create and operate.
This is a hugely important new product from Microsoft and I will simplify your understanding of it via a presentation and demo.
Agenda:
What is Microsoft Fabric?
Workspaces and capacities
OneLake
Lakehouse
Data Warehouse
ADF
Power BI / DirectLake
Resources
Data Warehousing Trends, Best Practices, and Future OutlookJames Serra
Over the last decade, the 3Vs of data - Volume, Velocity & Variety has grown massively. The Big Data revolution has completely changed the way companies collect, analyze & store data. Advancements in cloud-based data warehousing technologies have empowered companies to fully leverage big data without heavy investments both in terms of time and resources. But, that doesn’t mean building and managing a cloud data warehouse isn’t accompanied by any challenges. From deciding on a service provider to the design architecture, deploying a data warehouse tailored to your business needs is a strenuous undertaking. Looking to deploy a data warehouse to scale your company’s data infrastructure or still on the fence? In this presentation you will gain insights into the current Data Warehousing trends, best practices, and future outlook. Learn how to build your data warehouse with the help of real-life use-cases and discussion on commonly faced challenges. In this session you will learn:
- Choosing the best solution - Data Lake vs. Data Warehouse vs. Data Mart
- Choosing the best Data Warehouse design methodologies: Data Vault vs. Kimball vs. Inmon
- Step by step approach to building an effective data warehouse architecture
- Common reasons for the failure of data warehouse implementations and how to avoid them
Azure Synapse Analytics is Azure SQL Data Warehouse evolved: a limitless analytics service, that brings together enterprise data warehousing and Big Data analytics into a single service. It gives you the freedom to query data on your terms, using either serverless on-demand or provisioned resources, at scale. Azure Synapse brings these two worlds together with a unified experience to ingest, prepare, manage, and serve data for immediate business intelligence and machine learning needs. This is a huge deck with lots of screenshots so you can see exactly how it works.
Power BI Overview, Deployment and GovernanceJames Serra
This document provides an overview of external sharing in Power BI using Azure Active Directory Business-to-Business (Azure B2B) collaboration. Azure B2B allows Power BI content to be securely distributed to guest users outside the organization while maintaining control over internal data. There are three main approaches for sharing - assigning Pro licenses manually, using guest's own licenses, or sharing to guests via Power BI Premium capacity. Azure B2B handles invitations, authentication, and governance policies to control external sharing. All guest actions are audited. Conditional access policies can also be enforced for guests.
Power BI has become a product with a ton of exciting features. This presentation will give an overview of some of them, including Power BI Desktop, Power BI service, what’s new, integration with other services, Power BI premium, and administration.
The breath and depth of Azure products that fall under the AI and ML umbrella can be difficult to follow. In this presentation I’ll first define exactly what AI, ML, and deep learning is, and then go over the various Microsoft AI and ML products and their use cases.
This document provides an overview and summary of the author's background and expertise. It states that the author has over 30 years of experience in IT working on many BI and data warehouse projects. It also lists that the author has experience as a developer, DBA, architect, and consultant. It provides certifications held and publications authored as well as noting previous recognition as an SQL Server MVP.
AI for an intelligent cloud and intelligent edge: Discover, deploy, and manag...James Serra
Discover, manage, deploy, monitor – rinse and repeat. In this session we show how Azure Machine Learning can be used to create the right AI model for your challenge and then easily customize it using your development tools while relying on Azure ML to optimize them to run in hardware accelerated environments for the cloud and the edge using FPGAs and Neural Network accelerators. We then show you how to deploy the model to highly scalable web services and nimble edge applications that Azure can manage and monitor for you. Finally, we illustrate how you can leverage the model telemetry to retrain and improve your content.
Power BI for Big Data and the New Look of Big Data SolutionsJames Serra
New features in Power BI give it enterprise tools, but that does not mean it automatically creates an enterprise solution. In this talk we will cover these new features (composite models, aggregations tables, dataflow) as well as Azure Data Lake Store Gen2, and describe the use cases and products of an individual, departmental, and enterprise big data solution. We will also talk about why a data warehouse and cubes still should be part of an enterprise solution, and how a data lake should be organized.
In three years I went from a complete unknown to a popular blogger, speaker at PASS Summit, a SQL Server MVP, and then joined Microsoft. Along the way I saw my yearly income triple. Is it because I know some secret? Is it because I am a genius? No! It is just about laying out your career path, setting goals, and doing the work.
I'll cover tips I learned over my career on everything from interviewing to building your personal brand. I'll discuss perm positions, consulting, contracting, working for Microsoft or partners, hot fields, in-demand skills, social media, networking, presenting, blogging, salary negotiating, dealing with recruiters, certifications, speaking at major conferences, resume tips, and keys to a high-paying career.
Your first step to enhancing your career will be to attend this session! Let me be your career coach!
Is the traditional data warehouse dead?James Serra
With new technologies such as Hive LLAP or Spark SQL, do I still need a data warehouse or can I just put everything in a data lake and report off of that? No! In the presentation I’ll discuss why you still need a relational data warehouse and how to use a data lake and a RDBMS data warehouse to get the best of both worlds. I will go into detail on the characteristics of a data lake and its benefits and why you still need data governance tasks in a data lake. I’ll also discuss using Hadoop as the data lake, data virtualization, and the need for OLAP in a big data solution. And I’ll put it all together by showing common big data architectures.
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionJames Serra
It can be quite challenging keeping up with the frequent updates to the Microsoft products and understanding all their use cases and how all the products fit together. In this session we will differentiate the use cases for each of the Microsoft services, explaining and demonstrating what is good and what isn't, in order for you to position, design and deliver the proper adoption use cases for each with your customers. We will cover a wide range of products such as Databricks, SQL Data Warehouse, HDInsight, Azure Data Lake Analytics, Azure Data Lake Store, Blob storage, and AAS as well as high-level concepts such as when to use a data lake. We will also review the most common reference architectures (“patterns”) witnessed in customer adoption.
Azure SQL Database Managed Instance is a new flavor of Azure SQL Database that is a game changer. It offers near-complete SQL Server compatibility and network isolation to easily lift and shift databases to Azure (you can literally backup an on-premise database and restore it into a Azure SQL Database Managed Instance). Think of it as an enhancement to Azure SQL Database that is built on the same PaaS infrastructure and maintains all it's features (i.e. active geo-replication, high availability, automatic backups, database advisor, threat detection, intelligent insights, vulnerability assessment, etc) but adds support for databases up to 35TB, VNET, SQL Agent, cross-database querying, replication, etc. So, you can migrate your databases from on-prem to Azure with very little migration effort which is a big improvement from the current Singleton or Elastic Pool flavors which can require substantial changes.
Learning to present and becoming good at itJames Serra
Have you been thinking about presenting at a user group? Are you being asked to present at your work? Is learning to present one of the keys to advancing your career? Or do you just think it would be fun to present but you are too nervous to try it? Well take the first step to becoming a presenter by attending this session and I will guide you through the process of learning to present and becoming good at it. It’s easier than you think! I am an introvert and was deathly afraid to speak in public. Now I love to present and it’s actually my main function in my job at Microsoft. I’ll share with you journey that lead me to speak at major conferences and the skills I learned along the way to become a good presenter and to get rid of the fear. You can do it!
DocumentDB is a powerful NoSQL solution. It provides elastic scale, high performance, global distribution, a flexible data model, and is fully managed. If you are looking for a scaled OLTP solution that is too much for SQL Server to handle (i.e. millions of transactions per second) and/or will be using JSON documents, DocumentDB is the answer.
Introduction to Microsoft’s Hadoop solution (HDInsight)James Serra
Did you know Microsoft provides a Hadoop Platform-as-a-Service (PaaS)? It’s called Azure HDInsight and it deploys and provisions managed Apache Hadoop clusters in the cloud, providing a software framework designed to process, analyze, and report on big data with high reliability and availability. HDInsight uses the Hortonworks Data Platform (HDP) Hadoop distribution that includes many Hadoop components such as HBase, Spark, Storm, Pig, Hive, and Mahout. Join me in this presentation as I talk about what Hadoop is, why deploy to the cloud, and Microsoft’s solution.
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...javier ramirez
Los sistemas distribuidos son difíciles. Los sistemas distribuidos de alto rendimiento, más. Latencias de red, mensajes sin confirmación de recibo, reinicios de servidores, fallos de hardware, bugs en el software, releases problemáticas, timeouts... hay un montón de motivos por los que es muy difícil saber si un mensaje que has enviado se ha recibido y procesado correctamente en destino. Así que para asegurar mandas el mensaje otra vez.. y otra... y cruzas los dedos para que el sistema del otro lado tenga tolerancia a los duplicados.
QuestDB es una base de datos open source diseñada para alto rendimiento. Nos queríamos asegurar de poder ofrecer garantías de "exactly once", deduplicando mensajes en tiempo de ingestión. En esta charla, te cuento cómo diseñamos e implementamos la palabra clave DEDUP en QuestDB, permitiendo deduplicar y además permitiendo Upserts en datos en tiempo real, añadiendo solo un 8% de tiempo de proceso, incluso en flujos con millones de inserciones por segundo.
Además, explicaré nuestra arquitectura de log de escrituras (WAL) paralelo y multithread. Por supuesto, todo esto te lo cuento con demos, para que veas cómo funciona en la práctica.
Airline Satisfaction Project using Azure
This presentation is created as a foundation of understanding and comparing data science/machine learning solutions made in Python notebooks locally and on Azure cloud, as a part of Course DP-100 - Designing and Implementing a Data Science Solution on Azure.
Amazon Aurora 클러스터를 초당 수백만 건의 쓰기 트랜잭션으로 확장하고 페타바이트 규모의 데이터를 관리할 수 있으며, 사용자 지정 애플리케이션 로직을 생성하거나 여러 데이터베이스를 관리할 필요 없이 Aurora에서 관계형 데이터베이스 워크로드를 단일 Aurora 라이터 인스턴스의 한도 이상으로 확장할 수 있는 Amazon Aurora Limitless Database를 소개합니다.
LLM powered contract compliance application which uses Advanced RAG method Self-RAG and Knowledge Graph together for the first time.
It provides highest accuracy for contract compliance recorded so far for Oil and Gas Industry.
Amazon DocumentDB(MongoDB와 호환됨)는 빠르고 안정적이며 완전 관리형 데이터베이스 서비스입니다. Amazon DocumentDB를 사용하면 클라우드에서 MongoDB 호환 데이터베이스를 쉽게 설치, 운영 및 규모를 조정할 수 있습니다. Amazon DocumentDB를 사용하면 MongoDB에서 사용하는 것과 동일한 애플리케이션 코드를 실행하고 동일한 드라이버와 도구를 사용하는 것을 실습합니다.
6. Are you using or going to
use “Big Data” and/or
“Hadoop”
No or limited access to
detailed data; can only
surface reports and cannot
ask ad-hoc questions.
Slow data loading
performance cannot keep
up with the need for data
from transactional systems
for intraday reporting.
MOLAP cube processing
and data refresh take too
long.
Slow query performance
with need for constant
tuning, especially with SAN
storage.
High cost of SAN storage
chargeback.
7. Keep legacy
investment
Buy new tier one
hardware appliance
Acquire big data
solution (Hadoop)
Acquire business
intelligence solution
Roadblocks to evolving to a modern data warehouse
Limited
scalability & ability to
handle new data types
Significant training
& still siloed
High acquisition/
migration
costs & no Hadoop
Complex with low
adoption
Solution and issue with that solution
8. Introducing the Microsoft Analytics Platform System
Your turnkey modern data warehouse appliance
• Relational and non-relational data in
a single appliance
• Or, integrate relational data with
non-relational data in an external
Hadoop cluster on premise or data
stored in the Cloud (hot, warm, cold)
• Enterprise-ready Hadoop
• Integrated querying across Hadoop
and APS using T-SQL (PolyBase)
• Direct integration with Microsoft BI
tools such as Power BI
• Near real-time performance with In-
Memory
• Scale-out to accommodate your
growing data or to increase
performance (2-nodes to 56-nodes)
• Remove SMP DW bottlenecks with
MPP SQL Server
• No rip and replace when more
performance needed
• No performance tuning required
• Concurrency that fuels rapid
adoption
• Industry’s lowest DW price/TB
• Value through a single appliance
solution
• Value with flexible hardware options
using commodity hardware
• Free up space on SAN (cost averages
10k per TB)
10. Hardware and software engineered together
The ease of an appliance
Co-engineered
with HP, Dell, and
Quanta best
practices
Leading
performance with
commodity
hardware
Pre-configured,
built, and tuned
software and
hardware
Integrated
support plan with
a single Microsoft
contactPDW
HDInsight
PolyBase
11. APS History
• DatAllegro started in 2003
• Microsoft acquires DatAllegro in September 2008
• PDW released in December 2010 (version 1)
• Version 2 made available in March, 2013 (PolyBase introduced)
• AU1 released in April 2014. Renamed from Parallel Data Warehouse (PDW) to Analytics Platform System (APS). It
still includes the PDW region as well as a new HDInsights/Hadoop region
• AU2 was released in July 2014
• AU3 released in October 2014
There will be AU updates every 3-4 months.
NOTE: This is a Data Warehouse solution and not an OLTP (online transaction processing) solution.
Case studies: Go to https://customers.microsoft.com and enter "parallel data warehouse" (old name) in the keyword
box and search the results, then enter "analytics platform system“ (new name)
12. Parallelism
• Uses many separate CPUs running in parallel to execute a single program
• Shared Nothing: Each CPU has its own memory and disk (scale-out)
• Segments communicate using high-speed network between nodes
MPP - Massively
Parallel
Processing
• Multiple CPUs used to complete individual processes simultaneously
• All CPUs share the same memory, disks, and network controllers (scale-up)
• All SQL Server implementations up until now have been SMP
• Mostly, the solution is housed on a shared SAN
SMP - Symmetric
Multiprocessing
13. APS Logical Architecture (overview)
“Compute” node Balanced storage
SQL
“Compute” node Balanced storage
SQL
“Compute” node Balanced storage
SQL
“Compute” node Balanced storage
SQL
DMS
DMS
DMS
DMS
Compute Node – the “worker bee” of APS
• Runs SQL Server 2014 APS
• Contains a “slice” of each database
• CPU is saturated by storage
Control Node – the “brains” of the APS
• Also runs SQL Server 2014 APS
• Holds a “shell” copy of each database
• Metadata, statistics, etc
• The “public face” of the appliance
Data Movement Services (DMS)
• Part of the “secret sauce” of APS
• Moves data around as needed
• Enables parallel operations among the compute
nodes (queries, loads, etc)
“Control” node
SQL
DMS
14. APS Logical Architecture (overview)
“Compute” node Balanced storage
SQL“Control” node
SQL
“Compute” node Balanced storage
SQL
“Compute” node Balanced storage
SQL
“Compute” node Balanced storage
SQL
DMS
DMS
DMS
DMS
DMS
1) User connects to the appliance (control node)
and submits query
2) Control node query processor determines best
*parallel* query plan
3) DMS distributes sub-queries to each compute
node
4) Each compute node executes query on its
subset of data
5) Each compute node returns a subset of the
response to the control node
6) If necessary, control node does any final
aggregation/computation
7) Control node returns results to user
Queries running in parallel on a subset of the data, using separate pipes effectively making the pipe larger
15. APS Data Layout Options
“Compute” node Balanced storage
SQL
Balanced storage
Balanced storage
Balanced storage
“Compute” node
SQL
“Compute” node
SQL
“Compute” node
SQL
DMS
DMS
DMS
DMS
Time Dim
Date Dim ID
Calendar Year
Calendar Qtr
Calendar Mo
Calendar Day
Store Dim
Store Dim ID
Store Name
Store Mgr
Store Size
Product Dim
Prod Dim ID
Prod Category
Prod Sub Cat
Prod Desc
Customer Dim
Cust Dim ID
Cust Name
Cust Addr
Cust Phone
Cust Email
Sales Fact
Date Dim ID
Store Dim ID
Prod Dim ID
Cust Dim ID
Qty Sold
Dollars Sold
T
D
P
D
S
D
C
D
T
D
P
D
S
D
C
D
T
D
P
D
S
D
C
D
T
D
P
D
S
D
C
D
SalesFact
Replicated
Table copied to each compute node
Distributed
Table spread across compute nodes based on “hash”
Star Schema
17. APS – Balanced across servers and within
41
Largest Table 600,000,000,000
Randomly distributed across 40 compute nodes (5 racks) 15,000,000,000
In each server randomly distributed to 8 tables (so 320 total tables) 1,875,000,000
Each partition – 2 years data partitioned by week (benefiting queries by date) 18,028,846
As an end user or DBA you think about 1 table: LineItem.
“Select * from LineItem” is split into 320 queries running in parallel against 320 (1.875b row) tables.
“Select * from LineItem where OrderDate = ‘1/1/2014’ is 320 queries against 320 (18m row) tables.
You don’t care or need to know that there are actually 320 tables representing your 1 logical table.
CCI can add further performance via segment elimination.
22. What is Hadoop?
Microsoft Confidential
Distributed, scalable system on commodity HW
Composed of a few parts:
HDFS – Distributed file system
MapReduce – Programming model
Other tools: Hive, Pig, SQOOP, HCatalog, HBase,
Flume, Mahout, YARN, Tez, Spark, Stinger, Oozie,
ZooKeeper, Flume, Storm
Main players are Hortonworks, Cloudera, MapR
WARNING: Hadoop, while ideal for processing huge
volumes of data, is inadequate for analyzing that
data in real time (companies do batch analytics
instead)
Core Services
OPERATIONAL
SERVICES
DATA
SERVICES
HDFS
SQOOP
FLUME
NFS
LOAD &
EXTRACT
WebHDFS
OOZIE
AMBARI
YARN
MAP
REDUCE
HIVE &
HCATALOG
PIG
HBASEFALCON
Hadoop Cluster
compute
&
storage . . .
. . .
. .
compute
&
storage
.
.
Hadoop clusters provide
scale-out storage and
distributed data processing
on commodity hardware
23. Move HDFS into the warehouse before analysis
ETL
Learn new skills
TSQL
Build
Integrate
Manage
Maintain
Support
Complex query and analysis with big data today
Steep learning curve, slow and inefficient
Hadoop ecosystem
“New” data sources
“New” data sources“New” data sources
24. APS delivers enterprise-ready Hadoop with HDInsight
Manageable, secured and highly available Hadoop integrated into the appliance
High performance
tuned within the
appliance
End-user
authentication with
Active Directory
Accessible insights
for everyone with
Microsoft BI tools
Managed and
monitored using
System Center
100% Apache
Hadoop
SQL Server
Parallel Data
Warehouse
Microsoft
HDInsight
PolyBase
Leverage your
existing TSQL skills
Additional features over a separate Hadoop cluster
Plus one support contact still!
25. Parallel Data Warehouse
region
HDInsight region
Fabric
Hardware
Appliance
A region is a logical container within an
appliance
Each workload contains the following
boundaries:
• Security
• Metering
• Servicing
APS appliance overview
26. Select… Result set Provides a single T-SQL query model (“semantic
layer”) for APS and Hadoop with rich features of
T-SQL, including joins without ETL
Uses the power of MPP to enhance query
execution performance
Supports Windows Azure HDInsight to enable
new hybrid cloud scenarios
Provides the ability to query non-Microsoft
Hadoop distributions, such as Hortonworks and
Cloudera
Use existing SQL skillset, no IT intervention
Query Hadoop data with T-SQL using PolyBase
Bringing the worlds or big data and the data warehouse together for users and IT
SQL Server
Parallel Data
Warehouse
Cloudera CHD Linux 5.1
Hortonworks HDP 2.2
(Windows, Linux)
Windows Azure
HDInsight (HDP 2.2)
(WASB)
PolyBase
Microsoft
HDInsight
HDP 2.0
Others (SQL Server, DB2, Oracle)?
True federated query engine
27. Use cases where PolyBase simplifies using Hadoop data
Bringing islands of Hadoop data together
High performance queries against Hadoop data
(Predicate pushdown)
Archiving data warehouse data to Hadoop (move)
(Hadoop as cold storage)
Exporting relational data to Hadoop (copy)
(Hadoop as backup/DR, analysis, cloud use)
Importing Hadoop data into data warehouse (copy)
(Hadoop as staging area, sandbox, Data Lake)
28. Big data insights for anyone
Native Microsoft BI integration to create new insights with familiar tools
Tools like Power
BI minimize IT
intervention for
discovering data
T-SQL for DBA
and power
users to join
relational and
Hadoop data
Hadoop tools
like map-
reduce, Hive
and Pig for data
scientists
Leverages high
adoption
of Excel, Power
View, Power
Pivot, and SSAS
Power Users
Data Scientist
Everyone else using
Microsoft BI tools
30. Scale-out Massively Parallel Processing (MPP) parallelizes
queries (speed-driven not just capacity-driven)
Multiple nodes with dedicated CPU, memory,
storage “shared-nothing”
Incrementally add HW for near-linear scale to
multi-PB (no need to delete older data, stage)
Handles query complexity and concurrency at scale
No “forklift” of prior warehouse to increase capacity
Start small with a few terabyte warehouse
Mixed workload support: Query while you load
(250GB/hour per node). No need for maintenance
window
Scale-out technologies in the Analytics Platform System
91
PDW
0TB 6PB
PDW or
HDInsight
PDW or
HDInsight
PDW or
HDInsight
PDW or
HDInsight
PDW or
HDInsight
PDW or
HDInsight
31. • Store data in columnar format for massive
compression
• Load data into or out of memory for next-
generation performance
• Updateable and clustered for real-time trickle
loading
• No secondary indexes required
92
Up to 100x
faster queries
Updatable clustered columnstore vs. table with customary indexing
Up to 15x
more compression
Columnstore index representation
Parallel query execution
Query
Results
32. Investment firm Before/After Results - HP
SMP vs APS
21x
improvement
loading data
(7:30 minutes vs 21
seconds)
62x
improvement
staging to
landing (30
minutes vs 29
seconds)
17x, 166x, 169x
query
performance
improvement
(1:05 hour vs 23
seconds)
Microsoft BI
tools work
unchanged
1.1 TB/hr loading
time, 8.8x
compression (2
billion rows)
(472GB to 53GB)
46x
improvement
creating
datamart (70
minutes vs 1:31
minutes)
33. BI Tools
Reporting and cubes
SQL Server SMP (Spoke)
Concurrency that fuels rapid adoption
Great performance with mixed workloads
Analytics Platform System
ETL/ELT with SSIS, DQS, MDS
ERP CRM LOB APPS
ETL/ELT with DWLoader
Hadoop / Big Data
PDW
HDInsight
PolyBase
Ad hoc queries
Intra-Day
Near real-time
Fast ad hoc
Columnstore
Polybase
CRTAS
“Link Table”
Real-Time
ROLAP / MOLAP
DirectQuery
SNAC
34. Stream Analytics
TransformIngest
Example overall data flow and Architecture
Web logs
Present &
decide
IoT, Mobile Devices
etc.
Social Data
Event Hubs HDInsight
Azure Data
Factory
Azure SQL DB
Azure Blob Storage
Azure Machine
Learning
(Fraud detection
etc.)
Power BI
Web
dashboards
Mobile devices
DW / Long-term
storage
Predictive
analytics
Event & data
producers
Analytics Platform Sys.
36. APS provides the industry’s lowest DW appliance price/TB
Reshaped hardware specs through software innovation
Price per terabyte for leading vendors (Sept 2014)
Significantly lower price
per TB than the closest
competitor
Lower storage costs
with Windows Server 2012
Storage Spaces
Small cost gap between multiple clustered HP DL980's with SAN vs APS 1/4 rack
$-
$20
$40
$60
$80
$100
$120
$140
Oracle Pivotal IBM Teradata Microsoft
Thousands
TCO per TB (uncompressed):
37. Virtualized architecture overview
Host 2
Host 1
Host 3
Host 4
Economical
disk storage
IB and
Ethernet
Direct attached SAS
Base Unit
CT
L
M
AD
A
D
V
M
M
Compute 2
Compute 1
• APS engine
• DMS Manager
• SQL Server 2012 Enterprise Edition (APS build) (AU3: SQL 2014)
Software details
• All hosts run Windows Server 2012 Standard (AU3:
2012 R2) and Windows Azure Virtual Machines
• Fabric or workload in Hyper-V Virtual Machines
• Fabric virtual machine, management server (MAD01),
and control server (CTL) share one server
• APS agent that runs on all hosts and all virtual
machines
• DWConfig and Admin Console
• Windows Storage Spaces and Azure Storage blobs
• Does not require expertise in Hyper-V or Windows
38. APS High-Availability
X
X
Compute
Host 1
Compute
Host 2
XControl Host
Failover Host
Infiniband1
Ethernet1
Infiniband2
Ethernet2
XXXFAB AD VMM MAD CTL
Compute 2 VM
Compute 1 VMCompute 1 VMInfiniband1
Ethernet1
• No Single Point-Of-Failure
• No need for SQL Server Clustering
39. Less DBA Maintenance/Monitoring
• No index creation
• No deleting/archiving data to save space
• Management simplicity (System Center, Admin console, DMVs)
• No blocking
• No logs
• No query hints
• No wait states
• No IO tuning
• No query optimization/tuning
• No index reorgs/rebuilds
• No partitioning
• No managing filegroups
• No shrinking/expanding databases
• No managing physical servers
• No patching servers and software
RESULT: DBA’s spend more of their time as architects and not baby sitters!
40. The no-compromise modern data warehouse solution
Microsoft’s turn-key modern data warehouse appliance
Analytics Platform System
Microsoft
• Improved query performance
• Faster data loading
• Improved concurrency
• Less DBA maintenance
• Limited training needed
• Use familiar BI tools
• Ease of appliance deployment
• Mixed workload support
• Improved data compression
• Scalability
• High availability
• PolyBase
• Integration with cloud-born data
• HDInsight/Hadoop integration
• Data warehouse consolidation
• Easy support model
Summary of Benefits
Bold = benefits of APS over upgrading to SQL Server 2014, no worry about future hardware roadblocks
42. Enterprise-ready big data – cloud
enabled
• Improved PolyBase Support
• Cloudera 5.1 Support
• Partial Aggregate Pushdowns
• Expanding Big Data capacity
• Grow HDInsight region on an appliance
with an existing region
Next-gen performance & engineered
for optimal value
• 1.5X data return rate for SELECT * queries
• Streaming large data sets for external apps
(e.g., SSAS, SAS, R, etc.)
Next-gen performance &
engineered for optimal value
• TSQL Compatibility
• Scalar UDFs (CREATE Function)
• SQL Server SMP to APS (SQL Server
MPP) Migration Utility
• Bulk load / BCP through SQL Server
command-line tools
• OEM Hardware Refresh (HP Gen 9)
• HP ProLiant DL360 Gen9 Server w/2x
Intel Haswell Processors, 256 GB
(16x16Gb) 2133MHz memory
• HP 5900 series switches (HA
improvements)
Symmetry between DW On-Prem
and Azure
T-SQL Compat:
Appliance Hardware
Editor's Notes
Key goal of slide: To convey what every IT person knows: The data warehouse and what’s it for. Then we set-up the Gartner quote to say that there is a tipping point. End the slide with a question: Why is it at a tipping point?
Slide talk track:
What is the “traditional” data warehouse?
IT professionals know this well. A data warehouse or an enterprise data warehouse is a database that was designed specifically for data analysis. It is the single source of truth or the central repository for all data in the company. This means disparate data in the company coming from your transactional systems, your ERP, CRM or Line of Business applications would all be extracted, transformed, and cleansed and put into the warehouse. It was built so that the people who is accessing the warehouse using BI tools will be accessing data that has been provisioned by IT and represent accurate data sanctioned by the company.
However, this traditional data warehouse is reaching an inflection point. Gartner in their analysis of the state of data warehousing noted that it is reaching the most significant tipping point since it’s inception. The question is why? What is going on?
Key goal of slide: To convey that the traditional data warehouse is going to break in one of four different ways. These ways should also not be a surprise to the IT professionals. At the end of the slide, IT should be asking, what can I do to prevent my warehouse from breaking?
Slide talk track:
There are many reasons why data warehouses are at it’s tipping point where something needs to change.
The first trend that will break my traditional data warehouse is data growth. Data volumes are expected to grow 10X over the next five years and traditional data warehouses cannot keep up with this explosion of data.
In addition to growing data, end users have the expectation that they’ll need be able to get back query results faster in near real-time. End users are no longer apt to wait minutes to hours for their results which is something traditional data warehouses cannot keep up with. Also, want real-time data, not dated data pulled in during a maintenance window each night
The third trend is new types of data captured that are “non-relational.” 85% of data growth is coming from “non-relational” data in the form of things like web logs, sensor data, social sentiment and devices. You’ve probably heard the term “Big Data” and “Hadoop” quite a bit. This is where these technologies come into play. More on that later….
The final trend that is appearing is cloud born data. This is data that might be coming from some of IT’s infrastructure that they are starting to host in the cloud (ie. CRM, ERP, etc) or not stored by any type of corporate owned system. How do you incorporate both on-premise and cloud data as part of your data warehouse? This is the last trend that is breaking the traditional data warehouse.
However, this traditional data warehouse is reaching an inflection point. Gartner in their analysis of the state of data warehousing noted that it is reaching the most significant tipping point since it’s inception. The question is why? What is going on?
Key goal of slide: To convey that the modern data warehouse is something that the traditional data warehouse must evolve to. To have IT agree that their warehouses need to take advantage of these new technologies (specifically focusing on the middle and bottom layer).
Slide talk track:
To encompass these four trends, we need to evolve our traditional data warehouse to ensure that it does not break. It needs to become the “modern data warehouse.” What is the “modern data warehouse?” This is the new warehouse that is able to excel with these new trends and can be your warehouse now and into the future.
The modern data warehouse has the ability to:
Handle all types of data. Whether it be your structured, relational data sources or your non-relational data sources, the Modern data warehouse will incorporate Hadoop. It can handle real-time data by using complex event processor technologies.
Provide a way to enrich your data with Extract, Transform Load (ETL) capabilities as well as Master Data Management (MDM) and data quality
Provide a way for any BI tool or query mechanism to interface with all these different types of data with a single query model that leverages a single query language that users already know (example: SQL).
Questions drive BI, Analytics drive questions
Top: solution choice, Bottom: problem if do
Key goal of slide: To convey the limitations of current modern data warehouse options in the market.
Slide talk track:
Organizations are facing the challenge of now turning to two platforms for managing their data—relational database management systems (RDBMS) for traditional data and Apache Hadoop, the most widely used open source Big Data platform for large, non-relational data.
Many Brand-new tier-one appliances are expensive. Major vendors offer tier-one RDBMS appliances. However, many of these come with a high price tag, averaging millions of dollars, and in-company politics may result into long struggles to approve and implement. Further, most of these appliances are focusing on point solutions instead of general and do not include a Hadoop solution, requiring a separate, additional appliance and ecosystem.
Hadoop solutions are complex. Vendors can provide a Hadoop solution to you as their own distribution of Hadoop or an appliance that comes pre-installed with Hadoop. The problem is that the Hadoop ecosystem requires significant training investment, and a major effort is needed to integrate the Hadoop ecosystem. There is a steep learning curve and ongoing operational cost when your IT department needs to re-orient themselves around HDFS, MapReduce, Hive, and Hbase, rather than T-SQL and a standard RDBMS design. The result is often increased cost at a time when IT is expected to streamline.
BI tools are unfamiliar. Surveys from Gartner, The BI Survey, and Intelligent Enterprise have found abysmal BI adoption of current solutions (~8%) due to complaints of the complexity of the tools and the cost of the solution. Users want tools they already know and can consume, but no vendor can deliver on all the solutions you need at a reasonable cost or in a natively-integrated manner.
Troubleshooting, support and maintenance. Keeping up with configuration changes, support and maintenance with troubleshooting is not trivial.
Today’s world of data is changing rapidly, and organizations need a modern data warehouse to adapt successfully to these changes. However, companies want the smoothest path to this transformation- a path where costs, downtime, and training are minimal, and where performance and accessibility to data insights are vastly improved.
Key goal of slide: To convey that the major pillars of the Analytics Platform System with key points.
To help organizations with a simple and smooth seamlessly transition to this new world of data, Microsoft introduces the Microsoft Analytics Platform System (APS) – the only, no-compromise modern data warehouse solution that brings both Hadoop and RDBMS in a single, pre-built appliance with tier-one performance, the lowest TCO in the industry, and accessibility to all their users through some of the most widely used BI tools in the industry.
Enterprise-ready Big Data: Microsoft APS combines Microsoft’s industry leading RDBMS platform, the Parallel Data Warehouse Appliance (PDW), with Microsoft’s Hadoop Distribution, HDInsight, for non-relational data to offer an all-in Big Data Analytics appliance.
Tying together and integrating the worlds of relational and Hadoop data is PolyBase, Microsoft’s integrated query tool available only in APS.
Your Modern Data Warehouse in One Turnkey Appliance
APS integrates PDW and HDInsight to operate seamlessly together in a single appliance
Integrated Querying across All Data Types Using T-SQL
PolyBase allows Hadoop data to be queried using rich featured T-SQL , while taking advantage of Hadoop processing, without additional Hadoop-based skills or training.
Enterprise-Ready Hadoop
HDInsight is Microsoft’s Hadoop-based distribution with end-user authentication via Active Directory and managed by IT using System Center
Big Data Insights to Any User
Native Microsoft BI integration within PolyBase allows everyone access to insights through familiar tools such as SSAS and Excel
Next-generation performance at scale: APS was built to scale into multi-petabytes, handling both RDBMS and the data stored in Hadoop, to deliver the performance that meets today’s near real-time sand rapid insights requirements.
Scale-Out to accommodate your Growing Data
APS contains PDW and HDInsight that both have linear scale-out architecture. Start small with a few terabytes and dynamically add capacity for seamless, linear scale-out
Remove DW bottlenecks with MPP SQL Server
Get the dynamic performance and scale that your modern data warehouse requires while retaining your skills and investment in SQL Server.
Real-Time Performance with In-Memory
Provides up to 100x improvement in query performance and 15x compression via updateable in-memory columnstore
Concurrency that Supports High Adoption
Scales in simultaneous user accessibility. APS has high concurrency, allowing for multiple workloads.
Optimal architecture: More than just a converged system, APS has reshaped the very hardware specifications required through software innovations to deliver optimal value. Through features delivered in Windows Server 2012, customers get exceptional value:
APS Provides the Industry’s Lowest DW Price/TB
Lower cost while maintaining performance using WS2012 Storage Spaces that replace SAN with economical Windows Storage Spaces
Save up to 70% of APS storage with up to 15x compression via updateable in-memory columnstore
Value through Single Appliance Solution
Reduce hardware footprint by having PDW and HDInsight within a single appliance
Remove the need for costly integration efforts
Value through Flexible Hardware Options
Avoid hardware lock-in through flexible hardware options from HP, Dell, and Quanta
The Analytics Platform System is a pre-built appliance that ships to your door. As an appliance, all of the hardware has been pre-built: servers, storage arrays, switches, power, racks, and more. Also, all the software has been installed, configured, and tuned.
Customers are delivered a fully packaged appliance solution that just works. All they have to do is plug the appliance in and start integrating their specific data into the solution.
KEY POINT
Use an interesting story to show how the new modern data warehouse can handle real-time performance with in-memory technologies.
TALK TRACK
We have a flexible choice of hardware vendors – there’s no lock-in to hardware that may not fit your exact needs and may also require unnecessarily expensive hardware due to lack of choice.
Operating a Big Data analytics platform can be as simple as this. Avoid proprietary hardware lock-ins like others try to sell you, and rely on basic industry standard components instead. The Microsoft Analytics Platform System is available with the flexibility to choose their preferred hardware from Dell, HP or Quanta, and each hardware choice has been designed, engineered, and tuned to perform optimally.
8 tables (8 filegroups since 1 filegroup per table). Each filegroup is made up of 2 physical files. Each scale unit has two compute nodes, so 16 filegroups so 32 files. Since each unit has 32 cpu cores, so 1 core for each file.
Want high cardinality for distribution key
PDW, distributes a single large logical table across 8 tables across the each server.
The distribution is performed by selecting a column in each table and applying a hash function to it.
Partition 2 years by day = 2,568,493
40 servers * 8 tables =320 tables
This horizontal partitioning breaks the table up into 8 partitions per compute node. Each of these distributions (essentially a table in and of itself) have dedicated CPU and disk that is the essence of Massively Parallel Processing in APS. There are 8 internal disks per compute node.
1TB drive: 15TB uncompressed per unit (2 nodes), 60TB uncompressed per rack (4 units, 8 nodes), 420TB uncompressed for 7 racks (28 units, 56 nodes)3TB drive: 45TB uncompressed per unit (2 nodes), 180TB uncompressed per rack (4 units, 8 nodes), 1260TB uncompressed for 7 racks (28 units, 56 nodes)[see slide 125]tempdb, log, and overhead of formatting the drives, storage spaces, etc have already been subtracted out (about 47%) of the 70 1TB drives (4 hot spares, 2 for fabric storage, 32 for raid 1, so 32 drives with unique data for 32TB per scale unit), that gives you 15TB of 'usable' space on a 1/4 rack, apply a 5:1 compression ratio, you get 75TBHP ProLiantDL360p Gen8 Server, 256GB RAM, 1UEach server has 2 processors (E5-2690 “Sandy Bridge” 2.90 GHz, 20M cache) with 8 cores, so 16 cores each serverSixteen (16) HP 16GB (2R x 4) PC3-12800R (DDR3-1333) MemoryTwo (2) internal HP 600GB 6G SAS 10K 2.5inPaired with 1 HP D6000 high density storage enclosures (70 HDD (7.2K) of either 1, 2, or 3 TB capacity) connected to each server through an H221 SAS HBA, 5U, 6Gb/s I usually use the word "conservative" when I'm talking about a 5:1 ratio. I also generally mention that most of the others in the industry use generally use the same number
Key goal of slide: To convey that the major pillars of the Analytics Platform System with key points. To help organizations with a simple and smooth seamlessly transition to this new world of data, Microsoft introduces the Microsoft Analytics Platform System (APS) – the only, no-compromise modern data warehouse solution that brings both Hadoop and RDBMS in a single, pre-built appliance with tier-one performance, the lowest TCO in the industry, and accessibility to all their users through some of the most widely used BI tools in the industry. Enterprise-ready Big Data: Microsoft APS combines Microsoft’s industry leading RDBMS platform, the Parallel Data Warehouse Appliance (PDW), with Microsoft’s Hadoop Distribution, HDInsight, for non-relational data to offer an all-in Big Data Analytics appliance. Tying together and integrating the worlds of relational and Hadoop data is PolyBase, Microsoft’s integrated query tool available only in APS. Your Modern Data Warehouse in One Turnkey ApplianceAPS integrates PDW and HDInsight to operate seamlessly together in a single applianceIntegrated Querying across All Data Types Using T-SQLPolyBase allows Hadoop data to be queried using rich featured T-SQL , while taking advantage of Hadoop processing, without additional Hadoop-based skills or training.Enterprise-Ready HadoopHDInsight is Microsoft’s Hadoop-based distribution with end-user authentication via Active Directory and managed by IT using System Center Big Data Insights to Any User Native Microsoft BI integration within PolyBaseallows everyone access to insights through familiar tools such as SSAS and Excel
Key goal of slide: Communicate what Big Data isSlide talk track:ERP, SCM, CRM, and transactional web applications are classic examples of systems processing transactions. Highly structured data in these systems is typically stored in SQL databases. Web 2.0 is about how people and things interact with each other or with your business. Web logs, user clickstreams, social interactions and feeds, and user-generated content are classic places to find interaction data. Big Data is the explosion of data volume and types inside and outside the business too large for traditional systems to manage. There are multiple types of data, including personal, organizational, public, and private. More Important, Big Data is changing how the business uses data from historical analysis to predictive analytics. Enterprises are using data in more progressive and higher value applications. These uses and applications are changing how data must be stored, managed, analyzed and accessed in order to provide not just the historical and insight analysis of the current data warehouse, but the predictive analytics and forecasting needed to stay competitive in the current marketplace.
Key goal of slide: Communicate what Hadoop isSlide talk track:Everyone has heard of Hadoop. But what is it? And do I need it? Apache Hadoop is an open-source solution framework that supports data-intensive distributed applications on large clusters of commodity hardware. Hadoop is composed of a few parts:HDFS – Hadoop Distributed File System is Hadoop’s file-system which stores large files (from gigabytes to terabytes) across multiple machinesMapReduce – is a programming model that performs filtering, sorting and other data retrieval commands across a parallel, distributed algorithm. Other parts of Hadoop include Hbase, R, Pig, Hive, Flume, Mahout, Avro, Zookeeper which are all parts of the Hadoop ecosystem that all perform other functions to supplement.
Key goal of slide: Communicate conceptually how companies are managing Big Data in current data warehouse environments. This shows both setting up a side by side Hadoop and ETL data into existing data warehouse. Slide talk track:Many companies have responded to the explosion of Big Data by setting up side-by-side Hadoop ecosystems. However, these companies are learning the limitations of this approach, including a steep learning curve of MapReduce and other Hadoop ecosystem tools the cost of installing, maintaining, and tooling side-by-side ecosystems to support two separate query models. Many Hadoop solutions do not integrate into enterprise or other data warehouse systems creating complexity and cost and slowing time to insights. Some Hadoop solutions feature vendor lock-in creating long term obligations Other companies set up costly extract, transform and load (ETL) operations to move non-relational data directly into the data warehouse. This requires IT to modify or create new data schema for all new data which is also time consuming and costly. As a result, performance is degraded, and it is often more expensive to integrate new data, build new applications, or access key BI insights.
Key goal of slide: Communicate what Hadoop is.Slide talk track:HDInsight is an enterprise-ready, Hadoop-based distribution from Microsoft that brings a 100% Apache Hadoop solution to the data warehouse. APS gives customers Hadoop with the simplicity of a single appliance, and Microsoft integrates Hadoop data processing directly into the architecture of the appliance for optimum performance. HDInsight node has ‘shared nothing’ access to CPU, memory and storage HDInsight for APS is the most enterprise-ready Hadoop distribution in the market. HDInsight offers enterprise-class security, scalability and manageability. Thanks to a dedicated Secure Node, HDInsight helps you secure your Hadoop cluster. HDInsight also simplifies management through System Center, and organizations can provide multiple users simultaneous access to HDInsight within the appliance deploys with Active Directory.
This diagram illustrates the basic layout of the direct-to-fabric Hadoop region alongside a data warehouse region designed for the APS appliance and Windows Azure. Each region provides a boundary for workload, security, metering, and servicing.HDInsight is a Hadoop region that sits over the fabric of the appliance alongside the PDW region for processing. Both regions take advantage of PolyBase as a shared query and processing model, which results in exceptional performance improvements across every node. Based on the Hortonworks 1.0 HDFS, the new HDI (HDInsights) region within APS is a dedicated Hadoop region that sits directly on top of the fabric layer of the appliance to share metered resources with the APS engine and process Hadoop cluster data. In some aspects this transforms APS into a concurrent relational and Hadoop engine, resulting in much better performance. An appliance can be configured to support relational queries only (excluding the HDI region), be configured to provide a Hadoop-only node, or be configured to support both relational and Hadoop from a single appliance. In addition, HDInsight enables the processing of Hadoop data in place, without the need for expensive ETL (extract, transform, and load). By taking advantage of Azure Storage Vault blobs, HDInsight can even extend the storage of the traditional data warehouse into the cloud. Technically, adding one or more scale units of hdi to an all APS rack is "add region" which is supported. Adding one or more scale units of hdi to a rack with hdi already in it is "add capacity/unit" and is not supported for AU1.
Key goal of slide: PolyBase is available only within the Microsoft Analytics Platform System. Slide talk track:PolyBase simplifies this by allowing Hadoop data to be queried with standard Transact-SQL (T-SQL) query language without the need to learn MapReduce and without the need to move the data into the data warehouse. PolyBase unifies relational and non-relational data at the query level.Integrated query: PolyBase accepts a standard T-SQL query that joins tables containing a relational source with tables in a Hadoop cluster referencing a non-relational source. It then seamlessly returns the results to the user. PolyBase can query Hadoop data in other Hadoop distributions such as Hortonworks or Cloudera. No difficult learning curve: Standard T-SQL can be used to query Hadoop data. Users are not required to learn MapReduce to execute the query. Cloud-Hybrid Scenario Options PolyBase can also query across Windows Azure HDInsight, providing a Hybrid Cloud solution to the data warehouseThe ability of querying all of your company’s data, independent of where it resides, what format it is stored in, in a performing way is crucial in today’s data-centric world with massive, increasing data volume. Today, with AU1, one can query various Hadoop distributions + data stored in Azure. For example, with one single T-SQL statement a user can query over data stored in multiple HDP 2.0 clusters, combine it with data in PDW and combine it with data stored in Azure. No one in the industry (as far as I’m aware of) can do this in this simple fashion. Bringing all Microsoft assets together, on-prem and specifically through our Azure play including various services that will be brought online in future, we can clearly distinguish through our unique & complete end-to-end data management story. No doubt that there are several pieces missing in our ‘Poly’ vision – including supporting other data stores, enabling push-down computation for our cloud story, more user-definable options language-wise, better automation/polices, and many more ideas we’d like to go after in the next weeks & months ahead.
HDInsights benefits: Cheap, quickly procureKey goal of slide: Highlight the four main use cases for PolyBase.Slide talk track:There are four key scenarios for using PolyBase with the data lake of data normally locked up in Hadoop.PolyBase leverages the APS MPP architecture along with optimizations like push-down computing to query data using Transact-SQL faster than using other Hadoop technologies like Hive. More importantly, you can use the Transact-SQL join syntax between Hadoop data and PDW data without having to import the data into PDW first.PolyBase is a great tool for archiving older or unused data in APS to less expensive storage on a Hadoop cluster. When you do need to access the data for historical purposes, you can easily join it back up with your PDW data using Transact-SQL.There are times when you need to share your PDW with Hadoop users and PolyBase makes it easy to copy data to a Hadoop cluster.Using a simple SELECT INTO statement, PolyBase makes it easy to import valuable Hadoop data into PDW without having to use external ETL processes.
Big Data adds value to the business when it is accessible to BI users with tools that are easy to use and consume for IT and business users alike. While some Hadoop solutions provide BI tools, or require customers to find 3rd party BI solutions these often result in a low adoption rate due to learning curves. Surveys from Gartner,The BI Survey, and Intelligent Enterprisehave found abysmal BI adoption of current solutions (~8%) due to complaints of the complexity of the tools and the cost of the solution. The BI solution must be provided to users in tools they already know and can consume.APS is the only data warehouse and Hadoop solution that has native end-end Microsoft BI integration withPolyBase, allowing users to use create new insights themselves by using tools they already know; Every Microsoft BI client, SSAS, SSRS, PowerPivot, and PowerView, has native integration with APS and has ubiquitous connectivity across the entire SQL Server ecosystem. With native BI integration, Microsoft is unique in offering an end-to-end Big Data solution where there are no barriers in the journey from acquiring raw data of all types to displaying high value insights to all users. By providing the customer with an HDInsight region in APS, with PolyBase for querying and joining any type of data in T-SQL, and by democratizing access to data insight through familiar BI tools, Microsoft is prepared to provide Big Data insights to a any user.
Key goal of slide: To convey that the major pillars of the Analytics Platform System with key points. Next-generation performance at scale: APS was built to scale into multi-petabytes, handling both RDBMS and the data stored in Hadoop, to deliver the performance that meets today’s near real-time sand rapid insightsrequirements.Scale-Out to accommodate your Growing DataAPS contains PDW and HDInsight that both have linear scale-out architecture. Start small with a few terabytes and dynamically add capacity for seamless, linear scale-outRemove DW bottlenecks with MPP SQL ServerGet the dynamic performance and scale that your modern data warehouse requires while retaining your skills and investment in SQL Server. Real-Time Performance with In-MemoryProvides up to 100x improvement in query performance and 15x compression via updateable in-memory columnstoreConcurrency that Supports High AdoptionScales in simultaneous user accessibility. APS has high concurrency, allowing for multiple workloads.
Today, if you are not using a MPP scale-out appliance, most likely your data warehouse is built on the traditional scale-up, SMP architecture and organized as row stores. A scale-up solution runs queries sequentially on a shared-everything architecture. This essentially means that everything is processed on a single box that shares memory, disk, I/O operations, and more. To get more scale in a scale-up solution, you need to acquire a more powerful hardware box every time. You will not be able to add more hardware to the existing rack solution. A scale-up solution also has diminishing returns after a certain scale. Rowstore stores data in traditional tables as rows. The values comprising one row are stored contiguously on a page. Rowstores are sometimes not optimal for many queries that are being issued to the data warehouse, because the query returns the entire row of data—including fields that might not be needed as part of the query.The combination of scale-up SMP and rowstores are common limitations to existing warehouses that affect performance.
Key goal of slide: Communicate that the Microsoft Modern Data Warehouse can scale out to petabytes of relational data.Slide talk track:SQL Server 2012 APS is a scale out, Massively Parallel Processing (MPP) architecture that represents the most powerful distributed computing and scale. This type of technology powers supercomputers to achieve raw computing horsepower. As more scale is needed, more resources can be added to scale out to the largest data warehousing projects. APS uses a shared-nothing architecture where there are multiple physical nodes, each running its own instance of SQL Server with dedicated CPU, memory, and storage. As queries go through the system, they are broken up to run simultaneously over each physical node. The benefit is in the highest performance at scale through parallel execution. You need only to add new resources to continually scale out this implementation.This means if you also have high concurrency and complex queries at scale, APS can handle these queries with ease. This also means that APS can be optimized for “mixed workload” and “near-real time” data analysis. Enjoy faster data loading and more than two terabytes per hour.Other benefits of scale-out technologies:Start small and scale out to petabytes of dataOptimized for “mixed workload” and “near-real time” data analysisSupport for high concurrencyQuery while you loadNo hardware bottlenecksNo “forklifting” when you want to scale your systemScale not only for data size but for faster queries
Key goal of slide: Use an interesting story to show how the new modern data warehouse can handle real-time performance with in-memory technologies.TODO: for parallel query execution – explain difference from SMP. Slide talk track:The biggest issue with traditional data warehouses is that data is stored in rows. The values comprising one row are stored contiguously on a page. Rowstores are not optimal for many queries that are issued to the data warehouse, because the query will return the entire row of data—including fields that might not be needed as part of the query.By changing the primary storage engine to a new, updateable version of in-memory columnstore, data is grouped and stored one column at a time. The benefits to doing this are as follows:Only the columns needed must be read. Therefore, less data is read from disk to memory and later moved from memory to processor cache.Columns are heavily compressed, which reduces the number of bytes that must be read and moved.Most queries do not touch all columns of the table. Therefore, many columns will never be brought into memory. This, combined with excellent compression, improves buffer pool usage—which reduces total I/O. The result is massive compression (sometimes as much as 10x), as well as massive performance gains (as much as 100x). Use of columnstore also leverages your existing hardware instead of requiring you to purchase a new appliance. New in SQL Server 2012 APS and SQL Server 2014: Updateable and clustered columnstoreSQL Server 2012 APS and SQL Server 2013 uses the new updateable columnstore. Updates and direct bulk load are fully supported, which simplifies and accelerates data-loading and enables real-time data warehousing and trickle loading. Using columnstore also can save roughly 70% on overall storage space if you chose to eliminate the row store copy of the data entirely.
Key goal of slide: Explain the limitations of serial processing SMP architecture to high concurrency MPP.High performance ad hoc analytic queries Pull insights simultaneously throughout the day Run multiple types of Queries simultaneously Run multiple types of workloads together with no tuning required High concurrency means high availability which means higher adoption Slide talk track:With the explosion of data and the growth of end-users demanding real-time insights, data warehouses are not only growing in resources but also growing in the number of users frequently accessing the data warehouse. A modern data warehouse needs to be able to both scale-out to query results quickly, but it also needs to be able to run mixed workloads all at the same time.Mixed workloads refer to concurrency. Under concurrency, multiple types of queries are submitted, along with data loads and ELT processing. Under mixed workload scenarios, which organizations are certain to face, APS runs concurrent queries with little or no tuning. Organizations no longer have to worry about the types of workloads being run at any given time, and Microsoft APS can handle many users pulling insights simultaneously throughout the day.
Key goal of slide: To convey that the major pillars of the Analytics Platform System with key points. Optimal architecture: More than just a converged system, APS has reshaped the very hardware specifications required through software innovations to deliver optimal value. Through features delivered in Windows Server 2012, customers get exceptional value:APS Provides the Industry’s Lowest DW Price/TBLower cost while maintaining performance using WS2012 Storage Spaces that replace SAN with economical Windows Storage Spaces Save up to 70% of APS storage with up to 15x compression via updateable in-memory columnstoreValue through Single Appliance SolutionReduce hardware footprint by having PDW and HDInsight within a single applianceRemove the need for costly integration effortsValue through Flexible Hardware OptionsAvoid hardware lock-in through flexible hardware options from HP, Dell, and Quanta
EMC Greenplum, Teradata, Oracle Exadata, HP Vertica, and IBM NetezzaKey goal of slide: Use an interesting story to show how the new modern data warehouse can handle real-time performance with in-memory technologies.Slide talk track:Value through Software Innovation/Hardware CommoditizationMore than just a converged system, APS has reshaped the very hardware specifications required through software innovations to deliver optimal value. Through features delivered in Windows Server 2012, customers get exceptional value:Through Storage Spaces, APS has the performance, reliability, and scale for storage built-in to the software, allowing it to replace the SAN with a more economical high density disk option. This results in large capacity at low cost with no reduction in performance. HyperV virtualization and hardware design minimizes the hardware footprint and cost of the appliance, enabling high availability as simply as possible. Microsoft lowers the cost by reducing the hardware footprint through virtualization, providing Storage Spaces to replace expensive SAN storage, and compressing up to 15x to lower storage usage. These features help give APS the lowest relational data warehouse price/terabyte over any other company by a significant margin (~2x lower than market). The overall market’s comparable price/terabyte ranges from $8-13K/TB. For example, Oracle announced Exadata and a form factor of a 1/8th rack that costs $200K. However, this is only the hardware costs and does not include software prices, which can cost significantly more – hundreds of thousands to a million dollars. Even accounting for Oracle’s 10x compression, APS has a price/terabyte that is about half Oracle’s list price for their normal drive sizes (non-high capacity).IBM PureData Pricing: < $500,000 for quarter rack (8TB uncompressed) http://www.theregister.co.uk/2012/10/10/ibm_puredata_database_appliances/ @ 4x compression (=12 to 15K/TB)Oracle Exadata Pricing: HW pricing (1.1M)—http://www.oracle.com/us/corporate/pricing/exadata-pricelist-070598.pdf; SW pricing (7.2M)—http://www.oracle.com/us/corporate/pricing/technology-price-list-070617.pdf @ (100TB uncompressed) @ 10x compression (= 8K/TB)EMC Greenplum Pricing: $1,000,000 for half rack (18TB uncompressed) http://www.informationweek.com/software/information-management/emc-intros-backup-savvy-greenplum-applia/227701321 @4x compression (=13.8k/TB)Pricing analysis was done on the last-know publicly accessible information available, and represents the current view of Microsoft Corporation as of the date of this presentation. Because companies respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided outside of the sources cited or after the date of this presentation. Source: Value Prism Consulting: Microsoft’s SQL Server Parallel Data Warehouse Provides High Performance and Great Value; website: http://www.valueprism.com/resources/resources/ResourceDetails.aspx?ID=100
Windows Server 2012 and Windows Azure Virtual Machines offer full virtualization services for both on-premises or on-demand installations.General detailsAll hosts run Windows Server 2012 Standard and Windows Azure Virtual MachinesFabric or workload in Hyper-V virtual machinesFabric virtual machine, MAD01, and CTL share one server. You get lower overhead costs especially for small topologies.APS agent runs on all hosts and all virtual machines and collects appliance health data on fabric and workloadDWConfig and Admin Console continue to exist with minor extensions to expose host level informationWindows Storage Spaces and Azure Storage blobs, enabling use of lower cost DAS (JBODs)APS workload detailsSQL Server 2012 Enterprise Edition (APS build) Control node and compute nodes for APS workloadStorage detailsMore files per file groupUses larger number of spindles in parallel
Key goal of slide: APS was built to scale to handle the highest data requirements, the newest data types stored in Hadoop, and deliver the performance that meets today’s near real-time requirements. Slide talk track: A modern data warehouse is progressive, meeting broad needs and requirements:Hadoop integrates and operates seamlessly with your relational data warehousesData easily queried by SQL users without additional skills or trainingEnterprise-ready, meaning it is secure and easily managed by ITInsights accessible to everyoneThe Microsoft Analytics Platform System (APS)– the only, no-compromise modern data warehouse solution that brings both Hadoop and RDBMS in a single, pre-built appliance with tier-one performance, the lowest TCO in the industry, and accessibility to all their users through some of the most widely used BI tools in the industry. Microsoft APS combines Microsoft’s industry leading RDBMS platform, the Analytics Platform System Appliance (APS), with Microsoft’s Hadoop Distribution, HDInsight, for non-relational data to offer an all-in Big Data Analytics appliance. Tying together and integrating the worlds of relational and non-relational data is PolyBase, Microsoft’s integrated query tool available only in APS.
Data capacity requirement variable from smallest (15 terabytes) to largest (6 petabytes) with 5:1 compression (1.2 petabytes uncompressed). From 1/4 rack up to 7 racksdata loading speed: ideal: 175Gb/hour per node (8 nodes would give 1 TB/hour), have seen 250Gb/hr, 10-20x fasterdata compression: 3x-15x, but 5x is conservative number. Unique compression because of distribution across compute nodesquery performance: 10x-100x, reasonable linear increase with more racks