This document discusses using Azure HDInsight for big data applications. It provides an overview of HDInsight and describes how it can be used for various big data scenarios like modern data warehousing, advanced analytics, and IoT. It also discusses the architecture and components of HDInsight, how to create and manage HDInsight clusters, and how HDInsight integrates with other Azure services for big data and analytics workloads.
Designing big data analytics solutions on azureMohamed Tawfik
This document discusses designing big data analytics solutions on Azure. It provides an overview of Azure's data landscape and common architectural patterns and scenarios for building analytics solutions using various Azure data and analytics services. These include Azure SQL Data Warehouse, Azure Data Lake Store, Azure Data Factory, Azure Machine Learning, and Power BI for reporting and visualization. The document also discusses using these services to build solutions for scenarios like data warehousing, data lakes, ETL/ELT, machine learning, streaming analytics and more.
This document provides an overview of big data and how Azure HDInsight can be used to work with big data. It discusses the evolution of data from gigabytes to exabytes and the big data utility gap where most data is stored but not analyzed. It then discusses how to store everything, analyze anything, and build the right thing using big data. Examples are provided of companies generating large amounts of data. An overview of the Hadoop ecosystem is given along with examples of using Hive and Pig on HDInsight to query and analyze large datasets. A case study of Klout is also summarized.
The new Microsoft Azure SQL Data Warehouse (SQL DW) is an elastic data warehouse-as-a-service and is a Massively Parallel Processing (MPP) solution for "big data" with true enterprise class features. The SQL DW service is built for data warehouse workloads from a few hundred gigabytes to petabytes of data with truly unique features like disaggregated compute and storage allowing for customers to be able to utilize the service to match their needs. In this presentation, we take an in-depth look at implementing a SQL DW, elastic scale (grow, shrink, and pause), and hybrid data clouds with Hadoop integration via Polybase allowing for a true SQL experience across structured and unstructured data.
Big Data Analytics in the Cloud with Microsoft AzureMark Kromer
Big Data Analytics in the Cloud using Microsoft Azure services was discussed. Key points included:
1) Azure provides tools for collecting, processing, analyzing and visualizing big data including Azure Data Lake, HDInsight, Data Factory, Machine Learning, and Power BI. These services can be used to build solutions for common big data use cases and architectures.
2) U-SQL is a language for preparing, transforming and analyzing data that allows users to focus on the what rather than the how of problems. It uses SQL and C# and can operate on structured and unstructured data.
3) Visual Studio provides an integrated environment for authoring, debugging, and monitoring U-SQL scripts and jobs. This allows
Building Modern Data Platform with Microsoft AzureDmitry Anoshin
This document provides an overview of building a modern cloud analytics solution using Microsoft Azure. It discusses the role of analytics, a history of cloud computing, and a data warehouse modernization project. Key challenges covered include lack of notifications, logging, self-service BI, and integrating streaming data. The document proposes solutions to these challenges using Azure services like Data Factory, Kafka, Databricks, and SQL Data Warehouse. It also discusses alternative implementations using tools like Matillion ETL and Snowflake.
First introduced with the Analytics Platform System (APS), PolyBase simplifies management and querying of both relational and non-relational data using T-SQL. It is now available in both Azure SQL Data Warehouse and SQL Server 2016. The major features of PolyBase include the ability to do ad-hoc queries on Hadoop data and the ability to import data from Hadoop and Azure blob storage to SQL Server for persistent storage. A major part of the presentation will be a demo on querying and creating data on HDFS (using Azure Blobs). Come see why PolyBase is the “glue” to creating federated data warehouse solutions where you can query data as it sits instead of having to move it all to one data platform.
Data Con LA 2020
Description
Data warehouses are not enough. Data lakes are the backbone of a modern data environment. Data Lakes are best built leveraging unique services of the cloud provider to reduce operations complexity. This session will explain why everyone's talking about data lakes, break down the best services in Azure to build a Data Lake, and walk through code for querying and loading with Azure Databricks and Event Hubs for Kafka. Attendees will leave the session with a firm grasp of why we build data lakes and how Azure Databricks fits in for ETL and querying.
Speaker
Dustin Vannoy, Dustin Vannoy Consulting, Principal Data Engineer
These slides provide highlights of my book HDInsight Essentials. Book link is here: http://www.packtpub.com/establish-a-big-data-solution-using-hdinsight/book
Embarking on building a modern data warehouse in the cloud can be an overwhelming experience due to the sheer number of products that can be used, especially when the use cases for many products overlap others. In this talk I will cover the use cases of many of the Microsoft products that you can use when building a modern data warehouse, broken down into four areas: ingest, store, prep, and model & serve. It’s a complicated story that I will try to simplify, giving blunt opinions of when to use what products and the pros/cons of each.
Databricks is a Software-as-a-Service-like experience (or Spark-as-a-service) that is a tool for curating and processing massive amounts of data and developing, training and deploying models on that data, and managing the whole workflow process throughout the project. It is for those who are comfortable with Apache Spark as it is 100% based on Spark and is extensible with support for Scala, Java, R, and Python alongside Spark SQL, GraphX, Streaming and Machine Learning Library (Mllib). It has built-in integration with many data sources, has a workflow scheduler, allows for real-time workspace collaboration, and has performance improvements over traditional Apache Spark.
Running cost effective big data workloads with Azure Synapse and Azure Data L...Michael Rys
The presentation discusses how to migrate expensive open source big data workloads to Azure and leverage latest compute and storage innovations within Azure Synapse with Azure Data Lake Storage to develop a powerful and cost effective analytics solutions. It shows how you can bring your .NET expertise with .NET for Apache Spark to bear and how the shared meta data experience in Synapse makes it easy to create a table in Spark and query it from T-SQL.
DataOps for the Modern Data Warehouse on Microsoft Azure @ NDCOslo 2020 - Lac...Lace Lofranco
Talk Description:
The Modern Data Warehouse architecture is a response to the emergence of Big Data, Machine Learning and Advanced Analytics. DevOps is a key aspect of successfully operationalising a multi-source Modern Data Warehouse.
While there are many examples of how to build CI/CD pipelines for traditional applications, applying these concepts to Big Data Analytical Pipelines is a relatively new and emerging area. In this demo heavy session, we will see how to apply DevOps principles to an end-to-end Data Pipeline built on the Microsoft Azure Data Platform with technologies such as Data Factory, Databricks, Data Lake Gen2, Azure Synapse, and AzureDevOps.
Resources: https://aka.ms/mdw-dataops
The document discusses Ido Friedman and his background working with various data technologies. It then discusses the concept of a data lake and how it serves as a single store for raw and transformed data used for reporting, analytics, and machine learning. The rest of the document discusses how traditional tools like SQL have changed with the rise of Hadoop and cloud storage. It provides examples of performance and cost differences between running data workloads on Hadoop clusters versus cloud-based data processing services like BigQuery and Dataproc. The document concludes that a large data lake is now possible in the cloud and discusses various deployment options to consider.
In this session we will delve into the world of Azure Databricks and analyze why it is becoming a tool for data Scientist and/or fundamental data Engineer in conjunction with Azure services
Apache Hadoop is a platform that has emerged to help extract insight from all that data. In this session, you will learn the basics of Hadoop, how to get up and running with Hadoop in the cloud using Microsoft Azure HDInsight, and how you can leverage the deeper integration of Visual Studio to integrate Big Data with your existing applications. No previous experience with Hadoop is required.
Presented @ MSDEVMTL on Saturday February , 2015
Think of big data as all data, no matter what the volume, velocity, or variety. The simple truth is a traditional on-prem data warehouse will not handle big data. So what is Microsoft’s strategy for building a big data solution? And why is it best to have this solution in the cloud? That is what this presentation will cover. Be prepared to discover all the various Microsoft technologies and products from collecting data, transforming it, storing it, to visualizing it. My goal is to help you not only understand each product but understand how they all fit together, so you can be the hero who builds your companies big data solution.
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...Mark Rittman
Mark Rittman from Rittman Mead presented on Oracle Big Data Discovery. He discussed how many organizations are running big data initiatives involving loading large amounts of raw data into data lakes for analysis. Oracle Big Data Discovery provides a visual interface for exploring, analyzing, and transforming this raw data. It allows users to understand relationships in the data, perform enrichments, and prepare the data for use in tools like Oracle Business Intelligence.
Securing your Big Data Environments in the CloudDataWorks Summit
Big Data tools are becoming a critical part of enterprise architectures and as such securing the data, at rest, and in motion is a necessity. More so, when you’re implementing these solutions in the cloud and the data doesn't reside within the confines of your trusted data center. Also, there is a fine balance between implementing enterprise-grade security and negotiating utmost performance given the overheads of encryption and/or identity management.
This session is designed to tackle these challenges head on and explain the various options available in the cloud. The focal points are the implementation of tools like Ranger and Knox for cloud deployments, but we also pay attention to the security features offered in the cloud that complement this process and secure the data in unprecedented ways.
Cloud Security + OSS Security tools are a deadly combination, when it comes to securing your Data Lake.
Spark is fast becoming a critical part of Customer Solutions on Azure. Databricks on Microsoft Azure provides a first-class experience for building and running Spark applications. The Microsoft Azure CAT team engaged with many early adopter customers helping them build their solutions on Azure Databricks.
In this session, we begin by reviewing typical workload patterns, integration with other Azure services like Azure Storage, Azure Data Lake, IoT / Event Hubs, SQL DW, PowerBI etc. Most importantly, we will share real-world tips and learnings that you can take and apply in your Data Engineering / Data Science workloads
Big Data, IoT, data lake, unstructured data, Hadoop, cloud, and massively parallel processing (MPP) are all just fancy words unless you can find uses cases for all this technology. Join me as I talk about the many use cases I have seen, from streaming data to advanced analytics, broken down by industry. I’ll show you how all this technology fits together by discussing various architectures and the most common approaches to solving data problems and hopefully set off light bulbs in your head on how big data can help your organization make better business decisions.
Enabling Next Gen Analytics with Azure Data Lake and StreamSetsStreamsets Inc.
This document discusses enabling next generation analytics with Azure Data Lake. It provides definitions of big data and discusses how big data is a cornerstone of Cortana Intelligence. It also discusses challenges with big data like obtaining skills and determining value. The document then discusses Azure HDInsight and how it provides a cloud Spark and Hadoop service. It also discusses StreamSets and how it can be used for data movement and deployment on Azure VM or local machine. Finally, it discusses a use case of StreamSets at a major bank to move data from on-premise to Azure Data Lake and consolidate migration tools.
This slide deck was presented at #DataOnCloud event New York. DataOnCloud is an invite-only event for CIOs and top IT innovators. DataOnCloud enables key decision makers to discuss about real life adoption scenarios, challenges and best practices for leveraging Big, Small and Line Of Business Data on Cloud.
Aditi Technologies, a 'cloud first' technology services company organized #DataOnCloud, an event series focused on orchestrating data on cloud and navigating the complexity around integration, security, platform selection and technology solutions.
Aditi Technologies partnered with Microsoft for this 2-hour, CXO roundtable event in global technology hubs - London, New York, Seattle and San Diego
Introduces the Microsoft’s Data Platform for on premise and cloud. Challenges businesses are facing with data and sources of data. Understand about Evolution of Database Systems in the modern world and what business are doing with their data and what their new needs are with respect to changing industry landscapes.
Dive into the Opportunities available for businesses and industry verticals: the ones which are identified already and the ones which are not explored yet.
Understand the Microsoft’s Cloud vision and what is Microsoft’s Azure platform is offering, for Infrastructure as a Service or Platform as a Service for you to build your own offerings.
Introduce and demo some of the Real World Scenarios/Case Studies where Businesses have used the Cloud/Azure for creating New and Innovative solutions to unlock these potentials.
Build Big Data Enterprise solutions faster on Azure HDInsightDataWorks Summit
Hadoop and Spark are big data frameworks used to extract useful span a variety of scenarios from ingestion, data prep, data management, processing, analyzing and visualizing data. Each step requires specialized toolsets to be productive. In this talk I will share solution examples in the Big Data ecosystem such as Cask, StreamSets, Datameer, AtScale, Dataiku on Microsoft’s Azure HDInsight that simplify your Big Data solutions. Azure HDInsight is a cloud Spark and Hadoop service for the enterprise and take advantage of all the benefits of HDInsight giving you the best of both worlds. Join this session for practical information that will enable faster time to insights for you and your business.
This document discusses strategies for migrating existing enterprise IT solutions to the cloud. It begins by outlining the typical adoption stages companies go through with new technologies like virtualization and cloud computing. It then provides examples of how companies like Shell, GE, Dole Foods, and the New York Times have benefited from migrating applications and workloads to AWS. Finally, it discusses additional AWS services and solutions that can help companies at various stages of their cloud migration journey.
This document provides an overview of a course on implementing a modern data platform architecture using Azure services. The course objectives are to understand cloud and big data concepts, the role of Azure data services in a modern data platform, and how to implement a reference architecture using Azure data services. The course will provide an ARM template for a data platform solution that can address most data challenges.
The document discusses challenges facing today's enterprises such as cutting costs, driving value with tight budgets, maintaining security while increasing access, and finding the right transformative capabilities. It then discusses challenges in building applications related to scaling, availability, and costs. The remainder summarizes Microsoft's Windows Azure cloud computing platform, how it addresses these challenges, example use cases, and pricing models.
SendGrid Improves Email Delivery with Hybrid Data WarehousingAmazon Web Services
When you received your Uber ‘Tuesday Evening Ride Receipt’ or Spotify’s ‘This Week’s New Music’ email, did you think about how they got there?
SendGrid’s reliable email platform delivers each month over 20 Billion transactional and marketing emails on behalf of many of your favorite brands, including Uber, Airbnb, Spotify, Foursquare and NextDoor.
SendGrid was looking to evolve its data warehouse architecture in order to improve decision making and optimize customer experience. They needed a scalable and reliable architecture that would allow them to move nimbly and efficiently with a relatively small IT organization, while supporting the needs of both business and technical users at SendGrid.
SendGrid’s Director of Enterprise Data Operations will be joining architects from Amazon Web Services (AWS) and Informatica to discuss SendGrid’s journey to a hybrid cloud architecture and how a hybrid data warehousing solution is optimized to support SendGrid’s analytics initiative. Speakers will also review common technologies and use cases being deployed in hybrid cloud today, common data management challenges in hybrid cloud and best practices for addressing these challenges.
Join us to learn:
• How to evolve to a hybrid data warehouse with Amazon Redshift for scalability, agility and cost efficiency with minimal IT resources
• Hybrid cloud data management use cases
• Best practices for addressing hybrid cloud data management challenges
The document provides an overview of leading big data companies in 2021 and the Apache Hadoop stack, including related Apache software and the NIST big data reference architecture. It lists over 50 big data companies, including Accenture, Actian, Aerospike, Alluxio, Amazon Web Services, Cambridge Semantics, Cloudera, Cloudian, Cockroach Labs, Collibra, Couchbase, Databricks, DataKitchen, DataStax, Denodo, Dremio, Franz, Gigaspaces, Google Cloud, GridGain, HPE, HVR, IBM, Immuta, InfluxData, Informatica, IRI, MariaDB, Matillion, Melissa Data
The document discusses challenges facing today's enterprises including cutting costs, driving value with tight budgets, maintaining security while increasing access, and finding the right transformative capabilities. It then discusses challenges in building applications such as scaling, availability, and costs. The document introduces the Windows Azure platform as a solution, highlighting its fundamentals of scale, automation, high availability, and multi-tenancy. It provides considerations for using cloud computing on or off premises and discusses ownership models.
Caserta Concepts, Datameer and Microsoft shared their combined knowledge and a use case on big data, the cloud and deep analytics. Attendes learned how a global leader in the test, measurement and control systems market reduced their big data implementations from 18 months to just a few.
Speakers shared how to provide a business user-friendly, self-service environment for data discovery and analytics, and focus on how to extend and optimize Hadoop based analytics, highlighting the advantages and practical applications of deploying on the cloud for enhanced performance, scalability and lower TCO.
Agenda included:
- Pizza and Networking
- Joe Caserta, President, Caserta Concepts - Why are we here?
- Nikhil Kumar, Sr. Solutions Engineer, Datameer - Solution use cases and technical demonstration
- Stefan Groschupf, CEO & Chairman, Datameer - The evolving Hadoop-based analytics trends and the role of cloud computing
- James Serra, Data Platform Solution Architect, Microsoft, Benefits of the Azure Cloud Service
- Q&A, Networking
For more information on Caserta Concepts, visit our website: http://casertaconcepts.com/
The cloud is all the rage. Does it live up to its hype? What are the benefits of the cloud? Join me as I discuss the reasons so many companies are moving to the cloud and demo how to get up and running with a VM (IaaS) and a database (PaaS) in Azure. See why the ability to scale easily, the quickness that you can create a VM, and the built-in redundancy are just some of the reasons that moving to the cloud a “no brainer”. And if you have an on-prem datacenter, learn how to get out of the air-conditioning business!
How to Architect a Serverless Cloud Data Lake for Enhanced Data AnalyticsInformatica
This presentation is geared toward enterprise architects and senior IT leaders looking to drive more value from their data by learning about cloud data lake management.
As businesses focus on leveraging big data to drive digital transformation, technology leaders are struggling to keep pace with the high volume of data coming in at high speed and rapidly evolving technologies. What's needed is an approach that helps you turn petabytes into profit.
Cloud data lakes and cloud data warehouses have emerged as a popular architectural pattern to support next-generation analytics. Informatica's comprehensive AI-driven cloud data lake management solution natively ingests, streams, integrates, cleanses, governs, protects and processes big data workloads in multi-cloud environments.
Please leave any questions or comments below.
Azure Data Explorer deep dive - review 04.2020Riccardo Zamana
Modern Data Science Lifecycle with ADX & Azure
This document discusses using Azure Data Explorer (ADX) for data science workflows. ADX is a fully managed analytics service for real-time analysis of streaming data. It allows for ad-hoc querying of data using Kusto Query Language (KQL) and integrates with various Azure data ingestion sources. The document provides an overview of the ADX architecture and compares it to other time series databases. It also covers best practices for ingesting data, visualizing results, and automating workflows using tools like Azure Data Factory.
Cloud computing adoption in sap technologiessveldanda
Cloud computing is emerging as an exciting trend in the ICT and with this presentation we tried to explore opportunities of adopting Cloud computing in SAP Technologies
Melbourne: Certus Data 2.0 Vault Meetup with Snowflake - Data Vault In The Cl...Certus Solutions
Snowflake is a cloud data warehouse that provides elasticity, scalability, and simplicity. It allows organizations to consolidate their diverse data sources in one place and instantly scale up or down their compute capacity as needed. Aptus Health, a digital marketing company, used Snowflake to break down data silos, integrate disparate data sources, enable broad data sharing, and provide a scalable and cost-effective solution to meet their analytics needs. Snowflake addressed both business needs for timely access to centralized data and IT needs for flexibility, extensibility, and reducing ETL work.
1) Cloud computing allows you to pay for infrastructure as needed rather than upfront, which can lower costs. AWS passes these savings to customers in the form of low prices.
2) AWS provides a variety of compute, storage, database, analytics and other services that can be used to build applications. Popular services include EC2, S3, DynamoDB, and EMR.
3) There are a number of strategies for using AWS, such as using it for development/testing, building new apps, augmenting existing apps, hybrid apps, and full migration. Existing tools can often be used to manage AWS resources.
ASP.NET Identity is a membership system that provides authentication, user management, and claims-based authorization to ASP.NET applications. It allows for non-SQL persistence, extensible authentication separate from membership, and full support for asynchronous programming. Key features include login, roles, profiles, claims, user and role management, external logins, and security features like two-factor authentication and account lockout. It has a modular architecture with classes like UserManager and UserStore that separate concerns.
This document discusses ASP.NET identity and security. It provides an overview of ASP.NET identity, which provides a unified authentication experience for ASP.NET applications both on-premises and in the cloud. It also discusses ASP.NET security using OWIN and integrating with Windows Azure Active Directory for authentication with organizational accounts. The presentation includes demos of ASP.NET identity features, social login, and using Azure AD for single sign-on and multi-tenant applications.
This document discusses ASP.NET SignalR, a framework for adding real-time web functionality to applications. SignalR can be used to enable continuously updated data and real-time features. It works across different browsers and devices by utilizing multiple transport mechanisms like websockets, server-sent events, and long polling. SignalR provides a simple programming model and supports high throughput, large-scale applications through its ability to scale out to multiple servers.
The document discusses a presentation given at the DevIntersection Conference about new features in ASP.NET Web Forms 4.5. The presentation covered modern standards, customizability, extensibility, patterns, cleaner code, mobile support, and a mystery feature. Additional resources were provided, including links to blogs and websites about ASP.NET Web Forms and a video about advanced ASP.NET.
This document discusses modern web forms, highlighting that they use modern standards, are customizable and extensible, follow best practice patterns, result in cleaner code, support mobile devices, and may include surprises. It also requests feedback by asking readers to fill out a session evaluation form at the conference registration desk.
This document summarizes new features and enhancements in ASP.NET Web Forms 4.5 and Visual Studio 11. It highlights improvements to cleaner markup, validation, routing, and model binding. It encourages developers to download betas of Visual Studio 11 and new templates to start using these features and get involved by sharing ideas, reporting bugs, or following blogs and forums.
This document discusses optimizing ASP.NET web applications for standards and performance. It covers developing and debugging efficiently using tools like Visual Studio and the page inspector. Optimization topics include minimizing file sizes using techniques like CSS/JS minification and image optimization. Real-time web capabilities like polling, long polling and web sockets are presented. ASP.NET 4.5 improvements for application start up time and memory usage are highlighted. Ensuring high performance web servers requires considering factors like IIS configuration and hosting environment.
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...javier ramirez
Los sistemas distribuidos son difíciles. Los sistemas distribuidos de alto rendimiento, más. Latencias de red, mensajes sin confirmación de recibo, reinicios de servidores, fallos de hardware, bugs en el software, releases problemáticas, timeouts... hay un montón de motivos por los que es muy difícil saber si un mensaje que has enviado se ha recibido y procesado correctamente en destino. Así que para asegurar mandas el mensaje otra vez.. y otra... y cruzas los dedos para que el sistema del otro lado tenga tolerancia a los duplicados.
QuestDB es una base de datos open source diseñada para alto rendimiento. Nos queríamos asegurar de poder ofrecer garantías de "exactly once", deduplicando mensajes en tiempo de ingestión. En esta charla, te cuento cómo diseñamos e implementamos la palabra clave DEDUP en QuestDB, permitiendo deduplicar y además permitiendo Upserts en datos en tiempo real, añadiendo solo un 8% de tiempo de proceso, incluso en flujos con millones de inserciones por segundo.
Además, explicaré nuestra arquitectura de log de escrituras (WAL) paralelo y multithread. Por supuesto, todo esto te lo cuento con demos, para que veas cómo funciona en la práctica.
Amazon Aurora 클러스터를 초당 수백만 건의 쓰기 트랜잭션으로 확장하고 페타바이트 규모의 데이터를 관리할 수 있으며, 사용자 지정 애플리케이션 로직을 생성하거나 여러 데이터베이스를 관리할 필요 없이 Aurora에서 관계형 데이터베이스 워크로드를 단일 Aurora 라이터 인스턴스의 한도 이상으로 확장할 수 있는 Amazon Aurora Limitless Database를 소개합니다.
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA MATKA RESULT KALYAN MATKA TIPS SATTA MATKA MATKA COM MATKA PANA JODI TODAY
Applications of Data Science in Various IndustriesIABAC
The wide-ranging applications of data science across industries.
From healthcare to finance, data science drives innovation and efficiency by transforming raw data into actionable insights.
Learn how data science enhances decision-making, boosts productivity, and fosters new advancements in technology and business. Explore real-world examples of data science applications today.
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN MATKA RESULTS KALYAN CHART KALYAN MATKA MATKA RESULT KALYAN MATKA TIPS SATTA MATKA MATKA COM MATKA PANA JODI TODAY
[D2T2S04] SageMaker를 활용한 Generative AI Foundation Model Training and TuningDonghwan Lee
이 세션에서는 SageMaker Training Jobs / SageMaker Jumpstart를 사용하여 Foundation Model 을 Pre-Triaining 하거나 Fine Tuing 하는 방안을 제시합니다. 이 세션을 통해 아래 3가지가 소개됩니다.
1. 파운데이션 모델을 처음부터 Training
2. 오픈 소스 모델을 사용하여 파운데이션 모델을 Pre-Training
3. 도메인에 맞게 모델을 Fine Tuning하는 방안
발표자:
Miron Perel, Principal ML GTM Specialist, AWS
Kristine Pearce, Principal ML BD, AWS
How We Added Replication to QuestDB - JonTheBeachjavier ramirez
Building a database that can beat industry benchmarks is hard work, and we had to use every trick in the book to keep as close to the hardware as possible. In doing so, we initially decided QuestDB would scale only vertically, on a single instance.
A few years later, data replication —for horizontally scaling reads and for high availability— became one of the most demanded features, especially for enterprise and cloud environments. So, we rolled up our sleeves and made it happen.
Today, QuestDB supports an unbounded number of geographically distributed read-replicas without slowing down reads on the primary node, which can ingest data at over 4 million rows per second.
In this talk, I will tell you about the technical decisions we made, and their trade offs. You'll learn how we had to revamp the whole ingestion layer, and how we actually made the primary faster than before when we added multi-threaded Write Ahead Logs to deal with data replication. I'll also discuss how we are leveraging object storage as a central part of the process. And of course, I'll show you a live demo of high-performance multi-region replication in action.
1. Building big data applications on
Azure
Pranav Rastogi/ Bharath Sreenivas
Microsoft
pranav.rastogi@microsoft.com
@rustd/ @bharathbs
3. Security and privacyFlexibility of choiceReason over any data, anywhere
Data warehouses
Data lakes
Operational databases
Hybrid
Data warehouses
Data lakes
Operational databases
SocialLOB Graph IoTImageCRM
6. Solution scenarios
Three scenarios that take optimal advantage of Big Data
Modern DW
“We want to incorporate all
of our data including ‘big
data” with our data
warehouse”
Advanced Analytics
“We are trying to predict
when our customers churn.”
Internet of Things (IoT)
“We are trying to get insights
from our devices in real-time,
etc.”
7. Governance and
Master Data Management
Azure SQL Data Warehouse
Data Quality and
Lineage
ERP, CRM,
and other
LOB Data
OLTP and
other
RDBMS
Clickstream
Logs and
Events
Sensors,
Social,
Weather, and
other un-
structured
data
ETL
Azure Data Lake
Analytics (U-SQL)
Azure Storage / Azure Data Lake
Azure HDInsight
(Hadoop / Spark)
Azure Analysis
Services
BI Models
Power BI
Reports and
Dashboards
Polybase
Analyst
Power User
Data Engineer
Data Scientist
Big Data Warehouse
8. OLTP and
other
RDBMS
Clickstream
Logs and
Events
Sensors,
Social,
Weather, and
other un-
structured
data
REPL and
Machine
Learning Tools
Data
Wrangling
Tools
Data Engineer Data Scientist
Deep Learning
& Cognitive
Services
Azure
Cosmos DB
Apps
Automated
Systems
People
Web
Mobile
Bots
ML Models
and Scoring
APIs
Advanced Analytics and AI
Azure Data Lake
Analytics (U-SQL)
Azure Storage / Azure Data Lake
Azure HDInsight
(Hadoop / Spark)
9. Azure Stream Analytics / Spark Streaming
Clean,
Curate,
Aggregate
Combine
reference
data
Perform
Scoring from
ML models
IoT Sensors
and/or
User
activity
streams
Social,
Trends,
Weather
etc.
Clickstream,
Batch Files,
server logs,
Images,
videos, and
other
unstructured
data
Azure Event Hubs,
Apache Kafka
Event
Broker/Buffer
Queue
Event
Broker
Power BI
Realtime
Dashboards
Analyst
Data Engineer
Data Scientist
Azure ML / R
Trained Machine
Learning Models
Azure SQL DB /
Cosmos DB
Reference Data
Automated
Systems
Realtime Processing with Lambda Architecture
Azure Data Lake
Analytics (U-SQL)
Azure Storage / Azure Data Lake
Azure HDInsight
(Hadoop / Spark)
10. A d v a n c e d a n a l y t i c s a n d b i g d a t a
i m p a c t s a l l v e r t i c a l s
Heartland Bank prevents fraud
and boosts profits
The UK NHS transforms healthcare
with faster access to information.
City of Barcelona boosts citizen
unsegmented with intelligent app
Jet.com transforms customer engagement
with truly aerosolized experience
Rolls Royce decreases costs with
Predictive Maintenance
Manufacturing
Eliminate downtime and
increase efficiency by enabling
better predictive maintenance
for your capital assets.
Banking
Minimize losses with more
accurate fraud detection and
assess exposure to asset,
credit and market risk using a
holistic approach
Boost operational efficiency
and improve patient acre
experience with intelligent
detection and in time service.
Healthcare Government
Empower citizens and
improve their engagement
with relevant information and
personalized citizen services.
Retail
Turn individual customer
interactions into contextual
engagements and increase
customer satisfaction with highly
personalized offers and content
12. Managed Open Source Analytics for the
cloud with a 99.9% SLA.
100% Open Source
Clusters up and running in minutes
63% lower TCO than deploy your own Hadoop on-
premises
Separation of compute and store allows you to scale
clusters to exponentially reduce costs
Open Source Analytics for the Enterprise
13. Big data is hard
Buy
Servers
Install
OSS
Secure Configure
Optimize
Debug
Success
Scale up
14. HDInsight makes it easy
Provide
Cluster
details
HDInsight
Cluster
100% open source
Optimized
Highly available
Secure
Scalable
Dedicated
Managed
Certified ISVs
Customizable
Browse to
Azure Portal
15. Multi Region Availability
Available in >25 regions world-wide
Launched most recently in US West 2, and UK regions
Available in China, Europe and US Government clouds
Deploy Globally Within Minutes
16. Perimeter Level Security
Virtual Networks
Network Security Groups (firewalls)
Authentication
Azure Active Directory
Kerberos authentication
Authorization
Apache Ranger
RBAC for Admin
POSIX ACLs for Data Plane Data Security
Server-Side encryption at rest
HTTPS/TLS In-transit
Security and Compliance to Enable OSS for Enterprises
17. Plugins for HDI available for most popular IDEs for agile
development and debugging
Rich support for powerful notebooks used by data
scientists
Develop in C#, deploy on Linux in Java via HDI
developed SCP.Net technology
Remote Debugging for Spark jobs
Rich Developer Ecosystem
18. Recognized by
Top Analysts
Forrester Wave for Big Data
Hadoop Cloud
• Named industry leader by
Forrester with the most
comprehensive, scalable, and
integrated platforms*
• Recognized for its cloud-first
strategy that is paying off*
*The Forrester WaveTM: Big Data Hadoop Cloud Solutions, Q2 2016.
19. Products and Services Organization Size Industry Country Business Need
Simplified pricing process
now takes minutes instead
of days
Competitive pricing, product demand, the costs of materials, gas and
labor, and the thousands of other market variables affect product cost
and customer demand for products or services around the world. It’s
why accurate and profitable pricing represents one of the most
difficult business challenges for many companies. Manufacturing,
distribution, services, and airline companies look to the science and
technology provided by PROS to keep their pricing accurate,
competitive, and profitable. The PROS Guidance product runs
enormously complex pricing calculations based on variables that
comprise multiple terabytes of data. To handle this calculation
complexity and data volume, and then deliver specific results to its
clients quickly, PROS built its services on top of Azure HDInsight.
Pricing Software-
as-a-Service
United StatesOther-
unsegmented
1,000Microsoft Azure
Azure HDInsight
Apache Spark for Azure
HDInsight
20. HDInsight architecture
Hive meta store
Azure SQL database
Azure Storage or
Data Lake Store
Client
machines
HDInsight cluster
Gateway
nodes
Head
nodes
Worker
nodes
Edge
nodes
Zookeeper nodes
21. Scale compute & storage independently
Gateway
nodes
Head
nodes
Worker
nodes
Edge
nodes
Zookeeper nodes
Azure Blob Storage
or
Azure Data Lake
Store
22. Persist & Reuse your data
Your data is outside the
HDInsight cluster.
Hence data is persisted
even if you drop and
recreate the cluster.
Create multiple clusters
and point to same storage.
Azure Blob Storage
or
Azure Data Lake
Store
HDInsight
cluster
HDInsight
cluster
HDInsight
cluster
HDInsight
cluster
27. Azure
Blob
Storage
HDInsight Spark cluster
Azure SQL
Data Warehouse
Azure SQL
Database
Azure Data Lake
Store
Azure Cosmos
DB
Azure SQL
Database
Azure
Blob
Storage
Azure SQL
Data Warehouse
Azure Data Lake
Store
Azure Cosmos
DB
jobs
35. HDInsight Spark cluster
streaming jobs
Web app
Mobile
Azure
Blob
Storage
Kafka
Event Hub
Azure Data Lake
Store
Azure Cosmos
DB
Azure SQL
Database
HBase
push pull
Azure Redis Cache
Bot
41. Reads from
HDFS
Writes to
HDFS
Reads from
HDFS
Writes to
HDFSStep 1
“mapper”
Step 2
“reducer”
Step 1
Reads and writes
from HDFS
Read 1MB
sequentially from
disk
20,000,000 ns
Read 1 MB
sequentially from
SSD
1,000,000 ns
Read 1 MB
sequentially from
memory
250,000 ns
44. val file = spark.textFile(“wasb://...")
val errors = file.filter(line => line.contains("ERROR"))
// Cache errors
errors.cache()
// Count all the errors
errors.count()
// Count errors mentioning MySQL
errors.filter(line => line.contains(“Web")).count()
// Fetch the MySQL errors as an array of strings
errors.filter(line => line.contains(“Error")).collect()
55. Azure
Blob
Storage
HDInsight Spark cluster
Azure SQL
Data Warehouse
Azure SQL
Database
Azure Data Lake
Store
Azure Cosmos
DB
Azure SQL
Database
Azure
Blob
Storage
Azure SQL
Data Warehouse
Azure Data Lake
Store
Azure Cosmos
DB
jobs
57. HDInsight Spark cluster
streaming jobs
Web app
Mobile
Azure
Blob
Storage
Azure Data Lake
Store
Azure Cosmos
DB
Azure SQL
Database
HBase
push pull
Azure Redis Cache
Bot
Power BI
real-time
dashboard
Kafka
Event Hub
68. Phone Tracking Across Cell Sites
Connected Car - Remote
Management & Diagnostics
Asset Tracking
Fleet Management
Facilities Management
Personnel Tracking & Crowd
Control
Ride Sharing
Geofencing
Racecar Telemetry
Connected Manufacturing
and many more…
69. Data Sources Ingest Prepare
(normalize, clean, etc.)
Analyze
(stat analysis, ML, etc.)
Publish
(for programmatic
consumption, BI/visualization)
Consume
(Alerts, Operational Stats,
Insights)
Big Data Architecture
Data Consumption
(Ingestion)
Data Processing
Presentation/Serving
Layer
70. Data Sources Ingest Prepare
(normalize, clean, etc.)
Analyze
(stat analysis, ML, etc.)
Publish
(for programmatic
consumption, BI/visualization)
Consume
(Alerts, Operational Stats,
Insights)
Big Data Architecture
Data Processing
REALTIME ANALYTICS
INTERACTIVE ANALYTICS
BATCH ANALYTICS
Machine Learning
(Spark + Azure ML)
(Failure and RCA
Predictions)
HDI + ISVs
OLAP for Data
Warehousing
HDI Custom ETL
Aggregate /Partition
PowerBI
dashboard
(Shared with field
Ops, customers,
MIS, and Engineers)
Realtime Machine Learning
(Anomaly Detection)
CosmosDB
Interactive HDInsight clusters
BIG DATA STORAGE ANALYTICS
Big Data Storage
Azure Data
Lake Store
CosmosDB Azure Blob
Storage
Data Scientists,
BI Analysts
Big Data Applications
80. Microsoft Databus
(Siphon) Usage 8 million
EVENTS PER SECOND PEAK INGRESS
800 TB (10 GB per Sec)
INGRESS PER DAY
1,800; 450
PRODUCTION KAFKA BROKERS; TOPICS
15 Sec
99th PERCENTILE LATENCY
KEY CUSTOMER SCENARIOS
Ads Monetization (Fast BI)
O365 Customer Fabric NRT – Tenant & User insights
BingNRT Operational Intelligence
Presto (Fast SML) interactive analysis
Delve Analytics
0
5
10
15
20
25
30
35
40
45
Jan-15
Feb-15
Mar-15
Apr-15
May-15
Jun-15
Jul-15
Aug-15
Sep-15
Oct-15
Nov-15
Dec-15
Jan-16
Feb-16
Mar-16
Apr-16
May-16
Jun-16
Jul-16
Aug-16
Sep-16
Oct-16
Nov-16
Dec-16
Throughput(inGBps)
Siphon Data Volume (Ingress and Egress)
Volume published (GBps) Volume subscribed (GBps)
0
5
10
15
20
25
Jan-15
Feb-15
Mar-15
Apr-15
May-15
Jun-15
Jul-15
Aug-15
Sep-15
Oct-15
Nov-15
Dec-15
Jan-16
Feb-16
Mar-16
Apr-16
May-16
Jun-16
Jul-16
Aug-16
Sep-16
Oct-16
Nov-16
Dec-16
Throughput(eventspersec)Millions
Siphon Events per second (Ingress and Egress)
EPS In Eps Out
81. Asia DC
Zookeeper Canary
Kafka
Collector
Agent
Services Data Pull (Agent)
Services Data Push
Device Proxy Services
Consumer
API (Push/
Pull)
Europe DC
Zookeeper Canary
Kafka
US DC
Zookeeper Canary
Kafka
Streaming
Batch
Audit Trail
Open Source
Microsoft Internal
Siphon
86. Tool Purpose
Ambari Dashboard for monitoring health and status of the
Hadoop cluster
Yarn UI Monitor Yarn Application and logs
Tez View Track and debug the execution of jobs
Grafana Workload specific JMX metrics
Spark History Server The history server displays both completed and
incomplete Spark jobs
HMaster UI HBase provides a web-based user interface that you
can use to monitor your HBase cluster
Visual Studio /VS Code Monitor a Job status in VS with DataLake tools. Spark
Remote Job debugging
89. OMS Agent for
Linux
HDInsight nodes (Head, Worker ,
Zookeeper )
FluentD
HDInsight
plugin
1. Plugin for ‘in_tail’ for all Logs, allows
regexp to create JSON object
2. Filter for WARN and above for each
Log Type. `grep` filter plugin
3. Output to out_oms_api Type
4. Exec plugin for Metrics
HBaseConfigomsconfig
Spark
Hive
Storm
Kafka
Config
Config
Config
Config
Log Analytics(OMS) Service
112. Transparent Server Side Encryption
Azure Data Lake Storage
ALWAYS ON transparent encryption
All reads/writes are encrypted/decrypted
Service managed keys as well as Customer
managed keys
Encryption @ Rest and Encryption in Transit
Microsoft Azure Storage Blob
ALWAYS ON transparent encryption
All reads/writes are encrypted/decrypted
Service managed keys as well as Customer managed keys
Encryption @ Rest and Encryption in Transit
All kinds of data being generated
Stored on-premises and in the cloud – but vast majority in hybrid
Reason over all this data without requiring to move data
They want a choice of platform and languages, privacy and security
<Transition> Microsoft’s offerng
Objective: This slide describes the architecture of how Apache Spark is different, allowing it to offer better performance for data sharing.
Table Source: https://gist.github.com/jboner/2841832
Talking points:
Spark provides primitives for in-memory cluster computing. A Spark job can load and cache data into memory and query it repeatedly, much more quickly than disk-based systems.
Spark integrates into the Scala programming language to let you manipulate distributed data sets like local collections. No need to structure everything as map and reduce operations.
Data sharing between operations is faster, since data is in-memory.
Hadoop shares data through HDFS, an expensive option. It also maintains three replicas.
Spark stores data in-memory without any replication.
Objective: This slide explains the two types of operations that RDDs support: transformation and actions.
Talking points:
Transformations create a new data set from an existing data set.
Transformations do not compute their results right away. They are only computed when an action requires a result to be returned to the driver program. Does not apply to persistent RDDs.
Examples include: map, filter, sample, union, and more.
Actions return a value to the driver program after running a computation on the data set.
Examples include: reduce, collect, count, first, foreach, and more.
Objective: This slide shows an example of how transformations and actions are enabled to search through error messages.
Talking points:
Cache errors – Implementing this action will collect all the errors present
Count all errors – Implementing this action counts all the errors in the data
Count errors mentioning MySQL – When implementing this code, MySQL errors are counted
Fetch the MySQL errors as an array of strings – When implementing this code, MySQL errors are extracted as an array of strings
Event Detection in Realtime
FINANCIAL ENGINES
CONNECTED CAR – SENSORS FIRE
Data Landing for Learning
Use cases
Connected Car Insurance companies for Connected Driving
What are the three Big components that You need to stand up when you
ASK:
Who knows what Lambda architecture is
Who has helped implement one?
Walk through
VERTICALS
Ingest
Prep + Analyze
Serve
Consume
Horizontals
Drive by speed – realtime vs Batch
What are the three Big components that You need to stand up when you
ASK:
Who knows what Lambda architecture is
Who has helped implement one?
Walk through
VERTICALS
Ingest
Prep + Analyze
Serve
Consume
Horizontals
Drive by speed – realtime vs Batch
Let’s Walk through an example of this
We will demo this soon
We will demo this soon
TODO – add logos for Bing Ads, Office365, Delve Analytics
How to monitor all of our resources across subscriptions with single pane of glass?
How to Analyze Hadoop Logs & Metrics easily?
How to setup alerting?