The business cases for Hadoop can be made on the tremendous operational cost savings that it affords. But why stop there? The integration of R-powered analytics in Hadoop presents a totally new value proposition. Organizations can write R code and deploy it natively in Hadoop without data movement or the need to write their own MapReduce. Bringing R-powered predictive analytics into Hadoop will accelerate Hadoop’s value to organizations by allowing them to break through performance and scalability challenges and solve new analytic problems. Use all the data in Hadoop to discover more, grow more quickly, and operate more efficiently. Ask bigger questions. Ask new questions. Get better, faster results and share them.
Citizens Bank: Data Lake Implementation – Selecting BigInsights ViON Spark/Ha...Seeling Cheung
Citizens Bank was implementing a BigInsights Hadoop Data Lake with PureData System for Analytics to support all internal data initiatives and improve the customer experience. Testing BigInsights on the ViON Hadoop Appliance yielded the productivity, maintenance, and performance Citizens was looking for. Citizens Bank moved some analytics processing from Teradata to Netezza for better cost and performance, implemented BigInsights Hadoop for a data lake, and avoided large capital expenditures for additional Teradata capacity.
The document discusses big data testing and provides examples of big data projects. It defines big data as large volumes of data that are analyzed to make better decisions. Big data has three characteristics - volume, velocity, and variety. Traditional testing approaches are not suitable for big data, which requires new testing strategies and tools to handle the scale and complexity. Automating testing and understanding the data and processes are important for big data testing. The document outlines challenges and provides examples of batch and real-time systems as well as automation tools like Talend Open Studio.
Empower Splunk and other SIEMs with the Databricks Lakehouse for CybersecurityDatabricks
Cloud, Cost, Complexity, and threat Coverage are top of mind for every security leader. The Lakehouse architecture has emerged in recent years to help address these concerns with a single unified architecture for all your threat data, analytics and AI in the cloud. In this talk, we will show how Lakehouse is essential for effective Cybersecurity and popular security use-cases. We will also share how Databricks empowers the security data scientist and analyst of the future and how this technology allows cyber data sets to be used to solve business problems.
MongoDB San Francisco 2013:Geo Searches for Healthcare Pricing Data presented...MongoDB
This talk covers the MongoDB deployment architecture used at Castlight Health to support very low latency spatial searches against our database of hundreds of millions of healthcare prices. The Geo haystack index in MongoDB and SSDs turned out to be the perfect solution for our problem. A strategy of replica set flipping also enables Castlight to swap in very large changes to the pricing data with no impact to the running application.
Hortonworks Hybrid Cloud - Putting you back in control of your dataScott Clinton
The document discusses Hortonworks' solutions for managing data across hybrid cloud environments. It proposes getting all data under management, combating growing cloud data silos, and consistently securing and governing data across locations. Hortonworks offers the Hortonworks Data Platform, Hortonworks Dataflow, and Hortonworks DataPlane to provide a modern hybrid data architecture with cloud-native capabilities, security and governance, and the ability to extend to edge locations. The document also highlights Hortonworks' professional services and open source community initiatives around hybrid cloud data.
RCG proposes a Big Data Proof of Concept (PoC) to demonstrate the business value of analyzing a client's data using Big Data technologies. The PoC involves:
1) Defining a business problem and objectives in a workshop with client.
2) The client collecting and anonymizing relevant data.
3) RCG loading the data into their Big Data lab and analyzing it using Big Data technologies.
4) RCG producing results, insights, and recommendations for applying Big Data and taking business actions.
The PoC requires no investment from the client and provides an opportunity to explore Big Data analytics without committing resources.
Optimized Data Management with Cloudera 5.7: Understanding data value with Cl...Cloudera, Inc.
Across all industries, organizations are embracing the promise of Apache Hadoop to store and analyze data of all types, at larger volumes than ever before possible. But to tap into the true value of this data, organizations need to manage this data and its subsequent metadata to understand its context, see how it’s changing, and take actions on it.
Cloudera Navigator is the only integrated data management and governance for Hadoop and is designed to do exactly this. With Cloudera 5.7, we have further expanded the capabilities in Cloudera Navigator to make it even easier to understand your data and maintain metadata consistency as it moves through Hadoop.
Necessity of Data Lakes in the Financial Services SectorDataWorks Summit
With the emergence of regulations such as the General Data Protection Regulation from the European Union (effective May 2018), with fines up to 20m Euro, Data Lakes are emerging as the data architecture of choice amongst financial institutions. Banks are embarking on a journey to enable data scientists to unlock the value of the data silo'ed in many disparate data systems. By enabling self service data access and merging multiple streams of data by using data clustering, entity extraction, identity resolution and other techniques - we will show how banks have used Analytics to uncover business value without falling into the abyss of data swamps. The build out of the data lake requires the ingestion of data from multiple operational systems . By leveraging an automated Data Cataloging service, organizations are able to search, profile, discover, tag, track lineage and capture tribal knowledge delivered on the FICO Analytics Cloud enabling the data scientists to build innovative models, make automated decisions, track fraudulent usage, make intelligent marketing campaigns and improve the top line and bottom line for the financial institution.
Speaker:
Rohit Valia, Product Management and Strategy, Fico
How can a quality engineering and assurance consultancy keep you ahead of othersgreyaudrina
This document discusses how a quality engineering and assurance consultancy can help organizations stay ahead. It recommends leveraging technologies like AI, automation, and DevOps to improve software quality, testing, and speed. It also suggests using AI and customer feedback to enhance the customer experience. Adopting business processes that provide transparency and actionable information can help streamline operations and efficient decision making. With the help of a consultancy, organizations can optimize costs, improve returns on investment, and ensure business goals are achieved through customized solutions and a holistic approach.
Seagate: Sensor Overload! Taming The Raging Manufacturing Big Data TorrentSeeling Cheung
Nicholas Berg presented on Seagate's use of big data analytics to manage the large amount of manufacturing data generated from its hard drive production. Seagate collects terabytes of data per day from testing its drives, which it analyzes using Hadoop to improve quality, predict failures, and gain other insights. It faces challenges in integrating this emerging platform due to the rapid evolution of Hadoop and lack of tools to fully leverage large datasets. Seagate is developing its data lake and data science capabilities on Hadoop to better optimize manufacturing and drive design.
Continuously improving factory operations is of critical importance to manufacturers. Consider the facts: the total cost of poor quality amounts to a staggering 20% of sales (American Society of Quality), and unplanned downtime costs plants approximately $50 billion per year (Deloitte).
The most pressing questions are: which process variables effect quality and yield and which process variables predict equipment failure? Getting to those answers is providing forward thinking manufacturers a leg up over competitors.
The speakers address the data management challenges facing today's manufacturers, including proprietary systems and siloed data sources, as well as an inability to make sensor-based data usable.
Integrating enterprise data from ERP, MES, maintenance systems, and other sources with real-time operations data from sensors, PLCs, SCADA systems, and historians represents a major first step. But how to get started? What is the value of a data lake? How are AI/ML being applied to enable real time action?
Join us for this educational session, which includes a view into a roadmap for an open source industrial IoT data management platform.
Key Takeaways:
• Understand key use cases commonly undertaken by manufacturing enterprises
• Understand the value of using multivariate manufacturing data sources, as opposed to a single sensor on a piece of equipment
• Understand advances in big data management and streaming analytics that are paving the way to next-generation factory performance
Threat Detection and Response at Scale with Dominique BrezinskiDatabricks
Security monitoring and threat response has diverse processing demands on large volumes of log and telemetry data. Processing requirements span from low-latency stream processing to interactive queries over months of data. To make things more challenging, we must keep the data accessible for a retention window measured in years. Having tackled this problem before in a massive-scale environment using Apache Spark, when it came time to do it again, there were a few things I knew worked and a few wrongs I wanted to right.
We approached Databricks with a set of challenges to collaborate on: provide a stable and optimized platform for Unified Analytics that allows our team to focus on value delivery using streaming, SQL, graph, and ML; leverage decoupled storage and compute while delivering high performance over a broad set of workloads; use S3 notifications instead of list operations; remove Hive Metastore from the write path; and approach indexed response times for our more common search cases, without hard-to-scale index maintenance, over our entire retention window. This is about the fruit of that collaboration.
Data summit connect fall 2020 - rise of data opsRyan Gross
Data governance teams attempt to apply manual control at various points for consistency and quality of the data. By thinking of our machine learning data pipelines as compilers that convert data into executable functions and leveraging data version control, data governance and engineering teams can engineer the data together, filing bugs against data versions, applying quality control checks to the data compilers, and other activities. This talk illustrates how innovations are poised to drive process and cultural changes to data governance, leading to order-of-magnitude improvements.
Data Analytics in your IoT SolutionFukiat Julnual, Technical Evangelist, Mic...BAINIDA
Data Analytics in your IoT SolutionFukiat Julnual, Technical Evangelist, Microsoft (Thailand) Limited ในงาน THE FIRST NIDA BUSINESS ANALYTICS AND DATA SCIENCES CONTEST/CONFERENCE จัดโดย คณะสถิติประยุกต์และ DATA SCIENCES THAILAND
The document discusses Southwest Power Pool's initial steps towards creating a data lake. It describes:
- Storing historical and real-time data that exceeded initial expectations, with around 50% being less frequently used
- Conducting a proof-of-concept evaluation of three vendors to offload less frequently used data and allow SQL query access with minimal changes to existing queries
- Choosing BigInsights based on its ability to do this along with supporting existing Netezza functions and allowing federated queries between Netezza and BigInsights
- The multi-phase vision to eventually incorporate more data types and workloads while improving performance, security, and governance
Freddie Mac makes homeownership and rental housing more accessible and affordable. Operating in the secondary mortgage market, we keep mortgage capital flowing by purchasing mortgage loans from lenders so they in turn can provide more loans to qualified borrowers. Our mission to provide liquidity, stability, and affordability to the U.S. housing market in all economic conditions extends to all communities from coast to coast.
We're using big data and advanced analytics to create powerful enhancements to better meet our customer’s needs: automated collateral evaluation, automated assessments for borrowers without credit scores, immediate certainty for collateral rep and warranty relief, and coming soon automated asset and income validation.
We’re building tools to help our customers cut costs and give them rep and warranty relief sooner in the loan manufacturing process.
We’ve designed Loan Advisor Suite with lenders to give our customers greater certainty, usability, reliability and efficiency. It's a simpler, better way to do business.
More Tools - Access powerful solutions for every stage of the loan production process.
More Loans - Increase output with automated data management and user-friendly controls.
Less Risk = Get alerted to loan issues and take action the moment they occur.
Hear the story of how ACE helped Freddie Mac reimagine the mortgage process and how HDP helped make it possible.
Speaker
Dennis Tally, Freddie Mac, Director
Modern Data Management for Federal ModernizationDenodo
Watch full webinar here: https://bit.ly/2QaVfE7
Faster, more agile data management is at the heart of government modernization. However, Traditional data delivery systems are limited in realizing a modernized and future-proof data architecture.
This webinar will address how data virtualization can modernize existing systems and enable new data strategies. Join this session to learn how government agencies can use data virtualization to:
- Enable governed, inter-agency data sharing
- Simplify data acquisition, search and tagging
- Streamline data delivery for transition to cloud, data science initiatives, and more
Testing the Data Warehouse—Big Data, Big ProblemsTechWell
Data warehouses are critical systems for collecting, organizing, and making information readily available for strategic decision making. The ability to review historical trends and monitor near real-time operational data is a key competitive advantage for many organizations. Yet the methods for assuring the quality of these valuable assets are quite different from those of transactional systems. Ensuring that appropriate testing is performed is a major challenge for many enterprises. Geoff Horne has led numerous data warehouse testing projects in both the telecommunications and ERP sectors. Join Geoff as he shares his approaches and experiences, focusing on the key “uniques” of data warehouse testing: methods for assuring data completeness, monitoring data transformations, measuring quality, and more. Geoff explores the opportunities for test automation as part of the data warehouse process, describing how you can harness automation tools to streamline the work and minimize overhead.
When Agilysys, a leading provider of software solutions for the hospitality industry, decided to architect the next generation platform for their software to unify lines of business it choose MongoDB as the data store at the heart of the system. The Agilysys suite of hospitality products spans lodging and food & beverage; providing solutions for point of sale, property management, workforce management, and inventory and procurement, perfect for use with the flexible JSON document model in MongoDB. Agilysys solutions now use a single store with related data from otherwise disconnected applications, allowing each application and line of business to operate independently. Simultaneously, the Agilysys rGuest suite of hospitality products has the capability to correlate and analyze patterns and behavior of guests across systems to improve their experience when staying in property.
January 2015 HUG: Apache Flink: Fast and reliable large-scale data processingYahoo Developer Network
Apache Flink (incubating) is one of the latest addition to the Apache family of data processing engines. In short, Flink’s design aims to be as fast as in-memory engines, while providing the reliability of Hadoop. Flink contains (1) APIs in Java and Scala for both batch-processing and data streaming applications, (2) a translation stack for transforming these programs to parallel data flows and (3) a runtime that supports both proper streaming and batch processing for executing these data flows in large compute clusters.
Flink’s batch APIs build on functional primitives (map, reduce, join, cogroup, etc), and augment those with dedicated operators for iterative algorithms, and support for logical, SQL-like key attribute referencing (e.g., groupBy(“WordCount.word”). The Flink streaming API extends the primitives from the batch API with flexible window semantics.
Internally, Flink transforms the user programs into distributed data stream programs. In the course of the transformation, Flink analyzes functions and data types (using Scala macros and reflection), and picks physical execution strategies using a cost-based optimizer. Flink’s runtime is a true streaming engine, supporting both batching and streaming. Flink operates on a serialized data representation with memory-adaptive out-of-core algorithms for sorting and hashing. This makes Flink match the performance of in-memory engines on memory-resident datasets, while scaling robustly to larger disk-resident datasets.
Finally, Flink is compatible with the Hadoop ecosystem. Flink runs on YARN, reads data from HDFS and HBase, and supports mixing existing Hadoop Map and Reduce functions into Flink programs. Ongoing work is adding Apache Tez as an additional runtime backend.
This talk presents Flink from a user perspective. We introduce the APIs and highlight the most interesting design points behind Flink, discussing how they contribute to the goals of performance, robustness, and flexibility. We finally give an outlook on Flink’s development roadmap.
Apache Spark has emerged over the past year as the imminent successor to Hadoop MapReduce. Spark can process data in memory at very high speed, while still be able to spill to disk if required. Spark’s powerful, yet flexible API allows users to write complex applications very easily without worrying about the internal workings and how the data gets processed on the cluster.
Spark comes with an extremely powerful Streaming API to process data as it is ingested. Spark Streaming integrates with popular data ingest systems like Apache Flume, Apache Kafka, Amazon Kinesis etc. allowing users to process data as it comes in.
In this talk, Hari will discuss the basics of Spark Streaming, its API and its integration with Flume, Kafka and Kinesis. Hari will also discuss a real-world example of a Spark Streaming application, and how code can be shared between a Spark application and a Spark Streaming application. Each stage of the application execution will be presented, which can help understand practices while writing such an application. Hari will finally discuss how to write a custom application and a custom receiver to receive data from other systems.
This talk uses a case study to demonstrate core data science capabilities in Big Data, infrastructure requirements, and talent profiles that translate to early success. Using the challenge of classifying events in a consumer-oriented website, the discussion is for a wide audience:
- Practitioners will learn two key techniques for early success
- Technologists will learn how teams rely on key infrastructure and where engineers play a valuable role in data sciences
- Hiring managers will expand their knowledge of the skills required to bring business value with data
From the Predictive Analytics Innovation Summit
Video here: https://www.youtube.com/watch?v=PdKUt0zK0UY
With the avalanche of data about operations, customers, and products, leading companies are utilizing Big Analytics to better understand historical patterns and predict what may come next to create sustained competitive advantage. Dan Mallinger, who leads Think Big Analytic's data science team, will focus on practical examples of where companies are implementing new analytics approaches over big data. Dan will discuss how these efforts differ from traditional analytic approaches, the organizational and business impact, and how our clients are creating new value in areas such as marketing, services, sales and product development.
More Than Websites: PHP And The Firehose @DataSift (2013)Stuart Herbert
PHP is the world's #1 programming language for creating websites. But it's capable of so much more. How about real-time processing the social firehose? :)
Apachecon Europe 2012: Operating HBase - Things you need to knowChristian Gügi
This document provides an overview of important concepts for operating HBase, including:
- HBase stores data in columns families stored as files on disk and writes to memory before flushing to disk.
- Manual and automatic splitting of regions is covered, as well as challenges of improper splitting.
- Tools for monitoring, debugging, and visualizing HBase operations are discussed.
- Key lessons focus on proper data modeling, extensive monitoring, and understanding the whole Hadoop ecosystem.
On June 11 Thomas Dinsmore gave a nice outline on tools and technologies that are out there handling analytics in Hadoop. It is a must watch for anyone looking for what advance analytics Hadoop could deliver.
Please find video and slides below.
Synopsis
What is the state of play for advanced analytics in Hadoop? A year ago, options included "roll your own" and little else; today there are a number of serious open source and commercial options available, with new capabilities announced daily.
In this presentation, we begin with a brief overview of use cases for advanced analytics and a discussion of what types of analytics must run in Hadoop. We continue with an overview of available architectures. The presentation concludes with a hype-free survey of available open source and commercial software for advanced analytics in Hadoop.
Bio
Thomas W. Dinsmore is Director of Product Management for Revolution Analytics, a company that provides commercial support and services for open source R. In this role, Mr. Dinsmore closely tracks the market for commercial and open source software on all platforms, including Hadoop. Prior to joining Revolution Analytics, Mr. Dinsmore served as an Analytics Solution Architect for IBM Big Data, and as a Principal Consultant for Razorfish and SAS.
Mr. Dinsmore has hands-on experience with leading commercial and open source tools for advanced analytics, including SAS, SPSS, R, Oracle Data Mining across a range of platforms, including Hadoop, Netezza, Teradata and Oracle. He is certified in SAS 9.
In his career, Mr. Dinsmore has worked with more than 500 enterprises in the United States, Canada, Mexico, Venezuela, Chile, Brazil, the United Kingdom, Belgium, Italy, Turkey, Israel, Malaysia and Singapore.
The document discusses how to use the R programming language and Amazon's Elastic MapReduce service to quickly create a Hadoop cluster on Amazon Web Services in only 15 minutes. It demonstrates running a stochastic simulation to estimate pi by distributing 1,000 simulations across the Hadoop cluster and combining the results. The total cost of running the 15 minute cluster was only $0.15, showing how inexpensive it can be to leverage Hadoop's capabilities.
27 Aug 2013 Webinar High Performance Predictive Analytics in Hadoop and R presented by Mario E. Inchiosa, PhD., US Data Scientist and Kathleen Rohrecker, Director of Product Marketing
This a reduced PDF version of the hardcover book available at http://www.lulu.com/shop/jeffrey-strickland/predictive-analytics-using-r/hardcover/product-22000910.html, at a 40% discount. It will soon be available on Amazon.
12Nov13 Webinar: Big Data Analysis with Teradata and Revolution AnalyticsRevolution Analytics
Revolution R Enterprise is a big data analytics platform based on the open source statistical programming language R. It allows for high performance, scalable analytics on large datasets across enterprise platforms. The presentation discusses Revolution R Enterprise and how it addresses challenges with big data and accelerating analytics, including data volume, complex computation, enterprise readiness, and production efficiency. It also highlights how Revolution R Enterprise integrates with Teradata to enable in-database analytics for further performance improvements.
Introducing Trillium DQ for Big Data: Powerful Profiling and Data Quality for...Precisely
The advanced analytics and AI that run today’s businesses rely on a larger volume, and greater variety, of data. This data needs to be of the highest quality to ensure the best possible outcomes, but traditional data quality tools weren’t designed for today’s modern data environments.
That’s why we’ve developed Trillium DQ for Big Data -- an integrated product that delivers industry-leading data profiling and data quality at scale, in the cloud or on premises.
In this on-demand webcast, you will learn how Trillium DQ:
• Empowers data analysts to easily profile large, diverse data sources to discover new insights, uncover issues, and report on their findings – all without involving IT.
• Delivers best-in-class entity resolution to support mission-critical applications such as Customer 360, fraud detection, AML, and predictive analytics.
• Supports Cloud and hybrid architectures by providing consistent high-performance processing within critical time windows on all platforms.
• Keeps enterprise data lakes validated, clean, and trusted with the highest quality data – without technical expertise in big data or distributed architectures.
• Enables data quality monitoring based on targeted business rules for data governance and business insight
In this slidedeck, Infochimps Director of Product, Tim Gasper, discusses how Infochimps tackles business problems for customers by deploying a comprehensive Big Data infrastructure in days; sometimes in just hours. Tim unlocks how Infochimps is now taking that same aggressive approach to deliver faster time to value by helping customers develop analytic applications with impeccable speed.
Challenges of Operationalising Data Science in Productioniguazio
The presentation topic for this meet-up was covered in two sections without any breaks in-between
Section 1: Business Aspects (20 mins)
Speaker: Rasmi Mohapatra, Product Owner, Experian
https://www.linkedin.com/in/rasmi-m-428b3a46/
Once your data science application is in the production, there are many typical data science operational challenges experienced today - across business domains - we will cover a few challenges with example scenarios
Section 2: Tech Aspects (40 mins, slides & demo, Q&A )
Speaker: Santanu Dey, Solution Architect, Iguazio
https://www.linkedin.com/in/santanu/
In this part of the talk, we will cover how these operational challenges can be overcome e.g. automating data collection & preparation, making ML models portable & deploying in production, monitoring and scaling, etc.
with relevant demos.
M Chambers and RapidMiner Overview for Babson classmcAnalytics99
RapidMiner is a modern analytics platform that enables anyone to leverage big data and accelerate time-to-value. Unlike traditional analytics providers, RapidMiner allows users of any skill level to make the most of all data in all environments. It provides a code-free interface that is built by data scientists for data scientists, business analysts, and developers to simplify analytics. RapidMiner also utilizes a knowledge base of analytic best practices and machine learning to empower users to become data science heroes.
The document discusses optimizing a data warehouse by offloading some workloads and data to Hadoop. It identifies common challenges with data warehouses like slow transformations and queries. Hadoop can help by handling large-scale data processing, analytics, and long-term storage more cost effectively. The document provides examples of how customers benefited from offloading workloads to Hadoop. It then outlines a process for assessing an organization's data warehouse ecosystem, prioritizing workloads for migration, and developing an optimization plan.
Architect’s Open-Source Guide for a Data Mesh ArchitectureDatabricks
Data Mesh is an innovative concept addressing many data challenges from an architectural, cultural, and organizational perspective. But is the world ready to implement Data Mesh?
In this session, we will review the importance of core Data Mesh principles, what they can offer, and when it is a good idea to try a Data Mesh architecture. We will discuss common challenges with implementation of Data Mesh systems and focus on the role of open-source projects for it. Projects like Apache Spark can play a key part in standardized infrastructure platform implementation of Data Mesh. We will examine the landscape of useful data engineering open-source projects to utilize in several areas of a Data Mesh system in practice, along with an architectural example. We will touch on what work (culture, tools, mindset) needs to be done to ensure Data Mesh is more accessible for engineers in the industry.
The audience will leave with a good understanding of the benefits of Data Mesh architecture, common challenges, and the role of Apache Spark and other open-source projects for its implementation in real systems.
This session is targeted for architects, decision-makers, data-engineers, and system designers.
Building a Modern Analytic Database with Cloudera 5.8Cloudera, Inc.
This document discusses building a modern analytic database with Cloudera. It outlines Marketing Associates' evaluation of solutions to address challenges around managing massive and diverse data volumes. They selected Cloudera Enterprise to enable self-service BI and real-time analytics at lower costs than traditional databases. The solution has provided scalability, cost savings of over 90%, and improved security and compliance. Future roadmaps for Cloudera's analytic database include faster SQL, improved multitenancy, and deeper BI tool integration.
Watch this webinar in full here: https://buff.ly/2MVTKqL
Self-Service BI promises to remove the bottleneck that exists between IT and business users. The truth is, if data is handed over to a wide range of data consumers without proper guardrails in place, it can result in data anarchy.
Attend this session to learn why data virtualization:
• Is a must for implementing the right self-service BI
• Makes self-service BI useful for every business user
• Accelerates any self-service BI initiative
Productionizing Hadoop: 7 Architectural Best PracticesMapR Technologies
The document discusses 7 architectural best practices for productionizing Hadoop: experience, availability, performance, scalability, adaptability, security, and economy. It defines each quality and provides examples to illustrate how to achieve each one. The key message is that while big data is about innovation, productionizing analytics is critical to realize business value from all the data. Architectural best practices can help systems meet expectations for usefulness, uptime, speed, flexibility, security, and cost-efficiency as big data implementations scale up.
Customer value analysis of big data productsVikas Sardana
Business value analysis through Customer Value Model for software technology choices with a case study from Mobile Advertising industry for Big Data use case.
The document provides an overview of IBM's BigInsights product. It discusses how BigInsights can help businesses gain insights from large, complex datasets through features like built-in text analytics, SQL support, spreadsheet-style analysis, and accelerators for domain-specific analytics like social media. The document also summarizes capabilities of BigInsights like Big SQL, Big Sheets, Big R, and its text analytics engine that allow businesses to explore, analyze, and model large datasets.
The document provides an overview of IBM's BigInsights product. It discusses how BigInsights can help businesses gain insights from large, complex datasets through features like built-in text analytics, SQL support, spreadsheet-style analysis, and accelerators for domain-specific analytics like social media. The document also summarizes capabilities of BigInsights like Big SQL, Big Sheets, Big R, and its embedded text analytics engine.
It is a fascinating, explosive time for enterprise analytics.
It is from the position of analytics leadership that the mission will be executed and company leadership will emerge. The data professional is absolutely sitting on the performance of the company in this information economy and has an obligation to demonstrate the possibilities and originate the architecture, data, and projects that will deliver analytics. After all, no matter what business you’re in, you’re in the business of analytics.
The coming years will be full of big changes in enterprise analytics and Data Architecture. William will kick off the fourth year of the Advanced Analytics series with a discussion of the trends winning organizations should build into their plans, expectations, vision, and awareness now.
Bridging the Gap: from Data Science to ProductionFlorian Wilhelm
A recent but quite common observation in industry is that although there is an overall high adoption of data science, many companies struggle to get it into production. Huge teams of well-payed data scientists often present one fancy model after the other to their managers but their proof of concepts never manifest into something business relevant. The frustration grows on both sides, managers and data scientists.
In my talk I elaborate on the many reasons why data science to production is such a hard nut to crack. I start with a taxonomy of data use cases in order to easier assess technical requirements. Based thereon, my focus lies on overcoming the two-language-problem which is Python/R loved by data scientists vs. the enterprise-established Java/Scala. From my project experiences I present three different solutions, namely 1) migrating to a single language, 2) reimplementation and 3) usage of a framework. The advantages and disadvantages of each approach is presented and general advices based on the introduced taxonomy is given.
Additionally, my talk also addresses organisational as well as problems in quality assurance and deployment. Best practices and further references are presented on a high-level in order to cover all facets of data science to production.
With my talk I hope to convey the message that breakdowns on the road from data science to production are rather the rule than the exception, so you are not alone. At the end of my talk, you will have a better understanding of why your team and you are struggling and what to do about it.
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Denodo
Watch full webinar here: https://bit.ly/35FUn32
Presented at CDAO New Zealand
Advanced data science techniques, like machine learning, have proven an extremely useful tool to derive valuable insights from existing data. Platforms like Spark, and complex libraries for R, Python, and Scala put advanced techniques at the fingertips of the data scientists.
However, most architecture laid out to enable data scientists miss two key challenges:
- Data scientists spend most of their time looking for the right data and massaging it into a usable format
- Results and algorithms created by data scientists often stay out of the reach of regular data analysts and business users
Watch this session on-demand to understand how data virtualization offers an alternative to address these issues and can accelerate data acquisition and massaging. And a customer story on the use of Machine Learning with data virtualization.
Presented to eRum (Budapest), May 2018
There are many common workloads in R that are "embarrassingly parallel": group-by analyses, simulations, and cross-validation of models are just a few examples. In this talk I'll describe the doAzureParallel package, a backend to the "foreach" package that automates the process of spawning a cluster of virtual machines in the Azure cloud to process iterations in parallel. This will include an example of optimizing hyperparameters for a predictive model using the "caret" package.
By David Smith. Presented at Microsoft Build (Seattle), May 7 2018.
Your data scientists have created predictive models using open-source tools, proprietary software, or some combination of both, and now you are interested in lifting and shifting those models to the cloud. In this talk, I'll describe how data scientists can transition their existing workflows — while using mostly the same tools and processes — to train and deploy machine learning models based on open source frameworks to Azure. I'll provide guidance on keeping connections to data sources up-to-date, evaluating and monitoring models, and deploying applications that make use of those models.
Presentation delivered by David Smith to NY R Conference https://www.rstats.nyc/, April 2018:
Minecraft is an open-world creativity game, and a hit with kids. To get kids interested in learning to program with R, we created the "miner" package. This package is a collection of simple functions that allow you to connect with a Minecraft instance, manipulate the world within by creating blocks and controlling the player, and to detect events within the world and react accordingly.
The miner package is intended mainly for kids, to inspire them to learn R while playing Minecraft. But the development of the package also provides some useful insights into how to build an R package to interface with a persistent API, and how to instruct others on its use. In this talk I'll describe how to set up your own Minecraft server, and how to use and extend the package. I'll also provide a few examples of the package in action in a live Minecraft session.
While Python is a widely-used tool for AI development, in this talk I'll make the case for considering R as a platform for developing models for intelligent applications. Firstly, R provides a first-class experience working deep learning frameworks with its keras integration. Equally importantly, it provides the most comprehensive suite of statistical data analysis tools, which are extremely useful for many intelligent applications such as transfer learning. I'll give a few high-level examples in this talk, and we'll go into further detail in the accompanying interactive code lab.
There are many common workloads in R that are "embarrassingly parallel": group-by analyses, simulations, and cross-validation of models are just a few examples. In this talk I'll describe several techniques available in R to speed up workloads like these, by running multiple iterations simultaneously, in parallel.
Many of these techniques require the use of a cluster of machines running R, and I'll provide examples of using cloud-based services to provision clusters for parallel computations. In particular, I will describe how you can use the SparklyR package to distribute data manipulations using the dplyr syntax, on a cluster of servers provisioned in the Azure cloud.
Presented by David Smith at Data Day Texas in Austin, January 27 2018.
The R Ecosystem consists of the R Foundation which oversees the R programming language and its core development. The R Core Group maintains the R software and CRAN which distributes R packages. A large contributor and user community provides documentation, blogs, user groups and additional software/services to support the widespread use of R.
A look at the changing perceptions of R, from the early days of the R project to today. Microsoft sponsor talk, presented by David Smith to the useR!2017 conference in Brussels, July 5 2017.
Predicting Loan Delinquency at One Million Transactions per SecondRevolution Analytics
Real-time applications of predictive models must be able to generate predictions at the rate that transactions are generated. Previously, such applications of models trained using R needed to be converted to other languages like C++ or Java to achieve the required throughput. In this talk, I’ll describe how to use the in-database R processing capabilities of Microsoft R Server to detect fraud in a SQL Server database of loan records at a rate exceeding one million transactions per second. I will also show the process of training the underlying gradient-boosted tree model on a large training set using the out-of-memory algorithms of Microsoft R.
Presented by David Smith at The Data Science Summit, Chicago, April 20 2017.
The ability to independently reproduce results is a critical issue within the scientific community today, and is equally important for collaboration and compliance in business. In this talk, I'll introduce several features available in R that help you make reproducibility a standard part of your data science workflow. The talk will include tips on working with data and files, combining code and output, and managing R's changing package ecosystem.
Presented by David Smith, R Community Lead (Microsoft), at Monktoberfest October 2016.
The value of open source isn’t just in the software itself. The communities that form around open source software provide just as much value and sometimes even more: in ongoing development, in documentation, in support, in marketing, and as a supply of ready-trained employees. Companies who build on open source tend to focus on the software, but neglect communities at their peril.
In this talk, I share some of my experiences in building community for an open-source software company, Revolution Analytics, and perspectives since the acquisition by Microsoft in 2015.
R is more than just a language. Many of the reasons why R has become such a popular tool for data science come from the ecosystem surrounding the R project. R users benefit from the many resources and packages created by the community, while commercial companies (including Microsoft) provide tools to extend and support R, and services to help people use R.
In this talk, I will give an overview of the R Ecosystem and describe how it has been a critical component of R’s success, and include several examples of Microsoft’s contributions to the ecosystem.
(Presented to EARL London, September 2016)
(Presented by David Smith at useR!2016, June 2016. Recording: https://channel9.msdn.com/Events/useR-international-R-User-conference/useR2016/R-at-Microsoft )
Since the acquisition of Revolution Analytics in April 2015, Microsoft has embarked upon a project to build R technology into many Microsoft products, so that developers and data scientists can use the R language and R packages to analyze data in their data centers and in cloud environments.
In this talk I will give an overview (and a demo or two) of how R has been integrated into various Microsoft products. Microsoft data scientists are also big users of R, and I'll describe a couple of examples of R being used to analyze operational data at Microsoft. I'll also share some of my experiences in working with open source projects at Microsoft, and my thoughts on how Microsoft works with open source communities including the R Project.
Hadoop is famously scalable. Cloud Computing is famously scalable. R – the thriving and extensible open source Data Science software – not so much. But what if we seamlessly combined Hadoop, Cloud Computing, and R to create a scalable Data Science platform? Imagine exploring, transforming, modeling, and scoring data at any scale from the comfort of your favorite R environment. Now, imagine calling a simple R function to operationalize your predictive model as a scalable, cloud-based Web Service. Learn how to leverage the magic of Hadoop on-premises or in the cloud to run your R code, thousands of open source R extension packages, and distributed implementations of the most popular machine learning algorithms at scale.
The document discusses Revolution Analytics, a company that provides analytics software and services based on the open source R language. It was acquired by Microsoft to help customers use advanced analytics within Microsoft data platforms. The document provides overviews of R, data science in the cloud using Azure, connecting R to SQL, solving scalability issues with Revolution R Enterprise (RRE), using R in SQL Server, and moving analytics workflows to the cloud.
The document compares two biological networks, CRAN and BioConductor, in terms of their topological properties. It finds that CRAN has more nodes and edges than BioConductor, but BioConductor is more densely connected. A statistical test shows the networks have significantly different degree distributions, though both approximately follow power laws.
This document discusses using graph analysis and PageRank algorithms to analyze the network structure and popularity of R packages over time. It contains the steps to create a dependency graph from package metadata, plot and export the graph, and shows the top 10 most popular packages according to PageRank in 2012 and 2015. Further analysis of the network structure is suggested.
Checkpoint provides a simple way to ensure reproducible results in R. It works by adding two lines to an R script that install the checkpoint package and specify a date. Checkpoint will then install all packages used in the script to the versions that were available on that date. This allows sharing of R code and ensuring others can reproduce results even if package versions change later. Checkpoint manages dependencies and installation of correct package versions for reproducibility across systems and time.
This document summarizes R at Microsoft, including its acquisition of Revolution Analytics. Key points include:
- R is a widely used open-source statistical programming language. Microsoft acquired Revolution Analytics to help customers use advanced analytics within its data platforms.
- Microsoft products like SQL Server will integrate Revolution R Open, an enhanced open-source R distribution, to allow running R scripts directly from SQL queries.
- Microsoft aims to make R and advanced analytics more accessible and scalable through products like the Machine Learning marketplace and by running R on servers to handle large datasets within SQL Server and Azure.
How Netflix Builds High Performance Applications at Global ScaleScyllaDB
We all want to build applications that are blazingly fast. We also want to scale them to users all over the world. Can the two happen together? Can users in the slowest of environments also get a fast experience? Learn how we do this at Netflix: how we understand every user's needs and preferences and build high performance applications that work for every user, every time.
How to Avoid Learning the Linux-Kernel Memory ModelScyllaDB
The Linux-kernel memory model (LKMM) is a powerful tool for developing highly concurrent Linux-kernel code, but it also has a steep learning curve. Wouldn't it be great to get most of LKMM's benefits without the learning curve?
This talk will describe how to do exactly that by using the standard Linux-kernel APIs (locking, reference counting, RCU) along with a simple rules of thumb, thus gaining most of LKMM's power with less learning. And the full LKMM is always there when you need it!
Coordinate Systems in FME 101 - Webinar SlidesSafe Software
If you’ve ever had to analyze a map or GPS data, chances are you’ve encountered and even worked with coordinate systems. As historical data continually updates through GPS, understanding coordinate systems is increasingly crucial. However, not everyone knows why they exist or how to effectively use them for data-driven insights.
During this webinar, you’ll learn exactly what coordinate systems are and how you can use FME to maintain and transform your data’s coordinate systems in an easy-to-digest way, accurately representing the geographical space that it exists within. During this webinar, you will have the chance to:
- Enhance Your Understanding: Gain a clear overview of what coordinate systems are and their value
- Learn Practical Applications: Why we need datams and projections, plus units between coordinate systems
- Maximize with FME: Understand how FME handles coordinate systems, including a brief summary of the 3 main reprojectors
- Custom Coordinate Systems: Learn how to work with FME and coordinate systems beyond what is natively supported
- Look Ahead: Gain insights into where FME is headed with coordinate systems in the future
Don’t miss the opportunity to improve the value you receive from your coordinate system data, ultimately allowing you to streamline your data analysis and maximize your time. See you there!
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Erasmo Purificato
Slide of the tutorial entitled "Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Emerging Trends" held at UMAP'24: 32nd ACM Conference on User Modeling, Adaptation and Personalization (July 1, 2024 | Cagliari, Italy)
Navigating Post-Quantum Blockchain: Resilient Cryptography in Quantum Threatsanupriti
In the rapidly evolving landscape of blockchain technology, the advent of quantum computing poses unprecedented challenges to traditional cryptographic methods. As quantum computing capabilities advance, the vulnerabilities of current cryptographic standards become increasingly apparent.
This presentation, "Navigating Post-Quantum Blockchain: Resilient Cryptography in Quantum Threats," explores the intersection of blockchain technology and quantum computing. It delves into the urgent need for resilient cryptographic solutions that can withstand the computational power of quantum adversaries.
Key topics covered include:
An overview of quantum computing and its implications for blockchain security.
Current cryptographic standards and their vulnerabilities in the face of quantum threats.
Emerging post-quantum cryptographic algorithms and their applicability to blockchain systems.
Case studies and real-world implications of quantum-resistant blockchain implementations.
Strategies for integrating post-quantum cryptography into existing blockchain frameworks.
Join us as we navigate the complexities of securing blockchain networks in a quantum-enabled future. Gain insights into the latest advancements and best practices for safeguarding data integrity and privacy in the era of quantum threats.
Scaling Connections in PostgreSQL Postgres Bangalore(PGBLR) Meetup-2 - MydbopsMydbops
This presentation, delivered at the Postgres Bangalore (PGBLR) Meetup-2 on June 29th, 2024, dives deep into connection pooling for PostgreSQL databases. Aakash M, a PostgreSQL Tech Lead at Mydbops, explores the challenges of managing numerous connections and explains how connection pooling optimizes performance and resource utilization.
Key Takeaways:
* Understand why connection pooling is essential for high-traffic applications
* Explore various connection poolers available for PostgreSQL, including pgbouncer
* Learn the configuration options and functionalities of pgbouncer
* Discover best practices for monitoring and troubleshooting connection pooling setups
* Gain insights into real-world use cases and considerations for production environments
This presentation is ideal for:
* Database administrators (DBAs)
* Developers working with PostgreSQL
* DevOps engineers
* Anyone interested in optimizing PostgreSQL performance
Contact info@mydbops.com for PostgreSQL Managed, Consulting and Remote DBA Services
Fluttercon 2024: Showing that you care about security - OpenSSF Scorecards fo...Chris Swan
Have you noticed the OpenSSF Scorecard badges on the official Dart and Flutter repos? It's Google's way of showing that they care about security. Practices such as pinning dependencies, branch protection, required reviews, continuous integration tests etc. are measured to provide a score and accompanying badge.
You can do the same for your projects, and this presentation will show you how, with an emphasis on the unique challenges that come up when working with Dart and Flutter.
The session will provide a walkthrough of the steps involved in securing a first repository, and then what it takes to repeat that process across an organization with multiple repos. It will also look at the ongoing maintenance involved once scorecards have been implemented, and how aspects of that maintenance can be better automated to minimize toil.
Video traffic on the Internet is constantly growing; networked multimedia applications consume a predominant share of the available Internet bandwidth. A major technical breakthrough and enabler in multimedia systems research and of industrial networked multimedia services certainly was the HTTP Adaptive Streaming (HAS) technique. This resulted in the standardization of MPEG Dynamic Adaptive Streaming over HTTP (MPEG-DASH) which, together with HTTP Live Streaming (HLS), is widely used for multimedia delivery in today’s networks. Existing challenges in multimedia systems research deal with the trade-off between (i) the ever-increasing content complexity, (ii) various requirements with respect to time (most importantly, latency), and (iii) quality of experience (QoE). Optimizing towards one aspect usually negatively impacts at least one of the other two aspects if not both. This situation sets the stage for our research work in the ATHENA Christian Doppler (CD) Laboratory (Adaptive Streaming over HTTP and Emerging Networked Multimedia Services; https://athena.itec.aau.at/), jointly funded by public sources and industry. In this talk, we will present selected novel approaches and research results of the first year of the ATHENA CD Lab’s operation. We will highlight HAS-related research on (i) multimedia content provisioning (machine learning for video encoding); (ii) multimedia content delivery (support of edge processing and virtualized network functions for video networking); (iii) multimedia content consumption and end-to-end aspects (player-triggered segment retransmissions to improve video playout quality); and (iv) novel QoE investigations (adaptive point cloud streaming). We will also put the work into the context of international multimedia systems research.
Performance Budgets for the Real World by Tammy EvertsScyllaDB
Performance budgets have been around for more than ten years. Over those years, we’ve learned a lot about what works, what doesn’t, and what we need to improve. In this session, Tammy revisits old assumptions about performance budgets and offers some new best practices. Topics include:
• Understanding performance budgets vs. performance goals
• Aligning budgets with user experience
• Pros and cons of Core Web Vitals
• How to stay on top of your budgets to fight regressions
Sustainability requires ingenuity and stewardship. Did you know Pigging Solutions pigging systems help you achieve your sustainable manufacturing goals AND provide rapid return on investment.
How? Our systems recover over 99% of product in transfer piping. Recovering trapped product from transfer lines that would otherwise become flush-waste, means you can increase batch yields and eliminate flush waste. From raw materials to finished product, if you can pump it, we can pig it.
AC Atlassian Coimbatore Session Slides( 22/06/2024)apoorva2579
This is the combined Sessions of ACE Atlassian Coimbatore event happened on 22nd June 2024
The session order is as follows:
1.AI and future of help desk by Rajesh Shanmugam
2. Harnessing the power of GenAI for your business by Siddharth
3. Fallacies of GenAI by Raju Kandaswamy
How RPA Help in the Transportation and Logistics Industry.pptxSynapseIndia
Revolutionize your transportation processes with our cutting-edge RPA software. Automate repetitive tasks, reduce costs, and enhance efficiency in the logistics sector with our advanced solutions.
Are you interested in dipping your toes in the cloud native observability waters, but as an engineer you are not sure where to get started with tracing problems through your microservices and application landscapes on Kubernetes? Then this is the session for you, where we take you on your first steps in an active open-source project that offers a buffet of languages, challenges, and opportunities for getting started with telemetry data.
The project is called openTelemetry, but before diving into the specifics, we’ll start with de-mystifying key concepts and terms such as observability, telemetry, instrumentation, cardinality, percentile to lay a foundation. After understanding the nuts and bolts of observability and distributed traces, we’ll explore the openTelemetry community; its Special Interest Groups (SIGs), repositories, and how to become not only an end-user, but possibly a contributor.We will wrap up with an overview of the components in this project, such as the Collector, the OpenTelemetry protocol (OTLP), its APIs, and its SDKs.
Attendees will leave with an understanding of key observability concepts, become grounded in distributed tracing terminology, be aware of the components of openTelemetry, and know how to take their first steps to an open-source contribution!
Key Takeaways: Open source, vendor neutral instrumentation is an exciting new reality as the industry standardizes on openTelemetry for observability. OpenTelemetry is on a mission to enable effective observability by making high-quality, portable telemetry ubiquitous. The world of observability and monitoring today has a steep learning curve and in order to achieve ubiquity, the project would benefit from growing our contributor community.
An invited talk given by Mark Billinghurst on Research Directions for Cross Reality Interfaces. This was given on July 2nd 2024 as part of the 2024 Summer School on Cross Reality in Hagenberg, Austria (July 1st - 7th)
3. Revolution Confidential
Today’s Challenge:
Accelerating Business Cadence
Changing Business Environment
• Fact Based Decisions Require More Data
• Need to Understand Tradeoffs and Best Course of Action
• Predictive Models Need to Continually Deliver Lift
• Reduced Shelf Life for Predictive Models
Faster Time to Value
• Reduce Analytic Cycle Time
• Build & Deploy Models Faster
• Eliminate Time Consuming Data Movements
Rapid Customer Facing Decisions
• Score More Frequently
• Need to Make Best Decision in Real Time
3
5. Revolution Confidential
Typical Technology Challenges
Our Customers Face
Big Data
• New Data
Sources
• Data Variety &
Velocity
• Fine Grain
Control
• Data Movement,
Memory Limits
Complex
Computation
• Experimentation
• Many Small
Models
• Ensemble
Models
• Simulation
Enterprise
Readiness
• Heterogeneous
Landscape
• Write Once,
Deploy Anywhere
• Skill Shortage
• Production
Support
Production
Efficiency
• Shorter Model
Shelf Life
• Volume of
Models
• Long End-to-End
Cycle Time
• Pace of Decision
Accelerated
5
12. Revolution Confidential
Big Data Big Analytics Use Cases
12
• Build predictive models with (very) large datasets
• More rows/observations and/or more columns/features
• Tend to use dimension reduction, machine learning and/or ensemble techniques
One Big Model
• Score and predict with (very) large datasets with previously built model
• Score in batch or individual transactions
• Previously built model may be exported from model build to model deployment env.
Big Data Scoring
• Model factories build predictive models in quantity
• Automated building of individualized models and/or parallel individualized model
execution
Many Small
Models
• Score and predict with many individualized models
• Production model factories require model management
Scoring Many
Models
• Analytic models that are mathematically intense
• May not use large data sets but generate a lot of interim calculations
• May include vectorization, simulation, optimization
Computationally
Intensive Analytics
12
13. Revolution Confidential
Big Data Big Analytics
Specialized Use Cases
• Build forecasts with time sequenced data
• For Big Data, tend to be many small models esp. machine data
• Due to typical Big Data volume requires model management
Time Series
Analytics
• Use of unstructured, free text
• For Big Data, typically used to enhance structured predictive analytics
• Minimally requires text processing tools and may also require natural language
processing
Text and Document
Analytics
• Analyzing continuous, high speed data flows for patterns and acting upon the
patterns in real-time
• Requires specialized sampling and filtering techniques
• Uses distinct discovery analytics methods such as frequent itemsets or clustering
Mining Data
Streams
• No separation of model building and model scoring
• As real-time data becomes more widely available, this emerging category reduces
time-to-insight with little or no separation between model building and scoring
Zero Latency
13
14. Revolution Confidential
Revolution Confidential
Analytic Reference Architecture
Decision
Analytic Applications
Integration
Middleware
Data
Hadoop
Data
Warehouse
Other
Data
Sources
Analytics
Analytics Development Tools &
Platforms
|||||||||||||||||||||||||||
14
15. Revolution Confidential
Revolution Confidential
Architectural Approaches to Analytics
Beside Architecture Inside Architecture
DecisionIntegrationAnalytics
Analytics Development Tools & Platforms
Local Data Mart
Data
||||||||||||
||||||||||||
DecisionIntegration
Data+Analytics
Analytics Development Tools & Platforms
Analytic Applications
Middleware
Data Sources
Data Sources
Analytic Applications
Middleware
15
16. Revolution Confidential
Pros & Cons of Architectural Approaches
• Analytic workflow tasks performed in a separate analytics
environment outside of the source database
• Pros: Segregates analytic workload
• Cons: Doesn’t leverage powerful production for transformations,
introduces scoring latencies,
Beside
Architecture
• Analytics workflow tasks performed inside the source database
with embedded analytics
• Pros: Eliminates data movement, reduces model latency, allows
exploration of all data
• Cons: IT governance on production, potential new skills
Inside
Architecture
• Some analytic workflow tasks performed inside the source
database & others performed in a separate analytics environment
• Pros: Leverages strengths of each architecture
• Cons: Maintain multiple environments
Hybrid
Architecture
16
17. Revolution Confidential
Building & Deploying Analytic Models
Beside
Architecture
Inside
Architecture
Hybrid
Architecture
Analytics
Analytics Development
Tools & Platforms
Local Data Mart
Data
Data Sources
24 3 34 1
Data+Analytics
Analytics Development
Tools & Platforms
Data Sources
2 31
Analytics
Analytics Development
Tools & Platforms
Local Data Mart
Data+Analytics
Analytics Development
Tools & Platforms
Data Sources1 2
LEGEND
Model Build
Model Deploy
Model Recode / PMML
Update DataData Prep / Marshaling
134
24. Revolution Confidential
What is the R Language?
A Platform…
A Procedural Language for Stats, Math and Data Science
A Complete Data Visualization Framework
Provided as Open Source
A Community…
2M+ Users with the Skill to Tackle Big Data Statistical and
Numerical Analysis and Machine Learning Projects
Active User Groups Across the World
An Ecosystem
CRAN: 4500+ Freely Available Algorithms, Test Data and
Evaluations
24
25. Revolution Confidential
Revolution R Enterprise
Revolution R Enterprise
is the only enterprise big data big analytics platform
based on open source R statistical computing language
Portable Across Enterprise Platforms
High Performance, Scalable Analytics
Easier to Build & Deploy
25
26. Revolution Confidential
R is open source and drives analytic innovation but….
has some limitations for Enterprises
Disk based
scalability
Parallel threading
Commercial
support
Leverage open
source packages
plus Big Data ready
packages
26
Commercial
License
In memory bound
Single threaded
Community support
4500+ innovative
analytic packages
Risk of deployment
of open source
Big Data
Speed of
Analysis
Enterprise
Readiness
Analytic
Breadth
& Depth
Commercial
Viability
26
27. Revolution Confidential
Language
Interpreter and
Standard R
Algorithm Suites
Development &
Deployment Tooling
Big Data Distributed
Execution Platform
Introducing Revolution R Enterprise
The Big Data Big Analytics Platform
R+CRAN
RevoR
DistributedR
ConnectR
ScaleR
DevelopR DeployR
Revolution R Enterprise
27
28. Revolution Confidential
Big Data Speed @ Scale
with Revolution R Enterprise
Fast Math Libraries
Parallelized Algorithms
In-Database Execution
Multi-Threaded Execution
Multi-Core Processing
In-Hadoop Execution
Memory Management
Parallelized User Code
28
First, we enhance and
accelerate the Open
Source R interpreter.
28
29. Revolution Confidential
Open Source R performance:
Multi-threaded Math
Open
Source R
29
Revolution R
Enterprise
Computation (4-core laptop) Open Source R Revolution R Speedup
Linear Algebra1
Matrix Multiply 176 sec 9.3 sec 18x
Cholesky Factorization 25.5 sec 1.3 sec 19x
Linear Discriminant Analysis 189 sec 74 sec 3x
General R Benchmarks2
R Benchmarks (Matrix Functions) 22 sec 3.5 sec 5x
R Benchmarks (Program Control) 5.6 sec 5.4 sec Not appreciable
1. http://www.revolutionanalytics.com/why-revolution-r/benchmarks.php
2. http://r.research.att.com/benchmarks/
Customers report 3-50x
performance improvements
compared to Open Source R —
without changing any code
30. Revolution Confidential
Big Data Speed @ Scale
with Revolution R Enterprise
Fast Math Libraries
Parallelized Algorithms
In-Database Execution
Multi-Threaded Execution
Multi-Core Processing
In-Hadoop Execution
Memory Management
Parallelized User Code
30
Second, we built a
platform for hosting R
with Big Data on a
variety of massively
parallel platforms.
30
31. Revolution ConfidentialRevolution R Enterprise DistributedR
Innovative Memory Management, Multi-Threaded Execution, Multi-Core Processing
• A Revolution R Enterprise ScaleR analytic is provided a data source as input
• The analytic loops over data, reading a block at a time.
• Blocks of data are read by a separate worker thread (Thread 0).
• Worker threads (Threads 1..n) process the data block from the previous iteration of the data loop and update
intermediate results objects in memory
• When all of the data is processed a master results object is created from the intermediate results objects
COMBINE INTERMEDIATE RESULTS
31
33. Revolution Confidential
SAS HPA Benchmarking comparison*
Logistic Regression
Rows of data 1 billion 1 billion
Parameters “just a few” 7
Time 80 seconds 44 seconds
Data location In memory On disk
Nodes 32 5
Cores 384 20
RAM 1,536 GB 80 GB
Revolution R is faster on the same amount of data, despite using approximately a 20th as many cores, a
20th as much RAM, a 6th as many nodes, and not pre-loading data into RAM.
*As published by SAS in HPC Wire, April 21, 2011
Double
45%
1/6th
5%
5%
Revolution R Enterprise Delivers Performance at 2% of the Cost
33
34. Revolution ConfidentialRevolution R Enterprise ScaleR:
High Performance Big Data Analytics
Data import – Delimited,
Fixed, SAS, SPSS, OBDC
Variable creation &
transformation
Recode variables
Factor variables
Missing value handling
Sort
Merge
Split
Aggregate by category
(means, sums)
Min / Max
Mean
Median (approx.)
Quantiles (approx.)
Standard Deviation
Variance
Correlation
Covariance
Sum of Squares (cross product
matrix for set variables)
Pairwise Cross tabs
Risk Ratio & Odds Ratio
Cross-Tabulation of Data
(standard tables & long form)
Marginal Summaries of Cross
Tabulations
Chi Square Test
Kendall Rank Correlation
Fisher’s Exact Test
Student’s t-Test
Data Prep, Distillation & Descriptive Analytics
Subsample (observations &
variables)
Random Sampling
R Data Step Statistical Tests
Sampling
Descriptive Statistics
34
35. Revolution ConfidentialRevolution R Enterprise ScaleR:
High Performance Big Data Analytics
Sum of Squares (cross product
matrix for set variables)
Multiple Linear Regression
Generalized Linear Models (GLM)
- All exponential family
distributions: binomial, Gaussian,
inverse Gaussian, Poisson,
Tweedie. Standard link functions
including: cauchit, identity, log,
logit, probit. User defined
distributions & link functions.
Covariance & Correlation
Matrices
Logistic Regression
Classification & Regression Trees
Predictions/scoring for models
Residuals for all models
Histogram
Line Plot
Scatter Plot
Lorenz Curve
ROC Curves (actual data and
predicted values)
K-Means
Statistical Modeling
Decision Trees
Predictive Models Cluster AnalysisData Visualization
Classification
Machine Learning
Simulation
Monte Carlo
Variable Selection
Stepwise Regression (for linear reg)
35
36. Revolution Confidential
Unparalleled Big Data Big Analytics
Scale, Performance & Innovation
1 + 1 = 1000’s
Performance
V
a
l
u
e
Revolution R Enterprise
+ =
Performance
Enhanced R
R Language
Open Source
R Analytic
Packages
Big Data
Distributed &
Parallel
Processing
&
Analytic Package
Big Data
Distributed &
Parallel
Processing
&
Analytic Package
Open Source
R Analytic
Packages
Performance Enhanced R
36
37. Revolution Confidential
Leveraging CRAN with DistributedR & ScaleR
Big Data Distillation
Allows a R programmer to leverage RRE ScaleR to reduce dimensionality
prior and input the reduced data set into open source packages so that the
computationally intensive portion is sped up with RRE ScaleR techniques
and any of the plethora of open source packages can be leveraged
Big Data Threading
Allows a R programmer to leverage RRE ScaleR to execute algorithms
designed for SMP environments in parallel using DistributedR (ie: Monte
Carlo simulation)
Supercharge Open Source package with RRE
Allows a R programmer to re-engineer a CRAN routine by replacing an
Open Source function inside an R based algorithm with the equivalent
ScaleR function(s)
High Performance Custom Algorithm
Allows a R programmer to use the RRE high throughput extreme data
format (XDF) to apply any combination of Open Source functions and logic
while chunking through an XDF file to overcome the Open Source R
memory limitations
37
39. Revolution Confidential
Big Analytics on Big Data in Hadoop
100% R on Hadoop
Full Skill Transfer - No Java needed.
Use 4500+ CRAN Packages
Blend Combine R & Other Tools /
Methods
100% Portability
Build Once – Deploy Many
Track Evolution of Hadoop
Protect Against Platform Uncertainty
Avoid Platform Lock-ins
Hadoop Performance & Scale
Leverage Hadoop Parallelism Easily
Analyze Data Without Moving It
DataAnalyticsApplications
Hadoop
+
Scalable
Compute
HDFS
HBase
Portability.
Parallel Storage
Hive
Big Data
Scale
100% R.
39
40. Revolution Confidential
Revolution Confidential
Revolution R Enterprise + Cloudera Propels
Enterprises into the Future
Decision
Analytic Applications
Integration
Middleware
Data
Cloudera
Data Management Platform
Analytics
Revolution R Enterprise
Big Data Big Analytics Platform
|||||||||||||||||||||||||||
40
41. Revolution Confidential
Revolution R Enterprise Powers
Write Once, Deploy Anywhere
41
Beside
Architecture
Inside
Architecture
Hybrid
Architecture
Analytics
Revolution R Enterprise
Local Data Mart
Data
Cloudera
24 3 34 1
Data+Analytics
Revolution R Enterprise
Cloudera
2 31
Analytics
Revolution R Enterprise
Local Data Mart
Data+Analytics
Revolution R Enterprise
Cloudera1 2
LEGEND
Model Build
Model Deploy
Model Recode / PMML
Update DataData Prep / Marshaling
4 |||||||||||||
|||||||||||||
|||||| Direct Connector
Bottom Line: Save Time, Save Money, Get Insights Faster
• Direct connectors access data without data movement
• Push down analyzing data without movement
• Use same R script on any platform without recoding
• Use right architecture for the job!
42. Revolution Confidential
Revolution R Enterprise Inside Cloudera
Consumption
Cloudera
Business Analysts
(Alteryx, Tableau,
QlikView, Cognos,
Microstrategy, Datameer
etc.)
Power Analysts
(R Studio, DevelopR, etc.)
Line of Business
users
(Analytic Apps, Rules
Engines, etc.)
Revolution R Enterprise
Machine Data
New Data Sources
Data Suppliers
Traditional Sources
IBM
Mainframe
Data Sources
R+CRAN
RevoR
DistributedR
ConnectR
ScaleR
DeployR
Big Data Big Analytics
Data Transformation,
Model Building & Scoring
42
43. Revolution Confidential
QuickStart Programs Deliver Value Quickly
Offered by both Cloudera and Revolution
Analytics
Combine Software, Services and Training
Cloudera can help you get started with
Hadoop in a few ways
Revolution Analytics helps you realize value
from R + Hadoop
43
44. Revolution Confidential
Summary
Revolution R Enterprise and Cloudera Hadoop bring best-of-breed
technologies to deliver:
Highly scalable and high performance machine learning on data
residing in Hadoop
Using the familiar R programming environment makes analytics
at scale accessible and easy for R users
With the ability to integrate disparate data sources in one
repository, full lifecycle analytics from ad-hoc analysis to
production analytics are available in one managed environment
The deep integration of Revolution R Enterprise with Cloudera
will provide a seamless operational experience for managing
both products
44