Google Cloud Guide To DA ML
Google Cloud Guide To DA ML
Google Cloud Guide To DA ML
D ATA A N A LY T I C S &
MACHINE LEARNING
CONTENTS
Introduction 03
Conclusion 26
Works Cited 27
Introduction
Contents | 2| 2
INTRODUCTION
Two factors make the current landscape different from past evolutions.
The first is an exponential increase in the volume and diversity of data
being generated by billions of users and devices. The second is a demand
for immediate access to high-quality data and insights. Each has brought
new urgency to how companies manage data. In addition, the cost
and performance of many cloud capabilities have reached a tipping
point, helping make machine learning (ML) and artificial intelligence (AI)
accessible to every business.
Introduction | 3
I N T E R A C T I O N S P E R C O N N E C T E D P E R S O N P E R D AY
5,000
4,800
Number of interactions/capita/day
3,000
1,000
601
218
By 2025, the average connected person will interact with connected devices nearly 4,800 times per day—equivalent to
one interaction every 18 seconds.2
OUR ROOTS
Google Cloud’s Guide to Data Analytics & Machine Learning draws upon Google’s twenty years
of tackling some of the industry’s toughest data problems. Along the way, we’ve contributed
original research that has helped to shape the Big Data landscape: from two research papers
in late 2003 and 2004, which together spawned the Hadoop movement; to the Dremel paper,
which forms the basis for the cloud data warehouse capability you’ll read about in this guide.
We designed, built, and deployed Spanner, the first system to distribute data at global scale
and support externally consistent distributed transactions—and, in 2017, made it generally
available to our customers.3 More recently, Google Brain has helped fuel the industry’s
renewed interest in AI, leading up to the release of our TensorFlow Project into open source.4
With this Guide, we look forward to sharing our experience with leaders looking for ways to
unlock the promise of machine learning and AI for their organizations.
Introduction | 4
CHAPTER 1
T H E N E W D ATA L A N D S C A P E
01
T H E N E W D ATA L A N D S C A P E 01
IT teams are stuck in the middle. They must find ways to deliver a real-
time view of the business while also managing a larger and more complex
data landscape. As with many software initiatives, reducing complexity is
an important determinant of success.
This Guide explores how managed cloud services help new and
established companies meet today’s data challenges. It presents a path
that begins with capturing raw business data into cloud storage. As
business questions arise, cloud-based tools can prepare and structure raw
S E R V E R L E S S : T H E PA T H T O I T P R O D U C T I V I T Y
Modern serverless architectures are the culmination of a series of efforts to shrink the surface area
of responsibility that developers and IT teams must manage. Fundamentally, the goal of serverless
computing is to eliminate commodified work—managing server clusters, sharding databases, load
balancing, capacity planning, ensuring availability—so IT teams can focus on what matters to the
business. Serverless draws a sharp distinction between commodified IT—the mundane mainte-
nance work that looks roughly the same at every company—and differentiated work that elevates
IT to a direct provider of business value.
CHAPTER 1 RECAP
2 Cloud computing helps companies to meet these challenges by enabling data management at
scale and speed—without having to worry about infrastructure.
3 Specifically, businesses can start to modernize their data strategies by focusing on cloud storage
and data warehousing as a first step in building a foundation for machine learning and AI.
FIS INDUSTRY
Financial Services
FIS analyzes market events and disruptors with fully
managed cloud services
ABOUT
FIS has built a market reconstruction tool that could help FIS is a global leader in
determine the potential cause for events that disrupt the financial services
securities market, such as the “flash crash” of 2010. FIS’s technology, with a focus on
system not only can store billions of transactions, but retail and institutional
also allows compliance and risk executives to conduct banking, payments, asset
surveillance and on-demand querying, including market and wealth management,
reconstruction. risk and compliance,
consulting, and outsourcing
According to FIS, the system can process and link up to solutions.
15 terabytes of data daily in four hours and can store it for
six years, as mandated by law. “That amounts to about
30 petabytes of data,” says Neil Palmer, chief technology
officer of Advanced Technology business at FIS. “There’s
not much out there on that scale, and certainly not in
financial services. It’s a huge undertaking.”
READ MORE
02
C L O U D S T O R A G E & D ATA W A R E H O U S I N G
Centralizing raw data from key business processes into cloud storage is
one of the first steps organizations can take to modernize. In doing so,
they position themselves to tap analytics capabilities in the cloud. 02
Data silos scattered across the enterprise continue to vex business and
IT teams alike, with new silos (whether for organizational or technical
reasons, or both) created daily.5 Harvard Business Review has published
about the need for a single source of truth for data, as well as distinct
lenses through which different lines of business can view the data.6
IDC estimates that less than 1% of all files gets analyzed.7 The other
99%—depending on the timing of business needs—contains insights
material to decision-making. Since organizations cannot predict the
business questions that will arise, they need frictionless ways to store
large volumes of data cheaply and flexibly. This is especially true for
unstructured files, which make up the majority of data generated.8
With cloud, businesses can store enormous volumes of files at low cost
(below one penny per gigabyte at time of writing).9 Data that’s currently
needed can be kept “warm”—available globally to serve applications
or to run analytics—while data with still-untapped value remains in
cheaper cold storage. The most compelling online storage allows even
cold archival data to be retrieved quickly with extremely low latency.
<1%
Besides saving money, cloud storage serves as the basis for powerful analytics. Businesses
can capture structured and unstructured files seamlessly in their native formats. Because
storage is intentionally separated from processing and analysis, teams can defer structuring
raw data for analytics until business questions arise. Crucially, raw data from the same
foundation can be restructured easily to answer new questions on the fly. What sets cloud
storage apart is how efficiently these data-capture and repurposing steps can happen. To
position an organization to benefit from analytics, teams need to ensure that raw data from
their business processes is captured and centralized.
According to a survey of more than 500 global IT leaders conducted by MIT Sloan Management
Review on behalf of Google Cloud, cloud adoption continues to accelerate, with a majority (65%) of
applications, data, and/or infrastructure expected to be cloud-based by 2019.
The Internet of Things (IoT) is an important driver of this move to the cloud, with 91% of
respondents with IoT initiatives either currently deploying (59%) or planning to deploy (32%) data
from IoT-connected devices in the cloud. Respondents cited the ability to integrate with new tools
and platforms (33%), faster app deployment and iteration (31%), increased flexibility in business
processes and vendor choices (29%), and increased security (28%) as the top reasons to deploy
IoT data in the cloud.
To make meaningful use of IoT data, companies must be able to understand it in context. A cloud
data warehouse that allows for both batch and streaming inputs, paired with a powerful analytics
platform, helps ensure that your IoT data can deliver real-time insights.
Armed with the ability to capture data of any kind economically, organizations can turn their
attention to enabling a disciplined view of their most important business processes. While
cloud storage centralizes data in its raw native format, a cloud data warehouse enables
businesses to pull together data from disparate silos for analytics—just as a traditional data
warehouse would. With cloud, companies can manage large volumes of data with minimal
capital investment, scale practically indefinitely, and pay only for what they use. Managed
cloud services take it a step further, freeing IT from worrying about any of the underlying
infrastructure. Companies must consider which business questions need answering, and the
data required to answer them.
For example:
• What are the primary business goals for my data? To understand how users
interact with my systems, identify trends, increase sales, build consumer loyalty,
or something else?
• Where will my most important data come from (transactions, server logs, cloud
services, devices/IoT, social media)? Are these imported into cloud storage already?
• How fast must my system incorporate new data in reports and visualizations?
Cloud storage
Data from cloud storage can be imported into a cloud data warehouse for analytics.12
At this stage, a schema can be formalized based on the business questions that need
answering—bringing structure to raw data for analysis.
Streaming data
Data from web, mobile, and IoT applications can bypass cloud storage and be streamed
directly into a cloud data warehouse (see Chapter 3: Real-Time Data Integration).
Data Governance
Exponential growth in the global volume of data is not the only obstacle businesses face.
According to Forrester, fast-changing requirements around analytics and reporting, as
well as misalignment between business and IT, are among the top challenges impeding
organizations’ business intelligence efforts.13 In addition, the well-documented data science
talent gap (see “Rise of the Citizen Data Scientist”) requires businesses to consider new
approaches to developing analytical expertise.
With role-based access, any individual or application developer can query data stored in a
cloud data warehouse, generate reports, or access visualizations. Cloud data warehousing
supports individualized, need-to-know access management. Tailored access controls
and complete auditability help democratize data science, while still maintaining security
safeguards. Indeed, over one-half of firms across the U.S., Europe, and Asia-Pacific report
they either are implementing, have implemented, or are expanding their use of self-service
business intelligence tools across the enterprise.14
The responsibility for drawing statistically accurate conclusions based on data was once the
exclusive purview of professional data scientists. But by 2018, according to McKinsey, “the U.S.
alone could face a shortage of 140,000 to 190,000 people with deep analytical skills as well as
1.5 million managers and analysts with the know-how to use the analysis of Big Data to make
effective decisions.”15
As competition intensifies, most companies will need a diversified talent strategy. Citizen data
scientists—who, as defined in InformationWeek, are people who leverage data analytics, but whose
main job functions aren’t statistics or analytics—can be a powerful complement to in-house data
scientists, especially for companies that invest in building a culture of data science.16
• Access to data
• Curiosity
• Facility with SQL
• Domain expertise
• Collaboration
CHAPTER 2 RECAP
1 Cloud storage allows organizations to capture both structured and unstructured data of any kind
in its native format. Centralizing data into cloud storage creates a foundation for analytics, with
details deferred until organizations have concrete business questions to ask of their data.
2 A cloud data warehouse enables organizations to pull together data from disparate silos for
analytics, including from cloud storage, analytic and transactional databases on-prem or in
the cloud, or data stored with other cloud services. Organizations can run queries, generate
reports, and create visualizations—without managing the underlying infrastructure.
3 Role-based access democratizes analytics across an organization. A cloud data warehouse can
be scoped enterprise-wide, or organized flexibly based on the structure of the organization.
COLORADO CENTER
INDUSTRY
FOR PERSONALIZED MEDICINE
Healthcare
The Colorado Center for Personalized Medicine (CCPM) is conducting
breakthrough research through the analysis of patient DNA to
ABOUT
predict disease risk and develop targeted treatments based on an
individual’s genetics. CCPM relies on Health Data Compass, CCPM’s The Colorado Center for
Personalized Medicine
enterprise health data warehouse. Health Data Compass integrates
(CCPM) is a partnership
patient genomic data from CCPM and electronic health records from
among the University of
UCHealth, Children’s Hospital Colorado, and CU Medicine, including
Colorado Denver, UCHealth,
external records such as insurance claims, public health records, and
Children’s Hospital Colorado,
environmental data.
and CU Medicine, and is
located in the Denver,
Health Data Compass previously used a traditional on-premises Colorado area.
system to store and analyze data. But that approach proved costly
to maintain and didn’t scale for the center’s existing analytics needs,
let alone its projected growth. Following a comprehensive six-month
pilot project, Health Data Compass migrated to GCP and Tableau,
which together can handle massive data sets and powerful visual
data analyses, while costing less and allowing for easy scalability as
CCPM grows. Significant to CCPM’s decision was the ability of GCP,
including Google Cloud’s data warehouse BigQuery, to support HIPAA
compliance per CCPM’s requirements.
READ MORE
R E A L - T I M E D ATA I N T E G R AT I O N
03
R E A L - T I M E D ATA I N T E G R AT I O N 03
90%
and data integration. Consider a smart thermostat seeking to
learn and adjust to the preferences of different teams in an
office building. While the thermostat is in use, the cloud ingests
raw usage data, such as temperature settings and energy
consumption levels throughout the day. As data comes in, a
processing pipeline can be spun up on demand to prepare the
of companies
raw data: ensuring inputs fall within a valid range, converting
I N D I C AT E I N T E R E S T I N
temperature and energy use into the desired units, formatting
DEPLOYING SELF-SERVICE
time data. The data pipeline formally structures this data, then D A T A P R E PA R A T I O N
loads the transformed results into a cloud data warehouse. TO SUPPORT BIG
Queries, visualizations, and reports are available instantly. D A T A I N I T I A T I V E S .20
While traditional systems focused on analyzing data offline “in batch,” the demand for real-time
insights calls for a new approach. Cloud-based streaming analytics systems are built to handle
data streaming in from web applications, smartphones, or millions of IoT sensors in real time.
Hundreds of thousands of sensors can be installed on field equipment to report their raw status
to the cloud continuously for processing and monitoring. Visual feeds can be parsed in real time
for applications like anomaly detection and facial/object recognition. With widely tested and
deployed cloud services being tapped for use cases like these, streaming data analytics can be
implemented in a matter of days.
With real-time streaming data analytics, data streams directly into processing pipelines.
The transformed data can then be integrated into a cloud data warehouse—allowing
for queries, visualization, and reporting within seconds. In this way, the processing
pipeline serves as a kind of middleware that can be spun up on demand, able to
join data streaming in real time with batch data pulled in from storage. Data can be
structured flexibly to answer an organization’s business questions as they arise.
M A K I N G T H E M O S T O F Y O U R E X I S T I N G B I G D ATA I N V E S T M E N T S
Many forward-looking enterprises are already using Big Data, often based on open-source tools
like Apache Hadoop and Apache Spark. For these businesses, it’s possible to protect existing
investments in talent and tools—while still achieving the cloud’s productivity advantages.
Adoption of open-source Big Data tools is widespread—and growing. Globally, many firms are
storing an increasing amount of unstructured data in public cloud file systems (including Hadoop).
Over one-third of respondents in the U.S. and Europe—and well over half in Asia-Pacific—report
that they’re implementing, have implemented, or are expanding their implementation of Hadoop
(including HBASE, Accumulo, MapR, Cloudera, Hortonworks). Similarly, around one-third of
respondents in the U.S. and Europe (and a whopping 60% in Asia-Pacific) are implementing, have
implemented, or are expanding their implementation of in-memory data platforms (including
Apache Spark, SAP Hana, Kognitio, Terracotta, Gigaspaces).
• Continue to manage Big Data projects using familiar open-source tools—but migrate to virtual
machines in the cloud. The usual cloud benefits apply: retire expensive CapEx; move to an
OpEx model of billing, where organizations pay according to data stored and processed; scale
seamlessly. Note that in this forklift model, developers and IT teams are still required to manage
their own storage and data processing pipelines. However, it is the most straightforward route
for leveraging talent, tools, and vendor relationships already in place.
• The cloud offers fully managed versions of many of the most popular open-source tools in Big
Data. For example, running Apache Hadoop, Apache Spark, Apache Pig, and Apache Hive in the
cloud offloads basic data-management tasks like deployment, logging, and monitoring.21 This is
an excellent option for teams looking for the best of both on-prem and cloud-native worlds.
Either of these options lets organizations protect their existing investments in deploying Big Data—
but smartly use cloud economics to control costs and gain flexibility.
CHAPTER 3 RECAP
2 Serverless approaches to data preparation fully manage the underlying infrastructure; resources
are allocated automatically based on the needs of each data
processing pipeline.
3 Cloud streaming analytics allow data from web, mobile, and IoT applications to stream into data
processing pipelines in real time. From here, data can be prepared and integrated into a cloud data
warehouse to support a real-time view of the business.
CITIBANK UK
INDUSTRY
Financial Services
In this proof of concept, the team’s task was to show how
easy it would be for Citibank to use Google BigQuery and
Google Cloud Pub/Sub to analyze and consume roughly ABOUT
READ MORE
04
MACHINE LEARNING & AI
04
Recent breakthroughs in machine learning (ML) and artificial
intelligence (AI) frequently make headlines. Computers have bested
human world champions in Go, a board game with more positions
than there are atoms in the universe.22 They’ve mastered popular
video games and, critically, learned to recognize cats.23 More recently,
an AI effort achieved massive savings in energy costs, highlighting
machine learning as “a general-purpose framework to understand
complex dynamics.”24 This framework is starting to find diverse
applications—and deliver results—across many industries.
predictive, enabling the retailer to surface the right product for the SUCCESS DEPENDS ON
right person at the right time. This level of personalization—once THE SUCCESSFUL
typified by the small-town shopkeeper who knew the names and I M P L E M E N TAT I O N O F
The most straightforward way to get started with AI is to use pre-trained machine learning models, available instantly
through the cloud. No prior knowledge of ML is required. These capabilities may be familiar to those who use popular
consumer applications, where some of the models have reached levels of predictive accuracy that exceed human ability:
I M A G E A N A LY S I S
V I D E O A N A LY S I S
These services are general (not tied to consumer applications), and can be integrated into any application easily via
simple API calls. Developers don’t need to know any of the underlying details. Without having to develop any of these
services in-house, companies can tap the latest capabilities instantly, as a service.
Use cases span many industries and reveal some of AI’s most • Time savings
promising applications. Fraud detection in financial services and • Cost savings
preventive maintenance in manufacturing highlight the ability to • Better risk management
• Improved quality of analytics
surface anomalies from a sea of transactions and messy logs—a
• Increased revenues
common need in many domains. Diagnosis and treatment suggestions
in healthcare and judgments on creditworthiness highlight machine Others listed automation,
learning’s ability to assist with categorization—also generally useful. improved service, and improved
inventory planning. 27
The capabilities introduced in chapters 2 and 3 serve as a foundation for training machine
learning models using first-party data. With raw data already centralized in cloud storage and a cloud
data warehouse, serverless data pipelines can continuously extract this data and prepare it to train
bespoke ML models. Since ML models can themselves be housed in the cloud, they are immediately
available to applications to make predictions. This loop forms a virtuous cycle, in which ML models
housed in the cloud keep improving from new training data, which in turn keeps the models fresh
and relevant.
• H E A LT H C A R E
• FINANCIAL SERVICES REDUCING
RELIANCE
• M A N U FA C T U R I N G
ON MANUAL
• R E TA I L INTERVENTION
INCREASING
• MEDIA/GAMING A U T O M AT I O N
The age of machine learning has finally arrived—and it’s already in full swing within smaller, tech-forward
companies, according to a new survey of business and technology leaders by MIT Technology Review
Custom. Some key findings:29
50%
of early-stage ML implementers
> are already seeing ROI.
45%
have achieved more extensive
data analysis and insights.
CHAPTER 4 RECAP
1 AI, and its subset of machine learning, are simple in concept: the ability for software to improve
without needing to be explicitly programmed.
2 AI relies on large volumes of training data, giving established businesses a unique advantage to
pull from the wealth of business data generated over long operating histories.
3 Cloud storage, data warehousing, data integration, and analytics provide a natural foundation for
AI and ML by making data available for real-time training and optimization, powering a virtuous
cycle of continuous improvement.
The first step is to rethink data strategy from the ground up. Today’s cloud
tools allow companies to manage enormous volumes of diverse data
types more efficiently, at lower cost, than previously possible. Businesses
that take a modern approach to capturing, storing, preparing, and
analyzing their data will have the foundation to take advantage of machine
learning and AI. Ultimately, these new capabilities will translate into
closer relationships between companies and their customers, enabling
businesses to be more predictive in every interaction.
L E A R N M O R E A B O U T W H AT G O O G L E C L O U D C A N D O F O R Y O U R B U S I N E S S .
Conclusion | 26
WORKS CITED
1. Eighty-one percent of senior executives surveyed by Ernst & Young agreed that data should be at the heart
of decision-making, only 31% had significantly restructured operations to incorporate big data, and a mere
23% had implemented organization-wide data strategies. Ernst & Young, Becoming an Analytics-Driven
Organization (2015) (link).
2. David Reinsel et al., Data Age 2025: The Evolution of Data to Life-Critical (IDC, 2017) (link).
3. Cade Metz, “Exclusive: Inside Google Spanner, the Largest Single Database on Earth,” Wired
(26 November 2012) (link).
Cade Metz, “Spanner, the Google Database that Measured Time, Is Now Open to Everyone,” Wired
(14 February 2017) (link).
4. Robert McMillan, “Inside the Artificial Brain that’s Remaking the Google Empire,” Wired (16 July 2014) (link).
TensorFlow (link).
5. Forrester, Forrester’s Global Business Technographics Data and Analytics Survey (2016) (link).
6. Leandro DalleMule and Thomas H. Davenport, “What’s Your Data Strategy?” Harvard Business Review
(May 2017) (link).
7. John Gantz and David Reinsel, The Digital Universe in 2020: Big Data, Bigger Digital Shadows, and Biggest
Growth in the Far East (IDC, 2012) (link).
8. Tracie Kambies et al., Tech Trends 2017: Dark Analytics: Illuminating Opportunities Hidden within Unstructured
Data (Deloitte University Press, 2017) (link).
9. Google Cloud Storage Pricing, Google Cloud Platform (link).
10. Forrester, Forrester’s Global Business Technographics Data and Analytics Survey (2016) (link).
11. “Three Ways Marketing Organizations Can Make Data More Actionable,” Harvard Business Review
(9 August 2016) (link).
12. Modern cloud data warehouses support importing (and even ad-hoc querying) many semistructured
formats automatically. For unstructured data that needs to be transformed first (e.g., ETL), see Chapter 3:
Data Preparation.
13. Forrester, Forrester’s Global Business Technographics Data and Analytics Survey (2016) (link).
14. Forrester, Forrester’s Global Business Technographics Data and Analytics Survey (2016) (link).
15. James Manyika et al., Big Data: The Next Frontier for Innovation, Competition, and Productivity (McKinsey
Global Institute, 2011) (link).
16. Lisa Morgan, “Citizen Data Scientists: 7 Ways to Harness Talent,” InformationWeek (24 July 2015) (link).
17. Colorado Center for Personalized Medicine: Improving Healthcare by Integrating Patient Records and Genetic
Data Using Google Cloud Platform and Tableau (Google Cloud Platform, 2017) (link).
18. Steve Lohr, “For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights,” New York Times
(17 August 2014) (link).
19. Forrester, Forrester’s Global Business Technographics Data and Analytics Survey (2016) (link).
20. Forrester, Forrester’s Global Business Technographics Data and Analytics Survey (2016) (link).
21. Apache Hadoop, The Apache Software Foundation (link).
Apache Spark, The Apache Software Foundation (link).
Apache Pig, The Apache Software Foundation (link).
Apache Hive, The Apache Software Foundation (link).
Works Cited | 27
22. Paul Mozur, “Google’s A.I. Program Rattles Chinese Go-Master As It Wins Match,” New York Times
(25 May 2017) (link).
23. Nicola Twilley, “Artificial Intelligence Goes to the Arcade,” The New Yorker (25 February 2015) (link).
John Markoff, “How Many Computers to Identify A Cat? 16,000,” The New Yorker (25 June 2012) (link).
24. James Vincent, “Google Uses DeepMind AI to Cut Data Center Energy Bills,” The Verge
(21 July 2016) (link).
25. Harvard Business Review Analytic Services Global Data and Analytics Survey, sponsored by Google (2017).
26. A survey by MIT Technology Review shows smaller organizations well on their way to machine learning
adoption and returns: 60% of a pool of 375 respondents in which nearly two-thirds were companies with
fewer than 1,000 employees drawn largely from the technology, business, and financial services industries.
MIT Technology Review Custom and Google Cloud, Machine Learning: The New Proving Ground for
Competitive Advantage (2017) (link).
27. Anna Rader, Machine Learning Initiatives Across Industries: Practical Lessons from IT Executives (M-Brain,
sponsored by Google, 2017) (link).
28. Anna Rader and Irida Jano, Machine Learning Market Research: How Leading Industries Are Adopting AI
(M-Brain 2017) (link).
29. MIT Technology Review Custom and Google Cloud, Machine Learning: The New Proving Ground for
Competitive Advantage (2017) (link).
Conclusion | 28
© 2017 Google Inc.
1600 Amphitheatre Parkway, Mountain View, CA 94043