Analyzing Data in The Internet of Things PDF
Analyzing Data in The Internet of Things PDF
Analyzing Data in The Internet of Things PDF
in the Internet
of Things
A Collection of Talks from
Strata + Hadoop World 2015
Alice LaPlante
Alice LaPlante
Beijing
Tokyo
First Edition
978-1-491-95901-5
[LSI]
Table of Contents
Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
Part I.
10
10
11
11
14
15
16
16
17
18
21
22
22
v
23
24
27
28
29
32
32
36
36
38
42
43
43
44
44
46
vi
Table of Contents
50
51
Introduction
Alice LaPlante
The Internet of Things (IoT) is growing quickly. More than 28 bil
lion things will be connected to the Internet by 2020, according to
the International Data Corporation (IDC).1 Consider that over the
last 10 years:2
The cost of sensors has gone from $1.30 to $0.60 per unit.
The cost of bandwidth has declined by 40 times.
The cost of processing has declined by 60 times.
Interest as well as revenues has grown in everything from smart
watches and other wearables, to smart cities, smart homes, and
smart cars. Lets take a closer look:
Smart wearables
According to IDC, vendors shipped 45.6 million units of weara
bles in 2015, up more than 133% from 2014. By 2019, IDC fore
casts annual shipment volumes of 126.1 million units, resulting
in a five-year compound annual growth rate (CAGR) of 45.1%.3
This is fueling streams of big data for healthcare research and
developmentboth in academia and in commercial markets.
vii
Smart cities
With more than 60% of the worlds population expected to live
in urban cities by 2025, we will be seeing rapid expansion of city
borders, driven by population increases and infrastructure
development. By 2023, there will be 30 mega cities globally.4
This in turn will require an emphasis on smart cities: sustaina
ble, connected, low-carbon cities putting initiatives in place to
be more livable, competitive, and attractive to investors. The
market will continue growing to $1.5 trillion by 2020 through
such diverse areas as transportation, buildings, infrastructure,
energy, and security.5
Smart homes
Connected home devices will ship at a compound annual rate of
more than 67% over the next five years, and will reach 1.8 bil
lion units by 2019, according to BI Intelligence. Such devices
include smart refrigerators, washers, and dryers, security sys
tems, and energy equipment like smart meters and smart light
ing.6 By 2019, it will represent approximately 27% of total IoT
product shipments.7
Smart cars
Self-driving cars, also known as autonomous vehicles (AVs),
have the potential to disrupt a number of industries. Although
the exact timing of technology maturity and sales is unclear,
AVs could eventually play a profound role in the global econ
omy, according to McKinsey & Co. Among other advantages,
AVs could reduce the incidence of car accidents by up to 90%,
saving billions of dollars annually.8
In this OReilly report, we explore the IoT industry through a vari
ety of lenses, by presenting you with highlights from the 2015 Strata
+ Hadoop World Conferences that took place in both the United
States and Singapore. This report explores IoT-related topics
4 Frost & Sullivan. Urbanization Trends in 2020: Mega Cities and Smart Cities Built on
viii
Introduction
Introduction
ix
PART I
CHAPTER 1
Danielle Dean
Editors Note: At Strata + Hadoop World in Singapore, in December
2015, Danielle Dean (Senior Data Scientist Lead at Microsoft) presen
ted a talk focused on the landscape and challenges of predictive main
tenance applications. In her talk, she concentrated on the importance
of data acquisition in creating effective predictive maintenance appli
cations. She also discussed how to formulate a predictive maintenance
problem into three different machine-learning models.
data available today, you can predict not just when you need an oil
change, but when your brakes or transmission will fail.
youre not going to learn very well; having enough raw examples
is essential.
CHAPTER 2
Bruno Fernandez-Ruiz
Editors Note: At Strata + Hadoop World in San Jose, in February
2015, Bruno Fernandez-Ruiz (Senior Fellow at Yahoo!) presented a
talk that explores two issues that arise due to the computational
resource gap between CPUs, storage, and network on IoT sensor devi
ces: (a ) undefined prediction quality, and (b ) latency in generating
predictions.
Lets begin by defining the resource gap we face in the IoT by talking
about wearables and the data they provide. Take, for example, an
optical heart rate monitor in the form of a GPS watch. These
watches measure the conductivity of the photocurrent, through the
skin, and infer your actual heart rate, based on that data.
Essentially, its an input and output device, that goes through some
black box inside the device. Other devices are more complicated.
One example is Mobileye, which is a combination of radar/lidar
cameras embedded in a car that, in theory, detects pedestrians in
your path, and then initiates a braking maneuver. Tesla is going to
start shipping vehicles with this device.
Likewise, Mercedes has an on-board device called Sonic Cruise,
which is essentially a lidar (similar to a Google self-driving car). It
sends a beam of light, and measures the reflection that comes back.
It will tell you the distance between your car and the next vehicle, to
10
find out all of these functions, we calculate the error, and one of
these functions will minimize the error.
There are two key techniques for this process; the first is gradient
descent. Using gradient descent, you look at the gradient from one
point, walk the curve, and calculate for all of the points that you
have, and then you keep descending toward the minimum. This is a
slow technique, but it is more accurate than the second option well
describe.
Stochastic jumping is a technique by which you look at one sample at
a time, calculate the gradient for that sample, then jump, and jump
againit keeps approximating. This technique moves faster than
gradient descent, but is less accurate.
Constrained Throughput
In computational advertising, which is what we do at Yahoo!, we
know that we need two billion samples to achieve a good level of
accuracy for a click prediction. If you want to detect a pedestrian,
for example, you probably need billions of samples of situations
where you have encountered a pedestrian. Or, if youre managing
electronic border control, and you want to distinguish between a
coyote and a human being, again, you need billions of samples.
Thats a lot of samples. In order to process all of this data, normally
what happens is we bring all of the data somewhere, and process it
through a GPU, which gives you your optimal learning speed,
because the memory and processing activities are in the same place.
Another option is to use a CPU, where you move data between the
CPU and the memory. The slowest option is to use a network.
Can we do something in between, though, and if so, what would that
look like? What we can do is create something like a true distributed
hash table, which says to every computational node, Im going to
spin off the storage node, and you start routing requests.
Constrained Throughput
11
12
CHAPTER 3
Eric Frenkiel
Editors Note: At Strata + Hadoop World in Singapore, in December
2015, Eric Frenkiel (CEO and cofounder at MemSQL) presented a talk
that explores modeling the smart and connected city of the future with
Kafka and Spark.
Hadoop has solved the volume aspect of big data, but velocity
and variety are two aspects that still need to be tackled. Inmemory technology is important for addressing velocity and variety,
and here well discuss the challenges, design choices, and architec
ture required to enable smarter energy systems, and efficient energy
consumption through a real-time data pipeline that combines
Apache Kafka, Apache Spark, and an in-memory database.
What does a smart city look like? Heres a familiar-looking vision: its
definitely something that is futuristic, ultra-clean, and for some rea
son there are always highways that loop around buildings. But heres
the reality: we have a population of almost four billion people living
in cities, and unfortunately, very few cities can actually enact the
type of advances that are necessary to support them.
A full 3.9 billion people live in cities today; by 2050, were expected
to add another 2.5 billion people. Its critical that we get our vision of
a smart city right, because in the next few decades well be adding
billions of people to our urban centers. We need to think about how
13
we can design cities and use technology to help people, and deliver
real value to billions of people worldwide.
The good news is that the technology of today can build smart cities.
Our current ecosystem of data technologiesincluding Hadoop,
data warehouses, streaming, and in-memorycan deliver phenom
enal technology at a city-level scale.
14
15
16
effect, youre moving away from the concept of analyzing data via
reports, and toward real-time applications where you can interact
with live data.
17
18
CHAPTER 4
19
20
will come back downand thats our cue for damage on a railway
track.
Architectural Considerations
In our example, all of these sensor readings have to go from the
locomotive to the data center. The first thing we do when the data
arrives is write it to a reliable, high-throughput streaming channel,
or streaming transportation layerin this case, we use Kafka. With
the data in Kafka, we can read it in Spark Streaming, using the direct
Kafka connector.
The first thing we do when these events come into the data center is
enrich them with relevant metadata, to help determine if there is
potential damage. For example, based on the locomotive ID, we
want to fetch information about the locomotive, such as the type
for example, we would want to know if its a freight train, if its car
rying human passengers, how heavy it is, and so on. And if it is a
freight train, is it carrying hazardous chemicals? If thats the case, we
would probably need to take action at any hint of damage. If its a
freight train thats just coming back empty, with no cargo, then its
likely to be less critical. For these reasons, information about the
locomotive is critical.
Similarly, information about each sensor is critical. You want to
know where the sensor is on the train (i.e., is it on the left wheel or
the right wheel?). GPS information is also important because if the
train happens to be traveling on a steep incline, you might expect
temperature readings to go up. The Spark HBase model, which is
now a part of the HBase code base, is what we recommend for pull
ing in this data.
After youve enriched these events with all the relevant metadata, the
next task in our example is to determine whether a signal indicates
damageeither through a simple rule-based or predictive model.
Once youve identified a potential problem, you write an event to a
Kafka queue. Youll have an application thats continuously listening
to alerts in the queue, and when it sees an event, the application will
send out a physical alert (i.e., a pager alert, an email alert, or a phone
call) notifying a technician that somethings wrong.
One practical concern here is with regard to data storageits help
ful to dump all of the raw data into HDFS, for two reasons. First,
Architectural Considerations
21
keeping the raw data allows data scientists to play with the data, and
possibly uncover new insights. Second, there will likely be bugs in
your application, and in your code, and youll want to do an audit
when things go wrong. Having the raw data in HDFS lets you write
simple batch jobs to figure out when things are wrong, either in
your application logic, or in certain cases, where the sensors might
have gone wrong.
23
24
PART II
CHAPTER 5
Thomas Holleczek
Editors Note: At Strata + Hadoop World in Singapore, in December
2015, Thomas Holleczek (Data Scientist at Singtel) outlined this case
study to illustrate how telecommunications companies are using loca
tion data to develop a system for subway and expressway traffic moni
toring.
People take a lot of factors into consideration when they travel on
mass transit or a highway. They dont always take the shortest route.
Perhaps they want a less-crowded bus or subway, they want to take a
scenic route, or they dont want to have to change buses or subways
more than once.
At Singtel, we found that we could use telco location data to under
stand how people travel on transportation networks. We studied the
Singapore transportation system using data from Singtel.
28
But your phone still produces updates just because the phone
updates location with the cell towers.
The result: an accurate understanding of how crowded individual
stations and trains are at any given moment in the Singapore MRT
system. Based on this, we are currently developing an app that rec
ommends routes to subway riders. When we release it, youll be able
to tell the app where you are, and where you want to travel, and the
app will provide you with options based on real-time data, including
additional information including the current available capacity on a
train, whether there are seats available, and estimated travel time.
Expressway Data
When we do this kind of research project, we usually start off with
experimentationwe take phones, and we head out and make
experiments; then we look at the data that we record.
For this part of the project, we drove around on the expressways.
Our fear was that not enough data would be generated, because
most people dont use their phones when they drive (or they
shouldnt); they dont text and they dont use data. This meant that
we would be completely dependent on passive updates.
The terrific thing we found out was that when you start driving your
car, your phone produces a lot of location updates. In our experi
ment, we found handovers happening between cell towers along the
expressways, every three to five minutes. We were also able to detect
people who travel on buses, as most buses in Singapore are equipped
with machine-to-machine SIM cards, which allow the operators to
know bus locations. Most of the buses also have a GPS device, and
they transmit their location through the Singtel 3G network.
Getting a full view of whats happening with transportation in Singa
pore allows us to address several challengesit can help commuters
choose more efficient routes, it can serve as an aid in city planning,
and allow the subway system to improve operations, maintenance,
and planning of the network.
Expressway Data
29
CHAPTER 6
Susanna Pirttikangas
Editors Note: At Strata + Hadoop World in New York, in September
2015, Susanna Pirttikangas (Project Researcher at the University of
Oulu) outlined the fully integrated use of IoT data in one of the top
seven smart cities in the world, the city of Oulu, in Finland. Oulu con
tinuously collects data from transportation, infrastructure, and people,
and develops services on top of the ecosystem that benefit the city, the
ecology, the economy, and the people. This talk presents selected exam
ples from a four-year project called Data to Intelligence, based on a
smart traffic pilot, as a testing platform.
The Intelligence Community Forum (ICF)a New Yorkbased
think tankhas voted Oulu as one of the top seven of the smartest
cities of the world, twice in a row. According to the ICFs definition,
smart cities are cities and regions that use technology not just to
save money, or make things work better, but also to create highquality employment, increase citizen participation, and in general be
great places to live and work. Were a very small cityonly 200,000
inhabitants, but this is a good thing, as it allows us to pilot new serv
ices in an easy and agile manner.
31
32
33
CHAPTER 7
Ian Eslick
Editors Note: At Strata + Hadoop World in New York, in September
2015, Ian Eslick (CEO and cofounder of VitalLabs) presented a case
study that uses an open source technology framework for capturing
and routing device-based health data. This data is used by healthcare
providers and researchers, focusing particularly on the Health eHeart
initiative at the University of California San Francisco.
This project started by looking at the ecosystems that are emerging
around the IoTat data being collected by companies like Validic
and Fitbit. Think about it: one company has sold a billion dollars
worth of pedometers, and every smartphone now collects your step
count. What can this mean for healthcare? Can we transform clini
cal care, and the ways in which research is accomplished?
The Robert Wood Johnson Foundation (RWJ) decided to do an
experiment. It funded a deep dive into one problem surrounding
research in healthcare. Here, we give an overview of what we
learned, and some of our suggestions for how the open source com
munity, as well as commercial vendors, can play a role in transform
ing the future of healthcare.
35
37
You have two colliding worlds here in the Health eHeart context.
Clinical researchers understand population data, and they under
stand the physiology. They understand what is meaningful at the
time, but they dont understand it from the standpoint of doing
time-series analysis. Its a qualitatively different kind of analysis that
you have to do to make sense of a big longitudinal dataset, both at
an individual level and at a population level. For example, is there a
correlation between your activity patterns as measured by a Fitbit
and your A-fib events, as measured by your ECG with the AliveCor
ECG device? Thats not a question that has ever been asked before. It
couldnt possibly be asked until this dataset existed.
What you immediately start to realize is that data is multi-modal.
How do you take a 300-Hertz signal and relate that to an every few
minutes summary of your pedometer data, and then measure that
back against the clinical record of that patient to try to make sense
of it?
This data is also multi-scale. There is daily data, and sometimes the
time of day matters. Sometimes the time of month matters. Data has
a certain inherent periodicity. Youre also dealing with three-month
physical follow-ups with doctors, so you want to try to take detailed,
deep, longitudinal data and mash it up against these clinical records.
Registration is a surprisingly interesting challengeparticularly
when considering: what is my baseline? If time of day is important,
and youre trying to look at the correlations of activity to an event
that happened within the next hour, you might want to align all the
data points by hour. But then the number of such aggregated points
that you get is small. The more that you try to aggregate your indi
vidual data, the more general your dataset becomes, and then its
harder to ask specific questions, where youre dealing with time and
latency.
Registration problems require a deep understanding of the question
youre trying to answer, which is not something the data scientists
usually know, because its a deep sort of physiological question about
what is likely to be meaningful. And obviously, youve got lots of
messy data, missing data, and data thats incorrect. One of the things
you realize as you dig into this, is the scale that you need to get
enough good dataand this is ignoring issues of selection bias
that you can really sink your teeth into and start to identify interest
ing phenomena.
39
The big takeaway is that naive analysis breaks down pretty quickly.
All assumptions about physiology are approximations. For any
given patient, theyre almost always wrong. And none of us, turns
out, is the average patient. We have different responses to drugs, dif
ferent side effects, different patterns. And if you build a model based
on these assumptions, when you try to apply it back to an individual
case, it turns out to be something that only opens up more ques
tions.
40 |
Data
CHAPTER 8
41
tion. Toward the end of the flight, upon entering the atmosphere,
the vehicle was going 20,000 miles per hour, and sustained heat of
an excess of 4,000 degrees Fahrenheit. As the parachute deployed, it
slowed down to 20 miles per hour before it splashed in the ocean,
about 640 miles south-southwest of San Diego. A ship gathered it
and brought it back home.
42
Microsecond Timestamps
These telemetry measurements are microsecond timestamped, so this
is not your typical time-series data. There are also different time
sources. The average spacecraft has a space time up on the vehicle,
and a ground time. With Orion, there are 12 different sources of
time, and theyre all computed differently based on different meas
urements. Its a highly complex time series, because theres correla
tion across all of the different sensors, at different times. And of
course, because its human flight, it requires a very high degree of
fault tolerance.
In the EFT-1, there were about 350,000 measurements possible. On
EM-1, which is the next mission, there are three million different
types of measurements. So its a lot of information for the spacecraft
engineers to understand and try to consume. They have subsystem
engineers that know specific sensor measurements, and they focus
on those measurements. Out of the three million measurements,
subsystem engineers are only going to be able to focus on a handful
of them when they do their analysesthat is where data analytics is
needed. We need algorithms that can parse through all of the differ
ent sensor measurements.
43
45
46
PART III
CHAPTER 9
49
huge truck, or from the side? You would choose sideways, because
you think that will give you the biggest opportunity to survive,
right? But what if your child is in the car, and sitting next to you?
But how do you tell an algorithm to change the choice because of
your values? We might be able to figure that out.
One variation in algorithms already being taken into account is that
cars will obey the laws in the country in which theyre driving. For
example, if you buy a self-driving car and bring it to the United
Kingdom, it will obey the laws in the United Kingdom, but that
same car should adhere to different laws when driving in Germany.
That sounds fairly easy to put in an algorithm, but what about dif
ferences in culture and stylehow do we put that in the algorithms?
How aggressively would you expect a car to merge into the flow of
the traffic? Well, thats very different from one country to the next.
In fact, it could even be different from the northern part of a coun
try to the southern, so how would you map that?
50
51
52
53