Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Netacad

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 50

BIG DATA & ANALYTICS

CHAPTER 1: DATA AND THE INTERNET OF THINGS

UNIQUE ROLE OF DATA

The Internet of Things (IoT) not only attaches sensors to existing things, it creates a
market for new connected things. All of these connected things are generating data.
This adds up to an almost unimaginable quantity of data, called Big Data. This
chapter discusses how all of this data can be analyzed and put to use to improve our
lives. It explains the different types of data, where it comes from, and how it can all
be managed.

Our connected world is complex. This complexity generates an ever-increasing


amount of data, which is available at our finger tips. The volume of data that needs to
be stored and analyzed continues to expand. The velocity of data generation shows no
signs of slowing. The variety of data will continue to reach into new areas that have
never before been available for analysis. Interactions between people using media
platforms, the automation of processes, and the aggregation of data coming from
different sources creates the Internet of Things (IoT). This digital transformation will
reveal new insights that promise to change the way we live, work, and play.

The digital transformation has a profound impact on three main elements of our lives:
business, social, and environmental, as shown in the figure. Interactions in these areas
will create more data to fuel new ideas, products, and solutions. This will produce
even more new data, resulting in a repeating cycle of exponential innovation that
helps us make better decisions and have better ideas.

This course focuses on the pervasive and unique role of data. In this first chapter, you
will learn fundamental terminology related to data, the technologies that are used to
process data, and explore the concept of Big Data.

WHAT IS DATA?

Data can be the words in a book, article, or blog. Data can be the contents of a
spreadsheet or a database. Data can be pictures or video. Data can be a constant
stream of measurements sent from a monitoring device. By itself, data can be rather
meaningless. As we interpret data, by correlating or comparing, it becomes more
useful. This useful data is now information. As this information is applied or
understood, it becomes knowledge.

When collecting data, determine the amount of data you will need. It is not always
necessary, or possible, to collect all available data within a project or solution. The
amount of data that can be collected is determined by the ability of the sensors,
network, computers, and other hardware involved. It is also determined by the
necessity, for example in a high-speed bottling line, each bottle must be checked for
the proper alignment of the label and kicked out of the line if there is a problem. In
this case, the data from every bottle is important. With a different sensor such as a
humidity sensor in a corn field, it is not necessary to report the humidity every tenth
of a second. Every five or ten minutes may be sufficient. This is known as the data
sampling rate.

Not all data collected can be used as is. Extraneous data might have been collected.
Incorrect or false data might also have been collected. In order to make this data
usable, it must be cleaned. Cleaning data consists of removing unwanted data,
changing incorrect data, and filling in missing data. It is common to use code to clean
data. This is accomplished by searching for criteria, or lack thereof, and operating on
the data until there are no more anomalies. After the data has been cleaned, it can
more easily be searched, analyzed, and visualized.

Through data analysis, interesting insights can be learned and trends can be
uncovered. This often leads to new queries that had not yet been realized. When you
discover that you might be able to discern additional value from some data set, you
can begin to experiment with how the data is organized and presented. For example, a
security camera monitoring a parking lot for crimes could also be used to report the
number and location of free spaces to drivers.

ESTIMATING EXPONENTIAL GROWTH

There are two types of growth: linear growth and exponential growth. It is not
difficult to understand linear growth. For example, if a person gains ¼ kilogram each
month, then in one year he would have gained three kilograms. In two years, he would
have gained six kilograms.

Exponential growth is much more dramatic than linear growth. For example, if a
person saves $1 one month, $2 in the next month, $4 the next month, $8 the next
month, continuing to double the amount saved every month, how long would it take
for this person to become a millionaire? If you try this on your calculator, you will
notice that by the 12th month the person has saved $2,048, and $4,096 the following
month. Eight months later, the person would be a millionaire, assuming that a little
over half-a-million in savings can be found during the 20th month. (Figure 1).

An old legend also demonstrates the concept of exponential growth. The inventor of
chess showed the game to his king, who was so pleased that he told the inventor to
name his prize for the invention. The inventor asked that for the first square of the
chess board, he would receive one grain of wheat, two grains for the second one, four
grains for the third one, doubling the amount each time. The king quickly granted this
modest request. However, the treasurer explained that it would take more than all of
the assets that the kingdom has to give the inventor the reward. The story ends with
the inventor becoming the new king.

Imagine that each grain of wheat is equivalent to one byte of data. If so, then the
number of bytes would reach over nine exabytes in the last square of the chess board,
as shown in Figure 2. One exabyte is roughly 1.07 billion gigabytes. Nine exabytes is
roughly equivalent to the amount of Internet traffic for the year 2014. It is also
equivalent to the amount of global IP traffic that occurred in three days in 2015. In
2016, 88.7 exabytes crossed the global Internet every month.

GROWTH OF DATA
Why do we care about exponential growth in a course about data? Because today, the
growth of data is exponential! The following are just a few statistics that are part of
Cisco’s Visual Networking Index (VNI) forecast for data growth between 2015 and
2020:

- Consumer mobile data traffic will reach 26.1 exabytes per month in 2020, the
equivalent of 7 billion DVDs per month, or 9 million DVDs per hour.

- Globally, IP traffic will reach 194.4 exabytes per month in 2020, up from 72.5
exabytes per month in 2015, as shown in the figure.

- Globally, 64% of all Internet traffic will cross content delivery networks in 2020, up
from 45% in 2015.

- Global mobile data traffic will grow 3 times faster than global fixed IP traffic from
2015 to 2020.

- Global Internet traffic in 2020 will be equivalent to 95 times the volume of the entire
global Internet in 2005.

- Globally, the average fixed broadband speed will grow 1.9-fold from 2015 to 2020,
from 24.7 Mbps to 47.7 Mbps.

- In 2020, the gigabyte equivalent of all movies ever made will cross global IP
networks every 2 minutes.

- Globally, consumer IP VOD traffic will reach 28.8 exabytes per month in 2020, the
equivalent of 7 billion DVDs per month, or 10 million DVDs per hour.

DATA GROWTH IMPACT

The proliferation of devices in the IoT is one of the primary reasons for the
exponential growth in data generation. While the number of sensors and other end
devices grows exponentially, mobile routers are increasingly used to better manage
Internet traffic for systems that are literally on the move. Mobile routers are deployed
in airplanes, commercial vehicles, and even in personal automobiles. Not only is the
IoT growing, but its boundaries are actually moving! Just as the advent of wireless
roaming improved Internet access, the implementation of mobile networks is
changing the psychology and behavior of consumers and businesses by expanding
anytime, anywhere, on-demand access.

Here are three examples of how data growth is affecting society in healthcare, retail,
and education:

- Robotics, mobile devices, integrated software systems, and collaboration are


changing the way healthcare is delivered. Many of these technologies enable or
expand upon efficiencies in healthcare delivery. These technologies use data, and in
turn, create more data. Click Play in Figure 1 to see how healthcare is being
implemented at the Palomar Medical Center.

- Retailers increasingly depend on the data generated by digital technologies to


improve their bottom line. Cisco’s Connected Mobile Experiences (CMX) allow
retailers to provide consumers with highly personalized content while simultaneously
gaining visibility into their behavior in the store. Click Play in Figure 2 to see how
intu shopping centers in the UK are engaging with their customers to create a new
retail experience with Cisco CMX.

- Education is changing with digital technologies. It is now standard practice to


incorporate tablets in elementary school education in many parts of the world. Virtual
schools give students access to textbooks, content, and assistance using learning
management systems. Students and teachers want to be able to bring their own
devices and connect to learning resources. Click Play in Figure 3 to see how the
McAllen Independent School District uses Cisco Identity Service Engine (ISE) and
Secure Access to implement a bring your own device (BYOD) initiative.

BUSINESS EXAMPLE: KAGGLE

To stay competitive in the business world, every organization must become more
efficient. Innovation allows an organization to stay relevant. More organizations are
putting sensors in their operations and products. Their goal is to collect and analyze
the data to gain valuable insights. To take advantage of the power of IoT,
organizations require skilled and creative people. Online platforms, such as Kaggle,
allow companies to connect with talented people from different parts of the world.

Kaggle is a platform that connects businesses and other organizations that have
questions about their data to the people who know how to find the answers. They run
online competitions to create the world’s best predictive data models. Players in the
competitions generate many models using a variety of techniques. Players come from
all around the world with different educational backgrounds and specializations. They
can connect to form teams or simply help each other. The winner, or winning team, of
each competition wins a prize. Usually this prize is money, but occasionally it will be
employment, or something equally desirable.

In each competition, there are continuous improvements, as each winner beats the
previous score. Eventually the scores plateau. This means that the players have found
the threshold of what is possible to predict with the data provided. These new
predictive data models consistently outperform existing best-of-breed models.

The Mayo Clinic, NASA, GE, and Deloitte are just a few of the businesses and
organizations that have hosted competitions on Kaggle. Go here to read more about
Kaggle, its mission, the competitions, and its growing community of players.

SOCIAL EXAMPLE: DRIVENDATA

The same emerging technologies used in the IoT and data analytics can also be used
to solve social problems. Social entrepreneurs can use the collected data to predict
trends. For example, using the available data, social entrepreneurs in a developing
country can predict which water pumps are functional, and which need repair or do
not work at all. With reliable prediction of pump operations, maintenance is more
efficient. Clean, potable water is made available to all in the communities across the
country. Competitions with social impact can be found at websites such as
DrivenData.
At DrivenData, their mission is to bring cutting-edge practices in data science and
crowdsourcing to some of the world's biggest social challenges and to the
organizations that are addressing those challenges. Like Kaggle, they host online
challenges where a global community of data scientists competes to create the best
statistical model for difficult predictive problems. These models can then help to
make a positive difference in the world.

DrivenData starts by framing a good predictive question, one that can be solved by
the data at hand and has measurable, real-world impact. They work with nonprofit
organizations to understand their needs and identify productive partnerships. Next,
DrivenData hosts an online, open-innovation competition where freelance developers
and data scientists submit statistical models. Using their competition platform and
evaluation engine, the models are ranked based on how well they predict data that has
been withheld from the competitors. And finally, they work with the organization to
leverage the top model as insight, a fresh statistical approach, or a tool for analyzing
new data. This enables the nonprofit organization to more effectively and sustainably
carry out its mission.

ENVIRONMENTAL EXAMPLE: PLANETARY SKIN INSTITUE

NASA and Cisco partnered to develop an online collaborative global monitoring


platform called the Planetary Skin. This platform captures, collects, analyzes, and
reports data on environmental conditions around the world.

Planetary Skin Institute (PSI) is a global, nonprofit organization. It collaborates with


research and development partners to incubate scalable Big Data innovations that are
designed to increase food, water, and energy security, and to protect critical
ecosystems and biodiversity.

“Mitigating the impacts of climate change is critical to the world's economic and
social stability. This unique partnership taps the power and innovation of the market
and harnesses it for the public good. Cisco is proud to work with NASA on this
initiative and hopes others from the public and private sectors will join us in this
exciting endeavor.”

John Chambers, Former Cisco CEO

DEFINING BIG DATA

The exponential growth of data has created a new area of interest in technology and
business called “Big Data". In general, a data set or a business problem belongs to the
Big Data classification when its data is so vast, fast or complex that it becomes
impossible to store, process, and analyze using traditional data storage and analytics
applications.

How much data does it take to become Big Data? Are 100 terabytes enough? Are
1000 petabytes enough? The baby in the figure already has gigabytes of online data
associated with her name. Is that Big Data? Volume is only one of the criteria because
the need of real time processing of the data (also called data in motion) or the need of
integrating structured and unstructured data, may qualify the problem as Big Data
problem.
For example, International Data Corporation (IDC) uses 100 terabytes as the size of a
data set that qualifies as Big Data. If the data is streaming, the size of the data set can
be smaller than 100 terabytes but still be considered as Big Data as long as the data
that is being generated is increasing at a rate of more than 60% a year.

For IBM’s Big Data perspective, click here to view a video in which the presenter,
Paul Zikopoulos, says that 200 to 600 terabytes are a minimum qualification for data
to be called Big Data.

Many of the awe-inspiring quantifications of data sizes in our near future are
documented in the Cisco white paper The Zettabyte Era: Trends and Analysis.

In response to this need, a completely new class of software platform has emerged
called Big Data Platforms. It is discussed in Chapter 6 of this course.

According to NIST’s Big Data Interoperability Framework: "The Big Data paradigm
consists of the distribution of data systems across horizontally coupled, independent
resources to achieve the scalability needed for the efficient processing of extensive
data sets."

BIG DATA CHARACTERISTICS

To help distinguish data from Big Data, consider the Four Vs of Big Data:

- Volume - This describes the amount of data being transported and stored. The
current challenge is to discover ways to most efficiently process the increasing
amounts of data, which is predicted to grow 50 times by 2020, to 35 zettabytes.

- Velocity - This describes the rate at which this data is generated. For example, the
data generated by a billion shares sold on the New York Stock Exchange cannot just
be stored for later analysis. The data infrastructure must be able to immediately
respond to the demands of applications accessing and streaming the data.

- Variety - This describes the type of data, which is rarely in a state that is perfectly
ready for processing and analysis. A large contributor to Big Data is unstructured
data, which is estimated to represent anywhere from 70 to 90% of the world’s data.

- Veracity - This is the process of preventing inaccurate data from spoiling your data
sets. For example, when people sign up for an online account, they often use a false
contact information. Increased veracity in the collection of data reduces the amount of
data cleaning that is required.

SOURCES OF BIG DATA

To businesses, data is the new oil. Like crude oil, it is valuable, but if it is unrefined it
cannot be easily used. Crude oil has to be changed to gasoline, plastic, chemicals, and
other substances to create a valuable product. It is the same with data. Data must be
broken down and analyzed for it to have value.

Having the right data that can be turned into information and then into business
intelligence is critical to success. The data sources available to businesses are growing
exponentially. The proliferation of sensors guarantees that they will continue to be a
primary source of Big Data. Sensors are found in a variety of applications:

- Telemetry for vehicle monitoring

- Smart metering

- Inventory management and asset tracking

- Fleet management and logistics

Businesses need information and information is everywhere in an organization. The


two primary types are transactional information and analytical information.
Transactional information is captured and stored as events happen. Transactional
information is used to analyze daily sales reports and production schedules to
determine how much inventory to carry. Analytical information supports managerial
analysis tasks like determining whether the organization should build a new
manufacturing plant or hire additional sales personnel.

REAL-WORLD EXAMPLE OF BIG DATA SOURCES

To gain some perspective on Big Data, here are some specific real-world examples of
Big Data generators, as shown in the figure. An Airbus A380 Engine generates 1
petabyte of data on a flight from London to Singapore. The Large Hadron Collider
(LHC) generates 1 gigabyte of data every second. The Square Kilometer Array
(SKA), when it becomes operational in 2020, will be the largest radio telescope in the
world. It will generate 20 exabytes of data per day. That is equivalent to 20 billion
gigabytes per day.

The Human Genome Project was an effort to sequence and map all the human genes.
It began in 1990 and was completed in 2003 for approximately $3 billion. Now an
individual can order a complete sequencing of his or her genes for about $1,000 to
$2,000.

WHAT IS OPEN DATA?

With the rise in importance of data to businesses and people, many questions arise
regarding the privacy and the availability of large public and private data repositories.
For a data professional, it is fundamentally important to understand the continuum
between open data and private data. Making decisions about what and how various
types of data will be used in an organization is as important as knowing how to
implement a distributed storage and processing solution for Big Data.

The Open Knowledge Foundation defines open knowledge as “any content,


information or data that people are free to use, reuse, and redistribute without any
legal, technological, or social restriction.” They then go on to explain that open data
comprise the building blocks of open knowledge. Open knowledge is what open data
becomes when it is useful, usable, and used.

The value of open data can immediately be seen by viewing sites like New York
City’s Open Data Portal, NYC Open Data, where a resident or visitor can quickly find
ratings for restaurants based on annual inspections by the Department of Health and
Mental Hygiene. A visualization of the portal is shown in the figure and can be
accessed at the NYC Open Data portal. The portal is a clearinghouse of over 1300
data sets from city agencies to facilitate government transparency and civic
engagement. A data set is a collection of related and discrete records that may be
accessed for management individually or as a whole entity.

Gapminder is a non-profit venture promoting sustainable global development. The


site presents engaging analyses of open data sets with clarifying statistics on such
topics as:

- Health and wealth of nations

- CO2 emissions since 1820

- Child mortality

- HIV infection

Statistician Hans Rosling gave a must-see TED talk, The Best Stats You’ve Ever
Seen, bringing to life the facts and myths about the developing world in a quick tour
through human history.

WHAT IS PRIVATE DATA?

The expectation of privacy and what an individual or society considers private data
continues to evolve. As new apps are developed, more and more data is requested
from the end user to give companies and advertisers more information to make
business decisions.

The current state of data protection regulations across the globe is shown in the figure.

What is the right approach to maximize the benefit of new sources of data while, at
the same time, empowering individuals with the ability to control access to their
personal data? Some intriguing work is being done in this area including efforts by
openPDS and Privacy by Design.

Instead of looking to strip personal data of all its identifying characteristics, called
data anonymization, openPDS takes a slightly different approach. Using what they
call the SafeAnswers framework, openPDS provides only answers to specific queries.
No raw data is sent. The calculation for the answer is done within the user’s personal
data store (PDS):

“Only the answers, summarized data, necessary to the app leaves the boundaries of
the user’s PDS. Rather than exporting raw accelerometer or GPS data, it could be
sufficient for an app to know if you’re active or which general geographic zone you
are currently in....computation can be done inside the user’s PDS by the
corresponding Q&A module.”

Privacy by Design began in the 1990s to address the growing concern of large-scaled,
networked data systems. With its “7 Foundational Principles”, Privacy by Design
“advances the view that the future of privacy cannot be assured solely by compliance
with legislation and regulatory frameworks; rather, privacy assurance must ideally
become an organization’s default mode of operation.”
In Europe for example, the official texts of a new Regulation have been published in
the EU Official Journal. The regulation will apply starting in May of 2018.

STRUCTURED DATA

Previously, we have classified data in terms of its accessibility; data is either open or
private. Data can also be classified by the way it is arranged, either structured or
unstructured.

Structured data refers to data that is entered and maintained in fixed fields within a
file or record. Structured data is easily entered, classified, queried, and analyzed by a
computer. This includes data found in relational databases and spreadsheets. For
example, when you submit your name, address, and billing information to a website,
you are creating structured data. The structure will force a certain format for entering
the data to minimize errors and make it easier for a computer to interpret it.

If the data set is small enough, structured data is often managed with Structured
Query Language (SQL), a programming language created for querying data in
relational databases. SQL only works on structured data sets. However, with Big
Data, structured data may be part of a data set, but Big Data tools do not depend on
that structure. It is not uncommon for Big Data to have data sets that consist of
unstructured data.

UNSTRUCTURED DATA

Unstructured data lacks the organization found in structured data. Unstructured data is
raw data. Unstructured data is not organized in a predefined way. It does not possess a
fixed schema that identifies the type of the data. Unstructured data lacks a set way of
entering or grouping the data, and then analyzing the data. Examples of unstructured
data include the content of photos, audio, video, web pages, blogs, books, journals,
white papers, PowerPoint presentations, articles, email, wikis, word processing
documents, and text in general. Figure 1 shows Dostoevsky’s notes for Chapter 5 of
The Brothers Karamazov. The contents of the notes are not searchable because they
have no structure. Figure 2 shows Chapter 5 of the same novel after publication. Even
a PDF version of this chapter is unstructured. The text is searchable, but it is not
organized in a predefined form, for example, using fields and records.

Both structured and unstructured data are valuable to individuals, organizations,


industries, and governments. It is important for organizations to take all forms of data
and determine ways to format that data so it can be managed and analyzed.

NATURE OF DATA

In the past, data sets were mostly static, residing on a single server or a collection of
servers within the organization, and processed using a database programming
language like SQL. Although this model still exists, storage of large data sets has
migrated to data centers. Today, with the rise of cloud computing, Big Data, and the
need to analyze data in real time, data continues to be stored in data centers. Data
must also be available for analysis closer to where it is created and the knowledge
gained from that data can have the greatest impact. This is called fog computing.
As shown in the figure, fog is a cloud close to the ground, close to the source of data
generation. Fog computing is not a replacement for cloud computing, rather, fog
computing enables the development of new tools. In the fog computing model, there
is interplay between the cloud and the fog, particularly when it comes to data
management and analytics. Fog computing provides compute, storage, and
networking services between end devices and traditional data centers.

Fog computing produces an enormous amount of data from the various sensors and
controllers. When dealing with data in the IoT, three very important factors must be
taken into consideration:

- Energy or Battery – The amount of energy used by an IoT sensor, for example,
depends on the sample rate of the sensor. Range between devices can also affect how
much energy must be used to report sensor data to controllers. The farther away the
sensor, the more energy that must be used by the transmitting radio.

- Bandwidth – When many sensors are transmitting data, there may be delay in
communications if there is not enough bandwidth to support all of the devices.
Additional analysis in the fog can help to alleviate some communications bandwidth
requirements.

- Delay – Real-time data analysis is affected when there is too much delay in the
network. It is very important that only the necessary communications to the cloud are
performed and compute happens as close to the data source as possible.

DATA AT REST AND DATA IN MOTION

Data can be at rest and data can be in motion. Data at rest is static data that is stored in
a physical location, for example, on a hard drive in a server or data center. Data at rest
follows the traditional analysis flow of Store > Analyze > Notify > Act. Data is stored
in a database and then analyzed and interpreted. Decision makers are notified and
determine whether action is required.

Data in motion is dynamic data that requires real-time processing before that data
becomes irrelevant or obsolete. It represents the continuous interactions between
people, process, and things. Analysis and action happen sooner rather than later.
Devices at the edge of the network work together to act immediately on knowledge
gained from dynamic data analysis. The flow of analysis for data in motion is often
Analyze > Act > Notify > Store. The order of analyze, act, and notify can be different.
The important distinction between data at rest and data in motion is that, with data in
motion, acting on the data happens before the data is stored.

Data in motion is used by a variety of industries that rely on extracting value from
data before it is stored. Sensors in a farmer’s field continuously send data for
temperature, soil moisture, and sunlight to a local controller. The controller analyzes
the data. If the conditions are not right, the controller acts immediately by sending
signals to actuators in the field to begin watering. The controller then sends a
notification to the owner of the field that watering has begun. Then the controller
sends the data to be stored for historical records.
Due to the characteristics of Big Data, it is no longer feasible to duplicate and store all
that data in a centralized data warehouse. Emerging device implementations include a
large number of sensors capturing and processing data. Decisions and actions need to
take place at the edge, where and when the data is created. With sensors gaining more
processing power and becoming more context-aware, it is now possible to bring
intelligence and analytic algorithms close to the source of the data. In this case, data
in motion stays where it is created and presents insights in real time, prompting better,
faster decisions.

HOW DID WE GET HERE?

To be competitive and relevant, an organization’s leaders must make decisions related


to technology adoption. Every organization will have to decide whether to migrate
from on-premise technology solutions to a cloud environment, a fog model, or a
hybrid approach. This decision must take into consideration the types of Big Data
they need to run their business successfully in the future. Click Play in the figure to
view a video on the challenges of Big Data.

BIG DATA INFRASTURCTURE

Many companies realize that it makes sense to invest in some of the Big Data
technologies to remain competitive in their market. Currently, their data
infrastructures may look something like Figure 1, with database servers and
traditional data processing tools. Typically, data access is limited to a few
knowledgeable individuals within the organization. Companies are rapidly moving
towards leveraging Big Data technologies to drive business intelligence. According to
NIST, the Big Data paradigm consists of the distribution of data systems across
horizontally coupled, independent resources to achieve the scalability needed for the
efficient processing of extensive data sets. This is horizontal scalability. It is different
from vertical scalability in that it does not attempt to add more processing power,
storage or memory to existing machines. These infrastructures can allow many users
to seamlessly and securely access the data simultaneously. One such example is
thousands of online shoppers or mobile gamers. In this course, we will briefly explore
the technologies that are now common with Big Data implementations, as shown in
Figure 2. The icons in Figure 2 represent devices in an organization’s Big Data
infrastructure.

FLAT FILE DATABASES

Before SQL and other database programming languages were commonplace,


professionals worked with flat file databases. A flat file database stores records in a
single file with no hierarchical structure. As shown in the figure, these databases
consist of columns and rows. Columns are also called fields, and rows are also called
records. A spreadsheet file is an example of a flat file database.

RATIONAL DATABASES

The next generation of data management emerged with the relational database
management system (RDBMS). For 30 years, this was the standard approach to data
management. Relational databases capture the relationships between different sets of
data, creating more useful information. In Figure 1, these relationships are shown with
lines. For instance, more detail about subcontractors can be accessed from both
Product and Material database queries.

IBM's Structured Query Language/Data Store (SQL/DS) and Relational Software


Corporation's Oracle were the first two commercial RDBMS solutions. Most
commercial RDBMS solutions use SQL as their query language to this day. An
example of SQL is: SELECT id,name,price FROM inventory WHERE price < 20.
Examples of products that use structured query language to access data include
MySQL, SQLite, MS SQL, Oracle, and IBM DB2.

Another characteristic of relational databases is the distinction between the database


and the management system used to query the database. Typically, with RDMS and
the underlying database, many users can be querying the relational database at the
same time. The user normally does not know all the relationships that exist inside the
database. Rather, the user abstracts a view of the database that is suited to that user’s
needs.

Figure 2 shows a simplified view of data abstraction in a relational database. The


lowest level of abstraction describes how the data is physically stored. The next level
describes what data is stored and the relationships between the data. This is the level
where a database administrator operates. The user level is the highest level and
describes what part of the database a particular user or group of users can access.
There can be many different views defined and many simultaneous connections to the
database at any given time.

In contrast to traditional SQL relational database management systems that can be


challenging to scale, non-relational SQL (NoSQL) databases scale very well as
distributed databases Because NoSQL can handle big data and real time web
application better than RDBMS, NoSQL database queries are focused on the
collection of documents, such as information gathered from websites. NoSQL also
allows clusters of machines to process the data and provide better control over
availability.

NoSQL databases are being adopted widely to solve business problems.

DISTRIBUTED DATA AND PROCESSING

From a data management perspective, analytics were simple when only humans
created data. The amount of data was manageable and relatively easy to sift through.
Relational databases serve the needs of data analysts. However, with the
pervasiveness of business automation systems and the explosive growth of web
applications and machine-generated data, analytics is becoming increasing more
difficult to manage with just an RDBMS solution. In fact, 90% of data that exists
today has been generated in just the last two years. This increased volume within a
short period of time is a property of exponential growth. This high volume of data is
difficult to process and analyze within a reasonable amount of time.

Rather than large databases being processed by big and powerful mainframe
computers and stored in giant disk arrays (vertical scaling), distributed data
processing takes the large volume of data and breaks it into smaller pieces. These
smaller data volumes are distributed in many locations to be processed by many
computers with smaller processors. Each computer in the distributed architecture
analyzes its part of the Big Data picture (horizontal scaling).

Most distributed file systems are designed to be invisible to client programs. The
distributed file system locates files and moves data, but the users have no way of
knowing that the files are distributed among many different servers or nodes. The
users access these files as if they were local to their own computers. All users see the
same view of the file system and are able to access data concurrently with other users.

Hadoop was created to deal with these Big Data volumes. The Hadoop project started
with two facets: The Hadoop Distributed File System (HDFS) is a distributed, fault
tolerant file system, and MapReduce, which is a distributed way to process data.
Hadoop has now evolved into a very comprehensive ecosystem of software for Big
Data managment.There are many other distributed file system (DFS) programs. Here
are just a few: Ceph, GlusterFS, and Google’s File System.

A NoSQL database stores and accesses data differently than relational databases.
NoSQL is sometimes called “Not only SQL”, “non SQL”, or “non-relational”.
NoSQL systems may support SQL-like query languages. NoSQL databases use data
structures such as key-value, wide column, graph, or document. Many NoSQL
databases provide "eventual consistency". With eventual consistency, database
changes eventually appear in all nodes. This means that queries for data might not
provide the most recent information available. This problem is known as ‘stale reads’.

The reason for creating NoSQL was to make database design simpler. It is easier to
scale clusters of nodes using NoSQL than it is in standard relational databases.

The most popular NoSQL databases in 2015 were MongoDB, Apache Cassandra, and
Redis.

WHAT IS SQLite

A structured query language (SQL) is designed to manage, search, and process data,
including Big Data. SQLite is a simple and easy to use SQL database engine that you
will use in labs later in this course.

SQLite is an in-process library that uses a self-contained, transactional SQL database


engine. The code for SQLite is in the public domain which mean it is free to use for
commercial or private purposes. SQLite is the most widely deployed database in the
world. Go here to learn more about SQLite being used by many high-profile
organizations.

SQLite is also an embedded SQL database engine. Unlike most other SQL databases,
SQLite does not have a separate server process. SQLite reads and writes directly to
ordinary disk files.

SQLite is a popular choice for the database engine in mobile phones, MP3 players,
set-top boxes, and other electronic gadgets. SQLite has a small code footprint, makes
efficient use of memory, disk space, and disk bandwidth, is highly reliable, and
requires no maintenance from a Database Administrator. SQLite is often used instead
of an enterprise RDBMS for testing. SQLite requires no setup, which makes testing
much easier. This also makes it a good choice for databases that are behind small to
medium-sized websites.

SQLite FEATURES

SQLite has several useful features. A few are listed here:

- No setup or administration is required. It has an easy to use API.

- A complete database is stored in a single, cross-platform disk file. Can be used as an


application file format.

- It has a small code footprint.

- It is a cross-platform SQL. It supports Android, iOS, Linux, Mac, Windows and


several other operating systems.

- Sources for SQLite are in the public domain.

- It has a stand-alone command-line interface (CLI).

- All changes within a single transaction occur completely or not at all. This is true
even in the event of a program or operating system crash, or a power failure.

ABOUT PROGRAMMING

Using SQL and database technologies is very effective to extract a subset of data from
an existing data set stored in the database. The SQL expression that performs this
action is called a SQL query. In business, many important problems cannot be solved
with just a simple SQL query and need a more complex analytical process. This is
where use of a more powerful data analysis programming language like R or Python
come in. R and Python have very large communities of developers. Their users are
known for developing data analysis modules and making them available to the
community free of charge. Because of that, any user can download and use pre-
programmed modules and tools.

While third party tools and program modules are very useful, it is very important to
know how to create your own data analysis tools. The ability to create data analysis
tools from scratch allows for highly customized applications. The process of creating
a data analysis tool from scratch can be divided in two main parts: the model and the
code.

Modeling consists of deciding what to do with the data to achieve the desired results
and conclusions. Suppose for example, you want to create a personal fitness tracker.
Suppose there is no pre-programmed module in existence to do exactly what you want
to do. This is why learning a programming language is so important. You do not have
to change your idea of what you want, instead, you can become part of the developer
community and create exactly what you want.

In your module, the tracker, which is built into a chest band, contains an
accelerometer, which is a sensor capable of measuring the device’s acceleration. The
accelerometer can be used to determine the speed and direction of movement. The
speed and direction of the device’s movement always matches the speed and direction
of its user when attached to the user’s chest. But what if the device is attached to a
dumbbell weight? How about a tennis racket? The device will still yield the same
data, speed, and movement direction, but because of the different applications, the
interpretation of this data must be adjusted for the new usage. In this context,
modeling can be seen as a way to interpret and process data. If the fitness tracker is
attached to the user’s chest, two consecutive points of no movement (speed equals
zero), likely represent the beginning and end of a sprint. When attached to a dumbbell
weight, the same data points likely represent the moment that the dumbbell was
picked up off the floor and the highest point the user was able to lift it before putting
it back on the floor.

The code (or the program) is the second part of creating data analysis tools from
scratch. The code is the program that processes the data and must be written
according to the model created. While the model and the code are two separate
entities, they are related because the code is built based on the model. In this course,
we focus on the programming language known as Python.

CHAPTER 1: DATA AND THE INTERNET OF THINGS

Data can be the words in a book, article, or blog. Data can be the contents of a
spreadsheet or a database. Data can be pictures or video. Data can be a constant
stream of measurements sent from a monitoring device.

There are two types of data growth: linear growth and exponential growth.
Exponential growth is much more dramatic than linear growth. The digital
transformation and its generation of Big Data has a profound impact on three main
elements of our lives: business, social, and environmental.

The four Vs of Big Data are: volume, velocity, variety, and veracity.

Open data (or knowledge) is “any content, information or data that people are free to
use, reuse, and redistribute without any legal, technological, or social restriction.” The
expectation of privacy and what an individual or society considers private data
continues to evolve.

Structured data refers to data that is entered and maintained in fixed fields within a
file or record. Unstructured data does not possess a fixed schema that identifies the
type of the data.

Data at rest is static data that is stored in a physical location, for example, on a hard
drive in a server or data center. Data at rest follows the traditional analysis flow of
Store > Analyze > Notify > Act. Data in motion is used by a variety of industries that
rely on extracting value from data before it is stored. The flow of analysis for data in
motion is often Analyze > Act > Notify > Store.

A flat file database stores records in a single file with no hierarchical structure. These
databases consist of columns and rows. Relational databases capture the relationships
between different sets of data, creating more useful information.

Rather than large databases being processed by big and powerful mainframe
computers and stored in giant disk arrays (vertical scaling), distributed data
processing takes the large volume of data and breaks it into smaller pieces. These
smaller data volumes are distributed in many locations to be processed by many
computers with smaller processors. Each computer in the distributed architecture
analyzes its part of the Big Data picture (horizontal scaling).

A structured query language (SQL) is designed to manage, search, and process data,
particularly Big Data. SQLite is a simple and easy to use SQL database engine.
CHAPTER 2: FUNDAMENTALS OF DATA ANALYSIS

Data is being generated at an unprecedented rate by machines, by people, and by


things. The Cisco Visual Networking Index (VNI) forecasts global Internet traffic
growth and broadband trends for mobile and fixed networks. According to the Cisco
VNI, IP traffic will triple over the next 3 years. By 2020, there will be more than 26
billion global IP networked devices/connections (up from 16.3 billion in 2015).
Globally, IP traffic will reach 194.4 exabytes per month in 2020. Internet video will
account for 79 percent of global Internet traffic by 2020. That is up from 63 percent in
2015. The world will reach three trillion Internet video minutes per month by 2020,
which is five million years of video per month, or about one million video minutes
every second.

We are more connected than ever. In our homes, schools, work and even the areas in
which we play, advancements in IoT technologies are generating large quantities of
data. Everywhere you go and everything you do in this digital world becomes a new
source of data. Data is being generated from sensors, devices, video, audio, networks,
log files, transactional applications, the web and social media. It is more commonly
streaming over the networks and comes in a variety of sizes and formats. The high
volume, high velocity and high variety of these data sets is a key feature that
distinguishes data from Big Data.

The emergence of these large data sets requires more advance methods, technologies,
and infrastructure to process the data and convert it to actionable information. Data
can no longer be stored on a few machines nor be processed with one tool.

Companies are actively creating profiles and processing data on their systems, their
users, and their processes to spur growth and innovation. Researchers and analysts are
looking for ways to access and analyze data that was once considered unusable.
Advanced analytics techniques can be used on large data sets such as text analytics,
machine learning, predictive analytics, data mining, statistics, and natural language
processing. Businesses can analyze previously untapped data sources independently,
or together with their existing enterprise data, to gain new insights. These insights
result in significantly better and faster decisions.

This chapter explains what we mean by data analysis, what it can do for businesses
and other organizations, and the tools and methodologies available. In the past, data
analysts had access to historical, static data. This data was from the past and it did not
change. Now, with sensors, social media and other sources, data is dynamic. It needs
to be analyzed as soon as it is created. There is also so much more of this dynamic
data that we need new tools and methodologies, new storage solutions, and new ways
of thinking about this Big Data.

CRISP-DM

What does data analysis mean? Is it a standardized process or more of an art? With
data analysis we start with a business question and the availability of some data. We
end with the creation of new information relevant to solve the business question.This
chapter presents the concept of data analysis and applies basic elements of data
analysis to a particular situation. In this case, we measure the Internet speed of a
connected node.

There are many methodologies for conducting data analysis, including the popular
Cross Industry Standard Process for Data Mining (CRISP-DM) used by more than
40% of data analysts. About 27% of data analysts use their own methodology. The
rest use a variety of other methodologies. (Source: KDnuggets)

To keep it simple in this introductory course, we will use the six-step Data Analysis
Lifecycle shown in the figure. Closely resembling the scientific method, the Data
Analysis Lifecycle is designed for use in a business environment. Notice that arrows
are pointing in both directions between some steps. This highlights the fact that the
lifecycle may require many iterations before decision makers are confident enough to
move forward.

DATA ANALYTICS TOOL CAPABILITIES

Even before computers were invented, the information gathered while doing business
was reviewed with the goal of making processes more efficient and profitable. With
the limited amount of data and the painstaking process of manual analysis, the task
was still worthwhile. Today, with the massive growth in the volumes of data,
computers and software are required to gain insight into business patterns and make
sense of all this data.

What tools you use depends on your needs and the solutions you have already
implemented. Because “best” is a relative term, the tools you use will depend on your
specific objectives or the questions you are trying to answer. The tool to use depends
on the type of analysis to be performed. Some tools are designed to handle
manipulation and visualization of large data sets. Other tools are designed with
complex mathematical modeling and simulation capabilities for prediction and
forecasting. No matter which tools are used, they should be able to handle these five
capabilities:

- Ease of use – The tool that is easy to learn and to use is often more effective than a
tool that is difficult to use. Also, a tool that is easy to use requires less training and
less support.

- Data manipulation – The software should allow users to clean and modify the data to
make it more usable. This leads to data being more reliable because anomalies can be
detected, adjusted or removed.

- Sharing – Everyone must be looking at the same data sets to be able to collaborate
effectively. This helps people to interpret data the same way.

- Interactive visualization – To fully understand how data changes over time, it is


important to visualize trends. Basic charts and graphs cannot fully represent how
information evolves the way a heat map or time motion view can.

THE ROLE OF PYTHON IN DATA ANALYSIS


There are a variety of programs that are used to format data, clean it, analyze it, and
visualize it. Many companies and organizations are turning to open source tools to
process, aggregate and summarize their data. The Python programming language has
become a commonly used tool for handling and manipulating data. Python will be
used in this course to perform all of these functions.

Python was created in 1991 as an easy to learn language with many libraries used for
data manipulation, machine learning, and data visualization. Through the use of these
libraries, programmers do not have to learn multiple programming languages or spend
time learning how to use many different programs to perform the functions of these
libraries. Python is a flexible language that is growing and becoming more integral to
data science because of this flexibility and ease of learning.

This course will use Jupyter Notebooks, shown in the figure. Jupyter Notebooks allow
instruction and programming to exist within the same file. It is easy to alter code in
the notebooks and experiment with how different code can be used to manipulate,
analyze, and visualize your data.

These are some of the libraries that will be used in this course:

- NumPy – This library adds support for arrays and matrices. It also has many built-in
mathematical functions for use on data sets.

- Pandas – This library adds support for tables and time series. Pandas is used to
manipulate and clean data, among other uses.

- Matplotlib – This library adds support for data visualization. Matplotlib is a plotting
library capable of creating simple line plots to complicated 3D and contour plots.

BIG DATA AND DECISION MAKING

The scalable technologies made possible by distributed computing and virtualization


are enabling data center administrators to manage the top three of the four aspects of
Big Data: volume, velocity, and variety. Statistical methodologies embedded in
applications are empowering data analysts to interpret and use Big Data to make
better decisions. Modern data analysis tools make it possible to extract and transform
the raw data to produce a much smaller set of quality data. However, data alone is not
meaningful information, the data must be analyzed and then presented in a form
which can be interpreted. This is what decision makers need to take the right action.

Decision makers will increasingly rely on data analytics to extract the required
information at the right time, in the right place, to make the right decision. This
information can tell many different stories, depending on how the data is analyzed.
For example, in politics, it is common practice for data analysts to extract information
that is relevant to their candidate. In business, a data analyst may uncover market
trends that enable a company to move ahead of its competition.

DATA, INFORMATION, KNOWLEDGE, AND WISDOM

The Data, Information, Knowledge, and Wisdom (DIKW) model shown in the figure
is used to illustrate the transitions that data undergoes until it gains enough value to
inform wise decisions. This structure provides a means of communicating the value of
data in various states of incarnation.

The following is an example of each level of the pyramid, from the bottom up:

- Data – Collect temperature readings from multiple, geo-localized sensors.

- Information - Extract temporal and localization insights. It shows that temperatures


are constantly rising globally.

- Knowledge - Compare multiple hypotheses and it becomes apparent that the rise
appears to be caused by human activities, including greenhouse emissions.

- Wisdom – Work to reduce greenhouse gas emissions.

Wise decisions rely on a well-established base of knowledge. A common phrase used


in data analytics is business intelligence. Business intelligence encompasses the entire
process from data to information to knowledge to wisdom.

DESCRIPTIVE ANALYTICS

There are multiple types of analytics that can provide businesses, organizations and
people with information that can drive innovation, improve efficiency and mitigate
risk. The type of data analytics to implement will depend on the problem that needs to
be solved or questions that need to be answered.

Three types of data analytics will be covered in this course:

- Descriptive

- Predictive

- Prescriptive

Descriptive analytics primarily uses observed data. It is used to identify key


characteristics of a data set. Summarized data from descriptive analytics provides
information on prior events and trends in performance. Descriptive analytics relies
solely on historical data to provide regular reports on events that have already
happened. This type of analysis is also used to generate ad hoc reports that summarize
large amounts of data to answer simple questions like “how much...” or “how
many...” or “what happened.” It can also be used to drill down into the data, asking
deeper questions about a specific problem. The scope of descriptive analytics is to
summarize your data into more compact and useful information.

An example of a descriptive analysis is an hourly traffic report.

PREDICTIVE ANALYTICS

Predictive analytics attempts to predict what may happen next with a certain degree of
confidence, based on data and statistics. Predictive analytics can be used to infer
missing data and establish a future trend line based on past data. It uses simulation
models and forecasting to suggest what could happen.

An example of a predictive analysis is a computer model that uses Big Data to


forecast the weather.

Another way to look at predictive analytics is to produce new data by starting with
existing data. A common example is the price of a house. Imagine you want to sell
your house and you do not know what price to set for it. You can take the prices of
recent sales of houses in the neighborhood and the characteristics of those houses (e.g.
number of bedrooms, bathroom, status, etc.) as an indication of the price. But your
house is probably not identical to any of the other houses. Here is where predictive
analytics can help. A predictive model for the price is based on the data that you have
of previous sales. It “predicts” the appropriate price for your house. Another example
is classification. For example, given a tweet or a post, classify the tweet as positive or
negative based on the text it contains.

In 2014, Jameson Toole won the MIT Big Data Challenge: “What can you learn from
data about 2.3 million taxi rides?” Based on information garnered by writing pattern-
matching machine-learning algorithms for very large data sets, Toole was able to
predict the number of taxi pickups that would likely occur in 700 two-hour time
intervals at 36 locations in the Boston area. Knowledge of where and when taxis are
needed most could be used to reduce traffic congestion eliminating the need for taxis
to drive around the city looking for people to pick up.

PRESCRIPTIVE ANALYTICS

Prescriptive analytics predicts outcomes and suggests courses of actions that will hold
the greatest benefit for the enterprise or organization. Prescriptive analytics
recommends actions or decisions based on a complex set of targets, constraints, and
choices. It can be used to suggest how to mitigate or even avoid risks. Prescriptive
analytic implementations may require a feedback system to track the outcome of the
actions taken.

An example of a prescriptive analysis is a computer model that uses Big Data to make
stock market recommendations to buy or sell a stock.

All three types of analytics are used in data analysis.

THE ROLE OF TIME IN DATA ANALYTICS

Before the era of Big Data, the role of time in data analytics was restricted to how
long it took to compile a data set from disparate sources, or how long it took to run a
data set through some calculation. With Big Data, time becomes important in other
ways because much of the value of data is derived from creating opportunities to take
action immediately.

Data is being generated constantly by sensors, consumers, social media users, jet
engines, the stock market, and almost anything else that is connected to a network.
This data is not just growing in quantity; it is also changing in real time. Data analysis
must also be carried out in real-time while the data is being collected.
When discussing Big Data and business, making decisions based on analytics can
improve the return on investment (ROI) for businesses as a function of time. Data-
driven decisions can have the following benefits:

- Increased time for research and development of products and services

- Increased efficiency and faster manufacturing

- Faster time to market

- More effective marketing and advertising

TRADITIONAL ANALYTICS TO BIG DATA ANALYTICS

In the past, when most data sets were relatively small and manageable, analysts could
use traditional tools such as Excel or a statistical program, such as SPSS, to create
meaningful information out of the data. Typically, the data set contained historical
data and the processing of that data was not always time dependent. Traditional
databases had to be designed before the data could be entered. Then the data, if not
too large, could be cleaned, filtered, processed, summarized and visualized using
charts, graphs and dashboards.

As the data sets grow in volume, velocity and variety, the complexity of data storage,
processing, and aggregation becomes a challenge for traditional analytic tools. Large
data sets may be distributed and processed across multiple, geographically-dispersed
physical devices as well as in the cloud. Big Data tools, such as Hadoop and Apache
Spark, are needed for these large data sets to enable real-time analysis and predictive
modeling.

NEXT GENERATION ANALYTICS

For businesses to make optimal decisions, it is no longer enough gather data from the
previous fiscal year and run descriptive analysis types of queries. It is increasingly
necessary to use predictive and prescriptive analysis tools to remain competitive in a
world in which the rate of change is accelerating. Next generation analytics do not
have to solely rely on performing statistical analytics on an entire data set, as was
done with traditional analytics tools. Because of the vast amount of data points and
attributes collected per record or per “thing”, new behaviors and insights can be
gained from advanced analyses that improve prediction and prescription accuracy.

For example, the following questions can be answered to make real-time adjustments
to decisions:

• Which stocks will most likely have the highest daily gain based on trading in the last
hour?

• What is the best way to route delivery trucks this afternoon, based on morning sales,
existing inventory, and current traffic reports?

• What maintenance is required for this airplane, based on performance data generated
during the last flight?
The handling of this machine-generated data, combined with the geographical scope
of very large-scale systems, the number of data generating devices, the diversity of
devices manufacturers, the frequency of data generation, and the overall volume of
data requires new infrastructure software. This infrastructure software must be able to
distribute the computing and the storage of the data among edge, fog and cloud where
it better fits the needs the business.

THE SCIENTIFIC METHOD

The process that a data analyst uses to make conclusions is very similar to the
scientific method shown in the figure. A data analyst may ask the question, “What
district in San Francisco had the most incidents of reported crimes between June 1 and
August 1 of 2014?” A scientist may want to solve the problem, “Why does blood
from a young mouse reverse the effects of aging when put into an older mouse?”
Regardless of the exact method or steps used, both data analysts and scientists will
complete a process that includes asking questions, gathering data, analyzing the data,
and making conclusions or presenting the results.

BUSINESS VALUE

Data analytics allows businesses to better understand the impact of their products and
services, adjust their methods and goals, and provide their customers with better
products faster. The ability to gain new insights from their data brings business value.

Michel Porter from Harvard describes how IT, for the third time in 50 years, has
reshaped business:

“The first wave of IT, during the 1960s and 1970s, automated individual activities
like paying employee stipends or supporting the design and manufacturing of
products. The second wave of business transformation was the rise of the Internet, the
1980s and 1990s, that enabled the coordination and integration of outside suppliers,
distribution channels, and customers across geography.

With IoT we are now in the third wave, IT is becoming an integral part of the product
itself. Embedded sensors, processors, software, and connectivity in products (in
effect, computers are being put inside products), coupled with a cloud where product
data is stored and analyzed and some applications are run, are driving dramatic
improvements in product functionality and performance. Massive amounts of new
product-usage data enable many of those improvements.”

DATA ANALYSIS LIFECYCLE EXAMPLE

Like in the scientific method, the Data Analysis Lifecycle begins with a question. For
example, we could ask the question, “What was the most prevalent crime committed
in San Francisco, California on July 4, 2014?” Each step in the Data Analysis
Lifecycle includes many tasks that must be completed before moving on to the next
step. Only one example task is shown in the figure.

The following is a brief description of each step:

- Gathering the data - The process of locating data and then determining if there is
enough data to complete the analysis. In this case, we would search for an open data
set of crime statistics for San Francisco during July of 2014.

- Preparing the data - This step can involve many tasks to transform the data into a
format appropriate for the tool that will be used. The crime data set may already be
prepared for analysis. However, there are usually some adjustments to make to help
answer the question.

- Choosing a model - This step includes choosing an analysis technique that will best
answer the question with the data available. After a model is chosen, a tool (or tools)
for data analysis is selected. In this chapter, you will learn to use Python and Python
libraries to prepare, analyze, and present data.

- Analyzing the data - The process of testing the model against the data and
determining if the model and the analyzed data are reliable. Were you able to answer
the question with the selected tool?

- Presenting the results - This is usually the last step for data analysts. It is the process
of communicating the results to decision-makers. Sometimes, the data analyst is asked
to recommend actions. For the July 4th crime data, a bar graph, a pie chart, or some
other representation could be used to communicate which crime was most prevalent.
An analyst might suggest increasing police presence in certain areas to deter crime on
a specific holiday like July 4th.

- Making decisions - The final step in the data analysis lifecycle. Organizational
leaders incorporate the new knowledge as part of the overall strategy. The process
begins anew with gathering data.
FILES

There are many different sources of data. A vast amount of historical data can be
found in files such as MS Word documents, emails, spreadsheets, MS PowerPoints,
PDFs, HTML, and plaintext files. These are just a few of the types of files that
contain data.

Big Data can also be found in public and private archives. Scanned paper archives
containing historical data from a variety of sources is certainly Big Data. For
example, there is an enormous amount of data in medical insurance forms and
invoices, business statements and customer interaction, and tax documents. This list is
just a small portion of archived data.

Internal to organizations, raw data is created through customer relationship


management systems, learning management systems, human resource systems and
records, intranets, and other processes.

Different applications create files in different formats that are not necessarily
compatible with one another. For this reason, a universal file format is needed.
Comma-separated values (CSV) files are a type of plaintext file outlined in RFC
4180. CSV files use commas to separate columns in a table of data, and the newline
character to separate rows. Each row is a record. Although they are commonly used
for importing and exporting in traditional databases and spreadsheets, there is no
specific standard. JSON and XML are also plaintext file types that use a standard way
of representing data records. These file formats are compatible with a wide range of
applications. Converting data into a common format is a valuable way to combine
data from different sources.

INTERNET

The Internet is a good place to look for Big Data. There you can find images, videos,
and audio. Public web forums also create data. Social media such as YouTube,
Facebook, instant messaging, RSS, and Twitter all add to the data found on the
Internet. Most of this data is unstructured, which means it is not easy to categorize
into a database without some type of processing.

Web pages are created to provide data to humans, not machines. “Web scraping” tools
automatically extract data from HTML pages. This is similar to a Web Crawler or
spider of a search engine. It explores the web to extract data and create the database to
respond to the search queries. Web scraping software may use Hypertext Transfer
Protocol or a web browser to access the World Wide Web. Typically, web scraping is
an automated process which uses a bot or web crawler. Specific data is gathered and
copied from the web to a database or spreadsheet. The data can then be easily
analyzed.
To implement web scraping, the process must first download the web page and then
extract the desired data from it. Web scrapers typically take something out of a page,
to make use of it for another purpose somewhere else. Perhaps the web scraper is
being used to find and copy names, phone numbers, and addresses. This is known as
contact scraping.

In addition to contact scraping, web scraping is used for other types of data mining
such as real estate listings, weather data, research, and price comparisons. Many large
web service providers such as Facebook provide standardized interfaces to collect the
data automatically using APIs. The most common approach is to use RESTful
application program interfaces (APIs). RESTful APIs use HTTP as communication
protocol and JSON structure to encode the data. Internet websites like Google and
Twitter gather large amounts of static and time series data. Knowledge of the APIs for
these sites allow data analysts and engineers to access the large amounts of data that
are constantly being generated on the Internet.

SENSORS

The Internet of Things (IoT) uses sensors to create data. This data can come from
temperature and humidity sensors found in agriculture. Sensors are now in everything
from smart phones to cars, and jet engines to home appliances. These, along with
many other types of sensors (the list of things with sensors grows every year)
contribute to the exponential growth of Big Data.

We need new tools, new technologies, and a new way to approach how we store,
process and compute, so that raw data can become meaningful information.

DATABASES

Databases contain data that has been extracted, transformed and loaded (ETL). ETL is
the process of ‘cleaning’ raw data so that it can be placed into a database. Often data
is stored in multiple databases and must be merged into a single dataset for analysis.

Most databases contain data that is owned by an organization and is private. As


mentioned in the previous chapter, there are many public databases that can be
searched by anyone. For example, the Internet has several public databases with
ancestral records available for free or low cost.

DATA TYPES AND FORMATS

After data has been accessed from different sources, it requires preparation for
analysis. In fact, experts in the field of Data Science estimate that data preparation can
take up 50 to 80 percent of the time required to complete an analysis.

Because the data that will comprise the data set to be analyzed can come from very
diverse sources, it is not necessarily compatible when combined. Another issue is that
data that may be presented as text will need to be converted to a numeric type if it is
to be used for statistical analysis. Data types are important when computer languages,
such as Python or R, are used to operate on data. Some different data types, and their
descriptions, are shown in the Figure 1.
In addition to different data types, a single type of data can be formatted differently,
depending on its source. For example, different languages may use different symbols
to represent the same word. British English may use different spellings than American
English. An analysis of English text for mentions of modes of travel would need to
look for both airplane and aeroplane in order to be accurate.

Time and data formats present challenges. Although times and dates are very specific,
they are represented in a wide variety of formats. Time and date are essential to the
analysis of time series observations. Therefore, they must be converted to a standard
format in order for an analysis to have any value. For example, dates may be
formatted in with the year first followed by the day and the month in some countries,
while other countries may present data with the month first followed by the day and
year. Similarly, time may be represented in 12-hour format with the AM and PM
designation, or could be represented in 24-hour format. Different time and date
formats are shown in Figure 2. .

DATA STRUCTURES

Data science is a rapidly evolving field. Like many new disciplines, the language used
is diverse and not widely standardized. This means the same thing may have several
names depending on the context and background of the speaker. This is true of data
structures as well.

When discussing data, we can think of a hierarchy of structures. For example, a data
warehouse or data lake is a place that stores many diverse databases in such a way
that the databases can be accessed using the same system. A database is a collection
of data tables that are related to one another in one or more ways. Data tables consist
of fields, rows, and values that are similar to the columns, rows, and cells in a
spreadsheet. Each data table can be considered as a file, and a database as a collection
of files. Figure 1 illustrates the relationship of these structures and associated
terminology. For this course, we will use fields, rows, and values as our standard
vocabulary for the structure of data tables.

Other data structures, or objects, are used by Python. For example, Python uses
strings, lists, dictionaries, tuples, and sets as its primary data structures. Each data
structure has its own group of functions, or methods, which can be used to work with
the object. Figure 2 shows the common Python data structures. In addition, a popular
Python data analysis library called ‘pandas’ uses other data structures such as series
and data frames.

EXTRACT, TRANSFORM, AND LOAD DATA

As mentioned earlier in this topic, much of the data that is going to be placed in a
database so that it can then be queried comes from a variety of sources and in a wide
range of formats. Extract, Transform and Load (ETL) is a process for collecting data
from this variety of sources, transforming the data, and then loading the data into a
database. One company’s data could be found in Word documents, spreadsheets, plain
text, PowerPoints, emails and pdf files. This data might be stored in a variety of
servers which use different formats.

There are three steps to the ETL process:


Step 1. Extract – Data is culled from several sources.

Step 2. Transform – After the data has been culled, it must be transformed. Data
transformation may include aggregating, sorting, cleaning and joining data.

Step 3. Load – The transformed data is then loaded into the database for querying.

The above descriptions of the three steps of the ETL process are simplified. In fact,
there is quite a lot of work to do before data can be loaded into a database and then
queried.

EXTRACTING DATA

The extract step gathers the desired data from the source and makes it available to be
processed. Extraction converts the data into a single format that is ready to be
transformed. For example, combining data from a NOSQL server and an Oracle DB
will give you data in different formats. This data must be converted into a single
format. Also, the data must be checked to ensure it has the desired type of information
(value). This is done using validation rules. If data does not meet the validation
rule(s), it may be rejected. Sometimes, this rejected data is rectified and then
validated.

Ideally, during extraction, all of the required data from the source(s) is retrieved using
minimal compute resources, so as not to affect network or computer performance.

TRANSFORMING DATA

The transform step uses rules to transform the source data to the type of data needed
for the target database. This includes converting any measured data to the same
dimension (e.g. Imperial to Metric). The transformation step also requires several
other tasks. Some of these tasks are joining data from several sources, aggregating
data, sorting, determining new values that are calculated from aggregated data, and
then applying validation rules (Figure 1).

While it may seem as though this data is completely ready to load, there is usually
still work to be done to prepare it. Data (possibly including some rejected data) may
go through another part of the transform step known as ‘cleaning’ or ‘scrubbing’ data.
The cleaning part of the transform step further ensures the consistency of the source
data. Some examples of data cleaning are shown in Figure 2.

LOADING DATA

The load step is when the transformed data is loaded into the target database. This
may be a simple flat file or a relational database. The actual load process varies
widely. It depends on the type of source data, the type of target database, and the type
of querying that is to be done. Some organizations may overwrite existing data with
cumulative data. Loading new transformed data may be done on an hourly, daily,
weekly, or monthly basis. It may only happen when there has been a specific amount
of change to the transformed data.

During the load step, rules that have been defined in the database schema are applied.
Some of these rules check for uniqueness and consistency of data, fields that are
mandatory possess have the required values, etc. These rules help to ensure that the
load and any subsequent querying of the data is successful.

CURRENT AND FUTURE REGULATIONS

A quick search on the Internet will most likely reveal that the ethical use of data
continues to cause concern for many people. However, the response from
governments on data protection regulations varies from country to country, as shown
in the figure. The European Union (EU) has enacted the strictest regulations, defining
personal data as “any data that can be attributed to an identifiable person either
directly or indirectly.” For more information, click here for the World Economic
Forum’s Global Information Technology Report 2014.

The General Data Protection Regulation (GDPR) was approved by the EU Parliament
on April 14, 2016. It goes into effect on May 25, 2018 at which time any
organizations in non-compliance will face heavy fines. The EU GDPR was designed
to make data privacy laws consistent across Europe, to protect data privacy of all EU
citizens, and to reshape the way organizations across the region approach data
privacy. Click here for more information about the GDPR.

On October 6, 2015, the European Court of Justice created a set of data privacy
requirements that are included in the EU-US Privacy Shield. The EU-US Privacy
Shield takes the place of the previous Safe Harbour Framework. This new
arrangement requires that companies in the U.S. protect the personal data of
Europeans. It requires stronger monitoring and enforcement by the U.S. Department
of Commerce and Federal Trade Commission (FTC), including increased cooperation
with European Data Protection Authorities. Click here to learn more about the EU-US
Privacy Shield.

What about the people who work with Big Data on a daily basis, the data scientists?
What do they think about the ethical issues around its use? In August 2013,
Revolution Analytics surveyed 865 data scientists.

These are some results of that survey:

- 88% of those surveyed believed that consumers should worry about privacy issues.

- 80% of those surveyed agreed that there should be an ethical framework in place for
collecting and using data.

- More than half of those surveyed agreed that ethics already play a big part in their
research.

BIG DATA ETHICS SCENARIOS

Consider the following scenarios and how your own personal data might be used:

Scenario 1: You are unconscious after an accident and you are taken to the hospital
for treatment. A vast amount of data is generated over the next couple of hours as the
medical professionals work to save your life. Do you own this data even though it
could be used in the future to save other lives? Click here to read about actual medical
information data protection case studies. This site also contains several other types of
data protection case studies.

Scenario 2: A city installs surveillance cameras to reduce crime. Later, the city
performs Big Data analytics on citywide video data over the last year, finds that
human traffic patterns in your neighborhood demonstrate that sidewalk utilization is
less than in other neighborhoods. The city then uses this data analysis to justify a
street widening, resulting in a significant increase in traffic noise in your home. Does
the overall reduction in crime in the city outweigh your rights as a homeowner?

Scenario 3: An online retailer uses Big Data predictive modeling to make suggestions
to you about future purchases based on your previous purchasing data. You save
dozens of hours over a period of a few years by spending less time researching
product pricing and availability. What if the retailer sells your purchasing habits to a
third party? Should you be responsible for reading and understanding a lengthy End-
User License Agreement (EULA) that is used as legal cover by a corporation reselling
your information?

These scenarios illustrate the complex ethical issues currently facing organizations,
governments, and individuals. Ethics will continue to be a major concern as the
amount of data we generate grows.

DATA SECURITY

Confidentiality, integrity and availability, known as the CIA triad (Figure 1), is a
guideline for data security for an organization. Confidentiality ensures the privacy of
data by restricting access through authentication encryption. Integrity assures that the
information is accurate and trustworthy. Availability ensures that the information is
accessible to authorized people.

Confidentiality

Another term for confidentiality would be privacy. Company policies should restrict
access to the information to authorized personnel and ensure that only those
authorized individuals view this data. The data may be compartmentalized according
to the security or sensitivity level of the information. For example, a Java program
developer should not have to access to the personal information of all employees.
Furthermore, employees should receive training to understand the best practices in
safeguarding sensitive information to protect themselves and the company from
attacks. Methods to ensure confidentiality include data encryption, username ID and
password, two factor authentication, and minimizing exposure of sensitive
information.

Integrity

Integrity is accuracy, consistency, and trustworthiness of the data during its entire life
cycle. Data must be unaltered during transit and is not changed by unauthorized
entities. File permissions and user access control can prevent unauthorized access.
Version control can be used to prevent accidental changes by authorized users.
Backups must be available to restore any corrupted data, and checksum hashing can
be used to verify integrity of the data during transfer.
A checksum is used to verify the integrity of files, or strings of characters, after they
have been transferred from one device to another across your local network or the
Internet. Checksums are calculated with hash functions. Some of the common hash
functions are MD5, SHA-1, SHA-256, and SHA-512. A hash function uses a
mathematical algorithm to transform the data into fixed-length value that represents
the data, as represented in Figure 2. The hashed value is simply there for comparison.
From the hashed value, the original data cannot be retrieved directly. For example, if
you forgot your password, your password cannot be recovered from the hashed value.
The password must be reset.

After a file is downloaded, you can verify its integrity by verifying the hash values
from the source with the one you generated using any hash calculator. By comparing
the hash values, you can ensure that the file has not been tampered with or corrupted
during the transfer.

Availability

Maintaining equipment, performing hardware repairs, keeping operating systems and


software up to date, and creating backups ensure the availability of the network and
data to the authorized users. Plans should be in place to recover quickly from natural
or man-made disasters. Security equipment or software, such as firewalls, guard
against downtime because of attacks, such as denial of service (DoS). Denial of
service occurs when an attack attempts to overwhelm resources so the services are not
available to the users.

Note: It is not always easy to differentiate between many of the terms you see when
reading about the IoT. Figure 3 explains the difference between data protection, data
privacy, data security, and data confidentiality.

DATA SECURITY IN THE CLOUD

Many people now trust cloud services to store their data. User account credentials,
financial data, personal photos, and contacts are all stored in the cloud. The
responsibility of data security now extends beyond the user to the cloud service
providers. It is important for the user to make sure that they are following security
procedures by using strong password policy and proper authentication methods. The
cloud service provider must also implement cloud security controls. There are four
general categories of cloud security controls:

- Deterrent – This type of control is meant to reduce the extent of an attack. Often, the
attacker is warned that the attack will have dire consequences if it is carried out.

- Preventive – Up-to-date software and hardware are examples of preventive controls,


designed to reduce vulnerabilities and make it less likely for an attack to occur.

- Detective – An active security control such as intrusion detection is designed to


discover an attack and react to it by informing the next security control, the corrective
control.

- Corrective – This control is designed to reduce the impact of a successful attack.

An attack on Yahoo in 2013 compromised more than 1 billion user accounts. This is
one of the largest security breaches ever discovered. The information that was stolen
included user data and account data including not only passwords, but the security
questions Yahoo uses to reset passwords for its users. What is troubling about this
attack is that it was not disclosed until late 2016! Another attack on Yahoo occurred
in 2014 where more than 500 million account credentials were stolen.

Anthem, the second largest health insurer in the United States and the largest for-
profit managed health care company in the Blue Cross and Blue Shield Association,
was hacked in 2015. Sensitive information including the member ID, social security
numbers and employment information of more than 80 million people was stolen.

Cisco Cloudlock

Cisco Cloudlock is a cloud cybersecurity platform. Most cloud security products


focus on securing several Software-as-a Service (SaaS) applications. Cisco
Cloudlock’s platform extends security to protect apps on Infrastructure and Platform-
as-a-Service (IaaS and PaaS). Cloudlock dynamically manages security across an
organization's cybersecurity architecture, including Identity-as-a-Service (IDaaS),
Security Incident and Event Management (SIEM), Next-Generation Firewalls
(NGFW), Software Web Gateways (SGW), Threat Emulation, and more.

As a platform, Cisco Cloudlock takes a programmatic approach to cloud security. In


addition to leveraging APIs from cloud services, Cisco Cloudlock is itself a collection
of RESTful, API-based microservices. Using this approach, Cisco Cloudlock
dynamically applies individual security services (such as Cloud DLP) to customer
apps on any platform, whether those apps are running on a well-known services such
as Amazon Web Services (AWS) or an on-premises server.

FORMATTING TIME AND DATE DATA

In this last section, you will prepare to complete three labs. These are the first in a
series of labs that will be expanded upon through Chapter 5 of this course. These labs
are known as the Internet Meter labs. In the first lab, you will use a function, called
Speedtest, that returns the upload and download speeds of your Internet connection.
After acquiring the measurements, you will save the collected data. You will also
import a larger, previously collected Internet speed data set. This is for learning data
manipulation so you can present the data concisely.

The second lab is not an Internet Meter lab. It is a lab where you will work with
Python and SQLite, to prepare you for the third lab.

The third lab is the next Internet Meter lab. In it, you will use a relational database,
SQLite, and perform some basic SQL queries using Python. You will also calculate
the average and plot the data and the calculated averages of Internet speed data. You
will merge tables that contain average speed information and geographical
information into a single database.

As previously mentioned, IoT data that has been combined from many sources may
be formatted in ways that are incompatible. For example, there are many ways that
time and date data may be presented. However, for the purpose of analytics, it is best
that times and dates be formatted consistently. One way to deal with this problem is to
simply use data that has single time and date format. However, this would cause an
analyst to discard relevant, but incompatibly formatted data, which would create bias
and lead to flawed conclusions.

One of the reasons that Python is so popular with data analysts is that the core
functionalities of the language have been extended with many different libraries, or
modules. One Python module, which will be used in the upcoming lab, is dedicated to
handling time and date data. This module is called datetime. Click here to read
detailed documentation about the datetime module.

The datetime module is included in most Python distributions as a standard library;


however, it must be imported to be used in your code. Features of the datetime
module are represented by the object-oriented programming paradigm. The module
consists of the date, time, and datetime classes. Each class has its own methods that
can be called to work with instances of the classes called objects. Figure 1 offers
definitions of some basic object-oriented concepts. A detailed introduction to object-
oriented programming is beyond the scope of this course. Click here for an
introductory lesson in Python classes and objects. Figure 2 illustrates the use of some
of the basic objects and methods and included with the datetime module.In the first
Internet Meter lab in this chapter, you will change date and time data from one format
to another. This is done with the strftime (string from time) method that is available to
datetime objects. The strftime method uses a series of formatting codes, or directives,
as its parameters. The list of formatting codes is shown in Figure 3.

Figure 4 shows Python code that uses the datetime module to represent the date and
time in a format commonly used in the United States. This code can be recreated in a
new Python notebook for practice and exploration.

READING AND WRITING FILES

In addition to the datetime module, the lab uses a Comma Separated Value module
called csv.The csv module is also a core module that is part of the Python standard
library. The csv module allows reading and writing to .csv files. Python also has basic
methods for creating, opening, and closing external files. Later in the course you will
learn how to extensively modify data by manipulating it within data tables. These data
tables will only exist in RAM until they are saved to files. You will learn a number of
ways to do this.

In this lab, you will use the Python open() and close() methods. The open() method is
used to create a new file or to open an existing file that will contain the data to be
saved. The close() function removes unwritten data from buffers and ends the file
writing functionality for the specified file. It is important to explicitly close all files
that are not to be written to. This preserves system resources and protects the file from
corruption. Figure 1 shows the syntax for the open() function and explains some
important values that can be supplied to the method. It also illustrates the use of the
close() method.

Note: Opening a non existent file in "a" mode will create that file the same way as
"w" mode. The only difference is where the pointer is. It will either point to the
beginning of the file or to the end of the file.

Figure 2 explains some important values that can be supplied to the open() method.
These parameters can be combined or the “+” symbol can be added to specify that
both read and write or read and append modes are to be used.

Data can be written to the file using the write() file method. If the file was opened in
“a” mode, data will be added to the end of the file. Escape characters (sometimes
called escape sequence) may need to be added to the file for formatting. For example,
the \n or \r\n escape characters will add line breaks to the end of a line of data that has
been written. The read() file method reads the contents of an open file object. This is
shown in Figure 3. In the figure, in input cell one, a new file is created. The file is
then closed. In cell two, the file is reopened in append mode. Three lines of text are
written to the file. In cell three, the file is closed in order to ensure that the text has
been written to the file. The file is reopened in read mode, and the read() method is
used to the view the file. The file is also shown as it appears in a text editor that
formats the text using the escape characters to create three separate lines.

INTERACTING WITH EXTERNAL APPLICATIONS

Python allows interaction with external applications and the operating system. In the
first lab, you will install and run an external application and gather data from it for
analysis.

In Jupyter Notebooks, the “!” symbol allows direct interaction with the operating
system. For example, the Figure 1 shows two Linux commands that have been
executed in a Jupyter notebook. Note that the commands begin with the”!” symbol.

Figure 2 illustrates the use of the subprocess module to communicate with an external
application and store the output of a command issued to that application in a Python
object. First, a string object is created to hold the command to be sent the program. In
this case, we intend to send a command to the ping utility that is available from the
Linux shell. We then send that command, after splitting it into individual words, to
the program using a subprocess method. Finally, we store the output of the command
into a variable and split it into a string. Then we can view the contents of the object
with print, and address individual elements of it using string indexing.

SQL
SQL has been discussed previously in Chapter 1. In the second lab in this chapter we
work with SQLite to create and modify external databases.

There are many ways to work with external files in Python. SQLite is an SQL
implementation that works well with Python. Instead of using a client server method
of operation, it uses connections established between Python and an SQL database by
creating an SQL connection object. This object will have methods associated with it.
After creating the connection object, a method is used to create a cursor object. The
cursor object has SQLite methods available for executing SQL operations on the
database. Many SQL operations can be executed in this way.

BASIC SQL OPERATIONS

SQL is a language for interacting with databases and data tables. There are a number
of dialects of SQL, however some core operations are standard, and should work
similarly whether in SQLite, MySQL, or other SQL implementations. SQLite can run
in interactive mode, from a command line. Alternatively, a computer language, like
Python, can interact with SQLite through imported modules. The focus of this course
will mostly be on using Python to interact with SQLite.

In general, SQL can be said to be a language composed of three special purpose


languages. The first is the data definition language. It is used to create and manipulate
the structure of SQL databases and tables. Figure 1 shows some common commands
from the SQL data definition language. The second is the data manipulation language.
It is used to add, remove, or transform data that is present in data tables. Finally, there
is the data query language. It is used to access data in data tables in order to generate
information. Figure 2 shows common data manipulation and data query language
commands.

WORKING WITH PYTHON AND SQLite

The figure shows a sequence of commands that illustrate the basics of the SQLite
operations that are done in the lab. First, an external tool called the csvkit needs to be
installed in the operating system, so that a csv file can be imported into a SQLite
database. The figure illustrates the steps in the process of creating a SQLite database,
importing csv data into the database, executing a query on the data table, and viewing
the results of the query.

Note: In the figure, the "!csvsql --db ..." can be executed as the first command. This is
an external tool that needs to be installed in the OS. A command prompt can be used
(Linux CLI) to execute this, but to simplify things, this external command can be
executed directly from a notebook by prefixing the command with "!".

CHAPTER 2: FUNDAMENTALS OF DATA ANALYSIS

This chapter began by asking the question “What are analytics?” According to the
Cisco VNI, IP traffic will triple over the next 3 years. The emergence of these large
data sets requires more advance methods, technologies, and infrastructure to process
the data and convert it to actionable information. Data can no longer be stored on a
few machines nor be processed with one tool. There are many methodologies for
conducting data analysis, including the popular Cross Industry Standard Process for
Data Mining (CRISP-DM) used by more than 40% of data analysts. The Python
programming language has become a commonly used tool for handling and
manipulating data.

The next section detailed the issues that surround using Big Data. Decision makers
will increasingly rely on data analytics to extract the required information at the right
time, in the right place, to make the right decision. Wise decisions rely on a well-
established base of knowledge. Business intelligence encompasses the entire process
from data to information to knowledge to wisdom.

Descriptive analytics relies solely on historical data to provide regular reports on


events that have already happened. The scope of descriptive analytics is to summarize
your data into more compact and useful information. Predictive analytics attempts to
predict what may happen next with a certain degree of confidence, based on data and
statistics. Prescriptive analytics predicts outcomes and suggests courses of actions that
will hold the greatest benefit for the enterprise or organization.

Some data analysis must also be carried out in real-time while the data is being
collected. As the datasets grow in volume, velocity and variety, the complexity of data
storage, processing, and aggregation becomes a challenge for traditional analytic
tools. Large data sets may be distributed and processed across multiple,
geographically-dispersed physical devices as well as in the cloud. Data analytics
allows businesses to better understand the impact of their products and services,
adjust their methods and goals, and provide their customers with better products
faster. The section closes with an explanation of the Data Analytics Lifecycle.

The next section of this chapter covered data acquisition and preparation. Files, the
Internet, sensors, and databases are all good sources of data. Because the data that will
comprise the data set to be analyzed can come from very diverse sources, it is not
necessarily compatible when combined. In addition to different data types, a single
type of data can be formatted differently, depending on its source. Data tables consist
of fields, rows, and values that are similar to the columns, rows, and cells in a
spreadsheet. Each data table can be considered as a file, and a database as a collection
of files. As mentioned earlier in this topic, much of the data that is going to be placed
in a database so that it can then be queried comes from a variety of sources and in a
wide range of formats. Extract, Transform and Load (ETL) is a process for collecting
data from this variety of sources, transforming the data, and then loading the data into
a database.

The next section discussed Big Data ethics. A quick search on the Internet will most
likely reveal that the ethical use of data continues to cause concern for many people.
Several governments have regulations for the appropriate use of personal data.
Confidentiality, integrity and availability, known as the CIA triad, is a guideline for
data security for an organization. Confidentiality ensures the privacy of data by
restricting access through authentication encryption. Integrity assures that the
information is accurate and trustworthy. Availability ensures that the information is
accessible to authorized people. The responsibility of data security now extends
beyond the user to the cloud service providers. It is important for the user to make
sure that they are following security procedures by using strong password policy and
proper authentication methods. The cloud service provider must also implement cloud
security controls.

The final section of this chapter covered preparation for the Internet Meter labs and
working with Python and SQL.
CHAPTER 3: DATA ANALYSIS

EXPLORATORY DATA ANALYSIS

We add sensors and capture data from our networks, systems, and lives so that we can
make data-driven decisions that ultimately impact performance, the situation or the
environment. The data from the sensors and things are a critical element in providing
the opportunities for change. As shown in the data analysis lifecycle in the figure, data
is changed from its raw format into information after it has been gathered, prepared,
analyzed, and presented in a usable format. A first step in creating the needed
information is to perform an exploratory data analysis.

Exploratory data analysis is a set of procedures designed to produce descriptive and


graphical summaries of data with the notion that the results may reveal interesting
patterns. It is a process of discovery that sometimes enables us to create a hypothesis
about the data. It allows for the discovery of new question to be answered. Sometimes
the purpose of an analysis is to answer specific questions. Other times, someone may
have a “hunch” or intuition about some phenomenon in relation to a set of data. An
analyst may be called upon to investigate the cause or effect of that phenomenon.
Exploratory data analysis provides a useful way to examine the data to determine if
any relationships exists between the observed or collected data or if there are
problems in the data.

For example, an analyst for a chain of fast food restaurants is asked to examine
negative Twitter comments about the restaurants. These comments have been flagged
as negative by a real time semantic analysis process. The analyst performs some
descriptive analyses on the tweets to see what is happening in the data. The analyst
decides to investigate the time of day when the negative tweets are occurring. By
plotting the tweets versus the time of day, the analysis revealed that the number of
negative tweets that come in during breakfast time was disproportionately higher than
those generated during the rest of the day. This basic exploratory analysis reveals that
something regarding the breakfast offerings could be a problem but it does not allow
the analyst to make conclusions as to why this is occurring. Further analysis is needed
to understand the specific cause for this result. It could be a specific item on the
breakfast menu that is mentioned in the tweets, or other variables such as customer
satisfaction with food quality, service, or cleanliness.
ANALYZING IoT DATA

Gathering data is one of the first steps in performing an exploratory data analysis. No
matter the type of analysis to be done, IoT data provides special challenges. First, IoT
data may come in large volumes and in varying formats. Some data may be structured
so that the nature and meaning of the data can be quickly processed and understood.
Other data may be unstructured and require considerable processing to be made
meaningful. Because considerable value can be derived from combining structured
and unstructured data for analysis, IoT data may require more advanced analytic tools.
New technologies are constantly being invented for the acquisition, storage, and
computational analysis of Big Data.

In addition to the volume, another important aspect in IoT data is time as variable. IoT
data is frequently transmitted in real time or near real time. The data generated from
observations of how a variable changes over time is called time series. Examples of
time series data are the air temperature measured in a weather station every minute or
the electric power consumption of a home reported by the smart meter to the power
grid every 15 minutes. Time series data is different from the cross sectional data
where the observation is at one specific time across many different variables.
Typically, data will be formatted in a table, as shown in the figure. When the data
points have timestamps, the order of the data in the tables does not matter. This is
because the data points can be sorted by their timestamps.

OBSERVATIONS, VARIABLES, AND VALUES

When performing any kind of experiment or analysis, it is critical to define the key
characteristics that need to be measured or observed to answer the questions posed or
to create the hypothesis needed. These characteristics to be studied are called
variables. A variable is anything that varies from one instance to another. Not only is
a variable something that can be measured, but its value can also be manipulated or
controlled.

During an experiment or analysis, different variables and their associated values may
be observed. The recordings of the values, patterns and occurrences for a set of
variables is an observation. The set of values for that specific observation is called a
data point. Each observation can be considered as a record in a database or a row of
data in an Excel spreadsheet of data. The collection of observations make up the data
set for your analysis.

Because observations usually have a purpose, only some characteristics are relevant to
that purpose. For example, if you have lost your pet and have asked other people to
help you search for it, only a small set of characteristics are relevant to the
observations. These characteristics might be:

- What type of animal is your pet? It is a dog.

- What type of dog? It is a Schnauzer.

- What color is your Schnauzer? It is gray.


- What size is the Schnauzer? It is a medium sized Schnauzer.

- How much does the Schnauzer weigh? It weighs 15 kg.

As shown in the figure, the variables are the characteristics, such as breed, color, size
and weight. All of these characteristics are variables, because each can have multiple
values. As people search for your dog, data points are added for each observation.
Because the purpose of your observations is to search for dogs, observations that do
not meet the required criteria are discarded.

TYPES OF VARIABLES

When looking for meaningful patterns in data, we frequently look for correlational
relationships between variables. All variables can be classified by the characteristic
that is being studied. The variables will either be categorical or numerical.

Categorical variables indicate membership in a particular group and have a discrete or


specific qualitative value. They are further classified into two types:

- Nominal – These are variables that consist of two or more categories whose value is
assigned based on the identity of the object. Examples are gender, eye color or type of
animal.

- Ordinal – These are variables that consist of two or more categories in which order
matters in the value. Examples are student class rank or satisfaction survey scales
(dissatisfied, neutral, satisfied).

Numerical variables are quantitative values:

- Continuous – These are variables that are quantitative and can be measured along a
continuum or range of values. There are two types of continuous variables, Interval
variables can have any value within the range of values. Examples are temperature or
time. Ratio variables are special case interval variables where a value of zero (0)
means that there is none of that variable. Examples of this include income or sales
volume.
- Discrete – These types of continuous variables are quantitative but have a specific
value from a finite set of values. Examples include the number of sensors activated in
a network, or the number of cars in a lot.

Why is it important to know what type of variables are in your data set? Some
statistical methods and data visualizations are designed to work better with certain
types of data than others. How the results of the analysis are best displayed will
depend on the type of variables used in the data. Some variables lend themselves
better to bar graphs while others may allow for more examination and discovery using
a scatter plot. Examples of some of the suggested types of graphs that represent the
different types of variables can be seen in the figure.

WHAT IS STATISTICS?

Now that the purpose for the analysis is defined and the variables and observations are
gathered and recorded, it is time to perform some statistical analysis. Statistics is the
collection and analysis of data using mathematical techniques. It also includes the
interpretation of data and the presentation of findings. Another use of statistics is to
discover patterns or relationships between variables and to evaluate these patterns to
see how often they occur. Statistical findings are frequently judged by their
relationship with chance effects. In other words, what is the chance of something
happening repeatedly under the same conditions? For example, a hypothesis might be
that variable x is related to a change in variable y. An analysis reveals that a
relationship does exist. However, variable y also changes when variable x does not
change. A question to be answered is, “How much of the change in variable y is in
response to changes in variable x and how much is due to other factors?” Statistics
seeks to answer this question in order to estimate effects in relation to chance or
events that are not included in an analysis. If the results of the analyses show high
probabilities of recurrences, the findings of a study on one representative group can be
generalized to a much larger group.

The terms statistics and analytics are often interchanged, but are somewhat different.
In general, analytics embraces a larger domain of tools than statistics. Analytics uses
the mathematical modeling tools in statistics in addition to other forms of analysis,
such as machine learning. It can also involve working with very large data sets that
include unstructured data.

POPULATIONS AND SAMPLES

Statistics focus on aspects of reality that are studied for a specific purpose. Those
aspects of reality could be aspects of people, or the content of tweets or Facebook
posts. Statistics have been used extensively in the social and life sciences. Some terms
commonly used in statistics derive from this usage.

One such term is population. A population is a group of similar entities such as


people, objects, or events that share some common set of characteristics which can be
used for statistical or investigative purposes. It may be strange to think of tweets for
Facebook posts as members of populations, but this is how they are thought of for
statistical analyses. The definition or structure of a given population varies. A
population could be “all living people” or “all tweets since August 1, 2015”. It is a
large group of things that we are interested in knowing more about.

It may not always be practical to study all living people or even “all tweets since
August 1, 2015”. The practicalities of obtaining the required data from the population
make data gathering nearly impossible. Instead, a representative group from the
population can be used for analysis. This group is called a sample. Samples are often
chosen to represent the larger population in some way. If this is the case, special care
needs to be taken in selecting the sample in order to ensure that all the necessary
characteristics of the population are represented. A number of techniques are used for
deriving samples from populations.

DESCRIPTIVE STATISTICS

After the problem statement (or the questions to be asked) is determined, and a
population has been defined, some form of analysis or statistics are needed. There are
two key branches of statistics that we will discuss in this course:

- Descriptive Statistics

- Inferential Statistics
Descriptive statistics are used to describe or summarize the values and observations of
a data set. For example, a fitness tracker logged a person’s daily steps and heart rate
for a 10-day period. If the person met their fitness goals 6 out of the 10 days, then
they were successful 60% of the time. Over that 10-day period, the person’s heart rate
may have been a maximum of 140 beats per minute (bpm) but an average of 72 bpm.
Information about counts, averages, and maximums are some of the ways to describe
and simplify the data set that was observed.

Basic descriptive statistics may include the number of data points in a data set, the
range of values that exist for numeric data points, or the number of times various
values appear in a data set, among others. Additionally, descriptive statistics include
values that summarize the data set in various ways. It may answer questions such as:

- How widely dispersed is the data?

- Are there values that occur more often than others?

- What is the smallest or largest value?

- Are there particular trends occurring?

The answers to these questions can be provided in numerical and graphical formats.
Results of descriptive statistics are often represented in pie charts, bar charts or
histograms. This helps to visualize the data better.

One important point to note is that while descriptive statistics describe the current or
historical state of the observed population, it does not allow for comparison of groups,
conclusions to be drawn, or predictions to be made about other data sets that are not in
the population. In the fitness tracker example, we cannot infer that the person has poor
health because they were only successful in meeting their goal 60% of the time. We
also cannot use the data set for this one person to predict the fitness performance for
others with similar characteristics. This is where inferential statistics becomes
important.

INFERENTIAL STATISTICS

Descriptive statistics allow you to summarize findings based on data that you already
have or have observed about a population. But there are situations in which gathering
data for a very large population may not always be practical or even possible. For
example, it may not be possible to study every person in the world in order to
discover the effects of a new drug that is under development. However, it is possible
to study a smaller, representative sample of a population and use inferential statistics
to test hypotheses and draw conclusions about the larger population.

Inferential statistics is the process of collecting, analyzing and interpreting data


gathered from a sample to make generalizations or predictions about a population.
Because a representative sample is used instead of actual data from the entire
population, concerns that the particular groups chosen for the study, or the
environment in which a study is carried out, may not accurately reflect characteristics
of the larger group must be addressed. When using inferential statistics, questions of
how close the inferred data is to the actual data and how confident we can be in the
findings must be answered. Typically, these types of analyses will include different
sampling techniques to reduce error and increase confidence in the generalizations
about the findings. The type of sampling technique used will depend on the type of
data.

STATISTICS AND BIG DATA

Different statistical approaches are used in Big Data analytics. As we know,


descriptive statistics describe a sample. This is useful for understanding the sample
data and for determining the quality of the data. When dealing with large amounts of
data that come from multiple sources, many problems can occur. Sometimes data
points can be corrupted, incomplete, or missing entirely. Descriptive statistics can
help determine how much of the data in the sample is good for the analysis and
identify criteria for removing data that is inappropriate or problematic. Graphs of
descriptive statistics are a helpful way to make quick judgements about a sample.

For example, a sample of tweets may be selected for analysis. Some tweets in the
sample contain only characters, while other tweets contain characters and images.
Determine whether you want to analyze tweets that contain images or tweets with no
images. This will identify tweets that are invalid based on a very simple criterion.
Data points that do not meet this basic criterion would be removed from the sample
before the analysis continues.

A number of types of inferential and machine learning analysis are very commonly
used in Big Data analytics:

- Cluster – Used to find groups of observations that are similar to each other

- Association – Used to find co-occurrences of values for different variables

- Regression - Used to quantify the relationship, if any, between the variations of one
or more variables

In machine learning, computer software is either provided with, or derives its own set
of rules that are used to perform an analysis. Machine learning techniques can require
a lot of processing power and have only become viable with the availability of parallel
processing.

The figure illustrates an architecture for analysis of IoT data. We will learn more
about systems that support Big Data analytics later in the course.

DISTRIBUTIONS

There are multiple ways to summarize the data using descriptive statistics. You can
look for the actual distribution of the data, measures of central tendency or measures
of ranges. At a basic level, distribution is a simple association between a value and the
number or percentage of times it appears in a data sample. Distributions are useful for
understanding the characteristics of a data sample. The figure shows a table consisting
of two fields. One field contains a variable, and the other consists of a statistic that
describes the value of that variable. In this example, ten students have taken a ten-
point quiz. The score for each student is shown in the Raw Score by Student table.
When the teacher analyzes the scores, a distribution of scores is created as shown in
the second table. This expresses the number of times that a score occurred in the class.
The probability of the score occurring is expressed as a ratio of the frequency of the
score to the total number of scores.

Frequency distributions consist of all the unique values for a variable and the number
of times the values occurs in the data set. In probability distributions, instead of
frequencies, the proportion of times the value occurs in the data are used.

A histogram can immediately represent the distribution of a dataset. In the case of a


discrete variable, each bin of the histogram is assigned to a specific value. In the case
of a continuous one, each bin is associated to a range of values. In both cases, the
height of the bin represents the number of times the value of the variable assumes a
given value or falls into the range, respectively.

The histogram representation of the data distribution can take any shape. In the case
of the continuous variable, the shape will also depend on the width of the bins, i.e.
their range. Some shapes can be modelled using well-defined functions, which are
called probability distribution functions.

Probability distribution functions allow for representing the shape of the whole
dataset distribution using only a small set of parameters, such as the mean and the
variance, which will be explained later in the chapter. A probability distribution
function that is particularly suited to represent many events occurring in nature is
Gaussian, or Normal distribution, which is symmetrical and bell-shaped.

Other distributions are not symmetrical. The peak of the graph could be either to the
left or to the right of center. This property of a distribution is called skew. Some
distributions will have two peaks and are known as bimodal. The right and left ends of
the distribution graph are known as the tails.

CENTRALITY

One characteristic of distributions that is very commonly used is measures of central


tendency. These measures express the values that a variable has that is closest to the
central position in a distribution of data. The common measures of centrality are the
mean, median, and mode. The mode of a data sample is the value that occurs the most
often. These measures are illustrated in Figure 1. Essentially, these values that are
closer to the center of the distribution occur with greater frequency.

The mean, also known as the average, is the most well-known measure of central
tendency. It takes into account all of the values in a data set and is equal to the sum of
all the data values divided by the number of values in the data set. Although the mean
is very commonly used in everyday life, it is typically not the best measure of the
most representative value for a distribution. For example, if there are unusually high
or low values in the distribution, the mean can be highly influenced by those extreme
values, also called outliers. Depending on the number of outliers in the data set, the
mean, or average, is “skewed” or changed in one direction or another.

The median is the middle value in the data set after the list of values has been ordered.
As shown in Figure 2, the median is not sensitive to these extreme values. Because
the total number of values and the actual values in the data set are the same, the
midpoint in the list or median remains the same. This is different for the mean or
average. Depending on the number of outliers in the data set, the mean, or average, is
“skewed” or changed in one direction or another.

In addition to outliers, the type of variable used in the data set will also impact which
measure of central tendency is best used to represent the data. As shown in Figure 3,
the mean or average is best used when the data is interval data that is not skewed.

DISPERSION

While the mean, or average, is currently used to describe many distributions, it leaves
out an important part of the picture, which is the variability in the distribution. For
example, we know that outlier values can distort the mean. The median gets us closer
to what is central in the distribution, however we still do not know how spread out the
values in the sample are.

The most basic way of describing variability in a sample is by calculating the


difference between the highest and lowest values for a variable. This statistic is
known as the range. It is always useful to have an idea of what the highest and lowest
values are for a variable as a basic way to know if the data makes sense.

The variance ( 2) of distribution is a measure of how far each value in a data set is
from the mean. Related to the variance is the standard deviation ( ). The standard
deviation is used to standardize distributions as part of the normal curve, as shown in
Figure 1. Figure 2 shows the how standard deviation values relate to centrality. The
more data points that are centered around the mean, the lower the standard deviation.
The standard deviation values are higher as the distribution becomes more spread out.

Comparing standard deviations between two samples on the same measure can help to
tell the story of what is occurring. For example, if the mean on test scores in one
school is higher than that for the same test in another school, it would be natural to
assume that all the students in first school are higher achievers than those in the
second school. However, standard deviations can add an extra layer of interpretation
to the story. If the standard deviation for the first school is higher for the first group, it
says that the distribution is more spread out and that more students are scoring at the
extremes of the distribution. It is possible that a small group of very high scorers has
influenced the mean. Further investigation shows that a special program for gifted
students at the school has elevated the mean by pushing it away from the median.

USING PANDAS

Pandas is an open source library for Python that adds high-performance data
structures and tools for analysis of large data sets. Pandas is easy to use and is very
popular for adding extra capabilities to Python for data analysis. A link to the pandas
project is shown in Figure 1.

Pandas data structures include the series and dataframe structures. Dataframes are the
primary pandas structure and are the most commonly used. We will use pandas
dataframes often within this course. A dataframe is like a spreadsheet with rows and
columns. In addition, dataframes can have optional indexes and columns which are
labels for rows and columns.

Dataframes are easily built from a range of other data structures and external files,
such as csv. A wide range of methods are available to dataframe objects. Rows and
columns can be manipulated in various ways and operators are available to perform
mathematical, string, and logical transformations to dataframe contents. Figure 2
shows the components of a dataframe.
Pandas is imported into a Python program using import, like other modules. It is
conventional to use import pandas as pd to make reference to pandas components
easier to type. Figure 3 shows the code required to create the dataframe that is shown
in Figure 2.

IMPORTING DATA FROM FILES

Large data sets are compiled from various sources and may exist as different kinds of
files. Creating a pandas dataframe by coding the data values individually is not very
useful for analyzing Big Data.

Pandas includes some very easy to use functions for importing data from external
files, such as csv, into dataframes. We will recreate the telephone directory dataframe,
this time from a larger csv file. Pandas includes a dataframe function called
read_csv() for this purpose.

The figure illustrates the process of imported data from an external csv file into
pandas. The procedure is as follows:

- Step 1. Import the pandas module.

- Step 2. Verify that the file is available from the current working directory. In this
case the head Linux command is used to verify the file and preview its contents.

- Step 3. To import the file into a dataframe object use the pandas read_csv() method.
In this case the dataframe object is called directory_df.

- Step 4. Use the pandas info() dataframe method to view a summary of the file
contents.

- Step 5. Display the dataframe. In this case the head() method was used to display the
headings, index, and values for the first five rows.

IMPORTING DATA FROM THE WEB

It is very easy to import data from the web with pandas. While there are many
application program interfaces (APIs) available for accessing web data, including
streaming data, static data sets can also be accessed from the Internet based on the
URL of the file. In the example shown in the figure, a data set is imported into a data
set from the extensive collection at the Humanitarian Data Exchange. This website is
an excellent resource for people interested in exploring data related to international
humanitarian concerns. In this case, we import a data set containing information
regarding the percentage of woman serving in national parliaments for a series of
nations over a period of years. Information about this data set can be found here. The
data set can be downloaded here.

The process is simple:

- Step 1 Import pandas.

- Step 2. Create a string object to contain the URL of the file.


- Step 3. Import the file into a dataframe object using the pandas read_table() method.
read_table() is essentially the same as the read_csv() method, but it allows use of
different delimiters. In this case, we specify the comma as the separator to illustrate
how a separator is specified for this method.

- Step 4. Verify import with head() and info(). Note that the info() output indicates a
number of missing values (null entries), which is the difference between the total
number of entries and the number of non-null entries for each year.

There are many sources of data on the Internet. For example, sites like Google and
Twitter have APIs that allow for the connection of Python programs to live streaming
data. Numerous other databases exist online which can be directly addressed and
inserted into pandas dataframes using a range of pandas methods and the associated
parameters.

DESCRIPTIVE STATISTICS IN PANDAS

Pandas provides a very simple way of viewing basic descriptive statistics for a
dataframe. The describe() method for dataframe objects displays the following for
numeric data types:

- count – This is the number of values included in the statistics.

- mean - This is the average of values.

- std - This is the standard deviation of the distribution.

- min - This is the lowest value in the distribution.

- 25% - This is the value of the first quartile. 25% of the values are at or below this
value.

- 50% - This is the value for the second quartile. 50% of the values are at or below
this value. This is also the median value.

- 75% - This is the value for the second quartile. 75% of the values are at or below
this value.

- max - This is the highest value in the distribution.

In the example shown in the figure, the same data set has been used as in the previous
page. However, this time only the first, second, and seventh columns have been
imported into the dataframe. This shows the country name and value for the years
2015 and 2010. The describe() method is called on the resulting dataframe, and
descriptive statistics are shown for the two years. This allows a quick comparison of
the data over a five year period.

You might also like