Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
14 views

Module4-Data Analytics-Ppt-Dlb-Chapter5

Methods, Models, and Decisions,” James R. Evans, Pearson Publications, Second edition
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Module4-Data Analytics-Ppt-Dlb-Chapter5

Methods, Models, and Decisions,” James R. Evans, Pearson Publications, Second edition
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 50

Data Acquiring, Organizing,

Processing & Analyzing for IoT


For
4-1B.Tech (CSE) -2024 Batch
By
Prof.D.Lalitha Bhaskari
Dept. of CS&SE, AUCE(A)
Andhra University
WHAT IF “THINGS” START TO THINK…????
An Introduction to Data Analytics for IoT
⚫ As more and more devices are added to IoT networks, the
data generated by these systems becomes overwhelming
⚫ Traditional data management systems are simply unprepared
for the demands of what has come to be known as “big
data.”
⚫ The real value of IoT is not just in connecting things but
rather in the data produced by those things, the new services you can
enable via those connected things, and the business insights that
the data can reveal.
⚫ However, to be useful, the data needs to be handled in a
way that is organized and controlled.
⚫ Thus, a new approach to data analytics is needed for the
Internet of Things
⚫ In the world of IoT, the creation of massive amounts
of data from sensors is common and one of the
biggest challenges— not only from a transport
perspective but also from a data management
standpoint
⚫ Modern jet engines, weather monitoring, smart
home appliances, satellite images etc… are fitted
with thousands of sensors that generate a
whopping 10GB of data per second
⚫ Analyzing this amount of data in the most efficient
manner possible falls under the umbrella of data
analytics
⚫ Not all data is the same; it can be categorized and
thus analyzed in different ways.
⚫ Depending on how data is categorized, various data
analytics tools and processing methods can be
applied.
⚫ Two important categorizations from an IoT
A few terms used in IoT
1) Application refers to application software or a collection of
software components. An application enables a user to perform a
group of coordinated activities, functions and tasks.
2) Service denotes a mechanism, which enables the provisioning of
access to one or more capabilities. A service consists of a
collection of self-contained, distinct and reusable components. It
provides logically grouped and encapsulated functionalities.
3) Service Oriented Architecture (SOA) is a software architecture
model, which consists of services, messages, operations and
processes. SOA components are distributed over a network or the
Internet in a high-level business entity. New business applications
and applications integration architecture in an enterprise can be
developed using an SOA.
4) Message means a communicating entity or object.
5) Operation means action or set of actions.
6) Transaction refers to two inter-related sets of operations or
actions or instructions.
A few terms used in IoT
(Contd..)
7) Query is a command for getting select values from a database which
in return transfer the answer to the query after itsprocessing.
8) Query Processing is a group of structured activities undertaken to
get the results from a data store as per the query.
9) Key Value Pair (KVP) refers to a set of two linked entities, one is
the key, which is a unique identifier for a linked entity and the other is
the value, which is either the entity that is identified or a pointer to the
location of that entity. Ex: birthday: 29 July 2020
10) Hash Table (also called hash map) refers to a data structure which
maps the KVPs and is used to implement an associative array. A hash
table may use an index (key) which is computed using a hash function
and key maps to the value.
11) Bigtable maps two arbitrary string values into an associated
arbitrary byte array. One is used as row key and the other as column
key.
A few terms used in IoT
12) Business Transaction (BT) in database theory, refers to a
(Contd..)
(business) process that requests information from or that
changes the data in a database.
13) Process means a composition of a group of structured
activities or tasks that lead to a particular goal. It specifies
activities with relevance rules based on data in the process.
14) Process Matrix is a multi-element entity, each element of
which relates a set of data or inputs to an activity (or subset of
activities).
15) Business Process (BP) is an activity or series of activities
or a collection of inter-related structured activities, tasks or
processes.
16) Business Intelligence (BI) is a process which enables a
business service to extract new facts and knowledge, and then
undertake better decisions
Structured Versus
Unstructured

Data
Structured data and unstructured data are important
classifications as they typically require different toolsets from a data
analytics perspective
⚫ Structured data means that the data follows a model or
schema that defines how the data is represented or organized,
meaning it fits well with a traditional relational database
management system (RDBMS).
⚫ In many cases you will find structured data in a simple tabular
form—for example, a spreadsheet where data occupies a
specific cell and can be explicitly defined and referenced
⚫ Structured data can be found in most computing systems and
includes everything from banking transaction and invoices to
computer log files and router configurations.

⚫ IoT sensor data often uses structured values, such as


temperature, pressure, humidity, and so on, which are all sent
in a known format.
⚫ Structured data is easily formatted, stored, queried, and processed

⚫ Because of the highly organizational format of structured


data, a wide array of data analytics tools are readily
available for processing this type of data.

⚫ From custom scripts to commercial software like Microsoft


Excel and Tableau
⚫ Unstructured data lacks a logical schema for understanding and
decoding the data through traditional programming means.
⚫ Examples - text, speech, images, and video.
⚫ As a general rule, any data that does not fit neatly into a predefined
data model is classified as unstructured data
⚫ According to some estimates, around 80% of a business’s data is
unstructured.
⚫ Because of this fact, data analytics methods that can be applied to
unstructured data, such as cognitive computing and machine
learning, are deservedly garnering a lot of attention.
⚫ With machine learning applications, such as natural language
processing (NLP), you can decode speech.
⚫ With image/facial recognition applications, you can extract critical
information from still images and video
⚫ Smart objects in IoT networks generate
both structured and unstructured data.

⚫ Structured data is more easily managed and processed


due to its well-defined organization.

⚫ On the other hand, unstructured data can be harder to


deal with and typically requires very different analytics
tools for processing the data
Data in Motion Versus Data
⚫at
DataRest
in IoT networks is either in transit (“data in motion”)
or being held or stored (“data at rest”).

⚫ Examples of data in motion include traditional client/server


exchanges, such as web browsing and file transfers, and
email.

⚫ Data saved to a hard drive, storage array, or USB drive


is
data at rest.
⚫ From an IoT perspective, the data from smart objects is considered
data in motion as it passes through the network en route to its final
destination.
⚫ This is often processed at the edge, using fog computing.
⚫ When data is processed at the edge, it may be filtered and deleted or
forwarded on for further processing and possible storage at a fog node
or in the data center.
⚫ When data arrives at the data center, it is possible to process it in
real-time, just like at the edge, while it is still in motion.
⚫ Tools with this sort of capability, are Spark, Storm, and Flink
⚫ Data at rest in IoT networks can be typically found in IoT
brokers or in some sort of storage array at the data center
⚫ Hadoop not only helps with data processing but also data
storage
IoT Data Analytics
Overview
⚫ The true importance of IoT data from smart
objects is realized only when the analysis of the data
leads to actionable business intelligence and
insights.

⚫ Data analysis is typically broken down by the types


of results that are produced
BIG DATA
Four types of data analysis
results
1) Descriptive:
⚫ Descriptive data analysis tells you what is happening, either
now or in the past.
⚫ For example, a thermometer in a truck engine reports
temperature values every second.
⚫ From a descriptive analysis perspective, you can pull this data at
any moment to gain insight into the current operating condition
of the truck engine.
⚫ If the temperature value is too high, then there may be a
cooling problem or the engine may be experiencing too much
load.
2) Diagnostic:
⚫ When you are interested in the “why,” diagnostic data
analysis
can provide the answer.
⚫ For example- the temperature sensor in the truck engine,
then you might wonder why the truck engine failed.
⚫ Diagnostic analysis might show that the temperature of
the engine was too high, and the engine overheated.
⚫ Applying diagnostic analysis across the data generated by a
wide range of smart objects can provide a clear picture of
why a problem or an event occurred.
3)Predictive:
⚫ Predictive analysis aims to foretell problems or issues
before they occur.
⚫ For example, with historical values of temperatures for the truck
engine, predictive analysis could provide an estimate on the
remaining life of certain components in the engine.
⚫ These components could then be proactively replaced before
failure occurs.
⚫ Or perhaps if temperature values of the truck engine start to rise
slowly over time, this could indicate the need for an oil change or
some other sort of engine cooling maintenance.
4) Prescriptive:
⚫ Prescriptive analysis goes a step beyond predictive and
recommends
solutions for upcoming problems.
⚫ A prescriptive analysis of the temperature data from a truck
engine might calculate various alternatives to cost-
effectively
maintain our truck
⚫ These calculations could range from the cost necessary for more frequent
oil
changes and cooling maintenance to installing new cooling equipment on
the engine or upgrading to a lease on a model with a more powerful
engine.
⚫ Prescriptive analysis looks at a variety of factors and makes the
appropriate recommendation
⚫ Both predictive and prescriptive analyses are more
resource intensive and increase complexity, but the value
they provide is much greater than the value from
descriptive and diagnostic analysis
Types of Data Analysis
Results
Data
Generation
1)Passive Devices Data: Data generates at the device or system
• Data following the result of interactions
• A passive device does not have own power source
• An external source makes that device to send data
• RFID or an ATM debit card.
2) Active Devices Data: Data generates at the device or system
•Data following the result of interactions
•Active device own power source
•Examples: Active RFIDs, streetlight sensor, wireless sensor node.
•Active device associated microcontroller, memory and transceiver.
3) Event Data from Device : Generating data on an event once
only
• Detection of the traffic or on dark ambient conditions, that signals an event. Then
event communicates a need for the lighting up a group of streetlights.
• A system consisting of security cameras generating data on an event of security
breach or on intrusion detection
• A waste container with associate circuit generating data in the event of getting it 90%
or above filled up.
4) Device Real Time Data:
•An ATM generating data and communicates to Server
instantaneously through Internet
•Then initiating and enabling Online Transactions Processing
(OLTP) in real time.
IoT Data Analytics
Challenges
Problems by using RDMS in IoT

1.Scaling Problems (performance issues, costly to


resolve, req more h/w, architechture changes)

2. Volatility of Data (change in schema)


Big Data Analytics Tools and
Technology
⚫ Big data analytics can consist of many different software
pieces that together collect, store, manipulate, and analyze
all different data types.
⚫ Generally, the industry looks to the “three Vs” to categorize
big data:
1) Velocity
⚫ Refers to how quickly data is being collected and
analyzed.
⚫ Hadoop Distributed File System is designed to ingest and
process data very quickly.
⚫ Smart objects can generate machine and sensor data at a very fast
rate and require database or file systems capable of equally fast
ingest functions.
2) Variety
⚫ refers to different types of data.
⚫ Often you see data categorized as structured, semi-structured, or
unstructured.
⚫ Different database technologies may only be capable of
accepting one of these types.
⚫ Hadoop is able to collect and store all three types
3) Volume
⚫ refers to the scale of the data.

⚫ Typically, this is measured from gigabytes on the very low end to


petabytes or even exabytes of data on the other extreme
The characteristics of big data can be defined by the
sources and types of data:
⚫ First is machine data, which is generated by IoT devices and
is typically unstructured data.
⚫ Second is transactional data, which is from sources that produce
data from transactions on these systems, and, have high volume
and structured.
⚫ Third is social data sources, which are typically high
volume and structured.
⚫ Fourth is enterprise data, which is data that is lower
in volume and very structured
⚫ Hence big data consists of data from all these separate
sources.
Massively Parallel Processing
Databases
⚫ Enterprises have used relational databases for storing
structured, row and column style data types for decades.
⚫ Relational databases are often grouped into a broad data
storage category called data warehouses.
⚫ Though they are the centerpiece of most data architectures,
they are often used for longer-term archiving and data queries
that can often take minutes or hours
⚫ Massively parallel processing (MPP) databases were
built on the concept of the relational data warehouses
but are designed to be much faster, to be efficient, and
to support reduced query times.
⚫ To accomplish this, MPP databases take advantage of multiple
nodes (computers) designed in a scale-out architecture such
that both data and processing are distributed across
multiple systems

⚫ MPPs are sometimes referred to as analytic databases because


they are designed to allow for fast query processing and often have
built-in analytic functions

⚫ As the name implies, these database types process massive data


sets in parallel across many processors and nodes
⚫ An MPP architecture typically contains a single master node that is
responsible for the coordination of all the data storage and processing
across the cluster.
⚫ It operates in a “shared-nothing” fashion, with each node containing
local processing, memory, and storage and operating independently.
⚫ Data storage is optimized across the nodes in a
structured SQL-like format that allows data analysts to
work with the data using common SQL tools and
applications
NoSQL Databases
⚫ NoSQL (“not only SQL”) is a class of databases that support
semi-structured and unstructured data, in addition to the
structured data handled by data warehouses and MPPs
⚫ NoSQL is not a specific database technology; rather, it is
an umbrella term that encompasses several different types
of databases and document stores such as XML or JSON.
⚫ Document stores generally have query engines and indexing
features that allow for many optimized queries
⚫ Key-value stores
This type of database stores associative arrays where a
key is paired with an associated value. These databases
are easy to build and easy to scale

⚫ Wide-column stores
This type of database stores similar to a key-value store,
but the formatting of the values can vary from row to
row, even in the same table
⚫ Graph stores

⚫ This type of database is organized based on the relationships


between elements.

⚫ Graph stores are commonly used for social media or


natural language processing, where the connections
between data are very relevant.
Hadoop
⚫ Hadoop is one of the most recent entrant into the data
management market, but it is arguably the most popular choice as
a data repository and processing engine.

⚫ Hadoop was originally developed as a result of projects at


Google andYahoo!

⚫ The original intent for Hadoop was to index millions of


websites and quickly return search results for open source
search engines
⚫ Initially, the project had two key elements:
⚫ Hadoop Distributed File System (HDFS):
A system for storing data across multiple nodes
⚫ MapReduce:
⚫ A distributed processing engine that splits a large task into
smaller ones that can be run in parallel.
⚫ Hadoop relies on a scale-out architecture that leverages local
processing, memory, and storage to distribute tasks and provide a
scalable storage system for data.

⚫ Both MapReduce and HDFS take advantage of this distributed


architecture to store and process massive amounts of data and are
thus able to leverage resources from all nodes in the cluster.

⚫ For HDFS, this capability is handled by specialized nodes in the cluster,


including NameNodes and DataNodes
Fig; Distributed Hadoop Cluster
NameNodes
⚫ These are a critical piece in data adds, moves, deletes, and reads
on HDFS.
⚫ They coordinate where the data is stored, and maintain a map of
where each block of data is stored and where it is replicated.
⚫ All interaction with HDFS is coordinated through the primary
(active) NameNode, with a secondary (standby) NameNode notified
of the changes in the event of a failure of the primary.
⚫ The NameNode takes write requests from clients and distributes
those files across the available nodes in configurable block sizes,
usually 64 MB or 128 MB blocks.
⚫ The NameNode is also responsible for instructing the
DataNodes where replication should occur.
⚫ DataNodes
⚫ These are the servers where the data is stored at the direction of
the NameNode.
⚫ It is common to have many DataNodes in a Hadoop cluster to
store the data.
⚫ Data blocks are distributed across several nodes and often are
replicated three, four, or more times across nodes for
redundancy.
⚫ Once data is written to one of the DataNodes, the DataNode
selects two (or more) additional nodes, based on replication
policies, to ensure data redundancy across the cluster
YAR
N⚫ Introduced with version 2.0 of Hadoop,YARN (Yet Another
Resource Negotiator) was designed to enhance the
functionality of MapReduce.
⚫ With the initial release, MapReduce was responsible for
batch data processing and job tracking and resource
management across the cluster.
⚫ YARN was developed to take over the resource negotiation and
job/task tracking, allowing MapReduce to be responsible only
for data processing.
The Hadoop
Ecosystem
⚫ Since the initial release of Hadoop in 2011, many
projects have been developed to add incremental
functionality to Hadoop and have collectively become
known as the Hadoop ecosystem.
⚫ Apache Kafka
⚫ Apache Spark
⚫ Apache Storm and Apache Flink
⚫ Lambda Architecture
Comparing Big Data and Edge
Analytics
⚫ When you hear the term big data, it is usually in reference to
unstructured data that has been collected and stored in the
cloud
⚫ Tools like Hadoop and MapReduce are great at tackling
problems that require deep analytics on a large and complex
quantity of unstructured data;
⚫ However, due to their distance from the IoT endpoints and
the bandwidth required to bring all the data back to the cloud,
they are generally not well suited to real-time analysis of data
as it is generated.
⚫ In applying data analytics - car racing example, big data
analytics is used to examine all the statistics of the racing team
and players based on their performance in the data center or
cloud
⚫ Streaming analytics involves analyzing a race while it is happening
and trying to figure out who is going to win based on the actual
performance in real-time—and this analysis is typically performed
as close to the edge as possible.
⚫ Streaming analytics allows you to continually monitor and assess
data in real-time so that you can adjust or fine-tune your
predictions as the race progresses.
⚫ In the context of IoT, with streaming analytics performed at the
edge (either at the sensors themselves or very close to them, it
is possible to process and act on the data in realtime without
waiting for the results from a future batch- processing job in the
cloud.

You might also like