Unit I-KCS-061

KCS-061: Big Data
UNIT ‐I
Introduction to Big Data: Types of digital data, history of Big Data innovation,
introduction to Big Data platform, drivers for Big Data, Big Data architecture and
characteristics, 5 Vs of Big Data, Big Data technology components, Big Data importance and
applications, Big Data features – security, compliance, auditing and protection, Big Data
privacy and ethics, Big Data Analytics, Challenges of conventional systems, intelligent data
analysis, nature of data, analytic processes and tools, analysis vs reporting, modern data
analytic tools
INTRODUCTION TO BIG DATA
Data and Information

Data is nothing but facts and statistics stored or free flowing over a network, generally it's raw
and unprocessed. When data are processed, organized, structured or presented in a given
context so as to make them useful, they are called Information.
 It is not enough to have data (such as statistics on the economy).
 Data themselves are fairly useless, but when these data are interpreted and processed to
determine its true meaning, they becomes useful and can be called Information.
For example: When you visit any website, they might store you IP address, that is data, in
return they might add a cookie in your browser, marking you that you visited the website, that
is data, your name, it's data, your age, it's data.
What is Data?
The quantities, characters, or symbols on which operations are performed by a
computer, which may be stored and transmitted in the form of electrical signals and recorded on
magnetic, optical, or mechanical recording media.
3 Actions on Data
– Capture
– Transform
– Store
BigData
Every day we create 2.5 quintillion bytes of data—in fact, 90 percent of the data in the world
today has been created in the last two years alone.
• This data comes from a wide variety of sources: sensors used to gather climate information,
posts to social media sites, digital pictures and videos, purchase transaction records and cell
phone GPS signals, to name a few.
The New York Stock Exchange is an example of Big Data that generates about one terabyte of
new trade data per day.
Twitter generates 7TB of data Daily
IBM claims 90% of today’s stored data was generated in just the last two years.
Walmart handles more than 1 million customer transactions every hour.
Facebook handles 40 billion photos from its user base.
Decoding the human genome originally took 10 years to process; now it can be achieved in one
week.
According to Gartner, the definition of Big Data –
“Big data” is high-volume, velocity, and variety information assets that demand cost-
effective, innovative forms of information processing for enhanced insight and decision
making.”
Big Data refers to complex and large data sets that have to be processed and analyzed to
uncover valuable information that can benefit businesses and organizations. However, there are
certain basic tenets of Big Data that will make it even simpler to answer what
is Big Data:
•It refers to a massive amount of data that keeps on growing exponentially with time.
•Big data is a term applied to data sets whose size or type is beyond the ability of traditional
relational databases to capture, manage and process the data with low
latency.
•It includes data mining, data storage, data analysis, data sharing, and data visualization.
•The term is an all-comprehensive one including data, data frameworks, along with the tools
and techniques used to process and analyze the data.
SOURCES OF BIG DATA

This data comes from a wide variety of sources: sensors used to gather climate information,
posts to social media sites, digital pictures and videos, purchase transaction records and cell
phone GPS signals, to name a few.
Artificial intelligence (AI), Mobile, Social Media and the Internet of Things (IoT) are driving
data complexity through new forms and sources of data.
For example, big data comes from Sensors, Devices, Video/Audio, Networks, Log files,
Transactional applications, Web, and Social media — much of it generated in real time and at a
very large scale.
The History of Big Data
HISTORY OF BIG DATA
The 21 st century is characterized by the rapid advancement in the field of information
technology.
IT has become an integral part of daily life as well as various other industries like: health,
education, entertainment, science and technology, genetics, or business operations and these
industries generate a lot of data, this can be called Big Data.
Big Data consists of large datasets that cannot be managed efficiently by the common database
management systems.
These datasets range from terabytes to exabytes.
Mobile phones, credit cards, Radio Frequency Identification (RFID) devices, and social
networking platforms create huge amounts of data that may reside unutilized at unknown
servers for many years.
And with the evolution of Big Data, this data can be accessed and analyzed on a regular basis to
generate useful information.
“Big Data” is a relative term depending on who is discussing it. For Example, Big Data to
Amazon or Google is very different from Big Data to a medium-sized insurance organization.
Types of Big Data/Types of digital data
DIGITAL DATA
Digital data is information stored on a computer system as a series of 0’s and 1’s in a binary
language. Digital data jumps from one value to the next in a step by step sequence.
Example: Whenever we send an email, read a social media post, or take pictures with our
digital camera, we are working with digital data.
Digital data can be classified into three forms:
a) Structured
Structured is one of the types of big data and by structured data, we mean data that can be
processed, stored, and retrieved in a fixed format. It refers to highly organized information that
can be readily and seamlessly stored and accessed from a database by simple search engine
algorithms. Relationships exist between entities of data, such as classes and their
objects.Structured data is usually stored in well-defined columns and databases.
– Structured Schema
– Tables with rows and columns of data
– Example : DBMS,RDBMS
For instance, the employee table in a company database will be structured as the
employee details, their job positions, their salaries, etc., will be present in an organized
manner.
b) Unstructured
Unstructured data refers to the data that lacks any specific form or structure nor obeys to follow
formal structural rules of data models. It does not even have a consistent format and it is found
to be varying all the time. This makes it very difficult and time-consuming to process and
analyze unstructured data. About 80—90% data of an organization is in this format.
Example: Memos, chat rooms, PowerPoint presentations, images, videos, letters, researches,
white papers, the body of an email, etc.
c) Semi-structured
It is considered as another form of structured data. It inherits few properties of structured data,
but major parts of this kind of data fail to have a definitive structure and also it does not obey
the formal structure of data models such as RDBMS. To be precise, it refers to the data that
although has not been classified under a particular repository (database), yet contains vital
information or tags that segregate individual elements within the data. However, it is not in a
form that can be used easily by a computer program.
Example : Emails, XML, markup languages like HTML, etc.
Metadata for this data is available but is not sufficient.
– Schema is not defined properly
– JSON, XML, CSV,RSS
– Ex: Transactional history file, Log file
d) Quasi-structured Data: The data format contains textual data with inconsistent data
formats that are formatted with effort and time with some tools. Example: Web server
logs, i.e., the log file is created and maintained by some server that contains a list
of activities.
Future of Big Data
Introduction to Big Data platform
A big data platform is a type of IT solution that combines the features and capabilities of
several big data applications and utilities within a single solution, this is then used further for
managing as well as analyzing Big Data.
It focuses on providing its users with efficient analytics tools for massive datasets.
The users of such platforms can custom build applications according to their use case like to
calculate customer loyalty (E-Commerce user case), and so on.
Goal: The main goal of a Big Data Platform is to achieve: Scalability, Availability,
Performance, and Security.
Basics of Bigdata Platform
• Big Data platform is an integrated IT solution which combines several Big Data tools and
utilities into one packaged solution for managing and analyzing Big Data.
• It is an enterprise class IT platform that enables organization in developing, deploying,
operating and managing a big data infrastructure /environment.
• There are several Open source and commercial Big Data Platform in the market with varied
features which can be used in Big Data environment.
• Big data platform generally consists of big data storage, servers, database, big data
management, business intelligence and other big data management utilities
• It also supports custom development, querying and integration with other systems.
• The primary benefit behind a big data platform is to reduce the complexity of multiple
vendors/ solutions into a one cohesive solution.
• Big data platform are also delivered through cloud where the provider provides an all
inclusive big data solutions and services.
Features of Big Data Platform

Here are most important features of any good Big Data Analytics Platform:
a) Big Data platform should be able to accommodate new platforms and tool based on the
business requirement. Because business needs can change due to new technologies or
due to change in business process.
b) It should support linear scale-out
c) It should have capability for rapid deployment
d) It should support variety of data formats
e) Platform should provide data analysis and reporting tools
f) It should provide real-time data analysis software
g) It should have tools for searching the data through large data sets
Best Big Data Platforms
• Based on S, A, P, S which means Scalability, Availability, Performance, and Security,
platforms are listed below:
✓ Hadoop Delta Lake Migration Platform
✓ Data Catalog Platform
✓ Data Ingestion Platform
✓ IoT Analytics Platform
✓ Data Integration and Management Platform
✓ ETL Data Transformation Platform
• Hadoop - Delta Lake Migration Platform

It is an open-source software platform managed by Apache Software Foundation. It is used
to manage and store large data sets at a low cost and with great efficiency.
• Data Catalog Platform
Provides a single self-service environment to the users, helping them find, understand,
and trust the data source. Helps the users to discover the new data sources if there are any.
Discovering and understanding data sources are the initial steps for registering the sources.
• Data Ingestion Platform
This layer is the first step for the data coming from variable sources to start its journey.
This means the data here is prioritized and categorized, making data flow smoothly in
further layers in this process flow.
• IoT Analytics Platform
It provides a wide range of tool to work upon big data; this functionality of it comes
handy while using it over the IoT case.
• Data Integration and Management Platform

ElixirData provides a highly customizable solution for Enterprises. ElixirData provides
Flexibility, Security, and Stability for an Enterprise application and Big Data Infrastructure
to deploy on-premises and Public Cloud with cognitive insights using Machine Learning
and Artificial Intelligence.
• ETL Data Transformation Platform
This Platform can be used to build pipelines and even schedule the running of the same for
data transformation.
Drivers for Big Data
Big Data has quickly risen to become one of the most desired topics in the industry.
The main business drivers for such rising demand for Big Data Analytics are :
1. The digitization of society
2. The drop in technology costs
3. Connectivity through cloud computing
4. Increased knowledge about data science
5. Social media applications
6. The rise of Internet-of-Things(IoT)
Example: A number of companies that have Big Data at the core of their strategy like : Apple,
Amazon, Facebook and Netflix have become very successful at the beginning of the 21st
century.
Big Data Architecture :
Big data architecture refers to the logical and physical structure that dictates how high volumes
of data are ingested, processed, stored, managed, and accessed.Big data architecture is designed
to handle the ingestion, processing, and analysis of data that is too large or complex for
traditional database systems.
Layers in BIG DATA Architecture
Data sources: All big data solutions start with one or more data sources.
Example,
o Application data stores, such as relational databases.
o Static files produced by applications, such as web server log files.
o Real-time data sources, such as IoT devices.
Big Data Ingestion Layer

This layer of Big Data Architecture is the first step for the data coming from variable
sources to start its journey. Data ingestion means the data is prioritized and categorized,
making data flow smoothly in further layers in the Data ingestion process flow.
Tools used by this layer is
Apache Flume - straightforward and flexible architecture based on streaming data flows,
Apache Nifi - supports robust and scalable directed graphs of data routing, transformation,
and system mediation logic.,
Elastic Logstash - open-source Data ingestion tool, server-side data processing pipeline
that ingests data from many sources, simultaneously transforms it, and then sends it to your
“stash, ” i.e., Elasticsearch
Data Collector Layer
In this Layer, more focus is on the transportation of data from the ingestion layer to the rest of
the data pipeline. It is the Layer of data architecture where components are decoupled so that
analytic capabilities may begin.
Data Processing Layer
In this primary layer of Big Data Architecture, the focus is to specialize in the data pipeline
processing system. We can say the data we have collected in the previous layer is processed in
this layer. Here we do some magic with the data to route them to a different destination and
classify the data flow, and it’s the first point where the analytic may occur.
Data Storage Layer
Storage becomes a challenge when the size of the data you are dealing with becomes large.
Several possible solutions, like Data Ingestion Patterns, can rescue from such problems.
Finding a storage solution is very much important when the size of your data becomes large.
This layer of Big Data Architecture focuses on “where to store such large data efficiently.”
Data Query Layer
This is the architectural layer where active analytic processing of Big Data takes place. Here,
the primary focus is to gather the data value to be more helpful for the next layer.
Data Visualization Layer
The visualization, or presentation tier, probably the most prestigious tier, where the data
pipeline users may feel the VALUE of DATA.
BIG DATA CHARACTERISTICS
Back in 2001, 3 ‘V’s of Big Data are - Variety, Velocity, and Volume.
In early stages development of big data and related terms there were only 3 V’s (Volume,
Variety, Velocity) considered as potential characteristics.
But ever growing technology and tools and variety of sources where information being received
has potentially increased these 3 V’s into 5 V’s and still evolving.
The three Vs of Big data are Velocity, Volume and Variety
Figure : Characteristics of Big Data
VOLUME
Volume is one of the characteristics of big data. Volume refers to the unimaginable amounts
of information generated every second. The exponential growth in the data storage as the data
is now more than text data. The data can be found in the format of videos, music’s and large
images on our social media channels. It is very common to have Terabytes and Petabytes of
the storage system for enterprises. As the database grows the applications and architecture built
to support the data needs to be re-evaluated quite often.
Data has grown exponentially over the past few years
than in the past few decades. Social media, web portals and real time data using sensors have
increased the amount of data.
For example, Facebook alone can generate about billion messages, 4.5 billion times that the
“like” button is recorded, and over 350 million new posts are uploaded each day. Such a huge
amount of data can only be handled by Big Data Technologies.
Sometimes the same data is re-evaluated with multiple angles and even though the original
data is the same the new found intelligence creates explosion of the data. The big volume
indeed represents Big Data.
We are currently using distributed systems, to store data in several locations and brought
together by a software Framework like Hadoop.
VELOCITY
Velocity essentially refers to the speed at which data is being created in real-time. In a
broader prospect, it comprises the rate of change, linking of incoming data sets at varying
speeds, and activity bursts. The data growth and social media explosion have changed how
we look at the data. There was a time when we used to believe that data of yesterday is
recent. The matter of the fact newspapers is still following that logic. However, news
channels and radios have changed how fast we receive the news.
Today, people reply on social media to update them with the latest happening. On social
media sometimes a few seconds old messages (a tweet, status updates etc.) is not something
interests users.
They often discard old messages and pay attention to recent updates. The data movement is
now almost real time and the update window has reduced to fractions of the seconds. This
high velocity data represent Big Data.
Example of data that is generated with high velocity - Twitter messages or Facebook posts.
VARIETY
Data can be stored in multiple format. For example database, excel, csv, access or stored in a
simple text file. Sometimes the data is not even in the traditional format, it may be in the form
of video, SMS, pdf or something different. It is the need of the organization to arrange it and
make it meaningful. It will be easy to do so if we have data in the same format, however it is
not the case most of the time. The real world have data in many different formats and that is the
challenge we need to overcome with the Big Data. This variety of the data represents Big Data.
Variety of Big Data refers to structured, unstructured, and semi-structured data that is
gathered from multiple sources. While in the past, data could only be collected from
spreadsheets and databases, today data comes in an array of forms such as emails, PDFs,
photos, videos, audios, SM posts, and so much more. Variety is one of the important
characteristics of big data.
VERACITY
Data veracity, in general, is how accurate or truthful a data set may be. More specifically, when
it comes to the accuracy of big data, it’s not just the quality of the data itself but how
trustworthy the data source, type, and processing of it is.
The data quality of captured data can vary greatly, affecting the accurate analysis.
Example: Facebook posts with hashtags.
The 5 V’s are

•Volume
•Variety
•Velocity
•Veracity
•Value
VALUE
Value is the major issue that we need to concentrate on. It is not just the amount of data that we
store or process. It is actually the amount of valuable, reliable and trustworthy data that needs to
be stored, processed, analyzed to find insights.
• Mine the data, i.e., a process to turn raw data into useful data. Value represents benefits of
data to your business such as in finding out insights, results, etc. which were not possible
earlier.
Big Data Technology Components:
1. Ingestion :
The ingestion layer is the very first step of pulling in raw data.
It comes from internal sources, relational databases, non-relational databases, social media,
emails, phone calls etc.
There are two kinds of ingestions :

Batch, in which large groups of data are gathered and delivered together.
Streaming, which is a continuous flow of data. This is necessary for real-time data analytics.
2. Storage :
Storage is where the converted data is stored in a data lake or warehouse and eventually
processed.
The data lake/warehouse is the most essential component of a big data ecosystem.
It needs to contain only thorough, relevant data to make insights as valuable as possible.
It must be efficient with as little redundancy as possible to allow for quicker processing.
3. Analysis :
In the analysis layer, data gets passed through several tools, shaping it into actionable insights.
There are four types of analytics on big data :
• Diagnostic: Explains why a problem is happening.

• Descriptive: Describes the current state of a business through historical data.
• Predictive: Projects future results based on historical data.
• Prescriptive: Takes predictive analytics a step further by projecting best future efforts.
4. Consumption :
The final big data component is presenting the information in a format digestible to the end-
user.
This can be in the forms of tables, advanced visualizations and even single numbers if
requested.
The most important thing in this layer is making sure the intent and meaning of the output is
understandable.
A GENERAL OVERVIEW OF HIGH-PERFORMANCE ARCHITECTURE
Most high-performance platforms are created by connecting multiple nodes together via a
variety of network topologies.
The general architecture distinguishes the management of computing resources (and
corresponding allocation of tasks) and the management of the data across the network of
storage nodes. It is depicted in following image. It is generally called as master / slave
architecture.
In this configuration, a master job manager oversees the pool of processing nodes, assigns
tasks, and monitors the activity. At the same time, a storage manager oversees the data storage
pool and distributes datasets across the collection of storage resources. While there is no apriori
requirement that there be any colocation of data and processing tasks, it is beneficial from a
performance perspective to ensure that the threads process data that is local, or close to
minimize the costs of data access latency.
1. APACHE HADOOP
Apache Hadoop is one of the main supportive element in Big Data technologies. It simplifies
the processing of large amount of structured or unstructured data in a cheap manner. Hadoop
is an open source project from apache that is continuously improving over the years. "Hadoop
is basically a set of software libraries and frameworks to manage and process big amount of
data from a single server to thousands of machines. It provides an efficient and powerful error
detection mechanism based on application layer rather than relying upon hardware."
2. MAPREDUCE
MapReduce was introduced by google to create large amount of web search indexes. It is
basically a framework to write applications that processes a large amount of structured or
unstructured data over the web. MapReduce takes the query and breaks it into parts to run it on
multiple nodes. By distributed query processing it makes it easy to maintain large amount of
data by dividing the data into several different machines. Hadoop MapReduce is a software
framework for easily writing applications to manage large amount of data sets with a highly
fault tolerant manner.
3. HDFS (Hadoop distributed file system)
HDFS is a java based file system that is used to store structured or unstructured data over large
clustersof distributed servers. The data stored in HDFS has no restriction or rule to be applied,
the data can be either fully unstructured of purely structured. In HDFS the work to make data
senseful is done by developer's code only. Hadoop distributed file system provides a highly
fault tolerant atmosphere with a deployment on low cost hardware machines. HDFS is now a
part of Apache Hadoop project.
4. HIVE
Hive was originally developed by Facebook, now it is made open source for some time. Hive
works something like a bridge in between sql and Hadoop, it is basically used to make Sql
queries on Hadoop clusters. Apache Hive is basically a data warehouse that provides ad-hoc
queries, data summarization and analysis of huge data sets stored in Hadoop compatible file
systems. Hive provides a SQL like called HiveQL query based implementation of huge amount
of data stored in Hadoop clusters. In January 2013 apache releases Hive 0.10.0, more
information and installation guidecan be found at Apache Hive Documentation.
5. PIG
Pig was introduced by yahoo and later on it was made fully open source. It also provides a
bridge to query data over Hadoop clusters but unlike hive, it implements a script implementation
to make Hadoop data access able by developers and business persons. Apache pig provides a
high level programming platform for developers to process and analyses Big Data using user
defined functions and programming efforts. In January 2013 Apache released Pig 0.10.1 which
is defined for use with Hadoop 0.10.1 or later releases. More information and installation guide
can be found at Apache Pig Getting Started Documentation.
BIG DATA USE CASES
Big data techniques can be used to leverage the business benefits and by increasing the value of
an organization. Big data has beneficial in many applications and in general the following are
the common categories. It is derived from The Apache Software Foundation’s Powered By
Hadoop Web site.
•Business intelligence, querying, reporting, searching, including many implementation of
searching, filtering, indexing, speeding up aggregation for reporting and for report generation,
trend analysis, search optimization, and general information retrieval.
•Improved performance for common data management operations, with the majority focusing
on log storage, data storage and archiving, followed by sorting, running joins,
extraction/transformation/ loading (ETL) processing, other types of data conversions, as well as
duplicate analysis and elimination.
•Non-database applications, such as image processing, text processing in preparation for
publishing, genome sequencing, protein sequencing and structure prediction, web crawling, and
monitoring workflow processes.
•Data mining and analytical applications, including social network analysis, facial
recognition, profile matching, other types of text analytics, web mining, machine learning,
information extraction, personalization and recommendation analysis, ad optimization, and
behavior analysis.
In turn, the core capabilities that are implemented using the big data application can be further
abstracted into more fundamental categories:
•Counting functions applied to large bodies of data that can be segmented and distributed
among a pool of computing and storage resources, such as document indexing, concept
filtering, and aggregation (counts and sums).
•Scanning functions that can be broken up into parallel threads, such as sorting, data
transformations, semantic text analysis, pattern recognition, and searching.
•Modeling capabilities for analysis and prediction.
•Storing large datasets while providing relatively rapid access.
Generally, Processing applications can combine these core capabilities in different ways. In
today’s world big data have several applications, some of them are listed below :
Tracking Customer Spending Habit, Shopping Behavior :

In big retails stores, the management team has to keep data of customer’s spending habits,
shopping behaviour, most liked product, which product is being searched/sold most, based on
that data, the production/collection rate of that product gets fixed.
Recommendation :
By tracking customer spending habits, shopping behaviour, big retail stores provide
recommendations to the customers.
Smart Traffic System :
Data about the condition of the traffic of different roads, collected through cameras, GPS
devices placed in the vehicle. All such data are analyzed and jam-free or less jam way, less time
taking ways are recommended.
One more profit is fuel consumption can be reduced.
Secure Air Traffic System :
At various places of flight, sensors are present. These sensors capture data like the speed of
flight, moisture, temperature, and other environmental conditions.
Based on such data analysis, an environmental parameter within flight is set up and varied. By
analyzing flight’s machine-generated data, it can be estimated how long the machine can
operate flawlessly and when it can be replaced/repaired.
Auto Driving Car :
In the various spots of the car camera, a sensor is placed that gathers data like the size of the
surrounding car, obstacle, distance from those, etc. These data are being analyzed, then various
calculations are carried out.These calculations help to take action automatically.
Virtual Personal Assistant Tool :
Big data analysis helps virtual personal assistant tools like Siri, Cortana and Google Assistant to
provide the answer to the various questions asked by users.
This tool tracks the location of the user, their local time, season, other data related to questions
asked, etc. Analyzing all such data provides an answer.
Example: Suppose one user asks “Do I need to take Umbrella?”The tool collects data like
location of the user, season and weather condition at that location, then analyzes these data to
conclude if there is a chance of raining, then provides the answer.
IoT :
Manufacturing companies install IOT sensors into machines to collect operational data.
Analyzing such data, it can be predicted how long a machine will work without any problem
when it requires repair. Thus, the cost to replace the whole machine can be saved.
Education Sector Energy Sector :
Online educational courses conducting organization utilize big data to search candidates
interested in that course. If someone searches for a YouTube tutorial video on a subject, then an
online or offline course provider organization on that subject sends an ad online to that person
about their course.
Media and Entertainment Sector :

Media and entertainment service providing company like Netflix, Amazon Prime, Spotify do
analysis on data collected from their users. Data like what type of video, music users are
watching, listening to most, how long users are spending on site, etc are collected and analyzed
to set the next business strategy.
Big Data Importance
The importance of big data does not revolve around how much data a company has but how a
company utilizes the collected data. Every company uses data in its own way; the more
efficiently a company uses its data, the more potential it has to grow. The company can take
data from any source and analyze it to find answers which will enable:
1. Cost Savings: Some tools of Big Data like Hadoop and Cloud-Based Analytics can bring
cost advantages to business when large amounts of data are to be stored and these tools also
help in identifying more efficient ways of doing business.
2. Time Reductions: The high speed of tools like Hadoop and in-memory analytics can easily
identify new sources of data which helps businesses analyzing data immediately and make
quick decisions based on the learning.
3. Understand the market conditions: By analyzing big data you can get a better
understanding of current market conditions. For example, by analyzing customers’ purchasing
behaviors, a company can find out the products that are sold the most and produce products
according to this trend. By this, it can get ahead of its competitors.
4. Control online reputation: Big data tools can do sentiment analysis. Therefore, you can get
feedback about who is saying what about your company. If you want to monitor and improve
the online presence of your business, then, big data tools can help in all this.
5. Using Big Data Analytics to Boost Customer Acquisition and Retention
The customer is the most important asset any business depends on. There is no single business
that can claim success without first having to establish a solid customer base. However, even
with a customer base, a business cannot afford to disregard the high competition it faces. If a
business is slow to learn what customers are looking for, then it is very easy to begin offering
poor quality products. In the end, loss of clientele will result, and this creates an adverse overall
effect on business success. The use of big data allows businesses to observe various customer
related patterns and trends. Observing customer behavior is important to trigger loyalty.
6. Using Big Data Analytics to Solve Advertisers Problem and Offer Marketing Insights
Big data analytics can help change all business operations. This includes the ability to match
customer expectation, changing company’s product line and of course ensuring that the
marketing campaigns are powerful.
7. Big Data Analytics as a Driver of Innovations and Product Development

Another huge advantage of big data is the ability to help companies innovate and
redevelop their products.
Big Data Challenges
The challenges include capture, duration, storage, search, sharing, transfer, analysis, and
visualization.
Big Data is trend to larger data sets due to the additional information derivable from analysis of
a single large set of related data, as compared to separate smaller sets with the same total
amount of data, allowing correlations to be found to "spot business trends, determine quality of
research, prevent diseases, link legal citations, combat crime, and determine real-time roadway
traffic conditions.”
Challenges of Big Data
The following are the five most important challenges of the Big Data
a) Big data usually includes data sets with sizes beyond the ability of commonly used software
tools to capture, curate, manage, and process data within a tolerable elapsed time. Big data
"size" is a constantly moving target.
Meeting the need for speed
In today’s hypercompetitive business environment, companies not only have to find and
analyze the relevant data they need, they must find it quickly.
b) Visualization helps organizations perform analyses and make decisions much more rapidly,
but the challenge is going through the sheer volumes of data and accessing the level of detail
needed, all at a high speed.
c) The challenge only grows as the degree of granularity increases. One possible solution is
hardware. Some vendors are using increased memory and powerful parallel processing to
crunch large volumes of data extremely quickly
d)Understanding the data
It takes a lot of understanding to get data in the RIGHT SHAPE so that you can use
visualization as part of data analysis.
e) Addressing data quality
Even if you can find and analyze data quickly and put it in the proper context for the audience
that will be consuming the information, the value of data for DECISIONMAKING
PURPOSES will be jeopardized if the data is not accurate or timely. This is a challenge with
any data analysis.
f) Displaying meaningful results
Plotting points on a graph for analysis becomes difficult when dealing with extremely large
amounts of information or a variety of categories of information.
For example, imagine you have 10 billion rows of retail SKU data that you’re trying to
compare. The user trying to view 10 billion plots on the screen will have a hard time seeing so
many data points. By grouping the data together, or “binning,” you can more effectively
visualize the data.
g) Dealing with outliers
The graphical representations of data made possible by visualization can communicate trends
and outliers much faster than tables containing numbers and text. Users can easily spot issues
that need attention simply by glancing at a chart. Outliers typically represent about 1 to 5
percent of data, but when you’re working with massive amounts of data, viewing 1 to 5 percent
of the data is rather difficult. We can also bin the results to both view the distribution of data
and see the outliers. While outliers may not be representative of the data, they may also reveal
previously unseen and potentially valuable insights. Visual analytics enables organizations to
take raw data and present it in a meaningful way that generates the most value. However, when
used with big data, visualization is bound to lead to some challenges.
List of Big Data Platforms
a) Hadoop
b) Cloudera
c) Amazon Web Services
d) Hortonworks
e) MapR
f) IBM Open Platform
g) Microsoft HDInsight
h) Intel Distribution for Apache Hadoop
i) Datastax Enterprise Analytics
j) Teradata Enterprise Access for Hadoop
k) Pivotal HD
a) Hadoop
• Hadoop is open-source, Java based programming framework and server software which is
used to save and analyze data with the help of 100s or even 1000s of commodity servers in a
clustered environment.
• Hadoop is designed to storage and process large datasets extremely fast and in fault tolerant
way.
• Hadoop uses HDFS (Hadoop File System) for storing data on cluster of commodity
computers. If any server goes down it know how to replicate the data and there is no loss of
data even in hardware failure.
• Hadoop is Apache sponsored project and it consists of many software packages which runs on
the top of the Apache Hadoop system.
• Hadoop provides set of tools and software for making the backbone of the Big Data analytics
system.
• Hadoop ecosystem provides necessary tools and software for handling and analyzing Big
Data.
• On the top of the Hadoop system many applications can be developed and plugged-in to
provide ideal solution for Big Data needs.
Cloudera
• Cloudra is one of the first commercial Hadoop based Big Data Analytics Platform
offering Big Data solution.
• Its product range includes Cloudera Analytic DB, Cloudera Operational DB, Cloudera
Data Science & Engineering and Cloudera Essentials.
• All these products are based on the Apache Hadoop and provides real-time processing
and analytics of massive data sets.
Website: https://www.cloudera.com
Amazon Web Services

• Amazon is offering Hadoop environment in cloud as part of its Amazon Web Services
package.
• AWS Hadoop solution is hosted solution which runs on Amazon’s Elastic Cloud Compute
and Simple Storage Service (S3).
• Enterprises can use the Amazon AWS to run their Big Data processing analytics in the cloud
environment.
• Amazon EMR allows companies to setup and easily scale Apache Hadoop, Spark, HBase,
Presto, Hive, and other Big Data Frameworks using its cloud hosting environment.
Website: https://aws.amazon.com/emr/
Open Source Big Data Platform

There are various open-source Big Data Platform which can be used for Big Data handling and
data analytics in real-time environment. Both small and Big Enterprise can use these tools for
managing their enterprise data for getting best value from their enterprise data.
Apache Hadoop
• Apache Hadoop is Big Data platform and software package which is Apache sponsored
project.
• Under Apache Hadoop project various other software is being developed which runs on the
top of Hadoop system to provide enterprise grade data management and analytics solutions to
enterprise.
• Apache Hadoop is open-source, distributed file system which provides data processing and
analysis engine for analyzing large set of data.
• Hadoop can run on Windows, Linux and OS X operating systems, but it is mostly used on
Ubuntu and other Linux variants.
MapReduce
• The MapReduce engine was originally written by Google and this is the system which enables
the developers to write program which can run in parallel on 100 or even 1000s of computer
nodes to process vast data sets.
• After processing all the job on the different nodes it comes the results and return it to the
program which executed the MapReduce job.
• This software is platform independent and runs on the top of Hadoop ecosystem. It can
process tremendous data at very high speed in Big Data environment.
Apache Storm
• Apache Storm is a software for real-time computing and distributed processing.
• Its free and open-source software developed at Apache Software foundation. It’s a real-time,
parallel processing engine.
• Apache Storm is highly scalable, fault-tolerant which supports almost all the
programming language.
Apache Strom can be used in:
• Realtime analytics
• Online machine learning
• Continuous computation
• Distributed RPC
• ETL
• And all other places where real-time processing is required.
Apache Strom is used by Yahoo, Twitter, Spotify, Yelp, Flipboard and many other data giants.
Website: http://storm.apache.org/
Apache Spark
• Apache Spark is software that runs on the top of Hadoop and provides API for real-time, in-
memory processing and analysis of large set of data stored in the HDFS.
• It stores the data into memory for faster processing.
• Apache Spark runs program 100 times faster in-memory and 10 times faster on disk as
compared to the MapRedue.
• Apache Spark is here to faster the processing and analysis of big data sets in Big Data
environment.
• Apache Spark is being adopted very fast by the business to analyze their data set to get real
value of their data.
• Website: http://spark.apache.org/
SAMOA
• SAMOA stands for Scalable Advanced Massive Online Analysis,
• It’s a system for mining the Big Data streams.
• SAMOA is open-source software distributed at GitHub, which can be used as distributed
machine learning framework also.
• Website: https://github.com/yahoo/samoa
Thus, the Big Data industry is growing very fast in 2017 and companies are fast moving their
data to Big Data Platform. There is huge requirement of Big Data in the job market.
CHALLENGES OF CONVENTIONAL SYSTEMS
Conventional Systems
The system consists of one or more zones each having either manually operated call points or
automatic detection devices, or a combination of both.
• Big data is huge amount of data which is beyond the processing capacity of conventional
data base systems to manage and analyze the data in a specific time interval.
Difference between conventional computing and intelligent computing
• The conventional computing functions logically with a set of rules and calculations while the
neural computing can function via images, pictures, and concepts.
• Conventional computing is often unable to manage the variability of data obtained in the real
world. On the other hand, neural computing, like our own brains, is well suited to situations that
have no clear algorithmic solutions and are able to manage noisy imprecise data. This allows
them to excel in those areas that conventional computing often finds difficult.
Comparison of Big Data with Conventional Data
Big Data Conventional Data
Huge data sets Data set size in control.
Normally structured data such as numbers
Unstructured data such as text, video,
and categories, but it can take other forms
and audio.
as well.
Relatively easy-to-perform queries and
Hard-to-perform queries and analysis
analysis.
Data analysis can be achieved by using
Needs a new methodology for analysis.
conventional methods.
Need tools such as Hadoop, Hive, Tools such as SQL, SAS, R, and Excel
Hbase, Pig, Sqoop, and so on. alone may be sufficient.
The aggregated or sampled or filtered
Raw transactional data.
data.
Used for reporting, basic analysis, and
Used for reporting, advanced analysis, and
text mining. Advanced analytics is only
predictive modeling.
in a starting stage in big data.
Big data analysis needs both Analytical skills are sufficient for
programming skills (such as Java) and conventional data; advanced analysis tools
analytical skills to perform analysis. don’t require expert programing skills.
Petabytes/exabytes of data. Millions/billions of accounts.
Billions/trillions of transactions. Megabytes/gigabytes of data.
Thousands/millions of accounts. Millions of transactions
Generated by big financial institutions,
Generated by small enterprises and small
Facebook, Google, Amazon, eBay,
banks.
Walmart, and more.
List of challenges of Conventional Systems

Big data is the storage and analysis of large data sets.These are complex data sets that can be
both structured or unstructured.They are so large that it is not possible to work on them with
traditional analytical tools.
The following list of challenges has been dominating in the case Conventional systems in real
time scenarios:
1) Uncertainty of Data Management Landscape
2) The Big Data Talent Gap
3) The talent gap that exists in the industry Getting data into the big data platform
4) Need for synchronization across data sources
5) Getting important insights through the use of Big data analytics
1) Uncertainty of Data Management Landscape

Because big data is continuously expanding, there are new companies and technologies that are
being developed every day. A big challenge for companies is to find out which technology
works bests for them without the introduction of new risks and problems.
2) The Big Data Talent Gap
While Big Data is a growing field, there are very few experts available in this field. This is
because Big data is a complex field and people who understand the complexity and intricate
nature of this field are far few and between.
3) Getting data into the big data platform
Data is increasing every single day. This means that companies have to tackle limitless amount
of data on a regular basis. The scale and variety of data that is available today can overwhelm
any data practitioner and that is why it is important to make data accessibility simple and
convenient for brand managers and owners.
4) Need for synchronization across data sources
As data sets become more diverse, there is a need to incorporate them into an analytical
platform. If this is ignored, it can create gaps and lead to wrong insights and messages.
5) Getting important insights through the use of Big data analytics:
It is important that companies gain proper insights from big data analytics and it is
important that the correct department has access to this information. A major challenge in the
big data analytics is bridging this gap in an effective fashion.
Other Three challenges of Conventional systems

Three Challenges That big data face are:
1. Data
2. Process
3. Management
1. Data Challenges
Volume
1.The volume of data, especially machine-generated data, is exploding,
2.how fast that data is growing every year, withnew sources of data that are emerging.
3.For example, in the year 2000, 800,000petabytes (PB) of data were stored in the world, and it
is expected to reach 35 zetta bytes (ZB) by2020 (according to IBM).
Social media plays a key role: Twitter generates 7+ terabytes (TB) of data every day. Facebook,
10 TB. Mobile devices play a key role as well, as there were estimated 6 billion mobile phones
in 2011.
The challenge is how to deal with the size of Big Data.
Variety, Combining Multiple Data Sets
More than 80% of today’s information is unstructured and it is typically too big to manage
effectively.Today, companies are looking to leverage a lot more data from a wider variety of
sources both inside and outside the organization.
Things like documents, contracts, machine data, sensor data, social media, health records,
emails, etc. The list is endless really.
A lot of this data is unstructured, or has a complex structure that’s hard to represent in rows and
columns.
2. Processing
Processing such variety (heterogeneous) of data from various sources is a challenging task.
3. Management
• A lot of this data is unstructured, or has a complex structure that’s hard to represent in rows
and columns.
Schema less and Column oriented Databases (No Sql)
We are using table and row based relational databases over the years, these databases are just
fine with online transactions and quick updates. When unstructured and large amount of data
comes into the picture we needs some databases without having a hard code schema
attachment. There are a number of databases to fit into this category, these databases can store
unstructured, semi structured or even fully structured data.
Apart from other benefits the finest thing with schema less databases is that it makes data
migration very easy. MongoDB is a very popular and widely used NoSQL database these days.
NoSQL and schema less databases are used when the primary concern is to store a huge amount
of data and not to maintain relationship between elements. "NoSQL (not only Sql) is a type of
databases that does not primarily rely upon schema based structure and does not use Sql for
data processing."
Figure: Big Data
The traditional approach work on the structured data that has a basic layout and the structure
provided.
Figure: Static Data
The structured approach designs the database as per the requirements in tuples and columns.
Working on the live coming data, which can be an input from the ever changing scenario
cannot be dealt in the traditional approach. The Big data approach is iterative.
Figure: Streaming Data

The Big data analytics work on the unstructured data, where no specific pattern of the data is
defined. The data is not organized in rows and columns. The live flow of data is captured and
the analysis is done on it.
Big Data Architecture

INTELLIGENT DATA ANALYSIS (IDA)
Intelligent Data Analysis (IDA) is one of the hot issues in the field of
artificial intelligence and information. Intelligent Data Analysis (IDA) is one of the most
important approaches in the field of data mining.
Based on the basic principles of IDA and the features of datasets that IDA handles, the
development of IDA is briefly summarized from three aspects :
• Algorithm principle
• The scale
• Type of the dataset
IDA is
… an interdisciplinary study concerned with the effective analysis of data;
… used for extracting useful information from large quantities of online data; extracting
desirable knowledge or interesting patterns from existing databases;
Goal of Intelligent data analysis is to extract useful knowledge, the process demands a
combination of extraction, analysis, conversion, classification, organization, reasoning, and so
on.
Uses / Benefits of IDA

Intelligent Data Analysis provides a forum for the examination of issues related to the research
and applications of Artificial Intelligence techniques in data analysis across a variety of
disciplines and the techniques include (but are not limited to). The benefit areas are:
• Data Visualization
• Data pre-processing (fusion, editing, transformation, filtering, sampling)
• Data Engineering
• Database mining techniques, tools and applications
• Use of domain knowledge in data analysis
• Big Data applications
• Evolutionary algorithms
• Machine Learning(ML)
• Neural nets
• Fuzzy logic
• Statistical pattern recognition
• Knowledge Filtering and
• Post-processing
Why IDA?
➢ Decision making is asking for information and knowledge
➢ Data processing can give them
➢ Multidimensionality of problems is looking for methods for adequate and deep data
processing and analysis
Based on machine learning, artificial intelligence, recognition of pattern, and records and
visualization technology, IDA helps to obtain useful information, necessary data and interesting
models from a lot of data available online in order to make the right choices.
IDA includes three stages:
(1) Preparation of data
(2) Data mining
(3) Data validation and Explanation
Knowledge Acquisition
The process of eliciting, analyzing, transforming, classifying, organizing and integrating
knowledge and representing that knowledge in a form that can be used in a computer system.
Knowledge in a domain can be expressed as a number of rules
A Rule :
A formal way of specifying a recommendation, directive, or strategy, expressed as "IF premise
THEN conclusion" or "IF condition THEN action".
How to discover rules hidden in the data?
Intelligent Data Examples:

Example of IDA
➢ Epidemiological study (1970-1990)
➢ Sample of examinees died from cardiovascular diseases during the period
Question: Did they know they were ill?
1 – they were healthy
2 – they were ill (drug treatment, positive clinical and laboratory findings)
NATURE OF DATA
Data is a set of values of qualitative or quantitative variables;
Data is measured, collected and reported, and analyzed, whereupon it can be visualized
using graphs or images.
Properties of Data
a) Amenability of use
b) Clarity
c) Accuracy
d) Essence
e) Aggregation
f) Compression
g) Refinement
.
a) Amenability of use: From the dictionary meaning of data it is learnt that data are facts used in
deciding something. In short, data are meant to be used as a base for arriving at definitive
conclusions.
b) Clarity: Data are a crystallized presentation. Without clarity, the meaning desired to be
communicated will remain hidden.
c) Accuracy: Data should be real, complete and accurate. Accuracy is thus, an essential
property of data.
d) Essence: A large quantities of data are collected and they have to be Compressed and
refined. Data so refined can present the essence or derived qualitative value, of the
matter.
e) Aggregation: Aggregation is cumulating or adding up.
f) Compression: Large amounts of data are always compressed to make them more
meaningful. Compress data to a manageable size. Graphs and charts are some examples
of compressed data.
g) Refinement: Data require processing or refinement. When refined, they are capable of
leading to conclusions or even generalizations. Conclusions can be drawn only when
data are processed or refined.
TYPES OF DATA
Qualitative Data Type
Qualitative or Categorical Data describes the object under consideration using a finite set of
discrete classes. It means that this type of data can’t be counted or measured easily using
numbers and therefore divided into categories. The gender of a person (male, female, or others)
is a good example of this data type.
These are usually extracted from audio, images, or text medium. Another example can be of a
smartphone brand that provides information about the current rating, the color of the phone,
category of the phone, and so on. All this information can be categorized as Qualitative data.
There are two subcategories under this:
Nominal
These are the set of values that don’t possess a natural ordering. For example: The color of a
smartphone can be considered as a nominal data type as we can’t compare one color with others.
It is not possible to state that ‘Red’ is greater than ‘Blue’. The gender of a person is another one
where we can’t differentiate between male, female, or others. Mobile phone categories whether it
is midrange, budget segment, or premium smartphone is also nominal data type.
Ordinal
These types of values have a natural ordering while maintaining their class of values. If we
consider the size of a clothing brand then we can easily sort them according to their name tag in
the order of small < medium < large. The grading system while marking candidates in a test can
also be considered as an ordinal data type where A+ is definitely better than B grade.
These categories help us deciding which encoding strategy can be applied to which type of data.
Data encoding for Qualitative data is important because machine learning models can’t handle
these values directly and needed to be converted to numerical types as the models are
mathematical in nature.
For nominal data type where there is no comparison among the categories, one-hot encoding can
be applied which is similar to binary coding considering there are in less number and for the
ordinal data type, label encoding can be applied which is a form of integer encoding.
Quantitative Data Type

This data type tries to quantify things and it does by considering numerical values that make it
countable in nature. The price of a smartphone, discount offered, number of ratings on a product,
the frequency of processor of a smartphone, or ram of that particular phone, all these things fall
under the category of Quantitative data types.The key thing is that there can be an infinite
number of values a feature can take. For instance, the price of a smartphone can vary from x
amount to any value and it can be further broken down based on fractional values. The two
subcategories which describe them clearly are:
Discrete
The numerical values which fall under are integers or whole numbers are placed under this
category. The number of speakers in the phone, cameras, cores in the processor, the number of
sims supported all these are some of the examples of the discrete data type.
Continuous
The fractional numbers are considered as continuous values. These can take the form of the
operating frequency of the processors, the android version of the phone, wifi frequency,
temperature of the cores, and so on.
Different categorizations of data are possible. The first such categorization may be on the basis
of disciplines, e.g., Sciences, Social Sciences, etc. in which they are generated. Within each of
these fields, there may be several ways in which data can be categorized into types.
There are four types of data:
•Nominal
•Ordinal
•Interval
•Ratio
Each offers a unique set of characteristics, which impacts the type of analysis that can be
performed.
Nominal Scales
Nominal scales measure categories and have the following characteristics:
•Order: The order of the responses or observations does not matter.
•Distance: Nominal scales do not hold distance. The distance between a 1 and a 2 is not the
same as a 2 and 3.
•True Zero: There is no true or real zero. In a nominal scale, zero is uninterruptable.
Appropriate statistics for nominal scales: mode, count, frequencies
Displays: histograms or bar charts
Ordinal Scales
Characteristics for ordinal scales are:
•Order: The order of the responses or observations matters.
•Distance: Ordinal scales do not hold distance. The distance between first and second is
unknown as is the distance between first and third along with all observations.
•True Zero: There is no true or real zero. An item, observation, or category cannot finish zero.
Appropriate statistics for ordinal scales: count, frequencies, mode
Displays: histograms or bar charts
Interval Scales
Interval scales provide insight into the variability of the observations or data. Classic interval
scales are Likert scales (e.g., 1 - strongly agree and 9 - strongly disagree) and
Semantic Differential scales (e.g., 1 - dark and 9 - light). In an interval scale, users could respond
to “I enjoy opening links to the website from a company email” with a response ranging on a
scale of values.
The characteristics of interval scales are:
•Order: The order of the responses or observations does matter.
•Distance: Interval scales do offer distance. That is, the distance from 1 to 2 appears the
same as 4 to 5. Also, six is twice as much as three and two is half of four. Hence, we can
perform arithmetic operations on the data.
•True Zero: There is no zero with interval scales. However, data can be rescaled in a
manner that contains zero. An interval scales measure from 1 to 9 remains the same as
11 to 19 because we added 10 to all values. Similarly, a 1 to 9 interval scale is the same
a -4 to 4 scale because we subtracted 5 from all values. Although the new scale contains
zero, zero remains uninterruptable because it only appears in the scale from the
transformation.
Appropriate statistics for interval scales: count, frequencies, mode, median, mean, standard
deviation (and variance), skewness, and kurtosis.
Displays: histograms or bar charts, line charts, and scatter plots.
Ratio Scales
Ratio scales appear as nominal scales with a true zero.
They have the following characteristics:
•Order: The order of the responses or observations matters.
•Distance: Ratio scales do have an interpretable distance.
•True Zero: There is a true zero.
Income is a classic example of a ratio scale:
•Order is established. We would all prefer $100 to $1!
•Zero dollars means we have no income (or, in accounting terms, our revenue exactly
equals our expenses!)
•Distance is interpretable, in that $20 appears as twice $10 and $50 is half of a $100.
For the web analyst, the statistics for ratio scales are the same as for interval scales.
Appropriate statistics for ratio scales: count, frequencies, mode, median, mean, standard
deviation (and variance), skewness, and kurtosis.
Displays: histograms or bar charts, line charts, and scatter plots.
The table below summarizes the characteristics of all four types of scales.
Big Data Analytics
Big data analytics is a complex process of examining big data to uncover information, such as -
hidden patterns, correlations, market trends and customer preferences.This can help
organizations make informed business decisions. Data Analytics technologies and techniques
give organizations a way to analyze data sets and gather new information.
Big Data Analytics enables enterprises to analyze their data in full context quickly and some also
offer real-time analysis.
Importance of Big Data Analytics :

Organizations use big data analytics systems and software to make data-driven decisions that can
improve business-related outcomes.
The benefits include more effective marketing, new revenue opportunities, customer
personalization and improved operational efficiency.
With an effective strategy, these benefits can provide competitive advantages over
Rivals. Big Data Analytics tools also help businesses save time and money and aid in gaining
insights to inform data-driven decisions.
Big Data Analytics enables enterprises to narrow their Big Data to the most relevant information
and analyze it to inform critical business decisions.
ANALYTIC PROCESS AND TOOLS

There are 6 analytic processes:
1. Deployment
2. Business Understanding
3. Data Exploration
4. Data Preparation
5. Data Modeling
6. Data Evaluation
Step 1: Deployment
Here we need to:
– plan the deployment and monitoring and maintenance,
– we need to produce a final report and review the project.
– In this phase,
• we deploy the results of the analysis.
• This is also known as reviewing the project.
Step 2: Business Understanding

– The very first step consists of business understanding.
– Whenever any requirement occurs, firstly we need to determine the business objective,
– assess the situation,
– determine data mining goals and then
– produce the project plan as per the requirement.
• Business objectives are defined in this phase.
Step 3: Data Exploration

The second step consists of Data understanding.
– For the further process, we need to gather initial data, describe and explore the
data and verify data quality to ensure it contains the data we require.
– Data collected from the various sources is described in terms of its application
and the need for the project in this phase.
– This is also known as data exploration.
• This is necessary to verify the quality of data collected.
Step 4: Data Preparation

• From the data collected in the last step,
– we need to select data as per the need, clean it, construct it to get useful
information and
– then integrate it all.
• Finally, we need to format the data to get the appropriate data.
• Data is selected, cleaned, and integrated into the format finalized for the analysis in this phase.
Step 5: Data Modeling

• we need to
– select a modeling technique, generate test design, build a model and assess the
model built.
• The data model is build to
– analyze relationships between various selected objects in the data,
– test cases are built for assessing the model and model is tested and implemented
on the data in this phase.
• Where processing is hosted?
– Distributed Servers / Cloud (e.g. Amazon EC2)
• Where data is stored?
– Distributed Storage (e.g. Amazon S3)
What is the programming model?
– Distributed Processing (e.g. MapReduce)
How data is stored & indexed?
– High-performance schema-free databases (e.g. MongoDB)
What operations are performed on data?
– Analytic / Semantic Processing
Analytic processes and tools

Big Data Analytics is the process of collecting large chunks of structured/unstructured data,
segregating and analyzing it and discovering the patterns and other useful business insights from
it.
These days, organizations are realising the value they get out of big data analytics and hence they
are deploying big data tools and processes to bring more efficiency in their work environment.
Many big data tools and processes are being utilised by companies these days in the processes of
discovering insights and supporting decision making.
Big data processing is a set of techniques or programming models to access large- scale data to
extract useful information for supporting and providing decisions.
Below is the list of some of the data analytics tools used most in the industry :
• R Programming (Leading Analytics Tool in the industry)

• Python
• Lumify
• IBM SPSS Modeler
• Excel
• SAS
• Apache Spark
• Splunk
• RapidMiner
• Tableau Public
• KNime
* R is a language for statistical computing and graphics. It also used for big data analysis. It
provides a wide variety of statistical tests.
Features:
•Effective data handling and storage facility,
•It provides a suite of operators for calculations on arrays, in particular, matrices,
•It provides coherent, integrated collection of big data tools for data analysis
•It provides graphical facilities for data analysis which display either on-screen or on hardcopy
* Apache Spark is a powerful open source big data analytics tool. It offers over 80 high-level
operators that make it easy to build parallel apps. It is used at a wide range of organizations to
process large datasets.
Features:
•It helps to run an application in Hadoop cluster, up to 100 times faster in memory, and ten times
faster on disk
•It offers lighting Fast Processing
•Support for Sophisticated Analytics
•Ability to Integrate with Hadoop and Existing Hadoop Data
* Lumify is a big data fusion, analysis, and visualization platform. It helps users to discover
connections and explore relationships in their data via a suite of analytic options.
Features:
•It provides both 2D and 3D graph visualizations with a variety of automatic layouts
•It provides a variety of options for analyzing the links between entities on the graph
•It comes with specific ingest processing and interface elements for textual content, images, and
videos
•It spaces feature allows you to organize work into a set of projects, or workspaces
•It is built on proven, scalable big data technologies
* IBM SPSS Modeler is a predictive big data analytics platform. It offers predictive models and
delivers to individuals, groups, systems and the enterprise. It has a range of advanced algorithms
and analysis techniques.
Features:
•Discover insights and solve problems faster by analyzing structured and unstructured data
•Use an intuitive interface for everyone to learn
•You can select from on-premises, cloud and hybrid deployment options
•Quickly choose the best performing algorithm based on model performance
ANALYSIS AND REPORTING
What is Analysis?
Analytics is the process of taking the organized data and analyzing it.
• This helps users to gain valuable insights on how businesses can improve their
performance.
• Analysis transforms data and information into insights.
• The goal of the analysis is to answer questions by interpreting the data at a deeper level
and providing actionable recommendations.
• Reporting is
• The process of organizing this data is called reporting.

• Reporting translates raw data into information.
• Reporting helps companies to monitor different areas of a business are performing as
their online business and be alerted when data falls outside of expected ranges.
• Good reporting should raise questions about the business from its end users.
• Once data is collected, it will be organized using tools such as graphs and tables.
COMPARING ANALYSIS WITH REPORTING

Reporting shows us “what is happening”.
The analysis focuses on explaining “why it is happening” and “what we can do about it”.
A reporting activity deliberately proposes Analysis activity.
CONTRAST BETWEEN ANALYSIS AND REPORTING

The basis differences between Analysis and Reporting are as follows:
Analysis Reporting
Provides what is needed Provides what is asked for
Is typically customized Is Typically standardized
Involves a person Does not involve a person
Is extremely flexible Is fairly Inflexible
• Reporting translates raw data into information.
• Analysis transforms data and information into insights.
• reporting shows you what is happening
• while analysis focuses on explaining why it is happening and what you can do about
it.
•Reports are like Robots n monitor and alter you and where as analysis is like parents -
can figure out what is going on (hungry, dirty diaper, no pacifier, teething, tired, ear
infection, etc).
•Reporting and analysis can go hand-in-hand:
•Reporting provides no limited context about what is happening in the data. Context is
critical to good analysis.
•Reporting translate a raw data into information
•Reporting usually raises a question – What is happening ?
•Analysis transforms the data into insights - Why is it happening ? What you can do
about it?
Thus, Analysis and Reporting is synonym to each other with respect their need and
utilizing in the needy context.
Big Data Features

Security and privacy issues are magnified by velocity, volume, and variety of Big data, such as
large scale cloud infrastructure, diversity of data sources and formats, streaming nature of data
acquisition, and high volume inter-cloud migration. Traditional security mechanisms are in-
adequate as streaming data demands ultra-fast response time from any security and privacy
solution.
Security, Compliance, Auditing, and Protection
Big data security is the collective term for all the measures and tools used to guard both the data
and analytics processes from attacks, theft, or other malicious activities that could harm or
negatively affect them.For companies that operate on the cloud, big data security challenges are
multi-faceted.When customers give their personal information to companies, they trust them with
personal data which can be used against them if it falls into the wrong hands. Securing the huge
sets of data is one of the daunting challenges of Big Data. Often companies are so busy
understanding, storing and analyzing their data sets that they push data security for later stages.
But, this is not a smart move as unprotected data repositories can become breeding grounds for
malicious hackers.
Companies can lose up to $3.7 million for a stolen record or a data breach.
• The sheer size of Big Data brings with it a major security challenge. Proper security
entails more than keeping the illegal elements out; it also means backing up data and
protecting data from corruption.
• Data access: data can be protected if you eliminate access to the data! Not pragmatic so
we opt to control access.
• Data availability: controlling where the data are stored and how it is distributed; more
control position you better to protect the data.
• Performance: encryption and other measures can improve security but they carry a
processing burden that can severely affect the system performance!
• Liability: accessible data carry with them liability, such as the sensitivity of the data. The
legal requirements connected to the data privacy issues, and IP concerns.
• Adequate security becomes a strategic balancing act among the above concerns. With
planning, logic, and observations, security becomes manageable. Effectively protecting
data while allowing access to the authorized users and systems.
Pragmatic Steps to Securing Big Data:
• First get rid of data that are no longer needed. If not possible to destroy then the
information should be securely archived and kept offline.
• A real challenge is to decide which data is needed? As value can be found in unexpected
places. For example, activity logs represent a risk but logs can be used to determine scale,
use, and efficiency of big data analytics
Classifying Data:
• Protecting data is much easier if data is classified into categories, e.g., internal email
between colleagues is different from financial report, etc.
• Simple classification can be: financial, HR, sales, inventory, and communications.
• Once organizations better understand their data, they can take important steps to
segregate the information and that makes it easier to employ security measures like
encryption and monitoring more manageable
Protecting Big Data Analytics:
• A real concern with Big Data is the fact that Big Data contains all of the things you don’t
want to see when are trying to protect data, very unique sample set, etc.
• Such uniqueness also means that you can’t leverage time-saving backup and security
technologies such as deduplication.
• Significant issue is the large size and number of files involved in Big Data Analytics
environment. Backup bandwidth and/or the backup appliance must be large and the
receiving devices must be able to ingest data at the delivery rate of data.
Big Data and Compliance:
Data compliance is the practice of ensuring that sensitive data is organized and managed in such
a way as to enable organizations to meet enterprise business rules along with legal and
governmental regulations.
Organizations that don’t implement these regulations can be fined up to tens of millions of
dollars and even receive a 20-year penalty.
• Compliance has major effect on how Big Data is protected, stored, accessed, and
archived.
• Big Data is not easily handled by RDBMS; this means it is harder to understand how
compliance affects the data.
• Big Data is transforming the storage and access paradigm to a new world of horizontally
scaling, unstructured databases, which are more suited to solve old business problems
with analytics.
• New data types and methodologies are still expected to meet the legislative requirements
expected by compliance laws
• Preventing compliance from becoming the next Big Data nightmare is going to be the job
of security professionals.
• Health care is a good example of Big Data compliance challenge, i.e., different data types
and vast rate of data from different devices, etc.
• NoSQL is evolving as the new data management approach to unstructured data. No need
for federating multiple RDBMS. Clustered single NoSQL database are being deployed in
the cloud.
• Unfortunately, most data stores in the NoSQL world (i.e.., Hadoop, Cassandra and
MongoDB) do not incorporate sufficient data security tools to provide what is needed.
• Big Data changed few things: For example network security developers spent a great deal
of time and money on perimeter-based security mechanisms (e.g., firewalls) but that
cannot prevent unauthorized access to data once a criminal/hacker has entered the
network!
• Important is to
• Control access by process, not job function
• Secure the data at the data store level
• Protect the cryptographic keys and store them separately from the
data
• Create trusted applications and stacks to protect data from rogue
users
• Automating and monitoring security compliance is the need of the hour but automation
does not solve every problem; there are still basic rules to be used to enable security
while not derailing the value of Big Data:
• Ensure that security does not impede performance or availability
• Pick the right encryption scheme, i.e., file, document, column, etc.
• Ensure that the security solution can evolve with your changing
requirements
BIG DATA AUDITING :
Auditors can use big data to expand the scope of their projects and draw comparisons over larger
populations of data. Big data also helps financial auditors to streamline the reporting process and
detect fraud. These professionals can identify business risks in time and conduct more relevant
and accurate audits.
BIG DATA PROTECTION :
Big data security is the collective term for all the measures and tools used to guard both the data
and analytics processes from attacks, theft, or other malicious activities that could harm or
negatively affect them.
That’s why data privacy is there to protect those customers but also companies and their
employees from security breaches.
When customers give their personal information to companies, they trust them with personal data
which can be used against them if it falls into the wrong hands.
Data protection is also important as organizations that don’t implement these regulations can be
fined up to tens of millions of dollars and even receive a 20-year penalty.
The Intellectual Property (IP) Challenge:
• One of the biggest issues with Big Data is the concept of IP.
• IP refers to creations of the human mind, such as inventions, literary and artistic works,
and symbols, names, images used in commerce.
• Some basic rules are:
• Understand what IP is and know what you have to protect
• Prioritize protection
• Label (confidential information should be labeled
• Educate employees
• Know your tools: tools that can be used to track IP stores
• Use a holistic approach: includes internal risks as well as external ones.
• Use a counterintelligence mind-set: think as if you are spying on your company
and ask how would you do it?
The above guidelines can be applied to almost any information security paradigm that is geared
toward protecting IP.
Big Data privacy and ethics
Data can be used to drive decisions and make an impact at scale. Yet, this powerful resource
comes with challenges. The field of data ethics explores these questions as: How can
organizations ethically collect, store, and use data? What rights must be upheld?
Data Ethics describes a code of behavior, specifically what is right and wrong, encompassing the
following:
Data Handling: generation, recording, curation, processing, dissemination, sharing, and use.
Algorithms: AI, artificial agents, machine learning, and robots.
Corresponding Practices: responsible innovation, programming, hacking, and professional
codes.
Big data analytics raises a number of ethical issues, especially as companies begin monetizing
their data externally for purposes different from those for which the data was initially collected.
The scale and ease with which analytics can be conducted today completely changes the ethical
framework. The following principles can be applied to avoid any unethical practices.
1. Ownership: The first principle of data ethics is that an individual has ownership over their
personal information so it’s unlawful and unethical to collect someone’s personal data
without their consent. Some common ways you can obtain consent are through signed
written agreements, digital privacy policies that ask users to agree to a company’s terms and
conditions, and pop-ups with checkboxes that permit websites to track users’ online
behavior with cookies.
2. Private customer data and identity should remain private: Another ethical responsibility
that comes with handling data is ensuring data subjects’ privacy. Even if a customer gives
your company consent to collect, store, and analyze their personally identifiable information
(PII), that doesn’t mean they want it publicly available. Privacy does not mean secrecy, as
private data might need to be audited based on legal requirements, but that private data
obtained from a person with their consent should not be exposed for use by other businesses
or individuals with any traces to their identity. PII is any information linked to an
individual’s identity. Some examples of PII include:
▪ Full name
▪ Birthdate
▪ Street address
▪ Phone number
▪ Social Security card
▪ Credit card information
▪ Bank account number
▪ Passport number
3. Shared private information should be treated confidentially: Third party companies share
sensitive data — medical, financial or locational — and need to have restrictions on whether
and how that information can be shared further.
4. Customers should have a transparent view of how our data is being collected, stored, used
or sold, and the ability to manage the flow of their private information across massive, third-
party analytical systems.
5. Big Data should not interfere with human will: Big data analytics can moderate and even
determine who we are before we make up our own minds. Companies need to begin to think
about the kind of predictions and inferences that should be allowed and the ones that should
not.
6. Big data should not institutionalize unfair biases like racism or sexism. Machine learning
algorithms can absorb unconscious biases in a population and amplify them via training
samples.

Unit I-KCS-061

Uploaded by

Copyright:

Available Formats

Unit I-KCS-061

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit I-KCS-061

Uploaded by

Copyright:

Available Formats

KCS-061: Big Data

Data and Information

SOURCES OF BIG DATA

Types of Big Data/Types of digital data

Features of Big Data Platform

• Hadoop - Delta Lake Migration Platform

• Data Integration and Management Platform

Big Data Architecture :

Big Data Ingestion Layer

The three Vs of Big data are Velocity, Volume and Variety

Figure : Characteristics of Big Data

The 5 V’s are

There are two kinds of ingestions :

• Diagnostic: Explains why a problem is happening.

A GENERAL OVERVIEW OF HIGH-PERFORMANCE ARCHITECTURE

3. HDFS (Hadoop distributed file system)

Tracking Customer Spending Habit, Shopping Behavior :

Media and Entertainment Sector :

7. Big Data Analytics as a Driver of Innovations and Product Development

Amazon Web Services

Open Source Big Data Platform

List of challenges of Conventional Systems

1) Uncertainty of Data Management Landscape

Other Three challenges of Conventional systems

Figure: Big Data

Figure: Static Data

Figure: Streaming Data

Big Data Architecture

Uses / Benefits of IDA

Intelligent Data Examples:

Quantitative Data Type

Importance of Big Data Analytics :

ANALYTIC PROCESS AND TOOLS

Step 2: Business Understanding

Step 3: Data Exploration

Step 4: Data Preparation

Step 5: Data Modeling

Analytic processes and tools

• R Programming (Leading Analytics Tool in the industry)

• The process of organizing this data is called reporting.

COMPARING ANALYSIS WITH REPORTING

A reporting activity deliberately proposes Analysis activity.

CONTRAST BETWEEN ANALYSIS AND REPORTING

Big Data Features

You might also like