Unit I-KCS-061
Unit I-KCS-061
Unit I-KCS-061
UNIT ‐I
Introduction to Big Data: Types of digital data, history of Big Data innovation,
introduction to Big Data platform, drivers for Big Data, Big Data architecture and
characteristics, 5 Vs of Big Data, Big Data technology components, Big Data importance and
applications, Big Data features – security, compliance, auditing and protection, Big Data
privacy and ethics, Big Data Analytics, Challenges of conventional systems, intelligent data
analysis, nature of data, analytic processes and tools, analysis vs reporting, modern data
analytic tools
INTRODUCTION TO BIG DATA
What is Data?
The quantities, characters, or symbols on which operations are performed by a
computer, which may be stored and transmitted in the form of electrical signals and recorded on
magnetic, optical, or mechanical recording media.
3 Actions on Data
– Capture
– Transform
– Store
BigData
Every day we create 2.5 quintillion bytes of data—in fact, 90 percent of the data in the world
today has been created in the last two years alone.
• This data comes from a wide variety of sources: sensors used to gather climate information,
posts to social media sites, digital pictures and videos, purchase transaction records and cell
phone GPS signals, to name a few.
The New York Stock Exchange is an example of Big Data that generates about one terabyte of
new trade data per day.
Twitter generates 7TB of data Daily
IBM claims 90% of today’s stored data was generated in just the last two years.
Walmart handles more than 1 million customer transactions every hour.
Facebook handles 40 billion photos from its user base.
Decoding the human genome originally took 10 years to process; now it can be achieved in one
week.
According to Gartner, the definition of Big Data –
“Big data” is high-volume, velocity, and variety information assets that demand cost-
effective, innovative forms of information processing for enhanced insight and decision
making.”
Big Data refers to complex and large data sets that have to be processed and analyzed to
uncover valuable information that can benefit businesses and organizations. However, there are
certain basic tenets of Big Data that will make it even simpler to answer what
is Big Data:
•It refers to a massive amount of data that keeps on growing exponentially with time.
•Big data is a term applied to data sets whose size or type is beyond the ability of traditional
relational databases to capture, manage and process the data with low
latency.
•It includes data mining, data storage, data analysis, data sharing, and data visualization.
•The term is an all-comprehensive one including data, data frameworks, along with the tools
and techniques used to process and analyze the data.
DIGITAL DATA
Digital data is information stored on a computer system as a series of 0’s and 1’s in a binary
language. Digital data jumps from one value to the next in a step by step sequence.
Example: Whenever we send an email, read a social media post, or take pictures with our
digital camera, we are working with digital data.
Digital data can be classified into three forms:
a) Structured
Structured is one of the types of big data and by structured data, we mean data that can be
processed, stored, and retrieved in a fixed format. It refers to highly organized information that
can be readily and seamlessly stored and accessed from a database by simple search engine
algorithms. Relationships exist between entities of data, such as classes and their
objects.Structured data is usually stored in well-defined columns and databases.
– Structured Schema
– Tables with rows and columns of data
– Example : DBMS,RDBMS
For instance, the employee table in a company database will be structured as the
employee details, their job positions, their salaries, etc., will be present in an organized
manner.
b) Unstructured
Unstructured data refers to the data that lacks any specific form or structure nor obeys to follow
formal structural rules of data models. It does not even have a consistent format and it is found
to be varying all the time. This makes it very difficult and time-consuming to process and
analyze unstructured data. About 80—90% data of an organization is in this format.
Example: Memos, chat rooms, PowerPoint presentations, images, videos, letters, researches,
white papers, the body of an email, etc.
c) Semi-structured
It is considered as another form of structured data. It inherits few properties of structured data,
but major parts of this kind of data fail to have a definitive structure and also it does not obey
the formal structure of data models such as RDBMS. To be precise, it refers to the data that
although has not been classified under a particular repository (database), yet contains vital
information or tags that segregate individual elements within the data. However, it is not in a
form that can be used easily by a computer program.
Example : Emails, XML, markup languages like HTML, etc.
Metadata for this data is available but is not sufficient.
– Schema is not defined properly
– JSON, XML, CSV,RSS
– Ex: Transactional history file, Log file
d) Quasi-structured Data: The data format contains textual data with inconsistent data
formats that are formatted with effort and time with some tools. Example: Web server
logs, i.e., the log file is created and maintained by some server that contains a list
of activities.
Future of Big Data
Introduction to Big Data platform
A big data platform is a type of IT solution that combines the features and capabilities of
several big data applications and utilities within a single solution, this is then used further for
managing as well as analyzing Big Data.
It focuses on providing its users with efficient analytics tools for massive datasets.
The users of such platforms can custom build applications according to their use case like to
calculate customer loyalty (E-Commerce user case), and so on.
Goal: The main goal of a Big Data Platform is to achieve: Scalability, Availability,
Performance, and Security.
Basics of Bigdata Platform
• Big Data platform is an integrated IT solution which combines several Big Data tools and
utilities into one packaged solution for managing and analyzing Big Data.
• It is an enterprise class IT platform that enables organization in developing, deploying,
operating and managing a big data infrastructure /environment.
• There are several Open source and commercial Big Data Platform in the market with varied
features which can be used in Big Data environment.
• Big data platform generally consists of big data storage, servers, database, big data
management, business intelligence and other big data management utilities
• It also supports custom development, querying and integration with other systems.
• The primary benefit behind a big data platform is to reduce the complexity of multiple
vendors/ solutions into a one cohesive solution.
• Big data platform are also delivered through cloud where the provider provides an all
inclusive big data solutions and services.
Big data architecture refers to the logical and physical structure that dictates how high volumes
of data are ingested, processed, stored, managed, and accessed.Big data architecture is designed
to handle the ingestion, processing, and analysis of data that is too large or complex for
traditional database systems.
Layers in BIG DATA Architecture
Data sources: All big data solutions start with one or more data sources.
Example,
o Application data stores, such as relational databases.
o Static files produced by applications, such as web server log files.
o Real-time data sources, such as IoT devices.
VOLUME
Volume is one of the characteristics of big data. Volume refers to the unimaginable amounts
of information generated every second. The exponential growth in the data storage as the data
is now more than text data. The data can be found in the format of videos, music’s and large
images on our social media channels. It is very common to have Terabytes and Petabytes of
the storage system for enterprises. As the database grows the applications and architecture built
to support the data needs to be re-evaluated quite often.
Data has grown exponentially over the past few years
than in the past few decades. Social media, web portals and real time data using sensors have
increased the amount of data.
For example, Facebook alone can generate about billion messages, 4.5 billion times that the
“like” button is recorded, and over 350 million new posts are uploaded each day. Such a huge
amount of data can only be handled by Big Data Technologies.
Sometimes the same data is re-evaluated with multiple angles and even though the original
data is the same the new found intelligence creates explosion of the data. The big volume
indeed represents Big Data.
We are currently using distributed systems, to store data in several locations and brought
together by a software Framework like Hadoop.
VELOCITY
Velocity essentially refers to the speed at which data is being created in real-time. In a
broader prospect, it comprises the rate of change, linking of incoming data sets at varying
speeds, and activity bursts. The data growth and social media explosion have changed how
we look at the data. There was a time when we used to believe that data of yesterday is
recent. The matter of the fact newspapers is still following that logic. However, news
channels and radios have changed how fast we receive the news.
Today, people reply on social media to update them with the latest happening. On social
media sometimes a few seconds old messages (a tweet, status updates etc.) is not something
interests users.
They often discard old messages and pay attention to recent updates. The data movement is
now almost real time and the update window has reduced to fractions of the seconds. This
high velocity data represent Big Data.
Example of data that is generated with high velocity - Twitter messages or Facebook posts.
VARIETY
Data can be stored in multiple format. For example database, excel, csv, access or stored in a
simple text file. Sometimes the data is not even in the traditional format, it may be in the form
of video, SMS, pdf or something different. It is the need of the organization to arrange it and
make it meaningful. It will be easy to do so if we have data in the same format, however it is
not the case most of the time. The real world have data in many different formats and that is the
challenge we need to overcome with the Big Data. This variety of the data represents Big Data.
Variety of Big Data refers to structured, unstructured, and semi-structured data that is
gathered from multiple sources. While in the past, data could only be collected from
spreadsheets and databases, today data comes in an array of forms such as emails, PDFs,
photos, videos, audios, SM posts, and so much more. Variety is one of the important
characteristics of big data.
VERACITY
Data veracity, in general, is how accurate or truthful a data set may be. More specifically, when
it comes to the accuracy of big data, it’s not just the quality of the data itself but how
trustworthy the data source, type, and processing of it is.
The data quality of captured data can vary greatly, affecting the accurate analysis.
Example: Facebook posts with hashtags.
1. Ingestion :
The ingestion layer is the very first step of pulling in raw data.
It comes from internal sources, relational databases, non-relational databases, social media,
emails, phone calls etc.
Most high-performance platforms are created by connecting multiple nodes together via a
variety of network topologies.
The general architecture distinguishes the management of computing resources (and
corresponding allocation of tasks) and the management of the data across the network of
storage nodes. It is depicted in following image. It is generally called as master / slave
architecture.
In this configuration, a master job manager oversees the pool of processing nodes, assigns
tasks, and monitors the activity. At the same time, a storage manager oversees the data storage
pool and distributes datasets across the collection of storage resources. While there is no apriori
requirement that there be any colocation of data and processing tasks, it is beneficial from a
performance perspective to ensure that the threads process data that is local, or close to
minimize the costs of data access latency.
1. APACHE HADOOP
Apache Hadoop is one of the main supportive element in Big Data technologies. It simplifies
the processing of large amount of structured or unstructured data in a cheap manner. Hadoop
is an open source project from apache that is continuously improving over the years. "Hadoop
is basically a set of software libraries and frameworks to manage and process big amount of
data from a single server to thousands of machines. It provides an efficient and powerful error
detection mechanism based on application layer rather than relying upon hardware."
2. MAPREDUCE
MapReduce was introduced by google to create large amount of web search indexes. It is
basically a framework to write applications that processes a large amount of structured or
unstructured data over the web. MapReduce takes the query and breaks it into parts to run it on
multiple nodes. By distributed query processing it makes it easy to maintain large amount of
data by dividing the data into several different machines. Hadoop MapReduce is a software
framework for easily writing applications to manage large amount of data sets with a highly
fault tolerant manner.
HDFS is a java based file system that is used to store structured or unstructured data over large
clustersof distributed servers. The data stored in HDFS has no restriction or rule to be applied,
the data can be either fully unstructured of purely structured. In HDFS the work to make data
senseful is done by developer's code only. Hadoop distributed file system provides a highly
fault tolerant atmosphere with a deployment on low cost hardware machines. HDFS is now a
part of Apache Hadoop project.
4. HIVE
Hive was originally developed by Facebook, now it is made open source for some time. Hive
works something like a bridge in between sql and Hadoop, it is basically used to make Sql
queries on Hadoop clusters. Apache Hive is basically a data warehouse that provides ad-hoc
queries, data summarization and analysis of huge data sets stored in Hadoop compatible file
systems. Hive provides a SQL like called HiveQL query based implementation of huge amount
of data stored in Hadoop clusters. In January 2013 apache releases Hive 0.10.0, more
information and installation guidecan be found at Apache Hive Documentation.
5. PIG
Pig was introduced by yahoo and later on it was made fully open source. It also provides a
bridge to query data over Hadoop clusters but unlike hive, it implements a script implementation
to make Hadoop data access able by developers and business persons. Apache pig provides a
high level programming platform for developers to process and analyses Big Data using user
defined functions and programming efforts. In January 2013 Apache released Pig 0.10.1 which
is defined for use with Hadoop 0.10.1 or later releases. More information and installation guide
can be found at Apache Pig Getting Started Documentation.
BIG DATA USE CASES
Big data techniques can be used to leverage the business benefits and by increasing the value of
an organization. Big data has beneficial in many applications and in general the following are
the common categories. It is derived from The Apache Software Foundation’s Powered By
Hadoop Web site.
•Business intelligence, querying, reporting, searching, including many implementation of
searching, filtering, indexing, speeding up aggregation for reporting and for report generation,
trend analysis, search optimization, and general information retrieval.
•Improved performance for common data management operations, with the majority focusing
on log storage, data storage and archiving, followed by sorting, running joins,
extraction/transformation/ loading (ETL) processing, other types of data conversions, as well as
duplicate analysis and elimination.
•Non-database applications, such as image processing, text processing in preparation for
publishing, genome sequencing, protein sequencing and structure prediction, web crawling, and
monitoring workflow processes.
•Data mining and analytical applications, including social network analysis, facial
recognition, profile matching, other types of text analytics, web mining, machine learning,
information extraction, personalization and recommendation analysis, ad optimization, and
behavior analysis.
In turn, the core capabilities that are implemented using the big data application can be further
abstracted into more fundamental categories:
•Counting functions applied to large bodies of data that can be segmented and distributed
among a pool of computing and storage resources, such as document indexing, concept
filtering, and aggregation (counts and sums).
•Scanning functions that can be broken up into parallel threads, such as sorting, data
transformations, semantic text analysis, pattern recognition, and searching.
•Modeling capabilities for analysis and prediction.
•Storing large datasets while providing relatively rapid access.
Generally, Processing applications can combine these core capabilities in different ways. In
today’s world big data have several applications, some of them are listed below :
a) Hadoop
• Hadoop is open-source, Java based programming framework and server software which is
used to save and analyze data with the help of 100s or even 1000s of commodity servers in a
clustered environment.
• Hadoop is designed to storage and process large datasets extremely fast and in fault tolerant
way.
• Hadoop uses HDFS (Hadoop File System) for storing data on cluster of commodity
computers. If any server goes down it know how to replicate the data and there is no loss of
data even in hardware failure.
• Hadoop is Apache sponsored project and it consists of many software packages which runs on
the top of the Apache Hadoop system.
• Hadoop provides set of tools and software for making the backbone of the Big Data analytics
system.
• Hadoop ecosystem provides necessary tools and software for handling and analyzing Big
Data.
• On the top of the Hadoop system many applications can be developed and plugged-in to
provide ideal solution for Big Data needs.
Cloudera
• Cloudra is one of the first commercial Hadoop based Big Data Analytics Platform
offering Big Data solution.
• Its product range includes Cloudera Analytic DB, Cloudera Operational DB, Cloudera
Data Science & Engineering and Cloudera Essentials.
• All these products are based on the Apache Hadoop and provides real-time processing
and analytics of massive data sets.
Website: https://www.cloudera.com
Apache Hadoop
• Apache Hadoop is Big Data platform and software package which is Apache sponsored
project.
• Under Apache Hadoop project various other software is being developed which runs on the
top of Hadoop system to provide enterprise grade data management and analytics solutions to
enterprise.
• Apache Hadoop is open-source, distributed file system which provides data processing and
analysis engine for analyzing large set of data.
• Hadoop can run on Windows, Linux and OS X operating systems, but it is mostly used on
Ubuntu and other Linux variants.
MapReduce
• The MapReduce engine was originally written by Google and this is the system which enables
the developers to write program which can run in parallel on 100 or even 1000s of computer
nodes to process vast data sets.
• After processing all the job on the different nodes it comes the results and return it to the
program which executed the MapReduce job.
• This software is platform independent and runs on the top of Hadoop ecosystem. It can
process tremendous data at very high speed in Big Data environment.
Apache Storm
• Apache Storm is a software for real-time computing and distributed processing.
• Its free and open-source software developed at Apache Software foundation. It’s a real-time,
parallel processing engine.
• Apache Storm is highly scalable, fault-tolerant which supports almost all the
programming language.
Apache Strom can be used in:
• Realtime analytics
• Online machine learning
• Continuous computation
• Distributed RPC
• ETL
• And all other places where real-time processing is required.
Apache Strom is used by Yahoo, Twitter, Spotify, Yelp, Flipboard and many other data giants.
Website: http://storm.apache.org/
Apache Spark
• Apache Spark is software that runs on the top of Hadoop and provides API for real-time, in-
memory processing and analysis of large set of data stored in the HDFS.
• It stores the data into memory for faster processing.
• Apache Spark runs program 100 times faster in-memory and 10 times faster on disk as
compared to the MapRedue.
• Apache Spark is here to faster the processing and analysis of big data sets in Big Data
environment.
• Apache Spark is being adopted very fast by the business to analyze their data set to get real
value of their data.
• Website: http://spark.apache.org/
SAMOA
• SAMOA stands for Scalable Advanced Massive Online Analysis,
• It’s a system for mining the Big Data streams.
• SAMOA is open-source software distributed at GitHub, which can be used as distributed
machine learning framework also.
• Website: https://github.com/yahoo/samoa
Thus, the Big Data industry is growing very fast in 2017 and companies are fast moving their
data to Big Data Platform. There is huge requirement of Big Data in the job market.
CHALLENGES OF CONVENTIONAL SYSTEMS
Conventional Systems
The system consists of one or more zones each having either manually operated call points or
automatic detection devices, or a combination of both.
• Big data is huge amount of data which is beyond the processing capacity of conventional
data base systems to manage and analyze the data in a specific time interval.
Difference between conventional computing and intelligent computing
• The conventional computing functions logically with a set of rules and calculations while the
neural computing can function via images, pictures, and concepts.
• Conventional computing is often unable to manage the variability of data obtained in the real
world. On the other hand, neural computing, like our own brains, is well suited to situations that
have no clear algorithmic solutions and are able to manage noisy imprecise data. This allows
them to excel in those areas that conventional computing often finds difficult.
Comparison of Big Data with Conventional Data
Big Data Conventional Data
Huge data sets Data set size in control.
Normally structured data such as numbers
Unstructured data such as text, video,
and categories, but it can take other forms
and audio.
as well.
Relatively easy-to-perform queries and
Hard-to-perform queries and analysis
analysis.
Data analysis can be achieved by using
Needs a new methodology for analysis.
conventional methods.
Need tools such as Hadoop, Hive, Tools such as SQL, SAS, R, and Excel
Hbase, Pig, Sqoop, and so on. alone may be sufficient.
The aggregated or sampled or filtered
Raw transactional data.
data.
Used for reporting, basic analysis, and
Used for reporting, advanced analysis, and
text mining. Advanced analytics is only
predictive modeling.
in a starting stage in big data.
Big data analysis needs both Analytical skills are sufficient for
programming skills (such as Java) and conventional data; advanced analysis tools
analytical skills to perform analysis. don’t require expert programing skills.
Petabytes/exabytes of data. Millions/billions of accounts.
Billions/trillions of transactions. Megabytes/gigabytes of data.
Thousands/millions of accounts. Millions of transactions
Generated by big financial institutions,
Generated by small enterprises and small
Facebook, Google, Amazon, eBay,
banks.
Walmart, and more.
The following list of challenges has been dominating in the case Conventional systems in real
time scenarios:
1) Uncertainty of Data Management Landscape
2) The Big Data Talent Gap
3) The talent gap that exists in the industry Getting data into the big data platform
4) Need for synchronization across data sources
5) Getting important insights through the use of Big data analytics
We are using table and row based relational databases over the years, these databases are just
fine with online transactions and quick updates. When unstructured and large amount of data
comes into the picture we needs some databases without having a hard code schema
attachment. There are a number of databases to fit into this category, these databases can store
unstructured, semi structured or even fully structured data.
Apart from other benefits the finest thing with schema less databases is that it makes data
migration very easy. MongoDB is a very popular and widely used NoSQL database these days.
NoSQL and schema less databases are used when the primary concern is to store a huge amount
of data and not to maintain relationship between elements. "NoSQL (not only Sql) is a type of
databases that does not primarily rely upon schema based structure and does not use Sql for
data processing."
The traditional approach work on the structured data that has a basic layout and the structure
provided.
The structured approach designs the database as per the requirements in tuples and columns.
Working on the live coming data, which can be an input from the ever changing scenario
cannot be dealt in the traditional approach. The Big data approach is iterative.
Intelligent Data Analysis (IDA) is one of the hot issues in the field of
artificial intelligence and information. Intelligent Data Analysis (IDA) is one of the most
important approaches in the field of data mining.
Based on the basic principles of IDA and the features of datasets that IDA handles, the
development of IDA is briefly summarized from three aspects :
• Algorithm principle
• The scale
• Type of the dataset
IDA is
… an interdisciplinary study concerned with the effective analysis of data;
… used for extracting useful information from large quantities of online data; extracting
desirable knowledge or interesting patterns from existing databases;
Goal of Intelligent data analysis is to extract useful knowledge, the process demands a
combination of extraction, analysis, conversion, classification, organization, reasoning, and so
on.
Why IDA?
➢ Decision making is asking for information and knowledge
➢ Data processing can give them
➢ Multidimensionality of problems is looking for methods for adequate and deep data
processing and analysis
Based on machine learning, artificial intelligence, recognition of pattern, and records and
visualization technology, IDA helps to obtain useful information, necessary data and interesting
models from a lot of data available online in order to make the right choices.
IDA includes three stages:
(1) Preparation of data
(2) Data mining
(3) Data validation and Explanation
Knowledge Acquisition
The process of eliciting, analyzing, transforming, classifying, organizing and integrating
knowledge and representing that knowledge in a form that can be used in a computer system.
Knowledge in a domain can be expressed as a number of rules
A Rule :
A formal way of specifying a recommendation, directive, or strategy, expressed as "IF premise
THEN conclusion" or "IF condition THEN action".
How to discover rules hidden in the data?
Nominal
These are the set of values that don’t possess a natural ordering. For example: The color of a
smartphone can be considered as a nominal data type as we can’t compare one color with others.
It is not possible to state that ‘Red’ is greater than ‘Blue’. The gender of a person is another one
where we can’t differentiate between male, female, or others. Mobile phone categories whether it
is midrange, budget segment, or premium smartphone is also nominal data type.
Ordinal
These types of values have a natural ordering while maintaining their class of values. If we
consider the size of a clothing brand then we can easily sort them according to their name tag in
the order of small < medium < large. The grading system while marking candidates in a test can
also be considered as an ordinal data type where A+ is definitely better than B grade.
These categories help us deciding which encoding strategy can be applied to which type of data.
Data encoding for Qualitative data is important because machine learning models can’t handle
these values directly and needed to be converted to numerical types as the models are
mathematical in nature.
For nominal data type where there is no comparison among the categories, one-hot encoding can
be applied which is similar to binary coding considering there are in less number and for the
ordinal data type, label encoding can be applied which is a form of integer encoding.
Discrete
The numerical values which fall under are integers or whole numbers are placed under this
category. The number of speakers in the phone, cameras, cores in the processor, the number of
sims supported all these are some of the examples of the discrete data type.
Continuous
The fractional numbers are considered as continuous values. These can take the form of the
operating frequency of the processors, the android version of the phone, wifi frequency,
temperature of the cores, and so on.
Different categorizations of data are possible. The first such categorization may be on the basis
of disciplines, e.g., Sciences, Social Sciences, etc. in which they are generated. Within each of
these fields, there may be several ways in which data can be categorized into types.
There are four types of data:
•Nominal
•Ordinal
•Interval
•Ratio
Each offers a unique set of characteristics, which impacts the type of analysis that can be
performed.
Nominal Scales
Nominal scales measure categories and have the following characteristics:
•Order: The order of the responses or observations does not matter.
•Distance: Nominal scales do not hold distance. The distance between a 1 and a 2 is not the
same as a 2 and 3.
•True Zero: There is no true or real zero. In a nominal scale, zero is uninterruptable.
Appropriate statistics for nominal scales: mode, count, frequencies
Displays: histograms or bar charts
Ordinal Scales
Characteristics for ordinal scales are:
•Order: The order of the responses or observations matters.
•Distance: Ordinal scales do not hold distance. The distance between first and second is
unknown as is the distance between first and third along with all observations.
•True Zero: There is no true or real zero. An item, observation, or category cannot finish zero.
Appropriate statistics for ordinal scales: count, frequencies, mode
Displays: histograms or bar charts
Interval Scales
Interval scales provide insight into the variability of the observations or data. Classic interval
scales are Likert scales (e.g., 1 - strongly agree and 9 - strongly disagree) and
Semantic Differential scales (e.g., 1 - dark and 9 - light). In an interval scale, users could respond
to “I enjoy opening links to the website from a company email” with a response ranging on a
scale of values.
The characteristics of interval scales are:
•Order: The order of the responses or observations does matter.
•Distance: Interval scales do offer distance. That is, the distance from 1 to 2 appears the
same as 4 to 5. Also, six is twice as much as three and two is half of four. Hence, we can
perform arithmetic operations on the data.
•True Zero: There is no zero with interval scales. However, data can be rescaled in a
manner that contains zero. An interval scales measure from 1 to 9 remains the same as
11 to 19 because we added 10 to all values. Similarly, a 1 to 9 interval scale is the same
a -4 to 4 scale because we subtracted 5 from all values. Although the new scale contains
zero, zero remains uninterruptable because it only appears in the scale from the
transformation.
Appropriate statistics for interval scales: count, frequencies, mode, median, mean, standard
deviation (and variance), skewness, and kurtosis.
Displays: histograms or bar charts, line charts, and scatter plots.
Ratio Scales
Ratio scales appear as nominal scales with a true zero.
They have the following characteristics:
•Order: The order of the responses or observations matters.
•Distance: Ratio scales do have an interpretable distance.
•True Zero: There is a true zero.
Income is a classic example of a ratio scale:
•Order is established. We would all prefer $100 to $1!
•Zero dollars means we have no income (or, in accounting terms, our revenue exactly
equals our expenses!)
•Distance is interpretable, in that $20 appears as twice $10 and $50 is half of a $100.
For the web analyst, the statistics for ratio scales are the same as for interval scales.
Appropriate statistics for ratio scales: count, frequencies, mode, median, mean, standard
deviation (and variance), skewness, and kurtosis.
Displays: histograms or bar charts, line charts, and scatter plots.
The table below summarizes the characteristics of all four types of scales.
Big Data Analytics
Big data analytics is a complex process of examining big data to uncover information, such as -
hidden patterns, correlations, market trends and customer preferences.This can help
organizations make informed business decisions. Data Analytics technologies and techniques
give organizations a way to analyze data sets and gather new information.
Big Data Analytics enables enterprises to analyze their data in full context quickly and some also
offer real-time analysis.
* Apache Spark is a powerful open source big data analytics tool. It offers over 80 high-level
operators that make it easy to build parallel apps. It is used at a wide range of organizations to
process large datasets.
Features:
•It helps to run an application in Hadoop cluster, up to 100 times faster in memory, and ten times
faster on disk
•It offers lighting Fast Processing
•Support for Sophisticated Analytics
•Ability to Integrate with Hadoop and Existing Hadoop Data
* Lumify is a big data fusion, analysis, and visualization platform. It helps users to discover
connections and explore relationships in their data via a suite of analytic options.
Features:
•It provides both 2D and 3D graph visualizations with a variety of automatic layouts
•It provides a variety of options for analyzing the links between entities on the graph
•It comes with specific ingest processing and interface elements for textual content, images, and
videos
•It spaces feature allows you to organize work into a set of projects, or workspaces
•It is built on proven, scalable big data technologies
* IBM SPSS Modeler is a predictive big data analytics platform. It offers predictive models and
delivers to individuals, groups, systems and the enterprise. It has a range of advanced algorithms
and analysis techniques.
Features:
•Discover insights and solve problems faster by analyzing structured and unstructured data
•Use an intuitive interface for everyone to learn
•You can select from on-premises, cloud and hybrid deployment options
•Quickly choose the best performing algorithm based on model performance
ANALYSIS AND REPORTING
What is Analysis?
Analytics is the process of taking the organized data and analyzing it.
• This helps users to gain valuable insights on how businesses can improve their
performance.
• Analysis transforms data and information into insights.
• The goal of the analysis is to answer questions by interpreting the data at a deeper level
and providing actionable recommendations.
• Reporting is
Big data analytics raises a number of ethical issues, especially as companies begin monetizing
their data externally for purposes different from those for which the data was initially collected.
The scale and ease with which analytics can be conducted today completely changes the ethical
framework. The following principles can be applied to avoid any unethical practices.
1. Ownership: The first principle of data ethics is that an individual has ownership over their
personal information so it’s unlawful and unethical to collect someone’s personal data
without their consent. Some common ways you can obtain consent are through signed
written agreements, digital privacy policies that ask users to agree to a company’s terms and
conditions, and pop-ups with checkboxes that permit websites to track users’ online
behavior with cookies.
2. Private customer data and identity should remain private: Another ethical responsibility
that comes with handling data is ensuring data subjects’ privacy. Even if a customer gives
your company consent to collect, store, and analyze their personally identifiable information
(PII), that doesn’t mean they want it publicly available. Privacy does not mean secrecy, as
private data might need to be audited based on legal requirements, but that private data
obtained from a person with their consent should not be exposed for use by other businesses
or individuals with any traces to their identity. PII is any information linked to an
individual’s identity. Some examples of PII include:
▪ Full name
▪ Birthdate
▪ Street address
▪ Phone number
▪ Social Security card
▪ Credit card information
▪ Bank account number
▪ Passport number
3. Shared private information should be treated confidentially: Third party companies share
sensitive data — medical, financial or locational — and need to have restrictions on whether
and how that information can be shared further.
4. Customers should have a transparent view of how our data is being collected, stored, used
or sold, and the ability to manage the flow of their private information across massive, third-
party analytical systems.
5. Big Data should not interfere with human will: Big data analytics can moderate and even
determine who we are before we make up our own minds. Companies need to begin to think
about the kind of predictions and inferences that should be allowed and the ones that should
not.
6. Big data should not institutionalize unfair biases like racism or sexism. Machine learning
algorithms can absorb unconscious biases in a population and amplify them via training
samples.