40833 OR

CONCEPT OF BIG
DATA
Contents
1. Introduction 5. Data lake
2. Batch processing 6. Data mining
3. Cluster computing 7. Hadoop
4. Data warehouse 8. In-Memory computing
9. Machine learning
Introduction:
What is big data?

Big data refers to massive complex structured and unstructured data sets
that are rapidly generated and transmitted from a wide variety of sources.
These attributes make up the three Vs of big data:
1. Volume: The huge amounts of data being stored.
2. Velocity: The lightning speed at which data streams must be processed
and analysed.
3. Variety: The different sources and forms from which data is collected,
such as numbers, text, video, images, audio and text.
Big data is classified in three ways:
• Structured Data
• Unstructured Data
• Semi-Structured Data
Structured Data
Structured data is the easiest to work with. It is highly organized with dimensions defined
by set parameters.
Think spreadsheets; every piece of information is grouped into rows and columns. Specific
elements defined by certain variables are easily discoverable.
It’s all your quantitative data.
• Age
• Billing
• Contact
• Address
• Expenses
• Debit/credit card numbers
Because structured data is already tangible numbers, it’s much easier for a program to sort
through and collect data.
Unstructured:
Any data with unknown form or the structure is classified as unstructured data.
In
addition to the size being huge, un-structured data poses multiple challenges in
terms of its
processing for deriving value out of it. A typical example of unstructured data is a
heterogeneous data source containing a combination of simple text files, images,
videos etc. Now day organizations have wealth of data available with them but
unfortunately, they don’t know how to derive value out of it since this data is in its
raw form or unstructured format.
The hardest part of analysing unstructured data is teaching an application to
understand the information it’s extracting. This means translating it into some form
of structured data .Almost universally, it involves a complex algorithm blending the
processes of scanning, interpreting and contextualizing functions.
Unstructured:
Semi-structured data toes the line between structured and unstructured. Most of
the time, this translates to unstructured data with metadata attached to it. This
can be inherent data collected, such as time, location, device ID stamp or email
address, or it can be a semantic tag attached to the data later.
Batch processing:
What is batch processing?

In batch processing, we wait for a certain amount of raw data to “pile up”
before running an ETL job. Typically this means data is between an hour
to a few days old before it is made available for analysis. Batch ETL jobs
will typically be run on a set schedule (e.g. every 24 hours), or in some
cases once the amount of data reaches a certain threshold.
When to use batch processing?
By definition, batch processing entails latencies between the time data appears
in the storage layer and the time it is available in analytics or reporting tools.
However, this is not necessarily a major issue, and we might choose to accept
these latencies because we prefer working with batch processing frameworks.
For example, if we’re trying to analyze the correlation between SaaS license
renewals and customer support tickets, we might want to join a table from our
CRM with one from our ticketing system. If that join happens once a day rather
than the second a ticket is resolved, it probably won’t make much of a difference.
To generalize, you should lean towards batch processing when:

• Data freshness is not a mission-critical issue.
• You are working with large datasets and are running a complex algorithm that
requires access to the entire batch – e.g., sorting the entire dataset.
• You get access to the data in batches rather than in streams.
• When you are joining tables in relational databases.
Cluster computing:
Cluster computing provides a number of advantages with respect to conventional
custom-made parallel computers for achieving performance greater than that
typical of uniprocessors. As a consequence, the emergence of clusters has
greatly extended the availability of high-performance processing to a much
broader community and advanced its impact through new opportunities in
science, technology, industry, medical, commercial, finance, defence, and
education among other sectors of computational application. Included among the
most significant advantages exhibited by cluster computing are the following:
 Performance scalability. Clustering of computer nodes provides the means of assembling
larger systems than is practical for custom parallel systems, as these themselves can
become nodes of clusters. Many of the entries on the Top 500 list of the world's most
powerful computers are clusters and the most powerful general-purpose
computer under construction in the United States (DOE ASCI) is a cluster to be
completed in 2003.
 Performance to cost. Clustering of mass-produced computer systems yields the cost
advantage of a market much wider than that limited to the high-performance
computing community. An order of magnitude price-performance advantage with
respect to custom-designed parallel computers is achieved for many applications.
• Flexibility of configuration. The organization of cluster systems is determined by the
topology of their interconnection networks, which can be determined at time of
installation and easily modified. Depending on the requirements of the user
applications, various system configurations can be implemented to optimize for data
flow bandwidth and latency.
• Ease of upgrade. Old components may be replaced or new elements added to an

original cluster to incrementally improve system operation while retaining much of the
initial investment in hardware and software.
• Architecture convergence. Cluster computing offers a single general strategy to the

implementation and application of parallel high-performance systems independent of
specific hardware vendors and their product decisions. Users of clusters can build
software application systems with confidence that such systems will be available to
support them in the long term.
• Technology tracking. Clusters provide the most rapid path to integrating the latest
technology for high-performance computing, because advances in device technology
are usually first incorporated in mass market computers suitable for clustering.
• High availability. Clusters provide multiple redundant identical resources that, if

managed correctly, can provide continued system operation through graceful
degradation even as individual components fail.
Data warehouse:
A data warehouse (DW) is a digital storage system that connects and

harmonizes large amounts of data from many different sources. Its
purpose is to feed business intelligence (BI), reporting, and
analytics, and support regulatory requirements – so companies
can turn their data into insight and make smart, data-driven
decisions. Data warehouses store current and historical data in one
place and act as the single source of truth for an organization.
Data flows into a data warehouse from operational systems (like ERP and CRM
), databases, and external sources such as partner systems, Internet of
Things (IoT) devices, weather apps, and social media – usually on a regular
cadence. The emergence of cloud computing has caused a shift in the
landscape. In recent years, data storage locations have moved away from
traditional on-premise infrastructure to multiple locations, including on
premise, private cloud, and public cloud.
Modern data warehouses are designed to handle both structured and
unstructured data, like videos, image files, and sensor data. Some leverage
integrated analytics and in-memory database technology (which holds the data
set in computer memory rather than in disk storage) to provide real-time access
to trusted data and drive confident decision-making. Without data warehousing,
it’s very difficult to combine data from heterogeneous sources, ensure it’s in the
right format for analytics, and get both a current and long-range view of data
over time.
What can a data warehouse store?
When data warehouses first became popular in the late 1980s, they were
designed to store information about people, products, and transactions. This
data – called structured data – was neatly organized and formatted for easy
access. However, businesses soon wanted to store, retrieve, and
analyse unstructured data – such as documents, images, videos, emails,
social media posts, and raw data from machine sensors.
A modern data warehouse can accommodate both structured and unstructured

data. By merging these data types and breaking down silos between the two,
businesses can get a complete, comprehensive picture for the most valuable
insights.
Data warehouse vs. Database:

Databases and data warehouses are both data storage systems; however, they
serve different purposes. A database stores data usually for a particular
business area. A data warehouse stores current and historical data for the entire
business and feeds BI and analytics. Data warehouses use a database server to
pull in data from an organization’s databases and have additional functionalities
for data modelling, data lifecycle management, data source integration, and
Benefits of data warehousing:
A well-designed data warehouse is the foundation for any successful BI or
analytics program. Its main job is to power the reports, dashboards, and
analytical tools that have become indispensable to businesses today. A data
warehouse provides the information for your data-driven decisions – and helps
you make the right call on everything from new product development to
inventory levels. There are many benefits of a data warehouse. Here are just a
few:
Better business analytics: With data warehousing, decision-makers have
access to data from multiple sources and no longer have to make decisions
based on incomplete information.
Faster queries: Data warehouses are built specifically for fast data retrieval and
analysis. With a DW, you can very rapidly query large amounts of consolidated
data with little to no support from IT.
Improved data quality: Before being loaded into the DW, data cleansing cases
are created by the system and entered in a worklist for further processing,
ensuring data is transformed into a consistent format to support analytics – and
decisions – based on high quality, accurate data.
Historical insight: By storing rich historical data, a data warehouse lets
decision-makers learn from past trends and challenges, make predictions, and
drive continuous business improvement.
Data Lake:
The biggest challenge of Big Data is not volume, but data complexity or
data variety. Volume is not the problem because the storage is
manageable. Big Data is bringing together all the diverse and distributed
data sources that organizations have across many different sources of
data. Data silos inhibit data teams from integrating multiple data sets
that (when combined) could yield deep, actionable insights to create
business value. That’s what a data lake can do.
Why do you need a Data Lake for Big Data?
 A data lake includes all data sources, unstructured, semi-structured, from a wide
variety of data sources, which makes it much more flexible in its potential use
cases. Data lakes are usually built on low-cost commodity hardware, making it
economically viable to store terabytes and even petabytes of data.
 Moreover, data lake provides end-to-end services that reduce the time, effort,
and cost required to run Data pipelines, Streaming Analytics, and Machine
Learning workloads on any cloud.
Ad-hoc and Streaming Analytics
For ad hoc and streaming analytics, the Qubole cloud data lake platform lets you
author, save, collaborate, and share reports and queries. You can develop and
deliver ad-hoc SQL analytics through optimized ANSI/ISO-SQL (Presto, Hive,
SparkSQL) and third-party tools such as Tableau, Looker, and Git native
integrations. The data lake platform helps you build streaming data pipelines,
combining with multiple streaming and batch datasets to gain real-time insights.
Machine Learning
For machine learning, the data lake provides capabilities to build, visualize, and
collaborate on machine learning models. Qubole’s machine learning specific
capabilities such as offline editing, multi-language interpreters, and version
control deliver faster results. You can leverage Jupyter or Qubole notebooks to
monitor application status and job progress, and use the integrated package
manager to update the libraries at scale.
Data Engineering
For data engineering, the data lake automates pipeline creation, scale, and
monitoring. You can easily create, schedule, and manage workloads for
continuous data engineering. Use the processing engine and language of choice
like Apache Spark, Hive, Presto with SQL, Python, R, Scala.
Data lake architecture:
Three main architectural principles distinguish data lakes from conventional data
repositories:
• No data needs to be turned away. Everything collected from source systems can be
loaded and retained in a data lake if desired.
• Data can be stored in an untransformed or nearly untransformed state, as it was
received from the source system.
• That data is later transformed and fit into a schema as needed based on specific
analytics requirements, an approach known as schema-on-read.
• Whatever technology is used in a data lake deployment, some other elements
should also be included to ensure that the data lake is functional and that the data
it contains doesn't go to waste. That includes the following:
• A common folder structure with naming conventions.
• A searchable data catalogue to help users find and understand data.
• A data classification taxonomy to identify sensitive data, with information such as
data type, content, usage scenarios and groups of possible users.
• Data profiling tools to provide insights for classifying data and identifying data
quality issues.
• A standardized data access process to help control and keep track of who is
accessing data.
• Data protections, such as data masking, data encryption and automated usage
Data mining:
Data mining is a process of extracting hidden, unknown, but potentially

useful information from massive data. Big Data has great impacts on
scientific discoveries and value creation. This paper introduces methods in
data mining and technologies in Big Data. Challenges of data mining and
data mining with big data are discussed. Some technology progress of data
mining and data mining with big data are also presented. Keywords: Big Data,
Data Mining, Big Data Analytics, Networks, Grid, Distributed Computing,
Stream mining, Web Mining, Text Mining, Information Security.
Big data mining is the capability of extracting useful information from
these large datasets or streams of data, which was not possible before due to
data’s volume, variability, and velocity [7]. Big data is a massive volume of
both structured and unstructured data that is so large that it is difficult to
process using traditional database and software techniques.
METHODS OF DATA MINING AND BIG DATA
Data mining is a set of techniques for extracting valuable information
(patterns) from data. It includes clustering analysis, classification, regression,
and association rule learning, etc.. For example, cluster analysis is used to
differentiate objects with particular features and divide them into some
categories (clusters) according to these features. It is an unsupervised study
method without training data. Clustering can be considered the most important
unsupervised learning problem . Classification consists of examining the
features of a newly presented object and assigning to it a predefined class.
Several major kinds of classification algorithms in data mining are decision tree,
k-nearest neighbour (KNN) classifier, Naive Bayes, Apriori and AdaBoost .
Regression analysis identifies dependence relationships among variables hidden
by randomness.
A way of speeding up the mining of streaming learners is to distribute the
training process onto several machines. Hadoop is such a programming model
and software framework. Apache S4 is a platform for processing continuous data
streams. S4 applications are designed for combining streams and processing
elements in real time.
CHALLENGES OF DATA MINING:
Protecting privacy and confidentiality, stream pre-processing, timing and
availability of information, and relational stream mining, etc. are challenges.
Challenges of data stream processing and mining lie in the changing nature of
streaming data. Therefore, identifying trends, patterns, and changes in the
underlying processes generating data is important.
Data streams pose challenges for data mining. First, algorithms must make use of
limited resources (time and memory). Second, they must deal with data whose
nature or distribution changes over time. Unique challenges associated with
designing distributed mining systems are: online adaptation to incoming data
characteristics, online processing of large amounts of heterogeneous data, limited
data access and communication capabilities between distributed learners, etc.
The general MapReduce mode is not suitable for data mining. First of all,
MapReduce is lack of overall. The lack of data sharing between the tasks nodes in
Hadoop, such as shared memory. Secondly, the Hadoop distributed file system
(HDFS) does not allow random write operation. Massive data once written into the
HDFS only can be added or deleted. Thirdly, the task has a short life cycle. Finally,
MapReduce may not be well suited for complex algorithms that have an iterative
nature.
Hadoop in bigdata:
What is Hadoop?
Apache Hadoop is a 100 percent open source framework that pioneered a new way
for the distributed processing of large, enterprise data sets. Instead of relying on
expensive, and different systems to store and process data, Hadoop enables
distributed parallel processing of huge amounts of data across inexpensive,
industry-standard servers that both store and process the data. With Hadoop, no
data is too big data.
Hadoop Architecture
A small Hadoop cluster includes a single master and multiple worker nodes. The
master node consists of a Job Tracker, Task Tracker, Name Node and Data Node.
Though it is possible to have data-only worker nodes and compute-only worker
nodes, a slave or worker node acts as both a Data Node and Task Tracker. In a
larger cluster, the Hadoop Distributed File System (HDFS) is managed through a
dedicated Name Node server to host the file system index, and a secondary
Name Node that can generate snapshots of the Name Node's memory structures,
thus preventing file-system corruption and reducing loss of data.
The Apache Hadoop framework comprises:
• Hadoop Common – Contains libraries and utilities

needed by other Hadoop modules
• Hadoop Distributed File System (HDFS) – A
distributed file-system that stores data
on commodity machines, providing very high
aggregate bandwidth across the cluster
• Hadoop YARN – A resource-management
platform responsible for managing
compute resources in clusters and using them for
scheduling of users' applications
• Hadoop MapReduce– A programming model for
large-scale data processing.
Why Big Data Hadoop?
In a fast-paced and hyper-connected world where more and more data is being
created, Hadoop’s breakthrough advantages mean that businesses and organizations
can now find value in data that was considered useless.
Organizations are realizing that categorizing and analysing Big Data can help make
major business predictions. Hadoop allows enterprises to store as much data, in
whatever form, simply by adding more servers to a Hadoop cluster. Each new server
adds more storage and processing power to the cluster. This makes data storage with
Hadoop less expensive than earlier data storage methods.
Hadoop and Big Data

With 90 percent of data being unstructured and growing rapidly, Hadoop is required
to put the right Big Data workloads in the right systems and optimize data
management structure in an organization. The cost-effectiveness, scalability and
systematic architecture of Hadoop make it more necessary for organizations to
process and manage Big Data.
In-Memory computing:
In-memory computing (IMC), a technique of the future computing, stores data in RAM
to run calculations entirely in computer memory. With the rise of the big data era,
faster data processing capabilities are required. Computer memory and storage
space are also growing exponentially to adapt to large-capacity data collection and
complex data analysis, which promotes the development of AI (artificial intelligence),
and then derives an emerging stuff, that is, in-memory computing.
Ⅰ Memory Wall: Processor / Memory Performance Gap
The von Neumann architecture has occupied the dominant position in
computer system when the computer invented. This kind of calculation method is
to store the data in the main memory first, and then fetch the instructions from the
main memory to execute them in order when running. We all know that if the
connecting speed of the memory cannot keep up with the performance of the CPU,
the computing will be limited. This is a memory wall. At the same time, in terms of
efficiency, the von Neumann architecture also has obvious shortcomings. It
consumes more energy to read and write data than to calculate once time.
The performance of computer processors has developed rapidly based on Moore's Law, and
has been directly improved with the invention of transistors. The main memory of the
computer uses the DRAM. It is a high-density storage solution based on capacitor charging and
discharging. Its performance (speed) depends on two aspects, namely the reading/writing
speed of the capacitor charging and discharging in the memory and the interface bandwidth
between the devices. The read/write speed of capacitor charging and discharging has
increased with Moore’s Law, but the speed is not as fast as the processor. In addition, the
interface between DRAM and the processor is a mixed-signal circuit, and its bandwidth
increasing speed is mainly restricted by the signal integrity of the traces on the PCB. This has
also caused the performance improvement of DRAM to be much slower than that of the
processor. At present, the performance of DRAM has become an huge bottleneck of overall
computer performance, the so-called "memory wall". It blocks the computing performance
Ⅱ Developing Requirement:
In the current AI technology, with the increasing amount of data and calculations,
the original von Neumann architecture is facing more and more challenges. Rely
on expanding CPU, the hardware architecture can’t have a large amount of
calculation. Also the larger storage capacity is heavily rely on the past
architecture, it is also very unsuitable for AI. When the memory capacity is large
to a certain extent, it can only show that certain technologies need innovation. In
order to solve the "memory wall" problem, future computers are not based on
computing memory, but the in-memory computing, thereby reducing the cost of
data access in the calculation process.
Machine learning:
What is Machine Learning?
The core of machine learning consists of self-learning algorithms that evolve
by continuously improving at their assigned task. When structured correctly
and fed proper data, these algorithms eventually produce results in the
contexts of pattern recognition and predictive modelling.
For machine-learning algorithms, data is like exercise: the more the better.
Algorithms fine-tune themselves with the data they train on in the same way
Olympic athletes hone their bodies and skills by training every day.
Many programming languages work with machine learning, including Python,
R, Java, JavaScript and Scala. Python is the preferred choice for many
developers because of its TensorFlow library, which offers a comprehensive
ecosystem of machine-learning tools. If you’d like to practice coding on an
actual algorithm, check out our article on machine learning with Python.
Machine Learning Applications for Big Data:
Let’s look at some real-life examples that demonstrate how big data and machine learning
can work together.
Cloud Networks
A research firm has a large amount of medical data it wants to study, but in order to do so
on-premises it needs servers, online storage, networking and security assets, all of which
adds up to an unreasonable expense. Instead, the firm decides to invest in Amazon EMR, a
cloud service that offers data-analysis models within a managed framework.
Machine-learning models of this sort include GPU-accelerated image recognition and text
classification. These algorithms don’t learn once they are deployed, so they can be
distributed and supported by a content-delivery network (CDN). Check out Live Ramp's
detailed outline describing the migration of a big-data environment to the cloud.
Web Scraping
Let’s imagine that a manufacturer of kitchen appliances learns about market tendencies
and customer-satisfaction trends from a retailer’s quarterly reports. In their desire to find
out what the reports might have left out, the manufacturer decides to web-scrape the
enormous amount of existing data that pertains to online customer feedback and product
reviews. By aggregating this data and feeding it to a deep-learning model, the
manufacturer learns how to improve and better describe its products, resulting in
increased sales.
While web scraping generates a huge amount of data, it’s worthwhile to note that choosing
the sources for this data is the most important part of the process. Check out this IT Svit
guid for some best data-mining practices.
Others:
Image recognition:
Is one of the most common applications of machine
learning. It is used to identify objects, persons, places,
digital images, etc. The popular use case of image
recognition and face detection is, Automatic friend
tagging suggestion.
Speech recognition:
Is a process of converting voice instructions into
text, and it is also known as "Speech to text", or
"Computer speech recognition." At present, machine
learning algorithms are widely used by various
applications of speech recognition. • We have various
virtual personal assistants such as Google assistant ,
Alexa , Cortana , Siri. As the name suggests, they help
us in finding the information using our voice
instruction. These assistants can help us in various
ways just by our voice instructions such as Play music,
call someone, Open an email, Scheduling an
appointment, etc.
Conclusion :
The availability of Big Data, low-cost commodity hardware, and new information
management and analytic software have produced a unique moment in the history
of data analysis. The convergence of these trends means that we have the
capabilities required to analyse astonishing data sets quickly and cost-effectively
for the first time in history. These capabilities are neither theoretical nor trivial.
They represent a genuine leap forward and a clear opportunity to realize
enormous gains in terms of efficiency, productivity, revenue, and profitability.
Big Data is a game-changer. Many organizations are using more analytics to
drive strategic actions and offer a better customer experience. A slight change in
the efficiency or smallest savings can lead to a huge profit, which is why most
organizations are moving towards big data.

40833 OR

Uploaded by

Copyright:

Available Formats

40833 OR

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

40833 OR

Uploaded by

Copyright:

Available Formats

CONCEPT OF BIG

1. Introduction 5. Data lake

2. Batch processing 6. Data mining

3. Cluster computing 7. Hadoop

4. Data warehouse 8. In-Memory computing

What is big data?

What is batch processing?

To generalize, you should lean towards batch processing when:

• Ease of upgrade. Old components may be replaced or new elements added to an

• Architecture convergence. Cluster computing offers a single general strategy to the

• High availability. Clusters provide multiple redundant identical resources that, if

A data warehouse (DW) is a digital storage system that connects and

A modern data warehouse can accommodate both structured and unstructured

Data warehouse vs. Database:

Data mining is a process of extracting hidden, unknown, but potentially

• Hadoop Common – Contains libraries and utilities

Hadoop and Big Data

You might also like