40833 OR
40833 OR
40833 OR
DATA
Contents
9. Machine learning
Introduction:
Structured Data
Structured data is the easiest to work with. It is highly organized with dimensions defined
by set parameters.
Think spreadsheets; every piece of information is grouped into rows and columns. Specific
elements defined by certain variables are easily discoverable.
It’s all your quantitative data.
• Age
• Billing
• Contact
• Address
• Expenses
• Debit/credit card numbers
Because structured data is already tangible numbers, it’s much easier for a program to sort
through and collect data.
Unstructured:
Any data with unknown form or the structure is classified as unstructured data.
In
addition to the size being huge, un-structured data poses multiple challenges in
terms of its
processing for deriving value out of it. A typical example of unstructured data is a
heterogeneous data source containing a combination of simple text files, images,
videos etc. Now day organizations have wealth of data available with them but
unfortunately, they don’t know how to derive value out of it since this data is in its
raw form or unstructured format.
The hardest part of analysing unstructured data is teaching an application to
understand the information it’s extracting. This means translating it into some form
of structured data .Almost universally, it involves a complex algorithm blending the
processes of scanning, interpreting and contextualizing functions.
Unstructured:
Semi-structured data toes the line between structured and unstructured. Most of
the time, this translates to unstructured data with metadata attached to it. This
can be inherent data collected, such as time, location, device ID stamp or email
address, or it can be a semantic tag attached to the data later.
Batch processing:
• Technology tracking. Clusters provide the most rapid path to integrating the latest
technology for high-performance computing, because advances in device technology
are usually first incorporated in mass market computers suitable for clustering.
The biggest challenge of Big Data is not volume, but data complexity or
data variety. Volume is not the problem because the storage is
manageable. Big Data is bringing together all the diverse and distributed
data sources that organizations have across many different sources of
data. Data silos inhibit data teams from integrating multiple data sets
that (when combined) could yield deep, actionable insights to create
business value. That’s what a data lake can do.
Why do you need a Data Lake for Big Data?
A data lake includes all data sources, unstructured, semi-structured, from a wide
variety of data sources, which makes it much more flexible in its potential use
cases. Data lakes are usually built on low-cost commodity hardware, making it
economically viable to store terabytes and even petabytes of data.
Moreover, data lake provides end-to-end services that reduce the time, effort,
and cost required to run Data pipelines, Streaming Analytics, and Machine
Learning workloads on any cloud.
Ad-hoc and Streaming Analytics
For ad hoc and streaming analytics, the Qubole cloud data lake platform lets you
author, save, collaborate, and share reports and queries. You can develop and
deliver ad-hoc SQL analytics through optimized ANSI/ISO-SQL (Presto, Hive,
SparkSQL) and third-party tools such as Tableau, Looker, and Git native
integrations. The data lake platform helps you build streaming data pipelines,
combining with multiple streaming and batch datasets to gain real-time insights.
Machine Learning
For machine learning, the data lake provides capabilities to build, visualize, and
collaborate on machine learning models. Qubole’s machine learning specific
capabilities such as offline editing, multi-language interpreters, and version
control deliver faster results. You can leverage Jupyter or Qubole notebooks to
monitor application status and job progress, and use the integrated package
manager to update the libraries at scale.
Data Engineering
For data engineering, the data lake automates pipeline creation, scale, and
monitoring. You can easily create, schedule, and manage workloads for
continuous data engineering. Use the processing engine and language of choice
like Apache Spark, Hive, Presto with SQL, Python, R, Scala.
Data lake architecture:
Three main architectural principles distinguish data lakes from conventional data
repositories:
• No data needs to be turned away. Everything collected from source systems can be
loaded and retained in a data lake if desired.
• Data can be stored in an untransformed or nearly untransformed state, as it was
received from the source system.
• That data is later transformed and fit into a schema as needed based on specific
analytics requirements, an approach known as schema-on-read.
• Whatever technology is used in a data lake deployment, some other elements
should also be included to ensure that the data lake is functional and that the data
it contains doesn't go to waste. That includes the following:
• A common folder structure with naming conventions.
• A searchable data catalogue to help users find and understand data.
• A data classification taxonomy to identify sensitive data, with information such as
data type, content, usage scenarios and groups of possible users.
• Data profiling tools to provide insights for classifying data and identifying data
quality issues.
• A standardized data access process to help control and keep track of who is
accessing data.
• Data protections, such as data masking, data encryption and automated usage
Data mining:
What is Hadoop?
Apache Hadoop is a 100 percent open source framework that pioneered a new way
for the distributed processing of large, enterprise data sets. Instead of relying on
expensive, and different systems to store and process data, Hadoop enables
distributed parallel processing of huge amounts of data across inexpensive,
industry-standard servers that both store and process the data. With Hadoop, no
data is too big data.
Hadoop Architecture
A small Hadoop cluster includes a single master and multiple worker nodes. The
master node consists of a Job Tracker, Task Tracker, Name Node and Data Node.
Though it is possible to have data-only worker nodes and compute-only worker
nodes, a slave or worker node acts as both a Data Node and Task Tracker. In a
larger cluster, the Hadoop Distributed File System (HDFS) is managed through a
dedicated Name Node server to host the file system index, and a secondary
Name Node that can generate snapshots of the Name Node's memory structures,
thus preventing file-system corruption and reducing loss of data.
The Apache Hadoop framework comprises:
In-memory computing (IMC), a technique of the future computing, stores data in RAM
to run calculations entirely in computer memory. With the rise of the big data era,
faster data processing capabilities are required. Computer memory and storage
space are also growing exponentially to adapt to large-capacity data collection and
complex data analysis, which promotes the development of AI (artificial intelligence),
and then derives an emerging stuff, that is, in-memory computing.
Ⅰ Memory Wall: Processor / Memory Performance Gap
The von Neumann architecture has occupied the dominant position in
computer system when the computer invented. This kind of calculation method is
to store the data in the main memory first, and then fetch the instructions from the
main memory to execute them in order when running. We all know that if the
connecting speed of the memory cannot keep up with the performance of the CPU,
the computing will be limited. This is a memory wall. At the same time, in terms of
efficiency, the von Neumann architecture also has obvious shortcomings. It
consumes more energy to read and write data than to calculate once time.
The performance of computer processors has developed rapidly based on Moore's Law, and
has been directly improved with the invention of transistors. The main memory of the
computer uses the DRAM. It is a high-density storage solution based on capacitor charging and
discharging. Its performance (speed) depends on two aspects, namely the reading/writing
speed of the capacitor charging and discharging in the memory and the interface bandwidth
between the devices. The read/write speed of capacitor charging and discharging has
increased with Moore’s Law, but the speed is not as fast as the processor. In addition, the
interface between DRAM and the processor is a mixed-signal circuit, and its bandwidth
increasing speed is mainly restricted by the signal integrity of the traces on the PCB. This has
also caused the performance improvement of DRAM to be much slower than that of the
processor. At present, the performance of DRAM has become an huge bottleneck of overall
computer performance, the so-called "memory wall". It blocks the computing performance
Ⅱ Developing Requirement:
In the current AI technology, with the increasing amount of data and calculations,
the original von Neumann architecture is facing more and more challenges. Rely
on expanding CPU, the hardware architecture can’t have a large amount of
calculation. Also the larger storage capacity is heavily rely on the past
architecture, it is also very unsuitable for AI. When the memory capacity is large
to a certain extent, it can only show that certain technologies need innovation. In
order to solve the "memory wall" problem, future computers are not based on
computing memory, but the in-memory computing, thereby reducing the cost of
data access in the calculation process.
Machine learning:
What is Machine Learning?
The core of machine learning consists of self-learning algorithms that evolve
by continuously improving at their assigned task. When structured correctly
and fed proper data, these algorithms eventually produce results in the
contexts of pattern recognition and predictive modelling.
For machine-learning algorithms, data is like exercise: the more the better.
Algorithms fine-tune themselves with the data they train on in the same way
Olympic athletes hone their bodies and skills by training every day.
Many programming languages work with machine learning, including Python,
R, Java, JavaScript and Scala. Python is the preferred choice for many
developers because of its TensorFlow library, which offers a comprehensive
ecosystem of machine-learning tools. If you’d like to practice coding on an
actual algorithm, check out our article on machine learning with Python.
Machine Learning Applications for Big Data:
Let’s look at some real-life examples that demonstrate how big data and machine learning
can work together.
Cloud Networks
A research firm has a large amount of medical data it wants to study, but in order to do so
on-premises it needs servers, online storage, networking and security assets, all of which
adds up to an unreasonable expense. Instead, the firm decides to invest in Amazon EMR, a
cloud service that offers data-analysis models within a managed framework.
Machine-learning models of this sort include GPU-accelerated image recognition and text
classification. These algorithms don’t learn once they are deployed, so they can be
distributed and supported by a content-delivery network (CDN). Check out Live Ramp's
detailed outline describing the migration of a big-data environment to the cloud.
Web Scraping
Let’s imagine that a manufacturer of kitchen appliances learns about market tendencies
and customer-satisfaction trends from a retailer’s quarterly reports. In their desire to find
out what the reports might have left out, the manufacturer decides to web-scrape the
enormous amount of existing data that pertains to online customer feedback and product
reviews. By aggregating this data and feeding it to a deep-learning model, the
manufacturer learns how to improve and better describe its products, resulting in
increased sales.
While web scraping generates a huge amount of data, it’s worthwhile to note that choosing
the sources for this data is the most important part of the process. Check out this IT Svit
guid for some best data-mining practices.
Others:
Image recognition:
Is one of the most common applications of machine
learning. It is used to identify objects, persons, places,
digital images, etc. The popular use case of image
recognition and face detection is, Automatic friend
tagging suggestion.
Speech recognition:
Is a process of converting voice instructions into
text, and it is also known as "Speech to text", or
"Computer speech recognition." At present, machine
learning algorithms are widely used by various
applications of speech recognition. • We have various
virtual personal assistants such as Google assistant ,
Alexa , Cortana , Siri. As the name suggests, they help
us in finding the information using our voice
instruction. These assistants can help us in various
ways just by our voice instructions such as Play music,
call someone, Open an email, Scheduling an
appointment, etc.
Conclusion :
The availability of Big Data, low-cost commodity hardware, and new information
management and analytic software have produced a unique moment in the history
of data analysis. The convergence of these trends means that we have the
capabilities required to analyse astonishing data sets quickly and cost-effectively
for the first time in history. These capabilities are neither theoretical nor trivial.
They represent a genuine leap forward and a clear opportunity to realize
enormous gains in terms of efficiency, productivity, revenue, and profitability.
Big Data is a game-changer. Many organizations are using more analytics to
drive strategic actions and offer a better customer experience. A slight change in
the efficiency or smallest savings can lead to a huge profit, which is why most
organizations are moving towards big data.