Big Data
Big Data
Big Data
The most used type of machine learning is a type of AI that learns A to B, or input to output
mappings. This is called supervised learning.
Is relevant to understand what are the type of tasks that this algorithm can do with enough data
available. Therefore, what are the Big data capabilities. They can be summarized in:
• Descriptions
• Predictions
• Inferences
• Classifications
• Clustering
• Recommendations
• Cognitive systems
Descriptions: the integration of all data in a single dashboard or map making it available even it
just shows a description of a phenomena from different perspectives might trigger a better
decision-making process.
• Supervised learning:
- Predictions
- Interferences
- Classifications
• Unsupervised learning
- Clustering
- Recommendations
- Cognitive systems
3.Big data characteristics
We can distinguish some characteristics of Big data that are called the Vs of Biga data: The three
classical Vs of Big data (Volume, Variety and Velocity) + 2 more (Veracity and Valence) + and finally
one more (Value)
• Volume: Amount of data matters. The challenge of this volumes is how to store, acquire,
recover distribute, process and at what costs. In general, more data means better
predictions, classifications...
• Velocity: Fast rate in which data is received and acted on.
• Variety: Many type of data that are available
• Veracity: How accurate or truthful a data set may be.
• Valence: Connectedness between data items
• Value: The final product has to have a value for it to be useful
We can distinguish 5 elements that determine the value creation in an strategic view of a data
project.
Hadoop is a whole environment of tools, an ecosystem of tools to be able to store and process
data of every type and speed base on parallelism.
Hadoop allowed big problems to be broken down into smaller elements so that analysis could be
done quickly and cost-effectively. By breaking the big data problem into small pieces that could be
processed in parallel, you can process the information and regroup the small pieces to present
results.
Hadoop is designed to parallelize data processing across computing nodes to speed computations
and hide latency. At its core, Hadoop has two primary components:
• Hadoop Distributed File System: A reliable, high-bandwidth, low-cost, data storage cluster
that facilitates the management of related files across machines.
• MapReduce engine: A high-performance parallel/distributed data processing
implementation of the MapReduce algorithm.
2. HDFS
The Hadoop Distributed File System is a versatile, resilient, clustered approach to managing files in
a big data environment.
HDFS works by breaking large files into smaller pieces called blocks. The blocks are stored on data
nodes, and it is the responsibility of the NameNode to know what blocks on which data nodes
make up the complete file. The NameNode also acts as a “traffic cop,” managing all access to the
files, including reads, writes, creates, deletes, and replication of data blocks on the data nodes.
The complete collection of all the files in the cluster is sometimes referred to as the file system
namespace.
3. MAPREDUCE
Hadoop MapReduce is an implementation of the algorithm developed and maintained by the
Apache Hadoop project.
Data processing occurs when data is collected and translated into usable information. We can
distinguish several steps in this process:
Data munging: Transform data form erroneous or unusable forms, to useful and use-case-specific
ones. This concept includes:
• Data exploration: Munging usually begins with data exploration. This initial
exploration can be done with some initial graphs, correlations, histograms, or
descriptive statistics (Mean, Median, Mode, Range...) and visualize them in maps or
dashboards.
• Data transformation and integration: Once a sense of the raw data’s contents and
structure have been established, it must be transformed to new formats appropriate for
downstream processing. This step involves the pure restructuring of data.
• Data enrichment (and integration): This involves finding external sources of information
to expand the scope or content of existing records.
• Data validation: This step allows users to discover typos, incorrect mappings, problems
with transformation steps, even the rare corruption caused by computational failure or
error. No matter the data management system has prepare the data in a development of
new project this task is essential.
The data exploitation is the las group of tasks to perform our R&D project. In the project process
we will build a design of the desire output and possibly a beta version. In and operational
implementation we differentiate between the front-end (User interface) of the system and the
back-end(server) of the system. We include in data exploitation the analysis and visualization.
Analysis: Data mining is the process of finding anomalies, patterns, and correlations within large
data sets to predict outcomes.
Some include here the data preparation, but we consider it inside the data munging. Also, some
include data warehousing that involves storing structured data in relational database management
systems so it could be analyzed for business intelligence, reporting, and basic dashboarding
capabilities. However, this is and operational task not a data project process, so we prefer to
consider it in the data management process.
The report and visualization are the last step of the project process and involves the design of the
output to be meaningful to the end user.
Finally, we need to connect with the purpose and iterate the process to get a correct and valuable
output
2. DATA MODELS AND TYPES OF DATA.
• Operations
• Constrictions
• Structures
Operations: The possible operations can be summarized as:
• Sub setting: given a data set and a condition. Find a subset that fulfils the condition.
• Substructure extraction: given a set of data extracts a part of that structure with its
elements
• Union: given two data sets we create a new data set with elements from the two data
sets, erasing duplicates
• Join: given two data sets with complementary structure we create a new group with
elements of both data sets, erasing the duplicates
Constrictions: are the logical propositions that data must complain. For example: each person
has only one name. Different models have different ways to express constrictions
The different type of constrictions is:
• Structured
• Semi structured
• Un-structured
Structured: generally, refers to data that has a defined length and format.
The relational model is still in wide usage today and plays an important role in the evolution of big
data. Understanding the relational database is important because other types of databases are
used with big data. In a relational model, the data is stored in a table. This is often accomplished in
a relational model using a structured query language (SQL)
Another aspect of the relational model using SQL is that tables can be queried using a common key
(that is, the relationship). The related tables use that Key to make possible the relation. Here it’s
called the foreign key. In this tables we can find duplicate keys
Semi-structured data is a type of data that has some consistent and definite characteristics. It
does not confine into a rigid structure such as that needed for relational databases.
3. DATA MANAGEMENT
Data management is the practice of collecting, keeping, and using data securely, efficiently,
and cost-effectively. The goal of data management is to help people, organizations, and
connected things optimize the use of data.
Using another point of view Data management consist in the way to answer to the issues
appearing to make operational a given data project. So that we can consider the following:
• Data storage
• Data ingestion
• Data integration
• Data retrieval
• Data quality
• Data Security
One of the most important services provided by operational databases (also called data stores)
is persistence. Persistence guarantees that the data stored in a database won’t be changed
without permissions and that it will available if it is important to the business.
Given this most important requirement, you must then think about what kind of data you want
to persist, how can you access and update it, and how can you use it to make business decisions.
At this most fundamental level, the choice of your database engines is critical.
The forefather of persistent data stores is the relational database management system, or
RDBMS. The relational model is still in wide usage today and has an important role to play in the
evolution of big data.
Relational databases are built on one or more relations and are represented by table s. As the
name implies, normalized data has been converted from native format into a shared, agreed upon
format. To achieve a consistent view of the information, the field will need to be normalized to
one form or the other.
Over the years, the structured query language (SQL) has evolved in lock step with RDBMS
technology and is the most widely used mechanism for creating, querying, maintaining, and
operating relational databases. These tasks are referred to as CRUD: Create, retrieve, update, and
delete are common, related operations you can use directly on a database or through an
application programming interface (API).
Nonrelational databases do not rely on the table/key model endemic to RDBMSs. One
emerging, popular class of nonrelational database is called not only SQL (NoSQL). Nonrelational
database technologies have the following characteristics in common:
• Scalability: capability to write data across multiple data stores simultaneously without
regard to physical limitations of the underlying infrastructure.
• Data and Query model: Instead of the row, column, key structure, nonrelational
databases use specialty frameworks to store data with a requisite set of spe cialty query
APIs to intelligently access the data.
• Persistence design: Persistence is still a critical element in nonrelational databases.
• Interface diversity: Although most of these technologies support RESTful APIs6 as their
“go to” interface, they also offer a wide variety of connection mechanisms for
programmers and database managers, including analysis tools and reporting/visualization.
• Eventual Consistency: While RDBMS uses ACID (Atomicity, Consistency, Isolation,
Durability) as a mechanism for ensuring the consistency of data, non-relational DBMS
use BASE.
Distributed data storage is a computer network where data or information is stored (or
Distributed databases are databases that quickly retrieves data over many nodes. Distributed
data stores have an increased availability of data at the expense of consistency. We will come back
to these ideas speaking about the big data platform called Hadoop and specifically HDFS (Hadoop
Distributed file systems)
5.DATA MANAGEMENT: DATA INGESTION
Data Ingestion is the process of acquiring and importing data into a data store or a database . If the
data is ingested in real time, each record is pushed into the database as it is remitted, (Real time)
Data-in-motion: analyzed as it is generated. (Batch process) Data-at-rest: collected prior to
analysis
Streaming data is data that is generated continuously by thousands of data sources, which
typically send in the data records simultaneously, and in small sizes. This data needs to be
processed sequentially and incrementally on a record-by-record basis or over sliding time
windows.
In Data Streaming Systems real time Compute one data element or a small window of data
elements at a time.
Data Retrieval is the process of searching, identifying, and extracting required data from a
database. A database is one designed to make transactional systems run efficiently.
A data warehouse is a type of database the integrates copies of transaction data from disparate
source systems and provisions them for analytical use. In a traditional data warehouse, the data is
loaded into the warehouse after transforming it into a well-defined and structured format: this is
called schema on write.
A Data Lake is a massive storage depository with huge processing power and ability to handle a
very large number of concurrences, data management and analytical tasks. In a data lake is not
stored into a warehouse unless there is use. Data lakes ensures all data is stored for a potentially
unknown use later: schema on read.
8.DATA MANAGEMENT: DATA STORAGE AND RETRIEVAL INFRASTUCTURE. SCALING
Depending on the type of physical storage system the storage and access time increases. Using
these criteria, we can make a memory hierarchy:
• Data Profiling
• Data Parsing and Standardization
• Data Matching and Data Cleansing
Data profiling provides the metrics and reports that business information owners need to
continuously measure, monitor, track, and improve data quality at multiple points across the
organization.
Data parsing and standardization typically provides data standardization capabilities, enabling
data analysts to standardize and validate their customer data.
Data matching is the identification of potential duplicates for account, contact, and prospect
records
We need to secure:
• Machines
• Data transfer across different phases of data operation