Unit 4 LT
Unit 4 LT
Unit 4 LT
Big data is a term for data sets that are so large or complex that traditional data
processing applications are inadequate to deal with them. Challenges include analysis,
capture, data curation, search, sharing, storage, transfer, visualization, querying, updating
and information privacy. The term "big data" often refers simply to the use of predictive
analytics, user behavior analytics, or certain other advanced data analytics methods that
extract value from data, and seldom to a particular size of data set.
Volume: The quantity of generated and stored data. The size of the data determines the value
and potential insight- and whether it can actually be considered big data or not.
Variety: The type and nature of the data. This helps people who analyze it to effectively use
the resulting insight.
Velocity: In this context, the speed at which the data is generated and processed to meet the
demands and challenges that lie in the path of growth and development.
Variability: Inconsistency of the data set can hamper processes to handle and manage it.
Veracity: The quality of captured data can vary greatly, affecting accurate analysis.
2. The Big Data Talent Gap – The excitement around big data applications seems to imply
that there is a broad community of experts available to help in implementation. However, this
is not yet the case, and the talent gap poses our second challenge.
3. Getting Data into the Big Data Platform – The scale and variety of data to be absorbed
into a big data environment can overwhelm the unprepared data practitioner, making data
accessibility and integration our third challenge.
4. Synchronization Across the Data Sources – As more data sets from diverse sources are
incorporated into an analytical platform, the potential for time lags to impact data currency
and consistency becomes our fourth challenge.
5. Getting Useful Information out of the Big Data Platform – Lastly, using big data for
different purposes ranging from storage augmentation to enabling high-performance analytics
is impeded if the information cannot be adequately provisioned back within the other
components of the enterprise information architecture, making big data syndication our fifth
challenge.
Unstructured data such as text, video, and Normally structured data such as numbers and
audio. categories, but it can take other forms as well.
Need tools such as Hadoop, Hive, Hbase, Tools such as SQL, SAS, R, and Excel alone may
Pig, Sqoop, and so on. be sufficient.
Big data analysis needs both programming Analytical skills are sufficient for conventional
skills (such as Java) and analytical skills to data; advanced analysis tools don’t require expert
perform analysis. programing skills.
The amount of data organizations process continues to increase. The old methods for
handling data won’t work anymore. Hence, important technologies to tame the big data tidal
wave possible such as: MPP, The cloud, Grid computing and Map-Reduce.
Traditional Analytic Architecture- We had to pull all data together into a separate analytics
environment to do analysis.
Modern In-Database Architecture- The processing stays in the database where the data has
been consolidated.
Massively Parallel Processing:
An MPP database breaks the data into independent chunks with independent disk and CPU.
Concurrent Processing: An MPP system allows the different sets of CPU and disk to run the
process concurrently.
MPP systems build in redundancy to make recovery easy. MPP systems have resource
management tools.
Public Cloud:
- The services and infrastructure are provided off-site over the internet
- Greatest level of efficiency in shared resources
- Less secured and more vulnerable than private clouds
Private Cloud
- Infrastructure operated solely for a single organization
- The same features of a public cloud
- Offer the greatest level of security and control
- Necessary to purchase and own the entire cloud infrastructure
Grid is a shared collection of reliable (cluster-tightly coupled) & unreliable resources (loosely
coupled machines) and interactively communicating researchers of different virtual
organisations (doctors, biologists, physicists).
Grid System controls and coordinates the integrity of the Grid by balancing the usage of
reliable and unreliable resources among its participants providing better quality of service.
At least one computer, usually a server, which handles all the administrative
duties for the System
A network of computers running special grid computing network software.
A collection of computer software called middleware
Map function- Processing a key/value pairs to generate a set of intermediate key/value pairs.
Reduce function- Merging all intermediate values associated with the same intermediate key.
Let’s assume there are 20 terabytes of data and 20 Map-Reduce server nodes for a project.
- Distribute a terabyte to each of the 20 nodes using a simple file copy process.
- Submit two programs (Map, Reduce) to the scheduler.
- The map program finds the data on disk and executes the logic it contains.
- The results of the map step are then passed to the reduce process to summarize and
aggregate the final answers.
Strengths and Weaknesses:
Good for-
Bad for-
Big data analytics examines large amounts of data to uncover hidden patterns, correlations
and other insights.
Web analytics is the measurement, collection, analysis and reporting of web data for
purposes of understanding and optimizing web usage.
Data Analytics is the science of analysis where statistics, data mining, data processing and
even computer technology is utilized to break down data and come up with conclusive
insights and information.
Big data analytics helps organizations harness their data and use it to identify new
opportunities. That, in turn, leads to smarter business moves, more efficient operations,
higher profits and happier customers.
1. Cost reduction.
• Descriptive: A set of techniques for reviewing and examining the data set(s) to
understand the data and analyze business performance.
• Diagnostic: A set of techniques for determine what has happened and why
• Predictive: A set of techniques that analyze current and historical data to determine what
is most likely to (not) happen
While both areas are part of web analytics (note that analytics isn’t similar to analysis),
there’s a vast difference between them-
1. Purpose
Reporting helps companies monitor their data even before digital technology boomed.
Various organizations have been dependent on the information it brings to their business, as
reporting extracts that and makes it easier to understand.
Analysis interprets data at a deeper level. While reporting can link between cross-channels of
data, provide comparison, and make understand information easier (think of a dashboard,
charts, and graphs, which are reporting tools and not analysis reports), analysis interprets this
information and provides recommendations on actions.
2. Tasks
Here’s a great differentiator to keep in mind if what you’re doing is reporting or analysis:
3. Outputs
Reporting and analysis have the push and pull effect from its users through their outputs.
Reporting has a push approach, as it pushes information to users and outputs come in the
forms of canned reports, dashboards, and alerts.
Analysis has a pull approach, where a data analyst draws information to further probe and to
answer business questions. Outputs from such can be in the form of ad hoc responses and
analysis presentations. Analysis presentations are comprised of insights, recommended
actions, and a forecast of its impact on the company—all in a language that’s easy to
understand at the level of the user who’ll be reading and deciding on it.
This is important for organizations to realize truly the value of data, such that a standard
report is not similar to a meaningful analytics.
4. Delivery
Analysis requires a more custom approach, with human minds doing superior reasoning and
analytical thinking to extract insights, and technical skills to provide efficient steps towards
accomplishing a specific goal. This is why data analysts and scientists are demanded these
days, as organizations depend on them to come up with recommendations for leaders or
business executives make decisions about their businesses.
5. Value
This isn’t about identifying which one brings more value, rather understanding that both are
indispensable when looking at the big picture. It should help businesses grow, expand, move
forward, and make more profit or increase their value.
This Path to Value diagram illustrates how data converts into value by reporting and analysis
such that it’s not achievable without the other.
Data alone is useless, and action without data is baseless. Both reporting and analysis are
vital to bringing value to your data and operations.
Data analysis is about breaking that data down and assessing the impact of those patterns
over time.
Qubole
Qubole simplifies speeds and scales big data analytics workloads against data stored on
AWS, Google, or Azure clouds. This cloud-based data platform self-manages, self-optimizes
and learns to improve automatically and as a result delivers unbeatable agility, flexibility, and
TCO.
BigML
BigML is attempting to simplify machine learning. They offer a powerful Machine Learning
service with an easy-to-use interface for you to import your data and get predictions out of it.
You can even use their models for predictive analytics.
Statwing
Statwing takes data analysis to a new level, providing everything from beautiful visuals to
complex analysis. Statwing selects statistical tests with the goal of making statistical testing
intuitive and error-free.
Domo
Domo helps you put your data in one place, so you have access to the numbers you need to
generate analysis, visualize changes, and make decisions. With plenty of visualization and
collaboration tools, Domo is trying to make spreadsheets a thing of the past.
ThoughtSpot
By using the power of relational search – like the AI behind Google – anyone can search for
terms and instantly find the data they need. ThoughtSpot will even help a user visualize and
share that data to inform decision making.
SAMPLING DISTRIBUTIONS
Population and sample: population can be of two class- finite population and infinite
population
Population- A set or collection of all the objects, actual or conceptual and mainly the set of
numbers, measurements or observations which are under investigation.
Infinite Population : Total water in the sea or all the sand particle in sea shore.
Populations are often described by the distributions of their values, and it is common practice
to refer to a population in terms of its distribution.
If a population is infinite it is impossible to observe all its values, and even if it is finite it
may be impractical or uneconomical to observe it in its entirety. Thus it is necessary to use a
sample.
Sample: A part of population collected for investigation which needed to be representative of
population and to be large enough to contain all information about population.
1. Random Sample (finite population): • A set of observations X1, X2, …,Xn constitutes
a random sample of size n from a finite population of size N, if its values are chosen
so that each subset of n of the N elements of the population has the same probability
of being selected.
In statistics, resampling is any of a variety of methods for doing one of the following:
A permutation test (also called a randomization test, re-randomization test, or an exact test)
is a type of statistical significance test in which the distribution of the test statistic under the
null hypothesis is obtained by calculating all possible values of the test statistic under
rearrangements of the labels on the observed data points.
STATISTICAL INFERENCE
a. Confidence intervals
b. Hypothesis testing
Confidence intervals
• Allow us to use sample data to estimate a population value, like the true mean or the
true proportion, i.e. estimate parameters.
• Example: What is the current mean GPA of U.S. college & university students?
Hypothesis testing
• Allows us to use sample data to test a claim about a population, such as testing
whether a population proportion or population mean equals some number.
• Example: The mean GPA of U.S. college & university students today is larger than
2.70 which was the mean GPA in 1990?
PREDICTION ERROR
Prediction attempts to form patterns that permit it to predict the next event(s) given the
available input data.
The Prediction Error tries to represent the noise through the concept of training error
versus test error. We fit our model to the training set. We take our model, and then we apply
it to new data that the model hasn't seen. In general, because the more data, the bigger
the sample size, the more information you have, the lower the error is.
Training: Training error is the error we get applying the model to the same data from which
we trained.
Test: Test error is the error that we incur on new data. The test error is actually how well
we'll do on future data the model hasn't seen.
Training error usually Underestimates test error when the model is very complex (compared
to the training set size), and is a pretty good estimate when the model is not very complex.
However, it's always possible we just get too few hard-to-predict points in the test set, or too
many in the training set. Then the test error can be LESS than training error, when by chance
the test set has easier cases than the training set.