NOTES- BIG DATA ANALYTICS UNIT I, II, III
NOTES- BIG DATA ANALYTICS UNIT I, II, III
NOTES- BIG DATA ANALYTICS UNIT I, II, III
QUESTION BANK
PART-A
1. What is Big Data?
Big data is a field that treats ways to analyze, systematically extract information from,
or otherwise deal with data sets that are too large or complex to be dealt with by
traditional data-processing application software.
2. List out the best practices of Big Data Analytics.
1. Start at the End
2. Build an Analytical Culture.
3. Re-Engineer Data Systems for Analytics
4. Focus on Useful Data Islands.
5. Iterate often.
3. Write down the characteristics of Big Data Applications.
a) Data Throttling
b) Computation- restricted throttling
c) Large Data Volumes
d) Significant Data Variety
e) Benefits from Data parallelization
4. Write down the four computing resources of Big Data Storage.
a) Processing Capability
b) Memory
c) Storage
d) Network
5. What is HDFS?
Apache Hadoop is a collection of open-source software utilities that facilitate using a
network of many computers to solve problems involving massive amounts of data and
computation. It provides a software framework for distributed storage and processing
of big data using the MapReduce programming model.
6. What is MapReduce?
MapReduce is a processing technique and a program model for distributed computing
based on java. The MapReduce algorithm contains two important tasks, namely Map
and Reduce. Map takes a set of data and converts it into another set of data, where
individual elements are broken down into tuples (key/value pairs).
7. What is YARN?
YARN is an Apache Hadoop technology and stands for Yet Another Resource
Negotiator. YARN is a large-scale, distributed operating system for big data
applications. YARN is a software rewrite that is capable of decoupling MapReduce's
resource management and scheduling capabilities from the data processing
component.
8. What is Map Reduce Programming Model?
MapReduce is a programming model and an associated implementation for processing
and generating big data sets with a parallel, distributed algorithm on a cluster.
The model is a specialization of the split-apply-combine strategy for data analysis.
9. What are the characteristics of big data?
Big data can be described by the following characteristics
Volume - The quantity of data generated and stored data. The size of the data
determines the value and potential insight- and whether it can actually be considered
big data or not.
Variety -The type and nature of the data. This helps people who analyze it to
effectively use the resulting insight.
Velocity -In this context, the speed at which the data is generated and processed to
meet the demands and challenges that lie in the path of growth and development.
Variability- Inconsistency of the data set can hamper processes to handle and manage
it.
Veracity-The data quality of captured data can vary greatly, affecting the accurate
analysis
10. What is Big Data Platform?
• Big Data Platform is integrated IT solution for Big Data management which
combines several software systems, software tools and hardware to provide easy to
use tools system to enterprises.
• It is a single one-stop solution for all Big Data needs of an enterprise irrespective of
size and data volume. Big Data Platform is enterprise class IT solution for developing,
deploying and managing Big Data.
PART -B & C
1. What is Bigdata? Describe the main features of a big data in detail.
Basics of Bigdata Platform
Big Data platform is IT solution which combines several Big Data tools
and utilities into one packaged solution for managing and analyzing Big
Data.
Big data platform is a type of IT solution that combines the features and
capabilities of several big data application and utilities within a single solution.
It is an enterprise class IT platform that enables organization in developing,
deploying, operating and managing a big data infrastructure /environment.
Big data is a term for data sets that are so large or complex that traditional data
processing applications are inadequate.
Challenges include
Analysis,
Capture,
Data Curation,
Search,
Sharing,
Storage,
Transfer,
Visualization,
Querying,
Updating
Information Privacy.
The term often refers simply to the use of predictive analytics or certain other
advancedmethods to extract value from data, and seldom to a particular size of
data set.
ACCURACY in big data may lead to more confident decision making, and
better decisions can result in greater operational efficiency, cost reduction and
reduced risk.
Big data usually includes data sets with sizes beyond the ability of
commonly used
software tools to capture, curate, manage, and process data within a tolerable
elapsed
time. Big data "size" is a constantly moving target.
Big data requires a set of techniques and technologies with new forms of
integration to
reveal insights from datasets that are diverse, complex, and of a massive scale
a) Hadoop
Hadoop is open-source, Java based programming framework and server
software which is used to save and analyze data with the help of 100s or even
1000s of commodity servers in a clustered environment.
Hadoop is designed to storage and process large datasets extremely fast and in
fault tolerant way.
Hadoop uses HDFS (Hadoop File System) for storing data on cluster of
commodity computers. If any server goes down it know how to replicate the
data and there is no loss of data even in hardware failure.
Hadoop is Apache sponsored project and it consists of many software packages
which runs on the top of the Apache Hadoop system.
Top Hadoop based Commercial Big Data Analytics Platform
Hadoop provides set of tools and software for making the backbone of the Big
Data analytics system.
Hadoop ecosystem provides necessary tools and software for handling and
analyzing Big Data.
On the top of the Hadoop system many applications can be developed and
plugged-in to provide ideal solution for Big Data needs.
b) Cloudera
Cloudra is one of the first commercial Hadoop based Big Data Analytics
Platform offering Big Data solution.
Its product range includes Cloudera Analytic DB, Cloudera Operational DB,
Cloudera Data Science & Engineering and Cloudera Essentials.
All these products are based on the Apache Hadoop and provides real-time
processing and analytics of massive data sets.
e) MapR
MapR is another Big Data platform which us using the Unix file system for
handling data.
It is not using HDFS and this system is easy to learn anyone familiar with the
Unix system.
This solution integrates Hadoop, Spark, and Apache Drill with a real-time data
processing feature.
(iii) Velocity – The term 'velocity' refers to the speed of generation of data. How fast
the data is generated and processed to meet the demands, determines real potential in
the data.
Big Data Velocity deals with the speed at which data flows in from sources like
business processes, application logs, networks, and social media sites, sensors, Mobile
devices, etc. The flow of data is massive and continuous.
(iv) Variability – This refers to the inconsistency which can be shown by the data at
times, thus hampering the process of being able to handle and manage the data
effectively.
Benefits of Big Data Processing
Ability to process Big Data brings in multiple benefits, such as-
Businesses can utilize outside intelligence while taking decisions
Access to social data from search engines and sites like Facebook, twitter are enabling
organizations to fine tune their business strategies.
Improved customer service
Traditional customer feedback systems are getting replaced by new systems designed
with Big Data technologies. In these new systems, Big Data and natural language
processing technologies are being used to read and evaluate consumer responses.
Early identification of risk to the product/services, if any
Better operational efficiency
Big Data technologies can be used for creating a staging area or landing zone for new
data before identifying what data should be moved to the data warehouse. In addition,
such integration of Big Data technologies and data warehouse helps an organization to
offload infrequently accessed data.
3. Explain in detail about Nature of Data and its applications.
Data
Data is a set of values of qualitative or quantitative variables; restated, pieces
of data are individual pieces of information.
Data is measured, collected and reported, and analyzed, whereupon it can be
visualized using graphs or images.
Properties of Data
For examining the properties of data, reference to the various definitions of data.
Reference to these definitions reveals that following are the properties of data:
a) Amenability of use
b) Clarity
c) Accuracy
d) Essence
e) Aggregation
f) Compression
g) Refinement
.Amenability of use: From the dictionary meaning of data it is learnt that data are
facts used in deciding something. In short, data are meant to be used as a base for
arriving at definitive conclusions.
TYPES OF DATA
In order to understand the nature of data it is necessary to categorize them into
various types.
Different categorizations of data are possible.
The first such categorization may be on the basis of disciplines, e.g.,
Sciences, Social Sciences, etc. in which they are generated.
Within each of these fields, there may be several ways in which data can be
categorized into types.
The distinction between the four types of scales center on three different characteristics:
1. The order of responses – whether it matters or not
2. The distance between observations – whether it matters or is interpretable
3. The presence or inclusion of a true zero
Nominal Scales
Nominal scales measure categories and have the following characteristics:
Order: The order of the responses or observations does not matter.
Distance: Nominal scales do not hold distance. The distance between a 1 and a
2 is not the same as a 2 and 3.
True Zero: There is no true or real zero. In a nominal scale, zero is
uninterruptable.
Appropriate statistics for nominal scales: mode, count, frequencies
Displays: histograms or bar charts
Ordinal Scales
At the risk of providing a tautological definition, ordinal scales measure, well, order.
So, our characteristics for ordinal scales are:
Order: The order of the responses or observations matters.
Distance: Ordinal scales do not hold distance. The distance between first and
second is unknown as is the distance between first and third along with all
observations.
True Zero: There is no true or real zero. An item, observation, or category
cannot finish zero.
Appropriate statistics for ordinal scales: count, frequencies, mode
Displays: histograms or bar charts
Interval Scales
Interval scales provide insight into the variability of the observations or data.
Classic interval scales are Likert scales (e.g., 1 - strongly agree and 9 - strongly
disagree) and
Semantic Differential scales (e.g., 1 - dark and 9 - light).
In an interval scale, users could respond to “I enjoy opening links to thwebsite from a
company email” with a response ranging on a scale of values.
Appropriate statistics for interval scales: count, frequencies, mode, median, mean,
standard deviation (and variance), skewness, and kurtosis.
Displays: histograms or bar charts, line charts, and scatter plots.
Ratio Scales
Ratio scales appear as nominal scales with a true zero.
They have the following characteristics:
Order: The order of the responses or observations matters.
Distance: Ratio scales do do have an interpretable distance.
True Zero: There is a true zero.
Income is a classic example of a ratio scale:
Order is established. We would all prefer $100 to $1!
Zero dollars means we have no income (or, in accounting terms, our revenue
exactly equals our expenses!)
Distance is interpretable, in that $20 appears as twice $10 and $50 is half of a
$100.
For the web analyst, the statistics for ratio scales are the same as for interval scales.
Appropriate statistics for ratio scales: count, frequencies, mode, median, mean,
standard deviation (and variance), skewness, and kurtosis.
Displays: histograms or bar charts, line charts, and scatter plots.
HDFS is a distributed file system that handles large data sets running on commodity
hardware. It is used to scale a single Apache Hadoop cluster to hundreds (and even
thousands) of nodes. HDFS is one of the major components of Apache Hadoop, the
others being MapReduce and YARN. HDFS should not be confused with or replaced
by Apache HBase, which is a column-oriented non-relational database management
system that sits on top of HDFS and can better support real-time data needs with its
in- memory processing engine.
Fast recovery from hardware failures
Because one HDFS instance may consist of thousands of servers, failure of at least
one server is inevitable. HDFS has been built to detect faults and automatically
recover quickly.
Access to streaming data
HDFS is intended more for batch processing versus interactive use, so the emphasis in
the design is for high data throughput rates, which accommodate streaming access to
data sets.
Accommodation of large data sets
HDFS accommodates applications that have data sets typically gigabytes to terabytes
in size. HDFS provides high aggregate data bandwidth and can scale to hundreds of
nodes in a single cluster.
Portability
To facilitate adoption, HDFS is designed to be portable across multiple hardware
platforms and to be compatible with a variety of underlying operating systems.
A decision tree (also called prediction tree) uses a tree structure to specify sequences
of decisions and consequences. Given input , the goal is to predict a response or output
variable . Each member of the set is called an input variable. The prediction can be
achieved by constructing a decision tree with test points and branches. At each test
point, a decision is made to pick a specific branch and traverse down the tree.
Eventually, a final point is reached, and a prediction can be made. Each test point in a
decision tree involves testing a particular input variable (or attribute), and each branch
represents the decision being made. Due to its flexibility and easy visualization,
decision trees are commonly deployed in data mining applications for classification
purposes.
The input values of a decision tree can be categorical or continuous. A decision tree
employs a structure of test points (called nodes) and branches, which represent the
decision being made. A node without further branches is called a leaf node. The leaf
nodes return class labels and, in some implementations, they return the probability
scores. A decision tree can be converted into a set of decision rules. In the following
example rule, income and mortgage_amount are input variables, and the response is
the output variable default with a probability score.
Decision trees have two varieties: classification trees and regression trees.
Classification trees usually apply to output variables that are categorical—often binary
—in nature, such as yes or no, purchase or not purchase, and so on. Regression trees,
on the other hand, can apply to output variables that are numeric or continuous,
such as the predicted price of a consumer good or the likelihood a subscription will be
purchased.
Decision trees can be applied to a variety of situations. They can be easily represented
in a visual way, and the corresponding decision rules are quite straightforward.
Additionally, because the result is a series of logical if-then statements, there is no
underlying assumption of a linear (or nonlinear) relationship between the input
variables and the response variable.
The term branch refers to the outcome of a decision and is visualized as a line
connecting two nodes. If a decision is numerical, the “greater than” branch is usually
placed on the right, and the “less than” branch is placed on the left. Depending on the
nature of the variable, one of the branches may need to include an “equal to”
component.
Internal nodes are the decision or test points. Each internal node refers to an input
variable or an attribute. The top internal node is called the root. The decision tree in
Figure is a binary tree in that each internal node has no more than two branches. The
branching of a node is referred to as a split.
Sometimes decision trees may have more than two branches stemming from a node.
For example, if an input variable Weather is categorical and has three choices—
Sunny, Rainy, and Snowy—the corresponding node Weather in the decision tree may
have three branches labelled as Sunny, Rainy, and Snowy, respectively.
The depth of a node is the minimum number of steps required to reach the node from
the root. In Figure for example, nodes Income and Age have a depth of one, and the
four nodes on the bottom of the tree have a depth of two.
Leaf nodes are at the end of the last branches on the tree. They represent class labels
— the outcome of all the prior decisions. The path from the root to a leaf node
contains a series of decisions made at various internal nodes.
In Figure the root node splits into two branches with a Gender test. The right branch
contains all those records with the variable Gender equal to Male, and the left branch
contains all those records with the variable Gender equal to Female to create the depth
1 internal nodes. Each internal node effectively acts as the root of a subtree, and a best
test for each node is determined independently of the other internal nodes. The left-
hand side (LHS) internal node splits on a question based on the Income variable to
create leaf nodes at depth 2, whereas the right-hand side (RHS) splits on a question on
the Age variable.
The decision tree shows that females with income less than or equal to $45,000 and
males 40 years old or younger are classified as people who would purchase the
product. In traversing this tree, age does not matter for females, and income does not
matter for males.
Decision trees are widely used in practice. For example, to classify animals, questions
(like cold-blooded or warm-blooded, mammal or not mammal) are answered to arrive
at a certain classification. Another example is a checklist of symptoms during a
doctor’s evaluation of a patient. The artificial intelligence engine of a video game
commonly uses decision trees to control the autonomous actions of a character in
response to various scenarios. Retailers can use decision trees to segment
customers or predict response rates to marketing and promotions. Financial
institutions can use decision trees to help decide if a loan application should be
approved or denied. In the case of loan approval, computers can use the logical if-
then statements to predict whether the customer will default on the loan. For
customers with a clear (strong) outcome, no human interaction is required; for
observations that may not generate a clear response, a human is needed for the
decision.
By limiting the number of splits, a short tree can be created. Short trees are often used
as components (also called weak learners or base learners) in ensemble methods.
Ensemble methods use multiple predictive models to vote, and decisions can be made
based on the combination of the votes. Some popular ensemble methods include
random forest [4], bagging, and boosting [5]. Section 7.4 discusses these ensemble
methods more.
The simplest short tree is called a decision stump, which is a decision tree with the
root immediately connected to the leaf nodes. A decision stump makes a prediction
based on the value of just a single input variable.
3. Explain in detail about Naïve Bayes Classification.
A naïve Bayes classifier assumes that the presence or absence of a particular feature of
a class is unrelated to the presence or absence of other features. For example, an
object can be classified based on its attributes such as shape, color, and weight. A
reasonable classification for an object that is spherical, yellow, and less than 60 grams
in weight may be a tennis ball. Even if these features depend on each other or upon the
existence of the other features, a naïve Bayes classifier considers all these properties
to contribute independently to the probability that the object is a tennis ball.
The input variables are generally categorical, but variations of the algorithm can
accept continuous variables. There are also ways to convert continuous variables into
categorical ones. This process is often referred to as the discretization of continuous
variables. In the tennis ball example, a continuous variable such as weight can be
grouped into intervals to be converted into a categorical variable. For an attribute such
as income, the attribute can be converted into categorical values as shown below.
The conditional probability of event C occurring, given that event A has already
occurred, is denoted as , which can be found using the formula in Equation
Equation can be obtained with some minor algebra and substitution of the conditional
probability:
John checked in at least two hours early only 40% of the time, or . Therefore,
The probability that John received an upgrade given that he checked in early is 0.75, or
.
The probability that John received an upgrade given that he did not arrive two hours
early is 0.35, or. Therefore, the probability that John received an upgrade can be
computed as shown in Equation
Thus, the probability that John did not receive an upgrade. Using Bayes’ theorem, the
probability that John did not arrive two hours early given that he did not receive his
upgrade is shown in Equation
Another example involves computing the probability that a patient carries a disease
based on the result of a lab test. Assume that a patient named Mary took a lab test for
a certain disease and the result came back positive. The test returns a positive result in
95% of the cases in which the disease is actually present, and it returns a positive
result in 6% of the cases in which the disease is not present. Furthermore, 1% of the
entire population has this disease. What is the probability that Mary actually has the
disease, given that the test is positive?
Let C = {having the disease} and A = {testing positive}. The goal is to solve the
probability of having the disease, given that Mary has a positive test result, . From the
problem description, and. Bayes’ theorem defines. The probability of testing positive,
that is, needs to be computed first. That computation is shown in Equation
According to Bayes’ theorem, the probability of having the disease, given that Mary
has a positive test result, is shown in Equation.
That means that the probability of Mary actually having the disease given a positive
test result is only 13.79%. This result indicates that the lab test may not be a good one.
The likelihood of having the disease was 1% when the patient walked in the door and
only 13.79% when the patient walked out, which would suggest further tests.
A more general form of Bayes’ theorem assigns a classified label to an object with
multiple attributes such that the label corresponds to the largest value of. he
probability that a set of attribute values (composed of variables) should be labelled
with a classification label equals the probability that the set of variables given is true,
times the probability of divided by the probability of. Mathematically, this is shown in
Equation.
Decision trees use greedy algorithms, in that they always choose the option that seems
the best available at that moment. At each step, the algorithm selects which attribute to
use for splitting the remaining records. This selection may not be the best overall, but it
is guaranteed to be the best at that step. This characteristic reinforces the efficiency of
decision trees. However, once a bad split is taken, it is propagated through the rest of
the tree. To address this problem, an ensemble technique (such as random forest) may
randomize the splitting or even randomize data and come up with a multiple tree
structure. These trees then vote for each class, and the class with the most votes is
chosen as the predicted class
There are a few ways to evaluate a decision tree. First, evaluate whether the splits of
the tree make sense. Conduct sanity checks by validating the decision rules with
domain experts, and determine if the decision rules are sound.
Next, look at the depth and nodes of the tree. Having too many layers and obtaining
nodes with few members might be signs of overfitting. In overfitting, the model fits
the training set well, but it performs poorly on the new samples in the testing set.
Figure
7.7 illustrates the performance of an overfit model. The x-axis represents the amount
of data, and the yaxis represents the errors. The blue curve is the training set, and the
red curve is the testing set. The left side of the gray vertical line shows that the model
predicts well on the testing set. But on the right side of the gray line, the model
performs worse and worse on the testing set as more and more unseen data is
introduced.
For decision tree learning, overfitting can be caused by either the lack of training data
or the biased data in the training set. Two approaches [10] can help avoid overfitting
in decision tree learning.
Stop growing the tree early before it reaches the point where all the training data is
perfectly classified. Grow the full tree, and then post-prune the tree with methods such
as reduced-error pruning and rule-based post pruning.
Last, many standard diagnostics tools that apply to classifiers can help evaluate
overfitting. These tools are further discussed in Section 7.3.
Decision trees are computationally inexpensive, and it is easy to classify the data. The
outputs are easy to interpret as a fixed sequence of simple tests. Many decision tree
algorithms are able to show the importance of each input variable. Basic measures,
such as information gain, are provided by most statistical software packages.
Decision trees are able to handle both numerical and categorical attributes and are
robust with redundant or correlated variables. Decision trees can handle categorical
attributes with many distinct values, such as country codes for telephone numbers.
Decision trees can also handle variables that have a nonlinear effect on the outcome,
so they work better than linear models (for example, linear regression and logistic
regression) for highly nonlinear problems. Decision trees naturally handle variable
interactions. Every node in the tree depends on the preceding nodes in the tree.
In a decision tree, the decision regions are rectangular surfaces. Figure 7.8 shows an
example of five rectangular decision surfaces (A, B, C, D, and E) defined by four
values— —of two attributes (and). The corresponding decision tree is on the right side
of the figure. A decision surface corresponds to a leaf node of the tree, and it can be
reached by traversing from the root of the tree following by a series of decisions
according to the value of an attribute. The decision surface can only be axis-aligned
for the decision tree.
The structure of a decision tree is sensitive to small variations in the training data.
Although the dataset is the same, constructing two decision trees based on two different
subsets may result in very different trees. If a tree is too deep, overfitting may occur,
because each split reduces the training data for subsequent splits.
Decision trees are not a good choice if the dataset contains many irrelevant variables.
This is different from the notion that they are robust with redundant variables and
correlated variables. If the dataset contains redundant variables, the resulting decision
tree ignores all but one of these variables because the algorithm cannot detect
information gain by including more redundant variables. On the other hand, if the
dataset contains irrelevant variables and if these variables are accidentally chosen as
splits in the tree, the tree may grow too large and may end up with less data at every
split, where overfitting is likely to occur. To address this problem, feature selection
can be introduced in the data pre-processing phase to eliminate the irrelevant
variables.
Although decision trees are able to handle correlated variables, decision trees are not
well suited when most of the variables in the training set are correlated, since
overfitting is likely to occur. To overcome the issue of instability and potential
overfitting of deep trees, one can combine the decisions of several randomized
shallow decision trees—the basic idea of another classifier called random forest [4]—
or use ensemble methods to combine several weak learners for better classification.
These methods have been shown to improve predictive power compared to a single
decision tree.
For binary decisions, a decision tree works better if the training dataset consists of
records with an even probability of each result. In other words, the root of the tree has
a 50% chance of either classification. This occurs by randomly selecting training
records from each possible classification in equal numbers. It counteracts the
likelihood that a tree will stump out early by-passing purity tests because of bias in the
training data.
When using methods such as logistic regression on a dataset with many variables,
decision trees can help determine which variables are the most useful to select based
on information gain. Then these variables can be selected for the logistic regression.
Decision trees can also be used to prune redundant variables.
CS8091 BIG DATA ANALYTICS
UNIT III ASSOCIATION AND RECOMMENDATION SYSTEM
QUESTION BANK
PART-A
1. What are Recommenders?
• Recommenders are instances of personalization software.
• Personalization concerns adapting to the individual needs, interests, and preferences of
each user.
Includes:
– Recommending
– Filtering
– Predicting (e.g. form or calendar appt. completion)
From a business perspective, it is viewed as part of Customer Relationship Management
(CRM).
2. What is Dimensionality Reduction?
Dimension Reduction refers to:
– The process of converting a set of data having vast dimensions into data with lesser
dimensions ensuring that it conveys similar information concisely.
– These techniques are typically used while solving machine learning problems to
obtain better features for a classification or regression task.
3. List out the problems on using Recommendation systems
• Inconclusive user feedback forms
• Finding users to take the feedback surveys
• Weak Algorithms
• Poor results
• Poor Data
• Lack of Data
• Privacy Control (May NOT explicitly collaborate with recipients)
4. List out the types of Recommender Systems.
– Content
– Collaborative
– Knowledge
5. What is Association Mining?
Finding frequent patterns, associations, correlations, or causal structures among sets
of items or objects in transaction databases, relational databases, and other information
repositories.
6. What is the Purpose of Apriori Algorithm?
Apriori algorithm is an influential algorithm for mining frequent item sets for
Boolean association rules. The name of the algorithm is based on the fact that the
algorithm uses prior knowledge of frequent item set properties.
7. List out the applications of Association rules.
– Basket data analysis,
– cross-marketing,
– catalogue design,
– loss-leader analysis,
– clustering,
– classification
8. Define support and confidence in Association rule mining.
Support S is the percentage of transactions in D that contain AUB.
Confidence c is the percentage of transactions in D containing A that also contain B.
Support (A=>B) = P(AUB)
Confidence (A=>B) =P(B/A)
9. What is Association rule?
Association rule finds interesting association or correlation relationships among a large set
of data items, which is used for decision-making processes. Association rules analyses
buying patterns that are frequently associated or purchased together.
10. Describe the method of generating frequent item sets without candidate generation.
Frequent-pattern growth(or FP Growth) adopts divide-and-conquer strategy.
Steps:
– Compress the database representing frequent items into a frequent pattern tree or FP
tree
– Divide the compressed database into a set of conditional databases
– Mine each conditional database separately
PART B & C
1. Explain about the basics of Recommendation Systems
A Common Challenge:
– Assume you’re a company selling items of some sort: movies, songs, products, etc.
– Company collects millions of ratings from users of their items
– To maximize profit / user happiness, you want to recommend items that users are
likely to want
Recommender systems
• Systems for recommending items (e.g. books, movies, CD’s, web pages, newsgroup
messages) to users based on examples of their preferences.
• Many websites provide recommendations (e.g. Amazon, NetFlix, Pandora).
• Recommenders have been shown to substantially increase sales at on-line stores.
• Recommender systems are a technological proxy for a social process.
• Recommender systems are a way of suggesting like or similar items and ideas to a
users specific way of thinking.
• Recommender systems try to automate aspects of a completely different
information discovery model where people try to find other people with similar
tastes and then ask them to suggest new things.
Motivation for Recommender Systems
Automates quotes like:
–"I like this book; you might be interested in it"
–"I saw this movie, you’ll like it“
–"Don’t go see that movie!"
Usage
• Massive E-commerce sites use this tool to suggest other items a consumer may
want to purchase
• Web personalization
Ways its used
• Survey’s filled out by past users for the use of new users
• Search-style Algorithms
• Genre matching
• Past purchase querying
Problems on using Recommendation System
• Inconclusive user feedback forms
• Finding users to take the feedback surveys
• Weak Algorithms
• Poor results
• Poor Data
• Lack of Data
• Privacy Control (May NOT explicitly collaborate with recipients)
Maintenance
• Costly
• Information becomes outdated
• Information quantity (large, disk space expansion)
The Future of Recommender Systems
• Extract implicit negative ratings through the analysis of returned item.
• How to integrate community with recommendations
• Recommender systems will be used in the future to predict demand for products,
enabling earlier communication back the supply chain.
2. Explain content-based filtering in detail.
CONTENT-BASED RECOMMENDATIONS
• Main idea: Recommend items to customer x similar to previous items rated highly by x
Example:
•Movie recommendations
–Recommend movies with same actor(s), director, genre, …
•Websites, blogs, news
–Recommend other sites with “similar” content
Item Profiles
• For each item, create an item profile
• Profile is a set (vector) of features
o Movies: author, title, actor, director,…
o Text: Set of “important” words in document
• How to pick important features?
o Usual heuristic from text mining is TF-IDF
(Term frequency * Inverse Doc Frequency)
• Term … Feature
• Document … Item
• fij = frequency of term (feature) i in doc (item) j
• ni = number of docs that mention term i
• N = total number of docs
• TF-IDF score: wij = TFij × IDFi
• Doc profile = set of words with highest TF-IDF scores, together with their scores
User Profiles and Prediction
Discovering Features of Documents
• There are many kinds of documents for which a recommendation system can be
useful. For example, there are many news articles published each day, and we
cannot read all of them.
• A recommendation system can suggest articles on topics a user is interested in, but
• how can we distinguish among topics?
• Web pages are also a collection of documents. Can we suggest pages a user might
want to see?
• Likewise, blogs could be recommended to interested users, if we could classify
blogs by topics.
• Unfortunately, these classes of documents do not tend to have readily available
information giving features.
• A substitute that has been useful in practice is the identification of words that
characterize the topic of a document.
• First, eliminate stop words – the several hundred most common words, which tend
to say little about the topic of a document.
• For the remaining words, compute the TF.IDF score for each word in the document.
The ones with the highest scores are the words that characterize the document.
Content-Based Recommenders
• Find me things that I liked in the past.
• Machine learns preferences through user feedback and builds a user profile
• Explicit feedback – user rates items
• Implicit feedback – system records user activity
– Clicksteam data classified according to page category and activity, e.g. browsing a
product page
– Time spent on an activity such as browsing a page
• Recommendation is viewed as a search process, with the user profile acting as the
query and the set of items acting as the documents to match.
Recommending Items to Users Based on Content
• With profile vectors for both users and items, we can estimate the degree to which a
user would prefer an item by computing the cosine distance between the user’s and
item’s vectors.
• As in Example 9.2, we may wish to scale var- ious components whose values
are not boolean.
• The random-hyperplane and locality-sensitive-hashing techniques can be used to
place (just) item profiles in buckets.
• In that way, given a user to whom we want to recommend some items, we can
apply the same two techniques – random hyperplanes and LSH – to determine in
which buckets we must look for items that might have a small cosine distance from
the user.
Pros: Content-based Approach
• +: No need for data on other users
– No cold-start or sparsity problems
• +: Able to recommend to users with unique tastes
• +: Able to recommend new & unpopular items
– No first-rater problem
• +: Able to provide explanations
– Can provide explanations of recommended items by listing content-features that
caused an item to be recommended
Cons: Content-based Approach
• –: Finding the appropriate features is hard
– E.g., images, movies, music
• –: Recommendations for new users
– How to build a user profile?
• –: Overspecialization
– Never recommends items outside user’s content profile
– People might have multiple interests
– Unable to exploit quality judgments of other users
3. Explain about Collaborative Filtering in detail.
Introduction u to Collaborative Filtering
• Collaborative filtering leverages product transactions to give recommendations.
• In this type of model, for a specific customer,
– we find similar customers based on transaction history and recommend items that
the customer in question hasn’t purchased yet and which the similar customers
tended to like.
Collaborative Filtering
• Match people with similar interests as a basis for recommendation.
1) Many people must participate to make it likely that a person with similar interests
will be found.
2) There must be a simple way for people to express their interests.
3) There must be an efficient algorithm to match people with similar interests.
Designing Collaborative Filtering
APRIORI
k=1
o Find frequent set Lk from Ck of all candidate itemsets
o Form Ck+1 from Lk; k = k + 1
o Repeat 2-3 until Ck is empty
• Details about steps 2 and 3
o Step 2: scan D and count each itemset in Ck , if it’s greater than minSup, it is frequent
o Step 3: next slide