Wa0000.

UNIT-V
Data Analytics with R Machine Learning

Supervised learning
What is supervised learning?
Supervised learning, also known as supervised machine learning, is a subcategory of machine
learning and artificial intelligence. It is defined by its use of labeled data sets to train
algorithms that to classify data or predict outcomes accurately.
As input data is fed into the model, it adjusts its weights until the model has been fitted
appropriately, which occurs as part of the cross validation process. Supervised learning helps
organizations solve for a variety of real-world problems at scale, such as classifying spam in a
separate folder from your inbox. It can be used to build highly accurate machine learning models.
How supervised learning works
Supervised learning uses a training set to teach models to yield the desired output. This
training dataset includes inputs and correct outputs, which allow the model to learn over time.
The algorithm measures its accuracy through the loss function, adjusting until the error has
been sufficiently minimized.
Supervised learning can be separated into two types of problems when data mining—
classification and regression:
• Classification uses an algorithm to accurately assign test data into specific categories. It
recognizes specific entities within the dataset and attempts to draw some conclusions
on how those entities should be labeled or defined. Common classification algorithms
are linear classifiers, support vector machines (SVM), decision trees, k-nearest neighbor,
and random forest, which are described in more detail below.
• Regression is used to understand the relationship between dependent and independent
variables. It is commonly used to make projections, such as for sales revenue for a given
business. Linear regression, logistical regression, and polynomial regression are popular
regression algorithms.
Supervised learning algorithms
Various algorithms and computations techniques are used in supervised machine learning
processes. Below are brief explanations of some of the most commonly used learning methods,
typically calculated through use of programs like R or Python:
• Neural networks: Primarily leveraged for deep learning algorithms, neural

networks process training data by mimicking the interconnectivity of the human brain
through layers of nodes. Each node is made up of inputs, weights, a bias (or threshold),
and an output. If that output value exceeds a given threshold, it “fires” or activates the
node, passing data to the next layer in the network. Neural networks learn this mapping
function through supervised learning, adjusting based on the loss function through the
process of gradient descent. When the cost function is at or near zero, we can be
confident in the model’s accuracy to yield the correct answer.
• Naive bayes: Naive Bayes is classification approach that adopts the principle of class
conditional independence from the Bayes Theorem. This means that the presence of one
feature does not impact the presence of another in the probability of a given outcome, and
each predictor has an equal effect on that result. There are three types of Naïve Bayes
classifiers: Multinomial Naïve Bayes, Bernoulli Naïve Bayes, and Gaussian Naïve Bayes.
This technique is primarily used in text classification, spam identification, and
recommendation systems.
• Linear regression: Linear regression is used to identify the relationship between a
dependent variable and one or more independent variables and is typically leveraged to
make predictions about future outcomes. When there is only one independent variable
and one dependent variable, it is known as simple linear regression. As the number of
independent variables increases, it is referred to as multiple linear regression. For each
type of linear regression, it seeks to plot a line of best fit, which is calculated through the
method of least squares. However, unlike other regression models, this line is straight
when plotted on a graph.
• Logistic regression: While linear regression is leveraged when dependent variables are
continuous, logistic regression is selected when the dependent variable is categorical,
meaning they have binary outputs, such as "true" and "false" or "yes" and "no." While
both regression models seek to understand relationships between data inputs, logistic
regression is mainly used to solve binary classification problems, such as spam
identification.
• Support vector machines (SVM): A support vector machine is a popular supervised
learning model developed by Vladimir Vapnik, used for both data classification and
regression. That said, it is typically leveraged for classification problems, constructing a
hyperplane where the distance between two classes of data points is at its maximum. This
hyperplane is known as the decision boundary, separating the classes of data points (e.g.,
oranges vs. apples) on either side of the plane.
• K-nearest neighbor: K-nearest neighbor, also known as the KNN algorithm, is a non-
parametric algorithm that classifies data points based on their proximity and association
to other available data. This algorithm assumes that similar data points can be found near
each other. As a result, it seeks to calculate the distance between data points, usually
through Euclidean distance, and then it assigns a category based on the most frequent
category or average. Its ease of use and low calculation time make it a preferred
algorithm by data scientists, but as the test dataset grows, the processing time lengthens,
making it less appealing for classification tasks. KNN is typically used for
recommendation engines and image recognition.
• Random forest: Random forest is another flexible supervised machine learning
algorithm used for both classification and regression purposes. The "forest" references a
collection of uncorrelated decision trees, which are then merged together to reduce
variance and create more accurate data predictions.
Supervised learning examples

Supervised learning models can be used to build and advance a number of business
applications, including the following:
• Image- and object-recognition: Supervised learning algorithms can be used to locate,

isolate, and categorize objects out of videos or images, making them useful when
applied to various computer vision techniques and imagery analysis.
• Predictive analytics: A widespread use case for supervised learning models is in
creating predictive analytics systems to provide deep insights into various business data
points. This allows enterprises to anticipate certain results based on a given output
variable, helping business leaders justify decisions or pivot for the benefit of the
organization.
• Customer sentiment analysis: Using supervised machine learning algorithms,
organizations can extract and classify important pieces of information from large
volumes of data—including context, emotion, and intent—with very little human
intervention. This can be incredibly useful when gaining a better understanding of
customer interactions and can be used to improve brand engagement efforts.
• Spam detection: Spam detection is another example of a supervised learning model.
Using supervised classification algorithms, organizations can train databases to recognize
patterns or anomalies in new data to organize spam and non-spam-related
correspondences effectively.
Challenges of supervised learning
Although supervised learning can offer businesses advantages, such as deep data insights and
improved automation, there are some challenges when building sustainable supervised learning
models. The following are some of these challenges:
• Supervised learning models can require certain levels of expertise to structure

accurately.
• Training supervised learning models can be very time intensive.
• Datasets can have a higher likelihood of human error, resulting in algorithms learning
incorrectly.
• Unlike unsupervised learning models, supervised learning cannot cluster or classify
data on its own.
Unsupervised learning
What is unsupervised learning?
Unsupervised learning, also known as unsupervised machine learning, uses machine learning
(ML) algorithms to analyze and cluster unlabeled data sets. These algorithms discover hidden
patterns or data groupings without the need for human intervention.
Unsupervised learning's ability to discover similarities and differences in information make it the
ideal solution for exploratory data analysis, cross-selling strategies, customer segmentation and
image recognition.
Common unsupervised learning approaches
Unsupervised learning models are utilized for three main tasks—clustering, association, and
dimensionality reduction. Below we’ll define each learning method and highlight common
algorithms and approaches to conduct them effectively.
Clustering
Clustering is a data mining technique which groups unlabeled data based on their similarities or
differences. Clustering algorithms are used to process raw, unclassified data objects into groups
represented by structures or patterns in the information. Clustering algorithms can be categorized
into a few types, specifically exclusive, overlapping, hierarchical, and probabilistic.
Exclusive and Overlapping Clustering
Exclusive clustering is a form of grouping that stipulates a data point can exist only in one
cluster. This can also be referred to as “hard” clustering. The K-means clustering algorithm is an
example of exclusive clustering.
• K-means clustering is a common example of an exclusive clustering method where data

points are assigned into K groups, where K represents the number of clusters based on the
distance from each group’s centroid. The data points closest to a given centroid will be
clustered under the same category. A larger K value will be indicative of smaller
groupings with more granularity whereas a smaller K value will have larger groupings
and less granularity. K-means clustering is commonly used in market segmentation,
document clustering, image segmentation, and image compression.
Overlapping clusters differs from exclusive clustering in that it allows data points to belong to
multiple clusters with separate degrees of membership. “Soft” or fuzzy k-means clustering is an
example of overlapping clustering.
Hierarchical clustering
Hierarchical clustering, also known as hierarchical cluster analysis (HCA), is an unsupervised

clustering algorithm that can be categorized in two ways: agglomerative or divisive.
Agglomerative clustering is considered a “bottoms-up approach.” Its data points are isolated as
separate groupings initially, and then they are merged together iteratively on the basis of
similarity until one cluster has been achieved. Four different methods are commonly used to
measure similarity:
1. Ward’s linkage: This method states that the distance between two clusters is defined by
the increase in the sum of squared after the clusters are merged.
2. Average linkage: This method is defined by the mean distance between two points in
each cluster.
3. Complete (or maximum) linkage: This method is defined by the maximum distance
between two points in each cluster.
4. Single (or minimum) linkage: This method is defined by the minimum distance between
two points in each cluster.
Euclidean distance is the most common metric used to calculate these distances; however, other
metrics, such as Manhattan distance, are also cited in clustering literature.
Divisive clustering can be defined as the opposite of agglomerative clustering; instead it takes a
“top-down” approach. In this case, a single data cluster is divided based on the differences
between data points. Divisive clustering is not commonly used, but it is still worth noting in the
context of hierarchical clustering. These clustering processes are usually visualized using a
dendrogram, a tree-like diagram that documents the merging or splitting of data points at each
iteration.
Probabilistic clustering
A probabilistic model is an unsupervised technique that helps us solve density estimation or

“soft” clustering problems. In probabilistic clustering, data points are clustered based on the
likelihood that they belong to a particular distribution. The Gaussian Mixture Model (GMM) is
the one of the most commonly used probabilistic clustering methods.
• Gaussian Mixture Models are classified as mixture models, which means that they are
made up of an unspecified number of probability distribution functions. GMMs are
primarily leveraged to determine which Gaussian, or normal, probability distribution a
given data point belongs to. If the mean or variance are known, then we can determine
which distribution a given data point belongs to. However, in GMMs, these variables are
not known, so we assume that a latent, or hidden, variable exists to cluster data points
appropriately. While it is not required to use the Expectation-Maximization (EM)
algorithm, it is a commonly used to estimate the assignment probabilities for a given data
point to a particular data cluster.
Association Rules
An association rule is a rule-based method for finding relationships between variables in a given
dataset. These methods are frequently used for market basket analysis, allowing companies to
better understand relationships between different products. Understanding consumption habits of
customers enables businesses to develop better cross-selling strategies and recommendation
engines. Examples of this can be seen in Amazon’s “Customers Who Bought This Item Also
Bought” or Spotify’s "Discover Weekly" playlist. While there are a few different algorithms
used to generate association rules, such as Apriori, Eclat, and FP-Growth, the Apriori algorithm
is most widely used.
Apriori algorithms
Apriori algorithms have been popularized through market basket analyses, leading to different
recommendation engines for music platforms and online retailers. They are used within
transactional datasets to identify frequent itemsets, or collections of items, to identify the
likelihood of consuming a product given the consumption of another product. For example, if I
play Black Sabbath’s radio on Spotify, starting with their song “Orchid”, one of the other songs
on this channel will likely be a Led Zeppelin song, such as “Over the Hills and Far Away.” This
is based on my prior listening habits as well as the ones of others. Apriori algorithms use a hash
tree to count itemsets, navigating through the dataset in a breadth-first manner.
Dimensionality reduction
While more data generally yields more accurate results, it can also impact the performance of
machine learning algorithms (e.g. overfitting) and it can also make it difficult to visualize
datasets. Dimensionality reduction is a technique used when the number of features, or
dimensions, in a given dataset is too high. It reduces the number of data inputs to a manageable
size while also preserving the integrity of the dataset as much as possible. It is commonly used in
the preprocessing data stage, and there are a few different dimensionality reduction methods that
can be used, such as:
Principal component analysis
Principal component analysis (PCA) is a type of dimensionality reduction algorithm which is

used to reduce redundancies and to compress datasets through feature extraction. This method
uses a linear transformation to create a new data representation, yielding a set of "principal
components." The first principal component is the direction which maximizes the variance of the
dataset. While the second principal component also finds the maximum variance in the data, it is
completely uncorrelated to the first principal component, yielding a direction that is
perpendicular, or orthogonal, to the first component. This process repeats based on the number of
dimensions, where a next principal component is the direction orthogonal to the prior
components with the most variance.
Singular value decomposition
Singular value decomposition (SVD) is another dimensionality reduction approach which

factorizes a matrix, A, into three, low-rank matrices. SVD is denoted by the formula, A = USVT,
where U and V are orthogonal matrices. S is a diagonal matrix, and S values are considered
singular values of matrix A. Similar to PCA, it is commonly used to reduce noise and compress
data, such as image files.
Autoencoders
Autoencoders leverage neural networks to compress data and then recreate a new representation
of the original data’s input. Looking at the image below, you can see that the hidden layer
specifically acts as a bottleneck to compress the input layer prior to reconstructing within the
output layer. The stage from the input layer to the hidden layer is referred to as “encoding” while
the stage from the hidden layer to the output layer is known as “decoding.”
Applications of unsupervised learning
Machine learning techniques have become a common method to improve a product user
experience and to test systems for quality assurance. Unsupervised learning provides an
exploratory path to view data, allowing businesses to identify patterns in large volumes of data
more quickly when compared to manual observation. Some of the most common real-world
applications of unsupervised learning are:
• News Sections: Google News uses unsupervised learning to categorize articles on the
same story from various online news outlets. For example, the results of a presidential
election could be categorized under their label for “US” news.
• Computer vision: Unsupervised learning algorithms are used for visual perception
tasks, such as object recognition.
• Medical imaging: Unsupervised machine learning provides essential features to medical
imaging devices, such as image detection, classification and segmentation, used in
radiology and pathology to diagnose patients quickly and accurately.
• Anomaly detection: Unsupervised learning models can comb through large amounts of
data and discover atypical data points within a dataset. These anomalies can raise
awareness around faulty equipment, human error, or breaches in security.
• Customer personas: Defining customer personas makes it easier to understand common
traits and business clients' purchasing habits. Unsupervised learning allows businesses to
build better buyer persona profiles, enabling organizations to align their product
messaging more appropriately.
• Recommendation Engines: Using past purchase behavior data, unsupervised learning
can help to discover data trends that can be used to develop more effective cross- selling
strategies. This is used to make relevant add-on recommendations to customers during
the checkout process for online retailers.
Challenges of unsupervised learning
While unsupervised learning has many benefits, some challenges can occur when it allows
machine learning models to execute without any human intervention. Some of these challenges
can include:
• Computational complexity due to a high volume of training data

• Longer training times
• Higher risk of inaccurate results
• Human intervention to validate output variables
• Lack of transparency into the basis on which data was clustered
Collaborative Filtering
What is Collaborative Filtering?
In Collaborative Filtering, we tend to find similar users and recommend what similar users like.
In this type of recommendation system, we don’t use the features of the item to recommend it,
rather we classify the users into clusters of similar types and recommend each user according to
the preference of its cluster.
There are basically four types of algorithms o say techniques to build Collaborative filtering
recommender systems:
• Memory-Based
• Model-Based
• Hybrid
• Deep Learning
Advantages of Collaborative Filtering-Based Recommender Systems
As we know there are two types of recommender systems the content-based recommender
systems have limited use cases and have higher time complexity. Also, this algorithm is based on
some limited content but that is not the case in Collaborative Filtering based algorithms. One of
the main advantages that these recommender systems have is that they are highly efficient in
providing personalized content but also able t adapt to changing user preferences.
Measuring Similarity
A simple example of the movie recommendation system will help us in explaining:
In this type of scenario, we can see that User 1 and User 2 give nearly similar ratings to the
movie, so we can conclude that Movie 3 is also going to be averagely liked by User 1 but Movie
4 will be a good recommendation to User 2, like this we can also see that there are users who
have different choices like User 1 and User 3 are opposite to each other. One can see that User 3
and User 4 have a common interest in the movie, on that basis we can say that Movie 4 is also
going to be disliked by User 4. This is Collaborative Filtering, we recommend to users the items
which are liked by users of similar interest domains.
Cosine Similarity
We can also use the cosine similarity between the users to find out the users with similar
interests, larger cosine implies that there is a smaller angle between two users, hence they have
similar interests. We can apply the cosine distance between two users in the utility matrix, and
we can also give the zero value to all the unfilled columns to make calculation easy, if we get
smaller cosine then there will be a larger distance between the users, and if the cosine is larger
than we have a small angle between the users, and we can recommend them similar things.
similarity=∣A∣×∣B∣A⋅B=∑i=1nAi2×∑i=1nBi2∑i=1nAi×Bi
Rounding the Data

In collaborative filtering, we round off the data to compare it more easily like we can assign
below 3 ratings as 0 and above of it as 1, this will help us to compare data more easily, for
example:
We again took the previous example and we apply the rounding-off process, as you can see
how much more readable the data has become after performing this process, we can see that
User 1 and User 2 are more similar and User 3 and User 4 are more alike.
Normalizing Rating
In the process of normalizing, we take the average rating of a user and subtract all the given
ratings from it, so we’ll get either positive or negative values as a rating, which can simply
classify further into similar groups. By normalizing the data we can make clusters of the users
that give a similar rating to similar items and then we can use these clusters to recommend
items to the users.
What are some of the Challenges to be Faced while using Collaborative Filtering?
As we know that every algorithm has its pros and cons and so is the case with Collaborative
Filtering Algorithms. Collaborative Filtering algorithms are very dynamic and can change as
well as adapt to the changes in user preferences with time. But one of the main issues which
are faced by recommender systems is that of scalability because as the user base increases then
the respective sizes for the computation and the data storage space all increase manifold which
leads to slow and inaccurate results.
Also, collaborative filtering algorithms fail to recommend a diversity of products as it is based
on historical data and hence provide recommendations related to them as well.
Social Media Analytics

What is social media analytics?
Social media analytics is the ability to gather and find meaning in data gathered from social
channels to support business decisions—and measure the performance of actions based on those
decisions through social media.
Practitioners and analysts alike know social media by its many websites and channels: Facebook,
YouTube, Instagram, Twitter, LinkedIn, Reddit and many others.
Social media analytics is broader than metrics such as likes, follows, retweets, previews, clicks,
and impressions gathered from individual channels. It also differs from reporting offered by
services that support marketing campaigns such as LinkedIn or Google Analytics.
Social media analytics uses specifically designed software platforms that work similarly to web
search tools. Data about keywords or topics is retrieved through search queries or web ‘crawlers’
that span channels. Fragments of text are returned, loaded into a database, categorized and
analyzed to derive meaningful insights.
Social media analytics includes the concept of social listening. Listening is monitoring social
channels for problems and opportunities. Social media analytics tools typically incorporate
listening into more comprehensive reporting that involves listening and performance analysis.
Why is social media analytics important?
IBM points out that with the prevalence of social media: “News of a great product can spread
like wildfire. And news about a bad product — or a bad experience with a customer service rep
— can spread just as quickly. Consumers are now holding organizations to account for their
brand promises and sharing their experiences with friends, co-workers and the public at large.”
Social media analytics helps companies address these experiences and use them to:
• Spot trends related to offerings and brands

• Understand conversations — what is being said and how it is being received
• Derive customer sentiment towards products and services
• Gauge response to social media and other communications
• Identify high-value features for a product or service
• Uncover what competitors are saying and its effectiveness
• Map how third-party partners and channels may affect performance
These insights can be used to not only make tactical adjustments, like addressing an angry tweet,
they can help drive strategic decisions. In fact, IBM finds social media analytics is now “being
brought into the core discussions about how businesses develop their strategies.”
These strategies affect a range of business activity:
• Product development - Analyzing an aggregate of Facebook posts, tweets and Amazon

product reviews can deliver a clearer picture of customer pain points, shifting needs and
desired features. Trends can be identified and tracked to shape the management of
existing product lines as well as guide new product development.
• Customer experience - An IBM study discovered “organizations are evolving from
product-led to experience-led businesses.” Behavioral analysis can be applied across
social channels to capitalize on micro-moments to delight customers and increase loyalty
and lifetime value.
Branding - Social media may be the world’s largest focus group. Natural language
processing and sentiment analysis can continually monitor positive or negative
expectations to maintain brand health, refine positioning and develop new brand
attributes.
• Competitive Analysis - Understanding what competitors are doing and how customers
are responding is always critical. For example, a competitor may indicate that they are
foregoing a niche market, creating an opportunity. Or a spike in positive mentions for a
new product can alert organizations to market disruptors.
• Operational efficiency – Deep analysis of social media can help organizations improve
how they gauge demand. Retailers and others can use that information to manage
inventory and suppliers, reduce costs and optimize resources
• Key capabilities of effective social media analytics
• The first step for effective social media analytics is developing a goal. Goals can range
from increasing revenue to pinpointing service issues. From there, topics or keywords can
be selected and parameters such as date range can be set. Sources also need to be
specified — responses to YouTube videos, Facebook conversations, Twitter arguments,
Amazon product reviews, comments from news sites. It is important to select sources
pertinent to a given product, service or brand.
Typically, a data set will be established to support the goals, topics, parameters and
sources. Data is retrieved, analyzed and reported through visualizations that make it
easier to understand and manipulate.
These steps are typical of a general social media analytics approach that can be made
more effective by capabilities found in social media analytics platforms.
• Natural language processing and machine learning technologies identify entities and
relationships in unstructured data — information not pre-formatted to work with data
analytics. Virtually all social media content is unstructured. These technologies are
critical to deriving meaningful insights.
• Segmentation is a fundamental need in social media analytics. It categorizes social
media participants by geography, age, gender, marital status, parental status and other
demographics. It can help identify influencers in those categories. Messages, initiatives
and responses
• can be better tuned and targeted by understanding who is interacting on key topics.
• Behavior analysis is used to understand the concerns of social media participants by
assigning behavioral types such as user, recommender, prospective user and detractor.
Understanding these roles helps develop targeted messages and responses to meet,
change or deflect their perceptions.
• Sentiment analysis measures the tone and intent of social media comments. It typically
involves natural language processing technologies to help understand entities and
relationships to reveal positive, negative, neutral or ambivalent attributes.
• Share of voice analyzes prevalence and intensity in conversations regarding brand,
products, services, reputation and more. It helps determine key issues and important
topics. It also helps classify discussions as positive, negative, neutral or ambivalent.
• Clustering analysis can uncover hidden conversations and unexpected insights. It makes
associations between keywords or phrases that appear together frequently and derives
new topics, issues and opportunities. The people that make baking soda, for example,
discovered new uses and opportunities using clustering analysis.
• Dashboards and visualization charts, graphs, tables and other presentation tools
summarize and share social media analytics findings — a critical capability for
communicating and acting on what has been learned. They also enable users to grasp
meaning and insights more quickly and look deeper into specific findings without
advanced technical skills.
Mobile Analytics
What is Mobile Analytics?
Mobile analytics is the process of collecting and analyzing data from mobile devices, such as
smartphones and tablets, in order to gain insights into user behavior, app performance, and
business metrics. Mobile analytics tools are used to track and measure various aspects of mobile
app usage, including app downloads, user engagement, retention, in-app purchases, and other key
performance indicators.
Goals of Mobile Analytics

Marketers want to know what their customers want to see and do on their mobile device, so that
they can target the customer.
Similar to the process of analytics used to study the behaviour of users on the Web or social media,
mobile analytics is the process of analysing the behaviour of mobile users.
The primary goal of mobile analytics is to understand the following:
New users
These are users who have just started using a mobile service. Users are identified by unique device
IDs. The growth and popularity of a service greatly depend on the number of new users it is able
to attract.
Active users
These are users who use mobile services at least once in a specified period. If the period is one
day, for example, then the active user will use the service several times during the day. The number
of active users in any specific period of time shows the popularity of a service during that period.
Percentage of new users
This is the percentage of new users over the total active users of a mobile service. This figure is
always less than 100%, but a very low value means that the particular service or app is not doing
very well.
Sessions
When a user opens an app, it is counted as one session. In other words, the session starts with the
launching of the app and finishes with the app’s termination. Note that a session is not related to
how long the app has been used by the user.
Average usage duration

This is the average duration that a mobile user uses the service.
Accumulated users
This refers to the total number of users (old as well as new) who have used an app before a specific
time.
Bounce rate
The bounce rate is calculated in percentage (%). It can be calculated as follows: Bounce rate =
Number of terminated sessions on any specific page of an app/Total number of sessions of the
app* 100. The bounce rate can be used by service providers to help them monitor and improve
their service so that customers remain satisfied and do not leave the service.
User retention
After a certain period of time, the total number of new users still using any app is known as the
user retention of that app.
Mobile marketers receive useful data regarding who uses their apps and how do they do it through
the way the reports are designed. Marketers are left to make educated guesses without any hard
data therefore;
The following are listed as the goals of mobile analytics:
• Helps building an efficient mobile marketing strategy: Without verifying the content or
functionality which the customers respond to, the marketers have no ground at which they
could strategise and this is where analytics come into play. Analytics assist in defining a
measurable goal.
• Discovering the popular feature of the app and the ones people are not using: Analytic
tracking demonstrates the various screens and menus the users visit while navigating
throughout your app. Tracking the screens, they spend most of their time and often return to,
helps obtain a fair amount of understanding of the content users look for and finding a better
way to make them reach there is definitely worthwhile.
• Determine which segment of your app is converting most: Mobile analytics help the
marketers to identify the segments in their app which result in more conversions than rest of
the sections.
• Observe in case people use the app at all: Basis research, only 25% of the business apps are
continually used by people. Because it’s crucial to have repeat use of the apps towards building
a relationship with your users, the analysis helps tracking whether the repeated users invest
time on your app and also provides significant insights on whether the app adds value for users
to give them enough reasons to return.
• Detection of the mobile device: It is crucial for increasing the overall efficiency of your
strategy for mobile marketing and analytics, to help find mobile devices on which your app is
getting downloaded most and assists in determining the devices required to be prioritised.
• Realizing the complete mobile app user experience: By providing the ability for data-
driven decisions at every stage of the app life cycle, the analytics help marketers and developers
create such an app experience that proves to be more useful and appealing to their users and the
overall strategy for marketing.
Mobile Analytics and Web Analytics

Mobile analytics has several similarities with Web and social analytics, such as both can analyse
the behaviour of the user with regard to an application and send this information to the service
provider. However, there are also several important differences between Web analytics and mobile
analytics, which we will discuss in this section.
Some of the main differences between Web analytics and mobile analytics are as follows:
Analytics segmentation
Mobile analytics works on the basis of location of the mobile devices. For example, suppose a
company is offing cab service in a city like New York. In this case, the company can use mobile
analytics to identify the target people travelling in New York. Mobile analytics works for location-
based segments while a Web analytics works globally.
Complexity of code
Mobile analytics requires more complex code and programming languages to implement than Web
analytics, which is easier to code.
Network service providers

Mobile analytics is totally dependent on Network Service Providers (NSPs) while Web analytics
is independent of this factor.
Measure
Sometimes, it is difficult to measure information from the mobile analytics apps because they can
run offline. Web analytics always runs online so we can easily measure vital information with it.
Tools
To do the ultimate analysis on data, we require some other tools of Web analytics with mobile
analytics tools. Web analytics, on the other hand, does not require any other tool for analysis.
Types of Results from Mobile Analytics

The study of consumer behaviour helps business firms or other organisations to improve their
marketing strategies. Nowadays, every organisation is making that extra effort to understand and
know the behaviour of its consumers.
Mobile analytics provides an effective way of measuring large amounts of mobile data for
organisations. It also shows how useful marketing tools such as ads are in converting potential
buyers to actual purchasers. It also offers deep insight into what makes people buy a product or
service and what makes them quit a service.
The technologies behind mobile analytics like Global Positioning System (GPS) are more
sophisticated than those used in Web analytics; hence, compared to Web analytics, users can be
tracked and targeted more accurately with mobile analytics.
Mobile analytics can easily and effectively collect data from various data sources and manipulate
it into useful information. Mobile analytics keep track the following information:
Total time spent

This information shows the total time spent by the user with an application.
Visitors’ location
This information shows the location of the user using any particular application.
Number of total visitors

This is the total number of users using any particular application, useful in knowing the
application’s popularity.
Click paths of the visitors

Mobile analytics tracks of the activities of a user visiting the pages of any application.
Pages viewed by the visitor

Mobile analytics tracks the pages of any application visited by the user, which again reflects the
popular sections of the application.
Downloading choice of users

Mobile analytics keeps track of files downloaded by the user. This helps app owners to understand
the type of data users like to download.
Type of mobile device and network used
Mobile analytics tracks the type of mobile device and network used by the user. This information
helps mobile service provider and mobile phone sellers understand the popularity of mobile
devices and networks, and make further improvements as required.
Screen resolution of the mobile phone used

Any information or content that appears on mobile devices is according to the screen size of these
devices. This important aspect of ensuring that the content fits a particular device screen is done
through mobile analytics.
Performance of advertising campaigns

Mobile analytics is used to keep track of the performance of advertising campaigns and other
activities by analysing the number of visitors and time spent by them as well as other methods.
Three major types of results from mobile analytics are explained below:
Advertising/marketing analytics
Despite developing an outstanding app, its chances of getting identified among a million other
apps is too low these days unless marketing campaigns attract the appropriate type of users to make
them install, stay engaged contributing to the app’s financial components. The most generic route
to market an app is partnering with various ad networks but even so, a trustworthy channel to
determine which ad networks and publishers are delivering results is difficult to find without
marketing analytics. Commonly the following marketing analytics data are collected:
• Installs
• Opens
• Clicks
• Purchases
• Registrations
• Content viewed
• Level achieved
• Shares
• Invites
• Custom events ‰‰
In-App analytics
Whether an app delivers content, or sells products, or gaming experience, the app must be able to
satisfy the user expectations to be successful. Providing users to achieve the objectives for which
the apps are designed in the simplest manner, is every app’s goal. Without a user or in-app
behaviour data it is difficult to make a wild guess in the area for improvements and this is where
In-App Analytics play its role. Being an “in-session” analytics, this analyses what users are doing
inside an app and the way they are interacting with it. The major focal areas are, conversion funnel,
pathway and feature optimisation which are majorly used by the product managers. Commonly
the following in-app analytics data are collected:
• Device Profile (Mobile phone, tablet, etc., Manufacturer, Operating system)
• User Demographics (Location, Gender, New or returning user, Approximate age, Language)
• In-App Behaviour (Event Tracking (i.e., buttons clicked, ads clicked, purchases made, levels
completed, articles read, screens viewed, etc.)
Performance analytics
This involves the actual performance of the app. The two major measures for performance
analytics are: App uptime and App responsiveness. The factors which can impact the performance
of an app irrespective of how well it was coded include:
App complexity
Most apps depend on various third-party services hence the speed of such services directly impacts
its performance
Hardware variation
Apps available on both iOS and Android platforms should be compatible with device specification
variations on both the platforms, which is majorly the varying hardware environment across the
phone models. This incompatibility impacts the app performance heavily
Available operating systems

An app developed just for Android and iOS, would definitely have impact on its performance when
installed on other available operating systems like, Blackberry, Windows, Bada, Symbian, Firefox,
Ubuntu, Palm, TIZEN, Sailfish, etc.
Carrier/network
Most of the major networks are expanding their technology and coverage providing better access
to users with faster data speeds but, many still have vital issues in their latest technological
coverage, due to which the users fall back on older standards, directly impacting the app
performance.
Users expect to have an app working efficiently and are getting impatient towards
underperformance as overall technology gets faster and better therefore identifying the root cause
of issues and prioritising solutions would get difficult in absence of performance analytics.
Commonly the following performance analytics data are collected:
• API latency
• Carrier/network latency
• Data transactions
• Crashes
• Exceptions
• Errors
Types of Applications for Mobile Analytics
Mobile analytics record the demographics and behaviours of unique users by tracking them
through technologies varying between websites either using JavaScript or cookies and apps,
requiring software development kit or SDK.
This data is used by companies in order to figure out the need of the users for delivering a further
satisfying user experience. Through this data the analytics shows following information:
• The reason the visitors are drawn to the mobile site or app
• The duration visitors generally stay
• The features visitors are interacting with
• The problem areas for the visitors within the site or app
• Factors instigating purchases
• Factors responsible for higher usage and user retention
There are two types of applications made for mobile analytics. They are:
• Web mobile analytics

• Mobile application analytics
Web Mobile Analytics

Mobile Web refers to the use of mobile phones or other devices like tablets to view online content
via a light-weight browser for the mobile. The name of any mobile-specific site can be the form of
m.example.com. Mobile Web sometimes depends on the size of the screen of the devices.
For example, if you design an application for a small screen, then its images would appear blurred
on a big screen; similarly, if you make your site for the big screen, then it can be heavy for a small
screen device.
Some organisations are starting to build sites specifically for tablets because they have found that
neither their mobile-specific site nor their main website ideally serves the tablet segment.
To solve this problem, mobile Web should have a responsive design. In other words, it should have
the property to adapt the content to the screen size of the user’s device.
Figure 10.2 shows the difference between a website, a mobile site and a responsive-designed site:
In Figure 2, you can see that a website can be opened on both computers and mobile phones, while
a mobile site can be open only on the mobile phones; responsive-designed sites, on the other hand,
can open on any device like a computer, tablet and mobile phone.
Mobile Application Analytics
The term mobile app is short for the term mobile application software. It is an application program
designed to run on smartphones and other mobile devices.
Mobile apps are usually available through application distribution platforms like Apple App Store
and Google Play. Application distribution platforms are typically operated by the owner of the
mobile operating system.
Examples of mobile operating systems include the Apple App Store, Google Play, Windows Phone
Store, and BlackBerry App World. Some mobile apps are freely available, while others must be
bought.
Depending on the objective of analytics, an organisation should decide whether it needs a mobile
application or a mobile website. If the organisation wants to create an interactive engagement with
users on mobile devices, then mobile apps is a good option; however, for business purposes,
mobile websites are more suitable than mobile apps.
Table 10.1 lists the main differences between mobile app analytics and mobile Web analytics:
Table 10.1: Differences Between Mobile App Analytics and Mobile Web Analytics
Factors Mobile App Analytics Mobile Web Analytics
Screen and Page Mobile app analytics does not have Mobile Web analytics has pages like normal
pages. The user can interact with various websites, and users do interact with various
screens. pages.
Use of builtin Mobile app analytics can access built-in Mobile Web analytics does not use built-in
features of features such as gyroscope, GPS, features like gyroscope, GPS,
mobile devices accelerometer, and storage. accelerometer, etc.
Session time Mobile app analytics has shorter session Mobile Web analytics has a longer session
timeouts (around 30 seconds). timeouts. In general, a session will end after
30 minutes of inactivity for websites.
Online/ Offline Depending on how it was developed, Mobile Web analytics requires an Internet
mobile app analytics may not require to connection and can run online only.
be connected to a mobile network.
Updates App owners provide frequent updates Updates are not that frequent.
and new versions of the apps.
Challenges of Mobile Analytics
Mobile analytics has its own challenges. Some of the main challenges can be listed as follows:
Unavailability of uniform technology

Different mobile phones support different technologies. For example, some mobile phones support
images, JavaScript, HTML and cookies, while others do not
Random change in subscriber identity

TMSI (Temporary Mobile Subscriber Identity) is the identity of mobile devices and can be known
by the mobile network being used. This identity is randomly assigned by the VLR (Visitor
Location Register) to every mobile in the area as it is switched on. This random change in the
subscriber id makes it difficult to gather important information such as the location of user, etc.
Redirect
Some mobile devices do not support redirects. The term ‘redirect’ is used to describe the process
in which the system automatically opens another page.
Special characters in the URL

In some mobile devices, some special characters in the URL are not supported.
Interrupted connections
The mobile connection with the tower is not always dedicated. It can be interrupted when the user
is moving from one tower to another tower. This interruption in the connection breaks the requests
sent by the devices.
Together with generalised issues mentioned above, mobile analysts are also facing the following
critical issues, which discourage mobile analytics marketing:
Limited understanding of the network operators

Network operators are unable to understand the business processes happening outside the carrier’s
firewall.
True real-time analysis

True real-time data analysis is not always possible with mobile analytics due to various reasons
such as signal interruption, variation in technology used in mobiles, random change in subscriber
id, etc.
Security issues
Mobile technology has various important features but some of these features, such as GPS,
cookies, Wi-Fi and beacons can disclose important information of the user. Information like details
of credit cards, bank accounts, medical history, or other personal content can be easily misused.
Some techniques like Deep Packet Inspection (DPI), Deep Packet Capture (DPC), and application
logs can increase security threats.
To cope with such security threats, business organisations must intelligently monitor all
communications in real time and make sure that personal data is not accessible to everyone.
Big Data Analytics using Big R
Introduction
Big data analytics has become an integral part of decision-making and business intelligence across
various industries. With the exponential growth of data, organizations need robust tools and
techniques to extract meaningful insights. R, a powerful programming language and software
environment, has gained popularity for its extensive capabilities in data analysis and statistical
computing. In this comprehensive guide, we will explore how R can be effectively utilized for big
data analytics, covering various aspects and techniques.
Understanding R for Big Data Analytics
R Programming Language: R is an open-source programming language that provides a wide range

of statistical and graphical techniques. It offers a rich ecosystem of packages and libraries that
support data manipulation, visualization, and modeling. R's flexibility and extensibility make it an
excellent choice for big data analytics.
R for Big Data: While R is traditionally known for its performance on smaller datasets, it can also
handle big data efficiently. Several R packages have been developed specifically for big data
analytics, allowing users to process and analyze large datasets without compromising performance.
Handling Big Data in R
R Packages for Big Data Analytics: R offers several packages that facilitate big data analytics.
Some popular packages include −
• dplyr − This package provides a grammar of data manipulation, allowing users to perform
various operations like filtering, summarizing, and joining datasets efficiently.
• data.table − The data.table package enhances data manipulation by implementing fast and
memory-efficient data structures. It can handle large datasets with millions or even billions
of rows.
• SparkR − Built on Apache Spark, the SparkR package enables distributed data processing
with R. It leverages the power of Spark's distributed computing capabilities to analyze big
data efficiently.
Parallel Computing with R − Parallel computing is essential for processing big data efficiently.
R provides several approaches for parallelizing computations −
• Multithreading − R supports multithreading through packages like parallel and foreach,

allowing users to leverage multiple CPU cores for parallel execution.
• Distributed Computing − Packages like sparklyr and foreach in conjunction with
distributed computing frameworks like Apache Spark enable parallel processing across
multiple machines, scaling R's capabilities for big data analytics.
Data Manipulation and Preprocessing
Data Cleaning − Data cleaning is a crucial step in big data analytics. R provides a variety of
functions and packages for data cleaning tasks, including missing data imputation, outlier
detection, and data transformation.
Data Transformation − R offers powerful functions for transforming data, such as reshaping data
from wide to long format (melt function), creating new variables using calculated values
(mutate function), and splitting or combining variables (separate and unite functions).
Feature Engineering − Feature engineering involves creating new features from existing data to
improve model performance. R provides a plethora of packages and functions for feature
engineering, including text mining, time series analysis, and dimensionality reduction techniques.
Modeling and Analysis
Machine Learning with R − R is widely used for machine learning tasks. It offers numerous
packages for various machine learning algorithms, including classification, regression, clustering,
and ensemble methods. Popular machine learning packages in R include caret, randomForest,
glmnet, and xgboost.
Deep Learning with R − Deep learning has gained significant popularity in recent years. R
provides several packages for deep learning, such as keras, tensorflow, and mxnet. These
packages allow users to build and train deep neural networks for tasks like image classification,
natural language processing, and time series analysis.
Data Visualization
Data Visualization Packages − R is renowned for its extensive data visualization capabilities. It
provides a wide range of packages for creating visually appealing and informative plots and charts.
Some popular data visualization packages in R include −
• ggplot2 − ggplot2 is a highly flexible and powerful package for creating elegant and
customizable data visualizations. It follows the grammar of graphics principles, allowing
users to build complex plots layer by layer.
• plotly − plotly is an interactive visualization package that enables the creation of
interactive and web-based plots. It offers a wide range of options for creating interactive
charts, maps, and dashboards.
• lattice − lattice provides a comprehensive set of functions for creating conditioned plots,
such as trellis plots and multi-panel plots. It is particularly useful for visualizing
multivariate data.
Visualizing Big Data − When working with big data, visualization can be challenging due to the
sheer volume of data. R offers techniques to visualize big data efficiently, such as sampling
techniques, aggregating data, and using interactive visualizations that can handle large datasets.
Performance Optimization
Code Optimization − To enhance performance in big data analytics, optimizing code is crucial.
R provides several techniques for code optimization, including vectorization, avoiding
unnecessary loops, and efficient memory management.
Memory Management − Big data often exceeds the available memory capacity, requiring careful
memory management. R provides techniques for reducing memory usage, such as using efficient
data structures (data.table), garbage collection, and loading data in chunks.
Real-World Applications
Finance and Banking − Big data analytics in finance and banking can help in fraud detection,
risk modeling, portfolio optimization, and customer segmentation. R's capabilities in data analysis
and modeling make it a valuable tool in this domain.
Healthcare − In the healthcare industry, big data analytics can contribute to disease prediction,
drug discovery, patient monitoring, and personalized medicine. R's statistical and machine learning
capabilities are well-suited for analyzing healthcare data.
Marketing and Customer Analytics − R plays a significant role in marketing and customer
analytics by analyzing customer behavior, sentiment analysis, market segmentation, and campaign
optimization. It helps organizations make data-driven marketing decisions.

Wa0000.

Uploaded by

Document Informationclick to expand document informationPlease provide your notes.behalf of these notes I want more information to study for my examinations.

Document Informationclick to expand document information

Copyright:

Available Formats

Wa0000.

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Wa0000.

Uploaded by

Copyright:

Available Formats

UNIT-V

Data Analytics with R Machine Learning

Supervised learning algorithms

• Neural networks: Primarily leveraged for deep learning algorithms, neural

Supervised learning examples

• Image- and object-recognition: Supervised learning algorithms can be used to locate,

Challenges of supervised learning

• Supervised learning models can require certain levels of expertise to structure

Exclusive and Overlapping Clustering

• K-means clustering is a common example of an exclusive clustering method where data

Hierarchical clustering, also known as hierarchical cluster analysis (HCA), is an unsupervised

A probabilistic model is an unsupervised technique that helps us solve density estimation or

Principal component analysis

Principal component analysis (PCA) is a type of dimensionality reduction algorithm which is

Singular value decomposition

Singular value decomposition (SVD) is another dimensionality reduction approach which

Challenges of unsupervised learning

• Computational complexity due to a high volume of training data

Rounding the Data

Social Media Analytics

• Spot trends related to offerings and brands

These strategies affect a range of business activity:

• Product development - Analyzing an aggregate of Facebook posts, tweets and Amazon

Goals of Mobile Analytics

The primary goal of mobile analytics is to understand the following:

Average usage duration

The following are listed as the goals of mobile analytics:

Mobile Analytics and Web Analytics

Network service providers

Types of Results from Mobile Analytics

Total time spent

Number of total visitors

Click paths of the visitors

Pages viewed by the visitor

Downloading choice of users

Screen resolution of the mobile phone used

Performance of advertising campaigns

Available operating systems

• Web mobile analytics

Web Mobile Analytics

Unavailability of uniform technology

Random change in subscriber identity

Special characters in the URL

Limited understanding of the network operators

True real-time analysis

Big Data Analytics using Big R

Understanding R for Big Data Analytics

R Programming Language: R is an open-source programming language that provides a wide range

Handling Big Data in R

• Multithreading − R supports multithreading through packages like parallel and foreach,

Data Manipulation and Preprocessing

Modeling and Analysis

You might also like