Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
8 views

Unit 1 Notes - Data Analysis Using r

The document provides an overview of data analysis using R, detailing various data analysis techniques such as data mining, business intelligence, statistical analysis, predictive analytics, and text analytics. It outlines the data analysis process, including phases like data collection, processing, cleaning, analysis, and communication, as well as different forms and types of data, including structured, unstructured, and semi-structured data. Additionally, it discusses the significance of machine-generated data and the importance of effective data management for decision-making.

Uploaded by

hl5670204
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Unit 1 Notes - Data Analysis Using r

The document provides an overview of data analysis using R, detailing various data analysis techniques such as data mining, business intelligence, statistical analysis, predictive analytics, and text analytics. It outlines the data analysis process, including phases like data collection, processing, cleaning, analysis, and communication, as well as different forms and types of data, including structured, unstructured, and semi-structured data. Additionally, it discusses the significance of machine-generated data and the importance of effective data management for decision-making.

Uploaded by

hl5670204
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

PCA20S02J - DATA ANALYSIS USING R

UNIT 1
Data Analysis is a process of inspecting, cleaning, transforming and modeling data with the goal
of discovering useful information, suggesting conclusions and supporting decision-making

Types of Data Analysis

Several data analysis techniques exist encompassing various domains such as business, science,
social science, etc. with a variety of names. The major data analysis approaches are −

 Data Mining
 Business Intelligence
 Statistical Analysis
 Predictive Analytics
 Text Analytics

1. Data Mining

Data Mining is the analysis of large quantities of data to extract previously unknown, interesting
patterns of data, unusual data and the dependencies. Note that the goal is the extraction of
patterns and knowledge from large amounts of data and not the extraction of data itself.

Data mining analysis involves computer science methods at the intersection of the artificial
intelligence, machine learning, statistics, and database systems.

The patterns obtained from data mining can be considered as a summary of the input data that
can be used in further analysis or to obtain more accurate prediction results by a decision support
system.

2. Business Intelligence

Business Intelligence techniques and tools are for acquisition and transformation of large
amounts of unstructured business data to help identify, develop and create new strategic business
opportunities.

The goal of business intelligence is to allow easy interpretation of large volumes of data to
identify new opportunities. It helps in implementing an effective strategy based on insights that
can provide businesses with a competitive market-advantage and long-term stability.

Statistical Analysis

Statistics is the study of collection, analysis, interpretation, presentation, and organization of


data.

In data analysis, two main statistical methodologies are used −


Descriptive statistics − In descriptive statistics, data from the entire population or a sample is
summarized with numerical descriptors such as −

 Mean, Standard Deviation for Continuous Data


 Frequency, Percentage for Categorical Data

Inferential statistics − It uses patterns in the sample data to draw inferences about the
represented population or accounting for randomness. These inferences can be −

 answering yes/no questions about the data (hypothesis testing)


 estimating numerical characteristics of the data (estimation)
 describing associations within the data (correlation)
 modeling relationships within the data (E.g. regression analysis)

3. Predictive Analytics

Predictive Analytics use statistical models to analyze current and historical data for forecasting
(predictions) about future or otherwise unknown events. In business, predictive analytics is used
to identify risks and opportunities that aid in decision-making.

4. Text Analytics

Text Analytics, also referred to as Text Mining or as Text Data Mining is the process of deriving
high-quality information from text. Text mining usually involves the process of structuring the
input text, deriving patterns within the structured data using means such as statistical pattern
learning, and finally evaluation and interpretation of the output.

Data Analysis Process

Data Analysis is defined by the statistician John Tukey in 1961 as "Procedures for analyzing
data, techniques for interpreting the results of such procedures, ways of planning the gathering of
data to make its analysis easier, more precise or more accurate, and all the machinery and results
of (mathematical) statistics which apply to analyzing data.”

Thus, data analysis is a process for obtaining large, unstructured data from various sources and
converting it into information that is useful for −

 Answering questions
 Test hypotheses
 Decision-making
 Disproving theories

Data Analysis is a process of collecting, transforming, cleaning, and modeling data with the goal
of discovering the required information. The results so obtained are communicated, suggesting
conclusions, and supporting decision-making. Data visualization is at times used to portray the
data for the ease of discovering the useful patterns in the data. The terms Data Modeling and
Data Analysis mean the same.

Data Analysis Process consists of the following phases that are iterative in nature −

 Data Requirements Specification


 Data Collection
 Data Processing
 Data Cleaning
 Data Analysis
 Communication

1. Data Requirements Specification

The data required for analysis is based on a question or an experiment. Based on the
requirements of those directing the analysis, the data necessary as inputs to the analysis is
identified (e.g., Population of people). Specific variables regarding a population (e.g., Age and
Income) may be specified and obtained. Data may be numerical or categorical.

2. Data Collection

Data Collection is the process of gathering information on targeted variables identified as data
requirements. The emphasis is on ensuring accurate and honest collection of data. Data
Collection ensures that data gathered is accurate such that the related decisions are valid. Data
Collection provides both a baseline to measure and a target to improve.
Data is collected from various sources ranging from organizational databases to the information
in web pages. The data thus obtained, may not be structured and may contain irrelevant
information. Hence, the collected data is required to be subjected to Data Processing and Data
Cleaning.

3. Data Processing

The data that is collected must be processed or organized for analysis. This includes structuring
the data as required for the relevant Analysis Tools. For example, the data might have to be
placed into rows and columns in a table within a Spreadsheet or Statistical Application. A Data
Model might have to be created.

4. Data Cleaning

The processed and organized data may be incomplete, contain duplicates, or contain errors. Data
Cleaning is the process of preventing and correcting these errors. There are several types of Data
Cleaning that depend on the type of data. For example, while cleaning the financial data, certain
totals might be compared against reliable published numbers or defined thresholds. Likewise,
quantitative data methods can be used for outlier detection that would be subsequently excluded
in analysis.

5. Data Analysis

Data that is processed, organized and cleaned would be ready for the analysis. Various data
analysis techniques are available to understand, interpret, and derive conclusions based on the
requirements. Data Visualization may also be used to examine the data in graphical format, to
obtain additional insight regarding the messages within the data.

Statistical Data Models such as Correlation, Regression Analysis can be used to identify the
relations among the data variables. These models that are descriptive of the data are helpful in
simplifying analysis and communicate results.

The process might require additional Data Cleaning or additional Data Collection, and hence
these activities are iterative in nature.

6. Communication

The results of the data analysis are to be reported in a format as required by the users to support
their decisions and further action. The feedback from the users might result in additional
analysis.

The data analysts can choose data visualization techniques, such as tables and charts, which help
in communicating the message clearly and efficiently to the users. The analysis tools provide
facility to highlight the required information with color codes and formatting in tables and charts.
DIFFERENT FORMS OF DATA

Data is a set of values of subjects with respect to qualitative or quantitative variables. Data is
raw, unorganized facts that need to be processed. Data can be something simple and seemingly
random and useless until it is organized. When data is processed, organized, structured or
presented in a given context so as to make it useful, it is called information. Information,
necessary for research activities are achieved in different forms.

The main forms of the information available are:

 Primary data
 Secondary data
 Cross-sectional data
 Categorical data
 Time series data
 Spatial data
 Ordered data

1. Primary Data

 Primary data is an original and unique data, which is directly collected by the researcher
from a source according to his requirements.
 It is the data collected by the investigator himself or herself for a specific purpose.
 Data gathered by finding out first-hand the attitudes of a community towards health
services, ascertaining the health needs of a community, evaluating a social program,
determining the job satisfaction of the employees of an organization, and ascertaining the
quality of service provided by a worker are the examples of primary data.

2. Secondary Data

 Secondary data refers to the data which has already been collected for a certain purpose and
documented somewhere else.
 Data collected by someone else for some other purpose (but being utilized by the
investigator for another purpose) is secondary data.
 Gathering information with the use of census data to obtain information on the age-sex
structure of a population, the use of hospital records to find out the morbidity and mortality
patterns of a community, the use of an organization’s records to ascertain its activities, and
the collection of data from sources such as articles, journals, magazines, books and
periodicals to obtain historical and other types of information, are examples of secondary
data.

3. Cross Sectional Data

 Cross-sectional data is a type of data collected by observing many subjects (such as


individuals, firms, countries, or regions) at the same point of time, or without regard to
differences in time.
 It is the data for a single time point or single space point.
 This type of data is limited in that it cannot describe changes over time or cause and effect
relationships in which one variable affects the other.

4. Categorical Data

 Categorical variables represent types of data which may be divided into groups. Examples
of categorical variables are race, sex, age group, and educational level.
 The data, which cannot be measured numerically, is called as the categorical data.
Categorical data is qualitative in nature.
 The categorical data is also known as attributes.
 A data set consisting of observation on a single characteristic is a univariate data set. A
univariate data set is categorical if the individual observations are categorical responses.
Example of categorical data: Intelligence, Beauty, Literacy, Unemployment

5. Time series Data

 Time series data occurs wherever the same measurements are recorded on a regular basis.
 Quantities that represent or trace the values taken by a variable over a period such as a
month, quarter, or year.
 The values of different phenomenon such as temperature, weight, population, etc. can be
recorded over a different period of time.
 The values of the variable remain increasing or decreasing or constant.
 The data according to time periods is called time-series data. e.g. population in a different
time period.

6. Spatial Data

 Also known as geospatial data or geographic information it is the data or information that
identifies the geographic location of features and boundaries on Earth, such as natural or
constructed features, oceans, and more.
 Spatial data is usually stored as coordinates and topology and is data that can be mapped.
 Spatial data is used in geographical information systems (GIS) and other geolocation or
positioning services.
 Spatial data consists of points, lines, polygons and other geographic and geometric data
primitives, which can be mapped by location, stored with an object as metadata or used by a
communication system to locate end-user devices.

 Spatial data may be classified as scalar or vector data. Each provides distinct information
pertaining to geographical or spatial locations.

7. Ordered Data

 Data according to ordered categories is called as ordered data.


 Ordered data is similar to a categorical variable except that there is a clear ordering of the
variables.
 For example for category economic status ordered data may be, low, medium and high.
DIFFERENT TYPES OF DATA
Data is fundamental to business decisions. A company's ability to gather the right data, interpret
it, and act on those insights is often what will determine its level of success. But the amount of
data accessible to companies is ever increasing, as are the different kinds of data available.
Business data comes in a wide variety of formats, from strictly formed relational databases to
your last tweet. All of this data, in all its different formats, can be divided into two main
categories: structured data and unstructured data.

What is Structured Data?

The term structured data refers to data that resides in a fixed field within a file or record.
Structured data is typically stored in a relational database (RDBMS). It can consist of numbers
and text, and sourcing can happen automatically or manually, as long as it's within an RDBMS
structure. It depends on the creation of a data model, defining what types of data to include and
how to store and process it.

The programming language used for structured data is SQL (Structured Query Language).
Developed by IBM in the 1970s, SQL handles relational databases. Typical examples of
structured data are names, addresses, credit card numbers, geolocation, and so on.

What is Unstructured Data?

Unstructured data is more or less all the data that is not structured. Even though unstructured
data may have a native, internal structure, it's not structured in a predefined way. There is no data
model; the data is stored in its native format.

Typical examples of unstructured data are rich media, text, social media activity,
surveillance imagery, and so on.

The amount of unstructured data is much larger than that of structured data. Unstructured data
makes up a whopping 80% or more of all enterprise data, and the percentage keeps growing.
This means that companies not taking unstructured data into account are missing out on a lot of
valuable business intelligence.

What is Semistructured Data?

Semistructured data is a third category that falls somewhere between the other two. It's a type of
structured data that does not fit into the formal structure of a relational database. But while not
matching the description of structured data entirely, it still employs tagging systems or other
markers, separating different elements and enabling search. Sometimes, this is referred to as data
with a self-describing structure.

A typical example of semistructured data is smartphone photos. Every photo taken with a
smartphone contains unstructured image content as well as the tagged time, location, and other
identifiable (and structured) information. Semi-structured data formats include JSON, CSV, and
XML file types.

Structured vs Unstructured Data: 5 Key Differences

1. Defined vs Undefined Data

Structured data is clearly defined types of data in a structure, while unstructured data is usually
stored in its native format. Structured data lives in rows and columns and it can be mapped into
pre-defined fields. Unlike structured data, which is organized and easy to access in relational
databases, unstructured data does not have a predefined data model.

2. Qualitative vs Quantitative Data

Structured data is often quantitative data, meaning it usually consists of hard numbers or things
that can be counted. Methods for analysis include regression (to predict relationships between
variables); classification (to estimate probability); and clustering of data (based on different
attributes).

Unstructured data, on the other hand, is often categorized as qualitative data, and cannot be
processed and analyzed using conventional tools and methods. In a business context, qualitative
data can, for example, come from customer surveys, interviews, and social media interactions.
Extracting insights from qualitative data requires advanced analytics techniques like data mining
and data stacking.

3. Storage in Data Houses vs Data Lakes

Structured data is often stored in data warehouses, while unstructured data is stored in data lakes.
A data warehouse is the endpoint for the data’s journey through an ETL pipeline. A data lake, on
the other hand, is a sort of almost limitless repository where data is stored in its original format
or after undergoing a basic “cleaning” process.

Both have the potential for cloud-use. Structured data requires less storage space, while
unstructured data requires more. For example, even a tiny image takes up more space than many
pages of text.

4. Ease of Analysis

One of the most significant differences between structured and unstructured data is how well it
lends itself to analysis. Structured data is easy to search, both for humans and for algorithms.
Unstructured data, on the other hand, is intrinsically more difficult to search and requires
processing to become understandable. It's challenging to deconstruct since it lacks a predefined
data model and hence doesn't fit in in relational databases.

While there are a wide array of sophisticated analytics tools for structured data, most analytics
tools for mining and arranging unstructured data are still in the developing phase. The lack of
predefined structure makes data mining tricky, and developing best practices on how to handle
data sources like rich media, blogs, social media data, and customer communication is a
challenge.

5. Predefined Format vs Variety of Formats

The most common format for structured data is text and numbers. Structured data has been
defined beforehand in a data model.

Unstructured data, on the other hand, comes in a variety of shapes and sizes. It can consist of
everything from audio, video, and imagery to email and sensor data. There is no data model for
the unstructured data; it is stored natively or in a data lake that doesn't require any
transformation.
As for databases, structured data is usually stored in a relational database (RDBMS), while the
best fit for unstructured data instead is so-called non-relational, or NoSQL databases.

MACHINE GENERATED DATA

Machine-generated data is information automatically generated by a computer process,


application, or other mechanism without the active intervention of a human. While the term dates
back over fifty years, there is some current indecision as to the scope of the term. Monash
Research's Curt Monash defines it as "data that was produced entirely by machines OR data that
is more about observing humans than recording their choices." Meanwhile, Daniel Abadi, CS
Professor at Yale, proposes a narrower definition, "Machine-generated data is data that is
generated as a result of a decision of an independent computational agent or a measurement of an
event that is not caused by a human action." Regardless of definition differences, both exclude
data manually entered by a person. Machine-generated data crosses all industry sectors. Often
and increasingly, humans are unaware their actions are generating the data.

1. Web Server Log

A server log is a log file (or several files) automatically created and maintained by a server
consisting of a list of activities it performed.

A typical example is a web server log which maintains a history of page requests. The W3C
maintains a standard format (the Common Log Format) for web server log files, but other
proprietary formats exist. More recent entries are typically appended to the end of the file.
Information about the request, including client IP address, request date/time, page requested,
HTTP code, bytes served, user agent, and referrer are typically added. This data can be combined
into a single file, or separated into distinct logs, such as an access log, error log, or referrer log.
However, server logs typically do not collect user-specific information.

These files are usually not accessible to general Internet users, only to the webmaster or other
administrative person of an Internet service. A statistical analysis of the server log may be used
to examine traffic patterns by time of day, day of week, referrer, or user agent. Efficient web site
administration, adequate hosting resources and the fine tuning of sales efforts can be aided by
analysis of the web server logs.
2. Call detail record

A call detail record (CDR) is a data record produced by a telephone exchange or other
telecommunications equipment that documents the details of a telephone call or other
telecommunications transaction (e.g., text message) that passes through that facility or device.
The record contains various attributes of the call, such as time, duration, completion status,
source number, and destination number. It is the automated equivalent of the paper toll tickets
that were written and timed by operators for long-distance calls in a manual telephone exchange.

3. Financial instrument trades

Financial instruments are monetary contracts between parties. They can be created, traded,
modified and settled. They can be cash (currency), evidence of an ownership interest in an entity
or a contractual right to receive or deliver in the form of currency (forex); debt (bonds, loans);
equity (shares); or derivatives (options, futures, forwards).

International Accounting Standards IAS 32 and 39 define a financial instrument as "any contract
that gives rise to a financial asset of one entity and a financial liability or equity instrument of
another entity".

Financial instruments may be categorized by "asset class" depending on whether they are equity-
based (reflecting ownership of the issuing entity) or debt-based (reflecting a loan the investor has
made to the issuing entity). If the instrument is debt it can be further categorized into short-term
(less than one year) or long-term. Foreign exchange instruments and transactions are neither
debt- nor equity-based and belong in their own category.

4. Network Event logging

Event logging provides system administrators with information useful for diagnostics and
auditing. The different classes of events that will be logged, as well as what details will appear in
the event messages, are often considered early in the development cycle. Many event logging
technologies allow or even require each class of event to be assigned a unique "code", which is
used by the event logging software or a separate viewer (e.g., Event Viewer) to format and
output a human-readable message. This facilitates localization and allows system administrators
to more easily obtain information on problems that occur.
Because event logging is used to log high-level information (often failure information),
performance of the logging implementation is often less important.

A special concern, preventing duplicate events from being recorded "too often" is taken care of
through event throttling.

5. Security information and event management (SIEM) log

Security information and event management (SIEM) is a subsection within the field of computer
security, where software products and services combine security information management (SIM)
and security event management (SEM). They provide real-time analysis of security alerts
generated by applications and network hardware.

Vendors sell SIEM as software, as appliances, or as managed services; these products are also
used to log security data and generate reports for compliance purposes.

The term and the initialism SIEM was coined by Mark Nicolett and Amrit Williams of Gartner in
2005

6. Telemetry

Telemetry is the in situ collection of measurements or other data at remote points and their
automatic transmission to receiving equipment (telecommunication) for monitoring. The word is
derived from the Greek roots tele, "remote", and metron, "measure". Systems that need external
instructions and data to operate require the counterpart of telemetry, telecommand.

R PROGRAMMING

R is a programming language and software environment for statistical analysis, graphics


representation and reporting. R was created by Ross Ihaka and Robert Gentleman at the
University of Auckland, New Zealand, and is currently developed by the R Development Core
Team. R is freely available under the GNU General Public License, and pre-compiled binary
versions are provided for various operating systems like Linux, Windows and Mac. This
programming language was named R, based on the first letter of first name of the two R authors
(Robert Gentleman and Ross Ihaka), and partly a play on the name of the Bell Labs Language S.
The core of R is an interpreted computer language which allows branching and looping as well
as modular programming using functions. R allows integration with the procedures written in the
C, C++, .Net, Python or FORTRAN languages for efficiency.

R is freely available under the GNU General Public License, and pre-compiled binary versions
are provided for various operating systems like Linux, Windows and Mac.

R is free software distributed under a GNU-style copy left, and an official part of the GNU
project called GNU S.

Evolution of R

 R was initially written by Ross Ihaka and Robert Gentleman at the Department of
Statistics of the University of Auckland in Auckland, New Zealand. R made its first
appearance in 1993.
 A large group of individuals has contributed to R by sending code and bug reports.
 Since mid-1997 there has been a core group (the "R Core Team") who can modify the R
source code archive.

Features of R

The following are the important features of R −

 R is a well-developed, simple and effective programming language which includes


conditionals, loops, user defined recursive functions and input and output facilities.
 R has an effective data handling and storage facility,
 R provides a suite of operators for calculations on arrays, lists, vectors and matrices.
 R provides a large, coherent and integrated collection of tools for data analysis.
 R provides graphical facilities for data analysis and display either directly at the computer
or printing at the papers.

R Packages

R packages are a collection of R functions, complied code and sample data. They are stored
under a directory called "library" in the R environment. By default, R installs a set of packages
during installation. More packages are added later, when they are needed for some specific
purpose. When we start the R console, only the default packages are available by default. Other
packages which are already installed have to be loaded explicitly to be used by the R program
that is going to use them.

All the packages available in R language are listed at R Packages.

Below is a list of commands to be used to check, verify and use the R packages.

Check Available R Packages

1. To get library locations containing R packages

.libPaths()

2. To get the list of all the packages installed

library()

3. To get all packages currently loaded in the R environment

search()

Packages in library ‘C:/Program Files/R/R-3.2.2/library’:

Base: The R Base Package

Boot : Bootstrap Functions (Originally by Angelo Cant for S)

Class: Functions for Classification

Cluster: "Finding Groups in Data": Cluster Analysis

Codetools: Code Analysis Tools for R

Compiler: The R Compiler Package

Datasets: The R Datasets Package

Foreign: Read Data Stored by 'Minitab', 'S', 'SAS', 'SPSS', 'Stata', 'Systat', 'Weka', 'dBase', ...

Graphics:The R Graphics Package

grDevices: The R Graphics Devices and Support for Colours and Fonts

grid: The Grid Graphics Package


KernSmooth: Functions for Kernel Smoothing Supporting Wan & Jones (1995)

Lattice: Trellis Graphics for R

MASS: Support Functions and Datasets for Venables and Ripley's MASS

Matrix: Sparse and Dense Matrix Classes and Methods

Methods: Formal Methods and Classes

Mgcv: Mixed GAM Computation Vehicle with GCV/AIC/REML Smoothness Estimation

Nlme: Linear and Nonlinear Mixed Effects Models

nnet : Feed-Forward Neural Networks and Multinomial Log-Linear Models

parallel : Support for Parallel computation in R

rpart : Recursive Partitioning and Regression Trees

spatial : Functions for Kriging and Point Pattern Analysis

splines : Regression Spline Functions and Classes

stats : The R Stats Package

stats4 : Statistical Functions using S4 Classes

survival : Survival Analysis

tcltk : Tcl/Tk Interface

tools : Tools for Package Development

utils : The R Utils Package

To Install package manually

Go to the link R Packages to download the package needed. Save the package as a .zip file in a
suitable location in the local system.

Now you can run the following command to install this package in the R environment.

install.packages(file_name_with_path, repos = NULL, type = "source")


# Install the package named "XML"

install.packages("E:/XML_3.98-1.3.zip", repos = NULL, type = "source")

To Load Package to Library

Before a package can be used in the code, it must be loaded to the current R environment. You
also need to load a package that is already installed previously but not available in the current
environment.

A package is loaded using the following command −

library("package Name", lib.loc = "path to library")

# Load the package named "XML"

install.packages("E:/XML_3.98-1.3.zip", repos = NULL, type = "source")

You might also like