Unit 1 Notes - Data Analysis Using r
Unit 1 Notes - Data Analysis Using r
UNIT 1
Data Analysis is a process of inspecting, cleaning, transforming and modeling data with the goal
of discovering useful information, suggesting conclusions and supporting decision-making
Several data analysis techniques exist encompassing various domains such as business, science,
social science, etc. with a variety of names. The major data analysis approaches are −
Data Mining
Business Intelligence
Statistical Analysis
Predictive Analytics
Text Analytics
1. Data Mining
Data Mining is the analysis of large quantities of data to extract previously unknown, interesting
patterns of data, unusual data and the dependencies. Note that the goal is the extraction of
patterns and knowledge from large amounts of data and not the extraction of data itself.
Data mining analysis involves computer science methods at the intersection of the artificial
intelligence, machine learning, statistics, and database systems.
The patterns obtained from data mining can be considered as a summary of the input data that
can be used in further analysis or to obtain more accurate prediction results by a decision support
system.
2. Business Intelligence
Business Intelligence techniques and tools are for acquisition and transformation of large
amounts of unstructured business data to help identify, develop and create new strategic business
opportunities.
The goal of business intelligence is to allow easy interpretation of large volumes of data to
identify new opportunities. It helps in implementing an effective strategy based on insights that
can provide businesses with a competitive market-advantage and long-term stability.
Statistical Analysis
Inferential statistics − It uses patterns in the sample data to draw inferences about the
represented population or accounting for randomness. These inferences can be −
3. Predictive Analytics
Predictive Analytics use statistical models to analyze current and historical data for forecasting
(predictions) about future or otherwise unknown events. In business, predictive analytics is used
to identify risks and opportunities that aid in decision-making.
4. Text Analytics
Text Analytics, also referred to as Text Mining or as Text Data Mining is the process of deriving
high-quality information from text. Text mining usually involves the process of structuring the
input text, deriving patterns within the structured data using means such as statistical pattern
learning, and finally evaluation and interpretation of the output.
Data Analysis is defined by the statistician John Tukey in 1961 as "Procedures for analyzing
data, techniques for interpreting the results of such procedures, ways of planning the gathering of
data to make its analysis easier, more precise or more accurate, and all the machinery and results
of (mathematical) statistics which apply to analyzing data.”
Thus, data analysis is a process for obtaining large, unstructured data from various sources and
converting it into information that is useful for −
Answering questions
Test hypotheses
Decision-making
Disproving theories
Data Analysis is a process of collecting, transforming, cleaning, and modeling data with the goal
of discovering the required information. The results so obtained are communicated, suggesting
conclusions, and supporting decision-making. Data visualization is at times used to portray the
data for the ease of discovering the useful patterns in the data. The terms Data Modeling and
Data Analysis mean the same.
Data Analysis Process consists of the following phases that are iterative in nature −
The data required for analysis is based on a question or an experiment. Based on the
requirements of those directing the analysis, the data necessary as inputs to the analysis is
identified (e.g., Population of people). Specific variables regarding a population (e.g., Age and
Income) may be specified and obtained. Data may be numerical or categorical.
2. Data Collection
Data Collection is the process of gathering information on targeted variables identified as data
requirements. The emphasis is on ensuring accurate and honest collection of data. Data
Collection ensures that data gathered is accurate such that the related decisions are valid. Data
Collection provides both a baseline to measure and a target to improve.
Data is collected from various sources ranging from organizational databases to the information
in web pages. The data thus obtained, may not be structured and may contain irrelevant
information. Hence, the collected data is required to be subjected to Data Processing and Data
Cleaning.
3. Data Processing
The data that is collected must be processed or organized for analysis. This includes structuring
the data as required for the relevant Analysis Tools. For example, the data might have to be
placed into rows and columns in a table within a Spreadsheet or Statistical Application. A Data
Model might have to be created.
4. Data Cleaning
The processed and organized data may be incomplete, contain duplicates, or contain errors. Data
Cleaning is the process of preventing and correcting these errors. There are several types of Data
Cleaning that depend on the type of data. For example, while cleaning the financial data, certain
totals might be compared against reliable published numbers or defined thresholds. Likewise,
quantitative data methods can be used for outlier detection that would be subsequently excluded
in analysis.
5. Data Analysis
Data that is processed, organized and cleaned would be ready for the analysis. Various data
analysis techniques are available to understand, interpret, and derive conclusions based on the
requirements. Data Visualization may also be used to examine the data in graphical format, to
obtain additional insight regarding the messages within the data.
Statistical Data Models such as Correlation, Regression Analysis can be used to identify the
relations among the data variables. These models that are descriptive of the data are helpful in
simplifying analysis and communicate results.
The process might require additional Data Cleaning or additional Data Collection, and hence
these activities are iterative in nature.
6. Communication
The results of the data analysis are to be reported in a format as required by the users to support
their decisions and further action. The feedback from the users might result in additional
analysis.
The data analysts can choose data visualization techniques, such as tables and charts, which help
in communicating the message clearly and efficiently to the users. The analysis tools provide
facility to highlight the required information with color codes and formatting in tables and charts.
DIFFERENT FORMS OF DATA
Data is a set of values of subjects with respect to qualitative or quantitative variables. Data is
raw, unorganized facts that need to be processed. Data can be something simple and seemingly
random and useless until it is organized. When data is processed, organized, structured or
presented in a given context so as to make it useful, it is called information. Information,
necessary for research activities are achieved in different forms.
Primary data
Secondary data
Cross-sectional data
Categorical data
Time series data
Spatial data
Ordered data
1. Primary Data
Primary data is an original and unique data, which is directly collected by the researcher
from a source according to his requirements.
It is the data collected by the investigator himself or herself for a specific purpose.
Data gathered by finding out first-hand the attitudes of a community towards health
services, ascertaining the health needs of a community, evaluating a social program,
determining the job satisfaction of the employees of an organization, and ascertaining the
quality of service provided by a worker are the examples of primary data.
2. Secondary Data
Secondary data refers to the data which has already been collected for a certain purpose and
documented somewhere else.
Data collected by someone else for some other purpose (but being utilized by the
investigator for another purpose) is secondary data.
Gathering information with the use of census data to obtain information on the age-sex
structure of a population, the use of hospital records to find out the morbidity and mortality
patterns of a community, the use of an organization’s records to ascertain its activities, and
the collection of data from sources such as articles, journals, magazines, books and
periodicals to obtain historical and other types of information, are examples of secondary
data.
4. Categorical Data
Categorical variables represent types of data which may be divided into groups. Examples
of categorical variables are race, sex, age group, and educational level.
The data, which cannot be measured numerically, is called as the categorical data.
Categorical data is qualitative in nature.
The categorical data is also known as attributes.
A data set consisting of observation on a single characteristic is a univariate data set. A
univariate data set is categorical if the individual observations are categorical responses.
Example of categorical data: Intelligence, Beauty, Literacy, Unemployment
Time series data occurs wherever the same measurements are recorded on a regular basis.
Quantities that represent or trace the values taken by a variable over a period such as a
month, quarter, or year.
The values of different phenomenon such as temperature, weight, population, etc. can be
recorded over a different period of time.
The values of the variable remain increasing or decreasing or constant.
The data according to time periods is called time-series data. e.g. population in a different
time period.
6. Spatial Data
Also known as geospatial data or geographic information it is the data or information that
identifies the geographic location of features and boundaries on Earth, such as natural or
constructed features, oceans, and more.
Spatial data is usually stored as coordinates and topology and is data that can be mapped.
Spatial data is used in geographical information systems (GIS) and other geolocation or
positioning services.
Spatial data consists of points, lines, polygons and other geographic and geometric data
primitives, which can be mapped by location, stored with an object as metadata or used by a
communication system to locate end-user devices.
Spatial data may be classified as scalar or vector data. Each provides distinct information
pertaining to geographical or spatial locations.
7. Ordered Data
The term structured data refers to data that resides in a fixed field within a file or record.
Structured data is typically stored in a relational database (RDBMS). It can consist of numbers
and text, and sourcing can happen automatically or manually, as long as it's within an RDBMS
structure. It depends on the creation of a data model, defining what types of data to include and
how to store and process it.
The programming language used for structured data is SQL (Structured Query Language).
Developed by IBM in the 1970s, SQL handles relational databases. Typical examples of
structured data are names, addresses, credit card numbers, geolocation, and so on.
Unstructured data is more or less all the data that is not structured. Even though unstructured
data may have a native, internal structure, it's not structured in a predefined way. There is no data
model; the data is stored in its native format.
Typical examples of unstructured data are rich media, text, social media activity,
surveillance imagery, and so on.
The amount of unstructured data is much larger than that of structured data. Unstructured data
makes up a whopping 80% or more of all enterprise data, and the percentage keeps growing.
This means that companies not taking unstructured data into account are missing out on a lot of
valuable business intelligence.
Semistructured data is a third category that falls somewhere between the other two. It's a type of
structured data that does not fit into the formal structure of a relational database. But while not
matching the description of structured data entirely, it still employs tagging systems or other
markers, separating different elements and enabling search. Sometimes, this is referred to as data
with a self-describing structure.
A typical example of semistructured data is smartphone photos. Every photo taken with a
smartphone contains unstructured image content as well as the tagged time, location, and other
identifiable (and structured) information. Semi-structured data formats include JSON, CSV, and
XML file types.
Structured data is clearly defined types of data in a structure, while unstructured data is usually
stored in its native format. Structured data lives in rows and columns and it can be mapped into
pre-defined fields. Unlike structured data, which is organized and easy to access in relational
databases, unstructured data does not have a predefined data model.
Structured data is often quantitative data, meaning it usually consists of hard numbers or things
that can be counted. Methods for analysis include regression (to predict relationships between
variables); classification (to estimate probability); and clustering of data (based on different
attributes).
Unstructured data, on the other hand, is often categorized as qualitative data, and cannot be
processed and analyzed using conventional tools and methods. In a business context, qualitative
data can, for example, come from customer surveys, interviews, and social media interactions.
Extracting insights from qualitative data requires advanced analytics techniques like data mining
and data stacking.
Structured data is often stored in data warehouses, while unstructured data is stored in data lakes.
A data warehouse is the endpoint for the data’s journey through an ETL pipeline. A data lake, on
the other hand, is a sort of almost limitless repository where data is stored in its original format
or after undergoing a basic “cleaning” process.
Both have the potential for cloud-use. Structured data requires less storage space, while
unstructured data requires more. For example, even a tiny image takes up more space than many
pages of text.
4. Ease of Analysis
One of the most significant differences between structured and unstructured data is how well it
lends itself to analysis. Structured data is easy to search, both for humans and for algorithms.
Unstructured data, on the other hand, is intrinsically more difficult to search and requires
processing to become understandable. It's challenging to deconstruct since it lacks a predefined
data model and hence doesn't fit in in relational databases.
While there are a wide array of sophisticated analytics tools for structured data, most analytics
tools for mining and arranging unstructured data are still in the developing phase. The lack of
predefined structure makes data mining tricky, and developing best practices on how to handle
data sources like rich media, blogs, social media data, and customer communication is a
challenge.
The most common format for structured data is text and numbers. Structured data has been
defined beforehand in a data model.
Unstructured data, on the other hand, comes in a variety of shapes and sizes. It can consist of
everything from audio, video, and imagery to email and sensor data. There is no data model for
the unstructured data; it is stored natively or in a data lake that doesn't require any
transformation.
As for databases, structured data is usually stored in a relational database (RDBMS), while the
best fit for unstructured data instead is so-called non-relational, or NoSQL databases.
A server log is a log file (or several files) automatically created and maintained by a server
consisting of a list of activities it performed.
A typical example is a web server log which maintains a history of page requests. The W3C
maintains a standard format (the Common Log Format) for web server log files, but other
proprietary formats exist. More recent entries are typically appended to the end of the file.
Information about the request, including client IP address, request date/time, page requested,
HTTP code, bytes served, user agent, and referrer are typically added. This data can be combined
into a single file, or separated into distinct logs, such as an access log, error log, or referrer log.
However, server logs typically do not collect user-specific information.
These files are usually not accessible to general Internet users, only to the webmaster or other
administrative person of an Internet service. A statistical analysis of the server log may be used
to examine traffic patterns by time of day, day of week, referrer, or user agent. Efficient web site
administration, adequate hosting resources and the fine tuning of sales efforts can be aided by
analysis of the web server logs.
2. Call detail record
A call detail record (CDR) is a data record produced by a telephone exchange or other
telecommunications equipment that documents the details of a telephone call or other
telecommunications transaction (e.g., text message) that passes through that facility or device.
The record contains various attributes of the call, such as time, duration, completion status,
source number, and destination number. It is the automated equivalent of the paper toll tickets
that were written and timed by operators for long-distance calls in a manual telephone exchange.
Financial instruments are monetary contracts between parties. They can be created, traded,
modified and settled. They can be cash (currency), evidence of an ownership interest in an entity
or a contractual right to receive or deliver in the form of currency (forex); debt (bonds, loans);
equity (shares); or derivatives (options, futures, forwards).
International Accounting Standards IAS 32 and 39 define a financial instrument as "any contract
that gives rise to a financial asset of one entity and a financial liability or equity instrument of
another entity".
Financial instruments may be categorized by "asset class" depending on whether they are equity-
based (reflecting ownership of the issuing entity) or debt-based (reflecting a loan the investor has
made to the issuing entity). If the instrument is debt it can be further categorized into short-term
(less than one year) or long-term. Foreign exchange instruments and transactions are neither
debt- nor equity-based and belong in their own category.
Event logging provides system administrators with information useful for diagnostics and
auditing. The different classes of events that will be logged, as well as what details will appear in
the event messages, are often considered early in the development cycle. Many event logging
technologies allow or even require each class of event to be assigned a unique "code", which is
used by the event logging software or a separate viewer (e.g., Event Viewer) to format and
output a human-readable message. This facilitates localization and allows system administrators
to more easily obtain information on problems that occur.
Because event logging is used to log high-level information (often failure information),
performance of the logging implementation is often less important.
A special concern, preventing duplicate events from being recorded "too often" is taken care of
through event throttling.
Security information and event management (SIEM) is a subsection within the field of computer
security, where software products and services combine security information management (SIM)
and security event management (SEM). They provide real-time analysis of security alerts
generated by applications and network hardware.
Vendors sell SIEM as software, as appliances, or as managed services; these products are also
used to log security data and generate reports for compliance purposes.
The term and the initialism SIEM was coined by Mark Nicolett and Amrit Williams of Gartner in
2005
6. Telemetry
Telemetry is the in situ collection of measurements or other data at remote points and their
automatic transmission to receiving equipment (telecommunication) for monitoring. The word is
derived from the Greek roots tele, "remote", and metron, "measure". Systems that need external
instructions and data to operate require the counterpart of telemetry, telecommand.
R PROGRAMMING
R is freely available under the GNU General Public License, and pre-compiled binary versions
are provided for various operating systems like Linux, Windows and Mac.
R is free software distributed under a GNU-style copy left, and an official part of the GNU
project called GNU S.
Evolution of R
R was initially written by Ross Ihaka and Robert Gentleman at the Department of
Statistics of the University of Auckland in Auckland, New Zealand. R made its first
appearance in 1993.
A large group of individuals has contributed to R by sending code and bug reports.
Since mid-1997 there has been a core group (the "R Core Team") who can modify the R
source code archive.
Features of R
R Packages
R packages are a collection of R functions, complied code and sample data. They are stored
under a directory called "library" in the R environment. By default, R installs a set of packages
during installation. More packages are added later, when they are needed for some specific
purpose. When we start the R console, only the default packages are available by default. Other
packages which are already installed have to be loaded explicitly to be used by the R program
that is going to use them.
Below is a list of commands to be used to check, verify and use the R packages.
.libPaths()
library()
search()
Foreign: Read Data Stored by 'Minitab', 'S', 'SAS', 'SPSS', 'Stata', 'Systat', 'Weka', 'dBase', ...
grDevices: The R Graphics Devices and Support for Colours and Fonts
MASS: Support Functions and Datasets for Venables and Ripley's MASS
Go to the link R Packages to download the package needed. Save the package as a .zip file in a
suitable location in the local system.
Now you can run the following command to install this package in the R environment.
Before a package can be used in the code, it must be loaded to the current R environment. You
also need to load a package that is already installed previously but not available in the current
environment.