Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Unit 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 32

Unit-II: Introduction of Data Science, Basic Data Analytics using R-R

Graphical User Interfaces, Data Import and Export, Attribute and Data Types,
Descriptive Statistics, Exploratory Data Analysis, Visualization Before
Analysis, Dirty Data, visualizing a Single Variable, Examining Multiple
Variables, Data Exploration Versus Presentation.

1. R Graphical User Interfaces


R software uses a command-line interface (CLI).
For Windows installations, R comes with RGui.exe, which provides a basic graphical
user interface (GUI).
To improve the ease of writing, executing, and debugging R code, additional GUI is
RStudio.
The following Figure provides a screenshot of the previous R code example executed in
RStudio.

The four highlighted window panes follow.


● Scripts: Serves as an area to write and save R code
● Workspace: Lists the datasets and variables in the R environment
● Plots: Displays the plots generated by the R code and provides a straightforward
mechanism to
export the plots
● Console: Provides a history of the executed R code and the output
The console pane also can be used to obtain help information on R.
Ex: help(lm)
Functions such as edit() and fix() allow the user to update the contents of an R variable.
R allows one to save the workspace environment, including variables and loaded libraries,
into an .Rdata file using the save.image() function.
An existing .Rdata file can be loaded using the load.image() function.
Tools such as RStudio prompt the user for whether the developer wants to save the
workspace connects prior to exiting the GUI.

2. Data Import and Export


The dataset is usually imported into R using the read.csv() function as in the following
code.

R uses a forward slash (/) as the separator character in the directory and file paths.
To simplify the import of multiple files with long path names, the setwd() function can be
used to set the working directory for the subsequent import and export operations, as
shown in the following R code.

The main difference between these import functions is the default values.
For example, the read.delim() function expects the column separator to be a tab ("\t").
The following table includes the expected defaults for headers, column separators, and
decimal point notations.
R functions such as write.table(), write.csv(), and write.csv2() enable exporting of R
datasets to an external file.
Ex:

Sometimes it is necessary to read data from a database management system (DBMS). R


packages such as DBI and RODBC are available for this purpose. These packages
provide database interfaces for communication between R and DBMSs such as MySQL,
Oracle, SQL Server, PostgreSQL, and Pivotal Greenplum.
The following R code demonstrates how to install the RODBC package with the
install.packages() function.
The library() function loads the package into the R workspace.
Finally, a connector (conn) is initialized for connecting to a Pivotal Greenplum database
training2 via open database connectivity (ODBC) with user.
The training2 database must be defined either in the /etc/ODBC.ini configuration file or
using the Administrative Tools under the Windows Control Panel.
Although plots can be saved using the RStudio GUI, plots can also be saved using R code
by specifying the appropriate graphic devices.
Using the jpeg() function, the following R code creates a new JPEG file, adds a histogram
plot to the file, and then closes the file. Such techniques are useful when automating
standard reports.
Other functions, such as png(), bmp(), pdf(), and postscript(), are available in R to save
plots in the desired format.

3. Attribute and Data Types

3.1. Attributes:
In general, characteristics or attributes provide the qualitative and quantitative measures
for each item or subject of interest.
Attributes can be categorized into four types: nominal, ordinal, interval, and ratio (NOIR).
Nominal and ordinal attributes are considered categorical attributes, whereas interval and
ratio attributes are considered numeric attributes.
Data of one attribute type may be converted to another.
For example, the quality of diamonds {Fair, Good, Very Good, Premium, Ideal} is
considered ordinal but can be converted to nominal {Good, Excellent} with a defined
mapping.
Similarly, a ratio attribute like Age can be converted into an ordinal attribute such as
{Infant, Adolescent, Adult, Senior}.
Understanding the attribute types in a given dataset is important to ensure that the
appropriate descriptive statistics and analytic methods are applied and properly
interpreted.
For example, the mean and standard deviation of U.S. postal ZIP codes are not very
meaningful or appropriate.
3.2. Numeric, Character, and Logical Data Types
R supports 4 data types. Such as numeric, character, date and logical (Boolean) values.
Example variable assignment of different data values

R provides functions, such as class() and typeof(), to examine the characteristics of a


given variable.
The class() function represents the abstract class of an object.
The typeof() function determines the way an object is stored in memory.

Additional R functions exist that can test the variables and coerce a variable into a
specific type.
Let variable i is numeric.
is.integer() function is used test whether i is an integer.

as.integer() function is used to coerce i into a new integer variable (lets j)


The application of the length() function reveals that the created variables each have a
length of 1.

3.3. Vectors

The following R code illustrates how a vector can be created using the combine function,
c() or the colon operator, :, to build a vector from the sequence of integers from 1 to 5.
Furthermore, the code shows how the values of an existing vector can be easily modified
or accessed.
The code, related to the z vector, indicates how logical comparisons can be built to extract
certain elements of a given vector.

Sometimes it is necessary to initialize a vector of a specific length and then populate the
content of the vector later.

The vector () function, by default, creates a logical vector.


A vector of a different type can be specified by using the mode parameter.
The vector c, an integer vector of length 0, may be useful when the number of elements is
not initially known and the new elements will later be added to the end of the vector as
the values become available.

Although vectors may appear to be similar to arrays of one dimension, they are
technically dimensionless.

3.4. Arrays and Matrices


The array() function can be used to restructure a vector as an array.

A two-dimensional array is known as a matrix.


R provides the standard matrix operations such as addition, subtraction, and
multiplication, as well as the transpose function t() and the inverse matrix function
matrix.inverse() included in the matrixcalc package.
The following R code builds a 3 × 3 matrix, M, and multiplies it by its inverse to obtain
the identity matrix.

3.5. Data Frames


Data frames provide a structure for storing and accessing several variables of possibly
different data types.

The variables stored in the data frame can be easily accessed using the $ notation.

str() function provides the structure of the sales data frame.


This function identifies the integer and numeric (double) data types, the factor variables
and levels, as well as the first few values for each variable.
A subset of the data frame can be retrieved through subsetting operators.

3.6. Lists
Lists can contain any type of objects, including other lists.

In displaying the contents of assortment, it uses the double brackets, [[]]


If we use the single set of brackets, it only accesses an item in the list, not its content.
The str() function offers details about the structure of a list.
str(assortment)
3.7. Factors
Factors can be ordered or not ordered. In the case of gender attribute, the levels are not
ordered. Gender could assume one of two levels: F or M.

Included with the ggplot2 package, the diamonds data frame contains three ordered
factors.
Examining the cut factor, there are five levels in order of improving cut: Fair, Good, Very
Good, Premium, and Ideal.
library(ggplot2)

data(diamonds) # load the data frame into the R workspace

Suppose it is decided to categorize sales$sales_totals into three groups—small, medium,


and big.
The cbind() function is used to combine variables column-wise.
The rbind() function is used to combine datasets row-wise.
3.8. Contingency Tables
In R, table refers to a class of objects used to store the observed counts across the factors
for a given dataset. Such a table is commonly referred to as a contingency table and is the
basis for performing a statistical test on the independence of the factors used to build the
table.
The following R code builds a contingency table based on the sales$gender and
sales$spender factors.

The summary() function performs a chi-squared test on the independence of the two
factors.
3.9. Descriptive Statistics
The summary() function provides several descriptive statistics, such as the mean and
median, about a variable such as the sales data frame.

Some common R functions that include descriptive statistics:

The IQR() function provides the difference between the third and the first quartiles.
The function apply() is useful when the same function is to be applied to several variables
in a data frame.
For example, the following R code calculates the standard deviation for the first three
variables in sales.
In the code, setting MARGIN=2 specifies that the sd() function is applied over the
columns.
Other functions, such as lapply() and sapply(), apply a function to a list or vector.

Additional descriptive statistics can be applied with user-defined functions.


4. Exploratory Data Analysis
Exploratory data analysis is a data analysis approach to reveal the important
characteristics of a dataset, mainly through visualization.
Functions such as summary() can help analysts easily get an idea of the magnitude and
range of the data, but other aspects such as linear relationships and distributions are more
difficult to see from descriptive statistics.
A useful way to detect patterns and anomalies in the data is through the exploratory data
analysis with visualization.
Visualization gives a succinct, holistic view of the data that may be difficult to grasp from
the numbers and summaries alone.
Let, the variables x and y of the data frame data can instead be visualized in a scatterplot,
which easily depicts the relationship between two variables.
An important facet of the initial data exploration, visualization assesses data cleanliness
and suggests potentially important relationships in the data prior to the model planning
and building phases.
4.1. Visualization Before Analysis
Let’s consider the following datasets

The above four datasets have nearly identical statistical properties such as mean, variance.
Based on the nearly identical statistical properties across each dataset, one might
conclude that these four datasets are quite similar. However, the scatterplots tell a
different story.
Each dataset is plotted as a scatterplot, and the fitted lines are the result of applying linear
regression models.
The estimated regression line fits Dataset 1 reasonably well. Dataset 2 is definitely
nonlinear. Dataset 3 exhibits a linear trend, with one apparent outlier at x =13. For
Dataset 4, the regression line fits the dataset quite well. However, with only points at two
x values, it is not possible to determine that the linearity assumption is proper.
The R code for generating the above scatter plots is shown below:
It requires the R package ggplot2, which can be installed simply by running the command
install.packages("ggplot2").
The Anscombe dataset for the plot is included in the standard R distribution. Enter data()
for a list of datasets included in the R base distribution. Enter data(DatasetName) to make
a dataset available in the current workspace.
4.2. Dirty Data
Dirty(outliers) data can be detected in the data exploration phase with visualizations.
In general, analysts should look for anomalies, verify the data with domain knowledge,
and decide the most appropriate approach to clean the data.
Consider a scenario in which a bank is conducting data analyses of its account holders to
gauge customer retention. The following figure shows the age distribution of the account
holders.
The figure shows that the median age of the account holders is around 40.
A few accounts with account holder age less than 10 are unusual but plausible. These
could be custodial accounts or college savings accounts set up by the parents of young
children. These accounts should be retained for future analyses.
However, the left side of the graph shows a huge spike of customers who are zero years
old or have negative ages. This is likely to be evidence of missing data.
One possible explanation is that the null age values could have been replaced by 0 or
negative values during the data input. Such an occurrence may be caused by entering age
in a text box that only allows numbers and does not accept empty values. Or it might be
caused by transferring data among several systems that have different definitions for null
values (such as NULL, NA, 0, –1, or –2).
Therefore, data cleansing needs to be performed over the accounts with abnormal age
values. Analysts should take a closer look at the records to decide if the missing data
should be eliminated or if an appropriate age value can be determined using other
available information for each of the accounts.
4.3. Visualizing a Single Variable
4.4. Examining Multiple Variables
A scatterplot is a simple and widely used visualization for finding the relationship among
multiple variables.
A scatterplot can represent data with up to five variables using x-axis, y-axis, size, color,
and shape.
But usually only two to four variables are portrayed in a scatterplot to minimize
confusion.
When examining a scatterplot, one needs to pay close attention to the possible
relationship between the variables.
If the functional relationship between the variables is somewhat pronounced, the data
may roughly lie along a straight line, a parabola, or an exponential curve.
If variable y is related exponentially to x, then the plot of x versus log(y) is approximately
linear.
If the plot looks more like a cluster without a pattern, the corresponding variables may
have a weak relationship.
The scatterplot in Figure 3-13 portrays the relationship of two variables: x and y. The red
line shown on the graph is the fitted line from the linear regression.
Figure 3-13 shows that the regression line does not fit the data well. This is a case in
which linear regression cannot model the relationship between the variables.
Alternative methods such as the loess() function can be used to fit a nonlinear line to the
data. The blue curve shown on the graph represents the LOESS curve, which fits the data
better than linear regression.
Box-and-Whisker Plot
Box-and-whisker plots show the distribution of a continuous variable for each value of a
discrete variable.
The box-and-whisker plot in Figure 3-16 visualizes mean household incomes as a
function of region in the United States. The first digit of the U.S. postal (“ZIP”) code
corresponds to a geographical region in the United States.
In Figure 3-16, each data point corresponds to the mean household income from a
particular zip code. The horizontal axis represents the first digit of a zip code, ranging
from 0 to 9, where 0 corresponds to the northeast region of the United States (such as
Maine, Vermont, and Massachusetts), and 9 corresponds to the southwest region (such as
California and Hawaii). The vertical axis represents the logarithm of mean household
incomes. The logarithm is taken to better visualize the distribution of the mean household
incomes.
In this figure, the scatterplot is displayed beneath the box-and-whisker plot, with some
jittering for the overlap points so that each line of points widens into a strip.
The “box” of the box-and-whisker shows the range that contains the central 50% of the
data, and the line inside the box is the location of the median value.
The upper and lower hinges of the boxes correspond to the first and third quartiles of the
data. The upper whisker extends from the hinge to the highest value that is within 1.5 *
IQR of the hinge. The lower whisker extends from the hinge to the lowest value within
1.5 * IQR of the hinge.
IQR is the inter-quartile range. The points outside the whiskers can be considered
possible outliers.
4.5. Data Exploration Versus Presentation

You might also like