Unit 2
Unit 2
Unit 2
Graphical User Interfaces, Data Import and Export, Attribute and Data Types,
Descriptive Statistics, Exploratory Data Analysis, Visualization Before
Analysis, Dirty Data, visualizing a Single Variable, Examining Multiple
Variables, Data Exploration Versus Presentation.
R uses a forward slash (/) as the separator character in the directory and file paths.
To simplify the import of multiple files with long path names, the setwd() function can be
used to set the working directory for the subsequent import and export operations, as
shown in the following R code.
The main difference between these import functions is the default values.
For example, the read.delim() function expects the column separator to be a tab ("\t").
The following table includes the expected defaults for headers, column separators, and
decimal point notations.
R functions such as write.table(), write.csv(), and write.csv2() enable exporting of R
datasets to an external file.
Ex:
3.1. Attributes:
In general, characteristics or attributes provide the qualitative and quantitative measures
for each item or subject of interest.
Attributes can be categorized into four types: nominal, ordinal, interval, and ratio (NOIR).
Nominal and ordinal attributes are considered categorical attributes, whereas interval and
ratio attributes are considered numeric attributes.
Data of one attribute type may be converted to another.
For example, the quality of diamonds {Fair, Good, Very Good, Premium, Ideal} is
considered ordinal but can be converted to nominal {Good, Excellent} with a defined
mapping.
Similarly, a ratio attribute like Age can be converted into an ordinal attribute such as
{Infant, Adolescent, Adult, Senior}.
Understanding the attribute types in a given dataset is important to ensure that the
appropriate descriptive statistics and analytic methods are applied and properly
interpreted.
For example, the mean and standard deviation of U.S. postal ZIP codes are not very
meaningful or appropriate.
3.2. Numeric, Character, and Logical Data Types
R supports 4 data types. Such as numeric, character, date and logical (Boolean) values.
Example variable assignment of different data values
Additional R functions exist that can test the variables and coerce a variable into a
specific type.
Let variable i is numeric.
is.integer() function is used test whether i is an integer.
3.3. Vectors
The following R code illustrates how a vector can be created using the combine function,
c() or the colon operator, :, to build a vector from the sequence of integers from 1 to 5.
Furthermore, the code shows how the values of an existing vector can be easily modified
or accessed.
The code, related to the z vector, indicates how logical comparisons can be built to extract
certain elements of a given vector.
Sometimes it is necessary to initialize a vector of a specific length and then populate the
content of the vector later.
Although vectors may appear to be similar to arrays of one dimension, they are
technically dimensionless.
The variables stored in the data frame can be easily accessed using the $ notation.
3.6. Lists
Lists can contain any type of objects, including other lists.
Included with the ggplot2 package, the diamonds data frame contains three ordered
factors.
Examining the cut factor, there are five levels in order of improving cut: Fair, Good, Very
Good, Premium, and Ideal.
library(ggplot2)
The summary() function performs a chi-squared test on the independence of the two
factors.
3.9. Descriptive Statistics
The summary() function provides several descriptive statistics, such as the mean and
median, about a variable such as the sales data frame.
The IQR() function provides the difference between the third and the first quartiles.
The function apply() is useful when the same function is to be applied to several variables
in a data frame.
For example, the following R code calculates the standard deviation for the first three
variables in sales.
In the code, setting MARGIN=2 specifies that the sd() function is applied over the
columns.
Other functions, such as lapply() and sapply(), apply a function to a list or vector.
The above four datasets have nearly identical statistical properties such as mean, variance.
Based on the nearly identical statistical properties across each dataset, one might
conclude that these four datasets are quite similar. However, the scatterplots tell a
different story.
Each dataset is plotted as a scatterplot, and the fitted lines are the result of applying linear
regression models.
The estimated regression line fits Dataset 1 reasonably well. Dataset 2 is definitely
nonlinear. Dataset 3 exhibits a linear trend, with one apparent outlier at x =13. For
Dataset 4, the regression line fits the dataset quite well. However, with only points at two
x values, it is not possible to determine that the linearity assumption is proper.
The R code for generating the above scatter plots is shown below:
It requires the R package ggplot2, which can be installed simply by running the command
install.packages("ggplot2").
The Anscombe dataset for the plot is included in the standard R distribution. Enter data()
for a list of datasets included in the R base distribution. Enter data(DatasetName) to make
a dataset available in the current workspace.
4.2. Dirty Data
Dirty(outliers) data can be detected in the data exploration phase with visualizations.
In general, analysts should look for anomalies, verify the data with domain knowledge,
and decide the most appropriate approach to clean the data.
Consider a scenario in which a bank is conducting data analyses of its account holders to
gauge customer retention. The following figure shows the age distribution of the account
holders.
The figure shows that the median age of the account holders is around 40.
A few accounts with account holder age less than 10 are unusual but plausible. These
could be custodial accounts or college savings accounts set up by the parents of young
children. These accounts should be retained for future analyses.
However, the left side of the graph shows a huge spike of customers who are zero years
old or have negative ages. This is likely to be evidence of missing data.
One possible explanation is that the null age values could have been replaced by 0 or
negative values during the data input. Such an occurrence may be caused by entering age
in a text box that only allows numbers and does not accept empty values. Or it might be
caused by transferring data among several systems that have different definitions for null
values (such as NULL, NA, 0, –1, or –2).
Therefore, data cleansing needs to be performed over the accounts with abnormal age
values. Analysts should take a closer look at the records to decide if the missing data
should be eliminated or if an appropriate age value can be determined using other
available information for each of the accounts.
4.3. Visualizing a Single Variable
4.4. Examining Multiple Variables
A scatterplot is a simple and widely used visualization for finding the relationship among
multiple variables.
A scatterplot can represent data with up to five variables using x-axis, y-axis, size, color,
and shape.
But usually only two to four variables are portrayed in a scatterplot to minimize
confusion.
When examining a scatterplot, one needs to pay close attention to the possible
relationship between the variables.
If the functional relationship between the variables is somewhat pronounced, the data
may roughly lie along a straight line, a parabola, or an exponential curve.
If variable y is related exponentially to x, then the plot of x versus log(y) is approximately
linear.
If the plot looks more like a cluster without a pattern, the corresponding variables may
have a weak relationship.
The scatterplot in Figure 3-13 portrays the relationship of two variables: x and y. The red
line shown on the graph is the fitted line from the linear regression.
Figure 3-13 shows that the regression line does not fit the data well. This is a case in
which linear regression cannot model the relationship between the variables.
Alternative methods such as the loess() function can be used to fit a nonlinear line to the
data. The blue curve shown on the graph represents the LOESS curve, which fits the data
better than linear regression.
Box-and-Whisker Plot
Box-and-whisker plots show the distribution of a continuous variable for each value of a
discrete variable.
The box-and-whisker plot in Figure 3-16 visualizes mean household incomes as a
function of region in the United States. The first digit of the U.S. postal (“ZIP”) code
corresponds to a geographical region in the United States.
In Figure 3-16, each data point corresponds to the mean household income from a
particular zip code. The horizontal axis represents the first digit of a zip code, ranging
from 0 to 9, where 0 corresponds to the northeast region of the United States (such as
Maine, Vermont, and Massachusetts), and 9 corresponds to the southwest region (such as
California and Hawaii). The vertical axis represents the logarithm of mean household
incomes. The logarithm is taken to better visualize the distribution of the mean household
incomes.
In this figure, the scatterplot is displayed beneath the box-and-whisker plot, with some
jittering for the overlap points so that each line of points widens into a strip.
The “box” of the box-and-whisker shows the range that contains the central 50% of the
data, and the line inside the box is the location of the median value.
The upper and lower hinges of the boxes correspond to the first and third quartiles of the
data. The upper whisker extends from the hinge to the highest value that is within 1.5 *
IQR of the hinge. The lower whisker extends from the hinge to the lowest value within
1.5 * IQR of the hinge.
IQR is the inter-quartile range. The points outside the whiskers can be considered
possible outliers.
4.5. Data Exploration Versus Presentation