0% found this document useful (0 votes)

5 views

Data Sources Advance Data Handling

Uploaded by

Bella Nabila

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

Data Sources Advance Data Handling

Uploaded by

Bella Nabila

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 23

ANALITIKAL DATA UNTUK BISNIS

RESUME OF

Advanced Data Handling: Preprocessing Methods

Oleh
Zyana Nabila | 202431001068

2024-2025
Advanced Data Handling: Preprocessing
Methods

There are many sources for data, both internal and external to your business, which make the
first question difficult to answer. Internal data should be relatively easy to obtain, although
internal political issues may interfere. In particular, different internal organizations may maintain
data sets you might need, but jealously safeguard them to maintain power. This is sometimes
referred to as creating data silos. See Wilder-Jame (2016) for a discussion about the need to
break down silo barriers. External data would not have this problem although they may be
difficult to locate, nonetheless.

Once you have your data, you must then begin to understand its structure. Structure is an
overlooked topic in all forms of data analysis so I will spend some time on it. Knowing structure
is part of “looking” at your data, a time-honored recommendation. This is usually stated,
however, in conjunction with discussions about graphing data to visualize distributions,
relationships, trends, patterns, and anomalies.

2.1 Data Dimensions: A Taxonomy for Defining Data

This section outlines two key sources of data: primary and secondary, and highlights the importa
nce of understanding data sources to minimize errors that could lead to outliers in analysis.

Primary data refers to data collected specifically for a particular purpose, giving full control over
the collection process, including methods, criteria, and definitions. This allows for careful manag
ement of data quality and relevance. Common primary data sources include:

 Transactions (e.g., sales, purchases)

 Experiments (e.g., market tests, industrial settings)

 Surveys (e.g., customer satisfaction, behavior)

 Sensors (e.g., devices measuring processes or behaviors in real-time)

 Internal data (e.g., HR records, financial data)

Secondary data is pre-existing data collected for other purposes, which you reuse for analysis. W
hile it is often accessible and useful, it limits control over the collection process and may contain
measurement errors or irrelevant aspects. Examples include economic data from government age
ncies or external databases.
Understanding whether data is endogenous (internal, from within the business) or exogenous (ext
ernal, driven by factors beyond the business’s control) is also crucial for analysis. Exogenous dat
a are further divided into universal (affecting all industries, like economic cycles) and local (spec
ific to an industry or business, such as competitor actions).

The key questions to assess data quality include understanding its origin, strengths, weaknesses,
collection methods, and the reliability of the organization or person gathering the data.

2.1.2 Taxonomy Component #2: Domain

There are two data domains: spatial (i.e., cross-sectional) and temporal (i.e., time series). Cross-
sectional data are data on different units measured at one point in \time. Measurements on sales
by countries, states, industries, and households in a year are examples. The label “spatial” should
not be narrowly interpreted as space or geography; it is anything without a time aspect. Time seri
es, or temporal domain data, are data on one unit tracked through time. Monthly same-store sales
for a 5- year period is an example

2.1.3 Taxonomy Component #3: Levels

There are two levels to data:

1. disaggregate, and

2. aggregate.

Disaggregate data are the most fundamental level although the boundary between disaggregate a
nd aggregate is blurry. Data on consumer purchases collected by point-of-sale (POS) scanners is
an example of disaggregate data while sales by stores and marketing regions are examples of agg
regate data.

2.1.4 Taxonomy Component #4: Continuity

The continuity of data refers to their smoothness. There are two types:

Continuous data have an infinite number of possible values with a decimal representation, althou
gh typically only a finite number of decimal places are used or are relevant. Such data are floatin
g point numbers in Python.

Discrete data have only a small, finite number of possible values represented by integers. Integer
s are referred to as ints in Python. They are often used for classification and so are categorical.

2.1.5 Taxonomy Component #5: Measurement Scale

This explains the four measurement scales proposed by Stevens (1946), widely used by data anal
ysts despite some controversy. These scales—nominal, ordinal, interval, and ratio—are foundatio
nal for data analysis, each differing in complexity and the types of statistical operations allowed.
1. Nominal Scale: The simplest scale, used for labeling or categorizing data without any or
der. For example, a “Buy/Don’t Buy” survey question is nominal. Statistical operations ar
e limited to counts, proportions, and mode. The numeric encoding of categories is arbitrar
y and has no inherent meaning.

2. Ordinal Scale: Data in this scale has a meaningful order, but the distance between values
is not defined. Examples include Likert scales (e.g., levels of purchase intent) or job hiera
rchy (e.g., Entry, Mid, Executive). Ordinal data allows for the calculation of counts, prop
ortions, mode, median, and percentiles, but not means or standard deviations, as the inter
vals between ranks aren’t meaningful. Despite this, means are sometimes controversially
calculated for Likert scale data.

3. Interval Scale: This scale includes ordered data with equal intervals between values, but
lacks a true zero point. For instance, temperature in Fahrenheit is interval data. You can m
easure the difference between two temperatures, but a ratio (e.g., “twice as hot”) is meani
ngless due to the arbitrary zero point. Interval data supports more complex statistics, incl
uding counts, proportions, mode, median, mean, and standard deviation.

4. Ratio Scale: The most sophisticated scale, with ordered values, equal intervals, and a me
aningful zero point. Economic and business data often fall into this category (e.g., sales, i
ncome). With ratio data, both differences and ratios are meaningful (e.g., twice as much).
All statistical measures—counts, proportions, mode, median, mean, and standard deviatio
n—can be applied.

2.2 Data Organization

Data organization involves understanding how data is stored and structured, both externally by I
T departments and internally by the analyst for specific analysis purposes.

1. External Data Structure: Managed by IT, this involves organizing data for efficient stor
age and retrieval through processes like Extract-Translate-Load (ETL). This structure i
s largely fixed and designed for general use, not tailored to specific analysis needs.

2. Internal Data Structure: Once you access the external data, you reorganize it to fit your
specific analysis. This structure is flexible and changes based on the problem you are solv
ing. It helps define what analyses are possible based on how you arrange the data.

Understanding both structures is crucial for effective data analysis.

2.3 Data Dictionary

Metadata can be anything that helps you understand and document your data. This could include:
• means of creation;

• purpose of the data;

• time and date of creation;

• creator or author of data;

• placement on a network (electronic form) where the data was created;

• what standards were used

and so on.15 I will restrict the metadata in data dictionaries used in this book to

include only

• variable name;

• possible values;

• source; and

• mnemonic.
BASIC DATA HANDLING

Basic data handling in real-world business data analytics (BDA) is much more complex than the
simple, clean datasets typically seen in textbooks on statistics and econometrics. While textbooks
focus on tidy, small datasets with a few variables, BDA often involves massive, messy data sprea
d across multiple sources. Preprocessing this data is a critical step before any analysis can begin.

Key Challenges in Basic Data Handling:

1. Large Datasets: In practice, data sets are often far too large to be easily imported or proce
ssed by a computer’s memory. Unlike simple, flat data files with two dimensions (rows a
nd columns), BDA datasets can be high-dimensional, requiring complex handling.

2. Multiple Data Sources: Data often needs to be merged or joined from different sources to
form a single, usable dataset for analysis. Handling these complexities is a standard task i
n BDA, but not typically covered in introductory courses.

3. Reshaping Data: The form in which the data is initially organized may not be suitable for
analysis. For example, wide-form data (many variables, few rows) may need to be conver
ted to long-form (many rows, few variables), depending on the statistical or machine lear
ning techniques being applied.

4. Missing Values and Inconsistent Scales: Data often contain missing values, which can aff
ect analysis. Additionally, variables on different scales can bias results, as some variables
may dominate due to their measurement units. Preprocessing involves addressing missing
values, normalizing scales, and sometimes transforming variables to adjust their distributi
on or variance.

Preprocessing Tasks in Focus:

 Importing Data: Efficiently importing large datasets into tools like Pandas, especially wh
en data exceeds your computer's memory capacity.

 Preliminary Analysis: Conducting basic statistical assessments to understand the structure

and quality of the data.
 Merging Data: Combining multiple datasets into one coherent structure containing only t
he needed variables.

 Reshaping Data: Converting data from wide-form to long-form or vice versa, depending
on the requirements of the analysis.

These tasks are foundational to managing the complexities of data and ensure the data is properly
structured for further analysis.

3.1.1 Case Study 1: Customer Transactions Data

Table. 3,1

Business Context:

 A fictitious household furniture manufacturer sells products to boutique retailers across

four U.S. marketing regions (Midwest, Northeast, South, West).

 The company offers 43 products in six product lines (e.g., Living Room, Kitchen) with
discounts based on competitive factors, order size, and retailer pickup.

Data Structure:

1. Orders Database: Contains order number, customer ID, timestamps, order amount, list
price, discounts, return flags, and material costs.

2. Customers Database: Includes customer ID and location details.

3. Marketing Database: Features customer ID, loyalty program membership, buyer rating,
and satisfaction scores.
Objectives: The Living Rooms product line manager seeks to:

1. Identify sales patterns by marketing region, customer loyalty, and buyer rating.

2. Estimate price elasticity for living room blinds.

3. Analyze statistical differences among the four types of discounts.

4. Develop a predictive sales tool for customers.

3.1.2 Case Study 2: Measures of Order Fulfillment

Business Context:

 A national bread-products company supplies fresh bread items to various store fronts,
requiring timely deliveries from local bakeries.

 Orders are placed electronically, and fulfillment is monitored through a specialized tablet
app by customers.

Data Structure:

 Each order includes measures for completeness, damage-free delivery, timeliness, and
accurate documentation, recorded as binary responses (Yes/No).

 Additional data for each bakery (facility ID, marketing region, and location type) is also
maintained.

Analysis Objective: You are tasked with developing a Perfect Order Index (POI) as a fulfillment
measure. The POI is calculated as the product of the percentage of affirmative responses across
four fulfillment measures:

This framework will be expanded in future chapters to analyze fulfillment performance.

3.2 Importing Your Data

An obvious first step in any analytical process, aside from locating the right data, is to import
your data into an analytical framework. This is actually more complicated than imagined. Some
issues to address are the current data format, the size of the dataset to import, and the nature of
the data once imported

3.2.1 Data Formats

The Comma Separated Value (CSV) and Excel formats are probably the most commonly used
formats in BDA. CSV is a simple format with each value in a record separated by a comma and
text strings often, but not always, denoted by quotation marks. This format is supported by
almost all software packages because of its simplicity.1 Excel is also very popular because many
analysts mistakenly believe that Excel, or any spreadsheet package, is sufficient for data
analytical work. This is far from true. Nonetheless, they store their data in Excel workbooks and
worksheets.

Java Script Object Notation (JSON) is another popular format that allows you to transfer data,
software code, and more from one installation to another. Jupyter notebooks, for example, are
JSON files

The data sources for the SQL From verb are SQL-ready data tables and the result of the Select
verb is a data table satisfying the query. A powerful and useful feature of SQL is the use of a
returned table in the From clause so that, in effect, you could embed one query in another

3.2.2 Importing a CSV Text File into Pandas

Due to Table 3.1, The basic import or read command consists of four parts:

1. the package where the function is located: Pandas identified by its alias pd;

2. the read function: read_csv;

3. the location or path to the data as an argument: path; and

4. the file name as an argument: file.

The package alias must be “chained” to the read_csv import function, otherwis the Python
interpreter will not know where to find the function. Both the path and file name can be
separately defined for convenience and cleaner coding; I consider this a Best Practice. You must
always specify the file path so Pandas can find the file, unless the data file is in the same
directory as your Jupyter notebook. Then a path is unnecessary since Pandas always begins a
search in the same directory as the notebook.

3.2.3 Importing Large Files in Chunks

The data files you need for a BDA problem are typically large, perhaps larger than what is
practical for you to import at once. In particular, if you process a large file after importing it,
perhaps to create new variables or selectively keep specific columns, then it is very inefficient to
discard the majority of the imported data as unneeded. Too much time and computer resources
are used to justify the relatively smaller final DataFrame needed for an analysis. This
inefficiency is increased if there is a processing error (e.g., transformations are incorrectly
applied, calculations are incorrect, or the wrong variables are saved) so it all has to be redone.
Importing chunks of data, processing each separately, and then concatenating them into one final,
albeit smaller and more compact, DataFrame is a better way to proceed. For example, one small
chuck of data could be imported as a test chunk to check if transformations and content are
correct. Then a large number of chunks could be read, processed, and concatenated.

3.2.4 Checking Your Imported Data

Once you have imported your data, you should perform five checks of them before beginning
your analytical work:

Check #1 Display the first few records of your DataFrame. Ask: “Do I see what I expect to
see?”

One way to “look” at your data is to determine if they are in the format you expect. For example,
if you expect floating point numbers but you see only integers, then something is wrong. Also, if
you see character strings (e.g., the words “True” and “False”) when you expect integers (e.g., 1
for True and 0 for False), then you know you will have to do extra data processing. Similarly if
you see commas as thousands separators in what should be floating point numbers, then you
have a problem because the numbers will be treated and stored as strings since a comma is a
character
Check #2 Check the shape of your DataFrame. Ask: “Do I see all I expect to see?”

The shape of a DataFrame is a tuple (actually, a 2-tuple) whose elements are the number of rows
and the number of columns, in that order. A tuple is an immutable list which means it cannot be
modified. The shape tuple is an attribute of the DataFrame so it is an automatic characteristic of a
DataFrame that you can always access. To display the shape, use df.shap. Although a tuple is
immutable, this does not mean you cannot access its elements for separate processing. To access
an element, use the square brackets, [ ], with the element’s index inside. For example, to access
the number of rows, use df.shape[ 0 ]. Remember, Python is zero-based so indexing starts with
zero for the first element.

Check #3 Check the column names in your DataFrame. Ask: “Do I have the correct and
cleansed variable names I need?”

Checking column (i.e., variable) names is a grossly overlooked step in the first stages of data ana
lysis. A name will certainly not impact your analysis in any way, but failure to check names coul
d impact the time you spend looking for errors rather than doing your analysis. Column names, w
hich are also attributes of a DataFrame, could have stray characters and leading and trailing whit
e spaces. White spaces are especially pernicious. Suppose a variable’s name is listed as ‘sales’ wi
th a leading white space. When you display the head of the DataFrame, you will see ‘sales’ displ
ayed without the white space, but the white space is really there. You will naturally try to use ‘sal
es’ (notice there is no white space) in a future command, say a regression command. The Python
interpreter will immediately display an error message that ‘sales’ is not found and the reason is si
mply that you typed ‘sales’ (notice the lack of a white space) rather than ‘sales’ (notice the leadin
g white space) You will needlessly spend time trying to find out why. Checking column names u
p front will save time later.

Check #4 Check for missing data in your DataFrame. Ask: “Do I have complete data?”

Missing values are a headache in statistics, econometric, and machine learning. You cannot estim
ate a parameter if the data for estimation are missing. Some estimation and analysis methods auto
matically check for missing values in the DataFrame and then delete records with at least one mi
ssing value. If the DataFrame is very large, this automatic deletion is not worrisome. If the DataF
rame is small, then it is worrisome because the degrees-of-freedom for hypothesis testing could b
e reduced enough to jeopardize the validity of a test. It is also troublesome if you are working wit
h time series because the deletion will cause a break in the continuity of the series. Many time se
ries functions require this continuity.

Check #5 Check the data types of your variables. Ask: “Do I have the correct data types?”

There are counterparts for most of these in Python and Numpy, but these are the key ones

you will encounter in BDA.

Strings are text enclosed in either single or double quotation marks. Numbers could be interprete
d and handled as strings if they are enclosed in quotation marks. For example, 3.14159 and “3.14
159” are two different data types. The first is a number per se while the second is a string. An int
eger in a number without a decimal; it is a whole number that could be positive or negative. A flo
at is a number with a decimal point that can “float” among the digits depending on the values to r
epresent. Integers and floats are treated differently and operations on them could give surprising,
and unexpected results. Dates and times, combined and referred to as datetime, is a complex obje
ct treated and stored differently than floats, integers, and strings. Their use in calculations to accu
rately reflect dates, times, periods, time between periods, time zones, Daylight/Standard Saving t
ime, and calendars in general is itself a complex topic.

Notice that I do not have data visualization on my list. You might suppose that it should be part
of Check #1: Look at your data.
3.3 Merging or Joining DataFrames

You will often have data in two (or more) DataFrames so you will need to merge or join them to
get one complete DataFrame. As an example, a second DataFrame for the baking facilities has in
formation on each facility: the marketing region wherethe facility is located (Midwest, Northeast,
South, and West), the state in that region, a two character state code, the customer location served
by that facility (urban or rural), and the type of store served (convenience, grocery, or restaurant).
This DataFrame must be merged with the POI DataFrame to have a complete picture of the baki
ng facility.

There are many types of joins but I will only describe one: an inner join, which is the default met
hod in Pandas because it is the most common. The inner join operates by using each value of the
primary key in the DataFrame on the left to find a matching value in the primary key on the right .
If a match is found, then the data from the left and right DataFrames for that matching key are p
ut into an output DataFrame. If a match is not found, then the left primary key is dropped, nothin
g is put into the output DataFrame, and the next left primary key is used. This is repeated for eac
h primary key on the left.

3.4 Reshaping DataFrames

A DataFrame’s shape attribute provides information on the number of rows and columns. Someti
mes this shape is inappropriate for a specific form of analysis and so it must be changed, the Dat
aFrame must be reshaped to make its shape more appropriate. Changing a DataFrame from wid
e- to long-form involves stacking the rows vertically on top of each other (basically transposing e
ach row) into a new column in a new DataFrame and using the original DataFrame’s column na
mes as values in yet another new column alongside the transposed rows. Those column names wi
ll repeat for each transposed row. As an example, a DataFrame could have 5 years of monthly sal
es data with one column for the year and a separate column for each month. There are then be 13
columns (one for year and 12 for months) and five rows for the 5 years. The shape is the tuple (5,
13). This is a wide-form. A simple regression analysis for sales as a function of a year and month
effect requires a different data arrangement: one column for sales, one for year, and one for mont
h. This is a long-form. So, the wide-form must be reshaped to long-form by stacking.
There are two Pandas methods for reshaping a DataFrame from wide- to long form: stack and me
lt. They basically do the same thing, but the melt method is slightly more versatile in providing a
name for the new columns of transposed rows. The stack method is better for operating on MultiI
ndexed DataFrames. The reverse reshaping operation of going from long- to wide-form is unstac
king

3.5 Sorting a DataFrame

Sorting is the next most common and frequently used operation on a DataFrame. This involves p
utting the values in a DataFrame in a descending or ascending order based on the values in one o
r more variables.

In some instances, sorting the DataFrame to identify the largest customers in terms of purchases
or those who purchased most recently, or to identify the poorest performing facilities may not ha
ve to be done. A query of the DataFrame may be all that is need in these cases. I discuss queries i
n the next section. You could also use the nlargest or nsmallest methods. Each takes an argument
for the number of records to return and a list of the columns to search on. For example, df.nlarge
st (10, ‘X’ ) returns the 10 largest values from the X column in the DataFrame

3.6 Querying a DataFrame

A DataFrame, whether small or large, contains a lot of latent information as I discussed before. O
ne way you can get some information out of it is by literally asking it a question. That is, queryin
g the DataFrame.

3.6.1 Boolean Operators and Indicator Functions

In everyday arithmetic, the equal sign has a meaning we all accept: the term or value on the left i
s the same as that on the right, just differently expressed. If the two sides are the same, then the e
xpression is taken to be true; otherwise, it is false. So, the expression 2 + 2 = 4 is true while 2 + 2
= 5 is false. In Python, and many other programming languages, the equal sign has a different int
erpretation. It assigns a symbol to an object, which could be numeric or string; it does not signify
equality as in everyday arithmetic. The assignment names the object and is said to be bound to th
e object. The object is on the right and the name on the left. So, the expression x = 2 does not say
that x is the same as 2 but that x is the name for, and is bound to, the value 2. As noted by Sedge
wick et al. (2016, p. 17), “In short, the meaning of = in a program is decidedly not the same as in
a mathematical equation.”

An object, whether numeric or string or a function or anything Python considers to be an object, i

s stored in a location in memory. The statement x = 2 names the memory location of 2 so this obj
ect can be retrieved. In some languages, x is referred to as a pointer to the object 2 in memory.

The object, 2 in this case, is stored once in memory but there could be several names bound to th
at memory location. So x = 2 and y = 2 have the symbols x and y naming (or pointing to) the sam
e object in the same memory location. There is only one 2 in that memory location but two name
s or pointers to it.

Table 3.10 This is a truth table for two Boolean comparisons: logical “and” and logical “or.” See
Sedgewick et al. (2016) for a more extensive table for Python Boolean comparisons

A Boolean statement could be compactly written in mathematical notation using an indicator fun
ction. An indicator function returns a 0 or 1 for a Boolean statement, usually for the data or a sub
set of the data. Suppose you have a list of six values for a variable X: [ 1, 2, 3, 4, 5, 6 ] and the B
oolean statement “x > 3”. This is written in mathematical notation as

or more simply as I(x > 3). The I is the indicator function. It returns the list [0, 0, 0, 1, 1, 1] for th
is example. If you consider a subset of X, say the first three entries, A = [1, 2, 3], then the indicat
or function is written as IA(x > 3) which returns [0, 0, 0]. Indicator functions will be used in this
book.
3.6.2 Pandas Query Method

Pandas has a query method for a DataFrame that takes a Boolean expression and returns a subset
of the DataFrame where the Boolean expression is True. This makes it very easy to query a Data
Frame to create a subset of the data for more specific analysis.

The query method is chained to the DataFrame. The argument is a string with opening and closin
g single or double quotation marks (they must match). If a variable is used in the Boolean expres
sion, you must enclose it in quotation marks. For example, you could write “x > ’sales’”. Notice t
he single and double quotation marks. You could also define a variable with a value before the qu
ery but then use that variable in the query. In this case, you must use an @ before the variable so
that the Python interpreter knows the variable is not in the DataFrame. For example, you could d
efine Z = 3 and then use “x > @Z”.
Data Visualization: The Basics

Data visualization issues associated with the graphics used in a presentation, not in the analysis s
tage of developing the material leading to the presentation, are discussed in many books. I focus
on data visualization from a practical analytical point-of-view in this chapter, not their presentati
on. This does not mean, however, that they cannot be used in a presentation; they certainly can b
e used. The graphs I describe are meant to aid and enhance the extraction of latent Rich Informati
on from data.

4.1 Background for Data Visualization

Business Data Analytics is a combination of advanced and sophisticated statistical, econometric,

and machine learning tools and methods for extracting Rich Information from data. Visualization,
is, of course, a wide area with active research in all its aspects such as perception, displays, thre
e-dimensional, rendering, dynamic rendering, color coordination, and eye movement.

4.2 Gestalt Principles of Visual Design

An effective graph is one that conveys key information quickly and clearly, helping viewers to
understand the message at a glance. Ineffective graphs, on the other hand, obscure this message
due to poor design or the inclusion of unnecessary elements, often referred to as "chartjunk."
Chartjunk includes any extra graphical elements that clutter the chart and distract from the main
message. The goal of data visualization is to make the information buried in the data visible, so
adding unnecessary elements only complicates this task.

The Gestalt Principles of Visual Design provide guidance for creating effective graphs by
explaining how humans perceive and interpret visual elements. These principles help to ensure
that graphical representations are clear and easily interpreted by organizing the elements of the
graph in a way that aligns with natural human perception.

Key Gestalt Principles for Effective Graph Design:

1. Proximity Principle: Objects that are close together are perceived as belonging to the same
group. In graphs, similar items should be placed near each other to make comparisons easier.
If similar items are far apart, it becomes challenging to draw conclusions or see patterns.
2. Similarity Principle: Objects that look alike are interpreted as related. For example, in a
graph, similar colors or shapes should be used to group related data points. This principle
ensures that patterns are more easily recognizable. If all elements are the same (e.g., all
points in a scatter plot are black), it is difficult to see the underlying relationships.
3. Connectedness Principle: Lines or connections between similar units enhance the perception
of relatedness. For instance, in a line chart, connecting similar data points with lines
(whether solid, dashed, or of different colors) helps highlight related data chunks and make
patterns more evident.
4. Common Fate Principle: This principle relates to the tendency of objects moving or trending
together to be perceived as related. This is especially relevant in time series charts, where
trends over time can be more clearly seen when the movements of several data series are
compared. Highlighting shared trends or directions helps reveal common patterns and makes
comparisons easier.

By using these principles, graphs can be designed to present information in a way that aligns
with how humans naturally process visual input, making them more effective at communicating
the rich information hidden within the data. This reduces cognitive load and enhances
understanding.

4.3 Issues Complicating Data Visualization

There are two issues associated with data visualization:

1. Human Visual Limitations

When we discuss data visualization, we are referring to what the human eye can process
and transmit to the brain for further processing into intelligence and understanding. So a
gating item is the limitations of the human eye.
Sawant (2012). There
are also rods adjacent to the cones that are sensitive to lower-level light. There are about
107 cones in the eye so you should be able to process at most about the 106–107 observat
ions I noted above. For small databases, this is not an issue since most are smaller than 10
7 in size. For example, it is probably safe to say that the majority of statisticians, econom
etricians, and general data analysts work with fewer than 10,000 observations in their nor
mal analytical work where this figure is the total number of data points in their data sets.
The 10,000 is 103, much less than the maximal number. You can safely say that the data s
et most commonly used is “Small Data.” “Big Data”, the order of magnitude referred to a
bove, is a more recent phenomenon with data sizes far exceeding the maximal eye amoun
t.

Table 4.2 Visualization tools by data type and data size

The traditional visualization tools cannot be applied to data of these sizes, at least without
modifications. Visualization tools have to be divided into two categories: Small Data appl
icable and Big Data applicable. Within these two categories, we also have to distinguish b
etween the two types of data that can be visualized: categorical and continuous data and c
ertainly a mix of the two.

2. Tools For Data Visualization.

A convenient feature of a DataFrame is that it has a plot method attached to it when it is
created. This means you can easily create a plot of your data merely by chaining this
method to the DataFrame. For example, if a DataFrame contains a categorical variable X,
then you can create a pie chart of it by first chainingto it the value_counts method to
calculate the distribution of the values and thenchain the plot method with an argument
for a pie chart. The whole command is df.X.value_counts().plot( kind = ‘pie’ ). You can
change the pie chart to a bar chart by simply changing the kind argument from pie to bar.
The key word kind specifies the type of plot. I list options for “kind” in Table 4.3.
Table 4.3 This is a list of options for the kind parameter for the Pandas plot method

Table 4.4 This is a categorization of Seaborn’s plotting families, their plotClass, and the kind opt
ions. See the Seaborn documentation at https://seaborn.pydata.org/ for details

Seaborn is a data visualization package that has a wide range of capabilities. A nice feature of Se
aborn is that it reads Pandas DataFrames. A general Seaborn function call is sns.plotClass( x = X
var, y = Yvar, data = data, kind = kind, options ) where sns is the conventional Seaborn alias. Th
e plotClass is a class or family of plots which I list in Table 4.4. The Xvar and Yvar parameters ar
e the x-variable and y-variable, respectively, which vary by plotting class; options are hue, size, a
nd style.2 The kind options are the same as those in Table 4.3.

4.3.3 Types of Visuals

You most likely think of graphs when you think of visualization, but this thinking is simplistic si
nce there are many types of graphs and for different types of data. A brief listing includes those f
amiliar to most analysts: bar charts, pie charts, histograms, and scatterplots. The list is actually m
uch longer. Not all of these can be applied to any and all types of data. They have their individual
uses and limitations that reveal different aspects of the latent information.

4.3.4 What to Look for in a Graph

Most people just “look” at a graph and then say something –anything– about what they see. “Loo
king”, however, requires training and experience. “Lookers” can be classified as novices or exper
ts. Novices have either no training in what to look for in a graph, or are just beginning that traini
ng. Their tendency is to miss the important messages conveyed by a graph which results in Rich
Information remaining hidden. Experts, through training and experience, know what to look for a
nd look for it more rapidly not only to cull the salient messages, the Rich Information, but also tr
anslate them into actionable recommendations. This is referred to as graphicacy

Graphs have a different symbol system consisting of lines, bars, colors, dots, and other marks tha
t tell a story, convey a message, no different than text. Five guiding features you should look for:

1) Distributions, The distribution of your data is their arrangement or shape on a graph axis.
Distributions could be symmetric, skewed (left or right), or uni/multi-modal.
2) Relationships, are not just associations (i.e., correlations), but more cause-and effect beha
vior between or among two or more variables. Products purchased and distribution chann
els is one example. Customer satisfaction and future purchase intent is another. These rela
tionships could be spatial, temporal, or both.
3) Patterns, would be groupings of data points such as clusters or segments.
4) Trends, are developments or changes over time (e.g., a same-store sales tracking study or
attrition rates for R&D personnel for an HR study). These would be mostly temporal
5) Anomalies, are points that differ greatly from the bulk of the data. But not all outliers are
created equal: some are innocuous while others are pernicious and must be inspected for t
heir source and effects.

4.4.1 Data Preparation

The POI DataFrame is in panel format, meaning that it has a combination of spatial and temp
oral data. The spatial aspect is baking facilities and/or their geographic locations, either state
or marketing region.

4.4.2 Visualizing Continuous Spatial Data

Examining distributions of continuous spatial data is common. The boxplot, sometimes calle
d a box-and-whisker plot, is a powerful tool for continuous data. Tukey (1977).
Another, more classic way to visualize a distribution is to use a histogram. In particular, a hist
ogram is a tool for estimating the probability density function of the values of a random varia
ble, X. Let f (x) be the density function. As noted by Silverman (1986)

4.4.3 Visualizing Categorical Spatial Data

Discrete data have definite values or levels. They can be numeric or categorical. As numeric,
they are whole numbers without decimal values. Counts are a good example. As categorical, t
hey could be numeric values arbitrarily assigned to levels of a concept.

4.4.4 Visualizing Continuous and Categorical Spatial Data

Data are not always strictly categorical or continuous. You could have a mix of both. In this c
ase, graph types can be combined to highlight one variable conditioned on another. A second
categorical variable can be added to form a facet, trellis, lattice, or panel (all interchangeable
terms) plot.

4.5 Visualizing Temporal (Time Series) Data

Some of the displays I previously discussed, such as boxplots, can be used with temporal dat
a. Otherwise, temporal data have their own problems that require variations on some displays.

4.5.1 Properties of Temporal (Time Series) Data

Data could also be temporal, which is also known as time series data. Time series data are a s
pecial breed of data with a host of problems that make visualization more challenging. It is n
ot, however, only the visualization that is complicated; it is the full analysis of time series: da
ta visualization, data handling, and modeling becomes overwhelming.

4.5.2 Visualizing Time Series Data

A line chart is probably the simplest time series graph familiar to most analysts. It is just the s
eries plotted against time.

It may be possible to disaggregate a time series into constituent periods to reveal underlying
patterns hidden by the more aggregate presentation. For example, U.S. annual real GDP grow
th rates from 1960 to 2016 can be divided into six decades to show a cyclical pattern. A boxpl
ot for each decade could then be created and all six boxplots can be plotted next to each other
so that the decades play the role of a categorical variable.

4.5.3 Times Series Complications

Time series data have unique complications which account for why there is so much active ac
ademic research in this area. The visualization of time series reflects this work. Some unique
problems are:

1. Changing slope through time.

 This is called nonstationarity in the mean Solution: Take the difference in the series
(usually a first difference will suffice).

2. Changing variance—usually increasing. Solution: Plot natural log of series.

 Straightens curve. – Added benefit: slope is average growth rate.

 Stabilizes variance
3. Autocorrelation (series correlated with itself). Solution: Check correlation with lagged
series

Reference:

Walter R. Paczkowski. 2021. Business Analytics: Data Science for Business Problems.
Springer: Switzerland. E-ISBN: 978-3-030-87023-2.

Unit-3 DS
No ratings yet
Unit-3 DS
21 pages
Unit I - Part I Notes
100% (7)
Unit I - Part I Notes
33 pages
IB DP Computer Science Syllabus
100% (1)
IB DP Computer Science Syllabus
6 pages
Data Sources Data Handling Data Visualization
No ratings yet
Data Sources Data Handling Data Visualization
23 pages
FTA-Module 1-Notes (1)
No ratings yet
FTA-Module 1-Notes (1)
24 pages
Data Preparation and Exploration: DSCI 5240 Data Mining and Machine Learning For Business Russell R. Torres
No ratings yet
Data Preparation and Exploration: DSCI 5240 Data Mining and Machine Learning For Business Russell R. Torres
28 pages
Big Data and Analytics
No ratings yet
Big Data and Analytics
86 pages
Unit 1
No ratings yet
Unit 1
36 pages
Unit 2
No ratings yet
Unit 2
58 pages
Chapter 1-Introduction To Data
No ratings yet
Chapter 1-Introduction To Data
18 pages
1. Week2_Master the data
No ratings yet
1. Week2_Master the data
28 pages
Antim-Prahar-Data-Analytics-for-Business-Decisions-2025_compressed
No ratings yet
Antim-Prahar-Data-Analytics-for-Business-Decisions-2025_compressed
44 pages
DA-Unit-2-Trio-1
No ratings yet
DA-Unit-2-Trio-1
26 pages
EDA - Unit 1
No ratings yet
EDA - Unit 1
82 pages
LESSON1 ObtainingData
100% (1)
LESSON1 ObtainingData
32 pages
Data Analytics With Python Lecture 1
No ratings yet
Data Analytics With Python Lecture 1
23 pages
DA Unit 1
No ratings yet
DA Unit 1
43 pages
Lecture 1,2&3
No ratings yet
Lecture 1,2&3
80 pages
BDA-24_Lect (3-4)-(Fundamentals of Data Analysis)
No ratings yet
BDA-24_Lect (3-4)-(Fundamentals of Data Analysis)
15 pages
Module 4
No ratings yet
Module 4
13 pages
Linear Regression Merged
No ratings yet
Linear Regression Merged
38 pages
2.1_Data_Analytics[1]
No ratings yet
2.1_Data_Analytics[1]
16 pages
Module 1 - PPT5 - Pre - Processing of Data
No ratings yet
Module 1 - PPT5 - Pre - Processing of Data
21 pages
chapter-1 Introduction to Data Analytics
No ratings yet
chapter-1 Introduction to Data Analytics
34 pages
Lecture 3 (DS) - Steps in Data Science Process
No ratings yet
Lecture 3 (DS) - Steps in Data Science Process
57 pages
Data - part 1
No ratings yet
Data - part 1
58 pages
Document (1)
No ratings yet
Document (1)
10 pages
1 Introduction To Data Research Process (5 Files Merged)
No ratings yet
1 Introduction To Data Research Process (5 Files Merged)
17 pages
Online Transaction Processing (OLTP) - Supports Transaction-Oriented Applications in A 3-Tier
No ratings yet
Online Transaction Processing (OLTP) - Supports Transaction-Oriented Applications in A 3-Tier
2 pages
Notes 3 (Prepare Coursera)
No ratings yet
Notes 3 (Prepare Coursera)
67 pages
Data Accquisition
No ratings yet
Data Accquisition
6 pages
Instant Access to Data Analysis for Beginners: 2 in 1 Guide: A Beginner's Adventure in Analysis and Visualization Daniel Garfield ebook Full Chapters
100% (3)
Instant Access to Data Analysis for Beginners: 2 in 1 Guide: A Beginner's Adventure in Analysis and Visualization Daniel Garfield ebook Full Chapters
37 pages
Unit 1
No ratings yet
Unit 1
30 pages
BA_CH01
No ratings yet
BA_CH01
14 pages
Data2 PDF
No ratings yet
Data2 PDF
48 pages
Lecture 3
No ratings yet
Lecture 3
47 pages
Unit 2 Data Gathering
No ratings yet
Unit 2 Data Gathering
14 pages
Aiml Answers
No ratings yet
Aiml Answers
20 pages
Data Analytics
No ratings yet
Data Analytics
36 pages
Introduction To Data Analysis
No ratings yet
Introduction To Data Analysis
9 pages
Ch-1 Introduction To Data Analysis
No ratings yet
Ch-1 Introduction To Data Analysis
23 pages
FDS-Unit II-ECE
No ratings yet
FDS-Unit II-ECE
22 pages
Data Science - III
No ratings yet
Data Science - III
94 pages
Cognizant Data Analyst Interview Questions 1745235888
No ratings yet
Cognizant Data Analyst Interview Questions 1745235888
18 pages
BI module 2 (1)
No ratings yet
BI module 2 (1)
11 pages
ANL201 Study Unit 3 - 2023
No ratings yet
ANL201 Study Unit 3 - 2023
48 pages
Week 2 - 3getting To Know Your Data
No ratings yet
Week 2 - 3getting To Know Your Data
67 pages
Slide for Chapter 2
No ratings yet
Slide for Chapter 2
16 pages
Unit-II Notes
No ratings yet
Unit-II Notes
9 pages
Chap.3 Data Preprocessing
No ratings yet
Chap.3 Data Preprocessing
6 pages
Analysis of Data Is A Process of Inspecting, Cleaning, Transforming, and
No ratings yet
Analysis of Data Is A Process of Inspecting, Cleaning, Transforming, and
12 pages
Unit II Notes
No ratings yet
Unit II Notes
36 pages
Getting Started With Python Data Analysis - Sample Chapter
0% (1)
Getting Started With Python Data Analysis - Sample Chapter
17 pages
Data Analytics Fundamentals
No ratings yet
Data Analytics Fundamentals
35 pages
DSA question bank
No ratings yet
DSA question bank
22 pages
Data Understanding and Prepration
100% (1)
Data Understanding and Prepration
10 pages
Unit 1 Topic 1 Intro
No ratings yet
Unit 1 Topic 1 Intro
30 pages
Big Data Report
No ratings yet
Big Data Report
6 pages
Data Analytics and Data Processing Essentials
From Everand
Data Analytics and Data Processing Essentials
gareth thomas
No ratings yet
Data Mining: Fundamentals and Applications
From Everand
Data Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Database Management System
From Everand
Database Management System
Manish Soni
No ratings yet
Diamond Inclusions
No ratings yet
Diamond Inclusions
68 pages
Board of Aeronautical Engineering SB 1 PDF
No ratings yet
Board of Aeronautical Engineering SB 1 PDF
3 pages
Unit 3 - Blood
No ratings yet
Unit 3 - Blood
51 pages
Loop (Mesh) Analysis (3.2) : Dr. Holbert February 27, 2006
No ratings yet
Loop (Mesh) Analysis (3.2) : Dr. Holbert February 27, 2006
26 pages
Media Optimisation
No ratings yet
Media Optimisation
90 pages
Examen Soa PDF
No ratings yet
Examen Soa PDF
49 pages
AHL 3.14 Vector Equation of Line
No ratings yet
AHL 3.14 Vector Equation of Line
54 pages
Automated Welding Manipulators Available Manipulator Sizes
No ratings yet
Automated Welding Manipulators Available Manipulator Sizes
4 pages
Electrical Breakdown in SF6 Gas
No ratings yet
Electrical Breakdown in SF6 Gas
11 pages
2017TJS53
No ratings yet
2017TJS53
8 pages
SIMEAS P OperatingInstructionProfibus E50417 B1076 C238 A2 30082004 en
No ratings yet
SIMEAS P OperatingInstructionProfibus E50417 B1076 C238 A2 30082004 en
27 pages
Assessment Protocol For Nozzle Loads On Pressure Vessels
No ratings yet
Assessment Protocol For Nozzle Loads On Pressure Vessels
5 pages
Stat Post Test
No ratings yet
Stat Post Test
3 pages
(1906) Wireless Telegraphy and Telephony (Wireless Radio)
100% (1)
(1906) Wireless Telegraphy and Telephony (Wireless Radio)
436 pages
Inplace Analysis - Case Study
No ratings yet
Inplace Analysis - Case Study
5 pages
Differentiation Notes (1)
No ratings yet
Differentiation Notes (1)
47 pages
1.1K - 3.3KTL-G3 User Manual20210519
No ratings yet
1.1K - 3.3KTL-G3 User Manual20210519
72 pages
Mathematical Model For LCL Filter With AFE Converter
No ratings yet
Mathematical Model For LCL Filter With AFE Converter
4 pages
Under The Hood The Loadrunner Compiler
No ratings yet
Under The Hood The Loadrunner Compiler
7 pages
Comparison of Conventional and Bio-Treated Methods As Dust Suppressants
No ratings yet
Comparison of Conventional and Bio-Treated Methods As Dust Suppressants
10 pages
LCC Manual
100% (1)
LCC Manual
269 pages
ECSE 548 - Electronic Design and Implementation of The Sine Function On 8-Bit MIPS Processor - Report
100% (1)
ECSE 548 - Electronic Design and Implementation of The Sine Function On 8-Bit MIPS Processor - Report
4 pages
DISTILLATION
No ratings yet
DISTILLATION
6 pages
901344b - Straddle Big Joe
100% (1)
901344b - Straddle Big Joe
28 pages
Cahigam Es Grade 5 Q1 Data-Bank-In-Math-5
No ratings yet
Cahigam Es Grade 5 Q1 Data-Bank-In-Math-5
4 pages
Math 7 Angles Practice Quiz
No ratings yet
Math 7 Angles Practice Quiz
6 pages
Soklan Year 2 Science
No ratings yet
Soklan Year 2 Science
19 pages
DONOR - M - Algebra
No ratings yet
DONOR - M - Algebra
7 pages
Set Notations & Venn Diagrams
No ratings yet
Set Notations & Venn Diagrams
65 pages