Data Sources Advance Data Handling
Data Sources Advance Data Handling
RESUME OF
Oleh
Zyana Nabila | 202431001068
2024-2025
Advanced Data Handling: Preprocessing
Methods
There are many sources for data, both internal and external to your business, which make the
first question difficult to answer. Internal data should be relatively easy to obtain, although
internal political issues may interfere. In particular, different internal organizations may maintain
data sets you might need, but jealously safeguard them to maintain power. This is sometimes
referred to as creating data silos. See Wilder-Jame (2016) for a discussion about the need to
break down silo barriers. External data would not have this problem although they may be
difficult to locate, nonetheless.
Once you have your data, you must then begin to understand its structure. Structure is an
overlooked topic in all forms of data analysis so I will spend some time on it. Knowing structure
is part of “looking” at your data, a time-honored recommendation. This is usually stated,
however, in conjunction with discussions about graphing data to visualize distributions,
relationships, trends, patterns, and anomalies.
This section outlines two key sources of data: primary and secondary, and highlights the importa
nce of understanding data sources to minimize errors that could lead to outliers in analysis.
Primary data refers to data collected specifically for a particular purpose, giving full control over
the collection process, including methods, criteria, and definitions. This allows for careful manag
ement of data quality and relevance. Common primary data sources include:
Secondary data is pre-existing data collected for other purposes, which you reuse for analysis. W
hile it is often accessible and useful, it limits control over the collection process and may contain
measurement errors or irrelevant aspects. Examples include economic data from government age
ncies or external databases.
Understanding whether data is endogenous (internal, from within the business) or exogenous (ext
ernal, driven by factors beyond the business’s control) is also crucial for analysis. Exogenous dat
a are further divided into universal (affecting all industries, like economic cycles) and local (spec
ific to an industry or business, such as competitor actions).
The key questions to assess data quality include understanding its origin, strengths, weaknesses,
collection methods, and the reliability of the organization or person gathering the data.
There are two data domains: spatial (i.e., cross-sectional) and temporal (i.e., time series). Cross-
sectional data are data on different units measured at one point in \time. Measurements on sales
by countries, states, industries, and households in a year are examples. The label “spatial” should
not be narrowly interpreted as space or geography; it is anything without a time aspect. Time seri
es, or temporal domain data, are data on one unit tracked through time. Monthly same-store sales
for a 5- year period is an example
1. disaggregate, and
2. aggregate.
Disaggregate data are the most fundamental level although the boundary between disaggregate a
nd aggregate is blurry. Data on consumer purchases collected by point-of-sale (POS) scanners is
an example of disaggregate data while sales by stores and marketing regions are examples of agg
regate data.
The continuity of data refers to their smoothness. There are two types:
Continuous data have an infinite number of possible values with a decimal representation, althou
gh typically only a finite number of decimal places are used or are relevant. Such data are floatin
g point numbers in Python.
Discrete data have only a small, finite number of possible values represented by integers. Integer
s are referred to as ints in Python. They are often used for classification and so are categorical.
This explains the four measurement scales proposed by Stevens (1946), widely used by data anal
ysts despite some controversy. These scales—nominal, ordinal, interval, and ratio—are foundatio
nal for data analysis, each differing in complexity and the types of statistical operations allowed.
1. Nominal Scale: The simplest scale, used for labeling or categorizing data without any or
der. For example, a “Buy/Don’t Buy” survey question is nominal. Statistical operations ar
e limited to counts, proportions, and mode. The numeric encoding of categories is arbitrar
y and has no inherent meaning.
2. Ordinal Scale: Data in this scale has a meaningful order, but the distance between values
is not defined. Examples include Likert scales (e.g., levels of purchase intent) or job hiera
rchy (e.g., Entry, Mid, Executive). Ordinal data allows for the calculation of counts, prop
ortions, mode, median, and percentiles, but not means or standard deviations, as the inter
vals between ranks aren’t meaningful. Despite this, means are sometimes controversially
calculated for Likert scale data.
3. Interval Scale: This scale includes ordered data with equal intervals between values, but
lacks a true zero point. For instance, temperature in Fahrenheit is interval data. You can m
easure the difference between two temperatures, but a ratio (e.g., “twice as hot”) is meani
ngless due to the arbitrary zero point. Interval data supports more complex statistics, incl
uding counts, proportions, mode, median, mean, and standard deviation.
4. Ratio Scale: The most sophisticated scale, with ordered values, equal intervals, and a me
aningful zero point. Economic and business data often fall into this category (e.g., sales, i
ncome). With ratio data, both differences and ratios are meaningful (e.g., twice as much).
All statistical measures—counts, proportions, mode, median, mean, and standard deviatio
n—can be applied.
Data organization involves understanding how data is stored and structured, both externally by I
T departments and internally by the analyst for specific analysis purposes.
1. External Data Structure: Managed by IT, this involves organizing data for efficient stor
age and retrieval through processes like Extract-Translate-Load (ETL). This structure i
s largely fixed and designed for general use, not tailored to specific analysis needs.
2. Internal Data Structure: Once you access the external data, you reorganize it to fit your
specific analysis. This structure is flexible and changes based on the problem you are solv
ing. It helps define what analyses are possible based on how you arrange the data.
Metadata can be anything that helps you understand and document your data. This could include:
• means of creation;
and so on.15 I will restrict the metadata in data dictionaries used in this book to
include only
• variable name;
• possible values;
• source; and
• mnemonic.
BASIC DATA HANDLING
Basic data handling in real-world business data analytics (BDA) is much more complex than the
simple, clean datasets typically seen in textbooks on statistics and econometrics. While textbooks
focus on tidy, small datasets with a few variables, BDA often involves massive, messy data sprea
d across multiple sources. Preprocessing this data is a critical step before any analysis can begin.
1. Large Datasets: In practice, data sets are often far too large to be easily imported or proce
ssed by a computer’s memory. Unlike simple, flat data files with two dimensions (rows a
nd columns), BDA datasets can be high-dimensional, requiring complex handling.
2. Multiple Data Sources: Data often needs to be merged or joined from different sources to
form a single, usable dataset for analysis. Handling these complexities is a standard task i
n BDA, but not typically covered in introductory courses.
3. Reshaping Data: The form in which the data is initially organized may not be suitable for
analysis. For example, wide-form data (many variables, few rows) may need to be conver
ted to long-form (many rows, few variables), depending on the statistical or machine lear
ning techniques being applied.
4. Missing Values and Inconsistent Scales: Data often contain missing values, which can aff
ect analysis. Additionally, variables on different scales can bias results, as some variables
may dominate due to their measurement units. Preprocessing involves addressing missing
values, normalizing scales, and sometimes transforming variables to adjust their distributi
on or variance.
Importing Data: Efficiently importing large datasets into tools like Pandas, especially wh
en data exceeds your computer's memory capacity.
Reshaping Data: Converting data from wide-form to long-form or vice versa, depending
on the requirements of the analysis.
These tasks are foundational to managing the complexities of data and ensure the data is properly
structured for further analysis.
Table. 3,1
Business Context:
The company offers 43 products in six product lines (e.g., Living Room, Kitchen) with
discounts based on competitive factors, order size, and retailer pickup.
Data Structure:
1. Orders Database: Contains order number, customer ID, timestamps, order amount, list
price, discounts, return flags, and material costs.
3. Marketing Database: Features customer ID, loyalty program membership, buyer rating,
and satisfaction scores.
Objectives: The Living Rooms product line manager seeks to:
1. Identify sales patterns by marketing region, customer loyalty, and buyer rating.
Business Context:
A national bread-products company supplies fresh bread items to various store fronts,
requiring timely deliveries from local bakeries.
Orders are placed electronically, and fulfillment is monitored through a specialized tablet
app by customers.
Data Structure:
Each order includes measures for completeness, damage-free delivery, timeliness, and
accurate documentation, recorded as binary responses (Yes/No).
Additional data for each bakery (facility ID, marketing region, and location type) is also
maintained.
Analysis Objective: You are tasked with developing a Perfect Order Index (POI) as a fulfillment
measure. The POI is calculated as the product of the percentage of affirmative responses across
four fulfillment measures:
An obvious first step in any analytical process, aside from locating the right data, is to import
your data into an analytical framework. This is actually more complicated than imagined. Some
issues to address are the current data format, the size of the dataset to import, and the nature of
the data once imported
The Comma Separated Value (CSV) and Excel formats are probably the most commonly used
formats in BDA. CSV is a simple format with each value in a record separated by a comma and
text strings often, but not always, denoted by quotation marks. This format is supported by
almost all software packages because of its simplicity.1 Excel is also very popular because many
analysts mistakenly believe that Excel, or any spreadsheet package, is sufficient for data
analytical work. This is far from true. Nonetheless, they store their data in Excel workbooks and
worksheets.
Java Script Object Notation (JSON) is another popular format that allows you to transfer data,
software code, and more from one installation to another. Jupyter notebooks, for example, are
JSON files
The data sources for the SQL From verb are SQL-ready data tables and the result of the Select
verb is a data table satisfying the query. A powerful and useful feature of SQL is the use of a
returned table in the From clause so that, in effect, you could embed one query in another
Due to Table 3.1, The basic import or read command consists of four parts:
1. the package where the function is located: Pandas identified by its alias pd;
The package alias must be “chained” to the read_csv import function, otherwis the Python
interpreter will not know where to find the function. Both the path and file name can be
separately defined for convenience and cleaner coding; I consider this a Best Practice. You must
always specify the file path so Pandas can find the file, unless the data file is in the same
directory as your Jupyter notebook. Then a path is unnecessary since Pandas always begins a
search in the same directory as the notebook.
The data files you need for a BDA problem are typically large, perhaps larger than what is
practical for you to import at once. In particular, if you process a large file after importing it,
perhaps to create new variables or selectively keep specific columns, then it is very inefficient to
discard the majority of the imported data as unneeded. Too much time and computer resources
are used to justify the relatively smaller final DataFrame needed for an analysis. This
inefficiency is increased if there is a processing error (e.g., transformations are incorrectly
applied, calculations are incorrect, or the wrong variables are saved) so it all has to be redone.
Importing chunks of data, processing each separately, and then concatenating them into one final,
albeit smaller and more compact, DataFrame is a better way to proceed. For example, one small
chuck of data could be imported as a test chunk to check if transformations and content are
correct. Then a large number of chunks could be read, processed, and concatenated.
Once you have imported your data, you should perform five checks of them before beginning
your analytical work:
Check #1 Display the first few records of your DataFrame. Ask: “Do I see what I expect to
see?”
One way to “look” at your data is to determine if they are in the format you expect. For example,
if you expect floating point numbers but you see only integers, then something is wrong. Also, if
you see character strings (e.g., the words “True” and “False”) when you expect integers (e.g., 1
for True and 0 for False), then you know you will have to do extra data processing. Similarly if
you see commas as thousands separators in what should be floating point numbers, then you
have a problem because the numbers will be treated and stored as strings since a comma is a
character
Check #2 Check the shape of your DataFrame. Ask: “Do I see all I expect to see?”
The shape of a DataFrame is a tuple (actually, a 2-tuple) whose elements are the number of rows
and the number of columns, in that order. A tuple is an immutable list which means it cannot be
modified. The shape tuple is an attribute of the DataFrame so it is an automatic characteristic of a
DataFrame that you can always access. To display the shape, use df.shap. Although a tuple is
immutable, this does not mean you cannot access its elements for separate processing. To access
an element, use the square brackets, [ ], with the element’s index inside. For example, to access
the number of rows, use df.shape[ 0 ]. Remember, Python is zero-based so indexing starts with
zero for the first element.
Check #3 Check the column names in your DataFrame. Ask: “Do I have the correct and
cleansed variable names I need?”
Checking column (i.e., variable) names is a grossly overlooked step in the first stages of data ana
lysis. A name will certainly not impact your analysis in any way, but failure to check names coul
d impact the time you spend looking for errors rather than doing your analysis. Column names, w
hich are also attributes of a DataFrame, could have stray characters and leading and trailing whit
e spaces. White spaces are especially pernicious. Suppose a variable’s name is listed as ‘sales’ wi
th a leading white space. When you display the head of the DataFrame, you will see ‘sales’ displ
ayed without the white space, but the white space is really there. You will naturally try to use ‘sal
es’ (notice there is no white space) in a future command, say a regression command. The Python
interpreter will immediately display an error message that ‘sales’ is not found and the reason is si
mply that you typed ‘sales’ (notice the lack of a white space) rather than ‘sales’ (notice the leadin
g white space) You will needlessly spend time trying to find out why. Checking column names u
p front will save time later.
Check #4 Check for missing data in your DataFrame. Ask: “Do I have complete data?”
Missing values are a headache in statistics, econometric, and machine learning. You cannot estim
ate a parameter if the data for estimation are missing. Some estimation and analysis methods auto
matically check for missing values in the DataFrame and then delete records with at least one mi
ssing value. If the DataFrame is very large, this automatic deletion is not worrisome. If the DataF
rame is small, then it is worrisome because the degrees-of-freedom for hypothesis testing could b
e reduced enough to jeopardize the validity of a test. It is also troublesome if you are working wit
h time series because the deletion will cause a break in the continuity of the series. Many time se
ries functions require this continuity.
Check #5 Check the data types of your variables. Ask: “Do I have the correct data types?”
There are counterparts for most of these in Python and Numpy, but these are the key ones
Strings are text enclosed in either single or double quotation marks. Numbers could be interprete
d and handled as strings if they are enclosed in quotation marks. For example, 3.14159 and “3.14
159” are two different data types. The first is a number per se while the second is a string. An int
eger in a number without a decimal; it is a whole number that could be positive or negative. A flo
at is a number with a decimal point that can “float” among the digits depending on the values to r
epresent. Integers and floats are treated differently and operations on them could give surprising,
and unexpected results. Dates and times, combined and referred to as datetime, is a complex obje
ct treated and stored differently than floats, integers, and strings. Their use in calculations to accu
rately reflect dates, times, periods, time between periods, time zones, Daylight/Standard Saving t
ime, and calendars in general is itself a complex topic.
Notice that I do not have data visualization on my list. You might suppose that it should be part
of Check #1: Look at your data.
3.3 Merging or Joining DataFrames
You will often have data in two (or more) DataFrames so you will need to merge or join them to
get one complete DataFrame. As an example, a second DataFrame for the baking facilities has in
formation on each facility: the marketing region wherethe facility is located (Midwest, Northeast,
South, and West), the state in that region, a two character state code, the customer location served
by that facility (urban or rural), and the type of store served (convenience, grocery, or restaurant).
This DataFrame must be merged with the POI DataFrame to have a complete picture of the baki
ng facility.
There are many types of joins but I will only describe one: an inner join, which is the default met
hod in Pandas because it is the most common. The inner join operates by using each value of the
primary key in the DataFrame on the left to find a matching value in the primary key on the right .
If a match is found, then the data from the left and right DataFrames for that matching key are p
ut into an output DataFrame. If a match is not found, then the left primary key is dropped, nothin
g is put into the output DataFrame, and the next left primary key is used. This is repeated for eac
h primary key on the left.
A DataFrame’s shape attribute provides information on the number of rows and columns. Someti
mes this shape is inappropriate for a specific form of analysis and so it must be changed, the Dat
aFrame must be reshaped to make its shape more appropriate. Changing a DataFrame from wid
e- to long-form involves stacking the rows vertically on top of each other (basically transposing e
ach row) into a new column in a new DataFrame and using the original DataFrame’s column na
mes as values in yet another new column alongside the transposed rows. Those column names wi
ll repeat for each transposed row. As an example, a DataFrame could have 5 years of monthly sal
es data with one column for the year and a separate column for each month. There are then be 13
columns (one for year and 12 for months) and five rows for the 5 years. The shape is the tuple (5,
13). This is a wide-form. A simple regression analysis for sales as a function of a year and month
effect requires a different data arrangement: one column for sales, one for year, and one for mont
h. This is a long-form. So, the wide-form must be reshaped to long-form by stacking.
There are two Pandas methods for reshaping a DataFrame from wide- to long form: stack and me
lt. They basically do the same thing, but the melt method is slightly more versatile in providing a
name for the new columns of transposed rows. The stack method is better for operating on MultiI
ndexed DataFrames. The reverse reshaping operation of going from long- to wide-form is unstac
king
Sorting is the next most common and frequently used operation on a DataFrame. This involves p
utting the values in a DataFrame in a descending or ascending order based on the values in one o
r more variables.
In some instances, sorting the DataFrame to identify the largest customers in terms of purchases
or those who purchased most recently, or to identify the poorest performing facilities may not ha
ve to be done. A query of the DataFrame may be all that is need in these cases. I discuss queries i
n the next section. You could also use the nlargest or nsmallest methods. Each takes an argument
for the number of records to return and a list of the columns to search on. For example, df.nlarge
st (10, ‘X’ ) returns the 10 largest values from the X column in the DataFrame
A DataFrame, whether small or large, contains a lot of latent information as I discussed before. O
ne way you can get some information out of it is by literally asking it a question. That is, queryin
g the DataFrame.
In everyday arithmetic, the equal sign has a meaning we all accept: the term or value on the left i
s the same as that on the right, just differently expressed. If the two sides are the same, then the e
xpression is taken to be true; otherwise, it is false. So, the expression 2 + 2 = 4 is true while 2 + 2
= 5 is false. In Python, and many other programming languages, the equal sign has a different int
erpretation. It assigns a symbol to an object, which could be numeric or string; it does not signify
equality as in everyday arithmetic. The assignment names the object and is said to be bound to th
e object. The object is on the right and the name on the left. So, the expression x = 2 does not say
that x is the same as 2 but that x is the name for, and is bound to, the value 2. As noted by Sedge
wick et al. (2016, p. 17), “In short, the meaning of = in a program is decidedly not the same as in
a mathematical equation.”
The object, 2 in this case, is stored once in memory but there could be several names bound to th
at memory location. So x = 2 and y = 2 have the symbols x and y naming (or pointing to) the sam
e object in the same memory location. There is only one 2 in that memory location but two name
s or pointers to it.
Table 3.10 This is a truth table for two Boolean comparisons: logical “and” and logical “or.” See
Sedgewick et al. (2016) for a more extensive table for Python Boolean comparisons
A Boolean statement could be compactly written in mathematical notation using an indicator fun
ction. An indicator function returns a 0 or 1 for a Boolean statement, usually for the data or a sub
set of the data. Suppose you have a list of six values for a variable X: [ 1, 2, 3, 4, 5, 6 ] and the B
oolean statement “x > 3”. This is written in mathematical notation as
or more simply as I(x > 3). The I is the indicator function. It returns the list [0, 0, 0, 1, 1, 1] for th
is example. If you consider a subset of X, say the first three entries, A = [1, 2, 3], then the indicat
or function is written as IA(x > 3) which returns [0, 0, 0]. Indicator functions will be used in this
book.
3.6.2 Pandas Query Method
Pandas has a query method for a DataFrame that takes a Boolean expression and returns a subset
of the DataFrame where the Boolean expression is True. This makes it very easy to query a Data
Frame to create a subset of the data for more specific analysis.
The query method is chained to the DataFrame. The argument is a string with opening and closin
g single or double quotation marks (they must match). If a variable is used in the Boolean expres
sion, you must enclose it in quotation marks. For example, you could write “x > ’sales’”. Notice t
he single and double quotation marks. You could also define a variable with a value before the qu
ery but then use that variable in the query. In this case, you must use an @ before the variable so
that the Python interpreter knows the variable is not in the DataFrame. For example, you could d
efine Z = 3 and then use “x > @Z”.
Data Visualization: The Basics
Data visualization issues associated with the graphics used in a presentation, not in the analysis s
tage of developing the material leading to the presentation, are discussed in many books. I focus
on data visualization from a practical analytical point-of-view in this chapter, not their presentati
on. This does not mean, however, that they cannot be used in a presentation; they certainly can b
e used. The graphs I describe are meant to aid and enhance the extraction of latent Rich Informati
on from data.
An effective graph is one that conveys key information quickly and clearly, helping viewers to
understand the message at a glance. Ineffective graphs, on the other hand, obscure this message
due to poor design or the inclusion of unnecessary elements, often referred to as "chartjunk."
Chartjunk includes any extra graphical elements that clutter the chart and distract from the main
message. The goal of data visualization is to make the information buried in the data visible, so
adding unnecessary elements only complicates this task.
The Gestalt Principles of Visual Design provide guidance for creating effective graphs by
explaining how humans perceive and interpret visual elements. These principles help to ensure
that graphical representations are clear and easily interpreted by organizing the elements of the
graph in a way that aligns with natural human perception.
By using these principles, graphs can be designed to present information in a way that aligns
with how humans naturally process visual input, making them more effective at communicating
the rich information hidden within the data. This reduces cognitive load and enhances
understanding.
Table 4.4 This is a categorization of Seaborn’s plotting families, their plotClass, and the kind opt
ions. See the Seaborn documentation at https://seaborn.pydata.org/ for details
Seaborn is a data visualization package that has a wide range of capabilities. A nice feature of Se
aborn is that it reads Pandas DataFrames. A general Seaborn function call is sns.plotClass( x = X
var, y = Yvar, data = data, kind = kind, options ) where sns is the conventional Seaborn alias. Th
e plotClass is a class or family of plots which I list in Table 4.4. The Xvar and Yvar parameters ar
e the x-variable and y-variable, respectively, which vary by plotting class; options are hue, size, a
nd style.2 The kind options are the same as those in Table 4.3.
You most likely think of graphs when you think of visualization, but this thinking is simplistic si
nce there are many types of graphs and for different types of data. A brief listing includes those f
amiliar to most analysts: bar charts, pie charts, histograms, and scatterplots. The list is actually m
uch longer. Not all of these can be applied to any and all types of data. They have their individual
uses and limitations that reveal different aspects of the latent information.
Graphs have a different symbol system consisting of lines, bars, colors, dots, and other marks tha
t tell a story, convey a message, no different than text. Five guiding features you should look for:
1) Distributions, The distribution of your data is their arrangement or shape on a graph axis.
Distributions could be symmetric, skewed (left or right), or uni/multi-modal.
2) Relationships, are not just associations (i.e., correlations), but more cause-and effect beha
vior between or among two or more variables. Products purchased and distribution chann
els is one example. Customer satisfaction and future purchase intent is another. These rela
tionships could be spatial, temporal, or both.
3) Patterns, would be groupings of data points such as clusters or segments.
4) Trends, are developments or changes over time (e.g., a same-store sales tracking study or
attrition rates for R&D personnel for an HR study). These would be mostly temporal
5) Anomalies, are points that differ greatly from the bulk of the data. But not all outliers are
created equal: some are innocuous while others are pernicious and must be inspected for t
heir source and effects.
The POI DataFrame is in panel format, meaning that it has a combination of spatial and temp
oral data. The spatial aspect is baking facilities and/or their geographic locations, either state
or marketing region.
Examining distributions of continuous spatial data is common. The boxplot, sometimes calle
d a box-and-whisker plot, is a powerful tool for continuous data. Tukey (1977).
Another, more classic way to visualize a distribution is to use a histogram. In particular, a hist
ogram is a tool for estimating the probability density function of the values of a random varia
ble, X. Let f (x) be the density function. As noted by Silverman (1986)
Discrete data have definite values or levels. They can be numeric or categorical. As numeric,
they are whole numbers without decimal values. Counts are a good example. As categorical, t
hey could be numeric values arbitrarily assigned to levels of a concept.
Data are not always strictly categorical or continuous. You could have a mix of both. In this c
ase, graph types can be combined to highlight one variable conditioned on another. A second
categorical variable can be added to form a facet, trellis, lattice, or panel (all interchangeable
terms) plot.
Some of the displays I previously discussed, such as boxplots, can be used with temporal dat
a. Otherwise, temporal data have their own problems that require variations on some displays.
Data could also be temporal, which is also known as time series data. Time series data are a s
pecial breed of data with a host of problems that make visualization more challenging. It is n
ot, however, only the visualization that is complicated; it is the full analysis of time series: da
ta visualization, data handling, and modeling becomes overwhelming.
A line chart is probably the simplest time series graph familiar to most analysts. It is just the s
eries plotted against time.
It may be possible to disaggregate a time series into constituent periods to reveal underlying
patterns hidden by the more aggregate presentation. For example, U.S. annual real GDP grow
th rates from 1960 to 2016 can be divided into six decades to show a cyclical pattern. A boxpl
ot for each decade could then be created and all six boxplots can be plotted next to each other
so that the decades play the role of a categorical variable.
Time series data have unique complications which account for why there is so much active ac
ademic research in this area. The visualization of time series reflects this work. Some unique
problems are:
Reference:
Walter R. Paczkowski. 2021. Business Analytics: Data Science for Business Problems.
Springer: Switzerland. E-ISBN: 978-3-030-87023-2.