CBDA Domain-II Source Data v0.1
CBDA Domain-II Source Data v0.1
2 Case Study
3 Quiz
5 Practice Domains
Data - are collections of any number of related observations. This is also called raw data. [Information before it is
arranged and analyzed is called raw data. It is raw because, it is unprocessed by statistical methods.
For example, Yards produced yesterday by each of the 10 carpet looms in a Carpet production company.
Data Set – A collection of data is called a data set, and a single observation is a data point.
A data source is where that data that is being used to run a report or gain information is originating from. For a
database management system, the source is the database. For computer programs, the data source is a spreadsheet,
XML file, data sheet or hard-coded data within the program. Depending on the computer system or program, data
sources will differ.
Examples may include – Operations, ERP systems, Legacy Systems, Point of Sale, RFID Systems, Web Usage, External Sources,
Suppliers etc.
Data can be Structured (e.g., data residing in a database management system (DBMS)) and/or Unstructured. (e.g.,
text from word processing documents, emails, social media sites, image, audio, or video files)
Meaning of Some Terms
Tests for data – Before relying on any interpreted data, we need to test the data by asking the following questions –
1)Where did the data come from? Is the source biased – that is, is it likely to have an interest in supplying data points that will lead to
one conclusion over the other?
3)Is any evidence missing that might cause us to come to a different conclusion?
4)How many observations' do we have? Do they represent all groups that we wish to study?
5)Is the conclusion logical? Have we made any conclusions that the data do not support?
Tasks in “Source Data” Domain
Plan Data Collection Determine the Data Sets. Select Techniques for
Source Data
Performing a review of the data expected
What data is needed, the availability of
from the data sources and determining Acceptance and
the data, the need for historical data,
specifics such as data types, data evaluation criteria
determining when and how the data will
dimensions, sample size, and
be collected, and how the data will be Data Dictionary
relationships between different data
validated once collected.
elements Interface Analysis
Collect Data Validate Data Survey or
Questionnaire
Activities performed to support data Objective is high-level validation – can
professionals with data setup, preparation, include business validation and technical Data Mapping
and collection; Includes profiling of data. validation. Assessing the quality of the data
on the basis of Accuracy, Completeness,
Consistency, Uniqueness, Timeliness
Task – Plan Data Collection
Survey To understand the general characteristics or opinions of Distribute a list of questions to a sample
a group of people. online, in person or over-the-phone.
Experiment To test a causal relationship. Manipulate variables and measure their
effects on others.
Interview/focus group To gain an in-depth understanding of perceptions or Verbally ask participants open-ended
opinions on a topic. questions in individual interviews or focus
group discussions.
Secondary data collection To analyze data from populations that you can’t access Find existing datasets that have already been
first-hand. collected, from sources such as government
agencies or research organizations.
Observation To understand something in its natural setting. Measure or survey a sample without trying to
affect them.
Ethnography To study the culture of a community or organization Join and participate in a community and
first-hand. record your observations and reflections.
Task – Determine Data Sets
Volume: The amount of data being produced and the size of the data sets that we need to process determines the
Volume.
Velocity: The speed at which data is generated and the frequency by which the data needs to be collected and
processed determines the Velocity.
Variety: is determined by the variety of data sources, formats, and types needing to be processed
Veracity: implies the trustworthiness of the data and also represents the uncertainties‘ and inconsistencies existing
in the data. It is the ability of “managing the reliability and predictability of inherently imprecise data types”.
Value: indicates towards the necessity of putting in time and effort in any analytics initiative from real, valuable
business goals.
Data discovery is a term used to describe the process for collecting data from various sources by detecting patterns
and outliers with the help of guided advanced analytics and visual navigation of data, thus enabling consolidation of
all business information. Some commonly used tools include - Looker, Qlik Sense, Tableau.
An Example on incomplete or biased data
An advertisement claim by a Truck Association –
“75% of everything you use travels by truck”
Missing part – Question on double counting - What did they do when something was carried to your city by rail
and then delivered by truck? How are packages treated when they went by airmail and then by truck?
When the issue of double counting is resolved, it turns out that, although trucks are involved in delivering a relatively
high proportion of what you use, railways and ships still carry more goods for more total kilometers.
Visualization Technique use in Determining Data Sets
Visualization provides a unique perspective on the dataset. You
can visualize data in lots of different ways.
Tables are very powerful when you are dealing with a
relatively small number of data points. – Good at showing
one-dimensional outliers but poor at showing comparing
multiple dimensions at the same time (e.g., population per
country over time)
Charts are helpful in displaying data over multi-dimensions.
Validate Data
Data validation may be performed by a data analyst, data scientist, or business analysis practitioner with
sufficient skills to use the necessary tools to access data and the underlying competencies to analyze the results.
Consistency Uniqueness
how reliable the data is. No duplicates exist
Implies value of a data
element is the same
across sources
Timeliness
data is current and not
out of date
Case Study – Data Collection
Let’s say we are researching employee perceptions of their direct managers in a large organization. How do we
proceed about collecting data?
• Our first aim is to assess whether there are significant differences in perceptions of managers across different
departments and office locations.
• Our second aim is to gather meaningful feedback from employees to explore new ideas for how managers can
improve.
Using our plan data collection task – we determine which type of data shall we focus on – Quantitative, Qualitative or
both?....We decide to use a mixed-methods approach to collect both quantitative and qualitative data.
We also decide to use the Survey Method.
We operationalize the plan by transforming the conceptual understanding of what we want to study into operational
definition of what will actually measure.
Using multiple ratings of a single concept can help us cross-check our data and assess the test validity of our
measures.
We make use of our Sampling plan, Agreed Standardization procedures , Data Management Plan (all part of our data
collection plan) to actually collect and store the collected data.
1) We administer a survey with closed- and open-ended questions to a sample of 300 company employees across
different departments and locations.
2) The closed-ended questions ask participants to rate their manager’s leadership skills on scales from 1–5. The data
produced is numerical and can be statistically analyzed for averages and patterns.
3) The open-ended questions ask participants for examples of what the manager is doing well now and what they
can do better in the future. The data produced is qualitative and can be categorized through content analysis
[categorize or “code” words, themes, and concepts within the texts and then analyze the results] for further insights.
Quiz
Q.1 The degree of potential harm caused by faulty data collection depends on
the nature of the investigation and whether it is used to support public
policy.-
a) False
b) True
Quiz
Q.1 Answer – The correct answer is True. Faulty data collection compromises
the validity of the results, regardless of analytic procedure used.
You also risk potentially serious implications (economic and public safety)
when such an initiative is for public policy.
Quiz
Q.2 Quality control identifies the actions necessary to correct faulty data
collection practices-
Q.2 Answer – (B) The correct answer is that it minimizes future occurrence.
Quiz
Q.3 __________ , refers to activities that take place before data collection
begins.
Q.3 Answer – (C) Quality assurance refers to activities that take place before
data collection begins.
Quiz
(a) compromises
(c) decreases
(d) increases
Quiz
Q.4 Answer – (C) The correct answer is to decrease the likelihood of errors
occurring.
Quiz
Q.5 Answer – (D) The correct answer is to detect the presence of errors.
Business Data Analytics Domains - Relationships