Introduction to Data Science Module 2

The document provides an introduction to data science, focusing on data collection and storage methods, including primary and secondary data collection techniques. It emphasizes the importance of data quality, accuracy, and the implications of improper data collection, while also discussing the characteristics and challenges of big data. Additionally, it outlines the significance of data management and the data pipeline process for effective analysis and decision-making.

Uploaded by

gabrielogondiek2017

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views

Introduction to Data Science Module 2

Uploaded by

gabrielogondiek2017

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 35

INTRODUCTION TO

DATA SCIENCE
Instructor
Abubakar Yussuf
DATA COLLECTION AND
STORAGE
What is data collection?
• Data collection is the process of gathering data for use in
business decision-making, strategic planning, research and
other purposes.
• Effective data collection provides the information that's
needed to answer questions, analyze business performance or
other outcomes, and predict future trends, actions and
scenarios.
Consequences from improperly collected data include:

• Inability to answer research questions accurately

• Inability to repeat and validate the study
• Distorted findings resulting in wasted resources
• Misleading other researchers to pursue fruitless avenues of
investigation
• Compromising decisions for public policy
• Causing harm to human participants and animal subjects
When data is collected we need to ensure its quality and that can be
achieved by maintaining :
Accuracy: data must be true and relevant to the research questions
and objectives.
Completeness: Ensure the data has no missing values
Uniqueness: ensure there are no duplicates
Data collection methods
• Primary data collection
Primary data collection involves the collection of original data directly
from the source or through direct interaction with the respondents. This
method allows researchers to obtain firsthand information specifically
tailored to their research objectives
Data collection methods
• Secondary data collection
Secondary data collection involves using existing data collected by
someone else for a purpose different from the original intent.
Researchers analyze and interpret this data to extract relevant
information
Methods for primary data collection
• Surveys and questionnaires • Observations
Researchers design structured This technique involves observing
questionnaires or surveys to and recording behaviors, actions,
collect data from individuals or or events in their natural setting.
groups. These can be conducted This method is useful for
through face-to-face interviews, gathering data on human behavior,
telephone calls, mail, or online interactions, or phenomena
platforms. without direct intervention.
• Interviews
INTERVIEW are a qualitative research method
used to collect primary data by
asking one or more people about
their opinions, experiences or
perspectives on a particular topic or
subject matter.
• They can be structured,
unstructured or semi- structured.
• Experiments
An experiment is a data collection method where you as a researcher
change some variables and observe their effect on other variables. This is
used to study cause and effect relationship between variables.
• Focus Groups
Focus groups bring together a small group of individuals who discuss
specific topics in a moderated setting. This method helps in understanding
opinions, perceptions, and experiences shared by the participants.
Secondary data sources,
Data from public sources that area shared by :
• Government reports
• Press releases
• Business journals
• Libraries
• Internet(public repositories)
Internal data sources : this data is collected within the organization for
the benefit of the organization its self.
Secondary data collection methods
• Downloading from the sources.
• Web scraping from websites online.(Mainly for people with
programming experience)
When choosing the method to use :
• Quality and Accuracy: The choice of data collection method directly
impacts the quality and accuracy of the data obtained. Properly
designed methods help ensure that the data collected is relevant to the
research questions and free from errors.
• Relevance, Validity, and Reliability: Effective data collection
methods help ensure that the data collected is relevant to the research
objectives, valid (measuring what it intends to measure), and reliable
(consistent and reproducible).
• Bias Reduction and Representativeness: Carefully chosen data
collection methods can help minimize biases inherent in the research
process, such as sampling bias or response bias. They also aid in
achieving a representative sample, enhancing the findings’
generalizability.
Continuation…….
• Achievement of Research Objectives: Data collection methods
should align with the research objectives to ensure that the collected
data effectively addresses the research questions or hypotheses.
Properly collected data facilitates the attainment of these objectives.
• Big Data
BIG DATA
 Big data refers to any collection of data sets which
are so large and/ or complex that becomes difficult
to store, manage, and process using traditional
data management ttechniques such as the Sources of Big Data
RDBMS.

 It is a kind of data that is generated at a very large

scale with a very high speed which come from
different formats. (It includes data such as video,
audio, google searches, machine logs, tweets,
facebook/ Whatsapp charts, etc)

 Big data needs modern computing technologies

such as Hadoop, Cassandra, Google Analytics, etc
to process and analyze in order
Vs Characteristics of Big Data
• Volume: This refers to the sheer amount of data being generated and
stored. Datasets in big data are typically measured in terabytes,
petabytes, or even exabytes, far exceeding the capabilities of
traditional data storage and processing methods.

• Variety: Big data comes in a wide range of formats, not just the
structured data (rows and columns) found in traditional databases. It
can include unstructured data like social media posts, images, and
videos, as well as semi-structured data like emails and logs.
Continuation…….
• Velocity: The speed at which data is generated and processed is
another key characteristic. Big data can be generated in real-time or
near real-time, requiring fast processing and analysis tools to keep up
with the data flow.
• Veracity: This refers to the accuracy and quality of the data. With
diverse sources and formats, ensuring the trustworthiness and
consistency of big data can be a challenge. Data cleaning and
validation techniques are crucial for reliable analysis.
• Value: Extracting meaningful insights and value from vast amounts of
data is the ultimate goal. Big data analytics techniques help us uncover
hidden patterns, trends, and correlations that can inform better
decision-making, optimize processes, and create new opportunities.
Drivers to the increase in Data Growth
• More Data-Generating Devices: The proliferation of smartphones, tablets,
laptops, and other internet-connected devices contributes significantly. Each
device captures and stores data, from photos and videos to browsing history
and app activity.
• Internet of Things (IoT): The increasing number of internet-connected
sensors and devices embedded in everyday objects is another major
contributor. These devices constantly generate data on everything from
weather conditions to traffic patterns to appliance usage.
• Social Media and User-Generated Content: The rise of social media
platforms and online communities has led to a surge in user-generated content
like posts, comments, images, and videos. This data adds significantly to the
overall volume.
. The ever-growing tide of data holds immense potential benefits across various
sectors. Here are some of the promising ways data growth can be harnessed for
positive change:

1. Enhanced Decision-Making: With more data available, businesses,

organizations, and even individuals can make more informed decisions. Data
analytics can uncover hidden patterns, trends, and correlations that traditional
methods might miss. This can lead to:

Improved resource allocation: Data can reveal areas where resources are
underutilized or overspent, enabling better allocation for optimal results.
Data-driven marketing: Businesses can personalize marketing
campaigns based on customer behavior and preferences, leading to
higher engagement and sales.
Scientific advancements: Researchers can analyze vast datasets to
accelerate discoveries in medicine, materials science, and other fields.
Environmental monitoring: Data from sensors can be used to monitor
environmental changes, track pollution levels, and inform sustainable
practices
• Personalized experiences: Data allows for customization, tailoring
experiences to individual preferences. This can range from
personalized learning platforms to recommendation systems for
products and content
2. Innovation and Development: Data is the fuel for innovation. Here's
how big data can drive progress:
•Developing new products and services: Companies can use data to
identify customer needs and preferences, informing the development of
innovative products and services that cater to those needs.
•Optimizing existing processes: Data analysis can help identify
inefficiencies in operations, leading to process improvements and cost
reductions.
3. Societal Progress: Data can play a crucial role in tackling global
challenges:
•Public health management: Data analysis can be used to track disease
outbreaks, predict epidemics, and optimize healthcare resource
allocation.
•Smart cities: Urban planning can leverage data to improve traffic flow,
optimize energy use, and enhance public safety.
4. Improved Customer Service: Businesses can leverage data to
personalize customer interactions and provide a more positive customer
experience:
•Proactive customer support: By analyzing customer data, companies
can anticipate potential issues and provide proactive support, reducing
frustration.
•Faster issue resolution: Data can help identify root causes of customer
problems, leading to quicker and more effective solutions.
•Sentiment analysis: Businesses can use data to understand customer
sentiment and feedback, allowing them to improve products and services
Scientific Discovery: Big data opens doors to groundbreaking research
in various scientific fields:
•Genomics and personalized medicine: Analyzing vast amounts of
genetic data can lead to personalized healthcare approaches and
accelerate drug discovery.
•Climate change research: Data analysis from weather stations,
satellites, and other sources helps us understand climate patterns and
predict future trends.
•Social science research: Studying social media data and online
interactions can provide insights into human behavior and social trends.

It's important to acknowledge that data growth also comes with

challenges around privacy, security, and responsible data management.
However, by addressing these concerns and harnessing its potential
effectively, data can be a powerful tool for progress and positive change.
Data Management
• Data management is the practice of collecting, organizing, protecting,
and storing an organization’s data so it can be analyzed for business
decisions. As organizations create and consume data at unprecedented
rates, data management solutions become essential for making sense
of the vast quantities of data.
Data Pipeline
 A data pipeline is a method in which raw data is ingested from
various data sources, transformed and then ported to a data store, such
as a data lake or data warehouse, for analysis.
Data ingestion is the process of importing large, assorted data files
from multiple sources into a single, cloud-based storage medium—a
data warehouse, data mart or database—where it can be accessed and
analyzed.
• Data transformation is the process of converting, cleansing, and
structuring data into a usable format that can be analyzed to support
decision making processes, and to propel the growth of an
organization.
• This involves removing or filling the missing values, Transforming the
data into the correct format for storage, Joining data from different
sources and removing duplicate data.
• The process of transformation also allows for enrichment which
enhances the quality of data
• Data storage is the process of storing and preserving digital
information for later retrieval and use.
• Data can be stored in cloud storage, warehouses, floppy disk etc.
Data pipeline
Thank you

Philippine Biodiversity
100% (1)
Philippine Biodiversity
59 pages
How To Add A ZTE ONT On Huawei OLT
No ratings yet
How To Add A ZTE ONT On Huawei OLT
10 pages
Learning Activity Sheet: Ip Addressing and Subnet Mask
No ratings yet
Learning Activity Sheet: Ip Addressing and Subnet Mask
5 pages
Module 5 Lecture Note
No ratings yet
Module 5 Lecture Note
8 pages
ToolKit 1 - Unit 1 - Introduction To Data Analytics
No ratings yet
ToolKit 1 - Unit 1 - Introduction To Data Analytics
15 pages
Data For Business Analytics Unit 2
No ratings yet
Data For Business Analytics Unit 2
23 pages
Data Collection
No ratings yet
Data Collection
15 pages
BigDataAnalytics _ Unit1
No ratings yet
BigDataAnalytics _ Unit1
21 pages
LESSON1 ObtainingData
100% (1)
LESSON1 ObtainingData
32 pages
unit-1ppt
No ratings yet
unit-1ppt
29 pages
data collection (1)
No ratings yet
data collection (1)
6 pages
UNIT 2 Notes - Data Science
No ratings yet
UNIT 2 Notes - Data Science
18 pages
Important Question of Introduction of Data Science
No ratings yet
Important Question of Introduction of Data Science
10 pages
unit-1ppt-241202105748-ba1c594f
No ratings yet
unit-1ppt-241202105748-ba1c594f
30 pages
C20 Combined
No ratings yet
C20 Combined
291 pages
1.big Data and Its Importance
No ratings yet
1.big Data and Its Importance
17 pages
Cse2026 Module 1 & 2 Detailed Notes
No ratings yet
Cse2026 Module 1 & 2 Detailed Notes
185 pages
Data Science
No ratings yet
Data Science
68 pages
Unit II
No ratings yet
Unit II
6 pages
Lecture 2
No ratings yet
Lecture 2
14 pages
Unit 1 _ Big Data Analytics_CCS334
No ratings yet
Unit 1 _ Big Data Analytics_CCS334
35 pages
Big Data Analytics - CCS334 - Notes - Unit 1 - Understanding Big Data
No ratings yet
Big Data Analytics - CCS334 - Notes - Unit 1 - Understanding Big Data
40 pages
Screenshot 2024-11-08 at 11.01.05 AM
No ratings yet
Screenshot 2024-11-08 at 11.01.05 AM
54 pages
Unit 1 Introduction
No ratings yet
Unit 1 Introduction
70 pages
ADET - Lesson 2
No ratings yet
ADET - Lesson 2
21 pages
VETMI Data Analysis Workshop
No ratings yet
VETMI Data Analysis Workshop
577 pages
Data Science
No ratings yet
Data Science
12 pages
Emergency chapter two(2)
No ratings yet
Emergency chapter two(2)
41 pages
Rudra Bhatt Data
No ratings yet
Rudra Bhatt Data
9 pages
UNIT 2 BDA
No ratings yet
UNIT 2 BDA
5 pages
Three V of Big Data
No ratings yet
Three V of Big Data
4 pages
WINSEM2023-24 BCSE206L TH VL2023240501787 2024-01-29 Reference-Material-I
No ratings yet
WINSEM2023-24 BCSE206L TH VL2023240501787 2024-01-29 Reference-Material-I
53 pages
DAV 1 UNIT
No ratings yet
DAV 1 UNIT
30 pages
What is data
No ratings yet
What is data
8 pages
Unit 1
No ratings yet
Unit 1
19 pages
Unit 1
No ratings yet
Unit 1
14 pages
Introduction-to-Data-Analytics
No ratings yet
Introduction-to-Data-Analytics
15 pages
DA Unit 1
No ratings yet
DA Unit 1
43 pages
Introduction To Analytics and Big Data
No ratings yet
Introduction To Analytics and Big Data
12 pages
dataanalyticsunit-1[1]
No ratings yet
dataanalyticsunit-1[1]
26 pages
All About Data Science
No ratings yet
All About Data Science
35 pages
Big Data and Data Science
No ratings yet
Big Data and Data Science
6 pages
Unit I - Big Data Programming
No ratings yet
Unit I - Big Data Programming
19 pages
Cs3352 Foundation of Data Science
No ratings yet
Cs3352 Foundation of Data Science
80 pages
DAVAI Macro
No ratings yet
DAVAI Macro
6 pages
Unit 2 BI & Data Science (1)
No ratings yet
Unit 2 BI & Data Science (1)
35 pages
mod 3
No ratings yet
mod 3
96 pages
UDAS
No ratings yet
UDAS
3 pages
Data Analysis - Version 2
No ratings yet
Data Analysis - Version 2
12 pages
DATA BY GROUP 1 sec b
No ratings yet
DATA BY GROUP 1 sec b
15 pages
Antim Prahar 2024 Data Analytics For Business Decisions
50% (2)
Antim Prahar 2024 Data Analytics For Business Decisions
38 pages
Data Analysis _Unit1
No ratings yet
Data Analysis _Unit1
65 pages
Xi Ai Unit - 5 Notes
No ratings yet
Xi Ai Unit - 5 Notes
28 pages
Business Analytics
No ratings yet
Business Analytics
18 pages
Webinar StorytellingwithDataSession3-4
No ratings yet
Webinar StorytellingwithDataSession3-4
30 pages
Chapter-II-Data-Collection-and-Management
No ratings yet
Chapter-II-Data-Collection-and-Management
19 pages
Chapter Two
No ratings yet
Chapter Two
14 pages
BigData Theory Updated 2
No ratings yet
BigData Theory Updated 2
28 pages
Lec02 Business Analytics - 20231224 - 102047 - 0000 1
No ratings yet
Lec02 Business Analytics - 20231224 - 102047 - 0000 1
23 pages
Business Analytics Notes
No ratings yet
Business Analytics Notes
6 pages
Mastering Data Mining Techniques
From Everand
Mastering Data Mining Techniques
Dhaanyalakshmi Ahuja
No ratings yet
Data Science and Analytics Essentials: The Revolution of Decision-Making: Leveraging Data in the Digital Age
From Everand
Data Science and Analytics Essentials: The Revolution of Decision-Making: Leveraging Data in the Digital Age
Daniel Richards
No ratings yet
Essentials of Data Analysis
From Everand
Essentials of Data Analysis
Agasti Khatri
No ratings yet
Jawaban Quiz 1 Simbis
No ratings yet
Jawaban Quiz 1 Simbis
6 pages
Natural Language Processing
No ratings yet
Natural Language Processing
38 pages
Tetric N-Bond Universal
No ratings yet
Tetric N-Bond Universal
13 pages
ENGLISH 8 Module 4 Q1
No ratings yet
ENGLISH 8 Module 4 Q1
6 pages
The Economist April 13th 2024
No ratings yet
The Economist April 13th 2024
341 pages
UG B.A. History (English) UG B.A. English 108 53 History V Semester International Relations From 1914 AD Till Present 7030 1848
No ratings yet
UG B.A. History (English) UG B.A. English 108 53 History V Semester International Relations From 1914 AD Till Present 7030 1848
236 pages
Lehdonvirta Wu and Hawkins 2023 AIES Compute North Vs Compute South
No ratings yet
Lehdonvirta Wu and Hawkins 2023 AIES Compute North Vs Compute South
11 pages
Workers' Organization Development Program (Wodp) Dole-Ncr
No ratings yet
Workers' Organization Development Program (Wodp) Dole-Ncr
15 pages
Student Exploration: Roller Coaster Physics
No ratings yet
Student Exploration: Roller Coaster Physics
3 pages
Tender 0 buycon43.BKPL - Barauni
No ratings yet
Tender 0 buycon43.BKPL - Barauni
105 pages
JNUEE19 ConfirmationPage PDF
No ratings yet
JNUEE19 ConfirmationPage PDF
2 pages
Ballad
No ratings yet
Ballad
9 pages
An Integration of GIS and Remote Sensing in Ground
No ratings yet
An Integration of GIS and Remote Sensing in Ground
10 pages
Third Quarter Consolidated Grades: General Average
No ratings yet
Third Quarter Consolidated Grades: General Average
4 pages
School Physics Experiments With Arduino DUE
No ratings yet
School Physics Experiments With Arduino DUE
29 pages
Energies 12 01092 PDF
No ratings yet
Energies 12 01092 PDF
25 pages
SOPs For IQMS in Food Manufacturing Facilities
No ratings yet
SOPs For IQMS in Food Manufacturing Facilities
7 pages
Diana Hernandez Cruz - North-South Dispute Over Slavery Led To Civil War - Student Packet
No ratings yet
Diana Hernandez Cruz - North-South Dispute Over Slavery Led To Civil War - Student Packet
4 pages
A2 Booklet pracise
No ratings yet
A2 Booklet pracise
38 pages
Wa0001
0% (1)
Wa0001
21 pages
Aircraft Accident Investigation
No ratings yet
Aircraft Accident Investigation
39 pages
Heinrich Himmler: The Nazi Hindu
No ratings yet
Heinrich Himmler: The Nazi Hindu
5 pages
Proposal Paper - Campus Dining-1
No ratings yet
Proposal Paper - Campus Dining-1
7 pages
Hamworthy St2a
No ratings yet
Hamworthy St2a
4 pages
Final Exam 11
100% (1)
Final Exam 11
2 pages
Masterlist of Enrolled Learners With End of Program/Cy Status (Af-3)
No ratings yet
Masterlist of Enrolled Learners With End of Program/Cy Status (Af-3)
4 pages
Atreyee maths Question paper
No ratings yet
Atreyee maths Question paper
6 pages