Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
15 views

Introduction to Data Science Module 2

The document provides an introduction to data science, focusing on data collection and storage methods, including primary and secondary data collection techniques. It emphasizes the importance of data quality, accuracy, and the implications of improper data collection, while also discussing the characteristics and challenges of big data. Additionally, it outlines the significance of data management and the data pipeline process for effective analysis and decision-making.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Introduction to Data Science Module 2

The document provides an introduction to data science, focusing on data collection and storage methods, including primary and secondary data collection techniques. It emphasizes the importance of data quality, accuracy, and the implications of improper data collection, while also discussing the characteristics and challenges of big data. Additionally, it outlines the significance of data management and the data pipeline process for effective analysis and decision-making.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 35

INTRODUCTION TO

DATA SCIENCE
Instructor
Abubakar Yussuf
DATA COLLECTION AND
STORAGE
What is data collection?
• Data collection is the process of gathering data for use in
business decision-making, strategic planning, research and
other purposes.
• Effective data collection provides the information that's
needed to answer questions, analyze business performance or
other outcomes, and predict future trends, actions and
scenarios.
Consequences from improperly collected data include:

• Inability to answer research questions accurately


• Inability to repeat and validate the study
• Distorted findings resulting in wasted resources
• Misleading other researchers to pursue fruitless avenues of
investigation
• Compromising decisions for public policy
• Causing harm to human participants and animal subjects
When data is collected we need to ensure its quality and that can be
achieved by maintaining :
Accuracy: data must be true and relevant to the research questions
and objectives.
Completeness: Ensure the data has no missing values
Uniqueness: ensure there are no duplicates
Data collection methods
• Primary data collection
Primary data collection involves the collection of original data directly
from the source or through direct interaction with the respondents. This
method allows researchers to obtain firsthand information specifically
tailored to their research objectives
Data collection methods
• Secondary data collection
Secondary data collection involves using existing data collected by
someone else for a purpose different from the original intent.
Researchers analyze and interpret this data to extract relevant
information
Methods for primary data collection
• Surveys and questionnaires • Observations
Researchers design structured This technique involves observing
questionnaires or surveys to and recording behaviors, actions,
collect data from individuals or or events in their natural setting.
groups. These can be conducted This method is useful for
through face-to-face interviews, gathering data on human behavior,
telephone calls, mail, or online interactions, or phenomena
platforms. without direct intervention.
• Interviews
INTERVIEW are a qualitative research method
used to collect primary data by
asking one or more people about
their opinions, experiences or
perspectives on a particular topic or
subject matter.
• They can be structured,
unstructured or semi- structured.
• Experiments
An experiment is a data collection method where you as a researcher
change some variables and observe their effect on other variables. This is
used to study cause and effect relationship between variables.
• Focus Groups
Focus groups bring together a small group of individuals who discuss
specific topics in a moderated setting. This method helps in understanding
opinions, perceptions, and experiences shared by the participants.
Secondary data sources,
Data from public sources that area shared by :
• Government reports
• Press releases
• Business journals
• Libraries
• Internet(public repositories)
Internal data sources : this data is collected within the organization for
the benefit of the organization its self.
Secondary data collection methods
• Downloading from the sources.
• Web scraping from websites online.(Mainly for people with
programming experience)
When choosing the method to use :
• Quality and Accuracy: The choice of data collection method directly
impacts the quality and accuracy of the data obtained. Properly
designed methods help ensure that the data collected is relevant to the
research questions and free from errors.
• Relevance, Validity, and Reliability: Effective data collection
methods help ensure that the data collected is relevant to the research
objectives, valid (measuring what it intends to measure), and reliable
(consistent and reproducible).
• Bias Reduction and Representativeness: Carefully chosen data
collection methods can help minimize biases inherent in the research
process, such as sampling bias or response bias. They also aid in
achieving a representative sample, enhancing the findings’
generalizability.
Continuation…….
• Achievement of Research Objectives: Data collection methods
should align with the research objectives to ensure that the collected
data effectively addresses the research questions or hypotheses.
Properly collected data facilitates the attainment of these objectives.
• Big Data
BIG DATA
 Big data refers to any collection of data sets which
are so large and/ or complex that becomes difficult
to store, manage, and process using traditional
data management ttechniques such as the Sources of Big Data
RDBMS.

 It is a kind of data that is generated at a very large


scale with a very high speed which come from
different formats. (It includes data such as video,
audio, google searches, machine logs, tweets,
facebook/ Whatsapp charts, etc)

 Big data needs modern computing technologies


such as Hadoop, Cassandra, Google Analytics, etc
to process and analyze in order
Vs Characteristics of Big Data
• Volume: This refers to the sheer amount of data being generated and
stored. Datasets in big data are typically measured in terabytes,
petabytes, or even exabytes, far exceeding the capabilities of
traditional data storage and processing methods.

• Variety: Big data comes in a wide range of formats, not just the
structured data (rows and columns) found in traditional databases. It
can include unstructured data like social media posts, images, and
videos, as well as semi-structured data like emails and logs.
Continuation…….
• Velocity: The speed at which data is generated and processed is
another key characteristic. Big data can be generated in real-time or
near real-time, requiring fast processing and analysis tools to keep up
with the data flow.
• Veracity: This refers to the accuracy and quality of the data. With
diverse sources and formats, ensuring the trustworthiness and
consistency of big data can be a challenge. Data cleaning and
validation techniques are crucial for reliable analysis.
• Value: Extracting meaningful insights and value from vast amounts of
data is the ultimate goal. Big data analytics techniques help us uncover
hidden patterns, trends, and correlations that can inform better
decision-making, optimize processes, and create new opportunities.
Drivers to the increase in Data Growth
• More Data-Generating Devices: The proliferation of smartphones, tablets,
laptops, and other internet-connected devices contributes significantly. Each
device captures and stores data, from photos and videos to browsing history
and app activity.
• Internet of Things (IoT): The increasing number of internet-connected
sensors and devices embedded in everyday objects is another major
contributor. These devices constantly generate data on everything from
weather conditions to traffic patterns to appliance usage.
• Social Media and User-Generated Content: The rise of social media
platforms and online communities has led to a surge in user-generated content
like posts, comments, images, and videos. This data adds significantly to the
overall volume.
. The ever-growing tide of data holds immense potential benefits across various
sectors. Here are some of the promising ways data growth can be harnessed for
positive change:

1. Enhanced Decision-Making: With more data available, businesses,


organizations, and even individuals can make more informed decisions. Data
analytics can uncover hidden patterns, trends, and correlations that traditional
methods might miss. This can lead to:

Improved resource allocation: Data can reveal areas where resources are
underutilized or overspent, enabling better allocation for optimal results.
Data-driven marketing: Businesses can personalize marketing
campaigns based on customer behavior and preferences, leading to
higher engagement and sales.
Scientific advancements: Researchers can analyze vast datasets to
accelerate discoveries in medicine, materials science, and other fields.
Environmental monitoring: Data from sensors can be used to monitor
environmental changes, track pollution levels, and inform sustainable
practices
• Personalized experiences: Data allows for customization, tailoring
experiences to individual preferences. This can range from
personalized learning platforms to recommendation systems for
products and content
2. Innovation and Development: Data is the fuel for innovation. Here's
how big data can drive progress:
•Developing new products and services: Companies can use data to
identify customer needs and preferences, informing the development of
innovative products and services that cater to those needs.
•Optimizing existing processes: Data analysis can help identify
inefficiencies in operations, leading to process improvements and cost
reductions.
3. Societal Progress: Data can play a crucial role in tackling global
challenges:
•Public health management: Data analysis can be used to track disease
outbreaks, predict epidemics, and optimize healthcare resource
allocation.
•Smart cities: Urban planning can leverage data to improve traffic flow,
optimize energy use, and enhance public safety.
4. Improved Customer Service: Businesses can leverage data to
personalize customer interactions and provide a more positive customer
experience:
•Proactive customer support: By analyzing customer data, companies
can anticipate potential issues and provide proactive support, reducing
frustration.
•Faster issue resolution: Data can help identify root causes of customer
problems, leading to quicker and more effective solutions.
•Sentiment analysis: Businesses can use data to understand customer
sentiment and feedback, allowing them to improve products and services
Scientific Discovery: Big data opens doors to groundbreaking research
in various scientific fields:
•Genomics and personalized medicine: Analyzing vast amounts of
genetic data can lead to personalized healthcare approaches and
accelerate drug discovery.
•Climate change research: Data analysis from weather stations,
satellites, and other sources helps us understand climate patterns and
predict future trends.
•Social science research: Studying social media data and online
interactions can provide insights into human behavior and social trends.

It's important to acknowledge that data growth also comes with


challenges around privacy, security, and responsible data management.
However, by addressing these concerns and harnessing its potential
effectively, data can be a powerful tool for progress and positive change.
Data Management
• Data management is the practice of collecting, organizing, protecting,
and storing an organization’s data so it can be analyzed for business
decisions. As organizations create and consume data at unprecedented
rates, data management solutions become essential for making sense
of the vast quantities of data.
Data Pipeline
 A data pipeline is a method in which raw data is ingested from
various data sources, transformed and then ported to a data store, such
as a data lake or data warehouse, for analysis.
Data ingestion is the process of importing large, assorted data files
from multiple sources into a single, cloud-based storage medium—a
data warehouse, data mart or database—where it can be accessed and
analyzed.
• Data transformation is the process of converting, cleansing, and
structuring data into a usable format that can be analyzed to support
decision making processes, and to propel the growth of an
organization.
• This involves removing or filling the missing values, Transforming the
data into the correct format for storage, Joining data from different
sources and removing duplicate data.
• The process of transformation also allows for enrichment which
enhances the quality of data
• Data storage is the process of storing and preserving digital
information for later retrieval and use.
• Data can be stored in cloud storage, warehouses, floppy disk etc.
Data pipeline
Thank you

You might also like