Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
20 views

Unit 2 Data Gathering

…….

Uploaded by

LIKITH KUMAR.S
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Unit 2 Data Gathering

…….

Uploaded by

LIKITH KUMAR.S
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Unit 2

Data Gathering
Meaning of Data Gathering:

▪ Data gathering refers to the process of collecting relevant


information from various sources for analysis, decision-
making, or research purposes. It is the foundational step in
data analysis and involves identifying, measuring, and
collecting accurate data to answer specific questions or
solve problems.
Types of Data:

▪ Based on Source:
▪ Primary Data: Collected directly from the source or original data through methods like surveys, interviews, and experiments.
▪ Secondary Data: Pre-existing data gathered from other sources like reports, government publications, and databases.
▪ Based on Format:
▪ Structured Data: Organized in a defined manner (e.g., rows and columns in databases or spreadsheets) like numerical data.
▪ Unstructured Data: Lacks a predefined structure (e.g., text, images, audio, and video files) such as social media posts or
emails.
▪ Semi-structured Data: Contains elements of both structured and unstructured data (e.g., XML, JSON formats).
▪ Based on Nature:
▪ Qualitative Data: Descriptive data that includes opinions, narratives, and observations (e.g., interviews, open-ended survey
responses).
▪ Quantitative Data: Numeric data that can be measured or counted (e.g., sales figures, temperature readings).
Methods of Data Gathering:
1. Surveys and Questionnaires:
▪ Collect data directly from individuals using a series of questions.

▪ Can be conducted online, via email, or in person.

▪ Best for gathering primary data on opinions, preferences, and behaviors.

2. Interviews:

▪ One-on-one or group conversations where responses to questions are recorded.

▪ Useful for obtaining detailed qualitative insights.

3. Observations:

▪ Collecting data by observing behaviors or events in real time.

▪ Suitable for studying natural environments or behaviors.


▪ Experiments:
▪ Controlled trials where variables are manipulated to observe outcomes.
▪ Used to gather primary data in scientific or research settings.
▪ Existing Databases:
▪ Leveraging data that has already been collected and stored in public or private databases.
▪ Typically used for secondary data analysis
▪ Web Scraping:
▪ Automated methods to extract data from websites and online sources.
▪ Commonly used for gathering data from social media or e-commerce
websites.
Data Formats:

▪ Text Files (CSV, TXT): Often used for tabular data.


▪ Spreadsheets (XLS, XLSX): Frequently used for data analysis
and storage.
▪ Databases (SQL, NoSQL): Relational and non-relational data
storage.
▪ JSON/XML: Common formats for web-based data exchange.
▪ Images, Audio, and Video: Multimedia formats for
unstructured data.
Parsing and Transformation:

▪ Parsing: Refers to interpreting and converting raw data into a more


structured format. It is particularly useful when handling text or
semi-structured data formats (e.g., parsing JSON or XML files).
▪ Transformation: Involves changing the format, structure, or values
of the data to make it suitable for analysis. This could include
operations like:
▪ Aggregation (summarizing data),
▪ Normalization (scaling values to a common range),
▪ Data type conversion (e.g., changing strings to numbers),
▪ Merging multiple data sources.
Data Scalability and Real-Time
Issues:

▪ Scalability Issues:
▪ Volume: Large datasets (terabytes or more) may require distributed storage and processing.
▪ Velocity: High-speed data generation (e.g., real-time financial transactions) demands fast processing.
▪ Variety: Handling various data formats (structured, unstructured, semi-structured) across different
systems.
▪ Real-Time Data Issues:
▪ Latency: Delays in data collection, transmission, or processing can affect real-time decision-making.
▪ Data Consistency: Real-time data often comes from multiple sources, and ensuring consistent updates
can be difficult.
▪ Infrastructure: Real-time data systems require robust infrastructure, including memory, processors, and
networking capabilities.
Data Cleaning
▪ Data cleaning involves detecting and correcting errors in the dataset to ensure accuracy,
completeness, and consistency. It is crucial for maintaining the integrity of the data.
▪ Consistency Checking:
▪ Identifying and rectifying inconsistencies in the data, such as:
– Different units of measurement,
– Conflicting data across sources,
– Typographical errors
▪ Handling Heterogeneous Data:
▪ Heterogeneous data refers to data collected from different formats, types, and sources. Cleaning such data
requires:
– Standardization of formats,
– Integration of various datasets,
– Handling different data structures.
▪ Handling Missing Data:
▪ Common approaches to dealing with missing data
include:
– Imputation: Filling missing values with mean, median, or other
statistical methods.
– Removal: Deleting rows or columns with missing data if they are
insignificant.
– Prediction: Using machine learning algorithms to predict
missing values based on available data.
What Kind of Data Errors Does Data
Cleaning Fix?

Data cleaning addresses a wide range of errors in datasets,


ensuring that the data is accurate, reliable, and suitable for
analysis and decision making. Some common data errors often
found while cleaning data include:
▪ Duplicate records: These occur when the same data entry is
mistakenly repeated in a dataset, leading to redundancy and
potential inaccuracies in analysis.
▪ Inaccurate data: This refers to data entries that contain
incorrect values, such as typographical errors, wrong numerical
entries, or syntax errors. Inaccurate data can lead to misleading
or erroneous insights and conclusions.
▪ Missing or incomplete data: This occurs when certain data
fields are left blank or contain null values, which can affect the
overall quality and reliability of the dataset. Missing or incomplete
data can result in biased or incomplete analysis and decision
making.
▪ Inconsistent data: Inconsistencies can arise when data is
formatted differently across various sources or systems, leading
to discrepancies in values, units, or terminology. Inconsistent data
can make it difficult to accurately analyze and interpret the
information, potentially causing confusion and misinterpretation.
By identifying and addressing these common data errors, data
cleaning ensures that the dataset is of high quality and suitable for
use in various applications, such as business intelligence, analytics,
and decision making.
Data Transformation and
Segmentation:

Data Transformation:
▪ Normalization/Standardization: Scaling data to a specific range or mean for uniformity.
▪ Encoding: Converting categorical data into numerical formats (e.g., one-hot encoding for
machine learning).
▪ Aggregation: Summarizing data, such as calculating the total or average of a group.
Data Segmentation:
▪ Dividing data into meaningful segments or subsets based on specific criteria (e.g., customer
demographics, time periods).
▪ Segmentation is used for targeted analysis and improves focus on particular groups or patterns.
▪ Techniques like clustering (k-means, hierarchical clustering) are used to segment large datasets.

You might also like