Unit 2 Data Gathering
Unit 2 Data Gathering
Data Gathering
Meaning of Data Gathering:
▪ Based on Source:
▪ Primary Data: Collected directly from the source or original data through methods like surveys, interviews, and experiments.
▪ Secondary Data: Pre-existing data gathered from other sources like reports, government publications, and databases.
▪ Based on Format:
▪ Structured Data: Organized in a defined manner (e.g., rows and columns in databases or spreadsheets) like numerical data.
▪ Unstructured Data: Lacks a predefined structure (e.g., text, images, audio, and video files) such as social media posts or
emails.
▪ Semi-structured Data: Contains elements of both structured and unstructured data (e.g., XML, JSON formats).
▪ Based on Nature:
▪ Qualitative Data: Descriptive data that includes opinions, narratives, and observations (e.g., interviews, open-ended survey
responses).
▪ Quantitative Data: Numeric data that can be measured or counted (e.g., sales figures, temperature readings).
Methods of Data Gathering:
1. Surveys and Questionnaires:
▪ Collect data directly from individuals using a series of questions.
2. Interviews:
3. Observations:
▪ Scalability Issues:
▪ Volume: Large datasets (terabytes or more) may require distributed storage and processing.
▪ Velocity: High-speed data generation (e.g., real-time financial transactions) demands fast processing.
▪ Variety: Handling various data formats (structured, unstructured, semi-structured) across different
systems.
▪ Real-Time Data Issues:
▪ Latency: Delays in data collection, transmission, or processing can affect real-time decision-making.
▪ Data Consistency: Real-time data often comes from multiple sources, and ensuring consistent updates
can be difficult.
▪ Infrastructure: Real-time data systems require robust infrastructure, including memory, processors, and
networking capabilities.
Data Cleaning
▪ Data cleaning involves detecting and correcting errors in the dataset to ensure accuracy,
completeness, and consistency. It is crucial for maintaining the integrity of the data.
▪ Consistency Checking:
▪ Identifying and rectifying inconsistencies in the data, such as:
– Different units of measurement,
– Conflicting data across sources,
– Typographical errors
▪ Handling Heterogeneous Data:
▪ Heterogeneous data refers to data collected from different formats, types, and sources. Cleaning such data
requires:
– Standardization of formats,
– Integration of various datasets,
– Handling different data structures.
▪ Handling Missing Data:
▪ Common approaches to dealing with missing data
include:
– Imputation: Filling missing values with mean, median, or other
statistical methods.
– Removal: Deleting rows or columns with missing data if they are
insignificant.
– Prediction: Using machine learning algorithms to predict
missing values based on available data.
What Kind of Data Errors Does Data
Cleaning Fix?
Data Transformation:
▪ Normalization/Standardization: Scaling data to a specific range or mean for uniformity.
▪ Encoding: Converting categorical data into numerical formats (e.g., one-hot encoding for
machine learning).
▪ Aggregation: Summarizing data, such as calculating the total or average of a group.
Data Segmentation:
▪ Dividing data into meaningful segments or subsets based on specific criteria (e.g., customer
demographics, time periods).
▪ Segmentation is used for targeted analysis and improves focus on particular groups or patterns.
▪ Techniques like clustering (k-means, hierarchical clustering) are used to segment large datasets.