Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 11

What is Data Science? What is Data Science used for?

Data science is the study of data to extract meaningful insights for business. It is a
multidisciplinary approach that combines principles and practices from the fields of
mathematics, statistics, artificial intelligence, and computer engineering to analyze large
amounts of data.
Data science is used to study data in four main ways:
i. Descriptive analysis:
Descriptive analysis examines data to gain insights into what happened or what is
happening in the data environment. It is characterized by data visualizations such
as pie charts, bar charts, line graphs, tables, or generated narratives.
For example, a flight booking service may record data like the number of tickets
booked each day.
ii. Diagnostic analysis:
Diagnostic analysis is a deep-dive or detailed data examination to understand why
something happened. It is characterized by techniques such as drill-down, data
discovery, data mining, and correlations.
For example, the flight service might drill down on a particularly high-performing
month to better understand the booking spike.
iii. Predictive analysis:
Predictive analysis uses historical data to make accurate forecasts about data
patterns that may occur in the future. It is characterized by techniques such as
machine learning, forecasting, pattern matching, and predictive modeling.
For example, the flight service team might use data science to predict flight
booking patterns for the coming year at the start of each year.
iv. Prescriptive analysis:
Prescriptive analytics takes predictive data to the next level. It not only predicts
what is likely to happen but also suggests an optimum response to that outcome.
For the flight booking example, prescriptive analysis could look at historical
marketing campaigns to maximize the advantage of the upcoming booking spike.

What is Data Science Process?

The Data Science Process involves a series of steps that data scientists follow to extract
insights and valuable information from data. The typical data science process can be
summarized using the OSEMN framework:
 Obtain Data (O):
In this initial step, data scientists gather the data needed for analysis. This data can be
pre-existing, newly acquired, or obtained from various sources such as databases,
CRM software, web server logs, social media, or third-party sources.
 Scrub Data (S):
Data cleaning or data scrubbing involves preprocessing the data to handle missing
values, outliers, and inconsistencies. This step ensures that the data is accurate and
ready for analysis.
 Explore Data (E):
Data exploration involves conducting preliminary analysis to understand the data
better. Descriptive statistics and data visualization tools are used to identify patterns
and trends in the data.
 Model Data (M):
In this step, data scientists apply various machine learning algorithms to gain deeper
insights, predict outcomes, and prescribe the best course of action. Techniques like
classification, association, and clustering are used to analyze the data.
 Interpret Data (N):
Once the data has been modeled, data scientists interpret the results to derive
meaningful insights and make data-driven decisions. This step involves understanding
the implications of the analysis and communicating the findings effectively.

Challenges of Data Science (7)

1) Preparation of Data for Smart Enterprise AI:


 Finding and cleaning up proper data is a data scientist's top priority.
 Approximately 80% of a data scientist's day is spent on tasks such as cleaning,
organizing, mining, and gathering data, as per a CrowdFlower poll.
 Before further analysis, data is double-checked for accuracy and consistency.
2) Generation of Data from Multiple Sources:
 Data is obtained by organizations in a broad variety of forms from the many
programs, software, and tools that they use.
 This method calls for the manual entering of data and compilation, both of which
are time-consuming and have the potential to result in unnecessary repeats or
erroneous choices.
3) Identification of Business Issues:
 Identifying issues is a crucial component of conducting a solid organization.
 Before constructing data, sets and analyzing data, data scientists should
concentrate on identifying enterprise-critical challenges.
 Before commencing analytical operations, data scientists may have a structured
workflow in place.
4) Communication of Results to Non-Technical Stakeholders:
 The primary objective of a data scientist is to enhance the organization's capacity
for decision-making, which is aligned with the business plan that its function
supports.
 The main challenge for data scientists is effectively communicating findings to
business leaders. Since many managers are unfamiliar with data science tools,
providing them with foundational concepts to apply models using business AI is
crucial.
5) Data Security:
 Due to the need to scale quickly, businesses have turned to cloud management for
the safekeeping of their sensitive information.
 Cyberattacks and online spoofing have made sensitive data stored in the cloud
exposed to the outside world.
 Strict measures have been enacted to protect data in the central repository against
hackers.
6) Efficient Collaboration:
 It is common practice for data scientists and data engineers to collaborate on the
same projects for a company.
 Maintaining strong lines of communication is very necessary to avoid any
potential conflicts.
7) Selection of Non-Specific KPI (Key Performance Indicator) Metrics:
 It is a common misunderstanding that data scientists can handle the majority of
the job on their own and come prepared with answers to all of the challenges that
are encountered by the company.
 It is vital for any company to have a certain set of metrics to measure the analyses
that a data scientist presents.

Types of Data Models (3)

Data models can generally be divided into three categories, which vary according to their
degree of abstraction. The process will start with a conceptual model, progress to a logical
model and conclude with a physical model.
1) Conceptual data models:
 They are also referred to as domain models and offer a big-picture view of
what the system will contain, how it will be organized, and which business
rules are involved.
 Typically, they include entity classes, their characteristics and constraints, the
relationships between them and relevant security and data integrity
requirements.
2) Logical data models:
 They are less abstract and provide greater detail about the concepts and
relationships in the domain under consideration.
 These indicate data attributes, such as data types and their corresponding
lengths, and show the relationships among entities.
3) Physical data models:
 They provide a schema for how the data will be physically stored within a
database.
 They offer a finalized design that can be implemented as a relational database,
including associative tables that illustrate the relationships among entities as
well as the primary keys and foreign keys that will be used to maintain those
relationships.

Data Modelling Process

As a discipline, data modeling invites stakeholders to evaluate data processing and storage in
painstaking detail. Data modeling techniques have different conventions that dictate which
symbols are used to represent the data, how models are laid out, and how business
requirements are conveyed.

All approaches provide formalized workflows that include a sequence of tasks to be


performed in an iterative manner.
Those workflows generally look like this:

1.Identify the entities: The process of data modeling begins with the identification of the
things, events or concepts that are represented in the data set that is to be modeled. Each
entity should be cohesive and logically discrete from all others.
2.Identify key properties of each entity: Each entity type can be differentiated from all
others because it has one or more unique properties, called attributes. For instance, an entity
called “customer” might possess such attributes as a first name, last name, telephone number
and salutation, while an entity called “address” might include a street name and number, a
city, state, country and zip code.
3.Identify relationships among entities: The earliest draft of a data model will specify the
nature of the relationships each entity has with the others. In the above example, each
customer “lives at” an address. If that model were expanded to include an entity called
“orders,” each order would be shipped to and billed to an address as well. These relationships
are usually documented via unified modeling language (UML).
4. Assign keys as needed, and decide on a degree of normalization that balances the
need to reduce redundancy with performance requirements: Normalization is a technique
for organizing data models (and the databases they represent) in which numerical identifiers,
called keys, are assigned to groups of data to represent relationships between them without
repeating the data.
5.For instance, if customers are each assigned a key, that key can be linked to both their
address and their order history without having to repeat this information in the table of
customer names. Normalization tends to reduce the amount of storage space a database will
require, but it can cost to query performance.
6. Finalize and validate the data model: Data modeling is an iterative process that should
be repeated and refined as business needs change.
Statistical modeling
The statistical modeling process is a way of applying statistical analysis to datasets in data
science. The statistical model involves a mathematical relationship between random and non-
random variables.
A statistical model can provide intuitive visualizations that aid data scientists in identifying
relationships between variables and making predictions by applying statistical models to raw
data.

Examples of common data sets for statistical analysis include census data, public health data,
and social media data

Statistical Modeling Techniques:


Data gathering is the foundation of statistical modeling. The data may come from the cloud,
spreadsheets, databases, or other sources. There are two categories of statistical modeling
methods used in data analysis. These are:
➢Supervised learning:
• In the supervised learning model, the algorithm uses a labeled data set for learning,
with an answer key the algorithm uses to determine accuracy as it trains on the data.
Supervised learning techniques in statistical modeling include:
Regression model: A predictive model designed to analyze the relationship between
independent and dependent variables.
• The most common regression models are logistical, polynomial, and linear. These
models determine the relationship between variables, forecasting, and modeling.
Classification model: An algorithm analyzes and classifies a large and complex set
of data points. Common models include decision trees, Naive Bayes, the nearest
neighbor, random forests, and neural networking models.

➢Unsupervised learning:

• In the unsupervised learning model, the algorithm is given unlabelled data and
attempts to extract features and determine patterns independently. Clustering
algorithms and association rules are examples of unsupervised learning. Here are two
examples:
K-means clustering: The algorithm combines a specified number of data points into
specific groupings based on similarities.
Reinforcement learning: This technique involves training the algorithm to iterate
over many attempts using deep learning, rewarding moves that result in favourable
outcomes, and penalizing activities that produce undesired effects.

Name the different type of attributes and explain briefly with examples
The different types of attributes are: -
1.Nominal Attributes:
Nominal means “relating to names.” The values of a nominal attribute are symbols or
names of things. Each value represents some kind of category, code, or state, and so
nominal attributes are also referred to as categorical. The values do not have any
meaningful order. In computer science, the values are also known as enumerations.
Example: Suppose that hair colour and marital status are two attributes describing person
objects. In our application, possible values for hair colour are black, brown, blond, red,
auburn, grey, and white. The attribute marital status can take on the values single,
married, divorced, and widowed. Both hair colour and marital status are nominal
attributes
2.Binary Attributes:
A binary attribute is a nominal attribute with only two categories or states: 0 or 1, where 0
typically means that the attribute is absent, and 1 means that it is present.
Example: Given the attribute smoker describing a patient object, 1 indicates that the
patient smokes, while 0 indicates that the patient does not. Similarly, the patient
undergoes a medical test that has two possible outcomes. The attribute medical test is
binary, where a value of 1 means the result of the test for the patient is positive, while 0
means the result is negative.
3. Ordinal Attributes:
An ordinal attribute is an attribute with possible values that have a meaningful order or
ranking among them, but the magnitude between successive values is not known.
Example: Suppose that drink size corresponds to the size of drinks available at a fast-food
restaurant. This nominal attribute has three possible values: small, medium, and large.
The values have a meaningful sequence (which corresponds to increasing drink size);
however, we cannot tell from the values how much bigger, say, a medium is than a large.
Other examples of ordinal attributes include grade (e.g., A+, A, A−, B+, and so on) and
professional rank. Professional ranks can be enumerated in a sequential order: for
example, assistant, associate, and full for professors, and private, private first class,
specialist, corporal, and sergeant for army ranks
4. Numeric Attributes:
A numeric attribute is quantitative; that is, it is a measurable quantity, represented in
integer or real values.
• Numeric attributes can be interval-scaled or ratio-scaled.
a) Interval-Scaled Attributes:
• Interval-scaled attributes are measured on a scale of equal-size units. The values
of interval-scaled attributes have order and can be positive, 0, or negative. Thus,
in addition to providing a ranking of values, such attributes allow us to compare
and quantify the difference between values.
• Example: A temperature attribute is interval-scaled. Suppose that we have the
outdoor temperature value for a number of different days, where each day is an
object. By ordering the values, we obtain a ranking of the objects with respect to
temperature. In addition, we can quantify the difference between values. For
example, a temperature of 20◦C is five degrees higher than a temperature of 15◦C.
Calendar dates are another example. For instance, the years 2002 and 2010 are
eight years apart
b) Ratio-Scaled Attributes:
• A ratio-scaled attribute is a numeric attribute with an inherent zero-point. That is,
if a measurement is ratio-scaled, we can speak of a value as being a multiple (or
ratio) of another value. In addition, the values are ordered, and we can also
compute the difference between values, as well as the mean, median, and mode.
• Example: Unlike temperatures in Celsius and Fahrenheit, the Kelvin (K)
temperature scale has what is considered a true zero-point (0◦K = −273.15◦C): It is
the point at which the particles that comprise matter have zero kinetic energy.

Structured data, difference between structured and non-structured data (5)


Structured Data:
 Structured data is data that has been predefined and formatted to a set structure before
being placed in data storage, which is often referred to as schema-on-write.
 The best example of structured data is the “relational database”: the data has been
formatted into precisely defined fields, such as credit card numbers or address, in
order to be easily queried with SQL.
 The programming language used for structured data is SQL (Structured Query
Language).
 Structured data might be generated by either humans or machines. It is easy to
manage and highly searchable, both via human-generated queries and automated
analysis by traditional statistical methods and machine learning (ML) algorithms.
 Structured data is most often categorized as quantitative data, and it's the type of data
most of us are used to working with.

Structured data Unstructured data


 Structured data is clearly defined  unstructured data does not have a
data in a structure predefined data model and is
undefined.
 structured data is often  Unstructured data, on the other
quantitative data, meaning it hand, is often categorized as
usually consists of hard numbers qualitative data and cannot be
or things that can be counted. processed and analyzed using
(For example, product conventional tools and methods.
information in a customer In a business context, qualitative
relationship management data can, for example, come
system, or CRM.) Methods for from customer surveys,
analysis include regression (to interviews, and social media
predict relationships between interactions
variables), classification (to
estimate probability), and
clustering of data (based on
different attributes).
 Structured data is easy to search,  Unstructured data, on the other
both for data analytics experts hand, is intrinsically more
and for algorithms. difficult to search and requires
processing to become
understandable.
 Structured data has been defined  Unstructured data, on the other
beforehand in a data model hand, comes in a variety of
shapes and sizes. It can consist of
everything from audio, video,
and imagery to email and sensor
data.
 Structured data requires less  unstructured data requires more
storage space storage space

Major tasks / steps involved in data preprocessing with diagram (10)


Data preprocessing is a critical phase in the data mining process that involves transforming
raw data into a clean, organized format suitable for analysis. The major tasks involved in data
preprocessing are essential for ensuring the quality and reliability of the data.
Fig: Pre-processing steps

Here is a detailed explanation of the major steps involved in data preprocessing:


1. Data Cleaning:
Data cleaning involves handling noisy data, filling in missing values, identifying and
correcting inconsistencies, and removing outliers. This step aims to ensure that the data is
accurate and reliable for analysis.
2. Data Integration:
Data integration involves combining data from multiple sources into a coherent data store,
such as a data warehouse. This step helps in resolving naming inconsistencies, redundancies,
and inconsistencies that may exist in the data from different sources.
3. Data Reduction:
Data reduction aims to reduce the volume but produce the same or similar analytical results.
This can be achieved through techniques such as dimensionality reduction, numerical
approximation, or data cube aggregation. By reducing the data size, the efficiency of the
subsequent data mining process can be improved.
4. Data Transformation:
Data transformation involves converting data into appropriate forms suitable for mining.
This may include normalization, aggregation, generalization, and attribute construction.
Transforming the data can help in improving the mining process and the quality of the results.
By performing these major tasks in data preprocessing, the quality of the data is enhanced,
leading to more accurate and efficient data analysis and decision-making processes.

Steps involved in data cleaning


Data cleaning is a crucial step in the data preprocessing phase to ensure the quality and
reliability of the data for further analysis.
The steps involved in data cleaning typically include:
1. Handling Missing Values
 Identify missing values in the dataset.
 Decide on the best strategy to deal with missing values, such as:
 Ignoring the tuples with missing values.
 Manually filling in the missing values.
 Using statistical methods like mean, median, or mode imputation.
 Predicting missing values using machine learning algorithms.
2. Handling Noisy Data:
 Noisy data includes outliers and errors that can affect the analysis.
 Techniques to handle noisy data include:
 Smoothing techniques to remove noise.
 Outlier detection and removal methods.
 Binning, clustering, or regression to smooth out noisy data.
3. Handling Inconsistent Data:
 Inconsistent data can arise from discrepancies in data representation or entry.
 Steps to handle inconsistent data involve:
 Standardizing data formats.
 Resolving naming inconsistencies.
 Detecting and correcting data integration issues.
4. Data Transformation:
 Transforming data into a suitable format for analysis.
 Techniques include:
 Normalization to scale numerical data.
 Encoding categorical data.
 Discretization for continuous data.
5. Data Reduction:
 Reducing the volume but producing the same or similar analytical results.
 Methods for data reduction include:
 Dimensionality reduction techniques like PCA.
 Sampling methods to reduce the dataset size.
 Aggregation to summarize data.
6. Data Integration:
 Merging data from multiple sources into a coherent dataset.
 Addressing naming inconsistencies and redundancies.
 Ensuring data consistency and coherence across sources.
7. Discrepancy Detection:
 Identifying and resolving discrepancies in the data.
 Using metadata and domain knowledge to detect inconsistencies.
 Addressing errors in data entry, data decay, and data integration issues.

Data reduction strategies (3)


Data reduction strategies include dimensionality reduction, numerosity reduction, and
data compression.
Dimensionality reduction:
 “Dimensionality reduction” is the process of reducing the number of random variables
or attributes under consideration.
 Dimensionality reduction methods include wavelet transforms and principal
components analysis, which transform or project the original data onto a smaller
space. Attribute subset selection is a method of dimensionality reduction in which
irrelevant, weakly relevant, or redundant attributes or dimensions are detected and
removed.
Numerosity reduction:
 “Numerosity reduction” techniques replace the original data volume by alternative,
smaller forms of data representation. These techniques may be parametric or
nonparametric.
 For parametric methods, a model issued to estimate the data, so that typically only the
data parameters need to be stored, instead of the actual data. (Outliers may also be
stored) Regression and log-linear models are examples.
 Non parametric methods for storing reduced representations of the data include
histograms, clustering, sampling, and data cube aggregation.
Data compression:
 In “Data Compression”, transformations are applied so as to obtain a reduced or
“compressed” representation of the original data.
 If the original data can be reconstructed from the compressed data without any
information loss, the data reduction is called lossless.
 If, instead, we can reconstruct only an approximation of the original data, then the
data reduction is called lossy.
 There are several lossless algorithms for string compression; however, they typically
allow only limited data manipulation.
 Dimensionality reduction and numerosity reduction techniques can also be considered
forms of data compression.
“The computational time spent on data reduction should not outweigh or “erase” the time
saved by mining on a reduced data set size.”

You might also like