Data Science
Data Science
Data science is the study of data to extract meaningful insights for business. It is a
multidisciplinary approach that combines principles and practices from the fields of
mathematics, statistics, artificial intelligence, and computer engineering to analyze large
amounts of data.
Data science is used to study data in four main ways:
i. Descriptive analysis:
Descriptive analysis examines data to gain insights into what happened or what is
happening in the data environment. It is characterized by data visualizations such
as pie charts, bar charts, line graphs, tables, or generated narratives.
For example, a flight booking service may record data like the number of tickets
booked each day.
ii. Diagnostic analysis:
Diagnostic analysis is a deep-dive or detailed data examination to understand why
something happened. It is characterized by techniques such as drill-down, data
discovery, data mining, and correlations.
For example, the flight service might drill down on a particularly high-performing
month to better understand the booking spike.
iii. Predictive analysis:
Predictive analysis uses historical data to make accurate forecasts about data
patterns that may occur in the future. It is characterized by techniques such as
machine learning, forecasting, pattern matching, and predictive modeling.
For example, the flight service team might use data science to predict flight
booking patterns for the coming year at the start of each year.
iv. Prescriptive analysis:
Prescriptive analytics takes predictive data to the next level. It not only predicts
what is likely to happen but also suggests an optimum response to that outcome.
For the flight booking example, prescriptive analysis could look at historical
marketing campaigns to maximize the advantage of the upcoming booking spike.
The Data Science Process involves a series of steps that data scientists follow to extract
insights and valuable information from data. The typical data science process can be
summarized using the OSEMN framework:
Obtain Data (O):
In this initial step, data scientists gather the data needed for analysis. This data can be
pre-existing, newly acquired, or obtained from various sources such as databases,
CRM software, web server logs, social media, or third-party sources.
Scrub Data (S):
Data cleaning or data scrubbing involves preprocessing the data to handle missing
values, outliers, and inconsistencies. This step ensures that the data is accurate and
ready for analysis.
Explore Data (E):
Data exploration involves conducting preliminary analysis to understand the data
better. Descriptive statistics and data visualization tools are used to identify patterns
and trends in the data.
Model Data (M):
In this step, data scientists apply various machine learning algorithms to gain deeper
insights, predict outcomes, and prescribe the best course of action. Techniques like
classification, association, and clustering are used to analyze the data.
Interpret Data (N):
Once the data has been modeled, data scientists interpret the results to derive
meaningful insights and make data-driven decisions. This step involves understanding
the implications of the analysis and communicating the findings effectively.
Data models can generally be divided into three categories, which vary according to their
degree of abstraction. The process will start with a conceptual model, progress to a logical
model and conclude with a physical model.
1) Conceptual data models:
They are also referred to as domain models and offer a big-picture view of
what the system will contain, how it will be organized, and which business
rules are involved.
Typically, they include entity classes, their characteristics and constraints, the
relationships between them and relevant security and data integrity
requirements.
2) Logical data models:
They are less abstract and provide greater detail about the concepts and
relationships in the domain under consideration.
These indicate data attributes, such as data types and their corresponding
lengths, and show the relationships among entities.
3) Physical data models:
They provide a schema for how the data will be physically stored within a
database.
They offer a finalized design that can be implemented as a relational database,
including associative tables that illustrate the relationships among entities as
well as the primary keys and foreign keys that will be used to maintain those
relationships.
As a discipline, data modeling invites stakeholders to evaluate data processing and storage in
painstaking detail. Data modeling techniques have different conventions that dictate which
symbols are used to represent the data, how models are laid out, and how business
requirements are conveyed.
1.Identify the entities: The process of data modeling begins with the identification of the
things, events or concepts that are represented in the data set that is to be modeled. Each
entity should be cohesive and logically discrete from all others.
2.Identify key properties of each entity: Each entity type can be differentiated from all
others because it has one or more unique properties, called attributes. For instance, an entity
called “customer” might possess such attributes as a first name, last name, telephone number
and salutation, while an entity called “address” might include a street name and number, a
city, state, country and zip code.
3.Identify relationships among entities: The earliest draft of a data model will specify the
nature of the relationships each entity has with the others. In the above example, each
customer “lives at” an address. If that model were expanded to include an entity called
“orders,” each order would be shipped to and billed to an address as well. These relationships
are usually documented via unified modeling language (UML).
4. Assign keys as needed, and decide on a degree of normalization that balances the
need to reduce redundancy with performance requirements: Normalization is a technique
for organizing data models (and the databases they represent) in which numerical identifiers,
called keys, are assigned to groups of data to represent relationships between them without
repeating the data.
5.For instance, if customers are each assigned a key, that key can be linked to both their
address and their order history without having to repeat this information in the table of
customer names. Normalization tends to reduce the amount of storage space a database will
require, but it can cost to query performance.
6. Finalize and validate the data model: Data modeling is an iterative process that should
be repeated and refined as business needs change.
Statistical modeling
The statistical modeling process is a way of applying statistical analysis to datasets in data
science. The statistical model involves a mathematical relationship between random and non-
random variables.
A statistical model can provide intuitive visualizations that aid data scientists in identifying
relationships between variables and making predictions by applying statistical models to raw
data.
Examples of common data sets for statistical analysis include census data, public health data,
and social media data
➢Unsupervised learning:
• In the unsupervised learning model, the algorithm is given unlabelled data and
attempts to extract features and determine patterns independently. Clustering
algorithms and association rules are examples of unsupervised learning. Here are two
examples:
K-means clustering: The algorithm combines a specified number of data points into
specific groupings based on similarities.
Reinforcement learning: This technique involves training the algorithm to iterate
over many attempts using deep learning, rewarding moves that result in favourable
outcomes, and penalizing activities that produce undesired effects.
Name the different type of attributes and explain briefly with examples
The different types of attributes are: -
1.Nominal Attributes:
Nominal means “relating to names.” The values of a nominal attribute are symbols or
names of things. Each value represents some kind of category, code, or state, and so
nominal attributes are also referred to as categorical. The values do not have any
meaningful order. In computer science, the values are also known as enumerations.
Example: Suppose that hair colour and marital status are two attributes describing person
objects. In our application, possible values for hair colour are black, brown, blond, red,
auburn, grey, and white. The attribute marital status can take on the values single,
married, divorced, and widowed. Both hair colour and marital status are nominal
attributes
2.Binary Attributes:
A binary attribute is a nominal attribute with only two categories or states: 0 or 1, where 0
typically means that the attribute is absent, and 1 means that it is present.
Example: Given the attribute smoker describing a patient object, 1 indicates that the
patient smokes, while 0 indicates that the patient does not. Similarly, the patient
undergoes a medical test that has two possible outcomes. The attribute medical test is
binary, where a value of 1 means the result of the test for the patient is positive, while 0
means the result is negative.
3. Ordinal Attributes:
An ordinal attribute is an attribute with possible values that have a meaningful order or
ranking among them, but the magnitude between successive values is not known.
Example: Suppose that drink size corresponds to the size of drinks available at a fast-food
restaurant. This nominal attribute has three possible values: small, medium, and large.
The values have a meaningful sequence (which corresponds to increasing drink size);
however, we cannot tell from the values how much bigger, say, a medium is than a large.
Other examples of ordinal attributes include grade (e.g., A+, A, A−, B+, and so on) and
professional rank. Professional ranks can be enumerated in a sequential order: for
example, assistant, associate, and full for professors, and private, private first class,
specialist, corporal, and sergeant for army ranks
4. Numeric Attributes:
A numeric attribute is quantitative; that is, it is a measurable quantity, represented in
integer or real values.
• Numeric attributes can be interval-scaled or ratio-scaled.
a) Interval-Scaled Attributes:
• Interval-scaled attributes are measured on a scale of equal-size units. The values
of interval-scaled attributes have order and can be positive, 0, or negative. Thus,
in addition to providing a ranking of values, such attributes allow us to compare
and quantify the difference between values.
• Example: A temperature attribute is interval-scaled. Suppose that we have the
outdoor temperature value for a number of different days, where each day is an
object. By ordering the values, we obtain a ranking of the objects with respect to
temperature. In addition, we can quantify the difference between values. For
example, a temperature of 20◦C is five degrees higher than a temperature of 15◦C.
Calendar dates are another example. For instance, the years 2002 and 2010 are
eight years apart
b) Ratio-Scaled Attributes:
• A ratio-scaled attribute is a numeric attribute with an inherent zero-point. That is,
if a measurement is ratio-scaled, we can speak of a value as being a multiple (or
ratio) of another value. In addition, the values are ordered, and we can also
compute the difference between values, as well as the mean, median, and mode.
• Example: Unlike temperatures in Celsius and Fahrenheit, the Kelvin (K)
temperature scale has what is considered a true zero-point (0◦K = −273.15◦C): It is
the point at which the particles that comprise matter have zero kinetic energy.