Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
116 views

UNIT 4 Data Science Notes

Uploaded by

parth170309
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
116 views

UNIT 4 Data Science Notes

Uploaded by

parth170309
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

UNIT-4 DATA SCIENCE

1. What is Data Science?


Data Sciences- It is a concept to unify statistics, data analysis, machine learning
and their related methods in order to understand and analyse actual phenomena
with data. It employs techniques and theories drawn from many fields within the
context of Mathematics, Statistics, Computer Science, and Information Science.
2. What are the applications of Data Science?
Data Science is not a new field. Data Sciences majorly work around analysing the
data and when it comes to AI, the analysis helps in making the machine intelligent
enough to perform tasks by itself. There exist various applications of Data Science
in today’s world. Some of them are:
• Fraud and Risk Detection
• Genetics & Genomics
• Internet Search
• Targeted Advertising
• Website Recommendations:
• Airline Route Planning

3. What are the different sources of collecting data?


There exist various sources of data from where we can collect any type of data
required and the data collection process can be categorised in two ways: Offline
and Online.

4. What points should be kept in mind while accessing data from any of the data
sources?
While accessing data from any of the data sources, following points should be kept
in mind:
1) Data which is available for public usage only should be taken up.
2) Personal datasets should only be used with the consent of the owner.
3) One should never breach someone’s privacy to collect data.
4) Data should only be taken form reliable sources as the data collected from
random sources can be wrong or unusable.
5) Reliable sources of data ensure the authenticity of data which helps in
proper training of the AI model.

5. What are the some of the popular formats for storing data?
1) CSV: CSV stands for comma separated values. It is a simple file format used
to store tabular data. Each line of this file is a data record and each record
consist of one or more fields which are separated by commas. Since the
values of records are separated by a comma, hence they are known as CSV
files.
2) Spreadsheet: A Spreadsheet is a piece of paper or a computer program
which is used for accounting and recording data using rows and columns into
which information can be entered. Microsoft excel is a program which helps
in creating spreadsheets.
3) SQL: SQL is a programming language also known as Structured Query
Language. It is a domain specific language used in programming and is
designed for managing data held in different kinds of DBMS (Database
Management System) It is particularly useful in handling structured data.

6. What is NumPy?
NumPy, which stands for Numerical Python, is the fundamental package for
Mathematical and logical operations on arrays in Python. It is a commonly used
package when it comes to working around numbers. NumPy gives a wide range of
arithmetic operations around numbers giving us an easier approach in working
with them. NumPy also works with arrays, which is nothing but a homogenous
collection of Data.

7. Define Array?
An array is a set of multiple values which are of same datatype. They can be
numbers, characters, booleans, etc. but only one datatype can be accessed through
an array. In NumPy, the arrays used are known as ND-arrays (N-Dimensional
Arrays) as NumPy comes with a feature of creating n-dimensional arrays in
Python.

8. What is the difference between Array and List?


9. What is Pandas?
Pandas is a software library written for the Python programming language for data
manipulation and analysis. It offers data structures and operations for
manipulating numerical tables and time series. The name is derived from the term
"panel data" and “Python data analysis” which means using a statistical method of
analysing the data taken from the observations over multiple time periods for the
same individuals.

10. Pandas is well suited for which kinds of data?


Pandas is well suited for many kinds of data:
• Tabular data with heterogeneously-typed columns, as in an SQL table or
Excel spreadsheet
• Ordered and unordered (not necessarily fixed-frequency) time series data.
• Arbitrary matrix data (homogeneously typed or heterogeneous) with row
and column labels
• Any other form of observational / statistical data sets. The data need not be
labelled at all to be placed into a Pandas data structure

11. What are the two primary data structures of Pandas?


The two primary data structures of Pandas, Series (1-dimensional) and DataFrame
(2-dimensional), handle the vast majority of typical use cases in finance, statistics,
social science, and many areas of engineering.
Here are just a few of the things that pandas does well:
• Easy handling of missing data (represented as NaN) in floating point as well as
non-floating-point data
• Size mutability: columns can be inserted and deleted from DataFrame and higher
dimensional objects
• Automatic and explicit data alignment: objects can be explicitly aligned to a set of
labels, or the user can simply ignore the labels and let Series, DataFrame, etc.
automatically align the data for you in computations
• Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
• Intuitive merging and joining data sets
• Flexible reshaping and pivoting of data sets

12. What is Matplotlib?


Matplotlib is an amazing visualization library in Python for 2D plots of arrays.
Matplotlib is a multiplatform data visualization library built on NumPy arrays.
One of the greatest benefits of visualization is that it allows us visual access to
huge amounts of data in easily digestible visuals. Matplotlib comes with a wide
variety of plots. Plots helps to understand trends, patterns, and to make
correlations. They are typically instruments for reasoning about quantitative
information. Some types of graphs that we can make with this package are listed
below:
13. What are the Statistical tools widely used in Python?
Statistical tools widely used in Python are:

14. What are types of issues we can face with data?


While collecting data, it is possible that the data might come with some errors. Let
us first look at the types of issues we can face with data:
1) Erroneous Data: There are two ways in which the data can be erroneous:
• Incorrect values: The values in the dataset (at random places) are incorrect.
For example, in the column of phone number, there is a decimal value or in the
mark’s column, there is a name mentioned, etc.
• Invalid or Null values: At some places, the values get corrupted and hence
they become invalid. Many times, you will find NaN values in the dataset. These
are null values which do not hold any meaning and are not processible.
2) Missing Data: In some datasets, some cells remain empty. The values of these
cells are missing and hence the cells remain empty.
3) Outliers: Data which does not fall in the range of a certain element are referred
to as outliers. To understand this better, let us take an example of marks of
students in a class. Let us assume that a student was absent for exams and hence
has got 0 marks in it. If his marks are considered, the whole class’s average
would go down. To prevent this, the average is taken for the range of marks from
highest to lowest keeping this particular result separate. This makes sure that
the average marks of the class are true according to the data.

You might also like