Module1 Data Science
Module1 Data Science
2.Data collection :
Once the problem has been defined, the next step is to collect and
prepare the relevant data for analysis.
This involves identifying the data sources, acquiring the data,
and transforming it into a format suitable for analysis.
Data scientists can collect data from various sources, including
internal databases, external APIs, web scraping, and surveys.
During the data collection process, it is essential to ensure the
privacy and security of the data, especially when dealing with
sensitive or personally identifiable information.
5.Model Deployment:
After successfully building and evaluating the data model, the next
crucial phase in the data science lifecycle is deployment and
maintenance.
Deployment strategies
They also present their results in a clear way and communicate with
company leaders
Skills:
1.Data Analyst:
Data Analysts are the individuals who are responsible for reviewing
the data so that they can identify the key information in the business
of customers.
It is the process of collecting,processing and analysing the data to
extract meaningful insights and also data analyst support in decision-
making process.
2.Data Scientist:
3.Data Engineer:
Data engineers are the experts who are responsible for maintaining
,designing and optimizing the data infrastructure for the data
management and transform them.
Data engineers are in the change of creating pipelines to convert the
raw data n to the valuable formats for data scientists to use them.
4.Buisness Analyst:
Buisness Analyst are the peoples who help the business organization
to fullfill their goals and also assess the organization,analyze the data
and improve the systems and processes for the future.
They are the expert in allocating .forecasting ,budgeting and
resources in the business.
5.Data Architect:
Data architect are the IT individuals who use their computer science
and designing skills to analyze and review the data infrastructure of
business,plan the databases which needs to be used in future and
implement the useful solutions
Structured data
This data is organized and easy to search because it has a fixed record
format. It's usually stored in data warehouses and is often in the form
of numbers and text.
Structured data is typically tabular with rows and columns that clearly
define data attributes.
Unstructured data
This data doesn't fit neatly into a data table because of its size or
nature. It's often stored in its native format and can be human or
machine generated.
Unstructured data can include multimedia files, emails, text messages,
mobile activity, social media posts, satellite imagery, and more.
Unstructured data is usually stored in data lakes, which are
repositories that store data in its original format or after a basic
cleaning process.
Semi-structured data is a type of data that is not purely structured, but
also not completely unstructured.
o Velocity:
o Velocity refers to the high speed of accumulation of data.
o In Big Data velocity data flows in from sources like
machines, networks, social media, mobile phones etc.
o There is a massive and continuous flow of data. This
determines the potential of data that how fast the data is
generated and processed to meet the demands.
Variety:
o It refers to nature of data that is structured, semi-structured
and unstructured data.
o It also refers to heterogeneous sources.
o Variety is basically the arrival of data from new sources
that are both inside and outside of an enterprise. It can be
structured, semi-structured and unstructured.
Veracity:
o It refers to inconsistencies and uncertainty in data, that is
data which is available can sometimes get messy and
quality and accuracy are difficult to control.
o Big Data is also variable because of the multitude of data
dimensions resulting from multiple disparate data types
and sources.
o Example: Data in bulk could create confusion whereas less
amount of data could convey half or Incomplete
Information.
Value :
Value is an essential characteristic of big data. It is not the
data that we process or store. It
is valuable and reliable data that we store, process, and
also analyze
Sources of Data:
What are the different sources of data?
1. Internal sources
When data is collected from reports and records of the
organisation itself, they are known as the internal sources.
For example, a company publishes its annual report’ on profit
and loss, total sales, loans, wages, etc.
2. External sources
A) Primary data
Primary data means first-hand information collected by an
investigator.
It is collected for the first time.
It is original and more reliable.
For example, the population census conducted by the government
of India after every ten years is primary data.
The data which is Raw, original, and extracted directly from the
official sources is known as primary data.
2. Survey method:
The survey method is the process of research where a list of
relevant questions are asked and answers are noted down in the
form of text, audio, or video.
The survey method can be obtained in both online and offline mode
like through website forms and email.
3. Observation method:
B) Secondary data
Secondary data refers to second-hand information.
It is not originally collected and rather obtained from already
published or unpublished sources.
For example, the address of a person taken from the telephone
directory or the phone number of a company taken from Just
Dial are secondary data
Secondary data is the data which has already been collected and
reused again for some valid purpose.
This type of data is previously recorded from primary data and it has
two types of sources named internal source and external source.
Other sources: