Google Interview Warmup Questions
Google Interview Warmup Questions
Google Interview Warmup Questions
Explain some of the errors or problems you might look for as part
of the data cleaning process.
In the real world, it is next to impossible to have clean data that is ready to be
analyzed or visualized. Some of the errors or problems that I usually look for
1. Checking of Null values
2. Any column name typos or names that could be improved.
3. Data types of the columns.
4. Presence of leading or lagging spaces which are usually found in text data.
5. Presence of outliers in the data.
6. Checking if the dates are in the right format as needed by the organization.
(eg MM/DD/YYYY, DD/MM/YYYY, YYYY/MM/DD)
2. Some large companies store data in data centers in multiple
countries. Why does it matter which countries your data is stored
in?
Storing data in multiple countries can help ensure that a company's operations can continue in the event of
a disaster or outage in one location.
Compliance with data protection and privacy laws: Different countries have different laws and regulations
regarding data protection and privacy. Storing data in multiple countries can help a company comply with
these laws in each location.
Storing data in multiple locations can help improve performance and user experience for customers or
users in different regions by reducing latency. Latency is the time it takes for data to travel from the user's
device to the server and back.
The farther away the server is from the user, the higher the latency will be. By storing data in multiple
locations, a company can ensure that the server hosting the data is physically closer to the user, reducing
latency and improving the overall user experience.
Additionally, having data stored in multiple locations can also help to handle traffic and load balancing, this
means that if a data center in one location becomes unavailable, traffic can be directed to another
location, improving the availability of the service.
3. Analysts often need to combine data sets from different
sources using joins. Can you describe the common types of joins
you may need to complete?
Joins in SQL are used to perform the joining of columns from various data.
Depending on the use case, a different type of JOIN may be used.
Inner join outputs data that are common to both the tables
Left Join outputs Inner join data as well as data mentioned in the first table
(Inner Join + All First Table Data)
Right join outputs Inner join data as well as data mentioned in the second
table (Inner Join + All Second Table Data)
There is also Self-Join, which is used to join the table with itself.
4. Can you describe what a subquery is in SQL?
The subquery generally executes first when it doesn’t have any co-relation
with the main query, when there is a co-relation, the parser takes the decision
on the fly on which query to execute and uses the output of the subquery
accordingly.
However, for readability and efficiency, it is preferred to avoid as many
subqueries as possible. An alternative to subqueries can be Common Table
Expression (CTE).
5. You're putting together a presentation for a client using several
data visualizations. What steps would you take in your final
review of the presentation to make sure the information is clearly
communicated?
I would understand who my target audience is for the presentation. What is their
technical expertise in this subject matter?
To make sure that information is clearly communicated, I will take care of the
visualization as well. Are they too intimidating? Do they contain too much information?
If a slide with too much information is required for some reason, show it gradually using
animation so that information is absorbed without difficulty.
Create a great story for the presentation because people remember stories much
better.
Also, have a Q&A section at the end of the presentation so that the doubts of the
stakeholders can be cleared.
6. When would choose to use a programming language like
Rstats instead of spreadsheets or SQL?
Data Manipulation: R has a wide range of libraries and packages for data manipulation, such as "dplyr" and "tidyr", that make it
easy to clean, reshape, and transform data. These libraries offer a more powerful and flexible set of tools than are typically
available in a spreadsheet program.
Data Visualization: R has a number of powerful libraries for data visualization, such as "ggplot2" and "lattice", that allow you to
create high-quality graphics and plots. These libraries offer a wider range of options and greater control over the appearance
and style of visualizations than is typically available in a spreadsheet program.
Data Analysis: R has a wide range of libraries for statistical analysis and modeling, such as "caret" and "randomForest", that can
be used to perform complex analyses and build predictive models. These libraries offer a more powerful and flexible set of tools
than are typically available in a spreadsheet program.
Reproducibility: R scripts can be easily saved and shared, allowing others to reproduce your analysis and build upon your work.
Spreadsheets and SQL can also be shared but it's not as easily reproducible as R.
Automation: R can be used to automate repetitive tasks and workflows by writing scripts and functions. This can save time and
reduce the risk of errors when working with large datasets or performing multiple analyses.
In summary, R is a powerful tool for data manipulation, visualization, and analysis that offers a wide range of libraries and
packages for working with data, while spreadsheets and SQL are more basic tools that are better suited for simple data
management tasks.
7. Please give an example of a question that is better asked of a
focus group, and an example that is better asked using a survey.
A question that is better asked of a focus group would be "How do you feel about the
design of this new product?" This type of question is better suited for a focus group
because it allows for a more in-depth discussion and the opportunity to hear different
perspectives and feedback.
An example of a question that is better asked using a survey would be "How often do you
purchase our products?" This type of question is better suited for a survey because it can
be easily quantified and analyzed to determine patterns or trends in purchasing behavior.
For example, if a company wants to know how people feel about the design of a new
product that is about to launch, a focus group would be a better option because it would
allow them to gather more in-depth information about people's opinions, likes and dislikes,
and suggestions for improvement. On the other hand, if a company wants to know how
many of their customers purchase their products on a monthly basis, a survey would be a
better option because it would allow them to gather a large amount of data quickly and
easily, and would be useful for tracking trends over time.
8. You are joining data with phone numbers as an identifier and
you find that people have entered their phone numbers with
different formats: some with dashes, some with parentheses, and
some with spaces. What do you do?
Firstly, I would refer to the organization to understand what format the
phone number has to be in. After that, in the data, there are various
formats like parenthesis, a hyphen, parenthesis, and spaces. I would
delete all these characters.
eg. 123-456 -> 123456 567)8 90 -> 567890
After having data consisting only of numbers, I would transform the
data based on what the organization needs.
9. I calculated the mean age of the people in my data using a
programming language, and the answer came back as N/A.
What might have happened, and what could I do to solve this?
There might be missing or non-numeric data that is resulting in NA.
There might be a case where the value is a number, but the table
stores it in a string format.
In such cases, we can format the data type based on that use case.
Furthermore, checking for null and non-numeric values in a column is
accomplished by either removing or replacing that data.
If I am using Python, I can use methods like pd.isnull(), pd.isna(),
pd.to_numeric().
10. Given a dataset, you are tasked with creating a visualization
that shows the relationships between two variables. When would
you prefer to use a scatterplot, and when would you prefer to use
a heatmap?
A scatterplot and a heatmap are both useful visualization tools for showing the relationship between two variables, but
they are best used in different situations.
A scatterplot is a type of plot that uses dots to represent values for two numeric variables. It is particularly useful
when you want to show the relationship between two continuous variables, such as the relationship between income
and education level.
Scatterplots are also useful when you want to show patterns or trends in the data, such as clusters or outliers.
Scatterplots are suitable when you have a large number of observations, and you want to show the relationship
between two variables.
A heatmap, on the other hand, is a type of plot that uses color to represent the density or frequency of values for two
variables. It is particularly useful when you want to show the relationship between two categorical variables, such as
the relationship between type of product and region of purchase.
Heatmaps are also useful when you want to show patterns or trends in the data, such as which combinations of
variables are most common. Heatmaps are more suitable when you have a large number of categories, and you want
to show the distribution of data.
In summary, when you have continuous variables, and a large number of observations, and you want to show the
relationship between two variables, a scatterplot is a good option, when you have categorical variables and a large
number of categories, and you want to show the distribution of data, a heatmap is a good option.
11. Ensuring data is free from bias is an essential part of Data
Analytics. What are some sources of bias that you should
consider when evaluating data sources?
Selection bias: occurs when the sample of data is not representative of the population it is intended to
represent.
Eg. A study on the effectiveness of a new medication is conducted only on patients who are already taking the
medication, rather than a random sample of the population.
Confirmation bias: occurs when data is collected or interpreted in a way that confirms pre-existing beliefs or
hypotheses.
Eg. A researcher only looks for data that confirms their belief that a certain food is unhealthy, and ignores data
that contradicts their belief.
Measurement bias: occurs when the way data is collected or recorded introduces error or distortion.
Eg. A survey on income is conducted over the phone, but only landline numbers are called, leading to a bias
against individuals who only have a cell phone.
Self-selection bias: occurs when individuals or groups have the ability to choose whether or not to participate
in a study, potentially leading to a biased sample.
Eg. A company wants to survey their employees about their job satisfaction and sends out a survey link to all
employees via email. However, only employees who are satisfied with their jobs choose to fill out the survey,
leading to a bias towards employees who are more satisfied with their jobs and not an accurate representation
of the overall job satisfaction of all employees.
Survivorship bias: occurs when data only includes individuals or groups that have "survived" a certain event or
process, leading to a biased sample.
Eg. A study on the success rate of startup companies only looks at companies that are currently successful,
rather than considering all companies, including those that have failed.
12. When presenting information to a client, what are some key
differences between reports and dashboards? Can you think of
an example where a dashboard would be preferred over a report?
A report is a document that provides a detailed and structured summary of information, often containing tables,
charts, and written explanations. Reports are typically used to provide a historical view of data and are best used
when the client needs to see a detailed analysis of the information. They are often used to document a specific aspect
of the business, and can be shared with stakeholders.
Reports are best for providing a detailed analysis of data over a specific time period.
A dashboard, on the other hand, is an interactive visual representation of data that is designed to provide a quick
overview of key metrics and trends.
Dashboards often use real-time data and are designed to be easy to understand and navigate. They are typically
used to provide an up-to-date view of data and are best used when the client needs to quickly understand key
metrics and trends.
Dashboards are often used to monitor the performance of the business, and can be accessed by different
stakeholders.
An example where a dashboard would be preferred over a report is when a client needs to monitor the performance of
their online store in real-time. A dashboard could display key metrics such as website traffic, sales, and customer
behavior in real-time, allowing the client to quickly identify any issues or opportunities.
A report, on the other hand, would provide a detailed historical analysis of the data, which would be useful for
understanding trends and making decisions, but may not be as useful for identifying and addressing real-time issues.
In summary, reports are best used for providing a detailed analysis of data over a specific time period, while
dashboards are best for providing an up-to-date view of data and quickly understand key metrics and trends.
13. "If the sample is large enough, we don't need to worry about
whether it's biased." Do you agree with the statement? Why, or
why not?
The statement "If the sample is large enough, we don't need to worry about whether it's
biased" is not entirely accurate. A large sample size can help to reduce the impact of
random error and increase the precision of estimates, but it does not guarantee that a
sample is free from bias.
For example, a study on the effectiveness of a new medication conducted on a large sample
of patients who are already taking the medication will still be biased, even if the sample size
is large, because the sample is not representative of the general population. Similarly, a
large sample of data collected using a biased survey instrument or method will still be
biased, regardless of its size.
It's important to consider the potential sources of bias when evaluating data, and take steps
to minimize bias as much as possible, regardless of sample size.
Hence, a large sample size is not a substitute for a sample that is truly representative of the
population of interest, and a representative sample is still important in order to be able to
generalize the results to the population.
14. Can you explain why primary and foreign keys are an
important part of database management?
Primary and foreign keys are an important part of database management because they help to
ensure the integrity and consistency of data within a database.
A primary key is a unique identifier for each record in a table. It is used to identify and access
specific records, and it ensures that each record in a table is distinct and can be easily
identified. Primary keys are usually assigned to a column or set of columns, and the values in
that column must be unique for each row in the table.
A foreign key is a column or set of columns in a table that refers to the primary key of another
table. It is used to establish a relationship between two tables and ensure that data is consistent
across multiple tables. For example, if a database has a table of customers and a table of
orders, the primary key of the customers table could be a "customer_id" column, and the
foreign key in the orders table would be "customer_id" as well, linking the two tables together.
They also help to prevent data duplication and ensure that data is easily accessible.
For example, when a new customer is added to the customers table, a unique customer_id is
assigned to that customer, and that same customer_id is used in the orders table when adding
a new order for that customer. This ensures that any order placed is linked to the correct
customer, and that the customer's information is accurate and consistent across both tables.
15. Can you describe what metadata is, and why it's important in
the discussion of databases?
Metadata is data about data. They help us work with data more
efficiently. There is Technical metadata, which consists of information
like acceptable size and data format, which are useful especially
during debugging.
There is Business metadata as well, which gives business context
about the data. This is very helpful to understand the business context
so that we, as data and business analysts, can deliver better strategies
and analysis while keeping the business in mind.