Company Interview Question Bank

Company Interview Question Bank
1) cognizant-
explain your project --questions on project
what is crawler in aws glue.
A crawler is a job defined in Amazon Glue. It crawls databases and buckets in S3 and
then creates tables in Amazon Glue together with their schema.
how to run the aws glue job
 Create a Python script file (or PySpark)

 Copy it to Amazon S3
 Give the Amazon Glue user access to that S3 bucket
 Run the job in AWS Glue
 Inspect the logs in Amazon CloudWatch
how aws kinesis converts json file into parquet file format.
Amazon Kinesis Data Firehose can convert the format of your input data from JSON
to Apache Parquet or Apache ORC before storing the data in Amazon S3. Parquet and
ORC are columnar data formats that save space and enable faster queries To
enable, go to your Firehose stream and click Edit.
what is parquet file

 Parquet is an open source file format available to any project in the Hadoop ecosystem. Apache
Parquet is designed for efficient as well as performant flat columnar storage format of data compared
to row based files like CSV or TSV files.
What are storage classes of s3?

S3 Standard IA S3 One Zone-IA S3 Glacier
S3 Standard
Minimum capacity charge N/A 128KB 128KB 40KB

per object
Minimum storage duration N/A 30 days 30 days 90 days

charge
Retrieval fee N/A per GB retrieved per GB per GB retrieved

retrieved
First byte latency milliseconds milliseconds milliseconds Select minutes or

hours
What is the role of aws lambda in your project.

A Lambda function's execution role is an AWS Identity and Access
Management (IAM) role that grants the function permission to access AWS services
and resources. You provide this role when you create a function, and Lambda assumes
the role when your function is invoked.
what are the non relational database in aws and tell me about it
These types store data without structured linking mechanisms
(NoSQL). This allows the database to hold exceptionally large
amount of data.
 DynamoDB
 ElastiCache
 Neptune
have you used dynamodb and RDS databases
What is the difference between sql and no sql databases?

what do you know about aws ec2,vpc, IAM.
2) infosys-
explain your project -question on project
how to add column to dataframe in pyspark?
5 Ways to add a new column in a PySpark Dataframe
1. Using Spark Native Functions. The most pysparkish way to create a new column in a
PySpark DataFrame is by using built-in functions.
2. Spark UDFs. Sometimes we want to do complicated things to a column or multiple
columns. ...
3. Using RDDs. Sometimes both the spark UDFs and SQL Functions are not enough
for a particular use-case. ...
4. Pandas UDF. ...
5. Using SQL. ...
how to add value to column in pyspark?

In PySpark, to add a new column to DataFrame use lit() function by importing from
pyspark.sql.functions import lit, lit() function takes a constant value you wanted to
add and returns a Column type, if you wanted to add a NULL/None use lit(None).
From the below example first adds a literal constant value 0.3 to a DataFrame and the
second add’s a None.
what is spark query optimization?
Adaptive Query Optimization in Spark 3.0, reoptimizes and adjusts query plans based
on runtime metrics collected during the execution of the query, this re-optimization of
the execution plan happens after each stage of the query as stage gives the right place
to do re-optimization.
what is DAG?
what transformations of pyspark you made in your project?
1. Narrow Transformations: These are transformations that do not require the
process of shuffling. These actions can be executed in a single stage.
Example: map() and filter()
2. Wide Transformations: These are transformations that require shuffling across
various partitions. Hence it requires different stages to be created for communication
across different partitions.
Example: ReduceByKey
what are stages in spark?

1. ShuffleMapStage
This is basically an intermediate stage in the process of DAG execution. The output of
this stage is used as the input for further stage(s). The output of this is in the form of
map output files which can be later used by reducing task. A ShuffleMapStage is
considered ready when its all map outputs are available. Sometimes the output
locations can be missing in cases where the partitions are either lost or not available.
This stage may contain many pipeline operations such as map() and filter() before the
execution of shuffling. Internal registries outputLocs and _numAvailableOutputs are
used by ShuffleMapStage to track the number of shuffle map outputs. A single
ShuffleMapStage can be used commonly across various jobs.
2. ResultStage
As the name itself suggests, this is the final stage in a Spark job which performs an
operation on one or more partitions of an RDD to calculate its result. Initialization of
internal registries and counters is done by the ResultStage.
The DAGScheduler submits missing tasks if any to the ResultStage for computation.
For computation, it requires various mandatory parameters such as stageId,
stageAttempId, the broadcast variable of the serialized task, partition, preferred
TaskLocations, outputId, some local properties, TaskMetrics of that particular stage.
Some of the optional parameters required are Job Id, Application Id, and Application
attempt Id.
what is join ?
what are the types of joins?
hive questions -
what is hive,
 Hive is a data warehouse infrastructure software that can create interaction between user and
HDFS. The user interfaces that Hive supports are Hive Web UI, Hive command line, and Hive HD
Insight (In Windows server). Meta Store.
how to load the local file on hdfs in hive?

Use LOAD DATA HiveQL command to load the data from HDFS into a Hive
Partition table. By default, HIVE considers the specified path as an HDFS location.
Let’s Download the zipcodes.CSV from GitHub, upload it to HDFS using the below
command.
difference between static partitioning and dynamic partitioning in hive
The difference between static and dynamic partitions is that with a static partition, the
name of the partition is hardcoded in the insert statement, whereas with a dynamic
partition, Hive will automatically determine the partition based on the value
3) wipro-
asked on project based questions
what is concurrency in aws lambda?
concurrency is the same request executed by more than one Lambda function at a
time. A request is an event that triggers an invocation of a Lambda function. By
default, AWS Lambda gives you a pool of 1000 concurrent executions per AWS
account.
scenario - boto3
how to write code to write data in s3 in aws glue job?
The Glue Job code requires a script file to be stored in an S3 bucket. Then you have to
point your Terraform resource: aws_glue_job to the script_location which contains an
S3 URL to your file eg. s3://code-bucket/glue_job.py
what are the storage classes of s3?

lambda function is iaas/paas/saas?
rameters IAAS PAAS SAAS
Full-Form IaaS is an acronym for PaaS is an acronym for SaaS is an acronym for
Infrastructure As A Platform As A Service. Software As A Service.
Service.
Access The IaaS service provides Using the PaaS services, The SaaS services give
its users with access to users can get access to a access to all of their services
various resources like runtime environment (for the to the end-users, where it’s
virtual storage and virtual development and application hosting, storage,
machines. deployment of applications or any other services.
and tools).
Technical A user requires technical One must acquire the basic You don’t need to know any
Understanding knowledge to make use of knowledge of the concerned technicalities to understand
IaaS services. subjects to understand the and use the SaaS services-
setup of the PaaS services. the service provider can
handle everything.
Used By The network architects Developers mainly make use An end-user generally uses
primarily use the IaaS. of PaaS. SaaS.
What is CI/CD pipeline

The continuous integration/continuous delivery pipeline is an agile DevOps workflow focused on a
frequent and reliable software delivery process. It is a framework that includes continuous integration
(CI), continuous testing, continuous delivery (CD) and continuous deploymentmethods.
How you will push your code in to production environment

How you will test your code on development environment
How you have made your automations in code
Automation is the ultimate need for DevOps practice and ‘Automate everything’ is
the key principle of DevOps. In DevOps, automation kick starts from the code
generation on Developers machine till the code is pushed to the code and even after
that to monitor the application and system in production.
4) Quantifi---
aws-
what are the components of aws glue.
1. Data catalog: It is the centralized catalog that stores the
metadata and structure of the data. You can point Hive and
Athena to this centralized catalog while setting up to access the
data. Hence you can leverage the pros of both the tools on the
same data without changing any configuration and methods.
2. Database: This option is used to create the database for
movement and storing the data from source to target.
3. Table: This option allows you to create tables in the database
that can be used by the source and target.
4. Crawler and Classifier: A crawler is an outstanding feature
provided by AWS Glue. It crawls the location to S3 or other
sources by JDBC connection and moves the data to the table or
other target RDS by identifying and mapping the schema. It
creates/uses metadata tables that are pre-defined in the data
catalog.
5. Job: A job is an application that carries out the ETL task.
Internally it uses Spark or Python as the programming language
and EMR/EC2 to execute these applications on the cluster.
6. Trigger: A trigger starts the ETL job execution on-demand or at
a specific time.
7. Development endpoint: The development environment consists
of a cluster which processes the ETL operation. It is an EMR
cluster which can be then connected to a notebook or to execute
the jobs.
8. Notebook: Jupyter notebook is an on the web IDE to develop
and run the Scala or Python program for development and
testing.
what will happen if crawler is deleted from aws glue . will it affect on aws
glue job.
what is aws glue optimization technique.
what is difference between dynamic frame and dataframe
A DynamicFrame is similar to a DataFrame, except that each record is self-
describing, so no schema is required initially. Instead, AWS Glue computes a
schema on-the-fly when required, and explicitly encodes schema inconsistencies using
a choice (or union) type.
what is workflow in glue

workflow is a collection of multiple dependent AWS Glue jobs and crawlers that are
run to complete a complex ETL task. A workflow manages the execution and
monitoring of all its jobs and crawlers.
why we require aws lambda in aws
Lambda is best suited for shorter, event-driven workloads, since Lambda functions
run for up to 15 minutes per invocation. Also, when you use Lambda, you are only
responsible for your code and Lambda will take care of the rest i.e, balance of
memory, CPU, network, and other resources to run your code.
what is redshift spectrum.

Redshift spectrum is a part of Amazon Redshift Web Services that offers a common
platform to extract/view data from its hot data store as well as a cold data store
(Legacy data) without having to shift to different software tools.
how you will connect aws glue to redshift

 Step 1: Create Temporary Credentials and Roles using AWS Glue. AWS Glue creates
temporary credentials for you using the...
 Step 2: Specify the Role in the AWS Glue Script. After you’ve created a role for the
cluster, you’ll need to specify it...
 Step 3: Handing Dynamic Frames in AWS Glue to Redshift Integration. In these
examples, role name refers to the Amazon...
 Step 4: Supply the Key ID from AWS Key Management Service. The data in the
temporary folder used by AWS Glue in...
difference between aws lambda and ec2

AWS Lambda: Whether you need to set up a multiple or single environment,
you do not need to do much work. You are not required to spin up or provision
containers or make them available for your applications, scaling is fully
automated.
AWS Lambda might not appeal to someone already working with an on-
demand development environment with containers and orchestration in place.
Amazon EC2: With EC2, setting up includes logging in via SSH and manually
installing Apache, and doing a git clone. Along with that, you need to install and
configure all the required software in a manner that is automated and
reproducible.
For EC2, instances come in two options, first are standards ones which serve
data roughly the same as our desktop hard drive, and second, advanced
provisioning which will serve data much faster. Comparatively, this is a lot of
work.
suppose there is glue job on two files which are joined and if this job fails
then how you will handle the failure issue.
on which file format you worked?
what is difference between csv file and parquet file
what is difference between aws ec2 and aws lambda
how will you choose the s3 class storage.
what is difference global secondary index and local secondary index in
dynamodb
what is difference between RCU and WCU -dynamodb
what is difference between rds and dynamodb
what is difference between primary key,composite key and sort key
(dynamodb)
how to launch ec2 from a local machine.
can we access the ec2 machine if we know public ip and access key from
another machine.
what is difference between RDS,Dynamodb and redshift.
what is redshift spectrum
do you know vpc , what is Internet gateway , routing table ,subnets in vpc.
which IAM role you will set for aws glue.
pyspark-
what is RDD
features of RDD
what is lazy evaluation of RDD
what are functions of transformations and actions
what is spark sql
what are pyspark sql window functions
what is broadcast join
what is difference between rdd,dataframe,dataset
what is difference between persist() and cache()
what is difference between repartition() and coalesce()
-what is difference between df.take(5) and df.head(5)
-write a pyspark code to add new column "full name" with value "fname
mname lname"on df which has three column(fname,mname,lname).
-write a pyspark code to drop duplicate rows.
-write a pyspark to join to dataframes.
-what is accumulator and broadcast in spark
-what are the cases where you use accumulator spark.
-how to handle data skewness in spark
python -
what is difference between list and tuple
R.NO. LIST TUPLE
1 Lists are mutable Tuples are immutable
Implication of iterations is Time- The implication of iterations is

2 consuming comparatively Faster
The list is better for performing

operations, such as insertion Tuple data type is appropriate for
3 and deletion. accessing the elements
Tuple consume less memory as compared

4 Lists consume more memory to the list
Lists have several built-in Tuple does not have many built-in
5 methods methods.
The unexpected changes and

6 errors are more likely to occur In tuple, it is hard to take place.
is it possible to modify list in tuple.

On the other hand, mutable objects stored in a tuple do not lose their mutability
e.g. you can still modify inner lists using list methods: Tuples can store any kind of
object, although tuples that contain lists (or any other mutable objects) are not
hashable: The behaviour demonstrated above can indeed lead to confusing errors.
write a python code to print the capital letters in given txt file.
string = input('Enter any string: ')

upper = [char for char in string if char.isupper()]
print('Uppercase characters:', upper)
why numpy array mostly used instead of list

NumPy is used because it is faster and compact than python lists. In python, we
have lists that serve the purpose of arrays, but they are slow. NumPy provides an array
object that is up to 50x faster than python lists.
what is percentile in python.

Percentiles are descriptive statistics that tell us about the distribution of the
values. The nth percentile value denotes that n% of the values in the given sequence
are smaller than this value.
what are cases to use list,tuple,and dictionary
sql -
what is difference between dataware house and data mart.
 Data Warehouse is a large repository of data collected from different sources

whereas Data Mart is only subtype of a data warehouse.
 Data Warehouse is focused on all departments in an organization whereas
Data Mart focuses on a specific group.
 Data Warehouse designing process is complicated whereas the Data Mart
process is easy to design.
 Data Warehouse takes a long time for data handling whereas Data Mart takes
a short time for data handling.
 Comparing Data Warehouse vs Data Mart, Data Warehouse size range is 100
GB to 1 TB+ whereas Data Mart size is less than 100 GB.
 When we differentiate Data Warehouse and Data Mart, Data Warehouse
implementation process takes 1 month to 1 year whereas Data Mart takes a
few months to complete the implementation process.
what is difference between delete , truncate and drop on table
one query is given and asked the what is output of query.

what is difference between unique key,primary key, foreign key,surrogate
key.
what is difference between unique key and primary key.
how many values are NULL in unique key
what is difference between normalization and denormalization
write a query to display 10 records of left join should be same with records
of right join.
write a query to display the roll no which are even on table student
write a query to dispaly the names of employee those having "p" in their
names.
multiple queries given in chat box and asked its output
5) Accenture-
explain your project ---all questions on projects
how exactly your kinessis captures the real time events.
how to use cron jobs and on which AWS services it is working
have you used JIRA,scrum,agile
how you deploy the code and test that code into production environment.
-how you push your code into production environment and explain stepwise
-suppose you have one file in s3 and another table on RDS, if you want to
merge both tables then how you will write the code in glue (pyspark)
-what is difference between dynamic frame and dataframe .
6) Brillio Technologies
AWS
-explain about the project
-how you will schedule your job
-how you will transform on newly updated data.suppose you had 100
records in last day. and today you got additional 10 records in same
file.then how will separate that data and how will do transformation on 10
records only.
-if there another way instead of day and time partition.
- what is redshift and dynamodb . have you used in your project?
- how will you design the pipeline if data coming from SNS topic stored into
s3 bucket and again need to store the same data to business team tool like
saleforce with lambda function.
pyspark
how to convert RDD to dataframe?
MindTree
-explain all the steps of job execution when job is submitted to spark
-how the memory will be optimized if the job gets stucked and failed.
-employee table given with eid,name,dept,salary
write pyspark code to show third highest salary with eid name.
write pyspark code to show departmentwise highest salary and lowest
salary of employee .
PWC
- explain your project
- explain the AWS lambda.
- how to write code into aws lambda function.
- write a code to read the data from s3 bucket in aws lambda
- what are the storage classes of s3 bucket
- Tell me about aws glue.
- how to set a job in aws glue.
- what are the components of aws glue
-which services do know in aws
- what do you know about dynamodb
Pyspark
What is the join function in pyspark
Write a code to join the two dataframes with suitable example.
How much you rate yourself for pyspark out of 10
Python-
- which project you have handled using python
- write a code to extract the APIs (json file)
- what is indentation in python
- how to install package in python editor
- How much you rate yourself for python out of 10
Virtusa
How you will separate the new records in aws glue
How to rename a 50 column at a time in pyspark
How to append records in s3 bucket file
How to optimize a performance tuning in spark
How to eliminate data skewness
What is difference between repartition and coalesce
How to use main () function which has different packages in aws glue.
How to create a table for hive in pyspark
-How to write program in python - table column having the values 0.001 - ,
0.002 - and you need specify the - before the value.
How to check the path operating system path in python
What is difference between parquet and ORC file
What is accumulator

Company Interview Question Bank

Uploaded by

Copyright:

Available Formats

Company Interview Question Bank

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Company Interview Question Bank

Uploaded by

Copyright:

Available Formats

Company Interview Question Bank

 Create a Python script file (or PySpark)

what is parquet file

What are storage classes of s3?

Minimum capacity charge N/A 128KB 128KB 40KB

Minimum storage duration N/A 30 days 30 days 90 days

Retrieval fee N/A per GB retrieved per GB per GB retrieved

First byte latency milliseconds milliseconds milliseconds Select minutes or

What is the role of aws lambda in your project.

have you used dynamodb and RDS databases

What is the difference between sql and no sql databases?

how to add value to column in pyspark?

process of shuffling. These actions can be executed in a single stage.

Example: map() and filter()

2. Wide Transformations: These are transformations that require shuffling across

various partitions. Hence it requires different stages to be created for communication

across different partitions.

what are stages in spark?

execution of shuffling. Internal registries outputLocs and _numAvailableOutputs are

used by ShuffleMapStage to track the number of shuffle map outputs. A single

ShuffleMapStage can be used commonly across various jobs.

operation on one or more partitions of an RDD to calculate its result. Initialization of

internal registries and counters is done by the ResultStage.

For computation, it requires various mandatory parameters such as stageId,

stageAttempId, the broadcast variable of the serialized task, partition, preferred

TaskLocations, outputId, some local properties, TaskMetrics of that particular stage.

how to load the local file on hdfs in hive?

difference between static partitioning and dynamic partitioning in hive

what are the storage classes of s3?

rameters IAAS PAAS SAAS

What is CI/CD pipeline

How you will push your code in to production environment

what is workflow in glue

what is redshift spectrum.

how you will connect aws glue to redshift

difference between aws lambda and ec2

1 Lists are mutable Tuples are immutable

Implication of iterations is Time- The implication of iterations is

The list is better for performing

Tuple consume less memory as compared

The unexpected changes and

is it possible to modify list in tuple.

string = input('Enter any string: ')

why numpy array mostly used instead of list

what is percentile in python.

what are cases to use list,tuple,and dictionary

 Data Warehouse is a large repository of data collected from different sources

what is difference between delete , truncate and drop on table

one query is given and asked the what is output of query.

You might also like