Company Interview Question Bank
Company Interview Question Bank
Company Interview Question Bank
1) cognizant-
explain your project --questions on project
what is crawler in aws glue.
A crawler is a job defined in Amazon Glue. It crawls databases and buckets in S3 and
then creates tables in Amazon Glue together with their schema.
how to run the aws glue job
how aws kinesis converts json file into parquet file format.
Amazon Kinesis Data Firehose can convert the format of your input data from JSON
to Apache Parquet or Apache ORC before storing the data in Amazon S3. Parquet and
ORC are columnar data formats that save space and enable faster queries To
enable, go to your Firehose stream and click Edit.
what are the non relational database in aws and tell me about it
These types store data without structured linking mechanisms
(NoSQL). This allows the database to hold exceptionally large
amount of data.
DynamoDB
ElastiCache
Neptune
2) infosys-
explain your project -question on project
how to add column to dataframe in pyspark?
5 Ways to add a new column in a PySpark Dataframe
1. Using Spark Native Functions. The most pysparkish way to create a new column in a
PySpark DataFrame is by using built-in functions.
2. Spark UDFs. Sometimes we want to do complicated things to a column or multiple
columns. ...
3. Using RDDs. Sometimes both the spark UDFs and SQL Functions are not enough
for a particular use-case. ...
4. Pandas UDF. ...
5. Using SQL. ...
Example: ReduceByKey
this stage is used as the input for further stage(s). The output of this is in the form of
map output files which can be later used by reducing task. A ShuffleMapStage is
considered ready when its all map outputs are available. Sometimes the output
locations can be missing in cases where the partitions are either lost or not available.
This stage may contain many pipeline operations such as map() and filter() before the
2. ResultStage
As the name itself suggests, this is the final stage in a Spark job which performs an
The DAGScheduler submits missing tasks if any to the ResultStage for computation.
Some of the optional parameters required are Job Id, Application Id, and Application
attempt Id.
what is join ?
what are the types of joins?
hive questions -
what is hive,
Hive is a data warehouse infrastructure software that can create interaction between user and
HDFS. The user interfaces that Hive supports are Hive Web UI, Hive command line, and Hive HD
Insight (In Windows server). Meta Store.
The difference between static and dynamic partitions is that with a static partition, the
name of the partition is hardcoded in the insert statement, whereas with a dynamic
partition, Hive will automatically determine the partition based on the value
3) wipro-
asked on project based questions
what is concurrency in aws lambda?
concurrency is the same request executed by more than one Lambda function at a
time. A request is an event that triggers an invocation of a Lambda function. By
default, AWS Lambda gives you a pool of 1000 concurrent executions per AWS
account.
scenario - boto3
how to write code to write data in s3 in aws glue job?
The Glue Job code requires a script file to be stored in an S3 bucket. Then you have to
point your Terraform resource: aws_glue_job to the script_location which contains an
S3 URL to your file eg. s3://code-bucket/glue_job.py
Full-Form IaaS is an acronym for PaaS is an acronym for SaaS is an acronym for
Infrastructure As A Platform As A Service. Software As A Service.
Service.
Access The IaaS service provides Using the PaaS services, The SaaS services give
its users with access to users can get access to a access to all of their services
various resources like runtime environment (for the to the end-users, where it’s
virtual storage and virtual development and application hosting, storage,
machines. deployment of applications or any other services.
and tools).
Technical A user requires technical One must acquire the basic You don’t need to know any
Understanding knowledge to make use of knowledge of the concerned technicalities to understand
IaaS services. subjects to understand the and use the SaaS services-
setup of the PaaS services. the service provider can
handle everything.
Used By The network architects Developers mainly make use An end-user generally uses
primarily use the IaaS. of PaaS. SaaS.
what will happen if crawler is deleted from aws glue . will it affect on aws
glue job.
what is aws glue optimization technique.
what is difference between dynamic frame and dataframe
A DynamicFrame is similar to a DataFrame, except that each record is self-
describing, so no schema is required initially. Instead, AWS Glue computes a
schema on-the-fly when required, and explicitly encodes schema inconsistencies using
a choice (or union) type.
Amazon EC2: With EC2, setting up includes logging in via SSH and manually
installing Apache, and doing a git clone. Along with that, you need to install and
configure all the required software in a manner that is automated and
reproducible.
For EC2, instances come in two options, first are standards ones which serve
data roughly the same as our desktop hard drive, and second, advanced
provisioning which will serve data much faster. Comparatively, this is a lot of
work.
suppose there is glue job on two files which are joined and if this job fails
then how you will handle the failure issue.
on which file format you worked?
what is difference between csv file and parquet file
what is difference between aws ec2 and aws lambda
how will you choose the s3 class storage.
what is difference global secondary index and local secondary index in
dynamodb
what is difference between RCU and WCU -dynamodb
what is difference between rds and dynamodb
what is difference between primary key,composite key and sort key
(dynamodb)
how to launch ec2 from a local machine.
can we access the ec2 machine if we know public ip and access key from
another machine.
what is difference between RDS,Dynamodb and redshift.
what is redshift spectrum
do you know vpc , what is Internet gateway , routing table ,subnets in vpc.
which IAM role you will set for aws glue.
pyspark-
what is RDD
features of RDD
what is lazy evaluation of RDD
what are functions of transformations and actions
what is spark sql
what are pyspark sql window functions
what is broadcast join
what is difference between rdd,dataframe,dataset
what is difference between persist() and cache()
what is difference between repartition() and coalesce()
-what is difference between df.take(5) and df.head(5)
-write a pyspark code to add new column "full name" with value "fname
mname lname"on df which has three column(fname,mname,lname).
-write a pyspark code to drop duplicate rows.
-write a pyspark to join to dataframes.
-what is accumulator and broadcast in spark
-what are the cases where you use accumulator spark.
-how to handle data skewness in spark
python -
what is difference between list and tuple
R.NO. LIST TUPLE
Lists have several built-in Tuple does not have many built-in
5 methods methods.
sql -
what is difference between dataware house and data mart.
5) Accenture-
explain your project ---all questions on projects
how exactly your kinessis captures the real time events.
how to use cron jobs and on which AWS services it is working
have you used JIRA,scrum,agile
how you deploy the code and test that code into production environment.
-how you push your code into production environment and explain stepwise
-suppose you have one file in s3 and another table on RDS, if you want to
merge both tables then how you will write the code in glue (pyspark)
-what is difference between dynamic frame and dataframe .
6) Brillio Technologies
AWS
-explain about the project
-how you will schedule your job
-how you will transform on newly updated data.suppose you had 100
records in last day. and today you got additional 10 records in same
file.then how will separate that data and how will do transformation on 10
records only.
-if there another way instead of day and time partition.
- what is redshift and dynamodb . have you used in your project?
- how will you design the pipeline if data coming from SNS topic stored into
s3 bucket and again need to store the same data to business team tool like
saleforce with lambda function.
pyspark
how to convert RDD to dataframe?
MindTree
-explain all the steps of job execution when job is submitted to spark
-how the memory will be optimized if the job gets stucked and failed.
-employee table given with eid,name,dept,salary
write pyspark code to show third highest salary with eid name.
write pyspark code to show departmentwise highest salary and lowest
salary of employee .
PWC
- explain your project
- explain the AWS lambda.
- how to write code into aws lambda function.
- write a code to read the data from s3 bucket in aws lambda
- what are the storage classes of s3 bucket
- Tell me about aws glue.
- how to set a job in aws glue.
- what are the components of aws glue
-which services do know in aws
- what do you know about dynamodb
Pyspark
What is the join function in pyspark
Write a code to join the two dataframes with suitable example.
How much you rate yourself for pyspark out of 10
Python-
- which project you have handled using python
- write a code to extract the APIs (json file)
- what is indentation in python
- how to install package in python editor
- How much you rate yourself for python out of 10
Virtusa
How you will separate the new records in aws glue
How to rename a 50 column at a time in pyspark
How to append records in s3 bucket file
How to optimize a performance tuning in spark
How to eliminate data skewness
What is difference between repartition and coalesce
How to use main () function which has different packages in aws glue.
How to create a table for hive in pyspark
-How to write program in python - table column having the values 0.001 - ,
0.002 - and you need specify the - before the value.
How to check the path operating system path in python
What is difference between parquet and ORC file
What is accumulator