Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
268 views

Data Engineer Interview Questions

Data engineering focuses on applying data collection and research to convert raw data from various sources into useful information. Data modeling documents complex software design as diagrams to easily understand relationships between data objects and rules. The main types of schemas in data modeling are star schema and snowflake schema. Structured data uses databases for storage and has standard integration tools, while unstructured data uses unmanaged file structures and manual processing. A Hadoop application includes common utilities, HDFS for distributed file storage, MapReduce for large-scale processing, YARN for resource management, and NameNode tracks files across clusters.

Uploaded by

Ghulam Mustafa
Copyright
© © All Rights Reserved
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
268 views

Data Engineer Interview Questions

Data engineering focuses on applying data collection and research to convert raw data from various sources into useful information. Data modeling documents complex software design as diagrams to easily understand relationships between data objects and rules. The main types of schemas in data modeling are star schema and snowflake schema. Structured data uses databases for storage and has standard integration tools, while unstructured data uses unmanaged file structures and manual processing. A Hadoop application includes common utilities, HDFS for distributed file storage, MapReduce for large-scale processing, YARN for resource management, and NameNode tracks files across clusters.

Uploaded by

Ghulam Mustafa
Copyright
© © All Rights Reserved
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 6

1) Explain Data Engineering.

Data engineering is a term used in big data. It focuses on the


application of data collection and research. The data generated
from various sources are just raw data. Data engineering helps
to convert this raw data into useful information.

2) What is Data Modelling?


Data modeling is the method of documenting complex software
design as a diagram so that anyone can easily understand.
It is a conceptual representation of data objects that
are associated between various data objects and the rules.

3) List various types of design schemas in Data Modelling


There are mainly two types of schemas in data modeling: 1) Star schema and 2)
Snowflake schema.

4) Distinguish between structured and unstructured data

Parameter Structured Data Unstructured Data


Storage DBMS Unmanaged file structures
Standard ADO.net, ODBC, and SQL STMP, XML, CSV, and SMS
Integration Tool ELT (Extract, Transform, Load) Manual data entry or batch
processing that includes codes
scaling Schema scaling is difficult Scaling is very easy.

5) Explain all components of a Hadoop application

Hadoop Common: It is a common set of utilities and libraries that are utilized by
Hadoop.

HDFS: This Hadoop application relates to the file system in which the Hadoop data
is stored. It is a distributed file system having high bandwidth.

Hadoop MapReduce: It is based according to the algorithm for the provision of


large-scale data processing.

Hadoop YARN: It is used for resource management within the Hadoop cluster. It can
also be used for task scheduling for users.

6) What is NameNode?

It is the centerpiece of HDFS. It stores data of HDFS and tracks various files
across the clusters. Here, the actual data is not stored. The data is stored in
DataNodes.

7) Define Hadoop streaming


It is a utility which allows for the creation of the map and Reduces jobs and
submits them to a specific cluster.

8) What is the full form of HDFS?


HDFS stands for Hadoop Distributed File System.

9) Define Block and Block Scanner in HDFS


Blocks are the smallest unit of a data file. Hadoop automatically splits huge files
into small pieces.

Block Scanner verifies the list of blocks that are presented on a DataNode.

10) What are the steps that occur when Block Scanner detects a corrupted data
block?

1) First of all, when Block Scanner find a corrupted data block, DataNode report to
NameNode

2) NameNode start the process of creating a new replica using a replica of the
corrupted block.

3) Replication count of the correct replicas tries to match with the replication
factor. If the match found corrupted data block will not be deleted.

11) Name two messages that NameNode gets from DataNode?


There are two messages which NameNode gets from DataNode. They are 1) Block report
and 2) Heartbeat.

12) List out various XML configuration files in Hadoop?


There are five XML configuration files in Hadoop:
Mapred-site
Core-site
HDFS-site
Yarn-site

13) What are four V's of big data?

Five V's of big data are:

Velocity
Value
Variety
Volume
Veracity

14) Explain the features of Hadoop


Important features of Hadoop are:

It is an open-source framework that is available freeware.


Hadoop is compatible with the many types of hardware and easy to access new
hardware within a specific node.
Hadoop supports faster-distributed processing of data.
It stores the data in the cluster, which is independent of the rest of the
operations.
Hadoop allows creating 3 replicas for each block with different nodes.

15) Explain the main methods of Reducer


setup (): It is used for configuring parameters like the size of input data and
distributed cache.
cleanup(): This method is used to clean temporary files.
reduce(): It is a heart of the reducer which is called once per key with the
associated reduced task

16) What is the abbreviation of COSHH?


The abbreviation of COSHH is Classification and Optimization based Schedule for
Heterogeneous Hadoop systems.

17) Explain Star Schema


Star Schema or Star Join Schema is the simplest type of Data Warehouse schema.
It is known as star schema because its structure is like a star. In the Star
schema,
the center of the star may have one fact table and multiple associated dimension
table.
This schema is used for querying large data sets.

18) How to deploy a big data solution?

Follow the following steps in order to deploy a big data solution.

1) Integrate data using data sources like RDBMS, SAP, MySQL, Salesforce

2) Store data extracted data in either NoSQL database or HDFS.

3) Deploy big data solution using processing frameworks like Pig, Spark, and
MapReduce.

19) Explain FSCK


File System Check or FSCK is command used by HDFS. FSCK command is used to check
inconsistencies and problem in file.
20) Explain Snowflake Schema
A Snowflake Schema is an extension of a Star Schema, and it adds additional
dimensions.
It is so-called as snowflake because its diagram looks like a Snowflake. The
dimension tables
are normalized, that splits data into additional tables.

You might also like