DSBDA ORAL Question Bank
DSBDA ORAL Question Bank
DSBDA ORAL Question Bank
Big Data can be defined as a collection of complex unstructured or semi-structured data sets which have the potential to deliver actionable
insights.
The four Vs of Big Data are –
Volume – Talks about the amount of data
Variety – Talks about the various formats of data
Velocity – Talks about the ever increasing speed at which the data is growing
Veracity – Talks about the degree of accuracy of data available
3. Define HDFS and YARN, and talk about their respective components.
The HDFS is Hadoop’s default storage unit and is responsible for storing different types of data in a distributed environment.
HDFS has the following two components:
NameNode – This is the master node that has the metadata information for all the data blocks in the HDFS.
DataNode – These are the nodes that act as slave nodes and are responsible for storing the data.
YARN, short for Yet Another Resource Negotiator, is responsible for managing resources and providing an execution environment for the
said processes.
The two main components of YARN are –
ResourceManager – Responsible for allocating resources to respective NodeManagers based on the needs.
NodeManager – Executes tasks on every DataNode.
7 Interesting Big Data Projects You Need To Watch Out
4. What do you mean by commodity hardware?
Commodity Hardware refers to the minimal hardware resources needed to run the Apache Hadoop framework. Any hardware that supports
Hadoop’s minimum requirements is known as ‘Commodity Hardware.’
10. Define the Port Numbers for NameNode, Task Tracker and Job Tracker.
NameNode – Port 50070
Task Tracker – Port 50060
Job Tracker – Port 50030
13. What are some of the data management tools used with Edge Nodes in Hadoop?
This Big Data interview question aims to test your awareness regarding various tools and frameworks.
Oozie, Ambari, Pig and Flume are the most common data management tools that work with Edge Nodes in Hadoop.
14. Explain the core methods of a Reducer.
setup() – This is used to configure different parameters like heap size, distributed cache and input data.
reduce() – A parameter that is called once per key with the concerned reduce task
cleanup() – Clears all temporary files and called only at the end of a reducer task.
Furthermore, Predictive Analytics allows companies to craft customized recommendations and marketing strategies for different buyer personas.
Together, Big Data tools and technologies help boost revenue, streamline business operations, increase productivity, and enhance customer
satisfaction. In fact, anyone who’s not leveraging Big Data today is losing out on an ocean of opportunities.
In the case of system failure, you cannot Data can be accessed even in the case of a system
access the data. failure.
Since NFS runs on a single machine, HDFS runs on a cluster of machines, and hence, the
there’s no chance for data redundancy. replication protocol may lead to redundant data.
18. List the different file permissions in HDFS for files or directory levels.
One of the common big data interview questions. The Hadoop distributed file system (HDFS) has specific permissions for files and directories.
There are three user levels in HDFS – Owner, Group, and Others. For each of the user levels, there are three available permissions:
• read (r)
• write (w)
• execute(x).
These three permissions work uniquely for files and directories.
For files –
• The r permission is for reading a file
• The w permission is for writing a file.
Although there’s an execute(x) permission, you cannot execute HDFS files.
For directories –
• The r permission lists the contents of a specific directory.
• The w permission creates or deletes a directory.
• The X permission is for accessing a child directory.
19. Name the three modes in which you can run Hadoop.
• Standalone mode – This is Hadoop’s default mode that uses the local file system for both input and output operations. The main purpose of the
standalone mode is debugging. It does not support HDFS and also lacks custom configuration required for mapred-site.xml, core-site.xml, and
hdfs-site.xml files.
• Pseudo-distributed mode – Also known as the single-node cluster, the pseudo-distributed mode includes both NameNode and DataNode within
the same machine. In this mode, all the Hadoop daemons will run on a single node, and hence, the Master and Slave nodes are the same.
• Fully distributed mode – This mode is known as the multi-node cluster wherein multiple nodes function simultaneously to execute Hadoop jobs.
Here, all the Hadoop daemons run on different nodes. So, the Master and Slave nodes run separately.
• Problem definition
• Data exploration
• Data preparation
• Modelling
• Validation of data
• Implementation and tracking
Data cleaning also referred as data cleansing, deals with identifying and removing errors and inconsistencies from data in order to enhance the quality
of data.
34) List of some best tools that can be useful for data-analysis?
• Tableau
• RapidMiner
• OpenRefine
• KNIME
• Google Search Operators
• Solver
• NodeXL
• io
• Wolfram Alpha’s
• Google Fusion tables
35) what is the difference between data mining and data profiling?
• Data profiling: It targets on the instance analysis of individual attributes. It gives information on various attributes like value range, discrete
value and their frequency, occurrence of null values, data type, length, etc.
• Data mining: It focuses on cluster analysis, detection of unusual records, dependencies, sequence discovery, relation holding between
several attributes, etc.
• Common misspelling
• Duplicate entries
• Missing values
• Illegal values
• Varying value representations
• Identifying overlapping data
In computing, a hash table is a map of keys to values. It is a data structure used to implement an associative array. It uses a hash function to compute
an index into an array of slots, from which desired value can be fetched.
• Boolean- True/False
• Date- Date values
• Date and time- Timestamp values and date values
• Geographical Values- Geographical Mapping
• Text/ String
• Number- Decimal and Whole numbers
39) What is data visualization?
Data visualization is the graphical representation of information and data. By using visual elements like charts, graphs, and maps, data
visualization tools provide an accessible way to see and understand trends, outliers, and patterns in data.
The use of the split function in Python is that it breaks a string into shorter strings using the defined
separator. It gives a list of all words present in the string.
• Python comprises of a huge standard library for most Internet platforms like Email, HTML, etc.
• Python does not require explicit memory management as the interpreter itself allocates the memory
Statements in function
………
print - output
input – reading
1. MongoDB
2. Cassandra
3. ElasticSearch
4. Amazon DynamoDB
5. HBase