Adbms- Super 25
Adbms- Super 25
Adbms- Super 25
Lpad(String,char,number) – returns the string left padded with the character specified to a total of length
specified
Rpad(String,char,number) – returns the string right padded with the character specified to a total of
length specified
Ltrim(String) -removes white space or other specified characters from the left end of the string
Rtrim(String)--removes white space or other specified characters from the right end of the string
Work can be divided into smaller modules so that it can be manageable and also enhances the readability
of the code.
It promotes re-usability.
It is secure, as the code is in the database and hides the internal database details from the user.
2) Computing power: Hadoop's distributed computing model processes big data fast. The more computing
nodes you use the more processing power you have.
3) Fault tolerance: Data and application processing are protected.against hardware failure. If a node goes
down, jobs are automatically redirected to other nodes to make sure the distributed computing does not
fail. Multiple copies of all data are stored automatically.
4) Flexibility: Unlike traditional relational databases, you don’t have to preprocess data before storing it.
You can store as much data as you want and decide how to use it later. That includes unstructured data like
text, images and videos.
5) Low cost: The open-source framework is free and uses commodity hardware to store large quantities of
data. Scalability. You can easily grow your system to handle more data simply by adding nodes. Little
administration is required.
MapReduce is a parallel programming model for writing distributed applications devised at Google for efficient
processing of large amounts of data (multi-terabyte data-sets), on large clusters (thousands of nodes) of
commodity hardware in a reliable, fault-tolerant manner. The MapReduce program runs on Hadoop which is an
Apache open-source framework.
The Hadoop Distributed File System (HDFS) is based on the Google File System (GFS) and provides a distributed
file system that is designed to run on commodity hardware. It has many similarities with existing distributed file
systems. However, the differences from other distributed file systems are significant. It is highly fault-tolerant
and is designed to be deployed on low-cost hardware. It provides high throughput access to application data
and is suitable for applications having large datasets.
Apart from above mentioned two core components Hadoop framework also includes two modules as
1) Hadoop common utilities – These are the java libraries and utilities required by other Hadoop modules.
2) Hadoop YARN – This is a framework for job scheduling and cluster resource management.
4. Explain the use of R-programming and also give the various applications where R-
programming use.
ANS: Use of R-programming:
R is a programming language and free software environment. It is used for statistical computing and graphics
supported by the R foundation for statistical computing. The R language is widely used among statisticians and
data minors for developing statistical software and data analysis. Applications of R-Programming:
1. Banking
2. Finance
3. E-commerce
4. Social-Media
5. Healthcare
1. Facebook: Facebook uses R to update facebook status updates and its social network graph.
Project Planning: Contains the requirement gathering & project management. Requirement gathering: It is
done by business analyst, onsite technical lead & client. The business Analyst prepares Business requirements
specification (BRS) document. 80% of requirement collection takes place at client side. The business
requirement document can be prepared from the gathered requirement.
Requirement Analysis: After collecting the requirements the requirement analysis. This is the very tough task as
it affects every decision. The user requirement analysis can following into 4 categories:
- Data driven
- User Driven
- Goal Driven
- Mixed Driven
Technical Architecture Track: After requirement gathering & requirement analysis the technical architecture or
the project design takes place. This process involves preparing business requirement document into high level
design that includes various modules in the data warehouse project. This high level design is prepared by the
architects.
Data Track: The data track contains the data warehouse design & ETL development. Data ware design – is a
process of designing the data base by fulfilling user requirements. A data modeler is responsible for creating
Data Warehouse or Data Marts with different schemas as 1) Star schema: Simplest warehouse schema diagram
resembles star.
2) Snowflake schema: Extention of star schema, adds additional dimensions, diagram resembles snowflake
ETL development: Designing ETL applications to fulfill the specifications of documents which are prepared in
the analysis phase. The ETL development contains the ETL code review, Peer review and ETL testing.
Business Intelligence track: It contains BI design C BI development. The business logic is developed by the
developers as per the requirement.
Deployment: It is the next phase after construction. The deployment phase concerns with training support and
the maintenance of the product. This phase is also known as pilot phase or stabilization phase.
Project Management: The overall process of data warehouse life Cycle is managed by the project management
It contains different phases as: Approve specification, Task allocation, Manage issues, Regular product
demonstration, Regular product status updates and quality assurance.
Data Warehousing Development: Data warehouse is also known as enterprise data warehouse. It is a system
used for reporting and data analysis. It is considered as the core component of business Intelligence.
OLAP (Online Analytical Processing): This component of BI allows executives to sort and select aggregates
of data for strategic monitoring. With the help of specific software products, a certification in business
intelligence helps business owners can use data to make adjustments to overall business processes
Advanced Analytics or Corporate Performance Management (CPM): This set of tools allows business
leaders to look at the statistics of certain products or services. For instance, a fast food chain may analyze
the sale of certain items and make local, regional and national modifications on menu board offerings as a
result. The data could also be used to predict in which markets a new product may have the best success.
Real-time BI: Using software applications, a business can respond to real-time trends in email, messaging
systems or even digital displays. Because it’s all in real-time, an entrepreneur can announce special offers
that take advantage of what’s going on in the immediate.
Data Warehousing: Data warehousing lets business leaders sift through subsets of data and examine
interrelated components that can help drive business. Looking at sales data over several years can help
improve product development or tailor seasonal offerings.
Data Sources: This component of BI involves various forms of stored data. It’s about taking the raw data
and using software applications to create meaningful data sources that each division can use to positively
impact business.
A Business Intelligence Framework is a framework that seamlessly connects the various elements of a
business: organizational roles, KPIs (Key Performance Indicators), authorization, and visualization. This
helps you implement Business Intelligence plans both easier and faster.
1. Lock based protocol: To ensure serviceability it requires that th data items be accessed in a mutually
exclusive manner. i.e. While one transaction is accessing a data item, no other transaction can modify that
data. Method used to implement this requirement is to allow transaction to access a data item only if it is
currently holding a lock on that item. Locks: Lock is a data variable which is associated with a data item. Locks
help synchronize access to the database items by concurrent transactions. All lock requests are made to the
concurrency-control manager. Transactions proceed only once the lock request is granted. There are different
types of locks:
Binary lock: A binary lock on a data item can either have locked or unlocked states.
Shared Lock: A shared lock is also called as Read only lock. With the shared locks data items can be shared
between transactions. Because with shared locks you will never have permission to update data on the
data item. Shared lock is denoted by S.
Exclusive Lock: With the exclusive lock a data item can be read as well as written. This lock can’t be held
concurrently on the same data item. It is denoted by X. Exclusive lock is requested using lock-X instruction.
2. Two phase Locking protocol: which is also known as 2PL. Two phase locking protocol requires that each
transaction issues lock and unlock requests in two phases:
Growing phase: A transaction may obtain locks but may not release any lock.
Shrinking phase: A transaction may release locks, but may not obtain any new locks.
If the conversion is allowed, then upgrading of locks from S(A) to X(A) happens in growing phase and the
downgrade of locks from X(A) to S(A) happens in shrinking phase. It is true that 2PL protocol offers
serializability. However it does not ensure that dead locks not happen.
3. Time stamp based protocols: The timestamp-based algorithm uses a timestamp to serialize the execution of
concurrent transactions. This protocol ensures that every read and write operations are executed in timestamp
order. These protocol uses the System Time or logical count as a timestamp. The older transaction is always
given priority in this method. This is the most commonly used concurrency protocol. E.g: Suppose there are
transactions T1, T2 and T3 T1 has entered the system at time 0010 T2 has entered the system at 0020 T3 has
entered the system at 0030 Thus the priority will be given to transaction T1, then transaction T2 and then lastly
to Transaction T3
Databases have schemas, which are used to constrain what information can be stored in the database and
to constrain the data types of the stored information. e the first schema-definition language included as
part of the XML standard, the Document Type Definition, as well as its more recently defined
replacement, XML Schema.
Another XML schemadefinition language called Relax NG is also in use. XML Schema defines a number of
built-in types such as string, integer, decimal date, and boolean. In addition, it allows user-defined types;
these may be simple types with added restrictions, or complex types constructed using constructors such
as complex Type and sequence The first thing to note is that schema definitions in XML Schema are
themselves specified in XML syntax, using a variety of tags defined by XML Schema.
To avoid conflicts with user-defined tags, we prefix the XML Schema tag with the namespace prefix “xs:”;
this prefix is associated with the XML Schema namespace by the xmlns:xs specification in the root
element: <xs:schemaxmlns:xs=“http://www.w3.org/2001/XMLSchema”>
10. Compare supervised and unsupervised machine learning. (Any four points)
15. Compare between Parallel and Distributed Database (any six points).
a. HDFS
b. Hbase
21. Consider following input data for your Map Reduce Program Welcome to Hadoop Class Hadoop is
good Hadoop is bad Draw Map Reduce Architecture and explain its phases.
Class: Student
Name
Age
GPA
Subject
Gender
Store
Update