QuerySurge Slide Deck for Big Data Testing Webinar

Bill Hayduk
CEO, RTTS
Business Leader, QuerySurge
(the software division of RTTS)
Testing Big Data:
Automated ETL Testing of Hadoop and NoSQL
Jeff Bocarsly, Ph.D.
Chief Architect
QuerySurge Division, RTTS

built by
QuerySurge™
• About Big Data and Hadoop
• About NoSQL
• Hadoop and DWH Use Case
• How to test Big Data
• Demo of QuerySurge
w/ Hadoop and NoSQL
AGENDA
Testing Big Data:
Automated ETL Testing of
Hadoop and NoSQL
Host: RTTS/QuerySurge
Date: July 30, 2022
Time: 1:00 pm, Eastern
Standard Time
(New York, GMT-05:00)
Session number:
630 771 732

FACTS
Founded:
1996
Headquarters:
New York
Customers:
700+
Strategic Partners:
See logos
Enterprise Software:
QuerySurge
Launched:
2012
Customers:
170+ in 30 countries
RTTS is the leading provider of software & data quality
for critical business systems
About
Technology Partners

Regional Consulting firms
Technology Partners Global System Integrators
Argentina, Australia, Belgium, Brazil, Canada, Chile, India,
Malaysia, Netherlands, New Zealand, Norway, Sweden,
Singapore, South Africa, Ukraine, US

Data Warehouse
Data Warehouse
ETL
ETL
Mainframe
Business Intelligence
& Analytics
C-level executives are using BI &
Analytics to make critical
business decisions with the
assumption that the underlying
data is fine
We know it is not
ETL
Typical data
issue areas

Big data – defined as too much
volume, velocity and variety to
work on normal database
architectures.
Size
Defined as 5 petabytes or more
1 petabyte = 1,000 terabytes
1,000 terabytes = 1,000,000 gigabytes
1,000,000 gigabytes = 1,000,000,000 megabytes
built by
built by
QuerySurge™

Handles more than 1 million customer transactions every hour.
• data imported into databases that contain > 2.5 petabytes of data
• the equivalent of 167 times the information contained in all the books in the US Library of
Congress.
Facebook handles 40 billion photos from its user base.
Google processes 1 Terabyte per hour
Twitter processes 85 million tweets per day
eBay processes 80 Terabytes per day
others
built by
QuerySurge™

Requires exceptional technologies to efficiently process large quantities of
data within tolerable elapsed times.
Technologies include:
• massively parallel processing (MPP) databases
• data warehouses
• Data mining grids
• distributed file systems
• distributed databases
• cloud computing platforms
• the Internet, and
• scalable storage system
built by
QuerySurge™

built by
QuerySurge™
• easily deals with complexities of high of data
Hadoop is an open-source project that
develops software for scalable, distributed computing.
• is a of large data sets across
clusters of computers using simple programming models.
from single servers to 1,000’s of machines, each offering local
computation and storage.
• detects and at the application layer

built by
QuerySurge™
• Redundant and reliable
• Extremely powerful
• Easy to program distributed apps
• Runs on commodity hardware

built by
QuerySurge™
“Spending on Hadoop software and subscriptions will increase to
approximately $677 million, with overall big data market
anticipated to reach the $50 billion mark.”
- Wikibon

built by
QuerySurge™
MapReduce
(Task Tracker)
HDFS
(Data
Node)
MapReduce – processing part that manages
the programming jobs. (a.k.a. Task Tracker)
HDFS (Hadoop Distributed File System) –
stores data on the machines. (a.k.a. Data
Node)
machine

built by
QuerySurge™
Cluster
Add more machines for scaling – from 1 to 100 to 1,000
Job Tracker accepts jobs, assigns tasks, identifies failed machines
Name Node
Coordination for HDFS. Inserts and extraction are communicated through the Name Node.
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Task
Tracker
Data
Node
Name Node

built by
QuerySurge™
MapReduce
(Task Tracker)
HDFS
(Data
Node)
HiveQL
HiveQL
HiveQL
HiveQL
HiveQL
Apache Hive - a data warehouse infrastructure built on top
of Hadoop for providing data summarization, query, and analysis.
Hive provides a mechanism to query the data using a SQL-like language
called HiveQL that interacts with the HDFS files
• create
• insert
• update
• delete
• select

What is NoSQL?
A term used to describe high-performance, non-relational databases that provide a mechanism for
storage and retrieval of data that is modeled in means other than the tabular relations used in
relational databases
NoSQL Database Types
Document databases pair each key with a complex data structure known as a document.
Documents can contain many different key-value pairs, or key-array pairs, or even nested documents.
Graph stores are used to store information about networks of data, such as social connections.
Graph stores include Neo4J and Giraph.
Key-value stores are the simplest NoSQL databases. Every single item in the database is stored as
an attribute name (or 'key'), together with its value. Examples of key-value stores are Riak and
Berkeley DB. Some key-value stores, such as Redis, allow each value to have a type, such as 'integer',
which adds functionality.
Wide-column stores such as Cassandra and HBase are optimized for queries over large datasets,
and store columns of data together, instead of rows.
a software division of
QuerySurge™

built by
™
Source: MongoDB, Inc.
Data Warehouse Batch Aggregation
ETL from MongoDB
ETL to MongoDB

built by
™
• Online real-time processing
• Data set is smaller
• Measured in milliseconds
• Offline big data processing
• Offline analytics
• Measured in minutes & hours
Source: classpattern.com
When to use NoSQL? / When to use Hadoop?

built by
QuerySurge™
Data
Warehouse
Hadoop
NoSQL
Hadoop
Data
Warehouse

built by
QuerySurge™
USE CASE 1***
Use Hadoop as a landing zone for big data & raw data
1) bring all raw, big data into Hadoop
2) perform some pre-processing of this data
3) determine which data goes to Data Warehouse
4) Extract, transform and load (ETL) pertinent data into Data Warehouse
***Source: Vijay Ramaiah, IBM product manager, datanami magazine, June 10, 2013
built by
QuerySurge™

Recommended functional test strategy: Test every entry point in the system
(feeds, databases, internal messaging, front-end transactions).
The goal: provide rapid localization of data issues between points
test entry point
built by
Business
Intelligence
software
ETL
Source Data
Source Hadoop ETL Process Target DWH
built by
QuerySurge™
test entry point
test entry points

Relational DB & Data
Warehousing
Source Data
@
BI, Analytics &
Reporting
Ingestion
built by
™
test entry point
test entry point
test entry point
test entry point test entry point

built by
QuerySurge™
- we need to verify more data and to do it faster
- we need to automate the testing
effort
- We need to be able to test across different platforms
We need a testing tool!

built by
QuerySurge™
built by

built by
QuerySurge™
QuerySurge
is the smart Data Testing solution
that automates
the data validation and ETL testing
of Big Data
with full DevOps functionality
for continuous testing
built by

QuerySurge™
Data Quality at Speed
→ Automate the launch, execution, comparison & auto-email results
Test across different platforms
→ Data Warehouse, Hadoop, NoSQL, DB, flat files, XML, JSON, BI Reports
Smart Query Wizards - no coding needed
→ Query Wizards create tests visually, without writing SQL
Data Analytics & Data Intelligence
→ Data Analytics Dashboard, Data Intelligence Reports, emailed results,
Ready-for-Analytics back-end data access
Create Custom Tests
→ Modularize functions with snippets, set thresholds, stage data, check data types
DevOps for Data & Continuous Testing
→ API Integration with Build/Release, Continuous Integration/ETL ,
Operations/DevOps Monitoring, Test Management/Issue Tracking, more
Projects
→ Multi-project support, global admin user, activity log reports

Web-based…
Supported OS...
Connects through…
…to any JDBC compliant data source
QuerySurge™
QuerySurge
Controller
QuerySurge Server
DB Server (MySQL)
App Server (Tomcat)
QuerySurge Agents
(Ships with 10 Agents)
Installs...
…in the Cloud
…on a VM
…on a Bare Metal Server

Design
Library
Scheduler
Query
Wizards
QuerySurge™
Data
Intelligence
Reports
Run-Time
Dashboard
DevOps for
Data
Data Analytics
Dashboard
Projects

QuerySurge™ a software division of
Multi-Project Support
Multiple projects can now be created in a single QuerySurge instance. This allows for multiple groups to
work on the same QuerySurge server without seeing each other’s assets (project-level security).
Features supported in Multi-Projects are:
• Global Admin User: This new user type administers the QuerySurge instance
across multiple projects.
• Assign Users to Projects: Users can be assigned to one or more projects. In
each assignment, a user can have a different project role (administrator,
standard user or participant user).
• Assign Agents to Projects: Agents can be shared across projects or dedicated
to specific projects.
• Project Import: Import project data into another project on the same instance
or into a different environment (Dev/QA/Prod).
• Project Export: Export entire projects and store for backup purposes.
• Activity Log Reports: Two reports that track specific changes for auditing
purposes, including manipulations to users or connections.

Fast and Easy.
No programming needed.
QuerySurge™
• Perform 80% of all data tests with no SQL coding
• Opens up testing to novices & non-technical members
• Speeds up testing for skilled coders
• provides a huge Return-On-Investment

QuerySurge™

Design Library
• Create custom Query Pairs (source & target
SQLs for tests that have transformations)
Scheduling
 Build groups of Query Pairs
 Schedule Test Runs
• Run immediately
• Run at set date/time
• Have event kick it off
™

Deep-Dive Reporting
 Examine and automatically
email test results
Run Dashboard
 View real-time execution
 Analyze real-time results
™

QuerySurge™
QuerySurge DevOps for Data
• First full DevOps for Data testing solution
• Both RESTful and command line APIs
• Improves Data Quality at Speed
QuerySurge DevOps for Data integrates with:
• Continuous integration/ETL solutions
• Automated build/release/deployment solutions
• Operations and DevOps monitoring solutions
• Test management/issue tracking solutions
• Scheduling and workload automation solutions
60+ API calls with almost 100 different properties
that users can utilize to retrieve, edit, update, or
delete information.

QuerySurge™
• view data reliability & pass rate
• add, move, filter, zoom-in on any
data widget & underlying data
• verify build success or failure

Large Suite March 5, 2021 16:20:44 March 5, 2021
March 5, 2021 4:24 PM
Start Time
QuerySurge™
6 minutes

(1) Trial in the Cloud of QuerySurgeTM, including self-learning
tutorial that works with sample data for 3 days
(2) Downloaded Trial of QuerySurgeTM, including self-learning
tutorial with sample data or your data for 15 days
for more information on our Trials, please visit:
www.querysurge.com/compare-trial-options
TRIAL
IN THE CLOUD
built by
QuerySurge™
http://www.rttsweb.com/training/courses/big-data-testing-courses
Big Data Testing Courses
Filled with examples and labs, this hands-on training teaches concepts
and HQL techniques used in Big Data testing.
For more information on our Big Data Testing classes, please visit:

built by
built by
QuerySurge™
To see the video of our Big Data testing webinar please visit:
http://www.querysurge.com/solutions/testing-big-data/big-data-testing-for-hadoop
Big Data is on the verge of revolutionizing enterprise data
management architectures.
- DeZyre

QuerySurge Slide Deck for Big Data Testing Webinar

Related slideshows

More Related Content

QuerySurge Slide Deck for Big Data Testing Webinar