Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Building a Scalable Analytics Environment
to Support Diverse Workloads
WHO WE ARE
Aunalytics
Key Stats
Aunalytics provides a leading-edge cloud platform
to help companies leverage data, algorithms, and
high-performance computing to help their teams
answer questions and perform tasks more
efficiently.
Our side-by-side digital transformation model
provides on-demand access to technology, data
science, and AI experts to help transform the
way our clients work.
> 200 Employees
> 1,000 Customers
Financial
Institution
partners
THE SOLUTION
daybreak
Daybreak is a data platform powered by financial
industry intelligence and smart features that enable a
variety of analytics solutions across the enterprise.
SQL
UNIVERSAL ACCESS TO DATA
Access all your data in one
shared location
Securely connect your existing systems with a
data-source-agnostic product, and then quickly put
your data to use with everything you need in one
place.
Give everyone on the team access to the latest and
most accurate data, so they can answer their pressing
questions.
Use Daybreak as a single source of information.
Whether you are using Tableau, Power BI or input into
a 3rd party system, you can pull from a single source.
Simplify the information. Get everyone on the
same page.
SQL
FASTER INSIGHTS
Get the right data at the
right time
Get the updated data you need delivered timely and
consistently every day.
Convert rich, transactional data about your
customers into actionable insights.
Avoid wasting time wrangling data or straining your
IT department and focus on advancing your strategic
business priorities.
Make it easier to quickly understand your data and
save time with automated reporting and clean data.
Scale insights across the organization quickly
Leverage data insights and efficiently answer your
daily questions.
SMART
FEATURES
DATA MARTS
ARTIFICIAL INTELLIGENCE/
MACHINE LEARNING
MEMBER
LIBRARY
SERVICES
LIBRARY
TRANSACTION
LIBRARY
CORE
LENDING
MOBILE BANKING
ATM/ITM
WEALTH AND TRUST
CRM
ACCOUNT
LIBRARY
MEMBER-CENTRIC VIEW
DAYBREAK DATA WAREHOUSE
INSIGHTS
A new era for analytics
SIDE-BY-SIDE CLIENT SUCCESS
Support from a team of
data experts
Get tools, resources, and support throughout
our end-to-end process.
Integrate, enrich, and utilize data marts with
our team beside you, so you can get better
answers to the questions you have.
Be ready for your AI, machine learning, and
predictive analytics journey with the right
foundation.
Our talented team of data scientists and
analysts are here to help.
DATA
SCIENTISTS
CLIENT SUCCESS
MANAGER
BUSINESS
ANALYSTS
CLIENT
ADVISORY
TEAM
RELATIONSHIP
MANAGER
DATA ENGINEERS
ENGINEERS
CLIENT
INFRASTRUCTURE
INGESTION
SOFTWARE
SECURITY
PROJECT
MANAGER
The Challenge
Requirement: Data availability across a diverse
set of dynamic services
Based on
Requirement: Parallel and scalable data access layer
required, but not for all data all of the time
Typical Parallel File
System
All fast, all the time.
Tiering cost/benefit is
negligible and overhead
cost is high.
Alluxio as deployed
• Data in use is fast
• Invisible Upstream
• Scale based on
performance
• Scale de-coupled from
amount of storage
Building a scalable analytics environment to support diverse workloads
CLOUD HOSTING/ANALYTICS
Legacy Hadoop Platform
Hadoop
Cluster ONE
Hadoop
Cluster TWO
Hadoop
Cluster THREE
Small Containerization
Platform Kubernetes
Job Controller: low volume
workloads (low lift activity)
Limitations
Data Stored in triplicate
Requires high speed
storage
Requires high IOPS storage
Requires many spindles
Costly Hadoop nodes
Storage is still performant
even when you are not
using it !!!
Heavy Lift Area
Lots of performant
storage
Lots of performant LAN
Legacy Platform
CLOUD HOSTING/ANALYTICS
Commercial Boutique Storage Proposal
Diskless Physical Hadoop
Nodes
Hadoop processing nodes
connected to remote
boutique storage
Limitations
Extreme cost storage
All nodes have singular
purpose
Requires high speed
dedicated LAN/FIBER
Requires many spindles
Storage vendor lock in
Storage vendor support
All data on HP storage
always
Storage is still performant
even when you are not
Heavy Lift Storage Area
Lots of performant
storage
Lots of performant LAN
(Fiber possibly)
Lots of replication
Extreme performance
storage
Commercial performance
storage
Option ONE
CLOUD HOSTING/ANALYTICS
Open-Source Storage Proposal
Diskless Physical Hadoop
Nodes
Hadoop processing nodes
connected to remote
boutique storage
Limitations
Learning Curve
Internal Staff cost/training
All nodes have singular
purpose
Requires high speed
dedicated LAN/FIBER
Requires many spindles
All data on HP storage
always
Storage is still performant
even when you are not
using it !!!
Heavy Lift Storage Area
Lots of performant
storage
Lots of performant LAN
(Fiber possibly)
Lots of replication
Extreme performance
storage
CEPH, Gluster, Lustre, DPFS
Open-Source Storage
Option TWO
CLOUD HOSTING/ANALYTICS
Data Cache Layer Extreme Speed Storage (Abstraction Layer)
200 Cores
6TB ALL FLASH
12 million read IOPS
40 GB per second sustained read performance
Cost effective
Average Transfer Speeds
Low IOPS requirement
Highly Available
Built in DR functionality
NFS
● Scalable Caching Layer
● RAM/FLASH based
● Compensates for lower
speed/cost underlying storage
● Supports Spark/MR
● Replaces Physical HDFS
Kubernetes Heavy Lift
Platform
Alluxio Caching Layer
Final Design Choice
NFS
NFS
20 Hadoop Clusters
Same Hardware as 2
Legacy Clusters
CLOUD HOSTING/ANALYTICS
Kubernetes Platform Handles Heavy Lift
Object Store
or NFS
Alluxio
Data Cache
GPU
Containerization Platform
(DC/OS) Kubernetes
High Volume Transient
Workloads
Enterprise Cloud Services
Static Critical Management
Workloads
25 Servers (Can scale
to thousands)
1400 Cores
100% Memory
No spinning Disk
Hadoop
Map Reduce
Spark
Aunsight Tasks
Apache Drill
All heavy lifting data
processing
Adaptive Read/Write Methods
Local Object Store
(S3 Compatible)
NFS
Cloud Object Store
(Amazon/Azure)
• All Flash
• 600GB Aggregate
Lan Speed
• Extreme IOPS
• Low Latency
• Temp storage for
processing loads
• All NVME/Flash
• High RAM nodes
• High Core Density
Pre Staged Read Methodology
NFS
NFS
1) Data written to NFS
2) Alluxio copies data into
Flash to pre-stage for
processing
Adaptive Write Methods
• All Flash
• 600GB Aggregate
Lan Speed
• Extreme IOPS
• Low Latency
• Temp storage for
processing loads
NFS
NFS
Write to Alluxio only (Must
Cache)
Any Temp File (High Use)
Write through to UFS (Cache
Through)
(Rare Use)
Write Back to UFS (Async
Through)
Cache/Persist Later (High Use)
Write to UFS Only (Through)
(Rare Use)
Write modes embedded into
each write provides
maximum efficiency
Aunalytics
Use Case
Conclusions
Aunalytics
Use Case
Conclusions
• We have mass quantities of historical data that must be
stored but a much smaller amount of data that must be
processed daily
• The (relatively) small amount of data that we must
process daily requires parallelism from its underlying
storage in order to run in our required time frame
• ALL data must be quickly available for high speed
processing if required
• Allows for (IN Memory) storage performance levels in a
controlled, tunable and independently scalable way.

More Related Content

Building a scalable analytics environment to support diverse workloads

  • 1. Building a Scalable Analytics Environment to Support Diverse Workloads
  • 2. WHO WE ARE Aunalytics Key Stats Aunalytics provides a leading-edge cloud platform to help companies leverage data, algorithms, and high-performance computing to help their teams answer questions and perform tasks more efficiently. Our side-by-side digital transformation model provides on-demand access to technology, data science, and AI experts to help transform the way our clients work. > 200 Employees > 1,000 Customers Financial Institution partners
  • 4. Daybreak is a data platform powered by financial industry intelligence and smart features that enable a variety of analytics solutions across the enterprise.
  • 5. SQL
  • 6. UNIVERSAL ACCESS TO DATA Access all your data in one shared location Securely connect your existing systems with a data-source-agnostic product, and then quickly put your data to use with everything you need in one place. Give everyone on the team access to the latest and most accurate data, so they can answer their pressing questions. Use Daybreak as a single source of information. Whether you are using Tableau, Power BI or input into a 3rd party system, you can pull from a single source. Simplify the information. Get everyone on the same page.
  • 7. SQL FASTER INSIGHTS Get the right data at the right time Get the updated data you need delivered timely and consistently every day. Convert rich, transactional data about your customers into actionable insights. Avoid wasting time wrangling data or straining your IT department and focus on advancing your strategic business priorities. Make it easier to quickly understand your data and save time with automated reporting and clean data. Scale insights across the organization quickly Leverage data insights and efficiently answer your daily questions.
  • 8. SMART FEATURES DATA MARTS ARTIFICIAL INTELLIGENCE/ MACHINE LEARNING MEMBER LIBRARY SERVICES LIBRARY TRANSACTION LIBRARY CORE LENDING MOBILE BANKING ATM/ITM WEALTH AND TRUST CRM ACCOUNT LIBRARY MEMBER-CENTRIC VIEW DAYBREAK DATA WAREHOUSE INSIGHTS
  • 9. A new era for analytics
  • 10. SIDE-BY-SIDE CLIENT SUCCESS Support from a team of data experts Get tools, resources, and support throughout our end-to-end process. Integrate, enrich, and utilize data marts with our team beside you, so you can get better answers to the questions you have. Be ready for your AI, machine learning, and predictive analytics journey with the right foundation. Our talented team of data scientists and analysts are here to help. DATA SCIENTISTS CLIENT SUCCESS MANAGER BUSINESS ANALYSTS CLIENT ADVISORY TEAM RELATIONSHIP MANAGER DATA ENGINEERS ENGINEERS CLIENT INFRASTRUCTURE INGESTION SOFTWARE SECURITY PROJECT MANAGER
  • 12. Requirement: Data availability across a diverse set of dynamic services
  • 13. Based on Requirement: Parallel and scalable data access layer required, but not for all data all of the time Typical Parallel File System All fast, all the time. Tiering cost/benefit is negligible and overhead cost is high. Alluxio as deployed • Data in use is fast • Invisible Upstream • Scale based on performance • Scale de-coupled from amount of storage
  • 15. CLOUD HOSTING/ANALYTICS Legacy Hadoop Platform Hadoop Cluster ONE Hadoop Cluster TWO Hadoop Cluster THREE Small Containerization Platform Kubernetes Job Controller: low volume workloads (low lift activity) Limitations Data Stored in triplicate Requires high speed storage Requires high IOPS storage Requires many spindles Costly Hadoop nodes Storage is still performant even when you are not using it !!! Heavy Lift Area Lots of performant storage Lots of performant LAN Legacy Platform
  • 16. CLOUD HOSTING/ANALYTICS Commercial Boutique Storage Proposal Diskless Physical Hadoop Nodes Hadoop processing nodes connected to remote boutique storage Limitations Extreme cost storage All nodes have singular purpose Requires high speed dedicated LAN/FIBER Requires many spindles Storage vendor lock in Storage vendor support All data on HP storage always Storage is still performant even when you are not Heavy Lift Storage Area Lots of performant storage Lots of performant LAN (Fiber possibly) Lots of replication Extreme performance storage Commercial performance storage Option ONE
  • 17. CLOUD HOSTING/ANALYTICS Open-Source Storage Proposal Diskless Physical Hadoop Nodes Hadoop processing nodes connected to remote boutique storage Limitations Learning Curve Internal Staff cost/training All nodes have singular purpose Requires high speed dedicated LAN/FIBER Requires many spindles All data on HP storage always Storage is still performant even when you are not using it !!! Heavy Lift Storage Area Lots of performant storage Lots of performant LAN (Fiber possibly) Lots of replication Extreme performance storage CEPH, Gluster, Lustre, DPFS Open-Source Storage Option TWO
  • 18. CLOUD HOSTING/ANALYTICS Data Cache Layer Extreme Speed Storage (Abstraction Layer) 200 Cores 6TB ALL FLASH 12 million read IOPS 40 GB per second sustained read performance Cost effective Average Transfer Speeds Low IOPS requirement Highly Available Built in DR functionality NFS ● Scalable Caching Layer ● RAM/FLASH based ● Compensates for lower speed/cost underlying storage ● Supports Spark/MR ● Replaces Physical HDFS Kubernetes Heavy Lift Platform Alluxio Caching Layer Final Design Choice NFS NFS 20 Hadoop Clusters Same Hardware as 2 Legacy Clusters
  • 19. CLOUD HOSTING/ANALYTICS Kubernetes Platform Handles Heavy Lift Object Store or NFS Alluxio Data Cache GPU Containerization Platform (DC/OS) Kubernetes High Volume Transient Workloads Enterprise Cloud Services Static Critical Management Workloads 25 Servers (Can scale to thousands) 1400 Cores 100% Memory No spinning Disk Hadoop Map Reduce Spark Aunsight Tasks Apache Drill All heavy lifting data processing
  • 20. Adaptive Read/Write Methods Local Object Store (S3 Compatible) NFS Cloud Object Store (Amazon/Azure) • All Flash • 600GB Aggregate Lan Speed • Extreme IOPS • Low Latency • Temp storage for processing loads • All NVME/Flash • High RAM nodes • High Core Density
  • 21. Pre Staged Read Methodology NFS NFS 1) Data written to NFS 2) Alluxio copies data into Flash to pre-stage for processing
  • 22. Adaptive Write Methods • All Flash • 600GB Aggregate Lan Speed • Extreme IOPS • Low Latency • Temp storage for processing loads NFS NFS Write to Alluxio only (Must Cache) Any Temp File (High Use) Write through to UFS (Cache Through) (Rare Use) Write Back to UFS (Async Through) Cache/Persist Later (High Use) Write to UFS Only (Through) (Rare Use) Write modes embedded into each write provides maximum efficiency
  • 24. Aunalytics Use Case Conclusions • We have mass quantities of historical data that must be stored but a much smaller amount of data that must be processed daily • The (relatively) small amount of data that we must process daily requires parallelism from its underlying storage in order to run in our required time frame • ALL data must be quickly available for high speed processing if required • Allows for (IN Memory) storage performance levels in a controlled, tunable and independently scalable way.