Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Revolutionary
container based
hybrid cloud
solution for ML
Making ML simple portable and scalable
NextGenML
Platform
Radu Moldovan
Europe Lead Big Data & Machine Learning
Ness Machine Learning Platform (MLP)
Ø Robust pipeline container oriented to:
ü to train a model
ü provide metrics and generic evaluation
Ø Measure the performance of the trained model
Ø Deploy the model as a service
… of course all integrated into a Kubernetes container solution
Ness Machine Learning Platform (MLP)
Ness MLP act as a hub around which all data
science work takes place at enterprise scale level.
Ness' data science platform puts the
entire data modelling process in the hands of data
science teams so they can focus on gaining insights
from data and communicating them to key stakeholders in
your business.
• Infrastructure Agnostic (Docker)
•was deployed on prem but also on cloud (Azure Kubernetes Service)
•still easy to integrate with cloud services (Databricks – Spark)
•used cutting edge devOps technologies in container area (Dockerize, K8s)
• Azure heterogenous cluster (CPU&GPU) for dynamic GPU allocation for ML
trainig
In client’s production we run in parallel more than 20 ML models for organ-segmentation, each one took around 20-24
hours. The GPU nodes were allocated on the fly minimaxing the time and costs.
Not vendor locked, entire technology stack is opensource
NextGenML
Platform
• ML pipeline (Kubeflow)
Build, deploy and manage multi-step ML workflows based on Docker containers
• Pipeline creation is done programmatically using Python DSL (low level control)
• Real time perception of progress, execution, logs, in/outs of parameters
• Unified all the logs from pipeline containers and also system containers so it’s easy for all
the users(data engineers/scientist) to check progress
• Easy tool to compare in parallel different parameters and hyper-parameters for scientific
experiments.
• Also keep track of entire history of all runs and re-runs
NextGenML
Platform
• Asset Catalog
Build as ML/AI collaboration (Records, Models, Experiments, Pipelines, Deployments)
• built by Ness us from scratch as single source of truth built around ML concepts
• is the centric part for data scientist, the reference data integration also for data engineers
• data traceability UI/UX designed to easily drill down the data and ML execution
• REST API and SDK(Python) to manage records, experiment, pipelines during ML execution
• “experiment metadata” provided through UI => no need to ssh the machine for info
NextGenML
Platform
• Asset Catalog
Build as ML/AI collaboration (Records, Models, Experiments, Pipelines, Deployments)
• built by Ness us from scratch as single source of truth built around ML concepts
• is the centric part for data scientist, the reference data integration also for data engineers
• data traceability UI/UX designed to easily drill down the data and ML execution
• REST API and SDK(Python) to managerecords, experiment, pipelines during ML execution
• “experiment metadata” provided through UI => no need to ssh the machine for info
NextGenML
Platform
• ETL (Spark)
Build for easy integration with the ML platform
• full integration with Asset Catalog
• created own Spark adapters, Spark Operator, to deploy and run Spark on
Kubernetes
• created own Spark Python SDK to create Databricks Spark Cluster, deploy
and execute spark job on Databricks
• integration of Azure Data Factory with Databricks and Asset Catalog to
support Azure Service Bus(distributed queue)
NextGenML
Platform
• Automation (CI/CD)
Created for an easy code and flow integration with user’s demands
• smooth propagation of released code
• deploy easily “exact the same Docker image” in different environments like
development, integration system, production
• built on top of Jenkins and devOps best practices
NextGenML
Platform
Developing a cognitive interface between engineers and technology – data catalog
Data
Engineers
devOps
Data
Scientist
Data
Scientist
Data
Scientist
devOps
Data
Scientist
- unify the work for data scientists across different Line of Business
- to automate most of their manual process for data processing
- data discovery and tracing
- normalize different computer data
- visual navigation on experiments & deployments
Data
Engineers
NextGenML
Platform
import kfp.dsl as dsl
from kubernetes import client as k8s_client
training_op = dsl.ContainerOp(
name='training',
image=AZURE_ML___GPU_IMAGE,
command=['bash', '-c'],
arguments=[...],
file_outputs={'output': '....'})
training_op.add_pod_annotation('....', 'false')
training_op.set_gpu_limit("1")
Data
Engineers
devOps
Data
Scientist
CPU Cluster GPU Cluster Extended
TensorFlow
Keras
GPU
User implementation
System automation
TensorFlow
Keras
GPU
NextGenML
Platform
Data Ingestion NextGenML
Platform
Data PreparationTransformation NextGenML
Platform
Training Model
NextGenML
Platform
NextGenML
Platform
DataAsset Catalog
Data Catalog- Record NextGenML
Platform
Data Catalog- Experiment NextGenML
Platform
Data Catalog- Model NextGenML
Platform
Data Catalog- Record Set NextGenML
Platform
Data Catalog- Analytics NextGenML
Platform
Azure
Container
Registry
Azure DevOps
Team City
git
build
Azure
Kubernetes
Cluster
MASTER
git
DEV
PRODUCTION
Azure
Kubernetes
Cluster
DEV & UAT
Data
Engineers
devOps
Testing
Data
Scientist
Docker
Helm
kSonnet
User initiated
System initiated
Continuous Integration (CI) / Continuous Deployment (CD)
CI
CD CPU
GPU
CPU
GPU
2 x
2 x
NextGenML
Platform
Asset
Catalog
Gitlab/github/AD
dex
http://assetcatalog.rotsrlxdv29.ness.com/
ingress
1
2
3
4 (JWT –Authentication
& Authorisation)
Open ID Connect(OIDC), (OAuth2)
Identity provider
Is JWT signature
(encoded) valid? – see next page
Is JWT expired – see next page
Is User Authorized
Private Key Public Key
(PAYLOAD DATA: 'iat': 1565182393)
5
6
7
NextGenML
Platform
KUBEFLOW
K8 -CPU
BOARDING
K8-GPU
APPs
DATA
CATALOG
PORTAL
SYNC
SERVICE
Balancer
Active
Directory
Azure Storage
Datalake
Metrics
Logs
BlobFiles
Public
IP
SERVING
MODEL
Azure Vnet
MONITORING
TRAINING
Params
State Report
MODEL
Metrics
TRAINING
ND6s instance
NextGenML
Platform
BOARDING
NextGenML
Platform
VISUALISATION
NextGenML
Platform
TensorFlow Serving REST API (https://www.tensorflow.org/tfx/serving/api_rest)
POST http://host:port/v1/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]:(classify|regress)
#preparation payload
encoded_image = base64.b64encode(THE_IMAGE)
payload_json{
“image”:encoded_image,
….
}
# post request to Tensorflow Serving
json_response = requests.post(url=rest_url, json= payload_json{})
# decode the response and extract marked organ
response = extract_from_json_THE_2D_PREDICTION (numpy)
organ_countur = png.Reader(bytes=y.tostring())
SERVING
NextGenML
Platform
VPN + KeyVault
AZ DevOps Pipelines (Brigade)
Kubernetes Services (AKS)
HDInsight (Kafka & Spark)
Storage Account
DataLake Storage Gen 1 & 2
Data Factory (ADF)
Function App(lambda)
Batch Account (VM Pools)
Databricks + external metastore
Service Bus (ASB)
Cache for Redis
VM ( GPU )
CI / CD
Kibana Logstash Filebit
GrayLog
Jenkins
Nexus
SonarQube**
Machine Learning
Heterogeneous cluster (CPU+GPU)
Kubeflow(on K8s)
Tensorboard(Keras)- training/boarding/serving*
Seldon*
K8s
Gangway+Dex(GitLab/AD)+RBAC
Ingress(Nginx)/Istio +Services
KubeFlow
Spark
Asset Catalog
GlusterFS
Graphana + Prometeus
Azure Service
Hybrid Cloud Topology NextGenML
Platform
NextGenML
Platform
NextGenML
Platform
NextGenML
Platform
Hybrid Cloud – ETL solution
Azure Data Factory NextGenML
Platform
ABDOMEN – markers
"Small Intestine", [200, 200, 100],
"Large Intestine", [255, 255, 0],
"Spinal Canal", [150, 100, 50],
"Gall Bladder", [40, 100, 240]
"Spleen", [0, 200, 120],
"Liver", [30, 200, 50],
"Stomach", [120, 80, 10],
"Pancreas", [200, 70, 30],
"Duodenum”, [20, 100, 50],.
NextGenML
Platform
NextGenML
• Stack Technology
Platform
Kubernetes (onPrem) + Docker
Azure Kubernetes Cluster (AKS)
Nexus
Azure Container Registry(ACR)
GlusterFS
Workflow
Argo->Kubeflow
DevOps
Helm
kSonnet
Kustomize
Azure DevOps
Code Management & CI/CD
Git
TeamCity
SonarQube
Jenkins
Security
MS Active Directory
Azure VPN
Dex (K8s) integrated with GitLab
Machine Learning
TensorFlow (model training, boarding, serving)
Keras
Seldon
Storage (Azure)
Storage Gen1 & Gen2
Data Lake
File Storage
ETL (Azure)
Databricks
Data Factory (ADF)
HDInsight (Kafka and Spark)
Service Bus (ASB)
Lambda functions & VMs
Spark on K8
Cache for Redis
Monitoring and Logging
Graphana
Prometeus
GrayLog
www.ness.comNess Timisoara
Thank you!
Radu.Moldovan@ness.com
Skype: r.moldovan
Linkedin: linkedin.com/in/raduadrianmoldovan
NextGenML
Platform

More Related Content

NextGenML

  • 1. Revolutionary container based hybrid cloud solution for ML Making ML simple portable and scalable NextGenML Platform Radu Moldovan Europe Lead Big Data & Machine Learning
  • 2. Ness Machine Learning Platform (MLP) Ø Robust pipeline container oriented to: ü to train a model ü provide metrics and generic evaluation Ø Measure the performance of the trained model Ø Deploy the model as a service … of course all integrated into a Kubernetes container solution
  • 3. Ness Machine Learning Platform (MLP) Ness MLP act as a hub around which all data science work takes place at enterprise scale level. Ness' data science platform puts the entire data modelling process in the hands of data science teams so they can focus on gaining insights from data and communicating them to key stakeholders in your business.
  • 4. • Infrastructure Agnostic (Docker) •was deployed on prem but also on cloud (Azure Kubernetes Service) •still easy to integrate with cloud services (Databricks – Spark) •used cutting edge devOps technologies in container area (Dockerize, K8s) • Azure heterogenous cluster (CPU&GPU) for dynamic GPU allocation for ML trainig In client’s production we run in parallel more than 20 ML models for organ-segmentation, each one took around 20-24 hours. The GPU nodes were allocated on the fly minimaxing the time and costs. Not vendor locked, entire technology stack is opensource NextGenML Platform
  • 5. • ML pipeline (Kubeflow) Build, deploy and manage multi-step ML workflows based on Docker containers • Pipeline creation is done programmatically using Python DSL (low level control) • Real time perception of progress, execution, logs, in/outs of parameters • Unified all the logs from pipeline containers and also system containers so it’s easy for all the users(data engineers/scientist) to check progress • Easy tool to compare in parallel different parameters and hyper-parameters for scientific experiments. • Also keep track of entire history of all runs and re-runs NextGenML Platform
  • 6. • Asset Catalog Build as ML/AI collaboration (Records, Models, Experiments, Pipelines, Deployments) • built by Ness us from scratch as single source of truth built around ML concepts • is the centric part for data scientist, the reference data integration also for data engineers • data traceability UI/UX designed to easily drill down the data and ML execution • REST API and SDK(Python) to manage records, experiment, pipelines during ML execution • “experiment metadata” provided through UI => no need to ssh the machine for info NextGenML Platform
  • 7. • Asset Catalog Build as ML/AI collaboration (Records, Models, Experiments, Pipelines, Deployments) • built by Ness us from scratch as single source of truth built around ML concepts • is the centric part for data scientist, the reference data integration also for data engineers • data traceability UI/UX designed to easily drill down the data and ML execution • REST API and SDK(Python) to managerecords, experiment, pipelines during ML execution • “experiment metadata” provided through UI => no need to ssh the machine for info NextGenML Platform
  • 8. • ETL (Spark) Build for easy integration with the ML platform • full integration with Asset Catalog • created own Spark adapters, Spark Operator, to deploy and run Spark on Kubernetes • created own Spark Python SDK to create Databricks Spark Cluster, deploy and execute spark job on Databricks • integration of Azure Data Factory with Databricks and Asset Catalog to support Azure Service Bus(distributed queue) NextGenML Platform
  • 9. • Automation (CI/CD) Created for an easy code and flow integration with user’s demands • smooth propagation of released code • deploy easily “exact the same Docker image” in different environments like development, integration system, production • built on top of Jenkins and devOps best practices NextGenML Platform
  • 10. Developing a cognitive interface between engineers and technology – data catalog Data Engineers devOps Data Scientist Data Scientist Data Scientist devOps Data Scientist - unify the work for data scientists across different Line of Business - to automate most of their manual process for data processing - data discovery and tracing - normalize different computer data - visual navigation on experiments & deployments Data Engineers NextGenML Platform
  • 11. import kfp.dsl as dsl from kubernetes import client as k8s_client training_op = dsl.ContainerOp( name='training', image=AZURE_ML___GPU_IMAGE, command=['bash', '-c'], arguments=[...], file_outputs={'output': '....'}) training_op.add_pod_annotation('....', 'false') training_op.set_gpu_limit("1") Data Engineers devOps Data Scientist CPU Cluster GPU Cluster Extended TensorFlow Keras GPU User implementation System automation TensorFlow Keras GPU NextGenML Platform
  • 16. Data Catalog- Record NextGenML Platform
  • 17. Data Catalog- Experiment NextGenML Platform
  • 18. Data Catalog- Model NextGenML Platform
  • 19. Data Catalog- Record Set NextGenML Platform
  • 20. Data Catalog- Analytics NextGenML Platform
  • 21. Azure Container Registry Azure DevOps Team City git build Azure Kubernetes Cluster MASTER git DEV PRODUCTION Azure Kubernetes Cluster DEV & UAT Data Engineers devOps Testing Data Scientist Docker Helm kSonnet User initiated System initiated Continuous Integration (CI) / Continuous Deployment (CD) CI CD CPU GPU CPU GPU 2 x 2 x NextGenML Platform
  • 22. Asset Catalog Gitlab/github/AD dex http://assetcatalog.rotsrlxdv29.ness.com/ ingress 1 2 3 4 (JWT –Authentication & Authorisation) Open ID Connect(OIDC), (OAuth2) Identity provider Is JWT signature (encoded) valid? – see next page Is JWT expired – see next page Is User Authorized Private Key Public Key (PAYLOAD DATA: 'iat': 1565182393) 5 6 7 NextGenML Platform
  • 27. TensorFlow Serving REST API (https://www.tensorflow.org/tfx/serving/api_rest) POST http://host:port/v1/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]:(classify|regress) #preparation payload encoded_image = base64.b64encode(THE_IMAGE) payload_json{ “image”:encoded_image, …. } # post request to Tensorflow Serving json_response = requests.post(url=rest_url, json= payload_json{}) # decode the response and extract marked organ response = extract_from_json_THE_2D_PREDICTION (numpy) organ_countur = png.Reader(bytes=y.tostring()) SERVING NextGenML Platform
  • 28. VPN + KeyVault AZ DevOps Pipelines (Brigade) Kubernetes Services (AKS) HDInsight (Kafka & Spark) Storage Account DataLake Storage Gen 1 & 2 Data Factory (ADF) Function App(lambda) Batch Account (VM Pools) Databricks + external metastore Service Bus (ASB) Cache for Redis VM ( GPU ) CI / CD Kibana Logstash Filebit GrayLog Jenkins Nexus SonarQube** Machine Learning Heterogeneous cluster (CPU+GPU) Kubeflow(on K8s) Tensorboard(Keras)- training/boarding/serving* Seldon* K8s Gangway+Dex(GitLab/AD)+RBAC Ingress(Nginx)/Istio +Services KubeFlow Spark Asset Catalog GlusterFS Graphana + Prometeus Azure Service Hybrid Cloud Topology NextGenML Platform
  • 32. Azure Data Factory NextGenML Platform
  • 33. ABDOMEN – markers "Small Intestine", [200, 200, 100], "Large Intestine", [255, 255, 0], "Spinal Canal", [150, 100, 50], "Gall Bladder", [40, 100, 240] "Spleen", [0, 200, 120], "Liver", [30, 200, 50], "Stomach", [120, 80, 10], "Pancreas", [200, 70, 30], "Duodenum”, [20, 100, 50],. NextGenML Platform
  • 35. • Stack Technology Platform Kubernetes (onPrem) + Docker Azure Kubernetes Cluster (AKS) Nexus Azure Container Registry(ACR) GlusterFS Workflow Argo->Kubeflow DevOps Helm kSonnet Kustomize Azure DevOps Code Management & CI/CD Git TeamCity SonarQube Jenkins Security MS Active Directory Azure VPN Dex (K8s) integrated with GitLab Machine Learning TensorFlow (model training, boarding, serving) Keras Seldon Storage (Azure) Storage Gen1 & Gen2 Data Lake File Storage ETL (Azure) Databricks Data Factory (ADF) HDInsight (Kafka and Spark) Service Bus (ASB) Lambda functions & VMs Spark on K8 Cache for Redis Monitoring and Logging Graphana Prometeus GrayLog
  • 36. www.ness.comNess Timisoara Thank you! Radu.Moldovan@ness.com Skype: r.moldovan Linkedin: linkedin.com/in/raduadrianmoldovan NextGenML Platform