NextGenML

Revolutionary
container based
hybrid cloud
solution for ML
Making ML simple portable and scalable
NextGenML
Platform
Radu Moldovan
Europe Lead Big Data & Machine Learning

Ness Machine Learning Platform (MLP)
Ø Robust pipeline container oriented to:
ü to train a model
ü provide metrics and generic evaluation
Ø Measure the performance of the trained model
Ø Deploy the model as a service
… of course all integrated into a Kubernetes container solution

Ness Machine Learning Platform (MLP)
Ness MLP act as a hub around which all data
science work takes place at enterprise scale level.
Ness' data science platform puts the
entire data modelling process in the hands of data
science teams so they can focus on gaining insights
from data and communicating them to key stakeholders in
your business.

• Infrastructure Agnostic (Docker)
•was deployed on prem but also on cloud (Azure Kubernetes Service)
•still easy to integrate with cloud services (Databricks – Spark)
•used cutting edge devOps technologies in container area (Dockerize, K8s)
• Azure heterogenous cluster (CPU&GPU) for dynamic GPU allocation for ML
trainig
In client’s production we run in parallel more than 20 ML models for organ-segmentation, each one took around 20-24
hours. The GPU nodes were allocated on the fly minimaxing the time and costs.
Not vendor locked, entire technology stack is opensource
NextGenML
Platform

• ML pipeline (Kubeflow)
Build, deploy and manage multi-step ML workflows based on Docker containers
• Pipeline creation is done programmatically using Python DSL (low level control)
• Real time perception of progress, execution, logs, in/outs of parameters
• Unified all the logs from pipeline containers and also system containers so it’s easy for all
the users(data engineers/scientist) to check progress
• Easy tool to compare in parallel different parameters and hyper-parameters for scientific
experiments.
• Also keep track of entire history of all runs and re-runs
NextGenML
Platform

• Asset Catalog
Build as ML/AI collaboration (Records, Models, Experiments, Pipelines, Deployments)
• built by Ness us from scratch as single source of truth built around ML concepts
• is the centric part for data scientist, the reference data integration also for data engineers
• data traceability UI/UX designed to easily drill down the data and ML execution
• REST API and SDK(Python) to manage records, experiment, pipelines during ML execution
• “experiment metadata” provided through UI => no need to ssh the machine for info
NextGenML
Platform

• Asset Catalog
Build as ML/AI collaboration (Records, Models, Experiments, Pipelines, Deployments)
• built by Ness us from scratch as single source of truth built around ML concepts
• is the centric part for data scientist, the reference data integration also for data engineers
• data traceability UI/UX designed to easily drill down the data and ML execution
• REST API and SDK(Python) to managerecords, experiment, pipelines during ML execution
• “experiment metadata” provided through UI => no need to ssh the machine for info
NextGenML
Platform

• ETL (Spark)
Build for easy integration with the ML platform
• full integration with Asset Catalog
• created own Spark adapters, Spark Operator, to deploy and run Spark on
Kubernetes
• created own Spark Python SDK to create Databricks Spark Cluster, deploy
and execute spark job on Databricks
• integration of Azure Data Factory with Databricks and Asset Catalog to
support Azure Service Bus(distributed queue)
NextGenML
Platform

• Automation (CI/CD)
Created for an easy code and flow integration with user’s demands
• smooth propagation of released code
• deploy easily “exact the same Docker image” in different environments like
development, integration system, production
• built on top of Jenkins and devOps best practices
NextGenML
Platform

Developing a cognitive interface between engineers and technology – data catalog
Data
Engineers
devOps
Data
Scientist
Data
Scientist
Data
Scientist
devOps
Data
Scientist
- unify the work for data scientists across different Line of Business
- to automate most of their manual process for data processing
- data discovery and tracing
- normalize different computer data
- visual navigation on experiments & deployments
Data
Engineers
NextGenML
Platform

import kfp.dsl as dsl
from kubernetes import client as k8s_client
training_op = dsl.ContainerOp(
name='training',
image=AZURE_ML___GPU_IMAGE,
command=['bash', '-c'],
arguments=[...],
file_outputs={'output': '....'})
training_op.add_pod_annotation('....', 'false')
training_op.set_gpu_limit("1")
Data
Engineers
devOps
Data
Scientist
CPU Cluster GPU Cluster Extended
TensorFlow
Keras
GPU
User implementation
System automation
TensorFlow
Keras
GPU
NextGenML
Platform

Data Ingestion NextGenML
Platform

Data PreparationTransformation NextGenML
Platform

Training Model
NextGenML
Platform

NextGenML
Platform
DataAsset Catalog

Data Catalog- Record NextGenML
Platform

Data Catalog- Experiment NextGenML
Platform

Data Catalog- Model NextGenML
Platform

Data Catalog- Record Set NextGenML
Platform

Data Catalog- Analytics NextGenML
Platform

Azure
Container
Registry
Azure DevOps
Team City
git
build
Azure
Kubernetes
Cluster
MASTER
git
DEV
PRODUCTION
Azure
Kubernetes
Cluster
DEV & UAT
Data
Engineers
devOps
Testing
Data
Scientist
Docker
Helm
kSonnet
User initiated
System initiated
Continuous Integration (CI) / Continuous Deployment (CD)
CI
CD CPU
GPU
CPU
GPU
2 x
2 x
NextGenML
Platform

Asset
Catalog
Gitlab/github/AD
dex
http://assetcatalog.rotsrlxdv29.ness.com/
ingress
1
2
3
4 (JWT –Authentication
& Authorisation)
Open ID Connect(OIDC), (OAuth2)
Identity provider
Is JWT signature
(encoded) valid? – see next page
Is JWT expired – see next page
Is User Authorized
Private Key Public Key
(PAYLOAD DATA: 'iat': 1565182393)
5
6
7
NextGenML
Platform

KUBEFLOW
K8 -CPU
BOARDING
K8-GPU
APPs
DATA
CATALOG
PORTAL
SYNC
SERVICE
Balancer
Active
Directory
Azure Storage
Datalake
Metrics
Logs
BlobFiles
Public
IP
SERVING
MODEL
Azure Vnet
MONITORING
TRAINING
Params
State Report
MODEL
Metrics

TRAINING
ND6s instance
NextGenML
Platform

VISUALISATION
NextGenML
Platform

TensorFlow Serving REST API (https://www.tensorflow.org/tfx/serving/api_rest)
POST http://host:port/v1/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]:(classify|regress)
#preparation payload
encoded_image = base64.b64encode(THE_IMAGE)
payload_json{
“image”:encoded_image,
….
}
# post request to Tensorflow Serving
json_response = requests.post(url=rest_url, json= payload_json{})
# decode the response and extract marked organ
response = extract_from_json_THE_2D_PREDICTION (numpy)
organ_countur = png.Reader(bytes=y.tostring())
SERVING
NextGenML
Platform

VPN + KeyVault
AZ DevOps Pipelines (Brigade)
Kubernetes Services (AKS)
HDInsight (Kafka & Spark)
Storage Account
DataLake Storage Gen 1 & 2
Data Factory (ADF)
Function App(lambda)
Batch Account (VM Pools)
Databricks + external metastore
Service Bus (ASB)
Cache for Redis
VM ( GPU )
CI / CD
Kibana Logstash Filebit
GrayLog
Jenkins
Nexus
SonarQube**
Machine Learning
Heterogeneous cluster (CPU+GPU)
Kubeflow(on K8s)
Tensorboard(Keras)- training/boarding/serving*
Seldon*
K8s
Gangway+Dex(GitLab/AD)+RBAC
Ingress(Nginx)/Istio +Services
KubeFlow
Spark
Asset Catalog
GlusterFS
Graphana + Prometeus
Azure Service
Hybrid Cloud Topology NextGenML
Platform

NextGenML
Platform
Hybrid Cloud – ETL solution

Azure Data Factory NextGenML
Platform

ABDOMEN – markers
"Small Intestine", [200, 200, 100],
"Large Intestine", [255, 255, 0],
"Spinal Canal", [150, 100, 50],
"Gall Bladder", [40, 100, 240]
"Spleen", [0, 200, 120],
"Liver", [30, 200, 50],
"Stomach", [120, 80, 10],
"Pancreas", [200, 70, 30],
"Duodenum”, [20, 100, 50],.
NextGenML
Platform

• Stack Technology
Platform
Kubernetes (onPrem) + Docker
Azure Kubernetes Cluster (AKS)
Nexus
Azure Container Registry(ACR)
GlusterFS
Workflow
Argo->Kubeflow
DevOps
Helm
kSonnet
Kustomize
Azure DevOps
Code Management & CI/CD
Git
TeamCity
SonarQube
Jenkins
Security
MS Active Directory
Azure VPN
Dex (K8s) integrated with GitLab
Machine Learning
TensorFlow (model training, boarding, serving)
Keras
Seldon
Storage (Azure)
Storage Gen1 & Gen2
Data Lake
File Storage
ETL (Azure)
Databricks
Data Factory (ADF)
HDInsight (Kafka and Spark)
Service Bus (ASB)
Lambda functions & VMs
Spark on K8
Cache for Redis
Monitoring and Logging
Graphana
Prometeus
GrayLog

www.ness.comNess Timisoara
Thank you!
Radu.Moldovan@ness.com
Skype: r.moldovan
Linkedin: linkedin.com/in/raduadrianmoldovan
NextGenML
Platform

NextGenML

More Related Content

NextGenML