Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Co-funded by the European Commission
Horizon 2020 - Grant #777154
Software Defined
Networking in the
ATMOSPHERE project
Giacomo Verticale
Politecnico di Milano
• ATMOSPHERE is a 24-month H2020
project aiming at the design and
development of a framework and a
platform to implement trustworthy
cloud services on a federated
intercontinental cloud.
• Expected Results
• A federated cloud platform.
• A development framework
• Trustworthy evaluation and monitoring
• Trustworthy Distributed Data Management
• Trustworthy Distributed Data Processing
• A pilot use case on Medical Imaging
Processing.
The Project
Trustworthy Data Processing Services (TDPS)
Application
Trustworthy Data Management Services (TDMS)
Infrastructure Management Services (IMS)
Federated Infrastructure
Trustworthiness
Monit.&Assessment
(TMA)
The problem
I do not want to care for the infrastructure, resource
management, job scheduling, secure access and
similar burdens. Moreover, I want to guarantee that
no sensitive data is exposed outside of the country
where it was produced.
I need to build up an Image Processing Tool that
uses sensitive data that requires a high computing
demand. Once developed, I want to exploit it as a
service securely and with a Quality of Service.
Target: Diagnosis of RHD
• PROVAR study – the first large-scale RHD screening program in
Brazil.
• RHD Screening: public schools, private schools and primary health
units in the cities of Belo Horizonte,
Montes Claros and Bocaiúva,
Minas Gerais, Brazil.
The Data
• The characterization of Echo-cardio
images obtained in public schools
• 5,600 exams, with an average of 14
videos per exams (total of 75,836
videos)
• 5,330 exams are classified as normal (with a
total of 71,686 videos) - 95%
• 238 exams are classified as borderline RHD
(with a total of 3,649 videos) - 4%.
• 32 exams are classified as definite RHD (with a
total of 501 videos) - 1%.
• Additionally, there is another databank with 3.5
millions electrocardiograms from the same
population area and age.
Image Biobank Requirements
Mean age: 13 ± 3 y.o.
Female sex: 55%.
• Sensitive data must not be accessible out of the boundaries of
the hosting country
• Sensitive data is protected by the Brazilian LGPD and must be processed under high
access-protection means, robust even in a potentially vulnerable cloud offering.
• Anonymous data, though, can be released but should be kept accessible only in a
secured environment.
• Medical Imaging processing and Machine Learning model
building requires intensive computing resources
• The capabilities for processing may not be accessible in the boundaries where the
data is located and therefore such processing algorithms must run elsewhere.
• The access should be coherent and secure, and image processing should be efficient.
• Experiments should be reproducible and stable
• The model building, image processing and classification should run on well-defined
environments that could be reproduced for further analysis.
Image Biobank Requirements
• Trust is a choice that is based on past experience. Trust takes time to
build, but it can disappear in a second.
• Trusting cloud services is as complicated as trusting people. You need a
way to measure it and pieces of evidence to build trust.
• Trust in a cloud environment is considered as the reliance of a customer on a cloud
service and, consequently, on its provider.
• Trust bases on a broad spectrum of properties such as
Security, Privacy, Coherence, Isolation, Stability,
Fairness, Transparency and Dependability.
• Nowadays, few approaches deal with the
quantification of trust in cloud computing.
What is trust?
• Trustworthiness is considered in its multiple dimensions
• Security, as the capability to defence from attacks.
• Privacy, as the inherent risk of a dataset to contain re-identifiable data.
• Coherence, as the capability of providing a coherent behaviour
from any point of the federation.
• Isolation, as the difference when a service runs isolated or not.
• Stability, as the idempotency and stability of the services.
• Fairness, as the inexistence of undesirable or hidden biases.
• Transparency, as the capability of understand the
output of a system.
• Dependability, mainly focusing on availability and reliability.
• Measuring the trustworthiness properties
• A priori and a posteriori evaluation of vulnerability, performance,
re-identification risks, data loss rate, integrity, robustness, scalability, resource consumption,
classification bias and isolation.
Trustworthiness life-cycle
• Along with these
requirements, we explore
other requirements:
• Measurement of the Fairness of
the models to evaluate the bias
of the model with respect to
sensitive categories, such as
gender or race.
• Evaluation of the Explainability
of the model.
• Evaluation of the privacy loss
risk to determine the quality of
the anonymisation and the
potential leakage of personal
data inside the models.
Image Biobank Requirements
... successfully reidentified the demographic data of
4478 adults (94.9%) & 2120 children (87.4%) …
(P < .001)
11
The Previous situation
Application Developers
- Who develop the tools for
processing the data.
- They require the
infrastructure to provide
some types of services and
resources, such as
computing, secure storage,
high-availability, data
persistence.
- They will deliver the
applications to others
to operate.
Application Manager
- An Application Developer may
not be in charge of deploying
the application on the
production infrast.
- The deployment implies the
monitoring and management
of the resources, services,
user accounts and data.
- The Application Manager will
have access credentials to the
infrastructure and will decide
the optimal allocation of the
resource.
End-Users
- Data providers and Data
scientists exploring and
processing data.
- Need for secure data
transfer and data access
tracing, as well as
simplified processing
tools.
- No need to worry about
achieving ICT skills.
Building Trust with
The ATMOSPHERE Platform
13
Service classes
14
TDPS Layer
● Lemonade* is a web-based system for
designing and running analytics
applications.
● Users, who are not necessarily
programmers, describe applications as
workflows; Lemonade generates code
and controls their execution.
● Workflows consist of operations
(boxes) and data flows (arrows) among
them, performing:
⁃ Data preparation and engineering
⁃ Machine learning methods (MLib)
⁃ Visualization metaphors 15
LEMONADE
16
Supported Trustworthiness properties
Property Developers Data Scientists
Stability Stability strategies (e.g., cross-
validation)
Quality assurance of model outcome
(e.g., calibrate cross validation and
evaluate accuracy variance)
Privacy Privacy-preserving algorithms and
techniques (e.g., k-anonymity)
Assess the impact of preserving privacy
on the outcome utility and effectiveness
Transparency Transparency methods to be combined
with different data analytic flows (e.g.,
LIME/SHAP methods)
Execute ML models and, based on
explanations, calibrate the model or
enhance the input
Fairness Fairness-enhancing mechanisms and
strategies (e.g., Aequitas toolkit).
Generate report as to evaluate fairness
and decide on features to include on
models
• PAF assists organizations owning
and processing datasets to
understand how the processing of
data can affect their conformance
with regulations related to privacy
(GDPR and LGPD)
• These assessments may be used to
generate appropriate security/
privacy policies and checks used by
other services
17
Privacy assessment
forms (PAF)
18
TDMS Layer
• Typical best practices
• Data in transit and at rest can be encrypted
• Some processing can even be done over encrypted data
• Keys and certificates not included in repositories
• But this is not enough...
• If attacker has access to the machine (VM escapes,
internal attacker, cold boots), code can be changed,
memory can be dumped
• Keys or data can be stolen
19
Data access challenges
Data Protection
Layer
(Vallum)
The Vallum Framework
Colunar DBMS
(e.g., Cassandra)
Relational DBMS
(e.g., MySQL)Proxying
Authentication
Authorization
Privacy
Auditing Document Store
(e.g., MongoDB)
File System
(e.g. IPFS)
Query
Compliant
Results
Query
Compliant
Results
Query
Compliant
Results
Modified
Query
Result
Modified
Query
Result
Modified
Query
Result
Modified
Query
Result
trusted execution
environment (TEE) raw data encrypted
at rest
data encrypted
in transit
21
TMA Layer
22
TMA: Design and interfaces
Measures and enforces
the multiple dimensions
of trustworthiness:
• Security
• Privacy
• Coherence
• Isolation
• Stability
• Fairness
• Transparency.
• Dependability
23
IMS Layer
An orchestration platform to manage a federated set of hybrid
resources, to provide measures, adaptive mechanisms and policies
to improve trustworthiness
● orchestration platform ➔ Automatic configuration via TOSCA blueprints
● federated ➔ multiple clouds independently owned and managed, multi-
tenancy
● hybrid resources ➔ CPUs, SGX, GPUs
● measures ➔ metrics and tools to evaluate the trustworthiness of cloud
resources (availability, performance, etc)
● adaptive mechanisms ➔ to scale o reallocate cloud and network
resources
24
Trustworthy
Infrastructure Management
25
Infrastructure Management Services
Federated Infrastructure
Resource
Provider
Resource
Provider
Resource
Provider
ATMOSPHERE Platform
Federation middleware
Fogbow Fogbow Fogbow
Federation-wide
monitoring services
probes running at
each site
monitoring
service
Automated deployment service Performance prediction &
assessment serviceEC3 TOSCA-IM
Model training
Profiling
26
Site A
DMZInternal
XMPP
OVS
FNS
RAS
DMZ Internal
FNS
RAS
Cloud A
(OpenStack)
Site B
Cloud B
(OpenNebula)
Network federation
Fogbow
Dashboard
Fogbow
Dashboard
ONOS
XMPP
OVS
ONOS
IPSec
• Fogbow middleware can deploy multiple
VMs over a single VLAN spanning
multiple heterogeneous clouds
• Each federated site holds:
• a Federated Network Service
(FNS)
• a Resource Allocation Service
(RAS)
• an XMPP service
• one or more instances of
OpenVSwitch (OVS)
• Selected sites hold
• an instance of ONOS
• an instance of the Intent
Monitoring and Rerouting (IMR)
application
control
IMR IMR
control
27
Creation of a Network
Federation
Site A
DMZInternal
XMPP
OVS
FNS
RAS
DMZ Internal
FNS
RAS
Cloud A
(OpenStack)
Site B
Cloud B
(OpenNebula)
ONOS
XMPP
OVS
ONOS
1. The Infrastructure Manager (IM) requests
a new federated network and specifies
the private IP range and the VLAN ID
2. The IM requests a new local VM in the
federation
3. The FNS chooses an IP address,
prepares the cloud-init script and
forwards the request to the RAS
4. The RAS sets up OVS to accept the
incoming tunnel
5. The RAS interacts with the cloud to
create the VM.
6. The VM executes the cloud-init script and
establishes a tunnel with OVS.
7. Other VMs are attached to the federated
network in a similar way, with requests for
VMs in remote sites being forwarded by
the RAS accordingly.
8. ONOS sets up routing intents between
pairs of VMs
9. Intents are monitored and re-routed to
guarantee availability (and latency)
1 2
3
4
5
VM
6
VM
7
7
7
7
7
8
9
DEMO:
1. Configuration of each datacenter:
• one gateway VM (OVS)
• one instance of ONOS
• one or more VMs belonging to two
federations
2. The IMS monitors link availability and
assigns each link an «availability» score
3. Two VMs in the same federation
exchange traffic along the shortest path
4. When an IPSec tunnel fails traffic is
immediately rerouted along a live path
5. When the faulty IPSec tunnel is available
again, traffic remains in the backup path
until the availability score recovers
6. When the availability score is high, traffic
is rerouted 28
T4.4, D4.2Distributed Implementation of
Federated Networks
The Intercontinental Use Case
• The underlying infrastructure is a federated cloud
• Using fogbow (www.fogbowcloud.org) on OpenStack and OpenNebula.
• With a Federated Network to provide a coherent network space among nodes.
• Heterogeneous resources: SGX-enabled and GPU nodes.
• Using EC3(1) and Infrastructure Manager(2) to deploy a virtual
infrastructure.
30
Intercontinental
infrastructure
Cloud Resources @EU
Cloud Resources
@ Brazil
SGX-Enabled Resources
container
Encrypted
PROVAR
Study
Cloud
Manager
Cloud
Manager
Federation Layer
Secure overlay network
Central
TMA
TOSCA-IM
GPU-Enabled
Resources container
(1) https://marketplace.eosc-portal.eu/services/elastic-cloud-compute-cluster-ec3
(2) https://marketplace.eosc-portal.eu/services/infrastructure-manager-im
EC3
• The virtual infrastructure is managed by an
elastic Kubernetes cluster spawn over the
federated network
• Containers and services are accessible from both
sites but only through the federated network.
• Resources are properly tagged (SGX and GPU
capabilities and Brazil / Europe) so K8s applications
are placed in the correct resource.
• Infrastructure is described as code(3).
• K8s Front-end is deployed and nodes are being
powered on as the applications are deployed,
creating the request for specific resources.
31
Deployment of the virtual
infrastructure
(3) https://github.com/grycap/ec3/tree/atmosphere
• A secure storage is deployed at the
Brazilian side
• It uses Vallum(4), a service that provides
on-the-fly annonymisation based on policies.
• It masks (or blurs) the fields that are marked
as sensitive to different profiles of users.
• It relies on an HDFS filesystem for the files
and on SQL databases for the structured data.
• It runs the data anonymisation and sensitive data access on enclaves
running on SGX-enabled containers, so they securely run even in untrusted
clouds
• Data remains encrypted in disk.
32
Secure storage at Brazilian side
Cloud Resources
@ Brazil
SGX-Enabled
Resources
VALLUM
Encrypted
PROVAR
Study
Cloud
Manager
(4) https://www.atmosphere-eubrazil.eu/vallum-framework-access-privacy-protection
• Data is requested to Vallum from external users, but they will
only access to partially anonymised data
• Anonymised data (~1TB) is copied where the computing accelerators
are placed.
33
Anonymised Data
Cloud Resources @EUCloud Resources @ Brazil
SGX-Enabled Resources
VALLUM
Encrypted
PROVAR
Study
Plain &
Anonymised
data
Application
TMA
Cloud
Manager
Cloud
Manager
Federation Layer
Secure overlay network Central
TMA
GPU-Enabled
Resources
TOSCA-IM
storage
service
• Videos are split into frames and
classified by color inspection
• A color-based segmentation using k-means
clustering extracts the color pixels from the
Doppler images.
• Images are classified according
their acquisition view using a CNN
• Parasternal long axis view has proven to be
relevant to obtain an accurate classification.
• First & second order texture analyses
characterize the images by the spatial variation of pixel intensities.
• Besides texture features, blood velocity information is also obtained.
• Finally, all the extracted features are classified through machine learning
techniques in order to differentiate between RHD positive and healthy 34
Building the models for the
Estimation pipeline.
Image
Classification
Frame
Splitting
Preparation of
images for classifier Color-Based
Segment.
Doppler
Data Preparation
View
Classification
Texture Analysis &
Velocity Extraction
Features
Classification
Parasternal Long Axis
Data Analysis
• The pipeline is developed
using LEMONADE(5)
• LEMONADE provides
a GUI and a Machine
Learning librarie to
develop data analytics
pipelines.
• Pipelines can be run
interactively or transformed into executable code.
• Code can be interactively run or further embed into
services to be exposed for production.
• A model building pipeline and an estimation
pipeline are developed.
35
Coding the pipeline:
LEMONADE
(5) https://www.atmosphere-eubrazil.eu/lemonade-live-exploration-and-mining-non-trivial-amount-data-everywhere
Fairness
● Algorithms, in ML and IA, learn by identifying patterns in data collected
over many years. Why may algorithms become “unfair”?
○ By using unbalanced data sets, biased to certain population.
○ By using data sets that are perpetuating historical biases.
○ By inappropriate data handling.
○ As result of inappropriate model selection, uncorrect algorithm design or application.
● Algorithms Fairness components:
○ Aequitas Bias and Fairness Audit Toolkit, proposed
by the DSSG group from University of Chicago
(http://aequitas.dssg.io/)
○ Properties:
■ Equal Parity & Proportional Parity.
■ False Positive Rate and False Discovery
Rate Parity.
■ False Negative Rate and False Omission
Rate Parity.
Fairness
Tree
Equal
Parity
Proport.
Parity
Represent.
Fairness
Error
Fairness
FNRP FPRP FDRP FORP
● Model Complexity increase typically reduces Interpretability
○ Complex multilayer Convolutional Neural Networks are far more difficult to explain than
Decision Trees or Linear Regression.
● Effort is invested in characterizing explainability and providing
information to explain how the algorithm reached such results
○ 𝛿-Interprepetability (https://arxiv.org/pdf/1707.03886.pdf).
○ LIME (https://github.com/marcotcr/lime)
■ The output of LIME is a list of explanations,
reflecting the contribution of each feature to
the prediction of a data sample.
Interpretability
Retinopathy prediction using a 48 layers deep net)
https://www.kaggle.com/kmader/inceptionv3-for-retinopathy-gpu-hr
Severe
Retinopathy
Privacy Assessment Forms for GDPR
and LGPD
● The International context requires
dealing with multiple legal
frameworks
○ Brazilian LGPD and GDPR in our case.
● Integrated a tool for tagging and
following up sensitive fields
○ To provide a list of Personally Identifiable
Information (PII) and Sensitive Information
■ PIIs: Fullname, Ethnicity, Medical Record id,
Gender,..
■ Sensitive Info: Medical Information,
Genetics,..
○ Traces the use of sensitive data within a
processing workflow to guide on the
annotation of sensitive derived information.
Re-identification Risk
● Anonymisation defined by policies
○ Define actions (Removal, Blurring, Reduction,
Substitution) and fields.
○ The system starts with the less restrictive
policy, applies anonymisation and computes
the Metric.
● Data Privacy Model
○ Anonymisation Process.
○ K-anonymity Model Computation.
○ Threshold Checker.
○ Linkage Attack for Validation.
○ Increase Anonymity.
40
Conclusions
• Need to
manually
configure the
environment.
• Lack of
reproducibility.
• Qualitative
appraisal of the
trustworthiness.
Before After
• Self-assessment
of GDPR/LGDP.
• Trustable storage
environment even
on an untrusted
provider.
• Quantitative
anonymisation
level.
• Manual analysis
of GDPR/LGDP
risks
• Need to trust on
the storage
provider.
• Anonymisation
level is
qualitative.
• Applications templates
for complex &
distributed
applications.
• Provide a repeatable
way to deploy the
whole application.
• Quantitative measure
of trustworthiness

More Related Content

Software Defined Networking in the ATMOSPHERE project

  • 1. Co-funded by the European Commission Horizon 2020 - Grant #777154 Software Defined Networking in the ATMOSPHERE project Giacomo Verticale Politecnico di Milano
  • 2. • ATMOSPHERE is a 24-month H2020 project aiming at the design and development of a framework and a platform to implement trustworthy cloud services on a federated intercontinental cloud. • Expected Results • A federated cloud platform. • A development framework • Trustworthy evaluation and monitoring • Trustworthy Distributed Data Management • Trustworthy Distributed Data Processing • A pilot use case on Medical Imaging Processing. The Project Trustworthy Data Processing Services (TDPS) Application Trustworthy Data Management Services (TDMS) Infrastructure Management Services (IMS) Federated Infrastructure Trustworthiness Monit.&Assessment (TMA)
  • 3. The problem I do not want to care for the infrastructure, resource management, job scheduling, secure access and similar burdens. Moreover, I want to guarantee that no sensitive data is exposed outside of the country where it was produced. I need to build up an Image Processing Tool that uses sensitive data that requires a high computing demand. Once developed, I want to exploit it as a service securely and with a Quality of Service.
  • 5. • PROVAR study – the first large-scale RHD screening program in Brazil. • RHD Screening: public schools, private schools and primary health units in the cities of Belo Horizonte, Montes Claros and Bocaiúva, Minas Gerais, Brazil. The Data
  • 6. • The characterization of Echo-cardio images obtained in public schools • 5,600 exams, with an average of 14 videos per exams (total of 75,836 videos) • 5,330 exams are classified as normal (with a total of 71,686 videos) - 95% • 238 exams are classified as borderline RHD (with a total of 3,649 videos) - 4%. • 32 exams are classified as definite RHD (with a total of 501 videos) - 1%. • Additionally, there is another databank with 3.5 millions electrocardiograms from the same population area and age. Image Biobank Requirements Mean age: 13 ± 3 y.o. Female sex: 55%.
  • 7. • Sensitive data must not be accessible out of the boundaries of the hosting country • Sensitive data is protected by the Brazilian LGPD and must be processed under high access-protection means, robust even in a potentially vulnerable cloud offering. • Anonymous data, though, can be released but should be kept accessible only in a secured environment. • Medical Imaging processing and Machine Learning model building requires intensive computing resources • The capabilities for processing may not be accessible in the boundaries where the data is located and therefore such processing algorithms must run elsewhere. • The access should be coherent and secure, and image processing should be efficient. • Experiments should be reproducible and stable • The model building, image processing and classification should run on well-defined environments that could be reproduced for further analysis. Image Biobank Requirements
  • 8. • Trust is a choice that is based on past experience. Trust takes time to build, but it can disappear in a second. • Trusting cloud services is as complicated as trusting people. You need a way to measure it and pieces of evidence to build trust. • Trust in a cloud environment is considered as the reliance of a customer on a cloud service and, consequently, on its provider. • Trust bases on a broad spectrum of properties such as Security, Privacy, Coherence, Isolation, Stability, Fairness, Transparency and Dependability. • Nowadays, few approaches deal with the quantification of trust in cloud computing. What is trust?
  • 9. • Trustworthiness is considered in its multiple dimensions • Security, as the capability to defence from attacks. • Privacy, as the inherent risk of a dataset to contain re-identifiable data. • Coherence, as the capability of providing a coherent behaviour from any point of the federation. • Isolation, as the difference when a service runs isolated or not. • Stability, as the idempotency and stability of the services. • Fairness, as the inexistence of undesirable or hidden biases. • Transparency, as the capability of understand the output of a system. • Dependability, mainly focusing on availability and reliability. • Measuring the trustworthiness properties • A priori and a posteriori evaluation of vulnerability, performance, re-identification risks, data loss rate, integrity, robustness, scalability, resource consumption, classification bias and isolation. Trustworthiness life-cycle
  • 10. • Along with these requirements, we explore other requirements: • Measurement of the Fairness of the models to evaluate the bias of the model with respect to sensitive categories, such as gender or race. • Evaluation of the Explainability of the model. • Evaluation of the privacy loss risk to determine the quality of the anonymisation and the potential leakage of personal data inside the models. Image Biobank Requirements ... successfully reidentified the demographic data of 4478 adults (94.9%) & 2120 children (87.4%) … (P < .001)
  • 11. 11 The Previous situation Application Developers - Who develop the tools for processing the data. - They require the infrastructure to provide some types of services and resources, such as computing, secure storage, high-availability, data persistence. - They will deliver the applications to others to operate. Application Manager - An Application Developer may not be in charge of deploying the application on the production infrast. - The deployment implies the monitoring and management of the resources, services, user accounts and data. - The Application Manager will have access credentials to the infrastructure and will decide the optimal allocation of the resource. End-Users - Data providers and Data scientists exploring and processing data. - Need for secure data transfer and data access tracing, as well as simplified processing tools. - No need to worry about achieving ICT skills.
  • 12. Building Trust with The ATMOSPHERE Platform
  • 15. ● Lemonade* is a web-based system for designing and running analytics applications. ● Users, who are not necessarily programmers, describe applications as workflows; Lemonade generates code and controls their execution. ● Workflows consist of operations (boxes) and data flows (arrows) among them, performing: ⁃ Data preparation and engineering ⁃ Machine learning methods (MLib) ⁃ Visualization metaphors 15 LEMONADE
  • 16. 16 Supported Trustworthiness properties Property Developers Data Scientists Stability Stability strategies (e.g., cross- validation) Quality assurance of model outcome (e.g., calibrate cross validation and evaluate accuracy variance) Privacy Privacy-preserving algorithms and techniques (e.g., k-anonymity) Assess the impact of preserving privacy on the outcome utility and effectiveness Transparency Transparency methods to be combined with different data analytic flows (e.g., LIME/SHAP methods) Execute ML models and, based on explanations, calibrate the model or enhance the input Fairness Fairness-enhancing mechanisms and strategies (e.g., Aequitas toolkit). Generate report as to evaluate fairness and decide on features to include on models
  • 17. • PAF assists organizations owning and processing datasets to understand how the processing of data can affect their conformance with regulations related to privacy (GDPR and LGPD) • These assessments may be used to generate appropriate security/ privacy policies and checks used by other services 17 Privacy assessment forms (PAF)
  • 19. • Typical best practices • Data in transit and at rest can be encrypted • Some processing can even be done over encrypted data • Keys and certificates not included in repositories • But this is not enough... • If attacker has access to the machine (VM escapes, internal attacker, cold boots), code can be changed, memory can be dumped • Keys or data can be stolen 19 Data access challenges
  • 20. Data Protection Layer (Vallum) The Vallum Framework Colunar DBMS (e.g., Cassandra) Relational DBMS (e.g., MySQL)Proxying Authentication Authorization Privacy Auditing Document Store (e.g., MongoDB) File System (e.g. IPFS) Query Compliant Results Query Compliant Results Query Compliant Results Modified Query Result Modified Query Result Modified Query Result Modified Query Result trusted execution environment (TEE) raw data encrypted at rest data encrypted in transit
  • 22. 22 TMA: Design and interfaces Measures and enforces the multiple dimensions of trustworthiness: • Security • Privacy • Coherence • Isolation • Stability • Fairness • Transparency. • Dependability
  • 24. An orchestration platform to manage a federated set of hybrid resources, to provide measures, adaptive mechanisms and policies to improve trustworthiness ● orchestration platform ➔ Automatic configuration via TOSCA blueprints ● federated ➔ multiple clouds independently owned and managed, multi- tenancy ● hybrid resources ➔ CPUs, SGX, GPUs ● measures ➔ metrics and tools to evaluate the trustworthiness of cloud resources (availability, performance, etc) ● adaptive mechanisms ➔ to scale o reallocate cloud and network resources 24 Trustworthy Infrastructure Management
  • 25. 25 Infrastructure Management Services Federated Infrastructure Resource Provider Resource Provider Resource Provider ATMOSPHERE Platform Federation middleware Fogbow Fogbow Fogbow Federation-wide monitoring services probes running at each site monitoring service Automated deployment service Performance prediction & assessment serviceEC3 TOSCA-IM Model training Profiling
  • 26. 26 Site A DMZInternal XMPP OVS FNS RAS DMZ Internal FNS RAS Cloud A (OpenStack) Site B Cloud B (OpenNebula) Network federation Fogbow Dashboard Fogbow Dashboard ONOS XMPP OVS ONOS IPSec • Fogbow middleware can deploy multiple VMs over a single VLAN spanning multiple heterogeneous clouds • Each federated site holds: • a Federated Network Service (FNS) • a Resource Allocation Service (RAS) • an XMPP service • one or more instances of OpenVSwitch (OVS) • Selected sites hold • an instance of ONOS • an instance of the Intent Monitoring and Rerouting (IMR) application control IMR IMR control
  • 27. 27 Creation of a Network Federation Site A DMZInternal XMPP OVS FNS RAS DMZ Internal FNS RAS Cloud A (OpenStack) Site B Cloud B (OpenNebula) ONOS XMPP OVS ONOS 1. The Infrastructure Manager (IM) requests a new federated network and specifies the private IP range and the VLAN ID 2. The IM requests a new local VM in the federation 3. The FNS chooses an IP address, prepares the cloud-init script and forwards the request to the RAS 4. The RAS sets up OVS to accept the incoming tunnel 5. The RAS interacts with the cloud to create the VM. 6. The VM executes the cloud-init script and establishes a tunnel with OVS. 7. Other VMs are attached to the federated network in a similar way, with requests for VMs in remote sites being forwarded by the RAS accordingly. 8. ONOS sets up routing intents between pairs of VMs 9. Intents are monitored and re-routed to guarantee availability (and latency) 1 2 3 4 5 VM 6 VM 7 7 7 7 7 8 9
  • 28. DEMO: 1. Configuration of each datacenter: • one gateway VM (OVS) • one instance of ONOS • one or more VMs belonging to two federations 2. The IMS monitors link availability and assigns each link an «availability» score 3. Two VMs in the same federation exchange traffic along the shortest path 4. When an IPSec tunnel fails traffic is immediately rerouted along a live path 5. When the faulty IPSec tunnel is available again, traffic remains in the backup path until the availability score recovers 6. When the availability score is high, traffic is rerouted 28 T4.4, D4.2Distributed Implementation of Federated Networks
  • 30. • The underlying infrastructure is a federated cloud • Using fogbow (www.fogbowcloud.org) on OpenStack and OpenNebula. • With a Federated Network to provide a coherent network space among nodes. • Heterogeneous resources: SGX-enabled and GPU nodes. • Using EC3(1) and Infrastructure Manager(2) to deploy a virtual infrastructure. 30 Intercontinental infrastructure Cloud Resources @EU Cloud Resources @ Brazil SGX-Enabled Resources container Encrypted PROVAR Study Cloud Manager Cloud Manager Federation Layer Secure overlay network Central TMA TOSCA-IM GPU-Enabled Resources container (1) https://marketplace.eosc-portal.eu/services/elastic-cloud-compute-cluster-ec3 (2) https://marketplace.eosc-portal.eu/services/infrastructure-manager-im EC3
  • 31. • The virtual infrastructure is managed by an elastic Kubernetes cluster spawn over the federated network • Containers and services are accessible from both sites but only through the federated network. • Resources are properly tagged (SGX and GPU capabilities and Brazil / Europe) so K8s applications are placed in the correct resource. • Infrastructure is described as code(3). • K8s Front-end is deployed and nodes are being powered on as the applications are deployed, creating the request for specific resources. 31 Deployment of the virtual infrastructure (3) https://github.com/grycap/ec3/tree/atmosphere
  • 32. • A secure storage is deployed at the Brazilian side • It uses Vallum(4), a service that provides on-the-fly annonymisation based on policies. • It masks (or blurs) the fields that are marked as sensitive to different profiles of users. • It relies on an HDFS filesystem for the files and on SQL databases for the structured data. • It runs the data anonymisation and sensitive data access on enclaves running on SGX-enabled containers, so they securely run even in untrusted clouds • Data remains encrypted in disk. 32 Secure storage at Brazilian side Cloud Resources @ Brazil SGX-Enabled Resources VALLUM Encrypted PROVAR Study Cloud Manager (4) https://www.atmosphere-eubrazil.eu/vallum-framework-access-privacy-protection
  • 33. • Data is requested to Vallum from external users, but they will only access to partially anonymised data • Anonymised data (~1TB) is copied where the computing accelerators are placed. 33 Anonymised Data Cloud Resources @EUCloud Resources @ Brazil SGX-Enabled Resources VALLUM Encrypted PROVAR Study Plain & Anonymised data Application TMA Cloud Manager Cloud Manager Federation Layer Secure overlay network Central TMA GPU-Enabled Resources TOSCA-IM storage service
  • 34. • Videos are split into frames and classified by color inspection • A color-based segmentation using k-means clustering extracts the color pixels from the Doppler images. • Images are classified according their acquisition view using a CNN • Parasternal long axis view has proven to be relevant to obtain an accurate classification. • First & second order texture analyses characterize the images by the spatial variation of pixel intensities. • Besides texture features, blood velocity information is also obtained. • Finally, all the extracted features are classified through machine learning techniques in order to differentiate between RHD positive and healthy 34 Building the models for the Estimation pipeline. Image Classification Frame Splitting Preparation of images for classifier Color-Based Segment. Doppler Data Preparation View Classification Texture Analysis & Velocity Extraction Features Classification Parasternal Long Axis Data Analysis
  • 35. • The pipeline is developed using LEMONADE(5) • LEMONADE provides a GUI and a Machine Learning librarie to develop data analytics pipelines. • Pipelines can be run interactively or transformed into executable code. • Code can be interactively run or further embed into services to be exposed for production. • A model building pipeline and an estimation pipeline are developed. 35 Coding the pipeline: LEMONADE (5) https://www.atmosphere-eubrazil.eu/lemonade-live-exploration-and-mining-non-trivial-amount-data-everywhere
  • 36. Fairness ● Algorithms, in ML and IA, learn by identifying patterns in data collected over many years. Why may algorithms become “unfair”? ○ By using unbalanced data sets, biased to certain population. ○ By using data sets that are perpetuating historical biases. ○ By inappropriate data handling. ○ As result of inappropriate model selection, uncorrect algorithm design or application. ● Algorithms Fairness components: ○ Aequitas Bias and Fairness Audit Toolkit, proposed by the DSSG group from University of Chicago (http://aequitas.dssg.io/) ○ Properties: ■ Equal Parity & Proportional Parity. ■ False Positive Rate and False Discovery Rate Parity. ■ False Negative Rate and False Omission Rate Parity. Fairness Tree Equal Parity Proport. Parity Represent. Fairness Error Fairness FNRP FPRP FDRP FORP
  • 37. ● Model Complexity increase typically reduces Interpretability ○ Complex multilayer Convolutional Neural Networks are far more difficult to explain than Decision Trees or Linear Regression. ● Effort is invested in characterizing explainability and providing information to explain how the algorithm reached such results ○ 𝛿-Interprepetability (https://arxiv.org/pdf/1707.03886.pdf). ○ LIME (https://github.com/marcotcr/lime) ■ The output of LIME is a list of explanations, reflecting the contribution of each feature to the prediction of a data sample. Interpretability Retinopathy prediction using a 48 layers deep net) https://www.kaggle.com/kmader/inceptionv3-for-retinopathy-gpu-hr Severe Retinopathy
  • 38. Privacy Assessment Forms for GDPR and LGPD ● The International context requires dealing with multiple legal frameworks ○ Brazilian LGPD and GDPR in our case. ● Integrated a tool for tagging and following up sensitive fields ○ To provide a list of Personally Identifiable Information (PII) and Sensitive Information ■ PIIs: Fullname, Ethnicity, Medical Record id, Gender,.. ■ Sensitive Info: Medical Information, Genetics,.. ○ Traces the use of sensitive data within a processing workflow to guide on the annotation of sensitive derived information.
  • 39. Re-identification Risk ● Anonymisation defined by policies ○ Define actions (Removal, Blurring, Reduction, Substitution) and fields. ○ The system starts with the less restrictive policy, applies anonymisation and computes the Metric. ● Data Privacy Model ○ Anonymisation Process. ○ K-anonymity Model Computation. ○ Threshold Checker. ○ Linkage Attack for Validation. ○ Increase Anonymity.
  • 40. 40 Conclusions • Need to manually configure the environment. • Lack of reproducibility. • Qualitative appraisal of the trustworthiness. Before After • Self-assessment of GDPR/LGDP. • Trustable storage environment even on an untrusted provider. • Quantitative anonymisation level. • Manual analysis of GDPR/LGDP risks • Need to trust on the storage provider. • Anonymisation level is qualitative. • Applications templates for complex & distributed applications. • Provide a repeatable way to deploy the whole application. • Quantitative measure of trustworthiness