Invited lecture at the University of Trieste.
The lecture covers (briefly) the data streaming processing paradigm, research challenges related to distributed, parallel and deterministic streaming analysis and the research of the DCS (Distributed Computing and Systems) groups at Chalmers University of Technology.
Report
Share
Report
Share
1 of 64
More Related Content
The data streaming processing paradigm and its use in modern fog architectures
1. The data streaming processing paradigm and
its use in modern fog architectures
Vincenzo Gulisano, Ph.D.
2. Agenda
• Who we are (my group)
• Motivation
• The data streaming processing paradigm
• Challenges and research questions
• Conclusions
• Bibliography
2
3. Agenda
• Who we are (my group)
• Motivation
• The data streaming processing paradigm
• Challenges and research questions
• Conclusions
• Bibliography
3
4. Who we are
Chalmers university
Computer Science and Engineering
department
Distributed Computing and Systems
research group
4
5. Who we are
• ~15 PhD degrees awarded
• ~12 PhD students, 5 Postdocs and 6 faculty
• Acknowledged internationally as leading
group in practical multicore synchronization
algorithms and programming (results
adopted by Java JDK, C++, Intel, NVIDIA)
• Extensive network of academic and
industrial collaborations and has been
continuously supported by national and
international projects
5
6. Distributed Computing and Systems Research Group
Department of Computer Science and engineering
Chalmers University of Technology
At our research team:
Research expertise & projects
Cyber
Security
Efficient
parallel &
stream
computing
Distributed
systems
IoT & Sensor
Networks
6
7. Agenda
• Who we are
• Motivation
• The data streaming processing paradigm
• Challenges and research questions
• Conclusions
• Bibliography
7
8. • IoT
• Edge
• Fog
• Cloud
• Cyber-Physical System
• Big Data
...
BUZZ WORDS
8
11. IoT enables for increased awareness, security, power-efficiency, ...
large IoT systems are complex
traditional data analysis techniques alone are not adequate!
11
13. AMIs VNs
large IoT systems are complex
Characteristics [15]:
1. edge location
2. location awareness
3. low latency
4. geographical distribution
5. large-scale
6. support for mobility
7. real-time interactions
8. predominance of wireless
9. heterogeneous
10. interoperability / federation
11. interaction with the cloud
13
14. traditional data analysis techniques alone are not adequate! [13,14]
1. does the infrastructure allow for billions of
readings per day to be transferred continuously?
2. the latency incurred while transferring data, does
that undermine the utility of the analysis?
3. is it secure to concentrate all the data in a single
place? [11]
4. is it smart to give away fine-grained data? [12]
14
15. Agenda
• Who we are
• Motivation
• The data streaming processing paradigm
• Challenges and research questions
• Conclusions
• Bibliography
15
16. Main Memory
Motivation
DBMS vs. DSMS
Disk
1 Data
Query Processing
3 Query
results
2 Query
Main Memory
Query Processing
Continuous
Query
Data
Query
results
16
17. Before we start... about data streaming and Stream Processing Engines (SPEs)
An incomplete list of SPEs (cf. related work in [18]):
time
Borealis
The Aurora Project
STanfordstREamdatAManager
NiagaraCQ
COUGAR
StreamCloud
Covering all of them / discussing which use cases are best for each one out of scope...
the following show connection between what is being presented and a certain SPE
17
18. data stream: unbounded sequence of tuples sharing the same schema
Example: vehicles’ speed reports
time
Field Field
vehicle id text
time (secs) text
speed (Km/h) double
X coordinate double
Y coordinate double
A 8:00 55.5 X1 Y1
Let’s assume each source
(e.g., vehicle) produces
and delivers a timestamp
sorted stream
A 8:07 34.3 X3 Y3
A 8:03 70.3 X2 Y2
18
19. continuous query (or simply query): Directed Acyclic Graph (DAG) of
streams and operators
OP
OP
OP
OP OP
OP OP
source op
(1+ out streams)
sink op
(1+ in streams)
stream
op
(1+ in, 1+ out streams)
19
20. data streaming operators
Two main types:
• Stateless operators
• do not maintain any state
• one-by-one processing
• if they maintain some state, such state does not evolve depending
on the tuples being processed
• Stateful operators
• maintain a state that evolves depending on the tuples being
processed
• produce output tuples that depend on multiple input tuples
OP
OP
20
25. stateful operators
Aggregate information from multiple tuples
(e.g., max, min, sum, ...)
Join tuples coming from 2 streams given a certain predicate
Aggregate
Join
25
27. Wait a moment!
if streams are unbounded, how can we aggregate or join?
27
28. windows and stateful analysis [18]
Stateful operations are done over windows:
• Time-based (e.g., tuples in the last 10 minutes)
• Tuple-based (e.g., given the last 50 tuples)
time
[8:00,9:00)
[8:20,9:20)
[8:40,9:40)
Usually applications rely on time-based sliding windows
28
29. time-based sliding window aggregation (count)
Counter: 4
time
[8:00,9:00)
8:05 8:15 8:22 8:45 9:05
Output: 4
Counter: 1
Counter: 2
Counter: 3
Counter: 3
time
8:05 8:15 8:22 8:45 9:05
[8:20,9:20)
we assumed each source
produces and delivers a
timestamp sorted stream!
What happens if this is not
the case?
29
31. basic operators and user-defined operators
Besides a set of basic operators, SPEs usually allow the user to define
ad-hoc operators (e.g., when existing aggregation are not enough)
31
32. sample query
For each vehicle, raise an alert if the speed of the latest report is more
than 2 times higher than its average speed in the last 30 days.
time
A 8:00 55.5 X1 Y1 A 8:07 34.3 X3 Y3
A 8:03 70.3 X2 Y2
32
33. Remove
unused fields
Map
Field
vehicle id
time (secs)
speed (Km/h)
X coordinate
Y coordinate
Field
vehicle id
time (secs)
speed (Km/h)
Compute average
speed for each
vehicle during the
last 30 days
Aggregate
Field
vehicle id
time (secs)
avg speed (Km/h)
Join
Check
condition
Filter
Field
vehicle id
time (secs)
speed (Km/h)
Join on
vehicle id
Field
vehicle id
time (secs)
avg speed (Km/h)
speed (Km/h)
sample query
33
34. M A J F
sample query
Notice:
• the same semantics can be defined in several ways (using different
operators and composing them in different ways)
• Using many basic building blocks can ease the task of distributing
and parallelizing the analysis (more in the following...)
34
39. Agenda
• Who we are
• Motivation
• The data streaming processing paradigm
• Challenges and research questions
• Conclusions
• Bibliography
39
40. 1. Distributed deployment
2. Parallel deployment
3. Ordering and determinism
4. Shared-nothing vs shared-memory parallelism
5. Load balancing
6. Elasticity
7. Fault tolerance
8. Data sharing
Challenges and research questions
40
41. Picture source: Modern OS, by A. Tanenbaum
Data analysts
Stream Processing Engine
OS / Hardware
How can research
support these layers?
Why is it challenging?
41
42. A minimal example
(... the data analyst)
tell me the average speed
(per week) of each car
Road Side Unit (RSU)
42
43. Parallelization challenges
compute average speed
(per week) of each car
compute average speed
(per week) of each car
Does this work?
What if a car travels close to
different RSUs (and sends data
to both)?
43
45. Parallelization challenges
compute average speed
(per week) of each car
Odd plate number
Even plate numberSend to other RSU
compute average speed
(per week) of each car
Even plate number
Odd plate numberSend to other RSU
They are now communicating!
45
46. Balancing challenges
compute average speed
(per week) of each car
Odd plate
number
Even plate
number
Send to other RSU
compute average speed
(per week) of each car
Even plate
number
Odd plate
number
Send to other RSU
Too much pollution, we will
introduce the alternate
license plates policy!
Thanks! Now I am
using only 50% of my
resources...
46
47. Balancing challenges
compute average speed
(per week) of each car
Dynamic
condition TRUE
Dynamic
condition FALSE
Send to other RSU
compute average speed
(per week) of each car
Dynamic
condition FALSE
Dynamic
condition TRUE
Send to other RSU
They are now communicating!
There must be monitoring!
We need state
transfer protocols!
47
48. Balancing challenges
Time
responsible for responsible for
dynamic condition change
(in between the week!)
What about the data already sent to ?
tell me the average speed
(per week) of each car
48
49. Adapting challenges
compute average speed
(per week) of each car
Dynamic
condition TRUE
Dynamic
condition FALSE
Send to other RSU
compute average speed
(per week) of each car
Dynamic
condition FALSE
Dynamic
condition TRUE
Send to other RSU
What about the data
already sent to ?
49
50. Balancing challenges
compute average speed
(per week) of each car
Dynamic
condition TRUE
Dynamic
condition FALSE
Send to other RSU
compute average speed
(per week) of each car
Dynamic
condition FALSE
Dynamic
condition TRUE
Send to other RSU
They are now communicating!
There must be monitoring!
We need state
transfer protocols!
We need backups!
What about the data sent
between and ?
Cars need to
buffer data too!
Cars need to
buffer data too! 50
51. Faulttolerance
Elasticity
Loadbalancing
Determinism
Parallel execution of streaming operators
51
Parallel execution of streaming applications
DDoSdetection
andmitigation
Intrusiondetection
Datavalidation
Differentially
privateaggregation
Vehicularnetworks
analysis
SmartGrids/AMI
analysis
Security and
privacy
IoT
Cyber-Physical
Systems
Provenance and custom scheduling
52. Synchronization / Data structures
Faulttolerance
Elasticity
Loadbalancing
Determinism
Parallel execution of streaming operators
52
Parallel execution of streaming applications
Many-core systems / FPGAs
DDoSdetection
andmitigation
Intrusiondetection
Datavalidation
Differentially
privateaggregation
Vehicularnetworks
analysis
SmartGrids/AMI
analysis
Parallel joins
Parallel
aggregates
Joins
modeling
Security and
privacy
IoT
Cyber-Physical
Systems
Provenance and custom scheduling
53. • Palyvos-Giannas, D., Gulisano, V., & Papatriantafilou, M. (2019). GeneaLog: Fine-
grained data streaming provenance in cyber-physical systems. Parallel Computing
• Najdataei, H., Nikolakopoulos, Y., Papatriantafilou, M., Tsigas, P., & Gulisano, V.
(2019). STRETCH: Scalable and Elastic Deterministic Streaming Analysis with Virtual
Shared-Nothing Parallelism. Proceedings of the 13th ACM International Conference
on Distributed and Event-based Systems, DEBS 2019
• Palyvos-Giannas, D., Gulisano, V., & Papatriantafilou, M. (2019). Haren: A Framework
for Ad-Hoc Thread Scheduling Policies for Data Streaming Applications. Proceedings
of the 13th ACM International Conference on Distributed and Event-based Systems,
DEBS 2019
• Havers, B., Duvignau, R., Najdataei, H., Gulisano, V., Koppisetty, A. C., &
Papatriantafilou, M. (2019). DRIVEN: a Framework for Efficient Data Retrieval and
Clustering in Vehicular Networks. 35th IEEE International Conference on Data
Engineering, ICDE 2019
• Duvignau, R., Gulisano, V., Papatriantafilou, M., & Savic, V. (2019). Streaming
piecewise linear approximation for efficient data management in edge computing.
Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing, SAC 2019
• Walulya, I., Palyvos-Giannas, D., Nikolakopoulos, Y., Gulisano, V., Papatriantafilou,
M., & Tsigas, P. (2018). Viper: A module for communication-layer determinism and
scaling in low-latency stream processing. Future Generation Comp. Syst., 88
• Rooij, J. van, Gulisano, V., & Papatriantafilou, M. (2018). LoCoVolt: Distributed
Detection of Broken Meters in Smart Grids through Stream Processing. Proceedings
of the 12th ACM International Conference on Distributed and Event-based Systems,
DEBS 2018
Provenance Determinism
Vehicular networks
AMI
Parallelism Determinism Elasticity
53
Custom scheduling
Vehicular networks
Vehicular networks
DeterminismParallelism
Smart Grids / AMIs
54. • Najdataei, H., Nikolakopoulos, Y., Gulisano, V., & Papatriantafilou, M.
(2018). Continuous and Parallel LiDAR Point-Cloud Clustering. 38th IEEE
International Conference on Distributed Computing Systems, ICDCS 2018
• Gulisano, V., Nikolakopoulos, Y., Cederman, D., Papatriantafilou, M., &
Tsigas, P. (2017). Efficient Data Streaming Multiway Aggregation through
Concurrent Algorithmic Designs and New Abstract Data Types. TOPC
• Zacheilas, N., Kalogeraki, V., Nikolakopoulos, Y., Gulisano, V.,
Papatriantafilou, M., & Tsigas, P. (2017). Maximizing Determinism in
Stream Processing Under Latency Constraints. BT - Proceedings of the
11th ACM International Conference on Distributed and Event-based
Systems, DEBS 2017
• Gulisano, V., Papadopoulos, A. V., Nikolakopoulos, Y., Papatriantafilou, M.,
& Tsigas, P. (2017). Performance Modeling of Stream Joins. Proceedings
of the 11th ACM International Conference on Distributed and Event-
based Systems, DEBS 2017
• Geethakumari, P. R., Gulisano, V., Svensson, B. J., Trancoso, P., & Sourdis,
I. (2017). Single window stream aggregation using reconfigurable
hardware. International Conference on Field Programmable Technology,
ICFPT 2017
• Gulisano, V., Tudor, V., Almgren, M., & Papatriantafilou, M. (2016). BES:
Differentially Private and Distributed Event Aggregation in Advanced
Metering Infrastructures. Proceedings of the 2nd ACM International
Workshop on Cyber-Physical System Security, CPSS@AsiaCCS
• Gulisano, V., Callau-Zori, M., Fu, Z., Jiménez-Peris, R., Papatriantafilou, M.,
& Patiño-Martínez, M. (2015). STONE: A streaming DDoS defense
framework. Expert Syst. Appl., 42
54
Parallelism Clustering
Parallelism DeterminismStream aggregates
Parallelism Determinism
Synchronization / Data structures
ModelingStream joins
FPGAs DeterminismStream aggregates
Differentially
private aggregation
DDoS detection
and mitigation
55. • Gulisano, V., Nikolakopoulos, Y., Papatriantafilou, M., & Tsigas, P. (2015).
Scalejoin: A deterministic, disjoint-parallel and skew-resilient stream join. 2015
IEEE International Conference on Big Data, Big Data 2015
• Gulisano, V., Nikolakopoulos, Y., Walulya, I., Papatriantafilou, M., & Tsigas, P.
(2015). Deterministic real-time analytics of geospatial data streams through
ScaleGate objects. Proceedings of the 9th ACM International Conference on
Distributed Event-Based Systems, DEBS ’15
• Gulisano, V., Almgren, M., & Papatriantafilou, M. (2014). METIS: A Two-Tier
Intrusion Detection System for Advanced Metering Infrastructures.
International Conference on Security and Privacy in Communication Networks -
10th International ICST Conference, SecureComm 2014
• Gulisano, V., Jiménez-Peris, R., Patiño-Martínez, M., Soriente, C., & Valduriez, P.
(2012). StreamCloud: An Elastic and Scalable Data Streaming System. IEEE
Trans. Parallel Distrib. Syst., 23
• Gulisano, V., Jiménez-Peris, R., Patiño-Martínez, M., & Valduriez, P. (2010).
StreamCloud: A Large Scale Data Streaming System. BT - 2010 International
Conference on Distributed Computing Systems, ICDCS 2010
55
Parallelism DeterminismStream joins
Parallelism Determinism
Synchronization / Data structures
Intrusion detection
Parallelism Determinism
Parallelism Determinism
56. Agenda
• Who we are
• Motivation
• The data streaming processing paradigm
• Challenges and research questions
• Conclusions
• Bibliography
56
57. Millions of years
of evolution
Millions of
sensors
• Store information
• Iterate multiple times over data
• Think, do not rush through decisions
• ”Hard-wired” routines
• Real-time decisions
• High-throughput / low-latency
Should I (really)
have an extra
piece of cake?
Danger!!!
Run!!!
Humans
57
58. Years / Decades
of evolution
Millions of
sensors
What traffic
congestion
patterns can I
observe
frequently?
Don’t take
over, car in
opposite lane!
• Store information
• Iterate multiple times over data
• Think, do not rush through decisions
Databases, data mining
techniques...
Data streaming, distributed
and parallel analysis
• Continuous analysis
• Real-time decisions
• High-throughput / low-latency
Computers
(cyber-physical / IoT systems)
58
59. Agenda
• Motivation
• The data streaming processing paradigm
• Challenges and research questions
• Conclusions
• Bibliography
59
60. Bibliography
1. Zhou, Jiazhen, Rose Qingyang Hu, and Yi Qian. "Scalable distributed communication architectures to support advanced
metering infrastructure in smart grid." IEEE Transactions on Parallel and Distributed Systems 23.9 (2012): 1632-1642.
2. Gulisano, Vincenzo, et al. "BES: Differentially Private and Distributed Event Aggregation in Advanced Metering Infrastructures."
Proceedings of the 2nd ACM International Workshop on Cyber-Physical System Security. ACM, 2016.
3. Gulisano, Vincenzo, Magnus Almgren, and Marina Papatriantafilou. "Online and scalable data validation in advanced metering
infrastructures." IEEE PES Innovative Smart Grid Technologies, Europe. IEEE, 2014.
4. Gulisano, Vincenzo, Magnus Almgren, and Marina Papatriantafilou. "METIS: a two-tier intrusion detection system for advanced
metering infrastructures." International Conference on Security and Privacy in Communication Systems. Springer International
Publishing, 2014.
5. Yousefi, Saleh, Mahmoud Siadat Mousavi, and Mahmood Fathy. "Vehicular ad hoc networks (VANETs): challenges and
perspectives." 2006 6th International Conference on ITS Telecommunications. IEEE, 2006.
6. El Zarki, Magda, et al. "Security issues in a future vehicular network." European Wireless. Vol. 2. 2002.
7. Georgiadis, Giorgos, and Marina Papatriantafilou. "Dealing with storage without forecasts in smart grids: Problem
transformation and online scheduling algorithm." Proceedings of the 29th Annual ACM Symposium on Applied Computing.
ACM, 2014.
8. Fu, Zhang, et al. "Online temporal-spatial analysis for detection of critical events in Cyber-Physical Systems." Big Data (Big
Data), 2014 IEEE International Conference on. IEEE, 2014.
60
61. Bibliography
9. Arasu, Arvind, et al. "Linear road: a stream data management benchmark." Proceedings of the Thirtieth
international conference on Very large data bases-Volume 30. VLDB Endowment, 2004.
10. Lv, Yisheng, et al. "Traffic flow prediction with big data: a deep learning approach." IEEE Transactions on
Intelligent Transportation Systems 16.2 (2015): 865-873.
11. Grochocki, David, et al. "AMI threats, intrusion detection requirements and deployment
recommendations." Smart Grid Communications (SmartGridComm), 2012 IEEE Third International
Conference on. IEEE, 2012.
12. Molina-Markham, Andrés, et al. "Private memoirs of a smart meter." Proceedings of the 2nd ACM
workshop on embedded sensing systems for energy-efficiency in building. ACM, 2010.
13. Gulisano, Vincenzo, et al. "Streamcloud: A large scale data streaming system." Distributed Computing
Systems (ICDCS), 2010 IEEE 30th International Conference on. IEEE, 2010.
14. Stonebraker, Michael, Uǧur Çetintemel, and Stan Zdonik. "The 8 requirements of real-time stream
processing." ACM SIGMOD Record 34.4 (2005): 42-47.
15. Bonomi, Flavio, et al. "Fog computing and its role in the internet of things." Proceedings of the first
edition of the MCC workshop on Mobile cloud computing. ACM, 2012.
16. Himmelsbach, Michael, et al. "LIDAR-based 3D object perception." Proceedings of 1st international
workshop on cognition for technical systems. Vol. 1. 2008.
61
62. Bibliography
17. Geiger, Andreas, et al. "Vision meets robotics: The KITTI dataset." The International Journal of Robotics Research (2013):
0278364913491297.
18. Gulisano, Vincenzo Massimiliano. StreamCloud: An Elastic Parallel-Distributed Stream Processing Engine. Diss. Informatica,
2012.
19. Cardellini, Valeria, et al. "Optimal operator placement for distributed stream processing applications." Proceedings of the 10th
ACM International Conference on Distributed and Event-based Systems. ACM, 2016.
20. Costache, Stefania, et al. "Understanding the Data-Processing Challenges in Intelligent Vehicular Systems." Proceedings of the
2016 IEEE Intelligent Vehicles Symposium (IV16).
21. Cormode, Graham. "The continuous distributed monitoring model." ACM SIGMOD Record 42.1 (2013): 5-14.
22. Giatrakos, Nikos, Antonios Deligiannakis, and Minos Garofalakis. "Scalable Approximate Query Tracking over Highly Distributed
Data Streams." Proceedings of the 2016 International Conference on Management of Data. ACM, 2016.
23. Gulisano, Vincenzo, et al. "Streamcloud: An elastic and scalable data streaming system." IEEE Transactions on Parallel and
Distributed Systems 23.12 (2012): 2351-2365.
24. Shah, Mehul A., et al. "Flux: An adaptive partitioning operator for continuous query systems." Data Engineering, 2003.
Proceedings. 19th International Conference on. IEEE, 2003.
62
63. Bibliography
25. Cederman, Daniel, et al. "Brief announcement: concurrent data structures for efficient streaming aggregation." Proceedings of
the 26th ACM symposium on Parallelism in algorithms and architectures. ACM, 2014.
26. Ji, Yuanzhen, et al. "Quality-driven processing of sliding window aggregates over out-of-order data streams." Proceedings of
the 9th ACM International Conference on Distributed Event-Based Systems. ACM, 2015.
27. Ji, Yuanzhen, et al. "Quality-driven disorder handling for concurrent windowed stream queries with shared operators."
Proceedings of the 10th ACM International Conference on Distributed and Event-based Systems. ACM, 2016.
28. Gulisano, Vincenzo, et al. "Scalejoin: A deterministic, disjoint-parallel and skew-resilient stream join." Big Data (Big Data), 2015
IEEE International Conference on. IEEE, 2015.
29. Ottenwälder, Beate, et al. "MigCEP: operator migration for mobility driven distributed complex event processing." Proceedings
of the 7th ACM international conference on Distributed event-based systems. ACM, 2013.
30. De Matteis, Tiziano, and Gabriele Mencagli. "Keep calm and react with foresight: strategies for low-latency and energy-efficient
elastic data stream processing." Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel
Programming. ACM, 2016.
31. Balazinska, Magdalena, et al. "Fault-tolerance in the Borealis distributed stream processing system." ACM Transactions on
Database Systems (TODS) 33.1 (2008): 3.
32. Castro Fernandez, Raul, et al. "Integrating scale out and fault tolerance in stream processing using operator state
management." Proceedings of the 2013 ACM SIGMOD international conference on Management of data. ACM, 2013.
63
64. Bibliography
33. Dwork, Cynthia. "Differential privacy: A survey of results." International Conference on Theory and Applications of
Models of Computation. Springer Berlin Heidelberg, 2008.
34. Dwork, Cynthia, et al. "Differential privacy under continual observation." Proceedings of the forty-second ACM
symposium on Theory of computing. ACM, 2010.
35. Kargl, Frank, Arik Friedman, and Roksana Boreli. "Differential privacy in intelligent transportation systems." Proceedings
of the sixth ACM conference on Security and privacy in wireless and mobile networks. ACM, 2013.
64
Editor's Notes
Before we start... questions and please notice
A like to be hands-on...
say these are some examples...
say these are some examples...
say these are some examples...
About the ordering we will come back later
Need to explain a bit what is the approach I am following here
maybe say the DB part is a bit of an oversimplifaction