Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Strider

2017, Proceedings of the VLDB Endowment

Strider: An Adaptive, Inference-enabled Distributed RDF Stream Processing Engine Xiangnan Ren, Olivier Curé, Li Ke, Jérémy Lhez, Badre Belabbess, Tendry Randriamalala, Yufan Zheng, Gabriel Képéklian To cite this version: Xiangnan Ren, Olivier Curé, Li Ke, Jérémy Lhez, Badre Belabbess, et al.. Strider: An Adaptive, Inference-enabled Distributed RDF Stream Processing Engine. Proceedings of the VLDB Endowment (PVLDB), 2017, 10 (12), pp.1905 - 1908. ฀10.14778/3137765.3137805฀. ฀hal-01736988฀ HAL Id: hal-01736988 https://hal.science/hal-01736988 Submitted on 19 Mar 2018 HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. Strider: An Adaptive, Inference-enabled Distributed RDF Stream Processing Engine Xiangnan Ren1,2 Olivier Curé2 Li Ke1 Jeremy Lhez2 Badre Belabbess1,2 Tendry Randriamalala1 Yufan Zheng1 Gabriel Kepeklian1 1 ATOS, 80 quai Voltaire, 95870 Bezons, France. {xiang-nan.ren, firstname.lastname}@atos.net 2 UPEM LIGM - UMR CNRS 8049, 77454 Marne-la-Vallée, France {firstname.lastname}@u-pem.fr ABSTRACT Real-time processing of data streams emanating from sensors is becoming a common task in industrial scenarios. An increasing number of processing jobs executed over such platforms are requiring reasoning mechanisms. The key implementation goal is thus to efficiently handle massive incoming data streams and support reasoning, data analytic services. Moreover, in an on-going industrial project on anomaly detection in large potable water networks, we are facing the effect of dynamically changing data and work characteristics in stream processing. The Strider system addresses these research and implementation challenges by considering scalability, fault-tolerance, high throughput and acceptable latency properties. We will demonstrate the benefits of Strider on an Internet of Things-based real world and industrial setting. 1. INTRODUCTION The vast amount of data produced by the Internet of Things (IoT) generally needs to be processed in almost real time. This is the case for an increasing number of industrial scenarios requiring to detect anomalies, identify commercial trends, diagnose machines’ conditions, etc.. In the context of the Waves FUI (Fonds Unique Interministeriel) project1 , we are processing data streams emanating from sensors distributed over the potable water distribution network of a resource management international company. For France alone, this company distributes water to over 12 million clients through a network of more than 100.000 kilometers equipped with thousands (and growing) of sensors. Our system’s main objective is to automatically detect anomalies, e.g., water leaks, from analyzed data streams. Obviously, the promptness and accuracy of our anomaly discoveries potentially impacts ecological (loss of cleaned up water) as well as economical (price of clients’ consumed water) aspects. 1 http://www.waves-rsp.org/ ACM ISBN 978-1-4503-2138-9. DOI: 10.1145/1235 These anomaly detections frequently depend on reasoning services. That is meta-data emitted by IoT sensors are first mapped to predefined vocabularies which are supporting logic-based inferences. For data integration, inference and stack homogeneity reasons, we have opted for the Semantic Web technology stack, i.e., RDF(S), SPARQL and OWL. Hence, one of our goals is to design a system that efficiently integrates reasoning services within a real-time processing platform. This is performed by finding a tradeoff between the two common reasoning mechanisms adopted in knowledge-based management systems, namely inference materialization and query rewriting. Our solution consists in creating semantic-aware dictionaries for the concepts, predicates and instances of managed vocabularies and to use the properties of attributed identifiers to optimize query rewriting. We thus minimize the size of stored knowledge bases (KBs) and of data streams, as well as reduce the size and the time needed to reformulate continuous queries. In real world scenarios, we are frequently facing dynamically changing data and workload characteristics, e.g., a sensor could emit different types of messages based on the user requests. These changes impact the execution performance of continuous queries executed over data streams. To ensure the efficient query execution, the execution plan may have to change during its usually long-term lifetime. Our problem is exacerbated by the hardness of generating optimized query plans for SPARQL queries due to their potentially high number of joins compared to SQL queries. To support this behavior, we have implemented an adaptive query processing (henceforth AQP) [3] for continuous SPARQL queries and RDF data. Moreover, processing large scale data streams is usually performed over a distributed setting which guarantees scalability, automatic work distribution and fault tolerance but also has to satisfy high throughput and acceptable latency constraints. Such systems are better designed and operated upon when implemented on top of robust, proven engines such as Apache Kafka (a distributed commit log which efficiently stores and delivers data streams) and Apache Spark (a general purpose and unified cluster computing framework). Hence, our inference-enabled SPARQL AQP needs to fit efficiently into this distributed setting. We consider that such an AQP does not exist due to a high expertise entry point (distributed systems, database management systems and Semantic Web) when starting this kind of project. Thus, it is not a surprise that, to the best of our knowledge, no industry targeted production-ready systems are currently available. In fact, the RDF Stream Processing (RSP) ecosystem regroups engines that are either (i) distributed but lacking important features, e.g., Katts[4], or are not open-source, e.g., CQELS Cloud[5], or (ii) centralized and hence can not support high throughput (C-SPARQL[1]). A common characteristics of these engines is to consider that the structure of the data stream does not change. While facing Waves’ realworld use cases, we found out that one can not assume the regularity of incoming streaming data. Finally, we consider that the best of breed in SPARQL AQP necessarily needs to mix static (heuristic-based) and dynamic (cost-based) query optimization approaches. 2. THE STRIDER SYSTEM In this section, we first present an overview of the Strider system, detail the reasoning as well as query optimization components. Finally implementation aspects are considered. 2.1 Architecture overview Figure 1 gives a high-level overview of the system’s architecture. The left hand-side gives details on the application’s data flow. Its design is relatively standard and follows the approach generally adopted in stream processing systems. In a nutshell, data sources (IoT sensors) are sending messages to a publish-subscribe layer. That layer emits messages for the streaming layer which executes registered queries. The main originality of our approach consists in transforming the data source messages, e.g., csv files, into an RDF serialization for data integration and reasoning purposes. On the right hand-side of Figure 1, we concentrate on components related to the system’s implementation. The Encoding layer runs off-line and pre-processes the encoding of all KB elements, i.e., concepts, predicates and instances, into integer values using a semantic-aware approach. This component interacts with the RDF event converter and the Inference layer. The Request layer registers continuous queries which are later sent to the Parsing layer to compute a first version of a query plan. Plans involving any form of reasoning are extended by the Inference layer. These new plans are pushed to the Optimization layer which consists of three collaborating sub-components: static and adaptive optimizations as well as a trigger mechanism. Finally, the Query Processing layer sets off the query execution right after the optimized logical plan takes place. 2.2 Encoding and Inference layers Upfront to any data stream processing, the Encoding Layer encodes concepts, predicates and instances of registered KBs. In the remaining of this paper, we consider that a KB consists of a schema, aka ontology or terminological box (Tbox) and a fact base, aka assertional box (Abox). With this Encoding layer, we aim to provide efficient encoding scheme and data structures to support the reasoning services associated to the input ontology of an application. The input ontology is considered to be union of (potentially aligned) ontologies necessary to operate over one’s application domain. In the current version of our work, we address the ρdf subset of RDFS, meaning that we are only interested in the rdfs:subClassOf, rdfs:subPropertyOf, rdfs:domain and rdfs:range constructors. To address inferences drawn from these first two RDFS predicates, we attribute numerical identifiers to ontology terms, i.e., , concepts and predicates. The compression principle of this term encoding lies in the Figure 1: Strider Architecture fact that subsumption relationships are represented within the encoding of each term. This is performed by prefixing the encoding of a term with the encoding of its direct parent (a workaround is proposed to support multiple inheritance). This approach only works if an encoding is computed using a binary representation. More details on the encoding scheme can be obtained in [2]. Once this ontology pre-processing has been computed, we can use the generated dictionaries to encode all incoming data streams. In fact, the schema of all our domain’s sensor messages are known in advanced and have been mapped to ontology elements. Thus these messages can be represented as subject, predicate and object triples. Using our dictionaries, we can represent these triples with element identifiers. Not all triple entries are transformed using our dictionaries. For instance, numerical values (highly frequent in IoT) and blank nodes are not transformed. All predicates, concepts and domain constants are transformed. Following an analysis of the queries associated to Kafka topics, some RDF triples can be introduced in the data streams, and hence potentially make the query satisfiable. The semantic-aware encoding of concepts and predicates supports the reasoning services required at query processing time. For instance, consider a query asking for the pressure value of sensors of type S1. This would be expressed as the following two triple patterns: ?x pressureValue ?v. ?x type S1. In the case sensor concept S1 has n sub-concepts, then a naive query reformulation requires to run the union of n+1 queries. With our semantic-aware encoding, we are able, using two bit-shift operations, to compute the identifier interval, i.e., [lowerBound, upperBound[, of all direct and indirect sub-concepts of S1. And thus we can compute this query with a simple reformulation: (i) replacing the concept S1 with a new variable : ?x type ?newVar and (ii) introducing a filter clause constraining values of this new variable: FILTER (?newVar>=lowerBound && ?newVar<upperBound). The main benefits of our tight inference and stream processing integration supports the retrieval of all correct answers of continuous queries based on a trade-off between inference materialization and query reformulation. 2.3 Optimization layer Conventional SPARQL query optimization techniques with data pre-processing, e.g., data indexing, statistic information maintaining, are not appropriate within the requirement of real-time or near real-time aspect data processing. Besides, due to the schema-free, graph nature of RDF data, a navigation-based query language such as SPARQL potentially involves a large amount of joins and self-joins. In a distributed streaming context, this behavior becomes a performance bottleneck since a join task might lead to data shuffling which causes heavy network communications. To cope with the previously mentioned performance issue, and inspired by the state of the art efforts [3, 6, 7], we build our query optimizer with hybrid static and adaptive optimization strategies. In the Request layer, Strider allows to register multiple queries at once. The input SPARQL queries are submitted through different threads, and executed concurrently with respect to the execution plans which are generated by the optimizer. Currently, we consider that the input queries are independent, thus a multi-query optimization approach (e.g., sub-query sharing) is not in the scope of the current state of Strider. As described in Section 2.1, the Optimization layer possesses three sub-layers: static and adaptive optimizations, trigger. Fundamentally, both static and adaptive optimizations are processed using a graph G, denoted as Undirected Connected Graph (UCG) [6], which is formed of vertexes (triple patterns) and edges (joins between triple patterns). The weight of UCG’s vertexes and edges correspond to the selectivity of triple patterns and join patterns, respectively. Once a UCG is initialized, the query planner automatically generates an optimal logical plan and triggers a query execution. Static Optimization creates a UCG graph using a set of heuristic rules. The predefined heuristic rules set empirically assigns the weights for UCG vertexes and edges. Next, the query planner determines the shortest traversal path in the current UCG and generates the logical plan for query execution. The obtained logical plan presents the query execution pipeline which is kept by the system permanently. Our static optimization strategy depends solely on the query shape and is independent from the data set. Practically, this approach could be an acceptable approximation, since it applies a basic optimization and simplifies the implementation. However, the static optimization can not guarantee to return the optimal query plan for all input streams, and it does not cover the situation when the structure of data stream may change. In order to remedy this defect, we add an adaptive query optimization component. Trigger layer is the transition between the stages of static optimization and adaptive optimization. In a nutshell, the trigger layer is dedicated to notify the system whether it is necessary to proceed the adaptive optimization. Our adaptive strategy requires collecting statistic information at run-time which in a distributed environment involves a computation cost that is not negligible. The Strider prototype provides a set of straightforward trigger rules,i.e., the adaptive algebra optimization is triggered by a configurable workload threshold. The threshold refers to two factors: 1) the input number of RDF events/triples; 2) the fraction of the estimated input data size and the allocated executors’ heap memory. When Adaptive Optimization is activated, the statistic information gathering and query evaluation occur synchronously, and the UCG elements will be assigned a socall statistic weight. Next, at the same, the optimizer recomputes and evaluates the optimal algebra tree in a bottomup fashion. Therefore, adaptive logical plan scheduling ensures that the system always executes a query in an optimal way. Here is an example of a Strider query: STREAMING { WINDOW [ 20 SECONDS ] SLIDE [ 20 SECONDS ] BATCH [ 5 SECONDS ] } REGISTER { QUERYID [ Q5 ] SPARQL [ prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> prefix ssn: <http://purl.oclc.org/NET/ssnx/ssn/> prefix cuahsi: <http://www.cuahsi.org/waterML/> select ?s ?o1 where { ?s ssn:hasValue ?o1 ; ssn:hasValue ?o2 ; ssn:hasValue ?o3 . ?o1 rdf:type cuahsi:flow . ?o2 rdf:type cuahsi:temperature . ?o3 rdf:type cuahsi:chlorine . }]} Figure 2 illustrates Strider’s adaptive optimization for Q5. The query UCG (Figure 2(a),(b)) and logical plan (Figure 2(c),(d)) change as the structure of data stream changes dynamically. 2.4 Implementation Strider is written in Scala and contains two principle modules: i) data flow management. In order to ensure high throughput, fault-tolerant, and easy-to-use features, Strider uses Apache Kafka to manage input data flow. The incoming RDF streams are categorized into different message topics, which practically presents different types of RDF events. ii) computing core. Strider core is based on the Spark programming framework. Spark Streaming receives, maintains messages emitted from Kafka in parallel, and generates data processing pipeline. Comparing to other distributed stream processing frameworks, Spark represents a certain maturity and a rich ecosystem. This allows to achieve a horizontal expansion of the platform functionality. Besides, due to the coarse-grained feature of the micro-batch model, Spark Streaming provides high throughput and acceptable latency for tasks involving complex analytics. The Encoding layer uses Apache Jena to parse the RDFS or OWL ontologies and an external reasoner, namely HermiT, to classify concept hierarchies. This part of Strider is Figure 2: The query UCG and logical plan (TP denotes a triple pattern and t1,t2 computation times) the only component that does not run in parallel. This is due to the absence of an efficient, expressive, i.e., higher than RDFS and OWLRL, reasoner that runs in parallel. This can not be considered as a limitation of the system since the processing is only performed when the set of ontologies are updated and Tboxes are known to be rather small compared to Aboxes. For instance, the encoding of Wikipedia, which contains over 213,000 triples and 350 predicates, takes less than 2 minutes. The Abox is encoded in parallel using the dictionaries stored as Spark’s DataFrames. To enable SPARQL query processing on Spark, Strider parses a query with Jena ARQ and obtains a query algebra tree in the Parsing layer. The system reconstructs the algebra tree into a new Abstract Syntax Tree (AST) based on Visitor model. Basically, the AST represents the logical plan of a query execution. Once the AST is created, it is pushed into the algebra Optimization layer. As stated in section 2.3, the system applies a hybrid static and adaptive optimization strategy on original query algebra. By traversing the AST, we bind the SPARQL operators to the corresponding Spark SQL relational operators for query evaluation. 3. DEMONSTRATION SCENARIOS The demonstration2 concentrates on real-world IoT use cases from the Waves project. We highlight two aspects of the system. In the first one, we show some user-oriented features, e.g., Strider’s user-friendly graphical interface. This allows fast, easy and intuitive Spark cluster configuration and deployment. In the second one, we demonstrate Strider’s reasoning capability and hybrid optimization strategies for continuous SPARQL query execution via the the following scenarios: Scenario 1 concentrates on the inference component. We begin with by emphasizing the efficiency of our KB encoding: Tbox encoding and Abox encoding. Then, we focus on the stream materialization and query reformulation aspects. This is demonstrated on data streams requiring some data materialization and on query reformulation of regis2 details of the demo at https://github.com/renxiangnan/ reference-vldb-demo-2017/wiki tered queries. Both processing durations and internal representations of the streams and queries will accessible. Scenario 2 mainly focuses on continuous SPARQL query processing with stable stream structure. In the ideal case, incoming RDF streams are supposed to be structurally stable. The proportions of variant types of RDF triples does not change over time. By deploying our system on a small cluster of Amazon EMR (one driver node, 3 to 4 worker nodes), we show that Strider can achieve a throughput between 400,000 and 600,000 triples per second with real queries. Scenario 3 highlights the efficiency of our engine’s adaptive query optimization. A group of Kafka message producers are configured. By randomly modifying the proportion of the different types of messages, we feed the structurally unstable RDF stream to the engine. To give an intuitive view on the system’s adaptivity, we demonstrate the changing of the query execution plan in real-time. We also provide a comparison between conventional static optimization and adaptive query optimization. For instance, in Figure 3, an adaptive query execution shows a steady performance during a relative long (one hour) running time. On the contrary, the engine performance fluctuates substantially over time, and we have observed a large Garbage Collection pressure through log monitoring. Figure 3: Throughput Comparison between Static and Adaptive Optimization (query Q5) 4. REFERENCES [1] D. F. Barbieri and al. C-SPARQL: SPARQL for continuous querying. In 18th WWW, 2009. [2] O. Curé and al. Litemat: A scalable, cost-efficient inference encoding scheme for large RDF graphs. In IEEE Big Data, 2015. [3] A. Deshpande and al. Adaptive query processing. Foundations and Trends in Databases, 2007. [4] L. Fischer and al. Scalable linked data stream processing via network-aware workload scheduling. In SSWS, 2013. [5] D. L. Phuoc and al. Elastic and scalable processing of linked stream data in the cloud. In ISWC 2013. [6] M. Stocker and al. SPARQL basic graph pattern optimization using selectivity estimation. In 17th WWW, 2008. [7] P. Tsialiamanis and al. Heuristics-based query optimisation for SPARQL. In 15th EDBT, 2012.