-
ParvaGPU: Efficient Spatial GPU Sharing for Large-Scale DNN Inference in Cloud Environments
Authors:
Munkyu Lee,
Sihoon Seong,
Minki Kang,
Jihyuk Lee,
Gap-Joo Na,
In-Geol Chun,
Dimitrios Nikolopoulos,
Cheol-Ho Hong
Abstract:
In cloud environments, GPU-based deep neural network (DNN) inference servers are required to meet the Service Level Objective (SLO) latency for each workload under a specified request rate, while also minimizing GPU resource consumption. However, previous studies have not fully achieved this objective. In this paper, we propose ParvaGPU, a technology that facilitates spatial GPU sharing for large-…
▽ More
In cloud environments, GPU-based deep neural network (DNN) inference servers are required to meet the Service Level Objective (SLO) latency for each workload under a specified request rate, while also minimizing GPU resource consumption. However, previous studies have not fully achieved this objective. In this paper, we propose ParvaGPU, a technology that facilitates spatial GPU sharing for large-scale DNN inference in cloud computing. ParvaGPU integrates NVIDIA's Multi-Instance GPU (MIG) and Multi-Process Service (MPS) technologies to enhance GPU utilization, with the goal of meeting the diverse SLOs of each workload and reducing overall GPU usage. Specifically, ParvaGPU addresses the challenges of minimizing underutilization within allocated GPU space partitions and external fragmentation in combined MIG and MPS environments. We conducted our assessment on multiple A100 GPUs, evaluating 11 diverse DNN workloads with varying SLOs. Our evaluation revealed no SLO violations and a significant reduction in GPU usage compared to state-of-the-art frameworks.
△ Less
Submitted 22 September, 2024;
originally announced September 2024.
-
HiRED: Attention-Guided Token Dropping for Efficient Inference of High-Resolution Vision-Language Models in Resource-Constrained Environments
Authors:
Kazi Hasan Ibn Arif,
JinYi Yoon,
Dimitrios S. Nikolopoulos,
Hans Vandierendonck,
Deepu John,
Bo Ji
Abstract:
High-resolution Vision-Language Models (VLMs) have been widely used in multimodal tasks to enhance accuracy by preserving detailed image information. However, these models often generate excessive visual tokens due to encoding multiple partitions of the input image. Processing these excessive visual tokens is computationally challenging, especially in resource-constrained environments with commodi…
▽ More
High-resolution Vision-Language Models (VLMs) have been widely used in multimodal tasks to enhance accuracy by preserving detailed image information. However, these models often generate excessive visual tokens due to encoding multiple partitions of the input image. Processing these excessive visual tokens is computationally challenging, especially in resource-constrained environments with commodity GPUs. To support high-resolution images while meeting resource constraints, we propose High-Resolution Early Dropping (HiRED), a token-dropping scheme that operates within a fixed token budget before the Large Language Model (LLM) stage. HiRED can be integrated with existing high-resolution VLMs in a plug-and-play manner, as it requires no additional training while still maintaining superior accuracy. We strategically use the vision encoder's attention in the initial layers to assess the visual content of each image partition and allocate the token budget accordingly. Then, using the attention in the final layer, we select the most important visual tokens from each partition within the allocated budget, dropping the rest. Empirically, when applied to LLaVA-Next-7B on NVIDIA TESLA P40 GPU, HiRED with a 20% token budget increases token generation throughput by 4.7, reduces first-token generation latency by 15 seconds, and saves 2.3 GB of GPU memory for a single inference.
△ Less
Submitted 20 August, 2024;
originally announced August 2024.
-
Towards a real-time distributed feedback system for the transportation assistance of PwD
Authors:
Iosif Polenakis,
Vasileios Vouronikos,
Maria Chroni,
Stavros D. Nikolopoulos
Abstract:
In this work we propose the design principles of an integrated distributed system for the augment of the transportation for people with disabilities inside the road network of a city area utilizing the IT technologies. We propose the basis of our system upon the utilization of a distributed sensor network that will be incorporated by a real-time integrated feedback system. The main components of t…
▽ More
In this work we propose the design principles of an integrated distributed system for the augment of the transportation for people with disabilities inside the road network of a city area utilizing the IT technologies. We propose the basis of our system upon the utilization of a distributed sensor network that will be incorporated by a real-time integrated feedback system. The main components of the proposed architecture include the Inaccessible City Point System, the Live Data Analysis and Response System, and the Obstruction Detection and Prevention System. The incorporation of these subsystems will provide real-time feedback assisting the transportation of individuals with mobility problems informing them on real-time about blocked ramps across the path defined to their destination, being also responsible for the information of the authorities about incidents regarding the collision of accessibility in place where the sensors detect an inaccessible point. The proposed design allows the addition of further extensions regarding the assistance of individuals with mobility problems providing a basis for its further implementation and improvement. In this work we provide the fundamental parts regarding the interconnection of the proposed architecture's components as also its potential deployment regarding the proposed architecture and its application in the area of a city.
△ Less
Submitted 10 June, 2024;
originally announced June 2024.
-
Towards Persistent Memory based Stateful Serverless Computing for Big Data Applications
Authors:
Yuze Li,
Kevin Assogba,
Abhijit Tripathy,
Moiz Arif,
M. Mustafa Rafique,
Ali R. Butt,
Dimitrios Nikolopoulos
Abstract:
The Function-as-a-service (FaaS) computing model has recently seen significant growth especially for highly scalable, event-driven applications. The easy-to-deploy and cost-efficient fine-grained billing of FaaS is highly attractive to big data applications. However, the stateless nature of serverless platforms poses major challenges when supporting stateful I/O intensive workloads such as a lack…
▽ More
The Function-as-a-service (FaaS) computing model has recently seen significant growth especially for highly scalable, event-driven applications. The easy-to-deploy and cost-efficient fine-grained billing of FaaS is highly attractive to big data applications. However, the stateless nature of serverless platforms poses major challenges when supporting stateful I/O intensive workloads such as a lack of native support for stateful execution, state sharing, and inter-function communication. In this paper, we explore the feasibility of performing stateful big data analytics on serverless platforms and improving I/O throughput of functions by using modern storage technologies such as Intel Optane DC Persistent Memory (PMEM). To this end, we propose Marvel, an end-to-end architecture built on top of the popular serverless platform, Apache OpenWhisk and Apache Hadoop. Marvel makes two main contributions: (1) enable stateful function execution on OpenWhisk by maintaining state information in an in-memory caching layer; and (2) provide access to PMEM backed HDFS storage for faster I/O performance. Our evaluation shows that Marvel reduces the overall execution time of big data applications by up to 86.6% compared to current MapReduce implementations on AWS Lambda.
△ Less
Submitted 8 September, 2023; v1 submitted 4 September, 2023;
originally announced September 2023.
-
Adding a Tail in Classes of Perfect Graphs
Authors:
Anna Mpanti,
Stavros D. Nikolopoulos,
Leonidas Palios
Abstract:
Consider a graph $G$ which belongs to a graph class ${\cal C}$. We are interested in connecting a node $w \not\in V(G)$ to $G$ by a single edge $u w$ where $u \in V(G)$; we call such an edge a \emph{tail}. As the graph resulting from $G$ after the addition of the tail, denoted $G+uw$, need not belong to the class ${\cal C}$, we want to compute a minimum ${\cal C}$-completion of $G+w$, i.e., the mi…
▽ More
Consider a graph $G$ which belongs to a graph class ${\cal C}$. We are interested in connecting a node $w \not\in V(G)$ to $G$ by a single edge $u w$ where $u \in V(G)$; we call such an edge a \emph{tail}. As the graph resulting from $G$ after the addition of the tail, denoted $G+uw$, need not belong to the class ${\cal C}$, we want to compute a minimum ${\cal C}$-completion of $G+w$, i.e., the minimum number of non-edges (excluding the tail $u w$) to be added to $G+uw$ so that the resulting graph belongs to ${\cal C}$. In this paper, we study this problem for the classes of split, quasi-threshold, threshold, and $P_4$-sparse graphs and we present linear-time algorithms by exploiting the structure of split graphs and the tree representation of quasi-threshold, threshold, and $P_4$-sparse graphs.
△ Less
Submitted 1 February, 2023;
originally announced February 2023.
-
Adding an Edge in a $P_4$-sparse Graph
Authors:
Anna Mpanti,
Stavros D. Nikolopoulos,
Leonidas Palios
Abstract:
The minimum completion (fill-in) problem is defined as follows: Given a graph family $\mathcal{F}$ (more generally, a property $Π$) and a graph $G$, the completion problem asks for the minimum number of non-edges needed to be added to $G$ so that the resulting graph belongs to the graph family $\mathcal{F}$ (or has property $Π$). This problem is NP-complete for many subclasses of perfect graphs an…
▽ More
The minimum completion (fill-in) problem is defined as follows: Given a graph family $\mathcal{F}$ (more generally, a property $Π$) and a graph $G$, the completion problem asks for the minimum number of non-edges needed to be added to $G$ so that the resulting graph belongs to the graph family $\mathcal{F}$ (or has property $Π$). This problem is NP-complete for many subclasses of perfect graphs and polynomial solutions are available only for minimal completion sets. We study the minimum completion problem of a $P_4$-sparse graph $G$ with an added edge. For any optimal solution of the problem, we prove that there is an optimal solution whose form is of one of a small number of possibilities. This along with the solution of the problem when the added edge connects two non-adjacent vertices of a spider or connects two vertices in different connected components of the graph enables us to present a polynomial-time algorithm for the problem.
△ Less
Submitted 31 January, 2023;
originally announced February 2023.
-
A Stochastic Graph-based Model for the Simulation of SARS-CoV-2 Transmission
Authors:
Christos Chondros,
Stavros D. Nikolopoulos,
Iosif Polenakis
Abstract:
In this work we propose the design principles of a stochastic graph-based model for the simulation of SARS-CoV-2 transmission. The proposed approach incorporates three sub-models, namely, the spatial model, the mobility model, and the propagation model, in order to develop a realistic environment for the study of the properties exhibited by the spread of SARS-CoV-2. The spatial model converts imag…
▽ More
In this work we propose the design principles of a stochastic graph-based model for the simulation of SARS-CoV-2 transmission. The proposed approach incorporates three sub-models, namely, the spatial model, the mobility model, and the propagation model, in order to develop a realistic environment for the study of the properties exhibited by the spread of SARS-CoV-2. The spatial model converts images of real cities taken from Google Maps into undirected weighted graphs that capture the spatial arrangement of the streets utilized next for the mobility of individuals. The mobility model implements a stochastic agent-based approach, developed in order to assign specific routes to individuals moving in the city, through the use of stochastic processes, utilizing the weights of the underlying graph to deploy shortest path algorithms. The propagation model implements both the epidemiological model and the physical substance of the transmission of an airborne virus considering the transmission parameters of SARS-CoV-2. Finally, we integrate these sub-models in order to derive an integrated framework for the study of the epidemic dynamics exhibited through the transmission of SARS-CoV-2.
△ Less
Submitted 10 November, 2021;
originally announced November 2021.
-
SARiSsa -- A Mobile Application for the Proactive Control of SARS-CoV-2 Spread
Authors:
Christos Chondros,
Christos Georgiou-Mousses,
Stavros D. Nikolopoulos,
Iosif Polenakis,
Vasileios Vouronikos
Abstract:
In this work we propose the design principles behind the development of a smart application utilized by mobile devices in order to control the spread of SARS-CoV-2 coronavirus disease that caused the COVID-19 pandemic. Through the deployment of this application utilizing their Bluetooth enabled devices, individuals may keep track of their close contacts, and if nearby contacts using the same appli…
▽ More
In this work we propose the design principles behind the development of a smart application utilized by mobile devices in order to control the spread of SARS-CoV-2 coronavirus disease that caused the COVID-19 pandemic. Through the deployment of this application utilizing their Bluetooth enabled devices, individuals may keep track of their close contacts, and if nearby contacts using the same application are reported later as infected the proximate individual is informed in order to be quarantined for a short of time, preventing hence the spread of the virus. Through the latest year, there have been developed several applications in the Google Play Store that can be deployed by smart devices utilizing their Bluetooth connectivity for the nearby device tracking. However, in this work we propose an open architecture for the development of such applications, that also incorporates a more elaborated graph-theoretic and algorithmic background regarding the contact tracing. The proposed contact tracing algorithm, that can be embedded in the deployment of the application, provides a more immediate tracking of the contacts of an infected individuals, providing a wider extent in the tracing of the contacts, leading hence to a more immediate mitigation of the epidemic.
△ Less
Submitted 29 June, 2021; v1 submitted 28 June, 2021;
originally announced June 2021.
-
Workload-Aware DRAM Error Prediction using Machine Learning
Authors:
Lev Mukhanov,
Konstantinos Tovletoglou,
Hans Vandierendonck,
Dimitrios S. Nikolopoulos,
Georgios Karakonstantis
Abstract:
The aggressive scaling of technology may have helped to meet the growing demand for higher memory capacity and density, but has also made DRAM cells more prone to errors. Such a reality triggered a lot of interest in modeling DRAM behavior for either predicting the errors in advance or for adjusting DRAM circuit parameters to achieve a better trade-off between energy efficiency and reliability. Ex…
▽ More
The aggressive scaling of technology may have helped to meet the growing demand for higher memory capacity and density, but has also made DRAM cells more prone to errors. Such a reality triggered a lot of interest in modeling DRAM behavior for either predicting the errors in advance or for adjusting DRAM circuit parameters to achieve a better trade-off between energy efficiency and reliability. Existing modeling efforts may have studied the impact of few operating parameters and temperature on DRAM reliability using custom FPGAs setups, however they neglected the combined effect of workload-specific features that can be systematically investigated only on a real system. In this paper, we present the results of our study on workload-dependent DRAM error behavior within a real server considering various operating parameters, such as the refresh rate, voltage and temperature. We show that the rate of single- and multi-bit errors may vary across workloads by 8x, indicating that program inherent features can affect DRAM reliability significantly. Based on this observation, we extract 249 features, such as the memory access rate, the rate of cache misses, the memory reuse time and data entropy, from various compute-intensive, caching and analytics benchmarks. We apply several supervised learning methods to construct the DRAM error behavior model for 72 server-grade DRAM chips using the memory operating parameters and extracted program inherent features. Our results show that, with an appropriate choice of program features and supervised learning method, the rate of single- and multi-bit errors can be predicted for a specific DRAM module with an average error of less than 10.5 %, as opposed to the 2.9x estimation error obtained for a conventional workload-unaware error model.
△ Less
Submitted 17 March, 2020;
originally announced March 2020.
-
Cross Architectural Power Modelling
Authors:
Kai Chen,
Peter Kilpatrick,
Dimitrios S. Nikolopoulos,
Blesson Varghese
Abstract:
Existing power modelling research focuses on the model rather than the process for developing models. An automated power modelling process that can be deployed on different processors for developing power models with high accuracy is developed. For this, (i) an automated hardware performance counter selection method that selects counters best correlated to power on both ARM and Intel processors, (…
▽ More
Existing power modelling research focuses on the model rather than the process for developing models. An automated power modelling process that can be deployed on different processors for developing power models with high accuracy is developed. For this, (i) an automated hardware performance counter selection method that selects counters best correlated to power on both ARM and Intel processors, (ii) a noise filter based on clustering that can reduce the mean error in power models, and (iii) a two stage power model that surmounts challenges in using existing power models across multiple architectures are proposed and developed. The key results are: (i) the automated hardware performance counter selection method achieves comparable selection to the manual method reported in the literature, (ii) the noise filter reduces the mean error in power models by up to 55%, and (iii) the two stage power model can predict dynamic power with less than 8% error on both ARM and Intel processors, which is an improvement over classic models.
△ Less
Submitted 17 March, 2020;
originally announced March 2020.
-
Implementing Efficient Message Logging Protocols as MPI Application Extensions
Authors:
Kiril Dichev,
Dimitrios S. Nikolopoulos
Abstract:
Message logging protocols are enablers of local rollback, a more efficient alternative to global rollback, for fault tolerant MPI applications. Until now, message logging MPI implementations have incurred the overheads of a redesign and redeployment of an MPI library, as well as continued performance penalties across various kernels. Successful research efforts for message logging implementations…
▽ More
Message logging protocols are enablers of local rollback, a more efficient alternative to global rollback, for fault tolerant MPI applications. Until now, message logging MPI implementations have incurred the overheads of a redesign and redeployment of an MPI library, as well as continued performance penalties across various kernels. Successful research efforts for message logging implementations do exist, but not a single one of them can be easily deployed today by more than a few experts. In contrast, in this work we build efficient message logging capabilities on top of an MPI library with no message logging capabilities; we do so for two different HPC kernels, one with a global exchange pattern (CG), and one with a neighbourhood exchange pattern (LULESH). While our library of choice ULFM detects failure and recovers MPI communicators, we build on that to then restore the intra- and inter-process data consistency of both applications. This task turns out to be challenging, and we present the methodology for doing so in this work. In the end, we achieve message logging capabilities for each kernel, without the need for an actual message logging runtime underneath. On the performance side, we match state-of-the-art solutions and (a) eliminate event logging and the event logger component altogether, and (b) design a hybrid protocol, which gracefully shifts between global and local rollback, depending on the available payload logging memory. Such a hybrid protocol between local and global rollback has not been previously proposed to our knowledge. Our extensions span a few hundred lines of code for each kernel, are open-sourced, and enable local and global rollback after process failure.
△ Less
Submitted 8 May, 2019;
originally announced May 2019.
-
Characterizing Watermark Numbers encoded as Reducible Permutation Graphs against Malicious Attacks
Authors:
Anna Mpanti,
Stavros D. Nikolopoulos,
Leonidas Palios
Abstract:
In the domain of software watermarking, we have proposed several graph theoretic watermarking codec systems for encoding watermark numbers $w$ as reducible permutation flow-graphs $F[π^*]$ through the use of self-inverting permutations $π^*$. Following up on our proposed methods, we theoretically study the oldest one, which we call W-RPG, in order to investigate and prove its resilience to edge-mo…
▽ More
In the domain of software watermarking, we have proposed several graph theoretic watermarking codec systems for encoding watermark numbers $w$ as reducible permutation flow-graphs $F[π^*]$ through the use of self-inverting permutations $π^*$. Following up on our proposed methods, we theoretically study the oldest one, which we call W-RPG, in order to investigate and prove its resilience to edge-modification attacks on the flow-graphs $F[π^*]$. In particular, we characterize the integer $w\equivπ^*$ as strong or weak watermark through the structure of self-inverting permutations $π^*$ which encodes it. To this end, for any integer watermark $w \in R_n=[2^{n-1}, 2^n-1]$, where $n$ is the length of the binary representation $b(w)$ of $w$, we compute the minimum number of 01-modifications needed to be applied on $b(w)$ so that the resulting $b(w')$ represents the valid watermark number $w'$; note that a number $w'$ is called valid (or, true-incorrect watermark number) if $w'$ can be produced by the W-RPG codec system and, thus, it incorporates all the structural properties of $π^* \equiv w$.
△ Less
Submitted 28 December, 2018;
originally announced December 2018.
-
Malicious Software Detection and Classification utilizing Temporal-Graphs of System-call Group Relations
Authors:
Anna Mpanti,
Stavros D. Nikolopoulos,
Iosif Polenakis
Abstract:
In this work we propose a graph-based model that, utilizing relations between groups of System-calls, distinguishes malicious from benign software samples and classifies the detected malicious samples to one of a set of known malware families. More precisely, given a System-call Dependency Graph (ScDG) that depicts the malware's behavior, we first transform it to a more abstract representation, ut…
▽ More
In this work we propose a graph-based model that, utilizing relations between groups of System-calls, distinguishes malicious from benign software samples and classifies the detected malicious samples to one of a set of known malware families. More precisely, given a System-call Dependency Graph (ScDG) that depicts the malware's behavior, we first transform it to a more abstract representation, utilizing the indexing of System-calls to a set of groups of similar functionality, constructing thus an abstract and mutation-tolerant graph that we call Group Relation Graph (GrG); then, we construct another graph representation, which we call Coverage Graph (CvG), that depicts the dominating relations between the nodes of a GrG graph. Based on the research so far in the field, we pointed out that behavior-based graph representations had not leveraged the aspect of the temporal evolution of the graph. Hence, the novelty of our work is that, preserving the initial representations of GrG and CvG graphs, we focus on augmenting the potentials of theses graphs by adding further features that enhance its abilities on detecting and further classifying to a known malware family an unknown malware sample. To that end, we construct periodical instances of the graph that represent its temporal evolution concerning its structural modifications, creating another graph representation that we call Temporal Graphs. In this paper, we present the theoretical background behind our approach, discuss the current technological status on malware detection and classification and demonstrate the overall architecture of our proposed detection and classification model alongside with its underlying main principles and its structural key-components.
△ Less
Submitted 27 December, 2018;
originally announced December 2018.
-
RADS: Real-time Anomaly Detection System for Cloud Data Centres
Authors:
Sakil Barbhuiya,
Zafeirios Papazachos,
Peter Kilpatrick,
Dimitrios S. Nikolopoulos
Abstract:
Cybersecurity attacks in Cloud data centres are increasing alongside the growth of the Cloud services market. Existing research proposes a number of anomaly detection systems for detecting such attacks. However, these systems encounter a number of challenges, specifically due to the unknown behaviour of the attacks and the occurrence of genuine Cloud workload spikes, which must be distinguished fr…
▽ More
Cybersecurity attacks in Cloud data centres are increasing alongside the growth of the Cloud services market. Existing research proposes a number of anomaly detection systems for detecting such attacks. However, these systems encounter a number of challenges, specifically due to the unknown behaviour of the attacks and the occurrence of genuine Cloud workload spikes, which must be distinguished from attacks. In this paper, we discuss these challenges and investigate the issues with the existing Cloud anomaly detection approaches. Then, we propose a Real-time Anomaly Detection System (RADS) for Cloud data centres, which uses a one class classification algorithm and a window-based time series analysis to address the challenges. Specifically, RADS can detect VM-level anomalies occurring due to DDoS and cryptomining attacks. We evaluate the performance of RADS by running lab-based experiments and by using real-world Cloud workload traces. Evaluation results demonstrate that RADS can achieve 90-95% accuracy with a low false positive rate of 0-3%. The results further reveal that RADS experiences fewer false positives when using its window-based time series analysis in comparison to using state-of-the-art average or entropy based analysis.
△ Less
Submitted 11 November, 2018;
originally announced November 2018.
-
DYVERSE: DYnamic VERtical Scaling in Multi-tenant Edge Environments
Authors:
Nan Wang,
Michail Matthaiou,
Dimitrios S. Nikolopoulos,
Blesson Varghese
Abstract:
Multi-tenancy in resource-constrained environments is a key challenge in Edge computing. In this paper, we develop 'DYVERSE: DYnamic VERtical Scaling in Edge' environments, which is the first light-weight and dynamic vertical scaling mechanism for managing resources allocated to applications for facilitating multi-tenancy in Edge environments. To enable dynamic vertical scaling, one static and thr…
▽ More
Multi-tenancy in resource-constrained environments is a key challenge in Edge computing. In this paper, we develop 'DYVERSE: DYnamic VERtical Scaling in Edge' environments, which is the first light-weight and dynamic vertical scaling mechanism for managing resources allocated to applications for facilitating multi-tenancy in Edge environments. To enable dynamic vertical scaling, one static and three dynamic priority management approaches that are workload-aware, community-aware and system-aware, respectively are proposed. This research advocates that dynamic vertical scaling and priority management approaches reduce Service Level Objective (SLO) violation rates. An online-game and a face detection workload in a Cloud-Edge test-bed are used to validate the research. The merits of DYVERSE is that there is only a sub-second overhead per Edge server when 32 Edge servers are deployed on a single Edge node. When compared to executing applications on the Edge servers without dynamic vertical scaling, static priorities and dynamic priorities reduce SLO violation rates of requests by up to 4% and 12% for the online game, respectively, and in both cases 6% for the face detection workload. Moreover, for both workloads, the system-aware dynamic vertical scaling method effectively reduces the latency of non-violated requests, when compared to other methods.
△ Less
Submitted 21 February, 2020; v1 submitted 19 September, 2018;
originally announced October 2018.
-
VEBO: A Vertex- and Edge-Balanced Ordering Heuristic to Load Balance Parallel Graph Processing
Authors:
Jiawen Sun,
Hans Vandierendonck,
Dimitrios S. Nikolopoulos
Abstract:
Graph partitioning drives graph processing in distributed, disk-based and NUMA-aware systems. A commonly used partitioning goal is to balance the number of edges per partition in conjunction with minimizing the edge or vertex cut. While this type of partitioning is computationally expensive, we observe that such topology-driven partitioning nonetheless results in computational load imbalance. We p…
▽ More
Graph partitioning drives graph processing in distributed, disk-based and NUMA-aware systems. A commonly used partitioning goal is to balance the number of edges per partition in conjunction with minimizing the edge or vertex cut. While this type of partitioning is computationally expensive, we observe that such topology-driven partitioning nonetheless results in computational load imbalance. We propose Vertex- and Edge-Balanced Ordering (VEBO): balance the number of edges and the number of unique destinations of those edges. VEBO optimally balances edges and vertices for graphs with a power-law degree distribution. Experimental evaluation on three shared-memory graph processing systems (Ligra, Polymer and GraphGrind) shows that VEBO achieves excellent load balance and improves performance by 1.09x over Ligra, 1.41x over Polymer and 1.65x over GraphGrind, compared to their respective partitioning algorithms, averaged across 8 algorithms and 7 graphs.
△ Less
Submitted 18 June, 2018;
originally announced June 2018.
-
Energy-efficient localised rollback after failures via data flow analysis
Authors:
Kiril Dichev,
Kirk Cameron,
Dimitrios Nikolopoulos
Abstract:
Exascale systems will suffer failures hourly. HPC programmers rely mostly on application-level checkpoint and a global rollback to recover. In recent years, techniques reducing the number of rolling back processes have been implemented via message logging. However, the log-based approaches have weaknesses, such as being dependent on complex modifications within an MPI implementation, and the fact…
▽ More
Exascale systems will suffer failures hourly. HPC programmers rely mostly on application-level checkpoint and a global rollback to recover. In recent years, techniques reducing the number of rolling back processes have been implemented via message logging. However, the log-based approaches have weaknesses, such as being dependent on complex modifications within an MPI implementation, and the fact that a full restart may be required in the general case. To address the limitations of all log-based mechanisms, we return to checkpoint-only mechanisms, but advocate data-flow-driven recovery (DFR), a fundamentally different approach relying on analysis of the data flow of iterative codes, and the well-known concept of data-flow graphs. We demonstrate the effectiveness of DFR for an MPI stencil code to optimise rollback and reduce the overall energy consumption by 10-12 % on idling nodes during localised rollback. We also provide large-scale estimates for the energy savings of DFR compared to global rollback, which for stencil codes increase as n square for a process count n.
△ Less
Submitted 5 June, 2018;
originally announced June 2018.
-
TAPAS: Train-less Accuracy Predictor for Architecture Search
Authors:
R. Istrate,
F. Scheidegger,
G. Mariani,
D. Nikolopoulos,
C. Bekas,
A. C. I. Malossi
Abstract:
In recent years an increasing number of researchers and practitioners have been suggesting algorithms for large-scale neural network architecture search: genetic algorithms, reinforcement learning, learning curve extrapolation, and accuracy predictors. None of them, however, demonstrated high-performance without training new experiments in the presence of unseen datasets. We propose a new deep neu…
▽ More
In recent years an increasing number of researchers and practitioners have been suggesting algorithms for large-scale neural network architecture search: genetic algorithms, reinforcement learning, learning curve extrapolation, and accuracy predictors. None of them, however, demonstrated high-performance without training new experiments in the presence of unseen datasets. We propose a new deep neural network accuracy predictor, that estimates in fractions of a second classification performance for unseen input datasets, without training. In contrast to previously proposed approaches, our prediction is not only calibrated on the topological network information, but also on the characterization of the dataset-difficulty which allows us to re-tune the prediction without any training. Our predictor achieves a performance which exceeds 100 networks per second on a single GPU, thus creating the opportunity to perform large-scale architecture search within a few minutes. We present results of two searches performed in 400 seconds on a single GPU. Our best discovered networks reach 93.67% accuracy for CIFAR-10 and 81.01% for CIFAR-100, verified by training. These networks are performance competitive with other automatically discovered state-of-the-art networks however we only needed a small fraction of the time to solution and computational resources.
△ Less
Submitted 1 June, 2018;
originally announced June 2018.
-
Incremental Training of Deep Convolutional Neural Networks
Authors:
Roxana Istrate,
Adelmo Cristiano Innocenza Malossi,
Costas Bekas,
Dimitrios Nikolopoulos
Abstract:
We propose an incremental training method that partitions the original network into sub-networks, which are then gradually incorporated in the running network during the training process. To allow for a smooth dynamic growth of the network, we introduce a look-ahead initialization that outperforms the random initialization. We demonstrate that our incremental approach reaches the reference network…
▽ More
We propose an incremental training method that partitions the original network into sub-networks, which are then gradually incorporated in the running network during the training process. To allow for a smooth dynamic growth of the network, we introduce a look-ahead initialization that outperforms the random initialization. We demonstrate that our incremental approach reaches the reference network baseline accuracy. Additionally, it allows to identify smaller partitions of the original state-of-the-art network, that deliver the same final accuracy, by using only a fraction of the global number of parameters. This allows for a potential speedup of the training time of several factors. We report training results on CIFAR-10 for ResNet and VGGNet.
△ Less
Submitted 27 March, 2018;
originally announced March 2018.
-
Encoding Watermark Numbers as Reducible Permutation Graphs using Self-inverting Permutations
Authors:
Maria Chroni,
Stavros D. Nikolopoulos,
Leonidas Palios
Abstract:
Several graph theoretic watermark methods have been proposed to encode numbers as graph structures in software watermarking environments. In this paper, we propose an efficient and easily implementable codec system for encoding watermark numbers as reducible permutation flow-graphs and, thus, we extend the class of graphs used in such a watermarking environment. More precisely, we present an algor…
▽ More
Several graph theoretic watermark methods have been proposed to encode numbers as graph structures in software watermarking environments. In this paper, we propose an efficient and easily implementable codec system for encoding watermark numbers as reducible permutation flow-graphs and, thus, we extend the class of graphs used in such a watermarking environment. More precisely, we present an algorithm for encoding a watermark number $w$ as a self-inverting permutation $π^*$, an algorithm for encoding the self-inverting permutation $π^*$ into a reducible permutation graph $F[π^*]$ whose structure resembles the structure of real program graphs, as well as decoding algorithms which extract the permutation $π^*$ from the reducible permutation graph $F[π^*]$ and the number $w$ from $π^*$. Both the encoding and the decoding process takes time and space linear in the length of the binary representation of $w$. The two main components of our proposed codec system, i.e., the self-inverting permutation $π^*$ and the reducible permutation graph $F[π^*]$, incorporate the binary representation of the watermark~$w$ in their structure and possess important structural properties, which make our system resilient to attacks; to this end, we experimentally evaluated our system under edge modification attacks on the graph $F[π^*]$ and the results show that we can detect such attacks with high probability.
△ Less
Submitted 21 December, 2017;
originally announced December 2017.
-
Intra-node Memory Safe GPU Co-Scheduling
Authors:
Carlos Reano,
Federico Silla,
Dimitrios S. Nikolopoulos,
Blesson Varghese
Abstract:
GPUs in High-Performance Computing systems remain under-utilised due to the unavailability of schedulers that can safely schedule multiple applications to share the same GPU. The research reported in this paper is motivated to improve the utilisation of GPUs by proposing a framework, we refer to as schedGPU, to facilitate intra-node GPU co-scheduling such that a GPU can be safely shared among mult…
▽ More
GPUs in High-Performance Computing systems remain under-utilised due to the unavailability of schedulers that can safely schedule multiple applications to share the same GPU. The research reported in this paper is motivated to improve the utilisation of GPUs by proposing a framework, we refer to as schedGPU, to facilitate intra-node GPU co-scheduling such that a GPU can be safely shared among multiple applications by taking memory constraints into account. Two approaches, namely a client-server and a shared memory approach are explored. However, the shared memory approach is more suitable due to lower overheads when compared to the former approach.
Four policies are proposed in schedGPU to handle applications that are waiting to access the GPU, two of which account for priorities. The feasibility of schedGPU is validated on three real-world applications. The key observation is that a performance gain is achieved.
For single applications, a gain of over 10 times, as measured by GPU utilisation and GPU memory utilisation, is obtained. For workloads comprising multiple applications, a speed-up of up to 5x in the total execution time is noted. Moreover, the average GPU utilisation and average GPU memory utilisation is increased by 5 and 12 times, respectively.
△ Less
Submitted 12 December, 2017;
originally announced December 2017.
-
Power Modelling for Heterogeneous Cloud-Edge Data Centers
Authors:
Kai Chen,
Blesson Varghese,
Peter Kilpatrick,
Dimitrios S. Nikolopoulos
Abstract:
Existing power modelling research focuses not on the method used for developing models but rather on the model itself. This paper aims to develop a method for deploying power models on emerging processors that will be used, for example, in cloud-edge data centers. Our research first develops a hardware counter selection method that appropriately selects counters most correlated to power on ARM and…
▽ More
Existing power modelling research focuses not on the method used for developing models but rather on the model itself. This paper aims to develop a method for deploying power models on emerging processors that will be used, for example, in cloud-edge data centers. Our research first develops a hardware counter selection method that appropriately selects counters most correlated to power on ARM and Intel processors. Then, we propose a two stage power model that works across multiple architectures. The key results are: (i) the automated hardware performance counter selection method achieves comparable selection to the manual selection methods reported in literature, and (ii) the two stage power model can predict dynamic power more accurately on both ARM and Intel processors when compared to classic power models.
△ Less
Submitted 27 October, 2017;
originally announced October 2017.
-
Edge-as-a-Service: Towards Distributed Cloud Architectures
Authors:
Blesson Varghese,
Nan Wang,
Jianyu Li,
Dimitrios S. Nikolopoulos
Abstract:
We present an Edge-as-a-Service (EaaS) platform for realising distributed cloud architectures and integrating the edge of the network in the computing ecosystem. The EaaS platform is underpinned by (i) a lightweight discovery protocol that identifies edge nodes and make them publicly accessible in a computing environment, and (ii) a scalable resource provisioning mechanism for offloading workloads…
▽ More
We present an Edge-as-a-Service (EaaS) platform for realising distributed cloud architectures and integrating the edge of the network in the computing ecosystem. The EaaS platform is underpinned by (i) a lightweight discovery protocol that identifies edge nodes and make them publicly accessible in a computing environment, and (ii) a scalable resource provisioning mechanism for offloading workloads from the cloud on to the edge for servicing multiple user requests. We validate the feasibility of EaaS on an online game use-case to highlight the improvement in the QoS of the application hosted on our cloud-edge platform. On this platform we demonstrate (i) low overheads of less than 6%, (ii) reduced data traffic to the cloud by up to 95% and (iii) minimised application latency between 40%-60%.
△ Less
Submitted 27 October, 2017;
originally announced October 2017.
-
ENORM: A Framework For Edge NOde Resource Management
Authors:
Nan Wang,
Blesson Varghese,
Michail Matthaiou,
Dimitrios S. Nikolopoulos
Abstract:
Current computing techniques using the cloud as a centralised server will become untenable as billions of devices get connected to the Internet. This raises the need for fog computing, which leverages computing at the edge of the network on nodes, such as routers, base stations and switches, along with the cloud. However, to realise fog computing the challenge of managing edge nodes will need to b…
▽ More
Current computing techniques using the cloud as a centralised server will become untenable as billions of devices get connected to the Internet. This raises the need for fog computing, which leverages computing at the edge of the network on nodes, such as routers, base stations and switches, along with the cloud. However, to realise fog computing the challenge of managing edge nodes will need to be addressed. This paper is motivated to address the resource management challenge. We develop the first framework to manage edge nodes, namely the Edge NOde Resource Management (ENORM) framework. Mechanisms for provisioning and auto-scaling edge node resources are proposed. The feasibility of the framework is demonstrated on a PokeMon Go-like online game use-case. The benefits of using ENORM are observed by reduced application latency between 20% - 80% and reduced data transfer and communication frequency between the edge node and the cloud by up to 95\%. These results highlight the potential of fog computing for improving the quality of service and experience.
△ Less
Submitted 12 September, 2017;
originally announced September 2017.
-
Dependency-Aware Rollback and Checkpoint-Restart for Distributed Task-Based Runtimes
Authors:
Kiril Dichev,
Herbert Jordan,
Konstantinos Tovletoglou,
Thomas Heller,
Dimitrios S. Nikolopoulos,
Georgios Karakonstantis,
Charles Gillan
Abstract:
With the increase in compute nodes in large compute platforms, a proportional increase in node failures will follow. Many application-based checkpoint/restart (C/R) techniques have been proposed for MPI applications to target the reduced mean time between failures. However, rollback as part of the recovery remains a dominant cost even in highly optimised MPI applications employing C/R techniques.…
▽ More
With the increase in compute nodes in large compute platforms, a proportional increase in node failures will follow. Many application-based checkpoint/restart (C/R) techniques have been proposed for MPI applications to target the reduced mean time between failures. However, rollback as part of the recovery remains a dominant cost even in highly optimised MPI applications employing C/R techniques. Continuing execution past a checkpoint (that is, reducing rollback) is possible in message-passing runtimes, but extremely complex to design and implement. Our work focuses on task-based runtimes, where task dependencies are explicit and message passing is implicit. We see an opportunity for reducing rollback for such runtimes: we explore task dependencies in the rollback, which we call dependency-aware rollback. We also design a new C/R technique, which is influenced by recursive decomposition of tasks, and combine it with dependency-aware rollback. We expect the dependency-aware rollback to cancel and recompute less tasks in the presence of node failures. We describe, implement and validate the proposed protocol in a simulator, which confirms these expectations. In addition, we consistently observe faster overall execution time for dependency-aware rollback in the presence of faults, despite the fact that reduced task cancellation does not guarantee reduced overall execution time.
△ Less
Submitted 29 May, 2017;
originally announced May 2017.
-
Feasibility of Fog Computing
Authors:
Blesson Varghese,
Nan Wang,
Dimitrios S. Nikolopoulos,
Rajkumar Buyya
Abstract:
As billions of devices get connected to the Internet, it will not be sustainable to use the cloud as a centralised server. The way forward is to decentralise computations away from the cloud towards the edge of the network closer to the user. This reduces the latency of communication between a user device and the cloud, and is the premise of 'fog computing' defined in this paper. The aim of this p…
▽ More
As billions of devices get connected to the Internet, it will not be sustainable to use the cloud as a centralised server. The way forward is to decentralise computations away from the cloud towards the edge of the network closer to the user. This reduces the latency of communication between a user device and the cloud, and is the premise of 'fog computing' defined in this paper. The aim of this paper is to highlight the feasibility and the benefits in improving the Quality-of-Service and Experience by using fog computing. For an online game use-case, we found that the average response time for a user is improved by 20% when using the edge of the network in comparison to using a cloud-only model. It was also observed that the volume of traffic between the edge and the cloud server is reduced by over 90% for the use-case. The preliminary results highlight the potential of fog computing in achieving a sustainable computing model and highlights the benefits of integrating the edge of the network into the computing ecosystem.
△ Less
Submitted 19 January, 2017;
originally announced January 2017.
-
Challenges and Opportunities in Edge Computing
Authors:
Blesson Varghese,
Nan Wang,
Sakil Barbhuiya,
Peter Kilpatrick,
Dimitrios S. Nikolopoulos
Abstract:
Many cloud-based applications employ a data centre as a central server to process data that is generated by edge devices, such as smartphones, tablets and wearables. This model places ever increasing demands on communication and computational infrastructure with inevitable adverse effect on Quality-of-Service and Experience. The concept of Edge Computing is predicated on moving some of this comput…
▽ More
Many cloud-based applications employ a data centre as a central server to process data that is generated by edge devices, such as smartphones, tablets and wearables. This model places ever increasing demands on communication and computational infrastructure with inevitable adverse effect on Quality-of-Service and Experience. The concept of Edge Computing is predicated on moving some of this computational load towards the edge of the network to harness computational capabilities that are currently untapped in edge nodes, such as base stations, routers and switches. This position paper considers the challenges and opportunities that arise out of this new direction in the computing landscape.
△ Less
Submitted 7 September, 2016;
originally announced September 2016.
-
Two RPG Flow-graphs for Software Watermarking using Bitonic Sequences of Self-inverting Permutations
Authors:
Anna Mpanti,
Stavros D. Nikolopoulos
Abstract:
Software watermarking has received considerable attention and was adopted by the software development community as a technique to prevent or discourage software piracy and copyright infringement. A wide range of software watermarking techniques has been proposed among which the graph-based methods that encode watermarks as graph structures. Following up on our recently proposed methods for encodin…
▽ More
Software watermarking has received considerable attention and was adopted by the software development community as a technique to prevent or discourage software piracy and copyright infringement. A wide range of software watermarking techniques has been proposed among which the graph-based methods that encode watermarks as graph structures. Following up on our recently proposed methods for encoding watermark numbers $w$ as reducible permutation flow-graphs $F[π^*]$ through the use of self-inverting permutations $π^*$, in this paper, we extend the types of flow-graphs available for software watermarking by proposing two different reducible permutation flow-graphs $F_1[π^*]$ and $F_2[π^*]$ incorporating important properties which are derived from the bitonic subsequences composing the self-inverting permutation $π^*$. We show that a self-inverting permutation $π^*$ can be efficiently encoded into either $F_1[π^*]$ or $F_2[π^*]$ and also efficiently decoded from theses graph structures. The proposed flow-graphs $F_1[π^*]$ and $F_2[π^*]$ enrich the repository of graphs which can encode the same watermark number $w$ and, thus, enable us to embed multiple copies of the same watermark $w$ into an application program $P$. Moreover, the enrichment of that repository with new flow-graphs increases our ability to select a graph structure more similar to the structure of a given application program $P$ thereby enhancing the resilience of our codec system to attacks.
△ Less
Submitted 15 July, 2016; v1 submitted 8 July, 2016;
originally announced July 2016.
-
Preventing Malware Pandemics in Mobile Devices by Establishing Response-time Bounds
Authors:
Stavros D. Nikolopoulos,
Iosif Polenakis
Abstract:
We study the propagation of a malicious software in a network of mobile devices, which are moving in a specific city area, and establish time bounds for the activation of a counter-measure, i.e., an antivirus or a cleaner in order to prevent pandemic. More precisely, given an initial infected population (mobile devices), we establish upper bounds on the time needed for a counter-measure to take ef…
▽ More
We study the propagation of a malicious software in a network of mobile devices, which are moving in a specific city area, and establish time bounds for the activation of a counter-measure, i.e., an antivirus or a cleaner in order to prevent pandemic. More precisely, given an initial infected population (mobile devices), we establish upper bounds on the time needed for a counter-measure to take effect after infection (response-time), in order to prevent the rest susceptible devices to get infected. Thus, within a period of time, we guarantee that not all the susceptible devices in the city get infected and the infected ones get sanitized. In our work, we first propose a malware propagation model along with a device mobility model and then, utilizing these models, we develop a simulator that we use to study the spread of malware in such networks. Finally, we provide experimental results for the pandemic prevention taken by our simulator for various response-time intervals.
△ Less
Submitted 4 July, 2016;
originally announced July 2016.
-
BDDT-SCC: A Task-parallel Runtime for Non Cache-Coherent Multicores
Authors:
Alexandros Labrineas,
Polyvios Pratikakis,
Dimitrios S. Nikolopoulos,
Angelos Bilas
Abstract:
This paper presents BDDT-SCC, a task-parallel runtime system for non cache-coherent multicore processors, implemented for the Intel Single-Chip Cloud Computer. The BDDT-SCC runtime includes a dynamic dependence analysis and automatic synchronization, and executes OpenMP-Ss tasks on a non cache-coherent architecture. We design a runtime that uses fast on-chip inter-core communication with small mes…
▽ More
This paper presents BDDT-SCC, a task-parallel runtime system for non cache-coherent multicore processors, implemented for the Intel Single-Chip Cloud Computer. The BDDT-SCC runtime includes a dynamic dependence analysis and automatic synchronization, and executes OpenMP-Ss tasks on a non cache-coherent architecture. We design a runtime that uses fast on-chip inter-core communication with small messages. At the same time, we use non coherent shared memory to avoid large core-to-core data transfers that would incur a high volume of unnecessary copying. We evaluate BDDT-SCC on a set of representative benchmarks, in terms of task granularity, locality, and communication. We find that memory locality and allocation plays a very important role in performance, as the architecture of the SCC memory controllers can create strong contention effects. We suggest patterns that improve memory locality and thus the performance of applications, and measure their impact.
△ Less
Submitted 14 June, 2016;
originally announced June 2016.
-
Myrmics: Scalable, Dependency-aware Task Scheduling on Heterogeneous Manycores
Authors:
Spyros Lyberis,
Polyvios Pratikakis,
Iakovos Mavroidis,
Dimitrios S. Nikolopoulos
Abstract:
Task-based programming models have become very popular, as they offer an attractive solution to parallelize serial application code with task and data annotations. They usually depend on a runtime system that schedules the tasks to multiple cores in parallel while resolving any data hazards. However, existing runtime system implementations are not ready to scale well on emerging manycore processor…
▽ More
Task-based programming models have become very popular, as they offer an attractive solution to parallelize serial application code with task and data annotations. They usually depend on a runtime system that schedules the tasks to multiple cores in parallel while resolving any data hazards. However, existing runtime system implementations are not ready to scale well on emerging manycore processors, as they often rely on centralized structures and/or locks on shared structures in a cache-coherent memory. We propose design choices, policies and mechanisms to enhance runtime system scalability for single-chip processors with hundreds of cores. Based on these concepts, we create and evaluate Myrmics, a runtime system for a dependency-aware, task-based programming model on a heterogeneous hardware prototype platform that emulates a single-chip processor of 8 latency-optimized and 512 throughput-optimized CPUs. We find that Myrmics scales successfully to hundreds of cores. Compared to MPI versions of the same benchmarks with hand-tuned message passing, Myrmics achieves similar scalability with a 10-30% performance overhead, but with less programming effort. We analyze the scalability of the runtime system in detail and identify the key factors that contribute to it.
△ Less
Submitted 14 June, 2016;
originally announced June 2016.
-
TwinCG: Dual Thread Redundancy with Forward Recovery for Conjugate Gradient Methods
Authors:
Kiril Dichev,
Dimitrios S. Nikolopoulos
Abstract:
Even though iterative solvers like the Conjugate Gradients method (CG) have been studied for over fifty years, fault tolerance for such solvers has seen much attention in recent years. For iterative solvers, two major reliable strategies of recovery exist: checkpoint-restart for backward recovery, or some type of redundancy technique for forward recovery. Important redundancy techniques like ABFT…
▽ More
Even though iterative solvers like the Conjugate Gradients method (CG) have been studied for over fifty years, fault tolerance for such solvers has seen much attention in recent years. For iterative solvers, two major reliable strategies of recovery exist: checkpoint-restart for backward recovery, or some type of redundancy technique for forward recovery. Important redundancy techniques like ABFT techniques for sparse matrix-vector products (SpMxV) have recently been proposed, which increase the resilience of CG methods. These techniques offer limited recovery options, and introduce a tolerable overhead. In this work, we study a more powerful resilience concept, which is redundant multithreading. It offers more generic and stronger recovery guarantees, including any soft faults in CG iterations (among others covering ABFT SpMxV), but also requires more resources. We carefully study this redundancy/efficiency conflict. We propose a fault tolerant CG method, called TwinCG, which introduces minimal wallclock time overhead, and significant advantages in detection and correction strategies. Our method uses Dual Modular Redundancy instead of the more expensive Triple Modular Redundancy; still, it retains the TMR advantages of fault correction. We describe, implement, and benchmark our iterative solver, and compare it in terms of efficiency and fault tolerance capabilities to state-of-the-art techniques. We find that before parallelization, TwinCG introduces around 5-6% runtime overhead compared to standard CG, and after parallelization efficiently uses BLAS. In the presence of faults, it reliably performs forward recovery for a range of problems, outperforming SpMxV ABFT solutions.
△ Less
Submitted 15 May, 2016;
originally announced May 2016.
-
Energy Optimization of Memory Intensive Parallel workloads
Authors:
Chhaya Trehan,
Hans Vandierendonck,
Georgios Karakonstantis,
Dimitrios S. Nikolopoulos
Abstract:
Energy consumption is an important concern in modern multicore processors. The energy consumed during the execution of an application can be minimized by tuning the hardware state utilizing knobs such as frequency, voltage etc. The existing theoretical work on energy mini- mization using Global DVFS (Dynamic Voltage and Frequency Scaling), despite being thorough, ignores the energy consumed by the…
▽ More
Energy consumption is an important concern in modern multicore processors. The energy consumed during the execution of an application can be minimized by tuning the hardware state utilizing knobs such as frequency, voltage etc. The existing theoretical work on energy mini- mization using Global DVFS (Dynamic Voltage and Frequency Scaling), despite being thorough, ignores the energy consumed by the CPU on memory accesses and the dynamic energy consumed by the idle cores. This article presents an analytical model for the performance and the overall energy consumed by the CPU chip on CPU instructions as well as the memory accesses without ignoring the dynamic energy consumed by the idle cores. We present an analytical framework around our energy-performance model to predict the operating frequencies for global DVFS that minimize the overall CPU energy consumption within a performance budget. Finally, we suggest a scheduling criteria for energy aware scheduling of memory intensive parallel applications.
△ Less
Submitted 13 May, 2016;
originally announced May 2016.
-
ALEA: Fine-grain Energy Profiling with Basic Block Sampling
Authors:
Lev Mukhanov,
Dimitrios S. Nikolopoulos,
Bronis R. de Supinski
Abstract:
Energy efficiency is an essential requirement for all contemporary computing systems. We thus need tools to measure the energy consumption of computing systems and to understand how workloads affect it. Significant recent research effort has targeted direct power measurements on production computing systems using on-board sensors or external instruments. These direct methods have in turn guided st…
▽ More
Energy efficiency is an essential requirement for all contemporary computing systems. We thus need tools to measure the energy consumption of computing systems and to understand how workloads affect it. Significant recent research effort has targeted direct power measurements on production computing systems using on-board sensors or external instruments. These direct methods have in turn guided studies of software techniques to reduce energy consumption via workload allocation and scaling. Unfortunately, direct energy measurements are hampered by the low power sampling frequency of power sensors. The coarse granularity of power sensing limits our understanding of how power is allocated in systems and our ability to optimize energy efficiency via workload allocation.
We present ALEA, a tool to measure power and energy consumption at the granularity of basic blocks, using a probabilistic approach. ALEA provides fine-grained energy profiling via statistical sampling, which overcomes the limitations of power sensing instruments. Compared to state-of-the-art energy measurement tools, ALEA provides finer granularity without sacrificing accuracy. ALEA achieves low overhead energy measurements with mean error rates between 1.4% and 3.5% in 14 sequential and parallel benchmarks tested on both Intel and ARM platforms. The sampling method caps execution time overhead at approximately 1%. ALEA is thus suitable for online energy monitoring and optimization. Finally, ALEA is a user-space tool with a portable, machine-independent sampling method. We demonstrate two use cases of ALEA, where we reduce the energy consumption of a k-means computational kernel by 37% and an ocean modelling code by 33%, compared to high-performance execution baselines, by varying the power optimization strategy between basic blocks.
△ Less
Submitted 14 November, 2016; v1 submitted 3 April, 2015;
originally announced April 2015.
-
Evaluating Asymmetric Multicore Systems-on-Chip using Iso-Metrics
Authors:
Charalampos Chalios,
Dimitrios S. Nikolopoulos,
Enrique S. Quintana-Orti
Abstract:
The end of Dennard scaling has pushed power consumption into a first order concern for current systems, on par with performance. As a result, near-threshold voltage computing (NTVC) has been proposed as a potential means to tackle the limited cooling capacity of CMOS technology. Hardware operating in NTV consumes significantly less power, at the cost of lower frequency, and thus reduced performanc…
▽ More
The end of Dennard scaling has pushed power consumption into a first order concern for current systems, on par with performance. As a result, near-threshold voltage computing (NTVC) has been proposed as a potential means to tackle the limited cooling capacity of CMOS technology. Hardware operating in NTV consumes significantly less power, at the cost of lower frequency, and thus reduced performance, as well as increased error rates. In this paper, we investigate if a low-power systems-on-chip, consisting of ARM's asymmetric big.LITTLE technology, can be an alternative to conventional high performance multicore processors in terms of power/energy in an unreliable scenario. For our study, we use the Conjugate Gradient solver, an algorithm representative of the computations performed by a large range of scientific and engineering codes.
△ Less
Submitted 27 March, 2015;
originally announced March 2015.
-
Iso-Quality of Service: Fairly Ranking Servers for Real-Time Data Analytics
Authors:
Giorgis Georgakoudis,
Charles J. Gillan,
Ahmed Sayed,
Ivor Spence,
Richard Faloon,
Dimitrios S. Nikolopoulos
Abstract:
We present a mathematically rigorous Quality-of-Service (QoS) metric which relates the achievable quality of service metric (QoS) for a real-time analytics service to the server energy cost of offering the service. Using a new iso-QoS evaluation methodology, we scale server resources to meet QoS targets and directly rank the servers in terms of their energy-efficiency and by extension cost of owne…
▽ More
We present a mathematically rigorous Quality-of-Service (QoS) metric which relates the achievable quality of service metric (QoS) for a real-time analytics service to the server energy cost of offering the service. Using a new iso-QoS evaluation methodology, we scale server resources to meet QoS targets and directly rank the servers in terms of their energy-efficiency and by extension cost of ownership. Our metric and method are platform-independent and enable fair comparison of datacenter compute servers with significant architectural diversity, including micro-servers. We deploy our metric and methodology to compare three servers running financial option pricing workloads on real-life market data. We find that server ranking is sensitive to data inputs and desired QoS level and that although scale-out micro-servers can be up to two times more energy-efficient than conventional heavyweight servers for the same target QoS, they are still six times less energy efficient than high-performance computational accelerators.
△ Less
Submitted 14 January, 2015;
originally announced January 2015.
-
Watermarking PDF Documents using Various Representations of Self-inverting Permutations
Authors:
Maria Chroni,
Stavros D. Nikolopoulos
Abstract:
This work provides to web users copyright protection of their Portable Document Format (PDF) documents by proposing efficient and easily implementable techniques for PDF watermarking; our techniques are based on the ideas of our recently proposed watermarking techniques for software, image, and audio, expanding thus the digital objects that can be efficiently watermarked through the use of self-in…
▽ More
This work provides to web users copyright protection of their Portable Document Format (PDF) documents by proposing efficient and easily implementable techniques for PDF watermarking; our techniques are based on the ideas of our recently proposed watermarking techniques for software, image, and audio, expanding thus the digital objects that can be efficiently watermarked through the use of self-inverting permutations. In particular, we present various representations of a self-inverting permutation $π^*$ namely 1D-representation, 2D-representation, and RPG-representation, and show that theses representations can be efficiently applied to PDF watermarking. Indeed, we first present an audio-based technique for marking a PDF document $T$ by exploiting the 1D-representation of a permutation $π^*$, and then, since pages of a PDF document $T$ are 2D objects, we present an image-based algorithm for encoding $π^*$ into $T$ by first mapping the elements of $π^*$ into a matrix $A^*$ and then using the information stored in $A^*$ to mark invisibly specific areas of PDF document $T$. Finally, we describe a graph-based watermarking algorithm for embedding a self-inverting permutation $π^*$ into the document structure of a PDF file $T$ by exploiting the RPG-representation of $π^*$ and the structure of a PDF document. We have evaluated the embedding and extracting algorithms by testing them on various and different in characteristics PDF documents.
△ Less
Submitted 12 January, 2015;
originally announced January 2015.
-
Methods and Metrics for Fair Server Assessment under Real-Time Financial Workloads
Authors:
Giorgis Georgakoudis,
Charles J. Gillan,
Ahmed Sayed,
Ivor Spence,
Richard Faloon,
Dimitrios S. Nikolopoulos
Abstract:
Energy efficiency has been a daunting challenge for datacenters. The financial industry operates some of the largest datacenters in the world. With increasing energy costs and the financial services sector growth, emerging financial analytics workloads may incur extremely high operational costs, to meet their latency targets. Microservers have recently emerged as an alternative to high-end servers…
▽ More
Energy efficiency has been a daunting challenge for datacenters. The financial industry operates some of the largest datacenters in the world. With increasing energy costs and the financial services sector growth, emerging financial analytics workloads may incur extremely high operational costs, to meet their latency targets. Microservers have recently emerged as an alternative to high-end servers, promising scalable performance and low energy consumption in datacenters via scale-out. Unfortunately, stark differences in architectural features, form factor and design considerations make a fair comparison between servers and microservers exceptionally challenging. In this paper we present a rigorous methodology and new metrics for fair comparison of server and microserver platforms. We deploy our methodology and metrics to compare a microserver with ARM cores against two servers with x86 cores, running the same real-time financial analytics workload. We define workload-specific but platform-independent performance metrics for platform comparison, targeting both datacenter operators and end users. Our methodology establishes that a server based the Xeon Phi processor delivers the highest performance and energy-efficiency. However, by scaling out energy-efficient microservers, we achieve competitive or better energy-efficiency than a power-equivalent server with two Sandy Bridge sockets despite the microserver's slower cores. Using a new iso-QoS (iso-Quality of Service) metric, we find that the ARM microserver scales enough to meet market throughput demand, i.e. a 100% QoS in terms of timely option pricing, with as little as 55% of the energy consumed by the Sandy Bridge server.
△ Less
Submitted 30 December, 2014;
originally announced January 2015.
-
Detecting Malicious Code by Exploiting Dependencies of System-call Groups
Authors:
Stavros D. Nikolopoulos,
Iosif Polenakis
Abstract:
In this paper we present an elaborated graph-based algorithmic technique for efficient malware detection. More precisely, we utilize the system-call dependency graphs (or, for short ScD graphs), obtained by capturing taint analysis traces and a set of various similarity metrics in order to detect whether an unknown test sample is a malicious or a benign one. For the sake of generalization, we deci…
▽ More
In this paper we present an elaborated graph-based algorithmic technique for efficient malware detection. More precisely, we utilize the system-call dependency graphs (or, for short ScD graphs), obtained by capturing taint analysis traces and a set of various similarity metrics in order to detect whether an unknown test sample is a malicious or a benign one. For the sake of generalization, we decide to empower our model against strong mutations by applying our detection technique on a weighted directed graph resulting from ScD graph after grouping disjoint subsets of its vertices. Additionally, we have developed a similarity metric, which we call NP-similarity, that combines qualitative, quantitative, and relational characteristics that are spread among the members of known malware families to archives a clear distinction between graph-representations of malware and the ones of benign software. Finally, we evaluate our detection model and compare our results against the results achieved by a variety of techniques proving the potentials of our model.
△ Less
Submitted 30 December, 2014;
originally announced December 2014.
-
A Programming Model and Runtime System for Significance-Aware Energy-Efficient Computing
Authors:
Vassilis Vassiliadis,
Konstantinos Parasyris,
Charalambos Chalios,
Christos D. Antonopoulos,
Spyros Lalis,
Nikolaos Bellas,
Hans Vandierendonck,
Dimitrios S. Nikolopoulos
Abstract:
Reducing energy consumption is one of the key challenges in computing technology. One factor that contributes to high energy consumption is that all parts of the program are considered equally significant for the accuracy of the end-result. However, in many cases, parts of computations can be performed in an approximate way, or even dropped, without affecting the quality of the final output to a s…
▽ More
Reducing energy consumption is one of the key challenges in computing technology. One factor that contributes to high energy consumption is that all parts of the program are considered equally significant for the accuracy of the end-result. However, in many cases, parts of computations can be performed in an approximate way, or even dropped, without affecting the quality of the final output to a significant degree.
In this paper, we introduce a task-based programming model and runtime system that exploit this observation to trade off the quality of program outputs for increased energy-efficiency. This is done in a structured and flexible way, allowing for easy exploitation of different execution points in the quality/energy space, without code modifications and without adversely affecting application performance. The programmer specifies the significance of tasks, and optionally provides approximations for them. Moreover, she provides hints to the runtime on the percentage of tasks that should be executed accurately in order to reach the target quality of results. The runtime system can apply a number of different policies to decide whether it will execute each individual less-significant task in its accurate form, or in its approximate version. Policies differ in terms of their runtime overhead but also the degree to which they manage to execute tasks according to the programmer's specification.
The results from experiments performed on top of an Intel-based multicore/multiprocessor platform show that, depending on the runtime policy used, our system can achieve an energy reduction of up to 83% compared with a fully accurate execution and up to 35% compared with an approximate version employing loop perforation. At the same time, our approach always results in graceful quality degradation.
△ Less
Submitted 15 December, 2014;
originally announced December 2014.
-
WaterRPG: A Graph-based Dynamic Watermarking Model for Software Protection
Authors:
Ioannis Chionis,
Maria Chroni,
Stavros D. Nikolopoulos
Abstract:
Software watermarking involves embedding a unique identifier or, equivalently, a watermark value within a software to prove owner's authenticity and thus to prevent or discourage copyright infringement. Towards the embedding process, several graph theoretic watermarking algorithmic techniques encode the watermark values as graph structures and embed them in application programs. Recently, we prese…
▽ More
Software watermarking involves embedding a unique identifier or, equivalently, a watermark value within a software to prove owner's authenticity and thus to prevent or discourage copyright infringement. Towards the embedding process, several graph theoretic watermarking algorithmic techniques encode the watermark values as graph structures and embed them in application programs. Recently, we presented an efficient codec system for encoding a watermark number $w$ as a reducible permutation graph $F[π^*]$ through the use of self-inverting permutations $π^*$. In this paper, we propose a dynamic watermarking model, which we call WaterRPG, for embedding the watermark graph $F[π^*]$ into an application program $P$. The main idea behind the proposed watermarking model is a systematic use of appropriate calls of specific functions of the program $P$. More precisely, for a specific input $I_{key}$ of the program $P$, our model takes the dynamic call-graph $G(P, I_{key})$ of $P$ and the watermark graph $F[π^*]$, and produces the watermarked program $P^*$ having the following key property: its dynamic call-graph $G(P^*, I_{key})$ is isomorphic to the watermark graph $F[π^*]$. Within this idea the program $P^*$ is produced by only altering appropriate calls of specific functions of the input application program $P$. We have implemented our watermarking model WaterRPG in real application programs and evaluated its functionality under various and broadly used watermarking assessment criteria. The evaluation results show that our model efficiently watermarks Java application programs with respect to several watermarking metrics like data-rate, bytecode instructions overhead, resiliency, time and space efficiency. Moreover, the embedded watermarks withstand several software obfuscation and optimization attacks.
△ Less
Submitted 17 March, 2014;
originally announced March 2014.
-
Efficient Encoding of Watermark Numbers as Reducible Permutation Graphs
Authors:
Maria Chroni,
Stavros D. Nikolopoulos
Abstract:
In a software watermarking environment, several graph theoretic watermark methods use numbers as watermark values, where some of these methods encode the watermark numbers as graph structures. In this paper we extended the class of error correcting graphs by proposing an efficient and easily implemented codec system for encoding watermark numbers as reducible permutation flow-graphs. More precisel…
▽ More
In a software watermarking environment, several graph theoretic watermark methods use numbers as watermark values, where some of these methods encode the watermark numbers as graph structures. In this paper we extended the class of error correcting graphs by proposing an efficient and easily implemented codec system for encoding watermark numbers as reducible permutation flow-graphs. More precisely, we first present an efficient algorithm which encodes a watermark number $w$ as self-inverting permutation $π^*$ and, then, an algorithm which encodes the self-inverting permutation $π^*$ as a reducible permutation flow-graph $F[π^*]$ by exploiting domination relations on the elements of $π^*$ and using an efficient DAG representation of $π^*$. The whole encoding process takes O(n) time and space, where $n$ is the binary size of the number $w$ or, equivalently, the number of elements of the permutation $π^*$. We also propose efficient decoding algorithms which extract the number $w$ from the reducible permutation flow-graph $F[π^*]$ within the same time and space complexity. The two main components of our proposed codec system, i.e., the self-inverting permutation $π^*$ and the reducible permutation graph $F[π^*]$, incorporate important structural properties which make our system resilient to attacks.
△ Less
Submitted 6 October, 2011;
originally announced October 2011.
-
Join-Reachability Problems in Directed Graphs
Authors:
Loukas Georgiadis,
Stavros D. Nikolopoulos,
Leonidas Palios
Abstract:
For a given collection G of directed graphs we define the join-reachability graph of G, denoted by J(G), as the directed graph that, for any pair of vertices a and b, contains a path from a to b if and only if such a path exists in all graphs of G. Our goal is to compute an efficient representation of J(G). In particular, we consider two versions of this problem. In the explicit version we wish to…
▽ More
For a given collection G of directed graphs we define the join-reachability graph of G, denoted by J(G), as the directed graph that, for any pair of vertices a and b, contains a path from a to b if and only if such a path exists in all graphs of G. Our goal is to compute an efficient representation of J(G). In particular, we consider two versions of this problem. In the explicit version we wish to construct the smallest join-reachability graph for G. In the implicit version we wish to build an efficient data structure (in terms of space and query time) such that we can report fast the set of vertices that reach a query vertex in all graphs of G. This problem is related to the well-studied reachability problem and is motivated by emerging applications of graph-structured databases and graph algorithms. We consider the construction of join-reachability structures for two graphs and develop techniques that can be applied to both the explicit and the implicit problem. First we present optimal and near-optimal structures for paths and trees. Then, based on these results, we provide efficient structures for planar graphs and general directed graphs.
△ Less
Submitted 22 December, 2010;
originally announced December 2010.
-
Linear Coloring and Linear Graphs
Authors:
Kyriaki Ioannidou,
Stavros D. Nikolopoulos
Abstract:
Motivated by the definition of linear coloring on simplicial complexes, recently introduced in the context of algebraic topology \cite{Civan}, and the framework through which it was studied, we introduce the linear coloring on graphs. We provide an upper bound for the chromatic number $χ(G)$, for any graph $G$, and show that $G$ can be linearly colored in polynomial time by proposing a simple li…
▽ More
Motivated by the definition of linear coloring on simplicial complexes, recently introduced in the context of algebraic topology \cite{Civan}, and the framework through which it was studied, we introduce the linear coloring on graphs. We provide an upper bound for the chromatic number $χ(G)$, for any graph $G$, and show that $G$ can be linearly colored in polynomial time by proposing a simple linear coloring algorithm. Based on these results, we define a new class of perfect graphs, which we call co-linear graphs, and study their complement graphs, namely linear graphs. The linear coloring of a graph $G$ is a vertex coloring such that two vertices can be assigned the same color, if their corresponding clique sets are associated by the set inclusion relation (a clique set of a vertex $u$ is the set of all maximal cliques containing $u$); the linear chromatic number $\mathcalλ(G)$ of $G$ is the least integer $k$ for which $G$ admits a linear coloring with $k$ colors. We show that linear graphs are those graphs $G$ for which the linear chromatic number achieves its theoretical lower bound in every induced subgraph of $G$. We prove inclusion relations between these two classes of graphs and other subclasses of chordal and co-chordal graphs, and also study the structure of the forbidden induced subgraphs of the class of linear graphs.
△ Less
Submitted 26 July, 2008;
originally announced July 2008.
-
The 1-fixed-endpoint Path Cover Problem is Polynomial on Interval Graph
Authors:
Katerina Asdre,
Stavros D. Nikolopoulos
Abstract:
We consider a variant of the path cover problem, namely, the $k$-fixed-endpoint path cover problem, or kPC for short, on interval graphs. Given a graph $G$ and a subset $\mathcal{T}$ of $k$ vertices of $V(G)$, a $k$-fixed-endpoint path cover of $G$ with respect to $\mathcal{T}$ is a set of vertex-disjoint paths $\mathcal{P}$ that covers the vertices of $G$ such that the $k$ vertices of…
▽ More
We consider a variant of the path cover problem, namely, the $k$-fixed-endpoint path cover problem, or kPC for short, on interval graphs. Given a graph $G$ and a subset $\mathcal{T}$ of $k$ vertices of $V(G)$, a $k$-fixed-endpoint path cover of $G$ with respect to $\mathcal{T}$ is a set of vertex-disjoint paths $\mathcal{P}$ that covers the vertices of $G$ such that the $k$ vertices of $\mathcal{T}$ are all endpoints of the paths in $\mathcal{P}$. The kPC problem is to find a $k$-fixed-endpoint path cover of $G$ of minimum cardinality; note that, if $\mathcal{T}$ is empty the stated problem coincides with the classical path cover problem. In this paper, we study the 1-fixed-endpoint path cover problem on interval graphs, or 1PC for short, generalizing the 1HP problem which has been proved to be NP-complete even for small classes of graphs. Motivated by a work of Damaschke, where he left both 1HP and 2HP problems open for the class of interval graphs, we show that the 1PC problem can be solved in polynomial time on the class of interval graphs. The proposed algorithm is simple, runs in $O(n^2)$ time, requires linear space, and also enables us to solve the 1HP problem on interval graphs within the same time and space complexity.
△ Less
Submitted 26 June, 2008;
originally announced June 2008.
-
The Number of Spanning Trees in Kn-complements of Quasi-threshold Graphs
Authors:
Stavros D. Nikolopoulos,
Charis Papadopoulos
Abstract:
In this paper we examine the classes of graphs whose $K_n$-complements are trees and quasi-threshold graphs and derive formulas for their number of spanning trees; for a subgraph $H$ of $K_n$, the $K_n$-complement of $H$ is the graph $K_n-H$ which is obtained from $K_n$ by removing the edges of $H$. Our proofs are based on the complement spanning-tree matrix theorem, which expresses the number o…
▽ More
In this paper we examine the classes of graphs whose $K_n$-complements are trees and quasi-threshold graphs and derive formulas for their number of spanning trees; for a subgraph $H$ of $K_n$, the $K_n$-complement of $H$ is the graph $K_n-H$ which is obtained from $K_n$ by removing the edges of $H$. Our proofs are based on the complement spanning-tree matrix theorem, which expresses the number of spanning trees of a graph as a function of the determinant of a matrix that can be easily constructed from the adjacency relation of the graph. Our results generalize previous results and extend the family of graphs of the form $K_n-H$ admitting formulas for the number of their spanning trees.
△ Less
Submitted 7 February, 2005;
originally announced February 2005.