Distributed, Parallel, and Cluster Computing
See recent articles
- [1] arXiv:2407.20537 [pdf, other]
-
Title: Switchboard: An Open-Source Framework for Modular Simulation of Large Hardware SystemsSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); Hardware Architecture (cs.AR)
Scaling up hardware systems has become an important tactic for improving performance as Moore's law fades. Unfortunately, simulations of large hardware systems are often a design bottleneck due to slow throughput and long build times. In this article, we propose a solution targeting designs composed of modular blocks connected by latency-insensitive interfaces. Our approach is to construct the hardware simulation in a similar fashion as the design itself, using a prebuilt simulator for each block and connecting the simulators via fast shared-memory queues at runtime. This improves build time, because simulation scale-up simply involves running more instances of the prebuilt simulators. It also addresses simulation speed, because prebuilt simulators can run in parallel, without fine-grained synchronization or global barriers. We introduce a framework, Switchboard, that implements our approach, and discuss two applications, demonstrating its speed, scalability, and accuracy: (1) a web application where users can run fast simulations of chiplets on an interposer, and (2) a wafer-scale simulation of one million RISC-V cores distributed across thousands of cloud compute cores.
- [2] arXiv:2407.20573 [pdf, other]
-
Title: Federated Learning as a Service for Hierarchical Edge Networks with Heterogeneous ModelsSubjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Federated learning (FL) is a distributed Machine Learning (ML) framework that is capable of training a new global model by aggregating clients' locally trained models without sharing users' original data. Federated learning as a service (FLaaS) offers a privacy-preserving approach for training machine learning models on devices with various computational resources. Most proposed FL-based methods train the same model in all client devices regardless of their computational resources. However, in practical Internet of Things (IoT) scenarios, IoT devices with limited computational resources may not be capable of training models that client devices with greater hardware performance hosted. Most of the existing FL frameworks that aim to solve the problem of aggregating heterogeneous models are designed for Independent and Identical Distributed (IID) data, which may make it hard to reach the target algorithm performance when encountering non-IID scenarios. To address these problems in hierarchical networks, in this paper, we propose a heterogeneous aggregation framework for hierarchical edge systems called HAF-Edge. In our proposed framework, we introduce a communication-efficient model aggregation method designed for FL systems with two-level model aggregations running at the edge and cloud levels. This approach enhances the convergence rate of the global model by leveraging selective knowledge transfer during the aggregation of heterogeneous models. To the best of our knowledge, this work is pioneering in addressing the problem of aggregating heterogeneous models within hierarchical FL systems spanning IoT, edge, and cloud environments. We conducted extensive experiments to validate the performance of our proposed method. The evaluation results demonstrate that HAF-Edge significantly outperforms state-of-the-art methods.
- [3] arXiv:2407.20710 [pdf, other]
-
Title: On-the-fly Communication-and-Computing to Enable Representation Learning for Distributed Point CloudsComments: This is an ongoing work under revisionSubjects: Distributed, Parallel, and Cluster Computing (cs.DC)
The advent of sixth-generation (6G) mobile networks introduces two groundbreaking capabilities: sensing and artificial intelligence (AI). Sensing leverages multi-modal sensors to capture real-time environmental data, while AI brings powerful models to the network edge, enabling intelligent Internet-of-Things (IoT) applications. These features converge in the Integrated Sensing and Edge AI (ISEA) paradigm, where edge devices collect and locally process sensor data before aggregating it centrally for AI tasks. Point clouds (PtClouds), generated by depth sensors, are crucial in this setup, supporting applications such as autonomous driving and mixed reality. However, the heavy computational load and communication demands of PtCloud fusion pose challenges. To address these, the FlyCom$^2$ framework is proposed, optimizing distributed PtCloud fusion through on-the-fly communication and computing, namely streaming on-sensor processing, progressive data uploading integrated communication-efficient AirComp, and the progressive output of a global PtCloud representation. FlyCom$^2$ distinguishes itself by aligning PtCloud fusion with Gaussian process regression (GPR), ensuring that global PtCloud representation progressively improves as more observations are received. Joint optimization of local observation synthesis and AirComp receiver settings is based on minimizing prediction error, balancing communication distortions, data heterogeneity, and temporal correlation. This framework enhances PtCloud fusion by balancing local processing demands with efficient central aggregation, paving the way for advanced 6G applications. Validation on real-world datasets demonstrates the efficacy of FlyCom$^2$, highlighting its potential in next-generation mobile networks.
- [4] arXiv:2407.20980 [pdf, other]
-
Title: Impact of Conflicting Transactions in Blockchain: Detecting and Mitigating Potential AttacksSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); Cryptography and Security (cs.CR)
Conflicting transactions within blockchain networks not only pose performance challenges but also introduce security vulnerabilities, potentially facilitating malicious attacks. In this paper, we explore the impact of conflicting transactions on blockchain attack vectors. Through modeling and simulation, we delve into the dynamics of four pivotal attacks - block withholding, double spending, balance, and distributed denial of service (DDoS), all orchestrated using conflicting transactions. Our analysis not only focuses on the mechanisms through which these attacks exploit transaction conflicts but also underscores their potential impact on the integrity and reliability of blockchain networks. Additionally, we propose a set of countermeasures for mitigating these attacks. Through implementation and evaluation, we show their effectiveness in lowering attack rates and enhancing overall network performance seamlessly, without introducing additional overhead. Our findings emphasize the critical importance of actively managing conflicting transactions to reinforce blockchain security and performance.
- [5] arXiv:2407.20983 [pdf, other]
-
Title: Securing Proof of Stake Blockchains: Leveraging Multi-Agent Reinforcement Learning for Detecting and Mitigating Malicious NodesSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); Cryptography and Security (cs.CR)
Proof of Stake (PoS) blockchains offer promising alternatives to traditional Proof of Work (PoW) systems, providing scalability and energy efficiency. However, blockchains operate in a decentralized manner and the network is composed of diverse users. This openness creates the potential for malicious nodes to disrupt the network in various ways. Therefore, it is crucial to embed a mechanism within the blockchain network to constantly monitor, identify, and eliminate these malicious nodes without involving any central authority. In this paper, we propose MRL-PoS+, a novel consensus algorithm to enhance the security of PoS blockchains by leveraging Multi-agent Reinforcement Learning (MRL) techniques. Our proposed consensus algorithm introduces a penalty-reward scheme for detecting and eliminating malicious nodes. This approach involves the detection of behaviors that can lead to potential attacks in a blockchain network and hence penalizes the malicious nodes, restricting them from performing certain actions. Our developed Proof of Concept demonstrates effectiveness in eliminating malicious nodes for six types of major attacks. Experimental results demonstrate that MRL-PoS+ significantly improves the attack resilience of PoS blockchains compared to the traditional schemes without incurring additional computation overhead.
New submissions for Wednesday, 31 July 2024 (showing 5 of 5 entries )
- [6] arXiv:2407.20474 (cross-list from math.AC) [pdf, other]
-
Title: Two parallel dynamic lexicographic algorithms for factorization sets in numerical semigroupsSubjects: Commutative Algebra (math.AC); Distributed, Parallel, and Cluster Computing (cs.DC); Combinatorics (math.CO)
To the existing dynamic algorithm FactorizationsUpToElement for factorization sets of elements in a numerical semigroup, we add lexicographic and parallel behavior. To the existing parallel lexicographic algorithm for the same, we add dynamic behavior. The (dimensionwise) dynamic algorithm is parallelized either elementwise or factorizationwise, while the parallel lexicographic algorithm is made dynamic with low-dimension tabulation. The tabulation for the parallel lexicographic algorithm can itself be performed using the dynamic algorithm. We provide reference CUDA implementations with measured runtimes.
- [7] arXiv:2407.20539 (cross-list from cond-mat.mes-hall) [pdf, other]
-
Title: Memristive Linear AlgebraComments: 11 pages, 2 columns + AppendicesSubjects: Mesoscale and Nanoscale Physics (cond-mat.mes-hall); Distributed, Parallel, and Cluster Computing (cs.DC); Classical Analysis and ODEs (math.CA); Dynamical Systems (math.DS); Adaptation and Self-Organizing Systems (nlin.AO)
The advent of memristive devices offers a promising avenue for efficient and scalable analog computing, particularly for linear algebra operations essential in various scientific and engineering applications. This paper investigates the potential of memristive crossbars in implementing matrix inversion algorithms. We explore both static and dynamic approaches, emphasizing the advantages of analog and in-memory computing for matrix operations beyond multiplication. Our results demonstrate that memristive arrays can significantly reduce computational complexity and power consumption compared to traditional digital methods for certain matrix tasks. Furthermore, we address the challenges of device variability, precision, and scalability, providing insights into the practical implementation of these algorithms.
- [8] arXiv:2407.20611 (cross-list from cs.LG) [pdf, other]
-
Title: The Entrapment Problem in Random Walk Decentralized LearningComments: 10 pages, accepted by 2024 IEEE International Symposium on Information Theory. The associated presentation of this paper can be found in this https URLSubjects: Machine Learning (cs.LG); Distributed, Parallel, and Cluster Computing (cs.DC); Information Theory (cs.IT)
This paper explores decentralized learning in a graph-based setting, where data is distributed across nodes. We investigate a decentralized SGD algorithm that utilizes a random walk to update a global model based on local data. Our focus is on designing the transition probability matrix to speed up convergence. While importance sampling can enhance centralized learning, its decentralized counterpart, using the Metropolis-Hastings (MH) algorithm, can lead to the entrapment problem, where the random walk becomes stuck at certain nodes, slowing convergence. To address this, we propose the Metropolis-Hastings with Lévy Jumps (MHLJ) algorithm, which incorporates random perturbations (jumps) to overcome entrapment. We theoretically establish the convergence rate and error gap of MHLJ and validate our findings through numerical experiments.
Cross submissions for Wednesday, 31 July 2024 (showing 3 of 3 entries )
- [9] arXiv:2205.03060 (replaced) [pdf, other]
-
Title: Regular Model Checking Upside-Down: An Invariant-Based ApproachComments: Preparation for special issue in lmcs - minor revisions incorporatedSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); Logic in Computer Science (cs.LO)
Regular model checking is a well-established technique for the verification of infinite-state systems whose configurations can be represented as finite words over a suitable alphabet. It applies to systems whose set of initial configurations is regular, and whose transition relation is captured by a length-preserving transducer. To verify safety properties, regular model checking iteratively computes automata recognizing increasingly larger regular sets of reachable configurations, and checks if they contain unsafe configurations. Since this procedure often does not terminate, acceleration, abstraction, and widening techniques have been developed to compute a regular superset of the set of reachable configurations.
In this paper we develop a complementary approach. Instead of approaching the set of reachable configurations from below, we start with the set of all configurations and compute increasingly smaller regular supersets of it. We use that the set of reachable configurations is equal to the intersection of all inductive invariants of the system. Since the intersection is in general non-regular, we introduce $b$-bounded invariants, defined as those representable by CNF-formulas with at most $b$ clauses. We prove that, for every $b \geq 0$, the intersection of all $b$-bounded inductive invariants is regular, and show how to construct an automaton recognizing it. We study the complexity of deciding if this automaton accepts some unsafe configuration. We show that the problem is in \textsc{EXPSPACE} for every $b \geq 0$, and \textsc{PSPACE}-complete for $b=1$. Finally, we study how large must $b$ be to prove safety properties of a number of benchmarks. - [10] arXiv:2310.12670 (replaced) [pdf, other]
-
Title: Towards Fault-Tolerant Hybrid-Parallel Training at Scale with Reliable and Efficient In-memory CheckpointingYuxin Wang, Shaohuai Shi, Xin He, Zhenheng Tang, Xinglin Pan, Yang Zheng, Xiaoyu Wu, Amelie Chi Zhou, Bingsheng He, Xiaowen ChuComments: Fault Tolerance, Checkpoint Optimization, Large Language Model, Foundation Model, Hybrid parallelismSubjects: Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
To efficiently scale large model (LM) training, researchers transition from data parallelism (DP) to hybrid parallelism (HP) on GPU clusters, which frequently experience hardware and software failures. Existing works introduce in-memory checkpointing optimizations that snapshot parameters to device memory for rapid failure recovery. However, these methods introduce severe resource competition between checkpointing and training, which can work under DP but can hardly scale under resource-intensive HP.
To ensure low checkpointing overhead for hybrid-parallel training, this paper introduces a distributed in-memory checkpointing system with near-zero in-memory saving overhead. It strives from two aspects to mitigate the on-host resource competition caused by in-memory checkpointing: (1) It introduces Hierarchical Asynchronous Snapshotting Coordination in the checkpoint saving stage. This approach uses three-level asynchronous on-device scheduling to enhance parallelism between snapshotting and training, thereby minimizing snapshotting overhead. (2) It proposes Hybrid In-memory Checkpoint Protection to enhance checkpoint completeness during hardware failures. Unlike methods that require inter-node communications, which may block training under HP, it creates intra-node redundancy with efficient resource utilization, protecting training against hardware failures with minimal overhead. With these methods, this work enables fast restart for failed HP training with Distributed In-memory Checkpoint Loading, bypassing inefficiencies in NFS reads. In our evaluation, we achieve zero in-memory checkpoint saving overhead on Frontier while training Llama-2-34B on 256 MI250X devices (512 GPUs). - [11] arXiv:2407.19732 (replaced) [pdf, other]
-
Title: Performance Optimization of High-Conflict Transactions within the Hyperledger Fabric BlockchainSubjects: Distributed, Parallel, and Cluster Computing (cs.DC)
Hyperledger Fabric (HLF) is a secure and robust blockchain (BC) platform that supports high-throughput and low-latency transactions. However, it encounters challenges in managing conflicting transactions that negatively affect throughput and latency. This paper proposes a novel solution to address these challenges and improve performance, especially in applications incorporating extensive volumes of highly conflicting transactions. Our solution involves reallocating the Multi-Version Concurrency Control (MVCC) of the validation phase to a preceding stage in the transaction flow to enable early detection of conflicting transactions. Specifically, we propose and evaluate two innovative modifications, called Orderer Early MVCC (OEMVCC) and OEMVCC with Execution Avoidance (OEMVCC-EA). Our experimental evaluation results demonstrate significant throughput and latency improvements, providing a practical solution for high-conflict applications that demand high performance and scalability.
- [12] arXiv:2407.00394 (replaced) [pdf, other]
-
Title: Understanding Large-Scale Plasma Simulation Challenges for Fusion Energy on SupercomputersJeremy J. Williams, Ashish Bhole, Dylan Kierans, Matthias Hoelzl, Ihor Holod, Weikang Tang, David Tskhakaya, Stefan Costea, Leon Kos, Ales Podolnik, Jakub Hromadka, JOREK Team, Erwin Laure, Stefano MarkidisComments: Accepted by EPS PLASMA 2024 (50th European Physical Society Conference on Plasma Physics, Vol. 48A, ISBN: 111-22-33333-44-5), prepared in the standardized EPS conference proceedings format and consists of 4 pages, which includes the main text, references, and figuresSubjects: Plasma Physics (physics.plasm-ph); Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF); Computational Physics (physics.comp-ph)
Understanding plasma instabilities is essential for achieving sustainable fusion energy, with large-scale plasma simulations playing a crucial role in both the design and development of next-generation fusion energy devices and the modelling of industrial plasmas. To achieve sustainable fusion energy, it is essential to accurately model and predict plasma behavior under extreme conditions, requiring sophisticated simulation codes capable of capturing the complex interaction between plasma dynamics, magnetic fields, and material surfaces. In this work, we conduct a comprehensive HPC analysis of two prominent plasma simulation codes, BIT1 and JOREK, to advance understanding of plasma behavior in fusion energy applications. Our focus is on evaluating JOREK's computational efficiency and scalability for simulating non-linear MHD phenomena in tokamak fusion devices. The motivation behind this work stems from the urgent need to advance our understanding of plasma instabilities in magnetically confined fusion devices. Enhancing JOREK's performance on supercomputers improves fusion plasma code predictability, enabling more accurate modelling and faster optimization of fusion designs, thereby contributing to sustainable fusion energy. In prior studies, we analysed BIT1, a massively parallel Particle-in-Cell (PIC) code for studying plasma-material interactions in fusion devices. Our investigations into BIT1's computational requirements and scalability on advanced supercomputing architectures yielded valuable insights. Through detailed profiling and performance analysis, we have identified the primary bottlenecks and implemented optimization strategies, significantly enhancing parallel performance. This previous work serves as a foundation for our present endeavours.