3 rd International Conference on Data Mining & Machine Learning (DMML 2022) will act as a major forum for the presentation of innovative ideas, approaches, developments, and research projects in the areas of Data Mining and Machine... more
3 rd International Conference on Data Mining & Machine Learning (DMML 2022) will act as a major forum for the presentation of innovative ideas, approaches, developments, and research projects in the areas of Data Mining and Machine Learning. It will also serve to facilitate the exchange of information between researchers and industry professionals to discuss the latest issues and advancement in the area of Big Data and Machine Learning. Authors are solicited to contribute to the conference by submitting articles that illustrate research results, projects, surveying works and industrial experiences that describe significant advances in Data Mining and Machine Learning.
OpenMP parallelization of multiple precision Taylor series method is proposed. A very good parallel performance scalability and parallel efficiency inside one computation node of a CPU-cluster is observed. We explain the details of the... more
OpenMP parallelization of multiple precision Taylor series method is proposed. A very good parallel performance scalability and parallel efficiency inside one computation node of a CPU-cluster is observed. We explain the details of the parallelization on the classical example of the Lorentz equations. The same approach can be applied straightforwardly to a large class of chaotic dynamical systems.
Page 1. A Parallel Genetic Algorithm for Rule Discovery in Large Databases Dieferson Luis Alves de Araujo' , Heitor S. Lopes', Alex A. Freitas2 CEFET-PR - Centro... more
Page 1. A Parallel Genetic Algorithm for Rule Discovery in Large Databases Dieferson Luis Alves de Araujo' , Heitor S. Lopes', Alex A. Freitas2 CEFET-PR - Centro Federal de EducaGBo Tecnol6gica do Paranh CPGEI - Curso ...
Task intensive electronic control units (ECUs) in automotive domain, equipped with multicore processors , real time operating systems (RTOSs) and various application software, should perform efficiently and time deterministically. The... more
Task intensive electronic control units (ECUs) in automotive domain, equipped with multicore processors , real time operating systems (RTOSs) and various application software, should perform efficiently and time deterministically. The parallel computational capability offered by this multicore hardware can only be exploited and utilized if the ECU application software is parallelized. Having provided with such parallelized software, the real time operating system scheduler component should schedule the time critical tasks so that, all the computational cores are utilized to a greater extent and the safety critical deadlines are met. As original equipment manufacturers (OEMs) are always motivated towards adding more sophisticated features to the existing ECUs, a large number of task sets can be effectively scheduled for execution within the bounded time limits. In this paper, a hybrid scheduling algorithm has been proposed, that meticulously calculates the running slack of every task and estimates the probability of meeting deadline either being in the same partitioned queue or by migrating to another. This algorithm was run and tested using a scheduling simulator with different real time task models of periodic tasks. This algorithm was also compared with the existing static priority scheduler, which is suggested by Automotive Open Systems Architecture (AUTOSAR). The performance parameters considered here are, the % of core utilization, average response time and task deadline missing rate. It has been verified that, this proposed algorithm has considerable improvements over the existing partitioned static priority scheduler based on each performance parameter mentioned above.
Thermal imagery is a substitute of visible imagery for face detection due to its property of illumination invariance with the variation of facial appearances. This paper presents an effective method for human face detection in thermal... more
Thermal imagery is a substitute of visible imagery for face detection due to its property of illumination invariance with the variation of facial appearances. This paper presents an effective method for human face detection in thermal imaging. The concept of histogram plot has been used in the feature extraction process and later in face detection. Techniques like thresholding, object boundary analysis, morphological operation etc. have been performed on the images to ease the process of detection. In order to enhance the performance of the algorithm and to reduce the computation time, parallelism has been achieved using Message Passing Interface (MPI) model. Overall, the proposed algorithm showed a higher level of accuracy and less complexity time of 0.11 seconds in the parallel environment as compared to 0.20 seconds in a serial environment.
In this paper we advocate high-level programming methodology for Next Generation Sequencers (NGS) alignment tools for both productivity and absolute performance. We analyse the problem of parallel alignment and review the parallelisation... more
In this paper we advocate high-level programming methodology for Next Generation Sequencers (NGS) alignment tools for both productivity and absolute performance. We analyse the problem of parallel alignment and review the parallelisation strategies of the most popular alignment tools, which can all be abstracted to a single parallel paradigm. We compare these tools against their porting onto the FastFlow pattern-based programming framework, which provides programmers with high-level parallel patterns. By using a high-level approach, programmers are liberated from all complex aspects of parallel programming, such as synchronisation protocols and task scheduling, gaining more possibility for seamless performance tuning. In this work we show some use case in which, by using a high-level approach for parallelising NGS tools, it is possible to obtain comparable or even better absolute performance for all used datasets.
As the gap between memory and processor speeds continues to widen, cache efficiency is an increasingly important component of processor performance. Compiler techniques have been used to improve instruction and data cache performance for... more
As the gap between memory and processor speeds continues to widen, cache efficiency is an increasingly important component of processor performance. Compiler techniques have been used to improve instruction and data cache performance for virtually indexed caches by mapping code and data with temporal locality to different cache blocks. In this
File and block storage are well-defined concepts in computing and have been used as common components of computer systems for decades. Big data has led to new types of storage. The predominant data model in cloud storage is the... more
File and block storage are well-defined concepts in computing and have been used as common components of computer systems for decades. Big data has led to new types of storage. The predominant data model in cloud storage is the object-based storage and it is highly successful. Object stores follow a simpler API with get() and put() operations to interact with the data. A wide variety of data analysis software have been developed around objects using their APIs. However, object storage and traditional file storage are designed for different purpose and for different applications. Organizations maintain file-based storage clusters and a high volume of existing data are stored in files. Moreover, many new applications need to access data from both types of storage. In this paper, we first explore the key differences between object-based and the more traditional file-based storage systems. We have designed and implemented several file-to-object mapping algorithms to bridge the semantic ...
In this work we present a programmable and reconfigurable single instruction multiple data (SIMD) visual processor based on the S-CNN architecture, namely, the Simplicial CNN Digital Visual Processor (SCDVP), oriented to high-performance... more
In this work we present a programmable and reconfigurable single instruction multiple data (SIMD) visual processor based on the S-CNN architecture, namely, the Simplicial CNN Digital Visual Processor (SCDVP), oriented to high-performance low level image processing. The cells in the array have a selectable neighborhood configuration and several registers, which provide the chip with extended spatial and temporal processing capabilities, in particular optical flow. A prototype 64 64 cell chip with two program memories and a column adder was fabricated in a 90 nm technology, which running at 133 MHz delivers 105.5 GOPS. The calculation at the cell level is performed with time coded signals and the program memory is located outside the array. This produces a very efficient realization in terms of area: 53.8 GOPS per mm , which outperforms all results reported so far. We show that even after normalization, to account for technology scaling, the proposed architecture is the most efficient among all reported digital processors. Computation performance to power ratio also exceeds all previous results with 817.8 GOPS/W. Experimental results of the working chip are reported.
The mathematical model, the numerical method and the parallelization technique are presented for the problems of detonation initiation by means of comparatively weak shock wave and propagation of detonation waves in threedimensional tubes... more
The mathematical model, the numerical method and the parallelization technique are presented for the problems of detonation initiation by means of comparatively weak shock wave and propagation of detonation waves in threedimensional tubes of complex shapes. The mechanisms of detonation initiation in a tube with parabolic contraction and cone expansion and in a helical tube are analyzed. The results obtained are of interest both for basic research contributing to understanding of the mechanism of detonation initiation in tubes with curved walls and for applications from point of view of predictive modeling of accidents in chemical industry.
since 1978 was dealing with problems of concurrency for cluster, parallel, distributed computers. every mobile hangs, every computer, including super-super delays and wait, making billions of circles of waiting the event to synchronise. I... more
since 1978 was dealing with problems of concurrency for cluster, parallel, distributed computers. every mobile hangs, every computer, including super-super delays and wait, making billions of circles of waiting the event to synchronise. I recently in my archives have found a lecture that was requested by BCS colleagues, as well as one big US Computer conference orgcommitee. Part of it was included in our second edition of the Software Design for Resilient Computer Systems. But Book is expensive and students can't buy. Thus this lecture notes worth to publish on Academia.
Over the internet today, communication and computing environments are significant aspect for everyone. Then in order to secure our communication from the unauthorized access the network and different enemies we should apply biometric... more
Over the internet today, communication and computing environments are significant aspect for everyone. Then in order to secure our communication from the unauthorized access the network and different enemies we should apply biometric identification system. Facial recognition is one of the best biometric physiological techniques to detect and recognize human faces. Then, by assessing/reading different articles, journals, tutorials and online guides; this paper provides a brief academically introduction/survey about biometric identification techniques and we proposed a high performance autonomous Parallel Face Detection and Recognition using artificial neural network with Eigen face producer algorithm and in order to adapt and learn autonomously its environment we design and apply reinforcement learning algorithm based on the low of effect. In addition to that, in this project we implement a deep learning concepts integrated with parallel computing using C programming language of P-thread and compares with a sequential algorithm.
—This paper aims to show that knowing the core concepts related to a given parallel architecture is necessary to write correct code, regardless of the parallel programming paradigm used. Programmers unaware of architecture concepts, such... more
—This paper aims to show that knowing the core concepts related to a given parallel architecture is necessary to write correct code, regardless of the parallel programming paradigm used. Programmers unaware of architecture concepts, such as beginners and students, often write parallel code that is slower than their sequential versions. It is also easy to write code that produces incorrect answers under specific conditions, which are hard to detect and correct. The increasing popularization of multi-core architectures motivates the implementation of parallel programming frameworks and tools, such as OpenMP, that aim to lower the difficulty of parallel programming. OpenMP uses compilation directives, or pragmas, to reduce the number of lines that the programmer needs to write. However, the programmer still has to know when and how to use each of these directives. The documentation and available tutorials for OpenMP give the idea that using compilation directives for parallel programming is easy. In this paper we show that this is not always the case by analysing a set of corrections of OpenMP programs made by students of a graduate course in Parallel and Distributed Computing, at University of São Paulo. Several incorrect examples of OpenMP pragmas were found in tutorials and official documents available in the Internet. The idea that OpenMP is easy to use can lead to superficial efforts in teaching fundamental parallel programming concepts. This can in its turn lead to code that does not develop the full potential of OpenMP, and could also crash inexplicably due to very specific and hard-to-detect conditions. Our main contribution is showing how important it is to teach core architecture and parallel programming concepts properly, even when you have powerful tools such as OpenMP available.
This paper presents a high-performance reconfigurable hardware implementation of the Data Encryption Standard (DES) algorithm. This is achieved by combining pipelining concept with novel skew core key scheduling method and compared with... more
This paper presents a high-performance reconfigurable hardware implementation of the Data Encryption Standard (DES) algorithm. This is achieved by combining pipelining concept with novel skew core key scheduling method and compared with previous illustrated encryption algorithms. The DES design is implemented on Xilinx Spartan-3e Field Programming Gate Arrays (FPGA) technology. Final 16-stage pipelined design is achieved with encryption rate of 7.160 Gbit/s and 2814 number of Configurable logic blocks (CLBs). This result is among the fastest hardware implementations with better area utilization.
At Queen Mary, University of London, we have over twenty years of experience in Parallel Computing Applications, mostly on "massively parallel systems", such as the Distributed Array Processors (DAPs). The applications in which... more
At Queen Mary, University of London, we have over twenty years of experience in Parallel Computing Applications, mostly on "massively parallel systems", such as the Distributed Array Processors (DAPs). The applications in which we were involved included design of numerical subroutine libraries, Finite Element software, graphics tools, the physics of organic materials, medical imaging, computer vision and more recently, Financial modelling. Two of the projects related to the latter are described in this paper, namely Portfolio Optimisation and Financial Risk Assessment.
Image compression technique is used in many applications for example, satellite imaging, medical imaging, video where the size of the iamge requires more space to store, in such application image compression effectively can be used. There... more
Image compression technique is used in many applications for example, satellite imaging, medical imaging, video where the size of the iamge requires more space to store, in such application image compression effectively can be used. There are two types in image compression techniques Lossy and Lossless comression. Both these techniques are used for compression of images, but these techniques are not fast. The image compression techniques both lossy and lossless image compression techniques are not fast, they take more time for compression and decompression. For fast and efficient image compression a parallel computing technique is used in matlab. Matlab is used in this project for parallel computing of images. In this paper we will discuss Regular image compression technique, three alternatives of parallel computing using matlab, comparison of image compression with and without parallel computing.
In this paper, we present OMP2MPI a tool that generates automatically MPI source code from OpenMP. With this transformation the original program can be adapted to be able to exploit a larger number of processors by surpassing the limits... more
In this paper, we present OMP2MPI a tool that generates automatically MPI source code from OpenMP. With this transformation the original program can be adapted to be able to exploit a larger number of processors by surpassing the limits of the node level on large HPC clusters. The transformation can also be useful to adapt the source code to execute in distributed memory many-cores with message passing support. In addition, the resulting MPI code can be used as an starting point that still can be further optimized by software engineers. The transformation process is focused on detecting OpenMP parallel loops and distributing them in a master/worker pattern. A set of micro-benchmarks have been used to verify the correctness of the the transformation and to measure the resulting performance. Surprisingly not only the automatically generated code is correct by construction , but also it often performs faster even when executed with MPI.
This paper presents a new environment to simulate close-proximity dynamics around rubble-pile asteroids. The code provides methods for modeling the aster-oid's gravity field and surface through granular dynamics. It implements... more
This paper presents a new environment to simulate close-proximity dynamics around rubble-pile asteroids. The code provides methods for modeling the aster-oid's gravity field and surface through granular dynamics. It implements state-of-the-art techniques to model both gravity and contact interaction between particles: 1) mutual gravity as either direct N2 or Barnes-Hut GPU-parallel octree and 2) contact dynamics with a soft-body (force-based, smooth dynamics), hard-body (constraint-based, non-smooth dynamics), or hybrid (constraint-based with compliance and damping) approach. A very relevant feature of the code is its ability to handle complex-shaped rigid bodies and their full 6D motion. Examples of spacecraft close-proximity scenarios and their numerical simulations are shown.
This report is a result of a study about square matrices multiplication performance algorithm. For testing the algorithm we will use 431 node of SeARCH Cluster. Throughout this work we are going to explore three different implementations... more
This report is a result of a study about square matrices multiplication performance algorithm. For testing the algorithm we will use 431 node of SeARCH Cluster. Throughout this work we are going to explore three different implementations of this algorithm with matrices of different sizes specifically selected to evaluate the performance impact of our algorithm. The internal CPU organization and bottlenecks evaluation are the main focus throughout this work. In the algorithm, the loops indexes order was defined as k-j-i for our workgroup. In the internal CPU architecture logic, vector computing features was implemented in modern times. This potentiality consists on the capability of using "large" processor registers to process multiple data elements at once in a clock cycle. This CPU capability is commonly known as SIMD (Single Instruction Multiple Data) wich will be explored too as an optimization performance technique for our algorithm implementation. As the main tool in the experimental component of this work we'll use a C library for performance analysis called Performance Application Programming Interface (PAPI). This library will allow us to access the CPU internal counters of 431 node, analyse the different metrics and draw some conclusions for different data sets and algorithm performance.
L’Homme a utilisé l’algorithmique depuis l’époque des Babyloniens et l’écriture cunéiforme sur les pierres. Avec le français Blaise Pascal en 1642, il a commencé à inventer des machines pour exécuter les algorithmes. Alain Turing en 1937... more
L’Homme a utilisé l’algorithmique depuis l’époque des Babyloniens et l’écriture cunéiforme sur les pierres. Avec le français Blaise Pascal en 1642, il a commencé à inventer des machines pour exécuter les algorithmes. Alain Turing en 1937 a présenté une machine virtuelle intitulée la machine de Turing qui a servi pour la conception d’une architecture de base pour les ordinateurs. La machine RAM a été utilisée par la suite comme modèle de base pour la conception des architectures des ordinateurs. L’évolution technologique lors de la construction des ordinateurs a permis la construction des ordinateurs avec un processeur multi-cœurs ou même des ordinateurs multi-processeurs. Cet ouvrage cherche à étudier la notion de parallélisme dans la modélisation des architectures des ordinateurs multi-cœurs ou multi-processeurs. Quel est le modèle de l’architecture multi-processeurs ou multi-cœurs optimale ? En effet, l’objectif de ce document est de fournir au lecteur les outils pour { Comprendre l’architecture des machines parallèles ; { L’inviter et l’encourager à modéliser une nouvelle architecture pour les machines parallèles ou à faire évoluer une architecture existante ; { Etre capable à programmer de manière optimale sur les machines parallèles. Ainsi, le premier Chapitre sera dédié à introduire la notion de parallélisme et calcul de haute performance afin de familiariser le lecteur aux notions de parallélisme et de le sensibiliser à l’utilité de comprendre les architectures parallèles existantes et de l’inviter à les optimiser et les mieux utiliser en adoptant une programmation adaptée au parallèlisme. Dans un deuxième Chapitre, nous allons étudier les architectures des machines parallèles. Le lecteur est invité à une réflexion sur les axes de modéliser de nouvelles architectures plus optimales. Dans le troisième Chapitre, afin de pouvoir exploiter l’évolution technologique des machines parallèles, et de la possibilité de communiquer, par internet, un nombre très grand d’ordinateurs et de téléphones pour résoudre le même problème P, le lecteur sera invité à programmer de manière optimale sur les machines parallèles disponibles.
The need for maximizing the performance of processors opened doors for a new approach which emphasizes parallelism. This paper discusses how various architectures, like Superscalars and VLIWs, implement this parallelism and presents a... more
The need for maximizing the performance of processors opened doors for a new approach which emphasizes parallelism. This paper discusses how various architectures, like Superscalars and VLIWs, implement this parallelism and presents a case study on Intel’s Itanium processor that uses EPIC (Explicitly Parallel Instruction Computing)