Regular Expressions (RE) are widely used to find patterns among data, like in genomic markers research for DNA analysis, deep packet inspection or signature-based detection for network intrusion detection system. This paper proposes a novel and efficient RE matching architecture for FPGAs, based on the concept of matching core. RE can be software-compiled into sequences of basic matching instructions that a matching core runs on input data, and can be replaced to change the RE to be matched. This architecture can easily scale up with the available resources and is customizable to multiple usage scenarios. We ran several experiments and compared the obtained results with a software solution, reaching speedups over 100x, while running at 130MHz, over a Flex-based matching application running on an Intel i7 CPU at 2.8GHz.
This document describes research on implementing Curran's approximation algorithm for pricing Asian options using a dataflow architecture. The algorithm was implemented on a Maxeler dataflow engine (DFE) and compared to a CPU implementation. Different fixed-point precisions were tested on the DFE and 54-bit fixed-point provided the best balance of precision and resource usage. Implementing the algorithm across multiple DFEs provided speedups of 5-12x over a 48-core CPU. Further optimization of dynamic ranges allowed increasing the unrolling factor, improving performance and energy efficiency.
C-SAW: A Framework for Graph Sampling and Random Walk on GPUsPandey_G
Presentation for the paper C-SAW: A Framework for Graph Sampling and Random Walk on GPUs published in SC20.
Paper link: https://arxiv.org/pdf/2009.09103.pdf
[Paper Reading] Steering Query Optimizers: A Practical Take on Big Data Workl...PingCAP
This document discusses methods for optimizing query performance in a query optimizer called Scope by selecting alternative rule configurations. It proposes using rule signatures to group similar queries and generate candidate rule configurations to execute for each group. A learning model is then trained on execution results to select the best configuration for future queries in each group. The goal is to improve upon the default configuration by adapting to workloads and addressing inaccuracies in cardinality estimation that can lead to suboptimal plans.
Photon Technical Deep Dive: How to Think VectorizedDatabricks
Photon is a new vectorized execution engine powering Databricks written from scratch in C++. In this deep dive, I will introduce you to the basic building blocks of a vectorized engine by walking you through the evaluation of an example query with code snippets. You will learn about expression evaluation, compute kernels, runtime adaptivity, filter evaluation, and vectorized operations against hash tables.
This document discusses batch processing using Apache Flink. It provides code examples of using Flink's DataSet and Table APIs to perform batch word count jobs. It also covers iterative algorithms in Flink, including how Flink handles bulk and delta iterations more efficiently than other frameworks like Spark and MapReduce. Delta iterations are optimized by only processing changes between iterations to reduce the working data set size over time.
This document discusses a supercomputer called HYPE-2 built by Santosh Pandey, Ram Sharan Chaulagain, and Prakash Gyawali under the supervision of Prof. Dr. Subarna Shakya. It provides an overview of multiprocessor and multicore systems and discusses how HYPE-2 uses a distributed memory architecture with dynamic scaling to achieve high performance computing capabilities for research applications like cryptography, data mining, and weather forecasting. Performance tests showed near-linear speedup as nodes were added, with the system able to handle complex computations through inter-process communication, though it is not as powerful as larger supercomputers.
Apache Flink: API, runtime, and project roadmapKostas Tzoumas
The document provides an overview of Apache Flink, an open source stream processing framework. It discusses Flink's programming model using DataSets and transformations, real-time stream processing capabilities, windowing functions, iterative processing, and visualization tools. It also provides details on Flink's runtime architecture, including its use of pipelined and staged execution, optimizations for iterative algorithms, and how the Flink optimizer selects execution plans.
This is a talk given by Badrish Chandramouli at Portland State University on May 30, 2017, and overviews his recent and ongoing research directions in the space of stream processing and big data analytics.
Advances in the Solution of Navier-Stokes Eqs. in GPGPU Hardware. Modelling F...Storti Mario
In this article we compare the results obtained with an implementation of the Finite Volume for structured meshes on GPGPUs with experimental results and also with a Finite Element code with boundary fitted strategy. The example is a fully submerged spherical buoy immersed in a cubic water recipient. The recipient undergoes an harmonic linear motion imposed with a shake table. The experiment is recorded with a high speed camera and the displacement of the buoy if obtained from the video with a MoCap (Motion Capture) algorithm. The amplitude and phase of the resulting motion allows to determine indirectly the added mass and drag of the sphere.
Flink provides concise summaries of key points:
1) After submitting a Flink job, the client creates and submits the job graph to the JobManager, which then creates an execution graph and deploys tasks across TaskManagers for parallel execution.
2) The batch optimizer chooses optimal execution plans by evaluating physical execution strategies like join algorithms and data shipping approaches to minimize data shuffling and network usage.
3) Flink iterations are optimized by having the runtime directly handle caching, state maintenance, and pushing work out of loops to avoid scheduling overhead between iterations. Delta iterations further improve efficiency by only updating changed elements in each iteration.
This document provides an overview of Apache Flink internals. It begins with an introduction and recap of Flink programming concepts. It then discusses how Flink programs are compiled into execution plans and executed in a pipelined fashion, as opposed to being executed eagerly like regular code. The document outlines Flink's architecture including the optimizer, runtime environment, and data storage integrations. It also covers iterative processing and how Flink handles iterations both by unrolling loops and with native iterative datasets.
Distributed model-to-model transformations can be computationally expensive for large models or complex transformations. The authors present an approach to distribute ATL model transformations using MapReduce. Local match and apply phases are performed in parallel by mappers. Global resolve is done by reducers to combine local results. An evaluation shows near-linear speedup on Amazon EMR for models up to 100,000 lines of code. Challenges include load balancing, persistence for concurrent read/write, and parallelizing all transformation phases.
Developing Your Own Flux Packages by David McKay | Head of Developer Relation...InfluxData
Flux is easy to contribute to, and it is easy to share functions and libraries of Flux code with other developers. Although there are many functions in the language, the true power of Flux is its ability to be extended with custom functions. In this session, David will show you how to write your own custom function to perform some new analytics.
This document discusses different frameworks for big data processing at ResearchGate, including Hive, MapReduce, and Flink. It provides an example of using Hive to find the top 5 coauthors for each author based on publication data. Code snippets in Hive SQL and Java are included to implement the top k coauthors user defined aggregate function (UDAF) in Hive. The document evaluates different frameworks based on criteria like features, performance, and usability.
Migrate 10TB to Exadata -- Tips and TricksAmin Adatia
This document provides tips and tricks for migrating 10TB of data from an AIX database to an Exadata database within a limited 6 hour downtime window. It discusses approaches taken for different object types including non-partitioned tables, partitioned tables with and without LOB columns, tables with Oracle Text indexes, and tables using Oracle Label Security. Key steps taken included rebuilding Oracle Text indexes in parallel rather than using transportable tablespaces, and replacing source label tags with target tags during data migration rather than updating tags post-migration. The migration was completed on time with all objectives met.
How to Introduce Telemetry Streaming (gNMI) in Your Network with SNMP with Te...InfluxData
This document provides an overview of introducing network telemetry using streaming protocols like gNMI with Telegraf. It discusses gNMI as a streaming telemetry protocol, using Telegraf to collect metrics from network devices via gNMI and SNMP, and how to normalize and enrich the collected data through Telegraf processors before outputting to a time-series database. It also includes a demo of collecting interface counters from devices supporting gNMI and SNMP, and processing the data in Telegraf.
Addressing performance issues in titan+cassandraNakul Jeirath
Slides from presentation at Graph Day Texas discussing some of the problems we faced and what we did to fix them to keep our customer facing response times low and our data ingestion pipeline humming.
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraSomnath Mazumdar
This document provides an overview of several big data technologies including MapReduce, Pig, Flume, Cascading, and Dremel. It describes what each technology is used for, how it works, and example applications. MapReduce is a programming model for processing large datasets in a distributed environment, while Pig, Flume, and Cascading build upon MapReduce to provide higher-level abstractions. Dremel is an interactive query system for nested and complex datasets that uses a column-oriented data storage format.
TiReX is a tiled regular expression matching architecture developed by researchers at Politecnico di Milano. It uses a customized instruction set architecture implemented on an FPGA to compile regular expressions into low-level instructions and execute them in parallel across multiple processor cores. Evaluation shows it can match regular expressions over 37 times faster than software and over 100 times faster than a desktop CPU. The multi-core design allows flexible matching of multiple regular expressions over data in parallel.
The increasing demand for computing power in fields such as biology, finance, machine learning is pushing the adoption of reconfigurable hardware in order to keep up with the required performance level at a sustainable power consumption. Within this context, FPGA devices represent an interesting solution as they combine the benefits of power efficiency, performance and flexibility. Nevertheless, the steep learning curve and experience needed to develop efficient FPGA-based systems represents one of the main limiting factor for a broad utilization of such devices.
In this talk, we present CAOS, a framework which helps the application designer in identifying acceleration opportunities and guides through the implementation of the final FPGA-based system. The CAOS platform targets the full stack of the application optimization process, starting from the identification of the kernel functions to accelerate, to the optimization of such kernels and to the generation of the runtime management and the configuration files needed to program the FPGA.
This document proposes a highly parallel semi-dataflow FPGA architecture for accelerating large-scale N-body simulations. The key aspects of the proposed design are: 1) A hardware/software partitioning that accelerates the computationally intensive force calculation step on the FPGA; 2) An optimized data transfer approach to reduce memory traffic; 3) A semi-dataflow architecture providing high parallelism through 48 computation pipelines; and 4) A tiling approach to further improve performance and resource utilization. Experimental results show the design achieves up to 4400 million particle-pairs per second, outperforming CPU and GPU implementations in terms of performance and performance-per-watt.
The document discusses using a regular expression matching architecture called ReCPU for network intrusion detection systems (NIDS). ReCPU can efficiently match regular expressions in hardware and is well-suited for the high-speed regular expression matching needs of NIDS. It describes the ReCPU architecture, which uses parallel comparators to match multiple characters simultaneously, and how its design can be adapted for NIDS computation.
Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded E...RISC-V International
The document summarizes the Klessydra-T architecture for designing vector coprocessors for multi-threaded edge computing cores. It describes the interleaved multi-threading baseline, parameterized vector acceleration schemes using the Klessydra vector intrinsic functions. Performance results show up to 3x speedup over a baseline core for benchmarks like convolution, FFT, and matrix multiplication on FPGA implementations with different configurations of vector lanes, functional units, and scratchpad memories.
Synthesis & gate-level simulation is introduced. The key topics covered include basic concepts of logic synthesis using Design Compiler, including logic level optimization, mapping, boundary optimization, and static timing analysis. Simulation of the gate-level netlist generated after synthesis is also discussed. An example lab is outlined to synthesize a simple 8-bit microprocessor and simulate the gate-level netlist.
International Journal of Engineering Research and DevelopmentIJERD Editor
Electrical, Electronics and Computer Engineering,
Information Engineering and Technology,
Mechanical, Industrial and Manufacturing Engineering,
Automation and Mechatronics Engineering,
Material and Chemical Engineering,
Civil and Architecture Engineering,
Biotechnology and Bio Engineering,
Environmental Engineering,
Petroleum and Mining Engineering,
Marine and Agriculture engineering,
Aerospace Engineering.
A Cryptographic Hardware Revolution in Communication Systems using Verilog HDLidescitation
Advanced Encryption Standard (AES), is an
advancement of Federal Information Processing Standard
(FIPS) which is an initiated Process Standard of NIST. The
AES specifies the Rijndael algorithm, in which a symmetric
block cipher that processes fixed 128 bit data blocks using
cipher keys with different lengths of 128, 192 and 256 bits.
The earliest Rijndael algorithm had the advantage of
combining both data block sizes of 128, 192 and 256 bits with
any key lengths. AES can be programmed in pure hardware
Verilog HDL, Which includes Multiplexer to enhance more
secure to Cipher text. The results indicate that the hardware
implementation proposed in this project is Decrementing
Utilization of resource and power consumption of 113 mW
than other implementation. Using FPGA lead to reliability on
source modulations. This project presents the AES algorithm
with regard to FPGA and Verilog HDL. The software used for
Simulation is ModelSim-Altera 6.3g_p1 (Quartus II 8.1).
Synthesis and implementation of the code is carried out on
Xilinx ISE 13.4 (XC6VCX240T) device is used for hardware
evaluation.
This document summarizes a research paper that proposes implementing the Advanced Encryption Standard (AES) cryptographic algorithm using Verilog HDL for hardware implementation on FPGAs. The paper describes the AES algorithm, its encryption and decryption processes, and a hardware design for AES that was tested on a Xilinx FPGA. The results showed the hardware implementation utilized less resources and had lower power consumption compared to other AES FPGA designs.
The document summarizes benchmarking results for four magnetic fusion simulation codes: GTS, TGYRO, BOUT++, and VORPAL. It was performed on the Cray XE6 "Hopper" supercomputer at NERSC to evaluate performance, scalability, memory usage, and communication overhead at large scales. For GTS, weak scaling tests showed computation time remained constant while communication time increased slightly with up to 49,152 cores. Testing also examined the codes' sensitivity to reduced memory bandwidth by increasing core count per node. Overall results provide insight to improve fusion code design and inform exascale co-design efforts.
IRJET- A Review on Various Secured Data Encryption Models based on AES StandardIRJET Journal
This document reviews various secure data encryption models based on the AES standard. It discusses AES encryption which involves sub bytes, shift rows, mix columns and add round key processes over multiple rounds. Various FPGA-based AES encryption models are compared based on throughput, area, maximum frequency, bitwidth, and number of pipeline stages. The models achieve trade-offs between speed and area. AES is widely used for data security in applications like e-commerce and banking due to its standardization and ability to securely encrypt data.
This document describes the implementation of the AES (Advanced Encryption Standard) algorithm using a fully pipelined design on an FPGA. It first provides background on the AES algorithm, including its key components and previous hardware implementations. It then details the proposed fully pipelined design, which implements each of AES's 10 rounds as separate pipeline stages to achieve high throughput. Key generation is also pipelined internally. Simulation results show the design achieves a throughput higher than previous reported implementations.
Novel Adaptive Hold Logic Circuit for the Multiplier using Add Round Key and ...IJMTST Journal
Digital multipliers are among the most critical arithmetic functional units in many applications, such as the Fourier transform, discrete cosine transforms, and digital filtering. The through put of these applications depends on multipliers, if the multipliers are too slow, the performance of entire circuits will be reduced. The negative bias temperature instability effect occurs when a PMOS transistor is under negative bias (Vgs = −Vdd), increasing the threshold voltage of a PMOS transistor and reducing the multiplier speed. Similarly, positive bias temperature instability occurs when an NMOS transistor is under positive bias. Both effects degrade the speed of the transistor and in the long term, the system may be fail due to timing violations. Therefore, it is required to design reliable high-performance multipliers. In this paper, we implement an aging aware multiplier design with a novel adaptive hold logic (AHL) circuit. The multiplier is able to provide the higher throughput through the variable latency and can adjust the adaptive hold logic (AHL) circuit to lessen performance degradation that is due to the aging effect. The proposed design can be applied to the column bypass multiplier.
An OpenCL Method of Parallel Sorting Algorithms for GPU ArchitectureWaqas Tariq
In this paper, we present a comparative performance analysis of different parallel sorting algorithms: Bitonic sort and Parallel Radix Sort. In order to study the interaction between the algorithms and architecture, we implemented both the algorithms in OpenCL and compared its performance with Quick Sort algorithm, the fastest algorithm. In our simulation, we have used Intel Core2Duo CPU 2.67GHz and NVidia Quadro FX 3800 as graphical processing unit.
This paper presents 16 software implementations of the Advanced Encryption Standard (AES) cipher mapped to a fine-grained many-core processor array. The implementations explore different levels of data and task parallelism. The smallest design uses 6 cores for offline key expansion and 8 cores for online expansion, while the largest uses 107 and 137 cores respectively. Compared to other software platforms, the designs achieve 3.5-15.6 times higher throughput per chip area and 8.2-18.1 times higher energy efficiency.
Area efficient parallel LFSR for cyclic redundancy check IJECEIAES
Cyclic Redundancy Check (CRC), code for error detection finds many applications in the field of digital communication, data storage, control system and data compression. CRC encoding operation is carried out by using a Linear Feedback Shift Register (LFSR). Serial implementation of CRC requires more clock cycles which is equal to data message length plus generator polynomial degree but in parallel implementation of CRC one clock cycle is required if a whole data message is applied at a time. In previous work related to parallel LFSR, hardware complexity of the architecture reduced using a technique named state space transformation. This paper presents detailed explaination of search algorithm implementation and technique to find number of XOR gates required for different CRC algorithms. This paper presents a searching algorithm and new technique to find the number of XOR gates required for different CRC algorithms. The comparison between proposed and previous architectures shows that the number of XOR gates are reduced for CRC algorithms which improve the hardware efficiency. Searching algorithm and all the matrix computations have been performed using MATLAB simulations.
BM recently announced the POWER10 processor. POWER10 brings in a rich set of architecture capabilities in the processor core. Features like prefix instruction support and Matrix Multiply Assist, in short known as MMA, which were introduced in Open POWER ISA V3.01 are implemented in the POWER10 processor. MMA is an on-chip AI acceleration capability which accelerates the matrix multiplication compute. This talk will cover these two key concepts introduced in POWER ISA V3.01 namely, 1) prefix instructions, and how this can help extend the POWER ISA in next generation 2) Matrix Multiply Assist architecture and its implementation in POWER10.
Parallel-prefix adders offer a highly efficient solution to the binary addition problem and are
well-suited for VLSI implementations. In this paper, a novel framework is introduced, which allows
the design of parallel-prefix Ling adders. The proposed approach saves one-logic level of
implementation compared to the parallel-prefix structures proposed for the traditional definition of
carry look ahead equations and reduces the fan out requirements of the design. Experimental results
reveal that the proposed adders achieve delay reductions of up to 14 percent when compared to the
fastest parallel-prefix architectures presented for the traditional definition of carry equations
The increasing demand for computing power in fields such as biology, finance, machine learning is pushing the adoption of reconfigurable hardware in order to keep up with the required performance level at a sustainable power consumption. Within this context, FPGA devices represent an interesting solution as they combine the benefits of power efficiency, performance and flexibility. Nevertheless, the steep learning curve and experience needed to develop efficient FPGA-based systems represents one of the main limiting factor for a broad utilization of such devices.
In this talk, we present CAOS, a framework which helps the application designer in identifying acceleration opportunities and guides through the implementation of the final FPGA-based system. The CAOS platform targets the full stack of the application optimization process, starting from the identification of the kernel functions to accelerate, to the optimization of such kernels and to the generation of the runtime management and the configuration files needed to program the FPGA.
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021Deepak Shankar
The document discusses comparing the performance and power of ARM Cortex and RISC-V processors for AI applications. It outlines a methodology for modeling systems from the microarchitecture to SoC level using different instruction sets. Examples are provided to demonstrate how the methodology can be used to improve the accuracy of comparisons between architectures.
Similar to TiReX: Tiled Regular eXpression matching architecture (20)
Marco D. Santambrogio, responsabile del #NECSTLab, in questo talk dà indicazioni su come iniziare a prendere parte alle nostre attività di ricerca e le opportunità per gli studenti interessanti al progetto #NECSTCamp
This document appears to be a presentation for the NECST Summer Workshop 2017. It discusses the NECST laboratory's spirit of collaboration and innovation. It provides information on getting involved with NECST through programs for 1st and 2nd year students, including internships and research opportunities. It also lists the research areas of NECST, including reconfigurable computing, computer architecture, and smart technologies. Finally, it includes names of NECST people and links to papers and resources.
The document announces the NECST Summer Workshop 2017. It provides information on research areas at NECSTLab including reconfigurable computing, computer architecture and operating systems, and smart technologies. It also lists people involved in NECSTLab and provides links to access abstracts and papers. The workshop will discuss how involvement in NECSTLab's research can impact students in their first year.
- Silvia Brembati, Product Designer
- Benedetta Bolis, Engineering Physics Student
Due to the recent COVID-19 outbreak, everybody had to quickly rearrange their lifestyle and learn how to get through isolation.
Keeping in touch has never been more compelling and challenging at the same time.
A recent survey conducted in Italy, states that 80% of the population felt like they needed psychological support to get through quarantine. We believe that if people had a way to feel surrounded by their friends and had been able to share activities, this number would be significantly lower. This is where our new app TreeHouse comes in handy as it guides the user in contributing to the life of the community: a virtual tree will come to life and thrive thanks to both real-life and online interactions. Sharing content, chatting with friends, or drinking a cup of tea together will make a leaf or a branch grow, but if the user is missing for too long, the tree will suffer from their absence, in complete symbiosis.
Nevertheless, checking how the tree develops helps the members feel the actual presence of the community, and makes them able to support each other, letting the tree flourish again.
- Filippo Carloni, M.Sc. student in Computer Science and Engineering
Expressions (REs) are widely used to find patterns among data, like in genomic markers research for DNA analysis, signature-based detection for network intrusion detection systems, or search engines. TiReX is a novel and efficient RE matching architecture for
FPGAs, based on the concept of matching core. RE passes into the compilation and optimization phase to be efficiently translated into sequences of basic matching instructions that a matching core runs on input data, and can be replaced to change the RE to be matched.
A professor reviewed drugs being tested against COVID-19 that were repurposed from other uses. Researchers generated a knowledge graph embedding from 183k triples connecting proteins, genes, drugs and diseases. Their preliminary link prediction model achieved 50% accuracy at ranking potential interactions, higher than random chance, and integrated machine learning with biological insights on drug development.
- Daniele Valentino de Vincenti, B.Sc. graduate in Biomedical Engineering @Politecnico di Milano
- Lorenzo Farinelli, B.Sc. graduate in Computer Science and Engineering @Politecnico di Milano
Plaster is a multi-layered infrastructure (based on C++) aimed at supporting the development of multi-FPGA systems and the management of large data flows between the nodes. In particular, the goal of the project is to provide the end-user with a set of tools (by the means of a Python library and a C++ service) to easily assign bitstreams to nodes and route data between them, in the context of a PYNQ-based cluster suitable for distributed acceleration of computation-intensive tasks. Using this platform, an abandoned objects detection tool is implemented, designed as a Multi-FPGA distributed system exploiting an hardware accelerated version of the YOLO neural network for image detection.
- Jessica Leoni, PhD student in Data Analysis and Decision Science @Politecnico di Milano
- Luca Stornaiuolo, PhD student in Computer Science @Politecnico di Milano
- Irene Canavesi, B.Sc. student in Biomedical Engineering
- Sara Caramaschi, B.Sc. student in Biomedical Engineering
Lung cancer is one of the most frequently diagnosed cancer forms, with a mortality of 84.2% in 2018. Our project focuses on shortening diagnosis time and improving accuracy in the overall detection of this disease. We implemented a convolutional neural network capable of automatically identifying lungs on a CT image. Segmentation is a necessary first step for the development of an algorithm capable of identifying and classifying the tumor mass since errors in the ROI identification can lead to errors in the tumor mass recognition. The network architecture follows the structure of a preexisting network, the U-Net that performs well on medical images. We reached a very good test accuracy of 99.63%: the strength of our work lies in the large number of CT images of both healthy and sick patients, used for the training and validation of the network.
BlastFunction is a serverless platform that brings FPGA acceleration capabilities for specific functions through heterogeneous computing. It enables resource sharing across multiple users to maximize FPGA utilization and minimize costs for cloud providers. BlastFunction manages functions and machines, redistributing workloads across nodes equipped with FPGAs. Initial results show BlastFunction improved FPGA utilization and increased performance per watt for benchmark applications compared to native CPU execution. Future work includes porting BlastFunction to AWS and automating cluster management.
- Sofia Breschi, B.Sc. student in Biomedical Engineering
- Beatrice Branchini, B.Sc. student in Biomedical Engineering
In the last few years, the use of Next Generation Sequencing technology in medicine has become more and more common, in particular for the diagnosis of genetic diseases and the production of personalized drugs. In this context, the identification of characteristic patterns in the human genome plays an important role. Exact pattern matching algorithms are an efficient way to identify those sequences. However, this process represents a bottleneck in the genomic field as it is very computationally intensive and time-consuming. Moreover, general-purpose architectures are not optimized to handle the huge amount of data and operations used in a genomics context. Due to these considerations, we propose an implementation of the Knuth-Morris-Pratt (KMP) algorithm on FPGA, a particular family of integrated circuits capable of reconfiguration for an infinite number of times. The KMP algorithm results in being very fast and efficient, by reducing unnecessary comparisons of characters that have already been matched. Furthermore, to achieve an overall speedup of the alignment process, the implementation on FPGA will bring on an even faster and more efficient solution, thus providing the patient with a quick response.
In the global energy equation, the IT industry is not yet a major contributor to global warming, but it is increasingly significant. From an engineering standpoint we can achieve huge energy saving by replacing electronic signal processing with optical techniques for routing and switching, whilst longer fibre spans in the local loop offer further reductions. The mobile industry on the other hand has engineered 5G systems demanding ~10kW/tower due to signal processing and beam steering technologies. This sees some countries (i.e. China) closing cell sites at night to save money. So, what of 6G? The assumption that all surfaces can be smart signal regenerators with beam steering looks be a step too far and it may be time for a rethink!
On the extreme end of the scale we have AWS planning to colocate their latest AI data centre (at 1GW power consumption) along side two nuclear reactors because it needs 40% of their joint output. Google and Microsoft are following the AWS approach and reportedly in negotiation with nuclear plant owners. Needless to say that AI train ing sessions and usage have risen to dominate the top of the IT demand curve. At this time, there appears to be no limits to the projected energy demands of AI, but there is a further contender in this technology race, and that is the IoT. In order to satisfy the ecological demands of Industry 4.0/Society 5.0 we need to instrument and tag ‘Things’ by the Trillion, and not ~100 Billion as previously thought!
Now let’s see, Trillions of devices connected to the internet with 5G, 4G, WiFi, BlueTooth, LoRaWan et al using >100mW demands more power plants…
Good Energy Haus: PHN Presents Building Electrification, A Passive House Symp...TE Studio
Tim Eian's contribution to the Passive House Network's Building Electrification Symposium on July 25, 2024.
Topics covered:
- Our Motivation to Electrify
- The Context of the Project
- The Process of Electrification
- Considerations for Electrification
- Data
- Challenges of Electrification
- Successes
- Opportunities
Computer Vision and GenAI for Geoscientists.pptxYohanes Nuwara
Presentation in a webinar hosted by Petroleum Engineers Association (PEA) in 28 July 2023. The topic of the webinar is computer vision for petroleum geoscience.
This unit explains cartesian coordinate system. This unit also explains different types of coordinate systems like one dimensional, two dimensional and three dimensional system
Manufacturing is the process of converting raw materials into finished goods through various production methods. Historically, manufacturing occurred on a small scale through apprenticeships or putting-out systems, but the Industrial Revolution led to large-scale manufacturing using machines powered by steam engines
- Earlier Induction motors were used in applications requiring a
constant speed because variable speed applications have been
dominated by DC drives
- Conventional methods for speed control of Induction motors were
either expensive or highly inefficient
- Later the availability of thyristors, power transistors, IGBT and GTO
have allowed the development of variable speed induction motor
- Later the availability of thyristors, power transistors, IGBT and GTO
have allowed the development of variable speed induction motor
drives
- DC motors require frequent maintenance due to the presence of
commutators & brushes. Also they cannot be used in explosive & dirty
environment
- On the other hand, induction motors particularly squirrel cage are
rugged, cheaper, lighter, smaller, more efficient, requires less
maintenance and can be operated in dirty & explosive environment .
maintenance and can be operated in dirty & explosive environment .
- Due to these advantages, Three-phase induction motors are the most
common machines in industry now & more than 90% of mechanical
power used in industry is supplied by 3 phase induction motors.
- Variable speed induction motor drives are expensive than DC drives
- Application
include
fans,
blowers,
cranes,
conveyors,
traction,
underground & under water installations etc
Predicting damage in notched functionally graded materials plates thr...Barhm Mohamad
Presently, Functionally Graded Materials (FGMs) are extensively utilised in several industrial sectors, and the modelling of their mechanical behaviour is consistently advancing. Most studies investigate the impact of layers on the mechanical characteristics, resulting in a discontinuity in the material. In the present study, the extended Finite Element Method (XFEM) technique is used to analyse the damage in a Metal/Ceramic plate (FGM-Al/SiC) with a circular central notch. The plate is subjected to a uniaxial tensile force. The maximum stress criterion was employed for fracture initiation and the energy criterion for its propagation and evolution. The FGM (Al/SiC) structure is graded based on its thickness using a modified power law. The plastic characteristics of the structure were estimated using the Tamura-Tomota-Ozawa (TTO) model in a user-defined field variables (USDFLD) subroutine. Validation of the numerical model in the form of a stress-strain curve with the findings of the experimental tests was established following a mesh sensitivity investigation and demonstrated good convergence. The influence of the notch dimensions and gradation exponent on the structural response and damage development was also explored. Additionally, force-displacement curves were employed to display the data, highlighting the fracture propagation pattern within the FGM structure.
### A Brief History of Artificial Intelligence
Artificial Intelligence (AI) stands as one of the most transformative technologies of the modern era, promising to reshape industries, societies, and even the nature of work itself. Its evolution spans decades of research, innovation, and breakthroughs that have captured the imagination of scientists, entrepreneurs, and the general public alike. This comprehensive exploration delves into the key milestones, developments, and ethical implications that have shaped the history of AI.
#### Early Beginnings: The Birth of Artificial Intelligence
The roots of AI can be traced back to the mid-20th century, with foundational contributions from pioneers such as Alan Turing and John McCarthy. Turing's concept of a universal machine capable of computing any problem laid the groundwork for the theoretical underpinnings of AI. McCarthy, along with Marvin Minsky, Nathaniel Rochester, and Claude Shannon, organized the Dartmouth Conference in 1956, which is often regarded as the birth of AI as an academic field.
During the 1950s and 1960s, the focus was on symbolic AI, also known as "good old-fashioned AI" (GOFAI). Researchers aimed to develop intelligent systems that could reason and solve problems using symbolic logic and algorithms. Early successes included programs like the Logic Theorist and the General Problem Solver, which demonstrated AI's potential for logical reasoning and problem-solving tasks.
#### The AI Winter and the Rise of Expert Systems
Despite initial enthusiasm, the field encountered significant challenges in the 1970s and 1980s, leading to what became known as the "AI winter." Funding and interest in AI research waned as early expectations failed to materialize, and practical applications remained elusive.
During this period, a new approach emerged with the development of expert systems. These systems aimed to capture human expertise in specific domains through rules and knowledge bases. Expert systems like MYCIN, used for diagnosing infectious blood diseases, showcased AI's potential in specialized tasks and revived interest in the field.
#### Neural Networks and Machine Learning: Revitalizing AI
The late 20th century witnessed a resurgence of interest in AI, driven by advances in neural networks and machine learning. Neural networks, inspired by the human brain's structure and function, proved effective in pattern recognition tasks such as handwriting recognition and speech understanding.
Key milestones during this period include the development of backpropagation algorithms for training neural networks and the emergence of deep learning techniques capable of handling increasingly complex data. The success of deep learning in areas like image and speech recognition, bolstered by large datasets and powerful computing hardware, propelled AI into the mainstream.
#### AI in the 21st Century: Applications and Challenges
The 21st century has seen AI integrated into diverse applications ac
Artificial Intelligence Imaging - medical imagingNeeluPari
10 stages of Artificial Intelligence,
Artificial intelligence (AI) has made significant advancements in the field of medical imaging, offering valuable tools and capabilities to improve diagnostics, treatment planning, and patient care. Here are several ways AI is used in medical imaging
Numerical comaprison of various order explicit runge kutta methods with matla...DrAzizulHasan1
Numerical analysis is the area of mathematics and computer science that creates, analyzes andimplements numerical methods for solving numerically the problems of continuous mathematics. Such problems originates from real-world applications of algebra, geometry and calculus and they involve variables that vary continuously, such problems occur throughout the natural sciences, social science, engineering, medicine.
3. Current issues
• The trade off between performance and flexibility
2
• Current approaches lack flexibility
– If they use FPGA, require embedding the regex into
the architecture (= re-synthesis)
– ASIC technology no flexibility at all
4. Our solution and claims 3
Based on previous work [1] proposing Regular Expressions
as a high level language driving a custom processor
The improvements with respect to ReCPU are:
• A better preprocessing mechanism of the RegExp and a
renewed single core design
• A scalable multi-core architecture for parallelized
computations reaching 100x speedup over Flex
• Cross-platform design able easily integrable with
heterogeneous architectures
[1] M. Paolieri et al “ReCPU: A parallel and pipelined architecture for regular expression matching,” in Vlsi-Soc Springer 2009
5. Outline
• Related work
• TiReX design and implementation
• Evaluation
• Conclusions and future work
4
6. Related Work (1) 5
Most works use DFA (Deterministic Finite Automata) and address DFA
limitations, offering high matching speed at the cost of a fixed structure
Growth of memory usage along with RegExp complexity
• [1], [2] cluster states and group transitions
[1] L. Jiang et al“A fast regular expression matching engine for nids applying prediction scheme,” in Computers and Communication (ISCC), 2014
[2] J. van Lunteren and A. Guanella, “Hardware-accelerated regular expression matching at multiple tens of gb/s2” in INFOCOM, 2012
[3] K. Agarwal and R. Polig, “A high-speed and large-scale dictionary matching engine for information extraction systems,” in Application- Specific Systems,
Architectures and Processors (ASAP), 2013 IEEE 24th International Conference on. IEEE, 2013
[4] X.-T. Nguyen, H.-T. Nguyen, K. Inoue, O. Shimojo, and C.-K. Pham, “Highly parallel bitmap-based regular expression matching for text analytics,” in Circuits
and Systems (ISCAS), 2017
Other focus on achieving an efficient lookup process
• Hash based encoding scheme are another way to solve the problem [3]
• Bitmap index structures [4]
7. Related Work (2) 6
[5] C. R. Meiners et al “Fast regular expression matching using small tcams for network intrusion detection and prevention systems” 2010
[6] J. Yang et al “Pidfa: A practical multi-stride regular expression matching engine based on fpga” ICC 2016
[7] K. Atasu et al “Hardware-accelerated regular expression matching for high-throughput text analytics,” in FPL 2013
[8] G. Vasiliadis, M. Polychronakis, S. Antonatos, E. P. Markatos, and S. Ioannidis, “Regular expression matching on graphics hardware for intrusion detection,”
in International Workshop on Recent Advances in Intrusion Detection. Springer, 2009
Some works leverage hardware parallelism to match input against multiple RegExp
• [8] uses GPU to activate a new DFA for every initial character
Single character analysis for the basic version
• [5] Ternary Content Addressable Memories (TCAMs)
• [6],[7] precomputation of transitions
DFA encodes a single RegExp and matching one character at time, so it is
intrinsically sequential
8. Our Approach 7
As in ReCPU, RegExp are translated into program
instructions
TiReX matching core run instructions on input data based on
a dedicated Instruction Set Architecture (ISA)
RegExp is software compiled into a sequence of TiReX
instructions
9. Flow RE 8
Regular
Expression
Compiler
1 & ACGT
2 JIM offset
3 (
4 |)* AC
5 & TT
Instruction Set
ACGTCGGGGCGTGCAAATGCCCCGTGCGA
TTTGCGTGACGTCGGGGCGTGCAAATGCC
CCGTGCGATTTGCGTGACGTCGGGGCGTG
CAAATGCCCCGTGCGATTTGCGTGACGTC
GGGGCGTGCAAATGCCCCGTGCGATTTGC
GTGCGTGCGATTTGCGTGACGTCGGGGCG
TGCAAACGTGCGATTTGCGTGACGTCGGG
GCGTGCAAAGCTCGATCGATCGATCGA…
Data
Match results
10. TiReX ISA 9
Opcode RegExp Description Reference
0 00 000 NOP No Operation
1 00 000 ( Enter subroutine
0 10 000 AND And of cluster matches
0 01 000 OR Or of cluster matches
0 11 000 . Match any character 32 bits for
0 00 001 )* Match any number of sub-RE at most
0 00 010 )+ Match one or more sub-RE 4 characters
0 00 011 )| Match previous sub-RE or next one
0 00 100 ) End of subroutine
0 00 101 OKP Open Kleene Parenthesis
0 00 111 JIM Jump If Match
11. TiReX ISA 10
Opcode RegExp Description Reference
0 00 000 NOP No Operation
1 00 000 ( Enter subroutine
0 10 000 AND And of cluster matches
0 01 000 OR Or of cluster matches
0 11 000 . Match any character 32 bits for
0 00 001 )* Match any number of sub-RE at most
0 00 010 )+ Match one or more sub-RE 4 characters
0 00 011 )| Match previous sub-RE or next one
0 00 100 ) End of subroutine
0 00 101 OKP Open Kleene Parenthesis
0 00 111 JIM Jump If Match
All characters in the Reference must be
equal to the input data to have a match
RegExp: ACCGTGGA
Input 1:
Input 2:
TGGA GACCTACACCG
ACCA TGGACTAGAGG
12. TiReX ISA 11
Opcode RegExp Description Reference
0 00 000 NOP No Operation
1 00 000 ( Enter subroutine
0 10 000 AND And of cluster matches
0 01 000 OR Or of cluster matches
0 11 000 . Match any character 32 bits for
0 00 001 )* Match any number of sub-RE at most
0 00 010 )+ Match one or more sub-RE 4 characters
0 00 011 )| Match previous sub-RE or next one
0 00 100 ) End of subroutine
0 00 101 OKP Open Kleene Parenthesis
0 00 111 JIM Jump If Match
Special instruction to direct the jump backward
in the program like in a «for loop» with Kleene
operators
RegExp: (ACGT)+
Input 1: ACGT ACGT GACC
13. TiReX ISA 12
Opcode RegExp Description Reference
0 00 000 NOP No Operation
1 00 000 ( Enter subroutine
0 10 000 AND And of cluster matches
0 01 000 OR Or of cluster matches
0 11 000 . Match any character 32 bits for
0 00 001 )* Match any number of sub-RE at most
0 00 010 )+ Match one or more sub-RE 4 characters
0 00 011 )| Match previous sub-RE or next one
0 00 100 ) End of subroutine
0 00 101 OKP Open Kleene Parenthesis
0 00 111 JIM Jump If Match
Special instruction to direct the jump forward in the
program like in «if else» statement with chained
ORs
RegExp: (TTTT)|(GCAT)|(CTGA)
Input 1: GCAT GACCTAC
14. Single Core Architecture: Overview 13
Instruction Memory Data Buffer
Fetch & Decode Execution
Control Path
Address Address
DataInstruction Opcode
Reference
MatchControl ControlOpcode
18. Single Core Architecture: Details 17
Data Buffer
Addressable Buffer
Intermediate registers:
• Back up
• Hold data
• Shift of 1-4 characters
19. Single Core Architecture: Details 18
Control Path
Status Register of the computation
Stack for nesting parenthesis
Completely redesign FSM
20. Multi core 19
Being the recognition process highly parallelizable we adopt a multi-core
architecture
BRAM
TiReX
core1
BRAM
TiReX
core2
AGCT(A|C)*TT
AGCT
AG*(TTAC)
GTTTG(AC)*
Data
BRAM
TiReX
coren-1
BRAM
TiReX
coren
…
…
21. Multi core 20
Being the recognition process highly parallelizable we adopt a multi-core
architecture
BRAM
TiReX
core1
BRAM
TiReX
core2
BRAM
TiReX
coren-1
BRAM
TiReX
coren
AGCT(A|C)*TT
Data1 Data2
Datan-1 Datan…
…
…
22. Multi core: Boundary conditions 21
Customizable conditions to avoid boundary match
Data
Match of length N
Chunk 0
Chunk 1
Chunk 2
Chunk 3
23. Experimental setup and results 22
Evaluation environment:
• VC707 evaluation platform powered by a Virtex-7 FPGA
• Digilent PYNQ-Z1 board powered by a ZYNQ SoC
comprising an ARM CPU and a Xilinx FPGA
We compare against:
• Flex program compiled with O3 optimizations and
runs on an Intel i7 with a peak frequency of 2.8GHz
26. Comparisons with Related works 25
Solution Clock Frequency
[MHz]
Bitrate [Gb/s] Flexibility
VC707 16 – core 130 16.64 – 66.54
PYNQ 8 – core 70 4.48 – 17.92
[1] ASIC 318.47 10.19 – 18.18
[2] FPGA 150 230 – 430
[3] FPGA 100 3.2
[3] ASIC 1000 256
[1] M. Paolieri et al “Recpu: A parallel and pipelined architecture for regular expression matching,” in Vlsi-Soc: Advanced Topics on Systems on a Chip.
Springer, 2009
[2] L. Jiang et al“A fast regular expression matching engine for nids applying prediction scheme,” in Computers and Communication (ISCC), 2014 IEEE
Symposium on.
[3] V. Gogte et al “Hare: Hardware accelerator for regular expressions,” in Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium
on.
27. Comparisons with Related works 26
Solution Clock Frequency
[MHz]
Bitrate [Gb/s] Flexibility
VC707 16 – core 130 16.64 – 66.54
PYNQ 8 – core 70 4.48 – 17.92
[1] ASIC 318.47 10.19 – 18.18
[2] FPGA 150 230 – 430
[3] FPGA 100 3.2
[3] ASIC 1000 256
[1] M. Paolieri et al “Recpu: A parallel and pipelined architecture for regular expression matching,” in Vlsi-Soc: Advanced Topics on Systems on a Chip.
Springer, 2009
[2] L. Jiang et al“A fast regular expression matching engine for nids applying prediction scheme,” in Computers and Communication (ISCC), 2014 IEEE
Symposium on.
[3] V. Gogte et al “Hare: Hardware accelerator for regular expressions,” in Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium
on.
28. Comparisons with Related works 27
Solution Clock Frequency
[MHz]
Bitrate [Gb/s] Flexibility
VC707 16 – core 130 16.64 – 66.54
PYNQ 8 – core 70 4.48 – 17.92
[1] ASIC 318.47 10.19 – 18.18
[2] FPGA 150 230 – 430
[3] FPGA 100 3.2
[3] ASIC 1000 256
[1] M. Paolieri et al “Recpu: A parallel and pipelined architecture for regular expression matching,” in Vlsi-Soc: Advanced Topics on Systems on a Chip.
Springer, 2009
[2] L. Jiang et al“A fast regular expression matching engine for nids applying prediction scheme,” in Computers and Communication (ISCC), 2014 IEEE
Symposium on.
[3] V. Gogte et al “Hare: Hardware accelerator for regular expressions,” in Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium
on.
29. Comparisons with Related works 28
Solution Clock Frequency
[MHz]
Bitrate [Gb/s] Flexibility
VC707 16 – core 130 16.64 – 66.54
PYNQ 8 – core 70 4.48 – 17.92
[1] ASIC 318.47 10.19 – 18.18
[2] FPGA 150 230 – 430
[3] FPGA 100 3.2
[3] ASIC 1000 256
[1] M. Paolieri et al “Recpu: A parallel and pipelined architecture for regular expression matching,” in Vlsi-Soc: Advanced Topics on Systems on a Chip.
Springer, 2009
[2] L. Jiang et al“A fast regular expression matching engine for nids applying prediction scheme,” in Computers and Communication (ISCC), 2014 IEEE
Symposium on.
[3] V. Gogte et al “Hare: Hardware accelerator for regular expressions,” in Microarchitecture (MICRO), 2016 49th Annual IEEE/ACM International Symposium
on.
30. Conlusions and future work
• We have presented a multicore pattern matching
architecture implemented on an FPGA
• Overcome Flex solution gaining a 100x speedup
with a remarkable flexibility
• Future Works
– Performance improvements
• Exploration of different memory hierarchies
• Multicore interconnection studies
29
31. Conlusions and future work
• Future Works
– Performance improvements
• Exploration of different memory hierarchies
• Multicore interconnection studies
30
Thank you for your attention… Questions?
Alessandro Comodi, Davide Conficconi {alessandro.comodi, davide.conficconi}@mail.polimi.it
Alberto Scolari, Marco Santambrogio {alberto.scolari, marco.santambrogio}@polimi.it
NECST: www.necst.it
Slideshare NECST: www.slideshare.net/necstlab
RAW FB Group: facebook.com/groups/ReconfigurableArchitecturesWorkshop