Tuan is currently PhD student at National University of Singapore. His work is on Partially Reconfigurable Heterogeneous Multiprocessor System-on-Chip. Supervisors: Akash Kumar Phone: +491628684636 Address: Germany
Nowadays, in-memory column store database systems are state-of-the-art for analytical workloads. ... more Nowadays, in-memory column store database systems are state-of-the-art for analytical workloads. In these column stores, a full column scan is a fundamental key operation and thus, the optimization of this primitive is very crucial from a performance perspective. For this optimization, advances in hardware are always an interesting opportunity, but represent also a major challenge. At the moment, hardware systems are more and more changing from homogeneous CPU systems towards hybrid systems with different computing units. Based on that, we focus on column scan acceleration for hybrid hardware systems incorporating a Field Programmable Gate Array (FPGA) and a CPU into a single system in this paper. The advantage of those hybrid systems is that the FPGA has usually direct access to the main memory of the CPU avoiding data copy which is a necessary procedure in other hybrid systems like CPU-GPU architectures. Thus, we present several FPGA designs for a recent column scan technique to fully offload the scan operation to the FPGA. In detail, we present our basic FPGA design and different optimization techniques. Then, we present selective results of our exhaustive evaluation showing the benefit of our FPGA acceleration. As we are going to show, we achieve a maximum speedup of factor 7 compared to a single-threaded CPU scan execution.
Dynamic Partial Reconfiguration (DPR) enables resource sharing in FPGA-based systems. It can also... more Dynamic Partial Reconfiguration (DPR) enables resource sharing in FPGA-based systems. It can also be used for the mitigation of aging-related permanent faults by increasing the number of redundant Partially Reconfigurable Regions (PRRs). Normally, these PRRs are able to host any of the Partially Reconfigurable Modules (PRMs), or tasks, at one particular instance. This kind of system is called homogeneous. However, the FPGA resource constraints limit the amount of homogeneous redundancy that can be used and hence affect the lifetime of the system. This issue can be addressed by utilizing the heterogeneous approach where each PRR now only hosts a subset of the tasks. Further, the deadlines of the applications must also be taken care of in the design phase to decide the mapping and scheduling of tasks to PRRs. To this end, we propose an application-specific multi-objective system-level design methodology to determine the appropriate number of PRRs and the mapping and scheduling of tasks to the PRRs. Specifically, we propose a lifetime-aware scheduling method that maximizes the system's mean time to failure (MTTF) with different tolerances in the makespan specification of an application. We use the scheduler along with an automated floorplanner for design space exploration at design-time to generate a feasible heterogeneous PRR-based system. Our experiments show that the heterogeneous systems can offer more than 2x lifetime improvement over homogeneous ones. It also offers better scaling with increased tolerance in makespan specification.
Dynamic Partial Reconfiguration (DPR) in reconfigurable platforms can be used for the mitigation ... more Dynamic Partial Reconfiguration (DPR) in reconfigurable platforms can be used for the mitigation of aging-related permanent faults. We propose an application-specific system-level design methodology for determining the appropriate number of Partially Reconfigurable Regions and their compatibility with Partially Reconfigurable Modules for maximizing the system lifetime. Specifically, we propose a lifetime-aware scheduler that maximizes system MTTF. We use the scheduler along with an automated floorplanner for design space exploration at design-time to generate a heterogeneous PRR system. Our experiments show that the heterogeneous systems can offer up to 2x lifetime improvement over homogeneous ones.
26th International Conference on Field-Programmable Logic and Applications, Sep 2016
Network-on-Chip (NoC) is known as a scalable and high performance interconnect in Systems-on-Chip... more Network-on-Chip (NoC) is known as a scalable and high performance interconnect in Systems-on-Chip (SoCs) with multiple processing elements (PEs). Recently, the design paradigm of SoCs has shifted from static to dynamic run-time reconfigurable system. In these systems, the PEs can be loaded/unloaded on demand. Therefore, the NoC should be able to adapt as quickly as possible to the changes to maintain the performance of the systems. In this work, we present a non-intrusive runtime reconfigurable time-division-multiplexed circuit-switched NoC, XNoC, which offers the following benefits (1) it switches between different routes within a predictable latency that is strictly determined by the length of the route and the number of time slots; (2) the configuration process can be masked effectively by overlapping with communication and (3) the multi-cast service is supported with aggregate feedback from sink nodes. We propose an XSwitch which requires 3.5X less resource than the conventional switch with similar features. The overall resource cost of XNoC is also smaller than the most known NoC and the clock timing is up to 50% better. We also propose a novel distributed control plane to accelerate the reconfiguration process and to improve the scalability of NoC. The achieved reconfiguration speedup compared to the centralized control unit is up to 7.6X in certain conditions. On average, it takes only 74 clock cycles to activate a 12-hop connection.
24th ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
Partial reconfiguration (PR) is gaining more attention from the research community because of its... more Partial reconfiguration (PR) is gaining more attention from the research community because of its flexibility in dynamically changing some parts of the system at runtime. However , the current PR tools need the designer's involvement in manually specifying the shapes and locations for the PR regions (PRRs). It requires not only deep knowledge of the FPGA device, the system architecture, but also many trial-and-error attempts to find the best-possible floorplan. Therefore, many research works have been conducted to propose automatic floorplanners for PR systems. However, one of the most significant limitations of those works is that they only consider the PRRs and ignore all other static modules. In this paper, we propose a novel PR floorplanner called PRFloor. It takes into account all components in the system. The main ideas behind PRFloor are the unique re-cursive pseudo-bipartitioning heuristic using a new, simple, yet effective Nonlinear Integer Programming-based biparti-tioner. The PRFloor performs very well in the experiments with various synthetic PR system setups with up to 130 modules, 24 PRRs and 85% of the FPGA resource. The average maximum clock frequency obtained for the actual PR systems implemented using PRFloor is even 3% higher than the similar systems without PR capability.
Field Programmable Logic and Applications (FPL), 2015 25th International Conference on
Partial reconfiguration is a technique used to increase the flexibility of an FPGA-based system b... more Partial reconfiguration is a technique used to increase the flexibility of an FPGA-based system by reprogramming parts of the system dynamically without interrupting the operation of the other modules. Despite the runtime benefits offered by partially reconfigurable (PR) systems, creating and storing partial bitstreams (PBs) are becoming major concerns for system architects when the numbers of reconfigurable partitions (RPs) and PR modules (PRMs) increase. It takes significant amount of time to generate the PBs for PR systems with large number of RPs and PRMs. More importantly, when the mapping relationship between PRMs and RPs is many-to-many, several almost-identical PBs of one PRM must be stored separately which leads to inefficient utilization of the memory storage. Therefore, bitstream relocation is drawing interests from the research community as a viable solution. Yet almost none of the works are able to demonstrate a coherent method to not only create relocatable PBs for complex and large PRMs in variable-size RPs but also how to do that automatically to free the designer from the tedious and error prone manual processes. In this paper, we propose a new technique to fill that gap. The method is successfully developed for Xilinx Virtex 7 devices using Vivado design tool flow.
In Proceedings of International Conference on Field Programmable Logic and Applications (FPL), 2014
FPGA-based heterogeneous Multiprocessor Systems-on-Chip (HMPSoCs) are becoming quite popular for ... more FPGA-based heterogeneous Multiprocessor Systems-on-Chip (HMPSoCs) are becoming quite popular for high performance embedded systems because of their powerful computational ability and relatively flexible architecture to adapt to unexpected system requirement changes. However, with the insatiable demands of supporting an extensive range of applications beyond the limited resources of FPGA chip and shorter time-to-market, many research works on partially reconfigurable (PR) FPGA architectures have been conducted to fulfill the needs. Those have yet to fully provide a versatile framework to exploit the flexibility of PR such as hardware/software task migration and bitstream relocation; more importantly, the on-chip debug features to access all processors currently loaded in the system are compromised because of the lack of native-support from vendor tools. In this paper, a novel PR-HMPSoC architecture for dynamic FPGA-based embedded system is proposed to provide solutions for all of the above issues. The results from the experimental system consisting of one static Microblaze and three PR
Microblaze/hardware accelerators connected by a Network-onChip show that the architecture is very promising with just 8% reduction in operating frequency.
FPGA '11 Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays, Feb 2011
There are many pattern matching engines in Network Intrusion Detection Systems (NIDS) have been d... more There are many pattern matching engines in Network Intrusion Detection Systems (NIDS) have been developed on FPGA-based platforms to accelerates the performance of pattern matching process in order to keep up with the gradually increasing in speed of current networks. However, those systems only support small number of short patterns which are not appropriate to large database such as Clam Antivirus patterns. In this paper, we propose Bloom-Bloomier Filter Extension (BBFex) as a practical pattern matching engine that handles large various-length pattern database. The basic idea in designing BBFex is the combination of Bloom Filter and Bloomier Filter to index patterns and an efficient pattern fragmenting method to split and to merge long patterns. Therefore, BBFex can recognize nearly 84,000 Clam Antivirus static patterns of which lengths vary from 4 to 255 characters with rather low on chip memory density, approximately 0.4 bits per character while keeping the off-chip memory access rate 5X lower compared to previous similar system and achieving throughput of 1.36 Gbps. In addition, BBFex is not only limited to Clam Antivirus database because its architecture is designed in respect to general character-based database. Moreover, as a hash-based system, BBFex does not require entire system reconfiguration when updating database.
International Conference on Electrical Engineering/Electronics Computer Telecommunications and Information Technology (ECTI-CON), 2010 , May 2010
In this paper, we propose a high performance architecture based on the combination of Bloom Filte... more In this paper, we propose a high performance architecture based on the combination of Bloom Filter and Bloomier Filter (BBF) to enhance the speed of pattern matching process on Clam Antivirus (ClamAV) database. BBF maintains small on-chip memory, low number of fault positives and can indicate which patterns are the candidate matches. The implementation results on low-cost Altera Cyclone II show that our architecture can handle 43,491-characters of ClamAV pattern set with only 9.5 bits per character and achieve a throughput of 1 gigabit per second (Gbps). As compared with previous systems, our memory utilization is far better up to 73%.
Nowadays, in-memory column store database systems are state-of-the-art for analytical workloads. ... more Nowadays, in-memory column store database systems are state-of-the-art for analytical workloads. In these column stores, a full column scan is a fundamental key operation and thus, the optimization of this primitive is very crucial from a performance perspective. For this optimization, advances in hardware are always an interesting opportunity, but represent also a major challenge. At the moment, hardware systems are more and more changing from homogeneous CPU systems towards hybrid systems with different computing units. Based on that, we focus on column scan acceleration for hybrid hardware systems incorporating a Field Programmable Gate Array (FPGA) and a CPU into a single system in this paper. The advantage of those hybrid systems is that the FPGA has usually direct access to the main memory of the CPU avoiding data copy which is a necessary procedure in other hybrid systems like CPU-GPU architectures. Thus, we present several FPGA designs for a recent column scan technique to fully offload the scan operation to the FPGA. In detail, we present our basic FPGA design and different optimization techniques. Then, we present selective results of our exhaustive evaluation showing the benefit of our FPGA acceleration. As we are going to show, we achieve a maximum speedup of factor 7 compared to a single-threaded CPU scan execution.
Dynamic Partial Reconfiguration (DPR) enables resource sharing in FPGA-based systems. It can also... more Dynamic Partial Reconfiguration (DPR) enables resource sharing in FPGA-based systems. It can also be used for the mitigation of aging-related permanent faults by increasing the number of redundant Partially Reconfigurable Regions (PRRs). Normally, these PRRs are able to host any of the Partially Reconfigurable Modules (PRMs), or tasks, at one particular instance. This kind of system is called homogeneous. However, the FPGA resource constraints limit the amount of homogeneous redundancy that can be used and hence affect the lifetime of the system. This issue can be addressed by utilizing the heterogeneous approach where each PRR now only hosts a subset of the tasks. Further, the deadlines of the applications must also be taken care of in the design phase to decide the mapping and scheduling of tasks to PRRs. To this end, we propose an application-specific multi-objective system-level design methodology to determine the appropriate number of PRRs and the mapping and scheduling of tasks to the PRRs. Specifically, we propose a lifetime-aware scheduling method that maximizes the system's mean time to failure (MTTF) with different tolerances in the makespan specification of an application. We use the scheduler along with an automated floorplanner for design space exploration at design-time to generate a feasible heterogeneous PRR-based system. Our experiments show that the heterogeneous systems can offer more than 2x lifetime improvement over homogeneous ones. It also offers better scaling with increased tolerance in makespan specification.
Dynamic Partial Reconfiguration (DPR) in reconfigurable platforms can be used for the mitigation ... more Dynamic Partial Reconfiguration (DPR) in reconfigurable platforms can be used for the mitigation of aging-related permanent faults. We propose an application-specific system-level design methodology for determining the appropriate number of Partially Reconfigurable Regions and their compatibility with Partially Reconfigurable Modules for maximizing the system lifetime. Specifically, we propose a lifetime-aware scheduler that maximizes system MTTF. We use the scheduler along with an automated floorplanner for design space exploration at design-time to generate a heterogeneous PRR system. Our experiments show that the heterogeneous systems can offer up to 2x lifetime improvement over homogeneous ones.
26th International Conference on Field-Programmable Logic and Applications, Sep 2016
Network-on-Chip (NoC) is known as a scalable and high performance interconnect in Systems-on-Chip... more Network-on-Chip (NoC) is known as a scalable and high performance interconnect in Systems-on-Chip (SoCs) with multiple processing elements (PEs). Recently, the design paradigm of SoCs has shifted from static to dynamic run-time reconfigurable system. In these systems, the PEs can be loaded/unloaded on demand. Therefore, the NoC should be able to adapt as quickly as possible to the changes to maintain the performance of the systems. In this work, we present a non-intrusive runtime reconfigurable time-division-multiplexed circuit-switched NoC, XNoC, which offers the following benefits (1) it switches between different routes within a predictable latency that is strictly determined by the length of the route and the number of time slots; (2) the configuration process can be masked effectively by overlapping with communication and (3) the multi-cast service is supported with aggregate feedback from sink nodes. We propose an XSwitch which requires 3.5X less resource than the conventional switch with similar features. The overall resource cost of XNoC is also smaller than the most known NoC and the clock timing is up to 50% better. We also propose a novel distributed control plane to accelerate the reconfiguration process and to improve the scalability of NoC. The achieved reconfiguration speedup compared to the centralized control unit is up to 7.6X in certain conditions. On average, it takes only 74 clock cycles to activate a 12-hop connection.
24th ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
Partial reconfiguration (PR) is gaining more attention from the research community because of its... more Partial reconfiguration (PR) is gaining more attention from the research community because of its flexibility in dynamically changing some parts of the system at runtime. However , the current PR tools need the designer's involvement in manually specifying the shapes and locations for the PR regions (PRRs). It requires not only deep knowledge of the FPGA device, the system architecture, but also many trial-and-error attempts to find the best-possible floorplan. Therefore, many research works have been conducted to propose automatic floorplanners for PR systems. However, one of the most significant limitations of those works is that they only consider the PRRs and ignore all other static modules. In this paper, we propose a novel PR floorplanner called PRFloor. It takes into account all components in the system. The main ideas behind PRFloor are the unique re-cursive pseudo-bipartitioning heuristic using a new, simple, yet effective Nonlinear Integer Programming-based biparti-tioner. The PRFloor performs very well in the experiments with various synthetic PR system setups with up to 130 modules, 24 PRRs and 85% of the FPGA resource. The average maximum clock frequency obtained for the actual PR systems implemented using PRFloor is even 3% higher than the similar systems without PR capability.
Field Programmable Logic and Applications (FPL), 2015 25th International Conference on
Partial reconfiguration is a technique used to increase the flexibility of an FPGA-based system b... more Partial reconfiguration is a technique used to increase the flexibility of an FPGA-based system by reprogramming parts of the system dynamically without interrupting the operation of the other modules. Despite the runtime benefits offered by partially reconfigurable (PR) systems, creating and storing partial bitstreams (PBs) are becoming major concerns for system architects when the numbers of reconfigurable partitions (RPs) and PR modules (PRMs) increase. It takes significant amount of time to generate the PBs for PR systems with large number of RPs and PRMs. More importantly, when the mapping relationship between PRMs and RPs is many-to-many, several almost-identical PBs of one PRM must be stored separately which leads to inefficient utilization of the memory storage. Therefore, bitstream relocation is drawing interests from the research community as a viable solution. Yet almost none of the works are able to demonstrate a coherent method to not only create relocatable PBs for complex and large PRMs in variable-size RPs but also how to do that automatically to free the designer from the tedious and error prone manual processes. In this paper, we propose a new technique to fill that gap. The method is successfully developed for Xilinx Virtex 7 devices using Vivado design tool flow.
In Proceedings of International Conference on Field Programmable Logic and Applications (FPL), 2014
FPGA-based heterogeneous Multiprocessor Systems-on-Chip (HMPSoCs) are becoming quite popular for ... more FPGA-based heterogeneous Multiprocessor Systems-on-Chip (HMPSoCs) are becoming quite popular for high performance embedded systems because of their powerful computational ability and relatively flexible architecture to adapt to unexpected system requirement changes. However, with the insatiable demands of supporting an extensive range of applications beyond the limited resources of FPGA chip and shorter time-to-market, many research works on partially reconfigurable (PR) FPGA architectures have been conducted to fulfill the needs. Those have yet to fully provide a versatile framework to exploit the flexibility of PR such as hardware/software task migration and bitstream relocation; more importantly, the on-chip debug features to access all processors currently loaded in the system are compromised because of the lack of native-support from vendor tools. In this paper, a novel PR-HMPSoC architecture for dynamic FPGA-based embedded system is proposed to provide solutions for all of the above issues. The results from the experimental system consisting of one static Microblaze and three PR
Microblaze/hardware accelerators connected by a Network-onChip show that the architecture is very promising with just 8% reduction in operating frequency.
FPGA '11 Proceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays, Feb 2011
There are many pattern matching engines in Network Intrusion Detection Systems (NIDS) have been d... more There are many pattern matching engines in Network Intrusion Detection Systems (NIDS) have been developed on FPGA-based platforms to accelerates the performance of pattern matching process in order to keep up with the gradually increasing in speed of current networks. However, those systems only support small number of short patterns which are not appropriate to large database such as Clam Antivirus patterns. In this paper, we propose Bloom-Bloomier Filter Extension (BBFex) as a practical pattern matching engine that handles large various-length pattern database. The basic idea in designing BBFex is the combination of Bloom Filter and Bloomier Filter to index patterns and an efficient pattern fragmenting method to split and to merge long patterns. Therefore, BBFex can recognize nearly 84,000 Clam Antivirus static patterns of which lengths vary from 4 to 255 characters with rather low on chip memory density, approximately 0.4 bits per character while keeping the off-chip memory access rate 5X lower compared to previous similar system and achieving throughput of 1.36 Gbps. In addition, BBFex is not only limited to Clam Antivirus database because its architecture is designed in respect to general character-based database. Moreover, as a hash-based system, BBFex does not require entire system reconfiguration when updating database.
International Conference on Electrical Engineering/Electronics Computer Telecommunications and Information Technology (ECTI-CON), 2010 , May 2010
In this paper, we propose a high performance architecture based on the combination of Bloom Filte... more In this paper, we propose a high performance architecture based on the combination of Bloom Filter and Bloomier Filter (BBF) to enhance the speed of pattern matching process on Clam Antivirus (ClamAV) database. BBF maintains small on-chip memory, low number of fault positives and can indicate which patterns are the candidate matches. The implementation results on low-cost Altera Cyclone II show that our architecture can handle 43,491-characters of ClamAV pattern set with only 9.5 bits per character and achieve a throughput of 1 gigabit per second (Gbps). As compared with previous systems, our memory utilization is far better up to 73%.
Uploads
Papers by Tuan D. A. Nguyen
Microblaze/hardware accelerators connected by a Network-onChip show that the architecture is very promising with just 8% reduction in operating frequency.
(The attached file is a draft of the work)
Microblaze/hardware accelerators connected by a Network-onChip show that the architecture is very promising with just 8% reduction in operating frequency.
(The attached file is a draft of the work)