Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
 
 
Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Article Types

Countries / Regions

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Search Results (1,362)

Search Parameters:
Keywords = GPUs

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
30 pages, 1684 KiB  
Article
Efficient GPU Implementation of the McMurchie–Davidson Method for Shell-Based ERI Computations
by Haruto Fujii, Yasuaki Ito, Nobuya Yokogawa, Kanta Suzuki, Satoki Tsuji, Koji Nakano, Victor Parque and Akihiko Kasagi
Appl. Sci. 2025, 15(5), 2572; https://doi.org/10.3390/app15052572 - 27 Feb 2025
Viewed by 68
Abstract
Quantum chemistry offers the formal machinery to derive molecular and physical properties arising from (sub)atomic interactions. However, as molecules of practical interest are largely polyatomic, contemporary approximation schemes such as the Hartree–Fock scheme are computationally expensive due to the large number of electron [...] Read more.
Quantum chemistry offers the formal machinery to derive molecular and physical properties arising from (sub)atomic interactions. However, as molecules of practical interest are largely polyatomic, contemporary approximation schemes such as the Hartree–Fock scheme are computationally expensive due to the large number of electron repulsion integrals (ERIs). Central to the Hartree–Fock method is the efficient computation of ERIs over Gaussian functions (GTO-ERIs). Here, the well-known McMurchie–Davidson method (MD) offers an elegant formalism by incrementally extending Hermite Gaussian functions and auxiliary tabulated functions. Although the MD method offers a high degree of versatility to acceleration schemes through Graphics Processing Units (GPUs), the current GPU implementations limit the practical use of supported values of the azimuthal quantum number. In this paper, we propose a generalized framework capable of computing GTO-ERIs for arbitrary azimuthal quantum numbers, provided that the intermediate terms of the MD method can be stored. Our approach benefits from extending the MD recurrence relations through shells, batches, and triple-buffering of the shared memory, and ordering similar ERIs, thus enabling the effective parallelization and use of GPU resources. Furthermore, our approach proposes four GPU implementation schemes considering the suitable mappings between Gaussian basis and CUDA blocks and threads. Our computational experiments involving the GTO-ERI computations of molecules of interest on an NVIDIA A100 Tensor Core GPU (NVIDIA, Santa Clara, CA, USA) have revealed the merits of the proposed acceleration schemes in terms of computation time, including up to a 72× improvement over our previous GPU implementation and up to a 4500× speedup compared to a naive CPU implementation, highlighting the effectiveness of our method in accelerating ERI computations for both monatomic and polyatomic molecules. Our work has the potential to explore new parallelization schemes of distinct and complex computation paths involved in ERI computation. Full article
(This article belongs to the Special Issue Data Structures for Graphics Processing Units (GPUs))
Show Figures

Figure 1

28 pages, 7966 KiB  
Article
Real-Time Edge Computing vs. GPU-Accelerated Pipelines for Low-Cost Microscopy Applications
by Gloria Bueno, Lucia Sanchez-Vargas, Alberto Diaz-Maroto, Jesus Ruiz-Santaquiteria, Maria Blanco, Jesus Salido and Gabriel Cristobal
Electronics 2025, 14(5), 930; https://doi.org/10.3390/electronics14050930 - 26 Feb 2025
Viewed by 78
Abstract
Environmental microscopy is crucial for analyzing microorganisms, but traditional optical microscopes are often expensive, bulky, and impractical for field use. AI-driven image recognition, powered by deep learning models like YOLO, enhances microscopy analysis but typically requires high computational resources. To address these challenges, [...] Read more.
Environmental microscopy is crucial for analyzing microorganisms, but traditional optical microscopes are often expensive, bulky, and impractical for field use. AI-driven image recognition, powered by deep learning models like YOLO, enhances microscopy analysis but typically requires high computational resources. To address these challenges, we present two cost-effective pipelines integrating AI with low-cost microscopes and edge computing. Both approaches use the OpenFlexure Microscope and Raspberry Pi devices. The first performs real-time inference with a Raspberry Pi 5 and Hailo-8L accelerator, while the second captures images with a Raspberry Pi 4, transferring them to a GPU-equipped desktop for processing. Using YOLOv8, we evaluate their ability to detect phytoplankton species, including cyanobacteria and diatoms. Results show that edge computing enables accurate, efficient, and low-power microscopy analysis, demonstrating its potential for real-time environmental monitoring in resource-limited settings. Full article
(This article belongs to the Special Issue Real-Time Computer Vision)
Show Figures

Figure 1

17 pages, 1774 KiB  
Article
Training a Minesweeper Agent Using a Convolutional Neural Network
by Wenbo Wang and Chengyou Lei
Appl. Sci. 2025, 15(5), 2490; https://doi.org/10.3390/app15052490 - 25 Feb 2025
Viewed by 252
Abstract
The Minesweeper game is modeled as a sequential decision-making task, for which a neural network architecture, state encoding, and reward function were herein designed. Both a Deep Q-Network (DQN) and supervised learning methods were successfully applied to optimize the training of the game. [...] Read more.
The Minesweeper game is modeled as a sequential decision-making task, for which a neural network architecture, state encoding, and reward function were herein designed. Both a Deep Q-Network (DQN) and supervised learning methods were successfully applied to optimize the training of the game. The experiments were conducted on the AutoDL platform using an NVIDIA RTX 3090 GPU for efficient computation. The results showed that in a 6 × 6 grid with four mines, the DQN model achieved an average win rate of 93.3% (standard deviation: 0.77%), while the supervised learning method achieved 91.2% (standard deviation: 0.9%), both outperforming human players and baseline algorithms and demonstrating high intelligence. The mechanisms of the two methods in the Minesweeper task were analyzed, with the reasons for the faster training speed and more stable performance of supervised learning explained from the perspectives of means–ends analysis and feedback control. Although there is room for improvement in sample efficiency and training stability in the DQN model, its greater generalization ability makes it highly promising for application in more complex decision-making tasks. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Figure 1

12 pages, 2946 KiB  
Article
Optimizing Real-Time Object Detection in a Multi-Neural Processing Unit System
by Sehyeon Oh, Yongin Kwon and Jemin Lee
Sensors 2025, 25(5), 1376; https://doi.org/10.3390/s25051376 - 24 Feb 2025
Viewed by 248
Abstract
Real-time object detection demands high throughput and low latency, necessitating the use of hardware accelerators. NPU is specialized hardware designed to accelerate the calculation of deep learning models, providing better energy efficiency and parallel processing performance than existing CPUs or GPUs. In particular, [...] Read more.
Real-time object detection demands high throughput and low latency, necessitating the use of hardware accelerators. NPU is specialized hardware designed to accelerate the calculation of deep learning models, providing better energy efficiency and parallel processing performance than existing CPUs or GPUs. In particular, it plays an important role in reducing latency and improving processing speed in applications that require real-time processing. In this paper, we construct a real-time object detection system based on YOLOv3, utilizing Neubla’s Antara NPU, and propose two approaches for performance optimization. First, we ensure the continuity of NPU inference by allowing the CPU to process data in advance through double buffering. Second, in a multi-NPU environment, we distribute tasks among NPUs through queue-based processing and analyze the performance limits using Amdahl’s law. Experimental results demonstrate that compared to a CPU-only environment, applying the NPU in single buffering improved throughput by 2.13 times, double buffering by 3.35 times, and in a multi-NPU environment by 4.81 times. Latency decreased by 1.6 times in single and double buffering, and by 1.18 times in the multi-NPU environment. The accuracy remained consistent, with 31.4 mAP on the CPU and 31.8 mAP on the NPU. Full article
(This article belongs to the Special Issue Advances in Security of Mobile and Wireless Communications)
Show Figures

Figure 1

22 pages, 12001 KiB  
Article
A Study on Systematic Improvement of Transformer Models for Object Pose Estimation
by Jungwoo Lee and Jinho Suh
Sensors 2025, 25(4), 1227; https://doi.org/10.3390/s25041227 - 18 Feb 2025
Viewed by 229
Abstract
Transformer architecture, initially developed for natural language processing and time series analysis, has been successfully adapted to various generative models in several domains. Object pose estimation, which uses images to determine the 3D position and orientation of an object, is essential for tasks [...] Read more.
Transformer architecture, initially developed for natural language processing and time series analysis, has been successfully adapted to various generative models in several domains. Object pose estimation, which uses images to determine the 3D position and orientation of an object, is essential for tasks such as robotic manipulation. This study introduces a transformer-based deep learning model for object pose estimation in computer vision, which determines the 3D position and orientation of objects from images. A baseline model derived from an encoder-only transformer faces challenges with high GPU memory usage when handling multiple objects. To improve training efficiency and support multi-object inference, it reduces memory consumption by adjusting the transformer’s attention layer and incorporates low-rank weight decomposition to decrease parameters. Additionally, GQA and RMS normalization enhance multi-object pose estimation performance, resulting in reduced memory usage and improved training accuracy. The improved model implementation with an extended matrix dimension reduced the GPU memory usage to only 2.5% of the baseline model, although it increased the number of model weight parameters. To mitigate this, the number of weight parameters was reduced by 28% using low-rank weight decomposition in the linear layer of attention. In addition, a 17% improvement in rotation training accuracy over the baseline model was achieved by applying GQA and RMS normalization. Full article
(This article belongs to the Special Issue Transformer Applications in Target Tracking)
Show Figures

Figure 1

17 pages, 6692 KiB  
Article
A Lightweight Network Based on YOLOv8 for Improving Detection Performance and the Speed of Thermal Image Processing
by Huyen Trang Dinh and Eung-Tae Kim
Electronics 2025, 14(4), 783; https://doi.org/10.3390/electronics14040783 - 17 Feb 2025
Viewed by 437
Abstract
Deep learning and image processing technology continue to evolve, with YOLO models widely used for real-time object recognition. These YOLO models offer both blazing fast processing and high precision, making them super popular in fields like self-driving cars, security cameras, and medical support. [...] Read more.
Deep learning and image processing technology continue to evolve, with YOLO models widely used for real-time object recognition. These YOLO models offer both blazing fast processing and high precision, making them super popular in fields like self-driving cars, security cameras, and medical support. Most YOLO models are optimized for RGB images, which creates some limitations. While RGB images are super sensitive to lighting conditions, infrared (IR) images using thermal data can detect objects consistently, even in low-light settings. However, infrared images present unique challenges like low resolution, tiny object sizes, and high amounts of noise, which makes direct application tricky in regard to the current YOLO models available. This situation requires the development of object detection models designed specifically for thermal images, especially for real-time recognition. Given the GPU and memory constraints in edge device environments, designing a lightweight model that maintains a high speed is crucial. Our research focused on training a YOLOv8 model using infrared image data to recognize humans. We proposed a YOLOv8s model that had unnecessary layers removed, which was better suited to infrared images and significantly reduced the weight of the model. We also integrated an improved Global Attention Mechanism (GAM) module to boost IR image precision and applied depth-wise convolution filtering to maintain the processing speed. The proposed model achieved a 2% precision improvement, 75% parameter reduction, and 12.8% processing speed increase, compared to the original YOLOv8s model. This method can be effectively used in thermal imaging applications like night surveillance cameras, cameras used in bad weather, and smart ventilation systems, particularly in environments requiring real-time processing with limited computational resources. Full article
Show Figures

Figure 1

28 pages, 8850 KiB  
Article
Real-Time Runway Detection Using Dual-Modal Fusion of Visible and Infrared Data
by Lichun Yang, Jianghao Wu, Hongguang Li, Chunlei Liu and Shize Wei
Remote Sens. 2025, 17(4), 669; https://doi.org/10.3390/rs17040669 - 16 Feb 2025
Viewed by 240
Abstract
Advancements in aviation technology have made intelligent navigation systems essential for improving flight safety and efficiency, particularly in low-visibility conditions. Radar and GPS systems face limitations in bad weather, making visible–infrared sensor fusion a promising alternative. This study proposes a salient object detection [...] Read more.
Advancements in aviation technology have made intelligent navigation systems essential for improving flight safety and efficiency, particularly in low-visibility conditions. Radar and GPS systems face limitations in bad weather, making visible–infrared sensor fusion a promising alternative. This study proposes a salient object detection (SOD) method that integrates visible and infrared sensors for robust airport runway detection in complex environments. We introduce a large-scale visible–infrared runway dataset (RDD5000) and develop a SOD algorithm capable of detecting salient targets from unaligned visible and infrared images. To enable real-time processing, we design a lightweight dual-modal fusion network (DCFNet) with an independent–shared encoder and a cross-layer attention mechanism to enhance feature extraction and fusion. Experimental results show that the MobileNetV2-based lightweight version achieves 155 FPS on a single GPU, significantly outperforming previous methods such as DCNet (4.878 FPS) and SACNet (27 FPS), making it suitable for real-time deployment on airborne systems. This work offers a novel and efficient solution for intelligent navigation in aviation. Full article
Show Figures

Figure 1

29 pages, 6669 KiB  
Article
Implementing Deep Neural Networks on ARM-Based Microcontrollers: Application for Ventricular Fibrillation Detection
by Vessela Krasteva, Todor Stoyanov and Irena Jekova
Appl. Sci. 2025, 15(4), 1965; https://doi.org/10.3390/app15041965 - 13 Feb 2025
Viewed by 404
Abstract
GPU-based deep neural networks (DNNs) are powerful for electrocardiogram (ECG) processing and rhythm classification. Although questions often arise about their practical application in embedded systems with low computational resources, few studies have investigated the associated challenges. This study aims to show a useful [...] Read more.
GPU-based deep neural networks (DNNs) are powerful for electrocardiogram (ECG) processing and rhythm classification. Although questions often arise about their practical application in embedded systems with low computational resources, few studies have investigated the associated challenges. This study aims to show a useful workflow for deploying a pre-trained DNN model from a GPU-based development platform to two popular ARM-based microcontrollers: Raspberry Pi 4 and ARM Cortex-M7. Specifically, a five-layer convolutional neural network pre-trained in TensorFlow (TF) for the detection of ventricular fibrillation is converted to Lite Runtime (LiteRT) format and subjected to post-training quantization to reduce model size and computational complexity. Using a test dataset of 7482 10 s cardiac arrest ECGs, the inference of LiteRT DNN in Raspberry Pi 4 takes about 1 ms with a sensitivity of 98.6% and specificity of 99.5%, reproducing the TF DNN performance. An optimization study with 1300 representative datasets (RDSs), including 10 to 4000 calibration ECG signals selected by random, rhythm, or amplitude-based criteria, showed that choosing a random RDS with a relatively small size of 80 resulted in a quantized integer LiteRT DNN with minimal quantization error. The inference of both non-quantized and quantized LiteRT DNNs on a low-resource ARM Cortex-M7 microcontroller (STM32F7) shows rhythm accuracy deviation of <0.4%. Quantization reduces internal computation latency from 4.8 s to 0.6 s, flash memory usage from 40 kB to 20 kB, and energy consumption by 7.85 times. This study ensures that DNN models retain their functionality while being optimized for real-time execution on resource-constrained hardware, demonstrating application in automated external defibrillators. Full article
Show Figures

Figure 1

28 pages, 36222 KiB  
Review
Technical Review of Solar Distribution Calculation Methods: Enhancing Simulation Accuracy for High-Performance and Sustainable Buildings
by Ana Paula de Almeida Rocha, Ricardo C. L. F. Oliveira and Nathan Mendes
Buildings 2025, 15(4), 578; https://doi.org/10.3390/buildings15040578 - 13 Feb 2025
Viewed by 389
Abstract
Solar energy utilization in buildings can significantly contribute to energy savings and enhance on-site energy production. However, excessive solar gains may lead to overheating, thereby increasing cooling demands. Accurate calculation of sunlit and shaded areas is essential for optimizing solar technologies and improving [...] Read more.
Solar energy utilization in buildings can significantly contribute to energy savings and enhance on-site energy production. However, excessive solar gains may lead to overheating, thereby increasing cooling demands. Accurate calculation of sunlit and shaded areas is essential for optimizing solar technologies and improving the precision of building energy simulations. This paper provides a review of the solar shading calculation methods used in building performance simulation (BPS) tools, focusing on the progression from basic trigonometric models to advanced techniques such as projection and clipping (PgC) and pixel counting (PxC). These advancements have improved the accuracy and efficiency of solar shading simulations, enhancing energy performance and occupant comfort. As building designs evolve and adaptive shading systems become more common, challenges remain in ensuring that these methods can handle complex geometries and dynamic solar exposure. The PxC method, leveraging modern GPUs and parallel computing, offers a solution by providing real-time high-resolution simulations, even for irregular, non-convex surfaces. This ability to handle continuous updates positions PxC as a key tool for next-generation building energy simulations, ensuring that shading systems can adjust to changing solar conditions. Future research could focus on integrating appropriate modeling approaches with AI technologies to enhance accuracy, reliability, and computational efficiency. Full article
(This article belongs to the Special Issue Research on Sustainable Energy Performance of Green Buildings)
Show Figures

Figure 1

15 pages, 6734 KiB  
Article
Self-Assembled Sandwich-like Mixed Matrix Membrane of Defective Zr-MOF for Efficient Gas Separation
by Yuning Li, Xinya Wang, Weiqiu Huang, Xufei Li, Ping Xia, Xiaochi Xu and Fangrui Feng
Nanomaterials 2025, 15(4), 279; https://doi.org/10.3390/nano15040279 - 12 Feb 2025
Viewed by 465
Abstract
Membrane technology has been widely used in industrial CO2 capturing, gas purification and gas separation, arousing attention due to its advantages of high efficiency, energy saving and environmental protection. In the context of reducing global carbon emissions and combating climate change, it [...] Read more.
Membrane technology has been widely used in industrial CO2 capturing, gas purification and gas separation, arousing attention due to its advantages of high efficiency, energy saving and environmental protection. In the context of reducing global carbon emissions and combating climate change, it is particularly important to capture and separate greenhouse gasses such as CO2. Zr-MOF can be used as a multi-dimensional modification on the polymer membrane to prepare self-assembled MOF-based mixed matrix membranes (MMMs), aiming at the problem of weak adhesion or bonding force between the separation layer and the porous carrier. When defective UiO-66 is applied to PVDF membrane as a functional layer, the CO2 separation performance of the PVDF membrane is significantly improved. TUT-UiO-3-TTN@PVDF has a CO2 permeation flux of 14,294 GPU and a selectivity of 27 for CO2/N2 and 18 for CO2/CH4, respectively. The CO2 permeability and selectivity of the membrane exhibited change after 40 h of continuous operation, significantly improving the gas separation performance and showing exceptional stability for large-scale applications. Full article
(This article belongs to the Special Issue Advances in Polymer Nanofilms)
Show Figures

Figure 1

25 pages, 2844 KiB  
Article
Real-Time Gesture-Based Hand Landmark Detection for Optimized Mobile Photo Capture and Synchronization
by Pedro Marques, Paulo Váz, José Silva, Pedro Martins and Maryam Abbasi
Electronics 2025, 14(4), 704; https://doi.org/10.3390/electronics14040704 - 12 Feb 2025
Viewed by 479
Abstract
Gesture recognition technology has emerged as a transformative solution for natural and intuitive human–computer interaction (HCI), offering touch-free operation across diverse fields such as healthcare, gaming, and smart home systems. In mobile contexts, where hygiene, convenience, and the ability to operate under resource [...] Read more.
Gesture recognition technology has emerged as a transformative solution for natural and intuitive human–computer interaction (HCI), offering touch-free operation across diverse fields such as healthcare, gaming, and smart home systems. In mobile contexts, where hygiene, convenience, and the ability to operate under resource constraints are critical, hand gesture recognition provides a compelling alternative to traditional touch-based interfaces. However, implementing effective gesture recognition in real-world mobile settings involves challenges such as limited computational power, varying environmental conditions, and the requirement for robust offline–online data management. In this study, we introduce ThumbsUp, which is a gesture-driven system, and employ a partially systematic literature review approach (inspired by core PRISMA guidelines) to identify the key research gaps in mobile gesture recognition. By incorporating insights from deep learning–based methods (e.g., CNNs and Transformers) while focusing on low resource consumption, we leverage Google’s MediaPipe in our framework for real-time detection of 21 hand landmarks and adaptive lighting pre-processing, enabling accurate recognition of a “thumbs-up” gesture. The system features a secure queue-based offline–cloud synchronization model, which ensures that the captured images and metadata (encrypted with AES-GCM) remain consistent and accessible even with intermittent connectivity. Experimental results under dynamic lighting, distance variations, and partially cluttered environments confirm the system’s superior low-light performance and decreased resource consumption compared to baseline camera applications. Additionally, we highlight the feasibility of extending ThumbsUp to incorporate AI-driven enhancements for abrupt lighting changes and, in the future, electromyographic (EMG) signals for users with motor impairments. Our comprehensive evaluation demonstrates that ThumbsUp maintains robust performance on typical mobile hardware, showing resilience to unstable network conditions and minimal reliance on high-end GPUs. These findings offer new perspectives for deploying gesture-based interfaces in the broader IoT ecosystem, thus paving the way toward secure, efficient, and inclusive mobile HCI solutions. Full article
(This article belongs to the Special Issue AI-Driven Digital Image Processing: Latest Advances and Prospects)
Show Figures

Figure 1

20 pages, 899 KiB  
Article
Boundary-Aware Concurrent Queue: A Fast and Scalable Concurrent FIFO Queue on GPU Environments
by Md. Sabbir Hossain Polak, David A. Troendle and Byunghyun Jang
Appl. Sci. 2025, 15(4), 1834; https://doi.org/10.3390/app15041834 - 11 Feb 2025
Viewed by 355
Abstract
This paper presents Boundary-Aware Concurrent Queue (BACQ), a high-performance queue designed for modern GPUs, which focuses on high concurrency in massively parallel environments. BACQ operates at the warp level, leveraging intra-warp locality to improve throughput. A key to BACQ’s design is its [...] Read more.
This paper presents Boundary-Aware Concurrent Queue (BACQ), a high-performance queue designed for modern GPUs, which focuses on high concurrency in massively parallel environments. BACQ operates at the warp level, leveraging intra-warp locality to improve throughput. A key to BACQ’s design is its ability to replace conflicting accesses to shared data with independent accesses to private data. It uses a ticket-based system to ensure fair ordering of operations and supports infinite growth of the head and tail across its ring buffer. The leader thread of each warp coordinates enqueue and dequeue operations, broadcasting offsets for intra-warp synchronization. BACQ dynamically adjusts operation priorities based on the queue’s state, especially as it approaches boundary conditions such as overfilling the buffer. It also uses a virtual caching layer for intra-warp communication, reducing memory latency. Rigorous benchmarking results show that BACQ outperforms the BWD (Broker Queue Work Distributor), the fastest known GPU queue, by more than 2× while preserving FIFO semantics. The paper demonstrates BACQ’s superior performance through real-world empirical evaluations. Full article
(This article belongs to the Special Issue Data Structures for Graphics Processing Units (GPUs))
Show Figures

Figure 1

20 pages, 732 KiB  
Article
VCONV: A Convolutional Neural Network Accelerator for FPGAs
by Srikanth Neelam and A. Amalin Prince
Electronics 2025, 14(4), 657; https://doi.org/10.3390/electronics14040657 - 8 Feb 2025
Viewed by 325
Abstract
Field Programmable Gate Arrays (FPGAs), with their wide portfolio of configurable resources such as Look-Up Tables (LUTs), Block Random Access Memory (BRAM), and Digital Signal Processing (DSP) blocks, are the best option for custom hardware designs. Their low power consumption and cost-effectiveness give [...] Read more.
Field Programmable Gate Arrays (FPGAs), with their wide portfolio of configurable resources such as Look-Up Tables (LUTs), Block Random Access Memory (BRAM), and Digital Signal Processing (DSP) blocks, are the best option for custom hardware designs. Their low power consumption and cost-effectiveness give them an advantage over Graphics Processing Units (GPUs) and Central Processing Units (CPUs) in providing efficient accelerator solutions for compute-intensive Convolutional Neural Network (CNN) models. CNN accelerators are dedicated hardware modules capable of performing compute operations such as convolution, activation, normalization, and pooling with minimal intervention from a host. Designing accelerators for deeper CNN models requires FPGAs with vast resources, which impact its advantages in terms of power and price. In this paper, we propose the VCONV Intellectual Property (IP), an efficient and scalable CNN accelerator architecture for applications where power and cost are constraints. VCONV, with its configurable design, can be deployed across multiple smaller FPGAs instead of a single large FPGA to provide better control over cost and parallel processing. VCONV can be deployed across heterogeneous FPGAs, depending on the performance requirements of each layer. The IP’s performance can be evaluated using embedded monitors to ensure that the accelerator is configured to achieve the best performance. VCONV can be configured for data type format, convolution engine (CE) and convolution unit (CU) configurations, as well as the sequence of operations based on the CNN model and layer. VCONV can be interfaced through the Advanced Peripheral Bus (APB) for configuration and the Advanced eXtensible Interface (AXI) stream for data transfers. The IP was implemented and validated on the Avnet Zedboard and tested on the first layer of AlexNet, VGG16, and ResNet18 with multiple CE configurations, demonstrating 100% performance from MAC units with no idle time. We also synthesized multiple VCONV instances required for AlexNet, achieving the lowest BRAM utilization of just 1.64 Mb and deriving a performance of 56GOPs. Full article
(This article belongs to the Special Issue Convolutional Neural Networks and Vision Applications, 3rd Edition)
Show Figures

Figure 1

13 pages, 2045 KiB  
Article
A Hardware Accelerator for Real-Time Processing Platforms Used in Synthetic Aperture Radar Target Detection Tasks
by Yue Zhang, Yunshan Tang, Yue Cao and Zhongjun Yu
Micromachines 2025, 16(2), 193; https://doi.org/10.3390/mi16020193 - 7 Feb 2025
Viewed by 474
Abstract
The deep learning object detection algorithm has been widely applied in the field of synthetic aperture radar (SAR). By utilizing deep convolutional neural networks (CNNs) and other techniques, these algorithms can effectively identify and locate targets in SAR images, thereby improving the accuracy [...] Read more.
The deep learning object detection algorithm has been widely applied in the field of synthetic aperture radar (SAR). By utilizing deep convolutional neural networks (CNNs) and other techniques, these algorithms can effectively identify and locate targets in SAR images, thereby improving the accuracy and efficiency of detection. In recent years, achieving real-time monitoring of regions has become a pressing need, leading to the direct completion of real-time SAR image target detection on airborne or satellite-borne real-time processing platforms. However, current GPU-based real-time processing platforms struggle to meet the power consumption requirements of airborne or satellite applications. To address this issue, a low-power, low-latency deep learning SAR object detection algorithm accelerator was designed in this study to enable real-time target detection on airborne and satellite SAR platforms. This accelerator proposes a Process Engine (PE) suitable for multidimensional convolution parallel computing, making full use of Field-Programmable Gate Array (FPGA) computing resources to reduce convolution computing time. Furthermore, a unique memory arrangement design based on this PE aims to enhance memory read/write efficiency while applying dataflow patterns suitable for FPGA computing to the accelerator to reduce computation latency. Our experimental results demonstrate that deploying the SAR object detection algorithm based on Yolov5s on this accelerator design, mounted on a Virtex 7 690t chip, consumes only 7 watts of dynamic power, achieving the capability to detect 52.19 512 × 512-sized SAR images per second. Full article
(This article belongs to the Section E:Engineering and Technology)
Show Figures

Figure 1

27 pages, 6820 KiB  
Article
PCF-RWKV: Large Language Model for Product Carbon Footprint Estimation
by Zhen Li, Peihao Tang, Xuanlin Wang, Xueping Liu and Peng Mou
Sustainability 2025, 17(3), 1321; https://doi.org/10.3390/su17031321 - 6 Feb 2025
Viewed by 589
Abstract
As global climate change intensifies, assessing product carbon footprints serves as a foundational step for quantifying greenhouse gas emissions throughout a product’s lifecycle, forming the basis for achieving sustainability and emission reduction goals. Traditional lifecycle assessment methods face challenges such as subjective boundary [...] Read more.
As global climate change intensifies, assessing product carbon footprints serves as a foundational step for quantifying greenhouse gas emissions throughout a product’s lifecycle, forming the basis for achieving sustainability and emission reduction goals. Traditional lifecycle assessment methods face challenges such as subjective boundary definitions and time-consuming inventory construction. This study introduces PCF-RWKV, a novel model based on the RWKV architecture with task-specialized low-rank adaptations (LoRAs). Trained on carbon footprint datasets, the model minimizes memory use and data interference, enabling efficient deployment on consumer-grade GPUs without relying on cloud computing. By integrating multi-agent technology, PCF-RWKV automates the creation of lifecycle inventories and aligns production processes with emission factors to calculate carbon footprints. This approach significantly improves the efficiency and security of corporate carbon footprint assessments, providing a potential alternative to traditional methods. Full article
Show Figures

Figure 1

Back to TopTop