The motion-estimation search range required for interframe encoding with the MPEG-2 video compres... more The motion-estimation search range required for interframe encoding with the MPEG-2 video compression standard depends on a number of factors, including video content, video resolution, elapsed time between reference and predicted pictures, and, just as significantly, pragmatic considerations in implementing a cost-effective solution. In this paper we present a set of experimental results that provide a probabilistic characterization of the size of motion vectors for different types of video, from well-known standard test sequences to fast-paced sports sequences to action movie clips. We study the impact of search range on compression efficiency and video quality. Finally, and on the basis of these results, we conclude with recommendations for target search ranges suitable for high-quality compression of standard and high-definition video.
Given a high resolution compressed video, we investigate the problem of simplifying the memory an... more Given a high resolution compressed video, we investigate the problem of simplifying the memory and computational requirements of the decoder, when the intended playback resolution is less than the encoded resolution. We show that significant savings can be obtained in the amount of temporary storage space used, memory bandwidth between the processor and local memory, and raw computational complexity, while not incurring a significant loss in perceptual quality. Most video compression standards make special adjustments to deal with interlaced video. We explore how the proposed solution adapts to these situations and suggest modifications which improve the quality of the reconstructed video. Results from using these algorithms on a number of test sequences compressed with the MPEG-2 standard are presented.
We propose a new cost function for the motion estimation algorithm. A position penalty has been a... more We propose a new cost function for the motion estimation algorithm. A position penalty has been added to the conventional cost function, Sum of Absolute Difference (SAD), to regulate the motion field. The effort is focused on minimization of the size of motion vector difference as well as residual error which need to be coded. Compared to the existing motion field regulation methods, the proposed method has a better chance to choose minimum distortion which allows it to have better picture quality with a fixed bit rate. Simulation results show that the proposed cost function outperforms the existing methods with minimal increase of computational cost, which makes it attractive for hardware implementation
As medical image data sets are digitized and the number of data sets is increasing exponentially,... more As medical image data sets are digitized and the number of data sets is increasing exponentially, there is a need for automated image processing and analysis technique. Most medical imaging methods require human visual inspection and manual measurement which are labor intensive and often produce inconsistent results. In this paper, we propose an automated image segmentation and classification method that identifies tumor cell nuclei in medical images and classifies these nuclei into two categories, stained and unstained tumor cell nuclei. The proposed method segments and labels individual tumor cell nuclei, separates nuclei clusters, and produces stained and unstained tumor cell nuclei counts. The representative fields of view have been chosen by a pathologist from a known diagnosis (clear cell renal cell carcinoma), and the automated results are compared with the hand-counted results by a pathologist.
Open Computing LanguageĀ® (OpenCLĀ®), which is created to support parallel programming of heterogen... more Open Computing LanguageĀ® (OpenCLĀ®), which is created to support parallel programming of heterogeneous multicore-processor systems, has a very large potential for high-performance computing and consumer electronics since it provides application programming interfaces (APIs) to help make a portable code that runs across multiple devices. OpenCL is still under development, and it is not clear whether OpenCL has any advantages over other frameworks aside from portability. The purpose of our project was to define evaluation criteria, empirically evaluate OpenCL as a programming framework using evaluation criteria (e.g., performance, productivity, and portability criteria), define and implement parallel primitives in OpenCL, and demonstrate how the use of the implemented parallel primitives can have benefits for our target applications. Parallel primitive library APIs are defined to implement parallel algorithms in OpenCL, and a set of data- and task-parallel primitives is implemented and incorporated in the target applications. Multicore central processing units, the Cell Broadband EngineĀ® (Cell/B.E.Ā®), and graphics processing units are used as target platforms, and digital TV applications are used to evaluate usefulness of OpenCL. Preliminary results show that parallel primitives can be one of the ways to improve application performance and programmer productivity with respect to OpenCL while still maintaining software portability.
Emerging multi-core processors are able to accelerate medical imaging applications by exploiting ... more Emerging multi-core processors are able to accelerate medical imaging applications by exploiting the parallelism available in their algorithms. We have implemented a mutual-information-based 3D linear registration algorithm on the Cell Broadband Enginetrade processor. By exploiting the highly parallel architecture and its high memory bandwidth, our implementation with two CBE processors can register a pair of 256x256x30 3D images in one second. This implementation is significantly faster than a conventional one on a traditional microprocessor or even faster than a previously reported custom-hardware implementation. In addition to parallelizing the code for multiple cores and organizing the data structure for reducing the amount of the memory traffic, it is also critical to optimize the code for the SIMD pipeline structure. We note that code optimization for the SIMD pipeline alone results in a 4.2x-8.7x acceleration for the computation of small kernels. Further, SIMD optimization alone results in a 4.5x end-end application speedup.
A modular architecture with random access on-chip local memory for real-time motion estimation ha... more A modular architecture with random access on-chip local memory for real-time motion estimation has been proposed. The random access on-chip local memory with simple address generation has been proposed to overcome the irregular data flow of the three-step search BMA. This architecture features simple interconnection with low memory bandwidth and throughput rate as high as 1/N block per clock cycle for an NĆN block with the search range of dm=N/2-1 pixels with 100% processor utilization. By using a method called pipeline interleaving, this architecture offers a feasible solution for the Grand Alliance HDTV picture format with large search range
The three-step hierarchical search block matching motion estimation algorithm has played an impor... more The three-step hierarchical search block matching motion estimation algorithm has played an important role in low bit rate video coding because of its low computation complexity compared to the full search block matching motion estimation algorithm (FBMA). A modular architecture for the three-step hierarchical search BMA is presented, which features a throughput rate, as high as 1/N block per clock cycle, and a low memory bandwidth with random access on-chip local memory. Furthermore, 100% processor utilization has been achieved by using a method called pipeline interleaving. As such, this architecture offers a feasible solution for the Grand Alliance HDTV picture format with a large search range
Proposes a modular systolic array architecture for the full-search block matching motion estimati... more Proposes a modular systolic array architecture for the full-search block matching motion estimation algorithm (FBMA). With this novel architecture, the authors are able to generate a motion vector for every reference block in raster scan order while achieving 100% processor utilization and high throughput rate. Furthermore, they devised a scheme to save the pin count (I/O) by sharing memory units. This results in low memory bandwidth. This architecture is scalable in that it can easily be adapted to handle larger search ranges and different block sizes without increasing the effective latency
In this paper, architectures which can support the block-based real time motion estimation of vid... more In this paper, architectures which can support the block-based real time motion estimation of video signals using various search methods have been presented. The design efforts are focused on the processor-level design with a new matching criterion. With the new binary level matching criterion which performs a bit-wise comparison instead of the conventional eight-bit addition/subtraction, we could achieve a simple processor-level design with fewer input/output lines and lower power consumption
In recent years, minimizing the power consumption has become a key issue in the design of portabl... more In recent years, minimizing the power consumption has become a key issue in the design of portable electronic devices. In this paper, low power architecture which can support the real time motion estimation of video signals is presented. The architecture is based on a binary level matching criterion which performs a bit-wise comparison. The processor level design based on simple combinational logic using the binary level matching criterion has been introduced. Compared with the existing architectures, the proposed architecture delivers higher throughput rate, requires fewer input/output lines, and reduces the total power consumption
The motion-estimation search range required for interframe encoding with the MPEG-2 video compres... more The motion-estimation search range required for interframe encoding with the MPEG-2 video compression standard depends on a number of factors, including video content, video resolution, elapsed time between reference and predicted pictures, and, just as significantly, pragmatic considerations in implementing a cost-effective solution. In this paper we present a set of experimental results that provide a probabilistic characterization of the size of motion vectors for different types of video, from well-known standard test sequences to fast-paced sports sequences to action movie clips. We study the impact of search range on compression efficiency and video quality. Finally, and on the basis of these results, we conclude with recommendations for target search ranges suitable for high-quality compression of standard and high-definition video.
Given a high resolution compressed video, we investigate the problem of simplifying the memory an... more Given a high resolution compressed video, we investigate the problem of simplifying the memory and computational requirements of the decoder, when the intended playback resolution is less than the encoded resolution. We show that significant savings can be obtained in the amount of temporary storage space used, memory bandwidth between the processor and local memory, and raw computational complexity, while not incurring a significant loss in perceptual quality. Most video compression standards make special adjustments to deal with interlaced video. We explore how the proposed solution adapts to these situations and suggest modifications which improve the quality of the reconstructed video. Results from using these algorithms on a number of test sequences compressed with the MPEG-2 standard are presented.
We propose a new cost function for the motion estimation algorithm. A position penalty has been a... more We propose a new cost function for the motion estimation algorithm. A position penalty has been added to the conventional cost function, Sum of Absolute Difference (SAD), to regulate the motion field. The effort is focused on minimization of the size of motion vector difference as well as residual error which need to be coded. Compared to the existing motion field regulation methods, the proposed method has a better chance to choose minimum distortion which allows it to have better picture quality with a fixed bit rate. Simulation results show that the proposed cost function outperforms the existing methods with minimal increase of computational cost, which makes it attractive for hardware implementation
As medical image data sets are digitized and the number of data sets is increasing exponentially,... more As medical image data sets are digitized and the number of data sets is increasing exponentially, there is a need for automated image processing and analysis technique. Most medical imaging methods require human visual inspection and manual measurement which are labor intensive and often produce inconsistent results. In this paper, we propose an automated image segmentation and classification method that identifies tumor cell nuclei in medical images and classifies these nuclei into two categories, stained and unstained tumor cell nuclei. The proposed method segments and labels individual tumor cell nuclei, separates nuclei clusters, and produces stained and unstained tumor cell nuclei counts. The representative fields of view have been chosen by a pathologist from a known diagnosis (clear cell renal cell carcinoma), and the automated results are compared with the hand-counted results by a pathologist.
Open Computing LanguageĀ® (OpenCLĀ®), which is created to support parallel programming of heterogen... more Open Computing LanguageĀ® (OpenCLĀ®), which is created to support parallel programming of heterogeneous multicore-processor systems, has a very large potential for high-performance computing and consumer electronics since it provides application programming interfaces (APIs) to help make a portable code that runs across multiple devices. OpenCL is still under development, and it is not clear whether OpenCL has any advantages over other frameworks aside from portability. The purpose of our project was to define evaluation criteria, empirically evaluate OpenCL as a programming framework using evaluation criteria (e.g., performance, productivity, and portability criteria), define and implement parallel primitives in OpenCL, and demonstrate how the use of the implemented parallel primitives can have benefits for our target applications. Parallel primitive library APIs are defined to implement parallel algorithms in OpenCL, and a set of data- and task-parallel primitives is implemented and incorporated in the target applications. Multicore central processing units, the Cell Broadband EngineĀ® (Cell/B.E.Ā®), and graphics processing units are used as target platforms, and digital TV applications are used to evaluate usefulness of OpenCL. Preliminary results show that parallel primitives can be one of the ways to improve application performance and programmer productivity with respect to OpenCL while still maintaining software portability.
Emerging multi-core processors are able to accelerate medical imaging applications by exploiting ... more Emerging multi-core processors are able to accelerate medical imaging applications by exploiting the parallelism available in their algorithms. We have implemented a mutual-information-based 3D linear registration algorithm on the Cell Broadband Enginetrade processor. By exploiting the highly parallel architecture and its high memory bandwidth, our implementation with two CBE processors can register a pair of 256x256x30 3D images in one second. This implementation is significantly faster than a conventional one on a traditional microprocessor or even faster than a previously reported custom-hardware implementation. In addition to parallelizing the code for multiple cores and organizing the data structure for reducing the amount of the memory traffic, it is also critical to optimize the code for the SIMD pipeline structure. We note that code optimization for the SIMD pipeline alone results in a 4.2x-8.7x acceleration for the computation of small kernels. Further, SIMD optimization alone results in a 4.5x end-end application speedup.
A modular architecture with random access on-chip local memory for real-time motion estimation ha... more A modular architecture with random access on-chip local memory for real-time motion estimation has been proposed. The random access on-chip local memory with simple address generation has been proposed to overcome the irregular data flow of the three-step search BMA. This architecture features simple interconnection with low memory bandwidth and throughput rate as high as 1/N block per clock cycle for an NĆN block with the search range of dm=N/2-1 pixels with 100% processor utilization. By using a method called pipeline interleaving, this architecture offers a feasible solution for the Grand Alliance HDTV picture format with large search range
The three-step hierarchical search block matching motion estimation algorithm has played an impor... more The three-step hierarchical search block matching motion estimation algorithm has played an important role in low bit rate video coding because of its low computation complexity compared to the full search block matching motion estimation algorithm (FBMA). A modular architecture for the three-step hierarchical search BMA is presented, which features a throughput rate, as high as 1/N block per clock cycle, and a low memory bandwidth with random access on-chip local memory. Furthermore, 100% processor utilization has been achieved by using a method called pipeline interleaving. As such, this architecture offers a feasible solution for the Grand Alliance HDTV picture format with a large search range
Proposes a modular systolic array architecture for the full-search block matching motion estimati... more Proposes a modular systolic array architecture for the full-search block matching motion estimation algorithm (FBMA). With this novel architecture, the authors are able to generate a motion vector for every reference block in raster scan order while achieving 100% processor utilization and high throughput rate. Furthermore, they devised a scheme to save the pin count (I/O) by sharing memory units. This results in low memory bandwidth. This architecture is scalable in that it can easily be adapted to handle larger search ranges and different block sizes without increasing the effective latency
In this paper, architectures which can support the block-based real time motion estimation of vid... more In this paper, architectures which can support the block-based real time motion estimation of video signals using various search methods have been presented. The design efforts are focused on the processor-level design with a new matching criterion. With the new binary level matching criterion which performs a bit-wise comparison instead of the conventional eight-bit addition/subtraction, we could achieve a simple processor-level design with fewer input/output lines and lower power consumption
In recent years, minimizing the power consumption has become a key issue in the design of portabl... more In recent years, minimizing the power consumption has become a key issue in the design of portable electronic devices. In this paper, low power architecture which can support the real time motion estimation of video signals is presented. The architecture is based on a binary level matching criterion which performs a bit-wise comparison. The processor level design based on simple combinational logic using the binary level matching criterion has been introduced. Compared with the existing architectures, the proposed architecture delivers higher throughput rate, requires fewer input/output lines, and reduces the total power consumption
Uploads
Papers by Hangu Yeo