The paper introduces a range of efficient algorithmic solutions for implementing the fundamental ... more The paper introduces a range of efficient algorithmic solutions for implementing the fundamental filtering operation in convolutional layers of convolutional neural networks on fully parallel hardware. Specifically, these operations involve computing M inner products between neighbouring vectors generated by a sliding time window from the input data stream and an M-tap finite impulse response filter. By leveraging the factorisation of the Hankel matrix, we have successfully reduced the multiplicative complexity of the matrix-vector product calculation. This approach has been applied to develop fully parallel and resource-efficient algorithms for M values of 3, 5, 7, and 9. The fully parallel hardware implementation of our proposed algorithms achieves approximately a 30% reduction in embedded multipliers compared to the naive calculation methods.
2017 IEEE International Conference on INnovations in Intelligent SysTems and Applications (INISTA), 2017
The equations describing the position and movement of the new individuals in Bath Algorithm have ... more The equations describing the position and movement of the new individuals in Bath Algorithm have the form of difference equations. The analysis of the behavior of solutions of this equation and in particular its stability is possible, after omitting of randomness in parameters and treatment of the algorithm as stationary. The simplify analysis of the choice of parameters of the bat algorithm based on the linear stability and behavior of the algorithm is presented in the paper. The study indicates the recommended areas of the location of the parameters, and it shows how the different parameters affect the behavior of the algorithm. A simple tool to speed up the tuning of the algorithm was obtained.
2020 Progress in Applied Electrical Engineering (PAEE), 2020
Reducing the number of single operation during matrix-vector multiplication is a method of accele... more Reducing the number of single operation during matrix-vector multiplication is a method of accelerating of multiplication and decreasing power consumption. It is often not a simple task. The paper presents the methods for looking for the proper structure of the matrix using an evolutionary algorithm and the hill-climbing. The evaluation function, defined in the paper, leads to improvements in the form of the matrix by getting a proper structure. The optimisation process is made by using a special defined crossover and two types of mutations. The investigation presented in the paper confirms the possibility of automatically finding a special structure of a matrix.
This article presents an efficient algorithm for computing a 10-point DFT. The proposed algorithm... more This article presents an efficient algorithm for computing a 10-point DFT. The proposed algorithm reduces the number of multiplications at the cost of a slight increase in the number of additions in comparison with the known algorithms. Using a 10-point DFT for harmonic power system analysis can improve accuracy and reduce errors caused by spectral leakage. This paper compares the computational complexity for an L×10M-point DFT with a 2M-point DFT.
A set of efficient algorithmic solutions suitable to the fully parallel hardware implementation o... more A set of efficient algorithmic solutions suitable to the fully parallel hardware implementation of the short-length circular convolution cores is proposed. The advantage of the presented algorithms is that they require significantly fewer multiplications as compared to the naive method of implementing this operation. During the synthesis of the presented algorithms, the matrix notation of the cyclic convolution operation was used, which made it possible to represent this operation using the matrix–vector product. The fact that the matrix multiplicand is a circulant matrix allows its successful factorization, which leads to a decrease in the number of multiplications when calculating such a product. The proposed algorithms are oriented towards a completely parallel hardware implementation, but in comparison with a naive approach to a completely parallel hardware implementation, they require a significantly smaller number of hardwired multipliers. Since the wired multiplier occupies a...
The article presents a parallel hardware-oriented algorithm designed to speed up the division of ... more The article presents a parallel hardware-oriented algorithm designed to speed up the division of two octonions. The advantage of the proposed algorithm is that the number of real multiplications is halved as compared to the naive method for implementing this operation. In the synthesis of the discussed algorithm, the matrix representation of this operation was used, which allows us to present the division of octonions by means of a vector–matrix product. Taking into account a specific structure of the matrix multiplicand allows for reducing the number of real multiplications necessary for the execution of the octonion division procedure.
This paper presents a new algorithm for multiplying two Kaluza numbers. Performing this operation... more This paper presents a new algorithm for multiplying two Kaluza numbers. Performing this operation directly requires 1024 real multiplications and 992 real additions. We presented in a previous paper an effective algorithm that can compute the same result with only 512 real multiplications and 576 real additions. More effective solutions have not yet been proposed. Nevertheless, it turned out that an even more interesting solution could be found that would further reduce the computational complexity of this operation. In this article, we propose a new algorithm that allows one to calculate the product of two Kaluza numbers using only 192 multiplications and 384 additions of real numbers.
In this article, we propose a set of efficient algorithmic solutions for computing short linear c... more In this article, we propose a set of efficient algorithmic solutions for computing short linear convolutions focused on hardware implementation in VLSI. We consider convolutions for sequences of length N= 2, 3, 4, 5, 6, 7, and 8. Hardwired units that implement these algorithms can be used as building blocks when designing VLSI -based accelerators for more complex data processing systems. The proposed algorithms are focused on fully parallel hardware implementation, but compared to the naive approach to fully parallel hardware implementation, they require from 25% to about 60% less, depending on the length N and hardware multipliers. Since the multiplier takes up a much larger area on the chip than the adder and consumes more power, the proposed algorithms are resource-efficient and energy-efficient in terms of their hardware implementation.
Discrete orthogonal transforms such as the discrete Fourier transform, discrete cosine transform,... more Discrete orthogonal transforms such as the discrete Fourier transform, discrete cosine transform, discrete Hartley transform, etc., are important tools in numerical analysis, signal processing, and statistical methods. The successful application of transform techniques relies on the existence of efficient fast algorithms for their implementation. A special place in the list of transformations is occupied by the discrete fractional Fourier transform (DFrFT). In this paper, some parallel algorithms and processing unit structures for fast DFrFT implementation are proposed. The approach is based on the resourceful factorization of DFrFT matrices. Some parallel algorithms and processing unit structures for small size DFrFTs such as N = 2, 3, 4, 5, 6, and 7 are presented. In each case, we describe only the most important part of the structures of the processing units, neglecting the description of the auxiliary units and the control circuits.
2016 Progress in Applied Electrical Engineering (PAEE), 2016
The value of measurand reconstruction based on the result of observation of the output of the mea... more The value of measurand reconstruction based on the result of observation of the output of the measurement path belongs to the so-called ill-conditioned operation. It is so even for the well-known dynamic of the track type linear differential equation. This task is sensitive to any distortion and measuring errors by which generally requires the use of complex mathematical operations. If the measuring circuit has a linear dynamics specified by impulse responses and step responses, then using a simplified model of dynamics, obtained by expanding in Taylor series a convolution of the integrand function, you can easily restore the state of the input by observing the output state. The error of the model depends on the second derivative of the signal being played and it is negligibly small for a short time horizon. The paper proposes to use a model whose parameters can be selected in such a way that the error of the simplified model was dependent only on the fourth or higher derivatives of the input signal. This method gives good results for a smooth input signal with negligible values of derivatives higher than the third order. Complicated dependencies of the parameters of the model on the type of dynamic measurement channel and its parameters, can be reduced to a simple, universal form. This was possible thanks to the pre subjecting the output signal operation of multiple integrations, it is easy to make and accurate.
The paper introduces a range of efficient algorithmic solutions for implementing the fundamental ... more The paper introduces a range of efficient algorithmic solutions for implementing the fundamental filtering operation in convolutional layers of convolutional neural networks on fully parallel hardware. Specifically, these operations involve computing M inner products between neighbouring vectors generated by a sliding time window from the input data stream and an M-tap finite impulse response filter. By leveraging the factorisation of the Hankel matrix, we have successfully reduced the multiplicative complexity of the matrix-vector product calculation. This approach has been applied to develop fully parallel and resource-efficient algorithms for M values of 3, 5, 7, and 9. The fully parallel hardware implementation of our proposed algorithms achieves approximately a 30% reduction in embedded multipliers compared to the naive calculation methods.
2017 IEEE International Conference on INnovations in Intelligent SysTems and Applications (INISTA), 2017
The equations describing the position and movement of the new individuals in Bath Algorithm have ... more The equations describing the position and movement of the new individuals in Bath Algorithm have the form of difference equations. The analysis of the behavior of solutions of this equation and in particular its stability is possible, after omitting of randomness in parameters and treatment of the algorithm as stationary. The simplify analysis of the choice of parameters of the bat algorithm based on the linear stability and behavior of the algorithm is presented in the paper. The study indicates the recommended areas of the location of the parameters, and it shows how the different parameters affect the behavior of the algorithm. A simple tool to speed up the tuning of the algorithm was obtained.
2020 Progress in Applied Electrical Engineering (PAEE), 2020
Reducing the number of single operation during matrix-vector multiplication is a method of accele... more Reducing the number of single operation during matrix-vector multiplication is a method of accelerating of multiplication and decreasing power consumption. It is often not a simple task. The paper presents the methods for looking for the proper structure of the matrix using an evolutionary algorithm and the hill-climbing. The evaluation function, defined in the paper, leads to improvements in the form of the matrix by getting a proper structure. The optimisation process is made by using a special defined crossover and two types of mutations. The investigation presented in the paper confirms the possibility of automatically finding a special structure of a matrix.
This article presents an efficient algorithm for computing a 10-point DFT. The proposed algorithm... more This article presents an efficient algorithm for computing a 10-point DFT. The proposed algorithm reduces the number of multiplications at the cost of a slight increase in the number of additions in comparison with the known algorithms. Using a 10-point DFT for harmonic power system analysis can improve accuracy and reduce errors caused by spectral leakage. This paper compares the computational complexity for an L×10M-point DFT with a 2M-point DFT.
A set of efficient algorithmic solutions suitable to the fully parallel hardware implementation o... more A set of efficient algorithmic solutions suitable to the fully parallel hardware implementation of the short-length circular convolution cores is proposed. The advantage of the presented algorithms is that they require significantly fewer multiplications as compared to the naive method of implementing this operation. During the synthesis of the presented algorithms, the matrix notation of the cyclic convolution operation was used, which made it possible to represent this operation using the matrix–vector product. The fact that the matrix multiplicand is a circulant matrix allows its successful factorization, which leads to a decrease in the number of multiplications when calculating such a product. The proposed algorithms are oriented towards a completely parallel hardware implementation, but in comparison with a naive approach to a completely parallel hardware implementation, they require a significantly smaller number of hardwired multipliers. Since the wired multiplier occupies a...
The article presents a parallel hardware-oriented algorithm designed to speed up the division of ... more The article presents a parallel hardware-oriented algorithm designed to speed up the division of two octonions. The advantage of the proposed algorithm is that the number of real multiplications is halved as compared to the naive method for implementing this operation. In the synthesis of the discussed algorithm, the matrix representation of this operation was used, which allows us to present the division of octonions by means of a vector–matrix product. Taking into account a specific structure of the matrix multiplicand allows for reducing the number of real multiplications necessary for the execution of the octonion division procedure.
This paper presents a new algorithm for multiplying two Kaluza numbers. Performing this operation... more This paper presents a new algorithm for multiplying two Kaluza numbers. Performing this operation directly requires 1024 real multiplications and 992 real additions. We presented in a previous paper an effective algorithm that can compute the same result with only 512 real multiplications and 576 real additions. More effective solutions have not yet been proposed. Nevertheless, it turned out that an even more interesting solution could be found that would further reduce the computational complexity of this operation. In this article, we propose a new algorithm that allows one to calculate the product of two Kaluza numbers using only 192 multiplications and 384 additions of real numbers.
In this article, we propose a set of efficient algorithmic solutions for computing short linear c... more In this article, we propose a set of efficient algorithmic solutions for computing short linear convolutions focused on hardware implementation in VLSI. We consider convolutions for sequences of length N= 2, 3, 4, 5, 6, 7, and 8. Hardwired units that implement these algorithms can be used as building blocks when designing VLSI -based accelerators for more complex data processing systems. The proposed algorithms are focused on fully parallel hardware implementation, but compared to the naive approach to fully parallel hardware implementation, they require from 25% to about 60% less, depending on the length N and hardware multipliers. Since the multiplier takes up a much larger area on the chip than the adder and consumes more power, the proposed algorithms are resource-efficient and energy-efficient in terms of their hardware implementation.
Discrete orthogonal transforms such as the discrete Fourier transform, discrete cosine transform,... more Discrete orthogonal transforms such as the discrete Fourier transform, discrete cosine transform, discrete Hartley transform, etc., are important tools in numerical analysis, signal processing, and statistical methods. The successful application of transform techniques relies on the existence of efficient fast algorithms for their implementation. A special place in the list of transformations is occupied by the discrete fractional Fourier transform (DFrFT). In this paper, some parallel algorithms and processing unit structures for fast DFrFT implementation are proposed. The approach is based on the resourceful factorization of DFrFT matrices. Some parallel algorithms and processing unit structures for small size DFrFTs such as N = 2, 3, 4, 5, 6, and 7 are presented. In each case, we describe only the most important part of the structures of the processing units, neglecting the description of the auxiliary units and the control circuits.
2016 Progress in Applied Electrical Engineering (PAEE), 2016
The value of measurand reconstruction based on the result of observation of the output of the mea... more The value of measurand reconstruction based on the result of observation of the output of the measurement path belongs to the so-called ill-conditioned operation. It is so even for the well-known dynamic of the track type linear differential equation. This task is sensitive to any distortion and measuring errors by which generally requires the use of complex mathematical operations. If the measuring circuit has a linear dynamics specified by impulse responses and step responses, then using a simplified model of dynamics, obtained by expanding in Taylor series a convolution of the integrand function, you can easily restore the state of the input by observing the output state. The error of the model depends on the second derivative of the signal being played and it is negligibly small for a short time horizon. The paper proposes to use a model whose parameters can be selected in such a way that the error of the simplified model was dependent only on the fourth or higher derivatives of the input signal. This method gives good results for a smooth input signal with negligible values of derivatives higher than the third order. Complicated dependencies of the parameters of the model on the type of dynamic measurement channel and its parameters, can be reduced to a simple, universal form. This was possible thanks to the pre subjecting the output signal operation of multiple integrations, it is easy to make and accurate.
Uploads
Papers by Janusz Papliński