Data Placement and Duplication For Embedded Multicore Systems With Scratch Pad Memory
Data Placement and Duplication For Embedded Multicore Systems With Scratch Pad Memory
N
d
j=1
Cost
M
(D
j
), if k = 1, 2, ..., n, i
k
= Size
S
k
,
N
d
j=2
Cost
M
(D
j
) + Cost
S
k
(D
1
) if j = 1, i
k
= Size
S
k
1,
k
k=1
i
k
C
k=1
Size
S
k
j,
if
C
k=1
i
k
<
C
k=1
Size
S
k
j or
k {1, 2, 3, . . . , C} i
k
> Size
S
k
.
(2)
minimum cost problem on multicore is formally dened. After
that, the RDPM algorithm is proposed. Section V-B will
present how to determine the optimal data duplication and
propose the RDPM-DUP.
A. Regional Data Placement for Multicore
Before introducing the proposed RDPM algorithm, the
formal denition of the SPM data placement with minimal
cost problem on multicore is presented.
1) Problem Denition: The inputs are: A collection of
data D = (d
1
, d
2
, . . . , d
N
d
); the initial data placement for each
cores on-chip SPM; capacity of each cores SPM S
i
for core
i; number of data N
d
; number of cores C; read and write cost
to the local on-chip SPM R
S
i
, W
S
i
for core i; the read and
write cost from core i to core js SPM R
C
i
S
j
, W
C
i
S
j
; the
read and write cost to the main memory R
M
i
, W
M
i
for core i.
Denition 1: SPM data placement with minimal cost prob-
lem on multicore systems: Given the inputs, what is a data
placement for all cores SPMs and the shared main memory
that the total time/energy cost of memory accesses is mini-
mized?
The output is: A data placement for all cores SPMs and
main memory, under which the total cost of memory access
is minimized.
2) RDPM Algorithm:
a) Compute cost of accessing remote SPM: Let C be the
number of cores in the system. Let d be the distance between
two cores. Let the nondecreasing remote SPM accessing cost
function be f. In this step, we will compute CR
iS
j
and
CW
iS
j
for every pair of (i, j) using f(d).
b) Build cost table: After we have computed all CR
iS
j
and CW
iS
j
, the cost table T can be built. Let there be C + 1
columns in the table T indicating C + 1 different placements
for a data item. The costs of each data in different locations
are computed as shown in (1), which is the sum of executing
cost and placement cost. To compute the execution cost of each
data, we rst count the number of memory accesses to the data
and multiply it by the cost of each access for all the cores.
Then, all cores costs are summed up for the data to obtain the
execution cost. The placement cost is the cost of moving the
data from its initial placement to the new placement, which
includes a read operation from the original memory and a write
operation to the target memory. If the data is not moved, the
placement cost is 0.
C
1
executing cost + placement cost (1)
c) Dynamic programming scheme: After the cost table
is built, in the second step, a dynamic program algorithm as
shown in (2) is proposed to nd the data placement.
First, a C + 1-dimensional dynamic programming table
CostMin is constructed as follows: the rst dimension of the
table represents data items; each of the other dimensions rep-
resent the available space on a certain cores SPM, assuming
the shared main memory is large enough to hold all the data
items of the program.
Let CostMin[j, i
1
, i
2
, ..., i
C
] be the minimal cost of memory
accesses when the placement of data j (j N
d
, N
d
is the
number of data for the current region) has been optimally
determined while the rest of the data k (j < k N
d
) are in
the main memory, and there are i
1
empty memory units on
SPM of core 1, i
2
empty memory units on SPM of core 2, ...,
and i
C
empty memory units on SPM of core C.
The complexity of the RDPM algorithm is in polynomial
time. The total number of iterations is N
d
Size
S
1
Size
S
2
. . .
Size
S
C
. Since the architecture is determined, C is a constant.
Inside the most inner loop, there are C + 1 if/else branches to
decide the value of the current cell. Thus, the complexity of
RDPM is O(n
C+1
).
B. Data Duplication on Multicore Systems
In multicore systems, traditionally, a data item only has one
copy in either one of the SPMs or in the main memory. It is
common that multiple cores may access the same data in one
parallel region. In this section, a data duplication strategy for
read-only data on SPMs of multiple cores is proposed. First,
the duplication mechanism in this paper is introduced. Then,
814 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 32, NO. 6, JUNE 2013
CMD[j, i
1
, i
2
, ..., i
C
] =
N
d
j=1
Cost
M
(D
j
), if k = 1, 2, ..., n,
i
k
= Size
S
k
,
N
d
j=2
Cost
M
(D
j
) + Cost
S
k
(D
1
) if j = 1, i
p
= Size
S
p
1,
k {1, 2, ..., C}{p}, i
k
=Size
S
k
min(CMD[j 1, i
1
, i
2
, ..., i
C
],
CMD[j 1, i
1
+ 1, i
2
, ..., i
C
] Cost
M
(D
j
) + Cost
S
1
(D
j
),
CMD[j 1, i
1
, i
2
+ 1, ..., i
C
] Cost
M
(D
j
) + Cost
S
2
(D
j
),
... ,
CMD[j 1, i
1
, i
2
, ..., i
C
+ 1] Cost
M
(D
j
) + Cost
S
C
(D
j
)),
CMD[j 1, i
1
+ 1, i
2
+ 1, ..., i
C
] Cost
M
(D
j
) + Cost
S
1+2
(D
j
)),
CMD[j 1, i
1
, i
2
+ 1, i
3
+ 1, ..., i
C
] Cost
M
(D
j
) + Cost
S
2+3
(D
j
)),
CMD[j 1, i
1
, i
2
, i
3
, ..., i
p
+ 1, ..., i
q
+ 1, ..., i
C
]
Cost
M
(D
j
) + Cost
S
p+q
(D
j
)),
(Calculate all two combinations
cost, p, q = 1, 2, ..., n)
...
CMD[j 1, i
1
, i
2
, ..., i
p
+ 1, ..., i
q
+ 1, ..., i
r
+ 1, ..., i
C
]
Cost
C
(D
j
) + Cost
S
p+q+r
(D
j
)),
(Calculate all three combinations
cost,
p, q, r = 1, 2, ..., n)
...
(Calculate the cost for all combinations of t copies, t is
the maximum number of cores allowed to share SPMs.)
if Cost
S
t combinations
(D
j
) = and
C
k=1
i
k
+ t
C
k=1
Size
S
k
j,
k=1
i
k
+ t <
C
k=1
Size
S
k
j and
if Cost
S
t combinations
(D
j
) = or
k {1, 2, 3, . . . , C}, i
k
> Size
S
k
.
(3)
TABLE IV
System Specification for the Eight-Core Architecture
Component Description
CPU Core Number of cores: 8, frequency: 1.0 GHz
SRAM SPM
Size: 8 kB, access latency: 0.305 ns, access energy:
0.014 nJ
Main memory
DDR SDRAM, Size: 512 MB, Access latency:
19.51 ns access energy: 0.607 nJ
how to integrate data duplication into the data placement algo-
rithm presented in the previous section in order to determine
the best data placement and duplication is shown.
When there is no data duplication mechanism, if a data item
is intensively accessed by multiple cores, it will incur a lot
of remote accesses. Wherever the data is placed, there is only
one core being beneted from the local SPM. Data duplication
method will solve this problem by allocating a copy of the data
item to each SPM that may be beneted. As a result, the time
and energy cost incurred by remote data accesses is reduced.
For exclusive copy mode, there is no need to worry about the
data consistency. However, for data duplication mechanisms,
the data consistency problem becomes a key issue. In such
cases, inconsistency will occur when there are multiple cores
that want to write to the same data. It is true that in writing
heavy applications, duplicating to-be-written data may be
benecial with a well-designed data consistency protocol.
However, the overhead caused by maintaining data consistency
may offset the benets of duplicating written data. Therefore,
in this paper, only read data is allowed to be duplicated.
The rst step in integrating data duplication into RDPM is
modifying the cost tables. In Section V-A, data has the cost for
each memory placement in the cost table. In data duplication,
each data can be duplicated in multiple SPMs. Therefore, there
are many different possible ways of duplication for the same
data. All the possible data duplication congurations need to
be considered. In the cost table, a new column is added for
each possible way of data duplication for duplicated data. For
instance, in the motivational example, there is a possibility
GUO et al.: DATA PLACEMENT AND DUPLICATION FOR EMBEDDED MULTICORE SYSTEMS WITH SPM 815
TABLE V
Comparison of Time Cost Among Various Algorithms on the Eight-Core System
Benchmarks Che (J) Uday (J) RDPM (J) Imprv-Che (%) Imprv-Uday (%) RDPM-DUP (J) Imprv-Che (%) Imprv-Uday (%)
basicmath 12696.38 10434.65 6260.70 50.69 40.01 5858.08 53.86 43.85
btcount 751.55 672.14 388.08 48.36 42.26 283.28 62.31 57.85
qsort 19005.03 12889.47 9540.48 49.80 25.98 8558.41 54.97 33.60
susan 3246.32 1806.13 1331.83 58.97 26.26 1131.25 65.15 37.36
dijkstra 914.73 686.12 416.91 54.42 39.24 351.56 61.57 48.76
patricia 10853.42 6618.85 5059.12 53.39 23.57 4408.67 59.38 33.39
stringsearch 1712.88 1098.17 827.56 51.69 24.64 675.50 60.56 38.49
rijndael 13513.36 8354.76 5895.92 56.37 29.43 5534.15 59.05 33.76
SHA 7362.61 4864.20 3347.06 54.54 31.19 3066.68 58.34 36.95
CRC32 6279.95 4172.32 2629.81 58.12 36.97 2475.66 60.58 40.66
FFT 5363.44 4126.83 2481.25 53.73 39.89 2274.12 57.60 44.91
Average 53.64 32.68 59.40 40.87
TABLE VI
Comparison of Energy Cost Among Various Algorithms on the Eight-Core System
Benchmarks Che (J) Uday (J) RDPM (J) Imprv-Che (%) Imprv-Uday (%) RDPM-DUP (J) Imprv-Che (%) Imprv-Uday (%)
basicmath 295.87 235.48 158.78 46.33 32.57 149.27 49.55 36.61
btcount 43.31 40.07 16.13 62.75 59.74 13.57 68.67 66.13
qsort 515.03 291.58 212.35 58.77 27.30 210.84 59.06 27.70
susan 114.23 62.81 45.65 60.03 27.31 43.62 61.82 30.55
dijkstra 44.88 28.97 20.65 54.01 28.71 18.26 59.33 36.98
patricia 430.52 307.66 205.06 52.37 33.35 186.85 56.59 39.26
stringsearch 88.46 60.57 45.44 48.63 24.99 43.08 51.30 28.87
rijndael 310.65 205.62 134.54 56.69 34.57 125.83 59.50 38.81
SHA 219.45 176.57 119.23 45.67 32.48 110.55 49.62 37.39
CRC32 190.27 142.70 97.55 48.73 31.64 91.26 52.03 36.05
FFT 206.50 165.42 106.49 48.43 35.62 97.03 53.01 41.34
Average 52.95 33.47 - 56.41 38.15
that each data has two copies, and the copies are in SPM
1
and
SPM
2
. If any data item cannot have multiple copies according
to the restriction, their costs in the new column are set to be
innity.
Second, we dene a new C + 1-dimensional dynamic pro-
gramming table CMD for duplication method. Let CMD[j,
i
1
, i
2
, ..., i
C
] be the minimal memory accessing cost when
the placement and duplication of the jth data (j N
d
) are
optimally determined, while the rest of data is in the shared
main memory, and there are i
k
empty memory units on SPM
of core k (k C). Here, C is the number of cores.
Third, the recursive function of RDPM should be modied
for the data duplication method. The new recursive function
is shown in (3). The new equation needs to consider the cost
of all possible number of copies, and all possible places that
hold the copies.
VI. Experiments
Experiments are performed on a selected set of benchmarks
from Mibench [36] to compare both time and energy costs of
memory accesses for four data allocation techniques: the Ches
algorithm [31], the derived Udayakumaran algorithm for mul-
ticore, the RDPM algorithm, and the RDPM-DUP algorithm.
The experimental results show promising improvement for the
algorithms proposed in this paper compared with the existing
greedy algorithm.
A. Experimental Setup
All experiments are conducted on a custom simulator. The
simulator is exible for different hardware congurations.
In this paper, we conduct the experiments on an eight-core
system. The hardware conguration of the system is shown
in Table IV. The costs of memory accesses are obtained from
HP CACTI 5.3 [37]. All cores share an off-chip DRAM main
memory, and any pair of cores can access each others local
SPM.
The nondecreasing cost function f of accessing remote
SPMs that we used in the experiments is a linear function
f = d , where d is the distance between the two cores
and is a constant cost. For the eight-core system, equals
0.305 ns when we compute remote time cost and equals
0.014 nJ for the remote energy cost.
The benchmarks used in the experiments are from
Mibench [36]. Eleven applications are selected from the
Mibench benchmark suite: qsort, susan, basicmath, bitcount,
dijkstra, patricia, stringsearch, rijndael, sha, CRC32, and FFT.
The memory traces of these benchmarks are the input for the
simulator.
B. Experimental Results
In this subsection, the comparisons of time and energy cost
for the eight-core system are shown in Tables V and VI.
Tables V and VI reect the experimental results on an
eight-core environment. The average time cost reductions
816 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 32, NO. 6, JUNE 2013
are 53.64% for Ches algorithm and 32.68% for the derived
Udayakumaran algorithm. The average energy cost reductions
are 56.41% and 38.15%, respectively.
The reason that RDPM and RDPM-DUP are signicantly
better than Ches algorithm is that the goal of Ches algorithm
is to achieve the maximum throughput. It does not include
sufcient techniques that reduce the memory access cost.
Also, in Ches algorithm, all data have to be moved into the
SPM before being accessed. This leads to a large number of
unnecessary data movements, which signicantly increases
the total cost.
From the experimental results, it is easy to see that the
RDPM and RDPM-DUP algorithms have a better performance
for time latency, as well as energy consumption, compared
with both of two baseline algorithms. Furthermore, the RDPM-
DUP algorithm determines the optimal number of copies
for a heavily accessed data and places the data copies into
appropriate SPMs. It will always generate a data placement at
least as good as that of the RDPM algorithm. When there is no
suitable duplication for any data, the RDPM-DUP algorithm
will have the same optimal solution as the RDPM algorithm.
VII. Conclusion
In this paper, two polynomial time regional data placement
algorithms were proposed to minimize the cost of memory
accesses for multicore systems. The RDPM algorithm can
achieve near-optimal data placement for each region with
exclusive copy, while the RDPM-DUP algorithm is able to
generate near-optimal data placement and duplication when
the multiple copies for a single data item are allowed. Ex-
perimental results show that the proposed RDPM algorithm
alone can reduce the time cost of memory accesses by 32.68%
on average compared with existing algorithms. With data
duplication, the RDPM-DUP algorithm further reduces the
time cost by 40.87%. For energy consumption, the proposed
RDPM algorithm with exclusive copy can reduce the total
cost by 33.47% on average. The improvement increases up to
38.15% on average when RDPM-DUP is applied.
References
[1] S. Borkar, Thousand core chips: A technology perspective, in Proc.
DAC, 2007, pp. 746749.
[2] M. Qiu, Z. Shao, Q. Zhuge, C. Xue, M. Liu, and E. H.-M. Sha, Efcient
assignment with guaranteed probability for heterogeneous parallel DSP,
in Proc. ICPADS, 2006, pp. 623630.
[3] J. Xue, T. Liu, Z. Shao, J. Hu, Z. Jia, and E. H.-M. Sha, Address
assignment sensitive variable partitioning and scheduling for DSPS with
multiple memory banks, in Proc. ICASSP, 2008, pp. 14531456.
[4] R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, and P. Marwedel,
Scratchpad memory: Design alternative for cache on-chip memory in
embedded systems, in Proc. CODES, 2002, pp. 7378.
[5] S. Gilani, N. S. Kim, and M. Schulte, Scratchpad memory optimizations
for digital signal processing applications, in Proc. DATE, 2011, pp. 16.
[6] S. Udayakumaran and R. Barua, Compiler-decided dynamic memory
allocation for scratch-pad based embedded systems, in Proc. CASES,
2003, pp. 276286.
[7] R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, and P. Marwedel,
Scratchpad memory: Design alternative for cache on-chip memory in
embedded systems, in Proc. CODES, 2002, pp. 7378.
[8] O. Avissar, R. Barua, and D. Stewart, An optimal memory allocation
scheme for scratch-pad-based embedded systems, ACM Trans. Embed.
Comput. Syst., vol. 1, no. 1, pp. 626, 2002.
[9] M. Kandemir, M. J. Irwin, G. Chen, and I. Kolcu, Banked scratch-
pad memory management for reducing leakage energy consumption, in
Proc. ICCAD, 2004, pp. 120124.
[10] Y. He, C. Xue, C. Xu, and E. H.-M. Sha, Co-optimization of memory
access and task scheduling on MPSoC architectures with multilevel
memory, in Proc. ASP-DAC, 2010, pp. 95100.
[11] A. Dominguez, S. Udayakumaran, and R. Barua, Heap data allocation
to scratch-pad memory in embedded systems, J. Embedded Comput.,
vol. 1, no. 4, pp. 521540, 2005.
[12] S. Kaneko, H. Kondo, N. Masui, K. Ishimi, T. Itou, M. Satou, N. Oku-
mura, Y. Takata, H. Takata, M. Sakugawa, T. Higuchi, S. Ohtani,
K. Sakamoto, N. Ishikawa, M. Nakajima, S. Iwata, K. Hayase,
S. Nakano, S. Nakazawa, K. Yamada, and T. Shimizu, A 600-MHz
single-chip multiprocessor with 4.8-Gb/s internal shared pipelined bus
and 512-kb internal memory, IEEE J. Solid-State Circuits, vol. 39, no. 1,
pp. 184193, Jan. 2004.
[13] H. P. Hofstee, Power efcient processor architecture and the Cell
processor, in Proc. HPCA, 2005, pp. 258262.
[14] S. Udayakumaran and R. Barua, An integrated scratch-pad allocator for
afne and non-afne code, in Proc. DATE, 2006, pp. 925930.
[15] Y. Guo, Q. Zhuge, J. Hu, and E.-M. Sha, Optimal data placement
for memory architectures with scratch-pad memories, in Proc. ICESS,
2011, pp. 10451050.
[16] Q. Zhuge, Y. Guo, J. Hu, W.-C. Tseng, S. J. Xue, and E.-M. Sha, Min-
imizing access cost for multiple types of memory units in embedded
systems through data allocation and scheduling, IEEE Trans. Signal
Process., vol. 60, no. 6, pp. 32533263, Jun. 2012.
[17] P. R. Panda, N. D. Dutt, and A. Nicolau, On-chip vs. off-chip
memory: The data partitioning problem in embedded processor-based
systems, ACM Trans. Des. Autom. Electron. Syst., vol. 5, pp. 682704,
Jul. 2000.
[18] S. Udayakumaran, A. Dominguez, and R. Barua, Dynamic allocation
for scratch-pad memory using compile-time decisions, ACM Trans.
Embed. Comput. Syst., vol. 5, no. 2, pp. 472511, 2006.
[19] M. Kandemir, M. J. Irwin, G. Chen, and I. Kolcu, Compiler-guided
leakage optimization for banked scratch-pad memories, IEEE Trans.
Very Large Scale (VLSI) Syst., vol. 13, no. 10, pp. 11361146, Oct.
2005.
[20] Y. Guo, Q. Zhuge, J. Hu, M. Qiu, and E.-M. Sha, Optimal data
allocation for scratch-pad memory on embedded multi-core systems,
in Proc. ICPP, 2011, pp. 464471.
[21] R. Buchty, V. Heuveline, W. Karl, and J.-P. Weiss, A survey on
hardware-aware and heterogeneous computing on multicore processors
and accelerators, Concurrency Comput.: Practice Experience, vol. 24,
no. 17, pp. 663675, 2012.
[22] J. Chang and G. S. Sohi, Cooperative cache partitioning for chip
multiprocessors, in Proc. ICS, 2007, pp. 242252.
[23] G. E. Suh, L. Rudolph, and S. Devadas, Dynamic cache partitioning for
simultaneous multithreading systems, in Proc. IASTED PDCS, 2001,
pp. 116127.
[24] L. Zhang, M. Qiu, and W.-C. Tseng, Variable partitioning and schedul-
ing for MPSoC with virtually shared scratch pad memory, J. Signal
Process. Syst., vol. 50, no. 2, pp. 247265, 2010.
[25] M. Kandemir, J. Ramanujam, J. Irwin, N. Vijaykrishnan, I. Kadayif,
and A. Parikh, Dynamic management of scratch-pad memory space,
in Proc. DAC, 2001, pp. 690695.
[26] J. Hu, C. J. Xue, W.-C. Tseng, Y. He, M. Qiu, and E. H.-M. Sha,
Reducing write activities on non-volatile memories in embedded CMPs
via data migration and recomputation, in Proc. DAC, 2010, pp. 350
355.
[27] P. R. Panda, N. D. Dutt, and A. Nicolau, Efcient utilization of scratch-
pad memory in embedded processor applications, in Proc. ED&TC,
1997, p. 7.
[28] J. Hu, C. J. Xue, W.-C. Tseng, Q. Zhuge, and E. H.-M. Sha, Minimizing
write activities to non-volatile memory via scheduling and recomputa-
tion, in Proc. SASP, 2010, pp. 712.
[29] G. Chen, O. Ozturk, M. Kandemir, and M. Karakoy, Dynamic scratch-
pad memory management for irregular array access patterns, in Proc.
DATE, 2006, pp. 931936.
[30] J. Sj odin and C. von Platen, Storage allocation for embedded proces-
sors, in Proc. CASES, 2001, pp. 1523.
[31] W. Che, A. Panda, and K. S. Chatha, Compilation of stream programs
for multicore processors that incorporate scratchpad memories, in Proc.
DATE, 2010, pp. 11181123.
[32] M. Kandemir, J. Ramanujam, and A. Choudhary, Exploiting shared
scratch pad memory space in embedded multiprocessor systems, in
Proc. DAC, 2002, pp. 219224.
GUO et al.: DATA PLACEMENT AND DUPLICATION FOR EMBEDDED MULTICORE SYSTEMS WITH SPM 817
[33] V. Suhendra, C. Raghavan, and T. Mitra, Integrated scratchpad memory
optimization and task scheduling for MPSoC architectures, in Proc.
CASES, 2006, pp. 401410.
[34] M. K. F. Li and G. Chen, Improving scratch-pad memory reliability
through compiler-guided data block duplication, in Proc. ICCAD ,
2005, pp. 10021005.
[35] I. Issenin, E. Brockmeyer, B. Durinck, and N. Dutt, Multiprocessor
system-on-chip data reuse analysis for exploring customized memory
hierarchies, in Proc. DAC, 2006, pp. 4952.
[36] M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and
R. B. Brown, Mibench: A free, commercially representative embedded
benchmark suite, in Proc. WWC, 2001, pp. 314.
[37] S. J. E. Wilton and N. P. Jouppi, CACTI: An enhanced cache access
and cycle time model, IEEE J. Solid-State Circuits, vol. 31, no. 5, pp.
677688, May 1996.
[38] M. Qiu and E. H.-M. Sha, Cost minimization while satisfying hard/soft
timing constraints for heterogeneous embedded systems, ACM Trans.
Des. Autom. Electron. Syst., vol. 14, no. 2, pp. 130, Apr. 2009.
Yibo Guo received the B.S. degree in information
security from Hunan University, Hunan, China, in
2009, and the M.S. degree in computer science from
the University of Texas at Dallas, Richardson, TX,
in 2011, where he is currently pursuing the Ph.D.
degree in computer science.
His current research interests include memory
scheduling and data allocation on MPSoc.
Qingfeng Zhuge received the B.S. and M.S. degrees
in electronics engineering from Fudan University,
Shanghai, China, and the Ph.D. degree from the
Department of Computer Science at the University
of Texas at Dallas, Richardson, TX, in 2003.
She is currently a Full Professor at Chongqing Uni-
versity, Chongqing, China. She has published more
than 60 research articles in premier journals and
conferences. Her current research interests include
parallel architectures, embedded systems, supply-
chain management, real-time systems, optimization
algorithms, compilers, and scheduling.
Dr. Zhuge received the Best Ph.D. Dissertation Award in 2003.
Jingtong Hu (SM09) received the B.E. degree from
the School of Computer Science and Technology,
Shandong University, Shandong, China, in 2007 and
the M.S. degree from the Department of Computer
Science from the University of Texas at Dallas,
Richardson, TX, in May 2010, where he is currently
pursuing the Ph.D. degree from the Department of
Computer Science.
His current research interests include low power
and high-performance embedded systems, wireless
sensor networks, memory optimization, nonvolatile
memory, and compiler optimization.
Juan Yi received the B.E. degree from the School
of Software Engineering at Chongqing University,
Chongqing, China, in 2006 and is currently pursuing
the Ph.D degree from the Department of Computer
Science at the same university.
Her current research interests include multicore
architecture optimization and high-performance par-
allel computing.
Meikang Qiu (SM07) received the B.E. and M.E.
degrees from Shanghai Jiao Tong University, Shang-
hai, China, and the M.S. and Ph.D. degrees in com-
puter science from the University of Texas at Dallas,
Richardson, TX, in 2003 and 2007, respectively.
He was with the Chinese Helicopter Research and
Development Institute and was also with IBM. He
is currently an Assistant Professor of ECE at the
University of Kentucky, Lexington. He also holds
three patents and has published three books. His
current research interests include embedded systems,
computer security, and wireless sensor networks.
Dr. Qiu is an ACM Senior member. He has published 160 peer-reviewed
papers, including 16 IEEE/ACM Transactions on Networking papers and more
than 60 journal papers. He is the recipient of the ACM Transactions on Design
Automation of Electronic Systems 2011 Best Paper Award. He also received
four other Best Paper Awards (IEEE EUC09, IEEE/ACM GreenCom10,
IEEE CSE10, and IEEE ICESS) and one best paper nomination in the last
four years. He was named to the Navy Summer Faculty in 2012 and SFFP
Air Force Summer Faculty in 2009. His research is supported by the National
Science Foundation, Navy, and Air Force. He has held various chair positions
and served as a TPC member for many international conferences. He served
as the Program Chair of IEEE EmbeddCom09 and EM-Com09.
Edwin H.-M. Sha (S88M92SM04) received
the Ph.D. degree from the Department of Com-
puter Science, Princeton University, Princeton, NJ,
in 1992.
From August 1992 to August 2000, he was with
the Department of Computer Science and Engi-
neering at the University of Notre Dame, Notre
Dame, IN. Since 2000, he has been a Tenured Full
Professor at the Department of Computer Science at
the University of Texas at Dallas, Richardson, TX.
Since 2012, he has been serving as the Dean of the
College of Computer Science at Chongqing University, Chongqing, China.
He has published more than 280 research papers in refereed conferences
and journals. He has served as an editor for many journals, on program
committees, and as a Chair for numerous international conferences.
Dr. Sha received the Teaching Award, Microsoft Trustworthy Computing
Curriculum Award, NSF CAREER Award, NSFC Overseas Distinguished
Young Scholar Award, and Chang Jiang Honorary Chair Professorship. He
is a member of the China Thousand-Talent Program.