research-article

Thread to strand binding of parallel network applications in massive multi-threaded systems

Authors:

Petar Radojković,

Vladimir Čakarević,

Francisco J. Cazorla,

Mario Nemirovsky,

Mateo ValeroAuthors Info & Claims

PPoPP '10: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Pages 191 - 202

https://doi.org/10.1145/1693453.1693480

Published: 09 January 2010 Publication History

Abstract

In processors with several levels of hardware resource sharing,like CMPs in which each core is an SMT, the scheduling process becomes more complex than in processors with a single level of resource sharing, such as pure-SMT or pure-CMP processors. Once the operating system selects the set of applications to simultaneously schedule on the processor (workload), each application/thread must be assigned to one of the hardware contexts(strands). We call this last scheduling step the Thread to Strand Binding or TSB. In this paper, we show that the TSB impact on the performance of processors with several levels of shared resources is high. We measure a variation of up to 59% between different TSBs of real multithreaded network applications running on the UltraSPARC T2 processor which has three levels of resource sharing. In our view, this problem is going to be more acute in future multithreaded architectures comprising more cores, more contexts per core, and more levels of resource sharing.

We propose a resource-sharing aware TSB algorithm (TSBSched) that significantly facilitates the problem of thread to strand binding for software-pipelined applications, representative of multithreaded network applications. Our systematic approach encapsulates both, the characteristics of multithreaded processors under the study and the structure of the software pipelined applications. Once calibrated for a given processor architecture, our proposal does not require hardware knowledge on the side of the programmer, nor extensive profiling of the application. We validate our algorithm on the UltraSPARC T2 processor running a set of real multithreaded network applications on which we report improvements of up to 46% compared to the current state-of-the-art dynamic schedulers.

References

[1]

OpenSPARCTM T1 Microarchitecture Specification, 2006.

[2]

UltraSPARC T1TM Supplement to the UltraSPARC Architecture 2005, 2006.

[3]

OpenSPARCTM T2 Core Microarchitecture Specification, 2007.

[4]

OpenSPARCTM T2 System-On-Chip (SOC) Microarchitecture Specification, 2007.

[5]

Netra Data Plane Software Suite 2.0 Update 2 Reference Manual, 2008

[6]

Netra Data Plane Software Suite 2.0 Update 2 User's Guide, 2008.

[7]

Intel 64 and IA-32 Architectures Software Developers Manual, 2009. http://www.intel.com/Assets/PDF/manual/253665.pdf.

[8]

J. Aas. Understanding the Linux 2.6.8.1 CPU Scheduler. SGI, 2005. cpu scheduler.pdf

[9]

C. Acosta, F. Cazorla, A. Ramirez, and M. Valero. Thread to Core Assignment in SMT On-Chip Multiprocessors. In SBAC-PAD '09: Proceedings of the 2009 21st International Symposium on Computer Architecture and High Performance Computing. IEEE Computer Society, 2009.

Digital Library

[10]

D. Bovet and M. Cesati. Understanding the Linux Kernel. O'Reilly Media, Inc., 2006.

Digital Library

[11]

D. Chandra, F. Guo, S. Kim, and Y. Solihin. Predicting interthread cache contention on a chip multi-processor architecture. In HPCA 05: Proceedings of the 11th International Symposium on High-Performance Computer Architecture, pages 340--351. IEEE Computer Society, 2005.

Digital Library

[12]

W. G. Cochran. Sampling Techniques, 3rd edition. Wiley-India, 2007. ISBN 8126515244.

[13]

D. Doucette and A. Fedorova. Base vectors: A potential technique for microarchitectural classification of applications. In Proceedings of the Workshop on the Interaction between Operating Systems and Computer Architecture (WIOSCA), in conjunction with ISCA-34, 2007.

[14]

R. Ennals, R. Sharp, and A. Mycroft. Task partitioning for multi-core network processors. In In Compiler Construction, pages 76--90, 2005.

Digital Library

[15]

A. Fedorova, M. Seltzer, and M. Smith. Improving performance isolation on chip multiprocessors via an operating systems scheduler. In Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques, pages 25--38, 2007.

Digital Library

[16]

G. Houston. BGP Table Statistics. http://bgp.potaroo.net.

[17]

R. Jain, C. Hughes, and S. Adve. Soft real-time scheduling on simultaneous multithreaded processors. Proceedings of RTSS'2002.

Digital Library

[18]

J. Kihm, A. Settle, A. Janiszewski, and D. A. Connors. Understanding the impact of inter-thread cache interference on ILP in modern SMT processors. 7, 2005.

[19]

E. Kohler, J. Li, V. Paxson, and S. Shenker. Observed Structure of Addresses in IP Traffic. In Proceedings of the 2nd ACM SIGCOMM Workshop on Internet measurment, pages 253--266, New York, NY, USA, 2002. ACM. ISBN 1-58113-603-X.

Digital Library

[20]

R. Kokku, T. L. Richß, A. Kunze, J. Mudigonda, J. Jason, and H. M. Vin. A case for run-time adaptation in packet processing systems. SIGCOMM Comput. Commun. Rev., 34(1):107--112, 2004. ISSN 0146-4833.

Digital Library

[21]

R. Kumar, Dean M. Tullsen, Parathasarathy Ranganathan, Norman P. Jouppi, and Keith I. Farkas. Single-ISA heterogenous multi-core architectures for multithreaded workload performance. In Proceedings of the 31st annual international symposium on Computer architecture, page 64, 2004.

Digital Library

[22]

H. Q. Le, W. J. Starke, J. S. Fields, F. P. O'Connell, D. Q. Nguyen, B. J. Ronchetti, W. M. Sauer, E. M. Schwarz, and M. T. Vaden. IBM POWER6 microarchitecture. IBM J. Res. Dev., 51(6), 2007.

Digital Library

[23]

N. Shah. Understanding Network Processors. Technical report, EECS, University of California, Berkeley, Sept. 2001.

[24]

S. Parekh, S. Eggers, H. Levy, and J. Lo. Thread-sensitive scheduling for SMT processors, 2000.

[25]

P. Radojković, V. Cakarevic, J. Verdú, A. Pajuelo, R. Gioiosa, F. Cazorla, M. Nemirovsky, and M. Valero. Measuring Operating System Overhead on CMT Processors. In SBAC-PAD '08: Proceedings of the 2008 20th International Symposium on Computer Architecture and High Performance Computing. IEEE Computer Society, 2008. ISBN 978-0-7695-3423-7.

Digital Library

[26]

J. M. Richard McDougall. Solaris internals: Solaris 10 and OpenSolaris kernel architecture. Sun Microsystems Press/Prentice Hall, 2006. ISBN 9780131482098.

[27]

D. Shelepov, Juan Carlos Saez Alcaide, Stacey Jeffery, Alexandra Fedorova, Nestor Perez, Zhi Feng Huang, Sergey Blagodurov, and Viren Kumar. Hass: A scheduler for heterogeneous multicore systems. In ACM SIGOPS Operating Systems Review, pages 66--75, 2009.

Digital Library

[28]

T. Sherwood, G. Varghese, and B. Calder. A Pipelined Memory Architecture for High Throughput Network Processors. In Proceedings of the 30th annual international symposium on Computer architecture, pages 288--299, New York, NY, USA, 2003. ACM. ISBN 0-7695- 1945-8.

Digital Library

[29]

B. Sinharoy, R. N. Kalla, J. M. Tendler, R. J. Eickemeyer, and J. B. Joyner. POWER5 system microarchitecture. IBM J. Res. Dev., 49 (4/5), 2005.

Digital Library

[30]

A. Snavely, Dean M. Tullsen, and Geoff Voelker. Symbiotic jobscheduling with priorities for a simultaneous multithreading processor. In Proceedings of the 2002 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, pages 66--76, 2002.

Digital Library

[31]

L. A. Torrey, J. Coleman, and B. P. Miller. A comparison of interactivity in the Linux 2.6 scheduler and an MLFQ scheduler. Softw. Pract. Exper., 37(4):347--364, 2007. ISSN 0038-0644.

Digital Library

[32]

V. Čakarević, P. Radojković, J. Verdú A. Pajuelo, F. Cazorla, M. Nemirovsky, and M. Valero. Characterizing the resource-sharing levels in the UltraSPARC T2 processor. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO- 42), New York, NY, USA, Dec 2009.

Digital Library

[33]

T. Wolf, N. Weng, and C.-H. Tai. Design considerations for network processor operating systems. In Proc. of ACM/IEEE Symposium on Architectures for Networking and Communication Systems (ANCS), Princeton, NJ, Oct. 2005.

Digital Library

Cited By

Radojkovic PCarpenter PMoreto MCakarevic VVerdu JPajuelo ACazorla FNemirovsky MValero M(2016)Thread Assignment in Multicore/Multithreaded Processors: A Statistical ApproachIEEE Transactions on Computers10.1109/TC.2015.241753365:1(256-269)Online publication date: 1-Jan-2016
https://dl.acm.org/doi/10.1109/TC.2015.2417533
Moore RChilders BFettweis GNebel W(2014)Program affinity performance models for performance and utilizationProceedings of the conference on Design, Automation & Test in Europe10.5555/2616606.2616634(1-4)Online publication date: 24-Mar-2014
https://dl.acm.org/doi/10.5555/2616606.2616634
Zhang YZhao LIllikkal RIyer RHerdrich APeng L(2014)QoS management on heterogeneous architecture for parallel applications2014 IEEE 32nd International Conference on Computer Design (ICCD)10.1109/ICCD.2014.6974702(332-339)Online publication date: Oct-2014
https://doi.org/10.1109/ICCD.2014.6974702
Show More Cited By

Index Terms

Thread to strand binding of parallel network applications in massive multi-threaded systems
1. Computing methodologies
  1. Modeling and simulation
    1. Model development and analysis
      1. Modeling methodologies

Recommendations

Thread to strand binding of parallel network applications in massive multi-threaded systems
PPoPP '10

In processors with several levels of hardware resource sharing,like CMPs in which each core is an SMT, the scheduling process becomes more complex than in processors with a single level of resource sharing, such as pure-SMT or pure-CMP processors. Once ...
Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading

To achieve high performance, contemporary computer systems rely on two forms of parallelism: instruction-level parallelism (ILP) and thread-level parallelism (TLP). Wide-issue super-scalar processors exploit ILP by executing multiple instructions from a ...
An evaluation of speculative instruction execution on simultaneous multithreaded processors

Modern superscalar processors rely heavily on speculative execution for performance. For example, our measurements show that on a 6-issue superscalar, 93% of committed instructions for SPECINT95 are speculative. Without speculation, processor resources ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PPoPP '10: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

January 2010

372 pages

ISBN:9781605588773

DOI:10.1145/1693453

General Chairs:
R. Govindarajan
Indian Institute of Science
,
David Padua
UIUC
,
Program Chair:
Mary Hall
University of Utah

ACM SIGPLAN Notices Volume 45, Issue 5
PPoPP '10
May 2010
346 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/1837853
Issue’s Table of Contents

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 January 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PPoPP '10

Sponsor:

SIGPLAN

PPoPP '10: ACM SIGPLAN Principles and Practice of Parallel Computing

January 9 - 14, 2010

Bangalore, India

Acceptance Rates

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
392
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)1

Reflects downloads up to 16 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Radojkovic PCarpenter PMoreto MCakarevic VVerdu JPajuelo ACazorla FNemirovsky MValero M(2016)Thread Assignment in Multicore/Multithreaded Processors: A Statistical ApproachIEEE Transactions on Computers10.1109/TC.2015.241753365:1(256-269)Online publication date: 1-Jan-2016
https://dl.acm.org/doi/10.1109/TC.2015.2417533
Moore RChilders BFettweis GNebel W(2014)Program affinity performance models for performance and utilizationProceedings of the conference on Design, Automation & Test in Europe10.5555/2616606.2616634(1-4)Online publication date: 24-Mar-2014
https://dl.acm.org/doi/10.5555/2616606.2616634
Zhang YZhao LIllikkal RIyer RHerdrich APeng L(2014)QoS management on heterogeneous architecture for parallel applications2014 IEEE 32nd International Conference on Computer Design (ICCD)10.1109/ICCD.2014.6974702(332-339)Online publication date: Oct-2014
https://doi.org/10.1109/ICCD.2014.6974702
Radojkovic PCakarevic VVerdu JPajuelo ACazorla FNemirovsky MValero M(2013)Thread Assignment of Multithreaded Network Applications in Multicore/Multithreaded ProcessorsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2012.31124:12(2513-2525)Online publication date: 1-Dec-2013
https://dl.acm.org/doi/10.1109/TPDS.2012.311
Moore RChilders B(2013)Automatic generation of program affinity policies using machine learningProceedings of the 22nd international conference on Compiler Construction10.1007/978-3-642-37051-9_10(184-203)Online publication date: 16-Mar-2013
https://dl.acm.org/doi/10.1007/978-3-642-37051-9_10
Radojković PČakarević VMoretó MVerdú JPajuelo ACazorla FNemirovsky MValero M(2012)Optimal task assignment in multithreaded processorsACM SIGPLAN Notices10.1145/2248487.215100247:4(235-248)Online publication date: 3-Mar-2012
https://dl.acm.org/doi/10.1145/2248487.2151002
Radojković PČakarević VMoretó MVerdú JPajuelo ACazorla FNemirovsky MValero M(2012)Optimal task assignment in multithreaded processorsACM SIGARCH Computer Architecture News10.1145/2189750.215100240:1(235-248)Online publication date: 3-Mar-2012
https://dl.acm.org/doi/10.1145/2189750.2151002
Radojković PČakarević VMoretó MVerdú JPajuelo ACazorla FNemirovsky MValero MHarris TScott M(2012)Optimal task assignment in multithreaded processorsProceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems10.1145/2150976.2151002(235-248)Online publication date: 3-Mar-2012
https://dl.acm.org/doi/10.1145/2150976.2151002
Tang LMars JVachharajani NHundt RSoffa M(2011)The impact of memory subsystem resource sharing on datacenter applicationsACM SIGARCH Computer Architecture News10.1145/2024723.200009939:3(283-294)Online publication date: 4-Jun-2011
https://dl.acm.org/doi/10.1145/2024723.2000099
Tang LMars JVachharajani NHundt RSoffa MIyer RYang QGonzález A(2011)The impact of memory subsystem resource sharing on datacenter applicationsProceedings of the 38th annual international symposium on Computer architecture10.1145/2000064.2000099(283-294)Online publication date: 4-Jun-2011
https://dl.acm.org/doi/10.1145/2000064.2000099

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents