A Study On Hyper-Threading: Vimal Reddy Ambarish Sule Aravindh Anantaraman

A Study on Hyper-Threading
Vimal Reddy Ambarish Sule Aravindh Anantaraman
Microarchitectural trends
Higher degrees of instruction-level parallelism Different generations:
I. Serial Processors Fetch and execute each instruction back to back II. Pipelined Processors Overlap different phases of instruction processing for higher throughput III. Superscalar Processors Overlap different phases of instruction processing and issue and execute multiple instructions in parallel for IPC > 1 IV. ???
Superscalar limits
Limitations with superscalar approach: - Amount of ILP in most programs is limited - Nature of ILP in programs can be bursty - Bottom-line: Resources can be utilized better
Simultaneous Multithreading
Finds parallelism at thread level
Executes multiple instructions from multiple threads each cycle No significant increase in chip area over a superscalar processor
Multiple PCs
PC Fetch Unit
Thread selection Replicate RAS BTB thread ids
Replicate architectural state
FP queue
Selective squash
FP Registers
FP units Data Cache
Instruction Cache Decode
Int. queue Register Renaming

Multiple rename map tables
Multiple arch. map tables Multiple active lists
Int. Registers
Replicate architectural state
Selective squash
Int.+ load/store units

Per-thread disambiguation
From ece721 notes, Prof. Eric Rotenberg, NCSU
Hyper-Threading
Brings goodness of Simultaneous MultiThreading (SMT) to Intel Architecture Motivation (Same as that for SMT)
High processor utilization Better throughput (by exploiting thread level parallelism - TLP) Power efficient due to smaller processor cores compared to CMP
Hyper-Threading Contd.
2 Logical processors (2 threads in SMT terminology) Shared Instruction Trace Cache and L1 D-Cache 2 PCs and 2 register renamers Other resources partitioned equally between 2 threads Recombines shared resources when single threaded (no degradation of single thread performance)
Intel NetBurst Microarchitecture Pipeline With Hyper-Threading Technology
Project Goal
Measure performance of micro-benchmarks (kernels) on Pentium-4. Form workloads to utilize different processor resources and study behavior.
Pentium4 Functional Units
3 Integer ALU units (2 double speed) 1 unit for Floating point computation Separate address generator units for loads and stores
Micro-benchmarks
Created 3 types of kernels: Floating Point intensive kernel (flt)
Performs FP Add, Sub, Multiply, Divide operations a large number of times Targets single FP unit
Integer intensive kernel (int)

Performs integer Add, Subtract and Shift a large number of times Targets integer units (2 double speed and 1 slow)
Memory intensive kernel (mem, mem_s)

Dynamically allocates a linked list larger than L1 D$ and parses it Targets shared data cache and memory hierarchy as such
Micro-benchmarks (contd.)
Integer kernel
Floating Point kernel
Memory intensive kernel
Workbench
Machine: Pentium4 Northwood 2.53-2.66 GHz. with Hyper-Threading Operating System: Linux 2.4.18-SMP kernel. OS views each thread as a processor BIOS setting to turn HT On/Off PERL script to fork processes at the same time top (Linux utility) to monitor processes (processor and memory utilization) time utility to get timing statistics for each program Ran each experiment 10 times and took the average execution time
Methodology
Run different workload combinations. fltflt 2 Floating point kernels mem_smem_s 2 small memory intensive kernels intflt 1 integer and 1 float kernel and so on .. Run in 3 modes: 1. back-to-back: Run each program individually 2. HT Off: No Hyper-Threading. But OS context switching 3. HT On: Hyper-Threading on and OS context switching Find Contending workloads: Compete for resources and degrade performance (increase execution time with HT on) Find Complementary workloads: Utilize idle resources and increase performance (decrease execution time with HT on)
Experiments: Single thread performance

Single thread performance 400 350 HT off HT on
time (seconds)
300 250 200 150 100 50 0 int flt mem_s mem
Hyper-Threading does not degrade single thread performance
Experiments (Contd.)
Contending workloads 1600 1400 back-to-back HT off HT on
time (seconds)
1200 1000 800 600 400 200 0 fltflt fltfltflt
mem_smem_s
Contention for single FP unit increases execution time Contention for data cache can lead to thrashing
Complementary workloads
1200 1000
back-to-back HT off HT on
time (seconds)
800
600
400
200
tfl t
tfl t
tfl t
flt
in t em m m
in tin
in tfl
tin
em
in tin
flt in
in tin
in tfl
Integer workloads perform well 3 integer units
(2 double speed) are well utilized Workloads with complementary resource requirements perform well (intflt, memint) OS plays important role when number of programs > number of hardware contexts available
em
in t
flt
Noteworthy results 1200 1000
HT off HT on
time (seconds)
800 600 400 200 0 fltflt intfltflt memflt memintflt
Experiments (contd.)
Interesting results with Hyper-Threading 1200 1000 800 600 400 200 0 fltflt intfltflt fltfltint memflt memintflt
Execution time with 3 kernel workload is less than that for 2! Scheduling important! intfltflt - int kernel has 100% of 1 thread, 50:50 between flt and flt fltfltint - flt kernel has 100% of 1 thread, 50:50 between int and flt. Has higher execution time!
Project Goal
Model Hyper-Threading on a simulator. Vary key parameters and study first order effects
Simulator details
Execution driven, cycle accurate simulator based on SimpleScalar toolset Extended the simulator to model SMT and HyperThreading:
Resource sharing by tagging thread id (I$, D$) Resource replication through multiple instantiation (PC, Map tables, Branch history, RAS) Resource partitioning by having separate instances but imposing a global limit on entries ( Active list, Load/store buffers, IQs) Stop simulation after completion of all threads
Simulator details
Features Pentium 4 Simulator ISA Branch Misprediction pipeline Bandwidths Rename Map Table Architecture Map Table MEM IQ and ALU IQ Store buffers (24) Load buffers ( 48 ) Unified L2 cache Fetch unit Instruction cache Branch history register Branch predictor table Program Counters Return Address Stack ROB (126) L1 data cache Double Speed ALU/Functional Units x86 20 stage (Fetch=3, Dispatch=3, Issue=6) Replicated Replicated Partitioned Partitioned (12+12) Partitioned (24+24) 8-way set assoc. 128 Byte lines, 256 KB Shared, RR.1.3 Trace cache (12K micro-ops, 6 per trace line) Replicated Shared (algorithm unknown) Replicated Replicated Partitioned (63+63) Shared, 4-way set assoc., 64 Byte lines, 8KB Yes SimpleScalar (MIPS like) 20 stage (used dummy stages) Same Same Same Same Same Same No L2 cache Same (RR.2.3, ICOUNT, BRCOUNT, MISSCOUNT) Shared L1 I$ Same Shared (Gshare, 2K entries) Same Same Shared (126) Same No
Simulator SMT/HT validation

Validation of SMT/HT results for GCC
2.5
Hyper-Threading
2 1.5
IPC
1 0.5 0 1 2 3 4
number of GCC threads
Experiment: Modeling L1 data cache interference

Studying data cache interference
100% 90% 80% 70%
% of total misses
60% 50% 40% 30% 20% 10% 0% Loads Stores Loads Stores
Interference Misses Actual Misses
GCC
GO
Experiment: Modeling issue queue partitioning

Varying ALU and MEM IQ for total IQ size of 64 2.4 2.2 2
IPC
1.8 1.6 1.4 1.2 1 10,54 20,44 32,32 40,24 50,14 ALU IQ, MEM IQ compress_gcc gcc_go gcc_perl go_perl
Experiment: Modeling total issue queue size with partitioning

Varying MEM IQ and ALU IQ sizes and Total IQ size
2.35 2.15
1.95
IPC
1.75
1.55
1.35
1.15
compress_gcc : IQ size: 64 compress_gcc : IQ size: 126 gcc_go : IQ size : 64 gcc_go : IQ size : 126 gcc_perl : IQ size : 64 gcc_perl : IQ size : 126
15%, 85% 31%, 69% 50%, 50% 62.5% ,37.5 % 78%, 22%
(%ALU, %MEM)
Experiment: Varying Load/Store buffer sizes (Pentium4: 48 Load, 24 Store)

Varying load/store buffer sizes
2.5 IPC for Thread 2 IPC for Thread 1
1.5
IPC
1 0.5 0
24 36 48 60 72 24 36 48 60 72 24 36 48 60 72 24 36 48 60 72
GCC-PERL
COMP-GCC
GO-PERL
GCC-GO
Experiment: Comparison of fetch policies

Comparison of fetch policies
2.5 2 1.5 IPC for Thread 2 IPC for Thread 1
IPC
1 0.5 0
RR IC NT BC NT MC NT RR IC NT BC NT MC NT RR IC NT BC NT MC NT RR IC NT BC NT MC NT
GCC-GO
GCC-PERL
COMP-GCC
GO-PERL
References
[1] Prof. Eric Rotenberg, Course Notes, ECE 792E Advanced Microarchitecture, Fall 2002 NC State University. [2] Deborah T. Marr et al. Hyper-Threading Technology Architecture and Microarchitecture, Intel Technology Journal 1st Qtr 2002 Vol 6 Issue 1. [3] Vimal Reddy, Ambarish Sule, Aravindh Anantaraman Hyperthreading on the Pentium 4, ECE792E Project, Fall 2002 http://www.tinker.ncsu.edu/ericro/ece721/student_projects/avananta.pdf [4] D. M. Tullsen, et al. Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor, 23rd Annual ISCA, pp. 191-202, May 1996.
Questions

A Study On Hyper-Threading: Vimal Reddy Ambarish Sule Aravindh Anantaraman

Uploaded by

Copyright:

Available Formats

A Study On Hyper-Threading: Vimal Reddy Ambarish Sule Aravindh Anantaraman

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Study On Hyper-Threading: Vimal Reddy Ambarish Sule Aravindh Anantaraman

Uploaded by

Copyright:

Available Formats

A Study on Hyper-Threading

Vimal Reddy Ambarish Sule Aravindh Anantaraman

Thread selection Replicate RAS BTB thread ids

Replicate architectural state

FP units Data Cache

Instruction Cache Decode

Int. queue Register Renaming

Int.+ load/store units

From ece721 notes, Prof. Eric Rotenberg, NCSU

Intel NetBurst Microarchitecture Pipeline With Hyper-Threading Technology

Pentium4 Functional Units

Integer intensive kernel (int)

Memory intensive kernel (mem, mem_s)

Floating Point kernel

Memory intensive kernel

Experiments: Single thread performance

300 250 200 150 100 50 0 int flt mem_s mem

Hyper-Threading does not degrade single thread performance

1200 1000 800 600 400 200 0 fltflt fltfltflt

Integer workloads perform well 3 integer units

800 600 400 200 0 fltflt intfltflt memflt memintflt

Simulator SMT/HT validation

Experiment: Modeling L1 data cache interference

Interference Misses Actual Misses

Experiment: Modeling issue queue partitioning

Experiment: Modeling total issue queue size with partitioning

Experiment: Varying Load/Store buffer sizes (Pentium4: 48 Load, 24 Store)

Experiment: Comparison of fetch policies

You might also like