Cs 162 All Slides
Cs 162 All Slides
Cs 162 All Slides
Alewife
» Alewife project at MIT
Operating Systems and » Designed CMMU, Modified SPAR C processor
Systems Programming » Helped to write operating system
– Background in Operating Systems
Lecture 1 » Worked for Project Athena (MIT)
» OS Developer (device drivers,
Tessellation
What is an Operating System? network file systems)
» Worked on Clustered High-Availability systems
(CLAM Associates)
» OS lead researcher for the new Berkeley PARLab
(Tessellation OS). More later.
August 30th, 2010
– Peer-to-Peer
Prof. John Kubiatowicz » OceanStore project –
Store your data for 1000 years
http://inst.eecs.berkeley.edu/~cs162
OceanStore
» Tapestry and Bamboo –
Find you data around globe
– Quantum Computing
» Well, this is just cool, but probably not apropos
8/30/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 1.2
Interactive is important!
Ask Questions! 2X transistors/Chip Every 1.5 years
Called “Moore’s Law”
Gordon Moore (co-founder of
Intel) predicted in 1965 that the
Note: Some slides and/or pictures in the following are Microprocessors have
adapted from slides ©2005 Silberschatz, Galvin, and Gagne. Slides
transistor density of
courtesy of Kubiatowicz, AJ Shankar, George Necula, Alex Aiken, semiconductor chips would become smaller, denser,
Eric Brewer, Ras Bodik, Ion Stoica, Doug Tygar, and David Wagner. double roughly every 18 and more powerful.
months.
8/30/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 1.3 8/30/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 1.4
Societal Scale Information Systems People-to-Computer Ratio Over Time
• The world is a large parallel system
– Microprocessors in everything Massive Cluster
From David Culler
– Vast infrastructure behind them Gigabit Ethernet Clusters
Internet
Scalable, Reliable,
Connectivity
Secure Services
Databases
Information Collection
Remote Storage
Online Games
Commerce
…
• Today: Multiple CPUs/person!
MEMS for – Approaching 100s?
Sensor Nets
8/30/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 1.5 8/30/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 1.6
New Challenge: Slowdown in Joy’s law of Performance ManyCore Chips: The future is here
• Intel 80-core multicore chip (Feb 2007)
10000 3X – 80 simple cores
From Hennessy and Patterson, Computer Architecture: A
Quantitative Approach, 4th edition, Sept. 15, 2006 – Two FP-engines / core
??%/year – Mesh-like network
– 100 million transistors
1000
– 65nm feature size
Performance (vs. VAX-11/780)
• Computer-system operation
– One or more CPUs, device controllers connect
through common bus providing access to shared
memory
– Concurrent execution of CPUs and devices
competing for memory cycles
Other Processors
adapters Hierarchy Latency
Memory
Network
Communication
Controllers L1 Cache Addressing,
VLSI Protection,
Disks Instruction Set Architecture
Exception Handling
I/O Devices:
Displays
Keyboards
Networks Pipelining, Hazard Resolution, Pipelining and Instruction
Superscalar, Reordering, Level Parallelism
Prediction, Speculation,
Vector, Dynamic Compilation
8/30/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 1.11 8/30/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 1.12
Increasing Software Complexity Example: Some Mars Rover (“Pathfinder”) Requirements
• Pathfinder hardware limitations/complexity:
– 20Mhz processor, 128MB of DRAM, VxWorks OS
– cameras, scientific instruments, batteries,
solar panels, and locomotion equipment
– Many independent processes work together
• Can’t hit reset button very easily!
– Must reboot itself if necessary
– Must always be able to receive commands from Earth
• Individual Programs must not interfere
– Suppose the MUT (Martian Universal Translator Module) buggy
– Better not crash antenna positioning software!
• Further, all software may crash occasionally
– Automatic restart with diagnostics sent to Earth
– Periodic checkpoint of results saved?
• Certain functions time critical:
– Need to stop before hitting something
From MIT’s 6.033 course
– Must track orbit of Earth for communication
8/30/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 1.13 8/30/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 1.14
8/30/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 1.19 8/30/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 1.20
Class Schedule Textbook
• Class Time: M/W 4:00-5:30 PM, 277 Cory Hall
– Please come to class. Lecture notes do not have everything • Text: Operating Systems Concepts,
in them. The best part of class is the interaction! 8th Edition Silbershatz, Galvin, Gagne
– Also: 10% of the grade is from class participation (section
and class)
• Online supplements
• Sections:
– See “Information” link on course website
– Important information is in the sections
– Includes Appendices, sample problems, etc
– The sections assigned to you by Telebears are temporary!
– Every member of a project group must be in same section • Question: need 8th edition?
– No sections this week (obviously); start next week – No, but has new material that we may cover
Section Time Location TA
– Completely reorganized
101 F 9:00A-10:00A 85 Evans Christos Stergiou – Will try to give readings from both the 7th and 8th
editions on the lecture page
102 F 10:00A-11:00A 6 Evans Angela Juang
103 F 11:00A-12:00P 2 Evans Angela Juang
104 F 12:00P-1:00P 75 Evans Hilfi Alkaff
105 (New) F 1:00P-2:00P 85 Evans Christos Stergiou
8/30/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 1.21 8/30/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 1.22
8/30/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 1.23 8/30/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 1.24
Group Project Simulates Industrial Environment Typical Lecture Format
Interactive!!!
» All of the final runs must be done on your cs162-xx
account and must run on the x86 Solaris machines
• Make sure to log into your new account this week
and fill out the questions
• Project Information:
– See the “Projects and Nachos” link off the course
home page
• Newsgroup (ucb.class.cs162):
– Read this regularly!
8/30/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 1.27 8/30/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 1.28
Academic Dishonesty Policy What does an Operating System do?
• Copying all or part of another person's work, or using reference • Silerschatz and Gavin:
material not specifically allowed, are forms of cheating and will “An OS is Similar to a government”
not be tolerated. A student involved in an incident of cheating will
be notified by the instructor and the following policy will apply: – Begs the question: does a government do anything useful by
itself?
http://www.eecs.berkeley.edu/Policies/acad.dis.shtml • Coordinator and Traffic Cop:
• The instructor may take actions such as: – Manages all resources
– require repetition of the subject work,
– Settles conflicting requests for resources
– assign an F grade or a 'zero' grade to the subject work,
– Prevent errors and improper use of the computer
– for serious offenses, assign an F grade for the course.
• The instructor must inform the student and the Department Chair • Facilitator:
in writing of the incident, the action taken, if any, and the – Provides facilities that everyone needs
student's right to appeal to the Chair of the Department – Standard Libraries, Windowing systems
Grievance Committee or to the Director of the Office of Student
Conduct. – Make application programming easier, faster, less error-prone
• The Office of Student Conduct may choose to conduct a formal • Some features reflect both tasks:
hearing on the incident and to assess a penalty for misconduct.
– E.g. File system is needed by everyone (Facilitator)
• The Department will recommend that students involved in a second
incident of cheating be dismissed from the University. – But File system must be Protected (Traffic Cop)
8/30/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 1.29 8/30/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 1.30
• Most Likely:
• No universally accepted definition
– Memory Management
• “Everything a vendor ships when you order an
– I/O Management
operating system” is good approximation
– CPU Scheduling
– But varies wildly
– Communications? (Does Email belong in OS?)
• “The one program running at all times on the
– Multitasking/multiprogramming? computer” is the kernel.
• What about? – Everything else is either a system program (ships
– File System? with the operating system) or an application
– Multimedia Support? program
– User Interface?
– Internet Browser?
• Is this only interesting to Academics??
8/30/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 1.31 8/30/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 1.32
What if we didn’t have an Operating System? Simple OS: What if only one application?
Altair 8080
8/30/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 1.33 8/30/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 1.34
8/30/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 1.35 8/30/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 1.36
More complex OS: Multiple Apps Example: Protecting Processes from Each Other
• Full Coordination and Protection • Problem: Run multiple applications in such a way
– Manage interactions between different users that they are protected from one another
– Multiple programs running simultaneously • Goal:
– Multiplex and protect Hardware Resources – Keep User Programs from Crashing OS
» CPU, Memory, I/O devices like disks, printers, etc – Keep User Programs from Crashing each other
• Facilitator – [Keep Parts of OS from crashing other parts?]
– Still provides Standard libraries, facilities • (Some of the required) Mechanisms:
– Address Translation
• Would this complexity make sense if there were – Dual Mode Operation
only one application that you cared about? • Simple Policy:
– Programs are not allowed to read/write memory of
other Programs or of Operating System
8/30/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 1.37 8/30/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 1.38
• For now, assume translation happens with table • Hardware provides at least two modes:
(called a Page Table): – “Kernel” mode (or “supervisor” or “protected”)
10 – “User” mode: Normal programs executed
Virtual
Address
V page no. offset • Some instructions/ops prohibited in user mode:
– Example: cannot modify page tables in user mode
Page Table
» Attempt to modify Exception generated
index V
Access
Rights PA • Transitions from user mode to kernel mode:
into – System Calls, Interrupts, Other exceptions
page
table table located
in physical P page no. offset Physical
memory Address
10
• Translation helps protection:
– Control translations, control access
– Should Users be able to change Page Table???
8/30/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 1.41 8/30/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 1.42
“In conclusion…”
CS162 Application
Operating Systems and Virtual Machine Interface
Systems Programming Operating System
Lecture 2 Physical Machine Interface
Hardware
History of the World Parts 1—5
• Software Engineering Problem:
Operating Systems Structures – Turn hardware/software quirks
what programmers want/need
September 1st, 2010 – Optimize for convenience, utilization, security,
reliability, etc…
Prof. John Kubiatowicz
• For Any OS area (e.g. file systems, virtual memory,
http://inst.eecs.berkeley.edu/~cs162 networking, scheduling):
– What’s the hardware interface? (physical reality)
– What’s the application interface? (nicer abstraction)
9/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 2.3 9/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 2.4
Review: Example of Address Translation Goals for Today
Data 2
Code Code • Finish Protection Example
Stack 1 • History of Operating Systems
Data Data
Heap Heap 1 – Really a history of resource-driven choices
Heap
Stack Code 1
Stack • Operating Systems Structures
Stack 2 • Operating Systems Organizations
Prog 1 Prog 2
Virtual
Data 1 • Abstractions and layering
Virtual
Address Heap 2 Address
Space 1 Space 2
Code 2
OS code
The other half of protection: Dual Mode Operation UNIX System Structure
• Hardware provides at least two modes:
– “Kernel” mode (or “supervisor” or “protected”)
– “User” mode: Normal programs executed Applications
User Mode
• Some instructions/ops prohibited in user mode: Standard Libs
– Example: cannot modify page tables in user mode
» Attempt to modify Exception generated
• Transitions from user mode to kernel mode:
Kernel Mode
– System Calls, Interrupts, Other exceptions
Hardware
9/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 2.7 9/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 2.8
Moore’s Law Change Drives OS Change Moore’s law effects
9/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 2.11 9/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 2.12
Core Memories (1950s & 60s) History Phase 1½ (late 60s/early 70s)
• Data channels, Interrupts: overlap I/O and compute
– DMA – Direct Memory Access for I/O devices
– I/O can be completed asynchronously
• Multiprogramming: several programs run simultaneously
The first magnetic core – Small jobs not delayed by large jobs
memory, from the IBM 405
Alphabetical Accounting
– More overlap between I/O and CPU
Machine. – Need memory protection between programs and/or OS
• Complexity gets out of hand:
– Multics: announced in 1963, ran in 1969
» 1777 people “contributed to Multics” (30-40 core dev)
• Core Memory stored data as magnetization in iron rings » Turing award lecture from Fernando Corbató (key
researcher): “On building systems that will fail”
– Iron “cores” woven into a 2-dimensional mesh of wires
– OS 360: released with 1000 known bugs (APARs)
– Origin of the term “Dump Core”
» “Anomalous Program Activity Report”
– Rumor that IBM consulted Life Saver company
• OS finally becomes an important science:
• See: http://www.columbia.edu/acis/history/core.html
– How to deal with complexity???
– UNIX based on Multics, but vastly simplified
9/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 2.13 9/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 2.14
1973: 1979:
• The 6180 at MIT IPC, skin doors open, circa 1976: 1. 7 Mbit/sq. in 7. 7 Mbit/sq. in
– “We usually ran the machine with doors open so the 140 MBytes 2,300 MBytes
operators could see the AQ register display, which
gave you an idea of the machine load, and for
convenient access to the EXECUTE button, which the Contrast: Seagate 2TB,
operator would push to enter BOS if the machine
crashed.” 400 GB/SQ in, 3½ in disk,
• http://www.multicians.org/multics-stories.html 4 platters
9/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 2.15 9/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 2.16
Administrivia Administriva: Time to start thinking about groups
• Waitlist: • Project Signup: Not quite ready, but will be
– All CS/EECS seniors should be in the class – 4-5 members to a group
– Remaining: » Everyone in group must be able to actually attend same section
» 18 CS/EECS juniors,
» The sections assigned to you by Telebears are temporary!
» 4 grad students
» 2 non CS/EECS seniors – Only submit once per group!
• Cs162-xx accounts: » Everyone in group must have logged into their cs162-xx accounts
once before you register the group
– We have more forms for those who didn’t get one
» Make sure that you select at least 2 potential sections
– If you haven’t logged in yet, you need to do so » Due Tuesday 9/7 by 11:59pm
• Nachos readers:
• Sections:
– TBA: Will be down at Copy Central on Hearst
– Watch for section assignments next Wednesday/Thursday
– Will include lectures and printouts of all of the code
– Attend new sections next week: Telebears sections this Friday
• Video “Screencast” archives available off lectures page
– If have mp4 player, just click on the title of a lecture Section Time Location TA
– Otherwise, click on link at top middle of lecture page 101 F 9:00A-10:00A 85 Evans Christos Stergiou
• No slip days on first design document for each phase 102 F 10:00A-11:00A 6 Evans Angela Juang
– Need to get design reviews in on time 103 F 11:00A-12:00P 2 Evans Angela Juang
• Don’t know Java well? 104 F 12:00P-1:00P 75 Evans Hilfi Alkaff
– Perhaps try CS 9G self-paced Java course 105 (New) F 1:00P-2:00P 85 Evans Christos Stergiou
9/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 2.17 9/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 2.18
9/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 2.21 9/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 2.22
– Win 95 (1995)
– Win NT (1993) HAL/Protection
– Win 2000 (2000)
No HAL/
– Win XP (2001)
Full Prot
– Win Vista (2007)
9/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 2.23 9/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 2.24
History Phase 4 (1988—): Internet ARPANet Evolves into Internet
• First E-mail SPAM message: 1 May 1978 12:33 EDT
• Developed by the research community
– Based on open standard: Internet Protocol
• 80-83: TCP/IP, DNS; ARPANET and MILNET split
– Internet Engineering Task Force (IETF)
• 85-86: NSF builds NSFNET as backbone, links 6
• Technical basis for many other types of networks Supercomputer centers, 1.5 Mbps, 10,000 computers
– Intranet: enterprise IP network • 87-90: link regional networks, NSI (NASA), ESNet
• Services Provided by the Internet (DOE), DARTnet, TWBNet (DARPA), 100,000 computers
– Shared access to computing resources: telnet (1970’s)
– Shared access to data/files: FTP, NFS, AFS (1980’s)
– Communication medium over which people interact
» email (1980’s), on-line chat rooms, instant messaging (1990’s) ARPANet TCP/IP NSFNet Deregulation & ISP
» audio, video (1990’s, early 00’s) SATNet Commercialization ASP
– Medium for information dissemination PRNet WWW AIP
» USENET (1980’s) 1965 1975 1985 1995 2005
» WWW (1990’s)
» Audio, video (late 90’s, early 00’s) – replacing radio, TV? SATNet: Satelite network
» File sharing (late 90’s, early 00’s) PRNet: Radio Network
9/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 2.25 9/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 2.26
• Geographical distance
Links Interfaces Switches/routers – Local Area Networks (LAN): Ethernet, Token ring,
FDDI
Fibers Ethernet card Large router – Metropolitan Area Networks (MAN): DQDB, SMDS
– Wide Area Networks (WAN): X.25, ATM, frame
relay
– Caveat: LAN, MAN, WAN may mean different
things
» Service, network technology, networks
Wireless card
• Information type
– Data networks vs. telecommunication networks
Coaxial
Cable
Telephone • Application type
switch
– Special purpose networks: airline reservation
network, banking network, credit card network,
telephony
– General purpose network: Internet
9/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 2.29 9/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 2.30
Regional
Regional Regional
Net
Net Net
Backbone
Regional
Regional
Regional Net
Net
Net
9/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 2.31 9/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 2.32
Backbones + NAPs + ISPs Computers Inside the Core
DSL Cable
ISP Always on Head Ends
ISP
@home
Covad
NAP
NAP ISP
ISP Backbones
NAP
NAP ISP
Cingular
Business Consumer Satellite
ISP ISP Fixed Wireless
Cell Cell
Dial-up
Cell Sprint AOL
LAN LAN LAN
9/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 2.33 9/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 2.34
9/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 2.39 9/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 2.40
History of OS: Summary
• Change is continuous and OSs should adapt
– Not: look how stupid batch processing was
– But: Made sense at the time
• Situation today is much like the late 60s
– Small OS: 100K lines
– Large OS: 10M lines (5M for the browser!)
» 100-1000 people-years Now for a quick tour of OS Structures
• Complexity still reigns
– NT developed (early to late 90’s): Never worked well
– Windows 2000/XP: Very successful
– Windows Vista (aka “Longhorn”) delayed many times
» Finally released in January 2007
» Promised by removing some of the intended technology
» Slow adoption rate, even in 2008/2009
• CS162: understand OSs to simplify them
9/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 2.41 9/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 2.42
9/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 2.43 9/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 2.44
Operating Systems Structure
System Calls (What is the API) (What is the organizational Principle?)
• See Chapter 2 of 7th edition or Chapter 3 of 6th • Simple
– Only one or two levels of code
• Layered
– Lower levels independent of upper levels
• Microkernel
– OS built from many user-level processes
• Modular
– Core kernel with Dynamically loadable modules
9/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 2.45 9/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 2.46
9/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 2.47 9/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 2.48
UNIX System Structure Layered Structure
9/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 2.49 9/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 2.50
9/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 2.51 9/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 2.52
Modules-based Structure Partition Based Structure for Multicore chips?
Identity
Persistent HCI/ – Monitoring services
Device
Storage & Voice » Performance counters
Drivers
File System Rec » Introspection
– Identity/Environment
services (Security)
» Biometric, GPS,
Possession Tracking
• Applications Given
Larger Partitions
– Freedom to use
9/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 2.53 9/01/10 Kubiatowicz CS162 ©UCB Fall 2010 resources arbitrarily
Lec 2.54
Implementation Issues
(How is the OS implemented?) Conclusion
• Rapid Change in Hardware Leads to changing OS
• Policy vs. Mechanism
– Batch Multiprogramming Timeshare
– Policy: What do you want to do? Graphical UI Ubiquitous Devices
– Mechanism: How are you going to do it? Cyberspace/Metaverse/??
– Should be separated, since both change • OS features migrated from mainframes PCs
• Algorithms used • Standard Components and Services
– Process Control
– Linear, Tree-based, Log Structured, etc…
– Main Memory
• Event models used – I/O
– threads vs event loops – File System
• Backward compatability issues – UI
– Very important for Windows 2000/XP • Policy vs Mechanism
• System generation/configuration – Crucial division: not always properly separated!
– How to make generic OS fit on specific hardware • Complexity is always out of control
– However, “Resistance is NOT Useless!”
9/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 2.55 9/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 2.56
Review: History of OS
9/8/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 3.3 9/8/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 3.4
Goals for Today Concurrency
The Basic Problem of Concurrency Recall (61C): What happens during execution?
CPU1 CPU2 CPU3 • All virtual CPUs share same non-CPU resources
CPU1 CPU2 CPU3 CPU1 CPU2
– I/O devices the same
– Memory the same
Shared Memory Time • Consequence of sharing:
• Assume a single processor. How do we provide the – Each thread can access the data of every other
illusion of multiple processors? thread (good for sharing, bad for protection)
– Multiplex in time! – Threads can share instructions
• Each virtual “CPU” needs a structure to hold: (good for sharing, bad for protection)
– Program Counter (PC), Stack Pointer (SP) – Can threads overwrite OS functions?
– Registers (Integer, Floating point, others…?) • This (unprotected) model common in:
• How switch from one CPU to the next? – Embedded applications
– Save PC, SP, and registers in current state block
– Windows 3.1/Machintosh (switch only with yield)
– Load PC, SP, and registers from new state block
– Windows 95—ME? (switch with both yield and timer)
• What triggers switch?
– Timer, voluntary yield, I/O, other things
9/8/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 3.9 9/8/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 3.10
9/8/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 3.13 9/8/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 3.14
Heap Heap 1
– For a 32-bit processor there are Heap
232 = 4 billion addresses Stack Code 1
Stack
• What happens when you read or Stack 2
write to an address? Prog 1
Data 1
Prog 2
Virtual Virtual
– Perhaps Nothing
Address Heap 2 Address
– Perhaps acts like regular memory Space 1 Space 2
Code 2
– Perhaps ignores writes
OS code
– Perhaps causes I/O operation
» (Memory-mapped I/O) Translation Map 1 OS data Translation Map 2
– Perhaps causes exception (fault) OS heap &
Stacks
9/8/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 3.27 9/8/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 3.28
Examples of multithreaded programs Examples of multithreaded programs (con’t)
• Embedded systems • Network Servers
– Elevators, Planes, Medical systems, Wristwatches
– Concurrent requests from network
– Single Program, concurrent operations
– Again, single program, multiple concurrent operations
• Most modern OS kernels – File server, Web server, and airline reservation
– Internally concurrent because have to deal with systems
concurrent requests by multiple users
• Parallel Programming (More than one physical CPU)
– But no protection needed within kernel
– Split program into multiple threads for parallelism
• Database Servers – This is called Multiprocessing
– Access to shared data by many concurrent users
– Also background utility processing must be done
• Some multiprocessors are actually uniprogrammed:
– Multiple threads in one address space but one program
at a time
9/8/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 3.29 9/8/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 3.30
9/8/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 3.31 9/8/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 3.32
Classification Example: Implementation Java OS
• Many threads, one Address Space
spaces:
# of addr
• Why another OS?
One Many Java OS
# threads – Recommended Minimum memory sizes:
Structure
Per AS: » UNIX + X Windows: 32MB
» Windows 98: 16-32MB
One
MS/DOS, early
Traditional UNIX » Windows NT: 32-64MB Java APPS
Macintosh
» Windows 2000/XP: 64-128MB
Embedded systems Mach, OS/2, Linux
– What if we want a cheap network OS
(Geoworks, VxWorks, Windows 9x???
Many JavaOS,etc) point-of-sale computer?
Win NT to XP, » Say need 1000 terminals
JavaOS, Pilot(PC) Solaris, HP-UX, OS X Hardware
» Want < 8MB
• Real operating systems have either
• What language to write this OS in?
– One or many address spaces
– C/C++/ASM? Not terribly high-level.
– One or many threads per address space
Hard to debug.
• Did Windows 95/98/ME have real memory protection? – Java/Lisp? Not quite sufficient – need
– No: Users could overwrite process tables/System DLLs direct access to HW/memory management
9/8/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 3.33 9/8/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 3.34
Summary
• Processes have two parts
– Threads (Concurrency)
– Address Spaces (Protection)
• Concurrency accomplished by multiplexing CPU Time:
– Unloading current thread (PC, registers)
– Loading new thread (PC, registers)
– Such context switching may be voluntary (yield(),
I/O operations) or involuntary (timer, other interrupts)
• Protection accomplished restricting access:
– Memory mapping isolates processes from each other
– Dual-mode for isolating I/O, other resources
• Book talks about processes
– When this concerns concurrency, really talking about
thread portion of a process
– When this concerns protection, talking about address
space portion of a process
9/8/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 3.35
Recall: Modern Process with Multiple Threads
spaces:
# of addr
• State shared by all threads in process/addr space
One Many – Contents of memory (global variables, heap)
# threads
Per AS: – I/O state (file system, network connections, etc)
• State “private” to each thread
MS/DOS, early
One Traditional UNIX – Kept in TCB Thread Control Block
Macintosh
Mach, OS/2, Linux, – CPU registers (including, program counter)
Embedded systems
(Geoworks, VxWorks, Win 95?, Mac OS X, – Execution stack – what is this?
Many JavaOS,etc)
Win NT to XP,
JavaOS, Pilot(PC) Solaris, HP-UX
• Execution Stack
• Real operating systems have either – Parameters, Temporary variables
– One or many address spaces
– return PCs are kept while called procedures are
– One or many threads per address space executing
• Did Windows 95/98/ME have real memory protection?
– No: Users could overwrite process tables/System DLLs
9/13/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 4.5 9/13/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 4.6
}
• Permits recursive execution • Before calling procedure: • After return, assume
• Crucial to modern languages – Save caller-saves regs – Callee-saves reg OK
A(1);
– Save v0, v1 – gp,sp,fp OK (restored!)
– Save ra – Other things trashed
9/13/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 4.7 9/13/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 4.8
Single-Threaded Example Use of Threads
• Version of program with Threads:
• Imagine the following C program:
main() {
main() { CreateThread(ComputePI(“pi.txt”));
ComputePI(“pi.txt”); CreateThread(PrintClassList(“clist.text”));
PrintClassList(“clist.text”); }
}
• What does “CreateThread” do?
• What is the behavior here? – Start independent thread running given procedure
– Program would never print out class list • What is the behavior here?
– Now, you would actually see the class list
– Why? ComputePI would never finish
– This should behave as if there are two separate CPUs
Time
9/13/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 4.9 9/13/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 4.10
• If we stopped this program and examined it with a • Each Thread has a Thread Control Block (TCB)
debugger, we would see – Execution State: CPU registers, program counter,
– Two sets of CPU registers Stack 1 pointer to stack
– Two sets of Stacks – Scheduling info: State (more later), priority, CPU time
• Questions: – Accounting Info
– How do we position stacks relative to – Various Pointers (for implementing scheduling queues)
Address Space
Stack 2
each other? – Pointer to enclosing process? (PCB)?
– What maximum size should we choose – Etc (add stuff as you find a need)
for the stacks?
• In Nachos: “Thread” is a class that includes the TCB
– What happens if threads violate this? Heap
– How might you catch violations? • OS Keeps track of TCBs in protected memory
Global Data – In Array, or Linked List, or …
Code
9/13/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 4.11 9/13/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 4.12
Lifecycle of a Thread (or Process) Ready Queue And Various I/O Device Queues
• Thread not running TCB is in some scheduler queue
– Separate queue for each device/signal/condition
– Each queue can have a different scheduler policy
9/13/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 4.15 9/13/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 4.16
Dispatch Loop Running a thread
• Conceptually, the dispatching loop of the operating system Consider first portion: RunThread()
looks as follows:
9/13/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 4.17 9/13/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 4.18
Stack growth
– The act of requesting I/O implicitly yields the CPU yield
Trap to OS
• Waiting on a “signal” from other thread kernel_yield
– Thread asks to wait and thus yields the CPU run_new_thread
• Thread executes a yield() switch
Stack growth
B(while) B(while) TCB[tCur].regs.sp = CPU.sp;
}
TCB[tCur].regs.retpc = CPU.retpc; /*return addr*/
proc B() { yield yield
while(TRUE) {
run_new_thread run_new_thread /* Load and execute new thread */
yield();
CPU.r7 = TCB[tNew].regs.r7;
} switch switch
…
}
CPU.r0 = TCB[tNew].regs.r0;
• Suppose we have 2 CPU.sp = TCB[tNew].regs.sp;
threads:
CPU.retpc = TCB[tNew].regs.retpc;
– Threads S and T return; /* Return to CPU.retpc */
}
9/13/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 4.21 9/13/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 4.22
CopyFile
• What happens if thread never does any I/O,
never waits, and never yields control?
Stack growth
read – Could the ComputePI program grab all resources
Trap to OS and never release the processor?
kernel_read
» What if it didn’t print to console?
run_new_thread
– Must find way that dispatcher can regain control!
switch
• Answer: Utilize External Events
• What happens when a thread requests a block of – Interrupts: signals from hardware or software
data from the file system? that stop the running code and jump to kernel
– User code invokes a system call – Timer: like an alarm clock that goes off every
– Read operation is initiated some many milliseconds
– Run new thread/switch • If we make sure that external events occur
• Thread communication similar frequently enough, can ensure dispatcher runs
– Wait for Signal/Join
– Networking
9/13/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 4.25 9/13/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 4.26
add $r1,$r2,$r3
Dispatch to Handler
“Interrupt Handler”
subi $r4,$r1,#4
Stack growth
Some Routine
slli $r4,$r4,#2 Interrupt
Transfer Network TimerInterrupt
Packet from hardware
Pipeline Flush run_new_thread
to Kernel Buffers
lw $r2,0($r4) switch
lw $r3,4($r4)
add $r2,$r2,$r3
Restore registers
• Timer Interrupt routine:
Clear current Int
sw 8($r4),$r2
Disable All Ints
TimerInterrupt() {
Restore priority DoPeriodicHouseKeeping();
RTI run_new_thread();
}
• An interrupt is a hardware-invoked context switch
• I/O interrupt: same as timer interrupt except that
– No separate step to choose what to run next
DoHousekeeping() replaced by ServiceIO().
– Always run the interrupt handler immediately
9/13/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 4.27 9/13/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 4.28
Choosing a Thread to Run Summary
• How does Dispatcher decide what to run? • The state of a thread is contained in the TCB
– Zero ready threads – dispatcher loops – Registers, PC, stack pointer
» Alternative is to create an “idle thread” – States: New, Ready, Running, Waiting, or Terminated
» Can put machine into low-power mode
• Multithreading provides simple illusion of multiple CPUs
– Exactly one ready thread – easy
– Switch registers and stack to dispatch new thread
– More than one ready thread: use scheduling priorities
– Provide mechanism to ensure dispatcher regains control
• Possible priorities:
• Switch routine
– LIFO (last in, first out):
– Can be very expensive if many registers
» put ready threads on front of list, remove from front
– Must be very carefully constructed!
– Pick one at random
– FIFO (first in, first out):
• Many scheduling options
» Put ready threads on back of list, pull them from front – Decision of which thread to run complex enough for
complete lecture
» This is fair and is what Nachos does
– Priority queue:
» keep ready list sorted by TCB priority field
9/13/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 4.29 9/13/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 4.30
Review: Per Thread State
• Each Thread has a Thread Control Block (TCB)
CS162 – Execution State: CPU registers, program counter,
Operating Systems and pointer to stack
Systems Programming – Scheduling info: State (more later), priority, CPU time
– Accounting Info
Lecture 5
– Various Pointers (for implementing scheduling queues)
– Pointer to enclosing process? (PCB)?
Cooperating Threads
– Etc (add stuff as you find a need)
• OS Keeps track of TCBs in protected memory
September 15, 2010 – In Arrays, or Linked Lists, or …
Prof. John Kubiatowicz Head Link Link Link
Tail Registers Registers Registers
http://inst.eecs.berkeley.edu/~cs162 Other Other Other
Ready State State State
Queue TCB9 TCB6 TCB16
Review: Yielding through Internal Events Review: Stack for Yielding Thread
Stack growth
– The act of requesting I/O implicitly yields the CPU yield
• Waiting on a “signal” from other thread Trap to OS
kernel_yield
– Thread asks to wait and thus yields the CPU run_new_thread
• Thread executes a yield() switch
– Thread volunteers to give up CPU
computePI() {
• How do we run a new thread?
while(TRUE) { run_new_thread() {
newThread = PickNewThread();
ComputeNextDigit();
switch(curThread, newThread);
yield();
ThreadHouseKeeping(); /* Later in lecture */
}
}
}
– Note that yield() must be called by programmer • How does dispatcher switch to a new thread?
frequently enough! – Save anything next thread may trash: PC, regs, stack
– Maintain isolation for each thread
9/15/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 5.3 9/15/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 5.4
Review: Two Thread Yield Example Goals for Today
Stack growth
} B(while) B(while)
proc B() { yield yield
while(TRUE) {
yield(); run_new_thread run_new_thread
} switch switch
}
• Suppose we have 2
threads:
Note: Some slides and/or pictures in the following are
– Threads S and T adapted from slides ©2005 Silberschatz, Galvin, and Gagne
Gagne.
Many slides generated from my lecture notes by Kubiatowicz.
9/15/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 5.5 9/15/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 5.6
Raise priority
IntID
CPU
Reenable All Ints
External Interrupt
add $r1,$r2,$r3
Save registers
“Interrupt Handler”
subi $r4,$r1,#4
Dispatch to Handler
Interrupt Int Disable slli $r4,$r4,#2
Timer
Transfer Network
Pipeline Flush Packet from hardware
to Kernel Buffers
Control lw $r2,0($r4)
Software Restore registers
Interrupt NMI lw $r3,4($r4)
Network add $r2,$r2,$r3 Clear current Int
Disable All Ints
• Interrupts invoked with interrupt lines from devices sw 8($r4),$r2
Restore priority
• Interrupt controller chooses interrupt request to honor RTI
– Mask enables/disables interrupts
– Priority encoder picks highest enabled interrupt • Disable/Enable All Ints Internal CPU disable bit
– Software Interrupt Set/Cleared by Software – RTI reenables interrupts, returns to user mode
– Interrupt identity specified with ID line • Raise/lower priority: change interrupt mask
• CPU can disable all interrupts with internal flag • Software interrupts can be provided entirely in
• Non-maskable interrupt line (NMI) can’t be disabled software at priority switching boundaries
9/15/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 5.7 9/15/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 5.8
Review: Preemptive Multithreading Administrivia
• Use the timer interrupt to force scheduling decisions
Stack growth
Some Routine
Interrupt
TimerInterrupt
run_new_thread
switch
9/15/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 5.9 9/15/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 5.10
Stack growth
• Initialize stack data? B(while)
Stack growth
ThreadRoot stub
ThreadRoot
– Includes things like recording Thread Code • Call run_new_thread() to run another thread:
start time of thread run_new_thread() {
– Other Statistics newThread = PickNewThread();
• Stack will grow and shrink Running Stack
switch(curThread, newThread);
with execution of thread ThreadHouseKeeping();
• Final return from thread returns into ThreadRoot() }
which calls ThreadFinish() – ThreadHouseKeeping() notices waitingToBeDestroyed
– ThreadFinish() will start at user-level and deallocates the finished thread’s TCB and stack
9/15/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 5.15 9/15/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 5.16
Additional Detail Parent-Child relationship
9/15/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 5.17 9/15/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 5.18
queue
Thread – Create initial TCB and stack to point at ThreadRoot()
– ThreadRoot() calls thread code, then ThreadFinish()
– ThreadFinish() wakes up waiting threads then
Thread Pool prepares TCB/stack for distruction
master() { worker(queue) { • Threads can wait for other threads using
while(TRUE) {
allocThreads(worker,queue); ThreadJoin()
con=Dequeue(queue);
while(TRUE) {
con=AcceptCon();
if (con==null) • Threads may be at user-level or kernel level
Enqueue(queue,con); sleepOn(queue); • Cooperating threads have many potential advantages
else
wakeUp(queue);
ServiceWebPage(con); – But: introduces non-reproducibility and non-determinism
}
} } – Need to have Atomic operations
9/15/10 Kubiatowicz CS162 ©UCB
} Fall 2009 Lec 5.29 9/15/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 5.30
Review: ThreadFork(): Create a New Thread
• ThreadFork() is a user-level procedure that
CS162 creates a new thread and places it on ready queue
Operating Systems and • Arguments to ThreadFork()
Systems Programming – Pointer to application routine (fcnPtr)
Lecture 6 – Pointer to array of arguments (fcnArgPtr)
– Size of stack to allocate
Synchronization • Implementation
– Sanity Check arguments
– Enter Kernel-mode and Sanity Check arguments again
September 20, 2010
– Allocate new Stack and TCB
Prof. John Kubiatowicz
– Initialize TCB and place on ready list (Runnable).
http://inst.eecs.berkeley.edu/~cs162
Review: How does Thread get started? Review: What does ThreadRoot() look like?
Other Thread • ThreadRoot() is the root for the thread routine:
ThreadRoot() {
ThreadRoot DoStartupHousekeeping();
UserModeSwitch(); /* enter user mode */
A
Stack growth
Call fcnPtr(fcnArgPtr);
B(while) ThreadFinish();
}
• Startup Housekeeping
Stack growth
yield ThreadRoot
queue
serverLoop() { Thread
connection = AcceptCon();
ThreadFork(ServiceWebPage(),connection);
} Thread Pool
• Advantages of threaded version: master() { slave(queue) {
– Can share file caches kept in memory, results of CGI allocThreads(slave,queue); while(TRUE) {
scripts, other things while(TRUE) { con=Dequeue(queue);
– Threads are much cheaper to create than processes, so
if (con==null)
con=AcceptCon();
this has a lower per-request overhead Enqueue(queue,con); sleepOn(queue);
else
• What if too many requests come in at once?
wakeUp(queue);
} ServiceWebPage(con);
} }
9/20/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 6.9 9/20/10 Kubiatowicz CS162 ©UCB
} Fall 2009 Lec 6.10
9/20/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 6.17 9/20/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 6.18
9/20/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 6.23 9/20/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 6.24
More Definitions Too Much Milk: Correctness Properties
• Lock: prevents someone from doing something
• Need to be careful about correctness of
– Lock before entering critical section and
before accessing shared data concurrent programs, since non-deterministic
– Unlock when leaving, after accessing shared data – Always write down behavior first
– Wait if locked – Impulse is to start coding first, then when it
doesn’t work, pull hair out
» Important idea: all synchronization involves waiting
– Instead, think first, then code
• For example: fix the milk problem by putting a key on
the refrigerator • What are the correctness properties for the
– Lock it and take key if you are going to go buy milk “Too much milk” problem???
– Fixes too much: roommate angry if only wants OJ – Never more than one person buys
– Someone buys if needed
• Restrict ourselves to use only atomic load and
store operations as building blocks
Summary
• Concurrent threads are a very useful abstraction
– Allow transparent overlapping of computation and I/O
– Allow use of parallel processing when available
• Concurrent threads introduce problems when accessing
shared data
– Programs must be insensitive to arbitrary interleavings
– Without careful design, shared variables can become
completely inconsistent
• Important concept: Atomic Operations
– An operation that runs to completion or not at all
– These are the primitives on which to construct various
synchronization primitives
• Showed how to protect a critical section with only
atomic load and store pretty complex!
9/20/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 6.35
Review: Synchronization problem with Threads
• One thread per transaction, each running:
CS162
Operating Systems and Deposit(acctId, amount) {
acct = GetAccount(actId); /* May use disk I/O */
Systems Programming acct->balance += amount;
Lecture 7 }
StoreAccount(acct); /* Involves disk I/O */
Interrupt disable and enable across context switches Atomic Read-Modify-Write instructions
9/22/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 7.15 9/22/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 7.16
Examples of Read-Modify-Write Implementing Locks with test&set
• test&set (&address) { /* most architectures */
result = M[address];
M[address] = 1; • Another flawed, but simple solution:
return result;
} int value = 0; // Free
• swap (&address, register) { /* x86 */ Acquire() {
temp = M[address];
M[address] = register; while (test&set(value)); // while busy
register = temp; }
}
• compare&swap (&address, reg1, reg2) { /* 68000 */
Release() {
if (reg1 == M[address]) { value = 0;
M[address] = reg2; }
return success;
} else { • Simple explanation:
return failure;
} – If lock is free, test&set reads 0 and sets value=1, so
} lock is now busy. It returns 0 so while exits.
– If lock is busy, test&set reads 1 and sets value=1 (no
• load-linked&store conditional(&address) {
/* R4000, alpha */
loop: change). It returns 1, so while loop continues
ll r1, M[address];
movi r2, 1; /* Can do arbitrary comp */ – When we set value = 0, someone else can get lock
sc r2, M[address];
beqz r2, loop; • Busy-Waiting: thread consumes cycles while waiting
}
9/22/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 7.17 9/22/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 7.18
9/22/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 7.21 9/22/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 7.22
Acquire() { Release() {
Readers-Writers disable interrupts;
if (value == BUSY) {
disable interrupts;
if (anyone on wait queue) {
Language Support for Synchronization put thread on wait queue; take thread off wait queue
Go to sleep(); Place on ready queue;
// Enable interrupts? } else {
September 27, 2010 } else { value = FREE;
}
Prof. John Kubiatowicz value = BUSY; enable interrupts;
}
http://inst.eecs.berkeley.edu/~cs162 enable interrupts;
}
next
New
Object
9/27/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 8.15 9/27/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 8.16
Readers/Writers Problem Basic Readers/Writers Solution
W • Correctness Constraints:
– Readers can access database when no writers
– Writers can access database when no readers or writers
R – Only one thread manipulates state variables at a time
R • Basic structure of a solution:
R – Reader()
Wait until no writers
Access data base
Check out – wake up a waiting writer
– Writer()
• Motivation: Consider a shared database Wait until no active readers or writers
– Two classes of users:
Access database
Check out – wake up waiting readers or writer
» Readers – never modify database – State variables (Protected by a lock called “lock”):
» Writers – read and modify database » int AR: Number of active readers; initially = 0
» int WR: Number of waiting readers; initially = 0
– Is using a single lock on the whole database sufficient?
» int AW: Number of active writers; initially = 0
» Like to have many readers at the same time » int WW: Number of waiting writers; initially = 0
» Only one writer at a time » Condition okToRead = NIL
» Conditioin okToWrite = NIL
9/27/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 8.17 9/27/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 8.18
Simulation(3) Questions
• Can readers starve? Consider Reader() entry code:
• When writer wakes up, get:
AR = 0, WR = 1, AW = 1, WW = 0
while ((AW + WW) > 0) { // Is it safe to read?
WR++; // No. Writers exist
• Then, when writer finishes: okToRead.wait(&lock); // Sleep on cond var
WR--; // No longer waiting
if (WW > 0){ // Give priority to writers }
okToWrite.signal(); // Wake up one writer AR++; // Now we are active!
} else if (WR > 0) { // Otherwise, wake reader
okToRead.broadcast(); // Wake all readers • What if we erase the condition check in Reader exit?
} AR--; // No longer active
– Writer wakes up reader, so get: if (AR == 0 && WW > 0) // No other active readers
okToWrite.signal(); // Wake up one writer
AR = 1, WR = 0, AW = 0, WW = 0 • Further, what if we turn the signal() into broadcast()
• When reader completes, we are finished AR--; // No longer active
okToWrite.broadcast(); // Wake up one writer
• Finally, what if we use only one condition variable (call
it “okToContinue”) instead of two separate ones?
– Both readers and writers sleep on this variable
– Must use broadcast() instead of signal()
9/27/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 8.23 9/27/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 8.24
Can we construct Monitors from Semaphores? Construction of Monitors from Semaphores (con’t)
• Locking aspect is easy: Just use a mutex • Problem with previous try:
• Can we implement condition variables this way? – P and V are commutative – result is the same no matter
what order they occur
Wait() { semaphore.P(); }
Signal() { semaphore.V(); } – Condition variables are NOT commutative
– Doesn’t work: Wait() may sleep with lock held • Does this fix the problem?
• Does this work better? Wait(Lock lock) {
lock.Release();
Wait(Lock lock) { semaphore.P();
lock.Release(); lock.Acquire();
semaphore.P(); }
lock.Acquire(); Signal() {
} if semaphore queue is not empty
Signal() { semaphore.V(); } semaphore.V();
– No: Condition vars have no history, semaphores have }
history: – Not legal to look at contents of semaphore queue
» What if thread signals and no one is waiting? NO-OP – There is a race condition – signaler can slip in after lock
» What if thread later waits? Thread Waits release and before waiter executes semaphore.P()
» What if thread V’s and noone is waiting? Increment • It is actually possible to do this correctly
» What if thread later does P? Decrement and continue – Complex solution for Hoare scheduling in book
– Can you come up with simpler Mesa-scheduled solution?
9/27/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 8.25 9/27/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 8.26
Stack growth
• Basic structure of monitor-based program: lock.acquire();
lock … Proc B
while (need to wait) { Check and/or update if (exception) { Calls setjmp
state variables lock.release();
condvar.wait(); Proc C
return errReturnCode;
} Wait if necessary } lock.acquire
unlock
…
lock.release(); Proc D
do something so no need to wait
return OK;
Proc E
}
lock Calls longjmp
– Watch out for setjmp/longjmp!
Check and/or update
condvar.signal(); » Can cause a non-local jump out of procedure
state variables
» In example, procedure E calls longjmp, poping stack
unlock back to procedure B
» If Procedure C had lock.acquire, problem!
9/27/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 8.27 9/27/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 8.28
C++ Language Support for Synchronization C++ Language Support for Synchronization (con’t)
• Languages with exceptions like C++ • Must catch all exceptions in critical sections
– Languages that support exceptions are problematic (easy – Catch exceptions, release lock, and re-throw exception:
to make a non-local exit without releasing lock) void Rtn() {
lock.acquire();
– Consider: try {
void Rtn() { …
lock.acquire(); DoFoo();
… …
DoFoo(); } catch (…) { // catch exception
… lock.release(); // release lock
lock.release(); throw; // re-throw the exception
} }
lock.release();
void DoFoo() {
}
… void DoFoo() {
if (exception) throw errException; …
… if (exception) throw errException;
} …
– Notice that an exception in DoFoo() will exit without }
releasing the lock – Even Better: auto_ptr<T> facility. See C++ Spec.
» Can deallocate/free lock regardless of exit method
9/27/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 8.29 9/27/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 8.30
Java Language Support for Synchronization Java Language Support for Synchronization (con’t)
• Java has explicit support for threads and thread • Java also has synchronized statements:
synchronization synchronized (object) {
• Bank Account example: …
class Account { }
private int balance; – Since every Java object has an associated lock, this
// object constructor type of statement acquires and releases the object’s
public Account (int initialBalance) {
balance = initialBalance; lock on entry and exit of the body
} – Works properly even with exceptions:
public synchronized int getBalance() {
return balance; synchronized (object) {
} …
public synchronized void deposit(int amount) { DoFoo();
balance += amount; …
} }
}
void DoFoo() {
– Every object has an associated lock which gets throw errException;
automatically acquired and released on entry and exit }
from a synchronized method.
9/27/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 8.31 9/27/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 8.32
Java Language Support for Synchronization (con’t 2) Summary
• In addition to a lock, every object has a single • Semaphores: Like integers with restricted interface
condition variable associated with it – Two operations:
– How to wait inside a synchronization method of block: » P(): Wait if zero; decrement when becomes non-zero
» void wait(long timeout); // Wait for timeout » V(): Increment and wake a sleeping task (if exists)
» void wait(long timeout, int nanoseconds); //variant » Can initialize value to any non-negative value
» void wait(); – Use separate semaphore for each constraint
– How to signal in a synchronized method or block: • Monitors: A lock plus one or more condition variables
» void notify(); // wakes up oldest waiter
– Always acquire lock before accessing shared data
» void notifyAll(); // like broadcast, wakes everyone
– Use condition variables to wait inside critical section
– Condition variables can wait for a bounded length of
time. This is useful for handling exception cases: » Three Operations: Wait(), Signal(), and Broadcast()
t1 = time.now(); • Readers/Writers
while (!ATMRequest()) { – Readers can access database when no writers
wait (CHECKPERIOD); – Writers can access database when no readers
t2 = time.new();
if (t2 – t1 > LONG_TIME) checkMachine(); – Only one thread manipulates state variables at a time
} • Language support for synchronization:
– Not all Java VMs equivalent! – Java provides synchronized keyword and one condition-
» Different scheduling policies, not necessarily preemptive! variable per object (with wait() and notify())
9/27/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 8.33 9/27/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 8.34
Review: Definition of Monitor
• Semaphores are confusing because dual purpose:
CS162
– Both mutual exclusion and scheduling constraints
Operating Systems and – Cleaner idea: Use locks for mutual exclusion and
Systems Programming condition variables for scheduling constraints
Lecture 9 • Monitor: a lock and zero or more condition variables
for managing concurrent access to shared data
Tips for Working in a Project Team/ – Use of Monitors is a programming paradigm
Cooperating Processes and Deadlock • Lock: provides mutual exclusion to shared data:
– Always acquire before accessing shared data structure
September 29, 2010 – Always release after finishing with shared data
• Condition Variable: a queue of threads waiting for
Prof. John Kubiatowicz something inside a critical section
http://inst.eecs.berkeley.edu/~cs162 – Key idea: allow sleeping inside critical section by
atomically releasing lock at time we go to sleep
– Contrast to semaphores: Can’t wait inside critical
section
9/29/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 9.2
• Project objectives: goals, constraints, and priorities • Source revision control software
• Specifications: the manual plus performance specs – (Subversion, CVS, others…)
– This should be the first document generated and the – Easy to go back and see history/undo mistakes
last one finished – Figure out where and why a bug got introduced
• Meeting notes – Communicates changes to everyone (use CVS’s features)
– Document all decisions
• Use automated testing tools
– Write scripts for non-interactive software
– You can often cut & paste for the design documents
– Use “expect” for interactive software
• Schedule: What is your anticipated timing? – JUnit: automate unit testing
– This document is critical! – Microsoft rebuilds the Vista kernel every night with the
• Organizational Chart day’s changes. Everyone is running/testing the latest
software
– Who is responsible for what task?
• Use E-mail and instant messaging consistently to
leave a history trail
9/29/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 9.11 9/29/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 9.12
Test Continuously Administrivia
9/29/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 9.13 9/29/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 9.14
Resources
• Resources – passive entities needed by threads to do
their work
– CPU time, disk space, memory
• Two types of resources:
– Preemptable – can take it away
» CPU, Embedded security chip
– Non-preemptable – must leave it with the thread
» Disk space, plotter, chunk of virtual address space
» Mutual exclusion – the right to enter a critical section
• Resources may require exclusive access or may be
sharable
– Read-only files are typically sharable
– Printers are not sharable during time of printing
• One of the major tasks of an operating system is to
manage resources
9/29/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 9.15 9/29/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 9.16
Starvation vs Deadlock Conditions for Deadlock
• Starvation vs. Deadlock • Deadlock not always deterministic – Example 2 mutexes:
– Starvation: thread waits indefinitely Thread A Thread B
» Example, low-priority thread waiting for resources x.P(); y.P();
constantly in use by high-priority threads y.P(); x.P();
– Deadlock: circular waiting for resources y.V(); x.V();
» Thread A owns Res 1 and is waiting for Res 2 x.V(); y.V();
Thread B owns Res 2 and is waiting for Res 1
– Deadlock won’t always happen with this code
Thread » Have to have exactly the right timing (“wrong” timing?)
Wait
Owned A » So you release a piece of software, and you tested it, and
For
By there it is, controlling a nuclear power plant…
Res 1 Res 2 • Deadlocks occur with multiple resources
Owned – Means you can’t decompose the problem
Wait
For
Thread By – Can’t solve deadlock for each resource independently
B
• Example: System with 2 disk drives and two threads
– Deadlock Starvation but not vice versa – Each thread needs 2 disk drives to function
» Starvation can end (but doesn’t have to)
– Each thread gets one disk and waits for another one
» Deadlock can’t end without external intervention
9/29/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 9.17 9/29/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 9.18
Review: Resource Allocation Graph Examples Review: Methods for Handling Deadlocks
• Recall:
– request edge – directed edge T1 Rj
– assignment edge – directed edge Rj Ti • Allow system to enter deadlock and then recover
– Requires deadlock detection algorithm
R1 R2
R1 R2 R1 – Some technique for selectively preempting resources
T2
and/or terminating tasks
• Ensure that system will never enter a deadlock
T1 T2 T3
T1 T2 T3
– Need to monitor all lock acquisitions
T1 T3 – Selectively deny those that might lead to deadlock
• Ignore the problem and pretend that deadlocks
never occur in the system
R3 T4 – used by most operating systems, including UNIX
R3 R2
R4
R4
Simple Resource Allocation Graph Allocation Graph
Allocation Graph With Deadlock With Cycle, but
No Deadlock
10/4/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 10.3 10/4/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 10.4
Goals for Today Deadlock Detection Algorithm
• Only one of each type of resource look for loops
• Preventing Deadlock • More General Deadlock Detection Algorithm
• Scheduling Policy goals – Let [X] represent an m-ary vector of non-negative
• Policy Options integers (quantities of resources of each type):
[FreeResources]: Current free resources each type
• Implementation Considerations [RequestX]: Current requests from thread X
[AllocX]: Current resources held by thread X
– See if tasks can eventually terminate on their own
[Avail] = [FreeResources] R1
Add all nodes to UNFINISHED T2
do {
done = true
Foreach node in UNFINISHED {
if ([Requestnode] <= [Avail]) { T1 T3
remove node from UNFINISHED
[Avail] = [Avail] + [Allocnode]
done = false
Note: Some slides and/or pictures in the following are
} T4
} R2
adapted from slides ©2005 Silberschatz, Galvin, and Gagne
Gagne. } until(done)
Many slides generated from my lecture notes by Kubiatowicz. – Nodes left in UNFINISHED deadlocked
10/4/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 10.5 10/4/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 10.6
Comparisons between FCFS and Round Robin Earlier Example with Different Time Quantum
• Assuming zero-cost context-switching time, is RR
always better than FCFS? Best FCFS:
P2
[8]
P4
[24]
P1
[53]
P3
[68]
• Simple example: 10 jobs, each take 100s of CPU time 0 8 32 85 153
RR scheduler quantum of 1s
All jobs start at the same time Quantum P1 P2 P3 P4 Average
• Completion Times: Job # FIFO RR Best FCFS 32 0 85 8 31¼
1 100 991 Q = 1 84 22 85 57 62
Q = 5 82 20 85 58 61¼
2 200 992 Wait
Q = 8 80 8 85 56 57¼
… … … Time
Q = 10 82 10 85 68 61¼
9 900 999 Q = 20 72 20 85 88 66¼
10 1000 1000 Worst FCFS 68 145 0 121 83½
– Both RR and FCFS finish at the same time Best FCFS 85 8 153 32 69½
– Average response time is much worse under RR! Q = 1 137 30 153 81 100½
Q = 5 135 28 153 82 99½
» Bad when all jobs same length Completion
Q = 8 133 16 153 80 95½
• Also: Cache state must be shared between all jobs with Time
Q = 10 135 18 153 92 99½
RR but can be devoted to each job with FIFO
Q = 20 125 28 153 112 104½
– Total time for RR longer even for zero-cost switch! Worst FCFS 121 153 68 145 121¾
10/4/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 10.23 10/4/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 10.24
What if we Knew the Future? Discussion
• Could we always mirror best FCFS? • SJF/SRTF are the best you can do at minimizing
• Shortest Job First (SJF): average response time
– Run whatever job has the least amount of – Provably optimal (SJF among non-preemptive, SRTF
computation to do among preemptive)
– Sometimes called “Shortest Time to
Completion First” (STCF) – Since SRTF is always at least as good as SJF, focus
on SRTF
• Shortest Remaining Time First (SRTF):
– Preemptive version of SJF: if job arrives and has a • Comparison of SRTF with FCFS and RR
shorter time to completion than the remaining time on – What if all jobs the same length?
the current job, immediately preempt CPU » SRTF becomes the same as FCFS (i.e. FCFS is best can
– Sometimes called “Shortest Remaining Time to do if all jobs the same length)
Completion First” (SRTCF) – What if jobs have varying length?
• These can be applied either to a whole program or » SRTF (and RR): short jobs not stuck behind long ones
the current CPU burst of each program
– Idea is to get short jobs out of the system
– Big effect on short jobs, only small effect on long ones
– Result is better average response time
10/4/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 10.25 10/4/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 10.26
Summary (Scheduling)
• Scheduling: selecting a waiting process from the ready
queue and allocating the CPU to it
• FCFS Scheduling:
– Run threads to completion in order of submission
– Pros: Simple
– Cons: Short jobs get stuck behind long ones
• Round-Robin Scheduling:
– Give each thread a small amount of CPU time when it
executes; cycle between all ready threads
– Pros: Better for short jobs
– Cons: Poor when jobs are same length
• Shortest Job First (SJF)/Shortest Remaining Time
First (SRTF):
– Run whatever job has the least amount of computation to
do/least remaining amount of computation to do
– Pros: Optimal (average response time)
– Cons: Hard to predict future, Unfair
10/4/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 10.31
Review: Banker’s Algorithm for Preventing Deadlock
Review: Last Time Review: FCFS and RR Example with Different Quantum
• Scheduling: selecting a waiting process from the ready P2 P4 P1 P3
Best FCFS:
queue and allocating the CPU to it [8] [24] [53] [68]
• FCFS Scheduling: 0 8 32 85 153
Response
» Assuming you’re paying for worse
time
response time in reduced productivity, – Need to multiplex disk and devices (later in term)
customer angst, etc…
• Why worry about memory sharing?
100%
» Might think that you should buy a
faster X when X is utilized 100%, – The complete working state of a process and/or kernel is
but usually, response time goes defined by its data in memory (and registers)
to infinity as utilization100% Utilization – Consequently, cannot just let different threads of control
• An interesting implication of this curve: use the same memory
» Physics: two different pieces of data cannot occupy the same
– Most scheduling algorithms work fine in the “linear” locations in memory
portion of the load curve, fail otherwise – Probably don’t want different threads to even have access
– Argues for buying a faster X when hit “knee” of curve to each other’s memory (protection)
10/6/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 11.17 10/6/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 11.18
Application1
0x00000000
– Trick: Use Loader/Linker: Adjust addresses while
Application program loaded into memory (loads, stores, jumps)
0x00000000 » Everything adjusted to memory location of program
– Application given illusion of dedicated machine by giving » Translation done by a linker-loader
it reality of a dedicated machine » Was pretty common in early days
• Of course, this doesn’t help us with multithreading • With this solution, no protection: bugs in any program
can cause other programs to crash or even the OS
10/6/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 11.23 10/6/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 11.24
Multiprogramming (Version with Protection) Segmentation with Base and Limit registers
• Can we protect programs from each other without Base
translation? Virtual
+
Address
Operating
0xFFFFFFFF CPU DRAM
System Physical
LimitAddr=0x10000 Limit <? Address
Application2 0x00020000 BaseAddr=0x20000 No: Error!
• Could use base/limit for dynamic address translation
Application1 (often called “segmentation”):
0x00000000 – Alter address of every load/store by adding “base”
– Yes: use two special registers BaseAddr and LimitAddr – User allowed to read/write within segment
to prevent user from straying outside designated area » Accesses are relative to segment so don’t have to be
» If user tries to access an illegal address, cause an error relocated when program moved to different segment
– During switch, kernel loads new base/limit from TCB – User may have multiple segments available (e.g x86)
» User not allowed to change base/limit registers » Loads and stores include segment ID in opcode:
x86 Example: mov [es:bx],ax.
» Operating system moves around segment base pointers as
necessary
10/6/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 11.25 10/6/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 11.26
Issues with simple segmentation method Multiprogramming (Translation and Protection version 2)
process 6 process 6 process 6 process 6 • Problem: Run multiple applications in such a way that
process 5 process 5 process 5 process 5
they are protected from one another
process 9 process 9 • Goals:
process 2 process 10
– Isolate processes and kernel from one another
– Allow flexible translation that:
OS OS OS OS » Doesn’t lead to fragmentation
» Allows easy sharing between processes
• Fragmentation problem » Allows only part of process to be resident in physical
– Not every process is the same size memory
– Over time, memory space becomes fragmented • (Some of the required) Hardware Mechanisms:
– General Address Translation
• Hard to do inter-process sharing » Flexible: Can fit physical chunks of memory into arbitrary
– Want to share code segments when possible places in users address space
» Not limited to small number of segments
– Want to share memory between processes » Think of this as providing a large number (thousands) of
– Helped by by providing multiple segments per process fixed-sized segments (called “pages”)
• Need enough physical memory for every process – Dual Mode Operation
» Protection base involving kernel/user distinction
10/6/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 11.27 10/6/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 11.28
Example of General Address Translation Two Views of Memory
Virtual Physical
Data 2
Code
Stack 1
Code CPU Addresses
MMU Addresses
Data Data
Heap Heap 1
Heap Untranslated read or write
Stack Code 1 • Recall: Address Space:
Stack – All the addresses and state a process can touch
Stack 2
– Each process and kernel has different address space
Prog 1 Prog 2
Virtual
Data 1
Virtual • Consequently: two views of memory:
Address Heap 2 Address – View from the CPU (what program sees, virtual memory)
Space 1 Space 2 – View fom memory (physical memory)
Code 2
– Translation box converts between the two views
OS code • Translation helps to implement protection
OS data – If task A cannot even gain access to task B’s data, no
Translation Map 1 Translation Map 2 way for A to adversely affect B
OS heap & • With translation, every program can be linked/loaded
Stacks
into same region of user address space
Physical Address Space – Overlap avoided through translation, not relocation
10/6/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 11.29 10/6/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 11.30
Summary (2)
• Memory is a resource that must be shared
– Controlled Overlap: only shared when appropriate
– Translation: Change Virtual Addresses into Physical
Addresses
– Protection: Prevent unauthorized Sharing of resources
• Simple Protection through Segmentation
– Base+limit registers restrict memory accessible to user
– Can be used to translate as well
• Full translation of addresses through Memory
Management Unit (MMU)
– Every Access translated through page table
– Changing of page tables only available to user
• Dual-Mode
– Kernel/User distinction: User restricted
– UserKernel: System calls, Traps, or Interrupts
– Inter-process communication: shared memory, or
through kernel (system calls)
10/6/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 11.43
Review: Important Aspects of Memory Multiplexing
• Controlled overlap:
CS162 – Separate state of threads should not collide in physical
Operating Systems and memory. Obviously, unexpected overlap causes chaos!
Systems Programming – Conversely, would like the ability to overlap when
desired (for communication)
Lecture 12 • Translation:
– Ability to translate accesses from one address space
Protection (continued) (virtual) to a different one (physical)
– When translation exists, processor uses virtual
Address Translation addresses, physical memory uses physical addresses
– Side effects:
» Can be used to avoid overlap
October 11, 2010 » Can be used to give uniform view of memory to programs
Prof. John Kubiatowicz • Protection:
http://inst.eecs.berkeley.edu/~cs162 – Prevent access to private memory of other processes
» Different pages of memory can be given special behavior
(Read Only, Invisible to user programs, etc).
» Kernel data protected from User programs
» Programs protected from themselves
10/11/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 12.2
Review: General Address Translation Review: Simple Segmentation: Base and Bounds (CRAY-1)
Data 2
Base
Code Code Virtual
+
Data Stack 1
Data Address
CPU DRAM
Heap Heap 1 Physical
Heap
Stack Code 1
Stack Limit >? Address
Stack 2 Yes: Error!
Prog 1 Prog 2
Virtual
Data 1
Virtual
• Can use base & bounds/limit for dynamic address
Address Heap 2 Address translation (Simple form of “segmentation”):
Space 1 Space 2 – Alter every address by adding “base”
Code 2
– Generate error if address bigger than limit
OS code
• This gives program the illusion that it is running on its
Translation Map 1 OS data Translation Map 2 own dedicated machine, with memory starting at 0
OS heap & – Program gets continuous region of memory
Stacks – Addresses within program do not have to be relocated
Physical Address Space when program placed in different region of DRAM
10/11/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 12.3 10/11/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 12.4
Review: Cons for Simple Segmentation Method Goals for Today
• Fragmentation problem (complex memory allocation)
– Not every process is the same size • Address Translation Schemes
– Over time, memory space becomes fragmented – Segmentation
– Really bad if want space to grow dynamically (e.g. heap) – Paging
– Multi-level translation
process 6 process 6 process 6 process 6
11
Virtual Seg # Offset
Address > Error
Base0 Limit0 V
4
1 Base1 Limit1 V
Base2 Limit2 V
2 Base3
Base4
Limit3
Limit4
N
V
+ Physical
Address
3 22 Base5 Limit5 N
4 Base6 Limit6 N
3 Base7 Limit7 V
• Segment map resides in processor
– Segment number mapped into base/limit pair
user view of physical – Base added to offset to generate physical address
memory space memory space
– Error check catches offset out of range
• Logical View: multiple separate segments • As many chunks of physical memory as entries
– Typical: Code, Data, Stack – Segment addressed by portion of virtual address
– Others: memory sharing, etc – However, could be included in instruction instead:
• Each segment is given region of contiguous memory » x86 Example: mov [es:bx],ax.
– Has a base and limit • What is “V/N”?
– Can reside anywhere in physical memory – Can mark segments as invalid; requires check as well
10/11/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 12.7 10/11/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 12.8
Intel x86 Special Registers Example: Four Segments (16 bit addresses)
80386 Special Registers Seg ID # Base Limit
Seg Offset 0 (code) 0x4000 0x0800
15 14 13 0 1 (data) 0x4800 0x1400
Virtual Address Format 2 (shared) 0xF000 0x1000
3 (stack) 0x0000 0x3000
0x0000 0x0000
0x4000 0x4000
Might
0x4800 be shared
0x5C00
0x8000
Space for
0xC000 Other Apps
Typical Segment Register
Current Priority is RPL 0xF000 Shared with
Of Code Segment (CS) Other Apps
Virtual Physical
Address Space Address Space
10/11/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 12.9 10/11/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 12.10
page #0 V,R
Address Translation (con’t) Base0 Limit0 V page #1 V,R Physical
Page # Offset
Caches and TLBs Base1
Base2
Limit1
Limit2
V
V
page
page
#2 V,R,W
#3 V,R,W
Physical Address
Base3 Limit3 N page #4 N
Base4 Limit4 V
October 13, 2010 Base5 Limit5 N
page #5 V,R,W Check Perm
http://inst.eecs.berkeley.edu/~cs162
Base7 Limit7 V
> Access
Error
Access
Error
• What must be saved/restored on context switch?
– Contents of top-level segment registers (for this example)
– Pointer to top-level table (page table)
10/13/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 13.2
PageTablePtr
4 bytes
• Tree of Page Tables
• Tables fixed size (1024 entries)
– On context-switch: save single
PageTablePtr register
• Sometimes, top-level page tables
called “directories” (Intel) Note: Some slides and/or pictures in the following are
• Each entry called a (surprise!) adapted from slides ©2005 Silberschatz, Galvin, and Gagne
Page Table Entry (PTE)
4 bytes
10/13/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 13.3 10/13/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 13.4
What is in a PTE? Examples of how to use a PTE
• What is in a Page Table Entry (or PTE)? • How do we use the PTE?
– Pointer to next-level page table or to actual page – Invalid PTE can imply different things:
– Permission bits: valid, read-only, read-write, write-only » Region of address space is actually invalid or
» Page/directory is just somewhere else than memory
• Example: Intel x86 architecture PTE: – Validity checked first
– Address same format previous slide (10, 10, 12-bit offset) » OS can use other (say) 31 bits for location info
– Intermediate page tables called “Directories” • Usage Example: Demand Paging
Page Frame Number Free – Keep only active pages in memory
PWT
PCD
0 L D A UW P – Place others on disk and mark their PTEs invalid
(Physical Page Number) (OS)
31-12 11-9 8 7 6 5 4 3 2 1 0 • Usage Example: Copy on Write
P: Present (same as “valid” bit in other architectures) – UNIX fork gives copy of parent address space to child
» Address spaces disconnected after child created
W: Writeable – How to do this cheaply?
U: User accessible » Make copy of parent’s page tables (point at same memory)
PWT: Page write transparent: external cache write-through » Mark entries in both sets of page tables as read-only
PCD: Page cache disabled (page cannot be cached) » Page fault on write creates two copies
A: Accessed: page has been accessed recently • Usage Example: Zero Fill On Demand
D: Dirty (PTE only): page has been modified recently – New data pages must carry no information (say be zeroed)
L: L=14MB page (directory only). – Mark PTEs as invalid; page fault on use gets zeroed page
Bottom 22 bits of virtual address serve as offset – Often, OS creates zeroed pages in background
10/13/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 13.5 10/13/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 13.6
1000 µProc
CPU
“Moore’s Law” 60%/yr.
• Cache: a repository for copies that can be accessed (really Joy’s Law) (2X/1.5yr)
Performance
more quickly than the original 100 Processor-Memory
– Make frequent case fast and infrequent case less dominant Performance Gap:
• Caching underlies many of the techniques that are used (grows 50% / year)
today to make computers fast 10
– Can cache: memory locations, address translations, pages, “Less’ Law?” DRAM
file blocks, file names, network routes, etc… DRAM
9%/yr.
• Only good if: 1 (2X/10
– Frequent case frequent enough and yrs)
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
– Infrequent case not too expensive
• Important measure: Average Access time =
(Hit Rate x Hit Time) + (Miss Rate x Miss Time) Time
10/13/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 13.17 10/13/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 13.18
Another Major Reason to Deal with Caching Why Does Caching Help? Locality!
Virtual Virtual Virtual Offset
Address: Seg # Page # Probability
of reference
page #0 V,R
page #1 V,R Physical Offset
Base0 Limit0 V Page #
Base1 Limit1 V page #2 V,R,W
Physical Address
Base2 Limit2 V page #3 V,R,W 0 2n - 1
Base3 Limit3 N page #4 N
Address Space
Base4 Limit4 V
Base5 Limit5 N
page #5 V,R,W Check Perm • Temporal Locality (Locality in Time):
Base6 Limit6 N – Keep recently accessed data items closer to processor
Base7 Limit7 V
> Access
Error
Access
Error • Spatial Locality (Locality in Space):
• Cannot afford to translate on every access – Move contiguous blocks to the upper levels
– At least three DRAM accesses per actual DRAM access Lower Level
– Or: perhaps I/O if page table partially on disk! To Processor Upper Level Memory
Memory
• Even worse: What if we are using caching to make
memory access faster than DRAM access???
Blk X
From Processor Blk Y
• Solution? Cache translations!
– Translation Cache: TLB (“Translation Lookaside Buffer”)
10/13/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 13.19 10/13/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 13.20
Memory Hierarchy of a Modern Computer System A Summary on Sources of Cache Misses
• Take advantage of the principle of locality to: • Compulsory (cold start or process migration, first
– Present as much memory as in the cheapest technology reference): first access to a block
– Provide access at speed offered by the fastest technology – “Cold” fact of life: not a whole lot you can do about it
– Note: If you are going to run “billions” of instruction,
Compulsory Misses are insignificant
Processor • Capacity:
– Cache cannot contain all blocks access by the program
Control
Secondary
Tertiary – Solution: increase cache size
Storage
Second Main
Storage
(Disk)
(Tape) • Conflict (collision):
On-Chip
Level
Registers
– Index identifies the set Valid Bit Cache Tag Cache Data
• Tag used to identify actual copy
: :
Byte 31 Byte 1 Byte 0 0
– If no candidates match, then declare cache miss 0x50 Byte 63 Byte 33 Byte 32 1
2
• Block is minimum quantum of caching 3
– Data select field used to select data within block : : :
– Many caching applications don’t have data select field
:
Byte 1023 Byte 992 31
10/13/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 13.23 10/13/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 13.24
Review: Set Associative Cache Review: Fully Associative Cache
• N-way set associative: N entries per Cache Index • Fully Associative: Every block can hold any line
– N direct mapped caches operates in parallel – Address does not include a cache index
• Example: Two-way set associative cache – Compare Cache Tags of all Cache Entries in Parallel
– Cache Index selects a “set” from the cache • Example: Block Size=32B blocks
– Two tags in the set are compared to input in parallel – We need N 27-bit comparators
– Data is selected based on the tag result
31 8 4 0
– Still have byte select to choose from within block
Cache Tag Cache Index Byte Select 31 4 0
Cache Tag (27 bits long) Byte Select
Valid Cache Tag Cache Data Cache Data Cache Tag Valid Ex: 0x01
Cache Block 0 Cache Block 0
Cache Tag Valid Bit Cache Data
: : : : : :
: :
= Byte 31 Byte 1 Byte 0
= Byte 63 Byte 33 Byte 32
=
=
Compare Sel1 1 Mux 0 Sel0 Compare
=
: : :
OR
10/13/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 13.25 10/13/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 13.26
Hit Cache Block
Where does a Block Get Placed in a Cache? Review: Which block should be replaced on a miss?
• Example: Block 12 placed in 8 block cache
32-Block Address Space: • Easy for Direct Mapped: Only one possibility
• Set Associative or Fully Associative:
– Random
Block 1111111111222222222233
– LRU (Least Recently Used)
no. 01234567890123456789012345678901
10/13/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 13.31 10/13/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 13.32
What TLB organization makes sense? TLB organization: include protection
• How big does TLB actually have to be?
CPU TLB Cache Memory
– Usually small: 128-512 entries
– Not very big, can support higher associativity
• Needs to be really fast
– Critical path of memory access
• TLB usually organized as fully-associative cache
» In simplest view: before the cache – Lookup is by Virtual Address
» Thus, this adds to access time (reducing cache speed) – Returns Physical Address + other info
– Seems to argue for Direct Mapped or Low Associativity • What happens when fully-associative is too slow?
• However, needs to have very few conflicts! – Put a small (4-16 entry) direct-mapped cache in front
– With TLB, the Miss Time extremely high! – Called a “TLB Slice”
– This argues that cost of Conflict (Miss Time) is much • Example for MIPS R3000:
higher than slightly increased cost of access (Hit Time) Virtual Address Physical Address Dirty Ref Valid Access ASID
• Thrashing: continuous conflicts between accesses
– What if use low order bits of page as index into TLB? 0xFA00
0x0040
0x0003
0x0010
Y
N
N
Y
Y
Y
R/W
R
34
0
» First page of code, data, stack may map to same entry 0x0041 0x0011 N Y Y R 0
» Need 3-way associativity at least?
– What if use high order bits as index?
» TLB mostly unused for small programs
10/13/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 13.33 10/13/10 Kubiatowicz CS162 ©UCB Fall 2009 Lec 13.34
Example: R3000 pipeline includes TLB “stages” Reducing translation time further
• As described, TLB lookup is in serial with cache lookup:
MIPS R3000 Pipeline
Inst Fetch Dcd/ Reg ALU / E.A Memory Write Reg Virtual Address
TLB I-Cache RF Operation WB
10
V page no. offset
E.A. TLB D-Cache
TLB Lookup
TLB
64 entry, on-chip, fully associative, software TLB fault handler
Access
V Rights PA
Virtual Address Space
On-Chip
Level
Registers
Memory
Cache
Datapath Cache (DRAM)
Prof. John Kubiatowicz (SRAM)
http://inst.eecs.berkeley.edu/~cs162
Speed (ns): 1s 10s-100s 100s 10,000,000s 10,000,000,000s
(10s ms) (10s sec)
Size (bytes): 100s Ks-Ms Ms Gs Ts
10/20/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 14.7 10/20/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 14.8
Caching Applied to Address Translation What Actually Happens on a TLB Miss?
Virtual TLB • Hardware traversed page tables:
Physical
Address Cached? – On TLB miss, hardware in MMU looks at current page
CPU Yes
Address
Physical table to fill TLB (may walk multiple levels)
No Memory » If PTE valid, hardware fills TLB and processor never knows
» If PTE marked as invalid, causes Page Fault, after which
kernel decides what to do afterwards
Translate • Software traversed Page tables (like MIPS)
(MMU) – On TLB miss, processor receives TLB fault
– Kernel traverses page table to find PTE
Data Read or Write » If PTE valid, fills TLB and returns from fault
(untranslated) » If PTE marked as invalid, internally calls Page Fault handler
• Question is one of page locality: does it exist? • Most chip sets provide hardware traversal
– Instruction accesses spend a lot of time on the same – Modern operating systems tend to have more TLB faults
page (since accesses sequential) since they use translation for many things
– Stack accesses have definite locality of reference – Examples:
– Data accesses have less page locality, but still some… » shared segments
• Can we have a TLB hierarchy? » user-level portions of an operating system
– Sure: multiple levels at different sizes/speeds
10/20/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 14.9 10/20/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 14.10
10/20/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 14.11 10/20/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 14.12
What TLB organization makes sense? TLB organization: include protection
• How big does TLB actually have to be?
CPU TLB Cache Memory
– Usually small: 128-512 entries
– Not very big, can support higher associativity
• Needs to be really fast • TLB usually organized as fully-associative cache
– Critical path of memory access – Lookup is by Virtual Address
» In simplest view: before the cache
– Returns Physical Address + other info
» Thus, this adds to access time (reducing cache speed)
• Example for MIPS R3000:
– Seems to argue for Direct Mapped or Low Associativity
Virtual Address Physical Address Dirty Ref Valid Access ASID
• However, needs to have very few conflicts!
– With TLB, the Miss Time extremely high! 0xFA00 0x0003 Y N Y R/W 34
0x0040 0x0010 N Y Y R 0
– This argues that cost of Conflict (Miss Time) is much 0x0041 0x0011 N Y Y R 0
higher than slightly increased cost of access (Hit Time)
• Thrashing: continuous conflicts between accesses • What happens when fully-associative is too slow?
– What if use low order bits of page as index into TLB? – Put a small (4-16 entry) direct-mapped cache in front
» First page of code, data, stack may map to same entry – Called a “TLB Slice”
» Need 3-way associativity at least? • When does TLB lookup occur?
– What if use high order bits as index? – Before cache lookup?
» TLB mostly unused for small programs – In parallel with cache lookup?
10/20/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 14.13 10/20/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 14.14
Example: R3000 pipeline includes TLB “stages” Reducing translation time further
• As described, TLB lookup is in serial with cache lookup:
MIPS R3000 Pipeline
Inst Fetch Dcd/ Reg ALU / E.A Memory Write Reg Virtual Address
TLB I-Cache RF Operation WB
10
V page no. offset
E.A. TLB D-Cache
TLB Lookup
TLB
64 entry, on-chip, fully associative, software TLB fault handler
Access
V Rights PA
Virtual Address Space
On-Chip
Level Memory Storage (Tape)
– Need to do something else. See CS152/252
Cache
Datapath Cache (DRAM) (Disk)
• Another option: Virtual Caches (SRAM)
» 1 page
Page – What is organization of this cache (i.e. direct-mapped,
Table set-associative, fully-associative)?
Physical Disk
Virtual Memory 500GB » Fully associative: arbitrary virtualphysical mapping
Memory 512 MB
4 GB – How do we find a page in the cache when look for it?
• Disk is larger than physical memory » First check TLB, then page-table traversal
– In-use virtual memory can be bigger than physical memory – What is page replacement policy? (i.e. LRU, Random…)
– Combined memory of running processes much larger than
physical memory » This requires more explanation… (kinda LRU)
» More programs fit into memory, allowing more concurrency – What happens on a miss?
• Principle: Transparent Level of Indirection (page table) » Go to lower level to fill miss (i.e. disk)
– Supports flexible placement of physical data
» Data could be on disk or somewhere across network – What happens on a write? (write-through, write back)
– Variable location of data transparent to user program » Definitely write-back. Need dirty bit!
» Performance issue, not correctness issue
10/20/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 14.19 10/20/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 14.20
Review: What is in a PTE? Demand Paging Mechanisms
• What is in a Page Table Entry (or PTE)? • PTE helps us implement demand paging
– Pointer to next-level page table or to actual page – Valid Page in memory, PTE points at physical page
– Permission bits: valid, read-only, read-write, write-only – Not Valid Page not in memory; use info in PTE to find
• Example: Intel x86 architecture PTE: it on disk when necessary
– Address same format previous slide (10, 10, 12-bit offset) • Suppose user references page with invalid PTE?
– Intermediate page tables called “Directories” – Memory Management Unit (MMU) traps to OS
Page Frame Number Free » Resulting trap is a “Page Fault”
PWT
PCD
0 L D A UW P
(Physical Page Number) (OS) – What does OS do on a Page Fault?:
31-12 11-9 8 7 6 5 4 3 2 1 0 » Choose an old page to replace
P: Present (same as “valid” bit in other architectures) » If old page modified (“D=1”), write contents back to disk
W: Writeable » Change its PTE and any cached TLB to be invalid
U: User accessible » Load new page into memory from disk
PWT: Page write transparent: external cache write-through » Update page table entry, invalidate TLB for new entry
PCD: Page cache disabled (page cannot be cached) » Continue thread from original faulting location
A: Accessed: page has been accessed recently – TLB for new page will be loaded when thread continued!
D: Dirty (PTE only): page has been modified recently – While pulling pages off disk for one process, OS runs
L: L=14MB page (directory only). another process from ready queue
Bottom 22 bits of virtual address serve as offset » Suspended process sits on wait queue
10/20/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 14.21 10/20/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 14.22
Faulting
Faulting
Faulting
Faulting
Inst 1
Inst 1
Inst 2
Inst 2
– High TLB hit rateok to trap to software to fill the User
TLB, even if slower
– Simpler hardware and added flexibility: software can TLB Faults
maintain translation tables in whatever convenient format
Fetch page/
• How can a process run without access to page table? OS Load TLB
Load TLB
– Fast path (TLB hit with valid=1):
» Translation to physical page done by hardware • How to transparently restart faulting instructions?
– Slow path (TLB hit with valid=0 or TLB miss) – Could we just skip it?
» Hardware receives a “TLB Fault” » No: need to perform load or store after reconnecting
– What does OS do on a TLB Fault? physical page
» Traverse page table to find appropriate PTE • Hardware must help out by saving:
» If valid=1, load page table entry into TLB, continue thread – Faulting instruction and partial state
» If valid=0, perform “Page Fault” detailed previously » Need to know which instruction caused fault
» Continue thread » Is single PC sufficient to identify faulting position????
• Everything is transparent to the user process: – Processor State: sufficient to restart user thread
– It doesn’t know about paging to/from disk » Save/restore registers, stack, etc
– It doesn’t even know about software TLB handling • What if an instruction has side-effects?
10/20/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 14.23 10/20/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 14.24
Consider weird things that can happen Precise Exceptions
• What if an instruction has side effects? • Precise state of the machine is preserved as if
– Options: program executed up to the offending instruction
» Unwind side-effects (easy to restart)
» Finish off side-effects (messy!) – All previous instructions completed
– Example 1: mov (sp)+,10 – Offending instruction and all following instructions act as
» What if page fault occurs when write to stack pointer? if they have not even started
» Did sp get incremented before or after the page fault? – Same system code will work on different implementations
– Example 2: strcpy (r1), (r2) – Difficult in the presence of pipelining, out-of-order
» Source and destination overlap: can’t unwind in principle! execution, ...
» IBM S/370 and VAX solution: execute twice – once – MIPS takes this position
read-only
• What about “RISC” processors? • Imprecise system software has to figure out what is
– For instance delayed branches?
where and put it all back together
» Example: bne somewhere • Performance goals often lead designers to forsake
ld r1,(sp) precise interrupts
» Precise exception state consists of two PCs: PC and nPC – system software developers, user, markets etc. usually
– Delayed exceptions: wish they had not done this
» Example: div r1, r2, r3 • Modern techniques for out-of-order execution and
branch prediction help implement precise interrupts
ld r1, (sp)
» What if takes many cycles to discover divide by zero,
but load has already caused page fault?
10/20/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 14.25 10/20/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 14.26
Faulting
Faulting
Faulting
Faulting
Inst 1
Inst 1
Inst 2
Inst 2
User – Options:
» Unwind side-effects (easy to restart)
» Finish off side-effects (messy!)
TLB Faults – Example 1: mov (sp)+,10
Fetch page/ » What if page fault occurs when write to stack pointer?
OS Load TLB
Load TLB » Did sp get incremented before or after the page fault?
– Example 2: strcpy (r1), (r2)
• How to transparently restart faulting instructions? » Source and destination overlap: can’t unwind in principle!
– Could we just skip it? » IBM S/370 and VAX solution: execute twice – once
read-only
» No: need to perform load or store after reconnecting
physical page • What about “RISC” processors?
• Hardware must help out by saving: – For instance delayed branches?
» Example: bne somewhere
– Faulting instruction and partial state ld r1,(sp)
» Need to know which instruction caused fault » Precise exception state consists of two PCs: PC and nPC
» Is single PC sufficient to identify faulting position???? – Delayed exceptions:
– Processor State: sufficient to restart user thread » Example: div r1, r2, r3
» Save/restore registers, stack, etc ld r1, (sp)
» What if takes many cycles to discover divide by zero,
• What if an instruction has side-effects? but load has already caused page fault?
10/25/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 15.5 10/25/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 15.6
– MIN: 5 faults
– FIFO: 7 faults. – Where will D be brought in? Look for page not
– When referencing D, replacing A is bad choice, since referenced farthest in future.
need A again right away • What will LRU do?
– Same decisions as MIN here, but won’t always be true!
10/25/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 15.15 10/25/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 15.16
When will LRU perform badly? Graph of Page Faults Versus The Number of Frames
• Consider the following: A B C D A B C D A B C D
• LRU Performs as follows (same as FIFO here):
Ref: A B C D A B C D A B C D
Page:
1 A D C B
2 B A D C
3 C B A D
Review: Clock Algorithm: Not Recently Used Review: Nth Chance version of Clock Algorithm
Single Clock Hand: • Nth chance algorithm: Give page N chances
Advances only on page fault! – OS keeps counter per page: # sweeps
Check for pages not used recently – On page fault, OS checks use bit:
Mark pages as not used recently » 1clear use and also clear counter (used in last sweep)
Set of all pages » 0increment counter; if count=N, replace page
in Memory – Means that clock hand has to sweep by N times without
page being used before page is replaced
• How do we pick N?
– Why pick large N? Better approx to LRU
» If N ~ 1K, really good approximation
• Clock Algorithm: pages arranged in a ring
– Hardware “use” bit per physical page: – Why pick small N? More efficient
» Hardware sets use bit on each reference » Otherwise might have to look a long way to find free page
» If use bit isn’t set, means not referenced in a long time • What about dirty pages?
» Nachos hardware sets use bit in the TLB; you have to copy – Takes extra overhead to replace a dirty page, so give
this back to page table when TLB entry gets replaced dirty pages an extra chance before replacing?
– On page fault: – Common approach:
» Advance clock hand (not real time)
» Clean pages, use N=1
» Check use bit: 1used recently; clear and leave alone
0selected candidate for replacement » Dirty pages, use N=2 (and write back to disk when N=1)
10/27/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 16.3 10/27/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 16.4
Goals for Today Second-Chance List Algorithm (VAX/VMS)
LRU victim
• Finish Page Allocation Policies Directly Second
• Working Set/Thrashing Mapped Pages Chance List
• I/O Systems
Marked: RW Marked: Invalid
– Hardware Access List: FIFO List: LRU
– Device Drivers New
Page-in New
Active SC
From disk Pages Victims
• Split memory in two: Active list (RW), SC list (Invalid)
• Access pages in Active list at full speed
• Otherwise, Page Fault
– Always move overflow page from end of Active list to
front of Second-chance list (SC) and mark invalid
– Desired Page On SC List: move to front of Active list,
Note: Some slides and/or pictures in the following are mark RW
adapted from slides ©2005 Silberschatz, Galvin, and Gagne
Gagne. – Not on SC list: page in to front of Active list, mark RW;
Many slides generated from my lecture notes by Kubiatowicz. page out LRU victim at end of SC list
10/27/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 16.5 10/27/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 16.6
10/27/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 16.25 10/27/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 16.26
How does the processor actually talk to the device? Example: Memory-Mapped Display Controller
Processor Memory Bus Regular • Memory-Mapped:
Memory
– Hardware maps control registers
CPU Bus Bus
Device and display memory into physical 0x80020000
Graphics
Adaptor Adaptor
Address+
Controller address space Command
Other Devices Data Bus Hardware » Addresses set by hardware jumpers Queue
Interrupt or Buses or programming at boot time 0x80010000
Controller Interrupt Request
Interface Controller – Simply writing to display memory Display
read (also called the “frame buffer”) Memory
Addressable
write changes image on screen
• CPU interacts with a Controller control Memory
» Addr: 0x8000F000—0x8000FFFF
0x8000F000
status and/or
– Contains a set of registers that Registers Queues – Writing graphics description to
can be read and written (port 0x20) command-queue area
– May contain memory for request Memory Mapped 0x0007F004 Command
Region: 0x8f008020 » Say enter a set of triangles that
queues or bit-mapped images describe some scene 0x0007F000 Status
• Regardless of the complexity of the connections and » Addr: 0x80010000—0x8001FFFF
buses, processor accesses registers in two ways: – Writing to the command register
– I/O instructions: in/out instructions may cause on-board graphics
» Example from the Intel architecture: out 0x21,AL hardware to do something Physical Address
– Memory mapped I/O: load/store instructions » Say render the above scene
» Addr: 0x0007F004 Space
» Registers/memory appear in physical address space
» I/O accomplished with load and store instructions • Can protect with page tables
10/27/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 16.27 10/27/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 16.28
Transfering Data To/From Controller Summary
• Programmed I/O: • Second-Chance List algorithm: Yet another approx LRU
– Each byte transferred via processor in/out or load/store – Divide pages into two groups, one of which is truly LRU
– Pro: Simple hardware, easy to program and managed on page faults.
– Con: Consumes processor cycles proportional to data size
• Working Set:
• Direct Memory Access: – Set of pages touched by a process recently
– Give controller access to memory bus
– Ask it to transfer data to/from memory directly • Thrashing: a process is busy swapping pages in and out
• Sample interaction with DMA controller (from book): – Process will thrash if working set doesn’t fit in memory
– Need to swap out a process
• I/O Devices Types:
– Many different speeds (0.1 bytes/sec to GBytes/sec)
– Different Access Patterns:
» Block Devices, Character Devices, Network Devices
– Different Access Timing:
» Blocking, Non-blocking, Asynchronous
• I/O Controllers: Hardware that controls actual device
– Processor Accesses through I/O instructions, load/store
to special physical memory
– Report their results through either interrupts or a status
register that processor looks at occasionally (polling)
10/27/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 16.29 10/27/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 16.30
Review: Want Standard Interfaces to Devices
• Block Devices: e.g. disk drives, tape drives, Cdrom
CS162 – Access blocks of data
Operating Systems and – Commands include open(), read(), write(), seek()
Systems Programming – Raw I/O or file-system access
Lecture 17 – Memory-mapped file access possible
• Character Devices: e.g. keyboards, mice, serial ports,
some USB devices
Disk Management and – Single characters at a time
File Systems – Commands include get(), put()
– Libraries layered on top allow line editing
November 1, 2010 • Network Devices: e.g. Ethernet, Wireless, Bluetooth
– Different enough from block/character to have own
Prof. John Kubiatowicz interface
http://inst.eecs.berkeley.edu/~cs162 – Unix and Windows include socket interface
» Separates network protocol from network operation
» Includes select() functionality
– Usage: pipes, FIFOs, streams, queues, mailboxes
11/1/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 17.2
Review: How Does User Deal with Timing? Goals for Today
• Blocking Interface: “Wait” • Finish Discussing I/O Systems
– When request data (e.g. read() system call), put
– Hardware Access
process to sleep until data is ready
– Device Drivers
– When write data (e.g. write() system call), put process
to sleep until device is ready for data • Disk Performance
• Non-blocking Interface: “Don’t Wait” – Hardware performance parameters
– Returns quickly from read or write request with count of – Queuing Theory
bytes successfully transferred • File Systems
– Read may return nothing, write may write nothing – Structure, Naming, Directories, and Caching
• Asynchronous Interface: “Tell Me Later”
– When request data, take pointer to user’s buffer, return
immediately; later kernel fills buffer and notifies user
– When send data, take pointer to user’s buffer, return Note: Some slides and/or pictures in the following are
immediately; later kernel takes data and notifies user adapted from slides ©2005 Silberschatz, Galvin, and Gagne
Gagne.
Many slides generated from my lecture notes by Kubiatowicz.
11/1/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 17.3 11/1/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 17.4
Main components of Intel Chipset: Pentium 4 How does the processor talk to the device?
Processor Memory Bus Regular
Memory
11/1/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 17.9 11/1/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 17.10
Controller
User 200
Platters Disk
Thread
Track Queue
[OS Paths] 100
• Properties Response Time = Queue+Disk Service Time
– Independently addressable element: sector 0
» OS always transfers groups of sectors together—”blocks” 0% 100%
– A disk can access directly any given block of information Throughput (Utilization)
it contains (random access). Can access any file either (% total BW)
sequentially or randomly. • Performance of disk drive/file system
– A disk can be rewritten in place: it is possible to – Metrics: Response Time, Throughput
read/modify/write a block from the disk – Contributing factors to latency:
• Typical numbers (depending on the disk size): » Software paths (can be loosely modeled by a queue)
– 500 to more than 20,000 tracks per surface » Hardware controller
– 32 to 800 sectors per track » Physical disk media
» A sector is the smallest unit that can be read or written
• Zoned bit recording • Queuing behavior:
– Constant bit density: more sectors on outer tracks – Can lead to big increases of latency as utilization
– Speed varies with track location approaches 100%
11/1/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 17.15 11/1/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 17.16
Track
Magnetic Disk Characteristic Sector Typical Numbers of a Magnetic Disk
• Cylinder: all the tracks under the • Average seek time as reported by the industry:
head at a given point on all surface Head – Typically in the range of 8 ms to 12 ms
• Read/write data is a three-stage Cylinder – Due to locality of disk reference may only be 25% to 33%
process: Platter of the advertised number
– Seek time: position the head/arm over the proper track • Rotational Latency:
(into proper cylinder) – Most disks rotate at 3,600 to 7200 RPM (Up to
– Rotational latency: wait for the desired sector 15,000RPM or more)
to rotate under the read/write head – Approximately 16 ms to 8 ms per revolution, respectively
– Transfer time: transfer a block of bits (sector) – An average latency to the desired information is halfway
under the read-write head around the disk: 8 ms at 3600 RPM, 4 ms at 7200 RPM
• Disk Latency = Queueing Time + Controller time + • Transfer Time is a function of:
Seek Time + Rotation Time + Xfer Time – Transfer size (usually a sector): 512B – 1KB per sector
– Rotation speed: 3600 RPM to 15000 RPM
Controller
Hardware
Request
Result
Media Time
Queue – Diameter: ranges from 1 in to 5.25 in
(Seek+Rot+Xfer)
(Device Driver) – Typical values: 2 to 50 MB per second
• Controller time depends on controller hardware
• Highest Bandwidth: • Cost drops by factor of two per year (since 1991)
– Transfer large group of blocks sequentially from one track
11/1/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 17.17 11/1/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 17.18
Controller
Disk • Server spends variable time with customers Mean
Arrivals Departures – Mean (Average) m1 = p(T)T
(m1)
Queue
Queuing System – Variance = p(T)(T-m1) = p(T)T -m1
2 2 2 2
Controller
– Seek time: position the head/arm over the proper track
Disk
(into proper cylinder) Arrivals Departures
Queue
– Rotational latency: wait for the desired sector
to rotate under the read/write head Queuing System
– Transfer time: transfer a block of bits (sector) • What about queuing time??
under the read-write head – Let’s apply some queuing theory
– Queuing Theory applies to long term, steady state
• Disk Latency = Queueing Time + Controller time + behavior Arrival rate = Departure rate
Seek Time + Rotation Time + Xfer Time • Little’s Law:
Mean # tasks in system = arrival rate x mean response time
Controller
Hardware
Software
Result
Media Time
Queue – Simple interpretation: you should see the same number of
(Seek+Rot+Xfer) tasks in queue when entering as when leaving.
(Device Driver)
• Applies to any system in equilibrium, as long as nothing
• Highest Bandwidth: in black box is creating or destroying tasks
– Typical queuing theory doesn’t deal with transient
– Transfer large group of blocks sequentially from one track behavior, only steady-state behavior
11/03/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 18.3 11/03/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 18.4
Goals for Today Background: Use of random distributions
Mean
• Server spends variable time with customers (m1)
• Queuing Theory: Continued
– Mean (Average) m1 = p(T)T
• File Systems – Variance 2 = p(T)(T-m1)2 = p(T)T2-m1
– Structure, Naming, Directories – Squared coefficient of variance: C = 2/m12 Distribution
of service times
Aggregate description of the distribution.
• Important values of C:
– No variance or deterministic C=0 mean
– “memoryless” or exponential C=1
» Past tells nothing about future
» Many complex systems (or aggregates) Memoryless
well described as memoryless
– Disk response times C 1.5 (majority seeks < avg)
• Mean Residual Wait Time, m1(z):
– Mean time must wait for server to complete current task
Note: Some slides and/or pictures in the following are – Can derive m1(z) = ½m1(1 + C)
adapted from slides ©2005 Silberschatz, Galvin, and Gagne
Gagne. » Not just ½m1 because doesn’t capture variance
Many slides generated from my lecture notes by Kubiatowicz. – C = 0 m1(z) = ½m1; C = 1 m1(z) = m1
11/03/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 18.5 11/03/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 18.6
A Little Queuing Theory: Mean Wait Time A Little Queuing Theory: M/G/1 and M/M/1
• Computation of wait time in queue (Tq):
Queue Server Tq = Lq Tser + u m1(z) Little’s Law
Arrival Rate Service Rate
μ=1/Tser Tq = Tq Tser + u m1(z)
Tq = u Tq + u m1(z) Defn of utilization (u)
• Parameters that describe our system:
– : mean number of arriving customers/second Tq (1 – u) = m1(z) u Tq = m1(z) u/(1-u)
– Tser: mean time to service a customer (“m1”) Tq = Tser ½(1+C) u/(1 – u)
– C: squared coefficient of variance = 2/m12 • Notice that as u1, Tq !
– μ: service rate = 1/Tser • Assumptions so far:
– u: server utilization (0u1): u = /μ = Tser – System in equilibrium; No limit to the queue: works
• Parameters we wish to compute: First-In-First-Out
– Tq: Time spent in queue – Time between two successive arrivals in line are random
– Lq: Length of queue = Tq (by Little’s law) and memoryless: (M for C=1 exponentially random)
• Basic Approach: – Server can start on next customer immediately after
– Customers before us must finish; mean time = Lq Tser prior finishes
– If something at server, takes m1(z) to complete on avg • General service distribution (no restrictions), 1 server:
» m1(z): mean residual wait time at server= Tser ½(1+C)
» Chance something at server = u mean time is u m1(z) – Called M/G/1 queue: Tq = Tser ½(1+C) u/(1 – u))
• Computation of wait time in queue (Tq): • Memoryless service distribution (C = 1):
– Tq = Lq Tser + u m1(z) – Called M/M/1 queue: Tq = Tser u/(1 – u)
11/03/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 18.7 11/03/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 18.8
A Little Queuing Theory: An Example Queuing Theory Resources
• Example Usage Statistics:
– User requests 10 8KB disk I/Os per second • Handouts page contains Queueing Theory Resources:
– Requests & service exponentially distributed (C=1.0)
– Avg. service = 20 ms (controller+seek+rot+Xfertime) – Scanned pages from Patterson and Hennesey book that
• Questions: gives further discussion and simple proof for general eq.
– How utilized is the disk? – A complete website full of resources
» Ans: server utilization, u = Tser
– What is the average time spent in the queue? • Midterms with queueing theory questions:
» Ans: Tq
– What is the number of requests in the queue? – Midterm IIs from previous years that I’ve taught
» Ans: Lq = Tq • Assume that Queueing theory is fair game for the final!
– What is the avg response time for disk request?
» Ans: Tsys = Tq + Tser (Wait in queue, then get served)
• Computation:
(avg # arriving customers/s) = 10/s
Tser (avg time to service customer) = 20 ms (0.02s)
u (server utilization) = Tser= 10/s .02s = 0.2
Tq (avg time/customer in queue) = Tser u/(1 – u)
= 20 x 0.2/(1-0.2) = 20 0.25 = 5 ms (0 .005s)
Lq (avg length of queue) = Tq=10/s .005s = 0.05
Tsys (avg time/customer in system) =Tq + Tser= 25 ms
11/03/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 18.9 11/03/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 18.10
3,10
Head
2,2
5,2
7,2
2,1
2,3
Requests
• FIFO Order
– Fair among requesters, but order of arrival may be to
random spots on the disk Very long seeks
• SSTF: Shortest seek time first
Disk Head
– Pick the request that’s closest on the disk 3
– Although called SSTF, today must include
rotational delay in calculation, since 2
rotation can be as long as seek 1
– Con: SSTF good at reducing seeks, but 4
may lead to starvation
• SCAN: Implements an Elevator Algorithm: take the
closest request in the direction of travel
– No starvation, but retains flavor of SSTF
• C-SCAN: Circular-Scan: only goes in one direction
– Skips any requests on the way back
– Fairer than SCAN, not biased towards pages in middle
11/03/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 18.11 11/03/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 18.12
Building a File System Translating from User to System View
• File System: Layer of OS that transforms block
interface of disks (or other block devices) into Files,
Directories, etc. File
• File System Components System
– Disk Management: collecting disk blocks into files
– Naming: Interface to find files by name, not by blocks
– Protection: Layers to keep data secure
– Reliability/Durability: Keeping of files durable despite • What happens if user says: give me bytes 2—12?
crashes, media failures, attacks, etc – Fetch block corresponding to those bytes
• User vs. System View of a File – Return just the correct portion of the block
– User’s view:
» Durable Data Structures • What about: write bytes 2—12?
– System’s view (system call interface): – Fetch block
» Collection of Bytes (UNIX) – Modify portion
» Doesn’t matter to system what kind of data structures you
want to store on disk! – Write out Block
– System’s view (inside OS): • Everything inside File System is in whole size blocks
» Collection of blocks (a block is a logical transfer unit, while – For example, getc(), putc() buffers something like
a sector is the physical transfer unit) 4096 bytes, even if interface is one byte at a time
» Block size sector size; in UNIX, block size is 4KB
• From now on, file is a collection of blocks
11/03/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 18.13 11/03/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 18.14
File Header
Null
– Pros: Can grow files dynamically, Free list same as file
– Cons: Bad Sequential Access (seek between each block),
Unreliable (lose block, lose rest of file)
– Serious Con: Bad random access!!!! • MSDOS links pages together to create a file
– Technique originally from Alto (First PC, built at Xerox) – Links not in pages, but in the File Allocation Table (FAT)
» No attempt to allocate contiguous blocks » FAT contains an entry for each block on the disk
» FAT Entries corresponding to blocks of file linked together
– Access properties:
» Sequential access expensive unless FAT cached in memory
» Random access expensive always, but really expensive if
FAT not cached in memory
11/03/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 18.19 11/03/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 18.20
Indexed Allocation Multilevel Indexed Files (UNIX 4.1)
• Multilevel Indexed Files:
Like multilevel address
translation
(from UNIX 4.1 BSD)
– Key idea: efficient for small
files, but still allow big files
Multilevel Indexed Files (UNIX 4.1): Discussion Example of Multilevel Indexed Files
• Sample file in multilevel
• Basic technique places an upper limit on file size that indexed format:
is approximately 16Gbytes – How many accesses for
– Designers thought this was bigger than anything anyone block #23? (assume file
would need. Much bigger than a disk at the time… header accessed on open)?
» Two: One for indirect block,
– Fallacy: today, EOS producing 2TB of data per day one for data
– How about block #5?
» One: One for data
• Pointers get filled in dynamically: need to allocate – Block #340?
indirect block only when file grows > 10 blocks » Three: double indirect block,
– On small files, no indirection needed indirect block, and data
• UNIX 4.1 Pros and cons
– Pros: Simple (more or less)
Files can easily expand (up to a point)
Small files particularly cheap and easy
– Cons: Lots of seeks
Very large files must read many indirect blocks (four
I/Os per block!)
11/03/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 18.23 11/03/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 18.24
File Allocation for Cray-1 DEMOS Large File Version of DEMOS
disk group base size base size disk group
basesize 1,3,2
1,3,2
1,3,3 1,3,3
1,3,4 Basic Segmentation Structure: 1,3,4
1,3,5 Each segment contiguous on disk 1,3,5
1,3,6 1,3,6
1,3,7 1,3,7
1,3,8 indirect 1,3,8
file header 1,3,9 file header block group 1,3,9
• DEMOS: File system structure similar to segmentation • What if need much bigger files?
– Idea: reduce disk seeks by – If need more than 10 groups, set flag in header: BIGFILE
» using contiguous allocation in normal case » Each table entry now points to an indirect block group
» but allow flexibility to have non-contiguous allocation – Suppose 1000 blocks in a block group 80GB max file
– Cray-1 had 12ns cycle time, so CPU:disk speed ratio about » Assuming 8KB blocks, 8byte entries
the same as today (a few million instructions per seek) (10 ptrs1024 groups/ptr1000 blocks/group)*8K =80GB
• Header: table of base & size (10 “block group” pointers) • Discussion of DEMOS scheme
– Pros: Fast sequential access, Free areas merge simply
– Each block chunk is a contiguous group of disk blocks Easy to find free block groups (when disk not full)
– Sequential reads within a block chunk can proceed at high – Cons: Disk full No long runs of blocks (fragmentation),
speed – similar to continuous allocation so high overhead allocation/access
• How do you find an available block group? – Full disk worst of 4.1BSD (lots of seeks) with worst of
– Use freelist bitmap to find block of 0’s. continuous allocation (lots of recompaction needed)
11/03/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 18.25 11/03/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 18.26
• How are directories modified? • Files named by ordered set (e.g., /programs/p/list)
– Originally, direct read/write of special file
– System calls for manipulation: mkdir, rmdir
– Ties to file creation/destruction
» On creating a file by name, new inode grabbed and
associated with new file in particular directory
11/03/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 18.31 11/03/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 18.32
Directory Structure Directory Structure (Con’t)
• How many disk accesses to resolve “/my/book/count”?
– Read in file header for root (fixed spot on disk)
– Read in first data bock for root
» Table of file name/index pairs. Search linearly – ok since
directories typically very small
– Read in file header for “my”
– Read in first data block for “my”; search for “book”
– Read in file header for “book”
• Not really a hierarchy! – Read in first data block for “book”; search for “count”
– Many systems allow directory structure to be organized – Read in file header for “count”
as an acyclic graph or even a (potentially) cyclic graph
– Hard Links: different names for the same file • Current working directory: Per-address-space pointer
» Multiple directory entries point at the same file
– Soft Links: “shortcut” pointers to other files to a directory (inode) used for resolving file names
» Implemented by storing the logical name of actual file – Allows user to specify relative filename instead of
• Name Resolution: The process of converting a logical absolute path (say CWD=“/my/book” can resolve “count”)
name into a physical resource (like a file)
– Traverse succession of directories until reach target file
– Global file system: May be spread across the network
11/03/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 18.33 11/03/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 18.34
• In early UNIX and DOS/Windows’ FAT file • Later versions of UNIX moved the header
system, headers stored in special array in information to be closer to the data blocks
outermost cylinders – Often, inode for file stored in same “cylinder group”
– Header not stored anywhere near the data blocks. as parent directory of the file (makes an ls of that
To read a small file, seek to get header, see directory run fast).
back to data. – Pros:
– Fixed size, set when disk is formatted. At » Reliability: whatever happens to the disk, you can find
formatting time, a fixed number of inodes were all of the files (even if directories might be
created (They were each given a unique number, disconnected)
called an “inumber”) » UNIX BSD 4.2 puts a portion of the file header array
on each cylinder. For small directories, can fit all
data, file headers, etc in same cylinderno seeks!
» File headers much smaller than whole block (a few
hundred bytes), so multiple headers fetched from disk
at same time
11/03/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 18.35 11/03/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 18.36
Summary
• Queuing Latency:
– M/M/1 and M/G/1 queues: simplest to analyze
– As utilization approaches 100%, latency
Tq = Tser x ½(1+C) x u/(1 – u))
• File System:
– Transforms blocks into Files and Directories
– Optimize for access and usage patterns
– Maximize sequential access, allow efficient random access
• File (and directory) defined by header
– Called “inode” with index called “inumber”
• Multilevel Indexed Scheme
– Inode contains file info, direct pointers to blocks,
– indirect blocks, doubly indirect, etc..
• DEMOS:
– CRAY-1 scheme like segmentation
– Emphsized contiguous allocation of blocks, but allowed to
use non-contiguous allocation when necessary
• Naming: the process of turning user-visible names into
resources (such as
11/03/10
files)
Kubiatowicz CS162 ©UCB Fall 2010 Lec 18.37
Review: A Little Queuing Theory: Some Results
• Assumptions:
CS162 – System in equilibrium; No limit to the queue
– Time between successive arrivals is random and memoryless
Operating Systems and
Systems Programming Queue Server
Lecture 19 Arrival Rate
Service Rate
μ=1/Tser
• Parameters that describe our system:
– : mean number of arriving customers/second
File Systems continued – Tser: mean time to service a customer (“m1”)
Distributed Systems – C: squared coefficient of variance = 2/m12
– μ: service rate = 1/Tser
– u: server utilization (0u1): u = /μ = Tser
November 8, 2010 • Parameters we wish to compute:
– Tq: Time spent in queue
Prof. John Kubiatowicz – Lq: Length of queue = Tq (by Little’s law)
http://inst.eecs.berkeley.edu/~cs162 • Results:
– Memoryless service distribution (C = 1):
» Called M/M/1 queue: Tq = Tser x u/(1 – u)
– General service distribution (no restrictions), 1 server:
» Called M/G/1 queue: Tq = Tser x ½(1+C) x u/(1 – u))
11/08/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 19.2
Head
2,2
5,2
7,2
2,1
2,3
Multilevel Indexed Files (UNIX BSD 4.1) Example of Multilevel Indexed Files
• Multilevel Indexed Files: Like multilevel address • Sample file in multilevel
translation (from UNIX 4.1 BSD) indexed format:
– Key idea: efficient for small files, but still allow big files – How many accesses for
– File header contains 13 pointers block #23? (assume file
» Fixed size table, pointers not all equivalent header accessed on open)?
» This header is called an “inode” in UNIX » Two: One for indirect block,
– File Header format: one for data
» First 10 pointers are to data blocks – How about block #5?
» Block 11 points to “indirect block” containing 256 blocks
» One: One for data
» Block 12 points to “doubly indirect block” containing 256
indirect blocks for total of 64K blocks – Block #340?
» Block 13 points to a triply indirect block (16M blocks) » Three: double indirect block,
• Discussion indirect block, and data
– Basic technique places an upper limit on file size that is • UNIX 4.1 Pros and cons
approximately 16Gbytes – Pros: Simple (more or less)
» Designers thought this was bigger than anything anyone Files can easily expand (up to a point)
would need. Much bigger than a disk at the time… Small files particularly cheap and easy
» Fallacy: today, EOS producing 2TB of data per day – Cons: Lots of seeks
– Pointers get filled in dynamically: need to allocate Very large files must read many indirect block (four
indirect block only when file grows > 10 blocks. I/Os per block!)
11/08/10
» On small files, no indirection needed
Kubiatowicz CS162 ©UCB Fall 2010 Lec 19.11 11/08/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 19.12
Administrivia File Allocation for Cray-1 DEMOS
disk group
basesize
1,3,2
1,3,3
1,3,4 Basic Segmentation Structure:
1,3,5 Each segment contiguous on disk
1,3,6
1,3,7
1,3,8
file header 1,3,9
• DEMOS: File system structure similar to segmentation
– Idea: reduce disk seeks by
» using contiguous allocation in normal case
» but allow flexibility to have non-contiguous allocation
– Cray-1 had 12ns cycle time, so CPU:disk speed ratio about
the same as today (a few million instructions per seek)
• Header: table of base & size (10 “block group” pointers)
– Each block chunk is a contiguous group of disk blocks
– Sequential reads within a block chunk can proceed at high
speed – similar to continuous allocation
• How do you find an available block group?
– Use freelist bitmap to find block of 0’s.
11/08/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 19.13 11/08/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 19.14
11/08/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 19.23 11/08/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 19.24
Where are inodes stored? In-Memory File System Structures
• Later versions of UNIX moved the header
information to be closer to the data blocks
– Often, inode for file stored in same “cylinder
group” as parent directory of the file (makes an ls
of that directory run fast).
– Pros: • Open system call:
– Resolves file name, finds file control block (inode)
» UNIX BSD 4.2 puts a portion of the file header – Makes entries in per-process and system-wide tables
array on each cylinder. For small directories, can
fit all data, file headers, etc in same cylinderno – Returns index (called “file handle”) in open-file table
seeks!
» File headers much smaller than whole block (a few
hundred bytes), so multiple headers fetched from
disk at same time
» Reliability: whatever happens to the disk, you can
find many of the files (even if directories
disconnected) • Read/write system calls:
– Part of the Fast File System (FFS) – Use file handle to locate inode
» General optimization to avoid seeks – Perform appropriate reads or writes
11/08/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 19.25 11/08/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 19.26
How to make file system durable? Log Structured and Journaled File Systems
• Disk blocks contain Reed-Solomon error correcting • Better reliability through use of log
codes (ECC) to deal with small defects in disk drive – All changes are treated as transactions
– Can allow recovery of data from small media defects – A transaction is committed once it is written to the log
• Make sure writes survive in short term » Data forced to disk for reliability
– Either abandon delayed writes or » Process can be accelerated with NVRAM
– use special, battery-backed RAM (called non-volatile RAM – Although File system may not be updated immediately,
or NVRAM) for dirty blocks in buffer cache. data preserved in the log
• Make sure that data survives in long term • Difference between “Log Structured” and “Journaled”
– Need to replicate! More than one copy of data! – In a Log Structured filesystem, data stays in log form
– Important element: independence of failure
» Could put copies on one disk, but if disk head fails… – In a Journaled filesystem, Log used for recovery
» Could put copies on different disks, but if server fails… • For Journaled system:
» Could put copies on different servers, but if building is – Log used to asynchronously update filesystem
struck by lightning…. » Log entries removed after used
» Could put copies on servers in different continents…
– After crash:
• RAID: Redundant Arrays of Inexpensive Disks » Remaining transactions in the log performed (“Redo”)
– Data stored on multiple disks (redundancy)
» Modifications done in way that can survive crashes
– Either in software or hardware
» In hardware case, done by disk controller; file system may • Examples of Journaled File Systems:
not even know that there is more than one disk in use – Ext3 (Linux), XFS (Unix), etc.
11/08/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 19.31 11/08/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 19.32
Conclusion
• Multilevel Indexed Scheme
– Inode contains file info, direct pointers to blocks,
– indirect blocks, doubly indirect, etc..
• Cray DEMOS: optimization for sequential access
– Inode holds set of disk ranges, similar to segmentation
• 4.2 BSD Multilevel index files
– Inode contains pointers to actual blocks, indirect blocks,
double indirect blocks, etc
– Optimizations for sequential access: start new files in
open ranges of free blocks
– Rotational Optimization
• Naming: act of translating from user-visible names to
actual system resources
– Directories used for naming for local file systems
• Important system properties
– Availability: how often is the resource available?
– Durability: how well is data preserved against faults?
– Reliability: how often is resource performing correctly?
11/08/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 19.33
Review: Example of Multilevel Indexed Files
• Multilevel Indexed Files:
CS162 (from UNIX 4.1 BSD)
Operating Systems and – Key idea: efficient for small
files, but still allow big files
Systems Programming – File Header format:
Lecture 20 » First 10 ptrs to data blocks
» Block 11 points to “indirect
block” containing 256 blocks
Reliability and Access Control / » Block 12 points to “doubly-
indirect block” containing
Distributed Systems 256 indirect blocks for total
of 64K blocks
» Block 13 points to a triply indirect block (16M blocks)
November 10, 2010 • UNIX 4.1 Pros and cons
Prof. John Kubiatowicz – Pros: Simple (more or less)
Files can easily expand (up to a point)
http://inst.eecs.berkeley.edu/~cs162 Small files particularly cheap and easy
– Cons: Lots of seeks
Very large files must read many indirect block (four
I/Os per block!)
11/10/09 Kubiatowicz CS162 ©UCB Fall 2010 Lec 20.2
11/10/09 Kubiatowicz CS162 ©UCB Fall 2010 Lec 20.23 11/10/09 Kubiatowicz CS162 ©UCB Fall 2010 Lec 20.24
Authorization: Two Implementation Choices Authorization: Combination Approach
• Access Control Lists: store permissions with object
– Still might be lots of users!
– UNIX limits each file to: r,w,x for owner, group, world
– More recent systems allow definition of groups of users
and permissions for each group
– ACLs allow easy changing of an object’s permissions
» Example: add Users C, D, and F with rw permissions • Users have capabilities, • Objects have ACLs
called “groups” or “roles” – ACLs can refer to users or
• Capability List: each process tracks which objects has – Everyone with particular groups
permission to touch group access is “equivalent” – Change object permissions
– Popular in the past, idea out of favor today when accessing group object by modifying ACL
resource
– Consider page table: Each process has list of pages it – Change broad user
– Like passport (which gives permissions via changes in
has access to, not each page has list of processes … access to country of origin) group membership
– Capability lists allow easy changing of a domain’s – Possessors of proper
permissions credentials get access
» Example: you are promoted to system administrator and
should be given access to all system files
11/10/09 Kubiatowicz CS162 ©UCB Fall 2010 Lec 20.25 11/10/09 Kubiatowicz CS162 ©UCB Fall 2010 Lec 20.26
• How does one revoke someone’s access rights to • Various approaches to revoking capabilities:
a particular object? – Put expiration dates on capabilities and force
– Easy with ACLs: just remove entry from the list reacquisition
– Takes effect immediately since the ACL is checked – Put epoch numbers on capabilities and revoke all
on each object access capabilities by bumping the epoch number (which
gets checked on each access attempt)
• Harder to do with capabilities since they aren’t
stored with the object being controlled: – Maintain back pointers to all capabilities that have
been handed out (Tough if capabilities can be
– Not so bad in a single machine: could keep all copied)
capability lists in a well-known place (e.g., the OS
capability table). – Maintain a revocation list that gets checked on
every access attempt
– Very hard in distributed system, where remote
hosts may have crashed or may not cooperate
(more in a future lecture)
11/10/09 Kubiatowicz CS162 ©UCB Fall 2010 Lec 20.27 11/10/09 Kubiatowicz CS162 ©UCB Fall 2010 Lec 20.28
Centralized vs Distributed Systems Distributed Systems: Motivation/Issues
• Why do we want distributed systems?
– Cheaper and easier to build lots of simple computers
Server – Easier to add power incrementally
– Users can have complete control over some components
– Collaboration: Much easier for users to collaborate through
network resources (such as network file systems)
• The promise of distributed systems:
– Higher availability: one machine goes down, use another
Client/Server Model – Better durability: store data in multiple locations
Peer-to-Peer Model – More security: each piece easier to make secure
• Reality has been disappointing
• Centralized System: System in which major functions – Worse availability: depend on every machine being up
are performed by a single physical computer » Lamport: “a distributed system is one where I can’t do work
– Originally, everything on single computer because some machine I’ve never heard of isn’t working!”
– Later: client/server model – Worse reliability: can lose data if any machine crashes
• Distributed System: physically separate computers – Worse security: anyone in world can break into system
working together on some task • Coordination is more difficult
– Early model: multiple servers working together – Must coordinate multiple copies of shared state information
» Probably in the same room or building (using only a network)
» Often called a “cluster” – What would be easy in a centralized system becomes a lot
– Later models: peer-to-peer/wide-spread collaboration more difficult
11/10/09 Kubiatowicz CS162 ©UCB Fall 2010 Lec 20.29 11/10/09 Kubiatowicz CS162 ©UCB Fall 2010 Lec 20.30
11/10/09 Kubiatowicz CS162 ©UCB Fall 2010 Lec 20.31 11/10/09 Kubiatowicz CS162 ©UCB Fall 2010 Lec 20.32
Conclusion
• Important system properties
– Availability: how often is the resource available?
– Durability: how well is data preserved against faults?
– Reliability: how often is resource performing correctly?
• Use of Log to improve Reliability
– Journaled file systems such as ext3
• RAID: Redundant Arrays of Inexpensive Disks
– RAID1: mirroring, RAID5: Parity block
• Authorization
– Controlling access to resources using
» Access Control Lists
» Capabilities
• Network: physical connection that allows two
computers to communicate
– Packet: unit of transfer, sequence of bits carried over
the network
Review: RAID 5+: High I/O Rate Parity Goals for Today
Stripe
• Data stripped across Unit
multiple disks • Authorization
D0 D1 D2 D3 P0
– Successive blocks • Networking
stored on successive Increasing
(non-parity) disks D4 D5 D6 P1 D7 Logical – Broadcast
Disk
– Increased bandwidth Addresses – Point-to-Point Networking
over single disk D8 D9 P2 D10 D11
– Routing
• Parity block (in green)
constructed by XORing D12 P3 D13 D14 D15
– Internet Protocol (IP)
data bocks in stripe
– P0=D0D1D2D3 P4 D16 D17 D18 D19
– Can destroy any one
disk and still
reconstruct data D20 D21 D22 D23 P5
– Suppose D3 fails, Disk 1 Disk 2 Disk 3 Disk 4 Disk 5
then can reconstruct:
D3=D0D1D2P0 Note: Some slides and/or pictures in the following are
• Later in term: talk about spreading information widely adapted from slides ©2005 Silberschatz, Galvin, and Gagne
Gagne.
across internet for durability. Many slides generated from my lecture notes by Kubiatowicz.
11/15/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 21.3 11/15/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 21.4
Authorization: Who Can Do What? Authorization: Two Implementation Choices
• Access Control Lists: store permissions with object
• How do we decide who is – Still might be lots of users!
authorized to do actions in the – UNIX limits each file to: r,w,x for owner, group, world
system? » More recent systems allow definition of groups of users
• Access Control Matrix: contains and permissions for each group
all permissions in the system – ACLs allow easy changing of an object’s permissions
» Example: add Users C, D, and F with rw permissions
– Resources across top
– Requires mechanisms to prove identity
» Files, Devices, etc…
• Capability List: each process tracks which objects it
– Domains in columns has permission to touch
» A domain might be a user or a – Consider page table: Each process has list of pages it
group of users has access to, not each page has list of processes …
» E.g. above: User D3 can read » Capability list easy to change/augment permissions
F2 or execute F3 » E.g.: you are promoted to system administrator and should
– In practice, table would be be given access to all system files
huge and sparse! – Implementation: Capability like a “Key” for access
» Example: cryptographically secure (non-forgeable) chunk
of data that can be exchanged for access
11/15/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 21.5 11/15/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 21.6
11/15/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 21.7 11/15/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 21.8
Revoking Capabilities Centralized vs Distributed Systems
11/15/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 21.13 11/15/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 21.14
(receive)
• Delivery: When you broadcast a packet, how does a
receiver know who it is for? (packet goes to everyone!)
– Put header on front of packet: [ Destination | Packet ]
– Everyone gets packet, discards if not the target
– Originally, Ethernet was a broadcast network – In Ethernet, this check is done in hardware
» All computers on local subnet connected to one another » No OS interrupt if not for particular destination
– More examples (wireless: medium is air): cellular phones, – This is layering: we’re going to build complex network
GSM GPRS, EDGE, CDMA 1xRTT, and 1EvDO protocols by layering on top of the packet
11/15/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 21.15 11/15/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 21.16
Broadcast Network Arbitration Carrier Sense, Multiple Access/Collision Detection
• Arbitration: Act of negotiating use of shared medium • Ethernet (early 80’s): first practical local area network
– What if two senders try to broadcast at same time? – It is the most common LAN for UNIX, PC, and Mac
– Concurrent activity but can’t use shared memory to – Use wire instead of radio, but still broadcast medium
coordinate! • Key advance was in arbitration called CSMA/CD:
• Aloha network (70’s): packet radio within Hawaii Carrier sense, multiple access/collision detection
– Carrier Sense: don’t send unless idle
– Blind broadcast, with checksum at end of » Don’t mess up communications already in process
packet. If received correctly (not garbled), – Collision Detect: sender checks if packet trampled.
send back an acknowledgement. If not » If so, abort, wait, and retry.
received correctly, discard. – Backoff Scheme: Choose wait time before trying again
» Need checksum anyway – in case airplane • How long to wait after trying to send and failing?
flies overhead – What if everyone waits the same length of time? Then,
– Sender waits for a while, and if doesn’t they all collide again at some time!
get an acknowledgement, re-transmits. – Must find way to break up shared behavior with nothing
– If two senders try to send at same time, both get more than shared communication channel
garbled, both simply re-send later. • Adaptive randomized waiting strategy:
– Problem: Stability: what if load increases? – Adaptive and Random: First time, pick random wait time
» More collisions less gets through more resent more with some initial mean. If collide again, pick random value
load… More collisions… from bigger mean wait time. Etc.
» Unfortunately: some sender may have started in clear, get – Randomness is important to decouple colliding senders
scrambled without finishing – Scheme figures out how many people are trying to send!
11/15/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 21.17 11/15/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 21.18
Switch
Switch
3
2
11/15/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 21.27 11/15/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 21.28
Routing Setting up Routing Tables
• Routing: the process of forwarding packets hop-by-hop • How do you set up routing tables?
through routers to reach their destination – Internet has no centralized state!
– Need more than just a destination address! » No single machine knows entire topology
» Need a path
» Topology constantly changing (faults, reconfiguration, etc)
– Post Office Analogy:
» Destination address on each letter is not – Need dynamic algorithm that acquires routing tables
sufficient to get it to the destination » Ideally, have one entry per subnet or portion of address
» To get a letter from here to Florida, must route to local » Could have “default” routes that send packets for unknown
post office, sorted and sent on plane to somewhere in subnets to a different router that has more information
Florida, be routed to post office, sorted and sent with • Possible algorithm for acquiring routing table
carrier who knows where street and house is…
• Internet routing mechanism: routing tables – Routing table has “cost” for each entry
– Each router does table lookup to decide which link to use » Includes number of hops to destination, congestion, etc.
to get packet closer to destination » Entries for unknown subnets have infinite cost
– Don’t need 4 billion entries in table: routing is by subnet – Neighbors periodically exchange routing tables
– Could packets be sent in a loop? Yes, if tables incorrect » If neighbor knows cheaper route to a subnet, replace your
• Routing table contains: entry with neighbors entry (+1 for hop to neighbor)
– Destination address range output link closer to • In reality:
destination – Internet has networks of many different scales
– Default entry (for subnets without explicit entries) – Different algorithms run at different scales
11/15/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 21.29 11/15/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 21.30
edu com
Name Address 169.229.131.81
MIT
berkeley.edu berkeley
www Mit.edu
• How to map human-readable names to IP addresses? calmail
– E.g. www.berkeley.edu 128.32.139.48 eecs
128.32.61.103 eecs.berkeley.edu
– E.g. www.google.com different addresses depending on
location, and load www
CS162
Operating Systems and
Internet
Systems Programming Switch Router
Lecture 22
Networking II
• Point-to-point network: a network in which every
physical wire is connected to only two computers
November 17, 2010 • Switch: a bridge that transforms a shared-bus
(broadcast) configuration into a point-to-point network.
Prof. John Kubiatowicz
• Hub: a multiport device that acts like a repeater
http://inst.eecs.berkeley.edu/~cs162 broadcasting from each input to every output
• Router: a device that acts as a junction between two
networks to transfer data packets among them.
11/17/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 22.2
UDP Data
Seq:100
Seq:140
Seq:190
Seq:230
Seq:260
Seq:300
Seq:340
Seq:380
Size:40
Size:50
Size:40
Size:30
Size:40
Size:40
Size:40
Size:20
Sent Sent Not yet
acked not acked sent Sender
Ack Number
(20 bytes)
(20 bytes)
IP Header
IP Header
CS162 100 140 190 230 260 300 340 380 400
Seq:100
Seq:140
Seq:190
Seq:230
Seq:260
Seq:300
Seq:340
Seq:380
Size:40
Size:50
Size:40
Size:30
Size:40
Size:40
Size:40
Size:20
Systems Programming
Lecture 23
A:100/300
Seq:230 A:190/210
Seq:340 A:380/20
Seq:380
11/22/10 Kubiatowicz CS162 ©UCB Fall 2010
A:400/0
Lec 23.2
Receive
Send
Network • Actually two questions here:
– When can the sender be sure that receiver actually
received the message?
– One Abstraction: send/receive messages
» Already atomic: no receiver gets portion of a message and – When can sender reuse the memory containing message?
two receivers cannot get same message
• Mailbox provides 1-way communication from T1T2
• Interface:
– Mailbox (mbox): temporary holding area for messages – T1bufferT2
» Includes both destination location and queue – Very similar to producer/consumer
– Send(message,mbox) » Send = V, Receive = P
» Send message to remote mailbox identified by mbox
– Receive(buffer,mbox) » However, can’t tell if sender/receiver is local or not!
» Wait until mbox has message, copy into buffer, and return
» If threads sleeping on this mbox, wake up one of them
11/22/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 23.9 11/22/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 23.10
Network
Network
» Contains, among other things, types of arguments/return
– Output: stub code in the appropriate source language
Machine B » Code for client to pack message, send it off, wait for
bundle result, unpack result and return to caller
ret vals mbox1
» Code for server to unpack message, call procedure, pack
return send results, send them off
Server Server Packet
(callee) Stub Handler • Cross-platform issues:
call receive – What if client/server machines are different
unbundle architectures or in different languages?
args » Convert everything to/from some canonical form
» Tag every item with an indication of how it is encoded
(avoids unnecessary conversions).
11/22/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 23.21 11/22/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 23.22
• Remote Disk: Reads and writes forwarded to server • Idea: Use caching to reduce network load
– Use RPC to translate file system calls – In practice: use buffer cache at source and destination
– No local caching/can be caching at server-side • Advantage: if open/read/write/close can be done
• Advantage: Server provides completely consistent view locally, don’t need to do any network traffic…fast!
of file system to multiple clients • Problems:
• Problems? Performance! – Failure:
– Going over network is slower than going to local memory » Client caches have data not committed at server
– Lots of network traffic/not well pipelined – Cache consistency!
– Server can be a bottleneck » Client caches not consistent with server/each other
11/22/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 23.29 11/22/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 23.30
CS162 bundle
args
Operating Systems and call send
Client Client Packet
Systems Programming (caller) Stub Handler
Lecture 24 return receive
unbundle mbox2
ret vals
Machine A
Network
Network
Distributed File Systems
Machine B
bundle
ret vals mbox1
November 24, 2010 return send
Server Server Packet
Prof. John Kubiatowicz (callee) Stub Handler
call receive
http://inst.eecs.berkeley.edu/~cs162 unbundle
args
Read (RPC)
Return (Data)
Client
Server cache
Client
• Remote Disk: Reads and writes forwarded to server
– Use RPC to translate file system calls
– No local caching/can be caching at server-side
• Advantage: Server provides completely consistent view
of file system to multiple clients
• Problems? Performance!
– Going over network is slower than going to local memory
– Lots of network traffic/not well pipelined
– Server can be a bottleneck
11/24/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 24.11 11/24/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 24.12
Use of caching to reduce network load Failures Crash!
read(f1)V1
cache Read (RPC)
read(f1)V1 • What if server crashes? Can client wait until server
read(f1)V1 F1:V1 Return (Data) comes back up and continue as before?
Client – Any data in server memory but not on disk can be lost
read(f1)V1 cache
Server – Shared state across RPC: What if server crashes after
F1:V2
F1:V1 seek? Then, when client does “read”, it will fail
cache – Message retries: suppose server crashes after it does
write(f1)OK UNIX “rm foo”, but before acknowledgment?
read(f1)V2
F1:V2
Client
» Message system will retry: send it again
» How does it know not to delete it again? (could solve with
two-phase commit protocol, but NFS takes a more ad hoc
• Idea: Use caching to reduce network load approach)
– In practice: use buffer cache at source and destination • Stateless protocol: A protocol in which all information
• Advantage: if open/read/write/close can be done required to process a request is passed with request
locally, don’t need to do any network traffic…fast! – Server keeps no state about client, except as hints to
• Problems: help improve performance (e.g. a cache)
– Failure: – Thus, if server crashes and restarted, requests can
» Client caches have data not committed at server continue where left off (in many cases)
– Cache consistency! • What if client crashes?
» Client caches not consistent with server/each other – Might lose modified data in client cache
11/24/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 24.13 11/24/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 24.14
11/24/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 24.15 11/24/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 24.16
NFS Continued NFS Cache consistency
• NFS servers are stateless; each request provides all • NFS protocol: weak consistency
arguments require for execution
– E.g. reads include information for entire operation, such – Client polls server periodically to check for changes
as ReadAt(inumber,position), not Read(openfile) » Polls server if data hasn’t been checked in last 3-30
– No need to perform network open() or close() on file – seconds (exact timeout it tunable parameter).
each operation stands on its own » Thus, when file is changed on one client, server is notified,
but other clients use old version of file until timeout.
• Idempotent: Performing requests multiple times has
same effect as performing it exactly once
– Example: Server crashes between disk I/O and message cache F1 still ok?
send, client resend read, server does operation again F1:V2
F1:V1 No: (F1:V2)
– Example: Read and write file blocks: just re-read or re- Client
write file block – no side effects cache
– Example: What about “remove”? NFS does operation Server
twice and second time returns an advisory error F1:V2
• Failure Model: Transparent to client system cache
– Is this a good idea? What if you are in the middle of F1:V2
reading a file and server crashes? Client
– Options (NFS Provides both): – What if multiple clients write to same file?
» Hang until server comes back up (next week?)
» Return an error. (Of course, most applications don’t know » In NFS, can get either version (or parts of both)
they are talking over network) » Completely arbitrary!
11/24/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 24.17 11/24/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 24.18
11/24/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 24.21 11/24/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 24.22
• Key idea: graphical front-end to RPC protocol • Use client-side caching to reduce number of
interactions between clients and servers and/or
reduce the size of the interactions:
• What happens when a web server fails?
– Time-to-Live (TTL) fields – HTTP “Expires” header
– System breaks! from server
– Solution: Transport or network-layer redirection – Client polling – HTTP “If-Modified-Since” request
» Invisible to applications headers from clients
» Can also help with scalability (load balancers) – Server refresh – HTML “META Refresh tag”
» Must handle “sessions” (e.g., banking/e-commerce) causes periodic client poll
• What is the polling frequency for clients and
• Initial version: no caching servers?
– Didn’t scale well – easy to overload servers – Could be adaptive based upon a page’s age and its
rate of change
• Server load is still significant!
11/24/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 24.23 11/24/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 24.24
WWW Proxy Caches Conclusion
• Remote Procedure Call (RPC): Call procedure on remote
• Place caches in the network to reduce server load machine
– But, increases latency in lightly loaded case – Provides same interface as procedure
– Automatic packing and unpacking of arguments without
– Caches near servers called “reverse proxy caches” user programming (in stub)
» Offloads busy server machines • VFS: Virtual File System layer
– Caches at the “edges” of the network called “content – Provides mechanism which gives same system call interface
for different types of file systems
distribution networks”
• Distributed File System:
» Offloads servers and reduce client latency – Transparent access to files stored on a remote disk
• Challenges: » NFS: Network File System
» AFS: Andrew File System
– Caching static traffic easy, but only ~40% of traffic – Caching for performance
– Dynamic and multimedia is harder • Cache Consistency: Keeping contents of client caches
» Multimedia is a big win: Megabytes versus Kilobytes consistent with one another
– If multiple clients, some reading and some writing, how do
– Same cache consistency problems as before stale cached copies get updated?
• Caching is changing the Internet architecture – NFS: check periodically for changes
– AFS: clients register callbacks so can be notified by
– Places functionality at higher levels of comm. protocols server of changes
11/24/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 24.25 11/24/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 24.26
Review: RPC Information Flow
CS162 bundle
Operating Systems and args
call send
Systems Programming Client Client Packet
Lecture 25 (caller)
return
Stub
receive
Handler
unbundle mbox2
ret vals
Protection and Security Machine A
Network
Network
in Distributed Systems Machine B
bundle
ret vals mbox1
November 29, 2010 return send
Server Server Packet
Prof. John Kubiatowicz (callee) Stub Handler
call receive
http://inst.eecs.berkeley.edu/~cs162 unbundle
args
11/29/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 25.9 11/29/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 25.10
PASS: gina
Encrypt Decrypt
Insecure
Plaintext
Plaintext
Transmission
– Need way to prevent exposure of information while still SPY (ciphertext) CIA
proving identity to remote system Key Key
– Many of the original UNIX tools sent passwords over the • Important properties
wire “in clear text”
– Can’t derive plain text from ciphertext (decode) without
» E.g.: telnet, ftp, yp (yellow pages, for distributed login) access to key
» Result: Snooping programs widespread
– Can’t derive key from plain text and ciphertext
• What do we need? Cannot rely on physical security! – As long as password stays secret, get both secrecy and
– Encryption: Privacy, restrict receivers authentication
– Authentication: Remote Authenticity, restrict senders • Symmetric Key Algorithms: DES, Triple-DES, AES
11/29/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 25.13 11/29/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 25.14
• Hash Function: Short summary of data (message) – First, ask server for digest of desired file
– For instance, h1=H(M1) is the hash of message M1 » Use secure channel with server
» h1 fixed length, despite size of message M1. – Then ask mirror server for file
» Often, h1 is called the “digest” of M1. » Can be insecure channel
• Hash function H is considered secure if » Check digest of result and catch faulty or malicious mirrors
– It is infeasible to find M2 with h1=H(M2); ie. can’t easily File X
find other message with same digest as given message. Read X Insecure
Data
– It is infeasible to locate two messages, m1 and m2, Mirror
which “collide”, i.e. for which H(m1) = H(m2) File X
– A small change in a message changes many bits of
digest/can’t tell anything about message given its hash Read File X
Here is hx = H(X)
11/29/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 25.19 11/29/10 Client Kubiatowicz CS162 ©UCB Fall 2010 Server Lec 25.20
Signatures/Certificate Authorities Security through SSL
• Can use Xpublic for person X to define their identity nc
– Presumably they are the only ones who know Xprivate. • SSL Web Protocol ns,certs
– Often, we think of Xpublic as a “principle” (user) – Port 443: secure http
– Use public-key encryption (pms)Ks
• Suppose we want X to sign message M? for key-distribution
– Use private key to encrypt the digest, i.e. H(M)Xprivate • Server has a certificate signed by certificate authority
– Send both M and its signature:
» Signed message = [M,H(M)Xprivate] – Contains server info (organization, IP address, etc)
– Now, anyone can verify that M was signed by X – Also contains server’s public key and expiration date
» Simply decrypt the digest with Xpublic • Establishment of Shared, 48-byte “master secret”
» Verify that result matches H(M) – Client sends 28-byte random value nc to server
• Now: How do we know that the version of Xpublic that – Server returns its own 28-byte random value ns, plus its
we have is really from X??? certificate certs
– Answer: Certificate Authority – Client verifies certificate by checking with public key of
» Examples: Verisign, Entrust, Etc. certificate authority compiled into browser
– X goes to organization, presents identifying papers
» Organization signs X’s key: [ Xpublic, H(Xpublic)CAprivate] » Also check expiration date
» Called a “Certificate” – Client picks 46-byte “premaster” secret (pms), encrypts
– Before we use Xpublic, ask X for certificate verifying key it with public key of server, and sends to server
» Check that signature over Xpublic produced by trusted – Now, both server and client have nc, ns, and pms
authority » Each can compute 48-byte master secret using one-way
• How do we get keys of certificate authority? and collision-resistant function on three values
– Compiled into your browser, for instance! » Random “nonces” nc and ns make sure master secret fresh
11/29/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 25.21 11/29/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 25.22
Recall: Authorization: Who Can Do What? How fine-grained should access control be?
• How do we decide who is authorized • Example of the problem:
to do actions in the system? – Suppose you buy a copy of a new game from “Joe’s Game
• Access Control Matrix: contains World” and then run it.
all permissions in the system – It’s running with your userid
– Resources across top » It removes all the files you own, including the project due
» Files, Devices, etc… the next day…
– Domains in columns • How can you prevent this?
» A domain might be a user or a – Have to run the program under some userid.
group of permissions
» E.g. above: User D3 can read F2 or execute F3 » Could create a second games userid for the user, which
– In practice, table would be huge and sparse! has no write privileges.
• Two approaches to implementation » Like the “nobody” userid in UNIX – can’t do much
– Access Control Lists: store permissions with each object – But what if the game needs to write out a file recording
» Still might be lots of users! scores?
» UNIX limits each file to: r,w,x for owner, group, world » Would need to give write privileges to one particular file
» More recent systems allow definition of groups of users (or directory) to your games userid.
and permissions for each group – But what about non-game programs you want to use,
– Capability List: each process tracks objects has such as Quicken?
permission to touch » Now you need to create your own private quicken userid, if
» Popular in the past, idea out of favor today you want to make sure tha the copy of Quicken you bought
» Consider page table: Each process has list of pages it has can’t corrupt non-quicken-related files
access to, not each page has list of processes … – But – how to get this right??? Pretty complex…
11/29/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 25.23 11/29/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 25.24
Authorization Continued How to perform Authorization for Distributed Systems?
• Principle of least privilege: programs, users, and
systems should get only enough privileges to perform
their tasks Different
– Very hard to do in practice Authorization
» How do you figure out what the minimum set of privileges Domains
is needed to run your programs?
– People often run at higher privilege then necessary
» Such as the “administrator” privilege under windows • Issues: Are all user names in world unique?
• One solution: Signed Software – No! They only have small number of characters
– Only use software from sources that you trust, thereby » kubi@mit.edu kubitron@lcs.mit.edu
dealing with the problem by means of authentication kubitron@cs.berkeley.edu
– Fine for big, established firms such as Microsoft, since » However, someone thought their friend was kubi@mit.edu
they can make their signing keys well known and people and I got very private email intended for someone else…
trust them – Need something better, more unique to identify person
» Actually, not always fine: recently, one of Microsoft’s • Suppose want to connect with any server at any time?
signing keys was compromised, leading to malicious – Need an account on every machine! (possibly with
software that looked valid different user name for each account)
– What about new startups? – OR: Need to use something more universal as identity
» Who “validates” them? » Public Keys! (Called “Principles”)
» How easy is it to fool them? » People are their public keys
11/29/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 25.25 11/29/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 25.26
GACL
• Revocation:
Client 1 Group ACL: – What if someone steals your private key?
Domain 1 GACL verifier » Need to walk through all ACLs with your key and change…!
Hash, Timestamp,
Key: 0xA786EF889A…
» This is very expensive
Signature (group) Key: 0x6647DBC9AC… – Better to have unique string identifying you that people
Server 2: Domain 3 place into ACLs
» Then, ask Certificate Authority to give you a certificate
• Distributed Access Control List (ACL) matching unique string to your current public key
– Contains list of attributes (Read, Write, Execute, etc) » Client Request: (request + unique ID)Cprivate; give server
with attached identities (Here, we show public keys) certificate if they ask for it.
» ACLs signed by owner of file, only changeable by owner » Key compromisemust distribute “certificate revocation”,
» Group lists signed by group key since can’t wait for previous certificate to expire.
– ACLs can be on different servers than data – What if you remove someone from ACL of a given file?
» Signatures allow us to validate them » If server caches old ACL, then person retains access!
» ACLs could even be stored separately from verifiers » Here, cache inconsistency leads to security violations!
11/29/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 25.27 11/29/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 25.28
Conclusion
• User Identification
– Passwords/Smart Cards/Biometrics
• Passwords
– Encrypt them to help hid them
– Force them to be longer/not amenable to dictionary attack
– Use zero-knowledge request-response techniques
• Distributed identity
– Use cryptography
• Symmetrical (or Private Key) Encryption
– Single Key used to encode and decode
– Introduces key-distribution problem
• Public-Key Encryption
– Two keys: a public key and a private key
• Secure Hash Function
– Used to summarize data
– Hard to find another block of data with same hash
12/01/10 Client Kubiatowicz CS162 ©UCB Fall 2010 Server Lec 26.2
GACL
• Revocation:
Client 1 Group ACL: – What if someone steals your private key?
Domain 1 GACL verifier » Need to walk through all ACLs with your key and change…!
Hash, Timestamp,
Key: 0xA786EF889A…
» This is very expensive
Signature (group) Key: 0x6647DBC9AC… – Better to have unique string identifying you that people
Server 2: Domain 3 place into ACLs
» Then, ask Certificate Authority to give you a certificate
• Distributed Access Control List (ACL) matching unique string to your current public key
– Contains list of attributes (Read, Write, Execute, etc) » Client Request: (request + unique ID)Cprivate; give server
with attached identities (Here, we show public keys) certificate if they ask for it.
» ACLs signed by owner of file, only changeable by owner » Key compromisemust distribute “certificate revocation”,
» Group lists signed by group key since can’t wait for previous certificate to expire.
– ACLs can be on different servers than data – What if you remove someone from ACL of a given file?
» Signatures allow us to validate them » If server caches old ACL, then person retains access!
» ACLs could even be stored separately from verifiers » Here, cache inconsistency leads to security violations!
12/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 26.7 12/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 26.8
Analysis Continued Administrivia
• Who signs the data?
– Or: How does client know they are getting valid data?
– Signed by server?
» What if server compromised? Should client trust server?
– Signed by owner of file?
» Better, but now only owner can update file!
» Pretty inconvenient!
– Signed by group of servers that accepted latest update?
» If must have signatures from all servers Safe, but one
bad server can prevent update from happening
» Instead: ask for a threshold number of signatures
» Byzantine agreement can help here
• How do you know that data is up-to-date?
– Valid signature only means data is valid older version
– Freshness attack:
» Malicious server returns old data instead of recent data
» Problem with both ACLs and data
» E.g.: you just got a raise, but enemy breaks into a server
and prevents payroll from seeing latest version of update
– Hard problem
» Needs to be fixed by invalidating old copies or having a
trusted group of servers (Byzantine Agrement?)
12/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 26.9 12/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 26.10
Defeating Password Checking ManyCore Chips: The future is here (for EVERYONE)
• Intel 80-core multicore chip (Feb 2007)
• Tenex used VM, and it interacts badly with the above code
– 80 simple cores
– Key idea: force page faults at inopportune times to break
– Two floating point engines /core
passwords quickly
– Mesh-like "network-on-a-chip“
• Arrange 1st char in string to be last char in pg, rest on next pg
– 100 million transistors
– Then arrange for pg with 1st char to be in memory, and rest – 65nm feature size
to be on disk (e.g., ref lots of other pgs, then ref 1st page)
a|aaaaaa • “ManyCore” refers to many processors/chip
| – 64? 128? Hard to say exact boundary
page in memory| page on disk • Question: How can ManyCore change our view of OSs?
• Time password check to determine if first character is correct! – ManyCore is a challenge
– If fast, 1st char is wrong » Need to be able to take advantage of parallelism
– If slow, 1st char is right, pg fault, one of the others wrong » Must utilize many processors somehow
– So try all first characters, until one is slow – ManyCore is an opportunity
– Repeat with first two characters in memory, rest on disk » Manufacturers are desperate to figure out how to program
• Only 256 * 8 attempts to crack passwords » Willing to change many things: hardware, software, etc.
– Fix is easy, don’t stop until you look at all the characters – Can we improve: security, responsiveness, programmability?
12/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 26.19 12/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 26.20
PARLab OS Goals: RAPPidS The Problem with Current OSs
• Responsiveness: Meets real-time guarantees • What is wrong with current Operating Systems?
– Good user experience with UI expected – They do not allow expression of application requirements
– Illusion of Rapid I/O while still providing guarantees » Minimal Frame Rate, Minimal Memory Bandwidth, Minimal QoS
– Real-Time applications (speech, music, video) will be assumed from system Services, Real Time Constraints, …
• Agility: Can deal with rapidly changing environment » No clean interfaces for reflecting these requirements
– Programs not completely assembled until runtime – They do not provide guarantees that applications can use
– User may request complex mix of services at moment’s notice » They do not provide performance isolation
– Resources change rapidly (bandwidth, power, etc) » Resources can be removed or decreased without permission
• Power-Efficiency: Efficient power-performance tradeoffs » Maximum response time to events cannot be characterized
– Application-Specific parallel scheduling on Bare Metal – They do not provide fully custom scheduling
partitions
» In a parallel programming environment, ideal scheduling can depend
– Explicitly parallel, power-aware OS service architecture crucially on the programming model
• Persistence: User experience persists across device failures
– They do not provide sufficient Security or Correctness
– Fully integrated with persistent storage infrastructures
» Monolithic Kernels get compromised all the time
– Customizations not be lost on “reboot”
» Applications cannot express domains of trust within themselves
• Security and Correctness: Must be hard to compromise without using a heavyweight process model
– Untrusted and/or buggy components handled gracefully
• The advent of ManyCore both:
– Combination of verification and isolation at many levels
– Privacy, Integrity, Authenticity of information asserted – Exacerbates the above with greater number of shared resources
– Provides an opportunity to change the fundamental model
12/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 26.21 12/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 26.22
A First Step: Two Level Scheduling Important New Mechanism: Spatial Partitioning
Resource Allocation
And
Distribution
Monolithic
CPU and Resource Two-Level Scheduling
Scheduling
Application Specific
Scheduling • Spatial Partition: group of processors acting within
hardware boundary
• Split monolithic scheduling into two pieces: – Boundaries are “hard”, communication between partitions controlled
– Course-Grained Resource Allocation and Distribution – Anything goes within partition
» Chunks of resources (CPUs, Memory Bandwidth, QoS to Services) • Each Partition receives a vector of resources
distributed to application (system) components – Some number of dedicated processors
» Option to simply turn off unused resources (Important for Power) – Some set of dedicated resources (exclusive access)
» Complete access to certain hardware devices
– Fine-Grained Application-Specific Scheduling » Dedicated raw storage partition
» Applications are allowed to utilize their resources in any way – Some guaranteed fraction of other resources (QoS guarantee):
they see fit » Memory bandwidth, Network bandwidth
» Other components of the system cannot interfere with their use » fractional services from other partitions
of resources • Key Idea: Resource Isolation Between Partitions
12/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 26.23 12/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 26.24
Tessellation: The Exploded OS OS as Distributed System
• Normal Components split
into pieces
Balanced
– Device drivers Secure
Firewall (Security/Reliability) Gang
Device Channel
Virus – Network Services Secure
Drivers
Large Compute-Bound Intrusion (Performance) Secure Channel
Application » TCP/IP stack Channel
Monitor » Firewall Individual
And » Virus Checking • Use lessons from from Large Distributed Systems Partition
Adapt » Intrusion Detection – Like Peer-to-Peer on chip
Video & – Persistent Storage – OS is a set of independent interacting components
Real-Time (Performance, – Shared state across components minimized
Window Security, Reliability)
Application
Drivers • Component-based design:
– Monitoring services
– All applications designed with pieces from many sources
Identity
Space
– Communication represents a security vulnerability
– Quality of Service (QoS) boils down message tracking
– Communication efficiency impacts decomposability Time
• Shared components complicate resource isolation:
– Need distributed mechanism for tracking and accounting
of resource usage • Spatial Partitioning Varies over Time
– Partitioning adapts to needs of the system
» E.g.: How do we guarantee that each partition gets a – Some partitions persist, others change with time
guaranteed fraction of the service: – Further, Partititions can be Time Multiplexed
» Services (i.e. file system), device drivers, hard realtime partitions
Application A » User-level schedulers may time-multiplex threads within partition
• Global Partitioning Goals:
– Power-performance tradeoffs
Shared File Service – Setup to achieve QoS and/or Responsiveness guarantees
– Isolation of real-time partitions for better guarantees
Application B • Monitoring and Adaptation
– Integration of performance/power/efficiency counters
12/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 26.27 12/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 26.28
Another Look: Two-Level Scheduling Space-Time Resource Graph
• First Level: Gross partitioning of resources
– Goals: Power Budget, Overall Responsiveness/QoS, Security
Cell 1
– Partitioning of CPUs, Memory, Interrupts, Devices, other
resources Resources: Cell 3
– Constant for sufficient period of time to: 4 Proc, 50% time
» Amortize cost of global decision making 1GB network BW
» Allow time for partition-level scheduling to be effective 25% File Server
– Hard boundaries interference-free use of resources Cell 2
• Second Level: Application-Specific Scheduling Cell 3
Lightweight
– Goals: Performance, Real-time Behavior, Responsiveness,
Predictability Protection Domains
– CPU scheduling tuned to specific applications • Space-Time resource graph: the explicit instantiation of
– Resources distributed in application-specific fashion resource assignments
– External events (I/O, active messages, etc) deferrable as – Directed Arrows Express Parent/Child Spawning Relationship
appropriate – All resources have a Space/Time component
• Justifications for two-level scheduling? » E.g. X Processors/fraction of time, or Y Bytes/Sec
– Global/cross-app decisions made by 1st level • What does it mean to give resources to a Cell?
» E.g. Save power by focusing I/O handling to smaller # of cores
– App-scheduler (2nd level) better tuned to application – The Cell has a position in the Space-Time resource graph and
» Lower overhead/better match to app than global scheduler – The resources are added to the cell’s resource label
» No global scheduler could handle all applications – Resources cannot be taken away except via explicit APIs
12/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 26.29 12/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 26.30
Tessellation
Partition Partition
– Time-Slices at a course Management
granularity Scheduler Allocator
Kernel
Mapping Layer (Resource Distributer) Layer
– performs bin-packing like to
implement space-time graph Partition
Configure Partition Configure
– In limit of many processors, no Mechanism
Resources enforced by HW-supported
time multiplexing processors, Layer
HW at runtime Communication
merely distributing resources (Trusted)
• Partition Mechanism Layer
Interconnect Message Physical Performance
– Implements hardware partitions Partition Mechanism Layer Cache CPUs
and secure channels Bandwidth Passing Memory Counters
ParaVirtualized Hardware
– Device Dependent: Makes use of Hardware Partitioning Mechanisms
more or less hardware support for To Support Partitions
32
QoS and Partitions Kubiatowicz CS162 ©UCB Fall 2010
12/01/10 Lec 26.31 12/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 26.32
Example of Music Application Conclusion
• Distributed identity
Music program – Use cryptography (Public Key, Signed by PKI)
• Distributed storage example
Audio-processing / Synthesis Engine
– Revocation: How to remove permissions from someone?
(Pinned/TT partition) – Integrity: How to know whether data is valid
– Freshness: How to know whether data is recent
• Buffer-Overrun Attack: exploit bug to execute code
• Space-Time Partitioning: grouping processors & resources behind
Time-sensitive hardware boundary
Network
GUI Subsystem – Focus on Quality of Service
Input device Output device Subsystem
(Pinned/TT Partition) (Pinned/TT Partition) – Two-level scheduling
1) Global Distribution of resources
2) Application-Specific scheduling of resources
– Bare Metal Execution within partition
Network Graphical
– Composable performance, security, QoS
Service Interface
• Tessellation Paper:
(Net Partition) (GUI Partition)
– Off my “publications” page (near top):
Communication with other http://www.cs.berkeley.edu/~kubitron/papers
12/01/10
PreliminaryKubiatowicz CS162 ©UCB Fall 2010 Lecnodes
audio-processing 26.33 12/01/10 Kubiatowicz CS162 ©UCB Fall 2010 Lec 26.34