Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Resource Management in Distributed Systems: 
Distributed File Systems 
CS-550: Distributed File Systems [SiS] 1
Distributed File Systems 
Definition: 
• Implement a common file system that can be shared by all 
autonomous computers in a distributed system 
Goals: 
• Network transparency 
• High availability 
Architectural options: 
• Fully distributed: files distributed to all sites 
– Issues: performance, implementation complexity 
• Client-server Model: 
– Fileserver: dedicated sites storing files perform storage and retrieval 
operations 
– Client: rest of the sites use servers to access files 
CS-550: Distributed File Systems [SiS] 2
Distributed File Systems: Client-Server Architecture 
CS-550: Distributed File Systems [SiS] 3
Distributed File Systems Services 
• Services provided by the distributed file system: 
(1) Name Server: Provides mapping (name resolution) the names 
supplied by clients into objects (files and directories) 
• Takes place when process attempts to access file or directory the first 
time. 
(2) Cache manager: Improves performance through file caching 
• Caching at the client - When client references file at server: 
– Copy of data brought from server to client machine 
– Subsequent accesses done locally at the client 
• Caching at the server: 
– File saved in memory to reduce subsequent access time 
* Issue: different cached copies can become inconsistent. Cache 
managers (at server and clients) have to provide coordination. 
CS-550: Distributed File Systems [SiS] 4
Typical Data Access in a Client/File Server Architecture 
CS-550: Distributed File Systems [SiS] 5
Mechanisms used in distributed file systems 
(1) Mounting 
• The mount mechanism binds together several filename spaces (collection 
of files and directories) into a single hierarchically structured name space 
(Example: UNIX and its derivatives) 
• A name space ‘A’ can be mounted (bounded) at an internal node (mount 
point) of a name space ‘B’ 
• Implementation: kernel maintains the mount table, mapping mount 
points to storage devices 
CS-550: Distributed File Systems [SiS] 6
Mechanisms used in distributed file systems (cont.) 
(1) Mounting (cont.) 
• Location of mount information 
a. Mount information maintained at clients 
– Each client mounts every file system 
– Different clients may not see the same filename space 
– If files move to another server, every client needs to update its mount table 
– Example: SUN NFS 
b. Mount information maintained at servers 
– Every client see the same filename space 
– If files move to another server, mount info at server only needs to change 
– Example: Sprite File System 
CS-550: Distributed File Systems [SiS] 7
Mechanisms used in distributed file systems (cont.) 
(2) Caching 
– Improves file system performance by exploiting the locality of 
reference 
– When client references a remote file, the file is cached in the main 
memory of the server (server cache) and at the client (client cache) 
– When multiple clients modify shared (cached) data, cache consistency 
becomes a problem 
– It is very difficult to implement a solution that guarantees consistency 
(3) Hints 
– Treat the cached data as hints, i.e. cached data may not be completely 
accurate 
– Can be used by applications that can discover that the cached data is 
invalid and can recover 
• Example: 
– After the name of a file is mapped to an address, that address is stored as a hint 
in the cache 
– If the address later fails, it is purged from the cache 
– The name server is consulted to provide the actual location of the file and the 
cache is updated 
CS-550: Distributed File Systems [SiS] 8
Mechanism used in distributed file systems (cont.) 
(4) Bulk data transfer 
– Observations: 
• Overhead introduced by protocols does not depend on the amount of data 
transferred in one transaction 
• Most files are accessed in their entirety 
– Common practice: when client requests one block of data, multiple 
consecutive blocks are transferred 
(5) Encryption 
– Encryption is needed to provide security in distributed systems 
– Entities that need to communicate send request to authentication server 
– Authentication server provides key for conversation 
CS-550: Distributed File Systems [SiS] 9
Design Issues 
1. Naming and name resolution 
– Terminology 
• Name: each object in a file system (file, directory) has a unique name 
• Name resolution: mapping a name to an object or multiple objects (replication) 
• Name space: collection of names with or without same resolution mechanism 
– Approaches to naming files in a distributed system 
(a) Concatenate name of host to names of files on that host 
– Advantage: unique filenames, simple resolution 
– Disadvantages: 
» Conflicts with network transparency 
» Moving file to another host requires changing its name and the applications using it 
(b) Mount remote directories onto local directories 
– Requires that host of remote directory is known 
– After mounting, files referenced location-transparent (I.e., file name does not reveal its 
location) 
(c) Have a single global directory 
– All files belong to a single name space 
– Limitation: having unique system wide filenames require a single computing facility or 
cooperating facilities 
CS-550: Distributed File Systems [SiS] 10
Design Issues (cont.) 
1. Naming and Name Resolution (cont.) 
– Contexts 
• Solve the problem of system-wide unique names, by partitioning a name space 
into contexts (geographical, organizational, etc.) 
• Name resolution is done within that context 
• Interpretation may lead to another context 
• File Name = Context + Name local to context 
– Nameserver 
• Process that maps file names to objects (files, directories) 
• Implementation options 
– Single name Server 
» Simple implementation, reliability and performance issues 
– Several Name Servers (on different hosts) 
» Each server responsible for a domain 
» Example: 
Client requests access to file ‘A/B/C’ 
Local name server looks up a table (in kernel) 
Local name server points to a remote server for ‘/B/C’ mapping 
CS-550: Distributed File Systems [SiS] 11
Design Issues (Cont.) 
2. Caching 
– Caching at the client: Main memory vs. Disk 
• Main memory: (+) Fast, (+) Works for diskless clients, (-) Expensive memory, 
(-) Complex Virtual Memory Management. 
• Disk: (+) Large files, (+) Simpler Virtual Memory Management (-) Requires 
local disk. 
– Cache consistency 
• Server initiated 
– Server informs cache managers when data in client caches is stale 
– Client cache managers invalidate stale data or retrieve new data 
– Disadvantage: extensive communication 
• Client initiated 
– Cache managers at the clients validate data with server before returning it to 
clients 
– Disadvantage: extensive communication 
• Prohibit file caching when concurrent-writing 
– Several clients open a file, at least one of them for writing 
– Server informs all clients to purge that cached file 
• Lock files when concurrent-write sharing (at least one client opens for write) 
CS-550: Distributed File Systems [SiS] 12
Design Issues (Cont.) 
3. Writing policy 
– Question: once a client writes into a file (and the local cache), when should 
the modified cache be sent to the server? 
– Options: 
• Write-through: all writes at the clients, immediately transferred to the 
servers 
– Advantage: reliability 
– Disadvantage: performance, it does not take advantage of the cache 
• Delayed writing: delay transfer to servers 
– Advantages: 
» Many writes take place (including intermediate results) before a 
transfer 
» Some data may be deleted 
– Disadvantage: reliability 
• Delayed writing until file is closed at client 
– For short open intervals, same as delayed writing 
– For long intervals, reliability problems 
CS-550: Distributed File Systems [SiS] 13
Design Issues (Cont.) 
4. Availability 
– Issue: what is the level of availability of files in a distributed file system? 
– Resolution: use replication to increase availability, i.e. many copies 
(replicas) of files are maintained at different sites/servers 
– Replication issues: 
• How to keep replicas consistent 
• How to detect inconsistency among replicas 
– Unit of replication 
• File 
• Group of files 
a) Volume: group of all files of a user or group or all files in a server 
» Advantage: ease of implementation 
» Disadvantage: wasteful, user may need only a subset replicated 
b) Primary pack vs. pack 
» Primary pack:all files of a user 
» Pack: subset of primary pack. Can receive a different degree of replication for 
each pack 
CS-550: Distributed File Systems [SiS] 14
Design Issues (Cont.) 
5. Scalability 
– Issue: can the design support a growing system? 
– Example: server-initiated cache invalidation complexity and load grow with 
size of system. Possible solutions: 
• Do not provide cache invalidation service for read-only files 
• Provide design to allow users to share cached data 
– Design file servers for scalability: threads, SMPs, clusters 
6. Semantics 
– Expected semantics: a read will return data stored by the latest write 
– Possible options: 
• All read and writes go through the server 
– Disadvantage: communication overhead 
• Use of lock mechanism 
– Disadvantage: file not always available 
CS-550: Distributed File Systems [SiS] 15
Case Studies: 
The Sun Network File System (NSF) 
• Developed by Sun Microsystems to provide a distributed file 
system independent of the hardware and operating system 
• Architecture 
– Virtual File System (VFS): 
File system interface that allows NSF to support different file systems 
– Requests for operation on remote files are routed by VFS to NFS 
– Requests are sent to the VFS on the remote using 
• The remote procedure call (RPC), and 
• The external data representation (XDR) 
– VFS on the remote server initiates files system operation locally 
– Vnode (Virtual Node): 
• There is a network-wide vnode for every object in the file system (file or 
directory)- equivalent of UNIX inode 
• vnode has a mount table, allowing any node to be a mount node 
CS-550: Distributed File Systems [SiS] 16
Case Studies: NFS Architecture 
CS-550: Distributed File Systems [SiS] 17
NFS (Cont.) 
• Naming and location: 
– Workstations are designated as clients or file servers 
– A client defines its own private file system by mounting a subdirectory of 
a remote file system on its local file system 
– Each client maintains a table which maps the remote file directories to 
servers 
– Mapping a filename to an object is done the first time a client references 
the field. Example: 
Filename: /A/B/C 
• Assume ‘A’ corresponds to ‘vnode1’ 
• Look up on ‘vnode1/B’ returns ‘vnode2’ for ‘B’ where‘vnode2’ 
indicates that object is on server ‘X’ 
• Client asks server ‘X’ to lookup ‘vnode2/C’ 
• ‘file handle’ returned to client by server storing that file 
• Client uses ‘file handle’ for all subsequent operation on that file 
CS-550: Distributed File Systems [SiS] 18
NFS (Cont.) 
• Caching: 
– Caching done in main memory of clients 
– Caching done for: file blocks, translation of filenames to vnodes, and attributes 
of files and directories 
(1) Caching of file blocks 
• Cached on demand with time stamp of the file (when last modified on the server) 
• Entire file cached, if under certain size, with timestamp when last modified 
• After certain age, blocks have to be validated with server 
• Delayed writing policy: Modified blocks flushed to the server after certain delay 
(2) Caching of filenames to vnodes for remote directory names 
• Speeds up the lookup procedure 
(3) Caching of file and directory attributes 
• Updated when new attributes received from the server, discarded after certain time 
• Stateless Server 
– Servers are stateless 
• File access requests from clients contain all needed information (pointer position, etc) 
• Servers have no record of past requests 
– Simple recovery from crashes. 
CS-550: Distributed File Systems [SiS] 19

More Related Content

11 distributed file_systems

  • 1. Resource Management in Distributed Systems: Distributed File Systems CS-550: Distributed File Systems [SiS] 1
  • 2. Distributed File Systems Definition: • Implement a common file system that can be shared by all autonomous computers in a distributed system Goals: • Network transparency • High availability Architectural options: • Fully distributed: files distributed to all sites – Issues: performance, implementation complexity • Client-server Model: – Fileserver: dedicated sites storing files perform storage and retrieval operations – Client: rest of the sites use servers to access files CS-550: Distributed File Systems [SiS] 2
  • 3. Distributed File Systems: Client-Server Architecture CS-550: Distributed File Systems [SiS] 3
  • 4. Distributed File Systems Services • Services provided by the distributed file system: (1) Name Server: Provides mapping (name resolution) the names supplied by clients into objects (files and directories) • Takes place when process attempts to access file or directory the first time. (2) Cache manager: Improves performance through file caching • Caching at the client - When client references file at server: – Copy of data brought from server to client machine – Subsequent accesses done locally at the client • Caching at the server: – File saved in memory to reduce subsequent access time * Issue: different cached copies can become inconsistent. Cache managers (at server and clients) have to provide coordination. CS-550: Distributed File Systems [SiS] 4
  • 5. Typical Data Access in a Client/File Server Architecture CS-550: Distributed File Systems [SiS] 5
  • 6. Mechanisms used in distributed file systems (1) Mounting • The mount mechanism binds together several filename spaces (collection of files and directories) into a single hierarchically structured name space (Example: UNIX and its derivatives) • A name space ‘A’ can be mounted (bounded) at an internal node (mount point) of a name space ‘B’ • Implementation: kernel maintains the mount table, mapping mount points to storage devices CS-550: Distributed File Systems [SiS] 6
  • 7. Mechanisms used in distributed file systems (cont.) (1) Mounting (cont.) • Location of mount information a. Mount information maintained at clients – Each client mounts every file system – Different clients may not see the same filename space – If files move to another server, every client needs to update its mount table – Example: SUN NFS b. Mount information maintained at servers – Every client see the same filename space – If files move to another server, mount info at server only needs to change – Example: Sprite File System CS-550: Distributed File Systems [SiS] 7
  • 8. Mechanisms used in distributed file systems (cont.) (2) Caching – Improves file system performance by exploiting the locality of reference – When client references a remote file, the file is cached in the main memory of the server (server cache) and at the client (client cache) – When multiple clients modify shared (cached) data, cache consistency becomes a problem – It is very difficult to implement a solution that guarantees consistency (3) Hints – Treat the cached data as hints, i.e. cached data may not be completely accurate – Can be used by applications that can discover that the cached data is invalid and can recover • Example: – After the name of a file is mapped to an address, that address is stored as a hint in the cache – If the address later fails, it is purged from the cache – The name server is consulted to provide the actual location of the file and the cache is updated CS-550: Distributed File Systems [SiS] 8
  • 9. Mechanism used in distributed file systems (cont.) (4) Bulk data transfer – Observations: • Overhead introduced by protocols does not depend on the amount of data transferred in one transaction • Most files are accessed in their entirety – Common practice: when client requests one block of data, multiple consecutive blocks are transferred (5) Encryption – Encryption is needed to provide security in distributed systems – Entities that need to communicate send request to authentication server – Authentication server provides key for conversation CS-550: Distributed File Systems [SiS] 9
  • 10. Design Issues 1. Naming and name resolution – Terminology • Name: each object in a file system (file, directory) has a unique name • Name resolution: mapping a name to an object or multiple objects (replication) • Name space: collection of names with or without same resolution mechanism – Approaches to naming files in a distributed system (a) Concatenate name of host to names of files on that host – Advantage: unique filenames, simple resolution – Disadvantages: » Conflicts with network transparency » Moving file to another host requires changing its name and the applications using it (b) Mount remote directories onto local directories – Requires that host of remote directory is known – After mounting, files referenced location-transparent (I.e., file name does not reveal its location) (c) Have a single global directory – All files belong to a single name space – Limitation: having unique system wide filenames require a single computing facility or cooperating facilities CS-550: Distributed File Systems [SiS] 10
  • 11. Design Issues (cont.) 1. Naming and Name Resolution (cont.) – Contexts • Solve the problem of system-wide unique names, by partitioning a name space into contexts (geographical, organizational, etc.) • Name resolution is done within that context • Interpretation may lead to another context • File Name = Context + Name local to context – Nameserver • Process that maps file names to objects (files, directories) • Implementation options – Single name Server » Simple implementation, reliability and performance issues – Several Name Servers (on different hosts) » Each server responsible for a domain » Example: Client requests access to file ‘A/B/C’ Local name server looks up a table (in kernel) Local name server points to a remote server for ‘/B/C’ mapping CS-550: Distributed File Systems [SiS] 11
  • 12. Design Issues (Cont.) 2. Caching – Caching at the client: Main memory vs. Disk • Main memory: (+) Fast, (+) Works for diskless clients, (-) Expensive memory, (-) Complex Virtual Memory Management. • Disk: (+) Large files, (+) Simpler Virtual Memory Management (-) Requires local disk. – Cache consistency • Server initiated – Server informs cache managers when data in client caches is stale – Client cache managers invalidate stale data or retrieve new data – Disadvantage: extensive communication • Client initiated – Cache managers at the clients validate data with server before returning it to clients – Disadvantage: extensive communication • Prohibit file caching when concurrent-writing – Several clients open a file, at least one of them for writing – Server informs all clients to purge that cached file • Lock files when concurrent-write sharing (at least one client opens for write) CS-550: Distributed File Systems [SiS] 12
  • 13. Design Issues (Cont.) 3. Writing policy – Question: once a client writes into a file (and the local cache), when should the modified cache be sent to the server? – Options: • Write-through: all writes at the clients, immediately transferred to the servers – Advantage: reliability – Disadvantage: performance, it does not take advantage of the cache • Delayed writing: delay transfer to servers – Advantages: » Many writes take place (including intermediate results) before a transfer » Some data may be deleted – Disadvantage: reliability • Delayed writing until file is closed at client – For short open intervals, same as delayed writing – For long intervals, reliability problems CS-550: Distributed File Systems [SiS] 13
  • 14. Design Issues (Cont.) 4. Availability – Issue: what is the level of availability of files in a distributed file system? – Resolution: use replication to increase availability, i.e. many copies (replicas) of files are maintained at different sites/servers – Replication issues: • How to keep replicas consistent • How to detect inconsistency among replicas – Unit of replication • File • Group of files a) Volume: group of all files of a user or group or all files in a server » Advantage: ease of implementation » Disadvantage: wasteful, user may need only a subset replicated b) Primary pack vs. pack » Primary pack:all files of a user » Pack: subset of primary pack. Can receive a different degree of replication for each pack CS-550: Distributed File Systems [SiS] 14
  • 15. Design Issues (Cont.) 5. Scalability – Issue: can the design support a growing system? – Example: server-initiated cache invalidation complexity and load grow with size of system. Possible solutions: • Do not provide cache invalidation service for read-only files • Provide design to allow users to share cached data – Design file servers for scalability: threads, SMPs, clusters 6. Semantics – Expected semantics: a read will return data stored by the latest write – Possible options: • All read and writes go through the server – Disadvantage: communication overhead • Use of lock mechanism – Disadvantage: file not always available CS-550: Distributed File Systems [SiS] 15
  • 16. Case Studies: The Sun Network File System (NSF) • Developed by Sun Microsystems to provide a distributed file system independent of the hardware and operating system • Architecture – Virtual File System (VFS): File system interface that allows NSF to support different file systems – Requests for operation on remote files are routed by VFS to NFS – Requests are sent to the VFS on the remote using • The remote procedure call (RPC), and • The external data representation (XDR) – VFS on the remote server initiates files system operation locally – Vnode (Virtual Node): • There is a network-wide vnode for every object in the file system (file or directory)- equivalent of UNIX inode • vnode has a mount table, allowing any node to be a mount node CS-550: Distributed File Systems [SiS] 16
  • 17. Case Studies: NFS Architecture CS-550: Distributed File Systems [SiS] 17
  • 18. NFS (Cont.) • Naming and location: – Workstations are designated as clients or file servers – A client defines its own private file system by mounting a subdirectory of a remote file system on its local file system – Each client maintains a table which maps the remote file directories to servers – Mapping a filename to an object is done the first time a client references the field. Example: Filename: /A/B/C • Assume ‘A’ corresponds to ‘vnode1’ • Look up on ‘vnode1/B’ returns ‘vnode2’ for ‘B’ where‘vnode2’ indicates that object is on server ‘X’ • Client asks server ‘X’ to lookup ‘vnode2/C’ • ‘file handle’ returned to client by server storing that file • Client uses ‘file handle’ for all subsequent operation on that file CS-550: Distributed File Systems [SiS] 18
  • 19. NFS (Cont.) • Caching: – Caching done in main memory of clients – Caching done for: file blocks, translation of filenames to vnodes, and attributes of files and directories (1) Caching of file blocks • Cached on demand with time stamp of the file (when last modified on the server) • Entire file cached, if under certain size, with timestamp when last modified • After certain age, blocks have to be validated with server • Delayed writing policy: Modified blocks flushed to the server after certain delay (2) Caching of filenames to vnodes for remote directory names • Speeds up the lookup procedure (3) Caching of file and directory attributes • Updated when new attributes received from the server, discarded after certain time • Stateless Server – Servers are stateless • File access requests from clients contain all needed information (pointer position, etc) • Servers have no record of past requests – Simple recovery from crashes. CS-550: Distributed File Systems [SiS] 19