Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
43 views

What Is Distributed Data Processing?

Distributed file systems allow centralized storage and access of files across a network. They organize files in a hierarchical structure and track file locations with a uniform naming convention. When a client retrieves a file, it appears local but is actually stored on a server and returned to the server after editing. Google File System is a distributed file storage system that replicates file chunks across multiple servers for fault tolerance. It uses large chunk and batch processing to improve efficiency at scale for Google's data storage needs.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views

What Is Distributed Data Processing?

Distributed file systems allow centralized storage and access of files across a network. They organize files in a hierarchical structure and track file locations with a uniform naming convention. When a client retrieves a file, it appears local but is actually stored on a server and returned to the server after editing. Google File System is a distributed file storage system that replicates file chunks across multiple servers for fault tolerance. It uses large chunk and batch processing to improve efficiency at scale for Google's data storage needs.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

1. What is Distributed Data Processing?

Distributed file system (DFS) is a method of storing and accessing files based in
client/server architecture. In a distributed file system, one or more central servers store
files that can be accessed, with proper authorization rights, by any number of remote
clients in the network.

Much like an operating system organizes files in a hierarchical file management system;
the distributed system uses a uniform naming convention and a mapping scheme to keep
track of where files are located. When the client device retrieves a file from the server,
the file appears as a normal file on the client machine, and the user is able to work with
the file in the same ways as if it were stored locally on the workstation. When the user
finishes working with the file, it is returned over the network to the server, which stores
the now-altered file for retrieval at a later time.

Distributed file systems can be advantageous because they make it easier to distribute
documents to multiple clients and they provide a centralized storage system so that client
machines are not using their resources to store files.

[Beal, Vangie. (n.d.). Distributed File System. Webopedia. Retrieved December 14,
2020, from https://www.webopedia.com/TERM/D/distributed_file_system.html]

2. Watch in YouTube the “Google File System – Paper that inspired Hadoop”
https://www.youtube.com/watch?v=eRgFNW4QFDc

3. Illustrate and discuss the Google File System and its Components.

Google file system is essentially a Distributed File Storage. In any given cluster of
Google File system can be hundreds or thousands of commodity servers and this cluster
provides an interface for number of clients to either read a file or write a file. So,
theoretically, it is exactly like a file system but distributed over hundreds or thousands of
servers.

The GFS node cluster is a single master with multiple chunk servers that are continuously
accessed by different client systems. Chunk servers store data as Linux files on local
disks. Stored data is divided into large chunks (64 MB), which are replicated in the
network a minimum of three times. The large chunk size reduces network overhead.
GFS is designed to accommodate Google’s large cluster requirements without burdening
applications. Files are stored in hierarchical directories identified by path names.
Metadata - such as namespace, access control data, and mapping information - is
controlled by the master, which interacts with and monitors the status updates of each
chunk server through timed heartbeat messages.
GFS features include:
 Fault tolerance
 Critical data replication
 Automatic and efficient data recovery
 High aggregate throughput
 Reduced client and master interaction because of large chunk server size
 Namespace management and locking
 High availability

The largest GFS clusters have more than 1,000 nodes with 300 TB disk storage capacity.
This can be accessed by hundreds of clients on a continuous basis.

COMPONENTS
1) Commodity Hardware – Commodity servers are cheap and can be made to scale
horizontally with right software.
2) Google Large Files – Google system is optimized to store and read large files
ranges from 100 MB to multiple GBs .
3) File Operations – It was optimized for two kinds operations which are to read and
append only. It keeps appending crawled content and use batch processing (read)
to create index.
4) Chunks – Each chunk are of 64 Mb that are distributed in multiple machines.
5) Replicas – Google File System ensures that each chunk of your file has at least
three replicas across three different servers so even one server goes down, you
still have other two replicas to work with.
6) Google File System Master
7) Heartbeats – The file servers are off-the-shelf cheap commodity hardware and
they could go down for number of reasons so it is important for chunk servers to
have heartbeat messages passed along to the master so that the master knows that
the chunk server is still alive.
8) Ensure Chunk Replica Count – If chunk server is down, master ensures all chunks
that were on it are copied on other servers.
9) Operations Log – Each file operation with the corresponding timestamp and the
user details who performed that operation is stored in this operations log.
10) Shadow Master – Files are identified by path names which as namespace, access
control data and mapping information that is controlled by the master server. Each
file is divided into file size chunk which is stored by chunk server and data
transfer happens directly between clients and chunk servers.

You might also like