Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
5 views

The_Java_Interface

Uploaded by

Suseela Devi
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

The_Java_Interface

Uploaded by

Suseela Devi
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 32

The Java Interface

• The Hadoop FileSystem class: the API for interacting with one of Hadoop’s filesystems.
Reading Data from a Hadoop URL:
One of the simplest ways to read a file from a Hadoop filesystem is by using a
java.net.URL object to open a stream to read the data from.
The general idiom is:
InputStream in = null;
try {
in = new URL("hdfs://host/path").openStream();
// process in
} finally {
IOUtils.closeStream(in);
}
How to make Java recognize Hadoop’s hdfs URL scheme?
• This is achieved by calling the setURLStreamHandlerFactory() method on
URL with an instance of FsUrlStreamHandlerFactory.
• This method can be called only once per JVM, so it is typically executed in a
static block.
• This limitation means that if some other part of our program—perhaps a
third-party component outside our control— sets a
URLStreamHandlerFactory, we won’t be able to use this approach for
reading data from Hadoop.
Displaying files from a Hadoop filesystem on standard output using a URLStreamHandler

public class URLCat {


static {
URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
}
public static void main(String[] args) throws Exception {
InputStream in = null;
try {
in = new URL(args[0]).openStream();
IOUtils.copyBytes(in, System.out, 4096, false);
} finally {
IOUtils.closeStream(in);
}
}
}
Output:
% export HADOOP_CLASSPATH=hadoop-examples.jar
% hadoop URLCat hdfs://localhost/user/tom/quangle.txt
On the top of the Crumpetty Tree
The Quangle Wangle sat,
But his face you could not see,
On account of his Beaver Hat.
Reading Data Using the FileSystem
API
• A file in a Hadoop filesystem is represented by a Hadoop Path object
• FileSystem is a general filesystem API, so the first step is to retrieve an
instance for the filesystem we want to use—HDFS in this case.
• There are several static factory methods for getting a FileSystem
instance
• public static FileSystem get(Configuration conf) throws IOException
• public static FileSystem get(URI uri, Configuration conf) throws IOException
• public static FileSystem get(URI uri, Configuration conf, String user) throws
IOException
• Configuration object encapsulates a client or server’s configuration,
which is set using configuration files read from the class path, such as
conf/core-site.xml.
• The first method returns the default filesystem (as specified in the file
conf/core-site.xml, or the default local filesystem if not specified
there).
• The second uses the given URI’s scheme and authority to determine
the filesystem to use, falling back to the default filesystem if no
scheme is specified in the given URI.
• The third retrieves the filesystem as the given user.
• We may want to retrieve a local filesystem instance. For this, we can
use the convenience method getLocal()
public static LocalFileSystem getLocal(Configuration conf) throws IOException
• With a FileSystem instance in hand, we invoke an open() method to
get the input stream for a file:
• public FSDataInputStream open(Path f) throws IOException
• public abstract FSDataInputStream open(Path f, intbufferSize) throws
IOException
Displaying files from a Hadoop filesystem on standard output by using the
FileSystem Directly
public class FileSystemCat {
public static void main(String[] args) throws Exception {
String uri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
InputStream in = null;
try {
in = fs.open(new Path(uri));
IOUtils.copyBytes(in, System.out, 4096, false);
} finally {
IOUtils.closeStream(in);
}
}
}
Output:

% hadoopFileSystemCat hdfs://localhost/user/tom/quangle.txt
On the top of the Crumpetty Tree
The Quangle Wangle sat,
But his face you could not see,
On account of his Beaver Hat.
FSDataInputStream:
• The open() method on FileSystem actually returns an
FSDataInputStream rather than a standard java.io class.
• This class is a specialization of java.io.DataInputStream with support
for random access, so we can read from any part of the stream.
package org.apache.hadoop.fs;
public class FSDataInputStream extends DataInputStream
implements Seekable, PositionedReadable {
// implementation elided
}
• The Seekable interface permits seeking to a position in the file and
provides a query method for the current offset from the start of the
file (getPos()):
public interface Seekable {
void seek(long pos) throws IOException;
long getPos() throws IOException;
}
• seek() can move to an arbitrary, absolute position in the file.
• Calling seek() with a position that is greater than the length of the file
will result in an IOException.
Example:
public class FileSystemDoubleCat {
public static void main(String[] args) throws Exception {
String uri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
FSDataInputStream in = null;
try {
in = fs.open(new Path(uri));
IOUtils.copyBytes(in, System.out, 4096, false);
in.seek(0); // go back to the start of the file
IOUtils.copyBytes(in, System.out, 4096, false);
} finally {
IOUtils.closeStream(in);
}
}
}
• PositionedReadable interface: for reading parts of a file at a given offset.
public interface PositionedReadable {
public int read(long position, byte[] buffer, int offset, int length) throws
IOException;
public void readFully(long position, byte[] buffer, int offset, int length)
throws IOException;
public void readFully(long position, byte[] buffer) throws IOException;
}
Writing Data

• The FileSystem class has a number of methods for creating a file. The simplest is the method
that takes a Path object for the file to be created and returns an output stream to write to:
public FSDataOutputStream create(Path f) throws IOException
NOTE:
• The create() methods create any parent directories of the file to be written that don’t
already exist. Though convenient, this behavior maybe unexpected. If we want the write to
fail if the parent directory doesn’t exist, then we should check for the existence of the parent
directory first by calling the exists() method.
• There is also an overloaded method for passing a callback interface, Progressable, so our
application can be notified of the progress of the data being written to the datanodes:
package org.apache.hadoop.util;
public interface Progressable {
public void progress();
}
• As an alternative to creating a new file, we can append to an existing file using theappend() :
public FSDataOutputStream append(Path f) throws IOException
Copying a local file to a Hadoop filesystem

public class FileCopyWithProgress {


public static void main(String[] args) throws Exception {
String localSrc = args[0];
String dst = args[1];
InputStream in = new BufferedInputStream(new FileInputStream(localSrc));
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(dst), conf);
OutputStream out = fs.create(new Path(dst), new Progressable() {
public void progress() {
System.out.print(".");
}
});
IOUtils.copyBytes(in, out, 4096, true);
}
}
FSDataOutputStream

• The create() method on FileSystem returns an FSDataOutputStream, which, like


FSDataInputStream, has a method for querying the current position in the file.

public class FSDataOutputStream extends DataOutputStream implements Syncable {


public long getPos() throws IOException {
// implementation elided
}
// implementation elided
}
• However, unlike FSDataInputStream, FSDataOutputStream does not permit
seeking. This is because HDFS allows only sequential writes to an open file or
appends to an already written file.
Directories
• FileSystem provides a method to create a directory:
public boolean mkdir(Path f) throws IOException
• It returns true if the directory (and all parent directories) was (were)
successfully created.
• We don’t need to explicitly create a directory, because writing a file by
calling create() will automatically create any parent directories.
Querying the Filesystem
• The FileStatus class encapsulates filesystem metadata for files and
directories, including file length, block size, replication, modification
time, ownership, and permission information.
• The method getFileStatus() on FileSystem provides a way of getting a
FileStatus object for a single file or directory.
Listing files
• Finding information on a single file or directory is useful, but we also often need to
be able to list the contents of a directory. That’s what FileSystem’s listStatus()
methods are for:
public FileStatus[] listStatus(Path f) throws IOException
public FileStatus[] listStatus(Path f, PathFilter filter) throws IOException
public FileStatus[] listStatus(Path[] files) throws IOException
public FileStatus[] listStatus(Path[] files, PathFilter filter) throws IOException

•When the argument is a file, the simplest variant returns an array of FileStatus
objects of length 1. When the argument is a directory, it returns zero or more
FileStatus objects representing the files and directories contained in the directory.
File patterns
• It is a common requirement to process sets of files in a single operation.
• Rather than having to enumerate each file and directory to specify the
input, it is convenient to use wildcard characters to match multiple files
with a single expression, an operation that is known as globbing.
• Hadoop provides two FileSystem methods for processing globs:
public FileStatus[] globStatus(Path pathPattern) throws IOException
public FileStatus[] globStatus(Path pathPattern, PathFilter filter) throws
IOException
PathFilter

• Glob patterns are not always powerful enough to describe a set of files we want to access.
For example, it is not generally possible to exclude a particular file using a glob pattern.
The listStatus() and globStatus() methods of FileSystem take an optional PathFilter, which
allows programmatic control over matching.

package org.apache.hadoop.fs;
public interface PathFilter {
boolean accept(Path path);
}

Deleting Data
• Use the delete() method on FileSystem to permanently remove files or directories:
public boolean delete(Path f, boolean recursive) throws IOException

• If f is a file or an empty directory, the value of recursive is ignored. A nonempty directory is


deleted, along with its contents, only if recursive is true (otherwise, an IOException is thrown).
Data Flow
• Anatomy of a File Read
• The client opens the file it wishes to read by calling open() on the FileSystem object, which for
HDFS is an instance of Distributed FileSystem.
• DistributedFileSystem calls the namenode, using remote procedure calls (RPCs), to determine the
locations of the first few blocks in the file (step 2). For each block, the namenode returns the
addresses of the datanodes that have a copy of that block.
• the datanodes are sorted according to their proximity to the client according to the topology of
the cluster’s network.
• The DistributedFileSystem returns an FSDataInputStream (an input stream that supports file
seeks) to the client for it to read data from. FSDataInputStream in turn wraps a DFSInputStream,
which manages the datanode and namenode I/O.
• The client then calls read() on the stream (step 3). DFSInputStream, which has stored the
datanode addresses for the first few blocks in the file, then connects to the first Data Flow
(closest) datanode for the first block in the file.
• Data is streamed from the datanode back to the client, which calls read() repeatedly on the
stream (step 4).
• When the end of the block is reached, DFSInputStream will close the connection to the datanode,
then find the best datanode for the next block (step 5). This happens transparently to the
client,which from its point of view is just reading a continuous stream.
• It will also call the namenode to retrieve the datanode locations for the next batch of blocks as
needed. When the client has finished reading, it calls close() on the FSDataInputStream (step 6).
• During reading, if the DFSInputStream encounters an error while communicating with a
datanode, it will try the next closest one for that block. It will also remember datanodes that have
failed so that it doesn’t needlessly retry them for later blocks.
• The DFSInput Stream also verifies checksums for the data transferred to it from the datanode. If a
corrupted block is found, the DFSInputStream attempts to read a replica of the block from
another datanode; it also reports the corrupted block to the namenode.
• One important aspect of this design is that the client contacts datanodes directly to retrieve data
and is guided by the namenode to the best datanode for each block.
• This design allows HDFS to scale to a large number of concurrent clients because the data traffic is
spread across all the datanodes in the cluster.
Network Topology
• What does it mean for two nodes in a local network to be “close” to each other? In the
context of high-volume data processing, the limiting factor is the rate at which we can
transfer data between nodes—bandwidth is a scarce commodity. The idea is to use the
bandwidth between two nodes as a measure of distance.
• Hadoop takes a simple approach in which the network is represented as a tree and the
distance between two nodes is the sum of their distances to their closest common ancestor.
• The idea is that the bandwidth available for each of the following scenarios becomes
progressively less:
• Processes on the same node
• Different nodes on the same rack
• Nodes on different racks in the same data center
• Nodes in different data centers
• For example, imagine a node n1 on rack r1 in data center d1. This can be
represented as /d1/r1/n1. Using this notation, here are the distances for the four
scenarios:
• distance(/d1/r1/n1, /d1/r1/n1) = 0 (processes on the same node)
• distance(/d1/r1/n1, /d1/r1/n2) = 2 (different nodes on the same rack)
• distance(/d1/r1/n1, /d1/r2/n3) = 4 (nodes on different racks in the same data center)
• distance(/d1/r1/n1, /d2/r3/n4) = 6 (nodes in different data centers)
Anatomy of a File Write
• The client creates the file by calling create() on DistributedFileSystem (step 1).
DistributedFileSystem makes an RPC call to the namenode to create a new file in the
filesystem’s namespace, with no blocks associated with it (step 2).
• The namenode performs various checks to make sure the file doesn’t already exist and that the
client has the right permissions to create the file. If these checks pass, the namenode makes a
record of the new file; otherwise, file creation fails and the client is thrown an IOException.
• The DistributedFileSystem returns an FSDataOutputStream for the client to start writing data
to.
• As the client writes data (step 3), the DFSOutputStream splits it into packets, which it writes to
an internal queue called the data queue. The data queue is consumed by the DataStreamer,
which is responsible for asking the namenode to allocate new blocks by picking a list of suitable
datanodes to store the replicas.
• The DataStreamer streams the packets to the first datanode in the pipeline,which stores each
packet and forwards it to the second datanode in the pipeline.
• The DFSOutputStream also maintains an internal queue of packets that are waiting to be
acknowledged by datanodes, called the ack queue. A packet is removed from the ack queue only
when it has been acknowledged by all the datanodes in the pipeline (step 5).
• If any datanode fails while data is being written to it, then the following actions are taken, which
are transparent to the client writing the data. First, the pipeline is closed, and any packets in the
ack queue are added to the front of the data queue so that datanodes that are downstream from
the failed node will not miss any packets.
• The current block on the good datanodes is given a new identity, which is communicated to the
namenode, so that the partial block on the failed datanode will be deleted if the failed datanode
recovers later on.
• When the client has finished writing data, it calls close() on the stream (step 6). This action
flushes all the remaining packets to the datanode pipeline and waits for acknowledgments before
contacting the namenode to signal that the file is complete (step 7).

You might also like