The_Java_Interface
The_Java_Interface
• The Hadoop FileSystem class: the API for interacting with one of Hadoop’s filesystems.
Reading Data from a Hadoop URL:
One of the simplest ways to read a file from a Hadoop filesystem is by using a
java.net.URL object to open a stream to read the data from.
The general idiom is:
InputStream in = null;
try {
in = new URL("hdfs://host/path").openStream();
// process in
} finally {
IOUtils.closeStream(in);
}
How to make Java recognize Hadoop’s hdfs URL scheme?
• This is achieved by calling the setURLStreamHandlerFactory() method on
URL with an instance of FsUrlStreamHandlerFactory.
• This method can be called only once per JVM, so it is typically executed in a
static block.
• This limitation means that if some other part of our program—perhaps a
third-party component outside our control— sets a
URLStreamHandlerFactory, we won’t be able to use this approach for
reading data from Hadoop.
Displaying files from a Hadoop filesystem on standard output using a URLStreamHandler
% hadoopFileSystemCat hdfs://localhost/user/tom/quangle.txt
On the top of the Crumpetty Tree
The Quangle Wangle sat,
But his face you could not see,
On account of his Beaver Hat.
FSDataInputStream:
• The open() method on FileSystem actually returns an
FSDataInputStream rather than a standard java.io class.
• This class is a specialization of java.io.DataInputStream with support
for random access, so we can read from any part of the stream.
package org.apache.hadoop.fs;
public class FSDataInputStream extends DataInputStream
implements Seekable, PositionedReadable {
// implementation elided
}
• The Seekable interface permits seeking to a position in the file and
provides a query method for the current offset from the start of the
file (getPos()):
public interface Seekable {
void seek(long pos) throws IOException;
long getPos() throws IOException;
}
• seek() can move to an arbitrary, absolute position in the file.
• Calling seek() with a position that is greater than the length of the file
will result in an IOException.
Example:
public class FileSystemDoubleCat {
public static void main(String[] args) throws Exception {
String uri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
FSDataInputStream in = null;
try {
in = fs.open(new Path(uri));
IOUtils.copyBytes(in, System.out, 4096, false);
in.seek(0); // go back to the start of the file
IOUtils.copyBytes(in, System.out, 4096, false);
} finally {
IOUtils.closeStream(in);
}
}
}
• PositionedReadable interface: for reading parts of a file at a given offset.
public interface PositionedReadable {
public int read(long position, byte[] buffer, int offset, int length) throws
IOException;
public void readFully(long position, byte[] buffer, int offset, int length)
throws IOException;
public void readFully(long position, byte[] buffer) throws IOException;
}
Writing Data
• The FileSystem class has a number of methods for creating a file. The simplest is the method
that takes a Path object for the file to be created and returns an output stream to write to:
public FSDataOutputStream create(Path f) throws IOException
NOTE:
• The create() methods create any parent directories of the file to be written that don’t
already exist. Though convenient, this behavior maybe unexpected. If we want the write to
fail if the parent directory doesn’t exist, then we should check for the existence of the parent
directory first by calling the exists() method.
• There is also an overloaded method for passing a callback interface, Progressable, so our
application can be notified of the progress of the data being written to the datanodes:
package org.apache.hadoop.util;
public interface Progressable {
public void progress();
}
• As an alternative to creating a new file, we can append to an existing file using theappend() :
public FSDataOutputStream append(Path f) throws IOException
Copying a local file to a Hadoop filesystem
•When the argument is a file, the simplest variant returns an array of FileStatus
objects of length 1. When the argument is a directory, it returns zero or more
FileStatus objects representing the files and directories contained in the directory.
File patterns
• It is a common requirement to process sets of files in a single operation.
• Rather than having to enumerate each file and directory to specify the
input, it is convenient to use wildcard characters to match multiple files
with a single expression, an operation that is known as globbing.
• Hadoop provides two FileSystem methods for processing globs:
public FileStatus[] globStatus(Path pathPattern) throws IOException
public FileStatus[] globStatus(Path pathPattern, PathFilter filter) throws
IOException
PathFilter
• Glob patterns are not always powerful enough to describe a set of files we want to access.
For example, it is not generally possible to exclude a particular file using a glob pattern.
The listStatus() and globStatus() methods of FileSystem take an optional PathFilter, which
allows programmatic control over matching.
package org.apache.hadoop.fs;
public interface PathFilter {
boolean accept(Path path);
}
Deleting Data
• Use the delete() method on FileSystem to permanently remove files or directories:
public boolean delete(Path f, boolean recursive) throws IOException