The_Java_Interface

Uploaded by

Suseela Devi

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

The_Java_Interface

Uploaded by

Suseela Devi

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 32

The Java Interface

• The Hadoop FileSystem class: the API for interacting with one of Hadoop’s filesystems.
Reading Data from a Hadoop URL:
One of the simplest ways to read a file from a Hadoop filesystem is by using a
java.net.URL object to open a stream to read the data from.
The general idiom is:
InputStream in = null;
try {
in = new URL("hdfs://host/path").openStream();
// process in
} finally {
IOUtils.closeStream(in);
}
How to make Java recognize Hadoop’s hdfs URL scheme?
• This is achieved by calling the setURLStreamHandlerFactory() method on
URL with an instance of FsUrlStreamHandlerFactory.
• This method can be called only once per JVM, so it is typically executed in a
static block.
• This limitation means that if some other part of our program—perhaps a
third-party component outside our control— sets a
URLStreamHandlerFactory, we won’t be able to use this approach for
reading data from Hadoop.
Displaying files from a Hadoop filesystem on standard output using a URLStreamHandler

public class URLCat {

static {
URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
}
public static void main(String[] args) throws Exception {
InputStream in = null;
try {
in = new URL(args[0]).openStream();
IOUtils.copyBytes(in, System.out, 4096, false);
} finally {
IOUtils.closeStream(in);
}
}
}
Output:
% export HADOOP_CLASSPATH=hadoop-examples.jar
% hadoop URLCat hdfs://localhost/user/tom/quangle.txt
On the top of the Crumpetty Tree
The Quangle Wangle sat,
But his face you could not see,
On account of his Beaver Hat.
Reading Data Using the FileSystem
API
• A file in a Hadoop filesystem is represented by a Hadoop Path object
• FileSystem is a general filesystem API, so the first step is to retrieve an
instance for the filesystem we want to use—HDFS in this case.
• There are several static factory methods for getting a FileSystem
instance
• public static FileSystem get(Configuration conf) throws IOException
• public static FileSystem get(URI uri, Configuration conf) throws IOException
• public static FileSystem get(URI uri, Configuration conf, String user) throws
IOException
• Configuration object encapsulates a client or server’s configuration,
which is set using configuration files read from the class path, such as
conf/core-site.xml.
• The first method returns the default filesystem (as specified in the file
conf/core-site.xml, or the default local filesystem if not specified
there).
• The second uses the given URI’s scheme and authority to determine
the filesystem to use, falling back to the default filesystem if no
scheme is specified in the given URI.
• The third retrieves the filesystem as the given user.
• We may want to retrieve a local filesystem instance. For this, we can
use the convenience method getLocal()
public static LocalFileSystem getLocal(Configuration conf) throws IOException
• With a FileSystem instance in hand, we invoke an open() method to
get the input stream for a file:
• public FSDataInputStream open(Path f) throws IOException
• public abstract FSDataInputStream open(Path f, intbufferSize) throws
IOException
Displaying files from a Hadoop filesystem on standard output by using the
FileSystem Directly
public class FileSystemCat {
public static void main(String[] args) throws Exception {
String uri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
InputStream in = null;
try {
in = fs.open(new Path(uri));
IOUtils.copyBytes(in, System.out, 4096, false);
} finally {
IOUtils.closeStream(in);
}
}
}
Output:

% hadoopFileSystemCat hdfs://localhost/user/tom/quangle.txt
On the top of the Crumpetty Tree
The Quangle Wangle sat,
But his face you could not see,
On account of his Beaver Hat.
FSDataInputStream:
• The open() method on FileSystem actually returns an
FSDataInputStream rather than a standard java.io class.
• This class is a specialization of java.io.DataInputStream with support
for random access, so we can read from any part of the stream.
package org.apache.hadoop.fs;
public class FSDataInputStream extends DataInputStream
implements Seekable, PositionedReadable {
// implementation elided
}
• The Seekable interface permits seeking to a position in the file and
provides a query method for the current offset from the start of the
file (getPos()):
public interface Seekable {
void seek(long pos) throws IOException;
long getPos() throws IOException;
}
• seek() can move to an arbitrary, absolute position in the file.
• Calling seek() with a position that is greater than the length of the file
will result in an IOException.
Example:
public class FileSystemDoubleCat {
public static void main(String[] args) throws Exception {
String uri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
FSDataInputStream in = null;
try {
in = fs.open(new Path(uri));
IOUtils.copyBytes(in, System.out, 4096, false);
in.seek(0); // go back to the start of the file
IOUtils.copyBytes(in, System.out, 4096, false);
} finally {
IOUtils.closeStream(in);
}
}
}
• PositionedReadable interface: for reading parts of a file at a given offset.
public interface PositionedReadable {
public int read(long position, byte[] buffer, int offset, int length) throws
IOException;
public void readFully(long position, byte[] buffer, int offset, int length)
throws IOException;
public void readFully(long position, byte[] buffer) throws IOException;
}
Writing Data

• The FileSystem class has a number of methods for creating a file. The simplest is the method
that takes a Path object for the file to be created and returns an output stream to write to:
public FSDataOutputStream create(Path f) throws IOException
NOTE:
• The create() methods create any parent directories of the file to be written that don’t
already exist. Though convenient, this behavior maybe unexpected. If we want the write to
fail if the parent directory doesn’t exist, then we should check for the existence of the parent
directory first by calling the exists() method.
• There is also an overloaded method for passing a callback interface, Progressable, so our
application can be notified of the progress of the data being written to the datanodes:
package org.apache.hadoop.util;
public interface Progressable {
public void progress();
}
• As an alternative to creating a new file, we can append to an existing file using theappend() :
public FSDataOutputStream append(Path f) throws IOException
Copying a local file to a Hadoop filesystem

public class FileCopyWithProgress {

public static void main(String[] args) throws Exception {
String localSrc = args[0];
String dst = args[1];
InputStream in = new BufferedInputStream(new FileInputStream(localSrc));
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(dst), conf);
OutputStream out = fs.create(new Path(dst), new Progressable() {
public void progress() {
System.out.print(".");
}
});
IOUtils.copyBytes(in, out, 4096, true);
}
}
FSDataOutputStream

• The create() method on FileSystem returns an FSDataOutputStream, which, like

FSDataInputStream, has a method for querying the current position in the file.

public class FSDataOutputStream extends DataOutputStream implements Syncable {

public long getPos() throws IOException {
// implementation elided
}
// implementation elided
}
• However, unlike FSDataInputStream, FSDataOutputStream does not permit
seeking. This is because HDFS allows only sequential writes to an open file or
appends to an already written file.
Directories
• FileSystem provides a method to create a directory:
public boolean mkdir(Path f) throws IOException
• It returns true if the directory (and all parent directories) was (were)
successfully created.
• We don’t need to explicitly create a directory, because writing a file by
calling create() will automatically create any parent directories.
Querying the Filesystem
• The FileStatus class encapsulates filesystem metadata for files and
directories, including file length, block size, replication, modification
time, ownership, and permission information.
• The method getFileStatus() on FileSystem provides a way of getting a
FileStatus object for a single file or directory.
Listing files
• Finding information on a single file or directory is useful, but we also often need to
be able to list the contents of a directory. That’s what FileSystem’s listStatus()
methods are for:
public FileStatus[] listStatus(Path f) throws IOException
public FileStatus[] listStatus(Path f, PathFilter filter) throws IOException
public FileStatus[] listStatus(Path[] files) throws IOException
public FileStatus[] listStatus(Path[] files, PathFilter filter) throws IOException

•When the argument is a file, the simplest variant returns an array of FileStatus
objects of length 1. When the argument is a directory, it returns zero or more
FileStatus objects representing the files and directories contained in the directory.
File patterns
• It is a common requirement to process sets of files in a single operation.
• Rather than having to enumerate each file and directory to specify the
input, it is convenient to use wildcard characters to match multiple files
with a single expression, an operation that is known as globbing.
• Hadoop provides two FileSystem methods for processing globs:
public FileStatus[] globStatus(Path pathPattern) throws IOException
public FileStatus[] globStatus(Path pathPattern, PathFilter filter) throws
IOException
PathFilter

• Glob patterns are not always powerful enough to describe a set of files we want to access.
For example, it is not generally possible to exclude a particular file using a glob pattern.
The listStatus() and globStatus() methods of FileSystem take an optional PathFilter, which
allows programmatic control over matching.

package org.apache.hadoop.fs;
public interface PathFilter {
boolean accept(Path path);
}

Deleting Data
• Use the delete() method on FileSystem to permanently remove files or directories:
public boolean delete(Path f, boolean recursive) throws IOException

• If f is a file or an empty directory, the value of recursive is ignored. A nonempty directory is

deleted, along with its contents, only if recursive is true (otherwise, an IOException is thrown).
Data Flow
• Anatomy of a File Read
• The client opens the file it wishes to read by calling open() on the FileSystem object, which for
HDFS is an instance of Distributed FileSystem.
• DistributedFileSystem calls the namenode, using remote procedure calls (RPCs), to determine the
locations of the first few blocks in the file (step 2). For each block, the namenode returns the
addresses of the datanodes that have a copy of that block.
• the datanodes are sorted according to their proximity to the client according to the topology of
the cluster’s network.
• The DistributedFileSystem returns an FSDataInputStream (an input stream that supports file
seeks) to the client for it to read data from. FSDataInputStream in turn wraps a DFSInputStream,
which manages the datanode and namenode I/O.
• The client then calls read() on the stream (step 3). DFSInputStream, which has stored the
datanode addresses for the first few blocks in the file, then connects to the first Data Flow
(closest) datanode for the first block in the file.
• Data is streamed from the datanode back to the client, which calls read() repeatedly on the
stream (step 4).
• When the end of the block is reached, DFSInputStream will close the connection to the datanode,
then find the best datanode for the next block (step 5). This happens transparently to the
client,which from its point of view is just reading a continuous stream.
• It will also call the namenode to retrieve the datanode locations for the next batch of blocks as
needed. When the client has finished reading, it calls close() on the FSDataInputStream (step 6).
• During reading, if the DFSInputStream encounters an error while communicating with a
datanode, it will try the next closest one for that block. It will also remember datanodes that have
failed so that it doesn’t needlessly retry them for later blocks.
• The DFSInput Stream also verifies checksums for the data transferred to it from the datanode. If a
corrupted block is found, the DFSInputStream attempts to read a replica of the block from
another datanode; it also reports the corrupted block to the namenode.
• One important aspect of this design is that the client contacts datanodes directly to retrieve data
and is guided by the namenode to the best datanode for each block.
• This design allows HDFS to scale to a large number of concurrent clients because the data traffic is
spread across all the datanodes in the cluster.
Network Topology
• What does it mean for two nodes in a local network to be “close” to each other? In the
context of high-volume data processing, the limiting factor is the rate at which we can
transfer data between nodes—bandwidth is a scarce commodity. The idea is to use the
bandwidth between two nodes as a measure of distance.
• Hadoop takes a simple approach in which the network is represented as a tree and the
distance between two nodes is the sum of their distances to their closest common ancestor.
• The idea is that the bandwidth available for each of the following scenarios becomes
progressively less:
• Processes on the same node
• Different nodes on the same rack
• Nodes on different racks in the same data center
• Nodes in different data centers
• For example, imagine a node n1 on rack r1 in data center d1. This can be
represented as /d1/r1/n1. Using this notation, here are the distances for the four
scenarios:
• distance(/d1/r1/n1, /d1/r1/n1) = 0 (processes on the same node)
• distance(/d1/r1/n1, /d1/r1/n2) = 2 (different nodes on the same rack)
• distance(/d1/r1/n1, /d1/r2/n3) = 4 (nodes on different racks in the same data center)
• distance(/d1/r1/n1, /d2/r3/n4) = 6 (nodes in different data centers)
Anatomy of a File Write
• The client creates the file by calling create() on DistributedFileSystem (step 1).
DistributedFileSystem makes an RPC call to the namenode to create a new file in the
filesystem’s namespace, with no blocks associated with it (step 2).
• The namenode performs various checks to make sure the file doesn’t already exist and that the
client has the right permissions to create the file. If these checks pass, the namenode makes a
record of the new file; otherwise, file creation fails and the client is thrown an IOException.
• The DistributedFileSystem returns an FSDataOutputStream for the client to start writing data
to.
• As the client writes data (step 3), the DFSOutputStream splits it into packets, which it writes to
an internal queue called the data queue. The data queue is consumed by the DataStreamer,
which is responsible for asking the namenode to allocate new blocks by picking a list of suitable
datanodes to store the replicas.
• The DataStreamer streams the packets to the first datanode in the pipeline,which stores each
packet and forwards it to the second datanode in the pipeline.
• The DFSOutputStream also maintains an internal queue of packets that are waiting to be
acknowledged by datanodes, called the ack queue. A packet is removed from the ack queue only
when it has been acknowledged by all the datanodes in the pipeline (step 5).
• If any datanode fails while data is being written to it, then the following actions are taken, which
are transparent to the client writing the data. First, the pipeline is closed, and any packets in the
ack queue are added to the front of the data queue so that datanodes that are downstream from
the failed node will not miss any packets.
• The current block on the good datanodes is given a new identity, which is communicated to the
namenode, so that the partial block on the failed datanode will be deleted if the failed datanode
recovers later on.
• When the client has finished writing data, it calls close() on the stream (step 6). This action
flushes all the remaining packets to the datanode pipeline and waits for acknowledgments before
contacting the namenode to signal that the file is complete (step 7).

Pairip Updated
No ratings yet
Pairip Updated
3 pages
Com 415 Multimedia
100% (1)
Com 415 Multimedia
19 pages
l2 Hdfs and Mapreduce Model 2022s2
No ratings yet
l2 Hdfs and Mapreduce Model 2022s2
52 pages
Copying A File From Local To HDFS Using The Java API
No ratings yet
Copying A File From Local To HDFS Using The Java API
10 pages
3_HDFS-Hive-HBase-Pig
No ratings yet
3_HDFS-Hive-HBase-Pig
8 pages
Input and Output Basics
No ratings yet
Input and Output Basics
20 pages
Unit 14 Input Output
No ratings yet
Unit 14 Input Output
69 pages
Object-Oriented Programming: File Handling in Java
No ratings yet
Object-Oriented Programming: File Handling in Java
65 pages
IO and Streams
No ratings yet
IO and Streams
13 pages
Java Notes 7
No ratings yet
Java Notes 7
85 pages
Files in C++
No ratings yet
Files in C++
26 pages
FS Module
No ratings yet
FS Module
8 pages
PHP and MySql-IV
No ratings yet
PHP and MySql-IV
13 pages
JAVA Development: Working With Files
No ratings yet
JAVA Development: Working With Files
39 pages
Java IO Interview Questions and Answers: 1. What Are The Types of I / O Streams?
No ratings yet
Java IO Interview Questions and Answers: 1. What Are The Types of I / O Streams?
5 pages
IO Java Unit 2
No ratings yet
IO Java Unit 2
34 pages
a
No ratings yet
a
1 page
Example 3-5. Demonstrating File Status Information: Showfilestatustest
No ratings yet
Example 3-5. Demonstrating File Status Information: Showfilestatustest
3 pages
9h File Streams (1)
No ratings yet
9h File Streams (1)
15 pages
Streams and File I/O: Dr. M. Ishtiaq
No ratings yet
Streams and File I/O: Dr. M. Ishtiaq
33 pages
UNIT 5 CS - 15
No ratings yet
UNIT 5 CS - 15
78 pages
oops5
No ratings yet
oops5
71 pages
How To Use The Poifs Apis: 1.1. Target Audience
No ratings yet
How To Use The Poifs Apis: 1.1. Target Audience
11 pages
Data File Handling
100% (1)
Data File Handling
41 pages
Experiment 6
No ratings yet
Experiment 6
22 pages
Chapter 4. File Operations
No ratings yet
Chapter 4. File Operations
49 pages
3-Data Storage PDF
No ratings yet
3-Data Storage PDF
10 pages
Write a Java program to list all the files in a directory including the files
No ratings yet
Write a Java program to list all the files in a directory including the files
5 pages
Chapter 4
No ratings yet
Chapter 4
30 pages
Wa0004.
No ratings yet
Wa0004.
24 pages
Android Application Development: Data Management
No ratings yet
Android Application Development: Data Management
43 pages
PHP File Process Function
No ratings yet
PHP File Process Function
9 pages
@bigdatalabfile 09
No ratings yet
@bigdatalabfile 09
35 pages
4. File Operations
No ratings yet
4. File Operations
68 pages
8.file Handling Notes
No ratings yet
8.file Handling Notes
10 pages
ADV Java
No ratings yet
ADV Java
33 pages
Document From Kathawde Komal
No ratings yet
Document From Kathawde Komal
32 pages
File Input and Output
No ratings yet
File Input and Output
44 pages
CMP201 File Handling
No ratings yet
CMP201 File Handling
21 pages
Input - Output Exploring Java - Io
No ratings yet
Input - Output Exploring Java - Io
8 pages
zipf
No ratings yet
zipf
11 pages
Java IO
No ratings yet
Java IO
16 pages
Input Output
No ratings yet
Input Output
4 pages
Java - Io.file Class in Java: It Is An Abstract Representation of File and Directory Pathnames
No ratings yet
Java - Io.file Class in Java: It Is An Abstract Representation of File and Directory Pathnames
5 pages
Oop Chapter 05
No ratings yet
Oop Chapter 05
15 pages
File Stream Classes:-: Steps of File Operations
No ratings yet
File Stream Classes:-: Steps of File Operations
28 pages
Python QB Solns (Mod 4)
No ratings yet
Python QB Solns (Mod 4)
11 pages
Chapter-4 - File
No ratings yet
Chapter-4 - File
30 pages
IO Streams and files
No ratings yet
IO Streams and files
8 pages
Io
No ratings yet
Io
76 pages
Chapter 10
No ratings yet
Chapter 10
14 pages
File_IO_in_C++_(2)[1]
No ratings yet
File_IO_in_C++_(2)[1]
29 pages
File hndlinng
No ratings yet
File hndlinng
17 pages
Example: FD Creat ("Abc - TXT", O - RDONLY, 0666) 0666-RWX To File Owner. 0777-RWX To All Owners. Explanation of Algorithm
No ratings yet
Example: FD Creat ("Abc - TXT", O - RDONLY, 0666) 0666-RWX To File Owner. 0777-RWX To All Owners. Explanation of Algorithm
1 page
Database Mangement Report
No ratings yet
Database Mangement Report
9 pages
Chapter 7
No ratings yet
Chapter 7
15 pages
Oops Exp No 8
No ratings yet
Oops Exp No 8
4 pages
nodejs_file_system
No ratings yet
nodejs_file_system
15 pages
Streams
No ratings yet
Streams
37 pages
13588839
No ratings yet
13588839
25 pages
Firebase Storage for Angular: A reliable file upload solution for your applications
From Everand
Firebase Storage for Angular: A reliable file upload solution for your applications
Abdelfattah Ragab
No ratings yet
Mastering Go A Practical Guide to Developers: A Practical Guide to Developers
From Everand
Mastering Go A Practical Guide to Developers: A Practical Guide to Developers
Miguel Miranda de Mattos
No ratings yet
X IT Worksheets
No ratings yet
X IT Worksheets
32 pages
Irctc Tatkal Booking Tips Hac
No ratings yet
Irctc Tatkal Booking Tips Hac
14 pages
Muhammad Umair: Personal Statement
No ratings yet
Muhammad Umair: Personal Statement
2 pages
Assignment # 4: Objectives
No ratings yet
Assignment # 4: Objectives
2 pages
Melodos HTML Help Melodos - HTML en
No ratings yet
Melodos HTML Help Melodos - HTML en
79 pages
SAP Security Notes
100% (1)
SAP Security Notes
42 pages
Ubidots
No ratings yet
Ubidots
11 pages
Best Terminal Alternatives For Ubuntu
No ratings yet
Best Terminal Alternatives For Ubuntu
3 pages
File-Upload Backdoors Using Metasploit
No ratings yet
File-Upload Backdoors Using Metasploit
2 pages
LDAPDescription
No ratings yet
LDAPDescription
16 pages
Cafe Flow Chart
No ratings yet
Cafe Flow Chart
1 page
PDF Files - Reducing Size With Adobe Acrobat Pro
No ratings yet
PDF Files - Reducing Size With Adobe Acrobat Pro
1 page
Archetype
No ratings yet
Archetype
6 pages
Emerson Field Tools Quick Start Guide
No ratings yet
Emerson Field Tools Quick Start Guide
88 pages
Design Implementation and Evaluation of Online Eng
No ratings yet
Design Implementation and Evaluation of Online Eng
11 pages
Word
No ratings yet
Word
37 pages
HTTP Status Code 2xx - Success
No ratings yet
HTTP Status Code 2xx - Success
5 pages
HTTP Methods For RESTful Services
No ratings yet
HTTP Methods For RESTful Services
2 pages
h5d Sectioning Flowchart PDF
No ratings yet
h5d Sectioning Flowchart PDF
1 page
java script 2025
No ratings yet
java script 2025
55 pages
Smart Player User's Guide - Eng
No ratings yet
Smart Player User's Guide - Eng
41 pages
HamzahRawasia Resume
No ratings yet
HamzahRawasia Resume
1 page
Report
No ratings yet
Report
25 pages
Ontapcuoiky_SE445E_G_A_C_I_K_2024
No ratings yet
Ontapcuoiky_SE445E_G_A_C_I_K_2024
13 pages
Computer Software-Is A Collection of Instructions That Performs Different Tasks On A Computer
No ratings yet
Computer Software-Is A Collection of Instructions That Performs Different Tasks On A Computer
7 pages
RTXud
No ratings yet
RTXud
4 pages
answer key EXAMINATION EMPOWERMENT 2nd quarter
No ratings yet
answer key EXAMINATION EMPOWERMENT 2nd quarter
4 pages
Chapter-1 Introduction To Styles
No ratings yet
Chapter-1 Introduction To Styles
6 pages