Assume a client application that uses a FileSplit object in order to read the actual bytes from the corresponding file.
To do so, an InputStream object has to be created from the FileSplit, via code like:
FileSplit split = ... // The FileSplit reference
FileSystem fs = ... // The HDFS reference
FSDataInputStream fsin = fs.open(split.getPath());
long start = split.getStart()-1; // Byte before the first
if (start >= 0)
{
fsin.seek(start);
}
The adjustment of the stream by -1 is present in some scenarios like the Hadoop MapReduce LineRecordReader class. However, the documentation of the FSDataInputStream seek() method says explicitly that, after seeking to a location, the next read will be from that location, meaning (?) that the code above will be 1 byte off (?).
So, the question is, would that "-1" adjustment be necessary for all InputSplit reading cases?
By the way, if one wants to read a FileSplit correctly, seeking to its start is not enough, because every split also has an end that may not be identical to the end of the actual HDFS file. So, the corresponding InputStream should be "bounded", i.e. have a maximum length, like the following:
InputStream is = new BoundedInputStream(fsin, split.getLength());
In this case, after the "native" fsin steam has been created above, the org.apache.commons.io.input.BoundedInputStream class is used, to implement the "bounding".
UPDATE
Apparently the adjustment is necessary only for use cases line the one of the LineRecordReader class, which exceeds the boundaries of a split to make sure that it reads the full last line.
A good discussion with more details on this can be found in an earlier question and in the comments for MAPREDUCE-772.