24

I have a problem with viewing chunks of a very large text file. This file, approximately 19 GB, is obviously too big to view by any traditional means.

I have tried head 1 and tail 1 (head -n 1 and tail -n 1) with both commands piped together in various ways (to get at a piece in the middle) with no luck. My Linux machine running Ubuntu 9.10 cannot process this file.

How do I handle this file? My ultimate goal is to hone in on lines 45000000 and 45000100.

Oliver Salzburg
  • 89,072
  • 65
  • 269
  • 311
nicorellius
  • 6,815

4 Answers4

17

You should use sed.

sed -n -e 45000000,45000100p -e 45000101q bigfile > savedlines

This tells sed to print lines 45000000-45000100 inclusive, and to quit on line 45000101.

Kyle Jones
  • 6,364
5

Create a MySQL database with a single table which has a single field. Then import your file into the database. This will make it very easy to look up a certain line.

I don't think anything else could be faster (if head and tail already fail). In the end, the application that wants to find line n has to seek through the whole file until is has found n newlines. Without some sort of lookup (line-index to byte offset into file) no better performance can be achieved.

Given how easy it is to create a MySQL database and import data into it, I feel like this is a viable approach.

Here is how to do it:

DROP DATABASE IF EXISTS helperDb;
CREATE DATABASE `helperDb`;
CREATE TABLE `helperDb`.`helperTable`( `lineIndex` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT, `lineContent` MEDIUMTEXT , PRIMARY KEY (`lineIndex`) );
LOAD DATA INFILE '/tmp/my_large_file' INTO TABLE helperDb.helperTable (lineContent);
SELECT lineContent FROM helperTable WHERE ( lineIndex > 45000000 AND lineIndex < 45000100 );

/tmp/my_large_file would be the file you want to read.

The correct syntax to import a file with tab-delimited values on each line, is:

LOAD DATA INFILE '/tmp/my_large_file' INTO TABLE helperDb.helperTable FIELDS TERMINATED BY '\n' (lineContent);

Another major advantage of this is, if you decide later on to extract another set of lines, you don't have to wait hours for the processing again (unless you delete the database of course).

Oliver Salzburg
  • 89,072
  • 65
  • 269
  • 311
3

You have the right tools but are using them incorrectly. As previously answered over at U&L, tail -n +X file | head -n Y (note the +) is 10-15% faster than sed for Y lines starting at X. And conveniently, you don't have to explicitly exit the process as with sed.

tail will read and discard the first X-1 lines (there's no way around that), then read and print the following lines. head will read and print the requested number of lines, then exit. When head exits, tail receives a SIGPIPE signal and dies, so it won't have read more than a buffer size's worth (typically a few kilobytes) of lines from the input file.

Erich
  • 315
3

Two good old tools for big files are joinand split. You can use split with --lines=<number> option that cut file to multiple files of certain size.

For example split --lines=45000000 huge_file.txt. The resulted parts would be in xa, xb, etc. Then you can head the part xb which would include the the lines you wanted. You can also 'join' files back to single big file.

Anssi
  • 131