How can I parse extremely large (70+ GB) .txt files?

Question

I have several .txt files with >30 million lines each, and anywhere from 20 to 40 "columns" (some comma-separated, some space-separated, all ASCII with lines separated by a new-line). I don't need all (or even most) of the columns, and some of them have null spaces filled with NULL.

My goals are to:

Remove the columns I don't need
Re-order the columns as I see fit (for readability)
Pipe output to another .txt file, with columns delimited by colons

I just did this with one large text file by splitting it into ~40 .txt files with 1,000,000 lines each, importing them one by one into Excel, and using CONCATENATE, but that approach has yielded no fruit with my next target. The file is comma-separated, but still needs to be converted from .txt into .csv, and Excel chokes on it during the importing process; even if I got it into Excel, the main file breaks down into 200+ smaller files to comply with Excel's upper line limit, and doing the same thing more than 200 times is not efficient.

I'm working on a late-2020 MacBook Pro and am not versed enough in any coding languages to even know where to begin, but I'm semi-comfortable scripting in shell and always down to learn new tricks, just don't know where to start.

RedGrittyBrick · Answer 1 · 2020-12-23T10:38:34.283

Choose your tools

It seems Excel is not a suitable tool for what you want to do.

One approach would to be to use a different tool to consolidate or summarise the data. awk, sed, grep or perl might be better suited to this initial processing and create a smaller CSV file that can then be processed in Excel or other tools.

There are other tools that may be better suited to doing the whole job. Maybe something like R or a DBMS. It depends what you want to do with the data.

For simply taking a bunch of dissimilar text files and reordering and selecting columns I'd jump immediately to perl. Others would use awk.

Since tools like awk and perl can process files line by line and don't need to store everything in memory, they can process huge files that would choke other tools. They can also be surprisingly fast.

Just for fun, an example

With this data

Apples,27,500,10.2,fruit,100,200,300
Chairs  1   501 123.78  furniture 101   201 301
Europe, 655, 502,0.0001,continent,   102, 202,302

we can produce this output

fruit:Apples:10.2
furniture:Chairs:123.78
continent:Europe:0.0001

using this command

perl -l -n -e "print join(':',(split(/[, \t] */))[4,0,3])" *.txt

explanation

element	what it does
`-l`	add a newline after every print
`-n`	process line by line but don't implicitly print
`-e`	what follows is a program to execute
`print`	print the result of the following expression
`join(":"` list`)`	create one string from a list, use ":" between each
`split (/`expr`/)`	use expression to divide line into fields
`[, \t]`	either a comma, a space or a tab followed by
`*` (space asterisk)	0,1 or more spaces
`(`list`)[4,0,3]`	select the 4th, 0th and 3rd items from a list

That one line program is equivalent to the following, which may be easier to follow

#!perl
use warnings;
use strict;
while(<>) {  # read line by line all input or all listed files
    my @columns = split(/[, \t] */);   # split on whitespace or commas
    my @chosen = @columns[4,0,3];      # choose what to keep
    my $new_line = join(":", @chosen); # join items using colons between
    print "$new_line\n";               # print with line-separator
}

invoked as perl data.pl *.txt > newdata.txt

I like perl and am moderately familiar with a subset of it, although it is waning in popularity partly because it is easy to write perl programs that are very hard to read. However it was designed for exactly your use-case. Anyone familiar with awk, python, ruby or any of a large number of tools would solve this just as easily.

Hannu · Answer 2 · 2021-01-31T08:54:52.587

OS agnostic answer:

Learn just a tiny bit of Python and you will have a tool to do similar conversions in any way you wish.

Type this into a file, save it as e.g. cvt.py (originally based on code from here)

import sys
exportcolumns = [3, 4, 5]
with open(sys.argv[1], 'r') as fi:
    for line in fi:
        columns = line.split(',')
        print( '\t'.join( columns[col] for col in exportcolumns) )

After installing Python (version 3, nothing else!) you should be able to run the above by
Python3 cvt.py filename >newfile
where filename is one of your datafiles, and newfile is where you want the result.

As written the code looks for , as column separator, outputs columns 3,4,5 (in that order) with tabs \t as separator (at the end of each column).

If you have more complex (inconsistent) column separation you may well do

import re

... as shown here: https://stackoverflow.com/a/4998688/3720510

Short explanation for the above

First line makes the sys module available. This allows the use of sys.argv here; making the command line arguments available to the script as a simple list.
The second line creates a list with the indices of the columns to extract from the input data.
The with-line opens the file and makes it available during the following indented block - the file closes as the block has been executed.
for - loop once for every line that can be read from the file.
Next line; create a list of one line of content, splitting at every ,.
The print; uses a "list comprehension" to pick the columns from the list, join them with \t (a tab char) between them, then print them to sys.stdout (implicit with print()), which may be a file - if you redirected using > on the command line.

score 15 · Answer 3 · answered Dec 20 '20 at 17:35

Disclaimer: I have not actually tried this with a 70 GB file, but I have done a couple of GB and above 4 million rows.

The intended workflow for huge files is not to load whole file into a sheet, but to connect to the file.

Open the data tab, select "From text/CSV", select your file. When the preview dialog appears, click on the caret besides "Load" button and choose to "Only Create Connection". That's it. Here's a more elaborate tutorial: https://excel.officetuts.net/en/examples/open-large-csv

There might be some quirks and more challenges to solve with the column transformations, but it's worth giving a try if you feel much better in Excel than with command line tools.

Another option — if you have access to Access, you can import and handle the data there as well. That software is THE database system for power users.

That being said, I would choose awk for the particular task. But you should be at least somewhat comfortable with shell then.

score 9 · Answer 4 · answered Dec 21 '20 at 07:32

If your data format is well-known (CSV or other character delimited file, JSON, etc.) you can usually find a general-purpose command-line tool to help query it.

xsv is a popular one for comma-separated/space-separated data
jq is popular for JSON data (download available here)

xsv can chew through hundreds of MB per second depending on your hardware and the type of query.

score 4 · Answer 5 · answered Dec 21 '20 at 13:45

Lots of good advice from elsewhere about the mechanics of data extraction, however you are going to need some dirty coding skills to do anything useful with it.

Large data sets often contain corrupt lines, loopy data, strange characters, ohs instead of zeroes and every manner of formatting glitches. You need to validate and filter what you've got. (An example. Split a file into two then join them. There may well be the most subtle of flaws at the join. Possibly all normal lines are CRLF but at the join the end of line is just CR. This can go unnoticed or even cause the read-in to assume end of file!) As a minimum I would make sure that you're outputting exactly the same number of lines as you read.

Still on line-by line processing, it's very simple, and worthwhile, to add very basic sanity checking to the data. Even if a field isn't getting outputted, if it's easy to check then do it because it could indicate some more subtle trouble. Be aware that actual data may not conform to the official specs. Why does a price of -1 sometimes appear? An especially useful field to check is the last one that should always have something in it or the last in each row.

Log processing somewhere. That way you can set the process running and go to lunch. You have a record of what version of your program was used to create what outputs. Of course you're looking for '...lines rejected:0' all the time.

Bad source lines should be outputted to a failure file. (But quit after 15 lines.) You can visually examine a small amount of data to see what sort of weirdness you've got.

It may well be that inside the loop that processes each line you have to apply filters. This may not happen at the first pass, but as downstream analysis progresses you may be asked to give a more select data-set. Eg. exclude lines with products with 'test' in the name or product code starting with 9.

An often missed validation issue is missing or duplicated data. For example somehow Friday's raw data has been added to the end of Thursday's and Friday's is from the week before. How will anybody know? The network failed from 3pm to 5pm so nothing was recorded. Monday was a bank holiday where there shouldn't be any transactions but somebody has supplied data from the previous Monday. You are in a good position to do some simple sums, for example daily turnover or maximum period of no activity etc. These are bulk sanity checks used to give a human pause for thought and prompt for checking before poisoned data is passed further down the chain. It's probably not your job to decide what to do with a loopy batch, but you can highlight it and probably tweak your code to give a better data-set.

All the above is 'easy', one step at a time programming. You'll learn about automation, tidy workflows, loopy formatting and basic data anomalies. You'll also be a bit of an expert on spotting unusual data and what the fields are supposed to mean. That will be useful for...

Doing something useful with the data. You should be involved with the downstream analysis. This is not to suggest you should build analysis into your translation program, but you've got a framework ready to do it. Totals, averages, max and min, hourly, daily, weekly are all possible easy (NB Automated) outputs. You might think a database is a better tool, but for fiddly things simple coding may be better. Let me give an example: Smooth a set of data points. An easy moving average is nextPoint = (lastPoint *(0.8)) + (rawValue *(0.2)) [Adjust .8 and .2 to suit]. That's fine for continuous data but what about start of business each day? That's a special case where nextPoint = rawValue. Something to code perhaps.

Spurious data values is a good example of the cross-over between raw data crunching and analysis. When somebody punched in £175 when they meant £1.75 do we really want to include that in our analysis? It's a bit of an art, or fudge, but the raw data processor can easily calculate a mean and standard deviation for a couple of thousand data points, or an actual distribution for all rows of data. You /might/ want to throw out, mark, highlight or otherwise draw attention to unexpected values at the data crunching stage or use it to inform the analysis stage. Perhaps add another column with a blank for OK and 'H' for higher than expected and so on.

You will become a skilled craftsman, able to turn a huge tree into useful planks from start to finish. You'll learn who wants what sort of planks for what purpose and be able to saw up the raw wood in the right way to avoid splits and shakes. Moreover if you spot a diseased tree you can raise the alarm.

CSM · Answer 6 · 2020-12-31T12:25:36.847

I don't use Macs, but assuming the standard UNIX tools are available, then

Open a terminal
type in cd and drag in the folder containing the files
type in awk -v FS=, -v OFS=, '{ print $3, $2, $5 }' < source.txt > dest.txt

Note the space after cd and the use of single quotes with the awk command. For both commands, the case (upper/lower) is important.

This will read in all of the CSV file source.txt, and print the third, second and fifth columns as a CSV file, dest.txt

AWK is primarily designed to operate on text files that contain data in columns. The -v sets the input and output column separators to a comma

Should you get stuck, we have a Mac specific forum, Ask Different which may be more suited to your question.

score 1 · Answer 7 · answered Dec 21 '20 at 00:01

As always you have two ways to choose.

A. Make a fairly complex program (multiple choices of separators, multiple choices of data handling etc.)

B. Use a "Unix toolbox" approach. Learn some tools you have available in every Unix or Unix-like system (macOS was once certified Unix system, you should have everything already at your disposition). There is some effort needed, but it's worth it.

If you would like to follow the A. way, there is plenty of languages to choose from. For the purpose perl would be great, python being trendy option now, but there is much more...

If you would like to follow the B. way, I would suggest a calm start and some patience. I love a book called Classic Shell Scripting, I even use it on courses I teach...

I guess you need one step backwards. Before 1. you need a kind of step:

Avoid unnecessary mess. You can convert files to a common delimiter, if it's possible. Yes, it can be the first command in a pipe.

Then you can use more commands in pipe (cut might be a good option, or awk - it would even be able to select columns, change their order and set the proper delimiter in one step).

score 0 · Answer 8 · answered Dec 21 '20 at 04:39

If you want to use the built in commands in Windows, rather than downloading something else like awk, Python or Perl (which are much better) then you could always just use the Windows Command Prompt:

https://ss64.com/nt/for_f.html

Again, I'm not suggesting this as the "best" tool but if you have to distribute your solution or don't have the ability to download third party software, it may be your best bet.

score 0 · Answer 9 · answered Dec 22 '20 at 04:13

In case you don't mind throwing a few bucks at the solution then EmEditor can handle extremely large text files in a notepad++-like interface. I used to use that at an eDiscovery firm that had to work with very large datasets and it was user friendly enough for the paralegals to use.

score 0 · Answer 10 · answered Oct 13 '22 at 01:35

All this can be done easily in GS-Base. GS-Base is a database that uses up to 256 million rows and should be very efficient with very large CSV files (e.g. 10GB in about 2 minutes). Data can be displayed in tables or forms. You can re-order columns by simple drag-and-drop and for further analysis, field calculations, pivot tables, filtering (and much more) GS-Base can use up to 100 processor cores. The installation requires around 7MB and it can be installed on a portable USB storage device. Not free, but the price is rather hard to beat… ( https://citadel5.com/gs-base.htm )

How can I parse extremely large (70+ GB) .txt files?

10 Answers10

Choose your tools

Just for fun, an example