I have to parse a large (100+ MB) JSON file with the following format:
{
"metadata": {
"account_id": 1234
// etc.
},
"transactions": [
{
"transaction_id": 1234,
"amount": 2
},
// etc. for (potentially) 1000's of lines
]
}
The output of this parsing is a JSON array with the account_id appended to each of the transactions:
[
{
"account_id": 1234,
"transaction_id": 1234,
"amount": 2
},
// etc.
]
I'm using the stream-json library to avoid loading the whole file into memory at the same time. stream-json allows me to pick individual properties, and then stream them one at a time, depending on if they're an array or object
I'm also trying to avoid parsing the JSON twice by piping the read of the JSON file to two separate streams, which is possible in nodejs.
I'm using a Transform stream for generating the output, setting a property on the Transform stream object that stores the account_id.
Pseudo code (with obvious race condition) below:
const { parser } = require('stream-json');
const { pick } = require('stream-json/filters/Pick');
const { streamArray } = require('stream-json/streamers/StreamArray');
const { streamObject } = require('stream-json/streamers/StreamObject');
const Chain = require('stream-chain');
const { Transform } = require('stream');
let createOutputObject = new Transform({
writableObjectMode:true,
readableObjectMode:true,
transform(chunk, enc, next) => {
if (createOuptutObject.account_id !== null) {
// generate the output object
} else {
// Somehow store the chunk until we get the account_id...
}
}
});
createOutputObject.account_id = null;
let jsonRead = fs.createReadStream('myJSON.json');
let metadataPipline = new Chain([
jsonRead,
parser(),
pick({filter: 'metadata'}),
streamObject(),
]);
metadataPipeline.on('data', data => {
if (data.key === 'account_id') {
createOutputObject.account_id = data.value;
}
});
let generatorPipeline = new Chain([
jsonRead, // Note same Readable stream as above
parser(),
pick({filter: 'tracks'}),
streamArray(),
createOutputObject,
transformToJSONArray(),
fs.createWriteStream('myOutput.json')
]);
To resolve this race condition (i.e. converting to JSON array before account_id is set), I've tried:
- Using
createOutputObject.cork()to hold data up untilaccount_idis set.- The data just passes through to
transformToJSONArray().
- The data just passes through to
- Keeping the
chunks in an array increateOutputObjectuntilaccount_idis set.- Can't figure out how to re-add the stored
chunks afteraccount_idis set.
- Can't figure out how to re-add the stored
- Using
setImmediate()andprocess.nextTick()to callcreateOutputObject.transformlater on, hoping thataccount_idis set.- Overloaded stack so that nothing could get done.
I've considered using stream-json's streamValues function, which would allow me to do a pick of metadata and transactions. But the documentation leads me to believe that all of transactions would be loaded into memory, which is what I'm trying to avoid:
As every streamer, it assumes that individual objects can fit in memory, but the whole file, or any other source, should be streamed.
Is there something else that can resolve this race condition? Is there anyway I can avoid parsing this large JSON stream twice?