Context
In this post:
I ask about deserializing a 1.2GB JSON file.
This answer posted there:
does work, but it's extremely slow.
Sample data
So that you don't have to use a 1.2GB file, here's a small data example for use with this question. It's just the first few items from the original large JSON file.
example.json:
[{"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:AMD230728C00115000", "exchange": 304, "id": null, "tape": null, "price": 0.38, "size": 1, "conditions": [227], "timestamp": 1690471217275, "sequence_number": 1477738810, "trf_id": null, "trf_timestamp": null}, {"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:AFRM230728C00019500", "exchange": 302, "id": null, "tape": null, "price": 0.07, "size": 10, "conditions": [209], "timestamp": 1690471217278, "sequence_number": 1477739110, "trf_id": null, "trf_timestamp": null}, {"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:TSLA230804C00270000", "exchange": 325, "id": null, "tape": null, "price": 4.8, "size": 7, "conditions": [219], "timestamp": 1690471217282, "sequence_number": 341519150, "trf_id": null, "trf_timestamp": null}, {"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:TSLA230804C00270000", "exchange": 312, "id": null, "tape": null, "price": 4.8, "size": 1, "conditions": [209], "timestamp": 1690471217282, "sequence_number": 341519166, "trf_id": null, "trf_timestamp": null}, {"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:TSLA230804C00270000", "exchange": 312, "id": null, "tape": null, "price": 4.8, "size": 1, "conditions": [209], "timestamp": 1690471217282, "sequence_number": 341519167, "trf_id": null, "trf_timestamp": null}, {"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:TSLA230804C00270000", "exchange": 319, "id": null, "tape": null, "price": 4.8, "size": 5, "conditions": [219], "timestamp": 1690471217282, "sequence_number": 341519170, "trf_id": null, "trf_timestamp": null}, {"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:TSLA230804C00270000", "exchange": 312, "id": null, "tape": null, "price": 4.8, "size": 19, "conditions": [209], "timestamp": 1690471217284, "sequence_number": 341519682, "trf_id": null, "trf_timestamp": null}, {"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:TSLA230804C00270000", "exchange": 301, "id": null, "tape": null, "price": 4.8, "size": 2, "conditions": [219], "timestamp": 1690471217290, "sequence_number": 341519926, "trf_id": null, "trf_timestamp": null}, {"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:TSLA230804C00270000", "exchange": 301, "id": null, "tape": null, "price": 4.8, "size": 15, "conditions": [219], "timestamp": 1690471217290, "sequence_number": 341519927, "trf_id": null, "trf_timestamp": null}, {"py/object": "polygon.websocket.models.models.EquityTrade", "event_type": "T", "symbol": "O:META230728C00315000", "exchange": 302, "id": null, "tape": null, "price": 4.76, "size": 1, "conditions": [227], "timestamp": 1690471217323, "sequence_number": 1290750877, "trf_id": null, "trf_timestamp": null}]
Code
Here's (slow) code that works. It takes hours to run on the 1.2GB file.
$path = ".\example.json"
$stream = [System.IO.File]::Open($path, [System.IO.FileMode]::Open)
$i = 0
$stream.ReadByte() # read '['
$i++
$json = ''
$data = @()
while ($i -lt $stream.Length)
{
    $byte = $stream.ReadByte(); $i++
    $char = [Convert]::ToChar($byte)
            
    if ($char -eq '}')
    {
        $json = $json + [Convert]::ToChar($byte)
        
        $data = $data + ($json | ConvertFrom-Json)
        $json = ''
        $stream.ReadByte() | Out-Null # read comma;
        $i++
        if ($data.Count % 100 -eq 0)
        {
            Write-Host $data.Count
        }
    }
    else
    {
        $json = $json + [Convert]::ToChar($byte)
    }
}
$stream.Close()
After running it, you should have the records in $data:
PS C:\Users\dharm\Dropbox\Documents\polygon-io.ps1> $data | ft *
py/object                                   event_type symbol                exchange id tape price size conditions     timestamp sequence_number trf_id trf_timestamp
---------                                   ---------- ------                -------- -- ---- ----- ---- ----------     --------- --------------- ------ -------------
polygon.websocket.models.models.EquityTrade T          O:AMD230728C00115000       304          0.38    1 {227}      1690471217275      1477738810
polygon.websocket.models.models.EquityTrade T          O:AFRM230728C00019500      302          0.07   10 {209}      1690471217278      1477739110
polygon.websocket.models.models.EquityTrade T          O:TSLA230804C00270000      325           4.8    7 {219}      1690471217282       341519150
polygon.websocket.models.models.EquityTrade T          O:TSLA230804C00270000      312           4.8    1 {209}      1690471217282       341519166
polygon.websocket.models.models.EquityTrade T          O:TSLA230804C00270000      312           4.8    1 {209}      1690471217282       341519167
polygon.websocket.models.models.EquityTrade T          O:TSLA230804C00270000      319           4.8    5 {219}      1690471217282       341519170
polygon.websocket.models.models.EquityTrade T          O:TSLA230804C00270000      312           4.8   19 {209}      1690471217284       341519682
polygon.websocket.models.models.EquityTrade T          O:TSLA230804C00270000      301           4.8    2 {219}      1690471217290       341519926
polygon.websocket.models.models.EquityTrade T          O:TSLA230804C00270000      301           4.8   15 {219}      1690471217290       341519927
polygon.websocket.models.models.EquityTrade T          O:META230728C00315000      302          4.76    1 {227}      1690471217323      1290750877
Question
What's a good way to make this more efficient?
Notes
This answer:
does illustrate an approach for C# using Newtonsoft Json.NET.
Here's the code for it:
JsonSerializer serializer = new JsonSerializer();
MyObject o;
using (FileStream s = File.Open("bigfile.json", FileMode.Open))
using (StreamReader sr = new StreamReader(s))
using (JsonReader reader = new JsonTextReader(sr))
{
    while (reader.Read())
    {
        // deserialize only when there's "{" character in the stream
        if (reader.TokenType == JsonToken.StartObject)
        {
            o = serializer.Deserialize<MyObject>(reader);
        }
    }
}
One approach would be to download the Newtonsoft Json.NET DLL, and convert the above to PowerShell. One challenge is this line:
o = serializer.Deserialize<MyObject>(reader);
As you can see, it's making a generic method call. It's not clear to me how this would be translated to Windows PowerShell 5.1.
A solution that only depends on native JSON deserialization libraries would be preferred, but the Newtonsoft approach would be acceptable if necessary.
