So I have some data files that look like this:
      47
   425   425  -3 15000 15000 900   385   315   3   370   330   2   340   330   2
   325   315   2   325   240   2   340   225   2   370   225   2   385   240   2
   385   315   2   475   240   3   460   240   2   460   255   2   475   255   2
   475   240   2   595   315   3   580   330   2   550   330   2   535   315   2
   535   240   2   550   225   2   580   225   2   595   240   2   595   315   2
   700   315   3   685   330   2   655   330   2   640   315   2   640   240   2
   655   225   2   685   225   2   700   240   2   700   315   2   700   315   3
  9076   456   2  9102   449   2  9127   443   2  9152   437   2  9178   433   2
  9203   430   2  9229   428   2  9254   427   2  9280   425   2  9305   425   2
     0     0 999  6865    259999
      20
   425   425  -3 15000 15000 900   385   315   3   370   330   2   340   330   2
   325   315   2   325   240   2   340   225   2   370   225   2   385   240   2
   385   315   2   475   240   3   460   240   2   460   255   2   475   255   2
   475   240   2   595   315   3   580   330   2   550   330   2   535   315   2
The first number is the number of points in the following block of text, and then the block of text has that many points with up to 5 points per line. Each point has 3 components (I'll call them x, y, z). x and y get 6 characters, while z gets 4, so each point takes 16 characters. Occasionally z is 9999 resulting in no space between y and z, so using split() will mess up parsing those lines. Also, all the numbers are integers (no decimals), but there are some negatives.
In the actual file the blocks are generally 1000 points long with some blocks being smaller (at the end of a "page" where page breaks are denoted by z=9999)
My initial solution was to use regex:
import re
def get_points_regex(filename):
    with open(filename, 'r') as f:
        text = f.read()
    points = []
    for m in re.finditer('([ \d-]{6})([ \d-]{6})([ \d\-]{4})', text):
        point = tuple(int(i) for i in m.groups())
        points.append(point)
    return points
My test file is 55283 lines long (4.4 MB) and contains 274761 points.
Using timeit on get_points_regex I get 560 ms.
I then figured that while finditer is memory efficient, generating thousands of match objects is slow when I don't need any of their features, so I made a version using re.findall:
def get_points_regex2():
    with open(filename, 'r') as f:
        text = f.read()
    points = re.findall(r'([ \d-]{6})([ \d-]{6})([ \d\-]{4})', text)
    points = [tuple(map(int, point)) for point in points]
    return points
This version runs in 414 ms, 1.35x faster than finditer.
Then I was thinking that for such simple patterns regex might be overkill, so I made a version using pure python:
def get_points_simple():
    points = []
    with open(filename, 'r') as f:
        for line in f:
            n_chunks = int(len(line)/16)
            for i in range(n_chunks):
                chunk = line[16*i:16*(i+1)]
                x = int(chunk[0:6])
                y = int(chunk[6:12])
                z = int(chunk[12:16])
                points.append((x, y, z))
    return points
This runs in 386 ms, 1.07x faster than regex.
Then I broke down and tried Cython for the first time. I'm just running using the %%cython cell magic in a jupyter notebook.  I came up with this:
%%cython
def get_points_cython(filename):
    cdef int i, x, y, z
    points = []
    f = open(filename, 'r')
    for line in f:
        n_chunks = int(len(line)/16)
        for i in range(n_chunks):
            chunk = line[16*i:16*(i+1)]
            x = int(chunk[0:6])
            y = int(chunk[6:12])
            z = int(chunk[12:16])
            points.append((x, y, z))
    f.close()
    return points
The cython function runs in 196 ms. (2x faster than pure python)
I tried to simplify some expressions, like not using a context manager for file opening. While I declared the integers I wasn't sure what else to do so I left the rest alone.  I made a couple attempts at doing a 2D integer array instead of a list of tuples for points, but python segfaulted (I'm assuming that's what happened, the IPython kernal died).  I had cdef int points[1000000][3] then I assigned with statements like points[j][1] = x while incrementing j.  From some light reading and very little C background I think that might be a rather large array? Stack vs. heap (I don't know what these really are)? Need things like malloc?  I'm a bit lost on that stuff.
Next I had read that maybe I should just use Numpy since Cython is good at that. Following this I was able to create this function:
%%cython
import numpy as np
cimport numpy as np
DTYPE = np.int
ctypedef np.int_t DTYPE_t
def get_points_cython_numpy(filename):
    cdef int i, j, x, y, z
    cdef np.ndarray points = np.zeros([1000000, 3], dtype=DTYPE)
    f = open(filename, 'r')
    j = 0
    for line in f:
        n_chunks = int(len(line)/16)
        for i in range(n_chunks):
            chunk = line[16*i:16*(i+1)]
            x = int(chunk[0:6])
            y = int(chunk[6:12])
            z = int(chunk[12:16])
            points[j, 0] = x
            points[j, 1] = y
            points[j, 2] = z
            j = j + 1
    f.close()
    return points
Unfortunately this takes 263 ms, so a little slower.
Am I missing something obvious with cython or python std lib that would make parsing this any faster, or is this about as fast as it gets for a file of this size?
I thought about pandas and numpy loading functions, but I figured the chunk size rows would complicate it too much. At one point I about had something working with pandas read_fwf followed by DataFrame.values.reshape(-1, 3), then drop rows with NaNs, but I knew that had to be slower by that point.
Any ideas to speed this up would be very appreciated!
I'd love to get this below 100ms so that a GUI can be updated rapidly from reading these files as they get generated. (Move slider > run background analysis > load data > plot results in real time).