I have a speed/efficiency related question about Python:
I need to extract multiple fields from a nested JSON File (after writing to the .txt files, they have ~64k lines and the current snippet does it in ~ 9 mins), where each line can contain floats and strings.
Normally, I would just put all my data in numpy and use np.savetxt() to save it..
I have resorted to simply assembling the lines as strings, but this is rather slow. So far I'm doing:
- Assemble each line as a string(extract the desired field from JSON)
- Write string to the concerned file
I have several problems with this:
- it's leading to more separate file.write()commands, which are very slow as well (around 64k * 8 calls (for 8 files))
So my question is:
- What is a good routine for this kind of problem? One that balances out speed vs memory-consumptionfor most efficient writing to disk.
- Should I increase my DEFAULT_BUFFER_SIZE? (it's currently 8192)
I have checked this File I/O in Every Programming Language and this python org: IO but didn't help much except(in my understanding after going through it, file io should already be buffered in python 3.6.x) and I found that my default DEFAULT_BUFFER_SIZE is  8192.
Here's the part of my snippet -
def read_json_line(line=None):
    result = None
    try:        
        result = json.loads(line)
    except Exception as e:      
        # Find the offending character index:
        idx_to_replace = int(str(e).split(' ')[-1].replace(')',''))      
        # Remove the offending character:
        new_line = list(line)
        new_line[idx_to_replace] = ' '
        new_line = ''.join(new_line)     
        return read_json_line(line=new_line)
    return result
def extract_features_and_write(path_to_data, inp_filename, is_train=True):
    # It's currently having 8 lines of file.write(), which is probably making it slow as writing to disk is  involving a lot of overheads as well
    features = ['meta_tags__twitter-data1', 'url', 'meta_tags__article-author', 'domain', 'title', 'published__$date',\
                'content', 'meta_tags__twitter-description']
    
    prefix = 'train' if is_train else 'test'
    
    feature_files = [open(os.path.join(path_to_data,'{}_{}.txt'.format(prefix, feat)),'w', encoding='utf-8')
                    for feat in features]
    
    with open(os.path.join(PATH_TO_RAW_DATA, inp_filename), 
              encoding='utf-8') as inp_json_file:
        for line in tqdm_notebook(inp_json_file):
            for idx, features in enumerate(features):
                json_data = read_json_line(line)  
                content = json_data['meta_tags']["twitter:data1"].replace('\n', ' ').replace('\r', ' ').split()[0]
                feature_files[0].write(content + '\n')
                content = json_data['url'].split('/')[-1].lower()
                feature_files[1].write(content + '\n')
                content = json_data['meta_tags']['article:author'].split('/')[-1].replace('@','').lower()
                feature_files[2].write(content + '\n')
                content = json_data['domain']
                feature_files[3].write(content + '\n')
                content = json_data['title'].replace('\n', ' ').replace('\r', ' ').lower()
                feature_files[4].write(content + '\n')
                content = json_data['published']['$date']
                feature_files[5].write(content + '\n')
                content = json_data['content'].replace('\n', ' ').replace('\r', ' ')
                content = strip_tags(content).lower()
                content = re.sub(r"[^a-zA-Z0-9]", " ", content)
                feature_files[6].write(content + '\n')
                content = json_data['meta_tags']["twitter:description"].replace('\n', ' ').replace('\r', ' ').lower()
                feature_files[7].write(content + '\n')
 
     
    