I am using Jupyter Notebook to run some basic natural language processing on multiple text files. I am using two .ipynb files. One, which I am calling the "shell", reads in the files. It calls the second .ipynb (the core program), which runs the NLP.
(As you can tell, I am very much a beginner at this. I recognize that Jupyter Notebook is not ideal for this, but it is the current setup I'm using.)
The core file results in this:
return {'Cor':numCor, 'Sub':numSub, 'Ins':numIns, 'Del':numDel}
I have ten txt files I want to run the core NLP program on, and I want to end up with a dataframe with columns: 1) Filename (extracted from the name of the txt file), 2) Cor, 3) Sub, 4) Ins, and 5) Del. The integer results will populate the rows.
Each time I run the core:
z=wer(y,x)
it produces this:
{'Cor': 8, 'Sub': 0, 'Ins': 0, 'Del': 52}
But it produces it in this form:
    0
Cor 8
Sub 0
Ins 0
Del 52
I need to try to transpose it, so I did this:
df2=pd.Series(z).to_frame()
df2.reset_index()
df = df2.T 
Which produces this:
    Cor Sub Ins Del
0   8   0   0   52
So far so good (I think). I want to use this sort of command to append the results in a loop, where it adds a row for each of the 10 text files:
 orf += [{'Cor': df.Cor, 'Sub': df.Sub, 'Ins': df.Ins}]
'orf' is capturing from the dataframe, and I think that is part of my problem. Here are the results from the first two text files -- when it appends from the dataframe it's also taking the metadata (not sure that's the correct term) such as data type:
[{'filename': '/Users/jeannehsinclair/COVFEFE/miscues_ORF/anton/716_Anton_test.txt',
  'Cor': 0    52
  Name: Cor, dtype: int64,
  'Sub': 0    3
  Name: Sub, dtype: int64,
  'Ins': 0    0
  Name: Ins, dtype: int64,
  'Del': 0    5
  Name: Del, dtype: int64},
 {'filename': '/Users/jeannehsinclair/COVFEFE/miscues_ORF/anton/936_Anton.txt',
  'Cor': 0    60
  Name: Cor, dtype: int64,
  'Sub': 0    0
  Name: Sub, dtype: int64,
  'Ins': 0    0
  Name: Ins, dtype: int64,
  'Del': 0    0
  Name: Del, dtype: int64},
I want to convert it back to a dataframe. The problem is that when I convert to a dataframe, I get this (only included 3 variables here for ease of formatting):
    Cor                             Ins                         Sub
0   0 52 Name: Cor, dtype: int64    0 0 Name: Ins, dtype: int64 0 3 Name: Sub, dtype: int64
1   0 60 Name: Cor, dtype: int64    0 0 Name: Ins, dtype: int64 0 0 Name: Sub, dtype: int64
2   0 60 Name: Cor, dtype: int64    0 0 Name: Ins, dtype: int64 0 0 Name: Sub, dtype: int64
3   0 59 Name: Cor, dtype: int64    0 0 Name: Ins, dtype: int64 0 1 Name: Sub, dtype: int64
4   0 60 Name: Cor, dtype: int64    0 0 Name: Ins, dtype: int64 0 0 Name: Sub, dtype: int64
5   0 59 Name: Cor, dtype: int64    0 0 Name: Ins, dtype: int64 0 0 Name: Sub, dtype: int64
I don't want all the strings that are printed there. I just want the second integer in each cell. For example, for the first row, I just want each cell to have 52, 5, 0, 3.
What I am looking for help with streamlining the appending process. I imagine there is a good way to do this without converting twice to dataframe.
Ultimately I need a dataframe that looks like this
    Cor Sub Ins Del Filename
1   8   0   1   52  File1
2   6   0   0   52  File2
3   2   2   1   52  File3
4   1   3   0   52  File4
Thank you in advance for any advice you could offer!
