all. I am working on a personal NLP/NLU project using the nps_chat corpus. I am working on identifying all the questions asked and then doing some further analysis.
It is a rather large data set and is formatted as such:
Data columns (total 4 columns):
 #   Column               Dtype 
---  ------               ----- 
 0   episode              int64 
 1   episode_order        int64 
 2   speaker              object
 3   utterance            object
dtypes: int64(2), object(1)For each episode, there are a series of utterances by speakers that are ordered in the episode_order column.
I've sentence tokenized each utterance and identified any questions in each utterance. These questions are stored in a 5th column called 'questions' as a list. Most rows have an empty list [], others range from a list of one question to a list of multiple questions asked in series.
What I am trying to solve: I'd like to elongate the data frame in rows where the utterance contained multiple questions. At each location where a row contains more than one question, i'd like to:
- leave only the first question asked in the original row
- add additional rows below the original row each containing one of the remaining questions in the list. The row is a copy of all columns in the original row except the 'questions' column contains the next question.
--Credit to the user below who answered-- Here is what I am trying to achieve.
import pandas as pd
df = pd.DataFrame(
     {
        "episodes" : [1, 2], 
        "utterance": ["hey", "ho"],
        "questions": [['Where?', "Who?"], ["What?", "When?"]]
     }
)
df
>>>
    episodes    utterance   questions
0   1           hey         [Where?, Who?]
1   2           ho          [What?, When?]
    episodes    utterance   questions
0   1           hey         Where?
0   1           hey         Who?
1   2           ho          What?
1   2           ho          When?
What is the best approach for this? I am trying to think through a apply/lambda solution. I've also thought about successively going through the data frame and carving out a whole episode, pass it into a function, elongate it as described and return it...then append it to a new data frame. There are 3M rows in this data set so, that could take a while.
Any advice is appreciated. Thanks!
 
    