I have been using the FUNSD dataset to predict sequence labeling in unstructured documents per this paper: LayoutLM: Pre-training of Text and Layout for Document Image Understanding . The data after cleaning and moving from a dict to a dataframe, looks like this:
 The dataset is laid out as follows:
 The dataset is laid out as follows:
- The column idis the unique identifier for each word group inside a document, shown in columntext(like Nodes)
- The columnlabelidentifies whether the word group are classified as a 'question' or an 'answer'
- The column linkingdenoting the WordGroups which are 'linked' (like Edges), linking corresponding 'questions' to 'answers'
- The column 'box'denoting the location coordinates (x,y top left, x,ybottom right) of the word group relative to the top left corner (0.0).
- The Column 'words'holds each individual word inside the wordgroup, and its location (box).
I aim to train a classifier to identify words inside the column 'words' that are linked together by using a Graph Neural Net, and the first step is to be able to transform my current dataset into a Network. My questions are as follows:
- Is there a way to break each row in the column - 'words'into a two columns- [box_word, text_word], each only for one word, while replicating the other columns which remain the same:- [id, label, text, box], resulting in a final dataframe with these columns:- [box,text,label,box_word, text_word]
- I can Tokenize the columns - 'text'and- text_word, one hot encode column- label, split columns with more than one numeric- boxand- box_wordinto individual columns , but How do I split up/rearrange the colum- 'linking'to define the edges of my Network Graph?
- Am I taking the correct route in Using the dataframe to generate a Network, and use it to train a GNN? 
Any and all help/tips is appreciated.
 
    