I have an untidy DataFrame of Tweet objects. There are two columns that contain lists: hashtags and expanded_urls. I'm trying to follow tidy data principles by keeping only 1 value at a row/column index.
EDIT: This question was marked as a duplicate of this answer, which simply splits the list into more columns. That doesn't work for me because there could be a variable number of hashtags in 1 tweet.
Here's a sample of my tweet DataFrame:
-----------------------------------------------------------
tweet_id | hashtags       | expanded_urls
-----------------------------------------------------------
  123    | ['lol','bff']  | ['fakeurl.com']
  124    | []             | ['url1.io', 'url2.expanded.co']
There's two possible ways I could go about tidying this data.
1: Simply add new rows to the DataFrame with almost all row contents copied over:
---------------------------------------------
tweet_id | hashtag   | expanded_url
---------------------------------------------
  123    | 'lol'    | 'fakeurl.com'
  123    | 'bff'    | 'fakeurl.com'
  124    | ''       | 'url1.io'
  124    | ''       | 'url2.expanded.io'
I don't think this would be very efficient, especially because there would be many insert/append operations. However, having a single DataFrame to pass into a single scikit-learn model would make things very simple.
2: Create 2 new DataFrames:
The first would be hashtags with their corresponding tweet_ids:
------------------
tweet_id | hashtag
------------------
123      | `lol`
123      | `bff`
The other would be urls with their corresponding tweet_ids: 
------------------
tweet_id | url
------------------
123      | `fakeurl.com`
124      | `url1.io`
124      | `url2.expanded.co`
This seems cleaner, but I'm not entirely sure how I would modify the original DataFrame; would I just drop the corresponding columns and keep 3 separate tables? Is there a good way of merging these 3 DataFrames into 1, or would I have to do a separate lookup every time I wanted to know which hashtags are associated with a tweet?
 
     
    