Problem: df.assign of Pandas creates NaN values when it's not supposed to when working with dates, and I want to know why.
I have a pandas dataframe of tweets. Here are the columns:
id_str            object
coordinates       object
created_at        object   ***
text              object
user              object
favorite_count     int64
retweet_count      int64
username          object
clean_text        object
The time of the tweet is in the column created_at. Even though it's an object, it's formatted properly, like Mon Oct 16 23:58:55 +0000 2017.
Now, I want to make a new column created_date that will only contain the month, day, year. This ended up being my solution, inspired by this question and this question.
df = df.assign(created_date = pd.Series([x.date() for x in pd.to_datetime(df['created_at'])]))
# Bonus question: how effective is this code in runtime or memory? If not, why?
The pd.Series constructor is there as I thought it was necessary, but I'm leaving it here in case that is what's causing the problems.
I later found out that 2764 out of the 29984 tweets had NaN values instead of the correct output (e.g. 2017-10-16 of dtype 'O'), and I thought that the Series constructor was to blame, but apparently not; the problem was actually in the df.assign.
I used the Counter from collections for checking.
I first checked if the Series I created was correct.
# pd.Series([x.date() for x in pd.to_datetime(df['created_at'])])
Counter({datetime.date(2017, 10, 9): 8165,
        datetime.date(2017, 10, 10): 5898,
        datetime.date(2017, 10, 11): 3104,
        datetime.date(2017, 10, 12): 2067,
        datetime.date(2017, 10, 13): 1647,
        datetime.date(2017, 10, 14): 2750,
        datetime.date(2017, 10, 15): 2778,
        datetime.date(2017, 10, 16): 3575})
Nothing wrong there, the total number of tweets is correct, too. So, I rechecked the newly assigned column, and there's the problem.
# Counter(df['created_date'])
Counter({datetime.date(2017, 10, 16): 3240,
         datetime.date(2017, 10, 15): 2413,
         datetime.date(2017, 10, 14): 2431,
         datetime.date(2017, 10, 13): 1369,
         datetime.date(2017, 10, 12): 1680,
         datetime.date(2017, 10, 11): 2736,
         datetime.date(2017, 10, 10): 5409,
         datetime.date(2017, 10, 9): 7942,
         nan: 2764})
I've fixed the problem by removing the pd.Series constructor, but now I want to know why this is happening, since I can't find anything similar. I apologize in advance if the only problem is that I wasn't thinking hard enough.
