I have a dataframe analysis_df with the following structure:
...     FileName      UserSID   ImageSize   ImageChecksum
0       2197173372750839    0   17068032    11781483
1       5966634109289989    0   24576       42058
... ... ... ... ...
7500    6817023204572264    0   22000       123456
7501    6817023204572264    0   22000       123456
and need to create a new row that tells how many times each ImageChecksum repeats in the table. So I count them:
count_db = {}
for checksum in analysis_df['ImageChecksum']:
    checksum = str(checksum)
    if checksum in count_db:
        count_db[checksum] += 1
    else:
        count_db[checksum] = 1
print(f"count_db: {count_db}")
output:
count_db: {'11781483': 100, '42058': 100, '56817': 100, '491537': 100, '195631': 100, '146603': 100, '104915': 100, ... [snip] ..., '123456': 2}
So according to an answer to a question related, but not quite identical, I can do something similar like:
import pandas as pd
import numpy as np
df = pd.DataFrame([['dog', 'hound', 5],
                   ['cat', 'ragdoll', 1]],
                  columns=['animal', 'type', 'age'])
df['description'] = 'A ' + df.age.astype(str) + ' years old ' \
                    + df.type + ' ' + df.animal
But when I try to apply this solution to my own case, I get an error:
analysis_df['ImageChecksum_Count'] = count_db[str(analysis_df['ImageChecksum'])]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Input In [22], in <cell line: 21>()
     17         count_db[checksum] = 1
     19 print(f"count_db: {count_db}")
---> 21 analysis_df['ImageChecksum_Count'] = count_db[str(analysis_df['ImageChecksum'])]
     23 analysis_df.head()
KeyError: '0       11781483\n1          42058\n2          56817\n3         491537\n4         195631\n          ...   \n7497      125321\n7498       57364\n7499           0\n7500      123456\n7501      123456\nName: ImageChecksum, Length: 7502, dtype: int64'
Looking at this error, I get basically what I've done; I'm trying to apply normal programming to this sort of pythonic, vectorized functionality and it doesn't work.
I always find vectorized syntax and programming confusing in Python, what with overloaded operators and whatever magic is happening behind that kind of syntax. It's very foreign to me coming from a JavaScript background.
Can someone explain the correct way to do this?
Edit:
I found that this works:
for i, row in analysis_df.iterrows():
    analysis_df.iat[i, checksum_count_col_index] = count_db[str(analysis_df.iat[i, checksum_col_index])]
But doesn't this approach sort of go against the vectorized approach you're supposed to use with DataFrames, especially with large datasets? I'd still be glad to learn the right way to do it.
 
    