Apologies if this is a duplicate; I looked and couldn't find this question answered (well) online.
Let's say I have a pandas dataframe with a name column. I'd like to query a database and get the id associated with each name, and store that value in the dataframe.
In this case, I'm reading CSV files (with name), and querying a MySQL database to get id. I do some data cleaning and then upload the dataframe to the MySQL database (a different table then the one I query to get id).
What's the best way to go about doing this? Here's what I came up with:
df["id"] = pd.Series(
[pd.read_sql_query(
f"SELECT id FROM table WHERE name LIKE '{name}'",
connection)["id"][0] for name in df["name"]]
)
As I understand it, this avoid modifying df while iterating over it.
- We iterate over the values in
df["name"] - The SQL query returns a DataFrame
- The
["id"][0]pulls the value of interest (id) out of the DataFrame and stores it in a list - This list of values is converted to a
pd.Series - This
pd.Seriesis added to the originaldf
This works, but it doesn't really force an association between name and id, which (I think) would be more robust. (And, obviously, there's no error handling in the code snippet above.)
I tried combinations of apply or map or iterrows or iteritems, and that didn't work so well. For example:
for index, name in df["name"].iteritems():
id_df = pd.read_sql_query(f"SELECT id FROM tableWHERE name LIKE '{name}'", connection)
temp_df[temp_df["name"] == name]["id"] = id_df["id"][0]
Warning:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead