0

I am parsing a larger csv that looks loosely like so:

time   id        angle
0.0   1_2_3       ...
0.0   ad_42       ...
0.0   34_02_03    ...
0.1   1_2_3       ...
0.1   ad_42       ...
0.1   f_1         ...
....

As you can see, the id field has a lot of variety in the naming schematic, but definitely has values that repeat. My goal is to read in the csv and reassign id values as they appear while tracking the ones in common. So it would be nice to write into the dataframe and have output like so:

time   id      angle
0.0   id1       ...
0.0   id2       ...
0.0   id3       ...
0.1   id1       ...
0.1   id2       ...
0.1   id4       ...
....

Where the ids correspond but have a more human-readable form (ie 1-x).

Any advice would be greatly appreciated.

trincot
  • 317,000
  • 35
  • 244
  • 286
  • Do you need the new ID for each old ID to be consistent over time (e.g., today you process this .csv, and two weeks from now you process a new .csv -- do you need 1_2_3 to appear as id1 in both files)? – K. Thorspear May 04 '22 at 21:24
  • That would be an added bonus @K.Thorspear, but no. I can just parse this file the one time and that would be sufficient enough for my use case. So long as I had a schematic that was consistent for the single parse I would be pretty happy. – anothercoderbro May 04 '22 at 21:29

3 Answers3

3

You can do:

ids = df['id'].unique().tolist()
id_dict = {ids[i-1]:'id'+str(i) for i in range(1,len(ids)+1)}
df['id'] = df['id'].map(id_dict)

The ids gives you the unique id values and to each unique id you assign a id + number as in id_dict. Then map the dict onto your column to get the new values.

And note that you don't need to worry about the order of the values: unique() - preserves the order of the values in which they appear.

For chunks:

If you are dealing with chunks, you may do as follows - just need to take care to identify not available keys and the indices to append with 'id':

id_dict = {}
dict_size = 0
for chunk in chunks:
    chunk_na = chunk.loc[~chunk['id'].isin(id_dict)]
    if len(chunk_na) > 0:
        na_ids = chunk_na['id'].unique().tolist()
        id_dict.update({na_ids[i-1-dict_size]:'id'+str(i) for i in range(dict_size+1,len(na_ids)+dict_size+1)}
        dict_size += len(id_dict)

    chunk['id'] = chunk['id'].map(id_dict)
SomeDude
  • 13,876
  • 5
  • 21
  • 44
  • Say I wanted to take this a step further as I am reading the csv in chunks and want the dictionary to stay consistent among id's and reading consecutive chunks. How could I integrate this with a global dictionary and ID list to maintain the relationship mappings @SomeDude? – anothercoderbro May 04 '22 at 23:13
  • That's really, really close @SomeDude. I am looking through it on my end, but the first values assigned from id1 to id9 work great, then it has aberrant behavior and assigns the 10th id to be id18. then id19, then goes to id20, then skips to id30 and starts going into the 100s and 1000s quickly from there. – anothercoderbro May 05 '22 at 01:04
  • That’s just a guide you need to debug :) – SomeDude May 05 '22 at 01:16
  • 1
    Hahaha I appreciate it. I ended up just calling len(id_dict) instead of updating and maintaining the size variable at the end and that seemed to make it work and maintain consistency. I appreciate your help so, so much! – anothercoderbro May 05 '22 at 01:40
1

Try using pd.factorize()

df['id'] = ['id{}'.format(i) for i in (pd.factorize(df['id'])[0]+1)]
rhug123
  • 7,893
  • 1
  • 9
  • 24
0

One approach would be to hash the id column with the hash_pandas_object() function. Since the hash is reproducible each time, you would have a consistently formatted ID that remains the same over time.

Here's a StackOverflow answer: Get the same hash value for a Pandas DataFrame each time

K. Thorspear
  • 473
  • 3
  • 12