I have a dataframe (z) that looks like this:
timestamp                   source  price
2004-01-05 14:55:09+00:00   Bank1   420.975
2004-01-05 14:55:10+00:00   Bank2   421.0
2004-01-05 14:55:22+00:00   Bank1   421.075
2004-01-05 14:55:34+00:00   Bank1   420.975
2004-01-05 14:55:39+00:00   Bank1   421.175
2004-01-05 14:55:45+00:00   Bank1   421.075
2004-01-05 14:55:52+00:00   Bank1   421.175
2004-01-05 14:56:12+00:00   Bank2   421.1
2004-01-05 14:56:33+00:00   Bank1   421.275
Sometimes, there are time windows where Bank 2 submits only 1 quote - I need to throw out all days like this because I need 2 or more quotes by a bank. If Bank 2 appears 1 or fewer times, throw out the day.
I've accomplished this by creating a boolean mask from which I plan to filter out all days that satisfy the criteria:
r = z.groupby([z.index.date, z['source']]).size() > 1 
    # return boolean for each day/source if it appears at least once
r = r.groupby(level=0).all() == True 
    # ie. if the datetime 0th-level index contains all True, return True, otherwise False (meaning one source failed the criteria)
This yields:
2004-01-05  True
2004-01-06  True
2004-01-07  True
2004-01-08  False
2004-01-09  True
Perfect. Now I just need to filter it from the original dataframe z while keeping the original structure (ie. second-level frequency, not day-by-day). That means use the df.filter() method.
My original dataframe has the same structure (and their .shape[0]'s are the same):
2004-01-05  94
2004-01-06  24
2004-01-07  62
2004-01-08  30
2004-01-09  36
Great.
Here's where I get confused. I run:
t = y.groupby(y.index.date).filter(lambda x: [x for x in r])
And receive TypeError: filter function returned a list, but expected a scalar bool.
- Basically, I need the lambdafunction simply return each x (boolean) inr.
I solved this in a really convoluted instead (just take the whole thing I solved before and don't throw it into an r variable, but instead make it part of the lambda function).
t = y.groupby(y.index.date).filter(lambda x: (x.groupby([x.index.date, x['source']]).size() > 1).groupby(level=0).all() == True) # ie. the datetime 0th-level index
This is super messy and there must be a basic way to say, here is my dataframe z, then groupby('z.index.date'), then .filter() based on the boolean mask r.
Edit: this is what I found from the pandas tutorial, but I'm for some reason, the .between_time() part doesn't work. It filters out everything with length <= 1, not only when the .between_time() condition are true.
t = y.groupby([y.index.date, y['source']]).filter(lambda x: len(x.between_time('14:00','15:00') > 1)
 
    