python - Pandas DataFrame merge between two values instead of matching one -


i have dataframe date column , want merge not on match column if date column between 2 columns on second dataframe.

i believe can achieve using apply on first filter second based on these criterion , combining results apply has in practice been horribly slow way go things.

is there way merge match being between instead of exact match.

example dataframe:

,code,description,begindate,enddate,refsessiontypeid,organizationcalendarid 0,2014-2015,school year: 2014-2015,2014-08-18 00:00:00.000,2015-08-01 00:00:00.000,1,3 1,2012-2013,school year: 2012-2013,2012-09-01 00:00:00.000,2013-08-16 00:00:00.000,1,2 2,2013-2014,school year: 2013-2014,2013-08-19 00:00:00.000,2014-08-17 00:00:00.000,1,1 

instead of merge on date=begindate or date=enddate want match on date between(begindate, enddate)

you can use numpy.searchsorted() simulate between.

say data , lookup value this:

in [162]: data = pd.dataframe({    .....:     'date': pd.series(pd.np.random.randint(1429449000, 1429649000, 1000) * 1e9).astype('datetime64[ns]'),    .....:     'value': pd.np.random.randint(0, 100, 1000),    .....: })  in [163]: data.head() out[163]:                  date  value 0 2015-04-21 13:37:37     60 1 2015-04-20 06:27:43     76 2 2015-04-20 09:01:51     70 3 2015-04-21 10:47:31      5 4 2015-04-19 18:39:45     27  in [164]:  in [164]: lookup = pd.series(    .....:     pd.np.random.randint(0, 10, 5),    .....:     index=pd.series(pd.np.random.randint(1429449000, 1429649000, 5) * 1e9).astype('datetime64[ns]'),    .....: )  in [165]: lookup out[165]: 2015-04-21 11:10:39    4 2015-04-21 07:07:51    1 2015-04-20 08:27:19    1 2015-04-21 09:58:42    6 2015-04-20 06:46:12    7 dtype: int32 

you'd first want make sure dates in data['date'] available in lookup's index. then, sort lookup date.

in [166]: lookup[data['date'].max()] = lookup[data['date'].min()] = none  in [167]: lookup = lookup.sort_index() 

now comes important bit -- use numpy's extremely fast searchsorted() method indices:

in [168]: indices = pd.np.searchsorted(lookup.index.astype(long), data['date'].astype(long).values, side='left')  in [169]: data['lookup'] = lookup.iloc[indices].values  in [170]: data.head() out[170]:                  date  value lookup 0 2015-04-21 13:37:37     60   none 1 2015-04-20 06:27:43     76      7 2 2015-04-20 09:01:51     70      1 3 2015-04-21 10:47:31      5      4 4 2015-04-19 18:39:45     27      7 

edit: might want convert date range have in dataset single series lookup above. that's because in case of overlapping date ranges, isn't clear value up.


Comments

Popular posts from this blog

java - Ebean enhancement ignores a model -

ubuntu - How to disable Kernel Module Signing in linux -

SQL php on different pages to Insert (mysqli) -