python - Pandas DataFrame merge between two values instead of matching one -


i have dataframe date column , want merge not on match column if date column between 2 columns on second dataframe.

i believe can achieve using apply on first filter second based on these criterion , combining results apply has in practice been horribly slow way go things.

is there way merge match being between instead of exact match.

example dataframe:

,code,description,begindate,enddate,refsessiontypeid,organizationcalendarid 0,2014-2015,school year: 2014-2015,2014-08-18 00:00:00.000,2015-08-01 00:00:00.000,1,3 1,2012-2013,school year: 2012-2013,2012-09-01 00:00:00.000,2013-08-16 00:00:00.000,1,2 2,2013-2014,school year: 2013-2014,2013-08-19 00:00:00.000,2014-08-17 00:00:00.000,1,1 

instead of merge on date=begindate or date=enddate want match on date between(begindate, enddate)

you can use numpy.searchsorted() simulate between.

say data , lookup value this:

in [162]: data = pd.dataframe({    .....:     'date': pd.series(pd.np.random.randint(1429449000, 1429649000, 1000) * 1e9).astype('datetime64[ns]'),    .....:     'value': pd.np.random.randint(0, 100, 1000),    .....: })  in [163]: data.head() out[163]:                  date  value 0 2015-04-21 13:37:37     60 1 2015-04-20 06:27:43     76 2 2015-04-20 09:01:51     70 3 2015-04-21 10:47:31      5 4 2015-04-19 18:39:45     27  in [164]:  in [164]: lookup = pd.series(    .....:     pd.np.random.randint(0, 10, 5),    .....:     index=pd.series(pd.np.random.randint(1429449000, 1429649000, 5) * 1e9).astype('datetime64[ns]'),    .....: )  in [165]: lookup out[165]: 2015-04-21 11:10:39    4 2015-04-21 07:07:51    1 2015-04-20 08:27:19    1 2015-04-21 09:58:42    6 2015-04-20 06:46:12    7 dtype: int32 

you'd first want make sure dates in data['date'] available in lookup's index. then, sort lookup date.

in [166]: lookup[data['date'].max()] = lookup[data['date'].min()] = none  in [167]: lookup = lookup.sort_index() 

now comes important bit -- use numpy's extremely fast searchsorted() method indices:

in [168]: indices = pd.np.searchsorted(lookup.index.astype(long), data['date'].astype(long).values, side='left')  in [169]: data['lookup'] = lookup.iloc[indices].values  in [170]: data.head() out[170]:                  date  value lookup 0 2015-04-21 13:37:37     60   none 1 2015-04-20 06:27:43     76      7 2 2015-04-20 09:01:51     70      1 3 2015-04-21 10:47:31      5      4 4 2015-04-19 18:39:45     27      7 

edit: might want convert date range have in dataset single series lookup above. that's because in case of overlapping date ranges, isn't clear value up.


Comments

Popular posts from this blog

c++ - No viable overloaded operator for references a map -

java - Custom OutputStreamAppender not run: LOGBACK: No context given for <MYAPPENDER> -

java - Cannot secure connection using TLS -