python - Pandas DataFrame merge between two values instead of matching one -
i have dataframe date column , want merge not on match column if date column between 2 columns on second dataframe.
i believe can achieve using apply on first filter second based on these criterion , combining results apply has in practice been horribly slow way go things.
is there way merge match being between instead of exact match.
example dataframe:
,code,description,begindate,enddate,refsessiontypeid,organizationcalendarid 0,2014-2015,school year: 2014-2015,2014-08-18 00:00:00.000,2015-08-01 00:00:00.000,1,3 1,2012-2013,school year: 2012-2013,2012-09-01 00:00:00.000,2013-08-16 00:00:00.000,1,2 2,2013-2014,school year: 2013-2014,2013-08-19 00:00:00.000,2014-08-17 00:00:00.000,1,1
instead of merge on date=begindate or date=enddate want match on date between(begindate, enddate)
you can use numpy.searchsorted() simulate between.
say data , lookup value this:
in [162]: data = pd.dataframe({ .....: 'date': pd.series(pd.np.random.randint(1429449000, 1429649000, 1000) * 1e9).astype('datetime64[ns]'), .....: 'value': pd.np.random.randint(0, 100, 1000), .....: }) in [163]: data.head() out[163]: date value 0 2015-04-21 13:37:37 60 1 2015-04-20 06:27:43 76 2 2015-04-20 09:01:51 70 3 2015-04-21 10:47:31 5 4 2015-04-19 18:39:45 27 in [164]: in [164]: lookup = pd.series( .....: pd.np.random.randint(0, 10, 5), .....: index=pd.series(pd.np.random.randint(1429449000, 1429649000, 5) * 1e9).astype('datetime64[ns]'), .....: ) in [165]: lookup out[165]: 2015-04-21 11:10:39 4 2015-04-21 07:07:51 1 2015-04-20 08:27:19 1 2015-04-21 09:58:42 6 2015-04-20 06:46:12 7 dtype: int32
you'd first want make sure dates in data['date']
available in lookup
's index. then, sort lookup date.
in [166]: lookup[data['date'].max()] = lookup[data['date'].min()] = none in [167]: lookup = lookup.sort_index()
now comes important bit -- use numpy's extremely fast searchsorted() method indices:
in [168]: indices = pd.np.searchsorted(lookup.index.astype(long), data['date'].astype(long).values, side='left') in [169]: data['lookup'] = lookup.iloc[indices].values in [170]: data.head() out[170]: date value lookup 0 2015-04-21 13:37:37 60 none 1 2015-04-20 06:27:43 76 7 2 2015-04-20 09:01:51 70 1 3 2015-04-21 10:47:31 5 4 4 2015-04-19 18:39:45 27 7
edit: might want convert date range have in dataset single series lookup
above. that's because in case of overlapping date ranges, isn't clear value up.
Comments
Post a Comment