python - Why is Pandas Concatenation (pandas.concat) so Memory Inefficient? -
i have 30 gb of data (in list of 900 dataframes) attempting concatenate together. machine working moderately powerful linux box 256 gb of ram. however, when try concatenate files run out of available ram. have tried sorts of workarounds fix (concatenating in smaller batches loops, etc.) still cannot these concatenate. 2 questions spring mind:
has else dealt , found effective workaround? cannot use straight append because need 'column merging' (for lack of better word) functionality of
join='outer'
argument inpd.concat()
.why pandas concatenation (which know calling
numpy.concatenate
) inefficient use of memory?
i should note not think problem explosion of columns concatenating 100 of dataframes gives 3000 columns whereas base dataframe has 1000.
edit:
the data working financial data 1000 columns wide , 50,000 rows deep each of 900 dataframes. types of data going across left right are:
- date in string format,
string
np.float
int
... , on repeating. concatenating on column name outer join means columns in df2
not in df1
not discarded shunted off side.
example:
#example code data=pd.concat(datalist4, join="outer", axis=0, ignore_index=true) #two example dataframes (about 90% of column names should in common #between 2 dataframes, unnamed columns, etc not significant #number of columns) print datalist4[0].head() 800_1 800_2 800_3 800_4 900_1 900_2 0 2014-08-06 09:00:00 best_bid 1117.1 103 2014-08-06 09:00:00 best_bid 1 2014-08-06 09:00:00 best_ask 1120.0 103 2014-08-06 09:00:00 best_ask 2 2014-08-06 09:00:00 best_bid 1106.9 11 2014-08-06 09:00:00 best_bid 3 2014-08-06 09:00:00 best_ask 1125.8 62 2014-08-06 09:00:00 best_ask 4 2014-08-06 09:00:00 best_bid 1117.1 103 2014-08-06 09:00:00 best_bid 900_3 900_4 1000_1 1000_2 ... 2400_4 0 1017.2 103 2014-08-06 09:00:00 best_bid ... nan 1 1020.1 103 2014-08-06 09:00:00 best_ask ... nan 2 1004.3 11 2014-08-06 09:00:00 best_bid ... nan 3 1022.9 11 2014-08-06 09:00:00 best_ask ... nan 4 1006.7 10 2014-08-06 09:00:00 best_bid ... nan _1 _2 _3 _4 _1.1 _2.1 _3.1 _4.1 0 #n/a invalid security nan nan nan #n/a invalid security nan nan nan 1 nan nan nan nan nan nan nan nan 2 nan nan nan nan nan nan nan nan 3 nan nan nan nan nan nan nan nan 4 nan nan nan nan nan nan nan nan dater 0 2014.8.6 1 2014.8.6 2 2014.8.6 3 2014.8.6 4 2014.8.6 [5 rows x 777 columns] print datalist4[1].head() 150_1 150_2 150_3 150_4 200_1 200_2 0 2013-12-04 09:00:00 best_bid 1639.6 30 2013-12-04 09:00:00 best_ask 1 2013-12-04 09:00:00 best_ask 1641.8 133 2013-12-04 09:00:08 best_bid 2 2013-12-04 09:00:01 best_bid 1639.5 30 2013-12-04 09:00:08 best_ask 3 2013-12-04 09:00:05 best_bid 1639.4 30 2013-12-04 09:00:08 best_ask 4 2013-12-04 09:00:08 best_bid 1639.3 133 2013-12-04 09:00:08 best_bid 200_3 200_4 250_1 250_2 ... 2500_1 0 1591.9 133 2013-12-04 09:00:00 best_bid ... 2013-12-04 10:29:41 1 1589.4 30 2013-12-04 09:00:00 best_ask ... 2013-12-04 11:59:22 2 1591.6 103 2013-12-04 09:00:01 best_bid ... 2013-12-04 11:59:23 3 1591.6 133 2013-12-04 09:00:04 best_bid ... 2013-12-04 11:59:26 4 1589.4 133 2013-12-04 09:00:07 best_bid ... 2013-12-04 11:59:29 2500_2 2500_3 2500_4 unnamed: 844_1 unnamed: 844_2 0 best_ask 0.35 50 #n/a invalid security nan 1 best_ask 0.35 11 nan nan 2 best_ask 0.40 11 nan nan 3 best_ask 0.45 11 nan nan 4 best_ask 0.50 21 nan nan unnamed: 844_3 unnamed: 844_4 unnamed: 848_1 dater 0 nan nan #n/a invalid security 2013.12.4 1 nan nan nan 2013.12.4 2 nan nan nan 2013.12.4 3 nan nan nan 2013.12.4 4 nan nan nan 2013.12.4 [5 rows x 850 columns]
i've had performance issues concatenating large number of dataframes 'growing' dataframe. workaround appending sub dataframes list, , concatenating list of dataframes once processing of sub dataframes has been completed.
Comments
Post a Comment