python - Why is Pandas Concatenation (pandas.concat) so Memory Inefficient? -

- August 15, 2015

i have 30 gb of data (in list of 900 dataframes) attempting concatenate together. machine working moderately powerful linux box 256 gb of ram. however, when try concatenate files run out of available ram. have tried sorts of workarounds fix (concatenating in smaller batches loops, etc.) still cannot these concatenate. 2 questions spring mind:

has else dealt , found effective workaround? cannot use straight append because need 'column merging' (for lack of better word) functionality of join='outer' argument in pd.concat().
why pandas concatenation (which know calling numpy.concatenate) inefficient use of memory?

i should note not think problem explosion of columns concatenating 100 of dataframes gives 3000 columns whereas base dataframe has 1000.

edit:

the data working financial data 1000 columns wide , 50,000 rows deep each of 900 dataframes. types of data going across left right are:

date in string format,
string
np.float
int

... , on repeating. concatenating on column name outer join means columns in df2 not in df1 not discarded shunted off side.

example:

 #example code  data=pd.concat(datalist4, join="outer", axis=0, ignore_index=true)  #two example dataframes (about 90% of column names should in common  #between 2 dataframes, unnamed columns, etc not significant  #number of columns)  print datalist4[0].head()                 800_1     800_2   800_3  800_4               900_1     900_2  0 2014-08-06 09:00:00  best_bid  1117.1    103 2014-08-06 09:00:00  best_bid    1 2014-08-06 09:00:00  best_ask  1120.0    103 2014-08-06 09:00:00  best_ask    2 2014-08-06 09:00:00  best_bid  1106.9     11 2014-08-06 09:00:00  best_bid    3 2014-08-06 09:00:00  best_ask  1125.8     62 2014-08-06 09:00:00  best_ask    4 2014-08-06 09:00:00  best_bid  1117.1    103 2014-08-06 09:00:00  best_bid         900_3  900_4              1000_1    1000_2    ...     2400_4  0  1017.2    103 2014-08-06 09:00:00  best_bid    ...        nan    1  1020.1    103 2014-08-06 09:00:00  best_ask    ...        nan    2  1004.3     11 2014-08-06 09:00:00  best_bid    ...        nan    3  1022.9     11 2014-08-06 09:00:00  best_ask    ...        nan    4  1006.7     10 2014-08-06 09:00:00  best_bid    ...        nan                           _1  _2  _3  _4                   _1.1 _2.1 _3.1  _4.1  0  #n/a invalid security nan nan nan  #n/a invalid security  nan  nan   nan    1                    nan nan nan nan                    nan  nan  nan   nan    2                    nan nan nan nan                    nan  nan  nan   nan    3                    nan nan nan nan                    nan  nan  nan   nan    4                    nan nan nan nan                    nan  nan  nan   nan           dater   0  2014.8.6   1  2014.8.6   2  2014.8.6   3  2014.8.6   4  2014.8.6    [5 rows x 777 columns]  print datalist4[1].head()                 150_1     150_2   150_3  150_4               200_1     200_2  0 2013-12-04 09:00:00  best_bid  1639.6     30 2013-12-04 09:00:00  best_ask    1 2013-12-04 09:00:00  best_ask  1641.8    133 2013-12-04 09:00:08  best_bid    2 2013-12-04 09:00:01  best_bid  1639.5     30 2013-12-04 09:00:08  best_ask    3 2013-12-04 09:00:05  best_bid  1639.4     30 2013-12-04 09:00:08  best_ask    4 2013-12-04 09:00:08  best_bid  1639.3    133 2013-12-04 09:00:08  best_bid         200_3  200_4               250_1     250_2    ...                 2500_1  0  1591.9    133 2013-12-04 09:00:00  best_bid    ...    2013-12-04 10:29:41    1  1589.4     30 2013-12-04 09:00:00  best_ask    ...    2013-12-04 11:59:22    2  1591.6    103 2013-12-04 09:00:01  best_bid    ...    2013-12-04 11:59:23    3  1591.6    133 2013-12-04 09:00:04  best_bid    ...    2013-12-04 11:59:26    4  1589.4    133 2013-12-04 09:00:07  best_bid    ...    2013-12-04 11:59:29          2500_2 2500_3 2500_4         unnamed: 844_1  unnamed: 844_2  0  best_ask   0.35     50  #n/a invalid security             nan    1  best_ask   0.35     11                    nan             nan    2  best_ask   0.40     11                    nan             nan    3  best_ask   0.45     11                    nan             nan    4  best_ask   0.50     21                    nan             nan       unnamed: 844_3 unnamed: 844_4         unnamed: 848_1      dater   0            nan            nan  #n/a invalid security  2013.12.4   1            nan            nan                    nan  2013.12.4   2            nan            nan                    nan  2013.12.4   3            nan            nan                    nan  2013.12.4   4            nan            nan                    nan  2013.12.4    [5 rows x 850 columns]

i've had performance issues concatenating large number of dataframes 'growing' dataframe. workaround appending sub dataframes list, , concatenating list of dataframes once processing of sub dataframes has been completed.

Search This Blog

Shefl