Is the a way to force Spark Aggregate / Reduce to "bubble-up"? -
i tried both aggregate
, reduce
in spark produce large datasets. noticed part of reduction executed in driver
. according mllib blog, have managed implement bubbling
, ie. once workers have reduced each task
/partition
move reduction phase subset of workers until delegated driver.
in use case, have 580 partitions don't have many entries in common, ie. each partition size 2gb partitions aggregated 2gbs. driver
delegating reduction of partitions driver
oome. have missed api call can or best way force behaviours applying incremental repartitioning
?
tnx
i think looking rdd.treeaggregate
applies reducer in multi-leveled way, reducing amount of data passed driver final reduction.
it has been moved mllib spark core on spark 1.3.0. see spark-5430
Comments
Post a Comment