distributed computing - Spark CollectAsMap -
i know how collectasmap works in spark. more know aggregation of data of partitions take place? aggregation either takes place in master or in workers. in first case each worker send data on master , when master collects data each 1 worker, master aggregate results. in second case workers responsible aggregate results(after exchange data among them) , after results sent master.
it critical me find way master able collect data each partition separately, without workers exchange data.
you can see how doing collectasmap here. since rdd type tuple looks use normal rdd collect , translate tuples map of key,value pairs. mention in comment multi-map isn't supported, need 1-to-1 key/value mapping across data.
what collect execute spark job , results each partition workers , aggregates them reduce/concat phase on driver.
so given that, it should case driver collects data each partition separately without workers exchanging data perform collectasmap
.
note, if doing transformations on rdd prior using collectasmap
cause shuffle occur, there may intermediate step causes workers exchange data amongst themselves. check out cluster master's application ui see more information regarding how spark executing application.
Comments
Post a Comment