distributed computing - Spark CollectAsMap -

- February 15, 2015

i know how collectasmap works in spark. more know aggregation of data of partitions take place? aggregation either takes place in master or in workers. in first case each worker send data on master , when master collects data each 1 worker, master aggregate results. in second case workers responsible aggregate results(after exchange data among them) , after results sent master.

it critical me find way master able collect data each partition separately, without workers exchange data.

you can see how doing collectasmap here. since rdd type tuple looks use normal rdd collect , translate tuples map of key,value pairs. mention in comment multi-map isn't supported, need 1-to-1 key/value mapping across data.

collectasmap function

what collect execute spark job , results each partition workers , aggregates them reduce/concat phase on driver.

collect function

so given that, it should case driver collects data each partition separately without workers exchanging data perform collectasmap.

note, if doing transformations on rdd prior using collectasmap cause shuffle occur, there may intermediate step causes workers exchange data amongst themselves. check out cluster master's application ui see more information regarding how spark executing application.

Search This Blog

Shefl

distributed computing - Spark CollectAsMap -

Comments

Post a Comment

Popular posts from this blog

java - Custom OutputStreamAppender not run: LOGBACK: No context given for <MYAPPENDER> -

java - UML - How would you draw a try catch in a sequence diagram? -

c++ - No viable overloaded operator for references a map -