python - Spark MLlib - trainImplicit warning -


i keep seeing these warnings when using trainimplicit:

warn tasksetmanager: stage 246 contains task of large size (208 kb). maximum recommended task size 100 kb. 

and task size starts increase. tried call repartition on input rdd warnings same.

all these warnings come als iterations, flatmap , aggregate, instance origin of stage flatmap showing these warnings (w/ spark 1.3.0, shown in spark 1.3.1):

org.apache.spark.rdd.rdd.flatmap(rdd.scala:296) org.apache.spark.ml.recommendation.als$.org$apache$spark$ml$recommendation$als$$computefactors(als.scala:1065) org.apache.spark.ml.recommendation.als$$anonfun$train$3.apply(als.scala:530) org.apache.spark.ml.recommendation.als$$anonfun$train$3.apply(als.scala:527) scala.collection.immutable.range.foreach(range.scala:141) org.apache.spark.ml.recommendation.als$.train(als.scala:527) org.apache.spark.mllib.recommendation.als.run(als.scala:203) 

and aggregate:

org.apache.spark.rdd.rdd.aggregate(rdd.scala:968) org.apache.spark.ml.recommendation.als$.computeyty(als.scala:1112) org.apache.spark.ml.recommendation.als$.org$apache$spark$ml$recommendation$als$$computefactors(als.scala:1064) org.apache.spark.ml.recommendation.als$$anonfun$train$3.apply(als.scala:538) org.apache.spark.ml.recommendation.als$$anonfun$train$3.apply(als.scala:527) scala.collection.immutable.range.foreach(range.scala:141) org.apache.spark.ml.recommendation.als$.train(als.scala:527) org.apache.spark.mllib.recommendation.als.run(als.scala:203) 

similar problem described in apache spark mail lists - http://apache-spark-user-list.1001560.n3.nabble.com/large-task-size-td9539.html

i think can try play number of partitions (using repartition() method), depending of how many hosts, ram, cpus have.

try investigate steps via web ui, can see number of stages, memory usage on each stage, , data locality.

or never mind warnings unless works correctly , fast.

this notification hard-coded in spark (scheduler/tasksetmanager.scala)

      if (serializedtask.limit > tasksetmanager.task_size_to_warn_kb * 1024 &&           !emittedtasksizewarning) {         emittedtasksizewarning = true         logwarning(s"stage ${task.stageid} contains task of large size " +           s"(${serializedtask.limit / 1024} kb). maximum recommended task size " +           s"${tasksetmanager.task_size_to_warn_kb} kb.")       } 

.

private[spark] object tasksetmanager {   // user warned if stages contain task has serialized size greater   // this.   val task_size_to_warn_kb = 100 }  

Comments

Popular posts from this blog

c++ - No viable overloaded operator for references a map -

java - Custom OutputStreamAppender not run: LOGBACK: No context given for <MYAPPENDER> -

java - Cannot secure connection using TLS -