Spark - Naive Bayes classifier value error -
i have following issue when training naive bayes classifier. i'm getting error:
file "/home/juande/desktop/spark-1.3.0-bin-hadoop2.4/python/pyspark/mllib /classification.py", line 372, in train return naivebayesmodel(labels.toarray(), pi.toarray(), numpy.array(theta)) valueerror: invalid __array_struct__
when training model using line
dataframe = dataframe.map(lambda x: labeledpoint(sections_to_number[x[4]], tf.transform([x[0], x[1], x[2], x[3]]))) model = naivebayes.train(dataframe, 1.0)
where sections_to_number
dictionary maps value strings float numbers, example sports -> 0, weather -> 1 , on.
however, if train using number instead of using mapping sections_to_number, not error.
dataframe = dataframe.map(lambda x: labeledpoint(10.0, tf.transform([x[0], x[1], x[2], x[3]]))) model = naivebayes.train(dataframe, 1.0)
am missing something? thanks
naivebayes in spark ml package expects dataframe in form of 2 columns label,feature lable column target or class , feature org.apache.spark.ml.linalg.vector. in case of numeric/ continuous dataset feature column created using vector dataset continuous need convert categorical dataset numeric using onehotencoder of other feature extraction techniques shared @ http://spark.apache.org/docs/latest/ml-features.html#stringindexer.
e.g. onehotencoder converts foo - 0 , baar - 1 , forms vector of double, , dataframe lable , feature passed in algorithm
Comments
Post a Comment