How to More Efficiently Load Parquet Files in Spark (pySpark v1.2.0) -


i'm loading in high-dimensional parquet files need few columns. current code looks like:

dat = sqc.parquetfile(path) \           .filter(lambda r: len(r.a)>0) \           .map(lambda r: (r.a, r.b, r.c)) 

my mental model of what's happening it's loading in data, throwing out columns don't want. i'd prefer not read in columns, , understand parquet seems possible.

so there 2 questions:

  1. is mental model wrong? or spark compiler smart enough read in columns a, b, , c in example above?
  2. how can force sqc.parquetfile() read in data more efficiently?

you should use spark dataframe api: https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#dataframe-operations

something

dat.select("a", "b", "c").filter(lambda r: len(r.a)>0) 

or can use spark sql:

dat.regisertemptable("dat") sqc.sql("select a, b, c dat length(a) > 0") 

Comments

Popular posts from this blog

c++ - No viable overloaded operator for references a map -

java - Custom OutputStreamAppender not run: LOGBACK: No context given for <MYAPPENDER> -

java - Cannot secure connection using TLS -