How to More Efficiently Load Parquet Files in Spark (pySpark v1.2.0) -

- March 15, 2012

i'm loading in high-dimensional parquet files need few columns. current code looks like:

dat = sqc.parquetfile(path) \           .filter(lambda r: len(r.a)>0) \           .map(lambda r: (r.a, r.b, r.c))

my mental model of what's happening it's loading in data, throwing out columns don't want. i'd prefer not read in columns, , understand parquet seems possible.

so there 2 questions:

is mental model wrong? or spark compiler smart enough read in columns a, b, , c in example above?
how can force sqc.parquetfile() read in data more efficiently?

you should use spark dataframe api: https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#dataframe-operations

something

dat.select("a", "b", "c").filter(lambda r: len(r.a)>0)

or can use spark sql:

dat.regisertemptable("dat") sqc.sql("select a, b, c dat length(a) > 0")

Search This Blog

Shefl

How to More Efficiently Load Parquet Files in Spark (pySpark v1.2.0) -

Comments

Post a Comment

Popular posts from this blog

c++ - No viable overloaded operator for references a map -

java - UML - How would you draw a try catch in a sequence diagram? -

c++ - Gamma correction doesn't look properly corrected, is this linear? -