How to More Efficiently Load Parquet Files in Spark (pySpark v1.2.0) -
i'm loading in high-dimensional parquet files need few columns. current code looks like:
dat = sqc.parquetfile(path) \ .filter(lambda r: len(r.a)>0) \ .map(lambda r: (r.a, r.b, r.c))
my mental model of what's happening it's loading in data, throwing out columns don't want. i'd prefer not read in columns, , understand parquet seems possible.
so there 2 questions:
- is mental model wrong? or spark compiler smart enough read in columns a, b, , c in example above?
- how can force
sqc.parquetfile()
read in data more efficiently?
you should use spark dataframe api: https://spark.apache.org/docs/1.3.0/sql-programming-guide.html#dataframe-operations
something
dat.select("a", "b", "c").filter(lambda r: len(r.a)>0)
or can use spark sql:
dat.regisertemptable("dat") sqc.sql("select a, b, c dat length(a) > 0")
Comments
Post a Comment