DSLab Week 8

Week 2/3: Spark DataFrames

An API for computing with structured data (if it’s a table, it fits)

Some key points:

does not require using basic RDD methods (map, reduce etc.)
for many common operations data in a DataFrame follows a specified “schema” that tells Spark which kinds of data are in each column
the basic data unit is a Row which contains a value for each of the columns
automatically read data from a variety of formats, e.g. json, parquet, some database formats (through plugins)
can give some nice performance improvements!
Spark can optimize the execution because it knows more about the data layout
most of the work can take place in the JVM without need to ship to python
operations can be ‘pushed down’ to the optimized DB layer, e.g. filtering

–

–

–

By typing the data and using built-in functions, operations can be optimized
Certain operations (e.g. filtering) can be pushed one lever lower (to the DB for example)
In the end, still runs on RDD - can always resort to the lower-level API

–