Five minute DataFrame demo

In [1]:
import os

import findspark

# set spark_home to point to spark on your system
spark_home = os.path.join(os.path.expanduser('~'), 'src/spark')
findspark.init(spark_home=spark_home)

import pyspark

Initialize the SparkSession which allows us to use the RDD and DataFrame APIs

In [2]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .master("local[2]") \
    .appName("DataFrame demo") \
    .getOrCreate()
    
sc = spark.sparkContext

Read in some data and turn it into an RDD of tuples

In [3]:
file_path = os.path.join(spark_home, 'examples/src/main/resources/people.txt')

people_rdd = (sc.textFile('file://{0}'.format(file_path)))
                .map(lambda line: line.split(',')))
In [4]:
people_rdd.first()
Out[4]:
['Michael', ' 29']

Now we can use this data to create Row objects and convert the RDD into a DataFrame:

In [5]:
from pyspark.sql import Row

row_rdd = people_rdd.map(lambda x: Row(name=x[0], age=int(x[1])))

row_rdd.first()

df = spark.createDataFrame(row_rdd)

When the DataFrame is constructed, the data type for each column is inferred:

In [6]:
df.printSchema()
root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)

In [7]:
df.first()
Out[7]:
Row(age=29, name='Michael')

There are some convenient methods for pretty-printing the columns:

In [8]:
df.show()
+---+-------+
|age|   name|
+---+-------+
| 29|Michael|
| 30|   Andy|
| 19| Justin|
+---+-------+

Let's compare RDD methods and DataFrame -- we want to get all the people older than 20:

In [9]:
# using the usual RDD methods
people_rdd.filter(lambda x: int(x[1])>20).collect()
Out[9]:
[['Michael', ' 29'], ['Andy', ' 30']]
In [10]:
# using the DataFrame
df.filter(df.age > 20).take(20)
Out[10]:
[Row(age=29, name='Michael'), Row(age=30, name='Andy')]

No need to write maps if you can express the operation with the built-in functions. You refer to columns via the DataFrame object:

In [11]:
# this is a column that you can use in arithmetic expressions
df.age
Out[11]:
Column<b'age'>
In [12]:
df.select(df.age, (df.age*2).alias('times two')).show()
+---+---------+
|age|times two|
+---+---------+
| 29|       58|
| 30|       60|
| 19|       38|
+---+---------+

In [13]:
# equivalent RDD method
people_rdd.map(lambda x: int(x[1])*2).collect()
Out[13]:
[58, 60, 38]