Five minute `DataFrame` demo¶

In [1]:

import os

import findspark

# set spark_home to point to spark on your system
spark_home = os.path.join(os.path.expanduser('~'), 'src/spark')
findspark.init(spark_home=spark_home)

import pyspark

Initialize the SparkSession which allows us to use the RDD and DataFrame APIs

In [2]:

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .master("local[2]") \
    .appName("DataFrame demo") \
    .getOrCreate()
    
sc = spark.sparkContext

Read in some data and turn it into an RDD of tuples¶

In [3]:

file_path = os.path.join(spark_home, 'examples/src/main/resources/people.txt')

people_rdd = (sc.textFile('file://{0}'.format(file_path)))
                .map(lambda line: line.split(',')))

In [4]:

people_rdd.first()

Out[4]:

['Michael', ' 29']

Now we can use this data to create Row objects and convert the RDD into a DataFrame:

In [5]:

from pyspark.sql import Row

row_rdd = people_rdd.map(lambda x: Row(name=x[0], age=int(x[1])))

row_rdd.first()

df = spark.createDataFrame(row_rdd)

When the DataFrame is constructed, the data type for each column is inferred:

In [6]:

df.printSchema()

root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)

In [7]:

df.first()

Out[7]:

Row(age=29, name='Michael')

There are some convenient methods for pretty-printing the columns:

In [8]:

df.show()

+---+-------+
|age|   name|
+---+-------+
| 29|Michael|
| 30|   Andy|
| 19| Justin|
+---+-------+

Let's compare RDD methods and DataFrame -- we want to get all the people older than 20:

In [9]:

# using the usual RDD methods
people_rdd.filter(lambda x: int(x[1])>20).collect()

Out[9]:

[['Michael', ' 29'], ['Andy', ' 30']]

In [10]:

# using the DataFrame
df.filter(df.age > 20).take(20)

Out[10]:

[Row(age=29, name='Michael'), Row(age=30, name='Andy')]

No need to write maps if you can express the operation with the built-in functions. You refer to columns via the DataFrame object:

In [11]:

# this is a column that you can use in arithmetic expressions
df.age

Out[11]:

Column<b'age'>

In [12]:

df.select(df.age, (df.age*2).alias('times two')).show()

+---+---------+
|age|times two|
+---+---------+
| 29|       58|
| 30|       60|
| 19|       38|
+---+---------+

In [13]:

# equivalent RDD method
people_rdd.map(lambda x: int(x[1])*2).collect()

Out[13]:

[58, 60, 38]

Five minute DataFrame demo¶

Read in some data and turn it into an RDD of tuples¶

Five minute `DataFrame` demo¶