{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Analyzing the Gutenberg Books Corpus - part 2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this notebook, we will use the Gutenberg Corpus in the same form as last week. \n", "\n", "In the [first analysis notebook](https://github.com/dslab2018/dslab2018.github.io/blob/master/notebooks/DSLab_week7_gutenberg_corpus.ipynb) we explored various RDD methods and in the end built an N-gram viewer for the gutenberg books project. Now, we will use the corpus to train a simple language classification model using [Spark's machine learning library](http://spark.apache.org/docs/latest/mllib-guide.html) and Spark DataFrames.\n", "\n", "
SparkSession - in-memory
\n", " \n", "SparkContext
\n", "\n", " \n", "\n", "v2.3.0
local[2]
Gutenberg text modelling