Lab in Data Science

Website for the EPFL Lab in Data Science 2019

GitHub Repository

Questions and Answers

Mattermost

Week 5 Data Science Lab

Introduction

This week’s exercises can all be conducted on the Big Data cluster provided for the class.

You will be using the EPFL cluster for this exercise.

If you wish to experiment with advanced features that require admin privileges, we encourage you to try the Hortonworks HDP Sandbox from Hortonworks. From there, you will have the option to download the Sandbox for VirtualBox or for Docker.

Exercises series 1 - HDFS

First steps

In this series of exercises you will experiment with the Hadoop Distributed File System (HDFS).

Open a terminal on your laptop and ssh to the cluster with your EPFL gaspar username and password.

ssh your-gaspar-username@iccluster042.iccluster.epfl.ch

To get started with HDFS, type:

hdfs dfs

Notice how most of the commands behave like corresponding Unix commands.

Exercises

As a first exercise, you will explore the content of the cluster’s HDFS file system using the hdfs dfs command.

And walk your way down the HDFS directory structure from there.

  1. We have created a directory on HDFS for each of you, can you find yours?
  2. Create a folder work1 in your HDFS directory and change the access rights so that only you and group hadoop can read and write into it.
  3. Copy the 2017 Traffic Count data published by the Calderdale Metropolitan Borough Council (UK) to your work1 directory. A copy of the data is also available from the dslab 2019 github repository

Hints

  1. You can get started with the hdfs dfs -ls / command and walk your way down from there
  2. Use abslute paths in your hdfs command
  3. HDFS does not like spaces in filenames.
  4. Use the scp or the wget commands to copy the data locally in your home directory, and one of the hdfs dfs commands to copy the local file to your HDFS directory

Exercise series 2 - Hive

Hive is data warehouse built on top of Apache Hadoop. It is used to organize your data into tables on HDFS, and to executes SQL-like queries on them as Map Reduce jobs.

For this series of exercises we will use the Zeppelin notebook. This is yet anoter commonly used notebook, similar to Jupyter notebooks that you are already familiar with. We are using this notebook because it is installed by default with the Horthonworks HDP distribution.

Open a browser and log in the Zeppelin UI with your EPFL gaspar username and password at https://iccluster042.iccluster.epfl.ch:9995/ .

Once logged in, from your Zeppelin homepage, select the import note option, and copy the URL of this class’s notebook. We recommend that you use a name of the form /your-gaspar-username/week5 for the notebook.

You can now open the notebook in Zeppelin and start working on the exercises. As a bonus you can repeat the same exercises using the hive command in a terminal.

Data source in this exercise: The twitter stream grab provided thanks to the internet archive