Website for the EPFL Lab in Data Science 2019
This week’s exercises can all be conducted on the Big Data cluster provided for the class.
You will be using the EPFL cluster for this exercise.
If you wish to experiment with advanced features that require admin privileges, we encourage you to try the Hortonworks HDP Sandbox from Hortonworks. From there, you will have the option to download the Sandbox for VirtualBox or for Docker.
In this series of exercises you will experiment with the Hadoop Distributed File System (HDFS).
Open a terminal on your laptop and ssh to the cluster with your EPFL gaspar username and password.
ssh your-gaspar-username@iccluster042.iccluster.epfl.ch
To get started with HDFS, type:
hdfs dfs
Notice how most of the commands behave like corresponding Unix commands.
As a first exercise, you will explore the content of the cluster’s HDFS file system using the hdfs dfs command.
And walk your way down the HDFS directory structure from there.
hdfs dfs -ls /
command and walk your way down from therescp
or the wget
commands to copy the data locally in your home directory, and one of the hdfs dfs commands to copy the local file to your HDFS directoryHive is data warehouse built on top of Apache Hadoop. It is used to organize your data into tables on HDFS, and to executes SQL-like queries on them as Map Reduce jobs.
For this series of exercises we will use the Zeppelin notebook. This is yet anoter commonly used notebook, similar to Jupyter notebooks that you are already familiar with. We are using this notebook because it is installed by default with the Horthonworks HDP distribution.
Open a browser and log in the Zeppelin UI with your EPFL gaspar username and password at https://iccluster042.iccluster.epfl.ch:9995/
.
Once logged in, from your Zeppelin homepage, select the import note option, and copy the URL of this class’s notebook. We recommend that you use a name of the form /your-gaspar-username/week5 for the notebook.
You can now open the notebook in Zeppelin and start working on the exercises. As a bonus you can repeat the same exercises using the hive command in a terminal.
Data source in this exercise: The twitter stream grab provided thanks to the internet archive