Lab in Data Science

Website for the EPFL Lab in Data Science 2019

GitHub Repository

Questions and Answers

Mattermost

Final Assignment: Robust Journey Planning

The final assignment is to be done in groups of 4 or, preferably, 5.

Important dates

The assignment (clear, well-annotated notebook; report-like), with a short, 7max, video of your presentation is due on June 18th, 23:59 (note the change of date).

Instead of oral defense as initially planned, we will organize short Q&A discussions of 10mins per group. The Q&A will be scheduled on June 20th, from 10am to noon, and from 2pm to 4pm.

Problem Motivation

Imagine you are a regular user of the public transport system, and you are checking the operator’s schedule to meet your friends for a class reunion. The choices are:

  1. You could leave in 10mins, and arrive with enough time to spare for gossips before the reunion starts.

  2. You could leave now on a different route and arrive just in time for the reunion.

Undoubtedly, if this is the only information available, most of us will opt for option 1.

If we now tell you that option 1 carries a fifty percent chance of missing a connection and be late for the reunion. Whereas, option 2 is almost guaranteed to take you there on time. Would you still consider option 1?

Probably not. However, most public transport applications will insist on the first option. This is because they are programmed to plan routes that offer the shortest travel times, without considering the risk factors.

Problem Description

In this final project you will build your own public transport route planner to improve on that. You will reuse the SBB dataset (See next section: Dataset Description).

Given a desired departure, or arrival time, your route planner will compute the fastest route between two stops within a provided uncertainty tolerance expressed as interquartiles. For instance, “what route from A to B is the fastest at least Q% of the time if I want to leave from A (resp. arrive at B) at instant T”. Note that uncertainty is a measure of a route not being feasible within the time computed by the algorithm.

In order to answer this question you will need to:

Solving this problem accurately can be difficult. You are allowed a few simplifying assumptions:

Dataset Description

For this project we will use the data published by the Open Data Platform Swiss Public Transport (https://opentransportdata.swiss).

You can find the dataset in the following two places.

The folder contains the actual data istdaten and the station list data BFKOORD_GEO, which you can also get here

Format: the dataset is presented a collection of textfiles with fields separated by ‘;’ (semi-colon). There is one file per day.

Unfortunately, the full description from opentransportdata.swiss is only provided in German. You can use an automated translator (DeepL seems to provide a better translation at the time of writing) to get more information, but here are the relevant column descriptions:

Each line of the file represents a stop and contains arrival and departure times. When the stop is the start or end of a journey, the corresponding columns will be empty (ANKUNFTSZEIT/ABFAHRTSZEIT). In some cases, the actual times were not measured so the AN_PROGNOSE_STATUS/AB_PROGNOSE_STATUS will be empty or set to PROGNOSE and AN_PROGNOSE/AB_PROGNOSE will be empty.

We will use the SBB data limited around the Zurich area. We will focus on all the stops within 10km of the Zurich train station.

Grading Method

We will grade both your Jupyter-based report (60%) and your 15-minute oral presentation (40%).

We will use the following criteria:

  1. The clarity and conciseness of the written and oral reports (written: 15 pts, oral: 10 pts)
  2. The formulation of the problem and its decomposition into smaller tasks (written: 5 pts, oral: 5 pts)
  3. The originality of the solution (system design, analytics, visualization) (written: 10 pts, oral: 5 pts)
  4. The quality of the solution (system design, analytics and associated implementation) (written: 20 pts, oral: 10 pts)
  5. The explanation of the pro’s and con’s / shortcomings of the proposed solution (written: 10 pts, oral: 10 pts)

The solution and associated implementation & explanations will be weighted across the different parts as follows:

Hints

Before you get started, we offer a few hints:

References

We offer a list of useful references as a starting point:

FAQ

This section will be updated with the Frequently Asked Questions during the course of this project. Please stay tuned.

1 - Q: Do we need to take into account walking times at the connections?

2 - Q: Can we assume statistical independence between the observed delays?

3 - Q: Can I take advantage of the fact that a connection departs late most of the time to allow a plan that would otherwise not be possible according to the official schedule.