User Guide

This guide walks through the practical workflow for preparing and processing the data used in rdmpy.

Data Download

Before you can clean and preprocess the data, you need to download the necessary files from the Rail Data Marketplace.

Where to Find the Data

All required datasets are available from the Rail Data Marketplace (RDM). You will need to create an account to access these files.

Required Files

You need to download two main datasets:

NWR Historic Delay Attribution (Transparency Data)
NWR Schedule Data

File Specifications and Location

For Delays - Search “NWR Historic Delay Attribution”

Under “data files”, you will find .zip files organized by year and period. Download and extract them to find files named:

Transparency_23-24_P12.csv
Transparency_23-24_P13.csv
Transparency_24-25_P01.csv
...

File Naming Convention:

Transparency refers to the Rail Delivery Group (RDG) transparency initiative for public operational data
23-24 stands for the financial year (April to March)
P01 is the month within the financial year (starting in April)

You may also find files named like 202425 data files 20250213.zip or Transparency 25-26 P01 20250516.zip, where the date at the end indicates the last entry date in the data itself.

For Schedule Data - Search “NWR Schedule”

Under “data files”, you will find:

CIF_ALL_FULL_DAILY_toc-full.json.gz

File Details:

CIF = Common Interface Format
toc-full = Train Operating Companies (TOC) Full Extract
Format = Daily formats (but the full extent of data is weekly, containing all daily scheduled trains for a standard week)

Setup Instructions

Once downloaded, follow these steps:

Create a data/ folder inside the demo/ folder if it doesn’t exist
Save all downloaded .csv files and the .json.gz file in data/ without creating subfolders
For detailed specifications of each file and how to modify entries for different rail months/years, refer to: - incidents.py for delay file specifications - schedule.py for schedule file specifications

The tool will automatically detect and load these files from the data/ folder.

Reference Files

Additional reference data files are provided in the reference/ folder, including:

Station reference files with latitude and longitude
Station description and classification information

These are the only files directly provided and do not need to be downloaded separately.

Data Cleaning

Before running the preprocessor, you must clean the schedule data file.

The NWR Schedule data comes as a newline-delimited JSON (NDJSON) file containing five sections:

JsonTimetableV1 - Header/metadata
TiplocV1 - Location codes
JsonAssociationV1 - Train associations
JsonScheduleV1 - Schedule data (this is what we need)
EOF - End of file marker

How to Clean the Schedule

Run the schedule cleaning script:

python demo/data/schedule_cleaning.py

This extracts the JsonScheduleV1 section and saves it as a cleaned pickle file:

CIF_ALL_FULL_DAILY_toc-full_p4.pkl

Important: The “p4” suffix refers to the 4th section being extracted. The preprocessor expects this cleaned file and will use it automatically.

Data Pre-Processing

After cleaning the schedule data, run the preprocessor to match schedules with delays and organize results by station.

What the Preprocessor Does

The preprocessor:

Loads the cleaned schedule data
Loads the delay attribution (Transparency) files
Matches scheduled trains with actual delays
Organizes data by station code
Saves results in the processed_data/ folder

Output Structure

After preprocessing, the processed_data/ folder is organized as:

processed_data/
├── <STANOX_CODE_1>/
│   ├── MO.parquet
│   ├── TU.parquet
│   └── ...
├── <STANOX_CODE_2>/
│   ├── MO.parquet
│   ├── TU.parquet
│   └── ...
└── ...

Each station has its data organized by day of the week (MO, TU, WE, TH, FR, SA, SU for Monday to Sunday).

Running the Preprocessor

The preprocessor can be run with different options:

Process All Stations

To process all category stations (A, B, C1, C2):

python -m rdmpy.preprocessor --all-categories

This is recommended for comprehensive network analysis. Note: This takes approximately 1 full day to complete.

Process by Category

To process stations by DFT category:

python -m rdmpy.preprocessor --category-A
python -m rdmpy.preprocessor --category-B
python -m rdmpy.preprocessor --category-C1
python -m rdmpy.preprocessor --category-C2

Process a Single Station

To test or process a specific station:

python -m rdmpy.preprocessor <STANOX_CODE>

Replace <STANOX_CODE> with the station’s numeric code (e.g., 50001).

Important Considerations

Partial Processing Impact: If you only process a subset of stations (e.g., one category), the aggregate demos will show incomplete network data. See the Troubleshooting Guide guide for details.
Processing Time: Full preprocessing takes significant time. Run during off-peak hours if possible.
Disk Space: Ensure adequate disk space for processed data files.
No Interruption: Avoid interrupting the preprocessor mid-run to prevent data inconsistency.

You can find further information on the preprocessor’s functionality and troubleshooting tips in the Troubleshooting Guide guide.

Next Steps

After preprocessing completes:

Run the demos in the demo/ folder for different analytical perspectives
Explore the data using the analysis tools

See the API Reference for detailed API documentation.