User Guide
This guide walks through the practical workflow for preparing and processing the data used in rdmpy.
Data Download
Before you can clean and preprocess the data, you need to download the necessary files from the Rail Data Marketplace.
Where to Find the Data
All required datasets are available from the Rail Data Marketplace (RDM). You will need to create an account to access these files.
Required Files
You need to download two main datasets:
NWR Historic Delay Attribution (Transparency Data)
NWR Schedule Data
File Specifications and Location
For Delays - Search “NWR Historic Delay Attribution”
Under “data files”, you will find .zip files organized by year and period. Download and extract them to find files named:
Transparency_23-24_P12.csv
Transparency_23-24_P13.csv
Transparency_24-25_P01.csv
...
File Naming Convention:
Transparencyrefers to the Rail Delivery Group (RDG) transparency initiative for public operational data23-24stands for the financial year (April to March)P01is the month within the financial year (starting in April)
You may also find files named like 202425 data files 20250213.zip or Transparency 25-26 P01 20250516.zip, where the date at the end indicates the last entry date in the data itself.
For Schedule Data - Search “NWR Schedule”
Under “data files”, you will find:
CIF_ALL_FULL_DAILY_toc-full.json.gz
File Details:
CIF= Common Interface Formattoc-full= Train Operating Companies (TOC) Full ExtractFormat = Daily formats (but the full extent of data is weekly, containing all daily scheduled trains for a standard week)
Setup Instructions
Once downloaded, follow these steps:
Create a
data/folder inside thedemo/folder if it doesn’t existSave all downloaded .csv files and the .json.gz file in
data/without creating subfoldersFor detailed specifications of each file and how to modify entries for different rail months/years, refer to: -
incidents.pyfor delay file specifications -schedule.pyfor schedule file specifications
The tool will automatically detect and load these files from the data/ folder.
Reference Files
Additional reference data files are provided in the reference/ folder, including:
Station reference files with latitude and longitude
Station description and classification information
These are the only files directly provided and do not need to be downloaded separately.
Data Cleaning
Before running the preprocessor, you must clean the schedule data file.
The NWR Schedule data comes as a newline-delimited JSON (NDJSON) file containing five sections:
JsonTimetableV1 - Header/metadata
TiplocV1 - Location codes
JsonAssociationV1 - Train associations
JsonScheduleV1 - Schedule data (this is what we need)
EOF - End of file marker
How to Clean the Schedule
Run the schedule cleaning script:
python demo/data/schedule_cleaning.py
This extracts the JsonScheduleV1 section and saves it as a cleaned pickle file:
CIF_ALL_FULL_DAILY_toc-full_p4.pkl
Important: The “p4” suffix refers to the 4th section being extracted. The preprocessor expects this cleaned file and will use it automatically.
Data Pre-Processing
After cleaning the schedule data, run the preprocessor to match schedules with delays and organize results by station.
What the Preprocessor Does
The preprocessor:
Loads the cleaned schedule data
Loads the delay attribution (Transparency) files
Matches scheduled trains with actual delays
Organizes data by station code
Saves results in the
processed_data/folder
Output Structure
After preprocessing, the processed_data/ folder is organized as:
processed_data/
├── <STANOX_CODE_1>/
│ ├── MO.parquet
│ ├── TU.parquet
│ └── ...
├── <STANOX_CODE_2>/
│ ├── MO.parquet
│ ├── TU.parquet
│ └── ...
└── ...
Each station has its data organized by day of the week (MO, TU, WE, TH, FR, SA, SU for Monday to Sunday).
Running the Preprocessor
The preprocessor can be run with different options:
Process All Stations
To process all category stations (A, B, C1, C2):
python -m rdmpy.preprocessor --all-categories
This is recommended for comprehensive network analysis. Note: This takes approximately 1 full day to complete.
Process by Category
To process stations by DFT category:
python -m rdmpy.preprocessor --category-A
python -m rdmpy.preprocessor --category-B
python -m rdmpy.preprocessor --category-C1
python -m rdmpy.preprocessor --category-C2
Process a Single Station
To test or process a specific station:
python -m rdmpy.preprocessor <STANOX_CODE>
Replace <STANOX_CODE> with the station’s numeric code (e.g., 50001).
Important Considerations
Partial Processing Impact: If you only process a subset of stations (e.g., one category), the aggregate demos will show incomplete network data. See the Troubleshooting Guide guide for details.
Processing Time: Full preprocessing takes significant time. Run during off-peak hours if possible.
Disk Space: Ensure adequate disk space for processed data files.
No Interruption: Avoid interrupting the preprocessor mid-run to prevent data inconsistency.
You can find further information on the preprocessor’s functionality and troubleshooting tips in the Troubleshooting Guide guide.
Next Steps
After preprocessing completes:
Run the demos in the
demo/folder for different analytical perspectivesExplore the data using the analysis tools
See the API Reference for detailed API documentation.