Troubleshooting Guide

This guide covers common issues and best practices when using rdmpy for rail network analysis.

File Naming Issues with Downloaded Data

One of the most common issues arises from inconsistencies in file naming from the Rail Data Marketplace (RDM). These inconsistencies can stem from form updates, data corrections, or human errors during file uploads.

Transparency File Name Variations

The delay attribution data files are expected to follow the naming convention:

Transparency_YY-YY_PXX.csv

However, you may encounter files with spelling variations. For example:

Transparency_23-24_P12.csv ✓ (correct)
Transparancy_23-24_P12.csv ✗ (misspelled)

The most common variation is “Transparancy” where the vowels “a” and “e” are transposed. This can cause the preprocessor to fail silently if file detection is case-sensitive or pattern-based.

What to Do

Before running the preprocessor, manually verify the spelling of all downloaded files in your demo/data/ folder
Check the respective already-present python scripts in demo/data/ to ensure they match the expected downloaded file names
Rename any files with spelling errors to match the expected Transparency_YY-YY_PXX.csv pattern

Additional Naming Considerations

Pay attention to the year-financial year format (e.g., “23-24” for April 2023 to March 2024)
Verify the period notation (e.g., “P01” for April, “P12” for March)
Note that some downloads may include date suffixes (e.g., “202324 data files 20250213.zip”) where the date indicates the last data entry

Partial Station Processing

The preprocessor in this toolkit can process all stations or specific subsets based on category. This is a deliberate feature that allows users to test workflows or process data incrementally. However, incomplete processing has important implications for analysis.

Impact of Partial Station Processing

When you process only a subset of stations (e.g., Category A only, or specific individual stations), the following occurs:

Limited Data Coverage: Only the processed stations’ data will be available in the processed_data/ folder
Altered Network Representation: The demos will show a fragmented view of the network, not representative of the full system. Stations that were not processed will not appear in aggregated analyses
Demo Limitations: - ✓ Station View demo will work (analyzes individual stations) - ✗ Aggregate View demo will be incomplete or show partial network statistics - ✗ Incident View demo will miss incidents at non-processed stations - ✗ Time View demo will show incomplete temporal patterns - ✗ Train View demo may have incomplete journey data
Missing Error Outputs: Some analyses or visualizations may fail with errors because expected data is unavailable. For example, network-wide statistics or delay propagation analysis cannot be computed without comprehensive station coverage

Best Practice

Run the full preprocessor with --all-categories for comprehensive analysis:

python -m rdmpy.preprocessor --all-categories
Only process partial datasets if you are: - Testing the workflow on a small subset - Conducting station-specific analysis - Operating under severe computational constraints
Document Your Processing Choice: If you do process partial data, note which stations or categories were included. This prevents misinterpretation of results
Expect Incomplete Results: Be aware that aggregate metrics and network-wide visualizations will not reflect the true state of the full network

Processing Time Considerations

Be Aware of Execution Time

As of November 2025, processing all stations takes approximately 1 full day (24 hours) to complete
Processing is computationally intensive due to the volume of train schedule and delay data
Do not interrupt the process unless necessary, as partial runs may require cleanup

Recommendations

Run the full preprocessor during off-peak hours or overnight
Monitor disk space: the processed data files can be substantial
Run on a machine with adequate RAM (at least 8GB recommended) to avoid slowdowns
Consider running on a server or high-performance machine if available

Data Validation Best Practices

Before Running the Preprocessor

Verify All Required Files Are Present - Ensure you have downloaded all months of delay data for your desired period - Check that the schedule file (CIF_ALL_FULL_DAILY_toc-full_p4.pkl) exists (after running cleaning) - Confirm the schedule cleaning step was completed: python data/schedule_cleaning.py
Check File Integrity - Confirm that extracted .zip files contain the expected number of CSV files - Verify file sizes are reasonable (not corrupted or partial downloads) - Spot-check a few rows in the delay files to ensure proper formatting
Validate File Spellings and Naming - Use the checklist from the “File Naming Issues” section above - Create a small test with a single station first: python -m rdmpy.preprocessor <STANOX_CODE>

After Running the Preprocessor

Check Output - Verify that files have been created in processed_data/ - Spot-check the demo notebooks to ensure data loads correctly - Look for any warning messages in the preprocessor logs
Validate Data Completeness - Check that the number of processed stations matches your expectations - Review the date ranges covered in processed files - Ensure no stations have all-zero or missing data

Common Error Messages and Solutions

ModuleNotFoundError: No module named ‘rdmpy’

Ensure you are running the preprocessor from the repository root directory
Verify that the Python environment has the required packages installed (see Requirements section)
Install missing dependencies: pip install -r requirements.txt

FileNotFoundError: Cannot find data files

Check that the demo/data/ folder exists and contains the downloaded files
Verify file names match the expected format (see File Naming Issues section)
Ensure the schedule file has been cleaned and saved as CIF_ALL_FULL_DAILY_toc-full_p4.pkl

No OutputError or Empty Results in Demos

Check that the preprocessor completed successfully for all required stations
Verify that processed_data/ contains parquet files
If processing partial stations, use the Station View demo which handles single-station data best

Memory or Performance Issues During Preprocessing

Close other applications to free up RAM
Consider processing by category instead of all stations at once
Run on a machine with more available memory
Check available disk space before starting

Network Analysis Considerations

When Interpreting Results

Account for Data Lag: The delay attribution data from RDM may have updates. Ensure you are using the latest available data files and that the incident files cover a chronological month-to-month period
Station Coverage: Remember that not all stations may be equally represented. Some stations may have better data quality than others. This data is operational data, not research-grade, so expect some inconsistencies
Seasonal Patterns: Results should account for the financial year structure (April to March) when exploring temporal trends
Train Operating Companies (TOCs): The schedule data is based on published timetables. Actual operations may differ due to cancellations, rerouting, or service disruptions
Incident Attribution: Delays are attributed to specific incidents. Missing or misclassified incidents at unprocessed stations can affect delay propagation analysis

Reporting Issues and Getting Help

If you encounter issues not covered in this guide:

Check the project documentation in the docs/ folder
Review error messages carefully for file path and naming issues
Verify that the data formats match the expected structure (e.g., CSV columns, parquet schemas)
Verify that your data files match the format specified in the README
Contact the project maintainers at ji-eun.byun@glasgow.ac.uk with: - A description of the issue - The error message or unexpected behavior - Your preprocessing configuration (which stations/categories you processed) - The dates of data files you are using