User Guide
==========
This guide walks through the practical workflow for preparing and processing the data used in rdmpy.
Data Download
-------------
Before you can clean and preprocess the data, you need to download the necessary files from the Rail Data Marketplace.
**Where to Find the Data**
All required datasets are available from the `Rail Data Marketplace (RDM) `_. You will need to create an account to access these files.
**Required Files**
You need to download two main datasets:
1. **NWR Historic Delay Attribution (Transparency Data)**
2. **NWR Schedule Data**
**File Specifications and Location**
For Delays - Search "NWR Historic Delay Attribution"
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Under "data files", you will find .zip files organized by year and period. Download and extract them to find files named:
.. code-block:: text
Transparency_23-24_P12.csv
Transparency_23-24_P13.csv
Transparency_24-25_P01.csv
...
**File Naming Convention:**
- ``Transparency`` refers to the Rail Delivery Group (RDG) transparency initiative for public operational data
- ``23-24`` stands for the financial year (April to March)
- ``P01`` is the month within the financial year (starting in April)
You may also find files named like ``202425 data files 20250213.zip`` or ``Transparency 25-26 P01 20250516.zip``, where the date at the end indicates the last entry date in the data itself.
For Schedule Data - Search "NWR Schedule"
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Under "data files", you will find:
.. code-block:: text
CIF_ALL_FULL_DAILY_toc-full.json.gz
**File Details:**
- ``CIF`` = Common Interface Format
- ``toc-full`` = Train Operating Companies (TOC) Full Extract
- Format = Daily formats (but the full extent of data is weekly, containing all daily scheduled trains for a standard week)
**Setup Instructions**
Once downloaded, follow these steps:
1. Create a ``data/`` folder inside the ``demo/`` folder if it doesn't exist
2. Save all downloaded .csv files and the .json.gz file in ``data/`` without creating subfolders
3. For detailed specifications of each file and how to modify entries for different rail months/years, refer to:
- ``incidents.py`` for delay file specifications
- ``schedule.py`` for schedule file specifications
The tool will automatically detect and load these files from the ``data/`` folder.
**Reference Files**
Additional reference data files are provided in the ``reference/`` folder, including:
- Station reference files with latitude and longitude
- Station description and classification information
These are the only files directly provided and do not need to be downloaded separately.
Data Cleaning
-------------
Before running the preprocessor, you must clean the schedule data file.
The NWR Schedule data comes as a newline-delimited JSON (NDJSON) file containing five sections:
1. JsonTimetableV1 - Header/metadata
2. TiplocV1 - Location codes
3. JsonAssociationV1 - Train associations
4. JsonScheduleV1 - Schedule data (**this is what we need**)
5. EOF - End of file marker
**How to Clean the Schedule**
Run the schedule cleaning script:
.. code-block:: bash
python demo/data/schedule_cleaning.py
This extracts the JsonScheduleV1 section and saves it as a cleaned pickle file:
.. code-block:: text
CIF_ALL_FULL_DAILY_toc-full_p4.pkl
**Important:** The "p4" suffix refers to the 4th section being extracted. The preprocessor expects this cleaned file and will use it automatically.
Data Pre-Processing
--------------------
After cleaning the schedule data, run the preprocessor to match schedules with delays and organize results by station.
**What the Preprocessor Does**
The preprocessor:
1. Loads the cleaned schedule data
2. Loads the delay attribution (Transparency) files
3. Matches scheduled trains with actual delays
4. Organizes data by station code
5. Saves results in the ``processed_data/`` folder
**Output Structure**
After preprocessing, the ``processed_data/`` folder is organized as:
.. code-block:: text
processed_data/
├── /
│ ├── MO.parquet
│ ├── TU.parquet
│ └── ...
├── /
│ ├── MO.parquet
│ ├── TU.parquet
│ └── ...
└── ...
Each station has its data organized by day of the week (MO, TU, WE, TH, FR, SA, SU for Monday to Sunday).
**Running the Preprocessor**
The preprocessor can be run with different options:
Process All Stations
---------------------
To process all category stations (A, B, C1, C2):
.. code-block:: bash
python -m rdmpy.preprocessor --all-categories
This is recommended for comprehensive network analysis. **Note:** This takes approximately 1 full day to complete.
Process by Category
--------------------
To process stations by DFT category:
.. code-block:: bash
python -m rdmpy.preprocessor --category-A
python -m rdmpy.preprocessor --category-B
python -m rdmpy.preprocessor --category-C1
python -m rdmpy.preprocessor --category-C2
Process a Single Station
-------------------------
To test or process a specific station:
.. code-block:: bash
python -m rdmpy.preprocessor
Replace ```` with the station's numeric code (e.g., ``50001``).
**Important Considerations**
- **Partial Processing Impact**: If you only process a subset of stations (e.g., one category), the aggregate demos will show incomplete network data. See the :doc:`troubleshooting` guide for details.
- **Processing Time**: Full preprocessing takes significant time. Run during off-peak hours if possible.
- **Disk Space**: Ensure adequate disk space for processed data files.
- **No Interruption**: Avoid interrupting the preprocessor mid-run to prevent data inconsistency.
You can find further information on the preprocessor's functionality and troubleshooting tips in the :doc:`troubleshooting` guide.
Next Steps
----------
After preprocessing completes:
1. Run the demos in the ``demo/`` folder for different analytical perspectives
2. Explore the data using the analysis tools
See the :doc:`api` for detailed API documentation.