User Guide ========== This guide walks through the practical workflow for preparing and processing the data used in rdmpy. Data Download ------------- Before you can clean and preprocess the data, you need to download the necessary files from the Rail Data Marketplace. **Where to Find the Data** All required datasets are available from the `Rail Data Marketplace (RDM) `_. You will need to create an account to access these files. **Required Files** You need to download two main datasets: 1. **NWR Historic Delay Attribution (Transparency Data)** 2. **NWR Schedule Data** **File Specifications and Location** For Delays - Search "NWR Historic Delay Attribution" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Under "data files", you will find .zip files organized by year and period. Download and extract them to find files named: .. code-block:: text Transparency_23-24_P12.csv Transparency_23-24_P13.csv Transparency_24-25_P01.csv ... **File Naming Convention:** - ``Transparency`` refers to the Rail Delivery Group (RDG) transparency initiative for public operational data - ``23-24`` stands for the financial year (April to March) - ``P01`` is the month within the financial year (starting in April) You may also find files named like ``202425 data files 20250213.zip`` or ``Transparency 25-26 P01 20250516.zip``, where the date at the end indicates the last entry date in the data itself. For Schedule Data - Search "NWR Schedule" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Under "data files", you will find: .. code-block:: text CIF_ALL_FULL_DAILY_toc-full.json.gz **File Details:** - ``CIF`` = Common Interface Format - ``toc-full`` = Train Operating Companies (TOC) Full Extract - Format = Daily formats (but the full extent of data is weekly, containing all daily scheduled trains for a standard week) **Setup Instructions** Once downloaded, follow these steps: 1. Create a ``data/`` folder inside the ``demo/`` folder if it doesn't exist 2. Save all downloaded .csv files and the .json.gz file in ``data/`` without creating subfolders 3. For detailed specifications of each file and how to modify entries for different rail months/years, refer to: - ``incidents.py`` for delay file specifications - ``schedule.py`` for schedule file specifications The tool will automatically detect and load these files from the ``data/`` folder. **Reference Files** Additional reference data files are provided in the ``reference/`` folder, including: - Station reference files with latitude and longitude - Station description and classification information These are the only files directly provided and do not need to be downloaded separately. Data Cleaning ------------- Before running the preprocessor, you must clean the schedule data file. The NWR Schedule data comes as a newline-delimited JSON (NDJSON) file containing five sections: 1. JsonTimetableV1 - Header/metadata 2. TiplocV1 - Location codes 3. JsonAssociationV1 - Train associations 4. JsonScheduleV1 - Schedule data (**this is what we need**) 5. EOF - End of file marker **How to Clean the Schedule** Run the schedule cleaning script: .. code-block:: bash python demo/data/schedule_cleaning.py This extracts the JsonScheduleV1 section and saves it as a cleaned pickle file: .. code-block:: text CIF_ALL_FULL_DAILY_toc-full_p4.pkl **Important:** The "p4" suffix refers to the 4th section being extracted. The preprocessor expects this cleaned file and will use it automatically. Data Pre-Processing -------------------- After cleaning the schedule data, run the preprocessor to match schedules with delays and organize results by station. **What the Preprocessor Does** The preprocessor: 1. Loads the cleaned schedule data 2. Loads the delay attribution (Transparency) files 3. Matches scheduled trains with actual delays 4. Organizes data by station code 5. Saves results in the ``processed_data/`` folder **Output Structure** After preprocessing, the ``processed_data/`` folder is organized as: .. code-block:: text processed_data/ ├── / │ ├── MO.parquet │ ├── TU.parquet │ └── ... ├── / │ ├── MO.parquet │ ├── TU.parquet │ └── ... └── ... Each station has its data organized by day of the week (MO, TU, WE, TH, FR, SA, SU for Monday to Sunday). **Running the Preprocessor** The preprocessor can be run with different options: Process All Stations --------------------- To process all category stations (A, B, C1, C2): .. code-block:: bash python -m rdmpy.preprocessor --all-categories This is recommended for comprehensive network analysis. **Note:** This takes approximately 1 full day to complete. Process by Category -------------------- To process stations by DFT category: .. code-block:: bash python -m rdmpy.preprocessor --category-A python -m rdmpy.preprocessor --category-B python -m rdmpy.preprocessor --category-C1 python -m rdmpy.preprocessor --category-C2 Process a Single Station ------------------------- To test or process a specific station: .. code-block:: bash python -m rdmpy.preprocessor Replace ```` with the station's numeric code (e.g., ``50001``). **Important Considerations** - **Partial Processing Impact**: If you only process a subset of stations (e.g., one category), the aggregate demos will show incomplete network data. See the :doc:`troubleshooting` guide for details. - **Processing Time**: Full preprocessing takes significant time. Run during off-peak hours if possible. - **Disk Space**: Ensure adequate disk space for processed data files. - **No Interruption**: Avoid interrupting the preprocessor mid-run to prevent data inconsistency. You can find further information on the preprocessor's functionality and troubleshooting tips in the :doc:`troubleshooting` guide. Next Steps ---------- After preprocessing completes: 1. Run the demos in the ``demo/`` folder for different analytical perspectives 2. Explore the data using the analysis tools See the :doc:`api` for detailed API documentation.