{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Self-supervsied pre-training"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To pre-train the DreaMS model from scratch, one needs run the `pre_train.sh` script located in the `DreaMS/dreams/training` folder. The following code snippet shows the content of this file."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```bash\n",
    "#!/bin/bash\n",
    "#SBATCH --job-name DreaMS_pre-training\n",
    "#SBATCH --account OPEN-29-57\n",
    "#SBATCH --partition qgpu\n",
    "#SBATCH --nodes 1\n",
    "#SBATCH --gpus 8\n",
    "#SBATCH --time 24:00:00\n",
    "\n",
    "# Activate conda environment\n",
    "eval \"$(conda shell.bash hook)\"\n",
    "conda activate dreams\n",
    "\n",
    "# Export project definitions\n",
    "$(python -c \"from dreams.definitions import export; export()\")\n",
    "\n",
    "# Move to running dir\n",
    "cd \"${DREAMS_DIR}\" || exit 3\n",
    "\n",
    "job_key=\"my_pre_training_run\"\n",
    "\n",
    "# Run the training script\n",
    "# Replace `python3 training/train.py` with `srun --export=ALL --preserve-env python3 training/train.py \\`\n",
    "# when executing on a SLURM cluster via `sbatch`.\n",
    "python3 training/train.py \\\n",
    " --project_name SSL_VAL_4.0 \\\n",
    " --job_key \"${job_key}\" \\\n",
    " --run_name \"${job_key}\" \\\n",
    " --frac_masks 0.3 \\\n",
    " --train_regime pre-training \\\n",
    " --dataset_pth \"${GEMS_DIR}/GeMS_A/GeMS_A10.hdf5\" \\\n",
    " --val_check_interval 0.1 \\\n",
    " --train_objective mask_mz_hot \\\n",
    " --hot_mz_bin_size 0.05 \\\n",
    " --dformat A \\\n",
    " --model DreaMS \\\n",
    " --ff_peak_depth 1 \\\n",
    " --ff_fourier_depth 5 \\\n",
    " --ff_fourier_d 512 \\\n",
    " --ff_out_depth 1 \\\n",
    " --prec_intens 1.1 \\\n",
    " --num_devices 8 \\\n",
    " --max_epochs 3000 \\\n",
    " --log_every_n_steps 20 \\\n",
    " --seed 3402 \\\n",
    " --n_layers 7 \\\n",
    " --n_heads 8 \\\n",
    " --d_peak 44 \\\n",
    " --d_fourier 980 \\\n",
    " --lr 1e-4 \\\n",
    " --batch_size 256 \\\n",
    " --dropout 0.1 \\\n",
    " --save_top_k -1 \\\n",
    " --att_dropout 0.1 \\\n",
    " --residual_dropout 0.1 \\\n",
    " --ff_dropout 0.1 \\\n",
    " --weight_decay 0 \\\n",
    " --attn_mech dot-product \\\n",
    " --train_precision 32 \\\n",
    " --mask_peaks \\\n",
    " --mask_intens_strategy intens_p \\\n",
    " --max_peaks_n 60 \\\n",
    " --ssl_probing_depth 0 \\\n",
    " --focal_loss_gamma 5 \\\n",
    " --no_transformer_bias \\\n",
    " --n_warmup_steps 5000 \\\n",
    " --fourier_strategy lin_float_int \\\n",
    " --mz_shift_aug_p 0.2 \\\n",
    " --mz_shift_aug_max 50 \\\n",
    " --pre_norm \\\n",
    " --graphormer_mz_diffs \\\n",
    " --ret_order_loss_w 0.2\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are several important points to note here.\n",
    "\n",
    "- The header of the file specifies the job name, account, partition, nodes, GPUs, and time. This is used by the SLURM scheduler to allocate resources for the job when submitted to a SLURM cluster via `sbatch training/pre_train.sh`. Importantly, to run the training script on a SLURM cluster, you need to replace `python3 training/train.py` with `srun --export=ALL --preserve-env python3 training/train.py \\` when executing the script via `sbatch`. If your cluster uses a different scheduler (e.g., PBS), you can modify the header to suit your needs. Note, that the `#SBATCH` commands are ignored when the script is run locally or on a non-SLURM cluster.\n",
    "\n",
    "- The script activates the `dreams` conda environment and exports the project definitions. This ensures that the training script can access the project's configurations. Therefore, the `dreams` conda environment needs to be installed beforehand (according to [Getting started](https://dreams-docs.readthedocs.io/en/latest/) section of the documentation).\n",
    "\n",
    "- The `--dataset_pth` specifies the path to the training dataset. In this tutorial, we use the dataset, which can be downloaded from the [GeMS Hugging Face Hub repository](https://huggingface.co/datasets/roman-bushuiev/GeMS/resolve/main/data/GeMS_A/GeMS_A10.hdf5).\n",
    "\n",
    "- It is recommended to sign up for a [WandB](https://wandb.ai/site) account and log in via the command line using `wandb login`. This allows you to monitor the training progress and inspect the model's performance.\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "dreams",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}