Fine-tuning for downstream tasks

This tutorial demonstrates how to fine-tune DreaMS for a downstream task involving 10 molecular properties, including logP, quantitative estimation of drug-likeness, and synthetic accessibility, among others. We’ll use the MassSpecGym dataset, which has been split using Murcko histograms (as detailed in the previous tutorial; can be downloaded from Hugging Face Hub). To fine-tune DreaMS, one needs to run the fine_tune.sh script located in the DreaMS/dreams/training folder. The content of this script is shown in the following code snippet.

#!/bin/bash
#SBATCH --job-name DreaMS_fine-tuning
#SBATCH --account OPEN-29-57
#SBATCH --partition qgpu
#SBATCH --nodes 1
#SBATCH --gpus 8
#SBATCH --time 10:00:00

# Activate conda environment
eval "$(conda shell.bash hook)"
conda activate dreams

# Export project definitions
$(python -c "from dreams.definitions import export; export()")

# Move to running dir
cd "${DREAMS_DIR}" || exit 3

# Run the training script
# Replace `python3 training/train.py` with `srun --export=ALL --preserve-env python3 training/train.py \`
# when executing on a SLURM cluster via `sbatch`.
python3 training/train.py \
 --project_name MolecularProperties \
 --job_key "my_run_name" \
 --run_name "my_run_name" \
 --train_objective mol_props \
 --train_regime fine-tuning \
 --dataset_pth "${DATA_DIR}/MassSpecGym_MurckoHist_split.hdf5" \
 --dformat A \
 --model DreaMS \
 --lr 3e-5 \
 --batch_size 64 \
 --prec_intens 1.1 \
 --num_devices 8 \
 --max_epochs 103 \
 --log_every_n_steps 5 \
 --head_depth 1 \
 --seed 3407 \
 --train_precision 64   \
 --pre_trained_pth "${PRETRAINED}/ssl_model.ckpt" \
 --val_check_interval 0.1 \
 --max_peaks_n 100 \
 --save_top_k -1

There are several important points to note here.

The header of the file specifies the job name, account, partition, nodes, GPUs, and time. This is used by the SLURM scheduler to allocate resources for the job when submitted to a SLURM cluster via sbatch training/pre_train.sh. Importantly, to run the training script on a SLURM cluster, you need to replace python3 training/train.py with srun --export=ALL --preserve-env python3 training/train.py \ when executing the script via sbatch. If your cluster uses a different scheduler (e.g., PBS), you can modify the header to suit your needs. Note, that the #SBATCH commands are ignored when the script is run locally or on a non-SLURM cluster.
The script activates the dreams conda environment and exports the project definitions. This ensures that the training script can access the project’s configurations. Therefore, the dreams conda environment needs to be installed beforehand (according to Getting started section of the documentation).
The --dataset_pth specifies the path to the training dataset. In this tutorial, we use the dataset, which can be downloaded from the GeMS Hugging Face Hub repository.
The --train_objective mol_props specifies the training objective as molecular properties prediction. This tells the training script to generate molecular properties for each molecule as trainign labels. To train DreaMS on other tasks, one needs to either specify a different available objective (e.g., has_F for fluorine detection) or to implement a custom objective. The implementation of custom objectives necessiates one to write a custom class inherited from the `FineTuningHead <https://dreams-docs.readthedocs.io/en/latest/dreams.models.heads.html>`__.
It is recommended to sign up for a WandB account and log in via the command line using wandb login. This allows you to monitor the training progress and inspect the model’s performance.