DreaMS-Fluorine

1. Obtain model weights

Please contact us at roman.bushuiev@uochb.cas.cz to request the DreaMS-Fluorine model weight files (dreams_fluorine_epoch=1-step=7000.ckpt and dreams_fluorine_epoch=30-step=111000.ckpt). After receiving them, place both files in /DreaMS/dreams/models/pretrained/.

Note that DreaMS-Fluorine was trained using NIST20, so we can share the weights only with NIST library license holders. Please attach your NIST license or order confirmation to the email.

2. Run DreaMS-Fluorine

Execute the following command, where --in_dir specifies the folder containing .mzML or .mgf files (here, data/MSV000099559). The script will generate dreams_fluorine_predictions.csv in the same folder, containing predicted fluorine-presence probabilities for each MS/MS spectrum in each file. The current model version supports positive-mode data only.

python3 dreams/cli.py dreams_fluorine --in_dir data/MSV000099559

3. Examine the predictions

[2]:
import pandas as pd
df = pd.read_csv('../data/MSV000099559/dreams_fluorine_predictions.csv')
df
[2]:
RT charge file_name polarity precursor_mz precursor_target_mz scan_number spectrum window_lo window_uo F_preds_111k_steps F_preds_7k_steps dformat tag
0 810.063540 1 MO23S_030.mzML 1 304.891639 304.891632 4153 [[81.07015991210938, 83.0491714477539, 84.9596... 0.5 0.5 0.923519 0.210867 A Only 111k checkpoint > 0.9 hit
1 34.545791 1 MO23S_030.mzML 1 241.999878 241.999878 133 [[81.07012939453125, 84.95979309082031, 84.964... 0.5 0.5 0.921477 0.217751 A Only 111k checkpoint > 0.9 hit
2 514.087512 1 MO23S_030.mzML 1 204.138540 204.138535 2640 [[78.58903503417969, 79.05422973632812, 84.044... 0.5 0.5 0.913897 0.604710 A Only 111k checkpoint > 0.9 hit
3 810.776700 1 MO23S_027.mzML 1 328.915390 328.915405 4179 [[81.06999206542969, 87.28488159179688, 90.947... 0.5 0.5 0.903643 0.140945 A Only 111k checkpoint > 0.9 hit
4 809.980320 1 MO23S_027.mzML 1 304.891539 304.891541 4175 [[84.95977783203125, 85.2956314086914, 90.0553... 0.5 0.5 0.878663 0.227668 A Only 111k checkpoint > 0.75 hit
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
11125 745.386000 2 MO23S_027.mzML 1 615.455358 615.455383 3859 [[79.05448150634766, 80.05440521240234, 81.069... 0.5 0.5 0.000000 0.000000 A NaN
11126 745.794840 2 MO23S_027.mzML 1 593.442417 593.442444 3861 [[80.05455780029297, 81.06998443603516, 83.085... 0.5 0.5 0.000000 0.000000 A NaN
11127 747.321360 2 MO23S_027.mzML 1 571.429528 571.429504 3868 [[80.05457305908203, 81.06989288330078, 83.085... 0.5 0.5 0.000000 0.000000 A NaN
11128 748.533720 2 MO23S_027.mzML 1 549.416251 549.416260 3874 [[80.05448150634766, 81.06990814208984, 81.132... 0.5 0.5 0.000000 0.000000 A NaN
11129 749.549340 2 MO23S_027.mzML 1 527.403199 527.403198 3879 [[77.15992736816406, 77.27590942382812, 80.054... 0.5 0.5 0.000000 0.000000 A NaN

11130 rows × 14 columns

The prediction file contains standard metadata parsed from the input files, such as retention time (RT) and precursor m/z (precursor_mz), which can be used to reference the input spectra. In addition, it includes DreaMS-Fluorine predictions in the columns F_preds_111k_steps and F_preds_7k_steps. These columns represent fluorine presence probabilities predicted by two versions of the DreaMS-Fluorine model trained for different numbers of steps.

We consider a prediction to be confident if both models predict a fluorine probability of at least 0.75. Such cases are annotated in the tag column as > 0.75 hit, > 0.9 hit, or > 0.95 hit, depending on the prediction scores. Less confident predictions are labeled as Only 111k checkpoint > 0.75 hit, Only 111k checkpoint > 0.9 hit, or Only 111k checkpoint > 0.95 hit, indicating that only the model trained for 111k steps produced a confident prediction.

We further classify spectra into low- and high-quality categories (see Fig. 2b). If a spectrum is of low quality but still yields a confident prediction, the model may be hallucinating the result (for example, for spectra containing only a single signal). Such cases are therefore marked with a Low quality prefix in the tag column. If the tag value is NaN, the spectrum is not predicted to correspond to a fluorinated molecule.

The plot below summarizes these tags for three files from the example MSV000099559 dataset. While there are no highly confident predictions supported by both models, there are 18 predictions of fluorinated molecules supported by only one of the models (corresponding to the last two rows in the heatmap).

[13]:
import matplotlib.pyplot as plt
import seaborn as sns

df_plot = df[df['tag'] != '']
tag_file_counts = df_plot.pivot_table(index='tag', columns='file_name', values='RT', aggfunc='count', fill_value=0)
plt.figure(figsize=(5, 5))
sns.heatmap(tag_file_counts, annot=True, fmt='d', cmap='Greens', cbar_kws={'label': 'Count'})
plt.xlabel('File Name')
plt.ylabel('Tag')
plt.title('Counts of fluorine predictions per file')
plt.show()
../_images/tutorials_dreams_fluorine_4_0.png