Assessing spectral quality

The DreaMS project uses a set of mass spectral quality metrics to filter out low-quality or large-molecule spectra mined from public repositories (Figure 2b,c in the paper). This tutorial demonstrates how to apply MS/MS single-spectrum quality metrics to a custom dataset of MS/MS spectra. If you’re interested in using LC-MS/MS dataset-level metrics (e.g., instrument accuracy estimation), please refer to the utils/lcms.py subpackage within the DreaMS package.

[6]:
import pandas as pd
import numpy as np
from pathlib import Path
from dreams.utils.dformats import DataFormatA
from dreams.utils.data import MSData
from dreams.utils.io import append_to_stem

Load example dataset in the .mzML format

[2]:
in_pth = Path('../data/MSV00008490/G73954_1x_BC8_01_17287.mzML')
msdata = MSData.from_mzml(in_pth)
Loading dataset G73954_1x_BC8_01_17287 into memory (1930 spectra)...
/Users/builder/jenkins/ws/enms_ntly_pyoms_whl_Release3.0.0/OpenMS/src/openms/source/FORMAT/HANDLERS/XMLHandler.cpp(130): While loading '../data/MSV00008490/G73954_1x_BC8_01_17287.mzML': Required attribute 'softwareRef' not present!
Warning: Parsing error, "processingMethod" is missing the required attribute "softwareRef".
The software tool which generated this mzML should be fixed. Please notify the maintainers.

Subject spectra to quality control checks

[5]:
# Get spectra (m/z and inetsnity arrays) and precursor m/z values from the input dataset
spectra = msdata['spectrum']
prec_mzs = msdata['precursor_mz']

# Subject each spectrum to spectral quality checks
dformat = DataFormatA()
quality_lvls = [dformat.val_spec(s, p, return_problems=True) for s, p in zip(spectra, prec_mzs)]

# Check how many spectra passed all filters (`All checks passed`) and how many spectra did not pass some of the filters
pd.Series(quality_lvls).value_counts()
/Users/roman/DreaMS/dreams/utils/spectra.py:258: RuntimeWarning: divide by zero encountered in scalar divide
  return max(peak_list[1]) / min(peak_list[1])
[5]:
All checks passed                      1039
Number of high intensity peaks >= 3     863
m/z range <= 1000.0                      17
Precursor m/z <= 1000.0                   7
Intensity amplitude >= 20.0               4
Name: count, dtype: int64

Create new dataset with only high-quality spectra

[4]:
# Define path for output high-quality file
hq_pth = append_to_stem(in_pth, 'high_quality').with_suffix('.hdf5')

# Pick only high-quality spectra and save them to `hq_pth`
msdata.form_subset(
    idx=np.where(np.array(quality_lvls) == 'All checks passed')[0],
    out_pth=hq_pth
)

# Try reading the new file
msdata_hq = MSData.load(hq_pth)
msdata_hq
[4]:
MSData(pth=../data/MSV00008490/G73954_1x_BC8_01_17287_high_quality.hdf5, in_mem=False) with 1,039 spectra.