Assessing spectral quality
The DreaMS project uses a set of mass spectral quality metrics to filter out low-quality or large-molecule spectra mined from public repositories (Figure 2b,c in the paper). This tutorial demonstrates how to apply MS/MS single-spectrum quality metrics to a custom dataset of MS/MS spectra. If you’re interested in using LC-MS/MS dataset-level metrics (e.g., instrument accuracy estimation), please refer to the utils/lcms.py subpackage within the DreaMS package.
[6]:
import pandas as pd
import numpy as np
from pathlib import Path
from dreams.utils.dformats import DataFormatA
from dreams.utils.data import MSData
from dreams.utils.io import append_to_stem
Load example dataset in the .mzML format
[2]:
in_pth = Path('../data/MSV00008490/G73954_1x_BC8_01_17287.mzML')
msdata = MSData.from_mzml(in_pth)
Loading dataset G73954_1x_BC8_01_17287 into memory (1930 spectra)...
/Users/builder/jenkins/ws/enms_ntly_pyoms_whl_Release3.0.0/OpenMS/src/openms/source/FORMAT/HANDLERS/XMLHandler.cpp(130): While loading '../data/MSV00008490/G73954_1x_BC8_01_17287.mzML': Required attribute 'softwareRef' not present!
Warning: Parsing error, "processingMethod" is missing the required attribute "softwareRef".
The software tool which generated this mzML should be fixed. Please notify the maintainers.
Subject spectra to quality control checks
[5]:
# Get spectra (m/z and inetsnity arrays) and precursor m/z values from the input dataset
spectra = msdata['spectrum']
prec_mzs = msdata['precursor_mz']
# Subject each spectrum to spectral quality checks
dformat = DataFormatA()
quality_lvls = [dformat.val_spec(s, p, return_problems=True) for s, p in zip(spectra, prec_mzs)]
# Check how many spectra passed all filters (`All checks passed`) and how many spectra did not pass some of the filters
pd.Series(quality_lvls).value_counts()
/Users/roman/DreaMS/dreams/utils/spectra.py:258: RuntimeWarning: divide by zero encountered in scalar divide
return max(peak_list[1]) / min(peak_list[1])
[5]:
All checks passed 1039
Number of high intensity peaks >= 3 863
m/z range <= 1000.0 17
Precursor m/z <= 1000.0 7
Intensity amplitude >= 20.0 4
Name: count, dtype: int64
Create new dataset with only high-quality spectra
[4]:
# Define path for output high-quality file
hq_pth = append_to_stem(in_pth, 'high_quality').with_suffix('.hdf5')
# Pick only high-quality spectra and save them to `hq_pth`
msdata.form_subset(
idx=np.where(np.array(quality_lvls) == 'All checks passed')[0],
out_pth=hq_pth
)
# Try reading the new file
msdata_hq = MSData.load(hq_pth)
msdata_hq
[4]:
MSData(pth=../data/MSV00008490/G73954_1x_BC8_01_17287_high_quality.hdf5, in_mem=False) with 1,039 spectra.