dreams.utils package

Submodules

dreams.utils.annotation module

class dreams.utils.annotation.FingerprintInChIRetrieval(df_pkl_pth: Path, candidate_smiles_col: str, fp_name: str, top_k: int | List[int], index_smiles_col: str = 'SMILES', candidate_inchi14_col: str = None)

Bases: object

compute_reset_metrics(metrics_prefix='', return_unaveraged=False)
retrieve_inchi14s(query_fp, label_smiles)
class dreams.utils.annotation.SpectralLibraryRetrieval(df_lib: Path | DataFrame)

Bases: object

retrieve(query_dreams_emb: ndarray, top_k=inf, precursor_mz=None, prec_mz_tolerance=None)
retrieve_df(df, prec_mz_tolerance=0.01, top_k=10)

dreams.utils.data module

class dreams.utils.data.AnnotatedSpectraDataset(spectra: List[MSnSpectrum], label: str, spec_preproc: SpectrumPreprocessor, dformat: DataFormat, return_smiles=False)

Bases: Dataset

NOTE: This class is deprecated in favor of LabeledSpectraDataset.

class dreams.utils.data.AttentionEntropyValidation(nist_like_pkl_pth, dformat: DataFormat, spec_preproc: SpectrumPreprocessor, n_samples=None, as_plot=False, save_out_basename=None)

Bases: ImplExplValidation

get_res()
class dreams.utils.data.CSRKNN(csr, one_minus_weights=True)

Bases: object

static from_edge_list(s, t, w)

Construct CSR matrix from edge list. :param s: Source nodes. :param t: Target nodes. :param w: Edge weights.

static from_ngt_index(ngt_index, k, one_minus_for_weights=False)
static from_npz(pth: Path, one_minus_weights=False)

Load CSR matrix that was stores as a COO matrix using CSRKNN.save method.

inv_neighbors(i, sort=True, exclude_self_loops=True) ndarray

Get nodes that have i-th node as a neighbor.

neighbors(i, sort=True, exclude_self_loops=True) ndarray

Get neighbors of the i-th node, i.e. all non-zero columns in i-th row of the CSR matrix.

to_edge_list(one_minus_weights=False)

Convert CSR matrix to edge list.

to_graph(directed=True, graph_class='igraph')

Convert CSR matrix to a graph object using the specified library.

Parameters: - directed (bool): If True, creates a directed graph. If False, creates an undirected graph. - graph_class (str): Specifies which graph library to use.

“igraph” for igraph.Graph, “networkx” for networkx.Graph.

Returns: - Graph object: igraph.Graph or networkx.Graph based on the specified graph_class.

to_npz(pth: Path) None

Save CSR matrix as COO matrix on disk. COO seems to work better and does not produce errors when saving large matrices.

class dreams.utils.data.CVDataModule(dataset: Dataset, fold_idx: Series, batch_size: int, num_workers=0)

Bases: LightningDataModule

prepare_data_per_node

If True, each LOCAL_RANK=0 will call prepare data. Otherwise only NODE_RANK=0, LOCAL_RANK=0 will prepare data.

allow_zero_length_dataloader_with_multiple_devices

If True, dataloader with zero length within local rank is allowed. Default value is False.

get_num_folds() int
setup_fold_index(fold_i: int) None
train_dataloader() DataLoader

An iterable or collection of iterables specifying training samples.

For more information about multiple dataloaders, see this section.

The dataloader you return will not be reloaded unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.

For data processing use the following pattern:

  • download in prepare_data()

  • process and split in setup()

However, the above are only necessary for distributed processing.

Warning

do not assign state in prepare_data

  • fit()

  • prepare_data()

  • setup()

Note

Lightning tries to add the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.

val_dataloader() DataLoader

An iterable or collection of iterables specifying validation samples.

For more information about multiple dataloaders, see this section.

The dataloader you return will not be reloaded unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.

It’s recommended that all data downloads and preparation happen in prepare_data().

  • fit()

  • validate()

  • prepare_data()

  • setup()

Note

Lightning tries to add the correct sampler for distributed and arbitrary hardware There is no need to set it yourself.

Note

If you don’t need a validation dataset and a validation_step(), you don’t need to implement this method.

class dreams.utils.data.ContrastiveSpectraDataset(df: DataFrame, spec_preproc: SpectrumPreprocessor, msn_spec_col='MSnSpectrum', pos_idx_col='pos_idx', neg_idx_col='neg_idx', n_pos_samples=1, n_neg_samples=10, return_smiles=False, logger=None)

Bases: Dataset

class dreams.utils.data.ContrastiveValidation(nist_like_pkl_pth, pairs_pkl_pth, dformat: DataFormat, spec_preproc: SpectrumPreprocessor, n_instances=None, n_samples=None, seed=3, save_out_basename: Path = None, euclidean=False)

Bases: ImplExplValidation

get_labels()
get_name()
get_res()
get_umap_plot()
class dreams.utils.data.CorrelationValidation(nist_like_pkl_pth, corr_pkl_pth, dformat: DataFormat, spec_preproc: SpectrumPreprocessor, n_samples=None)

Bases: ImplExplValidation

get_res()
class dreams.utils.data.ImplExplValidation(nist_like_pkl_pth, dformat: DataFormat, spec_preproc: SpectrumPreprocessor, df_idx=None, n_samples=None, seed=1)

Bases: ABC

get_data(device=None, torch_dtype=None)
abstract get_res()
set_model_gains(model_gains)
class dreams.utils.data.KNNValidation(nist_like_pkl_pth, pairs_pkl_pth, dformat: DataFormat, spec_preproc: SpectrumPreprocessor, k=3, n_instances=None, n_samples=None, seed=3, save_out_basename: Path = None)

Bases: ContrastiveValidation

get_res()
class dreams.utils.data.LabeledSpectraDataset(msdata: Path | str | MSData, label: str, spec_preproc: SpectrumPreprocessor, dformat: DataFormat, return_smiles=False)

Bases: Dataset

class dreams.utils.data.MSData(hdf5_pth: Path | str | List[Path], in_mem: bool = False, mode: str = 'r', spec_col: str = 'spectrum', prec_mz_col: str = 'precursor_mz', index_col: str | None = None)

Bases: object

add_column(name, data, remove_old_if_exists=False)
at(i, plot_mol=True, plot_spec=True, return_spec=False, vals=None, unpad_spec=True)
columns()
extend_column(name, data)
form_subset(idx, out_pth, verbose=False, **kwargs)
static from_hdf5(pth: Path, **kwargs)
static from_hdf5_chunks(pths: List[Path], **kwargs)
static from_mgf(pth: Path | str, in_mem=True, **kwargs)
static from_msp(pth: Path | str, in_mem=True, **kwargs)
static from_mzml(pth: Path | str, scan_range: Tuple[int, int] | None = None, verbose_parser: bool = False, **kwargs)
static from_mzxml(pth: Path | str, scan_range: Tuple[int, int] | None = None, verbose_parser: bool = False, **kwargs)
static from_pandas(df: Path | str | DataFrame, n_highest_peaks=128, spec_col='spectrum', prec_mz_col='precursor_mz', adduct_col='adduct', charge_col='charge', mol_col='smiles', rt_col='RT', feature_id_col='scan_number', ignore_cols=(), in_mem=True, hdf5_pth=None, compression_opts=0, mode='r')
static from_pickle(pth: Path | str, in_mem=True, **kwargs)
get_adducts(idx=None)
get_charges(idx=None)
get_prec_mzs(idx=None)
get_smiles(idx=None)
get_spectra(idx=None)
get_values(col, idx=None, decode_strings=True)
static load(pth: Path | str, in_mem=False, **kwargs)
load_col_in_mem(col)
load_hdf5_in_mem(group)
static merge(pths: List[Path], out_pth: Path, cols=['spectrum', 'precursor_mz', 'charge', 'adduct'], show_tqdm=True, logger=None, add_dataset_col=True, in_mem=False, spectra_already_trimmed=False, filter_idx=None)
remove_column(name)
rename_column(old_name, new_name, remove_old_if_exists=False)
spec_to_matchms(i: int) Spectrum
to_matchms(progress_bar=True) List[Spectrum]
to_mgf(out_pth: Path | str)
to_pandas(unpad=True, ignore_cols=('DreaMS_embedding',))
to_pynndescent(embedding_col='DreaMS_embedding', out_pth=None, store_index=False, n_neighbors=50, metric='cosine', verbose=True)
to_torch_dataset(spec_preproc: SpectrumPreprocessor, label=None, **kwargs)
use_col_as_index(col)
class dreams.utils.data.ManualValidation(nist_like_pkl_pth, dformat: DataFormat, spec_preproc: SpectrumPreprocessor, n_samples=None, seed=1, df_idx=None)

Bases: ImplExplValidation

get_res()
class dreams.utils.data.MaskedSpectraDataset(in_pth: Path, dformat: DataFormat, ssl_objective: str, spec_preproc: SpectrumPreprocessor, mask_peaks=True, mask_intens_strategy='intens_p', frac_masks=0.2, min_n_masks=2, mask_val=-1.0, min_mask_intens=0.1, mask_prec=False, n_samples=None, logger=None, deterministic_mask=True, ret_order_pairs=False, return_charge=False, acc_est_weight=False, lsh_weight=False, bert801010_masking=False)

Bases: Dataset

A dataset class for masked spectra used in self-supervised learning tasks.

This class prepares spectra data for various self-supervised learning objectives such as peak masking, m/z masking, and intensity masking.

data

The dataset containing spectra and related information.

Type:

dict

dformat

The data format specification.

Type:

DataFormat

ssl_objective

The self-supervised learning objective.

Type:

str

spec_preproc

The spectrum preprocessor.

Type:

SpectrumPreprocessor

frac_masks

The fraction of peaks to mask.

Type:

float

min_n_masks

The minimum number of peaks to mask.

Type:

int

n_samples

The number of samples to use from the dataset.

Type:

int

mask_val

The value to use for masked peaks.

Type:

float

min_mask_intens

The minimum intensity for peaks to be considered for masking.

Type:

float

mask_prec

Whether to mask the precursor peak.

Type:

bool

deterministic_mask

Whether to use deterministic masking.

Type:

bool

mask_peaks

Whether to mask peaks.

Type:

bool

mask_intens_strategy

The strategy for masking intensities.

Type:

str

ret_order_pairs

Whether to return retention order pairs.

Type:

bool

return_charge

Whether to return charge information.

Type:

bool

acc_est_weight

Whether to use accuracy estimation weighting.

Type:

bool

lsh_weight

Whether to use LSH weighting.

Type:

bool

bert801010_masking

Whether to use BERT-style 80-10-10 masking.

Type:

bool

__len__()

Return the length of the dataset.

__getitem__(i)

Get a single item from the dataset.

get_spec(i)

Get a preprocessed spectrum.

Initialize the MaskedSpectraDataset.

Parameters:
  • in_pth (Path) – The input path for the dataset.

  • dformat (DataFormat) – The data format specification.

  • ssl_objective (str) – The self-supervised learning objective.

  • spec_preproc (SpectrumPreprocessor) – The spectrum preprocessor.

  • mask_peaks (bool) – Whether to mask peaks.

  • mask_intens_strategy (str) – The strategy for masking intensities.

  • frac_masks (float) – The fraction of peaks to mask.

  • min_n_masks (int) – The minimum number of peaks to mask.

  • mask_val (float) – The value to use for masked peaks.

  • min_mask_intens (float) – The minimum intensity for peaks to be considered for masking.

  • mask_prec (bool) – Whether to mask the precursor peak.

  • n_samples (int) – The number of samples to use from the dataset.

  • logger (Logger) – The logger object.

  • deterministic_mask (bool) – Whether to use deterministic masking.

  • ret_order_pairs (bool) – Whether to return retention order pairs.

  • return_charge (bool) – Whether to return charge information.

  • acc_est_weight (bool) – Whether to use accuracy estimation weighting.

  • lsh_weight (bool) – Whether to use LSH weighting.

  • bert801010_masking (bool) – Whether to use BERT-style 80-10-10 masking.

get_spec(i)

Get a preprocessed spectrum.

Parameters:

i (int) – The index of the spectrum to retrieve.

Returns:

A dictionary containing the preprocessed spectrum and related information.

Return type:

dict

class dreams.utils.data.MatchmsSpectraDataset(spectra: List[Spectrum], spec_preproc: SpectrumPreprocessor)

Bases: Dataset

class dreams.utils.data.RandomSplitDataModule(dataset, batch_size: int, max_var_features=None, val_frac=0.1, num_workers=0)

Bases: LightningDataModule

prepare_data_per_node

If True, each LOCAL_RANK=0 will call prepare data. Otherwise only NODE_RANK=0, LOCAL_RANK=0 will prepare data.

allow_zero_length_dataloader_with_multiple_devices

If True, dataloader with zero length within local rank is allowed. Default value is False.

test_dataloader()

An iterable or collection of iterables specifying test samples.

For more information about multiple dataloaders, see this section.

For data processing use the following pattern:

  • download in prepare_data()

  • process and split in setup()

However, the above are only necessary for distributed processing.

Warning

do not assign state in prepare_data

  • test()

  • prepare_data()

  • setup()

Note

Lightning tries to add the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.

Note

If you don’t need a test dataset and a test_step(), you don’t need to implement this method.

train_dataloader() DataLoader

An iterable or collection of iterables specifying training samples.

For more information about multiple dataloaders, see this section.

The dataloader you return will not be reloaded unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.

For data processing use the following pattern:

  • download in prepare_data()

  • process and split in setup()

However, the above are only necessary for distributed processing.

Warning

do not assign state in prepare_data

  • fit()

  • prepare_data()

  • setup()

Note

Lightning tries to add the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.

val_dataloader() DataLoader

An iterable or collection of iterables specifying validation samples.

For more information about multiple dataloaders, see this section.

The dataloader you return will not be reloaded unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.

It’s recommended that all data downloads and preparation happen in prepare_data().

  • fit()

  • validate()

  • prepare_data()

  • setup()

Note

Lightning tries to add the correct sampler for distributed and arbitrary hardware There is no need to set it yourself.

Note

If you don’t need a validation dataset and a validation_step(), you don’t need to implement this method.

class dreams.utils.data.RawSpectraDataset(spectra, prec_mzs, spec_preproc: SpectrumPreprocessor)

Bases: Dataset

class dreams.utils.data.SSLProbingValidation(labeled_data_module: SplittedDataModule, evaluator_impl='torch', n_hidden_layers=[0], n_epochs=100, probing_batch_freq=2500, prefix=None, save_fps_dir=typing.Optional[pathlib.Path])

Bases: Callback

on_train_batch_start(trainer, pl_module, batch, batch_idx)

Called when the train batch begins.

class dreams.utils.data.SpecRetrievalValidation(nist_like_pkl_pth, pairs_pkl_pth, dformat: DataFormat, spec_preproc: SpectrumPreprocessor)

Bases: ImplExplValidation

Cosine similarity on embeddings <-> equality of InChI keys validation.

get_res()
class dreams.utils.data.SpectrumPreprocessor(dformat: DataFormat, prec_intens=1.1, n_highest_peaks=None, spec_entropy_cleaning=False, normalize_mzs=False, precision=32, mz_shift_aug_p=0, mz_shift_aug_max=0, to_relative_intensities=True)

Bases: object

A class for preprocessing mass spectrometry spectra.

This class provides functionality to preprocess mass spectrometry spectra, including peak trimming, padding, intensity normalization, and various data augmentation techniques.

dformat

The data format specification.

Type:

DataFormat

prec_intens

The intensity of the precursor peak.

Type:

float

n_highest_peaks

The number of highest intensity peaks to keep.

Type:

int

spec_entropy_cleaning

Whether to apply spectral entropy cleaning.

Type:

bool

normalize_mzs

Whether to normalize m/z values.

Type:

bool

to_relative_intensities

Whether to convert intensities to relative values.

Type:

bool

precision

The precision of the output data (32 or 64 bit).

Type:

int

mz_shift_aug_p

The probability of applying m/z shift augmentation.

Type:

float

mz_shift_aug_max

The maximum m/z shift for augmentation.

Type:

float

__call__(spec, prec_mz, high_form, augment)

Preprocess a single spectrum.

Initialize the SpectrumPreprocessor.

Parameters:
  • dformat (DataFormat) – The data format specification.

  • prec_intens (float) – The intensity of the precursor peak.

  • n_highest_peaks (int) – The number of highest intensity peaks to keep.

  • spec_entropy_cleaning (bool) – Whether to apply spectral entropy cleaning.

  • normalize_mzs (bool) – Whether to normalize m/z values.

  • precision (int) – The precision of the output data (32 or 64 bit).

  • mz_shift_aug_p (float) – The probability of applying m/z shift augmentation.

  • mz_shift_aug_max (float) – The maximum m/z shift for augmentation.

  • to_relative_intensities (bool) – Whether to convert intensities to relative values.

class dreams.utils.data.SplittedDataModule(dataset, split_mask: Series | ndarray | list, batch_size: int | None, num_workers=0, n_train_samples=None, seed=None, include_val_in_train=False)

Bases: LightningDataModule

prepare_data_per_node

If True, each LOCAL_RANK=0 will call prepare data. Otherwise only NODE_RANK=0, LOCAL_RANK=0 will prepare data.

allow_zero_length_dataloader_with_multiple_devices

If True, dataloader with zero length within local rank is allowed. Default value is False.

test_dataloader() DataLoader

An iterable or collection of iterables specifying test samples.

For more information about multiple dataloaders, see this section.

For data processing use the following pattern:

  • download in prepare_data()

  • process and split in setup()

However, the above are only necessary for distributed processing.

Warning

do not assign state in prepare_data

  • test()

  • prepare_data()

  • setup()

Note

Lightning tries to add the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.

Note

If you don’t need a test dataset and a test_step(), you don’t need to implement this method.

train_dataloader() DataLoader

An iterable or collection of iterables specifying training samples.

For more information about multiple dataloaders, see this section.

The dataloader you return will not be reloaded unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.

For data processing use the following pattern:

  • download in prepare_data()

  • process and split in setup()

However, the above are only necessary for distributed processing.

Warning

do not assign state in prepare_data

  • fit()

  • prepare_data()

  • setup()

Note

Lightning tries to add the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.

val_dataloader() DataLoader

An iterable or collection of iterables specifying validation samples.

For more information about multiple dataloaders, see this section.

The dataloader you return will not be reloaded unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.

It’s recommended that all data downloads and preparation happen in prepare_data().

  • fit()

  • validate()

  • prepare_data()

  • setup()

Note

Lightning tries to add the correct sampler for distributed and arbitrary hardware There is no need to set it yourself.

Note

If you don’t need a validation dataset and a validation_step(), you don’t need to implement this method.

dreams.utils.data.condense_dreams_knn(graph, thld, embs, logger)
dreams.utils.data.evaluate_split(df_split, n_workers=5, smiles_col='smiles', fold_col='fold')

Evaluates data split based on the Morgan Tanimoto similarity between validation and train folds.

Parameters:
  • df_split (pd.DataFrame) – DataFrame containing the data split.

  • n_workers (int) – Number of workers for parallel processing. Default is 5.

  • smiles_col (str) – Column name for SMILES strings. Default is the SMILES constant.

  • fold_col (str) – Column name for fold information. Default is ‘fold’.

Returns:

A dictionary containing maximum Tanimoto similarities for each fold.

Return type:

dict

Raises:

ValueError – If the fold column is not found or contains invalid values.

dreams.utils.data.load_hdf5_in_mem(dct)
dreams.utils.data.subset_lsh(in_pth: Path | str, out_pth: Path | None = None, lsh_col: str = 'lsh', max_specs_per_lsh: int = 1, random_seed: int = 42)

dreams.utils.dformats module

class dreams.utils.dformats.DataFormat

Bases: ABC

Abstract class for DataFormats.

high_intensity_thld

alias of NotImplementedError

lsh_bin_size

alias of NotImplementedError

lsh_n_hplanes

alias of NotImplementedError

max_charge

alias of NotImplementedError

max_ms_level

alias of NotImplementedError

max_mz

alias of NotImplementedError

max_peaks_n

alias of NotImplementedError

max_prec_mz

alias of NotImplementedError

max_tbxic_stdev

alias of NotImplementedError

min_charge

alias of NotImplementedError

min_file_spectra

alias of NotImplementedError

min_intensity_ampl

alias of NotImplementedError

min_peaks_n

alias of NotImplementedError

val_spec(spec: ndarray, prec_mz: float, tbxic_stdev: float | None = None, charge: int | None = None, mslevel: int | None = None, verbose: bool = False, return_problems: bool = False) bool | str
class dreams.utils.dformats.DataFormatA

Bases: DataFormat

high_intensity_thld: float = 0.1
max_charge: int = 1
max_ms_level: int = 2
max_mz: float = 1000.0
max_peaks_n: int = 128
max_prec_mz: float = 1000.0
max_tbxic_stdev: float = 0.0001
min_charge: int = 1
min_file_spectra: int = 3
min_intensity_ampl: float = 20.0
min_peaks_n: int = 3
class dreams.utils.dformats.DataFormatA1

Bases: DataFormatA

min_peaks_n: int = 2
class dreams.utils.dformats.DataFormatA2

Bases: DataFormatA

high_intensity_thld: float = 0.05
min_intensity_ampl: float = 15
min_peaks_n: int = 2
class dreams.utils.dformats.DataFormatA3

Bases: DataFormatA

high_intensity_thld: float = 0.05
min_charge: int = -1
min_intensity_ampl: float = 15
min_peaks_n: int = 2
class dreams.utils.dformats.DataFormatB

Bases: DataFormat

high_intensity_thld: float = 0.1
max_charge: int = 1
max_ms_level: int = 2
max_mz: float = 1500.0
max_peaks_n: int = 128
max_prec_mz: float = 1500.0
max_tbxic_stdev: float = 0.001
min_charge: int = 1
min_file_spectra: int = 3
min_intensity_ampl: float = 20.0
min_peaks_n: int = 3
class dreams.utils.dformats.DataFormatBuilder(dformat_name)

Bases: object

get_dformat()
class dreams.utils.dformats.DataFormatC

Bases: DataFormat

high_intensity_thld: float = 0.1
max_charge: int = 1
max_ms_level: int = 10
max_mz: float = 1500.0
max_peaks_n: int = 128
max_prec_mz: float = 1500.0
max_tbxic_stdev: float = 0.001
min_charge: int = -1
min_file_spectra: int = 3
min_intensity_ampl: float = 18.0
min_peaks_n: int = 3
dreams.utils.dformats.assign_dformat(spec: ndarray, prec_mz: float, **kwargs) str
dreams.utils.dformats.to_A_format(df: DataFrame, filter=True, trimming=False, reset_index=True, verbose=True, add_msn_col=True, filter_block_mask=None)
dreams.utils.dformats.to_format(df: DataFrame, dformat: DataFormat, filter=True, trimming=False, reset_index=True, verbose=True, add_msn_col=True, filter_block_mask=None)
Parameters:
  • df – df with NIST-like columns

  • filter_block_mask – if not None, a boolean mask with the same length as df. True entries are not filtered.

dreams.utils.io module

class dreams.utils.io.ChunkedDatasetAccessor(parent, dataset_name)

Bases: object

property shape

Return the shape of the dataset.

class dreams.utils.io.ChunkedHDF5File(file_paths)

Bases: object

Initialize the ChunkedHDF5File with a list of HDF5 file paths.

Args: - file_paths (list of Path): Paths to the HDF5 files.

close()

Close all files.

keys()

Return a list of dataset names.

class dreams.utils.io.TqdmToLogger(logger, level=None, mininterval=5)

Bases: StringIO

buf = ''
flush()

Flush write buffers, if applicable.

This is not implemented for read-only and non-blocking streams.

level = None
logger = None
write(buf)

Write string to file.

Returns the number of characters written, which is always equal to the length of the string.

dreams.utils.io.append_to_stem(pth: Path, s, sep='_')

path/to/file.txt -> path/to/file{sep}{s}.txt

dreams.utils.io.bytes_to_human_str(size_bytes, decimal_places=2)
dreams.utils.io.bytes_to_units(size_bytes, unit='MB')
dreams.utils.io.cache_pkl(pth)
dreams.utils.io.clean_ftps(ftps: dict, verbose=True)

Cleans the dict of MassIVE ftps (see code comments 1., 2., 3. for details). :param ftps: keys - ftps, values - corresponding file sizes.

dreams.utils.io.compress_hdf(hdf_pth, out_pth=None, compression='gzip', compression_opts=4)
dreams.utils.io.downloadpublicdata_to_hdf5s(downloads_log: Path, del_in=True, verbose=False, only_msn=False, only_format=None) None

Convert downloaded LC-MS/MS data (e.g., .mzML or .mzXML) to .hdf5 format.

Args: downloads_log (Path): Path to the log file from downloadpublicdata containing information about downloaded files. del_in (bool, optional): Whether to delete the input files after conversion. Defaults to True. verbose (bool, optional): Whether to print additional information during the conversion. Defaults to True.

dreams.utils.io.ftp_to_msv_id(ftp)
dreams.utils.io.lcmsms_to_hdf5(input_path, output_path=None, num_peaks=None, num_prec_peaks=None, store_precursors=True, compress_peaks_lvl=0, compress_full_lvl=0, pwiz_stats=False, del_in=False, assign_dformats=True, log_path=None, verbose=False, only_msn=False, only_format=None)

Convert LC-MS/MS data from an input file (.mzML or .mzXML) to an output file (.hdf5).

Args: input_path (str): Path to the input file (.mzML or .mzXML). output_path (str, optional): Path to the output file (.hdf5). If not provided, the output is stored as the

input file name with .hdf5 extension.

num_peaks (int, optional): The number of peaks to pad the MSn peak lists with zeros. If not specified, it

will be set to the maximum number of peaks within the spectra that are to be stored.

num_prec_peaks (int, optional): The number of peaks to pad the MS1 peak lists with zeros. If not specified,

it will be set to the maximum number of peaks within the spectra that are to be stored.

store_precursors (bool, optional): Whether to store the data of precursor spectra (peak list and scan id) for

each MSn spectrum as a separate hdf5 dataset. Defaults to True.

compress_peaks_lvl (int, optional): The compression level for peak lists in the output .hdf5 file. Should be an

integer from 0 to 9. Defaults to 0.

compress_full_lvl (int, optional): The compression level for all stored attributes (e.g. RTs, polarities, etc.)

except for peak lists. Should be an integer from 0 to 9. Defaults to 0.

pwiz_stats (bool, optional): Whether to collect ProteoWizard msconvert statistics, including the histogram of

types of spectra converted by msconvert and the number of spectra centroided by msconvert but having zero intensities. Defaults to False.

del_in (bool, optional): Whether to delete the input .mzML or .mzXML file. Defaults to False. assign_dformats (bool, optional): Whether to assign data formats to MSn spectra. Defaults to True. log_path (str, optional): Path to the log file containing errors during opening of files and flaws of invalid

spectra. If set to None, the log file is stored as the input file name with .hdf5 extension.

verbose (bool, optional): Whether to log the scan number for each invalid spectrum and log additional

statistics. The statistics are redundant in a sense that they can be calculated from the output .hdf5 file but are helpful for the fast analysis of the input file and debugging. Defaults to False.

dreams.utils.io.list_from_txt(txt_pth, sep='\n', apply_lambda=None, progress_bar=False)
dreams.utils.io.list_to_txt(lst, txt_pth, sep='\n')
dreams.utils.io.lsh_subset(in_pth, dformat, n_hplanes=None, bin_size=1, max_specs_per_lsh=None, seed=333)

Subset the input .hdf5 file using Locality Sensitive Hashing (LSH) algorithm.

Parameters:
  • input_path (str) – Path to the input file.

  • dformat (DataFormatBuilder) – Data format builder object.

  • n_hplanes (int, optional) – Number of hyperplanes for LSH. Defaults to None.

  • bin_size (float, optional) – Bin size for LSH. Defaults to 1.

  • max_specs_per_lsh (int, optional) – Maximum number of spectra per LSH. Defaults to None.

  • seed (int, optional) – Random seed for LSH initialization and selection. Defaults to 333.

dreams.utils.io.merge_lcmsms_hdf5s(in_pths: Path | Iterable[Path], out_pth: Path, dformat: str = 'A', store_acc_est: bool = True, verbose: bool = True, compression: str | None = 'gzip', compression_level: int = 6)

Merge .hdf5 files generated with lcmsms_to_hdf5.

Parameters:
  • in_pths – Directory or iterable of .hdf5 files.

  • out_pth – Output .hdf5 file.

  • store_acc_est – Whether to store instrument accuracy estimate.

  • verbose – Verbose output.

  • compression – Compression method for HDF5 datasets (e.g. ‘gzip’, None).

  • compression_level – Compression level (0–9) when using gzip.

dreams.utils.io.merge_ms_hdfs(in_hdf_pths, out_pth, group='MSn data', max_peaks_n=512, del_in=False, show_tqdm=True, logger=None, add_file_name_dataset=True, mzs_dataset='mzs', intensities_dataset='intensities')

TODO: This should be remove after MSData is completely implemented (inclduing .merge method).

NOTE: currently ignores MS1 data and file-level metadata. NOTE: assumes identical keys in all files. TODO: when ms_hdfs are created with process_ms_file, mzs and intensities should be refactored to a single

spectrum dataset. Here, args and body should be refactored to reflect this change.

TODO: recursively merge groups? TODO: max_peaks_n=None :param group: str: Merge only datasets withing the given group. :param del_in: bool: Delete input files after merging. :param show_tqdm: bool: Show tqdm progress bar. :param logger: logging.Logger: Logger to log the progress. :param add_file_name_dataset: bool: Add a new dataset constantly filled with the file names of merged files.

dreams.utils.io.parse_sirius_ms(spectra_file: str) Tuple[dict, List[Tuple[str, ndarray]]]

Parses spectra from the SIRIUS .ms file.

Copied from the code of Goldman et al.: https://github.com/samgoldman97/mist/blob/4c23d34fc82425ad5474a53e10b4622dcdbca479/src/mist/utils/parse_utils.py#LL10C77-L10C77. :return Tuple[dict, List[Tuple[str, np.ndarray]]]: metadata and list of spectra tuples containing name and array

dreams.utils.io.parsed_lcmsms_to_hdf(output_path, file_props, df_msn_data, df_prec_data, logger, num_peaks=None, num_prec_peaks=None, compress_peaks_lvl=3, compress_full_lvl=3, only_msn=False, only_format=None)
dreams.utils.io.prepend_to_stem(pth: Path, s, sep='_')

path/to/file.txt -> path/to/{s}{sep}file.txt

dreams.utils.io.read_json(pth)
dreams.utils.io.read_json_spec(pth, peaks_key='peaks', prec_mz_key='precursor_mz')
dreams.utils.io.read_lcmsms(input_path, logger, store_precursors=True, pwiz_stats=False, assign_dformats=True, verbose=False)
dreams.utils.io.read_mgf(pth, **kwargs)
dreams.utils.io.read_ms(pth, peaks_tag='>ms2peaks', charge_tag='#Charge', prec_mz_tag='#Precursor_MZ')
dreams.utils.io.read_msp(pth, **kwargs)
dreams.utils.io.read_mzml(pth: Path | str, verbose: bool = False, scan_range: Tuple[int, int] | None = None)
dreams.utils.io.read_pickle(pth)
dreams.utils.io.read_textual_ms_format(pth, spectrum_end_line, name_value_sep, spectrum_start_line=None, prec_mz_name=['PEPMASS', 'PRECURSORMZ', 'PRECURSOR_MZ'], charge_name=['CHARGE'], adduct_name=['ADDUCT'], smiles_name=['SMILES'], rt_name=['RTINSECONDS', 'RETENTION_TIME', 'RTINMINUTES', 'RT'], ionmode_name=['IONMODE'], feature_id_name=['FEATURE_ID'], scan_number_name=['SCAN_NUMBER'], name_name=['NAME'], ignore_line_prefixes=(), encoding='utf-8')
dreams.utils.io.sample_hdf(hdf_pth, n_samples, out_pth=None, seed=333, compression='gzip', compression_opts=4)
dreams.utils.io.save_nist_like_df_to_mgf(df, out_pth: Path, remove_mol_info=False, all_mplush_adducts=False)
dreams.utils.io.savefig(name, path, extension='pdf')
dreams.utils.io.setup_logger(log_file_path=None, log_name='log')
dreams.utils.io.suppress_output()
dreams.utils.io.wandb_import(project_name, entity_name='roman-bushuiev', tags={}, run_name_suffixes=None)
dreams.utils.io.write_json(obj, pth)
dreams.utils.io.write_pickle(obj, pth)

dreams.utils.lcms module

class dreams.utils.lcms.MSLevelsOrder(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)

Bases: Enum

Enum representing order of spectra in MS file.

Further l_i denotes MS level of i-th spectrum.

CONSEQUENT_MSN = 6
EMPTY = 1
INVALID = 8
MIXED_MSN = 7
SINGLE_MS1 = 2
SINGLE_MSN = 3
UNIFORM_MS1 = 4
UNIFORM_MSN = 5
class dreams.utils.lcms.SpecType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)

Bases: Enum

Enum representing the type of spectrum (centroid, profile and other corner cases).

CENTROID = 1
INVALID = 6
PROFILE = 2
SIZE_OF_SPECTRUMTYPE = 5
THRESHOLDED = 3
UNKNOWN = 4
dreams.utils.lcms.estimate_peak_list_type(pl: array, to_int=True, verbose=False)

Reproduced from MZmine. https://github.com/mzmine/mzmine3/blob/master/src/main/java/io/github/mzmine/util/scans/ScanUtils.java#L609

ASSUMES PEAK LIST TO BE SORTED BY M/Z (no check in favor of performance).

dreams.utils.lcms.get_instrument_props(msdata)
dreams.utils.lcms.get_order_of_spectra(msdata) MSLevelsOrder
dreams.utils.lcms.get_pwiz_stats(msdata)

Checks the presence of spectra centroided by ProteoWizard msconvert yet having zero intensities. Outputs the number of such spectra and the histogram of types of spectra converted by msconvert.

dreams.utils.lcms.get_spectrum_type(spec: _MSSpectrumDF, to_int=False) SpecType
dreams.utils.lcms.get_tight_xics(msdata, mz_tol_1=0.5, mz_tol_2=0.01, intensity_rel_tol=0.1, xic_len_thld=5, n_highest_peaks=3)

Tight XIC at given m/z is a cut of ms data accross rt dimension, containing highest peak and all peaks in its neighbourhood wrt rt. Length of the neighbourhood is defined independently in each direction by m/z and intensity tolerance parameters. When algorigtmm builds XIC it starts from some particular peak (xic_mz, xic_in) and consequently examines peaks in its neighbourhood peak by peak. Suppose (prev_mz, prev_in) and (next_mz, next_in) are two peaks compared during the run, where (prev_mz, prev_in) is a current border of the neighbourhood, then the neighbourhood will be extended on (next_mz, next_in) only if it satisfies two conditions:

  1. abs(next_mz - xic_mz) <= m/z tolerance

  2. next_in <= prev_in * intensity tolerance

Algorithm performs 2 traversals accross all MS1 spectra:
  1. Builds tight XICs for m/z’s of n_highest_peaks highest peaks of each spectrum, where m/z

    tolerance windown is “wide” (mz_tol_1).

  2. Computes medians of m/z’s accross XICs obtained in step I., which are used to build new tight XICs with “smaller” m/z tolerance window (mz_tol_2).

Parameters:
  • msdata – ms data to boild XICs from

  • mz_tol_1 – absolute width of m/z tolerance windown for I. traversal

  • mz_tol_2 – absolute width of m/z tolerance windown for II. traversal

  • intensity_rel_tol – peaks

  • xic_len_thld – threshold for the number of peaks in XICs (XICs are filtered both after I. and II.)

  • n_highest_peaks – number of highest peaks to choose in I.

NOTE: Since such XICs contain all peaks in the neighbourhood, they are refered to as tight XICs.

TODO: improve speed, very slow.

dreams.utils.lcms.remove_electromagnetic_spectra(msdata)
dreams.utils.lcms.sort_by_rt(msdata)
dreams.utils.lcms.sorted_by_rt(msdata)
dreams.utils.lcms.standartize_gnps_species(species: Series)

dreams.utils.misc module

dreams.utils.misc.all_close_pairwise(numbers, eps=0.01)
dreams.utils.misc.calc_attention_entropy(attention_scores, as_plot=True, save_out_basename: Path = None)
Parameters:

attention_scores – dict with [0, num_layers] keys and tensor (batch_size, num_heads, seq_len, seq_len) values.

dreams.utils.misc.chunk_list(lst, chunks_n)
dreams.utils.misc.chunk_list_eq_sum(lst, chunks_n, val=<function <lambda>>)

Partitions list lst to n bins with approximately equal sums. If elements of lst are not numbers, val function can be specified in order to extract desired numbers from each element.

dreams.utils.misc.complete_permutation(arr: array)

Returns a copy of arr with shuffled elements such that each element has a different position :param arr: 1D NumPy array

dreams.utils.misc.contains_similar(lst, query_val, epsilon, return_idx=False)
dreams.utils.misc.download_pretrained_model(model_name: str = 'embedding_model.ckpt', download_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/dreams-docs/envs/latest/lib/python3.11/site-packages/dreams/models/pretrained'), verbose: bool = True)

Download a pre-trained model from the Hugging Face Hub and return its location on disk.

Parameters:
  • model_name (str) – Name of the model to download.

  • download_dir (Path) – Local directory to download the model to.

  • verbose (bool) – Whether to print verbose output.

dreams.utils.misc.gems_hf_download(file_pth: str, local_dir: str | Path | None = None) str

Download a GeMS file from the Hugging Face Hub and return its location on disk.

Parameters:
  • file_pth (str) – Name of the file to download.

  • local_dir (Optional[Union[str, Path]]) – Local directory to download the file to.

dreams.utils.misc.get_closest_values(lst, query_val, n=1, return_idx=False)
dreams.utils.misc.hf_download(repo_id: str, file_pth: str, local_dir: str | Path | None = None, repo_type: str = 'dataset') str

Download a file from the Hugging Face Hub and return its location on disk.

Parameters:
  • repo_id (str) – Hugging Face repository ID.

  • file_pth (str) – Name of the file to download.

  • local_dir (Optional[Union[str, Path]]) – Local directory to download the file to.

  • repo_type (str) – Type of the repository.

dreams.utils.misc.interpolate_interval(a, b, n, only_inter=False, rounded=False)
Parameters:
  • a – Start point.

  • b – End point.

  • n – Num. of steps.

  • only_inter – Does not return interval ends a and b.

  • rounded

dreams.utils.misc.is_float(s)
dreams.utils.misc.is_sorted(lst)
dreams.utils.misc.lists_to_legends(lists)

E.g. [[‘a’, ‘b’, ‘c’], [1, 2, 3]] -> [‘a | 1’, ‘b | 2’, ‘c | 3’].

dreams.utils.misc.merge_stats(stats: Sequence[dict], sets_len=False)
dreams.utils.misc.networkx_to_dataframe(G: Graph) DataFrame

dreams.utils.mols module

class dreams.utils.mols.MolPropertyCalculator

Bases: object

denormalize_prop(prop, prop_name, do_not_add_min=False)
denormalize_props(props)
mol_to_props(mol, min_max_norm=False)
normalize_prop(prop, prop_name)
normalize_props(props)
dreams.utils.mols.closest_mz_frags(query_mz, frags, n=1, mass_shift=1, return_masses=False, print_masses=True)
dreams.utils.mols.disable_rdkit_log()
dreams.utils.mols.formula_is_carbohydrate(formula)
dreams.utils.mols.formula_is_halogenated(formula)
dreams.utils.mols.formula_to_dict(formula)

Transforms chemical formula string to dictionary mapping elements to their frequencies e.g. ‘C15H24’ -> {‘C’: 15, ‘H’: 24}

dreams.utils.mols.formula_type(f)
dreams.utils.mols.fp_func_from_str(s)
Parameters:

s – E.g. “fp_rdkit_2048”, “fp_rdkit_2048” or “fp_maccs_166”.

dreams.utils.mols.generate_fragments(mol: Mol, max_cuts: int = None)

Generates all possible fragments of a molecule up to a certain number of bond cuts or without the restriction if max_cuts is not specified.

Parameters:
  • mol – an RDKit molecule object

  • max_cuts – the maximum number of bonds to cut

:return a set of RDKit Mol objects representing all possible fragments

dreams.utils.mols.generate_spectrum(mol: Mol, prec_mz: float = None, fragments: List = None, max_cuts: int = None)

Generates an MS/MS spectrum by exhaustively simulating the m/z values of theoretical fragments of a given molecule. The algorithm is very simplistic since it considers only subgraph-like fragments, does not consider isotopes, etc.

Parameters:
  • mol – An RDKit molecule object.

  • prec_mz – The m/z value of a molecule. If not specified, it is calculated as the sum of the exact molecular weight of the molecule and 1.

  • fragments – A list of RDKit Mol objects representing pre-generated fragments of the molecule. If not specified, the function will generate the fragments automatically.

  • max_cuts – The maximum number of bonds to cut when generating fragments. If not specified, all possible fragments will be generated without any restriction on the number of cuts.

Returns:

A spectrum represented as a numpy array with two columns: m/z values and their respective intensities.

dreams.utils.mols.get_mol_mass(mol)
dreams.utils.mols.maccs_fp(mol, as_numpy=True)
NOTE: Since indexing of MACCS keys starts from 1, when converting to numpy array with as_numpy, the first element

is removed, so the resulting array has 166 elements instead of 167.

dreams.utils.mols.mol_to_formula(mol, as_dict=False)
dreams.utils.mols.mol_to_img_str(mol, svg_size=200)

Supposed to be used with pyvis for showing molecule images as graph nodes.

dreams.utils.mols.mol_to_inchi14(mol: Mol)
dreams.utils.mols.morgan_fp(mol, binary=True, fp_size=4096, radius=2, as_numpy=True)
dreams.utils.mols.morgan_mol_sim(m1, m2, fp_size=4096, radius=2)
dreams.utils.mols.morgan_smiles_sim(s1, s2, fp_size=4096, radius=2)
dreams.utils.mols.np_classify(smiles: List[str], progress_bar=True, sleep_each_n_requests=100)
dreams.utils.mols.np_to_rdkit_fp(fp)
dreams.utils.mols.rdkit_fp(mol, fp_size=4096)

Default RDKit fingerprint.

dreams.utils.mols.rdkit_fp_to_np(fp)
dreams.utils.mols.rdkit_mol_sim(m1, m2, fp_size=4096)

Default RDKit Tanimoto distance on default RDKit fingerprint.

dreams.utils.mols.rdkit_smiles_sim(s1, s2, fp_size=4096)

Default RDKit Tanimoto distance on default RDKit fingerprint.

dreams.utils.mols.show_mols(mols, legends='new_indices', smiles_in=None, svg=False, sort_by_legend=False, max_mols=500, legend_float_decimals=4, mols_per_row=6, save_pth: Path | None = None)

Returns svg image representing a grid of skeletal structures of the given molecules

Parameters:
  • mols – list of rdkit molecules

  • legends – list of labels for each molecule, length must be equal to the length of mols. Can be ‘new_indices’ for default numbering, ‘masses’ for molecular weights, or a list of custom labels

  • smiles_in – True - SMILES inputs, False - RDKit mols, None - determine automatically

  • svg – True - return svg image, False - return png image

  • sort_by_legend – True - sort molecules by legend values

  • max_mols – maximum number of molecules to show

  • legend_float_decimals – number of decimal places to show for float legends

  • mols_per_row – number of molecules per row to show

  • save_pth – path to save the .svg image to

dreams.utils.mols.smiles_to_formula(s, as_dict=False, invalid_mol_smiles='')
dreams.utils.mols.smiles_to_inchi14(s)
dreams.utils.mols.tanimoto_sim(fp1, fp2)

Default RDKit Tanimoto distance.

dreams.utils.plots module

dreams.utils.plots.assign_colors(x)
dreams.utils.plots.color_generator(n_colors, cmap='plotly')
dreams.utils.plots.distr_density(values, domain=None, show_mean=True, show_median=False, title=None)
dreams.utils.plots.get_nature_hex_colors(extended=True)
dreams.utils.plots.get_palette(cmap='plotly', reversed_order=False, as_hex=False)
dreams.utils.plots.init_plotting(figsize=(6, 2), font_scale=0.95, style='whitegrid', cmap='plotly', font=None, legend_outside=False)
dreams.utils.plots.pie_chart(values, other_percent_thld='auto', title=None, figsize=(6, 6))
dreams.utils.plots.plot_nx_graph(G: Graph, node_attrs: list = [], special_node: int = None, special_nodes: list = [], pos: dict = None, node_color_attr: str = None, node_size: int = 10, edge_color: str = 'black', edge_width: int = 2, title: str = None, html_pth: Path | str = None) None

Plots a NetworkX graph using Plotly, with options to customize node attributes and highlight special nodes.

Args: - G (nx.Graph): The NetworkX graph to be plotted. - node_attrs (list): List of node attributes to be displayed in hover text. - special_node (int): Node to be highlighted with a star symbol and larger size. - special_nodes (list): List of nodes to be highlighted with a triangle symbol. - pos (dict): Dictionary specifying the positions of nodes. If None, a spring layout will be computed. - node_color_attr (str): Node attribute used to determine node colors. - node_size (int): Size of the nodes. - edge_color (str): Color of the edges. - edge_width (int): Width of the edges. - title (str): Title of the plot.

dreams.utils.plots.rgb_to_hex(r, g, b)
dreams.utils.plots.save_fig(name, dir=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/dreams-docs/envs/latest/lib/python3.11/site-packages/misc/figures'), dpi=None, transparent=True)

dreams.utils.spectra module

class dreams.utils.spectra.MSnSpectrum(peak_list, precursor_mol=None, precursor_mz=None, precursor_charge=None, ionization_mode=None, collision_energy=None, assert_is_valid=True)

Bases: object

get_collision_energy()
get_intensities()
get_ionization_mode()
get_mzs()
get_peak_list()
get_peaks_n()
get_precursor_charge()
get_precursor_formula(to_dict=False)
get_precursor_mass()
get_precursor_mol()
get_precursor_mz()
class dreams.utils.spectra.PeakListModifiedCosine(mz_tolerance: float = 0.05, unpad: bool = True)

Bases: object

compute(spec1: ndarray, spec2: ndarray, prec_mz1: float, prec_mz2: float) float
compute_pairwise(specs: ndarray, prec_mzs: ndarray, avg=False) ndarray | float
dreams.utils.spectra.bin_peak_list(peak_list: array, max_mz: float, bin_step: float) array
dreams.utils.spectra.bin_peak_lists(peak_lists: array, max_mz: float, bin_step: float) array
dreams.utils.spectra.df_to_MSnSpectra(df, assert_is_valid=True, as_new_column=False)

Processes NIST-like DataFrame to Series of MSnSpectra. # TODO: include more columns.

dreams.utils.spectra.from_hot(hots: Tensor, bin_size: float, dtype=torch.float64) Tensor

Makes the last dimension singleton.

dreams.utils.spectra.from_hot_logits(vals: Tensor, bin_size: float) Tensor
dreams.utils.spectra.get_base_peak(peak_list: array, return_i=False)
dreams.utils.spectra.get_closest_mz_peak(peak_list: array, query_mz)
dreams.utils.spectra.get_closest_mz_peaks(peak_list: array, query_mz, n)

Returns list of pairs (mz, intensity) of length n containing peaks having m/z values closest to the query_mz sorted ascending by the difference.

dreams.utils.spectra.get_highest_peaks(peak_list: array, n)

Returns n highest peaks.

dreams.utils.spectra.get_num_peaks(peak_list)
dreams.utils.spectra.get_peak_intens_nbhd(peak_list, peak_i, intens_thld, intens_thld_below=True)

Returns indices determining the range of the neighbour around peak at peak_i. The neighbourhood is defined as all consecutive peaks above (or below if intens_thld_below=False) the intens_thld intensity.

dreams.utils.spectra.has_peak_at(peak_list: array, query_mz, epsilon)
dreams.utils.spectra.intens_amplitude(peak_list)
dreams.utils.spectra.is_valid_peak_list(peak_list: array, relative_intensities=True, verbose=None, return_problems_list=False)

Returns True if peak list is valid (Numbers of m/z and intensity values are equal, m/z values are sorted in ascending order etc.), else False. TODO: consider padded spectra.

Parameters:
  • peak_list – np.array of shape (2, n), where n is a number of peaks.

  • relative_intensities – if True, performs additional checks for the intensities to be relative.

  • return_problems_list – if True, list of strings describing problems (invalid causes) will be returned (e.g. [‘#mzs != #intensities’, ‘Exists m/z < 0.0’]).

  • verbose – for ‘problems’ the reasons why peak list is not valid will be printed, for ‘problems_and_peak_list’, the peak list will be printed as well.

dreams.utils.spectra.max_mz(peak_list)
dreams.utils.spectra.merge_peak_lists(peak_lists: List[array], eps=0.01, n_highest_peaks=None) array

Merges peak lists without creating new “artificial” m/z values. The algorithm traverses all peaks (from all spectra) descendingly ordered by their intensities and create merged peaks by summing up intensities of all peaks in the range m/z ± eps. Each peak is used exactly once (either as the one determining the range or the one belonging to the range), and final peaks are not transitively connected. Notice, that the complexity is O(n^2), where n is a total num. of peaks within all spectra. :param peak_lists: List of NumPy arrays of shape (2, num. of peaks). :param eps: Epsilon determining the range of m/z values of peaks which are aggregated. :param n_highest_peaks: If not None, n_highest_peaks highest peaks are selected from each peak list.

dreams.utils.spectra.normalize_mzs(peak_list: array, max_mz: float, in_place=True, high=False)
dreams.utils.spectra.num_high_peaks(peak_list, high_intensity_thld)
dreams.utils.spectra.num_hot_classes(max_val: float, bin_size: float) int
dreams.utils.spectra.pad_peak_list(pl: ndarray, target_len: int, pad_val: float = 0, axis: int = -1) ndarray

Pads peak list to the target_len with pad_val or performs this for a batch of peak lists. :param pl: Peak list of shape (2, num_peaks) or a batch of peak lists of shape (batch_size, 2, num_peaks). :param target_len: Target num. of peaks of the peak list. :param pad_val: Value used for padding. :param axis: Axis along which the padding is performed.

dreams.utils.spectra.parse_raw_peak_list(peak_list: str)

Parses peak list string into numpy arrays of m/z and intensity values e.g. ‘53.0379 0.894101

54.0335 0.661867 ‘ -> ([53.0379, 54.0335], [0.894101, 0.661867])

dreams.utils.spectra.plot_spectrum(spec, hue=None, xlim=None, ylim=None, mirror_spec=None, highl_idx=None, high_peaks_at=None, figsize=(6, 2), colors=None, save_pth=None, prec_mz=None, mirror_prec_mz=None, normalize_intensities=True, spec_text=None, mirror_spec_text=None)

TODO: Whole function should be refactored, it is a mess. Plots a mass spectrum with optional mirror spectrum and highlighted peaks.

Args: - spec: The spectrum to be plotted. - hue: Optional values to color the peaks. - xlim: X-axis limits. - ylim: Y-axis limits. - mirror_spec: Optional mirror spectrum to be plotted. - highl_idx: Indices of peaks to be highlighted. - high_peaks_at: M/z values of peaks to be highlighted. - figsize: Figure size. - colors: Colors for the plot. - save_pth: Path to save the plot. - prec_mz: Precursor m/z value to display. - mirror_prec_mz: Precursor m/z value of the mirror spectrum to display. - spec_text: Text to display on the spectrum. - mirror_spec_text: Text to display on the mirror spectrum.

dreams.utils.spectra.prepend_precursor_peak(peak_list: array, prec_mz, prec_in=1.1, high=False)
dreams.utils.spectra.process_peak_list(peak_list, n_highest=None, sort_mzs=False, to_rel_intens=False)
dreams.utils.spectra.to_classes(vals: Tensor, max_val: float, bin_size: float, special_vals: List[float] = (), return_num_classes: bool = False) Tensor

Assumes that last dimension of mzs is singleton.

dreams.utils.spectra.to_hot(vals: Tensor, max_val: float, bin_size: float, dtype=torch.float64)

Assumes that last dimension of mzs is singleton.

dreams.utils.spectra.to_rel_intensity(peak_list: array, scale_factor=None)
dreams.utils.spectra.trim_peak_list(peak_list: array, n_highest: int)

Trims peak list by selecting n_highest highest peaks or performs this for a batch of peak lists. :param peak_list: np.array of shape (2, num_peaks) or (num_spectra, 2, num_peaks). :param n_highest: Number of highest peaks to be selected.

dreams.utils.spectra.unpad_peak_list(peak_list: array, pad_val=0.0)

Module contents