dreams.utils package
Submodules
dreams.utils.annotation module
- class dreams.utils.annotation.FingerprintInChIRetrieval(df_pkl_pth: Path, candidate_smiles_col: str, fp_name: str, top_k: int | List[int], index_smiles_col: str = 'SMILES', candidate_inchi14_col: str = None)
Bases:
object- compute_reset_metrics(metrics_prefix='', return_unaveraged=False)
- retrieve_inchi14s(query_fp, label_smiles)
dreams.utils.data module
- class dreams.utils.data.AnnotatedSpectraDataset(spectra: List[MSnSpectrum], label: str, spec_preproc: SpectrumPreprocessor, dformat: DataFormat, return_smiles=False)
Bases:
DatasetNOTE: This class is deprecated in favor of LabeledSpectraDataset.
- class dreams.utils.data.AttentionEntropyValidation(nist_like_pkl_pth, dformat: DataFormat, spec_preproc: SpectrumPreprocessor, n_samples=None, as_plot=False, save_out_basename=None)
Bases:
ImplExplValidation- get_res()
- class dreams.utils.data.CSRKNN(csr, one_minus_weights=True)
Bases:
object- static from_edge_list(s, t, w)
Construct CSR matrix from edge list. :param s: Source nodes. :param t: Target nodes. :param w: Edge weights.
- static from_ngt_index(ngt_index, k, one_minus_for_weights=False)
- static from_npz(pth: Path, one_minus_weights=False)
Load CSR matrix that was stores as a COO matrix using CSRKNN.save method.
- inv_neighbors(i, sort=True, exclude_self_loops=True) ndarray
Get nodes that have i-th node as a neighbor.
- neighbors(i, sort=True, exclude_self_loops=True) ndarray
Get neighbors of the i-th node, i.e. all non-zero columns in i-th row of the CSR matrix.
- to_edge_list(one_minus_weights=False)
Convert CSR matrix to edge list.
- to_graph(directed=True, graph_class='igraph')
Convert CSR matrix to a graph object using the specified library.
Parameters: - directed (bool): If True, creates a directed graph. If False, creates an undirected graph. - graph_class (str): Specifies which graph library to use.
“igraph” for igraph.Graph, “networkx” for networkx.Graph.
Returns: - Graph object: igraph.Graph or networkx.Graph based on the specified graph_class.
- to_npz(pth: Path) None
Save CSR matrix as COO matrix on disk. COO seems to work better and does not produce errors when saving large matrices.
- class dreams.utils.data.CVDataModule(dataset: Dataset, fold_idx: Series, batch_size: int, num_workers=0)
Bases:
LightningDataModule- prepare_data_per_node
If True, each LOCAL_RANK=0 will call prepare data. Otherwise only NODE_RANK=0, LOCAL_RANK=0 will prepare data.
- allow_zero_length_dataloader_with_multiple_devices
If True, dataloader with zero length within local rank is allowed. Default value is False.
- get_num_folds() int
- setup_fold_index(fold_i: int) None
- train_dataloader() DataLoader
An iterable or collection of iterables specifying training samples.
For more information about multiple dataloaders, see this section.
The dataloader you return will not be reloaded unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.
For data processing use the following pattern:
download in
prepare_data()process and split in
setup()
However, the above are only necessary for distributed processing.
Warning
do not assign state in prepare_data
fit()prepare_data()setup()
Note
Lightning tries to add the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.
- val_dataloader() DataLoader
An iterable or collection of iterables specifying validation samples.
For more information about multiple dataloaders, see this section.
The dataloader you return will not be reloaded unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.
It’s recommended that all data downloads and preparation happen in
prepare_data().fit()validate()prepare_data()setup()
Note
Lightning tries to add the correct sampler for distributed and arbitrary hardware There is no need to set it yourself.
Note
If you don’t need a validation dataset and a
validation_step(), you don’t need to implement this method.
- class dreams.utils.data.ContrastiveSpectraDataset(df: DataFrame, spec_preproc: SpectrumPreprocessor, msn_spec_col='MSnSpectrum', pos_idx_col='pos_idx', neg_idx_col='neg_idx', n_pos_samples=1, n_neg_samples=10, return_smiles=False, logger=None)
Bases:
Dataset
- class dreams.utils.data.ContrastiveValidation(nist_like_pkl_pth, pairs_pkl_pth, dformat: DataFormat, spec_preproc: SpectrumPreprocessor, n_instances=None, n_samples=None, seed=3, save_out_basename: Path = None, euclidean=False)
Bases:
ImplExplValidation- get_labels()
- get_name()
- get_res()
- get_umap_plot()
- class dreams.utils.data.CorrelationValidation(nist_like_pkl_pth, corr_pkl_pth, dformat: DataFormat, spec_preproc: SpectrumPreprocessor, n_samples=None)
Bases:
ImplExplValidation- get_res()
- class dreams.utils.data.ImplExplValidation(nist_like_pkl_pth, dformat: DataFormat, spec_preproc: SpectrumPreprocessor, df_idx=None, n_samples=None, seed=1)
Bases:
ABC- get_data(device=None, torch_dtype=None)
- abstract get_res()
- set_model_gains(model_gains)
- class dreams.utils.data.KNNValidation(nist_like_pkl_pth, pairs_pkl_pth, dformat: DataFormat, spec_preproc: SpectrumPreprocessor, k=3, n_instances=None, n_samples=None, seed=3, save_out_basename: Path = None)
Bases:
ContrastiveValidation- get_res()
- class dreams.utils.data.LabeledSpectraDataset(msdata: Path | str | MSData, label: str, spec_preproc: SpectrumPreprocessor, dformat: DataFormat, return_smiles=False)
Bases:
Dataset
- class dreams.utils.data.MSData(hdf5_pth: Path | str | List[Path], in_mem: bool = False, mode: str = 'r', spec_col: str = 'spectrum', prec_mz_col: str = 'precursor_mz', index_col: str | None = None)
Bases:
object- add_column(name, data, remove_old_if_exists=False)
- at(i, plot_mol=True, plot_spec=True, return_spec=False, vals=None, unpad_spec=True)
- columns()
- extend_column(name, data)
- form_subset(idx, out_pth, verbose=False, **kwargs)
- static from_hdf5(pth: Path, **kwargs)
- static from_hdf5_chunks(pths: List[Path], **kwargs)
- static from_mgf(pth: Path | str, in_mem=True, **kwargs)
- static from_msp(pth: Path | str, in_mem=True, **kwargs)
- static from_mzml(pth: Path | str, scan_range: Tuple[int, int] | None = None, verbose_parser: bool = False, **kwargs)
- static from_mzxml(pth: Path | str, scan_range: Tuple[int, int] | None = None, verbose_parser: bool = False, **kwargs)
- static from_pandas(df: Path | str | DataFrame, n_highest_peaks=128, spec_col='spectrum', prec_mz_col='precursor_mz', adduct_col='adduct', charge_col='charge', mol_col='smiles', rt_col='RT', feature_id_col='scan_number', ignore_cols=(), in_mem=True, hdf5_pth=None, compression_opts=0, mode='r')
- static from_pickle(pth: Path | str, in_mem=True, **kwargs)
- get_adducts(idx=None)
- get_charges(idx=None)
- get_prec_mzs(idx=None)
- get_smiles(idx=None)
- get_spectra(idx=None)
- get_values(col, idx=None, decode_strings=True)
- static load(pth: Path | str, in_mem=False, **kwargs)
- load_col_in_mem(col)
- load_hdf5_in_mem(group)
- static merge(pths: List[Path], out_pth: Path, cols=['spectrum', 'precursor_mz', 'charge', 'adduct'], show_tqdm=True, logger=None, add_dataset_col=True, in_mem=False, spectra_already_trimmed=False, filter_idx=None)
- remove_column(name)
- rename_column(old_name, new_name, remove_old_if_exists=False)
- spec_to_matchms(i: int) Spectrum
- to_matchms(progress_bar=True) List[Spectrum]
- to_mgf(out_pth: Path | str)
- to_pandas(unpad=True, ignore_cols=('DreaMS_embedding',))
- to_pynndescent(embedding_col='DreaMS_embedding', out_pth=None, store_index=False, n_neighbors=50, metric='cosine', verbose=True)
- to_torch_dataset(spec_preproc: SpectrumPreprocessor, label=None, **kwargs)
- use_col_as_index(col)
- class dreams.utils.data.ManualValidation(nist_like_pkl_pth, dformat: DataFormat, spec_preproc: SpectrumPreprocessor, n_samples=None, seed=1, df_idx=None)
Bases:
ImplExplValidation- get_res()
- class dreams.utils.data.MaskedSpectraDataset(in_pth: Path, dformat: DataFormat, ssl_objective: str, spec_preproc: SpectrumPreprocessor, mask_peaks=True, mask_intens_strategy='intens_p', frac_masks=0.2, min_n_masks=2, mask_val=-1.0, min_mask_intens=0.1, mask_prec=False, n_samples=None, logger=None, deterministic_mask=True, ret_order_pairs=False, return_charge=False, acc_est_weight=False, lsh_weight=False, bert801010_masking=False)
Bases:
DatasetA dataset class for masked spectra used in self-supervised learning tasks.
This class prepares spectra data for various self-supervised learning objectives such as peak masking, m/z masking, and intensity masking.
- data
The dataset containing spectra and related information.
- Type:
dict
- dformat
The data format specification.
- Type:
- ssl_objective
The self-supervised learning objective.
- Type:
str
- spec_preproc
The spectrum preprocessor.
- Type:
- frac_masks
The fraction of peaks to mask.
- Type:
float
- min_n_masks
The minimum number of peaks to mask.
- Type:
int
- n_samples
The number of samples to use from the dataset.
- Type:
int
- mask_val
The value to use for masked peaks.
- Type:
float
- min_mask_intens
The minimum intensity for peaks to be considered for masking.
- Type:
float
- mask_prec
Whether to mask the precursor peak.
- Type:
bool
- deterministic_mask
Whether to use deterministic masking.
- Type:
bool
- mask_peaks
Whether to mask peaks.
- Type:
bool
- mask_intens_strategy
The strategy for masking intensities.
- Type:
str
- ret_order_pairs
Whether to return retention order pairs.
- Type:
bool
- return_charge
Whether to return charge information.
- Type:
bool
- acc_est_weight
Whether to use accuracy estimation weighting.
- Type:
bool
- lsh_weight
Whether to use LSH weighting.
- Type:
bool
- bert801010_masking
Whether to use BERT-style 80-10-10 masking.
- Type:
bool
- __len__()
Return the length of the dataset.
- __getitem__(i)
Get a single item from the dataset.
- get_spec(i)
Get a preprocessed spectrum.
Initialize the MaskedSpectraDataset.
- Parameters:
in_pth (Path) – The input path for the dataset.
dformat (DataFormat) – The data format specification.
ssl_objective (str) – The self-supervised learning objective.
spec_preproc (SpectrumPreprocessor) – The spectrum preprocessor.
mask_peaks (bool) – Whether to mask peaks.
mask_intens_strategy (str) – The strategy for masking intensities.
frac_masks (float) – The fraction of peaks to mask.
min_n_masks (int) – The minimum number of peaks to mask.
mask_val (float) – The value to use for masked peaks.
min_mask_intens (float) – The minimum intensity for peaks to be considered for masking.
mask_prec (bool) – Whether to mask the precursor peak.
n_samples (int) – The number of samples to use from the dataset.
logger (Logger) – The logger object.
deterministic_mask (bool) – Whether to use deterministic masking.
ret_order_pairs (bool) – Whether to return retention order pairs.
return_charge (bool) – Whether to return charge information.
acc_est_weight (bool) – Whether to use accuracy estimation weighting.
lsh_weight (bool) – Whether to use LSH weighting.
bert801010_masking (bool) – Whether to use BERT-style 80-10-10 masking.
- get_spec(i)
Get a preprocessed spectrum.
- Parameters:
i (int) – The index of the spectrum to retrieve.
- Returns:
A dictionary containing the preprocessed spectrum and related information.
- Return type:
dict
- class dreams.utils.data.MatchmsSpectraDataset(spectra: List[Spectrum], spec_preproc: SpectrumPreprocessor)
Bases:
Dataset
- class dreams.utils.data.RandomSplitDataModule(dataset, batch_size: int, max_var_features=None, val_frac=0.1, num_workers=0)
Bases:
LightningDataModule- prepare_data_per_node
If True, each LOCAL_RANK=0 will call prepare data. Otherwise only NODE_RANK=0, LOCAL_RANK=0 will prepare data.
- allow_zero_length_dataloader_with_multiple_devices
If True, dataloader with zero length within local rank is allowed. Default value is False.
- test_dataloader()
An iterable or collection of iterables specifying test samples.
For more information about multiple dataloaders, see this section.
For data processing use the following pattern:
download in
prepare_data()process and split in
setup()
However, the above are only necessary for distributed processing.
Warning
do not assign state in prepare_data
test()prepare_data()setup()
Note
Lightning tries to add the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.
Note
If you don’t need a test dataset and a
test_step(), you don’t need to implement this method.
- train_dataloader() DataLoader
An iterable or collection of iterables specifying training samples.
For more information about multiple dataloaders, see this section.
The dataloader you return will not be reloaded unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.
For data processing use the following pattern:
download in
prepare_data()process and split in
setup()
However, the above are only necessary for distributed processing.
Warning
do not assign state in prepare_data
fit()prepare_data()setup()
Note
Lightning tries to add the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.
- val_dataloader() DataLoader
An iterable or collection of iterables specifying validation samples.
For more information about multiple dataloaders, see this section.
The dataloader you return will not be reloaded unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.
It’s recommended that all data downloads and preparation happen in
prepare_data().fit()validate()prepare_data()setup()
Note
Lightning tries to add the correct sampler for distributed and arbitrary hardware There is no need to set it yourself.
Note
If you don’t need a validation dataset and a
validation_step(), you don’t need to implement this method.
- class dreams.utils.data.RawSpectraDataset(spectra, prec_mzs, spec_preproc: SpectrumPreprocessor)
Bases:
Dataset
- class dreams.utils.data.SSLProbingValidation(labeled_data_module: SplittedDataModule, evaluator_impl='torch', n_hidden_layers=[0], n_epochs=100, probing_batch_freq=2500, prefix=None, save_fps_dir=typing.Optional[pathlib.Path])
Bases:
Callback- on_train_batch_start(trainer, pl_module, batch, batch_idx)
Called when the train batch begins.
- class dreams.utils.data.SpecRetrievalValidation(nist_like_pkl_pth, pairs_pkl_pth, dformat: DataFormat, spec_preproc: SpectrumPreprocessor)
Bases:
ImplExplValidationCosine similarity on embeddings <-> equality of InChI keys validation.
- get_res()
- class dreams.utils.data.SpectrumPreprocessor(dformat: DataFormat, prec_intens=1.1, n_highest_peaks=None, spec_entropy_cleaning=False, normalize_mzs=False, precision=32, mz_shift_aug_p=0, mz_shift_aug_max=0, to_relative_intensities=True)
Bases:
objectA class for preprocessing mass spectrometry spectra.
This class provides functionality to preprocess mass spectrometry spectra, including peak trimming, padding, intensity normalization, and various data augmentation techniques.
- dformat
The data format specification.
- Type:
- prec_intens
The intensity of the precursor peak.
- Type:
float
- n_highest_peaks
The number of highest intensity peaks to keep.
- Type:
int
- spec_entropy_cleaning
Whether to apply spectral entropy cleaning.
- Type:
bool
- normalize_mzs
Whether to normalize m/z values.
- Type:
bool
- to_relative_intensities
Whether to convert intensities to relative values.
- Type:
bool
- precision
The precision of the output data (32 or 64 bit).
- Type:
int
- mz_shift_aug_p
The probability of applying m/z shift augmentation.
- Type:
float
- mz_shift_aug_max
The maximum m/z shift for augmentation.
- Type:
float
- __call__(spec, prec_mz, high_form, augment)
Preprocess a single spectrum.
Initialize the SpectrumPreprocessor.
- Parameters:
dformat (DataFormat) – The data format specification.
prec_intens (float) – The intensity of the precursor peak.
n_highest_peaks (int) – The number of highest intensity peaks to keep.
spec_entropy_cleaning (bool) – Whether to apply spectral entropy cleaning.
normalize_mzs (bool) – Whether to normalize m/z values.
precision (int) – The precision of the output data (32 or 64 bit).
mz_shift_aug_p (float) – The probability of applying m/z shift augmentation.
mz_shift_aug_max (float) – The maximum m/z shift for augmentation.
to_relative_intensities (bool) – Whether to convert intensities to relative values.
- class dreams.utils.data.SplittedDataModule(dataset, split_mask: Series | ndarray | list, batch_size: int | None, num_workers=0, n_train_samples=None, seed=None, include_val_in_train=False)
Bases:
LightningDataModule- prepare_data_per_node
If True, each LOCAL_RANK=0 will call prepare data. Otherwise only NODE_RANK=0, LOCAL_RANK=0 will prepare data.
- allow_zero_length_dataloader_with_multiple_devices
If True, dataloader with zero length within local rank is allowed. Default value is False.
- test_dataloader() DataLoader
An iterable or collection of iterables specifying test samples.
For more information about multiple dataloaders, see this section.
For data processing use the following pattern:
download in
prepare_data()process and split in
setup()
However, the above are only necessary for distributed processing.
Warning
do not assign state in prepare_data
test()prepare_data()setup()
Note
Lightning tries to add the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.
Note
If you don’t need a test dataset and a
test_step(), you don’t need to implement this method.
- train_dataloader() DataLoader
An iterable or collection of iterables specifying training samples.
For more information about multiple dataloaders, see this section.
The dataloader you return will not be reloaded unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.
For data processing use the following pattern:
download in
prepare_data()process and split in
setup()
However, the above are only necessary for distributed processing.
Warning
do not assign state in prepare_data
fit()prepare_data()setup()
Note
Lightning tries to add the correct sampler for distributed and arbitrary hardware. There is no need to set it yourself.
- val_dataloader() DataLoader
An iterable or collection of iterables specifying validation samples.
For more information about multiple dataloaders, see this section.
The dataloader you return will not be reloaded unless you set :paramref:`~pytorch_lightning.trainer.Trainer.reload_dataloaders_every_n_epochs` to a positive integer.
It’s recommended that all data downloads and preparation happen in
prepare_data().fit()validate()prepare_data()setup()
Note
Lightning tries to add the correct sampler for distributed and arbitrary hardware There is no need to set it yourself.
Note
If you don’t need a validation dataset and a
validation_step(), you don’t need to implement this method.
- dreams.utils.data.condense_dreams_knn(graph, thld, embs, logger)
- dreams.utils.data.evaluate_split(df_split, n_workers=5, smiles_col='smiles', fold_col='fold')
Evaluates data split based on the Morgan Tanimoto similarity between validation and train folds.
- Parameters:
df_split (pd.DataFrame) – DataFrame containing the data split.
n_workers (int) – Number of workers for parallel processing. Default is 5.
smiles_col (str) – Column name for SMILES strings. Default is the SMILES constant.
fold_col (str) – Column name for fold information. Default is ‘fold’.
- Returns:
A dictionary containing maximum Tanimoto similarities for each fold.
- Return type:
dict
- Raises:
ValueError – If the fold column is not found or contains invalid values.
- dreams.utils.data.load_hdf5_in_mem(dct)
- dreams.utils.data.subset_lsh(in_pth: Path | str, out_pth: Path | None = None, lsh_col: str = 'lsh', max_specs_per_lsh: int = 1, random_seed: int = 42)
dreams.utils.dformats module
- class dreams.utils.dformats.DataFormat
Bases:
ABCAbstract class for DataFormats.
- high_intensity_thld
alias of
NotImplementedError
- lsh_bin_size
alias of
NotImplementedError
- lsh_n_hplanes
alias of
NotImplementedError
- max_charge
alias of
NotImplementedError
- max_ms_level
alias of
NotImplementedError
- max_mz
alias of
NotImplementedError
- max_peaks_n
alias of
NotImplementedError
- max_prec_mz
alias of
NotImplementedError
- max_tbxic_stdev
alias of
NotImplementedError
- min_charge
alias of
NotImplementedError
- min_file_spectra
alias of
NotImplementedError
- min_intensity_ampl
alias of
NotImplementedError
- min_peaks_n
alias of
NotImplementedError
- val_spec(spec: ndarray, prec_mz: float, tbxic_stdev: float | None = None, charge: int | None = None, mslevel: int | None = None, verbose: bool = False, return_problems: bool = False) bool | str
- class dreams.utils.dformats.DataFormatA
Bases:
DataFormat- high_intensity_thld: float = 0.1
- max_charge: int = 1
- max_ms_level: int = 2
- max_mz: float = 1000.0
- max_peaks_n: int = 128
- max_prec_mz: float = 1000.0
- max_tbxic_stdev: float = 0.0001
- min_charge: int = 1
- min_file_spectra: int = 3
- min_intensity_ampl: float = 20.0
- min_peaks_n: int = 3
- class dreams.utils.dformats.DataFormatA1
Bases:
DataFormatA- min_peaks_n: int = 2
- class dreams.utils.dformats.DataFormatA2
Bases:
DataFormatA- high_intensity_thld: float = 0.05
- min_intensity_ampl: float = 15
- min_peaks_n: int = 2
- class dreams.utils.dformats.DataFormatA3
Bases:
DataFormatA- high_intensity_thld: float = 0.05
- min_charge: int = -1
- min_intensity_ampl: float = 15
- min_peaks_n: int = 2
- class dreams.utils.dformats.DataFormatB
Bases:
DataFormat- high_intensity_thld: float = 0.1
- max_charge: int = 1
- max_ms_level: int = 2
- max_mz: float = 1500.0
- max_peaks_n: int = 128
- max_prec_mz: float = 1500.0
- max_tbxic_stdev: float = 0.001
- min_charge: int = 1
- min_file_spectra: int = 3
- min_intensity_ampl: float = 20.0
- min_peaks_n: int = 3
- class dreams.utils.dformats.DataFormatC
Bases:
DataFormat- high_intensity_thld: float = 0.1
- max_charge: int = 1
- max_ms_level: int = 10
- max_mz: float = 1500.0
- max_peaks_n: int = 128
- max_prec_mz: float = 1500.0
- max_tbxic_stdev: float = 0.001
- min_charge: int = -1
- min_file_spectra: int = 3
- min_intensity_ampl: float = 18.0
- min_peaks_n: int = 3
- dreams.utils.dformats.assign_dformat(spec: ndarray, prec_mz: float, **kwargs) str
- dreams.utils.dformats.to_A_format(df: DataFrame, filter=True, trimming=False, reset_index=True, verbose=True, add_msn_col=True, filter_block_mask=None)
- dreams.utils.dformats.to_format(df: DataFrame, dformat: DataFormat, filter=True, trimming=False, reset_index=True, verbose=True, add_msn_col=True, filter_block_mask=None)
- Parameters:
df – df with NIST-like columns
filter_block_mask – if not None, a boolean mask with the same length as df. True entries are not filtered.
dreams.utils.io module
- class dreams.utils.io.ChunkedDatasetAccessor(parent, dataset_name)
Bases:
object- property shape
Return the shape of the dataset.
- class dreams.utils.io.ChunkedHDF5File(file_paths)
Bases:
objectInitialize the ChunkedHDF5File with a list of HDF5 file paths.
Args: - file_paths (list of Path): Paths to the HDF5 files.
- close()
Close all files.
- keys()
Return a list of dataset names.
- class dreams.utils.io.TqdmToLogger(logger, level=None, mininterval=5)
Bases:
StringIO- buf = ''
- flush()
Flush write buffers, if applicable.
This is not implemented for read-only and non-blocking streams.
- level = None
- logger = None
- write(buf)
Write string to file.
Returns the number of characters written, which is always equal to the length of the string.
- dreams.utils.io.append_to_stem(pth: Path, s, sep='_')
path/to/file.txt -> path/to/file{sep}{s}.txt
- dreams.utils.io.bytes_to_human_str(size_bytes, decimal_places=2)
- dreams.utils.io.bytes_to_units(size_bytes, unit='MB')
- dreams.utils.io.cache_pkl(pth)
- dreams.utils.io.clean_ftps(ftps: dict, verbose=True)
Cleans the dict of MassIVE ftps (see code comments 1., 2., 3. for details). :param ftps: keys - ftps, values - corresponding file sizes.
- dreams.utils.io.compress_hdf(hdf_pth, out_pth=None, compression='gzip', compression_opts=4)
- dreams.utils.io.downloadpublicdata_to_hdf5s(downloads_log: Path, del_in=True, verbose=False, only_msn=False, only_format=None) None
Convert downloaded LC-MS/MS data (e.g., .mzML or .mzXML) to .hdf5 format.
Args: downloads_log (Path): Path to the log file from downloadpublicdata containing information about downloaded files. del_in (bool, optional): Whether to delete the input files after conversion. Defaults to True. verbose (bool, optional): Whether to print additional information during the conversion. Defaults to True.
- dreams.utils.io.ftp_to_msv_id(ftp)
- dreams.utils.io.lcmsms_to_hdf5(input_path, output_path=None, num_peaks=None, num_prec_peaks=None, store_precursors=True, compress_peaks_lvl=0, compress_full_lvl=0, pwiz_stats=False, del_in=False, assign_dformats=True, log_path=None, verbose=False, only_msn=False, only_format=None)
Convert LC-MS/MS data from an input file (.mzML or .mzXML) to an output file (.hdf5).
Args: input_path (str): Path to the input file (.mzML or .mzXML). output_path (str, optional): Path to the output file (.hdf5). If not provided, the output is stored as the
input file name with .hdf5 extension.
- num_peaks (int, optional): The number of peaks to pad the MSn peak lists with zeros. If not specified, it
will be set to the maximum number of peaks within the spectra that are to be stored.
- num_prec_peaks (int, optional): The number of peaks to pad the MS1 peak lists with zeros. If not specified,
it will be set to the maximum number of peaks within the spectra that are to be stored.
- store_precursors (bool, optional): Whether to store the data of precursor spectra (peak list and scan id) for
each MSn spectrum as a separate hdf5 dataset. Defaults to True.
- compress_peaks_lvl (int, optional): The compression level for peak lists in the output .hdf5 file. Should be an
integer from 0 to 9. Defaults to 0.
- compress_full_lvl (int, optional): The compression level for all stored attributes (e.g. RTs, polarities, etc.)
except for peak lists. Should be an integer from 0 to 9. Defaults to 0.
- pwiz_stats (bool, optional): Whether to collect ProteoWizard msconvert statistics, including the histogram of
types of spectra converted by msconvert and the number of spectra centroided by msconvert but having zero intensities. Defaults to False.
del_in (bool, optional): Whether to delete the input .mzML or .mzXML file. Defaults to False. assign_dformats (bool, optional): Whether to assign data formats to MSn spectra. Defaults to True. log_path (str, optional): Path to the log file containing errors during opening of files and flaws of invalid
spectra. If set to None, the log file is stored as the input file name with .hdf5 extension.
- verbose (bool, optional): Whether to log the scan number for each invalid spectrum and log additional
statistics. The statistics are redundant in a sense that they can be calculated from the output .hdf5 file but are helpful for the fast analysis of the input file and debugging. Defaults to False.
- dreams.utils.io.list_from_txt(txt_pth, sep='\n', apply_lambda=None, progress_bar=False)
- dreams.utils.io.list_to_txt(lst, txt_pth, sep='\n')
- dreams.utils.io.lsh_subset(in_pth, dformat, n_hplanes=None, bin_size=1, max_specs_per_lsh=None, seed=333)
Subset the input .hdf5 file using Locality Sensitive Hashing (LSH) algorithm.
- Parameters:
input_path (str) – Path to the input file.
dformat (DataFormatBuilder) – Data format builder object.
n_hplanes (int, optional) – Number of hyperplanes for LSH. Defaults to None.
bin_size (float, optional) – Bin size for LSH. Defaults to 1.
max_specs_per_lsh (int, optional) – Maximum number of spectra per LSH. Defaults to None.
seed (int, optional) – Random seed for LSH initialization and selection. Defaults to 333.
- dreams.utils.io.merge_lcmsms_hdf5s(in_pths: Path | Iterable[Path], out_pth: Path, dformat: str = 'A', store_acc_est: bool = True, verbose: bool = True, compression: str | None = 'gzip', compression_level: int = 6)
Merge .hdf5 files generated with lcmsms_to_hdf5.
- Parameters:
in_pths – Directory or iterable of .hdf5 files.
out_pth – Output .hdf5 file.
store_acc_est – Whether to store instrument accuracy estimate.
verbose – Verbose output.
compression – Compression method for HDF5 datasets (e.g. ‘gzip’, None).
compression_level – Compression level (0–9) when using gzip.
- dreams.utils.io.merge_ms_hdfs(in_hdf_pths, out_pth, group='MSn data', max_peaks_n=512, del_in=False, show_tqdm=True, logger=None, add_file_name_dataset=True, mzs_dataset='mzs', intensities_dataset='intensities')
TODO: This should be remove after MSData is completely implemented (inclduing .merge method).
NOTE: currently ignores MS1 data and file-level metadata. NOTE: assumes identical keys in all files. TODO: when ms_hdfs are created with process_ms_file, mzs and intensities should be refactored to a single
spectrum dataset. Here, args and body should be refactored to reflect this change.
TODO: recursively merge groups? TODO: max_peaks_n=None :param group: str: Merge only datasets withing the given group. :param del_in: bool: Delete input files after merging. :param show_tqdm: bool: Show tqdm progress bar. :param logger: logging.Logger: Logger to log the progress. :param add_file_name_dataset: bool: Add a new dataset constantly filled with the file names of merged files.
- dreams.utils.io.parse_sirius_ms(spectra_file: str) Tuple[dict, List[Tuple[str, ndarray]]]
Parses spectra from the SIRIUS .ms file.
Copied from the code of Goldman et al.: https://github.com/samgoldman97/mist/blob/4c23d34fc82425ad5474a53e10b4622dcdbca479/src/mist/utils/parse_utils.py#LL10C77-L10C77. :return Tuple[dict, List[Tuple[str, np.ndarray]]]: metadata and list of spectra tuples containing name and array
- dreams.utils.io.parsed_lcmsms_to_hdf(output_path, file_props, df_msn_data, df_prec_data, logger, num_peaks=None, num_prec_peaks=None, compress_peaks_lvl=3, compress_full_lvl=3, only_msn=False, only_format=None)
- dreams.utils.io.prepend_to_stem(pth: Path, s, sep='_')
path/to/file.txt -> path/to/{s}{sep}file.txt
- dreams.utils.io.read_json(pth)
- dreams.utils.io.read_json_spec(pth, peaks_key='peaks', prec_mz_key='precursor_mz')
- dreams.utils.io.read_lcmsms(input_path, logger, store_precursors=True, pwiz_stats=False, assign_dformats=True, verbose=False)
- dreams.utils.io.read_mgf(pth, **kwargs)
- dreams.utils.io.read_ms(pth, peaks_tag='>ms2peaks', charge_tag='#Charge', prec_mz_tag='#Precursor_MZ')
- dreams.utils.io.read_msp(pth, **kwargs)
- dreams.utils.io.read_mzml(pth: Path | str, verbose: bool = False, scan_range: Tuple[int, int] | None = None)
- dreams.utils.io.read_pickle(pth)
- dreams.utils.io.read_textual_ms_format(pth, spectrum_end_line, name_value_sep, spectrum_start_line=None, prec_mz_name=['PEPMASS', 'PRECURSORMZ', 'PRECURSOR_MZ'], charge_name=['CHARGE'], adduct_name=['ADDUCT'], smiles_name=['SMILES'], rt_name=['RTINSECONDS', 'RETENTION_TIME', 'RTINMINUTES', 'RT'], ionmode_name=['IONMODE'], feature_id_name=['FEATURE_ID'], scan_number_name=['SCAN_NUMBER'], name_name=['NAME'], ignore_line_prefixes=(), encoding='utf-8')
- dreams.utils.io.sample_hdf(hdf_pth, n_samples, out_pth=None, seed=333, compression='gzip', compression_opts=4)
- dreams.utils.io.save_nist_like_df_to_mgf(df, out_pth: Path, remove_mol_info=False, all_mplush_adducts=False)
- dreams.utils.io.savefig(name, path, extension='pdf')
- dreams.utils.io.setup_logger(log_file_path=None, log_name='log')
- dreams.utils.io.suppress_output()
- dreams.utils.io.wandb_import(project_name, entity_name='roman-bushuiev', tags={}, run_name_suffixes=None)
- dreams.utils.io.write_json(obj, pth)
- dreams.utils.io.write_pickle(obj, pth)
dreams.utils.lcms module
- class dreams.utils.lcms.MSLevelsOrder(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)
Bases:
EnumEnum representing order of spectra in MS file.
Further l_i denotes MS level of i-th spectrum.
- CONSEQUENT_MSN = 6
- EMPTY = 1
- INVALID = 8
- MIXED_MSN = 7
- SINGLE_MS1 = 2
- SINGLE_MSN = 3
- UNIFORM_MS1 = 4
- UNIFORM_MSN = 5
- class dreams.utils.lcms.SpecType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)
Bases:
EnumEnum representing the type of spectrum (centroid, profile and other corner cases).
- CENTROID = 1
- INVALID = 6
- PROFILE = 2
- SIZE_OF_SPECTRUMTYPE = 5
- THRESHOLDED = 3
- UNKNOWN = 4
- dreams.utils.lcms.estimate_peak_list_type(pl: array, to_int=True, verbose=False)
Reproduced from MZmine. https://github.com/mzmine/mzmine3/blob/master/src/main/java/io/github/mzmine/util/scans/ScanUtils.java#L609
ASSUMES PEAK LIST TO BE SORTED BY M/Z (no check in favor of performance).
- dreams.utils.lcms.get_instrument_props(msdata)
- dreams.utils.lcms.get_order_of_spectra(msdata) MSLevelsOrder
- dreams.utils.lcms.get_pwiz_stats(msdata)
Checks the presence of spectra centroided by ProteoWizard msconvert yet having zero intensities. Outputs the number of such spectra and the histogram of types of spectra converted by msconvert.
- dreams.utils.lcms.get_tight_xics(msdata, mz_tol_1=0.5, mz_tol_2=0.01, intensity_rel_tol=0.1, xic_len_thld=5, n_highest_peaks=3)
Tight XIC at given m/z is a cut of ms data accross rt dimension, containing highest peak and all peaks in its neighbourhood wrt rt. Length of the neighbourhood is defined independently in each direction by m/z and intensity tolerance parameters. When algorigtmm builds XIC it starts from some particular peak (xic_mz, xic_in) and consequently examines peaks in its neighbourhood peak by peak. Suppose (prev_mz, prev_in) and (next_mz, next_in) are two peaks compared during the run, where (prev_mz, prev_in) is a current border of the neighbourhood, then the neighbourhood will be extended on (next_mz, next_in) only if it satisfies two conditions:
abs(next_mz - xic_mz) <= m/z tolerance
next_in <= prev_in * intensity tolerance
- Algorithm performs 2 traversals accross all MS1 spectra:
- Builds tight XICs for m/z’s of n_highest_peaks highest peaks of each spectrum, where m/z
tolerance windown is “wide” (mz_tol_1).
Computes medians of m/z’s accross XICs obtained in step I., which are used to build new tight XICs with “smaller” m/z tolerance window (mz_tol_2).
- Parameters:
msdata – ms data to boild XICs from
mz_tol_1 – absolute width of m/z tolerance windown for I. traversal
mz_tol_2 – absolute width of m/z tolerance windown for II. traversal
intensity_rel_tol – peaks
xic_len_thld – threshold for the number of peaks in XICs (XICs are filtered both after I. and II.)
n_highest_peaks – number of highest peaks to choose in I.
NOTE: Since such XICs contain all peaks in the neighbourhood, they are refered to as tight XICs.
TODO: improve speed, very slow.
- dreams.utils.lcms.remove_electromagnetic_spectra(msdata)
- dreams.utils.lcms.sort_by_rt(msdata)
- dreams.utils.lcms.sorted_by_rt(msdata)
- dreams.utils.lcms.standartize_gnps_species(species: Series)
dreams.utils.misc module
- dreams.utils.misc.all_close_pairwise(numbers, eps=0.01)
- dreams.utils.misc.calc_attention_entropy(attention_scores, as_plot=True, save_out_basename: Path = None)
- Parameters:
attention_scores – dict with [0, num_layers] keys and tensor (batch_size, num_heads, seq_len, seq_len) values.
- dreams.utils.misc.chunk_list(lst, chunks_n)
- dreams.utils.misc.chunk_list_eq_sum(lst, chunks_n, val=<function <lambda>>)
Partitions list lst to n bins with approximately equal sums. If elements of lst are not numbers, val function can be specified in order to extract desired numbers from each element.
- dreams.utils.misc.complete_permutation(arr: array)
Returns a copy of arr with shuffled elements such that each element has a different position :param arr: 1D NumPy array
- dreams.utils.misc.contains_similar(lst, query_val, epsilon, return_idx=False)
- dreams.utils.misc.download_pretrained_model(model_name: str = 'embedding_model.ckpt', download_dir: Path = PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/dreams-docs/envs/latest/lib/python3.11/site-packages/dreams/models/pretrained'), verbose: bool = True)
Download a pre-trained model from the Hugging Face Hub and return its location on disk.
- Parameters:
model_name (str) – Name of the model to download.
download_dir (Path) – Local directory to download the model to.
verbose (bool) – Whether to print verbose output.
- dreams.utils.misc.gems_hf_download(file_pth: str, local_dir: str | Path | None = None) str
Download a GeMS file from the Hugging Face Hub and return its location on disk.
- Parameters:
file_pth (str) – Name of the file to download.
local_dir (Optional[Union[str, Path]]) – Local directory to download the file to.
- dreams.utils.misc.get_closest_values(lst, query_val, n=1, return_idx=False)
- dreams.utils.misc.hf_download(repo_id: str, file_pth: str, local_dir: str | Path | None = None, repo_type: str = 'dataset') str
Download a file from the Hugging Face Hub and return its location on disk.
- Parameters:
repo_id (str) – Hugging Face repository ID.
file_pth (str) – Name of the file to download.
local_dir (Optional[Union[str, Path]]) – Local directory to download the file to.
repo_type (str) – Type of the repository.
- dreams.utils.misc.interpolate_interval(a, b, n, only_inter=False, rounded=False)
- Parameters:
a – Start point.
b – End point.
n – Num. of steps.
only_inter – Does not return interval ends a and b.
rounded
- dreams.utils.misc.is_float(s)
- dreams.utils.misc.is_sorted(lst)
- dreams.utils.misc.lists_to_legends(lists)
E.g. [[‘a’, ‘b’, ‘c’], [1, 2, 3]] -> [‘a | 1’, ‘b | 2’, ‘c | 3’].
- dreams.utils.misc.merge_stats(stats: Sequence[dict], sets_len=False)
- dreams.utils.misc.networkx_to_dataframe(G: Graph) DataFrame
dreams.utils.mols module
- class dreams.utils.mols.MolPropertyCalculator
Bases:
object- denormalize_prop(prop, prop_name, do_not_add_min=False)
- denormalize_props(props)
- mol_to_props(mol, min_max_norm=False)
- normalize_prop(prop, prop_name)
- normalize_props(props)
- dreams.utils.mols.closest_mz_frags(query_mz, frags, n=1, mass_shift=1, return_masses=False, print_masses=True)
- dreams.utils.mols.disable_rdkit_log()
- dreams.utils.mols.formula_is_carbohydrate(formula)
- dreams.utils.mols.formula_is_halogenated(formula)
- dreams.utils.mols.formula_to_dict(formula)
Transforms chemical formula string to dictionary mapping elements to their frequencies e.g. ‘C15H24’ -> {‘C’: 15, ‘H’: 24}
- dreams.utils.mols.formula_type(f)
- dreams.utils.mols.fp_func_from_str(s)
- Parameters:
s – E.g. “fp_rdkit_2048”, “fp_rdkit_2048” or “fp_maccs_166”.
- dreams.utils.mols.generate_fragments(mol: Mol, max_cuts: int = None)
Generates all possible fragments of a molecule up to a certain number of bond cuts or without the restriction if max_cuts is not specified.
- Parameters:
mol – an RDKit molecule object
max_cuts – the maximum number of bonds to cut
:return a set of RDKit Mol objects representing all possible fragments
- dreams.utils.mols.generate_spectrum(mol: Mol, prec_mz: float = None, fragments: List = None, max_cuts: int = None)
Generates an MS/MS spectrum by exhaustively simulating the m/z values of theoretical fragments of a given molecule. The algorithm is very simplistic since it considers only subgraph-like fragments, does not consider isotopes, etc.
- Parameters:
mol – An RDKit molecule object.
prec_mz – The m/z value of a molecule. If not specified, it is calculated as the sum of the exact molecular weight of the molecule and 1.
fragments – A list of RDKit Mol objects representing pre-generated fragments of the molecule. If not specified, the function will generate the fragments automatically.
max_cuts – The maximum number of bonds to cut when generating fragments. If not specified, all possible fragments will be generated without any restriction on the number of cuts.
- Returns:
A spectrum represented as a numpy array with two columns: m/z values and their respective intensities.
- dreams.utils.mols.get_mol_mass(mol)
- dreams.utils.mols.maccs_fp(mol, as_numpy=True)
- NOTE: Since indexing of MACCS keys starts from 1, when converting to numpy array with as_numpy, the first element
is removed, so the resulting array has 166 elements instead of 167.
- dreams.utils.mols.mol_to_formula(mol, as_dict=False)
- dreams.utils.mols.mol_to_img_str(mol, svg_size=200)
Supposed to be used with pyvis for showing molecule images as graph nodes.
- dreams.utils.mols.mol_to_inchi14(mol: Mol)
- dreams.utils.mols.morgan_fp(mol, binary=True, fp_size=4096, radius=2, as_numpy=True)
- dreams.utils.mols.morgan_mol_sim(m1, m2, fp_size=4096, radius=2)
- dreams.utils.mols.morgan_smiles_sim(s1, s2, fp_size=4096, radius=2)
- dreams.utils.mols.np_classify(smiles: List[str], progress_bar=True, sleep_each_n_requests=100)
- dreams.utils.mols.np_to_rdkit_fp(fp)
- dreams.utils.mols.rdkit_fp(mol, fp_size=4096)
Default RDKit fingerprint.
- dreams.utils.mols.rdkit_fp_to_np(fp)
- dreams.utils.mols.rdkit_mol_sim(m1, m2, fp_size=4096)
Default RDKit Tanimoto distance on default RDKit fingerprint.
- dreams.utils.mols.rdkit_smiles_sim(s1, s2, fp_size=4096)
Default RDKit Tanimoto distance on default RDKit fingerprint.
- dreams.utils.mols.show_mols(mols, legends='new_indices', smiles_in=None, svg=False, sort_by_legend=False, max_mols=500, legend_float_decimals=4, mols_per_row=6, save_pth: Path | None = None)
Returns svg image representing a grid of skeletal structures of the given molecules
- Parameters:
mols – list of rdkit molecules
legends – list of labels for each molecule, length must be equal to the length of mols. Can be ‘new_indices’ for default numbering, ‘masses’ for molecular weights, or a list of custom labels
smiles_in – True - SMILES inputs, False - RDKit mols, None - determine automatically
svg – True - return svg image, False - return png image
sort_by_legend – True - sort molecules by legend values
max_mols – maximum number of molecules to show
legend_float_decimals – number of decimal places to show for float legends
mols_per_row – number of molecules per row to show
save_pth – path to save the .svg image to
- dreams.utils.mols.smiles_to_formula(s, as_dict=False, invalid_mol_smiles='')
- dreams.utils.mols.smiles_to_inchi14(s)
- dreams.utils.mols.tanimoto_sim(fp1, fp2)
Default RDKit Tanimoto distance.
dreams.utils.plots module
- dreams.utils.plots.assign_colors(x)
- dreams.utils.plots.color_generator(n_colors, cmap='plotly')
- dreams.utils.plots.distr_density(values, domain=None, show_mean=True, show_median=False, title=None)
- dreams.utils.plots.get_nature_hex_colors(extended=True)
- dreams.utils.plots.get_palette(cmap='plotly', reversed_order=False, as_hex=False)
- dreams.utils.plots.init_plotting(figsize=(6, 2), font_scale=0.95, style='whitegrid', cmap='plotly', font=None, legend_outside=False)
- dreams.utils.plots.pie_chart(values, other_percent_thld='auto', title=None, figsize=(6, 6))
- dreams.utils.plots.plot_nx_graph(G: Graph, node_attrs: list = [], special_node: int = None, special_nodes: list = [], pos: dict = None, node_color_attr: str = None, node_size: int = 10, edge_color: str = 'black', edge_width: int = 2, title: str = None, html_pth: Path | str = None) None
Plots a NetworkX graph using Plotly, with options to customize node attributes and highlight special nodes.
Args: - G (nx.Graph): The NetworkX graph to be plotted. - node_attrs (list): List of node attributes to be displayed in hover text. - special_node (int): Node to be highlighted with a star symbol and larger size. - special_nodes (list): List of nodes to be highlighted with a triangle symbol. - pos (dict): Dictionary specifying the positions of nodes. If None, a spring layout will be computed. - node_color_attr (str): Node attribute used to determine node colors. - node_size (int): Size of the nodes. - edge_color (str): Color of the edges. - edge_width (int): Width of the edges. - title (str): Title of the plot.
- dreams.utils.plots.rgb_to_hex(r, g, b)
- dreams.utils.plots.save_fig(name, dir=PosixPath('/home/docs/checkouts/readthedocs.org/user_builds/dreams-docs/envs/latest/lib/python3.11/site-packages/misc/figures'), dpi=None, transparent=True)
dreams.utils.spectra module
- class dreams.utils.spectra.MSnSpectrum(peak_list, precursor_mol=None, precursor_mz=None, precursor_charge=None, ionization_mode=None, collision_energy=None, assert_is_valid=True)
Bases:
object- get_collision_energy()
- get_intensities()
- get_ionization_mode()
- get_mzs()
- get_peak_list()
- get_peaks_n()
- get_precursor_charge()
- get_precursor_formula(to_dict=False)
- get_precursor_mass()
- get_precursor_mol()
- get_precursor_mz()
- class dreams.utils.spectra.PeakListModifiedCosine(mz_tolerance: float = 0.05, unpad: bool = True)
Bases:
object- compute(spec1: ndarray, spec2: ndarray, prec_mz1: float, prec_mz2: float) float
- compute_pairwise(specs: ndarray, prec_mzs: ndarray, avg=False) ndarray | float
- dreams.utils.spectra.bin_peak_list(peak_list: array, max_mz: float, bin_step: float) array
- dreams.utils.spectra.bin_peak_lists(peak_lists: array, max_mz: float, bin_step: float) array
- dreams.utils.spectra.df_to_MSnSpectra(df, assert_is_valid=True, as_new_column=False)
Processes NIST-like DataFrame to Series of MSnSpectra. # TODO: include more columns.
- dreams.utils.spectra.from_hot(hots: Tensor, bin_size: float, dtype=torch.float64) Tensor
Makes the last dimension singleton.
- dreams.utils.spectra.from_hot_logits(vals: Tensor, bin_size: float) Tensor
- dreams.utils.spectra.get_base_peak(peak_list: array, return_i=False)
- dreams.utils.spectra.get_closest_mz_peak(peak_list: array, query_mz)
- dreams.utils.spectra.get_closest_mz_peaks(peak_list: array, query_mz, n)
Returns list of pairs (mz, intensity) of length n containing peaks having m/z values closest to the query_mz sorted ascending by the difference.
- dreams.utils.spectra.get_highest_peaks(peak_list: array, n)
Returns n highest peaks.
- dreams.utils.spectra.get_num_peaks(peak_list)
- dreams.utils.spectra.get_peak_intens_nbhd(peak_list, peak_i, intens_thld, intens_thld_below=True)
Returns indices determining the range of the neighbour around peak at peak_i. The neighbourhood is defined as all consecutive peaks above (or below if intens_thld_below=False) the intens_thld intensity.
- dreams.utils.spectra.has_peak_at(peak_list: array, query_mz, epsilon)
- dreams.utils.spectra.intens_amplitude(peak_list)
- dreams.utils.spectra.is_valid_peak_list(peak_list: array, relative_intensities=True, verbose=None, return_problems_list=False)
Returns True if peak list is valid (Numbers of m/z and intensity values are equal, m/z values are sorted in ascending order etc.), else False. TODO: consider padded spectra.
- Parameters:
peak_list – np.array of shape (2, n), where n is a number of peaks.
relative_intensities – if True, performs additional checks for the intensities to be relative.
return_problems_list – if True, list of strings describing problems (invalid causes) will be returned (e.g. [‘#mzs != #intensities’, ‘Exists m/z < 0.0’]).
verbose – for ‘problems’ the reasons why peak list is not valid will be printed, for ‘problems_and_peak_list’, the peak list will be printed as well.
- dreams.utils.spectra.max_mz(peak_list)
- dreams.utils.spectra.merge_peak_lists(peak_lists: List[array], eps=0.01, n_highest_peaks=None) array
Merges peak lists without creating new “artificial” m/z values. The algorithm traverses all peaks (from all spectra) descendingly ordered by their intensities and create merged peaks by summing up intensities of all peaks in the range m/z ± eps. Each peak is used exactly once (either as the one determining the range or the one belonging to the range), and final peaks are not transitively connected. Notice, that the complexity is O(n^2), where n is a total num. of peaks within all spectra. :param peak_lists: List of NumPy arrays of shape (2, num. of peaks). :param eps: Epsilon determining the range of m/z values of peaks which are aggregated. :param n_highest_peaks: If not None, n_highest_peaks highest peaks are selected from each peak list.
- dreams.utils.spectra.normalize_mzs(peak_list: array, max_mz: float, in_place=True, high=False)
- dreams.utils.spectra.num_high_peaks(peak_list, high_intensity_thld)
- dreams.utils.spectra.num_hot_classes(max_val: float, bin_size: float) int
- dreams.utils.spectra.pad_peak_list(pl: ndarray, target_len: int, pad_val: float = 0, axis: int = -1) ndarray
Pads peak list to the target_len with pad_val or performs this for a batch of peak lists. :param pl: Peak list of shape (2, num_peaks) or a batch of peak lists of shape (batch_size, 2, num_peaks). :param target_len: Target num. of peaks of the peak list. :param pad_val: Value used for padding. :param axis: Axis along which the padding is performed.
- dreams.utils.spectra.parse_raw_peak_list(peak_list: str)
Parses peak list string into numpy arrays of m/z and intensity values e.g. ‘53.0379 0.894101
54.0335 0.661867 ‘ -> ([53.0379, 54.0335], [0.894101, 0.661867])
- dreams.utils.spectra.plot_spectrum(spec, hue=None, xlim=None, ylim=None, mirror_spec=None, highl_idx=None, high_peaks_at=None, figsize=(6, 2), colors=None, save_pth=None, prec_mz=None, mirror_prec_mz=None, normalize_intensities=True, spec_text=None, mirror_spec_text=None)
TODO: Whole function should be refactored, it is a mess. Plots a mass spectrum with optional mirror spectrum and highlighted peaks.
Args: - spec: The spectrum to be plotted. - hue: Optional values to color the peaks. - xlim: X-axis limits. - ylim: Y-axis limits. - mirror_spec: Optional mirror spectrum to be plotted. - highl_idx: Indices of peaks to be highlighted. - high_peaks_at: M/z values of peaks to be highlighted. - figsize: Figure size. - colors: Colors for the plot. - save_pth: Path to save the plot. - prec_mz: Precursor m/z value to display. - mirror_prec_mz: Precursor m/z value of the mirror spectrum to display. - spec_text: Text to display on the spectrum. - mirror_spec_text: Text to display on the mirror spectrum.
- dreams.utils.spectra.prepend_precursor_peak(peak_list: array, prec_mz, prec_in=1.1, high=False)
- dreams.utils.spectra.process_peak_list(peak_list, n_highest=None, sort_mzs=False, to_rel_intens=False)
- dreams.utils.spectra.to_classes(vals: Tensor, max_val: float, bin_size: float, special_vals: List[float] = (), return_num_classes: bool = False) Tensor
Assumes that last dimension of mzs is singleton.
- dreams.utils.spectra.to_hot(vals: Tensor, max_val: float, bin_size: float, dtype=torch.float64)
Assumes that last dimension of mzs is singleton.
- dreams.utils.spectra.to_rel_intensity(peak_list: array, scale_factor=None)
- dreams.utils.spectra.trim_peak_list(peak_list: array, n_highest: int)
Trims peak list by selecting n_highest highest peaks or performs this for a batch of peak lists. :param peak_list: np.array of shape (2, num_peaks) or (num_spectra, 2, num_peaks). :param n_highest: Number of highest peaks to be selected.
- dreams.utils.spectra.unpad_peak_list(peak_list: array, pad_val=0.0)