DreaMS Atlas

The DreaMS Atlas is a large-scale molecular network containing 201 million MS/MS spectra from the MassIVE GNPS repository, constructed using DreaMS embeddings. Each node in the network corresponds to a mass spectrum derived from a specific biological or environmental sample (e.g., human skin or blood, plant extracts, marine environments, food, and many others). Each edge represents a DreaMS similarity, linking a node to its three nearest neighbors across the entire MassIVE GNPS. This tutorial demonstrates various methods for exploring and analyzing the DreaMS Atlas through a user-friendly API.

98a9e6c4b862432a8963427a4f3d2b62

Initialization

Import all the necessary packages

[1]:
import networkx as nx
from rdkit import Chem
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import dreams.utils.plots as plots
import dreams.utils.spectra as su
from dreams.api import DreaMSAtlas
from dreams.utils.misc import networkx_to_dataframe
from dreams.definitions import *
%reload_ext autoreload
%autoreload 2

Initialize the DreaMS Atlas. Please note that the first initialization involves downloading over 400 GB of data files. However, once the files are downloaded, they accessed directly from the disk, so there’s no need to load all the data into memory, eliminating the requirement for a RAM-intensive machine to work with the Atlas.

ℹ️ In future updates, we plan to develop a web server that will allow access to the DreaMS Atlas from a remote server, removing the need to host all the data locally. Future release will also include the ability to extend the Atlas with new nodes and to query the Atlas using a spectrum of interest by DreaMS similarity.

[2]:
atlas = DreaMSAtlas()
Initializing DreaMS Atlas data structures...
Loaded spectral library (79,300 spectra).
Loaded GeMS-C1 dataset (75,520,646 spectra).
Loaded DreaMS Atlas edges (134,524,452 edges).
Loaded DreaMS Atlas nodes representing DreaMS k-NN clusters of GeMS-C1 (33,631,113 nodes).
Loaded LSH clusters of DreaMS Atlas nodes representing GeMS-C (201,223,336 spectra).

Accesing data from the Atlas

Let’s pick one of the 76 million spectra in GeMS-C1 dataset.

[27]:
i = 37552437
atlas.get_data(i, plot=True, return_spec=False, msv_metadata=True)
../_images/tutorials_atlas_9_0.png
[27]:
{37552437: {'DreaMS_embedding': array([-0.8768827 , -0.41617227,  0.02967745, ..., -1.2626797 ,
          1.6408856 , -0.64122283], dtype=float32),
  'RT': 550.4616,
  'charge': 1,
  'instrument accuracy est.': 0.00013683233,
  'lsh': -450688754114588762,
  'name': '20160906_pgk965_SloanSurfaceProject_Metabolomics_2-21',
  'precursor_mz': 404.1232,
  'msv_id': 'MSV000086209',
  'msv_species': 'NCBITaxon:2;NCBITaxon:4751',
  'msv_species_resolved': 'Bacteria (NCBITaxon:2)',
  'msv_instrument': nan,
  'msv_instrument_resolved': nan,
  'msv_title': 'GNPS - Microbial and metabolic succession on common building materials under high humidity conditions',
  'msv_description': 'Despite considerable efforts to characterize the microbial ecology of the built environment, the metabolic mechanisms underpinning microbial colonization and successional dynamics remain unclear, particularly at high moisture conditions. Here, we applied bacterial/viral particle counting, qPCR, amplicon sequencing of the genes encoding 16S and ITS rRNA, and metabolomics to longitudinally characterize the ecological dynamics of four common building materials maintained at high humidity. We varied the natural inoculum provided to each material and wet half of the samples to simulate a potable water leak. Wetted materials had higher growth rates and lower alpha diversity compared to non-wetted materials, and wetting described the majority of the variance in bacterial, fungal, and metabolite structure. Inoculation location was weakly associated with bacterial and fungal beta diversity. Material type influenced bacterial and viral particle abundance and bacterial and metabolic (but not fungal) diversity. Metabolites indicative of microbial activity were identified, and they too differed by material.',
  'msv_create_time': '2020-09-29 08:10:16.0',
  'msv_user': nan,
  'msv_keywords': nan}}

The displayed spectrum represents a single node in the Atlas. In addition to mass spectrometry attributes such as MS/MS peaks, precursor m/z, and retention time, the spectrum is associated with a DreaMS embedding and MassIVE GNPS metadata, which includes, for example, the species studied or the study description.

According to the construction of the DreaMS Atlas, each node represents a cluster of MS/MS spectra obtained using DreaMS and LSH. Let us explore the cluster corresponding to the selected node.

[28]:
dreams_cluster = atlas.get_node_cluster(i, lsh=True)
print(f'Node {i} represents a cluster of {len(dreams_cluster)} spectra with high DreaMS similarity.')
for spec_i, lsh_cluster in dreams_cluster.items():
    print(f'Spectrum with index {spec_i} further represents an LSH cluster of {len(lsh_cluster)} spectra.')
    print('Showing first spectrum:')
    su.plot_spectrum(lsh_cluster[0][SPECTRUM], prec_mz=lsh_cluster[0][PRECURSOR_MZ], figsize=(3, 1.2))
Node 37552437 represents a cluster of 6 spectra with high DreaMS similarity.
Spectrum with index 32730250 further represents an LSH cluster of 1 spectra.
Showing first spectrum:
../_images/tutorials_atlas_11_1.png
Spectrum with index 32730265 further represents an LSH cluster of 2 spectra.
Showing first spectrum:
../_images/tutorials_atlas_11_3.png
Spectrum with index 32730269 further represents an LSH cluster of 42 spectra.
Showing first spectrum:
../_images/tutorials_atlas_11_5.png
Spectrum with index 37552435 further represents an LSH cluster of 3 spectra.
Showing first spectrum:
../_images/tutorials_atlas_11_7.png
Spectrum with index 37552437 further represents an LSH cluster of 5 spectra.
Showing first spectrum:
../_images/tutorials_atlas_11_9.png
Spectrum with index 37552438 further represents an LSH cluster of 2 spectra.
Showing first spectrum:
../_images/tutorials_atlas_11_11.png

Exploring local structure of the Atlas

The DreaMS Atlas API allows for the visualization of a neighborhood of a given node as an interactive graph.

[1]:
i = 37542683
g_nbhd = atlas.get_neighbors(i, n_hops=3, msv_metadata=True)

plots.plot_nx_graph(
    g_nbhd,
    node_attrs=[PRECURSOR_MZ, NAME, SMILES, 'msv_id'],
    node_color_attr='msv_id',
    special_node=i,
    special_nodes=[n[0] for n in g_nbhd.nodes(data=True) if SMILES in n[1]],
    node_size=12
)

e5c9ea3dea9a45a6bb3738cbddd75166

Let’s examine the similarities between three spectra from the neighborhood and our query spectrum of interest. Note that one of the nodes represents an entry from the MoNA spectral library and is therefore annotated with a molecular structure.

[37]:
for n in list(g_nbhd.nodes(data=True))[-3:]:
    print('Node', n[0])
    su.plot_spectrum(spec=n[1][SPECTRUM], mirror_spec=g_nbhd.nodes[i][SPECTRUM], prec_mz=n[1][PRECURSOR_MZ], mirror_prec_mz=g_nbhd.nodes[i][PRECURSOR_MZ])
    if SMILES in n[1]:
        display(Chem.MolFromSmiles(n[1][SMILES]))
Node 37496414
../_images/tutorials_atlas_17_1.png
Node 34207
../_images/tutorials_atlas_17_3.png
../_images/tutorials_atlas_17_4.png
Node 37552437
../_images/tutorials_atlas_17_6.png

One can also explore the neighborhood as a pandas data frame.

[47]:
networkx_to_dataframe(g_nbhd).head()
[47]:
node_id RT msv_keywords msv_instrument_resolved msv_create_time name msv_species_resolved id DreaMS_embedding msv_instrument ... msv_species charge lsh msv_description spectrum instrument accuracy est. msv_id msv_title neighbors edge_weight
0 37542683 309.739990 Skin maXis 2018-07-31 14:41:27.0 BD4_V14_S2-015_arm_psoriasis_BD4_01_14378 Homo sapiens None [-0.10430071, 0.32463387, 1.2688519, 1.7766136... MS:1001541 ... NCBITaxon:9606 0 -4.529384e+17 GNPS - Skin psoriasis molecular cartography st... [[106.06563568115234, 106.07821655273438, 119.... 0.000712 MSV000082674 GNPS - Skin psoriasis molecular cartography st... [37542661, 37542656, 37648590] [0.8408746152701184, 0.829488577059148, 0.7978...
1 37542661 306.300995 Facial cleanser, skin, temporal maXis 2018-06-01 15:00:35.0 5A9_%20V1_%20Nose_%20D14H0_%20W7_BA9_01_11583 Homo sapiens None [0.19762689, 0.17532083, 1.2078881, 1.549347, ... MS:1001541 ... NCBITaxon:9606 0 -4.529386e+17 MS/MS spectra were collected from face of 6 in... [[121.03773498535156, 151.0904998779297, 156.0... 0.000614 MSV000082432 GNPS - Colgate facial cleanser longitudinal st... [37542653, 32640049, 37542632] [0.8550434399274379, 0.8537323876318667, 0.851...
2 37542656 309.917999 Skin maXis 2018-07-31 14:41:27.0 BD3_V14_S2-014_arm_healthy_BD3_01_14337 Homo sapiens None [0.38015515, -0.5561968, 1.060871, 0.93926096,... MS:1001541 ... NCBITaxon:9606 0 -4.529386e+17 GNPS - Skin psoriasis molecular cartography st... [[119.0505599975586, 134.0594482421875, 145.02... 0.000678 MSV000082674 GNPS - Skin psoriasis molecular cartography st... [37542683, 37542627, 37542661] [0.829488577059148, 0.7041841665583883, 0.6607...
3 37648590 541.183533 NaN NaN 2020-09-29 08:10:16.0 20160906_pgk965_SloanSurfaceProject_Metabolomi... Bacteria (NCBITaxon:2) None [-0.18013018, 0.5833932, 1.3610793, 2.4781635,... NaN ... NCBITaxon:2;NCBITaxon:4751 1 -4.326744e+17 Despite considerable efforts to characterize t... [[89.41886901855469, 89.42453002929688, 96.847... 0.000139 MSV000086209 GNPS - Microbial and metabolic succession on c... [32730262, 37542653, 32640049] [0.8631632498896021, 0.8612309271933151, 0.842...
4 37542653 349.489990 citrus maXis 4G 2020-05-15 11:47:12.0 Plate_2_1_820_RE7_01_41664 Citrus sinensis (NCBITaxon:2711) None [0.024766939, -0.28584233, 1.1211307, 1.363762... MS:1002279 ... NCBITaxon:2711 0 -4.529386e+17 Leaf tissues of orange trees extracted with et... [[130.02940368652344, 133.05169677734375, 134.... 0.000519 MSV000085416 GNPS_UCR_citrus_survivor_study_orchard_samples... [32640049, 37496366, 37542658] [0.8816033077004346, 0.8651546656827798, 0.864...

5 rows × 24 columns

Exploring global structure of the Atlas

The global structure of the DreaMS Atlas can be analyzed in two ways. First, one can efficiently analyze the graph by accessing its adjacency matrix, which is stored as a sparse array. For example, let’s plot the distribution of the DreaMS similarities representing the graph edges.

[10]:
A = atlas.csrknn.csr
print('Adjacency matrix shape:', A.shape)
edges = np.asarray(A[A > 0]).ravel()

print('Number of edges:', len(edges))
sns.histplot(edges)
plt.title('Distribution of DreaMS similarities')
plt.show()
Adjacency matrix shape: (33631113, 33631113)
Number of edges: 101104558
../_images/tutorials_atlas_21_1.png

Second, one can work with the Atlas as an igraph graph object, leveraging the many graph analysis methods implemented within the igraph package. For example, let’s plot the distribution of node degrees across the entire Atlas.

[38]:
G = atlas.csrknn.to_graph()
Retrieving graph edges: 100%|██████████| 33631113/33631113 [03:09<00:00, 177295.81it/s]
[48]:
degrees = G.degree()
sns.histplot(degrees, log_scale=True, shrink=10)
plt.title('Distribution of node degrees')
plt.show()
../_images/tutorials_atlas_24_0.png