Getting started with MosaicMPI#
[1]:
import mosaicmpi
mosaicmpi.start_logging()
2024-05-17 15:10:54,470 [INFO] mosaicMPI version 2.4.10
Data guidelines
mosaicMPI can factorize a wide variety of datasets, but will work optimally in these conditions:
Use untransformed, raw data data where possible, and avoid log-transformed data
For single-cell, spatial, or bulk RNA-Seq data, the best data to use is feature counts, then TPM-normalized values, then RPKM/FPKM-normalized values.
Working with Dataset objects#
Datasets can be created from pandas DataFrames quite easily.
[ ]:
import pandas as pd
rna_data = pd.read_table("cptac_data/cptac_RNA.txt", index_col=0)
rna_metadata = pd.read_table("cptac_data/cptac_RNA.metadata.txt", index_col=0) # sample metadata
# create dataset from DataFrames
rna = mosaicmpi.Dataset.from_df(data=rna_data, obs=rna_metadata, is_normalized=True, patient_id_col="patient_id")
They can be written to and read from AnnData files (h5ad format).
[ ]:
# write to .h5ad file
rna.write_h5ad("rna.h5ad")
# read from .h5ad file
rna = mosaicmpi.Dataset.from_h5ad("rna.h5ad")
Dataset objects contain an AnnData object which can also be used for interfacing with other tools
[ ]:
rna.adata
mosaicMPI can recognize and import AnnData .h5ad files whether they are created in Seurat, scanpy, or other single-cell software tools.
[ ]:
import scanpy
pbmc_adata = scanpy.datasets.pbmc3k()
pbmc_dataset = mosaicmpi.Dataset(pbmc_adata)
Selecting overdispersed genes#
[ ]:
# calculates overdispersion for each gene
rna.model_overdispersed_genes(odg_default_spline_degree=3, odg_default_dof=30) # calculates gene statistics and stores in the Dataset object
# thresholds for gene overdispersion
rna.select_overdispersed_genes(overdispersion_metric="odscore", min_score=1)
fig = mosaicmpi.plot_feature_dispersion(rna, show_selected=True)
Default parameters for select_overdispersed_genes() results in about 40-50% of genes as being overdispersed:
[ ]:
rna.adata.var["selected"].value_counts()
Factorization#
[ ]:
cnmf_results_dir = "cnmf_results"
run_name = "rna"
# by default, k=2-60 is run with n_iter=200. For this demo, we will speed it up by drastically subsetting.
kvals = [2, 3 ,4, 5, 6, 7, 8]
n_iter = 10
cnmf_run = rna.initialize_cnmf(cnmf_output_dir=cnmf_results_dir, cnmf_name=run_name, kvals=kvals, n_iter=n_iter)
# these steps take long
cnmf_run.factorize(verbose=False)
cnmf_run.postprocess()
# Merges cNMF results into the `Dataset` object
rna.add_cnmf_results(cnmf_output_dir=cnmf_results_dir, cnmf_name=run_name)
rna.write_h5ad("rna.h5ad") # overwrite original file
Stability-Error Plot#
[ ]:
fig = mosaicmpi.plot_stability_error(rna, figsize=[4,3])
fig.savefig("rna_stability-error.pdf") # Save figures in PDF or PNG format
Accessing program usage values#
Get dataframe with usage of each program across samples
[ ]:
rna.get_usages().head()
Plotting program usages in a heatmap#
[ ]:
colors = mosaicmpi.Colors.from_dataset(rna, pastel_factor=0.4) # create distinct colors for metadata tracks
fig = mosaicmpi.plot_usage_heatmap(rna, k=6, colors=colors,
title="CPTAC RNA dataset, k=6 Program usage")
fig.savefig("k6_usages_heatmap.pdf")
fig_legend = colors.plot_metadata_colors_legend()
fig.savefig("rna_metadata_colors_legend.pdf")
Factorize the proteomics data#
[ ]:
data = pd.read_csv("cptac_data/cptac_protein.csv", index_col=0).T # normalized expression data
metadata = pd.read_table("cptac_data/cptac_protein.metadata.txt", index_col=0) # sample metadata
# create dataset from CPTAC example data
protein = mosaicmpi.Dataset.from_df(data=data, obs=metadata, is_normalized=True, patient_id_col = "patient_id")
protein.model_overdispersed_genes()
protein.select_overdispersed_genes()
# creates directory with cNMF results
cnmf_results_dir = "cnmf_results"
run_name = "protein"
cnmf_run = protein.initialize_cnmf(cnmf_output_dir=cnmf_results_dir, cnmf_name=run_name, kvals=kvals, n_iter=n_iter)
cnmf_run.factorize(verbose=False)
cnmf_run.postprocess()
# Merges cNMF results into the `Dataset` object
protein.add_cnmf_results(cnmf_output_dir=cnmf_results_dir, cnmf_name=run_name)
protein.write_h5ad("protein.h5ad") # write to h5ad file
Factorize the snRNA data#
[ ]:
# Note that the snRNA data has been subsetted for the purposes of this tutorial
data = pd.read_table("cptac_data\cptac_snRNA_subsampled.txt", index_col=0, sep="\t") # normalized counts
metadata = pd.read_table("cptac_data/cptac_snRNA_subsampled.metadata.txt", index_col=0) # sample metadata
# create dataset from CPTAC example data
snrna = mosaicmpi.Dataset.from_df(data=data, obs=metadata, is_normalized=True, patient_id_col = "patient")
snrna.model_overdispersed_genes()
snrna.select_overdispersed_genes()
# creates directory with cNMF results
cnmf_results_dir = "cnmf_results"
run_name = "snrna"
cnmf_run = snrna.initialize_cnmf(cnmf_output_dir=cnmf_results_dir, cnmf_name=run_name, kvals=kvals, n_iter=n_iter)
cnmf_run.factorize(verbose=False)
cnmf_run.postprocess()
# Merges cNMF results into the `Dataset` object
snrna.add_cnmf_results(cnmf_output_dir=cnmf_results_dir, cnmf_name=run_name)
snrna.write_h5ad("snrna.h5ad") # write to h5ad file
Integrate multiple datasets together#
[ ]:
datasets = {"RNA": rna, "Protein": protein, "snRNA": snrna}
k_subset = (2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60) # regardless of the ranks that are factorized, subset for these
integration = mosaicmpi.Integration(datasets=datasets, k_subset = k_subset) # create an integration object
colors = mosaicmpi.Colors.from_integration(integration) # create color scheme for metadata and datasets
colors.to_toml("colors.toml") # save to file for reference (TOML file can be re-imported)
Identify feature overlap between datasets
[ ]:
mosaicmpi.plot_features_upset(integration)
A subset of these are identified as overdispersed in each dataset, and there is a significant overlap between datasets, indicating similar variation is seen in the two separate datasets.
[ ]:
mosaicmpi.plot_overdispersed_features_upset(integration)
[ ]:
# plot the correlation matrix of all programs to each other
fig = mosaicmpi.plot_program_correlation_matrix(integration, colors=colors, figsize=[6, 6], hide_program_labels=True)
fig.savefig("correlation_matrix.pdf")
# plot the legend separately as it applies to multiple figures
figlegend = colors.plot_dataset_colors_legend()
figlegend.savefig("datasets_legend.pdf")
To see if the max_median_corr threshold removed any ranks from either of the datasets, the following plot can be generated. The x-axis is the max-k, a threshold that excludes ranks above. The y-axis is the median of the correlation coefficients for all non-self edges in the correlation network. As this this threshold is slowly increased, the number of ranks, and thus nodes, and thus edges increases. The correlation between all edges slowly increases. In some datasets, this median of
correlations will exceed 0 at high ranks. These high ranks will be excluded by this filter.
You can easily see for each rank whether there is a cNMF result, the stability/error of the result, as well as whether the ranks will be excluded on the basis of a max-k filter (derived from the max_median_corr parameter). You can also see which ranks will be selected (selected_k) based on automatic node subsetting for the final SNS maps.
[ ]:
integration.k_table
[ ]:
fig = mosaicmpi.plot_rank_reduction(integration)
You can see that no k-values exceeded the threshold, so no max_k filter was applied. Now, let’s plot the distribution of correlations for programs within and between datasets. This will show the min_corr thresholds. There is one threshold per dataset pair, and one threshold for within each dataset.
[ ]:
# fig = mosaicmpi.plot_pairwise_corr(integration)
fig2 = mosaicmpi.plot_pairwise_corr_overlaid(integration) # overlaid plots show the mirrored distributions
See the number of nodes from each dataset with and without the node filters (including maxk and selectedk filters) and the edge filter (min_corr thresholds).
[ ]:
integration.get_node_table()
As you can see, for this tutorial, since we chose low ranks at the beginning, no nodes were excluded due to node and edge filters.
Create an Network integration#
[ ]:
network = mosaicmpi.Network(integration)
network.community_search(algorithm="greedy_modularity", resolution=1)
fig = mosaicmpi.plot_community_contribution(network, colors, figsize=[8, 3])
Optional: prune communities with at least 2 datasets#
[ ]:
network.prune_communities(min_datasets = 2) # can also filter communities by number of nodes in total (min_nodes) and number of nodes per dataset (min_nodes_per_dataset)
fig = mosaicmpi.plot_community_contribution(network, colors, figsize=[7, 2])
Save network object and underlying data#
[ ]:
network.to_pkl("network_integration.pkl.gz")
# to read it back, use this
network = mosaicmpi.Network.from_pkl("network_integration.pkl.gz")
Plot an program network#
[ ]:
network.compute_layout(algorithm="neato", community_layout_algorithm="spring") # available algorithms: "community_weighted_spring", "spring", "neato"
fig = mosaicmpi.plot_program_network_datasets(network, colors, node_size_kval=False)
How many samples have each program/node as their highest usage program?
[ ]:
fig = mosaicmpi.plot_program_network_nsamples(network,
colors,
node_size=1e3,
font_size=12,
discretize=True)
How many patients is each program primarily associated to?
[ ]:
fig = mosaicmpi.plot_program_network_npatients(network, colors, node_size=1e3, font_size=12)
Overrepresentation of sample categories for each program, based on the Protein dataset
[ ]:
fig = mosaicmpi.plots.plot_overrepresentation_program_network(network, colors, subset_datasets="RNA", layer="simple_category", pie_size=0.2)
fig.savefig("rna_tumor-normal_overrepresentation.pdf")
And again, using the CPTAC annotations
[ ]:
fig = mosaicmpi.plot_overrepresentation_program_network(network, colors, subset_datasets="snRNA", layer="CellType")
fig.savefig("snRNA_celltype_overrepresentation.pdf")
You can also look at continuous metadata correlated with program usage, like estimated tumor purity (from bulk RNA and Protein datasets):
[ ]:
fig = mosaicmpi.plot_metadata_correlation_program_network(network, colors, layer='purity_TSNet')
Note, that grey nodes have no “purity_TSNet” annotation track (these are the snRNA-Seq programs).
We can also look at correlation with percent mitochondrial reads in snRNA-Seq data:
[ ]:
fig = mosaicmpi.plot_metadata_correlation_program_network(network, colors, layer='percent.mt')
Identifying communities from the program network#
[ ]:
colors.add_missing_community_colors(network)
fig = mosaicmpi.plot_program_network_communities(network, colors)
Plot a summary showing the size of each community (node size) and number of edges connecting communities (edge width).
[ ]:
fig = mosaicmpi.plot_summary_community_network(network, colors)
Plot program usage heatmap summarized by Community
[ ]:
network.integration.get_metadata_df()
[ ]:
fig = mosaicmpi.plot_community_usage_heatmap(network, colors, subset_datasets=['RNA', 'Protein'], prepend_dataset_colors=True, show_sample_labels=False)
fig_legend = colors.plot_metadata_colors_legend()
Correlate categorical variables with usage of individual programs, grouped by community#
[ ]:
fig = mosaicmpi.plot_overrepresentation_program_bar(network, colors, dataset_name="RNA")
# fig_legend = colors.plot_metadata_colors_legend()
Correlate programs in each community with continuous variables#
[ ]:
fig = mosaicmpi.plot_metadata_correlation_program_bar(network, colors, dataset_name="snRNA")
Community-level summary of overrepresentation#
[ ]:
fig = mosaicmpi.plot_overrepresentation_community_bar(network, colors, layer="CellType", subset_datasets="snRNA")
Or, plot it on the network:
[ ]:
fig = mosaicmpi.plot_overrepresentation_community_network(network, colors, layer='CellType', subset_datasets="snRNA")
Summarizing correlation with metadata#
[ ]:
fig = mosaicmpi.plot_metadata_correlation_community_bar(network, colors, layer='purity_TSNet', subset_datasets="RNA", figsize=[2, 3])
Or, plot it on the network:
[ ]:
fig = mosaicmpi.plot_metadata_correlation_community_network(network, colors, layer='purity_TSNet', subset_datasets="RNA")
Shannon diversity of program usage by dataset#
[ ]:
fig = mosaicmpi.plot_sample_entropy(network, colors)