muffin.load

Functions to create a dataset from arrays or bam files.

muffin.load.dataset_from_arrays(count_table, row_names=None, col_names=None, input_count_table=None, random_loc_count_table=None, random_loc_input_table=None)[source]

Import count matrices from python array.

Parameters:
  • count_table (integer ndarray) – (sample, genomic feature) RAW, NON NORMALIZED COUNT data. To save time and memory you can use an int32 format.

  • row_annot (None or dict) – If set to None, will give an integer id to each row (1->n). If set to a dict, expects a dict of the form {annotation_name:annotations}, where annotations are ordered according to the count table. By default None.

  • col_annot (None or dict) – If set to None, will give an integer id to each column (1->n). If set to a dict, expects a dict of the form {annotation_name:annotations}, where annotations are ordered according to the count table. By default None.

  • input_count_table (integer or float ndarray or None, optional) – For sequencing data with input only ! (sample, genomic feature) Raw, non normalized input counts (int) or input with custom normalization. If your input counts are already normalized, you have to set input_already_normalized to True. To save time and memory you can use a 32bits format. By default None

  • random_loc_count_table (integer ndarray, optional) – For sequencing data with input only ! Mandatory if your input count table is not already normalized. (sample, n_random_intervals) Raw, non normalized counts (int) sampled at random genomic intervals (preferably at least 5,000). To save time and memory you can use an int32 format. By default None

  • random_loc_input_table (integer ndarray, optional) – For sequencing data with input only ! Mandatory if your input count table is not already normalized. (sample, n_random_intervals) Raw, non normalized counts (int) sampled at random genomic intervals (preferably at least 5,000). To save time and memory you can use an int32 format. By default None

Returns:

dataset – Annotated data matrix

Return type:

AnnData

muffin.load.dataset_from_bam(bam_paths, genomic_regions=None, genomic_regions_path=None, row_names=None, col_names=None, featureCounts_params=None, isBEDannotation=True, input_bam_paths=None, n_random_input=10000, chromsizes=None)[source]

Initialize count matrices using BAM files and query random regions.

Parameters:
  • bam_paths (str array-like) – Paths to BAM files.

  • genomic_regions (pandas Dataframe) – Genomic annotation in bed format

  • genomic_regions_path (str) – Path to a bed or SAF genome annotation file

  • row_annot (None or dict, optional) – If set to None, will give file name to each row (1->n). If set to a dict, expects a dict of the form {annotation_name:annotations}, where annotations are ordered according to the BAM file order. By default None.

  • col_annot (None or dict, optional) – If set to None, will give use the given genomic regions as annotations. If set to a dict, expects a dict of the form {annotation_name:annotations}, where annotations are ordered according to the genomic_regions_path. By default None.

  • featureCounts_params (None or dict, optional) – Dictionary of keyword arguments to be passed to featureCounts, see featureCounts R package documentation, by default None

  • isBEDannotation (bool, optional) – If set to true assumes the genomic_regions_path is in .bed format with strand information otherwise, it assumes a SAF file (see featureCounts doc), by default True.

  • input_bam_paths (str, optional) – For sequencing data with input only ! Paths to input BAM files, path order has to match the signal bam_paths. By default None.

  • n_random_input (int, optional) – For sequencing data with input only ! Number of random regions to sample to estimate centering and scaling factors. By default 10000.

  • chromsizes (dict or None, optional) – Mandatory for sequencing data with input ! Set of chromosome that will be used to generate random sampling locations. Dict with format : {“chr1”:3000000, “chr2”:456455412}.

Returns:

_description_

Return type:

dataset

muffin.load.set_design_matrix(dataset, design='intercept')[source]

Set design matrix for statistical modelling. If you do not want to remove any confounding factors, set it to a column vector of ones. E.g. : np.ones((model.dataset.layers[“input”].shape[0],1))

Parameters:

design (ndarray or "intercept") – Design matrix. If set to intercept will automatically set a column vector of ones.

muffin.load.set_normalization_factors(dataset, normalization_factors)[source]

Load normalization factors from a numpy array

Parameters:
  • dataset (_type_) – _description_

  • normalization_factors (float ndarray (observations x variables)) – Normalization factors per observation, per variable

muffin.load.set_size_factors(dataset, values)[source]

Load custom size factors for the model.

Parameters:

values (float ndarray) – Size factors, e.g. np.sum(model.matrices[“counts”], axis=1). They will be standardized to unit mean to ensure numerical stability when fitting models.

muffin.tools

Functions for data processing (normalization factors, deviance residuals, PCA with optimal number of components, UMAP…).

muffin.tools.cca(dataset1, dataset2, n_comps=30, layer='residuals')[source]

Perform Canonical Correspondance Analysis as in seurat. Results will be stored in the .obsm[“X_cca”] slot of each dataset.

Parameters:
  • dataset1 (anndata) – First dataset to integrate (order does not matter).

  • dataset2 (_type_) – Second dataset to integrate (order does not matter).

  • n_comps (int, optional) – Number of CCA components to use, by default 30.

  • layer (str, optional) – .layers key present in both datasets, by default “residuals”.

muffin.tools.cluster_rows_leiden(dataset, on='reduced_dims', which='X_pca', feature_mask=None, metric='auto', k='auto', r=1.0, restarts=10)[source]

Computes Shared Nearest Neighbor graph clustering.

Parameters:
  • on (str, optional) – On which data representation to perform clustering. Use “reduced_dims” or “features”, by default “reduced_dims”.

  • which (str, optional) – Which reduced_dims or feature to use. I.e. “PCA” or “residuals”, by default “PCA”.

  • feature_mask (boolean ndarray, optional) – Subset of features to use for Umap (works with PCA as well), by default None

  • metric (str, optional) – Metric to use for kNN search, by default “auto”. If set to auto Pearson correlation is used as the metric when there are more than 10 input dimensions; otherwise, the Euclidean distance is used.

  • k ("auto" or int, optional) – Number of nearest neighbors to find, by default “auto” uses 4*nFeatures^0.2 as a rule of thumb.

  • r (float, optional) – Resolution parameter of the graph clustering, by default 1.0

  • restarts (int, optional) – Number of times to restart the graph clustering before keeping the best partition, by default 10

Return type:

self

muffin.tools.compute_pa_pca(dataset, layer='residuals', feature_mask=None, perm=3, alpha=0.01, solver='arpack', whiten=False, max_rank=None, mincomp=2, plot=False)[source]

Permutation Parallel Analysis to find the optimal number of PCA components. Stored under self.reduced_dims[“PCA”].

Parameters:
  • feature_name (str, optional) – On which feature to compute PCA on (e.g. “residuals”)

  • feature_mask (boolean ndarray or None, optional) – Subset of features to use for Umap, by default None.

  • perm (int (default 3)) – Number of permutations. On large matrices (row * col > 100,000) the eigenvalues are very stable, so there is no need to use a large number of permutations, and only one permutation can be a reasonable choice on large matrices. On smaller matrices the number of permutations should be increased.

  • alpha (float (default 0.01)) – Permutation p-value threshold.

  • solver ("arpack" or "randomized") – Chooses the SVD solver. Randomized is much faster but less accurate and yields different results machine-to-machine.

  • whiten (bool (default True)) – If set to true, each component is transformed to have unit variance.

  • max_rank (int or None (default None)) – Maximum number of principal components to compute. Must be strictly less than the minimum of n_features and n_samples. If set to None, computes up to min(n_samples, n_features)-1 components.

  • mincomp (int (default 2)) – Number of components to return

  • plot (bool (default None)) – Whether to plot the randomized eigenvalues of not

muffin.tools.compute_residuals(dataset, residuals='quantile', clip=inf, subSampleEst=2000, maxThreads=-1, verbose=True, plot=True)[source]

Compute residuals from a regularized NB model for each variable.

Parameters:
  • residuals (str, optional) – Whether to compute “anscombe”, “deviance”, “quantile”, “rqr” or “pearson” residuals, by default “anscombe”

  • clip (float, optional) – Value to clip residuals to (+-value), you can also provide “auto” to clip at +-sqrt(9+len(dataset)/4), default np.inf

  • subSampleEst (int, optional) – Number of samples to use to estimate mean-overdispersion relationship.

  • maxThreads (int, optional) – Number of threads to use, by default -1 (all)

  • verbose (bool, optional) – Verbosity, by default True

  • plot (bool, optional) – Whether to plot the mean-variance graph, by default True

Returns:

dataset

Return type:

AnnData

muffin.tools.compute_size_factors(dataset, method='top_fpkm')[source]

Compute size factors.

Parameters:
  • dataset (AnnData) – Dataset in AnnData format.

  • method (str, optional) –

    Method to use, by default “top_fpkm”. Available methods:

    • ”top fpkm”Selects top 5% most detectable variables and

      computes sum of counts normalization.

    • ”fpkm” : Computes sum of counts normalization.

    • ”scran”Selects top 5% most detectable variables and

      computes scran pooling and deconvolution normalization.

    • ”deseq”Applies median of ratios, works well with deeply sequenced

      datasets. Will raise an error if not suitable.

    • ”fpkm_uq” : Computes Upper Quartile normalization.

Returns:

dataset

Return type:

AnnData

muffin.tools.compute_umap(dataset, on='reduced_dims', which='X_pca', feature_mask=None, umap_params={})[source]

Compute UMAP and stores it under dataset.reduced_dims[“X_umap”].

Parameters:
  • on (str, optional) – On which data representation to perform umap. Use “reduced_dims” or “features”, by default “reduced_dims”.

  • which (str, optional) – Which reduced_dims or feature to use. I.e. “PCA” or “residuals”, by default “PCA”.

  • feature_mask (boolean ndarray, optional) – Subset of features to use for Umap (works with PCA as well), by default None

  • umap_params (dict, optional) – Dictionnary of keyword arguments for UMAP, see UMAP documentation, by default if not provided metric is set to euclidean. Random state is also fixed and should not be provided.

Return type:

dataset

muffin.tools.differential_expression_A_vs_B(dataset, category, ref_category, alternative='two-sided', method='auto')[source]

Performs differential expression between two categories. Results will be stored in dataset.varm[“DE_results”], and for parity with scanpy, in dataset.uns[“rank_genes_groups”].

Parameters:
  • dataset (AnnData) – Dataset in AnnData format.

  • category (str) – Name of the obs column to use for grouping. If more than two unique values are present, will raise an error.

  • ref_category (str) – Name of the reference category used for log fold change computations.

  • method (str, optional) – Method to be used for differential expression, by default “auto”. Available methods: - “auto” : Uses deseq if dataset size > 50, t-test otherwise. - “deseq”: Uses DESeq2 with the supplied design matrix beforehand. - “t-test”: Performs welsh t-test on residuals. - “wilcoxon”: Performs mann-whitney u-test on residuals (prefer using t-test).

muffin.tools.feature_selection_chisquare(dataset, alpha=0.05)[source]

Returns a boolean array of the features which do not passes the sum of squared residuals chisquared goodness of fit test. It is usually too stringent.

Parameters:
  • dataset (AnnData) – Dataset in AnnData format.

  • alpha (float, optional) – FDR, by default 0.05

Returns:

Boolean array of highly variable features.

Return type:

boolean ndarray

muffin.tools.feature_selection_elbow(dataset, plot: bool = True, subsample=20000)[source]

Performs feature selection by finding a conservative ankle point of the ranked standard deviation of residuals curve. We find the ankle point of a third degree polynomial fit, which yields a quite conservative feature selection, but typically removes at least half of the features.

Parameters:
  • dataset (AnnData) – Dataset in AnnData format.

  • plot (bool, optional) – Whether to plot the ranked SSR curve with the ankle point, by default True

Returns:

Boolean array of highly variable features.

Return type:

boolean ndarray

muffin.tools.feature_selection_top_k(dataset, k_features)[source]

Returns a boolean array of the k features with the largest sum of squared residuals.

Parameters:
  • dataset (AnnData) – Dataset in AnnData format.

  • k_features (int) – Number of features to keep

Returns:

Boolean array of highly variable features.

Return type:

boolean ndarray

muffin.tools.pseudo_peak_calling(dataset, alpha=0.05, minFC=0.5, minExp=2)[source]

Performs a pseudo-peak calling using the fitted NB model and normalized input. Returns a boolean array of features with significant enrichment over input in at least minExp rows.

Parameters:
  • dataset (AnnData) – Dataset in AnnData format.

  • alpha (float, optional) – Peak fdr threshold, by default 0.05

  • minFC (float, optional) – Fold change over input threshold, by default 1.0

  • minExp (int, optional) – Minimal number of samples/cells with significant enrichment over input, by default 3.

Returns:

kept_features – Boolean array of features with significant enrichment over input in at least minExp rows.

Return type:

boolean ndarray

muffin.tools.rescale_input_center_scale(dataset, plot=True)[source]

For data with input only ! Centers and scales fold changes by modifying the input count matrix. Novel scaled input is stored in dataset.layers[“normalization_factors”]

Parameters:
  • dataset (AnnData) – Dataset in AnnData format, with input layer filled

  • plot (bool, optional) – Whether to plot or not, by default False

muffin.tools.rescale_input_quantile(dataset, plot=False)[source]

For data with input only ! Centers and scales fold changes by modifying the input count matrix. Novel scaled input is stored in dataset.layers[“normalization_factors”]

Parameters:
  • dataset (AnnData) – Dataset in AnnData format, with input layer filled

  • plot (bool, optional) – Whether to plot or not, by default False

muffin.tools.transfer_categorical_labels(dataset_ref, dataset_target, label, representation_common, representation_target='X_pca', k_smoothing='auto', metric='auto')[source]

Transfer a categorical label from observations of dataset_ref to dataset_target. e.g clustering info from scRNA-seq to scATAC. Uses a RF classifier fitted on reference dataset to predict labels on the target dataset. Predictions are then smoothed using kNN, and untransferrable points are detected using Local Outlier Factor.

Parameters:
  • dataset_ref (anndata) – Dataset with reference labels

  • dataset_target (anndata) – Dataset to which predict labels.

  • label (str) – Column name in dataset_ref.obs

  • representation_common (str) – .obsm key present in both datasets. e.g. joint embedding produced by cca, then batch corrected by a tool such as harmony.

  • representation_target (str, optional) – .obsm key present in dataset_target, that will be used to smooth transferred labels to nearest neighbors, by default “X_pca”

  • k_smoothing ("auto" or int) – Number of nearest neighbors used for label smoothing.

muffin.tools.trim_low_counts(dataset, min_exp=3, min_counts=1)[source]

Returns a boolean array of variables with at least min_counts in min_exp rows.

Parameters:
  • dataset (AnnData) – Dataset in AnnData format.

  • min_exp (int, optional) – Minimum number of experiment to have at least min_counts, by default 3, at least 1

  • min_counts (int, optional) – Minimum number of counts, by default 1

Returns:

kept_features – Variables that satisfies above criterions.

Return type:

boolean ndarray

muffin.great

Genomic Regions Enrichment Analysis

class muffin.great.pyGREAT(geneFile, chrFile, gmtFile=None, regulatory_bed=None, distal=1000000, upstream=5000, downstream=1000, gtfGeneCol='gene_name', gene_biotype='all')[source]

Bases: object

Genomic regions GSEA tool.

Parameters:
  • geneFile (str) – Path to a GTF file.

  • chrFile (str) – Path to a chrInfo file. (chrom-tab-Chrom size-line return)

  • gmtFile (str or list of str, optional) – Path to gmt file or list of paths to gmt files that will get concatenated.

  • regulatory_bed (str, optional) – Path to a bed file of the form “chr start end gene”, that will override muffin’s default regulatory domains.

  • distal (int, optional) – Size of inferred distal regulatory regions, by default 1000000

  • upstream (int, optional) – Size of inferred upstream regulatory regions, by default 5000

  • downstream (int, optional) – Size of inferred downstream regulatory regions, by default 1000

  • gtfGeneCol (str, optional) – Name of the gene name/id column in the GTF file, by default “gene_name”

  • gene_biotype (str, optional) – Type of gene to keep e.g. “protein_coding”, by default “all”

barplot_enrich(enrichDF, title='', by='P(Beta > 0)', alpha=0.05, topK=10, savePath=None)[source]

Draw Enrichment barplots

Parameters:
  • enrichDF (pandas dataframe or tuple of pandas dataframes) – The result of the findEnriched method.

  • savePath (string (optional)) – If set to None, does not save the figure.

cluster_treemap(enrichDF, alpha=0.05, score='-log10(qval)', metric='yule', resolution=1.0, output=None, topK=9)[source]

Plot a treemap of clustered gene set terms. Clustering is done according to similarities between annotated genes of a gene set.

Parameters:
  • enrichDF (dataframe) – Output of “findEnriched” method.

  • alpha (float, optional) – FDR cutoff, by default 0.05

  • score (str, optional) – Which column of the result dataframe to use to identify lead term, by default “-log10(qval)”

  • metric (str, optional) – Similarity measure of gene annotation between terms, by default “yule”

  • resolution (float, optional) – Influences the size of the clusters (larger = more clusters), by default 1.0

  • output (str or None, optional) – Path to save figure, by default None

  • topK (int or None, optional) – Number of terms to display, if set to None displays all, by default 9

Returns:

filtered

Return type:

pandas dataframe

find_enriched(query, background=None, min_genes=3, max_genes=1000, cores=-1)[source]

Find enriched terms in genes near query.

Parameters:
  • query (pandas dataframe in bed-like format or PyRanges) – Set of genomic regions to compute enrichment on. If a pandas dataframe, assume the first three columns are chromosome, start, end.

  • background (None, pandas dataframe in bed-like format, or PyRanges) –

  • (default (None)) – If set to None considers the whole genome as the possible, equiprobable locations of the query. Otherwise it supposes the query is a subset of these background regions. (Note that it does not explicitly check for it).

  • min_genes (int, (default 3)) – Minimum number of intersected genes for a geneset.

  • max_genes (int, (default 1000)) – Maximum number of genes for a geneset.

  • cores (int, (default -1)) – Max number of cores to used for parallelized computations. Default uses all available cores (-1).

Returns:

results

Return type:

pandas dataframe

find_enriched_genes(query, background=None)[source]

Find enriched terms in genes near query.

Parameters:
  • query (pandas dataframe in bed-like format or PyRanges) – Set of genomic regions to compute enrichment on.

  • background (None, pandas dataframe in bed-like format, or PyRanges) –

  • (default (None)) – If set to None considers the whole genome as the possible locations of the query. Otherwise it supposes the query is a subset of these background regions.

Returns:

results

Return type:

pandas dataframe

find_genes_for_geneset(term, enrichedGeneTab, alpha=0.05)[source]

Find genes enriched for a particular geneset

Parameters:
  • term (str) – Name of the GSEA term.

  • enrichedGeneTab (pandas dataframe) – Results of findEnrichedGenes.

  • alpha (float, optional) – Adjusted P-value threshold, by default 0.05

Returns:

Enriched – Enriched genes

Return type:

pandas Dataframe

gene_activity_matrix(dataset, extend=1000)[source]

Creates a gene-activity matrix, useful to quantify activity (e.g. ATAC or ChIP) for each gene. Sums the counts of peaks located over gene bodies.

Parameters:
  • dataset (anndata) – The .var fields need to contain the “Chromosome”, “Start” and “End” fields.

  • extend (int) – How much to extend gene body in bp.

Returns:

gene_activity – observations-genes activity matrix. Will also copy obs,obsm,obsp and uns fields of the original dataset.

Return type:

anndata

get_nearest_gene(query, max_dist=inf)[source]

Get the nearest TSS for each row. If it can’t be retrieved will be named “None” (str).

Parameters:
  • query (pandas Dataframe or PyRanges) – Set of genomic regions.

  • max_dist (integer or float) – Maximum distance for association, default np.inf

label_by_nearest_gene(query)[source]

Label each row by its nearest gene TSS, avoiding duplicates. E.g. NADK_1, MYC_1, MYC_2, None_1, None_2

Parameters:

query (pandas Dataframe or PyRanges) – Set of genomic regions.

Returns:

labels – Each corresponding label.

Return type:

ndarray

tss_activity_matrix(dataset, extend=1000)[source]

Creates a gene tss-activity matrix, useful to quantify activity (e.g. ATAC or ChIP) for each gene. Sums the counts of peaks located over TSS.

Parameters:
  • dataset (anndata) – The .var fields need to contain the “Chromosome”, “Start” and “End” fields.

  • extend (int) – How much to extend TSSs in bp.

Returns:

gene_activity – observations-genes activity matrix. Will also copy obs,obsm,obsp and uns fields of the original dataset.

Return type:

anndata

muffin.peakMerge

Generate consensus peaks.

muffin.peakMerge.merge_peaks(beds, chrom_sizes, fileFormat='narrowPeak', inferCenter=False, forceUnstranded=False, sigma='auto', perPeakDensity=False, minOverlap=2, output_bedgraph=None)[source]

Read peak called files, generate consensuses and the matrix.

Parameters:
  • beds (list of pandas dataframes) – Each dataframe should be formatted in bed format. Summit is assumed to be 7th column (=thickStart) for bed format, and 9th column for narrowPeak format.

  • chrom_sizes (str or dict) – Path to tab separated (chromosome, length) annotation file. If a dict, must be of the form {“ChrName”: “chrSize”}.

  • fileFormat ("bed" or "narrowPeak") – Format of the files being read. Bed file format assumes the max signal position to be at the 6th column (0-based) in absolute coordinates. The narrowPeak format assumes the max signal position to be the 9th column with this position being relative to the start position.

  • inferCenter (boolean (optional, default False)) – If set to true will use the position halfway between start and end positions. Enable this only if the summit position is missing. Can also be suitable for broad peaks as the summit position can be unreliable.

  • forceUnstranded (Boolean (optional, default False)) – If set to true, assumes all peaks are not strand-specific even if strand specific information was found.

  • sigma (float or "auto" (optional, default "auto")) – Size of the gaussian filter (lower values = more separation). Only effective if perPeakDensity is set to False. “auto” automatically selects the filter width at (average peak size)/8.

  • perPeakDensity (Boolean (optional, default False)) – Recommended for broad peaks. If set to false will perform a gaussian filter along the genome (faster), assuming all peaks have roughly the same size. If set to true will create the density curve per peak based on each peak individual size. This is much more slower than the filter method. May be useful if peaks are expected to have very different sizes. Can also be faster when the number of peaks is small.

  • minOverlap (integer (optional, default 2)) – Minimum number of peaks required at a consensus. 2 Indicates that a peak must be replicated at least once.

muffin.plots

Functions to generate plots.

muffin.plots.mega_heatmap(dataset, layer='residuals', label_col=None, method='ward', metric='euclidean', resolution=4000, vmin=None, vmax=None, show_dendrogram=True, show_bar=True, max_dendro=4000, cmap='vlag', figsize=(6, 3.375))[source]
muffin.plots.plot_reduced_dim(dataset, which='Umap', components=[0, 1], points_labels=None, label_type='categorical', palette='auto', highlight=None)[source]
Parameters:
  • which (str, optional) – Which reduced dimensionality representation to use, by default “X_umap”

  • components (integer list of length 2, optional) – Which components of the reduced dimensionality representation to plot, by default [0,1]

  • points_labels (ndarray, optional) – Label for each point, by default None

  • label_type ("categorical" or "numeric", optional) – If set to “categorical” assumes, by default “categorical”

  • palette ("auto" or matplotlib colormap or dict, optional) – If set “auto”, will automatically choose the color palette according to label type and number of labels. If label_type is set to “numeric”, it should be a matplotlib colormap. If label_type is set to “categorical” should be a dictionnary {category:color} with color being an hexadecimal str or rgb array-like (r,g,b). By default “auto”.

  • highlight (boolean ndarray, optional) – Points to highlight on the graph, by default None.

Return type:

self