Optimization of Thresholds

This section covers the optimization of thresholds for iModulon gene weights using Otsu’s method, which automatically determines optimal cutoffs to distinguish significant genes from background noise.

Overview

After running multi-view ICA, gene weights in the M matrices need to be thresholded to identify which genes significantly contribute to each iModulon. The optimize_M_thresholds method applies an adapted version of Otsu’s method to automatically determine these thresholds.

What is Otsu’s Method?

Otsu’s method is an automatic threshold selection technique originally developed for image segmentation. The method finds the optimal threshold by maximizing the between-class variance, effectively separating data into two distinct groups.

Key principles:

  • Analyzes the distribution of values (histogram)

  • Divides data into two classes: signal and background

  • Finds the threshold that maximizes separation between classes

  • Works best when data has a bimodal distribution

Mathematical basis:

The method maximizes the between-class variance:

\[\sigma_b^2(t) = \omega_0(t)\omega_1(t)[\mu_0(t) - \mu_1(t)]^2\]
Where:
  • \(\omega_0\) and \(\omega_1\) are the class probabilities

  • \(\mu_0\) and \(\mu_1\) are the class mean values

  • \(t\) is the threshold being evaluated

Adaptation for Gene Expression Data

Gene expression data often has heavy-tailed distributions with many genes showing low weights and few showing high weights. To handle this, MultiModulon uses a quantile-based pre-filtering approach:

  1. Pre-filtering: Remove the bottom X% of absolute values (default: 90%)

  2. Apply Otsu: Run Otsu’s method on the remaining high-value genes

  3. Binarization: Create presence matrix based on optimized thresholds

This approach helps identify true regulatory genes while filtering out noise.

Basic Usage

MultiModulon.optimize_M_thresholds(method="Otsu's method", quantile_threshold=90)

Optimize thresholds for M matrices across all species.

Parameters:
  • method (str) – Threshold optimization method (currently only “Otsu’s method”)

  • quantile_threshold (float) – Percentile for pre-filtering (default: 90)

Example:

# After running ICA
multiModulon.run_robust_multiview_ica(
    a={'Species1': 50, 'Species2': 60},
    c=20,
    num_runs=100
)

# Optimize thresholds
multiModulon.optimize_M_thresholds(
    method="Otsu's method",
    quantile_threshold=90
)

# Results are stored in each species
for species in multiModulon.species:
    thresholds = multiModulon[species].M_thresholds
    presence = multiModulon[species].presence_matrix
    print(f"{species}: {presence.sum().mean():.1f} genes per component")

Understanding the Results

M_thresholds DataFrame

Contains the optimized threshold for each component:

# Access thresholds
thresholds = multiModulon['Species1'].M_thresholds

# Structure:
# Index: Component names (Core_1, Core_2, ..., Unique_1, ...)
# Column: 'M_threshold' - the optimized threshold value

print("Component thresholds:")
print(thresholds.head(10))

# Average threshold across components
avg_threshold = thresholds['M_threshold'].mean()
print(f"Average threshold: {avg_threshold:.3f}")

Presence Matrix

Binary matrix indicating which genes belong to each component:

# Access presence matrix
presence = multiModulon['Species1'].presence_matrix

# Structure:
# Rows: Genes
# Columns: Components
# Values: 1 if |gene weight| > threshold, 0 otherwise

# Count genes per component
genes_per_comp = presence.sum(axis=0)
print(f"Genes per component: min={genes_per_comp.min()}, max={genes_per_comp.max()}")

# Find components with many genes
large_components = genes_per_comp[genes_per_comp > 100].index
print(f"Components with >100 genes: {large_components.tolist()}")

# Get genes in a specific component
component = 'Core_1'
component_genes = presence[presence[component] == 1].index.tolist()
print(f"{component} contains {len(component_genes)} genes")

Customizing Pre-filtering

Adjust the quantile threshold based on your data:

# More stringent: Keep only top 5% of values
multiModulon.optimize_M_thresholds(quantile_threshold=95)

# Less stringent: Keep top 20% of values
multiModulon.optimize_M_thresholds(quantile_threshold=80)

# Compare results
for threshold in [80, 85, 90, 95]:
    multiModulon.optimize_M_thresholds(quantile_threshold=threshold)
    avg_genes = []
    for species in multiModulon.species:
        presence = multiModulon[species].presence_matrix
        avg_genes.append(presence.sum().mean())
    print(f"Quantile {threshold}%: {np.mean(avg_genes):.1f} genes/component")

Next Steps

After optimizing thresholds:

  1. Visualization of iModulons - Visualize thresholded components

  2. Use presence matrices for enrichment analysis

  3. Compare gene sets across species for core components

  4. Export gene lists for biological validation