Optimization of Thresholds
This section covers the optimization of thresholds for iModulon gene weights using Otsu’s method, which automatically determines optimal cutoffs to distinguish significant genes from background noise.
Overview
After running multi-view ICA, gene weights in the M matrices need to be thresholded to identify which genes significantly contribute to each iModulon. The optimize_M_thresholds method applies an adapted version of Otsu’s method to automatically determine these thresholds.
What is Otsu’s Method?
Otsu’s method is an automatic threshold selection technique originally developed for image segmentation. The method finds the optimal threshold by maximizing the between-class variance, effectively separating data into two distinct groups.
Key principles:
Analyzes the distribution of values (histogram)
Divides data into two classes: signal and background
Finds the threshold that maximizes separation between classes
Works best when data has a bimodal distribution
Mathematical basis:
The method maximizes the between-class variance:
- Where:
\(\omega_0\) and \(\omega_1\) are the class probabilities
\(\mu_0\) and \(\mu_1\) are the class mean values
\(t\) is the threshold being evaluated
Adaptation for Gene Expression Data
Gene expression data often has heavy-tailed distributions with many genes showing low weights and few showing high weights. To handle this, MultiModulon uses a quantile-based pre-filtering approach:
Pre-filtering: Remove the bottom X% of absolute values (default: 90%)
Apply Otsu: Run Otsu’s method on the remaining high-value genes
Binarization: Create presence matrix based on optimized thresholds
This approach helps identify true regulatory genes while filtering out noise.
Basic Usage
- MultiModulon.optimize_M_thresholds(method="Otsu's method", quantile_threshold=90)
Optimize thresholds for M matrices across all species.
Example:
# After running ICA
multiModulon.run_robust_multiview_ica(
a={'Species1': 50, 'Species2': 60},
c=20,
num_runs=100
)
# Optimize thresholds
multiModulon.optimize_M_thresholds(
method="Otsu's method",
quantile_threshold=90
)
# Results are stored in each species
for species in multiModulon.species:
thresholds = multiModulon[species].M_thresholds
presence = multiModulon[species].presence_matrix
print(f"{species}: {presence.sum().mean():.1f} genes per component")
Understanding the Results
M_thresholds DataFrame
Contains the optimized threshold for each component:
# Access thresholds
thresholds = multiModulon['Species1'].M_thresholds
# Structure:
# Index: Component names (Core_1, Core_2, ..., Unique_1, ...)
# Column: 'M_threshold' - the optimized threshold value
print("Component thresholds:")
print(thresholds.head(10))
# Average threshold across components
avg_threshold = thresholds['M_threshold'].mean()
print(f"Average threshold: {avg_threshold:.3f}")
Presence Matrix
Binary matrix indicating which genes belong to each component:
# Access presence matrix
presence = multiModulon['Species1'].presence_matrix
# Structure:
# Rows: Genes
# Columns: Components
# Values: 1 if |gene weight| > threshold, 0 otherwise
# Count genes per component
genes_per_comp = presence.sum(axis=0)
print(f"Genes per component: min={genes_per_comp.min()}, max={genes_per_comp.max()}")
# Find components with many genes
large_components = genes_per_comp[genes_per_comp > 100].index
print(f"Components with >100 genes: {large_components.tolist()}")
# Get genes in a specific component
component = 'Core_1'
component_genes = presence[presence[component] == 1].index.tolist()
print(f"{component} contains {len(component_genes)} genes")
Customizing Pre-filtering
Adjust the quantile threshold based on your data:
# More stringent: Keep only top 5% of values
multiModulon.optimize_M_thresholds(quantile_threshold=95)
# Less stringent: Keep top 20% of values
multiModulon.optimize_M_thresholds(quantile_threshold=80)
# Compare results
for threshold in [80, 85, 90, 95]:
multiModulon.optimize_M_thresholds(quantile_threshold=threshold)
avg_genes = []
for species in multiModulon.species:
presence = multiModulon[species].presence_matrix
avg_genes.append(presence.sum().mean())
print(f"Quantile {threshold}%: {np.mean(avg_genes):.1f} genes/component")
Next Steps
After optimizing thresholds:
Visualization of iModulons - Visualize thresholded components
Use presence matrices for enrichment analysis
Compare gene sets across species for core components
Export gene lists for biological validation