Optimization of Dimensions
This section covers the optimization of component numbers for multi-view ICA, including both core (shared) and unique (species-specific) components.
Overview
Choosing the right number of components is crucial for meaningful results. MultiModulon provides automated optimization methods to determine:
Optimal number of core components - Shared across all species
Optimal number of unique components - Specific to each species
Two optimization metrics are available:
Cohen’s d Effect Size - between top genes and others (default, and recommended)
NRE (Normalized Reconstruction Error) - From paper https://proceedings.mlr.press/v216/pandeva23a.html)
Optimizing Core Components
- MultiModulon.optimize_number_of_core_components(**kwargs)
Optimize the number of core (shared) components across species.
- Parameters:
max_k (int) – Maximum number of core components to test (Auto-determined)
step (int) – Step size for k candidates (default: 5)
max_a_per_view (int) – Maximum components per species (default: max_k)
train_frac (float) – Fraction of data for training (default: 0.75)
num_runs (int) – Number of cross-validation runs (default: 1)
mode (str) – Computation mode ‘gpu’ or ‘cpu’ (default: ‘gpu’)
seed (int) – Random seed for reproducibility (default: 42)
metric (str) – Optimization metric ‘nre’ or ‘effect_size’ (default: ‘effect_size’)
effect_size_threshold (float) – Cohen’s d threshold (default: 5)
num_top_gene (int) – Number of top genes for Cohen’s d (default: 20)
save_path (str) – Directory to save optimization plot
fig_size (tuple) – Figure size as (width, height) (default: (5, 3))
font_path (str) – Path to font file for plots
- Returns:
Tuple of (optimal_num_core_components, metric_scores)
- Return type:
Basic Usage
# Optimize using Cohen's d effect size
optimal_core, scores = multiModulon.optimize_number_of_core_components(
max_k=30,
step=5,
metric='effect_size',
effect_size_threshold=5, # Minimum Cohen's d
num_top_gene=20, # Top genes to consider
save_plot="effect_size_optimization.png"
)
print(f"Optimal number of core components: {optimal_core}")
Understanding the Metrics
NRE (Normalized Reconstruction Error):
Measures how well core components reconstruct the data
Lower values are better
May include noise components
Cohen’s d Effect Size:
Measures separation between top genes and others
Higher values indicate components with a more clear gene membership
Better for biological interpretability
Filters out noise components
Optimizing Unique Components
After determining core components, optimize unique components:
- MultiModulon.optimize_number_of_unique_components(**kwargs)
Optimize the number of unique components for each species.
- Parameters:
optimal_num_core_components (int) – Number of core components (from previous step)
step (int) – Step size for testing unique components (default: 5)
mode (str) – Computation mode ‘gpu’ or ‘cpu’ (default: ‘gpu’)
seed (int) – Random seed (default: 42)
effect_size_threshold (float) – Cohen’s d threshold (default: 5)
num_top_gene (int) – Number of top genes for Cohen’s d (default: 20)
save_path (str) – Directory to save plots for each species
fig_size (tuple) – Figure size (default: (5, 3))
font_path (str) – Path to font file
- Returns:
Tuple of (optimal_unique_components, optimal_total_components)
- Return type:
Basic Usage
# Optimize unique components
optimal_unique, optimal_total = multiModulon.optimize_number_of_unique_components(
optimal_num_core_components=20, # From previous step
step=5,
save_plots="unique_optimization/"
)
# Results
print("Optimal unique components per species:")
for species, n_unique in optimal_unique.items():
n_total = optimal_total[species]
print(f"{species}: {n_unique} unique, {n_total} total")
How It Works
For each species:
Tests different numbers of unique components
Runs ICA with fixed core + varying unique
Calculates mean Cohen’s d for unique components
Selects number that maximizes interpretable components
Custom Thresholds
Different species may need different thresholds:
# Strict threshold for well-studied species
optimal_unique_strict, _ = multiModulon.optimize_number_of_unique_components(
optimal_num_core_components=20,
effect_size_threshold=7, # Higher threshold
save_plots="strict_optimization/"
)
# Permissive threshold for novel species
optimal_unique_permissive, _ = multiModulon.optimize_number_of_unique_components(
optimal_num_core_components=20,
effect_size_threshold=3, # Lower threshold
save_plots="permissive_optimization/"
)
Complete Optimization Workflow
Here’s a complete optimization workflow:
# Step 1: Optimize core components
print("Optimizing core components...")
optimal_core, core_scores = multiModulon.optimize_number_of_core_components(
max_k=40,
step=5,
metric='effect_size',
effect_size_threshold=5,
num_runs=3,
save_path="optimization_results/",
fig_size=(6, 4)
)
print(f"Optimal core components: {optimal_core}")
# Step 2: Optimize unique components
print("\nOptimizing unique components...")
optimal_unique, optimal_total = multiModulon.optimize_number_of_unique_components(
optimal_num_core_components=optimal_core,
step=5,
effect_size_threshold=5,
save_path="optimization_results/",
fig_size=(6, 4)
)
print("\nOptimization complete!")
print(f"Core components: {optimal_core}")
for species in multiModulon.species:
print(f"{species}: {optimal_unique[species]} unique, "
f"{optimal_total[species]} total")
# Step 3: Run ICA with optimal parameters
print("\nRunning multi-view ICA with optimal parameters...")
M_matrices, A_matrices = multiModulon.run_robust_multiview_ica(
a=optimal_total,
c=optimal_core,
num_runs=100,
mode='gpu'
)
Best Practices
Start with effect_size metric - More biologically relevant
Use multiple runs - At least 3-5 for reliability
Inspect plots - Don’t just trust automatic selection
Validate results - Check if components make biological sense
Next Steps
After optimization:
Robust Multi-view ICA - Run ICA with optimal parameters
Visualization of iModulons - Visualize and interpret components