Basic Workflow
This example demonstrates the complete MultiModulon analysis workflow from data loading to component visualization.
Step 1: Initialize MultiModulon Object
Load data from the Input_Data directory containing expression matrices, gene annotations, and sample metadata for all strains.
from multimodulon import MultiModulon
import pandas as pd
import numpy as np
# Initialize MultiModulon object
multiModulon = MultiModulon(input_data_path)
The initialization will:
Scan the input directory for species/strain subdirectories
Load expression data (log_tpm.csv)
Load sample metadata (sample_sheet.csv)
Validate data consistency
Report the number of genes and samples for each species
Step 2: Generate BBH Files
Generate Bidirectional Best Hits (BBH) files for ortholog detection between all strain pairs.
# Generate BBH files using multiple threads
multiModulon.generate_BBH(output_path, threads=8)
This step:
Extracts protein sequences from genome files
Runs BLAST between all species pairs
Identifies reciprocal best hits
Saves results as CSV files
Step 3: Align Genes Across Strains
Create a unified gene database by aligning genes across all strains using the BBH results.
# Align genes across all strains
combined_gene_db = multiModulon.align_genes(
input_bbh_dir=output_bbh_path,
output_dir=output_gene_info_path,
reference_order=['Species1', 'Species2', 'Species3'], # optional
bbh_threshold=90 # optional: minimum percent identity
)
The alignment process:
Groups orthologous genes using Union-Find algorithm
Creates gene families across species
Generates a combined gene database
Saves the alignment for future use
Step 4: Create Gene Tables
Parse GFF files to create gene annotation tables for each strain.
# Create gene tables from GFF files
multiModulon.create_gene_table()
# Optional: Add eggNOG annotations
multiModulon.add_eggnog_annotation(eggnog_output_path)
This enriches gene information with:
Gene names and products
COG categories
Functional annotations
Step 5: Generate Aligned Expression Matrices
Create expression matrices with consistent gene indexing across all strains.
# Generate aligned expression matrices
multiModulon.generate_X(gene_info_folder_path)
This step:
Aligns expression matrices based on gene families
Handles missing genes with NaN values
Reports dimensions and recommendations
Step 6: Optimize Number of Core Components
Use Cohen’s d effect size metric to automatically determine the optimal number of core components.
# Optimize number of core components
optimal_num_core_components = multiModulon.optimize_number_of_core_components(
metric='effect_size', # Use Cohen's d effect size
step=5, # Test k = 5, 10, 15, 20, ...
save_path=output_dir, # Save plots
fig_size=(7, 5), # Figure size
)
The optimization:
Tests different numbers of core components
Evaluates using Cohen’s d effect size
Selects optimal k based on interpretability
Saves optimization plots
Step 7: Optimize Number of Unique Components
Determine the optimal number of unique (species-specific) components for each strain.
# Optimize unique components for each species
optimal_unique, optimal_total = multiModulon.optimize_number_of_unique_components(
optimal_num_core_components=optimal_num_core_components,
step=5,
save_path=output_dir,
fig_size=(7, 5),
effect_size_threshold=1
)
This process:
Tests different numbers of unique components per species
Evaluates component quality using effect size
Returns optimal numbers for each species
Step 8: Run Robust Multi-view ICA
Perform robust multi-view ICA with multiple runs and clustering to identify consistent components.
# Run robust multi-view ICA
M_matrices, A_matrices = multiModulon.run_robust_multiview_ica(
a=optimal_total, # Total components per species
c=optimal_num_core_components, # Number of core components
num_runs=20, # Number of runs for robustness
effect_size_threshold_core=3, # Cohen's d threshold for core
effect_size_threshold_unique=1, # Cohen's d threshold for unique
seed=42 # Random seed
)
The robust ICA:
Runs ICA multiple times with different initializations
Clusters components across runs
Selects consistent components
Generates final M (gene weights) and A (activities) matrices
Step 9: Optimize M Matrix Thresholds
Calculate thresholds for binarizing the M matrices using Otsu’s method.
# Optimize thresholds for each component
multiModulon.optimize_M_thresholds(
method="Otsu's method",
quantile_threshold=95
)
This creates:
Component-specific thresholds
Binarized presence matrices
Statistics on genes per component
Step 10: Save Results
Save the complete MultiModulon object for future use.
# Save to compressed JSON format
multiModulon.save_to_json_multimodulon("multiModulon_results.json.gz")
# Load saved object
multiModulon = MultiModulon.load_json_multimodulon("multiModulon_results.json.gz")
Summary
This workflow covers the complete pipeline from raw data to interpretable multi-species regulatory modules:
Data loading and validation
Ortholog detection via BBH
Gene alignment across species
Expression matrix alignment
Component number optimization
Robust multi-view ICA
Threshold optimization
Results storage
The output includes:
Core components conserved across species
Unique components specific to each species
Gene membership for each component
Activity profiles across samples
Optimized thresholds for interpretation