Basic Workflow

This example demonstrates the complete MultiModulon analysis workflow from data loading to component visualization.

Step 1: Initialize MultiModulon Object

Load data from the Input_Data directory containing expression matrices, gene annotations, and sample metadata for all strains.

from multimodulon import MultiModulon
import pandas as pd
import numpy as np

# Initialize MultiModulon object
multiModulon = MultiModulon(input_data_path)

The initialization will:

Scan the input directory for species/strain subdirectories
Load expression data (log_tpm.csv)
Load sample metadata (sample_sheet.csv)
Validate data consistency
Report the number of genes and samples for each species

Step 2: Generate BBH Files

Generate Bidirectional Best Hits (BBH) files for ortholog detection between all strain pairs.

# Generate BBH files using multiple threads
multiModulon.generate_BBH(output_path, threads=8)

This step:

Extracts protein sequences from genome files
Runs BLAST between all species pairs
Identifies reciprocal best hits
Saves results as CSV files

Step 3: Align Genes Across Strains

Create a unified gene database by aligning genes across all strains using the BBH results.

# Align genes across all strains
combined_gene_db = multiModulon.align_genes(
    input_bbh_dir=output_bbh_path,
    output_dir=output_gene_info_path,
    reference_order=['Species1', 'Species2', 'Species3'],  # optional
    bbh_threshold=90  # optional: minimum percent identity
)

The alignment process:

Groups orthologous genes using Union-Find algorithm
Creates gene families across species
Generates a combined gene database
Saves the alignment for future use

Step 4: Create Gene Tables

Parse GFF files to create gene annotation tables for each strain.

# Create gene tables from GFF files
multiModulon.create_gene_table()

# Optional: Add eggNOG annotations
multiModulon.add_eggnog_annotation(eggnog_output_path)

This enriches gene information with:

Gene names and products
COG categories
Functional annotations

Step 5: Generate Aligned Expression Matrices

Create expression matrices with consistent gene indexing across all strains.

# Generate aligned expression matrices
multiModulon.generate_X(gene_info_folder_path)

This step:

Aligns expression matrices based on gene families
Handles missing genes with NaN values
Reports dimensions and recommendations

Step 6: Optimize Number of Core Components

Use Cohen’s d effect size metric to automatically determine the optimal number of core components.

# Optimize number of core components
optimal_num_core_components = multiModulon.optimize_number_of_core_components(
    metric='effect_size',       # Use Cohen's d effect size
    step=5,                     # Test k = 5, 10, 15, 20, ...
    save_path=output_dir,       # Save plots
    fig_size=(7, 5),           # Figure size
)

The optimization:

Tests different numbers of core components
Evaluates using Cohen’s d effect size
Selects optimal k based on interpretability
Saves optimization plots

Step 7: Optimize Number of Unique Components

Determine the optimal number of unique (species-specific) components for each strain.

# Optimize unique components for each species
optimal_unique, optimal_total = multiModulon.optimize_number_of_unique_components(
    optimal_num_core_components=optimal_num_core_components,
    step=5,
    save_path=output_dir,
    fig_size=(7, 5),
    effect_size_threshold=1
)

This process:

Tests different numbers of unique components per species
Evaluates component quality using effect size
Returns optimal numbers for each species

Step 8: Run Robust Multi-view ICA

Perform robust multi-view ICA with multiple runs and clustering to identify consistent components.

# Run robust multi-view ICA
M_matrices, A_matrices = multiModulon.run_robust_multiview_ica(
    a=optimal_total,                 # Total components per species
    c=optimal_num_core_components,   # Number of core components
    num_runs=20,                     # Number of runs for robustness
    effect_size_threshold_core=3,    # Cohen's d threshold for core
    effect_size_threshold_unique=1,  # Cohen's d threshold for unique
    seed=42                          # Random seed
)

The robust ICA:

Runs ICA multiple times with different initializations
Clusters components across runs
Selects consistent components
Generates final M (gene weights) and A (activities) matrices

Step 9: Optimize M Matrix Thresholds

Calculate thresholds for binarizing the M matrices using Otsu’s method.

# Optimize thresholds for each component
multiModulon.optimize_M_thresholds(
    method="Otsu's method",
    quantile_threshold=95
)

This creates:

Component-specific thresholds
Binarized presence matrices
Statistics on genes per component

Step 10: Save Results

Save the complete MultiModulon object for future use.

# Save to compressed JSON format
multiModulon.save_to_json_multimodulon("multiModulon_results.json.gz")

# Load saved object
multiModulon = MultiModulon.load_json_multimodulon("multiModulon_results.json.gz")

Summary

This workflow covers the complete pipeline from raw data to interpretable multi-species regulatory modules:

Data loading and validation
Ortholog detection via BBH
Gene alignment across species
Expression matrix alignment
Component number optimization
Robust multi-view ICA
Threshold optimization
Results storage

The output includes:

Core components conserved across species
Unique components specific to each species
Gene membership for each component
Activity profiles across samples
Optimized thresholds for interpretation