MultiModulon Worflow

This notebook demonstrates the first step for multi-species/strain/modality analysis using the MultiModulon package.

Step 1: Initialize MultiModulon

Load data from the Input_Data directory containing expression matrices, gene annotations, and sample metadata for all strains.

# Import required libraries
from multimodulon import MultiModulon
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 50)

# Path to the Input_Data folder
input_data_path = './Input_Data'

# Initialize MultiModulon object
multiModulon = MultiModulon(input_data_path)

Step 2: Create Gene Tables

Parse GFF files to create gene annotation tables for each strain.

# Create gene tables from GFF files
print("Creating gene tables from GFF files...")
multiModulon.create_gene_table()

multiModulon.add_eggnog_annotation("./Output_eggnog_mapper")

Step 3: Generate BBH Files

Generate Bidirectional Best Hits (BBH) files for ortholog detection between all strain pairs.

# Generate BBH files using multiple threads for faster computation
output_bbh_path = './Output_BBH'

multiModulon.generate_BBH(output_bbh_path, threads=16)

Step 4: Align Genes Across Strains

Create a unified gene database by aligning genes across all strains using the BBH results.

# Align genes across all strains
output_gene_info_path = './Output_Gene_Info'

combined_gene_db = multiModulon.align_genes(
    input_bbh_dir=output_bbh_path,
    output_dir=output_gene_info_path,
    reference_order=['MG1655', 'BL21', 'W3110'],  # optional: specify order
    # bbh_threshold=90  # optional: minimum percent identity threshold
)

combined_gene_db.head()

Step 5: Generate Aligned Expression Matrices

Create expression matrices with consistent gene indexing across all strains for multi-view ICA.

# Generate aligned expression matrices
print("Generating aligned expression matrices...")
multiModulon.generate_X(output_gene_info_path)

# The output shows aligned X matrices and dimension recommendations

Step 6: Optimize Number of Core Components

Determine the optimal number of core components.

# Optimize number of core components
print("Optimizing number of core components...")
print("This will test different values of k and find the optimal number.")

optimal_num_core_components = multiModulon.optimize_number_of_core_components(
    step=10,                        # Test k = 5, 10, 15, 20, ...
    save_path='./Output_Optimization_Figures',  # Save plots to directory
    fig_size=(7, 5),                # Figure size
    num_runs_per_dimension=10,
    seed=10
)
Core component optimization figure

Core component optimization result.

Step 7: Optimize Number of Unique Components

Determine the optimal number of unique (species-specific) components for each strain.

# Optimize unique components for each species
print("Optimizing unique components per species...")
print("This will test different numbers of unique components for each species.\n")

optimal_unique, optimal_total = multiModulon.optimize_number_of_unique_components(
    optimal_num_core_components=optimal_num_core_components,
    step=10,
    save_path='./Output_Optimization_Figures',
    fig_size=(7, 5),
    num_runs_per_dimension=10,
    seed=10
)
Unique component optimization for MG1655

Unique component optimization result for MG1655.

Unique component optimization for BL21

Unique component optimization result for BL21.

Unique component optimization for W3110

Unique component optimization result for W3110.

optimal_num_core_components

optimal_total

Step 8: Run Robust Multi-view ICA

Perform robust multi-view ICA with multiple runs and clustering to identify consistent components.

# Run robust multi-view ICA
print("Running robust multi-view ICA with clustering...")
print("This performs multiple ICA runs and clusters the results for robustness.\n")

M_matrices, A_matrices = multiModulon.run_robust_multiview_ica(
    a=optimal_total,                 # Dictionary of total components per species
    c=optimal_num_core_components,   # Number of core components
    num_runs=10,                     # Number of runs for robustness
    seed=100                         # Random seed for reproducibility
)

Step 9: Optimize thresholds to binarize the M matrices

Use Otsu’s method to calculate thresholds for each component in M matrices across all species.

multiModulon.optimize_M_thresholds(method="Otsu's method", quantile_threshold=95)

Step 10: Save the multiModulon object to json

Save the multiModulon object to json in the given path and file name.

multiModulon.save_to_json_multimodulon("./multiModulon_E_coli_comparison_demo.json.gz")

for i in multiModulon.species:
    print(i, " : ", multiModulon[i].M.shape)