Basic Workflow
==============

This example demonstrates the complete MultiModulon analysis workflow from data loading to component visualization.

Step 1: Initialize MultiModulon Object
--------------------------------------

Load data from the Input_Data directory containing expression matrices, gene annotations, and sample metadata for all strains.

.. code-block:: python

   from multimodulon import MultiModulon
   import pandas as pd
   import numpy as np
   
   # Initialize MultiModulon object
   multiModulon = MultiModulon(input_data_path)

The initialization will:

* Scan the input directory for species/strain subdirectories
* Load expression data (log_tpm.csv)
* Load sample metadata (sample_sheet.csv)
* Validate data consistency
* Report the number of genes and samples for each species

Step 2: Generate BBH Files
--------------------------

Generate Bidirectional Best Hits (BBH) files for ortholog detection between all strain pairs.

.. code-block:: python

   # Generate BBH files using multiple threads
   multiModulon.generate_BBH(output_path, threads=8)

This step:

* Extracts protein sequences from genome files
* Runs BLAST between all species pairs
* Identifies reciprocal best hits
* Saves results as CSV files

Step 3: Align Genes Across Strains
-----------------------------------

Create a unified gene database by aligning genes across all strains using the BBH results.

.. code-block:: python

   # Align genes across all strains
   combined_gene_db = multiModulon.align_genes(
       input_bbh_dir=output_bbh_path,
       output_dir=output_gene_info_path,
       reference_order=['Species1', 'Species2', 'Species3'],  # optional
       bbh_threshold=90  # optional: minimum percent identity
   )

The alignment process:

* Groups orthologous genes using Union-Find algorithm
* Creates gene families across species
* Generates a combined gene database
* Saves the alignment for future use

Step 4: Create Gene Tables
--------------------------

Parse GFF files to create gene annotation tables for each strain.

.. code-block:: python

   # Create gene tables from GFF files
   multiModulon.create_gene_table()
   
   # Optional: Add eggNOG annotations
   multiModulon.add_eggnog_annotation(eggnog_output_path)

This enriches gene information with:

* Gene names and products
* COG categories
* Functional annotations

Step 5: Generate Aligned Expression Matrices
--------------------------------------------

Create expression matrices with consistent gene indexing across all strains.

.. code-block:: python

   # Generate aligned expression matrices
   multiModulon.generate_X(gene_info_folder_path)

This step:

* Aligns expression matrices based on gene families
* Handles missing genes with NaN values
* Reports dimensions and recommendations

Step 6: Optimize Number of Core Components
-------------------------------------------

Use Cohen's d effect size metric to automatically determine the optimal number of core components.

.. code-block:: python

   # Optimize number of core components
   optimal_num_core_components = multiModulon.optimize_number_of_core_components(
       metric='effect_size',       # Use Cohen's d effect size
       step=5,                     # Test k = 5, 10, 15, 20, ...
       save_path=output_dir,       # Save plots
       fig_size=(7, 5),           # Figure size
   )

The optimization:

* Tests different numbers of core components
* Evaluates using Cohen's d effect size
* Selects optimal k based on interpretability
* Saves optimization plots

Step 7: Optimize Number of Unique Components
---------------------------------------------

Determine the optimal number of unique (species-specific) components for each strain.

.. code-block:: python

   # Optimize unique components for each species
   optimal_unique, optimal_total = multiModulon.optimize_number_of_unique_components(
       optimal_num_core_components=optimal_num_core_components,
       step=5,
       save_path=output_dir,
       fig_size=(7, 5),
       effect_size_threshold=1
   )

This process:

* Tests different numbers of unique components per species
* Evaluates component quality using effect size
* Returns optimal numbers for each species

Step 8: Run Robust Multi-view ICA
---------------------------------

Perform robust multi-view ICA with multiple runs and clustering to identify consistent components.

.. code-block:: python

   # Run robust multi-view ICA
   M_matrices, A_matrices = multiModulon.run_robust_multiview_ica(
       a=optimal_total,                 # Total components per species
       c=optimal_num_core_components,   # Number of core components
       num_runs=20,                     # Number of runs for robustness
       effect_size_threshold_core=3,    # Cohen's d threshold for core
       effect_size_threshold_unique=1,  # Cohen's d threshold for unique
       seed=42                          # Random seed
   )

The robust ICA:

* Runs ICA multiple times with different initializations
* Clusters components across runs
* Selects consistent components
* Generates final M (gene weights) and A (activities) matrices

Step 9: Optimize M Matrix Thresholds
------------------------------------

Calculate thresholds for binarizing the M matrices using Otsu's method.

.. code-block:: python

   # Optimize thresholds for each component
   multiModulon.optimize_M_thresholds(
       method="Otsu's method", 
       quantile_threshold=95
   )

This creates:

* Component-specific thresholds
* Binarized presence matrices
* Statistics on genes per component

Step 10: Save Results
---------------------

Save the complete MultiModulon object for future use.

.. code-block:: python

   # Save to compressed JSON format
   multiModulon.save_to_json_multimodulon("multiModulon_results.json.gz")
   
   # Load saved object
   multiModulon = MultiModulon.load_json_multimodulon("multiModulon_results.json.gz")

Summary
-------

This workflow covers the complete pipeline from raw data to interpretable multi-species regulatory modules:

1. Data loading and validation
2. Ortholog detection via BBH
3. Gene alignment across species
4. Expression matrix alignment
5. Component number optimization
6. Robust multi-view ICA
7. Threshold optimization
8. Results storage

The output includes:

* Core components conserved across species
* Unique components specific to each species
* Gene membership for each component
* Activity profiles across samples
* Optimized thresholds for interpretation