Gene Alignment

This section covers gene alignment across multiple species using Bidirectional Best Hits (BBH) and the creation of a unified gene database.

Overview

Gene alignment is a crucial step that enables multi-species analysis by:

Identifying orthologous genes across species
Creating a unified gene indexing system
Aligning expression matrices for multi-view ICA

The process involves two main steps:

BBH Generation: Find ortholog pairs between all species pairs
Gene Alignment: Group orthologs into gene families using Union-Find

Generating BBH

Basic BBH Generation

MultiModulon.generate_BBH(output_path='Output_BBH', threads=1)

Generate Bidirectional Best Hits between all species pairs.

Parameters:

output_path (str) – Directory to save BBH results (default: “Output_BBH”)
threads (int) – Number of threads for BLAST (default: 1)

Example:

# Generate BBH with multiple threads
multiModulon.generate_BBH(
    output_path="Output_BBH",
    threads=8
)

# Check results
print(f"BBH files generated in Output_BBH/")

How BBH Works

Protein sequences: Uses existing protein.faa files from each strain’s ref_genome directory
Gene ID mapping: Creates mappings from protein IDs (NCBI_GP) to locus_tags using genomic.gff files
- Standard strains: Uses locus_tag attribute from GFF files
- W3110 strain: Extracts JW-numbers (e.g., JW4367) from Note field format “ECK0001:JW4367:b0001”
- Warnings are printed when fallback extraction methods are used, including the strain name
BLAST search: Runs all-vs-all BLAST comparisons between species
Best hits: Identifies reciprocal best hits
Output format: CSV files with locus_tag-to-locus_tag mappings containing:
- gene: Query gene locus_tag
- subject: Subject gene locus_tag
- PID: Percent identity
- alnLength: Alignment length
- mismatchCount: Number of mismatches
- gapOpenCount: Number of gap openings
- queryStart/queryEnd: Query alignment coordinates
- subjectStart/subjectEnd: Subject alignment coordinates
- eVal: E-value
- bitScore: Bit score
- gene_length: Query gene length
- COV: Coverage (alnLength/gene_length)
- BBH: Bidirectional best hit indicator (“<=>”)

Requirements

For BBH generation, each species needs either:

Option 1: genome.fna and genome.gff files
Option 2: Pre-computed protein.faa file

Docker Support

BBH uses containerized BLAST for reproducibility:

# The default Docker image is used automatically
# To use a different BLAST version:
from multimodulon.utils.bbh import BBHAnalyzer

analyzer = BBHAnalyzer(
    docker_image="quay.io/biocontainers/blast:2.16.0--h66d330f_5"
)

Aligning Genes

After BBH generation, align genes across species:

MultiModulon.align_genes(input_bbh_dir='Output_BBH', output_dir='Output_Gene_Info', reference_order=None, bbh_threshold=None)

Align genes across all species using Union-Find algorithm.

Parameters:

input_bbh_dir (str) – Directory containing BBH CSV files
output_dir (str) – Directory to save combined gene database
reference_order (list) – Order of species columns in output (optional)
bbh_threshold (float) – Minimum percent identity for BBH (optional)

Returns:

Combined gene database DataFrame

Return type:

pd.DataFrame

Example:

# Basic alignment
combined_gene_db = multiModulon.align_genes()

# With custom parameters
combined_gene_db = multiModulon.align_genes(
    input_bbh_dir="Output_BBH",
    output_dir="Output_Gene_Info",
    reference_order=['Species1', 'Species2', 'Species3'],
    bbh_threshold=90  # 90% identity threshold
)

# Examine the results
print(combined_gene_db.head())

Combined Gene Database Format

The output is a DataFrame where:

Rows: Gene families (groups of orthologs)
Columns: Species names
Values: Gene IDs for each species (NaN if absent)

Example output:

Species1    Species2    Species3    row_label
gene001     geneA_001   locus_001   gene001
gene002     geneA_002   NaN         gene002
NaN         geneA_003   locus_003   geneA_003
NaN         NaN         locus_004   locus_004

Union-Find Algorithm

The alignment uses Union-Find to group genes:

Each gene starts in its own group
BBH relationships merge groups
Transitive closure creates gene families
Result: Genes in same family are orthologs

Filtering Options

Control alignment stringency:

# Strict alignment - high identity threshold
strict_db = multiModulon.align_genes(bbh_threshold=80)

# Permissive alignment - lower threshold
permissive_db = multiModulon.align_genes(bbh_threshold=30)

# Check alignment statistics
print(f"Strict: {strict_db.notna().sum().sum()} genes aligned")
print(f"Permissive: {permissive_db.notna().sum().sum()} genes aligned")

Generating Expression Matrices

After alignment, generate aligned expression matrices:

MultiModulon.generate_X(gene_info_folder)

Generate X matrices with consistent row indices based on combined_gene_db.

Parameters:: gene_info_folder (str) – Path to folder containing combined_gene_db.csv

Example:

# Generate aligned expression matrices
multiModulon.generate_X("Output_Gene_Info")

# Access aligned matrices
for species in multiModulon.species:
    X = multiModulon[species].X
    print(f"{species}: {X.shape}")
    print(f"Missing genes: {X.isna().sum().sum()}")

The aligned matrices have:

Consistent row order: Same gene families in same positions
Ready for ICA: Can be directly used for multi-view ICA

Quality Control

Check alignment quality:

# Alignment statistics
total_families = len(combined_gene_db)

# Genes per species
for species in combined_gene_db.columns:
    gene_count = combined_gene_db[species].notna().sum()
    print(f"{species}: {gene_count} genes")

# Core genes (present in all species)
core_genes = combined_gene_db.notna().all(axis=1).sum()
print(f"Core genes: {core_genes}")

# Species-specific genes
for species in combined_gene_db.columns:
    specific = (
        combined_gene_db[species].notna() &
        combined_gene_db.drop(columns=species).isna().all(axis=1)
    ).sum()
    print(f"{species}-specific: {specific}")

Advanced Usage

Custom Gene Grouping

For custom ortholog definitions:

# Load your own ortholog mappings
custom_orthologs = pd.read_csv("custom_orthologs.csv")

# Create combined gene database manually
from multimodulon.gene_alignment import create_combined_gene_db
combined_db = create_combined_gene_db(
    custom_orthologs,
    species_list=multiModulon.species
)

Next Steps

After gene alignment:

Optimization of Dimensions - Optimize component numbers
Robust Multi-view ICA - Run multi-view ICA
Visualization of iModulons - Visualize aligned components