Gene Alignment
This section covers gene alignment across multiple species using Bidirectional Best Hits (BBH) and the creation of a unified gene database.
Overview
Gene alignment is a crucial step that enables multi-species analysis by:
Identifying orthologous genes across species
Creating a unified gene indexing system
Aligning expression matrices for multi-view ICA
The process involves two main steps:
BBH Generation: Find ortholog pairs between all species pairs
Gene Alignment: Group orthologs into gene families using Union-Find
Generating BBH
Basic BBH Generation
- MultiModulon.generate_BBH(output_path='Output_BBH', threads=1)
Generate Bidirectional Best Hits between all species pairs.
- Parameters:
Example:
# Generate BBH with multiple threads multiModulon.generate_BBH( output_path="Output_BBH", threads=8 ) # Check results print(f"BBH files generated in Output_BBH/")
How BBH Works
Protein sequences: Uses existing protein.faa files from each strain’s ref_genome directory
Gene ID mapping: Creates mappings from protein IDs (NCBI_GP) to locus_tags using genomic.gff files
Standard strains: Uses
locus_tagattribute from GFF filesW3110 strain: Extracts JW-numbers (e.g., JW4367) from Note field format “ECK0001:JW4367:b0001”
Warnings are printed when fallback extraction methods are used, including the strain name
BLAST search: Runs all-vs-all BLAST comparisons between species
Best hits: Identifies reciprocal best hits
Output format: CSV files with locus_tag-to-locus_tag mappings containing:
gene: Query gene locus_tag
subject: Subject gene locus_tag
PID: Percent identity
alnLength: Alignment length
mismatchCount: Number of mismatches
gapOpenCount: Number of gap openings
queryStart/queryEnd: Query alignment coordinates
subjectStart/subjectEnd: Subject alignment coordinates
eVal: E-value
bitScore: Bit score
gene_length: Query gene length
COV: Coverage (alnLength/gene_length)
BBH: Bidirectional best hit indicator (“<=>”)
Requirements
For BBH generation, each species needs either:
Option 1:
genome.fnaandgenome.gfffilesOption 2: Pre-computed
protein.faafile
Docker Support
BBH uses containerized BLAST for reproducibility:
# The default Docker image is used automatically
# To use a different BLAST version:
from multimodulon.utils.bbh import BBHAnalyzer
analyzer = BBHAnalyzer(
docker_image="quay.io/biocontainers/blast:2.16.0--h66d330f_5"
)
Aligning Genes
After BBH generation, align genes across species:
- MultiModulon.align_genes(input_bbh_dir='Output_BBH', output_dir='Output_Gene_Info', reference_order=None, bbh_threshold=None)
Align genes across all species using Union-Find algorithm.
- Parameters:
- Returns:
Combined gene database DataFrame
- Return type:
pd.DataFrame
Example:
# Basic alignment combined_gene_db = multiModulon.align_genes() # With custom parameters combined_gene_db = multiModulon.align_genes( input_bbh_dir="Output_BBH", output_dir="Output_Gene_Info", reference_order=['Species1', 'Species2', 'Species3'], bbh_threshold=90 # 90% identity threshold ) # Examine the results print(combined_gene_db.head())
Combined Gene Database Format
The output is a DataFrame where:
Rows: Gene families (groups of orthologs)
Columns: Species names
Values: Gene IDs for each species (NaN if absent)
Example output:
Species1 Species2 Species3 row_label
gene001 geneA_001 locus_001 gene001
gene002 geneA_002 NaN gene002
NaN geneA_003 locus_003 geneA_003
NaN NaN locus_004 locus_004
Union-Find Algorithm
The alignment uses Union-Find to group genes:
Each gene starts in its own group
BBH relationships merge groups
Transitive closure creates gene families
Result: Genes in same family are orthologs
Filtering Options
Control alignment stringency:
# Strict alignment - high identity threshold
strict_db = multiModulon.align_genes(bbh_threshold=80)
# Permissive alignment - lower threshold
permissive_db = multiModulon.align_genes(bbh_threshold=30)
# Check alignment statistics
print(f"Strict: {strict_db.notna().sum().sum()} genes aligned")
print(f"Permissive: {permissive_db.notna().sum().sum()} genes aligned")
Generating Expression Matrices
After alignment, generate aligned expression matrices:
- MultiModulon.generate_X(gene_info_folder)
Generate X matrices with consistent row indices based on combined_gene_db.
- Parameters:
gene_info_folder (str) – Path to folder containing combined_gene_db.csv
Example:
# Generate aligned expression matrices multiModulon.generate_X("Output_Gene_Info") # Access aligned matrices for species in multiModulon.species: X = multiModulon[species].X print(f"{species}: {X.shape}") print(f"Missing genes: {X.isna().sum().sum()}")
The aligned matrices have:
Consistent row order: Same gene families in same positions
Ready for ICA: Can be directly used for multi-view ICA
Quality Control
Check alignment quality:
# Alignment statistics
total_families = len(combined_gene_db)
# Genes per species
for species in combined_gene_db.columns:
gene_count = combined_gene_db[species].notna().sum()
print(f"{species}: {gene_count} genes")
# Core genes (present in all species)
core_genes = combined_gene_db.notna().all(axis=1).sum()
print(f"Core genes: {core_genes}")
# Species-specific genes
for species in combined_gene_db.columns:
specific = (
combined_gene_db[species].notna() &
combined_gene_db.drop(columns=species).isna().all(axis=1)
).sum()
print(f"{species}-specific: {specific}")
Advanced Usage
Custom Gene Grouping
For custom ortholog definitions:
# Load your own ortholog mappings
custom_orthologs = pd.read_csv("custom_orthologs.csv")
# Create combined gene database manually
from multimodulon.gene_alignment import create_combined_gene_db
combined_db = create_combined_gene_db(
custom_orthologs,
species_list=multiModulon.species
)
Next Steps
After gene alignment:
Optimization of Dimensions - Optimize component numbers
Robust Multi-view ICA - Run multi-view ICA
Visualization of iModulons - Visualize aligned components