Data Preparation

This guide explains how to prepare your data for MultiModulon analysis.

Directory Structure

MultiModulon expects data organized in a specific directory structure (Output from https://github.com/Gaoyuan-Li/MAPPED):

Input_Data/
├── Species1/
│   ├── samplesheet/
│   │   ├── sample_sheet.csv      # Sample metadata (required)
│   ├── expression_matrices/
│   │   ├── log_tpm.csv           # Expression matrix (required)
│   │   └── log_tpm_norm.csv      # Normalized expression (required)
│   │   └── counts.csv            # Counts matrix (optional)
│   │   └── tpm.csv               # TPM matrix (optional)
│   ├── ref_genome/
│   │   ├── genome.fna            # Genome sequence (required)
│   │   ├── genome.gff            # Gene annotations (required)
│   │   └── protein.faa           # Protein sequences (required)
├── Species2/
│   └── ... (same structure)
└── Species3/
    └── ... (same structure)

Required Files

Expression Matrix (log_tpm.csv)

Format: CSV file with genes as rows and samples as columns
Values: Log-transformed TPM (Transcripts Per Million) values
Index: Gene identifiers (must match gene_table if provided)

Example:

gene_id,Sample1,Sample2,Sample3
gene001,5.2,4.8,5.1
gene002,0.3,0.5,0.2
gene003,7.1,7.3,6.9

Sample Sheet (sample_sheet.csv)

Format: CSV file with samples as rows
Required columns: None (index must match expression matrix columns)
Recommended columns:
- condition: Experimental condition # only when available
- project: Project or study name # only when available
- biological_replicate: Replicate number (1, 2, 3, etc.) # only when available
- study_accession: Study identifier (e.g., from GEO) # only when available
- sample_description: Brief description # only when available

Example:

sample_id,condition,project,biological_replicate
Sample1,Control,ProjectA,1
Sample2,Control,ProjectA,2
Sample3,Treatment,ProjectA,1

Optional Files

Gene Table (gene_table.csv)

Format: CSV file with genes as rows
Index: Must match expression matrix gene identifiers
Useful columns:
- gene_name: Human-readable gene name
- product: Gene product description
- COG: COG category
- start, end, strand: Genomic coordinates

Normalized Expression (log_tpm_norm.csv)

Pre-normalized expression matrix
Same format as log_tpm.csv
If not provided, log_tpm will be used directly

Files for BBH Analysis

To perform gene alignment across species, you need:

genome.fna: Genome sequence in FASTA format
genome.gff: Gene annotations in GFF3 format
protein.faa: Protein sequences in FASTA format

Data Quality Checklist

Before running analysis, ensure:

✓ Sample names are consistent between expression matrix and sample sheet
✓ Gene identifiers are consistent across all files
✓ No missing values in expression matrix (or handle appropriately)
✓ Biological replicates are properly labeled (1, 2, 3, not 1, 1, 2) # only when available