Data Preparation
This guide explains how to prepare your data for MultiModulon analysis.
Directory Structure
MultiModulon expects data organized in a specific directory structure (Output from https://github.com/Gaoyuan-Li/MAPPED):
Input_Data/
├── Species1/
│ ├── samplesheet/
│ │ ├── sample_sheet.csv # Sample metadata (required)
│ ├── expression_matrices/
│ │ ├── log_tpm.csv # Expression matrix (required)
│ │ └── log_tpm_norm.csv # Normalized expression (required)
│ │ └── counts.csv # Counts matrix (optional)
│ │ └── tpm.csv # TPM matrix (optional)
│ ├── ref_genome/
│ │ ├── genome.fna # Genome sequence (required)
│ │ ├── genome.gff # Gene annotations (required)
│ │ └── protein.faa # Protein sequences (required)
├── Species2/
│ └── ... (same structure)
└── Species3/
└── ... (same structure)
Required Files
Expression Matrix (log_tpm.csv)
Format: CSV file with genes as rows and samples as columns
Values: Log-transformed TPM (Transcripts Per Million) values
Index: Gene identifiers (must match gene_table if provided)
Example:
gene_id,Sample1,Sample2,Sample3
gene001,5.2,4.8,5.1
gene002,0.3,0.5,0.2
gene003,7.1,7.3,6.9
Sample Sheet (sample_sheet.csv)
Format: CSV file with samples as rows
Required columns: None (index must match expression matrix columns)
Recommended columns:
condition: Experimental condition # only when availableproject: Project or study name # only when availablebiological_replicate: Replicate number (1, 2, 3, etc.) # only when availablestudy_accession: Study identifier (e.g., from GEO) # only when availablesample_description: Brief description # only when available
Example:
sample_id,condition,project,biological_replicate
Sample1,Control,ProjectA,1
Sample2,Control,ProjectA,2
Sample3,Treatment,ProjectA,1
Normalized Expression (log_tpm_norm.csv)
Pre-normalized expression matrix
Same format as log_tpm.csv
If not provided, log_tpm will be used directly
Files for BBH Analysis and Gene Annotation
To perform gene alignment across species and add gene annotation, you need:
genome.fna: Genome sequence in FASTA formatgenome.gff: Gene annotations in GFF3 formatprotein.faa: Protein sequences in FASTA format