Cnnamon Documentation
A modular, end-to-end Python framework for DNA sequence-based Convolutional Neural Network (CNN) model training and explainability.
Overview
Cnnamon is designed to bridge the gap between deep learning model development and biological interpretation in genomics. While training high-performance models on DNA sequences is common, understanding what patterns the model learns is often the bottleneck. Cnnamon integrates data processing, model construction, and a suite of rich explainability tools into a unified ecosystem.
What to Anticipate
By using Cnnamon, researchers can expect to:
- Streamline Data Prep: Go from raw genomic intervals (BED files) and Reference Genomes (FASTA) to model-ready, one-hot encoded tensors with a single command.
- No-Code Model Building: Design complex 1D CNN architectures using simple JSON configuration files, ensuring reproducibility and easy sharing of model architectures.
- Biological Discovery: The framework is tailored to extract biological meaning from the "black box." You will be able to visualize Position Frequency Matrices (PFMs) of learned motifs, cluster them by similarity, and determine which transcription factors or patterns drive specific class predictions.
- Publication-Ready Outputs: Generate high-quality Sequence Logos, ROC curves, Confusion Matrices, and Heatmaps directly from the pipeline.
Modules
The library is organized into two primary namespaces:
1. cn.utility
Handles the "plumbing" of the deep learning pipeline. It includes PrepareData for sequence extraction and KerasModelBuilder for JSON-based training.
2. cn.CNN1D
The "Explainability Suite." This module provides post-hoc analysis tools:
- FilterVisualize: Convert filters to motifs (Logos) and export for TOMTOM.
- FilterImportance: Perturbation analysis to rank filters.
- FilterClustering: Group redundant filters.
- FilterEnrichment: Statistical tests for class specificity.
Installation
We highly recommend installing Cnnamon within a fresh virtual environment to ensure dependency compatibility.
Recommended Setup (Conda)
Create a new environment with Python 3.10 and install the package from PyPI. TensorFlow 2.15 is recommended for optimal compatibility.
# 1. Create a fresh environment
conda create -n cnnamon_env python=3.10
conda activate cnnamon_env
# 2. Install Cnnamon and TensorFlow
pip install cnnamon tensorflow==2.15
Requirements
- Python 3.10 (Recommended)
- bedtools (Required for sequence extraction, must be in system PATH)
Basic Workflow
This guide walks you through a complete analysis pipeline: from raw data to biological insights.
Step 1: Dataset Preparation
First, we convert genomic coordinates (BED) and a reference genome (FASTA) into one-hot encoded training tensors. The PrepareData module handles splitting (Train/Val/Test) and augmentation.
import cnnamon as cn
# Initialize data preparer
preparer = cn.utility.PrepareData(
intervalfile="peaks.bed", # Input BED file
genomefasta="hg38.fa", # Reference Genome
outdir="processed_data/", # Output folder
split_segmentation="random", # Split strategy
augment_RC=True # Augment with Reverse Complements
)
# Run extraction
train, test, val = preparer.run()
print(f"Training shape: {train['x'].shape}")
Step 2: Model Configuration (JSON)
Define your model architecture and training parameters in a JSON file. This ensures your experiments are reproducible.
Create a file named model_config.json with the following structure:
{
"model": {
"layers": [
{
"class_name": "Conv1D",
"config": {
"name": "motif_scanner",
"filters": 12,
"kernel_size": 20,
"padding": "same",
"use_bias": true,
"input_shape": [300, 4],
"activation": "leaky_relu"
}
},
{ "class_name": "MaxPooling1D", "config": { "pool_size": 2 } },
{ "class_name": "Dropout", "config": { "rate": 0.1 } },
{ "class_name": "Flatten", "config": {} },
{ "class_name": "Dense", "config": { "units": 32, "activation": "relu" } },
{ "class_name": "Dense", "config": { "units": 1, "activation": "sigmoid" } }
]
},
"compile": {
"optimizer": {
"class_name": "AdamW",
"config": { "learning_rate": 0.001 }
},
"loss": "binary_crossentropy",
"metrics": ["accuracy", "AUC"]
},
"training_params": {
"epochs": 50,
"batch_size": 64
},
"callbacks": {
"EarlyStopping": {
"monitor": "val_loss",
"patience": 10,
"restore_best_weights": true
},
"CSVLogger": {
"filename": "training_log.csv"
}
}
}
Step 3: Model Development
Initialize the KerasModelBuilder with your JSON config and start training. The framework builds the Keras model dynamically.
# Build model from JSON
builder = cn.utility.KerasModelBuilder.from_json("model_config.json")
# Train the model
history = builder.train(
x_train=train['x'], y_train=train['y'],
x_val=val['x'], y_val=val['y']
)
# Plot training history (Loss/Accuracy)
builder.plot_history(savefig="training_curves.png")
Step 4: Evaluation
Evaluate the model's performance on the hold-out test set using the built-in evaluation suite. This generates publication-ready ROC curves and Confusion Matrices.
# Plot ROC Curve
builder.eval.roc(
test['x'], test['y'],
title="Model Performance (ROC)",
savefig="roc_curve.png"
)
# Plot Confusion Matrix
builder.eval.cm(
test['x'], test['y'],
class_names=["Background", "Peak"],
savefig="confusion_matrix.png"
)
Step 5: Saving and Reloading
Essential: Before running explainability modules, save your trained model. The explainability tools often require a clean Keras model object or the path to a saved file.
from tensorflow import keras
# Save the trained model
builder.save("final_model.keras")
# --- LATER OR IN A NEW SCRIPT ---
# Reload the model for analysis
loaded_model = keras.models.load_model("final_model.keras")
# Reload data splits if needed
_, test, _ = cn.utility.PrepareData.load_splits_from_disk("processed_data/")
Step 6: Explainability Modules
This is where Cnnamon shines. Use the following modules to interpret what your model has learned.
A. Filter Visualization (Motif Discovery)
Extract sequence motifs using either the standard "Top Activating" method or the rigorous "Significant Activating" method. Crucially, you can export these motifs to check against known databases (like JASPAR) using TOMTOM.
# Method 1: Top Activating (Faster, Standard)
# Extracts motifs from the top 5% of activating sequences
motifs_standard = cn.CNN1D.FilterVisualize.top_activating(
loaded_model,
test,
percentile=95.0,
n_cores=4
)
motifs_standard.to_motifs(savefig="motifs_standard.png")
# Method 2: Significant Activating (Rigorous, Slower)
# Uses permutation testing to find statistically significant motifs
motifs_sig = cn.CNN1D.FilterVisualize.significant_activating(
loaded_model,
test,
n_perturbations=200, # Shuffles per filter
q_value_cutoff=0.05, # FDR threshold
n_cores=8
)
motifs_sig.to_motifs(savefig="motifs_significant.png")
# --- JASPAR / TOMTOM Validation ---
# Export significant motifs to MEME format for database comparison
motifs_sig.to_meme("significant_motifs.meme")
print("Motifs exported! Run TOMTOM to check matches against JASPAR.")
B. Filter Importance
Determine which filters actually drive model predictions by perturbing them and measuring loss increase.
# Rank filters by importance
importance = cn.CNN1D.FilterImportance(
loaded_model,
test,
n_iterations=10,
n_cores=4
)
# Visualize ranking (Boxplot)
importance.boxplot(savefig="filter_importance.png")
C. Filter Clustering
Group redundant filters (e.g., multiple filters learning the same "GATA" motif) to simplify interpretation.
# Cluster filters based on activation similarity
clustering = cn.CNN1D.FilterClustering(
loaded_model,
test,
linkage_method='ward'
)
# Plot circular dendrogram
clustering.plot_circlize(savefig="filter_tree.png")
# Get cluster assignments
clusters = clustering.get_clusters(n_clusters=5)
D. Filter Enrichment
Identify which filters are specifically associated with positive or negative classes.
# Run enrichment analysis
enrichment = cn.CNN1D.FilterEnrichment(
loaded_model,
test,
method='fold_change', # Uses Mann-Whitney U test
n_cores=4
)
# Plot enrichment heatmap with significance markers (*)
enrichment.plot_heatmap(q_cutoff=0.05, savefig="enrichment.png")
E. Nucleotide Sensitivity (Mutagenesis)
Perform in-silico mutagenesis to see exactly which bases in a motif are critical for activation.
# Analyze sensitivity of significant motifs
sensitivity_df = cn.CNN1D.analyze_nucleotide_sensitivity(
loaded_model,
motifs_sig, # Use the MotifSet from Step 6A
n_cores=4
)
# Plot boxplots of mutational impact
cn.CNN1D.plot_nucleotide_sensitivity(sensitivity_df, savefig="mutagenesis.png")
PrepareData
The PrepareData class handles genomic sequence data preparation for CNN training, including sequence extraction, one-hot encoding, and flexible dataset splitting.
Initialization
Core Parameters:
Configuration Options (**kwargs):
'random', 'chromosome', 'custom'. (Default: 'random')[0.6, 0.2, 0.2]). Normalized automatically if sum != 1.True, adds Reverse Complement sequences to the dataset with same labels. (Default: False)"1" to save generated numpy arrays and info CSVs to disk. (Default: "0")chrlist.specificchr="1".split_segmentation='custom').split_segmentation='custom').split_segmentation='custom').Accessible Attributes
After initialization/running, the following attributes are available:
Methods
Execute pipeline. Returns three dictionaries (Train, Test, Validation). Each contains:
"x": Sequence tensor (N, L, 4)"y": Label tensor (N, Classes)"info": List of strings "chr,start,end" corresponding to each sample.
Retrieve split data dictionaries without re-running sequence extraction (must be run after initialization or loading).
Load splits previously saved with save_splits="1". Returns the same structure as run().
KerasModelBuilder
A wrapper for building, training, and evaluating Keras models from JSON.
Methods
Initialize builder from a JSON file path or a dictionary.
Compiles and returns the Keras model based on the configuration.
Train the model using config parameters. Returns the Keras History object.
Plot training metrics (loss/accuracy) vs epochs.
Print Keras model summary.
Evaluation (model.eval)
Access via builder.eval. Provides publication-ready plotting.
Plot row-normalized confusion matrix. Annotations include probability and raw count.
Plot ROC curve (Binary classification only). Displays AUC in legend.
Calculate and return AUC score. Supports 'ovr' (One-vs-Rest) for multiclass.
Filter Visualization
The cn.CNN1D.FilterVisualize module extracts sequence motifs from model filters.
Discovery Methods
Extract motifs from sequences causing high activation.
Extract motifs using permutation testing to filter significant activations.
Construct consensus motifs strictly from positive filter weights.
Construct motifs by applying softmax to filter weights.
External Validation (TOMTOM/JASPAR)
Cnnamon integrates with the MEME Suite to validate discovered motifs against known biological databases (like JASPAR or HOCOMOCO).
Export all discovered motifs to a standard `.meme` file. This file can be used as input for tomtom.
# 1. Export motifs from Cnnamon
motifs_sig.to_meme("output/learned_motifs.meme")
# 2. Run TOMTOM (Terminal Command)
# Compare learned motifs against the JASPAR Core database
tomtom -no-ssc -oc output/tomtom_results \
output/learned_motifs.meme \
JASPAR2022_CORE_vertebrates_non-redundant_pfms_meme.txt
Redundancy Analysis
Calculate Pearson correlation between raw filter weights. Returns similarity matrix.
Calculate Pearson correlation between filters and their Reverse Complements. Returns similarity matrix.
The MotifSet Object
Returned by discovery methods. Behaves like a dictionary of PFMs (Pandas DataFrames) but includes rich metadata.
Attributes
significant_activating).Export Methods
Plot Sequence Logos (Information Content) for all filters in a grid.
Export each filter as an individual SVG file.
Cluster learned PFMs by similarity (Pearson correlation of PWMs).
Nucleotide Sensitivity
Perform in-silico mutagenesis on activating subsequences. Returns a DataFrame with 'FoldChange' for every mutation.
Plot mutagenesis results as boxplots per filter position, showing the impact of mutations on activation.
Filter Importance
Rank filters by their contribution to model prediction accuracy.
Initialize and run perturbation experiment.
Attributes
Methods
Plot importance distribution (boxplots) vs Baseline Loss.
Plot importance distribution (violin plots) vs Baseline Loss.
Filter Clustering
Group filters by activation profile similarity.
Initialize clustering analysis.
Attributes
Methods
Run silhouette analysis to determine optimal cluster count.
Plot clustered heatmap of filter activations.
Plot circular dendrogram with cluster highlighting.
Return a DataFrame mapping filter_name to cluster_id.
Filter Enrichment
Identify filters enriched for specific output classes.
Parameters:
'fold_change' (Mann-Whitney U) or 'odds_ratio' (Fisher Exact).method='odds_ratio'.'activated' (only activating sequences) or 'global' (all sequences).Attributes
Methods
Plot enrichment heatmap. Filters with q-value ≤ q_cutoff are marked with (*).