Cnnamon Documentation
A modular framework for DNA sequence-based Convolutional Neural Network (CNN) model training and explainability.
What is Cnnamon?
Cnnamon is a comprehensive Python framework designed for building, training, and interpreting 1D CNNs for DNA sequence analysis. It provides:
- Data Preparation: Tools for loading genomic intervals and converting sequences to one-hot encoded arrays
- Model Building: JSON-based CNN architecture specification and training
- Explainability: Advanced visualization and interpretation tools for understanding learned patterns
Key Features
🔬 Filter Visualization
Extract and visualize sequence motifs learned by convolutional filters using multiple methods (softmax, top-activating, positive-activating, significant-activating).
📊 Filter Importance
Identify the most important filters in your model through perturbation analysis.
🔗 Filter Clustering
Group functionally similar filters based on activation profiles.
🎯 Class Enrichment
Discover which filters are enriched for specific output classes.
Installation
Requirements
- Python ≥ 3.7
- TensorFlow ≥ 2.0
- NumPy
- Pandas
- Matplotlib, Seaborn
- scikit-learn
- logomaker
- pycirclize
- bedtools (for sequence extraction)
Install from Source
git clone https://github.com/yourusername/cnnamon.git
cd cnnamon
pip install -e .
Dependencies
pip install tensorflow numpy pandas matplotlib seaborn scikit-learn logomaker pycirclize scipy statsmodels joblib tqdm
Quick Start
Basic Workflow
import cnnamon
from cnnamon.util import PrepareData, KerasModelBuilder
from cnnamon.CNN1D import FilterVisualize, FilterImportance
# 1. Prepare data
preparer = PrepareData(
intervalfile="peaks.bed",
genomefasta="genome.fa",
outdir="output/"
)
train, test, val = preparer.run()
# 2. Build and train model
model = KerasModelBuilder.from_json("model_config.json")
model.train(train['x'], train['y'], val['x'], val['y'])
model.save("my_model.keras")
# 3. Visualize learned motifs
motifs = FilterVisualize.top_activating(model.model, test['x'])
motifs.to_motifs(savefig="motifs.png")
# 4. Analyze filter importance
importance = FilterImportance(model.model, test, n_cores=4)
importance.boxplot(savefig="importance.png")
PrepareData
The PrepareData class handles genomic sequence data preparation for CNN training, including sequence extraction, one-hot encoding, and flexible dataset splitting.
Initialization
Initialize data preparer with genomic intervals and reference genome.
Core Parameters:
Configuration Options (**kwargs):
'random', 'chromosome', 'custom'. (Default: 'random')
.npy files to disk. (Default: "0")
Custom Split Parameters (if split_segmentation='custom'):
['chr1', 'chr2'])
Methods
Execute full data preparation pipeline.
Returns:
(train_dict, test_dict, validation_dict) where each contains:
'x': One-hot encoded sequences (N × L × 4)'y': Label array'info': Interval information (chr, start, end)
Load previously saved data splits (if save_splits="1" was used).
Parameters:
Splitting Strategies
Cnnamon supports three robust strategies for dividing your genomic data:
🔀 Random Split ('random')
Pools all intervals from the genome and splits them randomly based on ratios. Best for non-overlapping, independent peaks.
🧬 Chromosome Split ('chromosome')
Randomly selects entire chromosomes to hold out for testing/validation based on ratios. Prevents data leakage between similar sequences on the same chromosome.
🛠 Custom Split ('custom')
Manually assign specific chromosomes to each set using *_chr_list parameters. Ideal for benchmarking against standard splits (e.g., "chr8 for test").
Examples
1. Random Split with Augmentation
preparer = PrepareData(
intervalfile="peaks.bed", genomefasta="hg38.fa", outdir="out_rnd/",
split_segmentation="random",
ratios=[0.8, 0.1, 0.1],
augment_RC=True, # Double data with Reverse Complements
seed=42
)
train, test, val = preparer.run()
2. Chromosome Holdout
preparer = PrepareData(
intervalfile="peaks.bed", genomefasta="hg38.fa", outdir="out_chr/",
split_segmentation="chromosome",
ratios=[0.7, 0.15, 0.15],
seed=123
)
3. Custom Chromosome Sets
preparer = PrepareData(
intervalfile="peaks.bed", genomefasta="hg38.fa", outdir="out_custom/",
split_segmentation="custom",
train_chr_list=["chr1", "chr2", "chr3"],
val_chr_list=["chr4"],
test_chr_list=["chr5"]
)
KerasModelBuilder
A flexible wrapper for building, training, and evaluating Keras models directly from JSON configurations. It supports the full Keras Functional/Sequential API capabilities by dynamically loading layers, optimizers, and callbacks.
1. Configuration Structure (JSON)
The configuration file is divided into four main sections: model, compile, training_params, and callbacks.
A. Model Architecture
Define layers using the standard Keras class_name and config dictionary. You can use any layer available in tf.keras.layers.
"model": {
"layers": [
{
"class_name": "Conv1D",
"config": {
"filters": 64,
"kernel_size": 8,
"activation": "relu",
"input_shape": [400, 4]
}
},
{ "class_name": "MaxPooling1D", "config": { "pool_size": 2 } },
{ "class_name": "Flatten", "config": {} },
{ "class_name": "Dense", "config": { "units": 1, "activation": "sigmoid" } }
]
}
B. Compilation
Specifies the optimizer, loss function, and metrics. Optimizers can be simple strings or detailed objects.
"compile": {
"optimizer": {
"class_name": "Adam",
"config": { "learning_rate": 0.0001 }
},
"loss": "binary_crossentropy",
"metrics": ["accuracy"]
}
C. Training Parameters
Arguments passed directly to the model.fit() method.
"training_params": {
"epochs": 50,
"batch_size": 32,
"verbose": 1
}
D. Callbacks
Define Keras callbacks. Keys match the callback class names in tf.keras.callbacks.
"callbacks": {
"EarlyStopping": {
"monitor": "val_loss",
"patience": 5,
"restore_best_weights": true
},
"CSVLogger": {
"filename": "logs/training_log.csv"
}
}
2. Core Methods
Initialize and build the model directly from a JSON file path or a dictionary.
Train the model using the parameters and callbacks defined in the JSON configuration.
Run standard Keras evaluation (returns loss and metrics).
Plot training vs. validation metrics (loss, accuracy, etc.) over epochs.
Parameters:
Save the trained model to a .keras file or load a pre-trained one.
3. Advanced Evaluation (model.eval)
The builder includes a helper eval object for generating publication-ready plots and metrics.
Plot a confusion matrix with row-normalized proportions (recall) and raw counts.
Parameters:
["Negative", "Positive"]).Plot the Receiver Operating Characteristic (ROC) curve with AUC score (Binary only).
Parameters:
Calculate and print the AUC score (supports One-vs-Rest for multiclass).
4. Full Example
from cnnamon.util import KerasModelBuilder
# 1. Initialize and Build
builder = KerasModelBuilder.from_json("config.json")
# 2. Train
builder.train(train_x, train_y, val_x, val_y)
# 3. Plot History
builder.plot_history(savefig="plots/history.png")
# 4. Advanced Evaluation
# Confusion Matrix with custom labels
builder.eval.cm(
test_x, test_y,
class_names=["Non-Promoter", "Promoter"],
title="Promoter Prediction Performance",
savefig="plots/cm.png"
)
# ROC Curve
builder.eval.roc(
test_x, test_y,
title="Model ROC",
savefig="plots/roc.png"
)
Filter Visualization
The FilterVisualize class is your primary tool for interpreting what the Convolutional filters in your model have learned. It provides multiple methods to convert the raw numerical weights of your model into human-readable sequence motifs (Position Frequency Matrices).
1. Visualization Methods
The "Quick & Dirty" Method. Applies a Softmax function directly to the filter weights. This gives you a theoretical idea of what the filter wants to see, but doesn't tell you if it actually fires on real data.
Parameters:
{'A':0.25, ...}). Default is uniform.The "Data-Driven" Method. Feeds your actual sequences through the model and extracts the subsequences that produce the highest activation scores. This is the most accurate representation of the biological signals your model has discovered.
Parameters:
percentile and builds motifs from all subsequences with activation > 0. Useful for rare motifs.The "Consensus" Method. Builds a motif by looking only at the positive weights in the filter kernel. It constructs a "consensus" sequence that would theoretically maximize activation.
The "Statistical" Method. Performs a perturbation test to compare real filter activations against a "null" model of random noise. It only builds motifs from subsequences that are statistically significant (p < cutoff). Warning: Computationally expensive.
Parameters:
2. The MotifSet Object
All visualization methods return a MotifSet object, which behaves like a dictionary of DataFrames but has powerful export capabilities.
Plots all filters in a grid layout as Sequence Logos (Information Content bits).
Exports all motifs to a MEME-formatted text file. Compatible with tools like TOMTOM (for matching against JASPAR/HOCOMOCO databases) or FIMO.
Saves each individual filter as a high-quality SVG vector graphic in the specified directory. Great for publications or clustering analysis.
3. Examples
Example 1: Standard Discovery
from cnnamon.CNN1D import FilterVisualize
# 1. Generate motifs from Test Set (Top 10% of activators)
motifs = FilterVisualize.top_activating(
model,
test['x'],
percentile=90.0,
n_cores=4
)
# 2. Visualize in a grid
motifs.to_motifs(savefig="figures/all_motifs.png")
# 3. Export for external tools
motifs.to_meme("results/learned_motifs.meme")
Example 2: Analyzing Rare Motifs
If you suspect a filter detects a very rare signal (e.g. only 50 instances in 10,000 sequences), a 90th percentile cutoff might be too strict. Use include_all_positive=True to capture every instance.
rare_motifs = FilterVisualize.top_activating(
model,
test['x'],
include_all_positive=True, # Capture ALL positive activations
n_cores=4
)
rare_motifs.to_motifs(savefig="figures/rare_motifs.png")
Example 3: High-Rigor Verification
Use the statistical method to filter out noise "motifs" that are just random GC-rich patches.
rigorous_motifs = FilterVisualize.significant_activating(
model,
test['x'],
n_perturbations=200,
p_value_cutoff=0.01,
n_cores=8
)
# Only statistically significant patterns will appear here
rigorous_motifs.to_motifs()
Filter Importance
Identify the most important filters through perturbation analysis.
Initialization
Runs the full perturbation experiment and ranks filters by importance.
Parameters:
How It Works
- Calculate baseline model loss on test data
- For each filter:
- Perturb its weights with Gaussian noise (preserving mean/std)
- Evaluate model loss with perturbed filter
- Repeat n_iterations times
- Rank filters by average loss increase
Visualization Methods
Plot distribution of perturbed losses as boxplots, ordered by importance.
Plot distribution of perturbed losses as violin plots, ordered by importance.
Example
from cnnamon.CNN1D import FilterImportance
importance = FilterImportance(
model,
testset=test,
n_iterations=20,
method='mean',
n_cores=8,
batch_size=256
)
# Visualize results
importance.boxplot(savefig="importance_boxplot.png")
importance.violin(savefig="importance_violin.png")
# Access ranking
print("Most important filters:", importance.filter_importance_ranking[:5])
batch_size (e.g., 256-512) for GPU acceleration, and use n_cores for CPU parallelization.
Filter Clustering
Group functionally similar filters based on their activation profiles.
Initialization
Performs hierarchical clustering of filters.
Parameters:
Visualization Methods
Plot clustered heatmap showing filter activation patterns.
Plot hierarchical clustering dendrogram.
Plot circular phylogenetic tree of filter relationships.
Extract filter groups by cutting dendrogram at specified number of clusters.
Returns:
Dictionary mapping cluster names to filter lists.
Example
from cnnamon.CNN1D import FilterClustering
clustering = FilterClustering(
model,
testset=test,
target_layer='conv1d_0',
linkage_method='ward'
)
# Visualize clustering
clustering.plot_heatmap(savefig="cluster_heatmap.png")
clustering.plot_dendrogram(savefig="dendrogram.png")
clustering.plot_circlize(savefig="circular_tree.png")
# Get cluster assignments
clusters = clustering.get_clusters(n_clusters=5)
for cluster_name, filters in clusters.items():
print(f"{cluster_name}: {filters}")
Filter Enrichment
Discover which filters are enriched for specific output classes through statistical testing.
Initialization
Performs class-specific enrichment analysis with FDR correction.
Parameters:
How It Works
- Extract filter activations for all sequences
- For each filter and each class:
- Split activations into class-positive and class-negative groups
- Calculate log2 fold change (enrichment direction)
- Perform statistical test (Mann-Whitney U
- Apply FDR correction (Benjamini-Hochberg) across all tests
- Identify significantly enriched filter-class pairs
Methods
Extract filtered enrichment results.
Parameters:
Plot log2 fold change heatmap with significance markers (*) for q ≤ cutoff.
Example
from cnnamon.CNN1D import FilterEnrichment
enrichment = FilterEnrichment(
model,
testset=test,
class_names=['Enhancer', 'Promoter', 'Silencer'],
method='mann-whitney',
n_cores=4
)
# Visualize enrichment
enrichment.plot_heatmap(
q_cutoff=0.05,
savefig="enrichment_heatmap.png"
)
# Get significant results
significant = enrichment.get_results(
value_type='logFC',
q_cutoff=0.05,
logFC_cutoff=1.0
)
print(significant)
# Access raw data
q_values = enrichment.q_values # DataFrame: filters × classes
logFCs = enrichment.logFCs # DataFrame: filters × classes
- Positive logFC: Filter activates more strongly for sequences in that class
- Negative logFC: Filter activates less for sequences in that class
- * marker: q-value ≤ significance threshold (statistically significant)