Cnnamon Documentation

A modular, end-to-end Python framework for DNA sequence-based Convolutional Neural Network (CNN) model training and explainability.

Overview

Cnnamon is designed to bridge the gap between deep learning model development and biological interpretation in genomics. While training high-performance models on DNA sequences is common, understanding what patterns the model learns is often the bottleneck. Cnnamon integrates data processing, model construction, and a suite of rich explainability tools into a unified ecosystem.

What to Anticipate

By using Cnnamon, researchers can expect to:

Modules

The library is organized into two primary namespaces:

1. cn.utility

Handles the "plumbing" of the deep learning pipeline. It includes PrepareData for sequence extraction and KerasModelBuilder for JSON-based training.

2. cn.CNN1D

The "Explainability Suite." This module provides post-hoc analysis tools:

  • FilterVisualize: Convert filters to motifs (Logos) and export for TOMTOM.
  • FilterImportance: Perturbation analysis to rank filters.
  • FilterClustering: Group redundant filters.
  • FilterEnrichment: Statistical tests for class specificity.

Installation

We highly recommend installing Cnnamon within a fresh virtual environment to ensure dependency compatibility.

Recommended Setup (Conda)

Create a new environment with Python 3.10 and install the package from PyPI. TensorFlow 2.15 is recommended for optimal compatibility.

# 1. Create a fresh environment
conda create -n cnnamon_env python=3.10
conda activate cnnamon_env

# 2. Install Cnnamon and TensorFlow
pip install cnnamon tensorflow==2.15
Dependencies: When you run the pip install command above, all necessary dependencies (NumPy, Pandas, Matplotlib, Seaborn, PyCirclize, Logomaker, Scikit-learn) will be automatically fetched and installed.

Requirements

Basic Workflow

This guide walks you through a complete analysis pipeline: from raw data to biological insights.

Step 1: Dataset Preparation

First, we convert genomic coordinates (BED) and a reference genome (FASTA) into one-hot encoded training tensors. The PrepareData module handles splitting (Train/Val/Test) and augmentation.

import cnnamon as cn

# Initialize data preparer
preparer = cn.utility.PrepareData(
    intervalfile="peaks.bed",      # Input BED file
    genomefasta="hg38.fa",         # Reference Genome
    outdir="processed_data/",      # Output folder
    split_segmentation="random",   # Split strategy
    augment_RC=True                # Augment with Reverse Complements
)

# Run extraction
train, test, val = preparer.run()

print(f"Training shape: {train['x'].shape}")

Step 2: Model Configuration (JSON)

Define your model architecture and training parameters in a JSON file. This ensures your experiments are reproducible.

Create a file named model_config.json with the following structure:

{
  "model": {
    "layers": [
      {
        "class_name": "Conv1D",
        "config": {
          "name": "motif_scanner",
          "filters": 12,
          "kernel_size": 20,
          "padding": "same",
          "use_bias": true,
          "input_shape": [300, 4],
          "activation": "leaky_relu"
        }
      },
      { "class_name": "MaxPooling1D", "config": { "pool_size": 2 } },
      { "class_name": "Dropout", "config": { "rate": 0.1 } },
      { "class_name": "Flatten", "config": {} },
      { "class_name": "Dense", "config": { "units": 32, "activation": "relu" } },
      { "class_name": "Dense", "config": { "units": 1, "activation": "sigmoid" } }
    ]
  },
  "compile": {
    "optimizer": {
      "class_name": "AdamW",
      "config": { "learning_rate": 0.001 }
    },
    "loss": "binary_crossentropy",
    "metrics": ["accuracy", "AUC"]
  },
  "training_params": {
    "epochs": 50,
    "batch_size": 64
  },
  "callbacks": {
    "EarlyStopping": {
      "monitor": "val_loss",
      "patience": 10,
      "restore_best_weights": true
    },
    "CSVLogger": {
      "filename": "training_log.csv"
    }
  }
}

Step 3: Model Development

Initialize the KerasModelBuilder with your JSON config and start training. The framework builds the Keras model dynamically.

# Build model from JSON
builder = cn.utility.KerasModelBuilder.from_json("model_config.json")

# Train the model
history = builder.train(
    x_train=train['x'], y_train=train['y'],
    x_val=val['x'], y_val=val['y']
)

# Plot training history (Loss/Accuracy)
builder.plot_history(savefig="training_curves.png")

Step 4: Evaluation

Evaluate the model's performance on the hold-out test set using the built-in evaluation suite. This generates publication-ready ROC curves and Confusion Matrices.

# Plot ROC Curve
builder.eval.roc(
    test['x'], test['y'],
    title="Model Performance (ROC)",
    savefig="roc_curve.png"
)

# Plot Confusion Matrix
builder.eval.cm(
    test['x'], test['y'],
    class_names=["Background", "Peak"],
    savefig="confusion_matrix.png"
)

Step 5: Saving and Reloading

Essential: Before running explainability modules, save your trained model. The explainability tools often require a clean Keras model object or the path to a saved file.

from tensorflow import keras

# Save the trained model
builder.save("final_model.keras")

# --- LATER OR IN A NEW SCRIPT ---

# Reload the model for analysis
loaded_model = keras.models.load_model("final_model.keras")

# Reload data splits if needed
_, test, _ = cn.utility.PrepareData.load_splits_from_disk("processed_data/")

Step 6: Explainability Modules

This is where Cnnamon shines. Use the following modules to interpret what your model has learned.

A. Filter Visualization (Motif Discovery)

Extract sequence motifs using either the standard "Top Activating" method or the rigorous "Significant Activating" method. Crucially, you can export these motifs to check against known databases (like JASPAR) using TOMTOM.

# Method 1: Top Activating (Faster, Standard)
# Extracts motifs from the top 5% of activating sequences
motifs_standard = cn.CNN1D.FilterVisualize.top_activating(
    loaded_model, 
    test, 
    percentile=95.0,
    n_cores=4
)
motifs_standard.to_motifs(savefig="motifs_standard.png")

# Method 2: Significant Activating (Rigorous, Slower)
# Uses permutation testing to find statistically significant motifs
motifs_sig = cn.CNN1D.FilterVisualize.significant_activating(
    loaded_model, 
    test, 
    n_perturbations=200,    # Shuffles per filter
    q_value_cutoff=0.05,    # FDR threshold
    n_cores=8
)
motifs_sig.to_motifs(savefig="motifs_significant.png")

# --- JASPAR / TOMTOM Validation ---
# Export significant motifs to MEME format for database comparison
motifs_sig.to_meme("significant_motifs.meme")
print("Motifs exported! Run TOMTOM to check matches against JASPAR.")

B. Filter Importance

Determine which filters actually drive model predictions by perturbing them and measuring loss increase.

# Rank filters by importance
importance = cn.CNN1D.FilterImportance(
    loaded_model, 
    test, 
    n_iterations=10, 
    n_cores=4
)

# Visualize ranking (Boxplot)
importance.boxplot(savefig="filter_importance.png")

C. Filter Clustering

Group redundant filters (e.g., multiple filters learning the same "GATA" motif) to simplify interpretation.

# Cluster filters based on activation similarity
clustering = cn.CNN1D.FilterClustering(
    loaded_model,
    test,
    linkage_method='ward'
)

# Plot circular dendrogram
clustering.plot_circlize(savefig="filter_tree.png")

# Get cluster assignments
clusters = clustering.get_clusters(n_clusters=5)

D. Filter Enrichment

Identify which filters are specifically associated with positive or negative classes.

# Run enrichment analysis
enrichment = cn.CNN1D.FilterEnrichment(
    loaded_model,
    test,
    method='fold_change',  # Uses Mann-Whitney U test
    n_cores=4
)

# Plot enrichment heatmap with significance markers (*)
enrichment.plot_heatmap(q_cutoff=0.05, savefig="enrichment.png")

E. Nucleotide Sensitivity (Mutagenesis)

Perform in-silico mutagenesis to see exactly which bases in a motif are critical for activation.

# Analyze sensitivity of significant motifs
sensitivity_df = cn.CNN1D.analyze_nucleotide_sensitivity(
    loaded_model, 
    motifs_sig,   # Use the MotifSet from Step 6A
    n_cores=4
)

# Plot boxplots of mutational impact
cn.CNN1D.plot_nucleotide_sensitivity(sensitivity_df, savefig="mutagenesis.png")

PrepareData

The PrepareData class handles genomic sequence data preparation for CNN training, including sequence extraction, one-hot encoding, and flexible dataset splitting.

Initialization

cn.utility.PrepareData(intervalfile, genomefasta, outdir, **kwargs)

Core Parameters:

intervalfile (str): Path to BED file. First 3 columns are coordinates; 4th is label (0/1 or pipe-separated for multilabel).
genomefasta (str): Path to reference genome FASTA.
outdir (str): Output directory for processed data.

Configuration Options (**kwargs):

split_segmentation (str): Strategy for splitting data. Options: 'random', 'chromosome', 'custom'. (Default: 'random')
ratios (list): Train/Val/Test split ratios (e.g., [0.6, 0.2, 0.2]). Normalized automatically if sum != 1.
augment_RC (bool): If True, adds Reverse Complement sequences to the dataset with same labels. (Default: False)
seed (int): Random seed for reproducibility.
save_splits (str): Set to "1" to save generated numpy arrays and info CSVs to disk. (Default: "0")
specificchr (str): Set to "1" to limit processing to chromosomes in chrlist.
chrlist (list/str): List of chromosomes to process if specificchr="1".
train_chr_list (list): List of chromosomes for training set (if split_segmentation='custom').
val_chr_list (list): List of chromosomes for validation set (if split_segmentation='custom').
test_chr_list (list): List of chromosomes for test set (if split_segmentation='custom').

Accessible Attributes

After initialization/running, the following attributes are available:

.intervals (dict): Map of {chrom: [[start, end], ...]}.
.sequence_onehot (dict): Map of {chrom: numpy_array}.
.labels (dict): Map of {chrom: label_array}.
.info_map (dict): Map of {chrom: [info_strings]}.
.label_dim (int): Detected dimension of labels (1 for binary, >1 for multilabel).

Methods

run() → Tuple[Dict, Dict, Dict]

Execute pipeline. Returns three dictionaries (Train, Test, Validation). Each contains:

  • "x": Sequence tensor (N, L, 4)
  • "y": Label tensor (N, Classes)
  • "info": List of strings "chr,start,end" corresponding to each sample.
get_splits() → Tuple[Dict, Dict, Dict]

Retrieve split data dictionaries without re-running sequence extraction (must be run after initialization or loading).

load_splits_from_disk(directory_path) → Tuple[Dict, Dict, Dict]
STATIC

Load splits previously saved with save_splits="1". Returns the same structure as run().

KerasModelBuilder

A wrapper for building, training, and evaluating Keras models from JSON.

Methods

cn.utility.KerasModelBuilder.from_json(json_path_or_dict)
CLASS METHOD

Initialize builder from a JSON file path or a dictionary.

build() → keras.Model

Compiles and returns the Keras model based on the configuration.

train(x_train, y_train, x_val=None, y_val=None) → History

Train the model using config parameters. Returns the Keras History object.

plot_history(title="Training History", savefig=None)

Plot training metrics (loss/accuracy) vs epochs.

summary()

Print Keras model summary.

Evaluation (model.eval)

Access via builder.eval. Provides publication-ready plotting.

cm(x_test, y_test, class_names=None, title="Confusion Matrix", xlabel="Predicted", ylabel="Actual", savefig=None)

Plot row-normalized confusion matrix. Annotations include probability and raw count.

roc(x_test, y_test, class_names=None, title="ROC Curve", xlabel="False Positive Rate", ylabel="True Positive Rate", savefig=None)

Plot ROC curve (Binary classification only). Displays AUC in legend.

auc(x_test, y_test) → float

Calculate and return AUC score. Supports 'ovr' (One-vs-Rest) for multiclass.

Filter Visualization

The cn.CNN1D.FilterVisualize module extracts sequence motifs from model filters.

Discovery Methods

cn.CNN1D.FilterVisualize.top_activating(model, data, percentile=90.0, include_all_positive=False, n_cores=1) → MotifSet

Extract motifs from sequences causing high activation.

percentile (float): Cutoff for top activations (default 90.0).
include_all_positive (bool): If True, uses all activations > 0 (ignores percentile).
cn.CNN1D.FilterVisualize.significant_activating(model, data, n_perturbations=1000, q_value_cutoff=1.0, n_cores=1) → MotifSet
RECOMMENDED

Extract motifs using permutation testing to filter significant activations.

n_perturbations (int): Number of shuffles per filter.
q_value_cutoff (float): FDR cutoff for subsequences.
cn.CNN1D.FilterVisualize.pos_activating(model, n_cores=1, background=None) → MotifSet

Construct consensus motifs strictly from positive filter weights.

cn.CNN1D.FilterVisualize.softmax(model, background=None) → MotifSet

Construct motifs by applying softmax to filter weights.

External Validation (TOMTOM/JASPAR)

Cnnamon integrates with the MEME Suite to validate discovered motifs against known biological databases (like JASPAR or HOCOMOCO).

to_meme(outfile)

Export all discovered motifs to a standard `.meme` file. This file can be used as input for tomtom.

# 1. Export motifs from Cnnamon
motifs_sig.to_meme("output/learned_motifs.meme")

# 2. Run TOMTOM (Terminal Command)
# Compare learned motifs against the JASPAR Core database
tomtom -no-ssc -oc output/tomtom_results \
       output/learned_motifs.meme \
       JASPAR2022_CORE_vertebrates_non-redundant_pfms_meme.txt

Redundancy Analysis

cn.CNN1D.FilterVisualize.filter_redundancy(model, plot=True, savefig=None, **kwargs) → DataFrame

Calculate Pearson correlation between raw filter weights. Returns similarity matrix.

cn.CNN1D.FilterVisualize.filter_RC_redundancy(model, plot=True, savefig=None, **kwargs) → DataFrame

Calculate Pearson correlation between filters and their Reverse Complements. Returns similarity matrix.

The MotifSet Object

Returned by discovery methods. Behaves like a dictionary of PFMs (Pandas DataFrames) but includes rich metadata.

Attributes

.subsequences (dict): The actual DNA subsequences driving activation for each filter.
.subseq_classes (dict): The class labels associated with each activating subsequence.
.subseq_info (dict): Genomic coordinates (chr, start, end) for each subsequence.
.q_values / .p_values (dict): Statistical significance (if using significant_activating).

Export Methods

to_motifs(savefig=None, figsize=None, **kwargs)

Plot Sequence Logos (Information Content) for all filters in a grid.

to_svgs(outdir)

Export each filter as an individual SVG file.

filter_redundancy(savefig=None, **kwargs)

Cluster learned PFMs by similarity (Pearson correlation of PWMs).

Nucleotide Sensitivity

cn.CNN1D.analyze_nucleotide_sensitivity(model, motifset, n_cores=1) → DataFrame

Perform in-silico mutagenesis on activating subsequences. Returns a DataFrame with 'FoldChange' for every mutation.

cn.CNN1D.plot_nucleotide_sensitivity(df, savefig=None)

Plot mutagenesis results as boxplots per filter position, showing the impact of mutations on activation.

Filter Importance

Rank filters by their contribution to model prediction accuracy.

cn.CNN1D.FilterImportance(model, testset, n_iterations=10, method='mean', n_cores=1, batch_size=32)

Initialize and run perturbation experiment.

n_iterations (int): Number of random perturbations per filter.
method (str): Aggregation method for loss ('mean' or 'median').

Attributes

.filter_importance_ranking (list): Filter names ordered by importance.
.importance_score (Series): The loss increase caused by perturbing each filter.
.perturbation_results (DataFrame): Raw loss values for every iteration.

Methods

boxplot(savefig=None, **kwargs)

Plot importance distribution (boxplots) vs Baseline Loss.

violin(savefig=None, **kwargs)

Plot importance distribution (violin plots) vs Baseline Loss.

Filter Clustering

Group filters by activation profile similarity.

cn.CNN1D.FilterClustering(model, testset, target_layer=None, linkage_method='ward')

Initialize clustering analysis.

linkage_method (str): Scipy linkage method ('ward', 'average', 'complete', etc.).

Attributes

.optimal_k (int): Suggested number of clusters based on silhouette score.
.fingerprints (DataFrame): Raw filter activation profiles.
.norm_fingerprints (DataFrame): Z-score normalized fingerprints.
.newick_tree (str): Tree structure in Newick format.

Methods

suggest_optimal_clusters(max_k=10, plot=True) → int

Run silhouette analysis to determine optimal cluster count.

plot_heatmap(savefig=None)

Plot clustered heatmap of filter activations.

plot_circlize(n_clusters=None, savefig=None)

Plot circular dendrogram with cluster highlighting.

get_clusters(n_clusters=None) → DataFrame

Return a DataFrame mapping filter_name to cluster_id.

Filter Enrichment

Identify filters enriched for specific output classes.

cn.CNN1D.FilterEnrichment(model, data, motifset=None, universe='activated', method='fold_change', n_cores=1)

Parameters:

method (str): 'fold_change' (Mann-Whitney U) or 'odds_ratio' (Fisher Exact).
motifset (MotifSet): Required if method='odds_ratio'.
universe (str): Background for Fisher test. 'activated' (only activating sequences) or 'global' (all sequences).

Attributes

.df (DataFrame): Main results table. Contains Log Fold Changes or Odds Ratios.
.pvals (DataFrame): Raw P-values.
.qvals (DataFrame): FDR-corrected Q-values.

Methods

plot_heatmap(q_cutoff=0.05, savefig=None, **kwargs)

Plot enrichment heatmap. Filters with q-value ≤ q_cutoff are marked with (*).