Metadata-Version: 2.4
Name: variantflow
Version: 1.0.0
Summary: A production-quality platform for downstream genomic variant interpretation and prioritization
Author-email: Robin Tomar <itsrobintomar@gmail.com>
Maintainer-email: Robin Tomar <itsrobintomar@gmail.com>
License: MIT
Project-URL: Homepage, https://github.com/imrobintomar/VariantFlow
Project-URL: Repository, https://github.com/imrobintomar/VariantFlow
Project-URL: Bug Tracker, https://github.com/imrobintomar/VariantFlow/issues
Project-URL: Documentation, https://github.com/imrobintomar/VariantFlow/wiki
Keywords: bioinformatics,genomics,variant,annotation,ACMG,ClinVar,InterVar,ANNOVAR,prioritization,rare-variant
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Operating System :: OS Independent
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas>=2.1
Requires-Dist: numpy>=1.26
Requires-Dist: scipy>=1.11
Requires-Dist: plotly>=5.18
Requires-Dist: dash>=2.14
Requires-Dist: dash-bootstrap-components>=1.5
Requires-Dist: gseapy>=1.1
Requires-Dist: openpyxl>=3.1
Requires-Dist: reportlab>=4.0
Requires-Dist: jinja2>=3.1
Requires-Dist: pydantic>=2.5
Requires-Dist: pydantic-settings>=2.1
Requires-Dist: typer[all]>=0.9
Requires-Dist: rich>=13.7
Requires-Dist: loguru>=0.7
Requires-Dist: tqdm>=4.66
Requires-Dist: xlsxwriter>=3.1
Requires-Dist: kaleido>=0.2
Requires-Dist: goatools>=1.4
Requires-Dist: requests>=2.31
Requires-Dist: python-dotenv>=1.0
Provides-Extra: dev
Requires-Dist: pytest>=7.4; extra == "dev"
Requires-Dist: pytest-cov>=4.1; extra == "dev"
Requires-Dist: pytest-mock>=3.12; extra == "dev"
Requires-Dist: black>=23.0; extra == "dev"
Requires-Dist: ruff>=0.1; extra == "dev"
Requires-Dist: mypy>=1.7; extra == "dev"
Requires-Dist: pre-commit>=3.5; extra == "dev"
Provides-Extra: docs
Requires-Dist: mkdocs>=1.5; extra == "docs"
Requires-Dist: mkdocs-material>=9.4; extra == "docs"
Requires-Dist: mkdocstrings[python]>=0.24; extra == "docs"
Dynamic: license-file

<div align="center">

# VariantFlow

**A production-quality platform for downstream genomic variant interpretation and prioritization**

[![Python 3.11+](https://img.shields.io/badge/Python-3.11%2B-blue?logo=python&logoColor=white)](https://www.python.org/)
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)
[![CI](https://github.com/imrobintomar/VariantFlow/actions/workflows/ci.yml/badge.svg)](https://github.com/imrobintomar/VariantFlow/actions)
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
[![Genome Build](https://img.shields.io/badge/Genome-hg38%20%7C%20hg19-orange)](https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.40)

<br/>

*Accepts ANNOVAR multianno files and automatically performs variant filtering, ACMG classification,
candidate gene prioritization, pathway enrichment, and generates publication-ready reports and
interactive dashboards — with full reproducibility tracking.*

<br/>

**Made with ❤️ in INDIA** &nbsp;||&nbsp; **Dr Prabudh Goel Lab, AIIMS New Delhi**

</div>

---

## Table of Contents

- [Overview](#overview)
- [Key Features](#key-features)
- [Architecture](#architecture)
- [Installation](#installation)
- [Quick Start](#quick-start)
- [CLI Reference](#cli-reference)
- [Input Formats](#input-formats)
- [Output Structure](#output-structure)
- [Configuration](#configuration)
- [Variant Scoring](#variant-scoring)
- [Pathway Enrichment](#pathway-enrichment)
- [Dashboard](#dashboard)
- [Cohort and Family Analysis](#cohort-and-family-analysis)
- [Reproducibility](#reproducibility)
- [Citation](#citation)
- [License](#license)

---

## Overview

VariantFlow is a modular, extensible Python platform designed for **downstream analysis of ANNOVAR-annotated genomic variant files**. It is built for clinical genomics research and is intended for publication in journals such as *BMC Genomics*, *Bioinformatics*, and *Briefings in Bioinformatics*.

The platform takes a standard ANNOVAR multianno file as input and executes a complete, auditable analysis pipeline — from raw variant filtering through to HTML, Excel, and PDF reports — without requiring any manual column mapping or configuration.

---

## Key Features

| Feature | Description |
|---|---|
| **Automatic column detection** | ColumnMapper uses regex pattern matching across 50+ field types — never hardcodes ANNOVAR column names. Supports gnomAD v2/v3/v4.1.1, ClinVar date-stamped columns, SIFT4G, and more |
| **Multi-tier filtering** | Sequential quality → population frequency → functional consequence → exonic consequence → ACMG benign removal pipeline with full audit trail |
| **ClinVar interpretation** | Parses CLNSIG text; auto-detects presence-flag columns and falls back to InterVar classification |
| **InterVar ACMG** | Full evidence extraction — PVS1, PS1–4, PM1–6, PP1–5, BA1, BS1–4, BP1–7 |
| **Transparent scoring** | Configurable multi-factor variant score with per-variant breakdown |
| **Gene prioritization** | Ranked candidate gene tables with natural-language score explanations |
| **GO / KEGG / Reactome** | Enrichment via gseapy with bubble plots and bar charts per database |
| **Interactive dashboard** | Multi-page Dash app with live filters, drill-down tables, and export |
| **Multi-format reports** | HTML, Excel (multi-sheet), PDF — all with lab branding |
| **Cohort analysis** | Shared/unique variants, gene burden, recurrent genes across samples |
| **Family analysis** | De novo, autosomal recessive, compound het, X-linked detection |
| **Reproducibility** | `project.json` manifest + auto-generated methods text for manuscripts |
| **3D visualizations** | 3D variant landscape (Score × CADD × REVEL) and 3D pathway landscape |

---

## Architecture

```
variantflow/
├── core/               # Data models (Pydantic), exceptions, logging, pipeline orchestrator
├── io/                 # ColumnMapper, MultiannoReader — auto-detect all ANNOVAR fields
├── filters/            # Quality, population frequency, functional, exonic, ACMG filters
├── annotations/        # ClinVar engine, InterVar ACMG evidence parser
├── scoring/            # Transparent multi-factor variant scorer
├── prioritization/     # Gene ranker with score explanation
├── enrichment/         # GO BP/MF/CC, KEGG, Reactome via gseapy
├── statistics/         # Summary statistics engine → statistics.json
├── visualization/      # Plotly 2D figures + 3D landscapes + enrichment plots
├── dashboard/          # Multi-page Dash app with live callbacks
├── reports/            # HTML (Jinja2), Excel (openpyxl), PDF (ReportLab)
├── cohort/             # Multi-sample shared/unique/burden analysis
├── family/             # Pedigree-based inheritance detection
├── config/             # Pydantic v2 settings — fully configurable, env-var overridable
└── cli/                # Typer CLI — analyze / dashboard / cohort / family
```

---

## Installation

### From source (recommended)

```bash
git clone https://github.com/imrobintomar/VariantFlow.git
cd VariantFlow
pip install -e . --no-build-isolation
```

### Dependencies

```bash
pip install pandas numpy scipy plotly dash dash-bootstrap-components \
            gseapy openpyxl reportlab jinja2 pydantic pydantic-settings \
            typer rich loguru tqdm
```

### Docker

```bash
docker build -t variantflow:1.0.0 .
docker run --rm -v $(pwd)/data:/data -v $(pwd)/results:/results \
  variantflow:1.0.0 analyze /data/sample.hg38_multianno.txt --output /results
```

---

## Quick Start

```bash
# Single-sample analysis
python variantflow_run.py analyze sample.hg38_multianno.txt \
  --output results/ --sample-id SAMPLE01

# Launch interactive dashboard
python variantflow_run.py dashboard results/ --port 8050

# Cohort analysis (directory of multianno files)
python variantflow_run.py cohort cohort_dir/ --output cohort_results/

# Family / trio analysis
python variantflow_run.py family family_dir/ \
  --proband PROBAND01 --father DAD01 --mother MOM01
```

---

## CLI Reference

### `analyze`

```
python variantflow_run.py analyze <input_file> [OPTIONS]

Arguments:
  input_file          ANNOVAR multianno file (.txt or .txt.gz)

Options:
  -o, --output        Output directory           [default: variantflow_results]
  -s, --sample-id     Sample identifier          [default: sample]
  -g, --genome        Genome build: hg38 / hg19  [default: hg38]
  --af                AF threshold (rare variant) [default: 0.01]
  --min-dp            Minimum read depth          [default: 10]
  --nonframeshift     Include nonframeshift indels
  --no-enrichment     Skip pathway enrichment
  --no-pdf            Skip PDF report
  -c, --config        JSON configuration file
  -v, --verbose       Verbose logging
```

### `dashboard`

```
python variantflow_run.py dashboard <results_dir> [OPTIONS]

Options:
  --host    Dashboard host  [default: 127.0.0.1]
  --port    Dashboard port  [default: 8050]
  --debug   Enable debug mode
```

### `cohort`

```
python variantflow_run.py cohort <cohort_dir> [OPTIONS]

Options:
  -o, --output   Output directory  [default: cohort_results]
  --pattern      File glob pattern [default: *.txt]
```

### `family`

```
python variantflow_run.py family <family_dir> [OPTIONS]

Options:
  -p, --proband  Proband sample ID  [required]
  -f, --father   Father sample ID
  -m, --mother   Mother sample ID
  -o, --output   Output directory  [default: family_results]
```

---

## Input Formats

VariantFlow accepts standard ANNOVAR multianno files:

| Format | Example |
|---|---|
| Plain text | `sample.hg38_multianno.txt` |
| Plain text | `sample.hg19_multianno.txt` |
| Gzip compressed | `sample.hg38_multianno.txt.gz` |
| Tab-separated | `sample.hg38_multianno.tsv` |

**Automatically detected fields include:**

- Genomic coordinates: `Chr`, `Start`, `End`, `Ref`, `Alt`
- Gene annotations: `Gene.refGene`, `Func.refGene`, `ExonicFunc.refGene`, `AAChange.refGene`
- Population frequencies: `gnomad411_exome_AF`, `gnomAD_exome_ALL`, `ExAC_ALL`, `1000g2015aug_all`
- ClinVar: `CLNSIG`, `clinvar_20260503` (date-stamped), `CLNDN`
- InterVar: `InterVar_automated`, `InterVar_ACMG`
- Predictors: `REVEL_score`, `CADD_phred`, `SIFT_score`, `SIFT4G_score`, `Polyphen2_HDIV_score`
- Other: `GERP++_RS`, `phyloP100way_vertebrate`, `MutationTaster_pred`, `SpliceAI_DS_max`

---

## Output Structure

```
results/
├── report.html                  # Self-contained interactive HTML report
├── VariantFlow_Report.xlsx       # Multi-sheet Excel workbook
│   ├── Summary                  # Key statistics and metadata
│   ├── CandidateVariants        # Top 500 variants ranked by score
│   ├── CandidateGenes           # Ranked candidate genes
│   ├── ClinVar_Pathogenic        # Pathogenic / Likely Pathogenic variants
│   ├── InterVar_Pathogenic       # ACMG Pathogenic / LP variants
│   └── go_* / kegg / reactome   # Enrichment results per database
├── report.pdf                   # PDF report with tables and methods
├── CandidateVariants.tsv        # Tab-separated candidate variants
├── CandidateGenes.tsv           # Tab-separated ranked genes
├── statistics.json              # Full summary statistics
├── project.json                 # Reproducibility manifest
├── methods.txt                  # Auto-generated methods section
└── figures/
    ├── filtering_funnel.html
    ├── clinvar_distribution.html
    ├── intervar_distribution.html
    ├── gene_ranking.html
    ├── chromosome_distribution.html
    ├── variant_score_histogram.html
    ├── af_distribution.html
    ├── acmg_evidence.html
    ├── variant_landscape_3d.html
    ├── enrichment_go_biological_process_dot.html
    ├── enrichment_go_biological_process_bar.html
    ├── enrichment_go_cellular_component_dot.html
    ├── enrichment_kegg_dot.html
    ├── enrichment_reactome_dot.html
    └── pathway_landscape_3d.html
```

---

## Configuration

VariantFlow uses a Pydantic v2 settings system. All parameters can be overridden via:

1. **JSON config file** (`--config my_config.json`)
2. **Environment variables** (prefix `VF_`)

### Example `config.json`

```json
{
  "project_name": "Rare Disease Study",
  "sample_id": "PATIENT_001",
  "genome_build": "hg38",
  "output_dir": "results/",
  "filters": {
    "active_af_threshold": 0.001,
    "min_dp": 20,
    "include_nonframeshift": true
  },
  "scoring": {
    "clinvar_pathogenic": 10.0,
    "revel_high": 3.0,
    "cadd_very_high": 3.0
  },
  "enrichment": {
    "organism": "human",
    "qvalue_cutoff": 0.05,
    "top_n_terms": 20
  }
}
```

### Environment variable override

```bash
export VF_FILTERS__ACTIVE_AF_THRESHOLD=0.001
export VF_FILTERS__MIN_DP=20
export VF_LOG_LEVEL=DEBUG
python variantflow_run.py analyze sample.txt
```

---

## Variant Scoring

VariantFlow uses a transparent, configurable multi-factor scoring system. Every score contribution is stored in a `score_breakdown` column for full auditability.

| Source | Criterion | Score |
|---|---|---|
| ClinVar | Pathogenic | +10.0 |
| ClinVar | Likely Pathogenic | +8.0 |
| ClinVar | VUS | +3.0 |
| ClinVar | Likely Benign | -2.0 |
| ClinVar | Benign | -5.0 |
| InterVar | Pathogenic | +8.0 |
| InterVar | Likely Pathogenic | +6.0 |
| Consequence | Stop-gain / Stop-loss / Start-loss | +5.0 |
| Consequence | Frameshift indel | +5.0 |
| Consequence | Splicing | +3.0 |
| Consequence | Nonsynonymous SNV | +2.0 |
| Population AF | < 0.0001 (ultra-rare) | +4.0 |
| Population AF | < 0.001 (very rare) | +3.0 |
| Population AF | < 0.01 (rare) | +1.5 |
| REVEL | ≥ 0.75 | +3.0 |
| REVEL | 0.50 – 0.75 | +1.5 |
| CADD | ≥ 30 | +3.0 |
| CADD | 20 – 30 | +2.0 |
| SIFT | Deleterious | +1.0 |
| PolyPhen-2 | Damaging | +1.0 |

All weights are configurable in `config.json` under the `scoring` key.

> **Note:** Variants classified as **Benign** or **Likely Benign** by InterVar are automatically removed from the candidate set after annotation.

---

## Pathway Enrichment

Enrichment analysis is performed using [gseapy](https://gseapy.readthedocs.io/) against:

| Database | Gene Sets |
|---|---|
| Gene Ontology | GO Biological Process 2023 |
| Gene Ontology | GO Molecular Function 2023 |
| Gene Ontology | GO Cellular Component 2023 |
| KEGG | KEGG 2021 Human |
| Reactome | Reactome 2022 |

Each database produces:
- **Bubble plot** — x = -log₁₀(adj. p-value), size = gene count, color = odds ratio
- **Bar chart** — ranked terms colored by gene count
- **Full results table** with export

Significance threshold: adjusted p-value ≤ 0.2 (Benjamini-Hochberg). Requires ≥ 5 candidate genes.

---

## Dashboard

The interactive Dash dashboard (`http://127.0.0.1:8050`) provides eight analysis pages:

| Page | Content |
|---|---|
| **Overview** | KPI cards, filtering funnel, ClinVar/InterVar/chromosome distribution |
| **Variant Explorer** | Live-filtered table with score slider, ClinVar and InterVar dropdowns, histogram |
| **Genes** | Ranked bar chart (color-coded by ClinVar), Top N slider, full gene table |
| **ClinVar** | Classification distribution pie chart, filtered variant table |
| **InterVar** | ACMG classification bar chart, evidence criterion heatmap |
| **Enrichment** | Bubble + bar plots per database (GO CC, GO BP, GO MF, KEGG, Reactome) |
| **3D Landscape** | 3D variant landscape (Score × CADD × REVEL) and 3D pathway landscape |
| **Chromosome** | Variant density by chromosome |

All tables support column filtering, sorting, and Excel export.

---

## Cohort and Family Analysis

### Cohort

```bash
python variantflow_run.py cohort cohort_dir/ --output cohort_results/
```

Outputs:
- `cohort_shared_variants.tsv` — variants present in ≥ 2 samples
- `cohort_unique_variants.tsv` — sample-private variants
- `cohort_gene_burden.tsv` — per-gene variant counts per sample
- `cohort_recurrent_genes.tsv` — genes affected in ≥ 2 samples

### Family (Trio/Quad)

```bash
python variantflow_run.py family family_dir/ \
  --proband PROBAND --father FATHER --mother MOTHER
```

Detects and outputs:
- `family_de_novo.tsv`
- `family_autosomal_recessive.tsv`
- `family_compound_heterozygous.tsv`
- `family_x_linked.tsv`

---

## Reproducibility

Every analysis generates a `project.json` manifest containing:

```json
{
  "run_id": "16e51f54",
  "variantflow_version": "1.0.0",
  "created_at": "2026-06-03T10:36:51",
  "python_version": "3.13.11",
  "genome_build": "hg38",
  "input_files": ["sample.hg38_multianno.txt"],
  "filters_applied": ["quality", "population_frequency", "functional_consequence",
                       "exonic_consequence", "acmg_benign_removal"],
  "total_input_variants": 86299,
  "total_output_variants": 917,
  "total_candidate_genes": 100,
  "config": { "..." }
}
```

A `methods.txt` file is also generated, ready to paste into a manuscript Methods section.

---

## Citation

If you use VariantFlow in your research, please cite:

> Tomar R. (2024). *VariantFlow: A production-quality platform for genomic variant interpretation and prioritization.* Dr Prabudh Goel Lab, AIIMS New Delhi. GitHub. https://github.com/imrobintomar/VariantFlow

---

## Contributing

Contributions are welcome. Please open an issue before submitting a pull request. All contributors must follow the existing code style (black, ruff) and include unit tests.

```bash
# Run tests
pytest tests/unit/ -v --cov=variantflow

# Lint
ruff check variantflow/
black variantflow/
```

---

## License

MIT License © 2024 Robin Tomar — Dr Prabudh Goel Lab, AIIMS New Delhi

See [LICENSE](LICENSE) for full terms.

---

<div align="center">
Made with ❤️ in INDIA &nbsp;||&nbsp; Dr Prabudh Goel Lab, AIIMS New Delhi
<br/>
<a href="https://github.com/imrobintomar/VariantFlow">github.com/imrobintomar/VariantFlow</a>
</div>
