Metadata-Version: 2.4
Name: sectioniq
Version: 0.1.0
Summary: Structured PDF retrieval with typed evidence blocks, hybrid search, and grounded citations.
License-Expression: Apache-2.0
Project-URL: Homepage, https://github.com/Vedant-Ratn-Nema/SectionIQ
Project-URL: Repository, https://github.com/Vedant-Ratn-Nema/SectionIQ
Project-URL: Issues, https://github.com/Vedant-Ratn-Nema/SectionIQ/issues
Project-URL: Changelog, https://github.com/Vedant-Ratn-Nema/SectionIQ/blob/main/CHANGELOG.md
Keywords: pdf,rag,retrieval,bm25,embeddings,citations
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Text Processing :: Indexing
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.26
Requires-Dist: pypdf>=5.0.0
Provides-Extra: openai
Requires-Dist: openai>=1.0.0; extra == "openai"
Provides-Extra: local
Requires-Dist: model2vec>=0.3.0; extra == "local"
Provides-Extra: pdf
Requires-Dist: pymupdf>=1.24.0; extra == "pdf"
Provides-Extra: ocr
Requires-Dist: ocrmypdf>=16.0.0; extra == "ocr"
Provides-Extra: dev
Requires-Dist: build>=1.2.0; extra == "dev"
Requires-Dist: mypy>=1.10.0; extra == "dev"
Requires-Dist: openai>=1.0.0; extra == "dev"
Requires-Dist: pymupdf>=1.24.0; extra == "dev"
Requires-Dist: pytest>=8.0.0; extra == "dev"
Requires-Dist: ruff>=0.4.0; extra == "dev"
Requires-Dist: twine>=5.0.0; extra == "dev"
Provides-Extra: bench
Requires-Dist: python-dotenv>=1.0.0; extra == "bench"
Dynamic: license-file

# SectionIQ

SectionIQ is a local-first Python library for structured PDF retrieval. It
ingests PDFs into typed evidence blocks, builds hybrid sparse+dense indexes, and
returns grounded answers with page/block citations.

The core design is deliberately not tree-first: hierarchy is used as context and
a ranking prior, while retrieval still fans out across sparse, dense, heading,
and table-aware signals.

## Install

```bash
pip install sectioniq                # core: BM25 + hash embeddings, no API key needed
pip install "sectioniq[local]"       # + local semantic embeddings (model2vec, ~30MB, no torch)
pip install "sectioniq[pdf]"         # + layout-aware extraction (PyMuPDF, two-column support)
pip install "sectioniq[ocr]"         # + OCR for scanned PDFs (ocrmypdf; needs Tesseract)
pip install "sectioniq[openai]"      # + OpenAI embeddings, reranking, answer generation
```

Recommended local setup: `pip install "sectioniq[local,pdf]"`.

For local development:

```bash
uv venv --python 3.11
source .venv/bin/activate
uv pip install -e ".[dev,bench]"
python -m pytest
```

## Try It in 60 Seconds

A real session against a public-domain U.S. Army maintenance manual
(127 pages), fully local, no API key. Ingest + index takes well under a
second:

```bash
pip install sectioniq
curl -L -o tm3.pdf "https://commons.wikimedia.org/wiki/Special:Redirect/file/TM-1-1500-204-23-3.pdf"

sectioniq ingest tm3.pdf --index
```

```
Ingested 'Tm 1 1500 204 23 3' (127 pages) -> 40ab9bf3-8025-5be2-8118-2d90fbf81c2c
  blocks=2977 tables=23 warnings=0 headings=1208
Indexed 2977 blocks from 1 document(s).
```

```bash
sectioniq search "fuel cell purging procedure"
```

```
1. [paragraph] Tm 1 1500 204 23 3 (p.38, 40ab9bf3...:b00780)
   6 After purging of the fuel cell has been completed, wait approximately two to
   three hours, and test fuel cell for the presence of dangerous fuel vapors...
2. [paragraph] Tm 1 1500 204 23 3 (p.73, 40ab9bf3...:b01507)
   Purge fuel cell prior to inspection and repair. Refer to paragraph 2-5f(4)
   purging procedure.
...
```

```bash
sectioniq answer "How should fuel cells be purged before maintenance?"
```

```
(4) Purging. Fuel cells may be purged and preserved by either of the following
methods. [Tm 1 1500 204 23 3 (p.37, 40ab9bf3...:b00742)]
...

Citations:
  - Tm 1 1500 204 23 3 (p.37, 40ab9bf3...:b00742)

(Extractive answer: verbatim evidence. Set OPENAI_API_KEY for synthesized answers.)
```

Everything works without an API key: search runs locally, and `answer` returns
extractive, citation-backed evidence. Set `OPENAI_API_KEY` for synthesized
answers and LLM reranking. Run `sectioniq info` to inspect the local store.

## Python Quick Start

```python
from sectioniq import SectionIQ

engine = SectionIQ(store_path=".sectioniq")
doc_id = engine.ingest("/path/to/public-manual.pdf")
engine.build_index()

hits = engine.search("What safety cautions apply before maintenance?", top_k=5)
for hit in hits:
    print(hit.block_type, hit.citation, hit.text_preview)

result = engine.answer("What safety cautions apply before maintenance?", top_k=5)
print(result.answer)
print(engine.get_citations(result))
```

## Configuration

SectionIQ picks the best available backends automatically:

| | Embeddings | Reranker | Answers |
|---|---|---|---|
| No API key (core) | Deterministic hash (lexical) | Heuristic | Extractive with citations |
| No API key + `[local]` | model2vec semantic | Heuristic | Extractive with citations |
| `OPENAI_API_KEY` set | OpenAI embeddings | LLM reranker | Synthesized, grounded |

Optional environment variables:

- `SECTIONIQ_EMBEDDING_MODEL`
- `SECTIONIQ_LOCAL_EMBEDDING_MODEL` (default `minishlab/potion-base-8M`)
- `SECTIONIQ_RERANK_MODEL`
- `SECTIONIQ_LLM_MODEL`

## Benchmark Results

Measured on the public-domain U.S. Army `TM-1-1500-204-23` corpus (9 volumes,
2,231 pages, 62,813 blocks with layout-aware extraction) with 20 public
queries at `top_k=5`, fully local, no API key:

| System | Source recall@5 | Term recall@5 | p50 query |
|--------|-----------------|---------------|-----------|
| SectionIQ `[local,pdf]` (semantic + layout-aware) | **0.90** | **0.75** | 19 ms |
| SectionIQ `[pdf]` (hash + layout-aware) | 0.85 | 0.65 | 17 ms |
| SectionIQ core (hash + pypdf) | 0.85 | 0.60 | 12 ms |
| Naive 400-char chunk baseline | 0.75 | 0.50 | — |
| Tree-first hierarchy baseline | 0.60 | 0.35 | — |

Ingest throughput: ~350 pages/s with PyMuPDF (~250 pages/s with pypdf); full
62k-block index build in ~3 s. SectionIQ sweeps the table/spec lookups (1.00
recall) that tree-first navigation and naive chunking miss. See
[docs/benchmarking.md](docs/benchmarking.md) for the full methodology and
reproduction steps.

## Public Benchmark Corpus

SectionIQ uses the public-domain U.S. Army `TM-1-1500-204-23` aviation
maintenance manual series as its release validation corpus. The tracked manifest
contains source URLs and metadata only; downloaded PDFs stay local and ignored.

```bash
python scripts/prepare_public_corpus.py
python scripts/benchmark_vs_pageindex.py --rebuild-index
```

To include the optional PageIndex comparison:

```bash
python scripts/benchmark_vs_pageindex.py --run-pageindex
```

See [docs/benchmarking.md](docs/benchmarking.md) for the benchmark workflow.

## Privacy

SectionIQ stores extracted PDF text, metadata, and indexes in the configured
local store. Do not commit local stores, PDFs, notebooks, spreadsheets, logs, or
benchmark outputs from private documents.

## Documentation

- [docs/architecture.md](docs/architecture.md): pipeline, modules, store layout, extension points
- [docs/api.md](docs/api.md): SDK and CLI reference
- [docs/benchmarking.md](docs/benchmarking.md): methodology and reproduction steps

## Project Layout

- `src/sectioniq/`: library code
- `benchmarks/`: public corpus manifest and public query set
- `docs/`: architecture, API reference, benchmarking
- `tests/`: unit tests
- `examples/`: local example runners
- `scripts/`: corpus preparation, benchmark, and release-safety utilities
