Metadata-Version: 2.4
Name: mineros
Version: 3.0.10
Summary: Apache 2.0 document parsing engine: PDF/DOCX/images → Markdown/JSON with tracked-changes (strikethrough) support
License: Apache-2.0
Project-URL: homepage, https://github.com/loganpowell/MinerOS
Project-URL: repository, https://github.com/loganpowell/MinerOS
Project-URL: issues, https://github.com/loganpowell/MinerOS/issues
Keywords: mineros,pdf,markdown,document-parsing,ocr,vlm,strikethrough,tracked-changes
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Requires-Python: <3.14,>=3.10
Description-Content-Type: text/markdown
License-File: LICENSE.md
Requires-Dist: boto3>=1.28.43
Requires-Dist: click>=8.1.7
Requires-Dist: loguru>=0.7.2
Requires-Dist: numpy>=1.21.6
Requires-Dist: pdfminer.six>=20251230
Requires-Dist: tqdm>=4.67.1
Requires-Dist: requests
Requires-Dist: httpx
Requires-Dist: pillow>=11.0.0
Requires-Dist: pypdfium2>=4.30.0
Requires-Dist: pypdf>=5.6.0
Requires-Dist: reportlab
Requires-Dist: pdftext>=0.6.3
Requires-Dist: modelscope>=1.26.0
Requires-Dist: huggingface-hub>=0.32.4
Requires-Dist: json-repair>=0.46.2
Requires-Dist: opencv-python>=4.11.0.86
Requires-Dist: fast-langdetect<0.3.0,>=0.2.3
Requires-Dist: scikit-image<1.0.0,>=0.25.0
Requires-Dist: openai<3,>=1.70.0
Requires-Dist: beautifulsoup4<5,>=4.13.5
Requires-Dist: magika<1.1.0,>=0.6.2
Requires-Dist: mineru-vl-utils<1,>=0.2.3
Requires-Dist: qwen-vl-utils<1,>=0.0.14
Requires-Dist: python-docx<2,>=1.2.0
Requires-Dist: pypptx-with-oxml<2,>=1.0.3
Requires-Dist: mammoth<2,>=1.11.0
Requires-Dist: pylatexenc<3,>=2.10
Requires-Dist: lxml<7.0.0,>=4.0.0
Requires-Dist: pandas<3,>=2.3.3
Requires-Dist: openpyxl<4,>=3.1.5
Requires-Dist: fastapi
Requires-Dist: python-multipart
Requires-Dist: uvicorn
Provides-Extra: test
Requires-Dist: mineros[core]; extra == "test"
Requires-Dist: pytest; extra == "test"
Requires-Dist: pytest-cov; extra == "test"
Requires-Dist: coverage; extra == "test"
Requires-Dist: fuzzywuzzy; extra == "test"
Provides-Extra: dev
Requires-Dist: mineros[test]; extra == "dev"
Requires-Dist: ruff; extra == "dev"
Requires-Dist: pylint; extra == "dev"
Provides-Extra: vlm
Requires-Dist: torch<3,>=2.6.0; extra == "vlm"
Requires-Dist: transformers<5.0.0,>=4.57.3; extra == "vlm"
Requires-Dist: accelerate>=1.5.1; extra == "vlm"
Provides-Extra: vllm
Requires-Dist: vllm<0.12,>=0.10.1.1; extra == "vllm"
Provides-Extra: lmdeploy
Requires-Dist: lmdeploy<0.12,>=0.10.2; extra == "lmdeploy"
Provides-Extra: mlx
Requires-Dist: mlx-vlm<0.4,>=0.3.3; extra == "mlx"
Provides-Extra: pipeline
Requires-Dist: dill<1,>=0.3.8; extra == "pipeline"
Requires-Dist: PyYAML<7,>=6.0.1; extra == "pipeline"
Requires-Dist: ftfy<7,>=6.3.1; extra == "pipeline"
Requires-Dist: shapely<3,>=2.0.7; extra == "pipeline"
Requires-Dist: pyclipper<2,>=1.3.0; extra == "pipeline"
Requires-Dist: omegaconf<3,>=2.3.0; extra == "pipeline"
Requires-Dist: torch<3,>=2.6.0; extra == "pipeline"
Requires-Dist: torchvision; extra == "pipeline"
Requires-Dist: transformers<5.0.0,>=4.57.3; extra == "pipeline"
Requires-Dist: onnxruntime>1.17.0; extra == "pipeline"
Requires-Dist: albumentations<3,>=2.0.8; extra == "pipeline"
Provides-Extra: gradio
Requires-Dist: gradio!=6.0.0,!=6.0.1,!=6.0.2,!=6.1.0,!=6.2.0,!=6.3.0,!=6.4.0,!=6.5.0,!=6.5.1,!=6.6.0,!=6.7.0,<6.9.0,>=5.49.1; extra == "gradio"
Requires-Dist: gradio-pdf>=0.0.22; extra == "gradio"
Provides-Extra: core
Requires-Dist: mineros[vlm]; extra == "core"
Requires-Dist: mineros[pipeline]; extra == "core"
Requires-Dist: mineros[gradio]; extra == "core"
Provides-Extra: all
Requires-Dist: mineros[core]; extra == "all"
Requires-Dist: mineros[mlx]; sys_platform == "darwin" and extra == "all"
Requires-Dist: mineros[vllm]; sys_platform == "linux" and extra == "all"
Requires-Dist: mineros[lmdeploy]; sys_platform == "win32" and extra == "all"
Dynamic: license-file

# MinerOS

[![open issues](https://img.shields.io/github/issues-raw/loganpowell/MinerOS)](https://github.com/loganpowell/MinerOS/issues)
[![License](https://img.shields.io/badge/license-Apache%202.0-blue.svg)](LICENSE.md)
[![Python Version](https://img.shields.io/badge/python-3.10--3.13-blue)](https://github.com/loganpowell/MinerOS)

**MinerOS** is an Apache 2.0-licensed fork of [MinerU](https://github.com/opendatalab/MinerU) — a high-accuracy document parsing engine that converts PDF, Word, PPT, and images into structured Markdown/JSON for LLM · RAG · Agent workflows.

## Why "OS"?

MinerU's upstream license history is complicated: the project briefly adopted AGPLv3 before reverting. MinerOS is pinned to the last clean **Apache 2.0** commit (`e148afa9`) and kept Apache 2.0 going forward — making it safe to embed in commercial and government applications without AGPL copyleft concerns.

The **OS** suffix signals:

- **Open Source** — fully Apache 2.0, no AGPL, no CC-BY-NC
- **Open Standard** — suitable for government procurement, regulated industries, and open-data pipelines
- **OS-level reliability** — designed to run as infrastructure, not just a script

### What MinerOS adds over upstream MinerU

| Feature                                   | MinerU upstream          | MinerOS                            |
| ----------------------------------------- | ------------------------ | ---------------------------------- |
| License                                   | Briefly AGPLv3, reverted | Apache 2.0 throughout              |
| Tracked-changes / strikethrough detection | ❌                       | ✅ (`~~struck text~~` in Markdown) |
| Remote VLM via `.env` auto-config         | manual                   | `.env` loaded automatically        |
| Package name                              | `mineru`                 | `mineros`                          |

## Core Parsing Capabilities

- **PDF · DOCX · PPTX · Images** → Markdown + JSON
- **Tracked-changes detection** — renders struck-through text as `~~...~~` in Markdown output (critical for government contracts, legislative drafts, redlined legal documents)
- Formulas → LaTeX · Tables → HTML · accurate layout reconstruction
- Scanned docs, handwriting, multi-column layouts, cross-page table merging
- Output follows human reading order with automatic header/footer removal
- VLM + OCR dual engine, 109-language OCR recognition

## Deployment Backends

| Backend              | Best For                                                                      |
| -------------------- | ----------------------------------------------------------------------------- |
| `pipeline`           | Fast & stable, no hallucination, runs on CPU or GPU                           |
| `vlm-http-client`    | High accuracy via remote OpenAI-compatible VLM server (e.g., Azure llama.cpp) |
| `hybrid-http-client` | High accuracy + local OCR, minimal local VRAM                                 |
| `vlm-auto-engine`    | High accuracy via local vLLM / LMDeploy / mlx                                 |
| `hybrid-auto-engine` | Best accuracy, native text extraction, low hallucination                      |

## Quick Start

### Install

```bash
pip install uv
uv pip install -e ".[core]"
```

Or from PyPI (once published):

```bash
uv pip install "mineros[core]"
```

### Run

```bash
# Basic parsing (auto-selects best available backend)
mineros -p <input.pdf> -o <output_dir>

# CPU-only (pipeline backend)
mineros -p <input.pdf> -o <output_dir> -b pipeline

# Remote VLM server (reads MINERU_VL_SERVER / MINERU_VL_API_KEY / MINERU_VL_MODEL_NAME from .env)
mineros -p <input.pdf> -o <output_dir> -b vlm-http-client

# Specific page range
mineros -p <input.pdf> -o <output_dir> -b vlm-http-client -s 0 -e 3
```

### Environment Variables (`.env`)

```dotenv
# Remote VLM server (OpenAI-compatible)
MINERU_VL_SERVER=https://your-llm-server.example.com
MINERU_VL_API_KEY=your-api-key
MINERU_VL_MODEL_NAME=your-model-name

# Required when server n_ctx is small (e.g., llama.cpp with 8192 context)
MINEROS_PROCESSING_WINDOW_SIZE=1
```

The `.env` file is loaded automatically via `python-dotenv` — no manual export needed.

## Hardware Requirements

<table>
  <thead>
    <tr>
      <th rowspan="2">Backend</th>
      <th rowspan="2">pipeline</th>
      <th colspan="2">*-auto-engine</th>
      <th colspan="2">*-http-client</th>
    </tr>
    <tr>
      <th>hybrid</th>
      <th>vlm</th>
      <th>hybrid</th>
      <th>vlm</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>Pure CPU</th>
      <td style="text-align:center;">✅</td>
      <td colspan="2" style="text-align:center;">❌</td>
      <td colspan="2" style="text-align:center;">✅</td>
    </tr>
    <tr>
      <th>Min VRAM</th>
      <td style="text-align:center;">4 GB</td>
      <td style="text-align:center;">8 GB</td>
      <td style="text-align:center;">8 GB</td>
      <td style="text-align:center;">2 GB</td>
      <td style="text-align:center;">None</td>
    </tr>
    <tr>
      <th>Min RAM</th>
      <td colspan="3" style="text-align:center;">16 GB (32 GB recommended)</td>
      <td colspan="2" style="text-align:center;">16 GB</td>
    </tr>
    <tr>
      <th>Python</th>
      <td colspan="5" style="text-align:center;">3.10 – 3.13</td>
    </tr>
    <tr>
      <th>OS</th>
      <td colspan="5" style="text-align:center;">Linux (2019+) · Windows · macOS 14+</td>
    </tr>
  </tbody>
</table>

## Docker

```bash
# Build
docker build -f docker/global/Dockerfile -t mineros:latest .

# Run via Compose
docker compose -f docker/compose.yaml up
```

## Known Issues

- Reading order may be out of sequence in extremely complex multi-column layouts.
- Strikethrough detection relies on the VLM visually identifying struck text — accuracy depends on model capability and image resolution.
- Tables of contents and lists are recognized via rules; uncommon formats may be missed.
- Comic books, art albums, and heavily stylized documents parse poorly.
- OCR may produce inaccurate characters for lesser-known languages.

## License

[Apache 2.0](LICENSE.md)

MinerOS is a derivative of [MinerU](https://github.com/opendatalab/MinerU) (opendatalab), used and redistributed under the terms of the Apache 2.0 license as it existed at commit `e148afa9`. All modifications are also released under Apache 2.0.

## Acknowledgments

MinerOS stands on the shoulders of MinerU and its dependencies:

- [MinerU](https://github.com/opendatalab/MinerU) — opendatalab
- [UniMERNet](https://github.com/opendatalab/UniMERNet)
- [TableStructureRec](https://github.com/RapidAI/TableStructureRec)
- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
- [fast-langdetect](https://github.com/LlmKira/fast-langdetect)
- [pypdfium2](https://github.com/pypdfium2-team/pypdfium2)
- [pdfminer.six](https://github.com/pdfminer/pdfminer.six)
- [pypdf](https://github.com/py-pdf/pypdf)
- [magika](https://github.com/google/magika)
- [vLLM](https://github.com/vllm-project/vllm)
- [LMDeploy](https://github.com/InternLM/lmdeploy)
