Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

gibbenergy/Vision-Ingest

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VisionIngest Logo

VisionIngest

AI-Powered Document Parser | PDF to JSON | OCR + LLM Pipeline

Extract structured data from PDFs and images using DeepSeek-OCR-2 and local LLMs.
100% local processing. No cloud APIs. Your data stays private.

Python CUDA React FastAPI License


What is VisionIngest?

VisionIngest is a local-first document parsing application that converts PDFs, scanned documents, and images into structured JSON data. It uses a two-stage AI pipeline:

  1. Vision Model (Eyes): DeepSeek-OCR-2 extracts text and layout as markdown
  2. Language Model (Brain): Local LLM (via Ollama) converts markdown to structured JSON

No data leaves your machine. Perfect for sensitive documents like resumes, contracts, invoices, and medical records.


Screenshots

VisionIngest Upload Interface
Upload interface with document type selection, processing modes, and OCR quality control

VisionIngest Results View
Side-by-side view: Original document vs. Extracted JSON (editable)


Key Features

Feature Description
Multi-Document Support Resumes, invoices, receipts, contracts, business cards, and more
100% Local Processing No cloud APIs - your data never leaves your machine
GPU Accelerated CUDA support for fast processing on NVIDIA GPUs
Flexible Processing Modes Performance, Batch, and Low VRAM modes for any hardware
Adjustable OCR Quality 4 presets from fast (512px) to high-quality (1280px)
Editable JSON Output Edit extracted data directly in the UI
REST API Full API for integration with other applications
Modern Web UI React-based interface with real-time GPU monitoring

Supported Document Types

  • HR & Recruiting: Resumes, CVs, Cover Letters
  • Finance: Invoices, Receipts, Bank Statements
  • Legal: Contracts, NDAs, Agreements
  • Business: Business Cards, Reports, Forms
  • Academic: Transcripts, Certificates, Research Papers
  • Custom: Add your own templates with JSON schema + prompt

Technology Stack

Component Technology
Vision Model DeepSeek-OCR-2 (Visual Causal Flow architecture)
Language Model Local LLM via Ollama (gpt-oss, llama3, mistral, etc.)
Backend FastAPI (Python 3.11+)
Frontend React 18 + TypeScript + Vite
GPU CUDA 12.8 with Flash Attention 2
Package Manager uv (10x faster than pip)

Quick Start

1. Install Ollama (One-Time)

# Windows
winget install Ollama.Ollama

# Pull a model
ollama pull gpt-oss

2. Download the OCR Model (One-Time)

download_model.bat

3. Start the Application

start.bat

That's it! The browser opens automatically at http://localhost:5173


Processing Modes

Mode Description VRAM Usage
Performance Both models stay in VRAM. Fastest. 16+ GB
Batch OCR all files first, then parse all. Good for multiple files. 12+ GB
Low VRAM Unload OCR after each file. Slowest but works on 8GB. 8 GB

OCR Quality Presets

Preset Resolution Speed Best For
Tiny 512px ~8s/page Quick previews, simple docs
Small 640px ~12s/page Standard documents
Base 1024px ~25s/page Most documents (default)
Large 1280px ~40s/page Dense text, fine print

Project Structure

VisionIngest/
├── backend/              # FastAPI application
│   ├── templates/        # Document templates (JSON schemas + prompts)
│   ├── ocr_adapter.py    # DeepSeek-OCR-2 adapter
│   ├── llm_parser.py     # LLM parser adapter
│   └── main.py           # API endpoints
├── frontend/             # React UI
│   └── src/
│       ├── components/   # UI components
│       └── App.tsx       # Main app
├── models/               # AI models (downloaded separately)
├── assets/               # Images for README
├── start.bat             # One-click startup script
└── requirements.txt      # Python dependencies

License

Business Source License 1.1

  • Non-commercial use only
  • Converts to Apache 2.0 on 2029-01-01

See LICENSE for full terms.

About

Vision Ingest is an AI-powered document parsing application that extracts structured JSON data from PDFs, scanned documents, and images. It combines state-of-the-art vision-language models (DeepSeek-OCR-2) with local large language models (via Ollama) to deliver accurate, privacy-preserving document extraction.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors