Ocean-OCR: Towards General OCR Application via a Vision-Language Model

Chen, Song; Guo, Xinyu; Li, Yadong; Zhang, Tao; Lin, Mingan; Kuang, Dongdong; Zhang, Youwei; Ming, Lingfeng; Zhang, Fengyu; Wang, Yuran; Xu, Jianhua; Zhou, Zenan; Chen, Weipeng

Computer Science > Computer Vision and Pattern Recognition

arXiv:2501.15558 (cs)

[Submitted on 26 Jan 2025]

Title:Ocean-OCR: Towards General OCR Application via a Vision-Language Model

Authors:Song Chen, Xinyu Guo, Yadong Li, Tao Zhang, Mingan Lin, Dongdong Kuang, Youwei Zhang, Lingfeng Ming, Fengyu Zhang, Yuran Wang, Jianhua Xu, Zenan Zhou, Weipeng Chen

View PDF HTML (experimental)

Abstract:Multimodal large language models (MLLMs) have shown impressive capabilities across various domains, excelling in processing and understanding information from multiple modalities. Despite the rapid progress made previously, insufficient OCR ability hinders MLLMs from excelling in text-related tasks. In this paper, we present \textbf{Ocean-OCR}, a 3B MLLM with state-of-the-art performance on various OCR scenarios and comparable understanding ability on general tasks. We employ Native Resolution ViT to enable variable resolution input and utilize a substantial collection of high-quality OCR datasets to enhance the model performance. We demonstrate the superiority of Ocean-OCR through comprehensive experiments on open-source OCR benchmarks and across various OCR scenarios. These scenarios encompass document understanding, scene text recognition, and handwritten recognition, highlighting the robust OCR capabilities of Ocean-OCR. Note that Ocean-OCR is the first MLLM to outperform professional OCR models such as TextIn and PaddleOCR.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2501.15558 [cs.CV]
	(or arXiv:2501.15558v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2501.15558

Submission history

From: Xinyu Guo [view email]
[v1] Sun, 26 Jan 2025 15:20:39 UTC (19,421 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Ocean-OCR: Towards General OCR Application via a Vision-Language Model

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Ocean-OCR: Towards General OCR Application via a Vision-Language Model

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators