Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

翻译扫描档存在重影 / feat (main): supports ocr on scanned document #19

Open
jackiehejian opened this issue Nov 7, 2024 · 15 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@jackiehejian
Copy link

jackiehejian commented Nov 7, 2024

翻译错误

当pdf文件均为图像,而不是可编辑(复制)状态时,翻译完全失败,具体见图

@Byaidu
Copy link
Owner

Byaidu commented Nov 7, 2024

图片型的 PDF 文档暂时还没办法翻译,目前主要还是在优化电子书和论文的翻译效果

@jackiehejian
Copy link
Author

图片型的 PDF 文档暂时还没办法翻译,目前主要还是在优化电子书和论文的翻译效果

好的,非常感谢

@Byaidu Byaidu added the enhancement New feature or request label Nov 7, 2024
@fireinrain
Copy link

均为图像有点为难人了,ocr的质量 影响文字的质量 影响翻译的效果

@xxsunyxx
Copy link

加一个可选流程paddleOCR,

@xxnuo
Copy link
Contributor

xxnuo commented Nov 20, 2024

sayura
这个模型非常准确,就是对算力的要求会高于 Paddle OCR

@Byaidu
Copy link
Owner

Byaidu commented Nov 20, 2024

sayura 这个模型非常准确,就是对算力的要求会高于 Paddle OCR

和 minerU/marker 比较怎么样呀

@xxnuo
Copy link
Contributor

xxnuo commented Nov 21, 2024

Owner

sayura 就是 marker 的作者做的开源多国语言和表格的 OCR 模型😂
minerU 这个我没有测试,我只测试了 PaddleOCR 高精度模型,Sayura 效果比它好很多,而且支持多国语言效果很好。
我看 minerU 的 issue,对多国语言的支持好像不佳
缺点就是 Sayura 对 GPU 显存要求有点高,头疼,不太会量化模型。

@reycn reycn changed the title 当PDF每一页均为图像时,无法进行翻译 feat (main): supports ocr on scanned document Nov 21, 2024
@reycn reycn added the help wanted Extra attention is needed label Nov 21, 2024
@xxnuo
Copy link
Contributor

xxnuo commented Dec 2, 2024

佬们 ocr 的进展如何,我觉得用 paddleocr 撸一个不错,如果已经有佬在做了我就不再造轮子了 @reycn @Byaidu

@Byaidu
Copy link
Owner

Byaidu commented Dec 2, 2024

佬们 ocr 的进展如何,我觉得用 paddleocr 撸一个不错,如果已经有佬在做了我就不再造轮子了 @reycn @Byaidu

目前还一点没做…

如果写好了的话欢迎来贡献代码

@hellofinch
Copy link
Contributor

from typing import BinaryIO
import numpy as np
import tqdm
from pymupdf import Document
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfparser import PDFParser
from pdf2zh.converter import TranslateConverter
from pdf2zh.pdfinterp import PDFPageInterpreterEx
from pymupdf import Font
import numpy as np
from paddleocr import PaddleOCR

file=""

def extract_text_to_fp(
    inf: BinaryIO,
    pages=None,
    password: str = "",
    debug: bool = False,
    page_count: int = 0,
    vfont: str = "",
    vchar: str = "",
    thread: int = 0,
    doc_en: Document = None,
    model=None,
    lang_in: str = "",
    lang_out: str = "",
    service: str = "",
    resfont: str = "",
    noto: Font = None,
    callback: object = None,
    **kwarg,
) -> None:
    ocr = PaddleOCR(use_angle_cls=True, lang="en")
    rsrcmgr = PDFResourceManager()
    layout = {}
    device = TranslateConverter(
        rsrcmgr, vfont, vchar, thread, layout, lang_in, lang_out, service, resfont, noto
    )

    assert device is not None
    obj_patch = {}
    interpreter = PDFPageInterpreterEx(rsrcmgr, device, obj_patch)
    if pages:
        total_pages = len(pages)
    else:
        total_pages = page_count

    parser = PDFParser(inf)
    doc = PDFDocument(parser, password=password)
    with tqdm.tqdm(
        enumerate(PDFPage.create_pages(doc)),
        total=total_pages,
    ) as progress:
        for pageno, page in progress:
            if pages and (pageno not in pages):
                continue
            if callback:
                callback(progress)
            page.pageno = pageno
            pix = doc_en[page.pageno].get_pixmap()
            image = np.fromstring(pix.samples, np.uint8).reshape(
                pix.height, pix.width, 3
            )[:, :, ::-1]
            page_layout = model.predict(image, imgsz=int(pix.height / 32) * 32)[0]
            # kdtree 是不可能 kdtree 的,不如直接渲染成图片,用空间换时间
            box = np.ones((pix.height, pix.width))
            h, w = box.shape
            result_text=[]
            vcls = ["abandon", "figure", "table", "isolate_formula", "formula_caption"]
            for i, d in enumerate(page_layout.boxes):
                text=''
                if not page_layout.names[int(d.cls)] in vcls:
                    x0, y0, x1, y1 = d.xyxy.squeeze()
                    x0, y0, x1, y1 = (
                        np.clip(int(x0 - 1), 0, w - 1),
                        np.clip(int(h - y1 - 1), 0, h - 1),
                        np.clip(int(x1 + 1), 0, w - 1),
                        np.clip(int(h - y0 + 1), 0, h - 1),
                    )
                    box[y0:y1, x0:x1] = i + 2
                    if page_layout.names[int(d.cls)]=="plain text":
                        imagex = image[y0:y1,x0:x1]
                        result = ocr.ocr(imagex, cls=False)
                        for idx in range(len(result)):
                            res = result[idx]
                            for line in res:
                                text+=line[1][0]
                        result_text.append(text)
            for i, d in enumerate(page_layout.boxes):
                if page_layout.names[int(d.cls)] in vcls:
                    x0, y0, x1, y1 = d.xyxy.squeeze()
                    x0, y0, x1, y1 = (
                        np.clip(int(x0 - 1), 0, w - 1),
                        np.clip(int(h - y1 - 1), 0, h - 1),
                        np.clip(int(x1 + 1), 0, w - 1),
                        np.clip(int(h - y0 + 1), 0, h - 1),
                    )
                    box[y0:y1, x0:x1] = 0
            layout[page.pageno] = box
            # 新建一个 xref 存放新指令流
            page.page_xref = doc_en.get_new_xref()  # hack 插入页面的新 xref
            doc_en.update_object(page.page_xref, "<<>>")
            doc_en.update_stream(page.page_xref, b"")
            doc_en[page.pageno].set_contents(page.page_xref)
            interpreter.process_page(page)

    device.close()
    return obj_patch,result_text

只有一段OCR的内容, 实在是看不懂怎么把OCR出来的结果往后传了。
:(

@Byaidu Byaidu changed the title feat (main): supports ocr on scanned document 翻译扫描档存在重影 / feat (main): supports ocr on scanned document Dec 13, 2024
@Byaidu Byaidu pinned this issue Dec 13, 2024
@xxnuo
Copy link
Contributor

xxnuo commented Dec 15, 2024

@xxnuo
Copy link
Contributor

xxnuo commented Dec 20, 2024

https://huggingface.co/spaces/stepfun-ai/GOT_official_online_demo

@gfhdhytghd
Copy link

建议使用有道来进行OCR翻译
钱可不是白交的

@gfhdhytghd
Copy link

尝试集成tesseract来实现OCR

@jj-a-li
Copy link

jj-a-li commented Jan 11, 2025

实际上pdf非常大一部分都是扫描版的,如果不能处理,使用范围会锐减

This was referenced Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

9 participants