Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible encoding error for Japanese document #19

Open
anhhaibkhn opened this issue Dec 12, 2024 · 0 comments
Open

Possible encoding error for Japanese document #19

anhhaibkhn opened this issue Dec 12, 2024 · 0 comments

Comments

@anhhaibkhn
Copy link

Thank you for making docling available for Spacy. I have a concern that there is OCR bottleneck for scanned data. Let's say the OCR-ed text is incorrect, then there is no way the structured output is reliable. For example, here is the input sample.
Invoice_D04_002

Here is the code which I used:

import spacy
from spacy_layout import spaCyLayout

# python -m spacy download ja_core_news_trf
nlp = spacy.load("ja_core_news_trf")

layout = spaCyLayout(nlp)

# Process a document and create a spaCy Doc object
doc = layout(r"tests\data\sample.png")

# The text-based contents of the document
print(doc.text)
# Document layout including pages and page sizes
print(doc._.layout)
# Tables in the document and their extracted data
print(doc._.tables)
# Markdown representation of the document
print(doc._.markdown)

# Layout spans for different sections
for span in doc.spans["layout"]:
    # Document section and token and character offsets into the text
    print(span.text, span.start, span.end, span.start_char, span.end_char)
    # Section type, e.g. "text", "title", "section_header" etc.
    print(span.label_)
    # Layout features of the section, including bounding box
    print(span._.layout)
    # Closest heading to the span (accuracy depends on document structure)
    print(span._.heading)
  

I was not able to get any making sense results. I suspect 2 reasons, the OCR text was utf-8 encoded, and not decoded properly or the OCR step failed to collect the correct text. Please correct me if I was doing it incorrectly, and let me know your opinion on this issue. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant