Possible encoding error for Japanese document #19

anhhaibkhn · 2024-12-12T09:32:47Z

Thank you for making docling available for Spacy. I have a concern that there is OCR bottleneck for scanned data. Let's say the OCR-ed text is incorrect, then there is no way the structured output is reliable. For example, here is the input sample.

Here is the code which I used:

import spacy
from spacy_layout import spaCyLayout

# python -m spacy download ja_core_news_trf
nlp = spacy.load("ja_core_news_trf")

layout = spaCyLayout(nlp)

# Process a document and create a spaCy Doc object
doc = layout(r"tests\data\sample.png")

# The text-based contents of the document
print(doc.text)
# Document layout including pages and page sizes
print(doc._.layout)
# Tables in the document and their extracted data
print(doc._.tables)
# Markdown representation of the document
print(doc._.markdown)

# Layout spans for different sections
for span in doc.spans["layout"]:
    # Document section and token and character offsets into the text
    print(span.text, span.start, span.end, span.start_char, span.end_char)
    # Section type, e.g. "text", "title", "section_header" etc.
    print(span.label_)
    # Layout features of the section, including bounding box
    print(span._.layout)
    # Closest heading to the span (accuracy depends on document structure)
    print(span._.heading)

I was not able to get any making sense results. I suspect 2 reasons, the OCR text was utf-8 encoded, and not decoded properly or the OCR step failed to collect the correct text. Please correct me if I was doing it incorrectly, and let me know your opinion on this issue. Thank you.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible encoding error for Japanese document #19

Possible encoding error for Japanese document #19

anhhaibkhn commented Dec 12, 2024

Possible encoding error for Japanese document #19

Possible encoding error for Japanese document #19

Comments

anhhaibkhn commented Dec 12, 2024