You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for making docling available for Spacy. I have a concern that there is OCR bottleneck for scanned data. Let's say the OCR-ed text is incorrect, then there is no way the structured output is reliable. For example, here is the input sample.
Here is the code which I used:
import spacy
from spacy_layout import spaCyLayout
# python -m spacy download ja_core_news_trf
nlp = spacy.load("ja_core_news_trf")
layout = spaCyLayout(nlp)
# Process a document and create a spaCy Doc object
doc = layout(r"tests\data\sample.png")
# The text-based contents of the document
print(doc.text)
# Document layout including pages and page sizes
print(doc._.layout)
# Tables in the document and their extracted data
print(doc._.tables)
# Markdown representation of the document
print(doc._.markdown)
# Layout spans for different sections
for span in doc.spans["layout"]:
# Document section and token and character offsets into the text
print(span.text, span.start, span.end, span.start_char, span.end_char)
# Section type, e.g. "text", "title", "section_header" etc.
print(span.label_)
# Layout features of the section, including bounding box
print(span._.layout)
# Closest heading to the span (accuracy depends on document structure)
print(span._.heading)
I was not able to get any making sense results. I suspect 2 reasons, the OCR text was utf-8 encoded, and not decoded properly or the OCR step failed to collect the correct text. Please correct me if I was doing it incorrectly, and let me know your opinion on this issue. Thank you.
The text was updated successfully, but these errors were encountered:
Thank you for making docling available for Spacy. I have a concern that there is OCR bottleneck for scanned data. Let's say the OCR-ed text is incorrect, then there is no way the structured output is reliable. For example, here is the input sample.
Here is the code which I used:
I was not able to get any making sense results. I suspect 2 reasons, the OCR text was utf-8 encoded, and not decoded properly or the OCR step failed to collect the correct text. Please correct me if I was doing it incorrectly, and let me know your opinion on this issue. Thank you.
The text was updated successfully, but these errors were encountered: