.djvu adapter #166
phiresky
started this conversation in
Show your adapter
Replies: 2 comments
-
It's possible to use the integrated ascii page breaks converter using |
Beta Was this translation helpful? Give feedback.
0 replies
-
By replacing Here is what I am using now in {
"name": "djvutorga",
"version": 2,
"description": "Uses djvused to extract plain text from DJVU files",
"extensions": ["djvu"],
"mimetypes": ["image/vnd.djvu"],
"binary": "djvutorga",
"args": ["$input_virtual_path"],
"output_path_hint": "${input_virtual_path}.txt.asciipagebreaks",
"disabled_by_default": false,
"match_only_by_mime": false
}, where #!/bin/bash
# `djvused -e 'print-pure-txt'` adds page breaks for blank pages, but
# also for each non-page file included in the djvu file (shared image or
# annotation data and thumbnail data).
#
# `djvused -e 'ls'` gives the list of included files, page files and others.
# We can thus get the page number associated to a file number. The latter
# is what we get by counting page breaks in `djvused -e 'print-pure-txt'`’s
# output.
input_file="$1"
# Lines produced by `djvused -e 'ls'` are of the following forms:
# 45 P …
# 145 P …
# I …
# A …
# T …
while IFS= read -r file_info; do
page="${file_info%% [APIT]*}"
page="${page// }"
file_to_page+=( "$page" )
done < <(djvused "$input_file" -e 'ls')
remove_non_pages() {
# Remove all occurrences of \x0c due to a non-page.
file=-1
while IFS= read -r -d $'\x0c' file_text; do
file=$((file + 1))
[[ "${file_to_page[$file]}" == '' ]] && continue
echo "$file_text"$'\x0c'
done
}
djvused "$input_file" -e 'print-pure-txt' | remove_non_pages |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Posted by @vejkse here
I tried to make a custom adapter for DJVU files (working with the latest unreleased version). The simple methods don’t seem to work satisfactorily:
ebook-convert
doesn’t seem to add any page breaks.djvutxt
doesn’t add a page break for blank pages, so that the page numbers will be too small if there are such pages.djvused -e 'print-pure-txt'
does add page breaks for blank pages, but also for each non-page files included in the djvu file (shared image or annotation data and thumbnail data), so the page numbers will be too large.Moreover, it looks like
rga
sends the file to the standard input of the external program, and it doesn’t look likedjvused
can read the file from there, so we need to write it to a temporary file to usedjvused
(also in my solution below, I usedjvused
twice on the file).I ended up using
djvused
but removing the extra pagebreaks or determining the correct page number based on the information given bydjvused -e 'ls'
. This gives the list of included files, page files and others. This list helps us identify which pagebreaks indjvused -e 'print-pure-txt'
’s output correspond to a page.So I ended up with the following configuration:
where
djvutorga
is the followingbash
script:Notes:
add_page_numbers
function takes more than half of the execution time. If the generic page break postprocessor can be used for custom adapters (which doesn’t seem to be possible right now), it could be replaced by the following somewhat faster function that only removes extraneous page breaks . If the latter postprocessor is fast enough, the overall extraction process would be faster.Beta Was this translation helpful? Give feedback.
All reactions