.djvu adapter #166

phiresky · 2023-05-26T14:51:55Z

phiresky
May 26, 2023
Maintainer

I tried to make a custom adapter for DJVU files (working with the latest unreleased version). The simple methods don’t seem to work satisfactorily:

ebook-convert doesn’t seem to add any page breaks.
djvutxt doesn’t add a page break for blank pages, so that the page numbers will be too small if there are such pages.
djvused -e 'print-pure-txt' does add page breaks for blank pages, but also for each non-page files included in the djvu file (shared image or annotation data and thumbnail data), so the page numbers will be too large.

Moreover, it looks like rga sends the file to the standard input of the external program, and it doesn’t look like djvused can read the file from there, so we need to write it to a temporary file to use djvused (also in my solution below, I use djvused twice on the file).

I ended up using djvused but removing the extra pagebreaks or determining the correct page number based on the information given by djvused -e 'ls'. This gives the list of included files, page files and others. This list helps us identify which pagebreaks in djvused -e 'print-pure-txt'’s output correspond to a page.

So I ended up with the following configuration:

    {
        "name": "djvused",
        "version": 1,
        "description": "Uses djvused to extract plain text from DJVU files",

        "extensions": ["djvu"],
        "mimetypes": ["image/vnd.djvu"],

        "binary": "djvutorga",
        "args": [],
        "disabled_by_default": false,
        "match_only_by_mime": false
    }

where djvutorga is the following bash script:

#!/bin/bash

temp_file="$(mktemp tempXXXXXXXXXX.djvu)"
clean_up() {
  rm "$temp_file"
}
trap clean_up EXIT
cat > "$temp_file"

while IFS= read -r file_info; do
  # djvused -e 'ls' results starts with the page number, if applicable,
  # then a code A, P, I or T.
  page="${file_info%% [APIT]*}"
  page="${page// /}"
  file_to_page+=( "$page" )
done < <(djvused "$temp_file" -e 'ls')

add_page_numbers() {
  file=0
  while IFS= read -r line; do
    [[ "$line" == *$'\x0c'* ]] && {
      pagebreaks="${line//[^$'\x0c']}"
      num_of_pagebreaks="${#pagebreaks}"
      file=$((file + num_of_pagebreaks))
      page="${file_to_page[$file]}"
      line="${line//$'\x0c'}"
    }
    echo "Page $page: $line"
  done
}

djvused "$temp_file" -e 'print-pure-txt' | add_page_numbers

Notes:

There is no Unicode sanitization like in @meesvermeulen branch.
The add_page_numbers function takes more than half of the execution time. If the generic page break postprocessor can be used for custom adapters (which doesn’t seem to be possible right now), it could be replaced by the following somewhat faster function that only removes extraneous page breaks . If the latter postprocessor is fast enough, the overall extraction process would be faster.

remove_non_pages() {
  # Remove all occurrences of \x0c due to a non-page.
  file=-1
  while IFS= read -r -d $'\x0c' file_text; do
    file=$((file + 1))
    [[ "${file_to_page[$file]}" == '' ]] && continue
    echo "$file_text"$'\x0c'
  done
}

phiresky · 2023-05-26T14:53:06Z

phiresky
May 26, 2023
Maintainer Author

It's possible to use the integrated ascii page breaks converter using output_path_hint: "${input_virtual_path}.txt.asciipagebreaks".

0 replies

vejkse · 2023-07-13T08:04:45Z

vejkse
Jul 13, 2023

By replacing add_page_numbers by remove_non_pages and using the integrated asciipagebreaks converter, caching djvu files takes about 30% less time. Moreover, I now see that there is no need to use a temporary file, since the path to the original file is passed as $input_virtual_path.

Here is what I am using now in ripgrep-all/config.jsonc:

    {
        "name": "djvutorga",
        "version": 2,
        "description": "Uses djvused to extract plain text from DJVU files",

        "extensions": ["djvu"],
        "mimetypes": ["image/vnd.djvu"],

        "binary": "djvutorga",
        "args": ["$input_virtual_path"],
        "output_path_hint": "${input_virtual_path}.txt.asciipagebreaks",
        "disabled_by_default": false,
        "match_only_by_mime": false
    },

where djvutorga is now:

#!/bin/bash

# `djvused -e 'print-pure-txt'` adds page breaks for blank pages, but
# also for each non-page file included in the djvu file (shared image or
# annotation data and thumbnail data).
#
# `djvused -e 'ls'` gives the list of included files, page files and others.
# We can thus get the page number associated to a file number. The latter
# is what we get by counting page breaks in `djvused -e 'print-pure-txt'`’s
# output.

input_file="$1"

# Lines produced by `djvused -e 'ls'` are of the following forms:
#  45 P …
# 145 P …
#     I …
#     A …
#     T …
while IFS= read -r file_info; do
  page="${file_info%% [APIT]*}"
  page="${page// }"
  file_to_page+=( "$page" )
done < <(djvused "$input_file" -e 'ls')

remove_non_pages() {
  # Remove all occurrences of \x0c due to a non-page.
  file=-1
  while IFS= read -r -d $'\x0c' file_text; do
    file=$((file + 1))
    [[ "${file_to_page[$file]}" == '' ]] && continue
    echo "$file_text"$'\x0c'
  done
}

djvused "$input_file" -e 'print-pure-txt' | remove_non_pages

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.djvu adapter #166

{{title}}

Replies: 2 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

.djvu adapter #166

phiresky May 26, 2023 Maintainer

Replies: 2 comments

phiresky May 26, 2023 Maintainer Author

vejkse Jul 13, 2023

phiresky
May 26, 2023
Maintainer

phiresky
May 26, 2023
Maintainer Author

vejkse
Jul 13, 2023