Make the outline page translation more robust and configurable#26
Make the outline page translation more robust and configurable#26Alkaid-Benetnash wants to merge 1 commit intokcroker:masterfrom
Conversation
v--
left a comment
There was a problem hiding this comment.
In addition to the less-important comments:
As far as I understand, there are two scenarios:
-
In the average case, pages do not have explicit names set, and their page numbers are simply incremental. This is the case currently covered. I have determined that this case behaves correctly with an offset.
-
In the more complicated cases, page titles contain page numbers, and it is sometimes possible to extract those numbers. I have given an example in
test_outline_with_page_titleswhere this is not-at-all straightforward, and why I did not implement it in the first place. Anyhow, this case does not need an offset.
Some conclusions:
- I believe we can skip the offset parameter altogether because one use case requires a static offset and the other does not, in general, require an offset. Implementation details may change, but the public interface must only be changed in radical situations. This is why I try to be very careful about what parameters the program provides.
- Make sure to write several tests for your use cases, because we want to preserve current behavior and make the new behavior consistent. In particular, see the aforementioned test.
| @click.option('-p', '--pool-size', type=click.IntRange(min=0), default=4, help='Size of MultiProcessing pool for handling page-by-page operations.') | ||
| @click.option('-q', '--quality', type=click.IntRange(min=0, max=100), default=75, help="Quality of images in output. Used only for JPEG compression, i.e. RGB and Grayscale images. Passed directly to Pillow and to OCRmyPDF's optimizer.") | ||
| @click.option('--ocr', type=str, is_flag=False, flag_value='{}', help='Perform OCR via OCRmyPDF rather than trying to convert the text layer. If this parameter has a value, it should be a JSON dictionary of options to be passed to OCRmyPDF.') | ||
| @click.option('--toc-pg-offset', type=int, default=-1, help='The page offset to be applied when translating outline/toc. default is -1') |
There was a problem hiding this comment.
There is no need to specify the default in the help string. click will display a default.
Also, this should be added to the README and the dpsprep.1.ronn page.
There was a problem hiding this comment.
Hi v, I am not seeing click to display a default. I am updating this PR according to (stackoverflow)[https://stackoverflow.com/questions/57550012/how-to-tell-click-to-always-show-option-defaults) to let click display default values.
README and dpsprep.1.ronn are updated.
| return | ||
|
|
||
| if page_number < 0 or page_number >= self.total_pages: | ||
| logger.warning(f'The page title \'{page.value}\' will be translated to an invalid number {page_number}.') |
There was a problem hiding this comment.
"will be" implies that this will happen; and you return specifically for it not to happen.
| # Based on | ||
| # https://github.com/pmaupin/pdfrw/issues/52#issuecomment-271190546 | ||
| class OutlineTransformVisitor(SExpressionVisitor): | ||
| def __init__(self, toc_pg_offset : int, total_pages : int): |
There was a problem hiding this comment.
This code fails the type checker - try running make lint.
-
These two type hints should also be added to the class itself (dataclasses can then be used to avoid writing a trivial constructor, but given that the other code here doesn't use dataclasses, I wouldn't go as far as to recommend it).
-
Please use consistent styling. The other code has no space before the colons.
There was a problem hiding this comment.
Passed make lint on my side. Fixed the inconsistent styling.
| super().__init__() | ||
| def visit_plain_list(self, node: djvu.sexpr.StringExpression, parent: IndirectPdfDict): | ||
| title, page, *rest = node | ||
| # I have experimentally determined that we need to translate page indices. -- Ianis, 2023-05-03 |
There was a problem hiding this comment.
This comment refers to the approach that you are trying to replace. It should be either deleted or updated.
|
@Alkaid-Benetnash I appreciate you addressing my concrete code-related comments, but I am afraid you missed the more essential one. Please see #26 (review) . Review the failing unit tests and, more concretely, see #23 for an example file where your approach fails. I hope you will understand why I am hesitant towards the approach in the pull request. I have tried several solutions myself and have come to the conclusion that none of them work well. A proper solution would require digging into libdjvu, but I am afraid neither of us is willing to do that. Please review the aforementioned comment and issue, review the failing unit tests and see if you come up with a solution to the problems outlined there. |
Hi,
I encountered outline "page title" that the existing code could not recognize:
So I added a regex matching for the first consecutive number substring.
Also the page table translation rule is not the same neither. Existing code assumes the outline always start from page number 1 while for PDF the page number should start from 0.
This PR also introduced an option to config such outline translation page offset.