-
Notifications
You must be signed in to change notification settings - Fork 503
Mx ocrmypdf #690
base: master
Are you sure you want to change the base?
Mx ocrmypdf #690
Conversation
This change was needed for slowly growing files coming in from scanner. The old implementation grabbed them even when they were not completely written. So we wait now (when not using inotify) and watch for changes in mtime and/or size of the file. Normally this gives enough time to the scanning process to complete the work and transfer.
The ocrmypdf process came back with tesseract took too long... So instead investigating this, I took the shortcut.
It was too short when deleting som 10 files. So I set this to 90s for now.
better user experience
The handling of the source filename is a kind of non trivial, possibly broken. So this is work for a later moment... When the file object is created, the file doesn't have to exist, but this is assumed for now (and searched for a moved file).
There were different functions to cleanup the directory structure in MEDIAROOT/originals when using a intelligent file naming format. This was reduced and cleaned up to be more to the point. The idea is to address all situations, when a file is moved or could be renamed by us or others. After this it is time to clean up empty dirs. There is no need for optimization, so one recursive function can do this for us.
Thank you for this huge contribution! I just wanted to let you know that you will have a very hard time getting a review for this PR, mainly because it is incredibly large and contains multiple, distinct features, which makes reviewing it extra tough. If you want to see your changes land in Paperless, your best bet is to split this up into separate PRs, where each PR contains a distinct feature. Also, without looking into this in detail, dropping support for Python 3.5, while sensible since it is EOL, at least requires an update to the documentation, which I'm not seeing. I'm not seeing any documentation at all, and for new features it would be nice to at least have short descriptions for what each thing does (and what might change for users of the application, because there seem to be quite a few changes that could be breaking to our users). |
Thank you for your nice notice and the hints! I made some development straight forward to suit my needs. Now the current state is some point, where I reason what to do next. I can break down the rather huge change into small and distinct branches, each a separate PR. But I'm unsure if this is the way for me. Yes it is completely understood, that any merge into upstream needs review. My main question is, will there be any review in the next future? Second point of interest is if my changes are acceptable at all? I tried not to break the behavior, beside some changes e.g. in the consumer code, where I let the configuration decide. So I made some rather small enhancements, started i18n with german localization, fixed small errors. One big feature was the support for another OCR backend, that indeed changes the content of the consumed document to have a searchable PDF. Thanks again for the time you spent looking into it and writing the comment. |
It was too hard to build a reliably working ocrmypdf. So for now we use the pre-built docker container and make a multi-stage build for the rest.
First: I like the project and it was the point for me to really switch over to a paperless workflow, thanks a lot for your work!
Second: I'm new to python and django. I used to code in C and C++, but python always stood on the todo-list... So this are my first lines ever in python. And django? It's really really cool but a lot to learn.
So in short I made the following changes:
I hope someone will find it useful.