Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor auto-archiver to use a modular structure for feeders/extractors/enrichers etc. #185

Open
wants to merge 72 commits into
base: main
Choose a base branch
from

Conversation

pjrobertson
Copy link
Collaborator

@pjrobertson pjrobertson commented Jan 29, 2025

Also:

  • View a full list of options and config settings by running auto-archiver --help
  • Quicker startup of auto-archiver, by only installing the enabled modules
  • New/additional modules can be enabled/disabled on the fly using command line arguments such as --extractor=new_temporary_extractor
  • You can now run auto-archiver first time without needing to create an orchestration.yaml file first, auto-archiver will set up a basic one so you can get started in no time 🚀
  • You can choose to update your config to file/rewrite your latest command line args to it (so you don't have to keep passing them) by setting the -s or --store flag on the command line
  • All config settings are checked/validated at startup, along with module dependencies, to make sure you have setup modules correctly. Developers: you can create your own 'validators' to check values, and have the program check against them. Just drop them into the validators.py file (uses argparse type under the hood). Fixes improve config parsing to accommodate type casting and validation rules #130
  • Improved errors/warnings if there are any config/dependency issues and more helpful guidance when getting started (plus: colour output)
    *You can now log directly to files using the logging: file option. Set the logging level as well using logging: level
  • Fixed up unit tests + added a few new ones
  • Set your own custom module folder with module_paths=/my/own/modules/ to allow you to easily extend auto archiver with new modules. Simply create a new module, place it in that folder then pass the folder path on the command line/save it in your orchestration.yaml
  • Update yt-dlp to latest version and factor out obsolete code from the bluesky.py dropin (no longer needed, as we can call the yt-dlp extractor directly
  • Create an 'authentication' mechanism for storing site authentication information (username/password, api keys, cookies etc.) in the config file, which can then be used by all extractors. Set your authentication details in one place, and all modules will be able to use them without having to set multiple additional configs!
  • Fixes using the screenshot enricher with facebook (click the 'accept cookies' and close the 'login' modals to take a proper screenshot). Screenshot enricher can now also use your cookies to login to sites to take screenshots (cookies from a cookiejar file or directly from your browser).
  • New CSV feeder, which allows you to input CSV files of URLs (more info here)

Developers:

  • Removes ArchivingContext completely, and now passes contextual information around using the metadata metadata.set_context('key', 'val') or metadata.get_context('key', 'val')
  • Easily get authentication information in your module using self.auth_for_site(url) - this returns a dict with the different authentication options available depending on what the user has set (e.g. username/password, cookiejar, api_key, cookie).
  • Add your own modules without touching the auto-archiver project structure. Pass --module_paths /my/modules/folder/ on the command line, or add module_paths: /my/modules/folder/ to your config file.
  • Easily leverage the power of yt-dlp by using it to extract information for a website without having to create an entirely new extractor. Subclass the 'GenericDropin from the generic_extractor module, and add the two methods extract_post and create_metadata, which both have access to yt-dlp's 'InfoExtractor' under the hood. An example of how to do this can be found here
  • Allow modules to have 'dependencies' of either a) other modules b) 3rd party python packages c) binary packages, by settings them in the 'dependencies' field in the manifest.py. (Note: fixes No hashing algorithm defined for use in HtmlFormatter when HashEnricher not included #156 )

Other Fixes:

pjrobertson and others added 30 commits January 21, 2025 17:53
(two simple helper functions to convert between dot and dict notation)
# Conflicts:
#	src/auto_archiver/databases/__init__.py
# Conflicts:
#	src/auto_archiver/core/orchestrator.py
pjrobertson and others added 30 commits January 28, 2025 11:37
… isn't installed by default on most machines)
1. Allow loading modules from --module_paths=/extra/path/here
2. Improved unit tests for module loading
3. Further small tidy ups/clean ups
* Add implementation tests for orchestrator + logging tests
* Standardise method/class vars for extractors to see if they are suitable
* Fix bugs with removing default loguru logger (allows further customisation)
* Fix bug loading required fields from file
*
* Removes (partly) the ArchivingOrchestrator
* Removes the cli_feeder module, and makes it the 'default', allowing you to pass URLs directly on the command line, without having to use the cumbersome --cli_feeder.urls. Just do auto-archiver https://my.url.com
* More unit tests
* Improved error handling
Context for a specific url/item is now passed around via the metadata (metadata.set_context('key', 'val') and metadata.get_context('key', default='something')
The only other thing that was passed around in ArchivingContext was the storage info, which is already accessible now via self.config
E.g. you can use the validator 'is_file' to check if a config is a valid file
Don't override the values in config['steps'] – the config should be left as is
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants