Refactor auto-archiver to use a modular structure for feeders/extractors/enrichers etc. #185

pjrobertson · 2025-01-29T18:34:07Z

Also:

View a full list of options and config settings by running auto-archiver --help
Quicker startup of auto-archiver, by only installing the enabled modules
New/additional modules can be enabled/disabled on the fly using command line arguments such as --extractor=new_temporary_extractor
You can now run auto-archiver first time without needing to create an orchestration.yaml file first, auto-archiver will set up a basic one so you can get started in no time 🚀
You can choose to update your config to file/rewrite your latest command line args to it (so you don't have to keep passing them) by setting the -s or --store flag on the command line
All config settings are checked/validated at startup, along with module dependencies, to make sure you have setup modules correctly. Developers: you can create your own 'validators' to check values, and have the program check against them. Just drop them into the validators.py file (uses argparse type under the hood). Fixes improve config parsing to accommodate type casting and validation rules #130
Improved errors/warnings if there are any config/dependency issues and more helpful guidance when getting started (plus: colour output)
*You can now log directly to files using the logging: file option. Set the logging level as well using logging: level
Fixed up unit tests + added a few new ones
Set your own custom module folder with module_paths=/my/own/modules/ to allow you to easily extend auto archiver with new modules. Simply create a new module, place it in that folder then pass the folder path on the command line/save it in your orchestration.yaml
Update yt-dlp to latest version and factor out obsolete code from the bluesky.py dropin (no longer needed, as we can call the yt-dlp extractor directly
Create an 'authentication' mechanism for storing site authentication information (username/password, api keys, cookies etc.) in the config file, which can then be used by all extractors. Set your authentication details in one place, and all modules will be able to use them without having to set multiple additional configs!
Fixes using the screenshot enricher with facebook (click the 'accept cookies' and close the 'login' modals to take a proper screenshot). Screenshot enricher can now also use your cookies to login to sites to take screenshots (cookies from a cookiejar file or directly from your browser).
New CSV feeder, which allows you to input CSV files of URLs (more info here)

Developers:

Removes ArchivingContext completely, and now passes contextual information around using the metadata metadata.set_context('key', 'val') or metadata.get_context('key', 'val')
Easily get authentication information in your module using self.auth_for_site(url) - this returns a dict with the different authentication options available depending on what the user has set (e.g. username/password, cookiejar, api_key, cookie).
Add your own modules without touching the auto-archiver project structure. Pass --module_paths /my/modules/folder/ on the command line, or add module_paths: /my/modules/folder/ to your config file.
Easily leverage the power of yt-dlp by using it to extract information for a website without having to create an entirely new extractor. Subclass the 'GenericDropin from the generic_extractor module, and add the two methods extract_post and create_metadata, which both have access to yt-dlp's 'InfoExtractor' under the hood. An example of how to do this can be found here
Allow modules to have 'dependencies' of either a) other modules b) 3rd party python packages c) binary packages, by settings them in the 'dependencies' field in the manifest.py. (Note: fixes No hashing algorithm defined for use in HtmlFormatter when HashEnricher not included #156 )

Other Fixes:

Fix a bug sending items to the database (Fixes Database.failed() method called without all required params #173 )

Loading configs now works

(two simple helper functions to convert between dot and dict notation)

…faster

# Conflicts: # src/auto_archiver/databases/__init__.py

# Conflicts: # src/auto_archiver/core/orchestrator.py

…fests

… values, it also validates them

… isn't installed by default on most machines)

1. Allow loading modules from --module_paths=/extra/path/here 2. Improved unit tests for module loading 3. Further small tidy ups/clean ups

…ependencies' -> simpler/easier to remember

* Add implementation tests for orchestrator + logging tests * Standardise method/class vars for extractors to see if they are suitable * Fix bugs with removing default loguru logger (allows further customisation) * Fix bug loading required fields from file *

…breaking it

* Removes (partly) the ArchivingOrchestrator * Removes the cli_feeder module, and makes it the 'default', allowing you to pass URLs directly on the command line, without having to use the cumbersome --cli_feeder.urls. Just do auto-archiver https://my.url.com * More unit tests * Improved error handling

Context for a specific url/item is now passed around via the metadata (metadata.set_context('key', 'val') and metadata.get_context('key', default='something') The only other thing that was passed around in ArchivingContext was the storage info, which is already accessible now via self.config

…tor working using this info

…ation' global settings :D

E.g. you can use the validator 'is_file' to check if a config is a valid file

… screenshots

…luesky dropin

…_enricher

Don't override the values in config['steps'] – the config should be left as is

…data was found

pjrobertson and others added 30 commits January 21, 2025 17:53

Use already implemented helper to get version

c41d93a

Ignore pylint statements for manifest files

bdfc855

Add __manifest__.py for generic_extractor

03f3770

Initial changes to move to '__manifest__' format

241b350

Get parsing of manifest and combining with config file working

4830f99

Create manifest files for archiver modules.

7b3a146

Further tweaks based on __manifest__.py files

54995ad

Loading configs now works

Switch back to using yaml with dot notation

b6b0858

(two simple helper functions to convert between dot and dict notation)

Tidy up imports + start on loading modules - program now starts much …

ade5ea0

…faster

Manifests for databases

99c8c69

Merge branch 'load_modules' into more_mainifests

c517d35

# Conflicts: # src/auto_archiver/databases/__init__.py

Get module loading working properly

550097a

Fix loading already loaded modules - don't load them twice

65ef46d

Set up feeder manifests (not merged by source yet)

79684f8

Merge branch 'load_modules' into more_mainifests

9db26cd

# Conflicts: # src/auto_archiver/core/orchestrator.py

More manifests, base modules and rename from archiver to extractor.

1274a1b

Rename storages for clarity

c3403ce

Move storage configs into individual manifests, assert format on useage.

50f4ebc

Fix up loading/storing configs + unit tests

b27bf8f

Revert changes to orchestrator to avoid merge conflicts

06f6e34

Fix loading modules when entry_point isn't set

9befb97

Revert Dockerfile changes

cbafbfa

Merge remote-tracking branch 'origin/more_mainifests' into more_maini…

ba4b330

…fests

Update manifests and modules

aa7ca93

fix config parsing in manifests

0453d95

fix config parsing in manifests, remove module level configs

024fe58

Gsheets utility revert

1942e8b

Merge branch 'main' into load_modules

f1e9ab6

Tweaks to logging strings

3fc6ddf

Fix and add types to manifest

dd402b4

pjrobertson and others added 30 commits January 28, 2025 11:37

Validate orchestration.yaml file inputs - so if a user enters invalid…

27b25c5

… values, it also validates them

more user friendly error logging when config issues are found

9635449

Fix up unit tests for new structure

7a4871d

set metadata enricher to requires_setup=True (requires exiftool which…

dcd5576

… isn't installed by default on most machines)

Tidy ups + unit tests:

3d37c49

1. Allow loading modules from --module_paths=/extra/path/here 2. Improved unit tests for module loading 3. Further small tidy ups/clean ups

Fix up dependency checking (use 'dependencies' instead of 'external_d…

00a7018

…ependencies' -> simpler/easier to remember

Add ruamel to dependencies (replaces pyyaml)

18ff36c

Update modules for new core structure.

cddae65

Fix up unit tests - dataclass + subclasses not having @DataClass was …

fade68c

…breaking it

Fix manifests for required configs.

5274388

Don't make modules 'dataclasses'

953011f

Fix unit tests

d76063c

Fix getting/setting folder context for metadata

9a8c94b

Remove lingering reference to ArchivingContext

9c9e9b3

Add cookie extraction to 'authentication' options, get generic_extrac…

7a2be5a

…tor working using this info

Remove cookie options from generic_extractor - it now uses 'authentic…

7ec328a

…ation' global settings :D

Set up screenshot enricher to use authentication/cookies

c574b69

Restore headless arg

72b5ea9

Remove old csv_feeder file - now inside a module

a873e56

Fix using validators set in __manifest__.py

b301f60

E.g. you can use the validator 'is_file' to check if a config is a valid file

Unit tests for csv feeder + fix some bugs

78e6418

Fix typos in csv feeder docs (in manifest)

034197a

Close the facebook 'login' window if it's there - to allow for proper…

0633e17

… screenshots

Update yt-dlp to latest version + remove code no longer needed from b…

91ca325

…luesky dropin

Remove dangling screenshot_enricher file. Moved to modules/screenshot…

48abb5e

…_enricher

Tidy up setting modules as Orchestrator attributes on startup.

6ab8fd2

Don't override the values in config['steps'] – the config should be left as is

Clarify that an extractor's method can also return False if no valid …

a506f2a

…data was found

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor auto-archiver to use a modular structure for feeders/extractors/enrichers etc. #185

Refactor auto-archiver to use a modular structure for feeders/extractors/enrichers etc. #185

pjrobertson commented Jan 29, 2025 •

edited

Loading

Refactor auto-archiver to use a modular structure for feeders/extractors/enrichers etc. #185

Are you sure you want to change the base?

Refactor auto-archiver to use a modular structure for feeders/extractors/enrichers etc. #185

Conversation

pjrobertson commented Jan 29, 2025 • edited Loading

pjrobertson commented Jan 29, 2025 •

edited

Loading