Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR contains all new features added in #67 while also refactoring the work therein to abstract the node worker pool functionality out into its own module for use across multiple projects.
A summary of features added before the abstraction/refactor (May 5th, 2023):
Two main features have been added to the CID importer cron:
cid-batch-import.js
in the crons directory). The first part of the maincid-importer.js
script is still the same; a manifest list of CIDs to download is still generated and stored totmp/cid-files/cid-manifest.txt
. However, where retrieving the files from the manifest list was previously handled in batches processed in series, now the script delegates batches out to worker threads to process in parallel. Two new arguments can be passed to thecid-importer.js
script:--threads
followed by the integer number of workers to add and a boolean argument--all
, which, if true, skips the search for the last imported document in the database and retrieves all CIDs starting from the oldest existing upload. The previous two arguments, both which still apply, are;--pagesize
- an integer specifying import/backup batch size and--maxpages
- an integer to specify how many batches to process; if left unspecified, no limit will be placed on the number of batches.Ticket:
https://www.notion.so/agencyundone/Backup-all-dataset-manifests-to-Backblaze-3196a93f141546a3a91602d78b3dbd7f?pvs=4
Most recent features added with this PR (Multithread Batch Processing via Worker Pool Module):
Overview
The
worker-pool-batch-processor.js
module manages a pool of node workers to process large lists of data or objects in batches. It is a simple layer built on top of the workerpool npm package that specifically makes processing large datasets in batches quick and straightforward. The worker-pool-batch-processor script exports two functions;CreateWorkerPool
andInitializeWorker
.CreateWorkerPool
instantiates a pool of workers using the workerpool npm package. It then proceeds to siphon and queue batches from a manifest list one at a time. The manifest list can either be a list of data, objects or references to files. How this list is processed or used for processing depends entirely on the script supplied toCreateWorkerPool
, the contents of which are passed on to each individual worker. TheCreateWorkerPool
function is agnostic to the contents of the script and only handles delegating batches to workers and managing results returned from each worker.There are four arguments that the function will accept:
These arguments would typically be used to display the worker pool's progress as each batch is completed.
InitializeWorker
is a simple initialization function that takes a single argument:Use
The module must be used with two separate scripts:
One runs in the main thread and is the source of the manifest list passed to the worker pool. In this thread/script the
CreateWorkerPool
function must be imported from theworker-pool-batch-processor.js
module and called at the appropriate step with the arguments described above.Likewise, the
InitializeWorker
function must be imported in a script representing all processes to be executed in a worker thread. This function must be called as the last step in the script and must be passed the top-most function in the script whose name exactly matches the operation string passed to theCreateWorkerPool
function.This function (the operation passed as an argument to
InitializeWorker
) must return a Promise which either resolves with the result of a processed batch or rejects upon error or null result.Ticket:
https://www.notion.so/agencyundone/Abstract-node-workers-and-worker-pools-782d974e21c8455282b033783a40198e?pvs=4