Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize auditing? #14

Open
woodruffw opened this issue Oct 6, 2022 · 3 comments
Open

Parallelize auditing? #14

woodruffw opened this issue Oct 6, 2022 · 3 comments
Assignees

Comments

@woodruffw
Copy link
Member

I'm not sure if this is a good idea yet.

When dealing with lots of specs (especially full Python package histories), auditing is pretty slow (since it's entirely serial). It doesn't need to be this way, since auditing is embarassingly parallel (each step is entirely independent).

The only real obstacles here are UI/UX ones: if we break auditing up into a pool of threads or processes, we'll want to make sure that the current output and progress bars remain about the same (or get nicer).

@woodruffw woodruffw self-assigned this Oct 6, 2022
@giampaolo
Copy link

giampaolo commented Sep 30, 2024

Hi @woodruffw. FYI a while back I submitted this PR for the autoflake CLI tool:
PyCQA/autoflake#107.

I'm not familiar with abi3audit code base. If by "specs" you mean "files" that you read and process independently, then this is definitively the sort of work that you can delegate to a process pool and see a noticeable speedup.

As for how to properly log / print to stdout: multiprocessing.Pool lets you run a function and get its return value. You may want to print to stdout only when the worker has finished its work. More or less:

import multiprocessing

with multiprocessing.Pool() as pool:
    futs = []
    # populate the pool
    for file in files:
        fut = pool.apply_async(audit_file, args=(file, ))
        futs.append(fut)
    # get workers results
    for fut in futs:
        try:
            result = fut.get()
        except Exception as err:
            print(err)
        else:
            print(result)

If on the other hand you print progress inside your worker function (audit_file() in the example above) things may be a bit more complicated as you want to use some sort of primitive (lock, pipe, etc.) to synchronize writes to stdout (maybe the logging module already provides something in this regard, but I'm not sure).

Hope this helps.

@woodruffw
Copy link
Member Author

Thanks, that is indeed helpful!

A "spec" in abi3audit is anything that gets passed in via the CLI, which can be unfolded into one more or auditable shared objects. For example if someone runs abi3audit cryptography, we fetch every wheel for cryptography ever and audit every object in every wheel 🙂

@giampaolo
Copy link

Got it. In that case you probably want to fetch all the wheels first and put them in a list, and process that list via the process pool.

If the fetching operation consists of downloading files from the internet, you can also use a separate pool just for that, but it should be a thread pool, since it's I/O bound rather than CPU bound. For thread pools you can use https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor. For process pools I personally find "multiprocessing.Pool" more convenient (but forgot why :)).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants