Parallelize auditing? #14

woodruffw · 2022-10-06T15:23:07Z

I'm not sure if this is a good idea yet.

When dealing with lots of specs (especially full Python package histories), auditing is pretty slow (since it's entirely serial). It doesn't need to be this way, since auditing is embarassingly parallel (each step is entirely independent).

The only real obstacles here are UI/UX ones: if we break auditing up into a pool of threads or processes, we'll want to make sure that the current output and progress bars remain about the same (or get nicer).

giampaolo · 2024-09-30T20:41:09Z

Hi @woodruffw. FYI a while back I submitted this PR for the autoflake CLI tool:
PyCQA/autoflake#107.

I'm not familiar with abi3audit code base. If by "specs" you mean "files" that you read and process independently, then this is definitively the sort of work that you can delegate to a process pool and see a noticeable speedup.

As for how to properly log / print to stdout: multiprocessing.Pool lets you run a function and get its return value. You may want to print to stdout only when the worker has finished its work. More or less:

import multiprocessing

with multiprocessing.Pool() as pool:
    futs = []
    # populate the pool
    for file in files:
        fut = pool.apply_async(audit_file, args=(file, ))
        futs.append(fut)
    # get workers results
    for fut in futs:
        try:
            result = fut.get()
        except Exception as err:
            print(err)
        else:
            print(result)

If on the other hand you print progress inside your worker function (audit_file() in the example above) things may be a bit more complicated as you want to use some sort of primitive (lock, pipe, etc.) to synchronize writes to stdout (maybe the logging module already provides something in this regard, but I'm not sure).

Hope this helps.

woodruffw · 2024-09-30T20:53:56Z

Thanks, that is indeed helpful!

A "spec" in abi3audit is anything that gets passed in via the CLI, which can be unfolded into one more or auditable shared objects. For example if someone runs abi3audit cryptography, we fetch every wheel for cryptography ever and audit every object in every wheel 🙂

giampaolo · 2024-09-30T21:06:24Z

Got it. In that case you probably want to fetch all the wheels first and put them in a list, and process that list via the process pool.

If the fetching operation consists of downloading files from the internet, you can also use a separate pool just for that, but it should be a thread pool, since it's I/O bound rather than CPU bound. For thread pools you can use https://docs.python.org/3/library/concurrent.futures.html#threadpoolexecutor. For process pools I personally find "multiprocessing.Pool" more convenient (but forgot why :)).

woodruffw self-assigned this Oct 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize auditing? #14

Parallelize auditing? #14

woodruffw commented Oct 6, 2022

giampaolo commented Sep 30, 2024 •

edited

Loading

woodruffw commented Sep 30, 2024

giampaolo commented Sep 30, 2024

Parallelize auditing? #14

Parallelize auditing? #14

Comments

woodruffw commented Oct 6, 2022

giampaolo commented Sep 30, 2024 • edited Loading

woodruffw commented Sep 30, 2024

giampaolo commented Sep 30, 2024

giampaolo commented Sep 30, 2024 •

edited

Loading