-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelize update script #115
base: gh-pages
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thank you for your contribution. I only have some comments :)
const readFile = util.promisify(fs.readFile) | ||
const readdir = util.promisify(fs.readdir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use fs.Promises instead?
E.g.
const { readdir, readFile } = require('node:fs/promises')
// ...
const files = await readdir(path)
) | ||
const pool = workerpool.pool('./update-worker.js', { | ||
maxWorkers: options.maxWorkerProcesses, | ||
// Threads would be ideal, but do not work as expected due to an issue with the swarmhash module |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What issue? Could you provide more information about the problem using the thread workerType
?
pool.terminate() | ||
} | ||
|
||
main(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the extra semicolon. Also, please remember to run npm run lint -- --fix
main(); | |
main() |
const latestRelease = parsedList | ||
.slice() | ||
.reverse() | ||
.filter(function (listEntry) { return listEntry.prerelease === null }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that you changed the behavior of the processDir
function. Some entries that do not have a prerelease now contains the prerelease
field set to null
in the final list.json
instead of not having the field. The prerelease
field must not be set in the parsedList
when there is not prerelease.
For example, this one:
{
"path": "soljson-v0.1.1+commit.6ff4cd6.js",
"version": "0.1.1",
"prerelease": null,
"build": "commit.6ff4cd6",
"longVersion": "0.1.1+commit.6ff4cd6",
"keccak256": "0xd8b8c64f4e9de41e6604e6ac30274eff5b80f831f8534f0ad85ec0aff466bb25",
"sha256": "0xfc54135edfe07d835216f87fa6c3b9049e1415c2825027b926daad018e09269f",
"urls": [
"bzzr://8f3c028825a1b72645f46920b67dca9432a87fc37a8940a2b2ce1dd6ddc2e29b",
"dweb:/ipfs/QmPPGxsMtQSEUt9hn9VAtrtBTa1u9S5KF1myw78epNNFkx"
]
},
should be without the prerelease
field:
{
"path": "soljson-v0.1.1+commit.6ff4cd6.js",
"version": "0.1.1",
"build": "commit.6ff4cd6",
"longVersion": "0.1.1+commit.6ff4cd6",
"keccak256": "0xd8b8c64f4e9de41e6604e6ac30274eff5b80f831f8534f0ad85ec0aff466bb25",
"sha256": "0xfc54135edfe07d835216f87fa6c3b9049e1415c2825027b926daad018e09269f",
"urls": [
"bzzr://8f3c028825a1b72645f46920b67dca9432a87fc37a8940a2b2ce1dd6ddc2e29b",
"dweb:/ipfs/QmPPGxsMtQSEUt9hn9VAtrtBTa1u9S5KF1myw78epNNFkx"
]
},
const batchPromises = [] | ||
|
||
for (let i = 0; i < values.length; i += batchSize) { | ||
batchPromises.push(pool.exec('workerMain', [dir, values.slice(i, i + batchSize), oldList])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you run any profiling to see how your changes improved the performance of the current code? The execution time of your implementation does not seem to have any significant improvement in my machine:
Current implementation: 702.21s user 16.80s system 106% cpu 11:15.54 total
PR implementation: 670.77s user 17.92s system 104% cpu 11:01.69 total
Maybe we could also move some of the I/O operations to the workers instead of only moving the hash operations? Like, we could have a worker per binary directory and then multiple child-workers per-directory (each one reading/processing a subset of files in each directory). The main worker thread in each directory would then aggregate the "child/file workers" results and write the respective list.[json,js,txt]
files or something like that. What do you think?
I ran a quick profiling here, which seems to hint that your code spends most of the time waiting for file descriptors to become ready. Not sure if this is actually the case though:
PR code:
[C++]:
ticks total nonlib name
1987 71.3% 93.9% epoll_pwait@@GLIBC_2.6
21 0.8% 1.0% __write@@GLIBC_2.2.5
14 0.5% 0.7% fwrite@@GLIBC_2.2.5
12 0.4% 0.6% recvmsg@@GLIBC_2.2.5
11 0.4% 0.5% std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long)
9 0.3% 0.4% std::ostream::sentry::sentry(std::ostream&)
8 0.3% 0.4% __libc_malloc@@GLIBC_2.2.5
[Summary]:
ticks total nonlib name
37 1.3% 1.7% JavaScript
2078 74.5% 98.3% C++
72 2.6% 3.4% GC
673 24.1% Shared libraries
Current code:
[C++]:
ticks total nonlib name
18358 2.9% 4.9% epoll_pwait@@GLIBC_2.6
901 0.1% 0.2% __dynamic_cast
793 0.1% 0.2% __cxxabiv1::__vmi_class_type_info::__do_dyncast(long, __cxxabiv1::__class_type_info::__sub_kind, __cxxabiv1::__class_type_info const*, void const*, __cxxabiv1::__class_type_info const*, void const*, __cxxabiv1::__class_type_info::__dyncast_result&) const
711 0.1% 0.2% __libc_malloc@@GLIBC_2.2.5
[Summary]:
ticks total nonlib name
1923 0.3% 0.5% JavaScript
24553 3.9% 6.5% C++
12993 2.1% 3.5% GC
256700 40.5% Shared libraries
349903 55.3% Unaccounted
I think it would be beneficial to have some benchmark showing the performance improvement of your implementation.
This addresses ethereum/solidity#12421
It does so by adding the
workerpool
node module, moving a bunch of code into a separate worker file and running that code in the worker pool. It also adds a--max-worker-processes
flag to configure the number of workers.Edit: Missed a pool.terminate() call initially. Had to do a bit of refactoring in order to be able to throw that in. As it was, there was no way to know when all of the async branches were done. This last commit makes the diff quite a bit more unreadable but that was rather unavoidable. The script is a bit dated and uses a combination of callbacks and async/await.