feat: multithread linting #129

fasttime · 2024-12-20T17:01:13Z

Summary

This document proposes adding a new multithread mode for ESLint#lintFiles(). The aim of multithread linting is to speed up ESLint by allowing workload distribution across multiple CPU cores.

Related Issues

eslint/eslint#3565

bradzacher · 2024-12-20T21:22:46Z

I don't see any mention of the previous RFCs around parallelisation:

#42
#87

Both of these have a lot of context about the difficulties of parallelisation outside of the core rules - eg in cases where the parser or the rules store state in any form.
Two quick and prevalent examples:

eslint-plugin-import and friends store a cache of modules that has resolved and that it has parsed outside of the eslint cycle.
@typescript-eslint has type information produced by the parser and consumed by rules.

Naively parallelising by just "randomly" distributing files across threads may lead to a SLOWER lint run in cases where people use such stateful plugins because the cached work may need to be redone once for each thread.

I would like to see such usecases addressed as part of this RFC given that these mentioned usecases are very prevalent - with both mentioned plugins in use by the majority of the ecosystem.

These problems have been discussed before in the context of language plugins and parser contexts (I can try to find the threads a bit later).

fasttime · 2024-12-22T10:28:16Z

Thanks for the input @bradzacher. How would you go about incorporating context from #42 and #87 into this RFC?

I see that #42 suggests introducing a plugin setting disallowWorkerThreads and also limiting the number of threads depending on the number of files. Those both measures could be actually useful when the concurrency is calculated automatically. Do you think that would be helpful?

As for #87, it seems about an unrelated feature that doesn't even require multithreading. But I get why it would be beneficial to limit the number of instances of the same parser across threads, especially if the parser takes a long time to load its initial state, like typescript-eslint with type-aware parsing. If you have any concrete suggestions on how to do that, I'd love to know.

Naively parallelising by just "randomly" distributing files across threads may lead to a SLOWER lint run in cases where people use such stateful plugins because the cached work may need to be redone once for each thread.

I would like to see such usecases addressed as part of this RFC given that these mentioned usecases are very prevalent - with both mentioned plugins in use by the majority of the ecosystem.

I imagine the way one would address such use cases is by making no changes, i.e. not enabling multithread linting if the results are not satisfactory. But if something can be done to improve performance for popular plugins that would be awesome.

bradzacher · 2024-12-22T13:37:53Z

To be clear - I'm all for such a system existing. Like caching it can vastly improve the experience for those that fit within the bounds.

The thing I want to make sure of is that we ensure the bounds are either intentionally designed to be tight to avoid complexity explosion, or that we are at least planning a path forward for the mentioned cases.

#87 has some discussions around parallel parsing which are relevant to the sorts of ideas we'd need to consider here.

Some other relevant discussions can be found in
eslint/eslint#16819
eslint/eslint#16818 (some concepts discussed are semi-relevant here)
eslint/eslint#16557 (comment) (and other threads in that discussion)
eslint/eslint#14139 (some more relevant context about plugin setup)

I'm pretty swamped atm cos holiday season and kids and probably won't be able to get back to this properly until the new year.

designs/2024-multithread-linting/README.md

Co-authored-by: 唯然 <[email protected]>

nzakas · 2024-12-31T18:18:13Z

Thanks for putting this together. I'm going to need more time to dig into the details, and I really appreciate the amount of thought and explanation you've included in this RFC. I have a few high-level thoughts from reviewing this today:

I'd like to see an exploration of how other tools handle concurrency. I know other tools aren't a one-to-one comparison with ESLint, but there are plenty of tools in the ecosystem that do concurrency. For instance, Jest was very early to implement concurrency, so it would be good to include how they do it. Ava also runs tests concurrently. (Threads vs. processes, at what level, how do they determine defaults etc.)
There have been a number of forks that implement parallelization of ESLint over the years. Even though some are old and eslintrc-based, I'd still like to see those included in this RFC with a summary of how each worked. There's a lot of history here, so we should be sure we're considering all past attempts. Here are a few:

I'm wondering if implementing this as part of ESLint#lintFiles() is the correct abstraction? If each worker needs to create an instance of ESLint, then I wonder if perhaps a separate class that solely manages concurrency would make things a bit cleaner?

fasttime · 2025-01-09T15:56:57Z

I'd like to see an exploration of how other tools handle concurrency. I know other tools aren't a one-to-one comparison with ESLint, but there are plenty of tools in the ecosystem that do concurrency. For instance, Jest was very early to implement concurrency, so it would be good to include how they do it. Ava also runs tests concurrently. (Threads vs. processes, at what level, how do they determine defaults etc.)

Yes, it would be interesting to look into other tools to understand how they handle concurrency. This could actually bring in some interesting ideas even if the general approach is different. I was thinking to check Prettier but haven't managed to do that yet. Jest and Ava are also good candidates.

There have been a number of forks that implement parallelization of ESLint over the years. Even though some are old and eslintrc-based, I'd still like to see those included in this RFC with a summary of how each worked. There's a lot of history here, so we should be sure we're considering all past attempts. Here are a few:

https://github.com/discord/eslint (discussion: Lint multiple files in parallel [$500] eslint#3565 (comment))

https://github.com/pinterest/esprint

Lint multiple files in parallel [$500] eslint#3565 (comment)

https://github.com/pgAdmin/eslint-parallel

https://github.com/jest-community/jest-runner-eslint

https://github.com/backstage/backstage/blob/905141aa08c9c9aba8cea4d4fc4c1cdbc5eb1aa7/packages/cli/src/commands/repo/lint.ts#L57-L100

Thanks for the list. I missed most of those links while skimming through the discussion in eslint#3565. I'll be sure to go through the items and add a prior art mention.

I'm wondering if implementing this as part of ESLint#lintFiles() is the correct abstraction? If each worker needs to create an instance of ESLint, then I wonder if perhaps a separate class that solely manages concurrency would make things a bit cleaner?

Workers don't need to create a new instance of ESLint each. Only the constructor options and the list of files must be known in each thread, so it makes perfect sense to keep the runtime logic in a separate module/class. I will emphasize this in the wording.

jfmengels · 2025-01-14T20:34:56Z

One thing I'd like to point out before it's too late and just in case it's relevant: multithreading makes multifile analysis harder. If there ever comes a system where a single rule can look at the contents of multiple files—as implemented in elm-review, or as planned in Biome)—then splitting the analysis across several threads means it's likely several threads will need to load all of the files in the project, increasing at least the IO load for the thread.

I imagine this kind of analysis is not really in the scope for ESLint at the moment (I haven't seen anything in the RFCs at least), as it would a high complexity impact on the project (I've written about some tradeoffs in this post). But if this proposal were to be implemented without consideration for multifile analysis—which seems to be the case currently—then the cost of implementing it later would skyrocket and I imagine would lead it to never be implemented.

I'm looking forward to see how this evolves, as I have unfortunately not figured out multi-threading well enough for this task to even try implementing it for elm-review, where multifile analysis has worked wonders for the quality of the linting rules.

bradzacher · 2025-01-15T13:12:50Z

The only way to parallelise and efficiently maintain cross-file analysis is with shared memory. Unfortunately in JS as a whole this is nigh-impossible with the current state of the world. Sharing memory via SharedArrayBuffer is not efficient in JS due to the need to encode/decode things (you can't just tell JS "load this range of bytes as an object" and instead need to decode the bytes by hand -- cloning the data).

The shared structs proposal would go a long way in enabling shared memory models and is currently at stage 2 -- so there is some hope for this in the relatively near future! I know the TS team is eagerly looking forward to this proposal landing in node so they can explore parallelising TS's type analysis.

For now at least the best way to efficiently do parallelised multi-file analysis is to do some "grouping aware" task splitting. I.e. instead of assigning files to threads randomly you instead try to keep "related" files in the same thread to minimise duplication of data across threads.

But this is what I was alluding to in my above comments [1] [2] -- there needs to be an explicit decision encoded in this RFC:

Is it going to go the way of the cache system where it ONLY supports lint setups that purely use single-file analysis (thus locking out all users of eslint-plugin-import's cross-file analysis rules, typescript-eslint's type-aware rules, etc), or
Is it going to try to build in primitives to help support cross-file analysis setups (thus enabling said users to leverage parallelism).

The former is "the easy route" for obvious reasons -- there's a lot to think about and work through for the latter.

As a quick-and-dirty example that we have discussed before (see eslint/eslint#16819):
If ESLint included a mechanism to allow the user to specify a file grouping strategy for parallelism then typescript-eslint could build a strategy for users to configure that makes ESLint group files by tsconfig. This would then keep each project's type information in one thread and thus minimise duplicated work in calculating type information.
I have actually prototyped such a strategy in the form of a wrapper around ESLint (typescript-eslint/typescript-eslint#4359) and it showed some decent wins for certain setups.

Just to reiterate my earlier comments -- I'm 100% onboard with the going with the former decision and ignoring the cross-file problem. I just want to ensure that this case has been fully considered and intentionally discarded, or that the design has a consideration to eventually grow the parallelism system to support such usecases.

nzakas · 2025-01-17T17:41:14Z

I think what we're going for here is effectively a "stop the bleeding" situation where we can get ESLint's current behavior to go faster, as this becomes an even bigger problem as people start to use ESLint to lint their JSON, CSS, and Markdown files, significantly increasing the number of files an average user will lint.

I'm on board with discounting the cross-file problem at this point, as I think it's clear that many companies have created their own concurrent linting solutions built on top of ESLint that also discount this issue.

I would like to revisit cross-file linting at some point in the future, but before we can do that, we really need to get the core rewrite in progress.

nzakas · 2025-01-24T17:37:40Z

designs/2024-multithread-linting/README.md

+**[Backstage](https://backstage.io/)**
+
+The Backstage CLI has an option to run ESLint (currently ESLint v8) in multithread mode along with other tools.
+Each file is linted in the next available thread.


Does this mean each thread is passed a single file to lint, passes back the results, and then gets another file to lint?

I've clarified the description.

nzakas · 2025-01-24T17:38:25Z

designs/2024-multithread-linting/README.md

+
+**[Trunk Code Quality](https://trunk.io/code-quality)**
+
+Trunk manages to parallelize ESLint and other linters by splitting the workload over multiple processes.


Does that mean it's splitting up the file list and spreading across multiple processes? Or just one process for ESLint and other processes for other tools?

The file list is being split in chunks of a predefined size and I can see that multiple threads are also being spawn in a process, but not sure what each thread is doing. I will look deeper into the details.

nzakas · 2025-01-24T17:39:28Z

designs/2024-multithread-linting/README.md

+**[eslint-p](https://www.npmjs.com/package/eslint-p)**
+
+A CLI-only wrapper around ESLint v9 that adds multithread linting support, authored by myself.
+After starting a worker thread pool, each file is linted in the next available thread.


Similar question: threads are just passed one file at a time?

This is of particular interest to me as an approach vs. passing multiple files to each thread and limiting the back-and-forth communication.

Each thread receives the complete list of files to be linted in the beginning. A file index counter is used to make sure that each file is processed only once. In practice, the counter consists in a SharedArrayBuffer whose only value is accessed using Atomics.add:

// main thread const fileIndexCounter = new Int32Array(new SharedArrayBuffer(Int32Array.BYTES_PER_ELEMENT));

// worker thread const fileIndex = Atomics.add(fileIndexCounter, 0, 1);

This is specifically to avoid cross-thread communication.

fasttime · 2025-01-26T20:30:46Z

One thing I'd like to point out before it's too late and just in case it's relevant: multithreading makes multifile analysis harder. If there ever comes a system where a single rule can look at the contents of multiple files—as implemented in elm-review, or as planned in Biome)—then splitting the analysis across several threads means it's likely several threads will need to load all of the files in the project, increasing at least the IO load for the thread.

Thanks for the feedback @jfmengels. In fact multifile analysis or project-aware analysis is not a concept we have implemented in the ESLint core at this time, which is the reason why it's not covered it in this RFC.

I imagine this kind of analysis is not really in the scope for ESLint at the moment (I haven't seen anything in the RFCs at least), as it would a high complexity impact on the project (I've written about some tradeoffs in this post). But if this proposal were to be implemented without consideration for multifile analysis—which seems to be the case currently—then the cost of implementing it later would skyrocket and I imagine would lead it to never be implemented.

Do you think it would be too difficult to have multifile analysis and multithread linting at the same time? Or are you suggesting that implementing multifile analysis before multithread linting would be easier than the other way around? If you could clarify your concern we could add that in the drawbacks section for further consideration.

nzakas

This is looking really good and I love the level of detail. Just left a few questions throughout.

nzakas · 2025-01-28T17:49:13Z

designs/2024-multithread-linting/README.md

+This module will be uniquely identified by a (serializable) URL, and only this URL will have to be passed to worker threads.
+The module can be either a file or a memory module (**note**: Node.js can only import modules with the schemes `file`, `data`, and `node`[^2]; other runtimes like Deno also support `blob` and `http`/`https`).
+
+In order to let consumers load those options and create an `ESLint` instance, a new static method will be added to the `ESLint` class.


Just to double-check: this solution is meant primarily for API consumers, correct? If so, can you please make that explicit?

nzakas · 2025-01-28T17:51:16Z

designs/2024-multithread-linting/README.md

+
+When `auto` concurrency is selected, ESLint will use a heuristic to determine the best concurrency setting, which could be any number of threads or also `"off"`.
+How this heuristic will work is an open question.
+An approach I have tested is using half the number of available CPU cores, which is a reasonable starting point for modern machines with many (4 or more) cores and fast I/O peripherals to access the file system.


I think we can just start by using this heuristic with the docs stating that you may get better performance by manually setting the concurrency level.

nzakas · 2025-01-28T17:51:53Z

designs/2024-multithread-linting/README.md

+
+```js
+        if (!options[disableSerializabilityCheck]) {
+            const unserializableKeys = Object.keys(options).filter(key => !isSerializable(options[key]));


Potentially faster to just run JSON.stringify() and catch any errors (faster than structuredClone(), which is creating objects as it goes). If there are errors, only then do we check which keys are causing the problem. I'm just not sure if we want to pay this cost 100% of the time.

nzakas · 2025-01-28T17:53:54Z

designs/2024-multithread-linting/README.md

+Each worker thread repeatedly reads and lints a single file until all files have been linted or an abort signal is triggered. Errors in worker threads are not caught: they will be intercepted as error events by the main thread where they will trigger an abort signal that causes all other threads to exit.
+
+The main thread itself does not lint any files: it waits until all files are linter or an error occurs.
+When a worker tread terminates successfully it submits a list of `LintReport`s to the main thread. Each result is enriched with the index of the associated file.


Suggested change

When a worker tread terminates successfully it submits a list of `LintReport`s to the main thread. Each result is enriched with the index of the associated file.

When a worker thread terminates successfully it submits a list of `LintReport`s to the main thread. Each result is enriched with the index of the associated file.

nzakas · 2025-01-28T17:58:25Z

designs/2024-multithread-linting/README.md

+        const abortController = new AbortController();
+        const fileIndexCounter = new Int32Array(new SharedArrayBuffer(Int32Array.BYTES_PER_ELEMENT));
+        const workerPath = require.resolve("./worker.js");
+        const workerOptions = {
+            workerData: {
+                filePaths,
+                fileIndexCounter,
+                eslintOptions
+            },
+        };


I'm a bit concerned about passing all file paths to every thread, as this is a lot of duplicated memory. Thinking about a situation where there are 10,000 files to be linted (which we have received reports of), that means we'd have 10,000 * thread_count stored in memory.

I wonder about an alternative approach where each thread is seeded with maybe 5-10 file paths (or maybe just one?) that it's responsible for linting. When they are all linted, it sends a message asking for more. I know this creates more chatter, but I'm wondering if it might end up being more memory-efficient in the long-run?

Any insights into how other tools handle this?

nzakas · 2025-01-28T17:59:05Z

designs/2024-multithread-linting/README.md

+
+The main task of a thread is reading and linting files.
+
+When treads are created, they all receive a copy of the same list of files to lint.


Suggested change

When treads are created, they all receive a copy of the same list of files to lint.

When threads are created, they all receive a copy of the same list of files to lint.

nzakas · 2025-01-28T18:02:45Z

designs/2024-multithread-linting/README.md

+Another possible solution is retrieving rules `meta` objects in each worker thread and returning this information to the main thread.
+When `getRulesMetaForResults()` is called in the main thread, rules `meta` objects from all threads will be deduped and merged and the results will be returned synchronously.
+
+This solution removes the need to load config files in the main thread but it still requires worker threads to do potentially useless work by adding an extra processing step unconditionally.
+Another problem is that rules `meta` objects for custom rules aren't always serializable.
+In order for `meta` objects to be passed to the main thread, unserializable properties will need to be stripped off, which is probably undesirable.


I like this approach. Do you have an example of when a rule uses unserializable values in meta?

Overall, I think it's safe for us to assume that meta is always serializable and deal with any outliers through documentation.

nzakas · 2025-01-28T18:03:59Z

designs/2024-multithread-linting/README.md

+
+Errors created in a worker threads cannot be cloned to the main thread without changes, because they can contain unserializable properties.
+Instead, Node.js creates a serializable copy of the error, stripped off of unserializable properties, and reports it to the main thread as a paremeter of the [`error`](https://nodejs.org/docs/latest-v18.x/api/worker_threads.html#event-error) event.
+During this process `message` and `stack` are preserved because they are strings.


Does this preserve all serializable properties? My main concern is the messageTemplate property that we add to produce readable error messages: https://github.com/eslint/eslint/blob/8bcd820f37f2361e4f7261a9876f52d21bd9de8f/bin/eslint.js#L80

jfmengels · 2025-01-29T15:46:28Z

Do you think it would be too difficult to have multifile analysis and multithread linting at the same time? Or are you suggesting that implementing multifile analysis before multithread linting would be easier than the other way around?

I don't know that it is too hard, but it's mostly that having each thread analyze a portion of of the project won't work, at least with the multifile analysis approach I've chosen for elm-review and explained in this blog post, because rules needs to (potentially) analyze every file.

For my use-case, it's more likely that analysis can be split by rule, instead of by file. i.e. thread 1 will run rules 1 and 2, thread 2 will run rule 3 and 4, etc. But this means that memory-wise, either the project's contents (and derived data) needs to be stored on every thread (multiplying the memory), which doesn't sound great for performance. Maybe this would get improved with the shared structs proposal.

Right now, both multifile analysis and multithreading are do-able in isolation, but doing both requires a lot more thinking and maybe additional tools. Therefore doing one may exclude the other. But if you ever figure it out I'll definitely be interested to hear about it!

feat: multithread linting

09db019

eslint-github-bot bot added the feature label Dec 20, 2024

aladdin-add reviewed Dec 24, 2024

View reviewed changes

designs/2024-multithread-linting/README.md Outdated Show resolved Hide resolved

fasttime and others added 2 commits December 27, 2024 08:38

add RFC PR link

a54ac87

Co-authored-by: 唯然 <[email protected]>

add link to parallel linting RFC

f00f45d

fasttime added 3 commits January 20, 2025 15:44

add related tools section, add more discussions

f48070b

add note about accessing the ESLint class

3261939

expand drawbacks section

ce33b9d

nzakas reviewed Jan 24, 2025

View reviewed changes

fasttime added 3 commits January 26, 2025 13:15

add worker flow diagram

302c0ff

clarify how Backstage works

0bb96cf

clarify how eslint-p works

f5b384f

nzakas reviewed Jan 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: multithread linting #129

feat: multithread linting #129

fasttime commented Dec 20, 2024

bradzacher commented Dec 20, 2024 •

edited

Loading

fasttime commented Dec 22, 2024

bradzacher commented Dec 22, 2024

nzakas commented Dec 31, 2024

fasttime commented Jan 9, 2025

jfmengels commented Jan 14, 2025

bradzacher commented Jan 15, 2025

nzakas commented Jan 17, 2025

nzakas Jan 24, 2025

fasttime Jan 26, 2025

nzakas Jan 24, 2025

fasttime Jan 26, 2025

nzakas Jan 24, 2025

fasttime Jan 26, 2025

fasttime commented Jan 26, 2025

nzakas left a comment

nzakas Jan 28, 2025

nzakas Jan 28, 2025

nzakas Jan 28, 2025

nzakas Jan 28, 2025

nzakas Jan 28, 2025

nzakas Jan 28, 2025

nzakas Jan 28, 2025

nzakas Jan 28, 2025

jfmengels commented Jan 29, 2025


		[Trunk Code Quality](https://trunk.io/code-quality)

		Trunk manages to parallelize ESLint and other linters by splitting the workload over multiple processes.

	When a worker tread terminates successfully it submits a list of `LintReport`s to the main thread. Each result is enriched with the index of the associated file.
	When a worker thread terminates successfully it submits a list of `LintReport`s to the main thread. Each result is enriched with the index of the associated file.


		The main task of a thread is reading and linting files.

		When treads are created, they all receive a copy of the same list of files to lint.

	When treads are created, they all receive a copy of the same list of files to lint.
	When threads are created, they all receive a copy of the same list of files to lint.

feat: multithread linting #129

Are you sure you want to change the base?

feat: multithread linting #129

Conversation

fasttime commented Dec 20, 2024

Summary

Related Issues

bradzacher commented Dec 20, 2024 • edited Loading

fasttime commented Dec 22, 2024

bradzacher commented Dec 22, 2024

nzakas commented Dec 31, 2024

fasttime commented Jan 9, 2025

jfmengels commented Jan 14, 2025

bradzacher commented Jan 15, 2025

nzakas commented Jan 17, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fasttime commented Jan 26, 2025

nzakas left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jfmengels commented Jan 29, 2025

bradzacher commented Dec 20, 2024 •

edited

Loading