Update YACHT to v1.4.0 #129

chunyuma · 2024-12-10T17:10:23Z

Integrate Mahmudher's scripts to calculate similarities within yacht run.

codecov · 2024-12-10T18:32:31Z

Codecov Report

Attention: Patch coverage is 88.88889% with 2 lines in your changes missing coverage. Please review.

Project coverage is 84.28%. Comparing base (44b2e40) to head (9784812).

Files with missing lines	Patch %	Lines
src/yacht/hypothesis_recovery_src.py	86.66%	2 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #129   +/-   ##
=======================================
  Coverage   84.27%   84.28%           
=======================================
  Files          11       11           
  Lines        1094     1101    +7     
=======================================
+ Hits          922      928    +6     
- Misses        172      173    +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

dkoslicki

Thanks both for doing this! One in-line comment, and the following higher level comments:

I assume this has been thoroughly tested to check consistency with the previous approach (save for order that ties are being broken, perhaps). Is that correct?
It'd be nice to include in the PR text a quick benchmark/summary of the performance improvement (eg. maybe also in a change log)
I noticed that the test_run_yacht_pretrained_ref_db is marked as a slow test. Does the CI/CD ignore these? If so, we might want to revert or pick a smaller reference database (or change the ANI to 0.95) to make it run faster so this important integration test isn't skipped. Also, I would have assumed this is actually a much faster test now after the training C++ conversion previously

dkoslicki · 2024-12-10T19:52:43Z

src/yacht/hypothesis_recovery_src.py

    num_threads: int,
    path_to_genome_temp_dir: str,
    path_to_sample_temp_dir: str,
+    num_genome_threshold: int = 1000000


This argument doesn't seem to be documented. Is the intent of this just to calculate the number of passes @mahmudhera ? Or sets some other threshold with respect to number of organisms detected or the like?

@dkoslicki, this parameter is not expected to be modified by the users. Or probably I can always set it to 1000000. It is not associated with the YACHT results. It is just associated with how many genomes are processed in a block within Mahmudera's algorithm. So, don't worry about it.

Yes, understood. But 2 years from now when we want to make yet further changes, it should probably be documented in the code with a comment about what it's doing

I have added documents for these new parameters. Please check the new commit.

I don't quite remember doing this, I guess @chunyuma has already addressed this

dkoslicki · 2024-12-10T19:58:07Z

Oh, and another question. Does the code coverage take into account the C++ code? Or is the C++ code covered by upstream python tests?

chunyuma · 2024-12-10T20:23:46Z

Hi @dkoslicki,

I assume this has been thoroughly tested to check consistency with the previous approach (save for order that ties are being broken, perhaps). Is that correct?

This update doesn't like the previous one. The previous update involves the integration of Mahmudur's cpp script into the yacht train, which might affect the trained genomes due to the order of genomes with size that ties being removed. This update is to integrate the similar similarity calculation script to replace sourmash multisearch to identify all organisms in the reference database that share at least one hash overlap with the sample. So, the risk of changing result is small. Although I didn't do the "thoroughly" tests but I checked the results using the demo data and using the GTDB k31 0.95 database, and make sure that the results are consistent.

It'd be nice to include in the PR text a quick benchmark/summary of the performance improvement (eg. maybe also in a change log)

Do we need to do this before merging it into main? What kinds of metrics you may want to include in the summary?

I noticed that the test_run_yacht_pretrained_ref_db is marked as a slow test. Does the CI/CD ignore these? If so, we might want to revert or pick a smaller reference database (or change the ANI to 0.95) to make it run faster so this important integration test isn't skipped. Also, I would have assumed this is actually a much faster test now after the training C++ conversion previously

I have tested all CI/CD tests locally and they all passes. However, for some reasons, it fails in the CI/CD server. I think it is due to some CI/CD machine setting (memory or something else). Do you think it is necessary to figure out the reason? If you think it is necessary, could you please help me take a look? I have spent the entire morning to figure why but can't resolve it.

Also, I don't think it becomes "much" faster because it still needs 5-10 minutes.

chunyuma · 2024-12-10T20:28:17Z

Does the code coverage take into account the C++ code? Or is the C++ code covered by upstream python tests?

The C++ code is mainly used to replace the sourmash multisearch within the existing python functions. Thus, it is covered by upstream ptyhon tests. If you compare the coverage percentage between main and chunyu_v1.4.0, they are both 84%, no change.

dkoslicki · 2024-12-10T21:21:57Z

Hi @dkoslicki,

I assume this has been thoroughly tested to check consistency with the previous approach (save for order that ties are being broken, perhaps). Is that correct?

This update doesn't like the previous one. The previous update involves the integration of Mahmudur's cpp script into the yacht train, which might affect the trained genomes due to the order of genomes with size that ties being removed. This update is to integrate the similar similarity calculation script to replace sourmash multisearch to identify all organisms in the reference database that share at least one hash overlap with the sample. So, the risk of changing result is small. Although I didn't do the "thoroughly" tests but I checked the results using the demo data and using the GTDB k31 0.95 database, and make sure that the results are consistent.

I see, thanks. @mahmudhera did you perform tests to ensure consistency with the sourmash multisearch results? I recall you mentioning that you checked it in this repo: https://github.com/KoslickiLab/sourmash_alternate_implementations but don't know if you did for these changes.

It'd be nice to include in the PR text a quick benchmark/summary of the performance improvement (eg. maybe also in a change log)

Do we need to do this before merging it into main? What kinds of metrics you may want to include in the summary?

No, not needed before merging into main. It would just be nice to include in some changelog somewhere.

I noticed that the test_run_yacht_pretrained_ref_db is marked as a slow test. Does the CI/CD ignore these? If so, we might want to revert or pick a smaller reference database (or change the ANI to 0.95) to make it run faster so this important integration test isn't skipped. Also, I would have assumed this is actually a much faster test now after the training C++ conversion previously

I have tested all CI/CD tests locally and they all passes. However, for some reasons, it fails in the CI/CD server. I think it is due to some CI/CD machine setting (memory or something else). Do you think it is necessary to figure out the reason? If you think it is necessary, could you please help me take a look? I have spent the entire morning to figure why but can't resolve it.

Also, I don't think it becomes "much" faster because it still needs 5-10 minutes.

That is quite frustrating (and as you know all too well with Translator, frequently encountered)
@mlupei Can you check what's going on with the CI/CD issue? No need to have it block this PR due to locally passing tests, but moving forward, we want to keep the CI/CD tests in sync (and for it to run the slow tests, even if they are skipped locally).

chunyuma · 2024-12-10T21:30:13Z

Hi @dkoslicki,

After a simple test using the demo sample.sig.zip file (which has only one genome) and the gtdb-rs214-reps.k31_0.95_pretrained as reference database, actually the performance of yacht run becomes even worse. Please see the comparison below:

New version: 1.4.0 (integrating cpp script to yacht run)
6:23.76 minutes, 23.45 GB

Old version: 1.3.0
0:54.35 minutes, 2.946216 GB

It seems like the cpp script need to spend some time in building indexes while the sourmash multisearch doesn't have this step so it is faster. @mahmudhera, do you know if it is normal? When the genome size is small (e.g., one genome in the sample), the sourmash multisearch is actually faster than our cpp algorithm?

dkoslicki · 2024-12-10T21:52:02Z

Ooo, yeah, that's a pretty significant slowdown. Let's wait to hear from @mahmudhera about what's going on. And along those lines, @mahmudhera and @chunyuma , let's run the timing of a few different data sets (small, and realistically large), so we get a good sense of speed impact. Slower code would defeat the purpose of moving to the cpp implementation...

sonarqubecloud · 2024-12-10T21:54:32Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

chunyuma · 2024-12-10T22:09:47Z

Hi Professor @dkoslicki again,

As you might notice in the comparison above, the new algorithm requires larger memory. This explains why the test_run_yacht_pretrained_ref_db failed in the CI/CD server but worked locally. As what I previously suspected, it is indeed due to the memory issue. The CI/CD server has memory limitation. In my latest commit, I switched to a smaller database (ANI 0.8), which resolved the previously failing test.

dkoslicki · 2024-12-10T22:47:56Z

I think we'll need to hold off on this PR until @mahmudhera and I can discuss the RAM and time impact of this move away from sourmash multisearch

mahmudhera · 2024-12-10T23:02:40Z

Hi guys, looks like I missed a lot here. Let me answer by organizing things into a few high level questions.

When is it beneficial to use the alternate implementation (implemented in cpp)?

@chunyuma and @dkoslicki : the alternate implementation of sourmash multisearch works by building an index. Indeed, building the index is slower than starting the comparisons straight off. Therefore, when the number of comparisons is very large (such as an all-vs-all comparison in the yacht train phase), building the index makes sense. If the number of comparisons is relatively small (such as a one-vs-all comparison in the yacht run phase), then perhaps not building the index makes sense.

From previous experience, we know that for an all-vs-all comparison, this alternate implementation is greatly beneficial. Is it indeed slower to use this alternate implementation for a one-vs-all? That has not been tested yet: I need to test that separately and get back to you guys.

Can we make the index building faster?

Yes, I have implemented a parallel implementation of building the index (by building something I am calling a Segregated Multi-Index), which is now here: https://github.com/KoslickiLab/sourmash_alternate_implementations

I can benchmark it separately to see if using this parallelized index building makes it fast enough so that it's faster than using sourmash multisearch for a one-vs-all comparison.

Has it been tested that this version of `yacht` produces the same output?

I did not do this test. What I have tested is as follows: does the cpp implementation of sourmash multisearch produce the same output in terms of pairwise similarity measures? My tests show that yes, the cpp implementation is equivalent to sourmash multisearch. In a previous release, this implementation had been collected by @chunyuma, and was used in yacht train. Both @chunyuma and myself tested that yacht train using my cpp code, and yacht train using sourmash multisearch produce (almost) the same result. I think in this iteration, @chunyuma is doing the same thing, but for yacht run. For yacht run, I don't think this equivalency test has been done yet, at least not that I am aware of.

What I can do is I can add the test that I do have in the yacht repo. I can also help to test if this version of yacht run produces the same results as the previous one.

chunyuma · 2024-12-10T23:04:07Z

ok, I have removed version 1.4.0 from the GitHub page, which will hold off the update on bioconda.

mahmudhera · 2024-12-10T23:08:40Z

I think we'll need to hold off on this PR until @mahmudhera and I can discuss the RAM and time impact of this move away from sourmash multisearch

I think that'd be the most appropriate way to go.

chunyuma · 2024-12-10T23:16:01Z

Thank you for your thoughts @mahmudhera.

@chunyuma and @dkoslicki : the alternate implementation of sourmash multisearch works by building an index. Indeed, building the index is slower than starting the comparisons straight off. Therefore, when the number of comparisons is very large (such as an all-vs-all comparison in the yacht train phase), building the index makes sense. If the number of comparisons is relatively small (such as a one-vs-all comparison in the yacht run phase), then perhaps not building the index makes sense.
From previous experience, we know that for an all-vs-all comparison, this alternate implementation is greatly beneficial. Is it indeed slower to use this alternate implementation for a one-vs-all? That has not been tested yet: I need to test that separately and get back to you guys.

I think it makes sense. Considering the potential varying performance of the alternative implementation in different data sizes, instead of totally abandoning sourmash multisearch, we probably can keep both but provide a switch button to change methods for different datasets. @dkoslicki. do you think it is worthy?

mahmudhera · 2024-12-10T23:56:40Z

@dkoslicki and @chunyuma: I suspect that there could be some potential for improvement if we do the following:

(a) from the reference sketches, build the index, and store the index in file(s)
(b) instead of providing the reference sketches as input, use the pre-built index file as input to yacht run

If the index is pre-built, both all-vs-all and many-vs-many comparisons should be much faster. We can provide the pre-built index files for various ani thresholds to the user to download and run. Additionally, we can provide the instructions to build a new index for the user's desired reference genomes.

Of course, this is just a thought. There are many "file-based" optimizations possible if we go this direction; these optimizations usually lead to very low memory usages (but I am not promising anything 😜)

chunyuma · 2024-12-11T19:29:16Z

Hi @mahmudhera,

Thanks for these potential improvement idea. Let's wait for @dkoslicki's feedback, as I'm uncertain if we should try them given the submission timeline and the related projects.

mahmudhera · 2024-12-11T19:57:17Z

Hi @mahmudhera,

Thanks for these potential improvement idea. Let's wait for @dkoslicki's feedback, as I'm uncertain if we should try them given the submission timeline and the related projects.

Definitely, implementing these will take time for me too, lets not blunder anything by rushing.

dkoslicki · 2024-12-11T20:57:37Z

Hi @mahmudhera,
Thanks for these potential improvement idea. Let's wait for @dkoslicki's feedback, as I'm uncertain if we should try them given the submission timeline and the related projects.

Definitely, implementing these will take time for me too, lets not blunder anything by rushing.

Agreed: let's not rush this. Mahmudur and I are meeting tomorrow and will map a plan forward. Note: this part (replace branchwater with the C++ code) is not required for the JOSS submission, so there's not big time deadline on that front

mahmudhera and others added 27 commits November 6, 2024 00:12

add bin files to ignore

3b33f47

add sketch reader to utils

f557d9c

move computation of hash index to header

b8bcd9f

move sketch reading to header

08d8014

move show mpty sketches to header

5532e3c

move computation of int matrix to utils

567eec1

change implementation to use two sets of sketches

37a2cb4

test if different sketches work

d03192b

remove globals

980ee55

remove precomputation of sizes and ids

cbdacbb

remove redundant arguments

c675814

keep single appearing sketches

de66de7

rename to yacht train code

ccba5fe

name sketch names to sketch paths

71a4438

added code for computing similarity

26faba0

fix linking

a2e7a24

fix linking

a03b6b1

fix build

c682943

combine output

d31c45f

add default value

398538e

add documentations

42d1717

add documentation

0706786

remove unnecessary files

653a247

Merge branch 'dev-fast-yacht-run' into chunyu_v1.4.0

4db35d8

rename compute_similarity.cpp -> yacht_run_compute_similarity.cpp

24562da

update yacht run

9e5dddf

update the setup scripts

3cef6df

chunyuma requested review from dkoslicki and mahmudhera December 10, 2024 17:10

try fixiing the pthread_create issue

22ad07f

chunyuma added 2 commits December 10, 2024 13:03

add conftest.py file and mark slow test functions

20d19b4

add conftest file

5906df0

chunyuma added 4 commits December 10, 2024 13:43

update version number

9a5b3d1

update sha256 code

f4e0a01

update conda_recipe/meta.yaml

91b1357

update conda_recipe/meta.yaml

ff365df

dkoslicki requested changes Dec 10, 2024

View reviewed changes

add some documents for the new parameters

a906ed2

change database 0.9995 to 0.80 for faster test

9784812

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update YACHT to v1.4.0 #129

Update YACHT to v1.4.0 #129

chunyuma commented Dec 10, 2024

codecov bot commented Dec 10, 2024 •

edited

Loading

dkoslicki left a comment

dkoslicki Dec 10, 2024

chunyuma Dec 10, 2024

dkoslicki Dec 10, 2024

chunyuma Dec 10, 2024

mahmudhera Dec 10, 2024

dkoslicki commented Dec 10, 2024

chunyuma commented Dec 10, 2024 •

edited

Loading

chunyuma commented Dec 10, 2024

dkoslicki commented Dec 10, 2024

chunyuma commented Dec 10, 2024 •

edited

Loading

dkoslicki commented Dec 10, 2024

sonarqubecloud bot commented Dec 10, 2024

chunyuma commented Dec 10, 2024 •

edited

Loading

dkoslicki commented Dec 10, 2024

mahmudhera commented Dec 10, 2024 •

edited

Loading

chunyuma commented Dec 10, 2024

mahmudhera commented Dec 10, 2024

chunyuma commented Dec 10, 2024

mahmudhera commented Dec 10, 2024 •

edited

Loading

chunyuma commented Dec 11, 2024

mahmudhera commented Dec 11, 2024

dkoslicki commented Dec 11, 2024

Update YACHT to v1.4.0 #129

Are you sure you want to change the base?

Update YACHT to v1.4.0 #129

Conversation

chunyuma commented Dec 10, 2024

codecov bot commented Dec 10, 2024 • edited Loading

Codecov Report

dkoslicki left a comment

Choose a reason for hiding this comment

dkoslicki Dec 10, 2024

Choose a reason for hiding this comment

chunyuma Dec 10, 2024

Choose a reason for hiding this comment

dkoslicki Dec 10, 2024

Choose a reason for hiding this comment

chunyuma Dec 10, 2024

Choose a reason for hiding this comment

mahmudhera Dec 10, 2024

Choose a reason for hiding this comment

dkoslicki commented Dec 10, 2024

chunyuma commented Dec 10, 2024 • edited Loading

chunyuma commented Dec 10, 2024

dkoslicki commented Dec 10, 2024

chunyuma commented Dec 10, 2024 • edited Loading

dkoslicki commented Dec 10, 2024

sonarqubecloud bot commented Dec 10, 2024

Quality Gate passed

chunyuma commented Dec 10, 2024 • edited Loading

dkoslicki commented Dec 10, 2024

mahmudhera commented Dec 10, 2024 • edited Loading

When is it beneficial to use the alternate implementation (implemented in cpp)?

Can we make the index building faster?

Has it been tested that this version of yacht produces the same output?

chunyuma commented Dec 10, 2024

mahmudhera commented Dec 10, 2024

chunyuma commented Dec 10, 2024

mahmudhera commented Dec 10, 2024 • edited Loading

chunyuma commented Dec 11, 2024

mahmudhera commented Dec 11, 2024

dkoslicki commented Dec 11, 2024

codecov bot commented Dec 10, 2024 •

edited

Loading

chunyuma commented Dec 10, 2024 •

edited

Loading

chunyuma commented Dec 10, 2024 •

edited

Loading

chunyuma commented Dec 10, 2024 •

edited

Loading

mahmudhera commented Dec 10, 2024 •

edited

Loading

Has it been tested that this version of `yacht` produces the same output?

mahmudhera commented Dec 10, 2024 •

edited

Loading