Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update YACHT to v1.4.0 #129

Open
wants to merge 36 commits into
base: main
Choose a base branch
from
Open

Update YACHT to v1.4.0 #129

wants to merge 36 commits into from

Conversation

chunyuma
Copy link
Member

Integrate Mahmudher's scripts to calculate similarities within yacht run.

Copy link

codecov bot commented Dec 10, 2024

Codecov Report

Attention: Patch coverage is 88.88889% with 2 lines in your changes missing coverage. Please review.

Project coverage is 84.28%. Comparing base (44b2e40) to head (9784812).

Files with missing lines Patch % Lines
src/yacht/hypothesis_recovery_src.py 86.66% 2 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #129   +/-   ##
=======================================
  Coverage   84.27%   84.28%           
=======================================
  Files          11       11           
  Lines        1094     1101    +7     
=======================================
+ Hits          922      928    +6     
- Misses        172      173    +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@dkoslicki dkoslicki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks both for doing this! One in-line comment, and the following higher level comments:

  1. I assume this has been thoroughly tested to check consistency with the previous approach (save for order that ties are being broken, perhaps). Is that correct?
  2. It'd be nice to include in the PR text a quick benchmark/summary of the performance improvement (eg. maybe also in a change log)
  3. I noticed that the test_run_yacht_pretrained_ref_db is marked as a slow test. Does the CI/CD ignore these? If so, we might want to revert or pick a smaller reference database (or change the ANI to 0.95) to make it run faster so this important integration test isn't skipped. Also, I would have assumed this is actually a much faster test now after the training C++ conversion previously

num_threads: int,
path_to_genome_temp_dir: str,
path_to_sample_temp_dir: str,
num_genome_threshold: int = 1000000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This argument doesn't seem to be documented. Is the intent of this just to calculate the number of passes @mahmudhera ? Or sets some other threshold with respect to number of organisms detected or the like?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dkoslicki, this parameter is not expected to be modified by the users. Or probably I can always set it to 1000000. It is not associated with the YACHT results. It is just associated with how many genomes are processed in a block within Mahmudera's algorithm. So, don't worry about it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, understood. But 2 years from now when we want to make yet further changes, it should probably be documented in the code with a comment about what it's doing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added documents for these new parameters. Please check the new commit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite remember doing this, I guess @chunyuma has already addressed this

@dkoslicki
Copy link
Member

Oh, and another question. Does the code coverage take into account the C++ code? Or is the C++ code covered by upstream python tests?

@chunyuma
Copy link
Member Author

chunyuma commented Dec 10, 2024

Hi @dkoslicki,

I assume this has been thoroughly tested to check consistency with the previous approach (save for order that ties are being broken, perhaps). Is that correct?

This update doesn't like the previous one. The previous update involves the integration of Mahmudur's cpp script into the yacht train, which might affect the trained genomes due to the order of genomes with size that ties being removed. This update is to integrate the similar similarity calculation script to replace sourmash multisearch to identify all organisms in the reference database that share at least one hash overlap with the sample. So, the risk of changing result is small. Although I didn't do the "thoroughly" tests but I checked the results using the demo data and using the GTDB k31 0.95 database, and make sure that the results are consistent.

It'd be nice to include in the PR text a quick benchmark/summary of the performance improvement (eg. maybe also in a change log)

Do we need to do this before merging it into main? What kinds of metrics you may want to include in the summary?

I noticed that the test_run_yacht_pretrained_ref_db is marked as a slow test. Does the CI/CD ignore these? If so, we might want to revert or pick a smaller reference database (or change the ANI to 0.95) to make it run faster so this important integration test isn't skipped. Also, I would have assumed this is actually a much faster test now after the training C++ conversion previously

I have tested all CI/CD tests locally and they all passes. However, for some reasons, it fails in the CI/CD server. I think it is due to some CI/CD machine setting (memory or something else). Do you think it is necessary to figure out the reason? If you think it is necessary, could you please help me take a look? I have spent the entire morning to figure why but can't resolve it.

Also, I don't think it becomes "much" faster because it still needs 5-10 minutes.

@chunyuma
Copy link
Member Author

Does the code coverage take into account the C++ code? Or is the C++ code covered by upstream python tests?

The C++ code is mainly used to replace the sourmash multisearch within the existing python functions. Thus, it is covered by upstream ptyhon tests. If you compare the coverage percentage between main and chunyu_v1.4.0, they are both 84%, no change.

@dkoslicki
Copy link
Member

Hi @dkoslicki,

I assume this has been thoroughly tested to check consistency with the previous approach (save for order that ties are being broken, perhaps). Is that correct?

This update doesn't like the previous one. The previous update involves the integration of Mahmudur's cpp script into the yacht train, which might affect the trained genomes due to the order of genomes with size that ties being removed. This update is to integrate the similar similarity calculation script to replace sourmash multisearch to identify all organisms in the reference database that share at least one hash overlap with the sample. So, the risk of changing result is small. Although I didn't do the "thoroughly" tests but I checked the results using the demo data and using the GTDB k31 0.95 database, and make sure that the results are consistent.

I see, thanks. @mahmudhera did you perform tests to ensure consistency with the sourmash multisearch results? I recall you mentioning that you checked it in this repo: https://github.com/KoslickiLab/sourmash_alternate_implementations but don't know if you did for these changes.

It'd be nice to include in the PR text a quick benchmark/summary of the performance improvement (eg. maybe also in a change log)

Do we need to do this before merging it into main? What kinds of metrics you may want to include in the summary?

No, not needed before merging into main. It would just be nice to include in some changelog somewhere.

I noticed that the test_run_yacht_pretrained_ref_db is marked as a slow test. Does the CI/CD ignore these? If so, we might want to revert or pick a smaller reference database (or change the ANI to 0.95) to make it run faster so this important integration test isn't skipped. Also, I would have assumed this is actually a much faster test now after the training C++ conversion previously

I have tested all CI/CD tests locally and they all passes. However, for some reasons, it fails in the CI/CD server. I think it is due to some CI/CD machine setting (memory or something else). Do you think it is necessary to figure out the reason? If you think it is necessary, could you please help me take a look? I have spent the entire morning to figure why but can't resolve it.

Also, I don't think it becomes "much" faster because it still needs 5-10 minutes.

That is quite frustrating (and as you know all too well with Translator, frequently encountered)
@mlupei Can you check what's going on with the CI/CD issue? No need to have it block this PR due to locally passing tests, but moving forward, we want to keep the CI/CD tests in sync (and for it to run the slow tests, even if they are skipped locally).

@chunyuma
Copy link
Member Author

chunyuma commented Dec 10, 2024

Hi @dkoslicki,

After a simple test using the demo sample.sig.zip file (which has only one genome) and the gtdb-rs214-reps.k31_0.95_pretrained as reference database, actually the performance of yacht run becomes even worse. Please see the comparison below:

New version: 1.4.0 (integrating cpp script to yacht run)
6:23.76 minutes, 23.45 GB

Old version: 1.3.0
0:54.35 minutes, 2.946216 GB

It seems like the cpp script need to spend some time in building indexes while the sourmash multisearch doesn't have this step so it is faster. @mahmudhera, do you know if it is normal? When the genome size is small (e.g., one genome in the sample), the sourmash multisearch is actually faster than our cpp algorithm?

@dkoslicki
Copy link
Member

Ooo, yeah, that's a pretty significant slowdown. Let's wait to hear from @mahmudhera about what's going on. And along those lines, @mahmudhera and @chunyuma , let's run the timing of a few different data sets (small, and realistically large), so we get a good sense of speed impact. Slower code would defeat the purpose of moving to the cpp implementation...

@chunyuma
Copy link
Member Author

chunyuma commented Dec 10, 2024

Hi Professor @dkoslicki again,

As you might notice in the comparison above, the new algorithm requires larger memory. This explains why the test_run_yacht_pretrained_ref_db failed in the CI/CD server but worked locally. As what I previously suspected, it is indeed due to the memory issue. The CI/CD server has memory limitation. In my latest commit, I switched to a smaller database (ANI 0.8), which resolved the previously failing test.

@dkoslicki
Copy link
Member

I think we'll need to hold off on this PR until @mahmudhera and I can discuss the RAM and time impact of this move away from sourmash multisearch

@mahmudhera
Copy link
Member

mahmudhera commented Dec 10, 2024

Hi guys, looks like I missed a lot here. Let me answer by organizing things into a few high level questions.

When is it beneficial to use the alternate implementation (implemented in cpp)?

@chunyuma and @dkoslicki : the alternate implementation of sourmash multisearch works by building an index. Indeed, building the index is slower than starting the comparisons straight off. Therefore, when the number of comparisons is very large (such as an all-vs-all comparison in the yacht train phase), building the index makes sense. If the number of comparisons is relatively small (such as a one-vs-all comparison in the yacht run phase), then perhaps not building the index makes sense.

From previous experience, we know that for an all-vs-all comparison, this alternate implementation is greatly beneficial. Is it indeed slower to use this alternate implementation for a one-vs-all? That has not been tested yet: I need to test that separately and get back to you guys.

Can we make the index building faster?

Yes, I have implemented a parallel implementation of building the index (by building something I am calling a Segregated Multi-Index), which is now here: https://github.com/KoslickiLab/sourmash_alternate_implementations

I can benchmark it separately to see if using this parallelized index building makes it fast enough so that it's faster than using sourmash multisearch for a one-vs-all comparison.

Has it been tested that this version of yacht produces the same output?

I did not do this test. What I have tested is as follows: does the cpp implementation of sourmash multisearch produce the same output in terms of pairwise similarity measures? My tests show that yes, the cpp implementation is equivalent to sourmash multisearch. In a previous release, this implementation had been collected by @chunyuma, and was used in yacht train. Both @chunyuma and myself tested that yacht train using my cpp code, and yacht train using sourmash multisearch produce (almost) the same result. I think in this iteration, @chunyuma is doing the same thing, but for yacht run. For yacht run, I don't think this equivalency test has been done yet, at least not that I am aware of.

What I can do is I can add the test that I do have in the yacht repo. I can also help to test if this version of yacht run produces the same results as the previous one.

@chunyuma
Copy link
Member Author

ok, I have removed version 1.4.0 from the GitHub page, which will hold off the update on bioconda.

@mahmudhera
Copy link
Member

I think we'll need to hold off on this PR until @mahmudhera and I can discuss the RAM and time impact of this move away from sourmash multisearch

I think that'd be the most appropriate way to go.

@chunyuma
Copy link
Member Author

Thank you for your thoughts @mahmudhera.

@chunyuma and @dkoslicki : the alternate implementation of sourmash multisearch works by building an index. Indeed, building the index is slower than starting the comparisons straight off. Therefore, when the number of comparisons is very large (such as an all-vs-all comparison in the yacht train phase), building the index makes sense. If the number of comparisons is relatively small (such as a one-vs-all comparison in the yacht run phase), then perhaps not building the index makes sense.
From previous experience, we know that for an all-vs-all comparison, this alternate implementation is greatly beneficial. Is it indeed slower to use this alternate implementation for a one-vs-all? That has not been tested yet: I need to test that separately and get back to you guys.

I think it makes sense. Considering the potential varying performance of the alternative implementation in different data sizes, instead of totally abandoning sourmash multisearch, we probably can keep both but provide a switch button to change methods for different datasets. @dkoslicki. do you think it is worthy?

@mahmudhera
Copy link
Member

mahmudhera commented Dec 10, 2024

@dkoslicki and @chunyuma: I suspect that there could be some potential for improvement if we do the following:

(a) from the reference sketches, build the index, and store the index in file(s)
(b) instead of providing the reference sketches as input, use the pre-built index file as input to yacht run

If the index is pre-built, both all-vs-all and many-vs-many comparisons should be much faster. We can provide the pre-built index files for various ani thresholds to the user to download and run. Additionally, we can provide the instructions to build a new index for the user's desired reference genomes.

Of course, this is just a thought. There are many "file-based" optimizations possible if we go this direction; these optimizations usually lead to very low memory usages (but I am not promising anything 😜)

@chunyuma
Copy link
Member Author

Hi @mahmudhera,

Thanks for these potential improvement idea. Let's wait for @dkoslicki's feedback, as I'm uncertain if we should try them given the submission timeline and the related projects.

@mahmudhera
Copy link
Member

Hi @mahmudhera,

Thanks for these potential improvement idea. Let's wait for @dkoslicki's feedback, as I'm uncertain if we should try them given the submission timeline and the related projects.

Definitely, implementing these will take time for me too, lets not blunder anything by rushing.

@dkoslicki
Copy link
Member

Hi @mahmudhera,
Thanks for these potential improvement idea. Let's wait for @dkoslicki's feedback, as I'm uncertain if we should try them given the submission timeline and the related projects.

Definitely, implementing these will take time for me too, lets not blunder anything by rushing.

Agreed: let's not rush this. Mahmudur and I are meeting tomorrow and will map a plan forward. Note: this part (replace branchwater with the C++ code) is not required for the JOSS submission, so there's not big time deadline on that front

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants