-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update YACHT to v1.4.0 #129
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #129 +/- ##
=======================================
Coverage 84.27% 84.28%
=======================================
Files 11 11
Lines 1094 1101 +7
=======================================
+ Hits 922 928 +6
- Misses 172 173 +1 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks both for doing this! One in-line comment, and the following higher level comments:
- I assume this has been thoroughly tested to check consistency with the previous approach (save for order that ties are being broken, perhaps). Is that correct?
- It'd be nice to include in the PR text a quick benchmark/summary of the performance improvement (eg. maybe also in a change log)
- I noticed that the
test_run_yacht_pretrained_ref_db
is marked as a slow test. Does the CI/CD ignore these? If so, we might want to revert or pick a smaller reference database (or change the ANI to 0.95) to make it run faster so this important integration test isn't skipped. Also, I would have assumed this is actually a much faster test now after the training C++ conversion previously
num_threads: int, | ||
path_to_genome_temp_dir: str, | ||
path_to_sample_temp_dir: str, | ||
num_genome_threshold: int = 1000000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This argument doesn't seem to be documented. Is the intent of this just to calculate the number of passes @mahmudhera ? Or sets some other threshold with respect to number of organisms detected or the like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dkoslicki, this parameter is not expected to be modified by the users. Or probably I can always set it to 1000000. It is not associated with the YACHT results. It is just associated with how many genomes are processed in a block within Mahmudera's algorithm. So, don't worry about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, understood. But 2 years from now when we want to make yet further changes, it should probably be documented in the code with a comment about what it's doing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added documents for these new parameters. Please check the new commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite remember doing this, I guess @chunyuma has already addressed this
Oh, and another question. Does the code coverage take into account the C++ code? Or is the C++ code covered by upstream python tests? |
Hi @dkoslicki,
This update doesn't like the previous one. The previous update involves the integration of Mahmudur's cpp script into the
Do we need to do this before merging it into
I have tested all CI/CD tests locally and they all passes. However, for some reasons, it fails in the CI/CD server. I think it is due to some CI/CD machine setting (memory or something else). Do you think it is necessary to figure out the reason? If you think it is necessary, could you please help me take a look? I have spent the entire morning to figure why but can't resolve it. Also, I don't think it becomes "much" faster because it still needs 5-10 minutes. |
The C++ code is mainly used to replace the |
I see, thanks. @mahmudhera did you perform tests to ensure consistency with the
No, not needed before merging into main. It would just be nice to include in some changelog somewhere.
That is quite frustrating (and as you know all too well with Translator, frequently encountered) |
Hi @dkoslicki, After a simple test using the demo New version: 1.4.0 (integrating cpp script to Old version: 1.3.0 It seems like the cpp script need to spend some time in building indexes while the |
Ooo, yeah, that's a pretty significant slowdown. Let's wait to hear from @mahmudhera about what's going on. And along those lines, @mahmudhera and @chunyuma , let's run the timing of a few different data sets (small, and realistically large), so we get a good sense of speed impact. Slower code would defeat the purpose of moving to the cpp implementation... |
|
Hi Professor @dkoslicki again, As you might notice in the comparison above, the new algorithm requires larger memory. This explains why the |
I think we'll need to hold off on this PR until @mahmudhera and I can discuss the RAM and time impact of this move away from sourmash multisearch |
Hi guys, looks like I missed a lot here. Let me answer by organizing things into a few high level questions. When is it beneficial to use the alternate implementation (implemented in cpp)?@chunyuma and @dkoslicki : the alternate implementation of From previous experience, we know that for an all-vs-all comparison, this alternate implementation is greatly beneficial. Is it indeed slower to use this alternate implementation for a one-vs-all? That has not been tested yet: I need to test that separately and get back to you guys. Can we make the index building faster?Yes, I have implemented a parallel implementation of building the index (by building something I am calling a Segregated Multi-Index), which is now here: https://github.com/KoslickiLab/sourmash_alternate_implementations I can benchmark it separately to see if using this parallelized index building makes it fast enough so that it's faster than using Has it been tested that this version of
|
ok, I have removed version 1.4.0 from the GitHub page, which will hold off the update on bioconda. |
I think that'd be the most appropriate way to go. |
Thank you for your thoughts @mahmudhera.
I think it makes sense. Considering the potential varying performance of the alternative implementation in different data sizes, instead of totally abandoning |
@dkoslicki and @chunyuma: I suspect that there could be some potential for improvement if we do the following: (a) from the reference sketches, build the index, and store the index in file(s) If the index is pre-built, both all-vs-all and many-vs-many comparisons should be much faster. We can provide the pre-built index files for various ani thresholds to the user to download and run. Additionally, we can provide the instructions to build a new index for the user's desired reference genomes. Of course, this is just a thought. There are many "file-based" optimizations possible if we go this direction; these optimizations usually lead to very low memory usages (but I am not promising anything 😜) |
Hi @mahmudhera, Thanks for these potential improvement idea. Let's wait for @dkoslicki's feedback, as I'm uncertain if we should try them given the submission timeline and the related projects. |
Definitely, implementing these will take time for me too, lets not blunder anything by rushing. |
Agreed: let's not rush this. Mahmudur and I are meeting tomorrow and will map a plan forward. Note: this part (replace branchwater with the C++ code) is not required for the JOSS submission, so there's not big time deadline on that front |
Integrate Mahmudher's scripts to calculate similarities within
yacht run
.