Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different results for v1.2.0 and v1.2.2 and v1.3.0 #74

Open
Stikus opened this issue Oct 2, 2024 · 13 comments
Open

Different results for v1.2.0 and v1.2.2 and v1.3.0 #74

Stikus opened this issue Oct 2, 2024 · 13 comments
Assignees
Labels
good first issue Good for newcomers

Comments

@Stikus
Copy link

Stikus commented Oct 2, 2024

Hello, thanks for updates for your tool.

We're trying to update our docker containers with your tool but registered different results for v1.2.0 and v1.2.2:
Run commands:

/soft/msisensor-pro-1.2.0/bin/msisensor-pro msi -d /ref/human_g1k_v37.list -n /inputs/normal.bam -t /inputs/tumor.bam -b 32 -o /outputs/output/test_pair

/soft/msisensor-pro-1.2.2/bin/msisensor-pro msi -d /ref/human_g1k_v37.list -n /inputs/normal.bam -t /inputs/tumor.bam -b 32 -p 8 -o /outputs/output/test_pair

First of all - number of sites and windows changed:

  • v1.2.0:
Total loading windows:  5804 
Total loading homopolymer and microsatellites:  1786895
...
*** Summary information ***

Number of total sites: 1786895
Number of sites with enough coverage: 58
Number of MSI sites: 0
  • v1.2.2:
Total loading windows:  5696 
Total loading homopolymer and microsatellites:  363133 
...
*** Summary information ***

Number of total sites: 363133
Number of sites with enough coverage: 0
Number of MSI sites: 0

And more important - we have empty germline result now:

  • v1.2.0:
chromosome	location	left_flank_bases	repeat_times	repeat_unit_bases	right_flank_bases	genotype	difference	P_value	FDR
19	2008302	CACTC	13	T	CTAGG	13|12	0.21678	0.54982	1.063
19	2011861	TTTCC	10	T	GAGAC	10|10	0	1	1.3488
19	2017120	CGTCT	6	CAAA	ACAAA	6|6	0.08	0.28932	1.1986
19	2017151	AAAAC	10	A	CTAGG	10|10	0.076167	0.57701	1.0458
19	2018642	TGGAC	12	T	AAAGA	12|12	0.10526	0.33506	0.97168
19	2018961	ACAGG	6	CA	CGGGA	6|6	0.076923	0.37455	0.90517
19	2019980	CTTTT	6	TTCC	TTCTT	6|6	0	1	1.0175
19	2022596	TTTCC	11	T	CCCTG	11|11	0.082596	0.81839	1.1577
19	2025188	AATAA	12	T	GAGAC	12|12	0.42105	0.037295	2.1631
19	2025522	CCTTC	5	CT	CCTTC	5|5	0	1	1.234
19	2027127	ACCAG	7	AC	CAGAC	7|7	0.028169	0.80606	1.1688
19	2027144	ACCAG	7	AC	CAGAC	7|7	0.028169	0.80606	1.1988
19	2027161	ACCAG	5	AC	AGGGC	5|5	0	1	1.3182
19	2027495	GAGCC	10	AG	AAAGA	10|10	0.21053	0.30538	1.1808
19	2028158	CCTCT	11	A	TTACA	11|11	0	1	1.2609
19	2031306	ATCTC	12	A	TAAAA	11|11	0.40209	0.1242	1.2006
19	2032700	GTGCC	5	TTCA	ATCTG	5|5	0	1	1.2083
19	2036371	AAAGG	12	A	TACAA	12|12	0.34862	0.48119	1.0734
19	2039434	ACAGG	10	A	CCCAA	11|10	0.13397	0.46656	1.0824
19	2046939	ACTCA	5	GT	GGATT	5|5	0	1	1.16
19	2049907	CAGAG	7	AC	AAACA	7|7	0.097561	0.21243	1.369
19	2049923	CACAA	5	AC	ATACA	5|5	0	1	1.0943
19	2052598	TAAGA	7	GT	GCACG	7|7	0.071429	0.22779	1.101
19	2058828	GGCTT	5	TTTG	TTTTG	4|5	0.28571	0.21417	1.1293
19	2068016	CGTTT	12	A	TGTGT	12|12	0.19048	0.23128	1.0319
19	2069060	CCTTT	11	A	TGGCA	11|11	0.09482	0.59809	1.0203
19	2069767	CTCAG	5	AACA	ACTAG	5|5	0	1	1.0741
19	2074245	GAGAT	5	CAAA	AAAAC	5|5	0	1	1.0545
19	2074718	GCTAA	11	T	CCCCT	11|11	0	1	1.0357
19	2075031	TTTTC	15	T	GAGAC	15|15	0.6255	0.12035	1.7451
19	2077400	GTCTC	11	A	GAAAA	11|11	0.38462	0.18621	1.35
19	2077787	AATCC	13	T	GAGAC	13|13	0.16667	0.51553	1.0679
19	2080383	ACCTC	11	T	GAGGC	11|11	0	1	1
19	2083003	ATTTT	5	TTG	TATTT	5|4	0.35897	0.071532	2.0744
19	2084890	CCTAG	10	T	ATTTA	10|10	0	1	1.1373
19	2090669	CGTTA	11	T	AAGAC	11|11	0.0625	0.30973	1.1228
19	2092000	AACAG	10	A	GAGAA	10|10	0.076923	0.33953	0.93774
19	2093160	GGGAA	13	T	CGAGA	13|13	0.24561	0.60897	1.0092
19	2095176	CTCAA	7	AAAC	AAAAA	7|7	0.08	0.6604	1.0352
19	2095756	TGTTC	12	T	CCCCC	12|12	0.094845	0.57166	1.0696
19	2103219	TTATT	11	A	TGGGA	11|11	0.25	0.10777	2.0835
19	2125834	GCTAG	11	A	CAAAA	12|12	0.27338	0.37271	0.93987
19	2133510	CCAGC	12	T	GAGAC	12|12	0.086957	0.61346	0.98836
19	2137313	ATGTG	13	T	CTTGG	13|13	0.13333	0.31937	1.0291
19	2140479	TTGAC	13	T	GAGAC	13|14	0.32558	0.32124	0.98061
19	2142754	TTCTG	12	T	CTTTT	11|11	0.10526	0.5009	1.076
19	2143763	GCCTC	10	A	TAAAA	10|10	0.094708	0.57759	1.0152
19	2144904	TCCAA	5	AAAC	AAAAC	5|5	0	1	1.1154
19	2145758	ACCTC	6	CA	CCAAG	6|6	0	1	1.381
19	2147882	GTCTC	11	A	GAAAA	11|11	0.13421	0.52212	1.0442
19	2152539	CGTCT	10	A	GAGAG	10|10	0	1	1.1837
19	2157097	CCCAT	5	CATC	CACCC	5|5	0.086957	0.67485	1.03
19	2158999	TATAA	11	T	AATGG	11|10	0.27869	0.31511	1.0751
19	2159210	TTCCA	5	TC	TTTTT	5|5	0.30769	0.12186	1.4136
19	2166076	ATTTA	5	TATGT	TGTTA	5|5	0	1	1.2889
19	2172506	CTTTC	13	T	GAAGA	13|13	0.2439	0.35051	0.92408
19	2177298	AACAA	12	T	GGGAG	12|12	0.16	0.17247	1.4291
19	2183629	AGTGC	14	T	AAGGC	14|14	0.26087	0.21288	1.2347

  • v1.2.2:
chromosome	location	left_flank_bases	repeat_times	repeat_unit_bases	right_flank_bases	genotype	difference	P_value	FDR

Is this intended? Is this due to MinMicrosate and MaxMicrosate change in commit 3d3d7b8 ? Looks like now zero sites have enough coverage but covCutoff is same (15).

@Stikus
Copy link
Author

Stikus commented Oct 2, 2024

We've found reason for empty result - we used /ref/human_g1k_v37.list from v1.2.0 scan command for v1.2.2 (we don't know that they will change after update).

==> human_g1k_v37.list_v1.2.0 (71504K size) <==
chromosome      location        repeat_unit_length      repeat_times    repeat_unit_bases       left_flank_bases        right_flank_bases
1       10108   6       6       AACCCT  AACCC   AACCC
1       10147   6       5       CCCTAA  CTAAC   CCTAA
1       10177   6       9       CCTAAC  CCTAA   CCCTA
1       10255   6       5       AACCCT  CCCTA   AACCC
1       10285   6       6       AACCCC  ACCCT   AACCC
1       10352   6       6       ACCCTA  ACCCT   ACCCC
1       10397   6       7       CCCTAA  CTAAC   CCCCT
1       26453   2       6       GT      TGTTA   TTGCT
1       28588   1       15      T       GGTGG   GCATC

==> human_g1k_v37.list_v1.2.2 (54029K size) <==
chromosome      location        repeat_unit_length      repeat_unit_binary      repeat_times    left_flank_binary       right_flank_binary      repeat_unit_bases     left_flank_bases        right_flank_bases
1       26453   2       11      6       956     999     GT      TGTTA   TTGCT
1       28588   1       3       15      698     589     T       GGTGG   GCATC
1       30867   2       7       12      885     319     CT      TCTCC   CATTT
1       30910   2       13      6       847     627     TC      TCATT   GCTAT
1       30935   2       13      6       255     1015    TC      ATTTT   TTTCT
1       31719   1       0       14      466     711     A       CTCAG   GTACT
1       31807   2       1       5       51      275     AC      AATAT   CACAT
1       33449   1       0       15      335     645     A       CCATT   GGACC
1       33530   1       3       11      1021    204     T       TTTTC   ATATA

After rescan we got new results:

  • v1.2.2:
Total loading windows:  5804 
Total loading homopolymer and microsatellites:  1789437 

*** Summary information ***

Number of total sites: 1789437
Number of sites with enough coverage: 63
Number of MSI sites: 0
chromosome	location	left_flank_bases	repeat_times	repeat_unit_bases	right_flank_bases	genotype	difference	P_value	FDR
19	2008302	CACTC	13	T	CTAGG	13|12	0.21678	0.54982	0.96218
19	2011861	TTTCC	10	T	GAGAC	10|10	0.16667	0.13451	1.6948
19	2017120	CGTCT	6	CAAA	ACAAA	6|6	0.08	0.28932	1.0126
19	2017151	AAAAC	10	A	CTAGG	10|10	0.1875	0.52981	0.95366
19	2018642	TGGAC	12	T	AAAGA	12|12	0.10526	0.33506	0.84436
19	2018961	ACAGG	6	CA	CGGGA	6|6	0.076923	0.37455	0.78656
19	2019980	CTTTT	6	TTCC	TTCTT	6|6	0	1	1.125
19	2022596	TTTCC	11	T	CCCTG	11|11	0.082596	0.81839	1.0312
19	2025188	AATAA	12	T	GAGAC	12|12	0.42105	0.037295	2.3496
19	2025522	CCTTC	5	CT	CCTTC	5|5	0	1	1.2353
19	2027127	ACCAG	7	AC	CAGAC	7|7	0.028169	0.80606	1.0364
19	2027144	ACCAG	7	AC	CAGAC	7|7	0.028169	0.80606	1.058
19	2027161	ACCAG	5	AC	AGGGC	5|5	0	1	1.0678
19	2027495	GAGCC	10	AG	AAAGA	10|10	0.21053	0.30538	0.96194
19	2028158	CCTCT	11	A	TTACA	11|11	0	1	1.05
19	2031306	ATCTC	12	A	TAAAA	11|11	0.40209	0.1242	2.6082
19	2032700	GTGCC	5	TTCA	ATCTG	5|5	0	1	1.0862
19	2036371	AAAGG	12	A	TACAA	12|12	0.36184	0.50028	0.95507
19	2039434	ACAGG	10	A	CCCAA	11|10	0.15982	0.4243	0.86228
19	2046939	ACTCA	5	GT	GGATT	5|5	0	1	1.0328
19	2049907	CAGAG	7	AC	AAACA	7|7	0.097561	0.21243	1.1153
19	2049923	CACAA	5	AC	ATACA	5|5	0	1	1.1455
19	2052598	TAAGA	7	GT	GCACG	7|7	0.071429	0.22779	0.89691
19	2058828	GGCTT	5	TTTG	TTTTG	4|5	0.28571	0.21417	0.96376
19	2068016	CGTTT	12	A	TGTGT	12|12	0.19048	0.23128	0.85711
19	2069060	CCTTT	11	A	TGGCA	11|11	0.09482	0.59809	0.942
19	2069767	CTCAG	5	AACA	ACTAG	5|5	0	1	1.1667
19	2074245	GAGAT	5	CAAA	AAAAC	5|5	0.069767	0.29665	0.98364
19	2074718	GCTAA	11	T	CCCCT	11|11	0	1	1.1887
19	2075031	TTTTC	15	T	GAGAC	15|15	0.64407	0.14965	1.3468
19	2077400	GTCTC	11	A	GAAAA	11|11	0.46667	0.21991	0.92363
19	2077787	AATCC	13	T	GAGAC	13|13	0.16667	0.51553	0.95525
19	2080383	ACCTC	11	T	GAGGC	11|11	0.16667	0.19761	1.2449
19	2083003	ATTTT	5	TTG	TATTT	5|4	0.37014	0.12715	2.0027
19	2084890	CCTAG	10	T	ATTTA	10|10	0	1	1.2115
19	2090669	CGTTA	11	T	AAGAC	11|11	0.0625	0.30973	0.9292
19	2092000	AACAG	10	A	GAGAA	10|10	0.076923	0.33953	0.8227
19	2093160	GGGAA	13	T	CGAGA	13|13	0.24561	0.60897	0.93574
19	2095176	CTCAA	7	AAAC	AAAAA	7|7	0.14815	0.6567	0.88025
19	2095756	TGTTC	12	T	CCCCC	12|12	0.14894	0.55354	0.94251
19	2103219	TTATT	11	A	TGGGA	11|11	0.25	0.10777	3.3947
19	2125834	GCTAG	11	A	CAAAA	12|12	0.27338	0.37271	0.80967
19	2133510	CCAGC	12	T	GAGAC	12|12	0.086957	0.61346	0.92019
19	2137313	ATGTG	13	T	CTTGG	13|13	0.13333	0.31937	0.8748
19	2140479	TTGAC	13	T	GAGAC	13|14	0.32558	0.32124	0.84324
19	2142754	TTCTG	12	T	CTTTT	11|11	0.13481	0.58299	0.94176
19	2143763	GCCTC	10	A	TAAAA	10|10	0.25	0.19776	1.1326
19	2144904	TCCAA	5	AAAC	AAAAC	5|5	0	1	1
19	2145758	ACCTC	6	CA	CCAAG	6|6	0	1	1.0161
19	2147882	GTCTC	11	A	GAAAA	11|11	0.17647	0.57594	0.95484
19	2152539	CGTCT	10	A	GAGAG	10|10	0.045195	0.36771	0.82735
19	2157097	CCCAT	5	CATC	CACCC	5|5	0.20134	0.45914	0.90394
19	2158999	TATAA	11	T	AATGG	11|10	0.27869	0.31511	0.90236
19	2159210	TTCCA	5	TC	TTTTT	5|5	0.34545	0.16105	1.2683
19	2162681	CACTC	10	A	CCCCA	10|9	0.47898	0.13651	1.4333
19	2166076	ATTTA	5	TATGT	TGTTA	5|5	0	1	1.1053
19	2172506	CTTTC	13	T	GAAGA	13|13	0.2439	0.35051	0.81787
19	2177298	AACAA	12	T	GGGAG	12|12	0.16	0.17247	1.2073
19	2183629	AGTGC	14	T	AAGGC	14|14	0.26087	0.21288	1.0317
19	2467148	AAAAC	10	A	CAAAA	7|7	0.31579	0.62109	0.85063
19	2467159	AAAAC	10	A	CAAAA	7|7	0.31579	0.62109	0.86953
19	2532160	AAAAC	11	A	CAAAA	7|7	0.31579	0.62109	0.88929
19	2567130	AAAAC	10	A	CAAAA	7|7	0.31579	0.62109	0.90997

Visual comparison (v1.2.0 - left, v1.2.2 - right):
image

Looks like major differences are numerical, but we have 5 new lines (as you can see on attached screenshot).

So we have 2 questions:

  1. Are these differences expected?
  2. Should we rescan our reference genomes every new msisensor version?

@PengJia6
Copy link
Collaborator

PengJia6 commented Oct 2, 2024

Thank you very much for your report. In versions prior to v1.2.2, if the repeat count of a read at a given site was less than the minimum repeat count (which can be set via the command line), the read would be filtered out, which is an unreasonable operation. As shown in your screenshot, all these records added in v1.2.2 have genotypes with fewer than 10 repeats of A repeat.

@PengJia6
Copy link
Collaborator

PengJia6 commented Oct 2, 2024

@Stikus
For each version, we recommend that you use the same scan and msi, pro commands. Additionally, the latest version of msisensor-pro is v1.3.0. This version will be a long-term support version, and we welcome you to test it.

@Stikus
Copy link
Author

Stikus commented Oct 2, 2024

We'll test v1.3.0 today, when we test it we got error about same genome for scan and msi commands and decided to downgrade to 1.2.2 first. We completely forgot about separate scan step - now I understand it.

Thanks for detailed answer - I'll post v1.3.0 results here if you don't mind.

@PengJia6
Copy link
Collaborator

PengJia6 commented Oct 2, 2024

We'll test v1.3.0 today, when we test it we got error about same genome for scan and msi commands and decided to downgrade to 1.2.2 first. We completely forgot about separate scan step - now I understand it.

Thanks for detailed answer - I'll post v1.3.0 results here if you don't mind.

I would be very grateful. Please feel free to post here.

@Stikus
Copy link
Author

Stikus commented Oct 2, 2024

Can you help me with new naming?

Before we have "${outputPrefix}_somatic" and "${outputPrefix}_germline" - now we have "${outputPrefix}_unstable" and "${outputPrefix}_all". Is this correct match?

According to

outputSomatic.open((outputPrefix + "_unstable").c_str());
- looks like correct.

@PengJia6
Copy link
Collaborator

PengJia6 commented Oct 2, 2024

Can you help me with new naming?

Before we have "${outputPrefix}_somatic" and "${outputPrefix}_germline" - now we have "${outputPrefix}_unstable" and "${outputPrefix}_all". Is this correct match?

According to

outputSomatic.open((outputPrefix + "_unstable").c_str());

  • looks like correct.

Yes, to align with the results produced by the pro command, the _all table includes all microsatellite sites with sufficient coverage, while the _unstable table indicates unstable sites, meaning sites where somatic mutations have occurred.

@Stikus
Copy link
Author

Stikus commented Oct 2, 2024

Here is a comparison (v1.3.0 - left, v1.2.2 - right):
image

Total loading windows:  5812 
Total loading homopolymer and microsatellites:  2678803 

*** Summary information ***

Number of total sites: 2678803
Number of sites with enough coverage: 116
Number of MSI sites: 0

As you can see - genotype field is missing (but header have leftover tab - here is the reason) - is this expected?
New output contain twice more lines - 116 instead of 63.

@PengJia6
Copy link
Collaborator

PengJia6 commented Oct 2, 2024

Here is a comparison (v1.3.0 - left, v1.2.2 - right): image

Total loading windows:  5812 
Total loading homopolymer and microsatellites:  2678803 

*** Summary information ***

Number of total sites: 2678803
Number of sites with enough coverage: 116
Number of MSI sites: 0

As you can see - genotype field is missing (but header have leftover tab - here is the reason) - is this expected? New output contain twice more lines - 116 instead of 63.

Yes, we removed the display of genotypes because many sites have undergone somatic mutations, making the genotypes here inaccurate. Additionally, for those sites, we adjusted the default minimum repeat count for homopolymers from 10 to 8, as in our experience, homopolymers with repeat counts of 8 and 9 also have many somatic mutations. The tab in the header is not as expected; I will remove it.

@Stikus
Copy link
Author

Stikus commented Oct 2, 2024

Correct me if I'm wrong - all new lines (green on screen) have repeat count 8 or 9. But old lines have repeat counts 5,6,7 and 10,11,12. So - how we get 5,6,7 in results if default was lowered from 10 to 8?

Upd. Or low repeat counts can be only for repeats with length more than one? So resulting length is above threshold?

@Stikus Stikus changed the title Different results for v1.2.0 and v1.2.2 Different results for v1.2.0 and v1.2.2 and v1.3.0 Oct 2, 2024
@PengJia6
Copy link
Collaborator

PengJia6 commented Oct 2, 2024

Correct me if I'm wrong - all new lines (green on screen) have repeat count 8 or 9. But old lines have repeat counts 5,6,7 and 10,11,12. So - how we get 5,6,7 in results if default was lowered from 10 to 8?

Upd. Or low repeat counts can be only for repeats with length more than one? So resulting length is above threshold?

In the scan command, the -l and -r parameters control the minimum repeat count for single nucleotide repeats and other sites (where the repeat unit length is greater than 2), respectively. The default values for -l and -r are 8 and 5, respectively.

@PengJia6 PengJia6 added the good first issue Good for newcomers label Oct 2, 2024
@PengJia6 PengJia6 self-assigned this Oct 2, 2024
@Stikus
Copy link
Author

Stikus commented Oct 2, 2024

Great thx for answers, we'll use latest version now. The issue can be closed, I think.

FYI - we are using your tool with latest samtools without any problems with this patch before make:
sed -i "s|\$(realpath ../vendor/htslib-1.11)|$SOFT/samtools-$SAMTOOLS_VERSION-src/htslib-$SAMTOOLS_VERSION|" "$SOFT/msisensor-pro-${MSISENSOR_PRO_VERSION}-src/cpp/makefile"

Our full build part from Dockerfile:

ENV MSISENSOR_PRO_VERSION='1.3.0'
RUN cd "$SOFT" \
    && wget -q "https://github.com/xjtu-omics/msisensor-pro/archive/refs/tags/v${MSISENSOR_PRO_VERSION}.tar.gz" -O "$SOFT/msisensor-pro-${MSISENSOR_PRO_VERSION}.tar.gz" \
    && tar -xzf "$SOFT/msisensor-pro-${MSISENSOR_PRO_VERSION}.tar.gz" \
    && mv "$SOFT/msisensor-pro-${MSISENSOR_PRO_VERSION}" "$SOFT/msisensor-pro-${MSISENSOR_PRO_VERSION}-src" \
    && cd "$SOFT/msisensor-pro-${MSISENSOR_PRO_VERSION}-src/cpp" \
    && sed -i "s|\$(realpath ../vendor/htslib-1.11)|$SOFT/samtools-$SAMTOOLS_VERSION-src/htslib-$SAMTOOLS_VERSION|" "$SOFT/msisensor-pro-${MSISENSOR_PRO_VERSION}-src/cpp/makefile" \
    && sed -ie '/export LD_LIBRARY_PATH/s/^/#/' "$SOFT/msisensor-pro-${MSISENSOR_PRO_VERSION}-src/cpp/makefile" \
    && make -j"$(($(nproc)+1))" \
    && mkdir -p "$SOFT/msisensor-pro-${MSISENSOR_PRO_VERSION}/bin" \
    && cp "$SOFT/msisensor-pro-${MSISENSOR_PRO_VERSION}-src/cpp/msisensor-pro" "$SOFT/msisensor-pro-${MSISENSOR_PRO_VERSION}/bin" \
    && cd "$SOFT" \
    && rm -r "$SOFT/msisensor-pro-${MSISENSOR_PRO_VERSION}-src" \
    && rm "$SOFT/msisensor-pro-${MSISENSOR_PRO_VERSION}.tar.gz"

@PengJia6
Copy link
Collaborator

PengJia6 commented Oct 2, 2024

Thank you very much for providing the Docker file and for testing it. I will keep this issue open for a while; if you have any other questions, please feel free to post.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants