Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix](cloud) shorten cache lock held time and add metrics #47472

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

freemandealer
Copy link
Contributor

when update bvar metrics, we held block lock in the critical context of cache lock, make the later lock held too long and affect other cache logic. we use unsafe method to update the bvar to boost performance.

some key metrics of lock and other meaningful metrics are also added for better monitoring cache time costs.

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

when update bvar metrics, we held block lock in the critical context of
cache lock, make the later lock held too long and affect other cache
logic. we use unsafe method to update the bvar to boost performance.

some key metrics of lock and other meaningful metrics are also added for
better monitoring cache time costs.

Signed-off-by: zhengyu <[email protected]>
@Thearas
Copy link
Contributor

Thearas commented Jan 27, 2025

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@freemandealer
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 32396 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit f444546bd2c2496ac2064aefd489c8430c4e6111, data reload: false

------ Round 1 ----------------------------------
q1	17600	5502	5410	5410
q2	2056	320	165	165
q3	10509	1274	741	741
q4	10223	992	537	537
q5	7623	2428	2148	2148
q6	188	167	135	135
q7	931	751	596	596
q8	9224	1370	1183	1183
q9	5208	5029	4955	4955
q10	6859	2333	1897	1897
q11	483	272	262	262
q12	350	359	217	217
q13	17765	3709	3071	3071
q14	233	224	228	224
q15	524	478	454	454
q16	645	626	582	582
q17	587	883	336	336
q18	7119	6452	6535	6452
q19	1967	968	543	543
q20	316	327	190	190
q21	3120	2195	1982	1982
q22	365	344	316	316
Total cold run time: 103895 ms
Total hot run time: 32396 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5575	5498	5500	5498
q2	234	329	229	229
q3	2283	2673	2318	2318
q4	1436	1846	1395	1395
q5	4349	4792	4760	4760
q6	173	164	128	128
q7	2077	1956	1841	1841
q8	2631	2873	2738	2738
q9	7318	7178	7296	7178
q10	3020	3293	2769	2769
q11	572	516	502	502
q12	661	730	593	593
q13	3581	3983	3237	3237
q14	281	301	277	277
q15	520	473	466	466
q16	638	673	644	644
q17	1261	1782	1281	1281
q18	7666	7511	7308	7308
q19	834	1195	1076	1076
q20	2078	2062	1917	1917
q21	5763	5392	5028	5028
q22	640	601	567	567
Total cold run time: 53591 ms
Total hot run time: 51750 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 192068 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit f444546bd2c2496ac2064aefd489c8430c4e6111, data reload: false

query1	1315	956	947	947
query2	6209	2019	2033	2019
query3	10953	4621	4430	4430
query4	61065	29023	23377	23377
query5	5467	619	438	438
query6	424	197	202	197
query7	5576	514	310	310
query8	329	258	217	217
query9	8301	2667	2655	2655
query10	457	300	251	251
query11	17654	15170	15572	15170
query12	158	103	109	103
query13	1432	544	410	410
query14	11520	7020	7079	7020
query15	221	208	186	186
query16	7389	638	469	469
query17	1120	754	584	584
query18	1984	414	326	326
query19	201	183	159	159
query20	117	120	118	118
query21	220	126	110	110
query22	4330	4880	4431	4431
query23	33666	33475	33521	33475
query24	5793	2348	2351	2348
query25	470	482	395	395
query26	637	271	151	151
query27	1704	483	333	333
query28	4056	2536	2500	2500
query29	529	565	421	421
query30	214	190	154	154
query31	921	895	815	815
query32	118	57	58	57
query33	425	359	329	329
query34	875	872	510	510
query35	828	846	738	738
query36	1022	1015	961	961
query37	118	99	78	78
query38	4243	4357	4345	4345
query39	1502	1463	1466	1463
query40	218	119	106	106
query41	51	64	47	47
query42	120	110	99	99
query43	526	541	497	497
query44	1352	843	827	827
query45	182	175	164	164
query46	879	1058	659	659
query47	1968	1922	1879	1879
query48	407	403	322	322
query49	723	493	387	387
query50	655	687	399	399
query51	4226	4285	4293	4285
query52	106	102	98	98
query53	242	250	189	189
query54	498	505	420	420
query55	82	84	75	75
query56	256	267	253	253
query57	1214	1217	1169	1169
query58	238	236	237	236
query59	3079	3244	3010	3010
query60	284	286	251	251
query61	113	114	111	111
query62	738	717	628	628
query63	219	187	185	185
query64	1290	1024	655	655
query65	3265	3170	3128	3128
query66	772	387	312	312
query67	15830	15629	15589	15589
query68	2949	788	574	574
query69	472	289	250	250
query70	1179	1158	1124	1124
query71	417	280	252	252
query72	6148	4096	3808	3808
query73	656	770	359	359
query74	10029	8958	8738	8738
query75	3211	3138	2672	2672
query76	2958	1169	759	759
query77	469	347	267	267
query78	10045	9918	9377	9377
query79	3096	806	602	602
query80	1696	523	490	490
query81	546	273	236	236
query82	356	158	128	128
query83	268	167	150	150
query84	295	96	79	79
query85	763	373	303	303
query86	458	318	296	296
query87	4510	4473	4353	4353
query88	4780	2185	2144	2144
query89	395	320	293	293
query90	1616	190	183	183
query91	130	138	105	105
query92	62	58	52	52
query93	2818	871	531	531
query94	753	404	261	261
query95	329	261	256	256
query96	498	620	282	282
query97	2799	2880	2713	2713
query98	215	204	190	190
query99	1285	1366	1261	1261
Total cold run time: 309297 ms
Total hot run time: 192068 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.08 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit f444546bd2c2496ac2064aefd489c8430c4e6111, data reload: false

query1	0.03	0.03	0.03
query2	0.08	0.03	0.03
query3	0.23	0.08	0.07
query4	1.61	0.10	0.10
query5	0.41	0.41	0.39
query6	1.17	0.66	0.65
query7	0.02	0.02	0.01
query8	0.04	0.03	0.03
query9	0.60	0.49	0.50
query10	0.56	0.56	0.54
query11	0.14	0.10	0.11
query12	0.14	0.11	0.10
query13	0.60	0.60	0.60
query14	2.71	2.73	2.83
query15	0.90	0.84	0.82
query16	0.39	0.36	0.39
query17	0.97	1.04	1.06
query18	0.24	0.21	0.21
query19	1.97	1.76	1.98
query20	0.02	0.01	0.02
query21	15.37	0.93	0.59
query22	0.76	0.70	0.67
query23	15.35	1.43	0.54
query24	3.26	2.19	0.66
query25	0.16	0.05	0.20
query26	0.34	0.14	0.14
query27	0.06	0.06	0.04
query28	13.70	0.96	0.43
query29	12.55	3.90	3.24
query30	0.26	0.08	0.06
query31	2.82	0.58	0.38
query32	3.24	0.55	0.47
query33	2.94	3.02	3.01
query34	16.62	5.14	4.44
query35	4.54	4.51	4.62
query36	0.65	0.49	0.48
query37	0.10	0.07	0.06
query38	0.05	0.04	0.03
query39	0.04	0.03	0.02
query40	0.17	0.13	0.12
query41	0.08	0.03	0.02
query42	0.04	0.02	0.02
query43	0.04	0.04	0.03
Total cold run time: 105.97 s
Total hot run time: 30.08 s

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 42.07% (10972/26083)
Line Coverage: 32.33% (92733/286812)
Region Coverage: 31.49% (47563/151027)
Branch Coverage: 27.53% (24084/87488)
Coverage Report: http://coverage.selectdb-in.cc/coverage/f444546bd2c2496ac2064aefd489c8430c4e6111_f444546bd2c2496ac2064aefd489c8430c4e6111/report/index.html

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Jan 29, 2025
Copy link
Contributor

PR approved by at least one committer and no changes requested.

Copy link
Contributor

PR approved by anyone and no changes requested.

@@ -59,6 +59,10 @@ FileBlock::State FileBlock::state() const {
return _download_state;
}

FileBlock::State FileBlock::state_unsafe() const {
Copy link
Contributor

@gavinchou gavinchou Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this only works on x86

Comment on lines +1063 to +1064
DEFINE_mInt64(cache_lock_wait_long_tail_threshold_us, "30000000");
DEFINE_mInt64(cache_lock_held_long_tail_threshold_us, "30000000");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

30000000 means 30 seconds, it seems too long...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. dev/3.0.x p0_c reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants