-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathindex.html
1551 lines (1469 loc) · 133 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta name="generator" content="Hugo 0.88.1" />
<meta name="viewport" content="width=device-width, initial-scale=1">
<link href="https://fonts.googleapis.com/css?family=Roboto:300,400,700" rel="stylesheet" type="text/css">
<link rel="stylesheet" href=""https://cdnjs.cloudflare.com/ajax/libs/highlight.js/8.4/styles/github.min.css">
<link rel="stylesheet" href="css/custom.css">
<link rel="stylesheet" href="css/normalize.css">
<title>FunAudioLLM Homepage</title>
<link href="css/bootstrap.min.css" rel="stylesheet">
</head>
<body data-new-gr-c-s-check-loaded="14.1091.0" data-gr-ext-installed="">
<div class="container" >
<header role="banner">
</header>
<main role="main">
<article itemscope itemtype="https://schema.org/BlogPosting">
<div class="container pt-5 mt-5 shadow-lg p-5 mb-5 bg-white rounded">
<div class="text-center">
<h2>FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs </h2>
[<a href="https://arxiv.org/abs/2407.04051">Paper</a>]
[<a href="https://github.com/FunAudioLLM">Code</a>]
[Modelscope:<a href="https://www.modelscope.cn/studios/iic/SenseVoice">SenseVoice</a> <a href="https://www.modelscope.cn/studios/iic/CosyVoice-300M">CosyVoice</a>]
[HuggingFace: <a href="https://huggingface.co/FunAudioLLM/SenseVoiceSmall">SenseVoice</a> CosyVoice]
<p class="fst-italic mb-0">
<br>
Tongyi SpeechTeam
<p></p>
</p>
<p><b>Alibaba Group</b></p>
</div>
<p><b>Abstract:</b>
<!-- We introduce FunAudioLLM, ....-->
This report introduces <b>FunAudioLLM</b>, a framework designed to enhance natural voice interactions between humans and large language models (LLMs). At its core are two innovative models: <b>SenseVoice</b> for high-precision multilingual speech recognition, emotion recognition, and audio event detection; and <b>CosyVoice</b> for natural speech generation with multi-language, timbre, and emotion control. SenseVoice delivers exceptionally low latency and supports over 50 languages, while CosyVoice excels in multi-lingual voice generation, zero-shot voice generation, cross-lingual voice cloning, and instruction-following capabilities. The models related to SenseVoice and CosyVoice have been open-sourced on Modelscope and Huggingface, along with the corresponding training, inference, and fine-tuning codes released on GitHub. By integrating these models with LLMs, FunAudioLLM enables applications such as speech translation, emotional voice chat, interactive podcasts, and expressive audiobook narration, thereby pushing the boundaries of voice interaction technology.
<p style="text-align: center;">
<img src="pics/highlight.png" height="480" width="1280">
</p>
<!--
<li> SenseVoice supports high-precision multilingual ASR, excels over Whisper in Chinese and Cantonese recognition, and has SOTA emotion recognition capabilities. </li>
<li> SenseVoice-small has extremely high inference efficiency, being 7 times faster than Whisper-small and 17 times faster than Whisper-large. </li>
<li> The CosyVoice model, pre-trained on 150,000+ hours of meticulous multi-lingual, multi-speaker data, pioneers as the first industry-grade, open-source speech synthesis tool. </li>
<li> CosyVoice offers multi-language support for English, Mandarin, Japanese, Korean, and Cantonese, enabling zero-shot voice generation, and includes seven speakers fine-tuned for immediate use. </li>
<li> CosyVoice-instruct also offers emotional expression, event control, and detailed adjustments like speed and pitch, and it ensures versatile and nuanced speech generation.</li>
<li> FunAudioLLM has developed a variety of application demos, including simultaneous interpretation, emotional voice interaction, interactive podcasts, and multi-character audiobook narration.</li>
-->
</p>
</p>
<p>
<b>Contents</b>
<ul>
<li><a href="#App-S2ST">Applications</a>
<ul>
<li><a href="#App-S2ST">Speech-to-Speech Translation</a></li>
<li><a href="#App-E-chat">Emotional VoiceChat</a></li>
<li><a href="#App-agentic">Interactive PodCast</a></li>
<li><a href="#App-audiobook">Expressive Audiobook </a></li>
</ul>
</li>
<li><a href="#CosyVoice-overview">Voice Generation Model: CosyVoice</a>
<ul>
<li><a href="#CosyVoice-overview">Overview of CosyVoice</a></li>
<li><a href="#CosyVoice-basic">Multi-lingual Voice Generation</a></li>
<li><a href="#CosyVoice-zeroshot">Zero-shot In-context Generation</a></li>
<li><a href="#CosyVoice-instructed">Instructed Voice Generation</a></li>
<li><a href="#CosyVoice-emotion">Emotionally Expressive Voice Generation</a></li>
<li><a href="#CosyVoice-sft">Speaker Fine-tune</a></li>
<li><a href="#CosyVoice-intep">Speaker Interpolation</a></li>
<li><a href="#Demos">Demo</a></li>
</ul>
</li>
<li><a href="#SenseVoice-overview">Voice Understanding Model: SenseVoice</a>
<ul>
<li><a href="#SenseVoice-overview">Overview of SenseVoice </a></li>
<li><a href="#SenseVoice-ASR"> Multilingual Speech Recognition</a></li>
<li><a href="#SenseVoice-emotion">Speech Emotion Recognition</a></li>
<li><a href="#SenseVoice-AED">Audio Event Detection</a></li>
<li><a href="#SenseVoice-demo">Rich Transcribe Demo Samples</a></li>
</ul>
</li>
</p>
</div>
<!------ Application -------------------------->
<div class="container pt-5 mt-5 shadow-lg p-5 mb-5 bg-white rounded">
<h2 id="App-S2ST" style="text-align: center;">Speech-to-Speech Translation</h2>
<p>By integrating SenseVoice, LLMs, and CosyVoice, we can effortlessly perform speech-to-speech translation (S2ST). Note that the original recordings are highlighted in bold.
</p>
<p style="text-align: center;">
<img src="pics/S2ST.png" height="368" width="1600">
</p>
<div class="table-responsive pt-3">
<table class="table table-hover pt-2">
<thead>
<tr>
<th style="vertical-align : middle;text-align: center">ZH </th>
<th style="vertical-align : middle;text-align: center">EN </th>
<th style="vertical-align : middle;text-align: center">JP </th>
<th style="vertical-align : middle;text-align: center">Yue </th>
<th style="vertical-align : middle;text-align: center">KO </th>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/s2st/zh/zh_prompt.wav" autoplay/>Your browser does not support the audio element.</audio><br><b>对,所以说你现在的话,这个账单的话,你既然说能处理,那你就想办法处理掉。</b></td>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/s2st/zh/zh_en.wav" autoplay/>Your browser does not support the audio element.</audio><br> Yes, that's why I'm saying, regarding the bill you're currently discussing, if you say you can handle it, then find a way to take care of it. </td>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/s2st/zh/zh_jp.wav" autoplay/>Your browser does not support the audio element.</audio><br>そう、だから今あなたが言っていること、この請求書について、あなたが処理できると言ったのなら、何とかして処理してください。</td>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/s2st/zh/zh_yue.wav" autoplay/>Your browser does not support the audio element.</audio><br>对,所以话你而家讲嘅,呢张账单嘅话,你既然话得掂,噉你就要想办法搞掂佢。</td>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/s2st/zh/zh_ko.wav" autoplay/>Your browser does not support the audio element.</audio><br>맞아, 그래서 네가 지금 말하는 것, 이 계산서에 대해서, 네가 처리할 수 있다고 했다면, 그렇다면 방법을 찾아서 처리해야 해.</td>
</tr>
<tr>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/s2st/en/en_zh.wav" autoplay/>Your browser does not support the audio element.</audio><br>在那之后,完全收购那家公司。因此,保持管理层的一致性,利益与即将加入家族的资产保持一致,这就是我们有时不买下全部的原因。</td>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/s2st/en/en_prompt.wav" autoplay/>Your browser does not support the audio element.</audio><br><b>And then later on, fully acquiring that company. So keeping management in line, interest in line with the asset that's coming into the family is a reason why sometimes we don't buy the whole thing.</b></td>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/s2st/en/en_jp.wav" autoplay/>Your browser does not support the audio element.</audio><br>その後、その会社を完全に買収する。だから、経営陣を一列に並べ、家族に入る資産との利益を一致させることが、私たちが全てを買わない理由の一つです。</td>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/s2st/en/en_yue.wav" autoplay/>Your browser does not support the audio element.</audio><br>之后,就完全收购咗嗰间公司。所以,保持管理层同兴趣与即将加入家族嘅资产保持一致,系我们有时唔买晒全部嘅原因。</td>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/s2st/en/en_ko.wav" autoplay/>Your browser does not support the audio element.</audio><br>그리고 나중에, 그 회사를 완전히 인수하게 됩니다. 그래서 경영진을 일치시키고, 가족에 들어오는 자산과의 이익을 일치시키는 것이 우리가 가끔 전부를 사지 않는 이유입니다.</td>
</tr>
<tr>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/s2st/jp/jp_zh.wav" autoplay/>Your browser does not support the audio element.</audio><br>只是雨滴有什么麻烦的?这还没有打雷呢!</td>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/s2st/jp/jp_en.wav" autoplay/>Your browser does not support the audio element.</audio><br>Rainfall alone does not constitute a storm. Thunder is required.</td>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/s2st/jp/jp_prompt.wav" autoplay/>Your browser does not support the audio element.</audio><br><b>雨なんて大したことありません。まだ雷も鳴っていないのですから。</b></td>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/s2st/jp/jp_yue.wav" autoplay/>Your browser does not support the audio element.</audio><br>只系雨滴有咩麻烦嘅?连打雷都未有呢!</td>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/s2st/jp/jp_ko.wav" autoplay/>Your browser does not support the audio element.</audio><br>고작 빗방울로 호들갑은, 아직 번개도 치지 않았는데.</td>
</tr>
<tr>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/s2st/yue/yue_zh.wav" autoplay/>Your browser does not support the audio element.</audio><br>早点回家吧,到家了如果可以的话给我回个信息。</td>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/s2st/yue/yue_en.wav" autoplay/>Your browser does not support the audio element.</audio><br>Come home early, and if possible, let me know when you're back.</td>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/s2st/yue/yue_jp.wav" autoplay/>Your browser does not support the audio element.</audio><br>早く家に帰って、家に着いたら可能なら私にメッセージを返してください。</td>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/s2st/yue/yue_prompt.wav" autoplay/>Your browser does not support the audio element.</audio><br><b>早啲翻屋企吧,到屋企可以嘅话俾我回个信息。</b></td>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/s2st/yue/yue_ko.wav" autoplay/>Your browser does not support the audio element.</audio><br>일찍 집에 돌아오세요. 집에 도착하면 가능하다면 저에게 연락해 주세요.</td>
</tr>
<tr>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/s2st/ko/ko_zh.wav" autoplay/>Your browser does not support the audio element.</audio><br>虽然这种时候已经错过了云先生的戏,但夜晚的璃月仍有许多好去处。</td>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/s2st/ko/ko_en.wav" autoplay/>Your browser does not support the audio element.</audio><br>By this time of night, we will have already missed Yun's play. But LeeYew is full of interesting places to explore, even at night.</td>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/s2st/ko/ko_jp.wav" autoplay/>Your browser does not support the audio element.</audio><br>もう雲殿の芝居に間に合わないが、夜の璃月はまだまだいいところがたくさんあるぞ。</td>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/s2st/ko/ko_yue.wav" autoplay/>Your browser does not support the audio element.</audio><br>虽然呢个时候已经错过咗云先生嘅戏,但夜晚嘅璃月仲有好多好去处。</td>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/s2st/ko/ko_prompt.wav" autoplay/>Your browser does not support the audio element.</audio><br><b>어디가서 눈을 피하지. 난 괜찮은데, 넌 감기 걸릴지도 모르니.</b></td>
</tr>
</tbody>
</table>
</div>
</div>
<div class="container pt-5 mt-5 shadow-lg p-5 mb-5 bg-white rounded">
<h2 id="App-E-chat" style="text-align: center;">Emotional VoiceChat</h2>
<p>By integrating SenseVoice, LLMs, and CosyVoice, we can develop an Emotional Voice Chat application. In the following example, both the User and Assistant content are synthesized by CosyVoice.</p>
<p style="text-align: center;">
<img src="pics/EmotionalVoiceChat.png" height="368" width="1600">
</p>
<div class="table-responsive pt-3">
<table class="table table-hover pt-2" style="width: 100%; border-collapse: collapse;">
--<thead>
<tr>
<th style="vertical-align : middle;text-align: center">Audio </th>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align : middle;text-align:center;">
<audio controls="controls" style="width: 480px;"><source src="audios/app/Emotional_Voice_Chat.wav" autoplay/>Your browser does not support the audio element.</audio>
</td>
</tr>
</tbody>
<thead>
<!--<tr>
<th style="vertical-align : middle;text-align: center">Chat Transcription </th>
</tr>-->
</thead>
<tbody>
<tr>
<td style="vertical-align : middle;text-align: left; padding: 5px;">
<p style="text-align: left; color: #000; background-color: #fbf2d1; padding-right: 300px; margin: 2px;"><strong>User</strong>: 早上好<br></p>
<p style="background-color: #F2F2F2; padding-left: 400px; margin: 5px;"><strong>Assistant</strong>: 生成风格: warm tone, slow speaking rate, and friendly emotion.<br>播报内容: 早上好呀,今天的精神状态如何呀?<br></p>
<p style="text-align: left; color: #000; background-color: #fbf2d1; padding-right: 300px; margin: 2px;"><strong>User</strong>: <|HAPPY|>今天特别开心<br></p>
<p style="background-color: #F2F2F2; padding-left: 400px; margin: 5px;"><strong>Assistant</strong>: 生成风格: high pitch, fast speaking rate, and happy emotion.<br>播报内容: [breath]哎哟,听到你这么开心[breath]我也很高兴呢,发生什么好事啦?<br></p>
<p style="text-align: left; color: #000; background-color: #fbf2d1; padding-right: 300px; margin: 2px;"><strong>User</strong>: 我们要发布一个新的语音大模型啦<|Laughter|><br></p>
<p style="background-color: #F2F2F2; padding-left: 400px; margin: 5px;"><strong>Assistant</strong>: 生成风格: normal tone, medium speaking rate, with a hint of amusement.<br>播报内容: [laughter]太棒了,听说你们要发布新的语音大模型,真是个令人兴奋的消息啊!<br></p>
<p style="text-align: left; color: #000; background-color: #fbf2d1; padding-right: 300px; margin: 2px;"><strong>User</strong>: 悲伤的说一句光头强台词吧<br></p>
<p style="background-color: #F2F2F2; padding-left: 400px; margin: 5px;"><strong>Assistant</strong>: 生成风格: low pitch, slow speaking rate, and sad emotion.<br>播报内容: "唉---看来这次又要空手而归了---"<br></p>
<p style="text-align: left; color: #000; background-color: #fbf2d1; padding-right: 300px; margin: 2px;"><strong>User</strong>: 真棒<br></p>
<p style="background-color: #F2F2F2; padding-left: 400px; margin: 5px;"><strong>Assistant</strong>: 生成风格: positive tone, moderate speaking rate, and encouraging emotion.<br>播报内容: 那真是太棒啦,继续保持这种势头哦!<br></p>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<div class="container pt-5 mt-5 shadow-lg p-5 mb-5 bg-white rounded">
<h2 id="App-agentic" style="text-align: center;">Interactive Podcast</h2>
<p>By integrating SenseVoice, an LLM-based multi-agent system with real-time world knowledge, and CosyVoice, we can create an interactive podcast.</p>
<p style="text-align: center;">
<img src="pics/InteractivePodcast.png" height="368" width="1600">
</p>
<div class="table-responsive pt-3">
<table class="table table-hover pt-2" style="width: 100%; border-collapse: collapse;">
--<thead>
<tr>
<th style="vertical-align : middle;text-align: center">Audio </th>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 480px;"><source src="audios/app/broadcaster.wav" autoplay/>Your browser does not support the audio element.</audio></td>
</tr>
</tbody>
<thead>
<!--<tr>
<th style="vertical-align : middle;text-align: center">Chat Transcription </th>
</tr>-->
</thead>
<tbody>
<tr>
<td style="vertical-align : middle;text-align: left; padding: 5px;">
<div style="background-color: #F2F2F2;">
<p style="text-align: left; color: #000; padding-left: 400px; margin: 2px;"><strong>Assistant_小夏</strong>: 聆听未来AI主播带你走在知识前沿。大家好,欢迎来到我们的播客节目,我是主播小夏。<br></p>
<p style="text-align: left; color: #000; padding-left: 400px; margin: 2px;"><strong>Assistant_大东</strong>: Hello, 大家好,我是你们的搞笑担当大东。今天我们要聊的话题非常劲爆,那就是五月天演唱会假唱风波。<br></p>
<p style="text-align: left; color: #000; padding-left: 400px; margin: 2px;"><strong>Assistant_小夏</strong>: 是啊,最近这个话题可是引发了不小的争议。首先我们来看看网友爆料的一些证据和分析。有网友上传了五月天演唱会上的一些视频片段,从这些视频中可以看出,主唱阿信在演唱时明显是在对口型,甚至有些地方的口型都不太对。<br></p>
<p style="text-align: left; color: #000; padding-left: 400px; margin: 2px;"><strong>Assistant_大东</strong>: 哈哈,我看了那个视频确实有点尴尬,我还记得有个镜头是阿信把麦克风放下来。<br></p>
</div>
<p style="background-color: #fbf2d1; margin: 5px;"><strong>User</strong>: 但这也太没有职业道德了吧,他。<br></p>
<p style="text-align: left; color: #000; background-color: #F2F2F2; margin: 2px; padding-left: 400px;"><strong>Assistant_大东</strong>: 职业道德这个问题确实值得讨论。五月天一直以来都是以现场实力著称,乐队如果真有假唱,那粉丝们可能会感到失望,但也有时候考虑到音乐质量或者艺人身体状况,可能会采取一些措施,不过粉丝们通常更喜欢看到真实的现场表演,不是吗?你觉得呢?<br></p>
<p style="background-color: #fbf2d1; margin: 5px;"><strong>User</strong>: 没什么意思,换一个。<br></p>
<div style="background-color: #F2F2F2; margin: 2px;">
<p style="text-align: left; color: #000; padding-left: 400px; margin: 2px;"><strong>Assistant_大东</strong>: 接下来我们转入下一个话题,即将讨论的内容非常热门,那就是泰国队在36强赛对阵中国的海报,真是让人充满期待又感到紧张。<br></p>
<p style="text-align: left; color: #000; padding-left: 400px; margin: 2px;"><strong>Assistant_小夏</strong>: 没错,我看到那海报的时候内心五味杂陈。我们国足上一次大比分失利于泰国队,那1至5的比分简直是噩梦。不过这次泰国队用这个当宣传,我觉得咱们国足的小伙子们可得打起精神来,不能再让历史重演了。<br></p>
<p style="text-align: left; color: #000; padding-left: 400px; margin: 2px;"><strong>Assistant_大东</strong>: 对,这个海报上还贴出了2013年的热身赛和2019年中国杯赛事中泰国队战胜国足的情况,泰国队可真不客气。</p>
</div>
</td>
</tr>
</tbody>
</table>
</div>
</div>
<div class="container pt-5 mt-5 shadow-lg p-5 mb-5 bg-white rounded">
<h2 id="App-audiobook" style="text-align: center;">Expressive Audiobook </h2>
<p>Through the analytical capabilities of LLMs to structure and identify emotions within books, and synthesizing this with CosyVoice, we achieve audiobooks with enhanced expressiveness.</p>
<p style="text-align: center;">
<img src="pics/AudioBook.png" height="368" width="1600">
</p>
<div class="table-responsive pt-3">
<table class="table table-hover pt-2">
<thead>
<tr>
<th style="vertical-align : middle;text-align: center">Text </th>
<th style="vertical-align : middle;text-align: center">Audio </th>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align : middle;text-align: left">神雕侠侣,第九回,百计避敌。<br>
- 陆无双正自惶急,听她忽问傻话,怒道:“傻蛋!又胡说甚么?”<br>
- 杨过笑道:“咱们来玩拜天地成亲,你扮新娘子好不好?那才叫美呢!脸上披了红布,别人说什么也瞧你不见。”<br>
- 陆无双一怔,道:“你教我扮新娘子躲过师父?”<br>
- 杨过嘻嘻笑道:“我不知道,你扮新娘子,我就扮新官人。”</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 380px;"><source src="audios/app/audiobook/ghy_eric_qiyue.wav" autoplay/>Your browser does not support the audio element.</audio></td>
</tr>
</tbody>
</table>
</div>
</div>
<!------ CosyVoice -------------------------->
<div class="container pt-5 mt-5 shadow-lg p-5 mb-5 bg-white rounded">
<h2 id="CosyVoice-overview" style="text-align: center;">Overview of CosyVoice</h2>
<body>
<p style="text-align: center;">
<img src="pics/overview-cosyvoice.png" height="400" width="1024">
</p>
</body>
<p style="text-align: center;" >
<b>Figure 1.</b> An overview of the CosyVoice Models at the inference stage.
In summary, CosyVoice consists of an autoregressive transformer to generate corresponding speech tokens
for input text, an ODE-based diffusion model, flow matching,
to reconstruct Mel spectrum from the generated speech tokens,
and a HiFTNet based vocoder to synthesize waveforms.
Dashed modules are optional in specific model usages, such as cross-lingual, SFT inference and so on.
[<a href="https://arxiv.org/abs/2407.05407">Paper</a>]
</p>
</div>
<div class="container pt-5 mt-5 shadow-lg p-5 mb-5 bg-white rounded">
<h2 id="CosyVoice-basic" style="text-align: center;">Multi-lingual Voice Generation</h2>
<div class="table-responsive pt-3">
<table class="table table-hover pt-2">
<thead>
<tr>
<th style="vertical-align : middle;text-align: center">Language </th>
<th style="vertical-align : middle;text-align: center">Speaker </th>
<th style="vertical-align : middle;text-align: center">Text </th>
<th style="vertical-align : middle;text-align: center">Audio </th>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align : middle;text-align:center;" rowspan="2">ZH</td>
<td style="vertical-align : middle;text-align:center;">Female</td>
<td style="vertical-align : middle;text-align:center;">我是通义实验室语音团队全新推出的生成式语音大模型,提供舒适自然的语音合成能力。</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/sft_generation/ZH/generated/female.wav" autoplay/>Your browser does not support the audio element.</audio></td>
</tr>
<tr>
<td style="vertical-align : middle;text-align:center;">Male</td>
<td style="vertical-align : middle;text-align:center;">我是通义实验室语音团队全新推出的生成式语音大模型,提供舒适自然的语音合成能力。</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/sft_generation/ZH/generated/male.wav" autoplay/>Your browser does not support the audio element.</audio></td>
</tr>
<tr>
<td style="vertical-align : middle;text-align:center;" rowspan="2">EN</td>
<td style="vertical-align : middle;text-align:center;">Female</td>
<td style="vertical-align : middle;text-align:center;">I am the latest generative text to speech model launched by the Tongyi speech team, offering comfortable and natural speech synthesis capabilities.</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/sft_generation/EN/generated/female.wav" autoplay/>Your browser does not support the audio element.</audio></td>
</tr>
<tr>
<td style="vertical-align : middle;text-align:center;">Male</td>
<td style="vertical-align : middle;text-align:center;">I am the latest generative text to speech model launched by the Tongyi speech team, offering comfortable and natural speech synthesis capabilities.</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/sft_generation/EN/generated/male.wav" autoplay/>Your browser does not support the audio element.</audio></td>
</tr>
<tr>
<td style="vertical-align : middle;text-align:center;" rowspan="1">JP</td>
<td style="vertical-align : middle;text-align:center;">Male</td>
<td style="vertical-align : middle;text-align:center;">私は通義ラボ音声チームによって新たにリリースされた生成型音声大規模モデルで、快適で自然な音声合成能力を提供します。</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/sft_generation/JP/generated/male.wav" autoplay/>Your browser does not support the audio element.</audio></td>
</tr>
<tr>
<td style="vertical-align : middle;text-align:center;" rowspan="1">Yue</td>
<td style="vertical-align : middle;text-align:center;">Female</td>
<td style="vertical-align : middle;text-align:center;">我是通义实验室语音团队全新推出的生成式语音大模型,提供舒适自然的语音合成能力。</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/sft_generation/Yue/generated/female.wav" autoplay/>Your browser does not support the audio element.</audio></td>
</tr>
<tr>
<td style="vertical-align : middle;text-align:center;" rowspan="1">KO</td>
<td style="vertical-align : middle;text-align:center;">Female</td>
<td style="vertical-align : middle;text-align:center;">저는 통의 연구소 음성 팀이 새롭게 공개한 생성형 음성 모델이며, 부드럽고 자연스러운 음성 합성 경험을 제공하도록 설계되었습니다.</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/sft_generation/KO/generated/female.wav" autoplay/>Your browser does not support the audio element.</audio></td>
</tr>
</tbody>
</table>
</div>
</div>
<div class="container pt-5 mt-5 shadow-lg p-5 mb-5 bg-white rounded">
<h2 id="CosyVoice-zeroshot" style="text-align: center;">Zero-shot In-context Generation</h2>
<div class="table-responsive pt-3">
<table class="table table-hover pt-2">
<thead>
<tr>
<th style="vertical-align : middle;text-align: center">Language </th>
<th style="vertical-align : middle;text-align: center">Prompt </th>
<th style="vertical-align : middle;text-align: center">Generated 1</th>
<th style="vertical-align : middle;text-align: center">Generated 2</th>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align : middle;text-align:center;" rowspan="4">ZH</td>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/zeroshot_generation/zh/gghy.wav" autoplay/>Your browser does not support the audio element.</audio><br>随着大军缓缓前进,他忍不住琢磨起了回京之后会被派到什么艰苦的地方。顶缸。要知道皇帝一向就是这么干的,几乎没让他过过什么安生日子。</td>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/zeroshot_generation/zh/gghy-01.wav" autoplay/>Your browser does not support the audio element.</audio><br>想着即将到来的未知与挑战,他的心中不禁泛起一丝苦涩,但也很快被坚毅取代。这次,无论是荒凉的边陲小镇,还是险象环生的前线阵地,他都已做好准备,迎接新的使命。</td>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/zeroshot_generation/zh/gghy-02.wav" autoplay/>Your browser does not support the audio element.</audio><br>大军的步伐愈发坚定,每一步都踏出了他们对未来的期许与决心。而他,作为这铁血军团的一员,更是心怀壮志,期待着用自己的双手,再次书写一段传奇,证明给所有人看。</td>
</tr>
<tr>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/zeroshot_generation/zh/210024_happy.wav" autoplay/>Your browser does not support the audio element.</audio><br>希望你以后能够做的比我还好呦。</td>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/zeroshot_generation/zh/210024_happy-utt0002_happy.wav" autoplay/>Your browser does not support the audio element.</audio><br>收到好友从远方寄来的生日礼物,那份意外的惊喜与深深的祝福让我心中充满了甜蜜的快乐,笑容如花儿般绽放。</td>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/zeroshot_generation/zh/210024_happy-utt0003_happy.wav" autoplay/>Your browser does not support the audio element.</audio><br>漫步在金秋的枫林中,阳光透过斑斓的叶片洒在身上,清风拂面,我陶醉在这宁静而美好的时光里,快乐得几乎要翩翩起舞。</td>
</tr>
<tr>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/zeroshot_generation/zh/230007_sad.wav" autoplay/>Your browser does not support the audio element.</audio><br>我的故事,过程很美,而结局却满是悲伤。</td>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/zeroshot_generation/zh/230007_sad-utt0004_sad.wav" autoplay/>Your browser does not support the audio element.</audio><br>望着空荡荡的房间,昔日共度的美好时光历历在目,如今却物是人非,泪水止不住地滑落,心如刀割。</td>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/zeroshot_generation/zh/230007_sad-utt0005_sad.wav" autoplay/>Your browser does not support the audio element.</audio><br>得知挚爱的宠物因病离世,我沉浸在无尽的哀伤中,那熟悉的身影、温顺的眼神仿佛还在眼前,令人心碎不已。</td>
</tr>
<tr>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/zeroshot_generation/zh/220011_angry.wav" autoplay/>Your browser does not support the audio element.</audio><br>多少年了,还没有人敢这样对我拍桌子瞪眼睛。</td>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/zeroshot_generation/zh/220011_angry-utt0008_angry.wav" autoplay/>Your browser does not support the audio element.</audio><br>目睹不法分子公然破坏公共设施,无视社会规则,我怒不可遏,心中充满了对这种恶劣行径的强烈谴责与愤慨。</td>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/zeroshot_generation/zh/220011_angry-utt0009_angry.wav" autoplay/>Your browser does not support the audio element.</audio><br>得知商家以次充好,欺诈消费者,我怒火中烧,对于这种丧失诚信、侵害消费者权益的行为感到极度愤恨,誓要讨回公道。</td>
</tr>
<!--
<tr>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/zeroshot_generation/zh/260022_hate.wav" autoplay/>Your browser does not support the audio element.</audio><br>我无法忍受他这种粗枝大叶的作风。</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/zeroshot_generation/zh/260022_hate-utt0013_hate.wav" autoplay/>Your browser does not support the audio element.</audio><br>看到餐盘中蠕动的蛆虫,我胃中翻江倒海,那种恶心的感觉瞬间席卷全身,对这种肮脏景象的厌恶之情无以言表。</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/zeroshot_generation/zh/260022_hate-utt0014_hate.wav" autoplay/>Your browser does not support the audio element.</audio><br>闻到邻居家中飘出的刺鼻烟味,我眉头紧皱,喉咙发痒,对这种严重干扰他人生活的不良习惯感到极度厌恶。</td>
</tr>
-->
<tr>
<td style="vertical-align : middle;text-align:center;" rowspan="2">EN</td>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/zeroshot_generation/en/Musk_Time19.41.wav" autoplay/>Your browser does not support the audio element.</audio><br>I think people online have actually assembled videos showing every launch and it just gets like crazy fast as you get to twenty twenty three. So yeah, so we've done a nineteen three flight. We're now qualifying Falcon nine to be able to do forty flights.</td>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/zeroshot_generation/en/utt047-musk-utt038-musk.wav" autoplay/>Your browser does not support the audio element.</audio><br>In the quest for sustainable energy, Tesla leads the charge; every electric vehicle on the road is an emissary saluting clearer skies, collectively weaving the tapestry of our planet's verdant future.</td>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/zeroshot_generation/en/utt047-musk-utt039-musk.wav" autoplay/>Your browser does not support the audio element.</audio><br>From space exploration to subterranean tunnels, from AI to the neurotechnology revolution, my pursuit transcends mere technological frontiers; it's about carving out unprecedented realms of existence and progress for mankind.</td>
</tr>
<tr>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/zeroshot_generation/en/TaylorSwift_Time16.92.wav" autoplay/>Your browser does not support the audio element.</audio><br> I'm so happy I got to do this. I really wanted to work with Tom Hooper. I know that he records live and he films and records your vocals live. It's such an interesting thing to me and I wanted to see him work. I had actually done screen tests for Les Mis.</td>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/zeroshot_generation/en/utt052-TaylorSwift-utt080-TaylorSwift.wav" autoplay/>Your browser does not support the audio element.</audio><br>Every stage is a fresh adventure, and as the lights ignite, it's an unspoken pact between me and the audience, weaving unforgettable nights where dreams meet reality.</td>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/zeroshot_generation/en/utt052-TaylorSwift-utt081-TaylorSwift.wav" autoplay/>Your browser does not support the audio element.</audio><br>Creating is my way of extracting magic from life's moments. Whether it's joy or tears, I embrace it all, transmuting those feelings into notes, with the hope of touching the depths of every soul.</td>
</tr>
<tr>
<td style="vertical-align : middle;text-align:center;" rowspan="2">JP</td>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/zeroshot_generation/jp/gongteng-prompt.wav" autoplay/>Your browser does not support the audio element.</audio><br>の匂いを嗅ぎつけて現場に赴き、モテる感覚の全てを使って犯人を割り出し、食らいついたら相手が観念するまで証拠という鋭い歯を食い込ませるそれが探偵さん。</td>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/zeroshot_generation/jp/gongteng-1.wav" autoplay/>Your browser does not support the audio element.</audio><br>投資で安定収入を得たい人達で情報交換をしませんか?</td>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/zeroshot_generation/jp/gongteng-2.wav" autoplay/>Your browser does not support the audio element.</audio><br>自分でもユナに提案してからやっぱり暑すぎるか。</td>
</tr>
<tr>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/zeroshot_generation/jp/huiyuanai-prompt.wav" autoplay/>Your browser does not support the audio element.</audio><br>どうして、どうしてお姉ちゃんを助けてくれなかったの?</td>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/zeroshot_generation/jp/huiyuanai-1.wav" autoplay/>Your browser does not support the audio element.</audio><br>クレジットカード現金化の店舗のスタッフブログです。</td>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/zeroshot_generation/jp/huiyuanai-2.wav" autoplay/>Your browser does not support the audio element.</audio><br>某ハンドメイドブログの別館ともなっております。</td>
</tr>
<tr>
<td style="vertical-align : middle;text-align:center;" rowspan="2">Yue</td>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/zeroshot_generation/yue/female01.wav" autoplay/>Your browser does not support the audio element.</audio><br>结果学校苦心安排佢哋自行排成三队,走到行列最前端。</td>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/zeroshot_generation/yue/female01-1.wav" autoplay/>Your browser does not support the audio element.</audio><br>不可能吧我挂住你啦点算啊,你又唔接我电话。</td>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/zeroshot_generation/yue/female01-2.wav" autoplay/>Your browser does not support the audio element.</audio><br>老公,今晚石河子好似有六级地震。</td>
</tr>
<tr>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/zeroshot_generation/yue/G0070_S0276.wav" autoplay/>Your browser does not support the audio element.</audio><br>你系男嘅定系女嘅,你叫咩名,边度嚟噶?</td>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/zeroshot_generation/yue/G0070_S0276-1.wav" autoplay/>Your browser does not support the audio element.</audio><br>衞生署提醒市民,近期流感病例增加,建議出門佩戴口罩,留意個人衛生。</td>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/zeroshot_generation/yue/G0070_S0276-2.wav" autoplay/>Your browser does not support the audio element.</audio><br>呢度嘅风景真係靓到爆,连带咁多年嚟我见过嘅都比唔上。</td>
</tr>
<tr>
<td style="vertical-align : middle;text-align:center;" rowspan="2">KO</td>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/zeroshot_generation/ko/gcvihizo3tz49qwukreibbf4ejj95ql.wav" autoplay/>Your browser does not support the audio element.</audio><br>물고기들은 빗물이 물 위로 떨어지는 소리에 놀라 달아나 버리지. 그래서 비가 그친 후엔 낚시하기 딱 좋아.</td>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/zeroshot_generation/ko/gcvihizo3tz49qwukreibbf4ejj95ql-1.wav" autoplay/>Your browser does not support the audio element.</audio><br>여기에 가만히 서 있는다고 상대가 찾아오진 않아.</td>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/zeroshot_generation/ko/gcvihizo3tz49qwukreibbf4ejj95ql-2.wav" autoplay/>Your browser does not support the audio element.</audio><br>기억해둬, 친구, 난 언젠가 이 세상을 정복할 사람이야!</td>
</tr>
<tr>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/zeroshot_generation/ko/hrawubezzoq2m7tjjw8i4f76xlpk5pl.wav" autoplay/>Your browser does not support the audio element.</audio><br>고작 빗방울로 호들갑은, 아직 번개도 치지 않았는데.</td>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/zeroshot_generation/ko/hrawubezzoq2m7tjjw8i4f76xlpk5pl-1.wav" autoplay/>Your browser does not support the audio element.</audio><br>하늘이 맑다고 번개는 숨길 수 없고, 비가 내리지 않아도 신의 번개를 내릴 수 있다.</td>
<td style="vertical-align : top;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/zeroshot_generation/ko/hrawubezzoq2m7tjjw8i4f76xlpk5pl-2.wav" autoplay/>Your browser does not support the audio element.</audio><br>비록 몸과 꿈은 사라졌어도, 그녀는 결국 신성한 벚나무가 되었다.</td>
</tr>
</tbody>
</table>
</div>
</div>
<div class="container pt-5 mt-5 shadow-lg p-5 mb-5 bg-white rounded">
<h2 id="CosyVoice-instructed" style="text-align: center;">Instructed Voice Generation</h2>
<h3 id="CosyVoice-identity" style="text-align: center;">Speaker Identity Control</h3>
<div class="table-responsive pt-3">
<table class="table table-hover pt-2">
<thead>
<tr>
<th style="vertical-align : middle;text-align: center">Instruction</th>
<th style="vertical-align : middle;text-align: center">Generated</th>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align : middle;text-align:center;">Theo 'Crimson', is a fiery, passionate rebel leader. Fights with fervor for justice, but struggles with impulsiveness.<br>(翻译:西奥(Theo)‘绯红’,是一位火热、充满激情的反叛领袖。他为正义而战,斗志高昂,但容易冲动。)</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/instruct/speaker_identity/speaker_identity1.wav" autoplay/>Your browser does not support the audio element.</audio><br>人总是要为自己所做的事情付出代价的,我不怕付出代价,只是不愿意让无辜的人来陪葬。</td>
</tr>
<tr>
<td style="vertical-align : middle;text-align:center;">Kai 'Torrent', is a cool-headed, tactical water mage who plans his moves carefully. A soothing presence with hidden depths.<br>(翻译:凯(Kai)‘激流’,是一位冷静、讲究策略的水法师,仔细规划每一步行动。他是一个具有安抚作用的存在,内心深藏深度。)</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/instruct/speaker_identity/speaker_identity2.wav" autoplay/>Your browser does not support the audio element.</audio><br>一个人最大的悲哀,不是牺牲,而是看着自己辛辛苦苦得来的一切,最后变成了别人的功劳。</td>
</tr>
<tr>
<td style="vertical-align : middle;text-align:center;">Zara 'Wildfire', is an impulsive, fearless firebrand who loves a challenge. Her bravery inspires others, though she often acts recklessly.<br>(翻译:扎拉(Wildfire)‘狂野之火’,是一个冲动而无畏的煽动者,她热爱挑战。她的勇敢鼓舞了他人,尽管她的行为常常是鲁莽的。)</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/instruct/speaker_identity/speaker_identity3.wav" autoplay/>Your browser does not support the audio element.</audio><br>真正的速度你是看不见的,就像风起云涌、日落月升,就像你不知道树叶什么时候变黄,不知道你的孩子什么时候长出第一颗牙来。</td>
</tr>
<tr>
<td style="vertical-align : middle;text-align:center;">Selene 'Moonshade', is a mysterious, elegant dancer with a connection to the night. Her movements are both mesmerizing and deadly.<br>(翻译:赛琳娜(Selene)‘月影’,是一位神秘、优雅的舞者,与夜晚有着特殊的联系。她的舞姿既令人着迷又致命。)</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/instruct/speaker_identity/speaker_identity4.wav" autoplay/>Your browser does not support the audio element.</audio><br>我们走的每一步,都是我们策略的一部分;你看到的所有一切,包括我此刻与你交谈,所做的一切,所说的每一句话,都有深远的含义。</td>
</tr>
<tr>
<td style="vertical-align : middle;text-align:center;">Priya, the humanitarian doctor, heals wounds of the world with her boundless empathy and skill.<br>(翻译:普莉娅(Priya),这位人道主义医生,用她无尽的同情心和高超的医术治愈着世界的创伤。)</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/instruct/speaker_identity/speaker_identity5.wav" autoplay/>Your browser does not support the audio element.</audio><br>You don't know about real loss, because it only occurs when you've loved something more than you love yourself.</td>
</tr>
<tr>
<td style="vertical-align : middle;text-align:center;">Ivan, the old sea captain, navigates life's storms with timeless wisdom and a heart of gold.<br>(翻译:伊凡(Ivan),这位老船长,用永恒的智慧和一颗善良的心驾驭人生的风暴。)</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/instruct/speaker_identity/speaker_identity6.wav" autoplay/>Your browser does not support the audio element.</audio><br>Hope is a good thing, maybe the best of things, and no good thing ever dies.</td>
</tr>
</tbody>
</table>
</div>
<h3 id="CosyVoice-finegrained" style="text-align: center;">Fine-grained Control</h3>
<div class="table-responsive pt-3">
<table class="table table-hover pt-2">
<thead>
<tr>
<th style="vertical-align : middle;text-align: center">Text</th>
<th style="vertical-align : middle;text-align: center">Generated</th>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align : middle;text-align:center;">[laughter]有时候,最简单的事情[laughter]能让我们笑得最开心,就像是无意中听到的一个傻笑话[laughter]。</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/instruct/fine-grained/laughter1.wav" autoplay/>Your browser does not support the audio element.</audio></td>
</tr>
<tr>
<td style="vertical-align : middle;text-align:center;">他搞的一个恶作剧,让大家<laughter>忍俊不禁</laughter>。</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/instruct/fine-grained/laughter2.wav" autoplay/>Your browser does not support the audio element.</audio></td>
</tr>
<tr>
<td style="vertical-align : middle;text-align:center;">成功并不是预先设定的终点,它需要你一步一步地<strong>努力</strong>,持续地<strong>努力</strong>,最终将梦想变成现实。</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/instruct/fine-grained/strong1.wav" autoplay/>Your browser does not support the audio element.</audio></td>
</tr>
<tr>
<td style="vertical-align : middle;text-align:center;">当你深入了解一个文化[breath],你会开始欣赏那些最初看似平凡无奇的细节[breath],并且通过这些细节,逐渐理解这个文化的精神。</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/instruct/fine-grained/breath1.wav" autoplay/>Your browser does not support the audio element.</audio></td>
</tr>
<tr>
<td style="vertical-align : middle;text-align:center;">Well that's kind of scary [laughter] I'm not near that age [laughter] I'm way over it but I do have children to think about you know.</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/instruct/fine-grained/laughter3.wav" autoplay/>Your browser does not support the audio element.</audio></td>
</tr>
<tr>
<td style="vertical-align : middle;text-align:center;">Well that pretty much covers <laughter>the subject</laughter> well thanks for calling me.</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/instruct/fine-grained/laughter4.wav" autoplay/>Your browser does not support the audio element.</audio></td>
</tr>
<tr>
<td style="vertical-align : middle;text-align:center;">The team's <strong>unity</strong> and <strong>resilience</strong> helped them win the championship.</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/instruct/fine-grained/strong2.wav" autoplay/>Your browser does not support the audio element.</audio></td>
</tr>
<tr>
<td style="vertical-align : middle;text-align:center;">I don't think I over eat yeah [breath] and um I do exercise regularly.</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/instruct/fine-grained/breath2.wav" autoplay/>Your browser does not support the audio element.</audio></td>
</tr>
</tbody>
</table>
</div>
<h3 id="CosyVoice-style" style="text-align: center;">Style Control</h3>
<div class="table-responsive pt-3">
<table class="table table-hover pt-2">
<thead>
<tr>
<th style="vertical-align : middle;text-align: center">Instruction</th>
<th style="vertical-align : middle;text-align: center">Generated</th>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align : middle;text-align:center;">A female speaker with normal pitch and normal speaking rate.</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/instruct/style/style0.wav" autoplay/>Your browser does not support the audio element.</audio><br>他讲的冷笑话虽然老套,但仍然让大家笑个不停。</td>
</tr>
<tr>
<td style="vertical-align : middle;text-align:center;">A female speaker with high pitch, normal speaking rate, and happy emotion.</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/instruct/style/style1.wav" autoplay/>Your browser does not support the audio element.</audio><br>他讲的冷笑话虽然老套,但仍然让大家笑个不停。</td>
</tr>
<tr>
<td style="vertical-align : middle;text-align:center;">A male speaker with low pitch, fast speaking rate, and angry emotion.</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/instruct/style/style2.wav" autoplay/>Your browser does not support the audio element.</audio><br>生活的美不在于宏大的时刻,而在于那些我们经常忽视的简单而日常的奇迹。</td>
</tr>
<tr>
<td style="vertical-align : middle;text-align:center;">A female speaker with normal pitch, slow speaking rate, and sad emotion.</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/instruct/style/style3.wav" autoplay/>Your browser does not support the audio element.</audio><br>当我们离开这个世界时,人们记住的不是我们积累的财物,而是我们对他们生活的影响和我们共享的爱。</td>
</tr>
<tr>
<td style="vertical-align : middle;text-align:center;">A male speaker with low pitch, slow speaking rate, and fearful emotion.</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/instruct/style/style4.wav" autoplay/>Your browser does not support the audio element.</audio><br>深夜独行于荒芜的小巷,忽闻身后传来诡异的脚步声,我寒毛直竖,心跳如雷,无法抑制对未知危险的深深恐惧。</td>
</tr>
<tr>
<td style="vertical-align : middle;text-align:center;">A male speaker with low pitch, slow speaking rate, and sad emotion.</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/instruct/style/style5.wav" autoplay/>Your browser does not support the audio element.</audio><br>Every choice we make, every path we take, molds our identity. We are the sum of our choices, and it's up to us to make them meaningful.</td>
</tr>
<tr>
<td style="vertical-align : middle;text-align:center;">A female speaker with angry emotion.</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/instruct/style/style6.wav" autoplay/>Your browser does not support the audio element.</audio><br>I’m really struggling to stay calm right now because what you did was totally out of line!</td>
</tr>
</tbody>
</table>
</div>
</div>
<div class="container pt-5 mt-5 shadow-lg p-5 mb-5 bg-white rounded">
<h2 id="CosyVoice-emotion" style="text-align: center;">Emotionally Expressive Voice Generation</h2>
<div class="table-responsive pt-3">
<table class="table table-hover pt-2">
<thead>
<tr>
<th style="vertical-align : middle;text-align: center">Emotion </th>
<th style="vertical-align : middle;text-align: center">Generated 1</th>
<th style="vertical-align : middle;text-align: center">Generated 2</th>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align : middle;text-align:center;">Neutral</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/emotion/中立1.wav" autoplay/>Your browser does not support the audio element.</audio><br>我是通义实验室语音团队全新推出的生成式语音大模型,提供舒适自然的语音合成能力。</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/emotion/中立2.wav" autoplay/>Your browser does not support the audio element.</audio><br>西红柿炒鸡蛋是一道简单又经典的家常菜。</td>
</tr>
<tr>
<td style="vertical-align : middle;text-align:center;">Sad</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/emotion/伤心1.wav" autoplay/>Your browser does not support the audio element.</audio><br>等你熬过那些孤独无助的时刻,你才会发现,原来自己并没有想象中那么脆弱。原来一个人,也可以活成千军万马的模样。</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/emotion/伤心2.wav" autoplay/>Your browser does not support the audio element.</audio><br>我可以安慰很多人,但就是不能安慰自己那颗千疮百孔的心。总有一些人会慢慢淡出你的生活。你要学会接受,而不是怀念。有些事。不管我们如何努力,回不去就是回不去了。我们漫长的岁月中有太多的过客,有太多的无奈。</td>
</tr>
<tr>
<td style="vertical-align : middle;text-align:center;">Happy</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/emotion/开心1.wav" autoplay/>Your browser does not support the audio element.</audio><br>小丽抿着嘴,弓着腰,蹑手蹑脚地,一步一步慢慢地靠近它。靠近了,靠近了,又见她悄悄地将右手伸向蝴蝶,张开的两个手指一合,夹住了粉蝶的翅膀。小丽高兴得又蹦又跳。</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/emotion/开心2.wav" autoplay/>Your browser does not support the audio element.</audio><br>除夕晚上,儿子孙子都来到她身边,她满脸皱纹都舒展开了,就像盛开的菊花瓣,每根皱纹里都洋溢着笑意。</td>
</tr>
<tr>
<td style="vertical-align : middle;text-align:center;">Angry</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/emotion/愤怒1.wav" autoplay/>Your browser does not support the audio element.</audio><br>突然有一个不认识的西班牙老粗,捶着台子站了起来,涨红着脸,激动的演说着,他说得口沫横飞,气得双眼要炸了似的弹出着,两手又挥又举,恨不能表达他的愤怒。</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/emotion/愤怒2.wav" autoplay/>Your browser does not support the audio element.</audio><br>无可抑制的愤怒在他的血管中奔腾翻滚着,它一阵飓风般的疯狂奔跑,没有任何事情能挡它,它看见两个那种恶魔吸附在马上,还有两条狗。他是一个狂魔,也是一阵毁灭一切的龙卷风。</td>
</tr>
<tr>
<td style="vertical-align : middle;text-align:center;">Fearful</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/emotion/害怕1.wav" autoplay/>Your browser does not support the audio element.</audio><br>在漆黑的夜晚,月光洒在寂静的街道上,一道身影颤抖着站在破旧的木门前。他的名字叫做李明,一个平凡的邮差,此刻却满眼恐惧地盯着那扇似乎隐藏着无尽黑暗的门。</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/emotion/害怕2.wav" autoplay/>Your browser does not support the audio element.</audio><br>他试图在心中寻找一丝勇气,回忆起过去的日子里,那些快乐而平凡的日子。他试图用这些回忆来驱散心中的恐惧。然而,那扇破旧的木门似乎在呼唤着他,吸引着他走向黑暗。李明的心跳如同在狂奔的野马,无法控制。</td>
</tr>
</tbody>
</table>
</div>
</div>
<div class="container pt-5 mt-5 shadow-lg p-5 mb-5 bg-white rounded">
<h2 id="CosyVoice-sft" style="text-align: center;">Speaker Fine-tune</h2>
<div class="table-responsive pt-3">
<table class="table table-hover pt-2">
<thead>
<tr>
<th style="text-align: center">Speaker </th>
<th style="text-align: center">Text </th>
<th style="text-align: center">Generated</th>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align : middle;text-align:center;">Speaker 1</td>
<td style="vertical-align : middle;text-align:center;">这也不知道为啥哈,反正,它刚出来的时候儿叫台湾手抓饼,现在就是可能这个,大陆这边儿都给改良了,整的都像那种,烙的那种,鸡蛋灌饼儿似的啦,哎呦,就有那种感觉哈。</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/finetuned/finetuned-zh-2.wav" autoplay/>Your browser does not support the audio element.</audio></td>
</tr>
<tr>
<td style="vertical-align : middle;text-align:center;">Speaker 2</td>
<td style="vertical-align : middle;text-align:center;">明月几时有?把酒问青天。不知天上宫阙,今夕是何年。我欲乘风归去,又恐琼楼玉宇,高处不胜寒。起舞弄清影,何似在人间。</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/finetuned/finetuned-zh-1.wav" autoplay/>Your browser does not support the audio element.</audio></td>
</tr>
<tr>
<td style="vertical-align : middle;text-align:center;">Speaker 3</td>
<td style="vertical-align : middle;text-align:center;">生活不在于拥有最好的一切,而在于把一切都变得最好。别怕失败,它是通往成功的必经之路。每一次跌倒,都是为了更坚强地站起。梦想不是等来的,是追出来的。迈出那一步,让汗水成为你成功的见证!</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/finetuned/finetuned-zh-3.wav" autoplay/>Your browser does not support the audio element.</audio></td>
</tr>
<tr>
<td style="vertical-align : middle;text-align:center;">Speaker 4</td>
<td style="vertical-align : middle;text-align:center;">In the heart of the whispering woods, Ellie the adventurous elf put on her leafy green cloak, picking up her map sprinkled with mystical runes and set out on a quest to find the enchanted crystal that was said to hold the key to endless joy and laughter.</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/finetuned/finetuned-en-1.wav" autoplay/>Your browser does not support the audio element.</audio></td>
</tr>
<tr>
<td style="vertical-align : middle;text-align:center;">Speaker 5</td>
<td style="vertical-align : middle;text-align:center;">In the stately grandeur of Pemberley, Elizabeth Bennet's prejudices began to crumble as she gazed upon the portrait of Mr. Darcy, realizing for the first time that the true measure of a man lay not in the fineries of his estate, but in the depth of his character and the kindness he bestowed upon those of lower station.</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/finetuned/finetuned-en-2.wav" autoplay/>Your browser does not support the audio element.</audio></td>
</tr>
</tbody>
</table>
</div>
<p>* Due to copyright restrictions, we are unable to open source the SFT models, but we will release the SFT training script. You can use this script to perform SFT on your own data.</p>
<br>
</div>
<div class="container pt-5 mt-5 shadow-lg p-5 mb-5 bg-white rounded">
<h2 id="CosyVoice-intep" style="text-align: center;">Speaker Interpolation</h2>
<div class="table-responsive pt-3">
<table class="table table-hover pt-2">
<thead>
<tr>
<th style="text-align: center">Text </th>
<th style="text-align: center">speaker A </th>
<th>
<input style="text-align: center" type="range" id="audioSlider1" min="0.1" max="0.9" step="0.1" value="0.5" oninput="updateGeneratedAudio(1)">
<span id="sliderValue1">0.5</span>
</th>
<th style="text-align: center">speaker B </th>
</tr>
</thead>
<tr>
<td style="vertical-align : middle;text-align:center;">晴空万里不如你心情愉悦,今天有什么开心的事吗?</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/intep/F030_M037/0.0/1.wav" autoplay/>Your browser does not support the audio element.</audio></td>
<td>
<audio id="generatedAudio1" controls="" style='width: 190px;'>
<!-- Placeholder for generated audio source -->
Your browser does not support the audio element.
</audio>
</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/intep/F030_M037/1.0/1.wav" autoplay/>Your browser does not support the audio element.</audio></td>
</tr>
<thead>
<tr>
<th style="text-align: center">Text </th>
<th style="text-align: center">speaker A </th>
<th>
<input style="text-align: center" type="range" id="audioSlider2" min="0.1" max="0.9" step="0.1" value="0.5" oninput="updateGeneratedAudio(2)">
<span id="sliderValue2">0.5</span>
</th>
<th style="text-align: center">speaker B </th>
</tr>
</thead>
<tr>
<td style="vertical-align : middle;text-align:center;">早上好啊,今天也是元气满满的一天呢,一起加油吧</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/intep/F030_M037/0.0/2.wav" autoplay/>Your browser does not support the audio element.</audio></td>
<td>
<audio id="generatedAudio2" controls="" style='width: 190px;'>
<!-- Placeholder for generated audio source -->
Your browser does not support the audio element.
</audio>
</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/intep/F030_M037/1.0/2.wav" autoplay/>Your browser does not support the audio element.</audio></td>
</tr>
<thead>
<tr>
<th style="text-align: center">Text </th>
<th style="text-align: center">speaker A </th>
<th>
<input style="text-align: center" type="range" id="audioSlider3" min="0.1" max="0.9" step="0.1" value="0.5" oninput="updateGeneratedAudio(3)">
<span id="sliderValue3">0.5</span>
</th>
<th style="text-align: center">speaker C </th>
</tr>
</thead>
<tr>
<td style="vertical-align : middle;text-align:center;">哎哟,为什么每次打游戏你都不认真的,我都快被怪物抓走了,你却在那里发呆,下一局你可要保护我啊</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/intep/F030_jielidou/0.0/4.wav" autoplay/>Your browser does not support the audio element.</audio></td>
<td>
<audio id="generatedAudio3" controls="" style='width: 190px;'>
<!-- Placeholder for generated audio source -->
Your browser does not support the audio element.
</audio>
</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/intep/F030_jielidou/1.0/4.wav" autoplay/>Your browser does not support the audio element.</audio></td>
</tr>
<thead>
<tr>
<th style="text-align: center">Text </th>
<th style="text-align: center">speaker A </th>
<th>
<input style="text-align: center" type="range" id="audioSlider4" min="0.1" max="0.9" step="0.1" value="0.5" oninput="updateGeneratedAudio(4)">
<span id="sliderValue4">0.5</span>
</th>
<th style="text-align: center">speaker C </th>
</tr>
</thead>
<tr>
<td style="vertical-align : middle;text-align:center;">哼,你明明答应我要一起去图书馆的,难道你忘记我们的约定了吗,真是的,下次你要记得守时哦</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/intep/F030_jielidou/0.0/5.wav" autoplay/>Your browser does not support the audio element.</audio></td>
<td>
<audio id="generatedAudio4" controls="" style='width: 190px;'>
<!-- Placeholder for generated audio source -->
Your browser does not support the audio element.
</audio>
</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/intep/F030_jielidou/1.0/5.wav" autoplay/>Your browser does not support the audio element.</audio></td>
</tr>
<thead>
<tr>
<th style="text-align: center">Text </th>
<th style="text-align: center">speaker B </th>
<th>
<input style="text-align: center" type="range" id="audioSlider5" min="0.1" max="0.9" step="0.1" value="0.5" oninput="updateGeneratedAudio(5)">
<span id="sliderValue5">0.5</span>
</th>
<th style="text-align: center">speaker C </th>
</tr>
</thead>
<tr>
<td style="vertical-align : middle;text-align:center;">有什么让你感到不开心的事情吗?哎呀,听到你这么说,我很难过。如果你愿意的话,可以跟我分享一下,或许我可以帮助你排解烦恼。</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/intep/M037_jielidou/0.0/8.wav" autoplay/>Your browser does not support the audio element.</audio></td>
<td>
<audio id="generatedAudio5" controls="" style='width: 190px;'>
<!-- Placeholder for generated audio source -->
Your browser does not support the audio element.
</audio>
</td>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/intep/M037_jielidou/1.0/8.wav" autoplay/>Your browser does not support the audio element.</audio></td>
</tr>
</tbody>
</table>
</div>
<br>
</div>
<div class="container pt-5 mt-5 shadow-lg p-5 mb-5 bg-white rounded">
<h2 id="SenseVoice-overview" style="text-align: center;">Overview of SenseVoice</h2>
<body>
<p style="text-align: center;">
<img src="pics/overview-sensevoice.png" height="400" width="800">
</body>
<p style="text-align: center;" >
<b>Figure 2.</b> An overview of the SenseVoice Models. SenseVoice is a speech foundation model with multiple speech understanding capabilities, including ASR, LID, SER, and AED. SenseVoice-Small, an encoder-only speech foundation model for fast speech understanding, and SenseVoice-Large, an encoder-decoder speech foundation model for more accurate speech understanding with more languages supported.
</div>
<div class="container pt-5 mt-5 shadow-lg p-5 mb-5 bg-white rounded">
<h2 id="SenseVoice-ASR" style="text-align: center;">Multilingaul Speech Recognition</h2>
</p> We compared the multilingual recognition performance and inference efficiency of SenseVoice and Whisper on open-source benchmark datasets, including AISHELL-1, AISHELL-2, Wenetspeech, Librispeech, and Common Voice. The inference efficiency evaluation was conducted using the A800 machine. SenseVoice-small employs a non-autoregressive end-to-end architecture, resulting in extremely low inference latency—7 times faster compared to Whisper-small and 17 times faster compared to Whisper-large. </p>
<body>
<p style="text-align: center;">
<img src="pics/chinese_english_open_corpus.png" height="250" width="580">
<img src="pics/common_voice.png" height="250" width="580">
</p>
</body>
<p style="text-align: center;" >
<b>Figure 3.</b> Comparasion of SenseVoice and Whisper on multilingual speech recognition beachmarks. </p>
<div class="table-responsive pt-3">
<table class="table table-hover pt-2">
<thead>
<tr>
<th style="text-align: center">Model</th>
<th style="text-align: center">Framework</th>
<th style="text-align: center">Parameters</th>
<th style="text-align: center">Support Language</th>
<th style="text-align: center">3s Audio Latency</th>
<th style="text-align: center">5s Audio Latency</th>
<th style="text-align: center">10 Audio Latency</th>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align : middle;text-align:center;">Whisper-Small</td>
<td style="vertical-align : middle;text-align:center;">Autoregressive</td>
<td style="vertical-align : middle;text-align:center;">244 M</td>
<td style="vertical-align : middle;text-align:center;">50+</td>
<td style="vertical-align : middle;text-align:center;">285ms</td>
<td style="vertical-align : middle;text-align:center;">367ms</td>
<td style="vertical-align : middle;text-align:center;">518ms</td>
</tr>
<tr>
<td style="vertical-align : middle;text-align:center;">Whisper-Large-V3</td>
<td style="vertical-align : middle;text-align:center;">Autoregressive</td>
<td style="vertical-align : middle;text-align:center;">1550 M</td>
<td style="vertical-align : middle;text-align:center;">50+</td>
<td style="vertical-align : middle;text-align:center;">751ms</td>
<td style="vertical-align : middle;text-align:center;">1009ms</td>
<td style="vertical-align : middle;text-align:center;">1281ms</td>
</tr>
<tr>
<td style="vertical-align : middle;text-align:center;">Paraformer-zh</td>
<td style="vertical-align : middle;text-align:center;">Non-Autoregressive</td>
<td style="vertical-align : middle;text-align:center;">220 M</td>
<td style="vertical-align : middle;text-align:center;">zh</td>
<td style="vertical-align : middle;text-align:center;">76ms</td>
<td style="vertical-align : middle;text-align:center;">85ms</td>
<td style="vertical-align : middle;text-align:center;">100ms</td>
</tr>
<tr>
<td style="vertical-align : middle;text-align:center;"><b>SenseVoice-Small</b></td>
<td style="vertical-align : middle;text-align:center;"><b>Non-Autoregressive</b></td>
<td style="vertical-align : middle;text-align:center;"><b>234 M</b></td>
<td style="vertical-align : middle;text-align:center;"><b>zh,yue,en,ja,ko</b></td>
<td style="vertical-align : middle;text-align:center;"><b>63ms</b></td>
<td style="vertical-align : middle;text-align:center;"><b>67ms</b></td>
<td style="vertical-align : middle;text-align:center;"><b>70ms</b></td>
</tr>
<tr>
<td style="vertical-align : middle;text-align:center;">SenseVoice-Large</td>
<td style="vertical-align : middle;text-align:center;">Autoregressive</td>
<td style="vertical-align : middle;text-align:center;">1587 M</td>
<td style="vertical-align : middle;text-align:center;">50+</td>
<td style="vertical-align : middle;text-align:center;">738ms</td>
<td style="vertical-align : middle;text-align:center;">1207ms</td>
<td style="vertical-align : middle;text-align:center;">1623ms</td>
</tr>
</tbody>
</table>
<p style="text-align: center;" ><b>Tabel 1.</b> Comparasion of model architecture, parameter scale, supported languages, and inference efficiency of SenseVoice, Paraformer, and Whisper. SenseVoice-small employs a non-autoregressive architecture, which offers a significant advantage in inference efficiency compared to Whisper.
</div>
<div class="table-responsive pt-3">
<table class="table table-hover pt-2">
<thead>
<tr>
<th style="text-align: center">Speech </th>
<th style="text-align: center">w/o ITN</th>
<th style="text-align: center">w ITN</th>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/asr/zh/zh.mp3" autoplay/>Your browser does not support the audio element.</audio><br></td>
<td style="vertical-align : middle;text-align:center;">开放时间早上九点至下午五点</td>
<td style="vertical-align : middle;text-align:center;">开放时间早上9点至下午5点。</td>
</tr>
<tr>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/asr/yue/yue.mp3" autoplay/>Your browser does not support the audio element.</audio><br></td>
<td style="vertical-align : middle;text-align:center;">呢几个字都表达唔到我想讲嘅意思</td>
<td style="vertical-align : middle;text-align:center;">呢几个字都表达唔到,我想讲嘅意思。</td>
</tr>
<tr>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/asr/en/en.mp3" autoplay/>Your browser does not support the audio element.</audio><br></td>
<td style="vertical-align : middle;text-align:center;">the tribal chieftain called for the boy and presented him with fifty pieces of gold</td>
<td style="vertical-align : middle;text-align:center;">The tribal chieftain called for the boy and presented him with 50 pieces of gold.</td>
</tr>
<tr>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/asr/ja/ja.mp3" autoplay/>Your browser does not support the audio element.</audio><br></td>
<td style="vertical-align : middle;text-align:center;">うちの中学は弁当制で持っていけない場合は50円の学校販売のパンを買う</td>
<td style="vertical-align : middle;text-align:center;">うちの中学は弁当制で持っていけない 場合は、50 円の学校販売の パンを買う。</td>
</tr>
<tr>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/asr/ko/ko.mp3" autoplay/>Your browser does not support the audio element.</audio><br></td>
<td style="vertical-align : middle;text-align:center;">조 금만 생각 을 하 면서 살 면 훨씬 편할 거야</td>
<td style="vertical-align : middle;text-align:center;">조 금만 생각 을 하 면서 살 면 훨씬 편할 거야.</td>
</tr>
</tbody>
</table>
<p style="text-align: center;" ><b>Tabel 2.</b> SenseVoice-small can control whether to perform Inverse Text Normalization (ITN) during recognition via the tag prompt.</p>
</div>
</div>
<div class="container pt-5 mt-5 shadow-lg p-5 mb-5 bg-white rounded">
<h2 id="SenseVoice-emotion" style="text-align: center;">Speech Emotion Recognition</h2>
<p> SenseVoice can also be used for discrete emotion recognition. Happy, Sad, Angry and Neutral are supported.
We evaluate it on 7 popular emotion recognition dataset.
The SenseVoice-Large can approaching or exceeding the SOTA results on most datasets even without target corpus finetuning.
</p>
<body>
<p style="text-align: center;">
<img src="pics/ser_result.png",height="480" width="1024">
</p>
</body>
<p style="text-align: center;" >
<b>Figure 4.</b> Weighted Average Accuracy (WA(%)) comparison on 7 emotion recognition datasets.
<a href="https://github.com/emo-box/EmoBox" target="_blank">EmoBox</a> is a recent speech emotion recognition benchmark based on Self-Supervised Models and Whisper.
<a href="https://huggingface.co/ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition" target="_blank">Model on HF</a> stands for the most popular speech emotion recognition model on HuggingFace.
</p>
<div class="table-responsive pt-3">
<table class="table table-hover pt-2">
<thead>
<tr>
<th style="text-align: center">Audio </th>
<th style="text-align: center">SenseVoice-Large </th>
<th style="text-align: center">SenseVoice-Small</th>
</tr>
</thead>
<tbody>
<tr>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/ser/zh_happy.wav" autoplay/>Your browser does not support the audio element.</audio></td>
<td style="vertical-align : middle;text-align:center;"><|zh|><|Speech|>英国的哲学家曾经说过。<|/Speech|><|HAPPY|></td>
<td style="vertical-align : middle;text-align:center;"><|zh|><|HAPPY|><|Speech|>英国的哲学家曾经说过。</td>
</tr>
<tr>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/ser/zh_sad.wav" autoplay/>Your browser does not support the audio element.</audio></td>
<td style="vertical-align : middle;text-align:center;"><|zh|><|Speech|>英国的哲学家曾经说过。<|/Speech|><|SAD|></td>
<td style="vertical-align : middle;text-align:center;"><|zh|><|SAD|><|Speech|>英国的哲学家曾经说过。</td>
</tr>
<tr>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/ser/zh_angry.wav" autoplay/>Your browser does not support the audio element.</audio></td>
<td style="vertical-align : middle;text-align:center;"><|zh|><|Speech|>英国的哲学家曾经说过。<|/Speech|><|ANGRY|></td>
<td style="vertical-align : middle;text-align:center;"><|zh|><|ANGRY|><|Speech|>英国的哲学家曾经说过。</td>
</tr>
<tr>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/ser/zh_neutral.wav" autoplay/>Your browser does not support the audio element.</audio></td>
<td style="vertical-align : middle;text-align:center;"><|zh|><|Speech|>英国的哲学家曾经说过。<|/Speech|><|NEUTRAL|></td>
<td style="vertical-align : middle;text-align:center;"><|zh|><|NEUTRAL|><|Speech|>英国的哲学家曾经说过。</td>
</tr>
<tr>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/ser/en_happy.wav" autoplay/>Your browser does not support the audio element.</audio></td>
<td style="vertical-align : middle;text-align:center;"><|en|><|Speech|>I did go, and made many prisoners. <|/Speech|><|HAPPY|></td>
<td style="vertical-align : middle;text-align:center;"><|en|><|HAPPY|><|Speech|>I did go and made many prisoners.</td>
</tr>
<tr>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/ser/en_sad.wav" autoplay/>Your browser does not support the audio element.</audio></td>
<td style="vertical-align : middle;text-align:center;"><|en|><|Speech|>I did go, and made many prisoners. <|/Speech|><|SAD|></td>
<td style="vertical-align : middle;text-align:center;"><|en|><|SAD|><|Speech|>I did go and made many prisoners.</td>
</tr>
<tr>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/ser/en_angry.wav" autoplay/>Your browser does not support the audio element.</audio></td>
<td style="vertical-align : middle;text-align:center;"><|en|><|Speech|>I did go, and made many prisoners. <|/Speech|><|ANGRY|></td>
<td style="vertical-align : middle;text-align:center;"><|en|><|ANGRY|><|Speech|>I did go and made many prisoners.</td>
</tr>
<tr>
<td style="vertical-align : middle;text-align:center;"><audio controls="controls" style="width: 190px;"><source src="audios/ser/en_neutral.wav" autoplay/>Your browser does not support the audio element.</audio></td>
<td style="vertical-align : middle;text-align:center;"><|en|>I did go, and made many prisoners. <|/Speech|><|NEUTRAL|></td>
<td style="vertical-align : middle;text-align:center;"><|en|><|NEUTRAL|><|Speech|>I did go and made many prisoners.</td>
</tr>
</tbody>
</table>