You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The version of HKCanCor published on HuggingFace by NTU is different from the version offered by this library in at least four undocumented ways:
Total token count differs: NTU has 160836 while Pycantonese has 153654
Pycantonese has a different definition of utterance, which seems to be sentences ending with period or question mark. However, the utterances in NTU can span multiple sentences. Also, PyCantonese's segmentation method breaks ending quotation marks, resulting in sentences like "一路一路剝. that starts with a quote.
Pycantonese uses English punctuations marks while NTU uses Chiense punctuations.
Some private use characters in NTU are rewritten with Unicode Chinese characters.
Would it be possible to explain these differences and any preprocessing step done by PyCantonese in the doc somewhere?
You can use the following script to compare the two versions of the corpus:
from datasets import load_dataset
import pycantonese
if __name__ == "__main__":
print('==== HuggingFace ====')
hf = load_dataset("nanyang-technological-university-singapore/hkcancor", trust_remote_code=True)
hf_utterances = []
hf_tokens = 0
for utterance in hf['train']:
hf_utterances.append(''.join(utterance['tokens']))
hf_tokens += len(utterance['tokens'])
print('Total tokens:', hf_tokens)
print('Utterances before deduplication:', len(hf_utterances))
hf_utterances = sorted(list(set(hf_utterances)))
print('Utterances after deduplication:', len(hf_utterances))
longest_hf_utterance = max(hf_utterances, key=len)
print('Length of the longest utterance:', len(longest_hf_utterance))
# Load HKCanCor data
hkcancor_data = pycantonese.hkcancor()
hkcancor_tokens = 0
hkcancor_utterances = [''.join(token.word for token in utterance) for utterance in hkcancor_data.tokens(by_utterances=True)]
for utterance in hkcancor_data.tokens(by_utterances=True):
hkcancor_tokens += len(utterance)
print('==== Pycantonese ====')
print('Total tokens:', hkcancor_tokens)
print('Utterances before deduplication:', len(hkcancor_utterances))
hkcancor_utterances = sorted(list(set(hkcancor_utterances)))
print('Utterances after deduplication:', len(hkcancor_utterances))
longest_hkcancor_utterance = max(hkcancor_utterances, key=len)
print('Length of the longest utterance:', len(longest_hkcancor_utterance))
Outputs:
==== HuggingFace ====
Total tokens: 160836
Utterances before deduplication: 10801
Utterances after deduplication: 9117
Length of the longest utterance: 876
==== PyCantonese ====
Total tokens: 153654
Utterances before deduplication: 16162
Utterances after deduplication: 13118
Length of the longest utterance: 145
The text was updated successfully, but these errors were encountered:
I converted the source data into the CHAT data format for compatibility with other conversational datasets in linguistics. That's the reason for English punctuation marks instead of Chinese ones, utterances defined by periods or question marks, and probably also Unicode characters replacing non-Unicode ones. Unfortunately, I've been unable to track down the exact code I used for the conversion and I did this almost ten years ago, so I'm afraid I may not be able to explain every difference.
For reference, here are the relevant commits at the pycantonese codebase:
a527f2b (January 2015): HKCanCor source data added, probably no conversion or preprocessing whatsover
bc1fa1e (February 2016): data converted to the CHAT format
Perhaps it's possible (ideal?) to re-do the CHAT format conversion work by pulling data from https://github.com/fcbond/hkcancor and keep everything (including the preprocessing/conversion code) properly versioned so that we would be able to track these things -- a project for another day :-)
The version of HKCanCor published on HuggingFace by NTU is different from the version offered by this library in at least four undocumented ways:
"一路一路剝.
that starts with a quote.Would it be possible to explain these differences and any preprocessing step done by PyCantonese in the doc somewhere?
You can use the following script to compare the two versions of the corpus:
Outputs:
The text was updated successfully, but these errors were encountered: