-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #2 from CanCLID/chaak
Publish to PyPI
- Loading branch information
Showing
4 changed files
with
174 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
name: Publish to PyPI | ||
|
||
jobs: | ||
pypi: | ||
name: upload release to PyPI | ||
runs-on: ubuntu-latest | ||
permissions: | ||
id-token: write | ||
steps: | ||
- uses: actions/checkout@v3 | ||
|
||
- uses: actions/setup-python@v4 | ||
with: | ||
python-version: "3.11" | ||
|
||
- name: Install dependencies | ||
run: | | ||
python -m pip install --upgrade pip | ||
pip install setuptools wheel twine | ||
- name: Run tests | ||
run: | | ||
python -m unittest tests/test_judge.py | ||
- name: Update version in setup.py | ||
run: | | ||
version=$(python -c "exec(open('cantonesedetect/version.py').read()); print(__version__)") | ||
sed -i "s/version=.*/version='$version',/" setup.py | ||
- name: Build and publish | ||
run: | | ||
python setup.py sdist bdist_wheel | ||
twine upload dist/* | ||
- name: mint API token | ||
id: mint-token | ||
run: | | ||
# retrieve the ambient OIDC token | ||
resp=$(curl -H "Authorization: bearer $ACTIONS_ID_TOKEN_REQUEST_TOKEN" \ | ||
"$ACTIONS_ID_TOKEN_REQUEST_URL&audience=pypi") | ||
oidc_token=$(jq -r '.value' <<< "${resp}") | ||
# exchange the OIDC token for an API token | ||
resp=$(curl -X POST https://pypi.org/_/oidc/mint-token -d "{\"token\": \"${oidc_token}\"}") | ||
api_token=$(jq -r '.token' <<< "${resp}") | ||
# mask the newly minted API token, so that we don't accidentally leak it | ||
echo "::add-mask::${api_token}" | ||
# see the next step in the workflow for an example of using this step output | ||
echo "api-token=${api_token}" >> "${GITHUB_OUTPUT}" | ||
- name: publish | ||
# gh-action-pypi-publish uses TWINE_PASSWORD automatically | ||
uses: pypa/gh-action-pypi-publish@release/v1 | ||
with: | ||
password: ${{ steps.mint-token.outputs.api-token }} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,72 @@ | ||
# CantoneseDetect 粵語特徵分類器 | ||
|
||
[![license](https://img.shields.io/github/license/DAVFoundation/captain-n3m0.svg?style=flat-square)](https://github.com/DAVFoundation/captain-n3m0/blob/master/LICENSE) | ||
|
||
本項目為 [CantoFilter](https://github.com/CanCLID/cantonese-classifier) 之後續。 | ||
This is an extension of the [CantoFilter](https://github.com/CanCLID/cantonese-classifier) project. | ||
|
||
## 引用 Citation | ||
|
||
抽出字詞特徵嘅策略同埋實踐方式,喺下面整理。討論本分類器時,請引用: | ||
|
||
Chaak-ming Lau, Mingfei Lau, and Ann Wai Huen To. 2024. | ||
[The Extraction and Fine-grained Classification of Written Cantonese Materials through Linguistic Feature Detection.](https://aclanthology.org/2024.eurali-1.4/) | ||
In Proceedings of the 2nd Workshop on Resources and Technologies for Indigenous, Endangered and Lesser-resourced Languages in Eurasia (EURALI) | ||
@ LREC-COLING 2024, pages 24–29, Torino, Italia. ELRA and ICCL. | ||
|
||
分類器採用嘅分類標籤及基準,參考咗對使用者嘅語言意識形態嘅研究。討論分類準則時,請引用: | ||
|
||
The definitions and boundaries of the labels depend on the user's language ideology. | ||
When discussing the criteria adopted by this tool, please cite: | ||
|
||
Lau, Chaak Ming. 2024. Ideologically driven divergence in Cantonese vernacular writing practices. In J.-F. Dupré, editor, _Politics of Language in Hong Kong_, Routledge. | ||
|
||
--- | ||
|
||
## 簡介 Introduction | ||
|
||
分類方法係利用粵語同書面中文嘅特徵字詞,用 Regex 方式加以識別。 | ||
|
||
The filter is based on Regex rules and detects lexical features specific to Cantonese or Written-Chiense. | ||
|
||
### 標籤 Labels | ||
|
||
分類器會將輸入文本分成四類(粗疏)或六類(精細),分類如下: | ||
The classifiers output four (coarse) or six (fine-grained) categories. The labels are: | ||
|
||
1. `Cantonese`: 純粵文,僅含有粵語特徵字詞,例如“你喺邊度” | Pure Cantonese text, contains Cantonese-featured words. E.g. 你喺邊度 | ||
1. `SWC`: 書面中文,係一個僅含有書面語特徵字詞,例如“你在哪裏” | Pure Standard Written Chinese (SWC) text, contains Mandarin-feature words. E.g. 你在哪裏 | ||
1. `Mixed`:書粵混雜文,同時含有書面語同粵語特徵嘅字詞,例如“是咁的” | Mixed Cantonese-Mandarin text, contains both Cantonese and Mandarin-featured words. E.g. 是咁的 | ||
1. `Neutral`:無特徵中文,唔含有官話同粵語特徵,既可以當成粵文亦可以當成官話文,例如“去學校讀書” | No feature Chinese text, contains neither Cantonese nor Mandarin feature words. Such sentences can be used for both Cantonese and Mandarin text corpus. E.g. 去學校讀書 | ||
1. `MixedQuotesInSWC` : 書面中文,引文入面係 `Mixed` | `Mixed` contents quoted within SWC text | ||
1. `CantoneseQuotesInSWC` : 書面中文,引文入面係純粵文 `cantonese` | `Cantonese` contents quoted within SWC text | ||
|
||
## 用法 Usage | ||
|
||
### 系統要求 Requirement | ||
|
||
Python >= 3.11 | ||
|
||
### 安裝 Installation | ||
|
||
首先用 pip 安裝 | ||
|
||
```bash | ||
pip install cantonesedetect | ||
``` | ||
|
||
### Python | ||
|
||
Use `judge()` | ||
|
||
```python | ||
from cantonesedetect import judge | ||
|
||
print(judge('你喺邊度')) # cantonese | ||
print(judge('你在哪裏')) # mandarin | ||
print(judge('是咁的')) # mixed | ||
print(judge('去學校讀書')) # neutral | ||
``` | ||
|
||
### CLI | ||
待補充 to be added. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
from cantonesedetect.judge import judge | ||
import unittest | ||
|
||
|
||
def load_test_sentences(file_path): | ||
with open(file_path, 'r', encoding='utf-8') as file: | ||
lines = [line.strip() for line in file if line.strip() | ||
and not line.startswith('#')] | ||
test_cases = [] | ||
for line in lines: | ||
if '|' in line: | ||
sentence, quotemode, expected = line.split('|') | ||
test_cases.append((sentence, quotemode, expected)) | ||
return test_cases | ||
|
||
|
||
test_cases = load_test_sentences('tests/test_judge_sentences.txt') | ||
|
||
|
||
class TestJudgeFunction(unittest.TestCase): | ||
def test_judge(self): | ||
for sentence, quotemode, expected in test_cases: | ||
result = judge(sentence, get_quote = (quotemode == 'Quote'))[0] | ||
self.assertEqual( | ||
result, expected, f"Failed for input: {sentence}. Expected: {expected}, Quote Mode: {quotemode} but got: {result}") | ||
|
||
|
||
if __name__ == "__main__": | ||
unittest.main() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
你喺邊度|NoQuote|Cantonese | ||
乜你今日唔使返學咩|NoQuote|Cantonese | ||
今日好可能會嚟唔到|NoQuote|Cantonese | ||
我哋影張相留念|NoQuote|Cantonese | ||
你在哪裏|NoQuote|SWC | ||
家長也應做好家居防蚊措施|NoQuote|SWC | ||
教育不只是為了傳授知識|NoQuote|SWC | ||
是咁的|NoQuote|Mixed | ||
佢在屋企吃飯|NoQuote|Mixed | ||
去學校讀書|NoQuote|Neutral | ||
做人最重要開心|NoQuote|Neutral | ||
外交部駐香港特別行政區特派員公署副特派員|NoQuote|Neutral | ||
全日制或大學生於晚市星期一至星期四一天前訂座|NoQuote|Neutral | ||
這就是「你哋都戇鳩嘅」的意思 |Quote|CantoneseQuotesInSWC | ||
今天我是一個「冇嘢好做」的狀態 |Quote|CantoneseQuotesInSWC | ||
他們跟我說:「是咁的,即係噉講」 |Quote|MixedQuotesInSWC | ||
他說:「佢在屋企吃飯」 |Quote|MixedQuotesInSWC |