Merge pull request #2 from CanCLID/chaak

Publish to PyPI
CanCLID · Jun 28, 2024 · 2a26921 · 2a26921
2 parents 3125f98 + f648607
commit 2a26921
Show file tree

Hide file tree

Showing 4 changed files with 174 additions and 0 deletions.
diff --git a/.github/workflows/workflow.yml b/.github/workflows/workflow.yml
@@ -0,0 +1,56 @@
+name: Publish to PyPI
+
+jobs:
+  pypi:
+    name: upload release to PyPI
+    runs-on: ubuntu-latest
+    permissions:
+      id-token: write
+    steps:
+      - uses: actions/checkout@v3
+
+      - uses: actions/setup-python@v4
+        with:
+          python-version: "3.11"
+
+      - name: Install dependencies
+        run: |
+          python -m pip install --upgrade pip
+          pip install setuptools wheel twine
+      - name: Run tests
+        run: |
+          python -m unittest tests/test_judge.py
+
+      - name: Update version in setup.py
+        run: |
+          version=$(python -c "exec(open('cantonesedetect/version.py').read()); print(__version__)")
+          sed -i "s/version=.*/version='$version',/" setup.py
+
+      - name: Build and publish
+        run: |
+          python setup.py sdist bdist_wheel
+          twine upload dist/*
+
+      - name: mint API token
+        id: mint-token
+        run: |
+          # retrieve the ambient OIDC token
+          resp=$(curl -H "Authorization: bearer $ACTIONS_ID_TOKEN_REQUEST_TOKEN" \
+            "$ACTIONS_ID_TOKEN_REQUEST_URL&audience=pypi")
+          oidc_token=$(jq -r '.value' <<< "${resp}")
+
+          # exchange the OIDC token for an API token
+          resp=$(curl -X POST https://pypi.org/_/oidc/mint-token -d "{\"token\": \"${oidc_token}\"}")
+          api_token=$(jq -r '.token' <<< "${resp}")
+
+          # mask the newly minted API token, so that we don't accidentally leak it
+          echo "::add-mask::${api_token}"
+
+          # see the next step in the workflow for an example of using this step output
+          echo "api-token=${api_token}" >> "${GITHUB_OUTPUT}"
+
+      - name: publish
+        # gh-action-pypi-publish uses TWINE_PASSWORD automatically
+        uses: pypa/gh-action-pypi-publish@release/v1
+        with:
+          password: ${{ steps.mint-token.outputs.api-token }}
diff --git a/README.md b/README.md
@@ -0,0 +1,72 @@
+# CantoneseDetect 粵語特徵分類器
+
+[![license](https://img.shields.io/github/license/DAVFoundation/captain-n3m0.svg?style=flat-square)](https://github.com/DAVFoundation/captain-n3m0/blob/master/LICENSE)
+
+本項目為 [CantoFilter](https://github.com/CanCLID/cantonese-classifier) 之後續。
+This is an extension of the [CantoFilter](https://github.com/CanCLID/cantonese-classifier) project.
+
+## 引用 Citation
+
+抽出字詞特徵嘅策略同埋實踐方式，喺下面整理。討論本分類器時，請引用：
+
+Chaak-ming Lau, Mingfei Lau, and Ann Wai Huen To. 2024. 
+[The Extraction and Fine-grained Classification of Written Cantonese Materials through Linguistic Feature Detection.](https://aclanthology.org/2024.eurali-1.4/) 
+In Proceedings of the 2nd Workshop on Resources and Technologies for Indigenous, Endangered and Lesser-resourced Languages in Eurasia (EURALI) 
+@ LREC-COLING 2024, pages 24–29, Torino, Italia. ELRA and ICCL.
+
+分類器採用嘅分類標籤及基準，參考咗對使用者嘅語言意識形態嘅研究。討論分類準則時，請引用：
+
+The definitions and boundaries of the labels depend on the user's language ideology. 
+When discussing the criteria adopted by this tool, please cite:
+
+Lau, Chaak Ming. 2024. Ideologically driven divergence in Cantonese vernacular writing practices. In J.-F. Dupré, editor, _Politics of Language in Hong Kong_, Routledge.
+
+---
+
+## 簡介 Introduction
+
+分類方法係利用粵語同書面中文嘅特徵字詞，用 Regex 方式加以識別。
+
+The filter is based on Regex rules and detects lexical features specific to Cantonese or Written-Chiense.
+
+### 標籤 Labels
+
+分類器會將輸入文本分成四類（粗疏）或六類（精細），分類如下:
+The classifiers output four (coarse) or six (fine-grained) categories. The labels are:
+
+1. `Cantonese`: 純粵文，僅含有粵語特徵字詞，例如“你喺邊度” | Pure Cantonese text, contains Cantonese-featured words. E.g. 你喺邊度
+1. `SWC`: 書面中文，係一個僅含有書面語特徵字詞，例如“你在哪裏” | Pure Standard Written Chinese (SWC) text, contains Mandarin-feature words. E.g. 你在哪裏
+1. `Mixed`：書粵混雜文，同時含有書面語同粵語特徵嘅字詞，例如“是咁的” | Mixed Cantonese-Mandarin text, contains both Cantonese and Mandarin-featured words. E.g. 是咁的
+1. `Neutral`：無特徵中文，唔含有官話同粵語特徵，既可以當成粵文亦可以當成官話文，例如“去學校讀書” | No feature Chinese text, contains neither Cantonese nor Mandarin feature words. Such sentences can be used for both Cantonese and Mandarin text corpus. E.g. 去學校讀書
+1. `MixedQuotesInSWC` : 書面中文，引文入面係 `Mixed` | `Mixed` contents quoted within SWC text
+1. `CantoneseQuotesInSWC` : 書面中文，引文入面係純粵文 `cantonese` | `Cantonese` contents quoted within SWC text
+
+## 用法 Usage
+
+### 系統要求 Requirement
+
+Python >= 3.11
+
+### 安裝 Installation
+
+首先用 pip 安裝
+
+```bash
+pip install cantonesedetect
+```
+
+### Python
+
+Use `judge()`
+
+```python
+from cantonesedetect import judge
+
+print(judge('你喺邊度')) # cantonese
+print(judge('你在哪裏')) # mandarin
+print(judge('是咁的'))  # mixed
+print(judge('去學校讀書'))  # neutral
+```
+
+### CLI
+待補充 to be added.
diff --git a/tests/test_judge.py b/tests/test_judge.py
@@ -0,0 +1,29 @@
+from cantonesedetect.judge import judge
+import unittest
+
+
+def load_test_sentences(file_path):
+    with open(file_path, 'r', encoding='utf-8') as file:
+        lines = [line.strip() for line in file if line.strip()
+                 and not line.startswith('#')]
+    test_cases = []
+    for line in lines:
+        if '|' in line:
+            sentence, quotemode, expected = line.split('|')
+            test_cases.append((sentence, quotemode, expected))
+    return test_cases
+
+
+test_cases = load_test_sentences('tests/test_judge_sentences.txt')
+
+
+class TestJudgeFunction(unittest.TestCase):
+    def test_judge(self):
+        for sentence, quotemode, expected in test_cases:
+            result = judge(sentence, get_quote = (quotemode == 'Quote'))[0]
+            self.assertEqual(
+                result, expected, f"Failed for input: {sentence}. Expected: {expected}, Quote Mode: {quotemode} but got: {result}")
+
+
+if __name__ == "__main__":
+    unittest.main()
diff --git a/tests/test_judge_sentences.txt b/tests/test_judge_sentences.txt
@@ -0,0 +1,17 @@
+你喺邊度|NoQuote|Cantonese
+乜你今日唔使返學咩|NoQuote|Cantonese
+今日好可能會嚟唔到|NoQuote|Cantonese
+我哋影張相留念|NoQuote|Cantonese
+你在哪裏|NoQuote|SWC
+家長也應做好家居防蚊措施|NoQuote|SWC
+教育不只是為了傳授知識|NoQuote|SWC
+是咁的|NoQuote|Mixed
+佢在屋企吃飯|NoQuote|Mixed
+去學校讀書|NoQuote|Neutral
+做人最重要開心|NoQuote|Neutral
+外交部駐香港特別行政區特派員公署副特派員|NoQuote|Neutral
+全日制或大學生於晚市星期一至星期四一天前訂座|NoQuote|Neutral
+這就是「你哋都戇鳩嘅」的意思 |Quote|CantoneseQuotesInSWC
+今天我是一個「冇嘢好做」的狀態 |Quote|CantoneseQuotesInSWC
+他們跟我說：「是咁的，即係噉講」 |Quote|MixedQuotesInSWC
+他說：「佢在屋企吃飯」 |Quote|MixedQuotesInSWC