Skip to content

Commit

Permalink
Merge pull request #2 from CanCLID/chaak
Browse files Browse the repository at this point in the history
Publish to PyPI
  • Loading branch information
chaaklau authored Jun 28, 2024
2 parents 3125f98 + f648607 commit 2a26921
Show file tree
Hide file tree
Showing 4 changed files with 174 additions and 0 deletions.
56 changes: 56 additions & 0 deletions .github/workflows/workflow.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
name: Publish to PyPI

jobs:
pypi:
name: upload release to PyPI
runs-on: ubuntu-latest
permissions:
id-token: write
steps:
- uses: actions/checkout@v3

- uses: actions/setup-python@v4
with:
python-version: "3.11"

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install setuptools wheel twine
- name: Run tests
run: |
python -m unittest tests/test_judge.py
- name: Update version in setup.py
run: |
version=$(python -c "exec(open('cantonesedetect/version.py').read()); print(__version__)")
sed -i "s/version=.*/version='$version',/" setup.py
- name: Build and publish
run: |
python setup.py sdist bdist_wheel
twine upload dist/*
- name: mint API token
id: mint-token
run: |
# retrieve the ambient OIDC token
resp=$(curl -H "Authorization: bearer $ACTIONS_ID_TOKEN_REQUEST_TOKEN" \
"$ACTIONS_ID_TOKEN_REQUEST_URL&audience=pypi")
oidc_token=$(jq -r '.value' <<< "${resp}")
# exchange the OIDC token for an API token
resp=$(curl -X POST https://pypi.org/_/oidc/mint-token -d "{\"token\": \"${oidc_token}\"}")
api_token=$(jq -r '.token' <<< "${resp}")
# mask the newly minted API token, so that we don't accidentally leak it
echo "::add-mask::${api_token}"
# see the next step in the workflow for an example of using this step output
echo "api-token=${api_token}" >> "${GITHUB_OUTPUT}"
- name: publish
# gh-action-pypi-publish uses TWINE_PASSWORD automatically
uses: pypa/gh-action-pypi-publish@release/v1
with:
password: ${{ steps.mint-token.outputs.api-token }}
72 changes: 72 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# CantoneseDetect 粵語特徵分類器

[![license](https://img.shields.io/github/license/DAVFoundation/captain-n3m0.svg?style=flat-square)](https://github.com/DAVFoundation/captain-n3m0/blob/master/LICENSE)

本項目為 [CantoFilter](https://github.com/CanCLID/cantonese-classifier) 之後續。
This is an extension of the [CantoFilter](https://github.com/CanCLID/cantonese-classifier) project.

## 引用 Citation

抽出字詞特徵嘅策略同埋實踐方式,喺下面整理。討論本分類器時,請引用:

Chaak-ming Lau, Mingfei Lau, and Ann Wai Huen To. 2024.
[The Extraction and Fine-grained Classification of Written Cantonese Materials through Linguistic Feature Detection.](https://aclanthology.org/2024.eurali-1.4/)
In Proceedings of the 2nd Workshop on Resources and Technologies for Indigenous, Endangered and Lesser-resourced Languages in Eurasia (EURALI)
@ LREC-COLING 2024, pages 24–29, Torino, Italia. ELRA and ICCL.

分類器採用嘅分類標籤及基準,參考咗對使用者嘅語言意識形態嘅研究。討論分類準則時,請引用:

The definitions and boundaries of the labels depend on the user's language ideology.
When discussing the criteria adopted by this tool, please cite:

Lau, Chaak Ming. 2024. Ideologically driven divergence in Cantonese vernacular writing practices. In J.-F. Dupré, editor, _Politics of Language in Hong Kong_, Routledge.

---

## 簡介 Introduction

分類方法係利用粵語同書面中文嘅特徵字詞,用 Regex 方式加以識別。

The filter is based on Regex rules and detects lexical features specific to Cantonese or Written-Chiense.

### 標籤 Labels

分類器會將輸入文本分成四類(粗疏)或六類(精細),分類如下:
The classifiers output four (coarse) or six (fine-grained) categories. The labels are:

1. `Cantonese`: 純粵文,僅含有粵語特徵字詞,例如“你喺邊度” | Pure Cantonese text, contains Cantonese-featured words. E.g. 你喺邊度
1. `SWC`: 書面中文,係一個僅含有書面語特徵字詞,例如“你在哪裏” | Pure Standard Written Chinese (SWC) text, contains Mandarin-feature words. E.g. 你在哪裏
1. `Mixed`:書粵混雜文,同時含有書面語同粵語特徵嘅字詞,例如“是咁的” | Mixed Cantonese-Mandarin text, contains both Cantonese and Mandarin-featured words. E.g. 是咁的
1. `Neutral`:無特徵中文,唔含有官話同粵語特徵,既可以當成粵文亦可以當成官話文,例如“去學校讀書” | No feature Chinese text, contains neither Cantonese nor Mandarin feature words. Such sentences can be used for both Cantonese and Mandarin text corpus. E.g. 去學校讀書
1. `MixedQuotesInSWC` : 書面中文,引文入面係 `Mixed` | `Mixed` contents quoted within SWC text
1. `CantoneseQuotesInSWC` : 書面中文,引文入面係純粵文 `cantonese` | `Cantonese` contents quoted within SWC text

## 用法 Usage

### 系統要求 Requirement

Python >= 3.11

### 安裝 Installation

首先用 pip 安裝

```bash
pip install cantonesedetect
```

### Python

Use `judge()`

```python
from cantonesedetect import judge

print(judge('你喺邊度')) # cantonese
print(judge('你在哪裏')) # mandarin
print(judge('是咁的')) # mixed
print(judge('去學校讀書')) # neutral
```

### CLI
待補充 to be added.
29 changes: 29 additions & 0 deletions tests/test_judge.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
from cantonesedetect.judge import judge
import unittest


def load_test_sentences(file_path):
with open(file_path, 'r', encoding='utf-8') as file:
lines = [line.strip() for line in file if line.strip()
and not line.startswith('#')]
test_cases = []
for line in lines:
if '|' in line:
sentence, quotemode, expected = line.split('|')
test_cases.append((sentence, quotemode, expected))
return test_cases


test_cases = load_test_sentences('tests/test_judge_sentences.txt')


class TestJudgeFunction(unittest.TestCase):
def test_judge(self):
for sentence, quotemode, expected in test_cases:
result = judge(sentence, get_quote = (quotemode == 'Quote'))[0]
self.assertEqual(
result, expected, f"Failed for input: {sentence}. Expected: {expected}, Quote Mode: {quotemode} but got: {result}")


if __name__ == "__main__":
unittest.main()
17 changes: 17 additions & 0 deletions tests/test_judge_sentences.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
你喺邊度|NoQuote|Cantonese
乜你今日唔使返學咩|NoQuote|Cantonese
今日好可能會嚟唔到|NoQuote|Cantonese
我哋影張相留念|NoQuote|Cantonese
你在哪裏|NoQuote|SWC
家長也應做好家居防蚊措施|NoQuote|SWC
教育不只是為了傳授知識|NoQuote|SWC
是咁的|NoQuote|Mixed
佢在屋企吃飯|NoQuote|Mixed
去學校讀書|NoQuote|Neutral
做人最重要開心|NoQuote|Neutral
外交部駐香港特別行政區特派員公署副特派員|NoQuote|Neutral
全日制或大學生於晚市星期一至星期四一天前訂座|NoQuote|Neutral
這就是「你哋都戇鳩嘅」的意思 |Quote|CantoneseQuotesInSWC
今天我是一個「冇嘢好做」的狀態 |Quote|CantoneseQuotesInSWC
他們跟我說:「是咁的,即係噉講」 |Quote|MixedQuotesInSWC
他說:「佢在屋企吃飯」 |Quote|MixedQuotesInSWC

0 comments on commit 2a26921

Please sign in to comment.