This repository contains the two modules developed for Machine Transliteration from raw data (bengali glyph through s550 font) to Meeteilon (Meetei/Meitei Mayek Unicode). It includes
glyph correction
(s-550 → bn_unicode)transliteration
(bn_unicode → mm_unicode)
-
Clone this repository.
git clone https://github.com/hoomexsun/xlit.git
-
Install python requirements. Please refer requirements.txt.
-
Enter your file, specify the location and run either
main.py
or- run
python run.py
.
- run
For custom usage, follow after step 1 & 2.
-
After adding your input file.
-
Extract the string from the file and call either
gc.correct()
orgc.correct_words()
.# run.py from pathlib import Path from src.gc_ import GlyphCorrection content = Path("<YOUR_FILE_PATH>").read_text(encoding="utf-8") gc = GlyphCorrection() output_1 = gc.correct_words(content) # For huge text # or output_2 = gc.correct(content) # Simpler
-
Now, run
run.py
.
-
After adding your input file.
-
Extract the string from the file and call either
mt.transliterate()
ormt.transliterate_words()
.# run.py from pathlib import Path from src.mt_ import MTransliteration content = Path("<YOUR_FILE_PATH>").read_text(encoding="utf-8") mt = MTransliteration() output_1 = mt.transliterate_words(content) # For huge text # or output_2 = mt.transliterate(content) # Simpler
-
Now, run
run.py
.
The repository contains high level implementation in python and the content is deeply organized. Refer to Theory Section for better understanding.
- The different modules are stored in
src
directory. Since it is too big, there will be no further explanation. You can modify or extend the implementation for your own work. - The data is stored in
data
directory. You can use your own data to test the methods. - Additionally, baseline models are included in
src
.
-
Add this repository as submodule
git submodule add https://github.com/hoomexsun/xlit.git
-
Create a
GlyphCorrection
orTransliteration
object after importing and then use its functions.from xlit import GlyphCorrection, Transliteration gc = GlyphCorrection() mt = MTransliteration() ...
This repository contains an inbuilt function called run()
in run.py which can be used for experimenting.
Runs the given function in the specified mode.
Parameters:
func (Callable): The function to be executed.
mode (str): The mode in which to run the function. Options are "evaluate", "simple", "detailed", and "wordmap".
model_name (str, optional): The name of the model. Defaults to "Proposed".
root_dir (str | Path, optional): The root directory for input/output files. Defaults to an empty string.
- Given below are the modes and modules (which contains the callable function).
Modes | src.mt_ | src.gc_ |
---|---|---|
'simple' | Transliteration of Bengali text inside a file. | Glyph correction of s550 text inside a file. |
'detailed' | Step-wise transliteration including syllabified Bengali words, phonemes and Meetei Mayek words from a list of Bengali words inside a file. | Step-wise glyph correction at every step from a list of s550 words inside a file. |
'wordmap' | Building wordmap (json, csv & txt) from a list of Bengali words inside a file. | Building wordmap (json, csv & txt) from a list of s550 words inside a file. |
'evaluate' | Evaluation (Accuracy & CER) of a list of parallel Bengali words and Meetei Mayek words inside a file by comparing edit disances. | Evaluation (Accuracy & CER) of a list of parallel s550 unicode incompatible words and Bengali unicode words inside a file by comparing edit disances. |
The methods can be called through script mode via main.py
as follows:
usage: main.py [-h] [-m] [-g] [-d] [-w] [-e] [--root ROOT]
Run from main
options:
-h, --help show this help message and exit
-m Select module mt
-g Select module gc
-d Enable detailed mode
-w Enable wordmap mode
-e Enable evaluation mode
--root ROOT Directory path which contains words.txt or targets.txt
If neither input file and output directory is specified, it will use the default specified in the functions.
- word accuracy = 1-err/M
- character accuracy = 1-(err==0)/N
- where M is the total number of words
- where N is the total number of characters
- where err is the minimum edit distance to correct a word
Check out gui
built using tkinter on XLIT GUI.
This repository is an implementation of a paper currently submitted for review
. To replicate the paper, run replicate_paper_gc.py
and replicate_paper_mt.py
.
Visualization of the Transliteration module.
Input: Unicode used as Bengali Glyph, A or {a₀, a₁, …, aₘ₋₁}
Output: Correct Bengali Unicode, B or {b₀, b₁, …, bₙ₋₁}
- Pre-adjust glyphs in A
- B ← map_unicode[A]
B ← (B ∪r_glyph
) - A - for bᵢ in B:
- if bᵢ is
r glyph
written on right:- bᵢ is removed and inserted using Jump (Reverse)
- bᵢ ← map_unicode[bᵢ]
- if bᵢ is
- for bᵢ in B:
- if bᵢ is
vowel
written on left:- bᵢ is removed and inserted using Jump
- if bᵢ and bᵢ₊₁ are actually vowel written by enclosing:
- bᵢ and bᵢ₊₁ are replaced with correct unicode
- if bᵢ is
- Return the resulting string B
Input: List of phonemes, {p₀, p₁, …, pₙ₋₁} ∈ P ∪ {U+09CD
}
Output: Meetei Mayek String, S
- initialize S ← mm_begin[p₀]
- assign flag ← True if p₀ is vowel else False
- for each phoneme pᵢ from i ← 1 to n-1:
- if pᵢ is
U+09CD
:- if i ≠ n-1:
- append mm_char_apun to S
- if i ≠ n-1:
- else if flag is True:
- append mm_end[pᵢ] to S
- flag ← False
- else if pᵢ is consonant:
- append mm_begin[pᵢ] to S
- else:
- flag ← True
- append mm_begin[pᵢ] to S
- if pᵢ is
- Return the resulting string S
Indigenous words | Exotic words | News Corpus | Literature Corpus | |
---|---|---|---|---|
Baseline 1 | 38.92 | 29.88 | 29.08 | 54.32 |
Baseline 2 | 54.88 | 47.08 | 46.06 | 69.52 |
Proposed | 85.02 | 86.18 | 84.48 | 90.08 |
Indigenous words | Exotic words | News Corpus | Literature Corpus | |
---|---|---|---|---|
Baseline 1 | 11.86 | 14.94 | 14.50 | 9.01 |
Baseline 2 | 8.97 | 8.77 | 9.51 | 6.58 |
Proposed | 2.50 | 2.23 | 2.56 | 1.61 |