Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpusCleaner fixes for Japanese and Korean #1001

Open
wants to merge 34 commits into
base: main
Choose a base branch
from
Open

Conversation

ZJaume
Copy link
Collaborator

@ZJaume ZJaume commented Jan 21, 2025

[skip ci]

ZJaume and others added 30 commits November 21, 2024 13:37
It is removing a lot of sentences that have exactly the same numbers.
Seems that is happenning when the Chinese side has the numbers glued to
the Chinese characters.
…lds (#945)

* Fix inference in CODEOWNERS file

This fixes an oversight from a previous PR when
`inference-engine` was renamed to `inference`,
however the path was not updated in `CODEOWNERS`.

* Improve eslint string-formatting configuration

This is a miscellaneous change to the eslint config that
now allows different string types based on whether certain
types of quotes need to be escaped within the string.

* Add a --force-rebuild flag to WASM build commands

This commit adds a --force-rebuild flag to the
WASM build commands that will trigger a rebuild
without having to fully clobber and start over.

* Fix misc. formatting in build-bergamot.py

This commit fixes miscellaneous formatting that I noticed
looked misaligned in the terminal. For some reason, some
emojis need two spaces after them, when other emojis only
need one space to achieve the same alignment.

* Rename `appendEndingWhitespace`

This commit renames `appendEndingWhitespace` to
`handleEndingWhitespace`, because the whitespace
logic will be made more complex by this PR, and
whitspace is no longer guaranteed to be appended.

* Add capability to register languages

This commit adds the capability for several of the
C++ classes to register either a source language tag
or a target language tag (depending on their needs).

I had experimented with changing the constructors themselves,
but mtaintaining backward compatibility got messy very quickly
with native builds continuing to use `ssplit` and WASM builds now
using `Intl.Segmenter`.

The least-invasive and cleanest-to-implement compromise that I came up
with was to add WASM-specific functionality to register the language tags
for classes after construction.

* Implement WASM segmentation with `Intl.Segmenter`

This is the largest commit of the stack, and likely the
one to pay the most attention to.

In addition to utilizing `Intl.Segmenter` instead of `ssplit`
when segmenting text in WASM builds, this patch also necessarily
modifies the logic of how whitespce is handled during translations.

We now have to concern ourselves with whether the source
language and/or target language utilize whitespace between
sentences or omit whitespace between sentences.

For example:

* When translating from Chinese to English, then whitespace
  must be added between sentences.

* When translating from English to Chinese, then whitespace
  must be removed between sentences.

* When translating form Chinese to Japanese, then whitesapce
  must be inserted between sentences for the English pivot,
  and then removed for the final output.

* Remove WASM dependency on ssplit

This commit entirely removes the build dependency on
`ssplit` when building the WASM target.

This actually ultimately reduces the size of the compiled
WASM binary from 5.01 MB to 4.73 MB.

* Bump Bergamot Version 0.4.5 => 0.5.0

* Update WASM Bindings

Part 1 of 2

This commit updates the WASM bindings to take the source
language and target language tags in order to construct
the TranslationModels that now utilize the locale-specific
`Intl.Segmenter`.

This effectively takes the `LanguageTranslationModelFiles`
object and makes that a sub-object of `TranslationModelPayload`,
which includes the language tags as well as the files.

This hierarchical separation is ideal, because the `LanguageTranslationModelFiles`
object is designed to be iterated over and chunked into aligned memory,
where as the language tags are plain strings that are distinctly separate
in the way that they are handled.

* Rework TranslationsEngine to utilize new bindings

Part 2 of 2

This commit reworks the TranslationsEngine worker code
to utilize the new bindings implemented in the previous
commit.

* Insert whitespace between full-width punctuation and opening quotes

This commit introduces extra logic to the text cleaning
that purposely inserts whitespace into CJK text to trick
the segmenter into doing the right thing.

See the in-code comment for more context.

* Add `zhen` test model files

This commit adds our work-in-progress `zhen` model
to the repository for use in testing.

* Add test cases for testing `zhen` models.

This commit adds several test cases for translating
from Chinese into other languages, which will both
guard against regressions and demonstrate correct
segmentation behavior.

* Add temporary `enzh` models for testing

Part 1 of 2

The final two commits of this stack may be slightly controversial.
We do not currently have a viable `enzh` model, even for testing
purposes, however, I need to test the functionality of removing
whitespace between sentences for target languages that require it.

This patch adds our `enes` models under the `enzh` directory, which
will trick the implementation into translating into "Chinese" with
a Spanish output. The key difference is that the Spanish output should
not include spaces between sentences, which is, in my opinion, good
enough for testing in the interim.

* Add makeshift `enzh` tests

Part 2 of 2

This patch adds test cases for translating into "Chinese", which
at present, is actually a Spanish translation that omits spaces
between sentences.
The rename of this repository has exposed that we rely on fork names to set `project` in PRs. This causes issues in certain circumstances, eg: if we try to fire an action in a PR where the head repository is not named the same, we end up with [the wrong treeherder routes](https://github.com/mozilla/translations/blob/f9010478a45cd40bf7ad3d0aecdd62dd281ec5d6/.taskcluster.yml#L146), which causes scope issues.
* fix: set permission for train action

This used to default to the action name, but changed in taskgraph 11.0. Without this, we try to use the `generic` permission, which doesn't have the scopes needed.

* chore: bump taskgraph in poetry requirements
* Add pyright as a dependency

* Add pyright to Taskfile and to CI

* Add the utils folder to be type checked

* Fix the dependencies for type checking
This PR fixes an issue where we are checking for equality on the
language tag for CJK languages, when we really need to be checking
if the tag starts with the language tag.

Before this PR `zh-Hans` would not match. After this PR it will match.
This commit removes Korean from the list of languages
that will omit space between sentences when being
translated into as a target language.
* Rename kind

* Roll back target renaming
This will remove docker-worker from translations, and use the most recent generic-worker version instead.
* WANDB Test failure

* Rename DataDir.load to DataDir.read_text and allow for reading compressed files

* Add compress and decompress common utilities

* Use decompression utilities everywhere

* Re-work the marian-decoder fixture to correctly output nbest

* Rewrite translate.sh to python

* Add a requirements file for ctranslate2

* Add support for ctranslate2

* Add gpustats to the train requirements

* Add logging for translations

* Remove old translate scripts

* Handle review feedback
* Add ICU tokenizer

* Use ICU tokenizer in alignments

* Update to OpusTrainer with ICU detokenization support

* Update docs

* Add pyicu pypi package

* Use ICU system package

* Strip new lines

* Refactor abstract classes

* Fix typo

* Add todo with issue link for OpusTrainer package

* Add test cases from sacremoses

* Update to the latest commit

* Relock poetry
* Add contributing guide

* Add a link from the readme

* Fix link formatting

* Clarify how to skip a dataset
The current mtdata dataset here has been unavailable for many days in a row, making it difficult to verify pipeline changes.
* Update OpusCleaner

* Update OpusTrainer

* Update OpusCleaner

* Add Japanese and Korean to sacrebleu

* Add opusfilter package
* Update development docs

* Fix the note on Snakemake
* Update model training guide

* Update docs/training-guide.md

Co-authored-by: Greg Tatum <[email protected]>

* Clarify language resource classification

---------

Co-authored-by: Greg Tatum <[email protected]>
@eu9ene
Copy link
Collaborator

eu9ene commented Jan 28, 2025

@ZJaume I'm using this in my training. Is it ready to be merged?

@ZJaume ZJaume marked this pull request as ready for review January 29, 2025 10:38
@ZJaume ZJaume requested a review from a team as a code owner January 29, 2025 10:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants