Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding-related test failures on Alpine Linux #6350

Open
bastistician opened this issue Aug 3, 2024 · 4 comments
Open

encoding-related test failures on Alpine Linux #6350

bastistician opened this issue Aug 3, 2024 · 4 comments
Labels
encoding issues related to Encoding
Milestone

Comments

@bastistician
Copy link

(Only reporting now, seeing that data.table is being developed again.)

Checking released data.table 1.15.4, my Alpine Linux server gives

Error: 3 error(s) out of 11070. Search tests/tests.Rraw.bz2 for test number(s) 1590.05, 1590.06, 1997.14. Duration: 34.4s elapsed (34.9s cpu).

but at this point it is probably more useful to look at the development version of data.table.

So in a vanilla Alpine Linux container,

docker run --rm -it alpine

running

export TZ=UTC
apk add R R-dev R-doc
## get data.table (devel) and suggested packages
R -s -e 'install.packages("data.table", repos = "https://rdatatable.gitlab.io/data.table", dependencies = TRUE, destdir = "/tmp")'
export _R_CHECK_TESTS_NLINES_=0
R CMD check --extra-arch /tmp/data.table_*.tar.gz

gives only 2 failures for test numbers 1590.05 and 1590.06:

Error in test.data.table()
* using R version 4.4.0 (2024-04-24)
* using platform: x86_64-pc-linux-musl
* R was compiled by
    gcc (Alpine 13.2.1_git20240309) 13.2.1 20240309
    GNU Fortran (Alpine 13.2.1_git20240309) 13.2.1 20240309
* running under: Alpine Linux v3.20
* using session charset: UTF-8
[...]
Running the tests in ‘tests/main.R’ failed.
Complete output:
  > require(data.table)
  Loading required package: data.table
  > 
  > test.data.table()  # runs the main test suite of 5,000+ tests in /inst/tests/tests.Rraw
  getDTthreads(verbose=TRUE):
    OpenMP version (_OPENMP)       201511
    omp_get_num_procs()            12
    R_DATATABLE_NUM_PROCS_PERCENT  unset (default 50)
    R_DATATABLE_NUM_THREADS        unset
    R_DATATABLE_THROTTLE           unset (default 1024)
    omp_get_thread_limit()         2147483647
    omp_get_max_threads()          12
    OMP_THREAD_LIMIT               unset
    OMP_NUM_THREADS                unset
    RestoreAfterFork               true
    data.table is using 6 threads with throttle==1024. See ?setDTthreads.
  test.data.table() running: //data.table.Rcheck/data.table/tests/tests.Rraw
  Test 1590.05 ran without errors but failed check that x equals y:
  > x = x1 != x2 
  First 1 of 1 (type 'logical'): 
  [1] FALSE
  > y = TRUE 
  First 1 of 1 (type 'logical'): 
  [1] TRUE
  1 element mismatch
  Test 1590.06 ran without errors but failed check that x equals y:
  > x = forderv(c(x2, x1, x1, x2)) 
  First 0 of 0 (type 'integer'): 
  integer(0)
  > y = INT(1, 4, 2, 3) 
  First 4 of 4 (type 'integer'): 
  [1] 1 4 2 3
  Numeric: lengths (0, 4) differ
  Unloading package bit64
  
  Sat Aug  3 13:25:45 2024  endian==little, sizeof(long double)==16, longdouble.digits==64, sizeof(pointer)==8, TZ=='UTC', Sys.timezone()=='UTC', Sys.getlocale()=='C.UTF-8;C;C;C;C;C', l10n_info()=='MBCS=TRUE; UTF-8=TRUE; Latin-1=FALSE; codeset=UTF-8', getDTthreads()=='OpenMP version (_OPENMP)==201511; omp_get_num_procs()==12; R_DATATABLE_NUM_PROCS_PERCENT==unset (default 50); R_DATATABLE_NUM_THREADS==unset; R_DATATABLE_THROTTLE==unset (default 1024); omp_get_thread_limit()==2147483647; omp_get_max_threads()==12; OMP_THREAD_LIMIT==unset; OMP_NUM_THREADS==unset; RestoreAfterFork==true; data.table is using 6 threads with throttle==1024. See ?setDTthreads.', .libPaths()=='//data.table.Rcheck','/usr/lib/R/library', zlibVersion()==1.3.1 ZLIB_VERSION==1.3.1
  Error in test.data.table() : 
    2 error(s) out of 11369. Search tests/tests.Rraw for test number(s) 1590.05, 1590.06. Duration: 26.9s elapsed (29.1s cpu).

Here is the relevant R code, with comments indicating results on Alpine Linux:

x1 <- "fa\xE7ile"
Encoding(x1) <- "latin1"
x2 <- iconv(x1, "latin1", "UTF-8")
identical(x1, x2)  # TRUE, ok
x1 == x2           # TRUE, ok

Encoding(x2) <- "unknown"  #  <-- an invalid string in a non-UTF-8 locale
identical(x1, x2)  # TRUE on Alpine even in the C locale, but FALSE on, e.g., Ubuntu in the C locale
x1 == x2           # the same

It seems this test (1590.05) relies on (undocumented) platform-dependent behaviour for invalid strings, so should probably be dropped.

I cannot say anything about the unexpected length-0 result of data.table:::forderv(c(x2,x1,x1,x2)) (test number 1590.06).

@MichaelChirico
Copy link
Member

The nearby comments look relevant:

test(1590.03, forderv(    c(x2,x1,x1,x2)), integer())     # desirable consistent result given identical(x1, x2)
                                                          #           ^^ data.table consistent over time regardless of which version of R or locale
baseR = base::order(c(x2,x1,x1,x2))
  # Even though C locale and identical(x1,x2), base R<=4.0.0 considers the encoding too; i.e. orders the encoding together x2 (UTF-8) before x1 (latin1).
  # Then around May 2020, R-devel (but just on Windows) started either respecting identical() like data.table has always done, or put latin1 before UTF-8.
  # Jan emailed R-devel on 23 May 2020.
  # We relaxed 1590.04 and 1590.07 (tests of base R behaviour) rather than remove them, PR#4492 and its follow-up. But these two tests
  # are so relaxed now that they barely testing anything. It appears base R behaviour is undefined in this rare case of identical strings in different encodings.

This will take some time to go through the history and figure out what this test was trying to do exactly and how to handle it.

Should we consider this a potential blocker for CRAN in the near future? We're just about to release a new version -- we can just deactivate those tests in the short term if needed.

@tdhock tdhock added the encoding issues related to Encoding label Aug 7, 2024
@bastistician
Copy link
Author

The report shows that these two tests are not portable. If they were disabled I could drop the --no-tests flag for data.table when mass-checking packages on Alpine Linux (against specific R patches).

@MichaelChirico
Copy link
Member

Something to keep an eye on: actions/runner#801

Would be nice to attack this issue with GHA just running one flavor on alpine.

Support on GLCI does seem easier in case @ben-schwen / @jangorecki want to throw some spare cycles at it :)

@MichaelChirico
Copy link
Member

OK, so

  • I agree with your assessment to drop 1590.05. It seems like we're just testing R's own behavior -- no real reason for a package to be doing that (certainly not on CRAN). Do you think this is worthy of a bug report for R itself, though?
  • For 1590.06, the integer(0) result we see on Alpine is just telling us forder() thinks the input is already sorted. I think that's not ideal in that we expect forder() to give platform-independent sorting results, and here we have a different sort order on Alpine. But it's not a more worrisome bug (where forder() is totally broken on Alpine).

@MichaelChirico MichaelChirico modified the milestones: 1.17.0, 1.18.0 Jan 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
encoding issues related to Encoding
Projects
None yet
Development

No branches or pull requests

3 participants