Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add search index #21

Merged
merged 11 commits into from
Dec 15, 2023
Merged

Add search index #21

merged 11 commits into from
Dec 15, 2023

Conversation

apainintheneck
Copy link
Owner

@apainintheneck apainintheneck commented Dec 11, 2023

Resolves #14.

Overview

The main change here is to pre-compute all word to country mappings for performance reasons. This adds quite a bit of complexity to the code though because we have to make sure that these mappings stay up-to-date.

Changes

  • Add script/shared/cache_generator.rb class to make it easier to generate cache files with documentation.
  • Add script/generate_search_index.rb to facilitate generating cached search indexes and checking to make sure they are up-to-date.
  • Add cache.rb module to facilitate getting values from the cache.
  • Update code to use new search indexes.
  • Refactor: Move generate_readme.rb to new script/ folder.
  • Add rake task to generate and check for outdated search indexes.
  • Add non-readme cache files to the gem files (update in atlasq.gemspec).
  • Check for outdated indexes on CI.
  • Add info about the cache to the readme in the dev section.

Benchmarks

The performance gains come from two main areas.

  1. We have mappings country names and country name fragments which are pre-computed so we don't have to allocate 100s of thousands of strings when we normalize strings at runtime.
  2. We no longer need to load the internationalization library for 93 different languages. We can just use the default English one.

This even notably speeds up tests from around 8 seconds to around 4 seconds.


This is a measurement of the worst-case scenario before this change since it required creating a partial country name index on the fly.

$ hyperfine \
  --parameter-list branch main,add-search-index \
  --setup 'git switch {branch}' \
  --warmup 3 \
  'exe/atlasq -c united'
Benchmark 1: exe/atlasq -c united (branch = main)
  Time (mean ± σ):      1.211 s ±  0.019 s    [User: 0.899 s, System: 0.275 s]
  Range (min … max):    1.191 s …  1.256 s    10 runs
 
Benchmark 2: exe/atlasq -c united (branch = add-search-index)
  Time (mean ± σ):     621.9 ms ±   3.2 ms    [User: 341.8 ms, System: 252.7 ms]
  Range (min … max):   617.8 ms … 628.6 ms    10 runs
 
Summary
  exe/atlasq -c united (branch = add-search-index) ran
    1.95 ± 0.03 times faster than exe/atlasq -c united (branch = main)

The results would look like this.

$ exe/atlasq -c united
*
* Countries (Partial Match)
* * * * * * * * * * * * * * *
(🇦🇪 | 784 | AE | ARE | United Arab Emirates)
(🇬🇧 | 826 | GB | GBR | United Kingdom of Great Britain and Northern Ireland)
(🇲🇽 | 484 | MX | MEX | Mexico)
(🇹🇿 | 834 | TZ | TZA | Tanzania, United Republic of)
(🇺🇲 | 581 | UM | UMI | United States Minor Outlying Islands)
(🇺🇸 | 840 | US | USA | United States of America)
(🇻🇮 | 850 | VI | VIR | Virgin Islands (U.S.))

For normal country matching the results are still significant.

$ hyperfine \
  --parameter-list branch main,add-search-index \
  --setup 'git switch {branch}' \
  --warmup 3 \
  'exe/atlasq -c france'
Benchmark 1: exe/atlasq -c france (branch = main)
  Time (mean ± σ):     895.3 ms ±  10.6 ms    [User: 594.2 ms, System: 266.4 ms]
  Range (min … max):   874.3 ms … 908.2 ms    10 runs
 
Benchmark 2: exe/atlasq -c france (branch = add-search-index)
  Time (mean ± σ):     592.8 ms ±   4.4 ms    [User: 318.2 ms, System: 248.4 ms]
  Range (min … max):   585.8 ms … 598.8 ms    10 runs
 
Summary
  exe/atlasq -c france (branch = add-search-index) ran
    1.51 ± 0.02 times faster than exe/atlasq -c france (branch = main)

The results should look like this.

$ exe/atlasq -c france
*
* Country: The French Republic
* * * * * * * * * * * * * * * * *
(🇫🇷 | 250 | FR | FRA | France)
 | Search Term: france
  | Languages: French
   | Nationality: French
    | Region: Western Europe
     | Continent: Europe
      | Currency: € Euro
       |________________________________________

Next Steps

It would be nice to not have to bundle all of the data dependencies with the program. These could easily be added to the cache reducing the number of runtime dependencies.

This ends up speeding up the partial country search by around
3/10s of a second on my old computer. The slow thing was creating
the index on the fly. Loading it from JSON ends up being very fast.
This moves another search index to the cache. The goal here is
to move all data to the cache and remove all data dependencies.

I also made some changes to the rubocop config and how the
CacheGenerator class is set up. The tests had to be updated as well.
- add docs for cached files
- update gemspec to include cached files
@apainintheneck apainintheneck marked this pull request as ready for review December 12, 2023 06:40
test/test_util.rb Outdated Show resolved Hide resolved
script/generate_search_index.rb Outdated Show resolved Hide resolved
script/shared/cache_generator.rb Show resolved Hide resolved
script/shared/cache_generator.rb Outdated Show resolved Hide resolved
script/shared/cache_generator.rb Show resolved Hide resolved
@apainintheneck apainintheneck merged commit 3e2b1c2 into main Dec 15, 2023
9 checks passed
@apainintheneck apainintheneck deleted the add-search-index branch December 15, 2023 05:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pre-compute all country mappings
1 participant