There are two quasi-independent parts to this repository:
- solr/ contains all of the configuration for the catalog solr,
which currently uses a non-managed schema (i.e., we hand-edit the
schema.xml
file) - Everything else is concerned with the actual indexing code (based on traject)
In addition to this overview, more detailed explanations can be found in:
git clone https://github.com/hathitrust/hathitrust_catalog_indexer
cd hathitrust_catalog_indexer
docker-compose build
docker-compose up -d solr-sdr-catalog
docker-compose run --rm traject bundle install
The contents of lib/translation_maps/ht/collection_code_to_original_from.yaml
will change
depending on the contents of the sample database you are running locally. To keep git
from
bugging you about this constantly-drifting-from-upstream file, run
git update-index --skip-worktree lib/translation_maps/ht/collection_code_to_original_from.yaml
and to undo this (like if you update the file based on the production database and want the repo up-to-date) run
git update-index --no-skip-worktree lib/translation_maps/ht/collection_code_to_original_from.yaml
Generate Solr documents given an input file of MARC records in JSON format, one per line:
docker-compose run --rm traject bundle exec bin/cictl index file --no-commit --writer=json input-marc-records.jsonl
Output will be in debug.json
.
This will build a docker image with a Solr core pre-loaded with a set of records.
-
As above, put records you want to load in
example-index/records-to-index.jsonl
. Some records are included with this repository; this is a set of 2,000 records from a variety of contributors that were updated in HathiTrust on May 1, 2022. -
Then run:
docker build . -f example-index/Dockerfile -t my-sample-solr
and run e.g. docker run -p 9033:9033 my-sample-solr
, or use in another
docker-compose.yml
, etc.
A multi-platform (amd64/arm64) image with the sample records pre-loaded is available:
docker pull ghcr.io/hathitrust/catalog-solr-sample
The sample HathiTrust catalog records are made available for development and testing purposes only, and are not intended for further re-use in other contexts.
Index a file of records without using the database or hardcoded filesystem paths:
# get some sample records somehow
docker-compose run --rm traject bundle exec bin/cictl index file examples/sample_records.json.gz
Zephir records for the last monthly up to the current date should be in examples
:
docker-compose run --rm traject bundle exec bin/cictl index all
Solr should be accessible at http://localhost:9033
Start solr and index as above. In the other project, ensure docker-compose.yml
contains e.g.:
services:
web:
build: .
ports:
- "3000:3000"
# Add this networks entry to the service that needs to reach solr
networks:
- catalog_indexer
# Add this network information
networks:
catalog_indexer:
external: true
# If you checked out into another directory than
# 'hathitrust_catalog_indexer', adjust to match
# match (appending '_default')
name: hathitrust_catalog_indexer_default
If you checked out into another directory than hathitrust_catalog_indexer
,
adjust the name of the network above to match.
This will ensure the application uses the solr running from this docker network
network (i.e. the one started with docker-compose up
from this repository).
Solr should be reachable via the solr-sdr-catalog
hostname.
For use in production environments where daily and monthly indexing are ongoing activities,
we enable the indexer to maintain state by writing "journal" files: empty datestamped
files in a known location (JOURNAL_DIRECTORY
). The command cictl index continue
does whatever
full or daily indexing is appropriate given the state of the journals.
Note that all of the cictl index *
commands write journal files, with the exception of
cictl index file
which takes only an upd
MARC file rather than a MARC-deletes pair, and is not
expected to be used in an environment where date independence is in force.
- On
beeftea-2
, go to/htsolr/catalog/bin/ht_catalog_indexer
andgit pull
to get up to date - Shut down the catalog indexing solr with
systemctl stop solr-current-catalog
. - Copy the new configuration over:
cd /htsolr/catalog/cores/catalog
rm -rf conf
rm core.properties
cp -r /htsolr/catalog/bin/ht_catalog_indexer/solr/catalog/conf .
- (Optional) If your new solr config requires a full reindex, go ahead and
get rid of the data with
rm -rf data
- Fire solr back up:
systemctl start solr-current-catalog
- Give it a minute and then go to
http://beeftea-2.umdl.umich.edu:9033/solr
to make sure the core came back up. - Do whatever indexing needs doing.
Note that "today's file" is "the file that became available today", which will have yesterday's date embedded in it.
Re-process today's file:
bin/cictl index today
Processes all deletes/marcfiles with a date on or after YYYYMMDD in its name sequentially
bin/cictl index since 20220302
Re-build the entire index based on the last full file, making sure everything is up-to-date:
bin/cictl index all
Note that the fullindex file does not contain that day's updates (e.g., on
July 1, you need to index both the zephir_full_20230630
file and the zephir_upd_20230630
file.
The index all
command takes care of that, but
if running stuff by hand keep in mind that you need to index the full file
and the update file for that day as well.
Adding a field requires two things:
- adding a
to_field
definition in an indexing file - adding the field definition to the solr schema file
schema.xml
After that, of course, you need to get the solr conf directory on the catalog indexing solr updated, restart that solr, and then reindex everything. See solr/README.md.
While solr support for dynamic fields has gotten pretty good, we've never used it.
The main driver script is bin/cictl
:
> bundle exec bin/cictl help
Commands:
cictl delete SUBCOMMAND ARGS # Delete some or all Solr records
cictl help [COMMAND] # Describe available commands or one specific...
cictl index SUBCOMMAND ARGS # Index a set of records from a file or date
cictl pry # Open a pry-shell with environment loaded
Options:
[--verbose], [--no-verbose] # Emit 'debug' in addition to 'info' log entries
[--log=<logfile>] # Log to <logfile> instead of STDOUT/STDERR
The index
command has a number of possibilities:
> bundle exec bin/cictl help index
Commands:
cictl index all # Empty the catalog and index the most recent m...
cictl index continue # index all files not represented in the indexe...
cictl index date YYYYMMDD # Run the catchup (delete and index) for a part...
cictl index file FILE # Index a single file
cictl index help [COMMAND] # Describe subcommands or one specific subcommand
cictl index since YYYYMMDD # Run all deletes/includes in order since the g...
cictl index today # Run the catchup (delete and index) for last n...
Options:
[--reader=READER] # Reader name/path
[--writer=WRITER] # Writer name/path
The delete
command has fewer subcommands:
> bundle exec bin/cictl help delete
Commands:
cictl delete all # Delete all records from the Solr index
cictl delete file FILE # Delete records from a single file
cictl delete help [COMMAND] # Describe subcommands or one specific subcommand
TODO: Add in non-hardcoded mechanisms for dictating where the marc/delete files will be, where the redirect file will be, etc.
bin/
contains thecictl
indexing CLI.indexers
would more appropriately be called "indexing rules". It contains all the traject rules for turning a marc record into a solr document, independent of the source of the marc record or where it's being written to.- The organization of this dir reflects the HT/UMich joint code policy long after it's no longer in place
- Order matters when loading these files. In particular, the file
indexers/common.rb
has a ton ofrequire
statements, settings, etc., and most of the other files assume that stuff is all defined.
readers/
have a variety of files that contain nothing but traject settings that specify what reader to use. We use newline-delimited-marc (jsonl
) but others are available for running test files.writers/
, similarly, has files with traject settings for different writers. This might involve pushing the resulting documents to solr (as inlocalhost.rb
, which actually uses SOLR_URL), pushing to a local file for human (debug.rb
) or machine (json.rb
) inspection, or doing nothing at all (null.rb
) for benchmarking.lib/
contains all the code called by the indexing code. The organization is...not so much organized at all. See that directory for more info.
lib
is put into the search path, so one can require
those file directly.
lib/translation_maps
is automatically searched for translation map files
(.yaml
, .rb
, or .ini
) when they are referenced either by explicitly
creating one (with Traject::TranslationMap.new
) or implicitly created
from an extract_marc
(e.g., extract_marc('1004:1104:1114', translation_map: 'ht/relators')
). Note that all translation maps are cached so loading it
more than once isn't a big deal.
In addition to the obvious target solr instance, the indexing process pulls data from a number of external sources:
- Mapping of collections to institution names. This is pulled by the
script bin/get_collection_map.rb from the
database tables
ht_institutions
andht_collections
and is cached locally inlib/translation_maps/ht
. (FIXME: shouldcictl
expose this functionality?) - The
holdings_htitem_htmember
database table for print holdings - The
oclc_concordance
table for adding in canonical OCLC numbers - The file for the current month in
/htapps/babel/hathifiles/catalog_redirects/redirects
for setting up redirects.
TODO: Get the rights info from rights_current
so it's up-to-date. It
would be nice if rights_current
had an index on the whole HTID instead
of us having to split out the namespace...
Connection string is exposed by the Services
object based on environment variables
and config/env
. The defaults in the repository suffice for testing under Docker only.
DDIR
data directory, defaults to/htsolr/catalog/prep
JOURNAL_DIRECTORY
location of journal files (see Date-Independent Indexing above) defaulting tojournal/
inside the repo directory.LOG_DIR
where to store logs, defaults to/htsolr/catalog/prep
.MYSQL_HOST
,MYSQL_DATABASE
,MYSQL_USER
,MYSQL_PASSWORD
required unless run withNO_DB
.NO_DB
if you want to skip all the database stuff. Useful for testing. Implied byNO_EXTERNAL_DATA
.NO_EXTERNAL_DATA
combinesNO_DB
,NO_REDIRECTS
NO_REDIRECTS
do not read catalog redirects file. Implied byNO_EXTERNAL_DATA
.REDIRECT_FILE
(and the now-deprecatedredirect_file
) path to the redirect file. Default is/htapps/babel/hathifiles/catalog_redirects/redirects/redirects_YYYYMM.txt.gz
SOLR_URL
(required) with the solr core URL (i.e, ending in/catalog
)
JOB_NAME
if not set defaults to thecictl
command, e.g.,index_continue
fromcictl index continue
.JOB_SUCCESS_INTERVAL
handled byPushMetrics
, no defaults set by this repository.PUSHGATEWAY
set tohttp://pushgateway:9091
in thedocker-compose
file, otherwise no default.
These are used internally, mainly for testing. They are not exposed by the Services
object.
CICTL_SEMANTIC_LOGGER_SYNC
forces SemanticLogger to run on the main thread in order to mitigate testing headaches.CICTL_ZEPHIR_FILE_TEMPLATE_PREFIX
for test fixtures, overrides default "zephir".
(the nerd version)
traject
is really designed to be run from the command line, which makes
things like testing a pain.
The lifecycle is:
- a new
Traject::Indexer
is created, more-or-less blank. - each file passed to the
traject
command with a-c
is read and subjected toindexer.instance_eval
. Note that this causes closures to be created for any lambdas defined in those files.- a
to_field
oreach_record
call adds the given proc/lambda to the list of Things To Do in the indexer. These are run in order for every record. - Note that macros (like the traject-provided
extract_marc
) actually return a lambda, soto_field 'id', extract_marc('001')
is just a mapping from a name to a lambda. - As of traject 3.0, you can stack an arbitrary number of lambdas/procs
on a
to_field
call, optionally culminating with a block. This allows post-processing calls likefirst_only
to work. - Once all the files have been read and processed, it makes a reader and a writer and starts processing the input records in turn and spitting them out to the writer.
- a
- THE ACCUMULATOR MUST BE CHANGED IN-SITU. This is the one that messes
people up. Methods like
map
will have no effect because they return a new array. You must use things like:map!
,reject!
,select!
, etc.concat
(which should be a!
method but isn't)replace
(ditto)
- Scopes: The
accumulator
exists only during the processing of a singleto_field
. Thecontext
lasts throughout the processing of a single record. - The basic structure is
to_field(field_name, lambdas/procs) &optional_block}
. The list of lambdas is optional, as is the final block, but you need at least one macro or the block. - Every proc/lambda (and the final block) must have the signature
(record, accumulator_array, traject_context_object)
or just(record, accumulator_array)
. A traject 'macro' is just a method that returns a lambda with that structure. - Everything is pass-by-reference -- the record, acc, and context are all the same as you go down the list of lambdas. Thus, as noted above, the accumulator must be changed in-place.
- The context keeps track of where in the configuration files things are
defined (for error reporting), but also two important areas.
context.clipboard
is simply a hash where you can store things for later.context.output_hash
is the actual mapping of field name to value -- this is what's actually sent to the writer and then onto solr (or a debug file or whatever).