fix: Implement Unique Document ID for Elasticsearch Indexing to Prevent Duplication #2

esloch · 2024-02-23T13:56:19Z

Pull Request Description

This Pull Request introduces a unique document ID generation mechanism for the Elasticsearch indexing process used within the LiteRev platform. The purpose of this enhancement is to prevent data duplication during the daily indexing of new data from MedRxiv and BioRxiv servers. By ensuring each document indexed into Elasticsearch is unique, we maintain data integrity and improve the platform's overall search efficiency.

Closes Optimize Elasticsearch Indexing Process to Prevent Duplicating Data #3

How to Test These Changes

To test these changes, follow the steps below:

Ensure the Elasticsearch service is running and accessible.
Execute the modified indexing script for either medrxiv or biorxiv data.
Verify that documents are indexed correctly without duplication by querying the Elasticsearch database.
Check the logs generated by the indexing script to ensure unique document IDs are generated and used during the indexing process.

Example script execution:

python scripts/index_rxivx_data.py medrxiv0

Pull Request Checklists

This PR is a:

bug-fix
new feature
maintenance

About this PR:

It includes tests.
The tests are executed on CI.
The tests generate log file(s) (path: /tmp/elasticrxivx_{index_name}_{timestamp}.log).
Pre-commit hooks were executed locally.
This PR requires a project documentation update.

Author's Checklist:

I have reviewed the changes and it contains no misspelling.
The code is well commented, especially in the parts that contain more complexity.
New and old tests passed locally.

Additional Implementation

1. Secure Password Management for Elasticsearch

Introduced a script to automatically reset and update the Elasticsearch 'elastic' user password, enhancing security by automating credential management. This script is executed as part of the container startup process, ensuring that Elasticsearch credentials are securely managed and updated as needed.

Example script execution:

containers/init-scripts/set_passwords.sh

Reviewer's Checklist

Please use the following checklist for reviewing this PR:

## Reviewer's Checklist

- [ ] I managed to reproduce the problem locally from the `main` branch.
- [ ] I managed to test the new changes locally.
- [ ] I confirm that the issues mentioned were fixed/resolved.

esloch added 2 commits February 23, 2024 13:35

feat: Create script to set password on startup

8499c16

feat(indexing): implement unique document ID to prevent duplication

5375501

esloch force-pushed the improve-indexing branch from 0bb0762 to 47e77ab Compare February 23, 2024 13:58

feat(security): Automate elastic user password reset and update

181ac6f

esloch force-pushed the improve-indexing branch from 47e77ab to 181ac6f Compare February 23, 2024 14:03

xmnlab changed the title ~~Implement Unique Document ID for Elasticsearch Indexing to Prevent Duplication~~ fix: Implement Unique Document ID for Elasticsearch Indexing to Prevent Duplication Feb 23, 2024

xmnlab merged commit 7ffac5f into main Feb 23, 2024
0 of 3 checks passed

xmnlab deleted the improve-indexing branch February 23, 2024 14:31

esloch self-assigned this Mar 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Implement Unique Document ID for Elasticsearch Indexing to Prevent Duplication #2

fix: Implement Unique Document ID for Elasticsearch Indexing to Prevent Duplication #2

esloch commented Feb 23, 2024 •

edited

Loading

fix: Implement Unique Document ID for Elasticsearch Indexing to Prevent Duplication #2

fix: Implement Unique Document ID for Elasticsearch Indexing to Prevent Duplication #2

Conversation

esloch commented Feb 23, 2024 • edited Loading

Pull Request Description

How to Test These Changes

Pull Request Checklists

Additional Implementation

1. Secure Password Management for Elasticsearch

Reviewer's Checklist

esloch commented Feb 23, 2024 •

edited

Loading