Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Implement Unique Document ID for Elasticsearch Indexing to Prevent Duplication #2

Merged
merged 3 commits into from
Feb 23, 2024

Conversation

esloch
Copy link
Collaborator

@esloch esloch commented Feb 23, 2024

Pull Request Description

This Pull Request introduces a unique document ID generation mechanism for the Elasticsearch indexing process used within the LiteRev platform. The purpose of this enhancement is to prevent data duplication during the daily indexing of new data from MedRxiv and BioRxiv servers. By ensuring each document indexed into Elasticsearch is unique, we maintain data integrity and improve the platform's overall search efficiency.

How to Test These Changes

To test these changes, follow the steps below:

  1. Ensure the Elasticsearch service is running and accessible.
  2. Execute the modified indexing script for either medrxiv or biorxiv data.
  3. Verify that documents are indexed correctly without duplication by querying the Elasticsearch database.
  4. Check the logs generated by the indexing script to ensure unique document IDs are generated and used during the indexing process.
  • Example script execution:
    python scripts/index_rxivx_data.py medrxiv0

Pull Request Checklists

This PR is a:

  • bug-fix
  • new feature
  • maintenance

About this PR:

  • It includes tests.
  • The tests are executed on CI.
  • The tests generate log file(s) (path: /tmp/elasticrxivx_{index_name}_{timestamp}.log).
  • Pre-commit hooks were executed locally.
  • This PR requires a project documentation update.

Author's Checklist:

  • I have reviewed the changes and it contains no misspelling.
  • The code is well commented, especially in the parts that contain more complexity.
  • New and old tests passed locally.

Additional Implementation

1. Secure Password Management for Elasticsearch

Introduced a script to automatically reset and update the Elasticsearch 'elastic' user password, enhancing security by automating credential management. This script is executed as part of the container startup process, ensuring that Elasticsearch credentials are securely managed and updated as needed.

  • Example script execution:
containers/init-scripts/set_passwords.sh

Reviewer's Checklist

Please use the following checklist for reviewing this PR:

## Reviewer's Checklist

- [ ] I managed to reproduce the problem locally from the `main` branch.
- [ ] I managed to test the new changes locally.
- [ ] I confirm that the issues mentioned were fixed/resolved.

@xmnlab xmnlab changed the title Implement Unique Document ID for Elasticsearch Indexing to Prevent Duplication fix: Implement Unique Document ID for Elasticsearch Indexing to Prevent Duplication Feb 23, 2024
@xmnlab xmnlab merged commit 7ffac5f into main Feb 23, 2024
0 of 3 checks passed
@xmnlab xmnlab deleted the improve-indexing branch February 23, 2024 14:31
@esloch esloch self-assigned this Mar 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimize Elasticsearch Indexing Process to Prevent Duplicating Data
2 participants