Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Virtuoso wikidata import performance - virtuoso wikidata endpoints as part of snapquery wikidata mirror network #1326

Open
WolfgangFahl opened this issue Nov 4, 2024 · 7 comments

Comments

@WolfgangFahl
Copy link

WolfgangFahl commented Nov 4, 2024

@TallTed

Tim Holzheim has successfully imported Wikidata into a virtuoso instance see https://cr.bitplan.com/index.php/Wikidata_import_2024-10-28_Virtuoso and
https://wiki.bitplan.com/index.php/Wikidata_import_2024-10-28_Virtuoso

for the documentation. The endpoint is available at https://virtuoso.wikidata.dbis.rwth-aachen.de/sparql/ and we would love to integrate this and other virtuoso endpoints into our snapquery https://github.com/WolfgangFahl/snapquery infrastructure.

Ted suggested that i should open a ticket to get the dicussion going about how virtuoso endpoints could be made part of the snapquery wikidata mirror infrastructure. The idea is to use named parameterized queries that hide the details of the endpoints so that it does not matter wether you use blazegraph, qlever, jena, virtuoso, stardog, ... you name it. Queries should just work as specified and be monitored for non functional aspects proactively.

@TallTed
Copy link
Collaborator

TallTed commented Nov 4, 2024

Note that we (OpenLink Software [1], [2]) have also loaded Wikidata into a live Virtuoso instance, available at https://wikidata.demo.openlinksw.com/sparql.

I'm not sure whether I'm the "Ted" referenced in the last paragraph; if so, regrettably, I've forgotten the specifics of that conversation. Could you provide more detail about the "question" being asked by this issue, especially to benefit others who may have more to contribute to the "answer" than I?

@WolfgangFahl
Copy link
Author

WolfgangFahl commented Nov 4, 2024

https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours has the info as well as https://www.wikidata.org/wiki/Wikidata:Scholia/Events/Hackathon_October_2024

We are well aware of the virtuoso endpoint it is already configured in the default https://github.com/WolfgangFahl/snapquery/blob/main/snapquery/samples/endpoints.yaml file.

The question here is how do we get a virtuoso endpoint that is as up-to-date as possible quickly. We intent to "rotate" images based on dumps as long as the streaming updates are not possible. So currently that would be roughly weekly. E.g. ad-freiburg/qlever-control#82

is an example. This is just an initial issue to start the communication as suggested by Ted in the online meeting of wikidata Search Platform mentioned above. Depending on how the Virtuoso open source project is going to be involved we might need multiple tickets for the different aspects. I suggest to stick with the import performance issue in this ticket for the time being and wait for Tim's comment.

@TallTed
Copy link
Collaborator

TallTed commented Nov 5, 2024

wait for Tim's comment

Is Tim a GitHub user? Tagging their handle seems appropriate, if so. If not, I wonder how they are to comment here? (Also if not a GitHub user, it might make sense to instead raise these threads on the OpenLink Community Forum. They would need to register there, but this could be done using various third-party IdPs.)

@tholzheim
Copy link

The import took ~4 days and the virtuoso instance was configured with the recommendation for 64 GB RAM (highest available recommendation in the documentation)
Dump used:

  • file: latest-all.nt.bz2
  • size: 166GB

To improve the import performance I want to try:

  • increasing the RAM configuration
  • splitting the dump file into smaller subsets as recommended on some doc pages for the bulk load

Is there a recommendation for a configuration that would allow the import of the dump on a single day?

I noticed that once the ram is full there occurred a lot of write lock log messages (or waiting to write unfortunately I did not save the import logs). To avoid this in the next try the RAM config will be to increase e.g. 300GB

@pkleef
Copy link
Collaborator

pkleef commented Nov 7, 2024

NumberOfBuffers

The virtuoso.ini config file traditionally has a table with some pre-calculated settings for NumberOfBuffers and MaxDirtyBuffers as a starting point based on the amount of free memory space.

Say you have 64 GB of free memory in your system, which corresponds to 64 * 1024 * 1024 * 1024 / 8192 = 8388608 maximum NumberOfBuffers. As you also need memory for related caches, transactions, etc., we recommend using about 2/3 of the maximum, or 559240, which got rounded down to 5450000 in the table.

The MaxDirtyBuffers is normally set to around 75% of the NumberOfBuffers.

In commit b6845d1, we enhanced the way you can specify the number of buffers:

Say you have a machine with 300GB of free memory, and you want to use about 250GB of that for database buffers, leaving around 50GB for Virtuoso's caches, transactions, etc. Instead of performing the above calculation(s), we can simply use the following settings:

[Parameters]
...
NumberOfBuffers = 250G    ; calculate max NumberOfBuffers that will fit in 250GB memory
MaxDirtyBuffers = 75%        ; allow up to 75% of NumberOfBuffers to be dirty

Or in your Dockerfile or docker-compose.yml file:

Environment:
      - VIRT_PARAMETERS_NUMBEROFBUFFERS=250G
      - VIRT_PARAMETERS_MAXDIRTYBUFFERS=75%

Splitting latest-all.nt.bz2

Splitting a big dump to smaller files will inevitably take some time; however, depending on the number of cores and threads in your CPU, such a split can greatly reduce the time it takes to bulk-load this dataset using multiple Virtuoso threaded loaders in parallel.

You can try the following perl-split.pl script written in the Perl language, to split the data into chunks of roughly the same size. Depending on the way the split program you used before was written, our script may be fractionally faster.

#!/usr/bin/perl
#
#  Simple perl script to split n-triple files like the one from Wikidata
#  into parts.
#
#  Copyright (C) OpenLink Software
#

use strict;
use warnings;

#
#  Vars
#
my $counter = 0;
my $in_file = $ARGV[0];


#
#  Number of bytes to read
#
my $chunk_sz = 500000000;


#
#  Open the source file
#
open (FH, "bzip2 -cd $in_file.nt.bz2 |  ") or die "Could not open source file. $!";

while (1) {
    my $chunk;
    my $out_file;

    #
    #  Open the next part
    #
    print "processing part $counter\n";
    $out_file = sprintf ("wikidata/%s-part-%05d.nt.gz", $in_file, $counter);
    open(OUT, "| gzip -2 >$out_file") or die "Could not open destination file";
    $counter++;

    #
    #  Read the next chunk_sz bytes
    #
    if (!eof(FH)) {
        read(FH, $chunk, $chunk_sz);
        print OUT $chunk;
    }

    #
    #  read upto next \n to complete the part
    #
    if (!eof(FH)) {
        $chunk = <FH>;
        print OUT $chunk;
    }

    close(OUT);
    last if eof(FH);
}

To use it, you can run:

$ wget wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-all.nt.bz2
$ mkdir wikidata
$ perl perl-split.pl latest-all

This will inevitably take a few hours, after which you can remove the latest-all.nt.bz2 file and use the content of the wikidata directory during bulk-load.

Bulk-loading using multiple threads

Using the scripts in the initdb.d directory, loading is done single-threaded, which of course is not ideal on a machine with lots of memory, cores, and threads.

I will think about a slightly different way this can be automated.

@WolfgangFahl
Copy link
Author

@tholzheim thanks Tim for showing up and bringing the dicussion forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants