Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error when loading Wikidata - Non unique primary key on DB.DBA.RDF_LANGUAGE. #1329

Open
pfps opened this issue Nov 13, 2024 · 6 comments
Open
Assignees

Comments

@pfps
Copy link

pfps commented Nov 13, 2024

When I tried to load Wikidata into the current version of Virtuoso Open Source I got the following error just after starting:

08:50:42 PL LOG: File /home/local/wikidata/split/wikidata-part-005.ttl error 23000 TURTLE RDF loader, line 112: SR197: Non unique primary key on DB.DBA.RDF_LANGUAGE.

I'm loading from a bunch of Turtle files split up from the Wikidata RDF dump.

I think I reported this error several years ago but I cannot find my original bug report. As I recollect the problem is a timing one when several files are being loaded at the same time.

@pfps
Copy link
Author

pfps commented Nov 13, 2024

What is the effect of this error? Is the rest of the file ignored? Or is it just a matter of one triple not being loaded?

@pkleef
Copy link
Collaborator

pkleef commented Nov 13, 2024

You are correct. Any loading error will stop parsing the file at that point, so the rest of the file is ignored.

If you are using the virtuoso bulk loader, you can check if there are any parts that returned an error with the following command:

SQL> select ll_file, ll_error from load_list where ll_error is not null;
ll_file                                                                           ll_error
VARCHAR NOT NULL                                                                  VARCHAR
_______________________________________________________________________________


0 Rows. -- 3 msec.

If there are any errors, you can create a new file starting from the next line on the original file and do another round of bulkloading. Note that the bulkloader will skip over files that have already been loaded, so you will have to use a new file name such as wikidata-part-0005-repaired.ttl. Although in your case it looks more like an issue with the number of parallel loaders you are using, so i will discuss this further within development.

Couple of questions:

  1. What is the exact version of the virtuoso-t binary you are using? (output of virtuoso-t -?)

  2. How did you create the .ttl parts from wikidata?

  3. How many CPU cores/threads are available on your system?

  4. How many parallel rdf_loader_run() commands did you use?

  5. Can we get a copy of your virtuoso.ini?

@pfps
Copy link
Author

pfps commented Nov 13, 2024

1/ I assume that you want only

Virtuoso Open Source Edition (Column Store) (multi threaded)
Version 7.2.15-dev.3240-pthreads as of Nov 12 2024 (d275e564e)
Compiled for Linux (x86_64-pc-linux-gnu)
Copyright (C) 1998-2024 OpenLink Software

2/ I created the .ttl parts by reading the full file, storing the namespace declarations and then splitting the file by looking for the next item boundary after reading 2GB and writing that section preceeded by the namespace declarations.

3/ I am running on Ryzen 9950X with 16 cores and 32 threads with 196GB of memory.

4/ I used 10 loaders.

5/ Here is virtuoso.ini

;
;  virtuoso.ini
;
;  Configuration file for the OpenLink Virtuoso VDBMS Server
;
;  To learn more about this product, or any other product in our
;  portfolio, please check out our web site at:
;
;      http://virtuoso.openlinksw.com/
;
;  or contact us at:
;
;      [email protected]
;
;  If you have any technical questions, please contact our support
;  staff at:
;
;      [email protected]
;

;
;  Database setup
;
[Database]
DatabaseFile			= virtuoso.db
ErrorLogFile			= virtuoso.log
LockFile			= virtuoso.lck
TransactionFile			= virtuoso.trx
xa_persistent_file		= virtuoso.pxa
ErrorLogLevel			= 7
FileExtend			= 200
MaxCheckpointRemap		= 2000
Striping			= 0
TempStorage			= TempDatabase


[TempDatabase]
DatabaseFile			= virtuoso-temp.db
TransactionFile			= virtuoso-temp.trx
MaxCheckpointRemap		= 2000
Striping			= 0


;
;  Server parameters
;
[Parameters]
ServerPort			= 1111
LiteMode			= 0
DisableUnixSocket		= 1
DisableTcpSocket		= 0
;SSLServerPort			= 2111
;SSLCertificate			= cert.pem
;SSLPrivateKey			= pk.pem
;X509ClientVerify		= 0
;X509ClientVerifyDepth		= 0
;X509ClientVerifyCAFile		= ca.pem
MaxClientConnections		= 10
CheckpointInterval		= 60
O_DIRECT			= 0
CaseMode			= 2
MaxStaticCursorRows		= 5000
CheckpointAuditTrail		= 0
AllowOSCalls			= 0
SchedulerInterval		= 10
DirsAllowed			= ., ../vad/, /usr/share/proj, /home/local/wikidata/split
ThreadCleanupInterval		= 0
ThreadThreshold			= 10
ResourcesCleanupInterval	= 0
FreeTextBatchSize		= 100000
SingleCPU			= 0
VADInstallDir			= ../vad/
PrefixResultNames               = 0
RdfFreeTextRulesSize		= 100
IndexTreeMaps			= 256
MaxMemPoolSize                  = 200000000
PrefixResultNames               = 0
MacSpotlight                    = 0
IndexTreeMaps                   = 64
MaxQueryMem 		 	= 64G		; memory allocated to query processor
VectorSize 		 	= 1000		; initial parallel query vector (array of query operations) size
MaxVectorSize 		 	= 1000000	; query vector size threshold.
AdjustVectorSize 	 	= 0
ThreadsPerQuery 	 	= 4
AsyncQueueMaxThreads 	 	= 10
MaxSortedTopRows = 10000000
;;
;; When running with large data sets, one should configure the Virtuoso
;; process to use between 2/3 to 3/5 of free system memory and to stripe
;; storage on all available disks.
;;
;; Uncomment next two lines if there is 2 GB system memory free
;NumberOfBuffers          = 170000
;MaxDirtyBuffers          = 130000
;; Uncomment next two lines if there is 4 GB system memory free
;NumberOfBuffers          = 340000
; MaxDirtyBuffers          = 250000
;; Uncomment next two lines if there is 8 GB system memory free
;NumberOfBuffers          = 680000
;MaxDirtyBuffers          = 500000
;; Uncomment next two lines if there is 16 GB system memory free
;NumberOfBuffers          = 1360000
;MaxDirtyBuffers          = 1000000
;; Uncomment next two lines if there is 32 GB system memory free
;NumberOfBuffers          = 2720000
;MaxDirtyBuffers          = 2000000
;; Uncomment next two lines if there is 48 GB system memory free
;NumberOfBuffers          = 4000000
;MaxDirtyBuffers          = 3000000
;; Uncomment next two lines if there is 64 GB system memory free
;NumberOfBuffers          = 5450000
;MaxDirtyBuffers          = 4000000
;;
;; Note the default settings will take very little memory
;; but will not result in very good performance
;;
NumberOfBuffers          = 11000000
MaxDirtyBuffers          = 8000000


[HTTPServer]
ServerPort			= 8890
ServerRoot			= ../vsp
MaxClientConnections		= 12
DavRoot				= DAV
EnabledDavVSP			= 0
HTTPProxyEnabled		= 0
TempASPXDir			= 0
DefaultMailServer		= localhost:25
ServerThreads			= 16
MaxKeepAlives			= 10
KeepAliveTimeout		= 10
MaxCachedProxyConnections	= 10
ProxyConnectionCacheTimeout	= 15
HTTPThreadSize			= 280000
HttpPrintWarningsInOutput	= 0
Charset				= UTF-8
;HTTPLogFile		        = logs/http.log
MaintenancePage             	= atomic.html
EnabledGzipContent          	= 1


[AutoRepair]
BadParentLinks			= 0


[Client]
SQL_PREFETCH_ROWS		= 10000
SQL_PREFETCH_BYTES		= 1600000
SQL_QUERY_TIMEOUT		= 0
SQL_TXN_TIMEOUT			= 0
;SQL_NO_CHAR_C_ESCAPE		= 1
;SQL_UTF8_EXECS			= 0
;SQL_NO_SYSTEM_TABLES		= 0
;SQL_BINARY_TIMESTAMP		= 1
;SQL_ENCRYPTION_ON_PASSWORD	= -1


[VDB]
ArrayOptimization		= 0
NumArrayParameters		= 10
VDBDisconnectTimeout		= 1000
KeepConnectionOnFixedThread	= 0


[Replication]
ServerName			= VIRTUOSO
ServerEnable			= 1
QueueMax			= 50000


;
;  Striping setup
;
;  These parameters have only effect when Striping is set to 1 in the
;  [Database] section, in which case the DatabaseFile parameter is ignored.
;
;  With striping, the database is spawned across multiple segments
;  where each segment can have multiple stripes.
;
;  Format of the lines below:
;    Segment<number> = <size>, <stripe file name> [, <stripe file name> .. ]
;
;  <number> must be ordered from 1 up.
;
;  The <size> is the total size of the segment which is equally divided
;  across all stripes forming  the segment. Its specification can be in
;  gigabytes (g), megabytes (m), kilobytes (k) or in database blocks
;  (b, the default)
;
;  Note that the segment size must be a multiple of the database page size
;  which is currently 8k. Also, the segment size must be divisible by the
;  number of stripe files forming  the segment.
;
;  The example below creates a 200 meg database striped on two segments
;  with two stripes of 50 meg and one of 100 meg.
;
;  You can always add more segments to the configuration, but once
;  added, do not change the setup.
;
[Striping]
Segment1			= 100M, db-seg1-1.db, db-seg1-2.db
Segment2			= 100M, db-seg2-1.db
;...

;[TempStriping]
;Segment1			= 100M, db-seg1-1.db, db-seg1-2.db
;Segment2			= 100M, db-seg2-1.db
;...

;[Ucms]
;UcmPath			= <path>
;Ucm1				= <file>
;Ucm2				= <file>
;...


[Zero Config]
ServerName			= VIRTUOSO
;ServerDSN			= ZDSN
;SSLServerName			= 
;SSLServerDSN			= 


[URIQA]
DynamicLocal			= 0
DefaultHost			= localhost:8890


[SPARQL]
;ExternalQuerySource		= 1
;ExternalXsltSource 		= 1
;DefaultGraph      		= http://localhost:8890/dataspace
;ImmutableGraphs    		= http://localhost:8890/dataspace
ResultSetMaxRows           	= 10000000
MaxQueryCostEstimationTime 	= 600	; in seconds
MaxQueryExecutionTime      	= 180	; in seconds
DefaultQuery               	= select distinct ?Concept where {[] a ?Concept} LIMIT 100
DeferInferenceRulesInit    	= 0  ; controls inference rules loading


[Plugins]
LoadPath			= ../hosting
Load1				= plain, wikiv
Load2				= plain, mediawiki
Load3				= plain, creolewiki
Load4				= plain, im
;Load5				= plain, wbxml2
;Load6				= attach, libphp5.so
;Load7				= Hosting, hosting_php.so
Load8				= plain, proj4
Load9				= plain, geos
Load10				= plain, shapefileio
;Load11				= plain, hslookup

@pkleef
Copy link
Collaborator

pkleef commented Nov 13, 2024

Additionally, can you run the following 2 queries and show me the output:

select count(*) from RDF_LANGUAGE table option (index DB_DBA_RDF_LANGUAGE_UNQC_RL_TWOBYTE);

select count(*) from RDF_LANGUAGE table option (index RDF_LANGUAGE);

@pfps
Copy link
Author

pfps commented Nov 13, 2024

That's not going to be useful as I have since restarted the load (and this time it is working).

I seem to remember that I reported this bug quite some years ago and even had a patch for it. I can't find the bug message but here is the patch (against version 7.20).

Patch the Turtle loader, editing the end of rdf_rl_lang_id in <parent directory>/virtuoso-opensource/libsrc/Wi/ttlpv.sql to look as follows

  id:= sequence_next ('RDF_LANGUAGE_TWOBYTE', 1, 1);
--pfps  insert into rdf_language (rl_twobyte, rl_id) values (id, ln);
  insert soft rdf_language (rl_twobyte, rl_id) values (id, ln);
  commit work; -- if load non transactional, this is still a sharpp transaction boundary.
  log_enable (old_mode, 1);
--pfps get the actual id, as it may be different
  id := (select RL_TWOBYTE from DB.DBA.RDF_LANGUAGE where RL_ID = ln);
  rdf_cache_id ('l', ln, id);
  return id;

@pkleef
Copy link
Collaborator

pkleef commented Nov 13, 2024

Thanks for your feedback. I will discuss this within development and let you know.

@pkleef pkleef self-assigned this Nov 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants