Get your own copy of wikidata with qlever #668
Replies: 17 comments 4 replies
-
The latest attempt is documented at https://wiki.bitplan.com/index.php/WikiData_Import_2022-05-21. |
Beta Was this translation helpful? Give feedback.
-
https://wiki.bitplan.com/index.php/WikiData_Import_2022-05-22 is started. ad-freiburg/qlever-control#4 hit again - retrying another time ... |
Beta Was this translation helpful? Give feedback.
-
I could do the next wikidata import trial on multiple machines:
Options 1-3 are readily available and used to work in 2018 and i see no reason why they shouldn't still work since the machine just run the OS and dozens of other software packages fine. As for the instructions i would git clone/pull the qlever-control script and qlever c++ software - set up the packages as outlined in https://wiki.bitplan.com/index.php/WikiData_Import_2022-03-16#Native_approach and recompile and then run the qlever-control script For me the biggest issue for reproducability is still https://wiki.bitplan.com/index.php/WikiData_Import_2022-03-16#Native_approach because the libaries of the O/S seem the need to be tweaked. If we can't use docker to make sure these prerequisites are met i think this should also be scripted and checked otherwise the reproducability might suffer. is there any chance for my mac O/S environment? |
Beta Was this translation helpful? Give feedback.
-
@joka921 see https://wiki.bitplan.com/index.php/WikiData_Import_2022-05-22#wikidata.settings.json - it only sets the numtriples per batch |
Beta Was this translation helpful? Give feedback.
-
https://wiki.bitplan.com/index.php/WikiData_Import_2022-06-24 has the report for the latest attempt |
Beta Was this translation helpful? Give feedback.
-
https://wiki.bitplan.com/index.php/WikiData_Import_2022-06-25 has the report for the latest attempt. |
Beta Was this translation helpful? Give feedback.
-
Are you saying it worked 🙂 Can you post the output of |
Beta Was this translation helpful? Give feedback.
-
@hannahbast - the last few lines of the log are: I don't know what the expected output of the log is with this version - my logstat scripts assumes entries as they were found in the february log. In the last few crashes there was no core dump or logentry so there is no graceful handling of out of memory or out of disk space situations - each of these should IMHO be separate tickets. I also would appreciate if the logfile would have a "I am still alive" entry after a specified time say e.g. 1h so that it is not necessary to look for the process state separately when monitoring the indexing see #565. The last successful attempt on this machine took some 4 days - after the recent speedups things might go a bit quicker - let's see what the timing will be for this attempt. |
Beta Was this translation helpful? Give feedback.
-
In the meantime the process seems to have stopped and the result is:
is that the expected result? The logfile seems unchanged and there is no success or finish message. I fear this might be because i had to delete some lines from the nohup.out result while the process is running. |
Beta Was this translation helpful? Give feedback.
-
@WolfgangFahl I would strongly suggest to exactly follow the instructions on https://github.com/ad-freiburg/qlever/wiki/Using-QLever-for-Wikidata . It's maximally easy and I have done it many times on many different machines already and it always worked like a charm (in around 14 hours on our machines). Make sure that you use the latest version of the |
Beta Was this translation helpful? Give feedback.
-
@hannahbast could you please give me a hint for your 14h run to be included in https://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData#Performance_Reports |
Beta Was this translation helpful? Give feedback.
-
@WolfgangFahl What exactly is your question? Have you read https://github.com/ad-freiburg/qlever/wiki/Using-QLever-for-Wikidata ? Have you looked at the QLeverfile produced by the qlever script for Wikidata? It contains all the relevant settings. |
Beta Was this translation helpful? Give feedback.
-
I just looked at https://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData#Performance_Reports . I don't think it makes a lot of sense to have a table with absolute running times, but on totally different machines. As a minimum, the table should also include the processor specification, the number of cores and the amount of RAM used. And since you include unverified information from https://lists.wikimedia.org/hyperkitty/list/[email protected]/message/CKN3QPV2NJ5TAKHORMYDTTXG7Y65UXAF , you should also include a line with the 14 hours from https://github.com/ad-freiburg/qlever/wiki/Using-QLever-for-Wikidata . |
Beta Was this translation helpful? Give feedback.
-
By the way: Thanks for including Stardog in your evaluation. I just tried it myself for the DBLP dataset (266M triples), which is large enough to be interesting, but small enough so that one play around with it easily. Here are some very preliminary findings:
|
Beta Was this translation helpful? Give feedback.
-
we'd love to start our qlever wikidata instance and can't get it work with qlever-control nor with a command line.
|
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
@hannahbast @joka921 - thanks for helping with this. https://wiki.bitplan.com/index.php/WikiData_Import_2022-06-24#Starting_server has the description of the working start command. I still have trouble with the GUI but the server itself works. |
Beta Was this translation helpful? Give feedback.
-
Currently i am busy with my main project so it might take another 2-3 weeks before i can continue more in-depth experiments for https://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData
Since https://wiki.bitplan.com/index.php/WikiData_Import_2022-01-29 was a success
and
the two attempts
https://wiki.bitplan.com/index.php/WikiData_Import_2022-03-11
and
https://wiki.bitplan.com/index.php/WikiData_Import_2022-03-16
were not the main problem at this point is that the test environment is unstable. The qlever script #562 i originally used was declared as not actual any more by now and if i'd simply retry with it that would IMHO not be considered a valid attempt by you.
For the next run i intend to try a smaller dataset for simply trying out the setup and then do the wikidata import. That way general issues will be caught much earlier on. Ideally the smaller dataset would be a useful subset of wikidata which i might be able to generate with my Trulytabular tool by only selecting relevant item classes and properties.
for the "bigger" classes "author" and "paper" qlever is most interesting for me since the WikdataQueryService times out on the truly tabular analysis attempts.
Beta Was this translation helpful? Give feedback.
All reactions