proper use of Wikimedia Commons images #5

kolossos · 2022-10-24T18:48:57Z

I hope I'm here at the right point.
For the next version of Laion 5B (stable diffusion) it would be good to support Wikimedia Commons images properly.
There are in Wikimedia Commons over 80 Mio images, mostly in high resolution and without watermark bullshit.
File size >200TB in full res.

Commons community invest also a lot of effort to the description for all the images. It's a quite various data-set of free images and the basement of Wikipedia visual content.
But Commons don't use the ALT-Text on the web page instead it use a visible description inside the Wikitext-HTML.
Maybe the easiest way to extract the description would be to parse the webpage or to parse the XML-dump:
https://dumps.wikimedia.org/commonswiki/20221020/commonswiki-{date}-pages-articles-multistream.xml.bz2
There the imagename (ns=6) would be available but also the template "information" with the description in different languages. Alternative it's easy to parse ccs-class "description mw-content-ltr {lang}" on the website. So it seems for me a low hanging fruit.

I could also give you some hint how to get the urls to maybe download images in small resolution.

Would it be interesting and would you see also an improvement of Laion 5B here?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proper use of Wikimedia Commons images #5

proper use of Wikimedia Commons images #5

kolossos commented Oct 24, 2022

proper use of Wikimedia Commons images #5

proper use of Wikimedia Commons images #5

Comments

kolossos commented Oct 24, 2022