Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proper use of Wikimedia Commons images #5

Open
kolossos opened this issue Oct 24, 2022 · 0 comments
Open

proper use of Wikimedia Commons images #5

kolossos opened this issue Oct 24, 2022 · 0 comments

Comments

@kolossos
Copy link

I hope I'm here at the right point.
For the next version of Laion 5B (stable diffusion) it would be good to support Wikimedia Commons images properly.
There are in Wikimedia Commons over 80 Mio images, mostly in high resolution and without watermark bullshit.
File size >200TB in full res.

Commons community invest also a lot of effort to the description for all the images. It's a quite various data-set of free images and the basement of Wikipedia visual content.
But Commons don't use the ALT-Text on the web page instead it use a visible description inside the Wikitext-HTML.
Maybe the easiest way to extract the description would be to parse the webpage or to parse the XML-dump:
https://dumps.wikimedia.org/commonswiki/20221020/commonswiki-{date}-pages-articles-multistream.xml.bz2
There the imagename (ns=6) would be available but also the template "information" with the description in different languages. Alternative it's easy to parse ccs-class "description mw-content-ltr {lang}" on the website. So it seems for me a low hanging fruit.

I could also give you some hint how to get the urls to maybe download images in small resolution.

Would it be interesting and would you see also an improvement of Laion 5B here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant