-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Why do we only get ~25% of movies with Wikidata matches? #14
Comments
The publication date is definitely a large part of the issue here. After logging just the first 10 movies, only 3 match (ex. Fighter 2001 should actually match Fighter (2000)). Could partially solve by looking in a range before the given year in case the dvd was released in the year after the movie release, but there could be a case where there's a DVD release of a movie after a couple years (for example if it was only on VHS for a while) which would escape the lookback. Maybe it's worth looking into other user rating datasets or even scraping our own from somewhere? I know someone posted a link in the discord to a imdb rating set on huggingfaces, though I haven't looked at it yet. |
Discussed solution, likely best to find explicit match rather than assume range when a movie should exist. Good approach would be to prioritize matching titles, then have some kind of database/hash that explicitly matches duplicates. |
Looks like we might actually have a different problem. After manually checking a couple missed matches, it should be matching the correct title and year, but isn't (ex. movie id 7, "8 Man" 1992). I tried removing the year filter and it still wouldn't match, so it must not be matching the title string. |
Great work, we should investigate further! One thing to note is that we're using the "search" functionality for the movie titles. Maybe we should be looking for an exact match? |
Here's the query for "8 Man" in the wikidata SPARQL query playground: Sorry for the huge URL! We should be able to play with parameters in there to see if we can get a match. In this case, I think it has something to do with the
However it doesn't match movies that have no release date specified at all, which seems to be the case with this movie: https://www.wikidata.org/wiki/Q116548163 |
How about this? It adds a clause to the UNION to include movies with no release date at all, and moves the release date filtering into the spots where a release date is found:
Of course now we get more "duplicates"/matching results which we have to deal with. |
It seems that last query has problems with |
Hmm.... To be fair, other sources (like letterboxd) will likely be correct or link an id to another source like The Movie Database, which would be crossreferenced on wikidata so it's not totally crucial to match titles. |
Whether we match to Wikidata or "other sources like letterboxd", we will still need to perform something akin to Named Entity Recognition to identify the movie in those systems based on just a title and a year. |
The number might be wrong, but the last I heard we were only getting around 25% of the movies in the Netflix data set to match. This is a bit disappointing. We should look into this and see if anything becomes obvious for why this is the case.
A couple of current theories:
The text was updated successfully, but these errors were encountered: