Add paragraph concerning YouTube exact search inconsistency

Benjamin Loison 2023-01-24 22:47:00 +01:00
parent 0e6d8d5051
commit 05eaf8a7c9

34
Home.md

@ -40,6 +40,40 @@ Note that all of these videos are partial uploads of the original video and they
Note that the video only have auto-generated captions. Note that the video only have auto-generated captions.
### - [`gJjLdnycuyU`](https://www.youtube.com/watch?v=gJjLdnycuyU) `My kids have seen a lot of cartoons`
Following [my project proposal](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/wiki/Project-proposal), I've been noticed:
> It's not clear to me from the "proof" part whether the video "o8NPllzkFhE" is not returned because of an indexing problem or because it is considered to be a duplicate of the video "Vo9KPk-gqKk".
> Did you manage to identify a case where a video is not returned even though it is the only match to a query? (Indeed, if the goal of your project is just to work around the fact that some duplicate videos are removed from search results, then it limits a bit the appeal.)
**Let's try to answer this question with the best approach and show how YouTube search doesn't make sense sometimes.**
Let's look at videos which have both automatically generated captions and not automatically generated captions and let's focus on English, so we will consider [`@TED`](https://www.youtube.com/@TED) videos, as they are quite an interesting dataset for this purpose.
Thanks to [YouTube operational API Videos: list endpoint](https://stackoverflow.com/a/74324720) we learn that its channel id is [`UCAuUUnT6oDeKwE6v1NGQxug`](https://www.youtube.com/channel/UCAuUUnT6oDeKwE6v1NGQxug).
Then let's list their videos thanks to [YouTube Data API v3 PlaylistItems: list](https://developers.google.com/youtube/v3/docs/playlistItems/list) endpoint from the oldest one to the newest one, that way we will work with old videos that have had enough time to be processed. As of January 24 2023, they have `4185` videos retrievable that way.
Then we will focus on videos that have less than 2 caption tracks, including one that is both in English and not automatically generated. The first one matching this criteria is [`gJjLdnycuyU`](https://www.youtube.com/watch?v=gJjLdnycuyU) which is the 2970th oldest video. The hope by looking for oldest videos matching this criteria is that they are simple in terms of captions (not having a lot of caption tracks) and, as they are old videos and their captions aren't translated in many languages, we can hope that they doesn't have many views which will make duplicates less likely.
Let's focus on the sentence at [0:08](https://www.youtube.com/watch?v=gJjLdnycuyU&t=8s) of this video, that is *my kids have seen a lot of cartoons*. More precisely according to:
- not automatically generated captions: `My kids have seen a lot of cartoons`
- automatically generated captions: `my kids have seen a lot of cartoons`
Let's put the chances on our side by assuming that the exact search feature using `"Your query"` from YouTube is case sensitive, so let's consider only the common `"kids have seen a lot of cartoons"` of both caption tracks. If we provide it to [YouTube Data API v3 Search: list](https://developers.google.com/youtube/v3/docs/search/list) endpoint, [we get at least `50` results](https://yt.lemnoslife.com/noKey/search?part=snippet&q=%22kids%20have%20seen%20a%20lot%20of%20cartoons%22&maxResults=50) where `gJjLdnycuyU` doesn't appear. Let's say that all these videos contain `kids have seen a lot of cartoons` and our study video is going to appear on a following page. As we have other things to do than watching a random video of tens of minutes, let's extract thanks to [YouTube operational API Videos: list endpoint with `part=contentDetails`](https://stackoverflow.com/a/70908689) the shortest video, in order to verify that YouTube exact search feature works as expected. The shortest video is [`dC7tUcRCS58`](https://www.youtube.com/watch?v=dC7tUcRCS58) and lasts 175 seconds. The audio, the video and the automatically generated captions don't contain neither near nor far `kids have seen a lot of cartoons`.
So YouTube is just giving us random videos about the words we typed but not exactly the exact search we asked him to proceed.
While concerning [the project proposal video concerning](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/wiki#o8npllzkfhe-https-www-youtube-com-watch-v-o8npllzkfhe-linux-is-in-millions-of-computers) [`The mind behind Linux | Linus Torvalds`](https://www.youtube.com/watch?v=o8NPllzkFhE) proceeding to exact search with [`"your software Linux is in millions of computers"`](https://yt.lemnoslife.com/noKey/search?part=snippet&q=%22your%20software%20Linux%20is%20in%20millions%20of%20computers%22&maxResults=50) we get only one result that is [`Vo9KPk-gqKk`](https://www.youtube.com/watch?v=Vo9KPk-gqKk) which as discussed contains the exact sentence `your software Linux is in millions of computers`.
So trying to *identify a case where a video is not returned even though it is the only match to a query* shows inconsistent behavior from YouTube exact search, as it gives exactly what we asked concerning our test with `The mind behind Linux | Linus Torvalds` and it doesn't give exactly what we asked concerning `The creative power of misfits | WorkLife with Adam Grant (Audio only)`.
Note that [YouTube UI](https://www.youtube.com/results?search_query=%22kids+have+seen+a+lot+of+cartoons%22) has the same too many results bug concerning `The creative power of misfits | WorkLife with Adam Grant (Audio only)`.
From [my experience with YouTube](https://stackoverflow.com/users/7123660/benjamin-loison) which starts to be significant, we can't rely on YouTube search feature, as they give weird results as shown. However YouTube gives quite correctly the information concerning a given video id, so [the best approach that I am aware of](https://stackoverflow.com/a/69259093) to returns exactly correct and as far as possible exhaustive results consists in discovering the maximum number of videos through some crawling approach as I sketch in the last paragraph of the project proposal.
## Concerning 20,000 videos limit for YouTube Data API v3 PlaylistItems: list endpoint ## Concerning 20,000 videos limit for YouTube Data API v3 PlaylistItems: list endpoint
Could try both (`-i` was required for ignoring errors such as age-restricted videos): Could try both (`-i` was required for ignoring errors such as age-restricted videos):