From 15354283721b397d430844babea61c5746e26e1b Mon Sep 17 00:00:00 2001 From: Benjamin Loison Date: Thu, 23 Feb 2023 22:43:52 +0100 Subject: [PATCH] Move `Home.md` to `Trying-to-understand-YouTube-results.md` and move `The-crawler.md` to `Home.md` --- Home.md | 202 ++++++------------------ The-crawler.md | 77 --------- Trying-to-understand-YouTube-results.md | 177 +++++++++++++++++++++ 3 files changed, 228 insertions(+), 228 deletions(-) delete mode 100644 The-crawler.md create mode 100644 Trying-to-understand-YouTube-results.md diff --git a/Home.md b/Home.md index ec7882b..68e89a5 100644 --- a/Home.md +++ b/Home.md @@ -1,177 +1,77 @@ -As described on [the project proposal page](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/wiki/Project-proposal), there is a discovery process consisting in going through comments, so we will try to also keep comments. That way we could end up, potentially after the project, doing interesting stuff such as listing all comments written by a given user, as [my French only without discovery process project](https://github.com/Benjamin-Loison/YouTube-comments-graph) was doing. +Note that the following doesn't even treat about the captions problem described in [the project proposal](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/wiki/Project-proposal). Here I just explains how the crawler works in terms of channels discovery. -## Dive into YouTube search results +Here I talk much about retrieving the maximum number of channels but in fact we are interested in videos for captions and that isn't a problem as we discover all videos from their channels thanks to [YouTube Data API v3 PlaylistItems: list](https://developers.google.com/youtube/v3/docs/playlistItems/list) endpoint. -As a first feeling it seems that YouTube returns videos that only match auto-generated captions. +## Context -### - [`7TXEZ4tP06c`](https://www.youtube.com/watch?v=7TXEZ4tP06c) `how many people here would say they can draw` +The crawler relies on two data sources: +- the official [YouTube Data API v3](https://developers.google.com/youtube/v3) +- my unofficial [YouTube operational API](https://github.com/Benjamin-Loison/YouTube-operational-API) -Let's consider [`7TXEZ4tP06c`](https://www.youtube.com/watch?v=7TXEZ4tP06c) at [0:18](https://www.youtube.com/watch?v=7TXEZ4tP06c&t=18s) the auto-generated captions and not auto-generated captions are `how many people here would say they can draw`. +To put it in a nutshell from [my crawling experience with YouTube data](https://stackoverflow.com/users/7123660/benjamin-loison) in YouTube Data API v3: -[Passing this sentence to YouTube Data API v3 Search: list endpoint](https://yt.lemnoslife.com/noKey/search?part=snippet&q=%22how%20many%20people%20here%20would%20say%20they%20can%20draw%22&maxResults=50) returns these videos: +- some data are missing ([retrieving `CHANNELS` channel tab content](https://stackoverflow.com/a/74213174) for instance) +- sometimes the data that we retrieve aren't correct ([the video duration](https://stackoverflow.com/a/70908689) for instance) +- the data access is limited ([to an absolute number of results (500 for searches)](https://stackoverflow.com/a/73357447) and it's by default limited to a given number of requests per day (at most 10,000) ([quota available](https://developers.google.com/youtube/v3/getting-started#quota) and [quota cost](https://developers.google.com/youtube/v3/determine_quota_cost) documentations)) -- [`7TXEZ4tP06c`](https://www.youtube.com/watch?v=7TXEZ4tP06c): is the original video ([`7TXEZ4tP06c`](https://www.youtube.com/watch?v=7TXEZ4tP06c)) ([0:18](https://www.youtube.com/watch?v=7TXEZ4tP06c&t=18s)) -- [`qH-yY7UZW_k`](https://www.youtube.com/watch?v=qH-yY7UZW_k): reupload part of the original video at [1:13](https://www.youtube.com/watch?v=qH-yY7UZW_k&t=73s) ([1:16](https://www.youtube.com/watch?v=qH-yY7UZW_k&t=76s)) -- [`cOwYXnpW-8A`](https://www.youtube.com/watch?v=cOwYXnpW-8A): reupload part of the original video ([0:05](https://www.youtube.com/watch?v=cOwYXnpW-8A&t=5s)) -- [`vzH9Fo9GI9Y`](https://www.youtube.com/watch?v=vzH9Fo9GI9Y): reupload part of the original video at [3:31](https://youtu.be/vzH9Fo9GI9Y?t=211s) ([3:39](https://www.youtube.com/watch?v=vzH9Fo9GI9Y&t=219s)) -- [`gpMp6tz3d7w`](https://www.youtube.com/watch?v=gpMp6tz3d7w): reupload part of the original video at [0:37](https://youtu.be/gpMp6tz3d7w?t=37s) ([0:41](https://www.youtube.com/watch?v=gpMp6tz3d7w&t=41s)) -- [`ZI7XTsGTl34`](https://www.youtube.com/watch?v=gpMp6tz3d7w): reupload part of the original video at [23:36](https://youtu.be/ZI7XTsGTl34?t=1416s) ([23:43](https://www.youtube.com/watch?v=ZI7XTsGTl34&t=1423s)) +so I coded a complementary API that is the YouTube operational API that relies on web-scraping YouTube UI. -Note that all of these videos are partial uploads of the original video and they have auto-generated captions and all exactly contain `how many people here would say they can draw`. +The YouTube operational API has two parts: +- the actual YouTube operational API based on web-scraping YouTube UI +- [the no-key service](https://github.com/Benjamin-Loison/YouTube-operational-API/blob/7ff59b2d477c8d2caf6813a114f4201791627cc1/noKey/index.php) which consists in a proxy to fetch YouTube Data API v3 without any developer key by using a batch of more than 200 keys that I gathered thanks to people having unintentionally leaked their developer keys on Stack Overflow, GitHub... -### - [`o8NPllzkFhE`](https://www.youtube.com/watch?v=o8NPllzkFhE) `linux is in millions of computers` +Note that the web-scraping of the YouTube UI is not done with tools similar to Selenium which runs JavaScript but is a low-level web-scraping by just proceeding to HTTP requests and parsing the JavaScript variable containing data that the webpage is generated with. -Completing [the project proposal example](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/wiki/Project-proposal), [`Vo9KPk-gqKk`](https://www.youtube.com/watch?v=Vo9KPk-gqKk) reupload part of the original video and has only auto-generated captions which contains `your software Linux is in millions of computers` while [`o8NPllzkFhE`](https://www.youtube.com/watch?v=o8NPllzkFhE) that is the original video has auto-generated captions which contains `your software uh linux is in millions of computers` and has not auto-generated captions which contains `Your software, Linux, is in millions of computers`. +Note that I host a YouTube operational API instance with both parts at https://yt.lemnoslife.com
+Some of its metrics are available at https://yt.lemnoslife.com/metrics/, note that my official instance has proceeded successfully (without any downtime) on January 24 2023 (without me using it): +- 674,861 no-key requests +- 58,332 web-scraping requests -The weird thing is that when [passing `linux is in millions of computers` to YouTube Data API v3 Search: list endpoint](https://yt.lemnoslife.com/noKey/search?part=snippet&q=%22linux%20is%20in%20millions%20of%20computers%22), it returns only these videos: -- [`Vo9KPk-gqKk`](https://www.youtube.com/watch?v=Vo9KPk-gqKk): reupload part of the original video (`your software Linux is in millions of computers`) -- [`krakddj30eU`](https://www.youtube.com/watch?v=krakddj30eU): reupload part of the original video at [0:05](https://www.youtube.com/watch?v=krakddj30eU?t=5s) (`your software uh linux is in millions of computers`) -- [`NvPaFoIbbzg`](https://www.youtube.com/watch?v=NvPaFoIbbzg): reupload part of the original video (`your software uh Linux is in millions of computers`) +Concerning the web-scraping of YouTube UI, [YouTube may detect and block requests](https://github.com/Benjamin-Loison/YouTube-operational-API/issues/11) for a temporary amount of times (a few hours) if it is used for many requests. However if you have your own YouTube operational instance and [you proceed in a mono-thread way by waiting the answer to your request before making another one](https://github.com/Benjamin-Loison/YouTube-operational-API/issues/11#issuecomment-1317163330), then as far as I know there isn't any problem with this scenario. Note that even if we proceed in a multi-threading way, this mono-thread limitation won't bother us as we mainly rely on YouTube Data API v3. -So it returns similar videos but not the original one we focused on while it should be clearly returned. +## Features that the crawler relies on -Note that all of these videos are partial uploads of the original video and they have auto-generated captions and all exactly contain case-insensitively `Linux is in millions of computers`. +Note that you can run my algorithm with [your own YouTube operational API instance](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L60) and [your own set of YouTube Data API v3 keys](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L57) or you can rely for both on my official instance. -### - [`f6nxcfbDfZo`](https://www.youtube.com/watch?v=f6nxcfbDfZo) `at tedx about to give a killer talk` +The crawler can be [multi-threaded with the `--threads=N` argument](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L72). -[Passing this sentence to YouTube Data API v3 Search: list endpoint](https://yt.lemnoslife.com/noKey/search?part=snippet&q=%22at%20tedx%20about%20to%20give%20a%20killer%20talk%22) returns these videos: +Note that [the algorithm is able to pause and resume at any point](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L100) by loosing progress on channels that he was working on. -- [`f6nxcfbDfZo`](https://www.youtube.com/watch?v=f6nxcfbDfZo): is the original video +The crawler starts with [some channel ids that you provided in `channels.txt`](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L56). -Note that the video only have auto-generated captions. +Each thread works on a different channel id and proceed as follows: -### - [`gJjLdnycuyU`](https://www.youtube.com/watch?v=gJjLdnycuyU) `My kids have seen a lot of cartoons` +### High overview -Following [my project proposal](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/wiki/Project-proposal), I've been noticed: +- Lists all videos (including livestreams and shorts) of this given channel and list their comments and proceed recursively with these comment authors +- [Lists all other tabs content](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/issues/11): + - [`CHANNELS`](https://www.youtube.com/@who/channels) lists *related channels* + - [`COMMUNITY`](https://www.youtube.com/@who/community) lists comments on community posts + - [`PLAYLISTS`](https://www.youtube.com/@who/playlists) lists videos in playlists to find unlisted videos but also videos from other channels + - [`LIVE`](https://www.youtube.com/@who/streams) lists livestreams to find new channels in the livestreams chat messages -> It's not clear to me from the "proof" part whether the video "o8NPllzkFhE" is not returned because of an indexing problem or because it is considered to be a duplicate of the video "Vo9KPk-gqKk". -> Did you manage to identify a case where a video is not returned even though it is the only match to a query? (Indeed, if the goal of your project is just to work around the fact that some duplicate videos are removed from search results, then it limits a bit the appeal.) +### Technically -**Let's try to answer this question with the best approach and show how YouTube search doesn't make sense sometimes.** +The architecture of the description is the same as the `High overview` one. -Let's look at videos which have both automatically generated captions and not automatically generated captions and let's focus on English, so we will consider [`@TED`](https://www.youtube.com/@TED) videos, as they are quite an interesting dataset for this purpose. +- Uses [YouTube Data API v3 CommentThreads: list](https://developers.google.com/youtube/v3/docs/commentThreads/list) endpoint with [`allThreadsRelatedToChannelId` filter](https://developers.google.com/youtube/v3/docs/commentThreads/list#allThreadsRelatedToChannelId) [to retrieve all comments left by people on the given channel videos](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L205) + - Some channels have disabled to some extent comments on their channels so [we can't proceed with this endpoint as it returns an error](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L208), so in the case that there are comments on some videos ([which happens](https://www.youtube.com/watch?v=3F8dFt8LsXY)) we use [YouTube Data API v3 PlaylistItems: list](https://developers.google.com/youtube/v3/docs/playlistItems/list) endpoint to retrieve the channel videos and we use YouTube Data API v3 CommentThreads: list endpoint with [`videoId`](https://developers.google.com/youtube/v3/docs/commentThreads/list#videoId) filter to retrieve all comments left by people on the given channel video. In fact the algorithm prints [the number of videos to treat](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L262) thanks to [YouTube Data API v3 Channels: list](https://developers.google.com/youtube/v3/docs/channels/list) endpoint with [`part=statistics`](https://developers.google.com/youtube/v3/docs/channels/list#statistics). Note that as I don't know how to [retrieve all video ids for channels having more than 20,000 videos](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/wiki#concerning-20-000-videos-limit-for-youtube-data-api-v3-playlistitems-list-endpoint), [the algorithm stops if it finds such a channel](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L305). +- Uses other channel tabs: + - [`CHANNELS`](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L311) by using [YouTube operational API Channels: list endpoint with `part=channels`](https://stackoverflow.com/a/74213174) + - [`COMMUNITY`](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L342) by using [YouTube operational API Channels: list endpoint with `part=community`](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L346) then using [YouTube operational API Community: list](https://github.com/Benjamin-Loison/YouTube-operational-API/blob/7ff59b2d477c8d2caf6813a114f4201791627cc1/community.php) endpoint to retrieve comments on community posts and proceed with [YouTube operational API CommentThreads: list](https://github.com/Benjamin-Loison/YouTube-operational-API/blob/7ff59b2d477c8d2caf6813a114f4201791627cc1/commentThreads.php) endpoint to proceed to comments pagination + - [`PLAYLISTS`](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L402), we proceed with [YouTube operational API Channels: list endpoint with `part=playlists`](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L406) then we proceed with [YouTube Data API v3 PlaylistItems: list](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L418) endpoint and we focus on [videos that are unlisted](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L429), as we can't find them in anyway, that way we treat the video (so notably its comments) and as other playlists, we look for new channels in [the authors of the playlist videos](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L438) + - [`LIVE`](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L468), here we focus on livestream chat messages on not comments. We first retrieve channel videos thanks to [YouTube Data API v3 PlaylistItems: list](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L474) endpoint, then we proceed with [YouTube Data API v3 Videos: list endpoint](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L484) [to check which videos are livestreams](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L488), then there are two possibilities: + - [the livestreams are ongoing](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L493) and [we proceed](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L496) with [YouTube Livestreaming API LiveChatMessages: list](https://developers.google.com/youtube/v3/live/docs/liveChatMessages/list) endpoint + - the livestreams are ended and [we proceed](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L512) with [YouTube operational API LiveChats: list](https://github.com/Benjamin-Loison/YouTube-operational-API/blob/7ff59b2d477c8d2caf6813a114f4201791627cc1/liveChats.php) endpoint + -Thanks to [YouTube operational API Videos: list endpoint](https://stackoverflow.com/a/74324720) we learn that its channel id is [`UCAuUUnT6oDeKwE6v1NGQxug`](https://www.youtube.com/channel/UCAuUUnT6oDeKwE6v1NGQxug). +### Side notes -Then let's list their videos thanks to [YouTube Data API v3 PlaylistItems: list](https://developers.google.com/youtube/v3/docs/playlistItems/list) endpoint from the oldest one to the newest one, that way we will work with old videos that have had enough time to be processed. As of January 24 2023, they have `4185` videos retrievable that way. +Note that once a channel is treated, [the data retrieved from it are compressed](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L186). -Then we will focus on videos that have less than 2 caption tracks, including one that is both in English and not automatically generated. The first one matching this criteria is [`gJjLdnycuyU`](https://www.youtube.com/watch?v=gJjLdnycuyU) which is the 2970th oldest video. The hope by looking for oldest videos matching this criteria is that they are simple in terms of captions (not having a lot of caption tracks) and, as they are old videos and their captions aren't translated in many languages, we can hope that they doesn't have many views which will make duplicates less likely. +Currently unlisted videos that could be found thanks to [`PLAYLISTS` channel tab](https://www.youtube.com/@who/playlists) aren't treated as I haven't found a way to treat them logically: we can find these unlisted videos before, during and after having treated all channel tabs. Let's just consider a problematic situation that is finding the video after having treated all channel tabs (channel compression isn't a problem), then we have to treat the video in a manner that allows pause and resume even if the comments on the unlisted video brings to the whole YouTube graph that may not be treated. +Otherwise as far as I know my algorithm is reliably exhaustive in terms of features to find all YouTube channels. Indeed we don't focus on trying to parse YouTube ids in the plaintext spaces such as comments, as it's an endless work, as we have to support identification of `VIDEO_ID`, `youtu.be/watch?v=VIDEO_ID`... without even being sure that `VIDEO_ID` is a correct video id. In addition that [id formats aren't documented](https://stackoverflow.com/q/47670064). While our approach guarantees that ids that we find are correct. Furthermore we could use [unreliable features such as YouTube Data API v3 Search: list endpoint](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/wiki#gjjldnycuyu-https-www-youtube-com-watch-v-gjjldnycuyu-my-kids-have-seen-a-lot-of-cartoons). -Let's focus on the sentence at [0:08](https://www.youtube.com/watch?v=gJjLdnycuyU&t=8s) of this video, that is *my kids have seen a lot of cartoons*. More precisely according to: +Note that each time we use YouTube Data API v3 CommentThreads: list endpoint, we use [YouTube Data API v3 Comments: list](https://developers.google.com/youtube/v3/docs/comments/list) endpoint [in the case that there are more than 5 answers on a comment](https://stackoverflow.com/a/71284334) (code: [YouTube Data API v3 Comments: list call](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L224) and [test on number of replies](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L219)), as CommentThreads: list only returns first 5 answers to a given comment. -- not automatically generated captions: `My kids have seen a lot of cartoons` -- automatically generated captions: `my kids have seen a lot of cartoons` - -Let's put the chances on our side by assuming that the exact search feature using `"Your query"` from YouTube is case sensitive, so let's consider only the common `"kids have seen a lot of cartoons"` of both caption tracks. If we provide it to [YouTube Data API v3 Search: list](https://developers.google.com/youtube/v3/docs/search/list) endpoint, [we get at least `50` results](https://yt.lemnoslife.com/noKey/search?part=snippet&q=%22kids%20have%20seen%20a%20lot%20of%20cartoons%22&maxResults=50) where `gJjLdnycuyU` doesn't appear. Let's say that all these videos contain `kids have seen a lot of cartoons` and our study video is going to appear on a following page. As we have other things to do than watching a random video of tens of minutes, let's extract thanks to [YouTube operational API Videos: list endpoint with `part=contentDetails`](https://stackoverflow.com/a/70908689) the shortest video, in order to verify that YouTube exact search feature works as expected. The shortest video is [`dC7tUcRCS58`](https://www.youtube.com/watch?v=dC7tUcRCS58) and lasts 175 seconds. The audio, the video and the automatically generated captions don't contain neither near nor far `kids have seen a lot of cartoons`. - -So YouTube is just giving us random videos about the words we typed but not exactly the exact search we asked him to proceed. - -While concerning [the project proposal video concerning](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/wiki#o8npllzkfhe-https-www-youtube-com-watch-v-o8npllzkfhe-linux-is-in-millions-of-computers) [`The mind behind Linux | Linus Torvalds`](https://www.youtube.com/watch?v=o8NPllzkFhE) proceeding to exact search with [`"your software Linux is in millions of computers"`](https://yt.lemnoslife.com/noKey/search?part=snippet&q=%22your%20software%20Linux%20is%20in%20millions%20of%20computers%22&maxResults=50) we get only one result that is [`Vo9KPk-gqKk`](https://www.youtube.com/watch?v=Vo9KPk-gqKk) which as discussed contains the exact sentence `your software Linux is in millions of computers`. - -So trying to *identify a case where a video is not returned even though it is the only match to a query* shows inconsistent behavior from YouTube exact search, as it gives exactly what we asked concerning our test with `The mind behind Linux | Linus Torvalds` and it doesn't give exactly what we asked concerning `The creative power of misfits | WorkLife with Adam Grant (Audio only)`. - -Note that [YouTube UI](https://www.youtube.com/results?search_query=%22kids+have+seen+a+lot+of+cartoons%22) has the same too many results bug concerning `The creative power of misfits | WorkLife with Adam Grant (Audio only)`. - -From [my experience with YouTube](https://stackoverflow.com/users/7123660/benjamin-loison) which starts to be significant, we can't rely on YouTube search feature, as they give weird results as shown. However YouTube gives quite correctly the information concerning a given video id, so [the best approach that I am aware of](https://stackoverflow.com/a/69259093) to returns exactly correct and as far as possible exhaustive results consists in discovering the maximum number of videos through some crawling approach as I sketch in the last paragraph of the project proposal. - -
-The code associated to this approach is here: - -```py -import requests, json, subprocess - -channelId = 'UCAuUUnT6oDeKwE6v1NGQxug' -uploadsPlaylistId = 'UU' + channelId[2:] - -def getJson(url): - url = f'https://yt.lemnoslife.com/{url}' - content = requests.get(url).text - data = json.loads(content) - return data - -videoIds = [] - -pageToken = '' -while True: - data = getJson(f'noKey/playlistItems?part=snippet&playlistId={uploadsPlaylistId}&maxResults=50&pageToken={pageToken}') - items = data['items'] - print(len(videoIds)) - for item in items: - #print(item) - videoId = item['snippet']['resourceId']['videoId'] - #print(videoId) - videoIds += [videoId] - if 'nextPageToken' in data: - pageToken = data['nextPageToken'] - else: - break - -print(len(videoIds)) -# 4185 - -videoIds = videoIds[::-1] - -def execute(command): - subprocess.check_output(command, shell = True) - -videoIds = videoIds[2968:] - -## - -# 2968 SMnKboI4fvY - -for videoIndex, videoId in enumerate(videoIds): - print(videoIndex, videoId) - data = getJson(f'noKey/captions?part=snippet&videoId={videoId}') - items = data['items'] - if len(items) <= 2: - for item in items: - snippet = item['snippet'] - trackKind = snippet['trackKind'] - language = snippet['language'] - if language == 'en' and trackKind == 'standard': - print('Found') - #execute('notify-send "Found"') - break - -## - -# Find shortest video: - -url = 'noKey/search?part=snippet&q="your software Linux is in millions of computers"&maxResults=50' -data = getJson(url) -items = data['items'] -setVideoIds = [] -shortestVideo = 10 ** 9 -shortestVideoId = None -for item in items: - videoId = item['id']['videoId'] - print(videoId) - setVideoIds += [videoId] - url = f'videos?part=contentDetails&id={videoId}' - data = getJson(url) - duration = data['items'][0]['contentDetails']['duration'] - if shortestVideo > duration and duration > 0: - shortestVideo = duration - shortestVideoId = videoId - -print(shortestVideoId, shortestVideo) -``` - -
- -Following my answer my supervisor answered: - -> Thanks for the answer! Long story short, this does seems to answer my question: indeed, there are cases where a search for a string `S` does not prominently return any video containing `S` in the subtitles, but such videos do exist and are not returned. - -## Concerning 20,000 videos limit for YouTube Data API v3 PlaylistItems: list endpoint - -Could try both (`-i` was required for ignoring errors such as age-restricted videos): -```sh -youtube-dl --dump-json "https://www.youtube.com/channel/UCf8w5m0YsRa8MHQ5bwSGmbw/videos" -i | jq -r '[.id]|@csv' | wc -l -``` -```sh -yt-dlp --dump-json "https://www.youtube.com/channel/UCf8w5m0YsRa8MHQ5bwSGmbw/videos" -i | jq -r '[.id]|@csv' | wc -l -``` - -As mentioned in [this commit](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/commit/6f04109fe21434b4bf47176985d19676b987d06a), could give a try with date filters or [the YouTube operational API issue](https://github.com/Benjamin-Loison/YouTube-operational-API/issues/4). \ No newline at end of file +Note that currently [all requests to both APIs are stored](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/issues/2), that way all this data can be used later on for other purpose. \ No newline at end of file diff --git a/The-crawler.md b/The-crawler.md deleted file mode 100644 index 68e89a5..0000000 --- a/The-crawler.md +++ /dev/null @@ -1,77 +0,0 @@ -Note that the following doesn't even treat about the captions problem described in [the project proposal](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/wiki/Project-proposal). Here I just explains how the crawler works in terms of channels discovery. - -Here I talk much about retrieving the maximum number of channels but in fact we are interested in videos for captions and that isn't a problem as we discover all videos from their channels thanks to [YouTube Data API v3 PlaylistItems: list](https://developers.google.com/youtube/v3/docs/playlistItems/list) endpoint. - -## Context - -The crawler relies on two data sources: -- the official [YouTube Data API v3](https://developers.google.com/youtube/v3) -- my unofficial [YouTube operational API](https://github.com/Benjamin-Loison/YouTube-operational-API) - -To put it in a nutshell from [my crawling experience with YouTube data](https://stackoverflow.com/users/7123660/benjamin-loison) in YouTube Data API v3: - -- some data are missing ([retrieving `CHANNELS` channel tab content](https://stackoverflow.com/a/74213174) for instance) -- sometimes the data that we retrieve aren't correct ([the video duration](https://stackoverflow.com/a/70908689) for instance) -- the data access is limited ([to an absolute number of results (500 for searches)](https://stackoverflow.com/a/73357447) and it's by default limited to a given number of requests per day (at most 10,000) ([quota available](https://developers.google.com/youtube/v3/getting-started#quota) and [quota cost](https://developers.google.com/youtube/v3/determine_quota_cost) documentations)) - -so I coded a complementary API that is the YouTube operational API that relies on web-scraping YouTube UI. - -The YouTube operational API has two parts: -- the actual YouTube operational API based on web-scraping YouTube UI -- [the no-key service](https://github.com/Benjamin-Loison/YouTube-operational-API/blob/7ff59b2d477c8d2caf6813a114f4201791627cc1/noKey/index.php) which consists in a proxy to fetch YouTube Data API v3 without any developer key by using a batch of more than 200 keys that I gathered thanks to people having unintentionally leaked their developer keys on Stack Overflow, GitHub... - -Note that the web-scraping of the YouTube UI is not done with tools similar to Selenium which runs JavaScript but is a low-level web-scraping by just proceeding to HTTP requests and parsing the JavaScript variable containing data that the webpage is generated with. - -Note that I host a YouTube operational API instance with both parts at https://yt.lemnoslife.com
-Some of its metrics are available at https://yt.lemnoslife.com/metrics/, note that my official instance has proceeded successfully (without any downtime) on January 24 2023 (without me using it): -- 674,861 no-key requests -- 58,332 web-scraping requests - -Concerning the web-scraping of YouTube UI, [YouTube may detect and block requests](https://github.com/Benjamin-Loison/YouTube-operational-API/issues/11) for a temporary amount of times (a few hours) if it is used for many requests. However if you have your own YouTube operational instance and [you proceed in a mono-thread way by waiting the answer to your request before making another one](https://github.com/Benjamin-Loison/YouTube-operational-API/issues/11#issuecomment-1317163330), then as far as I know there isn't any problem with this scenario. Note that even if we proceed in a multi-threading way, this mono-thread limitation won't bother us as we mainly rely on YouTube Data API v3. - -## Features that the crawler relies on - -Note that you can run my algorithm with [your own YouTube operational API instance](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L60) and [your own set of YouTube Data API v3 keys](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L57) or you can rely for both on my official instance. - -The crawler can be [multi-threaded with the `--threads=N` argument](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L72). - -Note that [the algorithm is able to pause and resume at any point](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L100) by loosing progress on channels that he was working on. - -The crawler starts with [some channel ids that you provided in `channels.txt`](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L56). - -Each thread works on a different channel id and proceed as follows: - -### High overview - -- Lists all videos (including livestreams and shorts) of this given channel and list their comments and proceed recursively with these comment authors -- [Lists all other tabs content](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/issues/11): - - [`CHANNELS`](https://www.youtube.com/@who/channels) lists *related channels* - - [`COMMUNITY`](https://www.youtube.com/@who/community) lists comments on community posts - - [`PLAYLISTS`](https://www.youtube.com/@who/playlists) lists videos in playlists to find unlisted videos but also videos from other channels - - [`LIVE`](https://www.youtube.com/@who/streams) lists livestreams to find new channels in the livestreams chat messages - -### Technically - -The architecture of the description is the same as the `High overview` one. - -- Uses [YouTube Data API v3 CommentThreads: list](https://developers.google.com/youtube/v3/docs/commentThreads/list) endpoint with [`allThreadsRelatedToChannelId` filter](https://developers.google.com/youtube/v3/docs/commentThreads/list#allThreadsRelatedToChannelId) [to retrieve all comments left by people on the given channel videos](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L205) - - Some channels have disabled to some extent comments on their channels so [we can't proceed with this endpoint as it returns an error](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L208), so in the case that there are comments on some videos ([which happens](https://www.youtube.com/watch?v=3F8dFt8LsXY)) we use [YouTube Data API v3 PlaylistItems: list](https://developers.google.com/youtube/v3/docs/playlistItems/list) endpoint to retrieve the channel videos and we use YouTube Data API v3 CommentThreads: list endpoint with [`videoId`](https://developers.google.com/youtube/v3/docs/commentThreads/list#videoId) filter to retrieve all comments left by people on the given channel video. In fact the algorithm prints [the number of videos to treat](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L262) thanks to [YouTube Data API v3 Channels: list](https://developers.google.com/youtube/v3/docs/channels/list) endpoint with [`part=statistics`](https://developers.google.com/youtube/v3/docs/channels/list#statistics). Note that as I don't know how to [retrieve all video ids for channels having more than 20,000 videos](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/wiki#concerning-20-000-videos-limit-for-youtube-data-api-v3-playlistitems-list-endpoint), [the algorithm stops if it finds such a channel](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L305). -- Uses other channel tabs: - - [`CHANNELS`](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L311) by using [YouTube operational API Channels: list endpoint with `part=channels`](https://stackoverflow.com/a/74213174) - - [`COMMUNITY`](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L342) by using [YouTube operational API Channels: list endpoint with `part=community`](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L346) then using [YouTube operational API Community: list](https://github.com/Benjamin-Loison/YouTube-operational-API/blob/7ff59b2d477c8d2caf6813a114f4201791627cc1/community.php) endpoint to retrieve comments on community posts and proceed with [YouTube operational API CommentThreads: list](https://github.com/Benjamin-Loison/YouTube-operational-API/blob/7ff59b2d477c8d2caf6813a114f4201791627cc1/commentThreads.php) endpoint to proceed to comments pagination - - [`PLAYLISTS`](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L402), we proceed with [YouTube operational API Channels: list endpoint with `part=playlists`](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L406) then we proceed with [YouTube Data API v3 PlaylistItems: list](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L418) endpoint and we focus on [videos that are unlisted](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L429), as we can't find them in anyway, that way we treat the video (so notably its comments) and as other playlists, we look for new channels in [the authors of the playlist videos](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L438) - - [`LIVE`](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L468), here we focus on livestream chat messages on not comments. We first retrieve channel videos thanks to [YouTube Data API v3 PlaylistItems: list](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L474) endpoint, then we proceed with [YouTube Data API v3 Videos: list endpoint](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L484) [to check which videos are livestreams](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L488), then there are two possibilities: - - [the livestreams are ongoing](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L493) and [we proceed](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L496) with [YouTube Livestreaming API LiveChatMessages: list](https://developers.google.com/youtube/v3/live/docs/liveChatMessages/list) endpoint - - the livestreams are ended and [we proceed](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L512) with [YouTube operational API LiveChats: list](https://github.com/Benjamin-Loison/YouTube-operational-API/blob/7ff59b2d477c8d2caf6813a114f4201791627cc1/liveChats.php) endpoint - - -### Side notes - -Note that once a channel is treated, [the data retrieved from it are compressed](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L186). - -Currently unlisted videos that could be found thanks to [`PLAYLISTS` channel tab](https://www.youtube.com/@who/playlists) aren't treated as I haven't found a way to treat them logically: we can find these unlisted videos before, during and after having treated all channel tabs. Let's just consider a problematic situation that is finding the video after having treated all channel tabs (channel compression isn't a problem), then we have to treat the video in a manner that allows pause and resume even if the comments on the unlisted video brings to the whole YouTube graph that may not be treated. -Otherwise as far as I know my algorithm is reliably exhaustive in terms of features to find all YouTube channels. Indeed we don't focus on trying to parse YouTube ids in the plaintext spaces such as comments, as it's an endless work, as we have to support identification of `VIDEO_ID`, `youtu.be/watch?v=VIDEO_ID`... without even being sure that `VIDEO_ID` is a correct video id. In addition that [id formats aren't documented](https://stackoverflow.com/q/47670064). While our approach guarantees that ids that we find are correct. Furthermore we could use [unreliable features such as YouTube Data API v3 Search: list endpoint](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/wiki#gjjldnycuyu-https-www-youtube-com-watch-v-gjjldnycuyu-my-kids-have-seen-a-lot-of-cartoons). - -Note that each time we use YouTube Data API v3 CommentThreads: list endpoint, we use [YouTube Data API v3 Comments: list](https://developers.google.com/youtube/v3/docs/comments/list) endpoint [in the case that there are more than 5 answers on a comment](https://stackoverflow.com/a/71284334) (code: [YouTube Data API v3 Comments: list call](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L224) and [test on number of replies](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L219)), as CommentThreads: list only returns first 5 answers to a given comment. - -Note that currently [all requests to both APIs are stored](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/issues/2), that way all this data can be used later on for other purpose. \ No newline at end of file diff --git a/Trying-to-understand-YouTube-results.md b/Trying-to-understand-YouTube-results.md new file mode 100644 index 0000000..ec7882b --- /dev/null +++ b/Trying-to-understand-YouTube-results.md @@ -0,0 +1,177 @@ +As described on [the project proposal page](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/wiki/Project-proposal), there is a discovery process consisting in going through comments, so we will try to also keep comments. That way we could end up, potentially after the project, doing interesting stuff such as listing all comments written by a given user, as [my French only without discovery process project](https://github.com/Benjamin-Loison/YouTube-comments-graph) was doing. + +## Dive into YouTube search results + +As a first feeling it seems that YouTube returns videos that only match auto-generated captions. + +### - [`7TXEZ4tP06c`](https://www.youtube.com/watch?v=7TXEZ4tP06c) `how many people here would say they can draw` + +Let's consider [`7TXEZ4tP06c`](https://www.youtube.com/watch?v=7TXEZ4tP06c) at [0:18](https://www.youtube.com/watch?v=7TXEZ4tP06c&t=18s) the auto-generated captions and not auto-generated captions are `how many people here would say they can draw`. + +[Passing this sentence to YouTube Data API v3 Search: list endpoint](https://yt.lemnoslife.com/noKey/search?part=snippet&q=%22how%20many%20people%20here%20would%20say%20they%20can%20draw%22&maxResults=50) returns these videos: + +- [`7TXEZ4tP06c`](https://www.youtube.com/watch?v=7TXEZ4tP06c): is the original video ([`7TXEZ4tP06c`](https://www.youtube.com/watch?v=7TXEZ4tP06c)) ([0:18](https://www.youtube.com/watch?v=7TXEZ4tP06c&t=18s)) +- [`qH-yY7UZW_k`](https://www.youtube.com/watch?v=qH-yY7UZW_k): reupload part of the original video at [1:13](https://www.youtube.com/watch?v=qH-yY7UZW_k&t=73s) ([1:16](https://www.youtube.com/watch?v=qH-yY7UZW_k&t=76s)) +- [`cOwYXnpW-8A`](https://www.youtube.com/watch?v=cOwYXnpW-8A): reupload part of the original video ([0:05](https://www.youtube.com/watch?v=cOwYXnpW-8A&t=5s)) +- [`vzH9Fo9GI9Y`](https://www.youtube.com/watch?v=vzH9Fo9GI9Y): reupload part of the original video at [3:31](https://youtu.be/vzH9Fo9GI9Y?t=211s) ([3:39](https://www.youtube.com/watch?v=vzH9Fo9GI9Y&t=219s)) +- [`gpMp6tz3d7w`](https://www.youtube.com/watch?v=gpMp6tz3d7w): reupload part of the original video at [0:37](https://youtu.be/gpMp6tz3d7w?t=37s) ([0:41](https://www.youtube.com/watch?v=gpMp6tz3d7w&t=41s)) +- [`ZI7XTsGTl34`](https://www.youtube.com/watch?v=gpMp6tz3d7w): reupload part of the original video at [23:36](https://youtu.be/ZI7XTsGTl34?t=1416s) ([23:43](https://www.youtube.com/watch?v=ZI7XTsGTl34&t=1423s)) + +Note that all of these videos are partial uploads of the original video and they have auto-generated captions and all exactly contain `how many people here would say they can draw`. + +### - [`o8NPllzkFhE`](https://www.youtube.com/watch?v=o8NPllzkFhE) `linux is in millions of computers` + +Completing [the project proposal example](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/wiki/Project-proposal), [`Vo9KPk-gqKk`](https://www.youtube.com/watch?v=Vo9KPk-gqKk) reupload part of the original video and has only auto-generated captions which contains `your software Linux is in millions of computers` while [`o8NPllzkFhE`](https://www.youtube.com/watch?v=o8NPllzkFhE) that is the original video has auto-generated captions which contains `your software uh linux is in millions of computers` and has not auto-generated captions which contains `Your software, Linux, is in millions of computers`. + +The weird thing is that when [passing `linux is in millions of computers` to YouTube Data API v3 Search: list endpoint](https://yt.lemnoslife.com/noKey/search?part=snippet&q=%22linux%20is%20in%20millions%20of%20computers%22), it returns only these videos: +- [`Vo9KPk-gqKk`](https://www.youtube.com/watch?v=Vo9KPk-gqKk): reupload part of the original video (`your software Linux is in millions of computers`) +- [`krakddj30eU`](https://www.youtube.com/watch?v=krakddj30eU): reupload part of the original video at [0:05](https://www.youtube.com/watch?v=krakddj30eU?t=5s) (`your software uh linux is in millions of computers`) +- [`NvPaFoIbbzg`](https://www.youtube.com/watch?v=NvPaFoIbbzg): reupload part of the original video (`your software uh Linux is in millions of computers`) + +So it returns similar videos but not the original one we focused on while it should be clearly returned. + +Note that all of these videos are partial uploads of the original video and they have auto-generated captions and all exactly contain case-insensitively `Linux is in millions of computers`. + +### - [`f6nxcfbDfZo`](https://www.youtube.com/watch?v=f6nxcfbDfZo) `at tedx about to give a killer talk` + +[Passing this sentence to YouTube Data API v3 Search: list endpoint](https://yt.lemnoslife.com/noKey/search?part=snippet&q=%22at%20tedx%20about%20to%20give%20a%20killer%20talk%22) returns these videos: + +- [`f6nxcfbDfZo`](https://www.youtube.com/watch?v=f6nxcfbDfZo): is the original video + +Note that the video only have auto-generated captions. + +### - [`gJjLdnycuyU`](https://www.youtube.com/watch?v=gJjLdnycuyU) `My kids have seen a lot of cartoons` + +Following [my project proposal](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/wiki/Project-proposal), I've been noticed: + +> It's not clear to me from the "proof" part whether the video "o8NPllzkFhE" is not returned because of an indexing problem or because it is considered to be a duplicate of the video "Vo9KPk-gqKk". +> Did you manage to identify a case where a video is not returned even though it is the only match to a query? (Indeed, if the goal of your project is just to work around the fact that some duplicate videos are removed from search results, then it limits a bit the appeal.) + +**Let's try to answer this question with the best approach and show how YouTube search doesn't make sense sometimes.** + +Let's look at videos which have both automatically generated captions and not automatically generated captions and let's focus on English, so we will consider [`@TED`](https://www.youtube.com/@TED) videos, as they are quite an interesting dataset for this purpose. + +Thanks to [YouTube operational API Videos: list endpoint](https://stackoverflow.com/a/74324720) we learn that its channel id is [`UCAuUUnT6oDeKwE6v1NGQxug`](https://www.youtube.com/channel/UCAuUUnT6oDeKwE6v1NGQxug). + +Then let's list their videos thanks to [YouTube Data API v3 PlaylistItems: list](https://developers.google.com/youtube/v3/docs/playlistItems/list) endpoint from the oldest one to the newest one, that way we will work with old videos that have had enough time to be processed. As of January 24 2023, they have `4185` videos retrievable that way. + +Then we will focus on videos that have less than 2 caption tracks, including one that is both in English and not automatically generated. The first one matching this criteria is [`gJjLdnycuyU`](https://www.youtube.com/watch?v=gJjLdnycuyU) which is the 2970th oldest video. The hope by looking for oldest videos matching this criteria is that they are simple in terms of captions (not having a lot of caption tracks) and, as they are old videos and their captions aren't translated in many languages, we can hope that they doesn't have many views which will make duplicates less likely. + +Let's focus on the sentence at [0:08](https://www.youtube.com/watch?v=gJjLdnycuyU&t=8s) of this video, that is *my kids have seen a lot of cartoons*. More precisely according to: + +- not automatically generated captions: `My kids have seen a lot of cartoons` +- automatically generated captions: `my kids have seen a lot of cartoons` + +Let's put the chances on our side by assuming that the exact search feature using `"Your query"` from YouTube is case sensitive, so let's consider only the common `"kids have seen a lot of cartoons"` of both caption tracks. If we provide it to [YouTube Data API v3 Search: list](https://developers.google.com/youtube/v3/docs/search/list) endpoint, [we get at least `50` results](https://yt.lemnoslife.com/noKey/search?part=snippet&q=%22kids%20have%20seen%20a%20lot%20of%20cartoons%22&maxResults=50) where `gJjLdnycuyU` doesn't appear. Let's say that all these videos contain `kids have seen a lot of cartoons` and our study video is going to appear on a following page. As we have other things to do than watching a random video of tens of minutes, let's extract thanks to [YouTube operational API Videos: list endpoint with `part=contentDetails`](https://stackoverflow.com/a/70908689) the shortest video, in order to verify that YouTube exact search feature works as expected. The shortest video is [`dC7tUcRCS58`](https://www.youtube.com/watch?v=dC7tUcRCS58) and lasts 175 seconds. The audio, the video and the automatically generated captions don't contain neither near nor far `kids have seen a lot of cartoons`. + +So YouTube is just giving us random videos about the words we typed but not exactly the exact search we asked him to proceed. + +While concerning [the project proposal video concerning](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/wiki#o8npllzkfhe-https-www-youtube-com-watch-v-o8npllzkfhe-linux-is-in-millions-of-computers) [`The mind behind Linux | Linus Torvalds`](https://www.youtube.com/watch?v=o8NPllzkFhE) proceeding to exact search with [`"your software Linux is in millions of computers"`](https://yt.lemnoslife.com/noKey/search?part=snippet&q=%22your%20software%20Linux%20is%20in%20millions%20of%20computers%22&maxResults=50) we get only one result that is [`Vo9KPk-gqKk`](https://www.youtube.com/watch?v=Vo9KPk-gqKk) which as discussed contains the exact sentence `your software Linux is in millions of computers`. + +So trying to *identify a case where a video is not returned even though it is the only match to a query* shows inconsistent behavior from YouTube exact search, as it gives exactly what we asked concerning our test with `The mind behind Linux | Linus Torvalds` and it doesn't give exactly what we asked concerning `The creative power of misfits | WorkLife with Adam Grant (Audio only)`. + +Note that [YouTube UI](https://www.youtube.com/results?search_query=%22kids+have+seen+a+lot+of+cartoons%22) has the same too many results bug concerning `The creative power of misfits | WorkLife with Adam Grant (Audio only)`. + +From [my experience with YouTube](https://stackoverflow.com/users/7123660/benjamin-loison) which starts to be significant, we can't rely on YouTube search feature, as they give weird results as shown. However YouTube gives quite correctly the information concerning a given video id, so [the best approach that I am aware of](https://stackoverflow.com/a/69259093) to returns exactly correct and as far as possible exhaustive results consists in discovering the maximum number of videos through some crawling approach as I sketch in the last paragraph of the project proposal. + +
+The code associated to this approach is here: + +```py +import requests, json, subprocess + +channelId = 'UCAuUUnT6oDeKwE6v1NGQxug' +uploadsPlaylistId = 'UU' + channelId[2:] + +def getJson(url): + url = f'https://yt.lemnoslife.com/{url}' + content = requests.get(url).text + data = json.loads(content) + return data + +videoIds = [] + +pageToken = '' +while True: + data = getJson(f'noKey/playlistItems?part=snippet&playlistId={uploadsPlaylistId}&maxResults=50&pageToken={pageToken}') + items = data['items'] + print(len(videoIds)) + for item in items: + #print(item) + videoId = item['snippet']['resourceId']['videoId'] + #print(videoId) + videoIds += [videoId] + if 'nextPageToken' in data: + pageToken = data['nextPageToken'] + else: + break + +print(len(videoIds)) +# 4185 + +videoIds = videoIds[::-1] + +def execute(command): + subprocess.check_output(command, shell = True) + +videoIds = videoIds[2968:] + +## + +# 2968 SMnKboI4fvY + +for videoIndex, videoId in enumerate(videoIds): + print(videoIndex, videoId) + data = getJson(f'noKey/captions?part=snippet&videoId={videoId}') + items = data['items'] + if len(items) <= 2: + for item in items: + snippet = item['snippet'] + trackKind = snippet['trackKind'] + language = snippet['language'] + if language == 'en' and trackKind == 'standard': + print('Found') + #execute('notify-send "Found"') + break + +## + +# Find shortest video: + +url = 'noKey/search?part=snippet&q="your software Linux is in millions of computers"&maxResults=50' +data = getJson(url) +items = data['items'] +setVideoIds = [] +shortestVideo = 10 ** 9 +shortestVideoId = None +for item in items: + videoId = item['id']['videoId'] + print(videoId) + setVideoIds += [videoId] + url = f'videos?part=contentDetails&id={videoId}' + data = getJson(url) + duration = data['items'][0]['contentDetails']['duration'] + if shortestVideo > duration and duration > 0: + shortestVideo = duration + shortestVideoId = videoId + +print(shortestVideoId, shortestVideo) +``` + +
+ +Following my answer my supervisor answered: + +> Thanks for the answer! Long story short, this does seems to answer my question: indeed, there are cases where a search for a string `S` does not prominently return any video containing `S` in the subtitles, but such videos do exist and are not returned. + +## Concerning 20,000 videos limit for YouTube Data API v3 PlaylistItems: list endpoint + +Could try both (`-i` was required for ignoring errors such as age-restricted videos): +```sh +youtube-dl --dump-json "https://www.youtube.com/channel/UCf8w5m0YsRa8MHQ5bwSGmbw/videos" -i | jq -r '[.id]|@csv' | wc -l +``` +```sh +yt-dlp --dump-json "https://www.youtube.com/channel/UCf8w5m0YsRa8MHQ5bwSGmbw/videos" -i | jq -r '[.id]|@csv' | wc -l +``` + +As mentioned in [this commit](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/commit/6f04109fe21434b4bf47176985d19676b987d06a), could give a try with date filters or [the YouTube operational API issue](https://github.com/Benjamin-Loison/YouTube-operational-API/issues/4). \ No newline at end of file