Add The crawler page

Benjamin Loison 2023-01-25 00:47:11 +01:00
parent 05eaf8a7c9
commit 49ff87ac54

75
The-crawler.md Normal file

@ -0,0 +1,75 @@
Note that the following doesn't even treat about the captions problem described in [the project proposal](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/wiki/Project-proposal). Here I just explains how the crawler works in terms of channels discovery.
Here I talk much about retrieving the maximum number of channels but in fact we are interested in videos for captions and that isn't a problem as we discover all videos from their channels thanks to [YouTube Data API v3 PlaylistItems: list](https://developers.google.com/youtube/v3/docs/playlistItems/list) endpoint.
## Context
The crawler relies on two data sources:
- the official [YouTube Data API v3](https://developers.google.com/youtube/v3)
- my unofficial [YouTube operational API](https://github.com/Benjamin-Loison/YouTube-operational-API)
To put it in a nutshell from [my crawling experience with YouTube data](https://stackoverflow.com/users/7123660/benjamin-loison) in YouTube Data API v3:
- some data are missing ([retrieving `CHANNELS` channel tab content](https://stackoverflow.com/a/74213174) for instance)
- sometimes the data that we retrieve aren't correct ([the video duration](https://stackoverflow.com/a/70908689) for instance)
- the data access is limited ([to an absolute number of results (500 for searches)](https://stackoverflow.com/a/73357447) and it's by default limited to a given number of requests per day (at most 10,000) ([quota available](https://developers.google.com/youtube/v3/getting-started#quota) and [quota cost](https://developers.google.com/youtube/v3/determine_quota_cost) documentations))
so I coded a complementary API that is the YouTube operational API that relies on web-scraping YouTube UI.
The YouTube operational API has two parts:
- the actual YouTube operational API based on web-scraping YouTube UI
- [the no-key service](https://github.com/Benjamin-Loison/YouTube-operational-API/blob/7ff59b2d477c8d2caf6813a114f4201791627cc1/noKey/index.php) which consists in a proxy to fetch YouTube Data API v3 without any developer key by using a batch of more than 200 keys that I gathered thanks to people having unintentionally leaked their developer keys on StackOverflow, GitHub...
Note that I host a YouTube operational API instance with both parts at https://yt.lemnoslife.com<br/>
Some of its metrics are available at https://yt.lemnoslife.com/metrics/, note that my official instance has proceeded successfully (without any downtime) on January 24 2023 (without me using it):
- 674,861 no-key requests
- 58,332 web-scraping requests
Concerning the web-scraping of YouTube UI, [YouTube may detect and block requests](https://github.com/Benjamin-Loison/YouTube-operational-API/issues/11) for a temporary amount of times (a few hours) if it is used for many requests. However if you have your own YouTube operational instance and [you proceed in a mono-thread way by waiting the answer to your request before making another one](https://github.com/Benjamin-Loison/YouTube-operational-API/issues/11#issuecomment-1317163330), then as far as I know there isn't any problem with this scenario. Note that even if we proceed in a multi-threading way, this mono-thread limitation won't bother us as we mainly rely on YouTube Data API v3.
## Features that the crawler relies on
Note that you can run my algorithm with [your own YouTube operational API instance](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L60) and [your own set of YouTube Data API v3 keys](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L57) or you can rely for both on my official instance.
The crawler can be [multi-threaded with the `--threads=N` argument](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L72).
Note that [the algorithm is able to pause and resume at any point](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L100) by loosing progress on channels that he was working on.
The crawler starts with [some channel ids that you provided in `channels.txt`](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L56).
Each thread works on a different channel id and proceed as follows:
### High overview
- Lists all videos (including livestreams and shorts) of this given channel and list their comments and proceed recursively with these comment authors
- [Lists all other tabs content](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/issues/11):
- [`CHANNELS`](https://www.youtube.com/@who/channels) lists *related channels*
- [`COMMUNITY`](https://www.youtube.com/@who/community) lists comments on community posts
- [`PLAYLISTS`](https://www.youtube.com/@who/playlists) lists videos in playlists to find unlisted videos but also videos from other channels
- [`LIVE`](https://www.youtube.com/@who/streams) lists livestreams to find new channels in the livestreams chat messages
### Technically
The architecture of the description is the same as the `High overview` one.
- Uses [YouTube Data API v3 CommentThreads: list](https://developers.google.com/youtube/v3/docs/commentThreads/list) endpoint with [`allThreadsRelatedToChannelId` filter](https://developers.google.com/youtube/v3/docs/commentThreads/list#allThreadsRelatedToChannelId) [to retrieve all comments left by people on the given channel videos](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L205)
- Some channels have disabled to some extent comments on their channels so [we can't proceed with this endpoint as it returns an error](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L208), so in the case that there are comments on some videos ([which happens](https://www.youtube.com/watch?v=3F8dFt8LsXY)) we use [YouTube Data API v3 PlaylistItems: list](https://developers.google.com/youtube/v3/docs/playlistItems/list) endpoint to retrieve the channel videos and we use YouTube Data API v3 CommentThreads: list endpoint with [`videoId`](https://developers.google.com/youtube/v3/docs/commentThreads/list#videoId) filter to retrieve all comments left by people on the given channel video. In fact the algorithm prints [the number of videos to treat](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L262) thanks to [YouTube Data API v3 Channels: list](https://developers.google.com/youtube/v3/docs/channels/list) endpoint with [`part=statistics`](https://developers.google.com/youtube/v3/docs/channels/list#statistics). Note that as I don't know how to [retrieve all video ids for channels having more than 20,000 videos](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/wiki#concerning-20-000-videos-limit-for-youtube-data-api-v3-playlistitems-list-endpoint), [the algorithm stops if it finds such a channel](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L305).
- Uses other channel tabs:
- [`CHANNELS`](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L311) by using [YouTube operational API Channels: list endpoint with `part=channels`](https://stackoverflow.com/a/74213174)
- [`COMMUNITY`](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L342) by using [YouTube operational API Channels: list endpoint with `part=community`](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L346) then using [YouTube operational API Community: list](https://github.com/Benjamin-Loison/YouTube-operational-API/blob/7ff59b2d477c8d2caf6813a114f4201791627cc1/community.php) endpoint to retrieve comments on community posts and proceed with [YouTube operational API CommentThreads: list](https://github.com/Benjamin-Loison/YouTube-operational-API/blob/7ff59b2d477c8d2caf6813a114f4201791627cc1/commentThreads.php) endpoint to proceed to comments pagination
- [`PLAYLISTS`](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L402), we proceed with [YouTube operational API Channels: list endpoint with `part=playlists`](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L406) then we proceed with [YouTube Data API v3 PlaylistItems: list](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L418) endpoint and we focus on [videos that are unlisted](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L429), as we can't find them in anyway, that way we treat the video (so notably its comments) and as other playlists, we look for new channels in [the authors of the playlist videos](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L438)
- [`LIVES`](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L468), here we focus on livestream chat messages on not comments. We first retrieve channel videos thanks to [YouTube Data API v3 PlaylistItems: list](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L474) endpoint, then we proceed with [YouTube Data API v3 Videos: list endpoint](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L484) [to check which videos are livestreams](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L488), then there are two possibilities:
- [the livestreams are ongoing](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L493) and [we proceed](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L496) with [YouTube Livestreaming API LiveChatMessages: list](https://developers.google.com/youtube/v3/live/docs/liveChatMessages/list) endpoint
- the livestreams are ended and [we proceed](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L512) with [YouTube operational API LiveChats: list](https://github.com/Benjamin-Loison/YouTube-operational-API/blob/7ff59b2d477c8d2caf6813a114f4201791627cc1/liveChats.php) endpoint
### Side notes
Note that once a channel is treated, [the data retrieved from it are compressed](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L186).
Currently unlisted videos that could be found thanks to [`PLAYLISTS` channel tab](https://www.youtube.com/@who/playlists) aren't treated as I haven't found a way to treat them logically: we can find these unlisted videos before, during and after having treated all channel tabs. Let's just consider a problematic situation that is finding the video after having treated all channel tabs (channel compression isn't a problem), then we have to treat the video in a manner that allows pause and resume even if the comments on the unlisted video brings to the whole YouTube graph that may not be treated.
Otherwise as far as I know my algorithm is reliably exhaustive in terms of features to find all YouTube channels. Indeed we don't focus on trying to parse YouTube ids in the plaintext spaces such as comments, as it's an endless work, as we have to support identification of `VIDEO_ID`, `youtu.be/watch?v=VIDEO_ID`... without even being sure that `VIDEO_ID` is a correct video id. In addition that [id formats aren't documented](https://stackoverflow.com/q/47670064). While our approach guarantees that ids that we find are correct. Furthermore we could use [unreliable features such as YouTube Data API v3 Search: list endpoint](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/wiki#gjjldnycuyu-https-www-youtube-com-watch-v-gjjldnycuyu-my-kids-have-seen-a-lot-of-cartoons).
Note that each time we use YouTube Data API v3 CommentThreads: list endpoint, we use [YouTube Data API v3 Comments: list](https://developers.google.com/youtube/v3/docs/comments/list) endpoint [in the case that there are more than 5 answers on a comment](https://stackoverflow.com/a/71284334) (code: [YouTube Data API v3 Comments: list call](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L224) and [test on number of replies](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/4e162e34c3a5debad1ca6bcbf02701b2c4faa431/main.cpp#L219)), as CommentThreads: list only returns first 5 answers to a given comment.
Note that currently [all requests to both APIs are stored](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/issues/2), that way all this data can be used later on for other purpose.