YouTube_captions_search_engine

Benjamin_Loison/YouTube_captions_search_engine

Table of Contents

Context
Features that the crawler relies on

High overview
Technically
Side notes

Note that the following doesn't even treat about the captions problem described in the project proposal. Here I just explains how the crawler works in terms of channels discovery.

Here I talk much about retrieving the maximum number of channels but in fact we are interested in videos for captions and that isn't a problem as we discover all videos from their channels thanks to YouTube Data API v3 PlaylistItems: list endpoint.

Context

The crawler relies on two data sources:

the official YouTube Data API v3
my unofficial YouTube operational API

To put it in a nutshell from my crawling experience with YouTube data in YouTube Data API v3:

some data are missing (retrieving CHANNELS channel tab content for instance)
sometimes the data that we retrieve aren't correct (the video duration for instance)
the data access is limited (to an absolute number of results (500 for searches) and it's by default limited to a given number of requests per day (at most 10,000) (quota available and quota cost documentations))

so I coded a complementary API that is the YouTube operational API that relies on web-scraping YouTube UI.

The YouTube operational API has two parts:

the actual YouTube operational API based on web-scraping YouTube UI
the no-key service which consists in a proxy to fetch YouTube Data API v3 without any developer key by using a batch of more than 200 keys that I gathered thanks to people having unintentionally leaked their developer keys on Stack Overflow, GitHub...

Note that the web-scraping of the YouTube UI is not done with tools similar to Selenium which runs JavaScript but is a low-level web-scraping by just proceeding to HTTP requests and parsing the JavaScript variable containing data that the webpage is generated with.

Note that I host a YouTube operational API instance with both parts at https://yt.lemnoslife.com
Some of its metrics are available at https://yt.lemnoslife.com/metrics/, note that my official instance has proceeded successfully (without any downtime) on January 24 2023 (without me using it):

674,861 no-key requests
58,332 web-scraping requests

Concerning the web-scraping of YouTube UI, YouTube may detect and block requests for a temporary amount of times (a few hours) if it is used for many requests. However if you have your own YouTube operational instance and you proceed in a mono-thread way by waiting the answer to your request before making another one, then as far as I know there isn't any problem with this scenario. Note that even if we proceed in a multi-threading way, this mono-thread limitation won't bother us as we mainly rely on YouTube Data API v3.

Features that the crawler relies on

Note that you can run my algorithm with your own YouTube operational API instance and your own set of YouTube Data API v3 keys or you can rely for both on my official instance.

The crawler can be multi-threaded with the --threads=N argument.

Note that the algorithm is able to pause and resume at any point by loosing progress on channels that he was working on.

The crawler starts with some channel ids that you provided in channels.txt.

Each thread works on a different channel id and proceed as follows:

High overview

Lists all videos (including livestreams and shorts) of this given channel and list their comments and proceed recursively with these comment authors
Lists all other tabs content:
- CHANNELS lists related channels
- COMMUNITY lists comments on community posts
- PLAYLISTS lists videos in playlists to find unlisted videos but also videos from other channels
- LIVE lists livestreams to find new channels in the livestreams chat messages

Technically

The architecture of the description is the same as the High overview one.

Uses YouTube Data API v3 CommentThreads: list endpoint with allThreadsRelatedToChannelId filter to retrieve all comments left by people on the given channel videos
- Some channels have disabled to some extent comments on their channels so we can't proceed with this endpoint as it returns an error, so in the case that there are comments on some videos (which happens) we use YouTube Data API v3 PlaylistItems: list endpoint to retrieve the channel videos and we use YouTube Data API v3 CommentThreads: list endpoint with videoId filter to retrieve all comments left by people on the given channel video. In fact the algorithm prints the number of videos to treat thanks to YouTube Data API v3 Channels: list endpoint with part=statistics. Note that as I don't know how to retrieve all video ids for channels having more than 20,000 videos, the algorithm stops if it finds such a channel.
Uses other channel tabs:
- CHANNELS by using YouTube operational API Channels: list endpoint with part=channels
- COMMUNITY by using YouTube operational API Channels: list endpoint with part=community then using YouTube operational API Community: list endpoint to retrieve comments on community posts and proceed with YouTube operational API CommentThreads: list endpoint to proceed to comments pagination
- PLAYLISTS, we proceed with YouTube operational API Channels: list endpoint with part=playlists then we proceed with YouTube Data API v3 PlaylistItems: list endpoint and we focus on videos that are unlisted, as we can't find them in anyway, that way we treat the video (so notably its comments) and as other playlists, we look for new channels in the authors of the playlist videos
- LIVE, here we focus on livestream chat messages on not comments. We first retrieve channel videos thanks to YouTube Data API v3 PlaylistItems: list endpoint, then we proceed with YouTube Data API v3 Videos: list endpoint to check which videos are livestreams, then there are two possibilities:
  - the livestreams are ongoing and we proceed with YouTube Livestreaming API LiveChatMessages: list endpoint
  - the livestreams are ended and we proceed with YouTube operational API LiveChats: list endpoint

Side notes

Note that once a channel is treated, the data retrieved from it are compressed.

Currently unlisted videos that could be found thanks to PLAYLISTS channel tab aren't treated as I haven't found a way to treat them logically: we can find these unlisted videos before, during and after having treated all channel tabs. Let's just consider a problematic situation that is finding the video after having treated all channel tabs (channel compression isn't a problem), then we have to treat the video in a manner that allows pause and resume even if the comments on the unlisted video brings to the whole YouTube graph that may not be treated. Otherwise as far as I know my algorithm is reliably exhaustive in terms of features to find all YouTube channels. Indeed we don't focus on trying to parse YouTube ids in the plaintext spaces such as comments, as it's an endless work, as we have to support identification of VIDEO_ID, youtu.be/watch?v=VIDEO_ID... without even being sure that VIDEO_ID is a correct video id. In addition that id formats aren't documented. While our approach guarantees that ids that we find are correct. Furthermore we could use unreliable features such as YouTube Data API v3 Search: list endpoint.

Note that each time we use YouTube Data API v3 CommentThreads: list endpoint, we use YouTube Data API v3 Comments: list endpoint in the case that there are more than 5 answers on a comment (code: YouTube Data API v3 Comments: list call and test on number of replies), as CommentThreads: list only returns first 5 answers to a given comment.

Note that currently all requests to both APIs are stored, that way all this data can be used later on for other purpose.