19
Home
Benjamin Loison edited this page 2024-03-18 12:01:56 +01:00

Note that the following doesn't even treat about the captions problem described in the project proposal. Here I just explains how the crawler works in terms of channels discovery.

Here I talk much about retrieving the maximum number of channels but in fact we are interested in videos for captions and that isn't a problem as we discover all videos from their channels thanks to YouTube Data API v3 PlaylistItems: list endpoint.

Context

The crawler relies on two data sources:

To put it in a nutshell from my crawling experience with YouTube data in YouTube Data API v3:

so I coded a complementary API that is the YouTube operational API that relies on web-scraping YouTube UI.

The YouTube operational API has two parts:

  • the actual YouTube operational API based on web-scraping YouTube UI
  • the no-key service which consists in a proxy to fetch YouTube Data API v3 without any developer key by using a batch of more than 200 keys that I gathered thanks to people having unintentionally leaked their developer keys on Stack Overflow, GitHub...

Note that the web-scraping of the YouTube UI is not done with tools similar to Selenium which runs JavaScript but is a low-level web-scraping by just proceeding to HTTP requests and parsing the JavaScript variable containing data that the webpage is generated with.

Note that I host a YouTube operational API instance with both parts at https://yt.lemnoslife.com
Some of its metrics are available at https://yt.lemnoslife.com/metrics/, note that my official instance has proceeded successfully (without any downtime) on January 24 2023 (without me using it):

  • 674,861 no-key requests
  • 58,332 web-scraping requests

Concerning the web-scraping of YouTube UI, YouTube may detect and block requests for a temporary amount of times (a few hours) if it is used for many requests. However if you have your own YouTube operational instance and you proceed in a mono-thread way by waiting the answer to your request before making another one, then as far as I know there isn't any problem with this scenario. Note that even if we proceed in a multi-threading way, this mono-thread limitation won't bother us as we mainly rely on YouTube Data API v3.

Features that the crawler relies on

Note that you can run my algorithm with your own YouTube operational API instance and your own set of YouTube Data API v3 keys or you can rely for both on my official instance.

The crawler can be multi-threaded with the --threads=N argument.

Note that the algorithm is able to pause and resume at any point by loosing progress on channels that he was working on.

The crawler starts with some channel ids that you provided in channels.txt.

Each thread works on a different channel id and proceed as follows:

High overview

  • Lists all videos (including livestreams and shorts) of this given channel and list their comments and proceed recursively with these comment authors
  • Lists all other tabs content:
    • CHANNELS lists related channels
    • COMMUNITY lists comments on community posts
    • PLAYLISTS lists videos in playlists to find unlisted videos but also videos from other channels
    • LIVE lists livestreams to find new channels in the livestreams chat messages

Technically

The architecture of the description is the same as the High overview one.

Side notes

Note that once a channel is treated, the data retrieved from it are compressed.

Currently unlisted videos that could be found thanks to PLAYLISTS channel tab aren't treated as I haven't found a way to treat them logically: we can find these unlisted videos before, during and after having treated all channel tabs. Let's just consider a problematic situation that is finding the video after having treated all channel tabs (channel compression isn't a problem), then we have to treat the video in a manner that allows pause and resume even if the comments on the unlisted video brings to the whole YouTube graph that may not be treated. Otherwise as far as I know my algorithm is reliably exhaustive in terms of features to find all YouTube channels. Indeed we don't focus on trying to parse YouTube ids in the plaintext spaces such as comments, as it's an endless work, as we have to support identification of VIDEO_ID, youtu.be/watch?v=VIDEO_ID... without even being sure that VIDEO_ID is a correct video id. In addition that id formats aren't documented. While our approach guarantees that ids that we find are correct. Furthermore we could use unreliable features such as YouTube Data API v3 Search: list endpoint.

Note that each time we use YouTube Data API v3 CommentThreads: list endpoint, we use YouTube Data API v3 Comments: list endpoint in the case that there are more than 5 answers on a comment (code: YouTube Data API v3 Comments: list call and test on number of replies), as CommentThreads: list only returns first 5 answers to a given comment.

Note that currently all requests to both APIs are stored, that way all this data can be used later on for other purpose.