2023-01-06 16:09:12 +01:00

The algorithm:

To retrieve the most YouTube video ids in order to retrieve the most video captions, we need to retrieve the most YouTube channels. So to discover the YouTube channels graph with a breadth-first search, we proceed as follows:

  1. Provide a starting set of channels.
  2. Given a channel, retrieve other channels thanks to its content by using YouTube Data API v3 and YouTube operational API and then repeat 1. for each retrieved channel.

A ready to be used by the end-user website instance of this project is hosted at: https://crawler.yt.lemnoslife.com

See more details on the Wiki.

Running the algorithm:

Because of the current compression mechanism, Linux is the only known OS able to run this algorithm.

sudo apt install nlohmann-json3-dev yt-dlp
make
./youtubeCaptionsSearchEngine -h

If you plan to use the front-end website, also run:

pip install webvtt-py

Except if you provide the argument --youtube-operational-api-instance-url https://yt.lemnoslife.com, you have to host your own instance of the YouTube operational API.

Except if you provide the argument --no-keys, you have to provide at least one YouTube Data API v3 key in keys.txt.

./youtubeCaptionsSearchEngine
Description
YouTube UI and YouTube Data API v3 search feature doesn't browse through not auto-generated video captions.
Readme 194 KiB
Languages
C++ 69.6%
PHP 17.6%
Python 12.7%
Makefile 0.1%