YouTube_captions_search_engine/README.md

35 lines
1.6 KiB
Markdown

# The algorithm:
To retrieve the most YouTube video ids in order to retrieve the most video captions, we need to retrieve the most YouTube channels.
So to discover the YouTube channels graph with a breadth-first search, we proceed as follows:
1. Provide a starting set of channels.
2. Given a channel, retrieve other channels thanks to its content by using [YouTube Data API v3](https://developers.google.com/youtube/v3) and [YouTube operational API](https://github.com/Benjamin-Loison/YouTube-operational-API) and then repeat 1. for each retrieved channel.
A ready to be used by the end-user website instance of this project is hosted at: https://crawler.yt.lemnoslife.com
See more details on [the Wiki](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/wiki).
# Running the algorithm:
Because of [the current compression mechanism](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/issues/30), Linux is the only known OS able to run this algorithm.
```sh
sudo apt install nlohmann-json3-dev yt-dlp
make
./youtubeCaptionsSearchEngine -h
```
If you plan to use the front-end website, also run:
```sh
pip install webvtt-py
```
Except if you provide the argument `--youtube-operational-api-instance-url https://yt.lemnoslife.com`, you have [to host your own instance of the YouTube operational API](https://github.com/Benjamin-Loison/YouTube-operational-API/#install-your-own-instance-of-the-api).
Except if you provide the argument `--no-keys`, you have to provide at least one [YouTube Data API v3 key](https://developers.google.com/youtube/v3/getting-started) in `keys.txt`.
```sh
./youtubeCaptionsSearchEngine
```