YouTube_captions_search_engine/README.md

A video introducing this project is available [here](https://crawler.yt.lemnoslife.com/presentation).

# The algorithm:

To retrieve the most YouTube video ids in order to retrieve the most video captions, we need to retrieve the most YouTube channels.
So to discover the YouTube channels graph with a breadth-first search, we proceed as follows:
1. Provide a starting set of channels.
2. Given a channel, retrieve other channels thanks to its content by using [YouTube Data API v3](https://developers.google.com/youtube/v3) and [YouTube operational API](https://github.com/Benjamin-Loison/YouTube-operational-API) and then repeat 1. for each retrieved channel.

A ready to be used by the end-user website instance of this project is hosted at: https://crawler.yt.lemnoslife.com

See more details on [the Wiki](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/wiki).

# The project structure:

- `main.cpp` contains the C++ multi-threaded algorithm proceeding to the YouTube channels discovery. It is notably made of the following functions:
  - `main` which takes into account the command line arguments, load variables from files (`channels.txt`, `keys.txt`, `channels/` content) and start the threads as executing `treatChannels` function
  - `treatChannels` gets a YouTube channel to treat, treat it in `treatChannelOrVideo` function and compress the retrieved data
  - `treatChannelOrVideo` which provided a YouTube channel id or a video id, treats this resource. In both cases it treats comments left on this resource. In the case of a channel it also treats its `CHANNELS`, `COMMUNITY`, `PLAYLISTS` and `LIVE` tabs and downloads the captions of the channel videos.
  - `markChannelAsRequiringTreatmentIfNeeded` which provided a YouTube channel id marks it as requiring treatment if it wasn't already treated
  - `execute` which provided an `yt-dlp` command executes it in a shell
  - `getJson` which provided an API request returns a JSON structure with its result. In the case that the API requested is YouTube Data API v3 and a set of keys is provided (see below `keys.txt`), it rotates the keys as required
- `channels.txt` contains a starting set of channels which contains mostly the 100 most subscribed French channels
- `keys.txt` contains a set of YouTube Data API v3 keys (not provided) to have the ability to request this API (see an alternative to filling it in the section below with `--no-keys` command line argument)
- `scripts/` contains Python scripts to:
  - generate the `channels.txt` as described above (`retrieveTop100SubscribersFrance.py`)
  - remove channels being treated before a restart of the algorithm as described in [the `main` function documentation](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/8dd89e6e881da0a905b6fa4b23775c4344dd0d9d/main.cpp#L126-L128) (`removeChannelsBeingTreated.py`)
- `website/` is a PHP website using WebSocket to allow the end-user to proceed to requests on the retrieved dataset. When fetching the website, the end-user receives the interpreted `index.php` which upon making a request interacts with `websocket.php` which in the back-end dispatches the requests from various end-users to `search.py` (which treats the actual end-user request on the compressed dataset) by using `users/` to make the inter-process communication.

Note that this project heavily relies on [YouTube operational API](https://github.com/Benjamin-Loison/YouTube-operational-API) [which was modified for this project](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/wiki/YouTube-operational-API-commits).

# Running the YouTube graph discovery algorithm:

Because of [the current compression mechanism](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/issues/30), Linux is the only known OS able to run this algorithm.

To clone the repository, run:

```sh
git clone https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine
```

Move to the cloned repository by running:

```sh
cd YouTube_captions_search_engine/
```

To install the dependencies on an `apt` based Linux distribution of this project make sure to have [`pip`](https://pip.pypa.io/en/stable/installation/) and run:

```sh
sudo apt install nlohmann-json3-dev
pip install yt-dlp
```

To compile the YouTube discovery graph algorithm, run:

```sh
make
```

To see the command line arguments of the algorithm, run:

```sh
./youtubeCaptionsSearchEngine -h
```

To run the YouTube discovery graph algorithm, run:

```sh
./youtubeCaptionsSearchEngine
```

Except if you provide the argument `--youtube-operational-api-instance-url https://yt.lemnoslife.com`, you have [to host your own instance of the YouTube operational API](https://github.com/Benjamin-Loison/YouTube-operational-API/#install-your-own-instance-of-the-api).

Except if you provide the argument `--no-keys`, you have to provide at least one [YouTube Data API v3 key](https://developers.google.com/youtube/v3/getting-started) in `keys.txt`.

# Hosting the website enabling users to make requests:

Move to the `website/` folder by running:

```sh
cd website/
```

To install its dependencies make sure to have [`composer`](https://getcomposer.org/doc/00-intro.md) installed and run:

```sh
sudo apt install nginx
composer install
pip install webvtt-py
```

Add the following configuration to your Nginx website one:

```nginx
    # Make the default webpage of your website to be `index.php`.
    index index.php;

    # Allow end-users to retrieve the content of a file within a channel zip.
    location /channels {
        rewrite ^(.*).zip$ /channels.php;
        rewrite ^(.*).zip/(.*).json$ /channels.php;
        rewrite ^(.*).zip/(.*).txt$ /channels.php;
        rewrite ^(.*).zip/(.*).vtt$ /channels.php;
        # Allow end-users to list `channels/` content.
        autoindex on;
    }

    # Disable end-users to access to other end-users requests.
    location /users {
        deny all;
    }

    # Configure the websocket endpoint.
    location /websocket {
        # switch off logging
        access_log off;

        # redirect all HTTP traffic to localhost
        proxy_pass http://localhost:4430;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header Host $host;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

        # WebSocket support (nginx 1.4)
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";

        # timeout extension, possibly keep this short if using a ping strategy
        proxy_read_timeout 99999s;
    }
```

Start the websocket worker by running:

```sh
php websockets.php
```
Add presentation video link to `README.md` 2023-02-26 17:10:00 +01:00			`A video introducing this project is available [here](https://crawler.yt.lemnoslife.com/presentation).`

Fix #19: Improve documentation and code comments 2023-02-23 22:50:30 +01:00			`# The algorithm:`
Add `README.md` with first sketching questions 2022-12-21 23:46:14 +01:00
Fix #19: Improve documentation and code comments 2023-02-23 22:50:30 +01:00			`To retrieve the most YouTube video ids in order to retrieve the most video captions, we need to retrieve the most YouTube channels.`
			`So to discover the YouTube channels graph with a breadth-first search, we proceed as follows:`
			`1. Provide a starting set of channels.`
			`2. Given a channel, retrieve other channels thanks to its content by using [YouTube Data API v3](https://developers.google.com/youtube/v3) and [YouTube operational API](https://github.com/Benjamin-Loison/YouTube-operational-API) and then repeat 1. for each retrieved channel.`
Add `README.md` with first sketching questions 2022-12-21 23:46:14 +01:00
Fix #19: Improve documentation and code comments 2023-02-23 22:50:30 +01:00			`A ready to be used by the end-user website instance of this project is hosted at: https://crawler.yt.lemnoslife.com`
Add `README.md` with first sketching questions 2022-12-21 23:46:14 +01:00
Fix #19: Improve documentation and code comments 2023-02-23 22:50:30 +01:00			`See more details on [the Wiki](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/wiki).`

#35: Move Python scripts to `scripts/` and describe the project structure in `README.md` 2023-02-26 15:12:06 +01:00			`# The project structure:`

			- `main.cpp` contains the C++ multi-threaded algorithm proceeding to the YouTube channels discovery. It is notably made of the following functions:
			- `main` which takes into account the command line arguments, load variables from files (`channels.txt`, `keys.txt`, `channels/` content) and start the threads as executing `treatChannels` function
			- `treatChannels` gets a YouTube channel to treat, treat it in `treatChannelOrVideo` function and compress the retrieved data
			- `treatChannelOrVideo` which provided a YouTube channel id or a video id, treats this resource. In both cases it treats comments left on this resource. In the case of a channel it also treats its `CHANNELS`, `COMMUNITY`, `PLAYLISTS` and `LIVE` tabs and downloads the captions of the channel videos.
			- `markChannelAsRequiringTreatmentIfNeeded` which provided a YouTube channel id marks it as requiring treatment if it wasn't already treated
			- `execute` which provided an `yt-dlp` command executes it in a shell
			- `getJson` which provided an API request returns a JSON structure with its result. In the case that the API requested is YouTube Data API v3 and a set of keys is provided (see below `keys.txt`), it rotates the keys as required
			- `channels.txt` contains a starting set of channels which contains mostly the 100 most subscribed French channels
			- `keys.txt` contains a set of YouTube Data API v3 keys (not provided) to have the ability to request this API (see an alternative to filling it in the section below with `--no-keys` command line argument)
			- `scripts/` contains Python scripts to:
			- generate the `channels.txt` as described above (`retrieveTop100SubscribersFrance.py`)
			- remove channels being treated before a restart of the algorithm as described in [the `main` function documentation](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/8dd89e6e881da0a905b6fa4b23775c4344dd0d9d/main.cpp#L126-L128) (`removeChannelsBeingTreated.py`)
			- `website/` is a PHP website using WebSocket to allow the end-user to proceed to requests on the retrieved dataset. When fetching the website, the end-user receives the interpreted `index.php` which upon making a request interacts with `websocket.php` which in the back-end dispatches the requests from various end-users to `search.py` (which treats the actual end-user request on the compressed dataset) by using `users/` to make the inter-process communication.

			`Note that this project heavily relies on [YouTube operational API](https://github.com/Benjamin-Loison/YouTube-operational-API) [which was modified for this project](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/wiki/YouTube-operational-API-commits).`

#19: Detail how to run the website and reference `channels.txt` on it 2023-02-23 23:12:18 +01:00			`# Running the YouTube graph discovery algorithm:`
Add `main.cpp`, `Makefile` and `channelsToTreat.txt` Note that running this algorithm end up with channel [`UC-99odscxh1xxTyxHyXuRrg`](https://www.youtube.com/channel/UC-99odscxh1xxTyxHyXuRrg) and more precisely the video [`Tq5aPNzfYcg`](https://www.youtube.com/watch?v=Tq5aPNzfYcg) and more precisely the comment [`Ugx-TlSq6SNCbOX04mx4AaABAg`](https://www.youtube.com/watch?v=Tq5aPNzfYcg&lc=Ugx-TlSq6SNCbOX04mx4AaABAg) [which doesn't have any author](https://yt.lemnoslife.com/noKey/comments?part=snippet&id=Ugx-TlSq6SNCbOX04mx4AaABAg)... 2022-12-22 05:20:32 +01:00
Add in `README.md` the fact that as documented in #30, this algorithm is only known to be working fin on Linux 2023-01-21 22:20:45 +01:00			`Because of [the current compression mechanism](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/issues/30), Linux is the only known OS able to run this algorithm.`

Precise in `README.md` in which folder each command has to be ran 2023-02-23 23:48:40 +01:00			`To clone the repository, run:`

			```sh
			`git clone https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine`
			```

			`Move to the cloned repository by running:`

			```sh
			`cd YouTube_captions_search_engine/`
			```

Advertize `pip` instead of `apt` in `README.md` to install the latest version of `yt-dlp` 2023-02-23 23:16:36 +01:00			To install the dependencies on an `apt` based Linux distribution of this project make sure to have [`pip`](https://pip.pypa.io/en/stable/installation/) and run:
#19: Detail how to run the website and reference `channels.txt` on it 2023-02-23 23:12:18 +01:00
Add `main.cpp`, `Makefile` and `channelsToTreat.txt` Note that running this algorithm end up with channel [`UC-99odscxh1xxTyxHyXuRrg`](https://www.youtube.com/channel/UC-99odscxh1xxTyxHyXuRrg) and more precisely the video [`Tq5aPNzfYcg`](https://www.youtube.com/watch?v=Tq5aPNzfYcg) and more precisely the comment [`Ugx-TlSq6SNCbOX04mx4AaABAg`](https://www.youtube.com/watch?v=Tq5aPNzfYcg&lc=Ugx-TlSq6SNCbOX04mx4AaABAg) [which doesn't have any author](https://yt.lemnoslife.com/noKey/comments?part=snippet&id=Ugx-TlSq6SNCbOX04mx4AaABAg)... 2022-12-22 05:20:32 +01:00			```sh
Advertize `pip` instead of `apt` in `README.md` to install the latest version of `yt-dlp` 2023-02-23 23:16:36 +01:00			`sudo apt install nlohmann-json3-dev`
			`pip install yt-dlp`
#19: Detail how to run the website and reference `channels.txt` on it 2023-02-23 23:12:18 +01:00			```

			`To compile the YouTube discovery graph algorithm, run:`

			```sh
Add `main.cpp`, `Makefile` and `channelsToTreat.txt` Note that running this algorithm end up with channel [`UC-99odscxh1xxTyxHyXuRrg`](https://www.youtube.com/channel/UC-99odscxh1xxTyxHyXuRrg) and more precisely the video [`Tq5aPNzfYcg`](https://www.youtube.com/watch?v=Tq5aPNzfYcg) and more precisely the comment [`Ugx-TlSq6SNCbOX04mx4AaABAg`](https://www.youtube.com/watch?v=Tq5aPNzfYcg&lc=Ugx-TlSq6SNCbOX04mx4AaABAg) [which doesn't have any author](https://yt.lemnoslife.com/noKey/comments?part=snippet&id=Ugx-TlSq6SNCbOX04mx4AaABAg)... 2022-12-22 05:20:32 +01:00			`make`
#19: Detail how to run the website and reference `channels.txt` on it 2023-02-23 23:12:18 +01:00			```

			`To see the command line arguments of the algorithm, run:`

			```sh
Fix #13: Add captions extraction I was about to commit in addition: ```c++ // Due to videos with automatically generated captions but being set to `Off` by default aren't retrieved with `--sub-langs '.orig'`. // My workaround is to first call YouTube Data API v3 Captions: list endpoint with `part=snippet` and retrieve the language that has `"trackKind": "asr"` (automatic speech recognition) in `snippet`. /json data = getJson(threadId, "captions?part=snippet&videoId=" + videoId, true, channelToTreat), items = data["items"]; for(const auto& item : items) { json snippet = item["snippet"]; if(snippet["trackKind"] == "asr") { string language = snippet["language"]; cmd = cmdCommonPrefix + "--write-auto-subs --sub-langs '" + language + "-orig' --sub-format ttml --convert-subs vtt" + cmdCommonPostfix; exec(threadId, cmd); // As there should be a single automatic speech recognized track, there is no need to go through all tracks. break; } }/ ``` Instead of: ```c++ cmd = cmdCommonPrefix + "--write-auto-subs --sub-langs '.orig' --sub-format ttml --convert-subs vtt" + cmdCommonPostfix; exec(threadId, cmd); ``` But I realized that, as the GitHub comment I was about to add to https://github.com/yt-dlp/yt-dlp/issues/2655, I was wrong: > `yt-dlp --cookies cookies.txt --sub-langs 'en.,.orig' --write-auto-subs https://www.youtube.com/watch?v=tQqDBySHYlc` work as expected. Many thanks again. > > ``` > 'subtitleslangs': ['en.','.orig'], > 'writeautomaticsub': True, > ``` > > Work as expected too. Thank you > > Very sorry for the video sample. I even not watched it. Thank you for this workaround. However note that videos having automatically generated subtitles but being set to `Off` by default aren't retrieved with your method (example of such video: [`mozyXsZJnQ4`](https://www.youtube.com/watch?v=mozyXsZJnQ4)). My workaround is to first call [YouTube Data API v3](https://developers.google.com/youtube/v3) [Captions: list](https://developers.google.com/youtube/v3/docs/captions/list) endpoint with [`part=snippet`](https://developers.google.com/youtube/v3/docs/captions/list#part) and retrieve the [`language`](https://developers.google.com/youtube/v3/docs/captions#snippet.language) that has [`"trackKind": "asr"`](https://developers.google.com/youtube/v3/docs/captions#snippet.trackKind) (automatic speech recognition) in [`snippet`](https://developers.google.com/youtube/v3/docs/captions#snippet). 2023-02-10 20:03:08 +01:00			`./youtubeCaptionsSearchEngine -h`
#11: Add a first iteration for the `CHANNELS` retrieval 2023-01-15 02:19:31 +01:00			```

#19: Detail how to run the website and reference `channels.txt` on it 2023-02-23 23:12:18 +01:00			`To run the YouTube discovery graph algorithm, run:`
#31: Make search within captions not limited by line wrapping 2023-02-14 01:32:36 +01:00
			```sh
#19: Detail how to run the website and reference `channels.txt` on it 2023-02-23 23:12:18 +01:00			`./youtubeCaptionsSearchEngine`
#31: Make search within captions not limited by line wrapping 2023-02-14 01:32:36 +01:00			```

#11: Add a first iteration for the `CHANNELS` retrieval 2023-01-15 02:19:31 +01:00			Except if you provide the argument `--youtube-operational-api-instance-url https://yt.lemnoslife.com`, you have [to host your own instance of the YouTube operational API](https://github.com/Benjamin-Loison/YouTube-operational-API/#install-your-own-instance-of-the-api).

Add comment in `README.md` about the usage of `--no-keys` or generating a YouTube Data API v3 key 2023-01-22 15:41:13 +01:00			Except if you provide the argument `--no-keys`, you have to provide at least one [YouTube Data API v3 key](https://developers.google.com/youtube/v3/getting-started) in `keys.txt`.

#19: Detail how to run the website and reference `channels.txt` on it 2023-02-23 23:12:18 +01:00			`# Hosting the website enabling users to make requests:`

Precise in `README.md` in which folder each command has to be ran 2023-02-23 23:48:40 +01:00			Move to the `website/` folder by running:

			```sh
			`cd website/`
			```

Advertize `pip` instead of `apt` in `README.md` to install the latest version of `yt-dlp` 2023-02-23 23:16:36 +01:00			To install its dependencies make sure to have [`composer`](https://getcomposer.org/doc/00-intro.md) installed and run:
#19: Detail how to run the website and reference `channels.txt` on it 2023-02-23 23:12:18 +01:00
#11: Add a first iteration for the `CHANNELS` retrieval 2023-01-15 02:19:31 +01:00			```sh
Add `sudo apt install nginx` to `README.md` for hosting the website 2023-02-25 15:55:24 +01:00			`sudo apt install nginx`
#19: Detail how to run the website and reference `channels.txt` on it 2023-02-23 23:12:18 +01:00			`composer install`
			`pip install webvtt-py`
			```

			`Add the following configuration to your Nginx website one:`

			```nginx
			# Make the default webpage of your website to be `index.php`.
			`index index.php;`

			`# Allow end-users to retrieve the content of a file within a channel zip.`
			`location /channels {`
			`rewrite ^(.*).zip$ /channels.php;`
			`rewrite ^(.).zip/(.).json$ /channels.php;`
			`rewrite ^(.).zip/(.).txt$ /channels.php;`
			`rewrite ^(.).zip/(.).vtt$ /channels.php;`
			# Allow end-users to list `channels/` content.
			`autoindex on;`
			`}`

			`# Disable end-users to access to other end-users requests.`
			`location /users {`
			`deny all;`
			`}`

			`# Configure the websocket endpoint.`
			`location /websocket {`
			`# switch off logging`
			`access_log off;`

			`# redirect all HTTP traffic to localhost`
			`proxy_pass http://localhost:4430;`
			`proxy_set_header X-Real-IP $remote_addr;`
			`proxy_set_header Host $host;`
			`proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;`

			`# WebSocket support (nginx 1.4)`
			`proxy_http_version 1.1;`
			`proxy_set_header Upgrade $http_upgrade;`
			`proxy_set_header Connection "upgrade";`

			`# timeout extension, possibly keep this short if using a ping strategy`
			`proxy_read_timeout 99999s;`
			`}`
			```

			`Start the websocket worker by running:`

			```sh
			`php websockets.php`
Add `main.cpp`, `Makefile` and `channelsToTreat.txt` Note that running this algorithm end up with channel [`UC-99odscxh1xxTyxHyXuRrg`](https://www.youtube.com/channel/UC-99odscxh1xxTyxHyXuRrg) and more precisely the video [`Tq5aPNzfYcg`](https://www.youtube.com/watch?v=Tq5aPNzfYcg) and more precisely the comment [`Ugx-TlSq6SNCbOX04mx4AaABAg`](https://www.youtube.com/watch?v=Tq5aPNzfYcg&lc=Ugx-TlSq6SNCbOX04mx4AaABAg) [which doesn't have any author](https://yt.lemnoslife.com/noKey/comments?part=snippet&id=Ugx-TlSq6SNCbOX04mx4AaABAg)... 2022-12-22 05:20:32 +01:00			```