2023-02-23 22:50:30 +01:00
# The algorithm:
2022-12-21 23:46:14 +01:00
2023-02-23 22:50:30 +01:00
To retrieve the most YouTube video ids in order to retrieve the most video captions, we need to retrieve the most YouTube channels.
So to discover the YouTube channels graph with a breadth-first search, we proceed as follows:
1. Provide a starting set of channels.
2. Given a channel, retrieve other channels thanks to its content by using [YouTube Data API v3 ](https://developers.google.com/youtube/v3 ) and [YouTube operational API ](https://github.com/Benjamin-Loison/YouTube-operational-API ) and then repeat 1. for each retrieved channel.
2022-12-21 23:46:14 +01:00
2023-02-23 22:50:30 +01:00
A ready to be used by the end-user website instance of this project is hosted at: https://crawler.yt.lemnoslife.com
2022-12-21 23:46:14 +01:00
2023-02-23 22:50:30 +01:00
See more details on [the Wiki ](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/wiki ).
# Running the algorithm:
2022-12-22 05:20:32 +01:00
2023-01-21 22:20:45 +01:00
Because of [the current compression mechanism ](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/issues/30 ), Linux is the only known OS able to run this algorithm.
2022-12-22 05:20:32 +01:00
```sh
Fix #13: Add captions extraction
I was about to commit in addition:
```c++
// Due to videos with automatically generated captions but being set to `Off` by default aren't retrieved with `--sub-langs '.*orig'`.
// My workaround is to first call YouTube Data API v3 Captions: list endpoint with `part=snippet` and retrieve the language that has `"trackKind": "asr"` (automatic speech recognition) in `snippet`.
/*json data = getJson(threadId, "captions?part=snippet&videoId=" + videoId, true, channelToTreat),
items = data["items"];
for(const auto& item : items)
{
json snippet = item["snippet"];
if(snippet["trackKind"] == "asr")
{
string language = snippet["language"];
cmd = cmdCommonPrefix + "--write-auto-subs --sub-langs '" + language + "-orig' --sub-format ttml --convert-subs vtt" + cmdCommonPostfix;
exec(threadId, cmd);
// As there should be a single automatic speech recognized track, there is no need to go through all tracks.
break;
}
}*/
```
Instead of:
```c++
cmd = cmdCommonPrefix + "--write-auto-subs --sub-langs '.*orig' --sub-format ttml --convert-subs vtt" + cmdCommonPostfix;
exec(threadId, cmd);
```
But I realized that, as the GitHub comment I was about to add to https://github.com/yt-dlp/yt-dlp/issues/2655, I was
wrong:
> `yt-dlp --cookies cookies.txt --sub-langs 'en.*,.*orig' --write-auto-subs https://www.youtube.com/watch?v=tQqDBySHYlc` work as expected. Many thanks again.
>
> ```
> 'subtitleslangs': ['en.*','.*orig'],
> 'writeautomaticsub': True,
> ```
>
> Work as expected too. Thank you
>
> Very sorry for the video sample. I even not watched it.
Thank you for this workaround. However note that videos having automatically generated subtitles but being set to `Off` by default aren't retrieved with your method (example of such video: [`mozyXsZJnQ4`](https://www.youtube.com/watch?v=mozyXsZJnQ4)). My workaround is to first call [YouTube Data API v3](https://developers.google.com/youtube/v3) [Captions: list](https://developers.google.com/youtube/v3/docs/captions/list) endpoint with [`part=snippet`](https://developers.google.com/youtube/v3/docs/captions/list#part) and retrieve the [`language`](https://developers.google.com/youtube/v3/docs/captions#snippet.language) that has [`"trackKind": "asr"`](https://developers.google.com/youtube/v3/docs/captions#snippet.trackKind) (automatic speech recognition) in [`snippet`](https://developers.google.com/youtube/v3/docs/captions#snippet).
2023-02-10 20:03:08 +01:00
sudo apt install nlohmann-json3-dev yt-dlp
2022-12-22 05:20:32 +01:00
make
Fix #13: Add captions extraction
I was about to commit in addition:
```c++
// Due to videos with automatically generated captions but being set to `Off` by default aren't retrieved with `--sub-langs '.*orig'`.
// My workaround is to first call YouTube Data API v3 Captions: list endpoint with `part=snippet` and retrieve the language that has `"trackKind": "asr"` (automatic speech recognition) in `snippet`.
/*json data = getJson(threadId, "captions?part=snippet&videoId=" + videoId, true, channelToTreat),
items = data["items"];
for(const auto& item : items)
{
json snippet = item["snippet"];
if(snippet["trackKind"] == "asr")
{
string language = snippet["language"];
cmd = cmdCommonPrefix + "--write-auto-subs --sub-langs '" + language + "-orig' --sub-format ttml --convert-subs vtt" + cmdCommonPostfix;
exec(threadId, cmd);
// As there should be a single automatic speech recognized track, there is no need to go through all tracks.
break;
}
}*/
```
Instead of:
```c++
cmd = cmdCommonPrefix + "--write-auto-subs --sub-langs '.*orig' --sub-format ttml --convert-subs vtt" + cmdCommonPostfix;
exec(threadId, cmd);
```
But I realized that, as the GitHub comment I was about to add to https://github.com/yt-dlp/yt-dlp/issues/2655, I was
wrong:
> `yt-dlp --cookies cookies.txt --sub-langs 'en.*,.*orig' --write-auto-subs https://www.youtube.com/watch?v=tQqDBySHYlc` work as expected. Many thanks again.
>
> ```
> 'subtitleslangs': ['en.*','.*orig'],
> 'writeautomaticsub': True,
> ```
>
> Work as expected too. Thank you
>
> Very sorry for the video sample. I even not watched it.
Thank you for this workaround. However note that videos having automatically generated subtitles but being set to `Off` by default aren't retrieved with your method (example of such video: [`mozyXsZJnQ4`](https://www.youtube.com/watch?v=mozyXsZJnQ4)). My workaround is to first call [YouTube Data API v3](https://developers.google.com/youtube/v3) [Captions: list](https://developers.google.com/youtube/v3/docs/captions/list) endpoint with [`part=snippet`](https://developers.google.com/youtube/v3/docs/captions/list#part) and retrieve the [`language`](https://developers.google.com/youtube/v3/docs/captions#snippet.language) that has [`"trackKind": "asr"`](https://developers.google.com/youtube/v3/docs/captions#snippet.trackKind) (automatic speech recognition) in [`snippet`](https://developers.google.com/youtube/v3/docs/captions#snippet).
2023-02-10 20:03:08 +01:00
./youtubeCaptionsSearchEngine -h
2023-01-15 02:19:31 +01:00
```
2023-02-14 01:32:36 +01:00
If you plan to use the front-end website, also run:
```sh
pip install webvtt-py
```
2023-01-15 02:19:31 +01:00
Except if you provide the argument `--youtube-operational-api-instance-url https://yt.lemnoslife.com` , you have [to host your own instance of the YouTube operational API ](https://github.com/Benjamin-Loison/YouTube-operational-API/#install-your-own-instance-of-the-api ).
2023-01-22 15:41:13 +01:00
Except if you provide the argument `--no-keys` , you have to provide at least one [YouTube Data API v3 key ](https://developers.google.com/youtube/v3/getting-started ) in `keys.txt` .
2023-01-15 02:19:31 +01:00
```sh
./youtubeCaptionsSearchEngine
2022-12-22 05:20:32 +01:00
```