Add: Note that we aren't interested in Auto-translated captions.

Benjamin Loison 2023-02-10 17:51:29 +01:00
parent 0bd7a90c74
commit b202a37560

@ -20,6 +20,6 @@ Focusing on French channels would restrict the dataset we are looking for, howev
YouTube Data API v3 interesting [Captions: download](https://developers.google.com/youtube/v3/docs/captions/download) endpoint is only usable by the channel owning the given videos we want the captions of (source: [this StackOverflow comment](https://stackoverflow.com/questions/30653865/downloading-captions-always-returns-a-403#comment49414961_30660549), I verified this fact).
I know how to retrieve captions of a video using [a reverse-engineered approach](https://stackoverflow.com/a/70013529) I developed, but we will try to focus on less technical tools such as `yt-dlp` to get the captions of videos. To retrieve not auto-generated captions `yt-dlp --all-subs --skip-download 'VIDEO_ID'` works fine, however both `youtube-dl --write-auto-sub --skip-download 'VIDEO_ID'` and `yt-dlp --write-auto-sub --skip-download 'VIDEO_ID'` return incorrect format files even with latest releases. Nevertheless using `yt-dlp --write-auto-subs --sub-format ttml --convert-subs vtt --skip-download 'VIDEO_ID'` works (source: [this Stack Overflow answer](https://stackoverflow.com/a/74935253)). If we have time, we will try to also download auto-generated video captions to be able to make comparison of our results with YouTube ones, so maybe by using a reverse-engineering approach (this works for sure).
I know how to retrieve captions of a video using [a reverse-engineered approach](https://stackoverflow.com/a/70013529) I developed, but we will try to focus on less technical tools such as `yt-dlp` to get the captions of videos. To retrieve not auto-generated captions `yt-dlp --all-subs --skip-download 'VIDEO_ID'` works fine, however both `youtube-dl --write-auto-sub --skip-download 'VIDEO_ID'` and `yt-dlp --write-auto-sub --skip-download 'VIDEO_ID'` return incorrect format files even with latest releases. Nevertheless using `yt-dlp --write-auto-subs --sub-format ttml --convert-subs vtt --skip-download 'VIDEO_ID'` works (source: [this Stack Overflow answer](https://stackoverflow.com/a/74935253)). If we have time, we will try to also download auto-generated video captions to be able to make comparison of our results with YouTube ones, so maybe by using a reverse-engineering approach (this works for sure). Note that we aren't interested in `Auto-translate`d captions.
As I answered to [this StackOverflow question](https://stackoverflow.com/q/68970958), as [YouTube Data API v3 doesn't propose a way to enumerate all videos (even for just a country)](https://github.com/Benjamin-Loison/YouTube-comments-graph/issues/2), the idea to retrieve all video ids is to start from a starting set of channels, then list their videos using YouTube Data API v3 [PlaylistItems: list](https://stackoverflow.com/a/74579030), then list the comments on their videos and then restart the process as we potentially retrieved new channels thanks to comment authors on videos from already known channels.