From 0bd7a90c743a5ecdba42218fc03bba1d8496193d Mon Sep 17 00:00:00 2001 From: Benjamin_Loison Date: Fri, 10 Feb 2023 17:06:44 +0100 Subject: [PATCH] Precise that have problems for automatically generated captions even with latest releases but there is a workaround --- Project-proposal.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Project-proposal.md b/Project-proposal.md index c9f42a2..733db06 100644 --- a/Project-proposal.md +++ b/Project-proposal.md @@ -20,6 +20,6 @@ Focusing on French channels would restrict the dataset we are looking for, howev YouTube Data API v3 interesting [Captions: download](https://developers.google.com/youtube/v3/docs/captions/download) endpoint is only usable by the channel owning the given videos we want the captions of (source: [this StackOverflow comment](https://stackoverflow.com/questions/30653865/downloading-captions-always-returns-a-403#comment49414961_30660549), I verified this fact). -I know how to retrieve captions of a video using [a reverse-engineered approach](https://stackoverflow.com/a/70013529) I developed, but we will try to focus on less technical tools such as `yt-dlp` to get the captions of videos. To retrieve not auto-generated captions `yt-dlp --all-subs --skip-download 'VIDEO_ID'` works fine, however both `youtube-dl --write-auto-sub --skip-download 'VIDEO_ID'` and `yt-dlp --write-auto-sub --skip-download 'VIDEO_ID'` return incorrect format files. If we have time, we will try to also download auto-generated video captions to be able to make comparison of our results with YouTube ones, so maybe by using a reverse-engineering approach (this works for sure). +I know how to retrieve captions of a video using [a reverse-engineered approach](https://stackoverflow.com/a/70013529) I developed, but we will try to focus on less technical tools such as `yt-dlp` to get the captions of videos. To retrieve not auto-generated captions `yt-dlp --all-subs --skip-download 'VIDEO_ID'` works fine, however both `youtube-dl --write-auto-sub --skip-download 'VIDEO_ID'` and `yt-dlp --write-auto-sub --skip-download 'VIDEO_ID'` return incorrect format files even with latest releases. Nevertheless using `yt-dlp --write-auto-subs --sub-format ttml --convert-subs vtt --skip-download 'VIDEO_ID'` works (source: [this Stack Overflow answer](https://stackoverflow.com/a/74935253)). If we have time, we will try to also download auto-generated video captions to be able to make comparison of our results with YouTube ones, so maybe by using a reverse-engineering approach (this works for sure). As I answered to [this StackOverflow question](https://stackoverflow.com/q/68970958), as [YouTube Data API v3 doesn't propose a way to enumerate all videos (even for just a country)](https://github.com/Benjamin-Loison/YouTube-comments-graph/issues/2), the idea to retrieve all video ids is to start from a starting set of channels, then list their videos using YouTube Data API v3 [PlaylistItems: list](https://stackoverflow.com/a/74579030), then list the comments on their videos and then restart the process as we potentially retrieved new channels thanks to comment authors on videos from already known channels.