From 21ef5c1dc64e338e2c99642f70180603a4ec41bd Mon Sep 17 00:00:00 2001 From: Benjamin_Loison Date: Tue, 13 Dec 2022 02:56:12 +0100 Subject: [PATCH] Update 'Project proposal' --- Project-proposal.md | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+) create mode 100644 Project-proposal.md diff --git a/Project-proposal.md b/Project-proposal.md new file mode 100644 index 0000000..cbb82a1 --- /dev/null +++ b/Project-proposal.md @@ -0,0 +1,25 @@ +# Project: + +YouTube UI and YouTube Data API v3 search feature doesn't always use video captions.
+The goal of this project is to gather as most as possible video captions with their video ids. That way we could propose a video search engine browsing through video captions, that is something that YouTube doesn't always properly propose. + +## Proof: + +For instance the video `o8NPllzkFhE` starts with the English not auto-generated caption: +> Chris Anderson: This is such a strange thing. Your software, Linux, is in millions of computers + +By searching `"Your software, Linux, is in millions of computers"` with the [YouTube UI](https://www.youtube.com/results?search_query=%22Your+software%2C+Linux%2C+is+in+millions+of+computers%22) and [YouTube Data API v3](https://developers.google.com/youtube/v3) Search: list ([endpoint documentation](https://developers.google.com/youtube/v3/docs/search/list) and [request](https://yt.lemnoslife.com/noKey/search?part=snippet&q="Your%20software,%20Linux,%20is%20in%20millions%20of%20computers")), both return only the video `Vo9KPk-gqKk`. This match makes sense, as it is another upload of the same TED conference, so if we don't consider the punctuation, this video contains our query that is `"your software Linux is in millions of computers"`. We can assume that both videos have been processed by YouTube a while ago as they were uploaded in 2016.
+ +Note that by framing our query with `"` we filter content containing strictly the given query. Furthermore also note that the API doesn't list recommended videos not perfectly matching our query, while the UI does. + +## Technical details: + +[Git](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine) + +Focusing on French channels would restrict the dataset we are looking for, however as I experienced (notably with [this implementation](https://github.com/Benjamin-Loison/YouTube-comments-graph/blob/9802fd2c5d11c6dd866f4e39343630b98a01b4e3/CPP/main.cpp#L1140-L1233)) the country isn't given for every YouTube channel so not restricting on a given country sounds like a less feverish approach. + +YouTube Data API v3 interesting [Captions: download](https://developers.google.com/youtube/v3/docs/captions/download) endpoint is only usable by the channel owning the given videos we want the captions of (source: comments of [this StackOverflow answer](https://stackoverflow.com/a/30660549), I verified this fact). + +I know how to retrieve captions of a video using [a reverse-engineered approach](https://stackoverflow.com/a/70013529) I developed, but we will try to focus on less technical tools such as `yt-dlp` to get the captions of videos. To retrieve not auto-generated captions `yt-dlp --all-subs --skip-download 'VIDEO_ID'` works fine, however both `youtube-dl --write-auto-sub --skip-download 'VIDEO_ID'` and `yt-dlp --write-auto-sub --skip-download 'VIDEO_ID'` return incorrect format files. If we have time, we will try to also download auto-generated video captions to be able to make comparison of our results with YouTube ones, so maybe by using a reverse-engineering approach (this works for sure). + +As I answered to [this StackOverflow question](https://stackoverflow.com/q/68970958), as [YouTube Data API v3 doesn't propose a way to enumerate all videos (even for just a country)](https://github.com/Benjamin-Loison/YouTube-comments-graph/issues/2), the idea to retrieve all video ids is to start from a starting set of channels, then list their videos using YouTube Data API v3 [PlaylistItems: list](https://stackoverflow.com/a/74579030), then list the comments on their videos and then restart the process as we potentially retrieved new channels thanks to comment authors on videos from already known channels.