Update 'Project proposal'

Benjamin Loison 2022-12-13 02:56:12 +01:00
parent 425692ed24
commit 21ef5c1dc6

25
Project-proposal.md Normal file

@ -0,0 +1,25 @@
# Project:
YouTube UI and YouTube Data API v3 search feature doesn't always use video captions.<br/>
The goal of this project is to gather as most as possible video captions with their video ids. That way we could propose a video search engine browsing through video captions, that is something that YouTube doesn't always properly propose.
## Proof:
For instance the video `o8NPllzkFhE` starts with the English not auto-generated caption:
> Chris Anderson: This is such a strange thing. Your software, Linux, is in millions of computers
By searching `"Your software, Linux, is in millions of computers"` with the [YouTube UI](https://www.youtube.com/results?search_query=%22Your+software%2C+Linux%2C+is+in+millions+of+computers%22) and [YouTube Data API v3](https://developers.google.com/youtube/v3) Search: list ([endpoint documentation](https://developers.google.com/youtube/v3/docs/search/list) and [request](https://yt.lemnoslife.com/noKey/search?part=snippet&q="Your%20software,%20Linux,%20is%20in%20millions%20of%20computers")), both return only the video `Vo9KPk-gqKk`. This match makes sense, as it is another upload of the same TED conference, so if we don't consider the punctuation, this video contains our query that is `"your software Linux is in millions of computers"`. We can assume that both videos have been processed by YouTube a while ago as they were uploaded in 2016.<br/>
Note that by framing our query with `"` we filter content containing strictly the given query. Furthermore also note that the API doesn't list recommended videos not perfectly matching our query, while the UI does.
## Technical details:
[Git](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine)
Focusing on French channels would restrict the dataset we are looking for, however as I experienced (notably with [this implementation](https://github.com/Benjamin-Loison/YouTube-comments-graph/blob/9802fd2c5d11c6dd866f4e39343630b98a01b4e3/CPP/main.cpp#L1140-L1233)) the country isn't given for every YouTube channel so not restricting on a given country sounds like a less feverish approach.
YouTube Data API v3 interesting [Captions: download](https://developers.google.com/youtube/v3/docs/captions/download) endpoint is only usable by the channel owning the given videos we want the captions of (source: comments of [this StackOverflow answer](https://stackoverflow.com/a/30660549), I verified this fact).
I know how to retrieve captions of a video using [a reverse-engineered approach](https://stackoverflow.com/a/70013529) I developed, but we will try to focus on less technical tools such as `yt-dlp` to get the captions of videos. To retrieve not auto-generated captions `yt-dlp --all-subs --skip-download 'VIDEO_ID'` works fine, however both `youtube-dl --write-auto-sub --skip-download 'VIDEO_ID'` and `yt-dlp --write-auto-sub --skip-download 'VIDEO_ID'` return incorrect format files. If we have time, we will try to also download auto-generated video captions to be able to make comparison of our results with YouTube ones, so maybe by using a reverse-engineering approach (this works for sure).
As I answered to [this StackOverflow question](https://stackoverflow.com/q/68970958), as [YouTube Data API v3 doesn't propose a way to enumerate all videos (even for just a country)](https://github.com/Benjamin-Loison/YouTube-comments-graph/issues/2), the idea to retrieve all video ids is to start from a starting set of channels, then list their videos using YouTube Data API v3 [PlaylistItems: list](https://stackoverflow.com/a/74579030), then list the comments on their videos and then restart the process as we potentially retrieved new channels thanks to comment authors on videos from already known channels.