6
Project proposal
Benjamin_Loison edited this page 2023-02-10 17:51:29 +01:00

Project:

YouTube UI and YouTube Data API v3 search feature doesn't always use video captions.
The goal of this project is to gather as many video captions with their video ids as possible. That way we could propose a video search engine browsing through video captions, that is something that YouTube doesn't always properly propose.

Proof:

For instance the video o8NPllzkFhE starts with the English not auto-generated caption:

Chris Anderson: This is such a strange thing. Your software, Linux, is in millions of computers

By searching "Your software, Linux, is in millions of computers" with the YouTube UI and YouTube Data API v3 Search: list (endpoint documentation and request), both return only the video Vo9KPk-gqKk. This match makes sense, as it is another upload of the same TED conference, so if we don't consider the punctuation, this video contains our query that is "your software Linux is in millions of computers". We can assume that both videos have been processed by YouTube a while ago as they were uploaded in 2016.

Note that by framing our query with " we filter content containing strictly the given query. Furthermore also note that the API doesn't list recommended videos not perfectly matching our query, while the UI does.

Technical details:

Git

Focusing on French channels would restrict the dataset we are looking for, however as I experienced (notably with this implementation) the country isn't given for every YouTube channel so not restricting on a given country sounds like a less feverish approach.

YouTube Data API v3 interesting Captions: download endpoint is only usable by the channel owning the given videos we want the captions of (source: this StackOverflow comment, I verified this fact).

I know how to retrieve captions of a video using a reverse-engineered approach I developed, but we will try to focus on less technical tools such as yt-dlp to get the captions of videos. To retrieve not auto-generated captions yt-dlp --all-subs --skip-download 'VIDEO_ID' works fine, however both youtube-dl --write-auto-sub --skip-download 'VIDEO_ID' and yt-dlp --write-auto-sub --skip-download 'VIDEO_ID' return incorrect format files even with latest releases. Nevertheless using yt-dlp --write-auto-subs --sub-format ttml --convert-subs vtt --skip-download 'VIDEO_ID' works (source: this Stack Overflow answer). If we have time, we will try to also download auto-generated video captions to be able to make comparison of our results with YouTube ones, so maybe by using a reverse-engineering approach (this works for sure). Note that we aren't interested in Auto-translated captions.

As I answered to this StackOverflow question, as YouTube Data API v3 doesn't propose a way to enumerate all videos (even for just a country), the idea to retrieve all video ids is to start from a starting set of channels, then list their videos using YouTube Data API v3 PlaylistItems: list, then list the comments on their videos and then restart the process as we potentially retrieved new channels thanks to comment authors on videos from already known channels.