YouTube_captions_search_engine

Benjamin_Loison/YouTube_captions_search_engine

Table of Contents

Dive into YouTube search results

- 7TXEZ4tP06c how many people here would say they can draw
- o8NPllzkFhE linux is in millions of computers
- f6nxcfbDfZo at tedx about to give a killer talk
- gJjLdnycuyU My kids have seen a lot of cartoons

Concerning 20,000 videos limit for YouTube Data API v3 PlaylistItems: list endpoint

As described on the project proposal page, there is a discovery process consisting in going through comments, so we will try to also keep comments. That way we could end up, potentially after the project, doing interesting stuff such as listing all comments written by a given user, as my French only without discovery process project was doing.

Dive into YouTube search results

As a first feeling it seems that YouTube returns videos that only match auto-generated captions.

- `7TXEZ4tP06c` `how many people here would say they can draw`

Let's consider 7TXEZ4tP06c at 0:18 the auto-generated captions and not auto-generated captions are how many people here would say they can draw.

Passing this sentence to YouTube Data API v3 Search: list endpoint returns these videos:

7TXEZ4tP06c: is the original video (7TXEZ4tP06c) (0:18)
qH-yY7UZW_k: reupload part of the original video at 1:13 (1:16)
cOwYXnpW-8A: reupload part of the original video (0:05)
vzH9Fo9GI9Y: reupload part of the original video at 3:31 (3:39)
gpMp6tz3d7w: reupload part of the original video at 0:37 (0:41)
ZI7XTsGTl34: reupload part of the original video at 23:36 (23:43)

Note that all of these videos are partial uploads of the original video and they have auto-generated captions and all exactly contain how many people here would say they can draw.

- `o8NPllzkFhE` `linux is in millions of computers`

Completing the project proposal example, Vo9KPk-gqKk reupload part of the original video and has only auto-generated captions which contains your software Linux is in millions of computers while o8NPllzkFhE that is the original video has auto-generated captions which contains your software uh linux is in millions of computers and has not auto-generated captions which contains Your software, Linux, is in millions of computers.

The weird thing is that when passing linux is in millions of computers to YouTube Data API v3 Search: list endpoint, it returns only these videos:

Vo9KPk-gqKk: reupload part of the original video (your software Linux is in millions of computers)
krakddj30eU: reupload part of the original video at 0:05 (your software uh linux is in millions of computers)
NvPaFoIbbzg: reupload part of the original video (your software uh Linux is in millions of computers)

So it returns similar videos but not the original one we focused on while it should be clearly returned.

Note that all of these videos are partial uploads of the original video and they have auto-generated captions and all exactly contain case-insensitively Linux is in millions of computers.

- `f6nxcfbDfZo` `at tedx about to give a killer talk`

Passing this sentence to YouTube Data API v3 Search: list endpoint returns these videos:

f6nxcfbDfZo: is the original video

Note that the video only have auto-generated captions.

- `gJjLdnycuyU` `My kids have seen a lot of cartoons`

Following my project proposal, I've been noticed:

It's not clear to me from the "proof" part whether the video "o8NPllzkFhE" is not returned because of an indexing problem or because it is considered to be a duplicate of the video "Vo9KPk-gqKk". Did you manage to identify a case where a video is not returned even though it is the only match to a query? (Indeed, if the goal of your project is just to work around the fact that some duplicate videos are removed from search results, then it limits a bit the appeal.)

Let's try to answer this question with the best approach and show how YouTube search doesn't make sense sometimes.

Let's look at videos which have both automatically generated captions and not automatically generated captions and let's focus on English, so we will consider @TED videos, as they are quite an interesting dataset for this purpose.

Thanks to YouTube operational API Videos: list endpoint we learn that its channel id is UCAuUUnT6oDeKwE6v1NGQxug.

Then let's list their videos thanks to YouTube Data API v3 PlaylistItems: list endpoint from the oldest one to the newest one, that way we will work with old videos that have had enough time to be processed. As of January 24 2023, they have 4185 videos retrievable that way.

Then we will focus on videos that have less than 2 caption tracks, including one that is both in English and not automatically generated. The first one matching this criteria is gJjLdnycuyU which is the 2970th oldest video. The hope by looking for oldest videos matching this criteria is that they are simple in terms of captions (not having a lot of caption tracks) and, as they are old videos and their captions aren't translated in many languages, we can hope that they doesn't have many views which will make duplicates less likely.

Let's focus on the sentence at 0:08 of this video, that is my kids have seen a lot of cartoons. More precisely according to:

not automatically generated captions: My kids have seen a lot of cartoons
automatically generated captions: my kids have seen a lot of cartoons

Let's put the chances on our side by assuming that the exact search feature using "Your query" from YouTube is case sensitive, so let's consider only the common "kids have seen a lot of cartoons" of both caption tracks. If we provide it to YouTube Data API v3 Search: list endpoint, we get at least 50 results where gJjLdnycuyU doesn't appear. Let's say that all these videos contain kids have seen a lot of cartoons and our study video is going to appear on a following page. As we have other things to do than watching a random video of tens of minutes, let's extract thanks to YouTube operational API Videos: list endpoint with part=contentDetails the shortest video, in order to verify that YouTube exact search feature works as expected. The shortest video is dC7tUcRCS58 and lasts 175 seconds. The audio, the video and the automatically generated captions don't contain neither near nor far kids have seen a lot of cartoons.

So YouTube is just giving us random videos about the words we typed but not exactly the exact search we asked him to proceed.

While concerning the project proposal video concerning The mind behind Linux | Linus Torvalds proceeding to exact search with "your software Linux is in millions of computers" we get only one result that is Vo9KPk-gqKk which as discussed contains the exact sentence your software Linux is in millions of computers.

So trying to identify a case where a video is not returned even though it is the only match to a query shows inconsistent behavior from YouTube exact search, as it gives exactly what we asked concerning our test with The mind behind Linux | Linus Torvalds and it doesn't give exactly what we asked concerning The creative power of misfits | WorkLife with Adam Grant (Audio only).

Note that YouTube UI has the same too many results bug concerning The creative power of misfits | WorkLife with Adam Grant (Audio only).

From my experience with YouTube which starts to be significant, we can't rely on YouTube search feature, as they give weird results as shown. However YouTube gives quite correctly the information concerning a given video id, so the best approach that I am aware of to returns exactly correct and as far as possible exhaustive results consists in discovering the maximum number of videos through some crawling approach as I sketch in the last paragraph of the project proposal.

The code associated to this approach is here:

import requests, json, subprocess

channelId = 'UCAuUUnT6oDeKwE6v1NGQxug'
uploadsPlaylistId = 'UU' + channelId[2:]

def getJson(url):
    url = f'https://yt.lemnoslife.com/{url}'
    content = requests.get(url).text
    data = json.loads(content)
    return data

videoIds = []

pageToken = ''
while True:
    data = getJson(f'noKey/playlistItems?part=snippet&playlistId={uploadsPlaylistId}&maxResults=50&pageToken={pageToken}')
    items = data['items']
    print(len(videoIds))
    for item in items:
        #print(item)
        videoId = item['snippet']['resourceId']['videoId']
        #print(videoId)
        videoIds += [videoId]
    if 'nextPageToken' in data:
        pageToken = data['nextPageToken']
    else:
        break

print(len(videoIds))
# 4185

videoIds = videoIds[::-1]

def execute(command):
    subprocess.check_output(command, shell = True)

videoIds = videoIds[2968:]

##

# 2968 SMnKboI4fvY

for videoIndex, videoId in enumerate(videoIds):
    print(videoIndex, videoId)
    data = getJson(f'noKey/captions?part=snippet&videoId={videoId}')
    items = data['items']
    if len(items) <= 2:
        for item in items:
            snippet = item['snippet']
            trackKind = snippet['trackKind']
            language = snippet['language']
            if language == 'en' and trackKind == 'standard':
                print('Found')
                #execute('notify-send "Found"')
                break

##

# Find shortest video:

url = 'noKey/search?part=snippet&q="your software Linux is in millions of computers"&maxResults=50'
data = getJson(url)
items = data['items']
setVideoIds = []
shortestVideo = 10 ** 9
shortestVideoId = None
for item in items:
    videoId = item['id']['videoId']
    print(videoId)
    setVideoIds += [videoId]
    url = f'videos?part=contentDetails&id={videoId}'
    data = getJson(url)
    duration = data['items'][0]['contentDetails']['duration']
    if shortestVideo > duration and duration > 0:
        shortestVideo = duration
        shortestVideoId = videoId

print(shortestVideoId, shortestVideo)

Following my answer my supervisor answered:

Thanks for the answer! Long story short, this does seems to answer my question: indeed, there are cases where a search for a string S does not prominently return any video containing S in the subtitles, but such videos do exist and are not returned.

Concerning 20,000 videos limit for YouTube Data API v3 PlaylistItems: list endpoint

Could try both (-i was required for ignoring errors such as age-restricted videos):

youtube-dl --dump-json "https://www.youtube.com/channel/UCf8w5m0YsRa8MHQ5bwSGmbw/videos" -i | jq -r '[.id]|@csv' | wc -l

yt-dlp --dump-json "https://www.youtube.com/channel/UCf8w5m0YsRa8MHQ5bwSGmbw/videos" -i | jq -r '[.id]|@csv' | wc -l

As mentioned in this commit, could give a try with date filters or the YouTube operational API issue.

Dive into YouTube search results

- 7TXEZ4tP06c how many people here would say they can draw

- o8NPllzkFhE linux is in millions of computers

- f6nxcfbDfZo at tedx about to give a killer talk

- gJjLdnycuyU My kids have seen a lot of cartoons

Concerning 20,000 videos limit for YouTube Data API v3 PlaylistItems: list endpoint

- `7TXEZ4tP06c` `how many people here would say they can draw`

- `o8NPllzkFhE` `linux is in millions of computers`

- `f6nxcfbDfZo` `at tedx about to give a killer talk`

- `gJjLdnycuyU` `My kids have seen a lot of cartoons`