Table of Contents
- Dive into YouTube search results
- - 7TXEZ4tP06c how many people here would say they can draw
- - o8NPllzkFhE linux is in millions of computers
- - f6nxcfbDfZo at tedx about to give a killer talk
- - gJjLdnycuyU My kids have seen a lot of cartoons
- Concerning 20,000 videos limit for YouTube Data API v3 PlaylistItems: list endpoint
As described on the project proposal page, there is a discovery process consisting in going through comments, so we will try to also keep comments. That way we could end up, potentially after the project, doing interesting stuff such as listing all comments written by a given user, as my French only without discovery process project was doing.
Dive into YouTube search results
As a first feeling it seems that YouTube returns videos that only match auto-generated captions.
- 7TXEZ4tP06c
how many people here would say they can draw
Let's consider 7TXEZ4tP06c
at 0:18 the auto-generated captions and not auto-generated captions are how many people here would say they can draw
.
Passing this sentence to YouTube Data API v3 Search: list endpoint returns these videos:
7TXEZ4tP06c
: is the original video (7TXEZ4tP06c
) (0:18)qH-yY7UZW_k
: reupload part of the original video at 1:13 (1:16)cOwYXnpW-8A
: reupload part of the original video (0:05)vzH9Fo9GI9Y
: reupload part of the original video at 3:31 (3:39)gpMp6tz3d7w
: reupload part of the original video at 0:37 (0:41)ZI7XTsGTl34
: reupload part of the original video at 23:36 (23:43)
Note that all of these videos are partial uploads of the original video and they have auto-generated captions and all exactly contain how many people here would say they can draw
.
- o8NPllzkFhE
linux is in millions of computers
Completing the project proposal example, Vo9KPk-gqKk
reupload part of the original video and has only auto-generated captions which contains your software Linux is in millions of computers
while o8NPllzkFhE
that is the original video has auto-generated captions which contains your software uh linux is in millions of computers
and has not auto-generated captions which contains Your software, Linux, is in millions of computers
.
The weird thing is that when passing linux is in millions of computers
to YouTube Data API v3 Search: list endpoint, it returns only these videos:
Vo9KPk-gqKk
: reupload part of the original video (your software Linux is in millions of computers
)krakddj30eU
: reupload part of the original video at 0:05 (your software uh linux is in millions of computers
)NvPaFoIbbzg
: reupload part of the original video (your software uh Linux is in millions of computers
)
So it returns similar videos but not the original one we focused on while it should be clearly returned.
Note that all of these videos are partial uploads of the original video and they have auto-generated captions and all exactly contain case-insensitively Linux is in millions of computers
.
- f6nxcfbDfZo
at tedx about to give a killer talk
Passing this sentence to YouTube Data API v3 Search: list endpoint returns these videos:
f6nxcfbDfZo
: is the original video
Note that the video only have auto-generated captions.
- gJjLdnycuyU
My kids have seen a lot of cartoons
Following my project proposal, I've been noticed:
It's not clear to me from the "proof" part whether the video "o8NPllzkFhE" is not returned because of an indexing problem or because it is considered to be a duplicate of the video "Vo9KPk-gqKk". Did you manage to identify a case where a video is not returned even though it is the only match to a query? (Indeed, if the goal of your project is just to work around the fact that some duplicate videos are removed from search results, then it limits a bit the appeal.)
Let's try to answer this question with the best approach and show how YouTube search doesn't make sense sometimes.
Let's look at videos which have both automatically generated captions and not automatically generated captions and let's focus on English, so we will consider @TED
videos, as they are quite an interesting dataset for this purpose.
Thanks to YouTube operational API Videos: list endpoint we learn that its channel id is UCAuUUnT6oDeKwE6v1NGQxug
.
Then let's list their videos thanks to YouTube Data API v3 PlaylistItems: list endpoint from the oldest one to the newest one, that way we will work with old videos that have had enough time to be processed. As of January 24 2023, they have 4185
videos retrievable that way.
Then we will focus on videos that have less than 2 caption tracks, including one that is both in English and not automatically generated. The first one matching this criteria is gJjLdnycuyU
which is the 2970th oldest video. The hope by looking for oldest videos matching this criteria is that they are simple in terms of captions (not having a lot of caption tracks) and, as they are old videos and their captions aren't translated in many languages, we can hope that they doesn't have many views which will make duplicates less likely.
Let's focus on the sentence at 0:08 of this video, that is my kids have seen a lot of cartoons. More precisely according to:
- not automatically generated captions:
My kids have seen a lot of cartoons
- automatically generated captions:
my kids have seen a lot of cartoons
Let's put the chances on our side by assuming that the exact search feature using "Your query"
from YouTube is case sensitive, so let's consider only the common "kids have seen a lot of cartoons"
of both caption tracks. If we provide it to YouTube Data API v3 Search: list endpoint, we get at least 50
results where gJjLdnycuyU
doesn't appear. Let's say that all these videos contain kids have seen a lot of cartoons
and our study video is going to appear on a following page. As we have other things to do than watching a random video of tens of minutes, let's extract thanks to YouTube operational API Videos: list endpoint with part=contentDetails
the shortest video, in order to verify that YouTube exact search feature works as expected. The shortest video is dC7tUcRCS58
and lasts 175 seconds. The audio, the video and the automatically generated captions don't contain neither near nor far kids have seen a lot of cartoons
.
So YouTube is just giving us random videos about the words we typed but not exactly the exact search we asked him to proceed.
While concerning the project proposal video concerning The mind behind Linux | Linus Torvalds
proceeding to exact search with "your software Linux is in millions of computers"
we get only one result that is Vo9KPk-gqKk
which as discussed contains the exact sentence your software Linux is in millions of computers
.
So trying to identify a case where a video is not returned even though it is the only match to a query shows inconsistent behavior from YouTube exact search, as it gives exactly what we asked concerning our test with The mind behind Linux | Linus Torvalds
and it doesn't give exactly what we asked concerning The creative power of misfits | WorkLife with Adam Grant (Audio only)
.
Note that YouTube UI has the same too many results bug concerning The creative power of misfits | WorkLife with Adam Grant (Audio only)
.
From my experience with YouTube which starts to be significant, we can't rely on YouTube search feature, as they give weird results as shown. However YouTube gives quite correctly the information concerning a given video id, so the best approach that I am aware of to returns exactly correct and as far as possible exhaustive results consists in discovering the maximum number of videos through some crawling approach as I sketch in the last paragraph of the project proposal.
The code associated to this approach is here:
import requests, json, subprocess
channelId = 'UCAuUUnT6oDeKwE6v1NGQxug'
uploadsPlaylistId = 'UU' + channelId[2:]
def getJson(url):
url = f'https://yt.lemnoslife.com/{url}'
content = requests.get(url).text
data = json.loads(content)
return data
videoIds = []
pageToken = ''
while True:
data = getJson(f'noKey/playlistItems?part=snippet&playlistId={uploadsPlaylistId}&maxResults=50&pageToken={pageToken}')
items = data['items']
print(len(videoIds))
for item in items:
#print(item)
videoId = item['snippet']['resourceId']['videoId']
#print(videoId)
videoIds += [videoId]
if 'nextPageToken' in data:
pageToken = data['nextPageToken']
else:
break
print(len(videoIds))
# 4185
videoIds = videoIds[::-1]
def execute(command):
subprocess.check_output(command, shell = True)
videoIds = videoIds[2968:]
##
# 2968 SMnKboI4fvY
for videoIndex, videoId in enumerate(videoIds):
print(videoIndex, videoId)
data = getJson(f'noKey/captions?part=snippet&videoId={videoId}')
items = data['items']
if len(items) <= 2:
for item in items:
snippet = item['snippet']
trackKind = snippet['trackKind']
language = snippet['language']
if language == 'en' and trackKind == 'standard':
print('Found')
#execute('notify-send "Found"')
break
##
# Find shortest video:
url = 'noKey/search?part=snippet&q="your software Linux is in millions of computers"&maxResults=50'
data = getJson(url)
items = data['items']
setVideoIds = []
shortestVideo = 10 ** 9
shortestVideoId = None
for item in items:
videoId = item['id']['videoId']
print(videoId)
setVideoIds += [videoId]
url = f'videos?part=contentDetails&id={videoId}'
data = getJson(url)
duration = data['items'][0]['contentDetails']['duration']
if shortestVideo > duration and duration > 0:
shortestVideo = duration
shortestVideoId = videoId
print(shortestVideoId, shortestVideo)
Following my answer my supervisor answered:
Thanks for the answer! Long story short, this does seems to answer my question: indeed, there are cases where a search for a string
S
does not prominently return any video containingS
in the subtitles, but such videos do exist and are not returned.
Concerning 20,000 videos limit for YouTube Data API v3 PlaylistItems: list endpoint
Could try both (-i
was required for ignoring errors such as age-restricted videos):
youtube-dl --dump-json "https://www.youtube.com/channel/UCf8w5m0YsRa8MHQ5bwSGmbw/videos" -i | jq -r '[.id]|@csv' | wc -l
yt-dlp --dump-json "https://www.youtube.com/channel/UCf8w5m0YsRa8MHQ5bwSGmbw/videos" -i | jq -r '[.id]|@csv' | wc -l
As mentioned in this commit, could give a try with date filters or the YouTube operational API issue.