Current algorithm doesn't check all channel tabs #11

Open
opened 2023-01-04 02:09:03 +01:00 by Benjamin_Loison · 8 comments

As YouTube Data API v3 CommentThreads: list endpoint doesn't support this feature with filter allThreadsRelatedToChannelId neither channelId.

My YouTube operational API allows to treat them.

As YouTube Data API v3 CommentThreads: list endpoint doesn't support this feature with filter `allThreadsRelatedToChannelId` neither `channelId`. My YouTube operational API allows to treat them.
Author
Owner

Should check that in a similar manner we can't extract channels, even in some cases, from others tabs.

  • LIVE (if the channel is livestreaming, or the previous livestreams where we can still access the chat)
    • @LofiGirl current livestreams (less important as it doesn't seem that we can go up in the chat, even with the ability to go back in time up to 12 hours thanks to DVR the chat still works with current new messages, maybe doable by reverse-engineering). A question I want to raise is for a livestream that started a while ago does a first chat message is still visible after a while if there wasn't any message? Even with the official method I mention in the following item, we can only retrieve current and future chat messages but not previous ones.
    • @LeFatShow previous livestreams
      are good examples to make sure my algorithm works fine with. We can't just use YouTube Data API v3 PlaylistItems: list endpoint to retrieve whether or a not given video is an ended livestream even if they are listed with this endpoint. So we are obliged to use YouTube Data API v3 Videos: list endpoint with part=liveStreamingDetails. Note that no activeLiveChatId can be returned for this latter video for instance while the first one doesn't have any problem to follow the official method. In the latter case I should rely on my YouTube operational API liveChats endpoint. Note that current pagination algorithm doesn't work properly (in addition that the returned data aren't much parsed perfectly). Note that even with reverse-engineering by providing a negative time, we can't retrieve all messages in the chat that happened before the livestream (we only have those returned when requesting time = 0), an example of livestream having messages before the beginning of it is FuFjLL7gKv4.
      I used the following algorithm to find some interesting LIVE tabs:
import subprocess

with open('channels.txt') as f:
    lines = f.read().splitlines()
    for line in lines:
        print(line)
        url = f'https://www.youtube.com/channel/{line}/streams'
        subprocess.check_output(f'firefox --new-tab --url {url}', shell=True)
  • PLAYLISTS (in case of playlist with videos of other channels) - should also keep unlisted videos found that way, see PLe0Nm0KU0Zo2vuYY_DDLHK4iP7SQVOUSm for such an example) - as far as I know retrieving Saved playlists isn't possible with YouTube APIs I solved this issue adding the ability to retrieve Saved playlists, as YouTube official APIs doesn't allow us to do so. Currently having an issue with @Goldenmoustache, as there is a YouTube Originals playlist. Should pay attention to not treat twice the same unlisted videos. Should make sure that unlisted videos are treated with our livestream procedure if they are livestreams. We don't treat shows cf #47, but it's unclear if there can be comments on such videos as I'm unable to see when not logged in my comment on kZfSRj-cOJk. YouTube Data API v3 seems to agree with my observation. In fact concerning shows, it doesn't seem that we can extract any channel id that may have indirectly comments as the serie channel UCubjFsqje4qtdyQVZiQAUsg isn't available.
  • CHANNELS - have to implement pagination for @LeFatShow for instance - @cyprien has a different layout (cf YouTube-operational-API/issues/121) - from @LeFatShow CHANNELS tab by proceeding recursively, I retrieved at least 20,538 channels (stopped it by mistake)
  • COMMUNITY thanks to post comments (have to add its support in YouTube operational API, as I confirm again that YouTube Data API v3 CommentThreads: list endpoint with allThreadsRelatedToChannelId and channelId)

Note that in theory we should call YouTube operational API instance at the end of the process with YouTube official APIs otherwise if we do it in the beginning we risk to be detected as having an unusual activity. However should adopt this method only for the long run, as otherwise for testing we would lose quite some time.

Should check that in a similar manner we can't extract channels, even in some cases, from [others tabs](https://github.com/Benjamin-Loison/YouTube-operational-API/issues/48#issuecomment-1270271158). - [x] `LIVE` (if the channel is livestreaming, or the previous livestreams where we can still access the chat) - - [X] [`@LofiGirl`](https://www.youtube.com/@LofiGirl/streams) current livestreams (less important as it doesn't seem that we can go up in the chat, even with the ability to go back in time up to 12 hours thanks to [DVR](https://support.google.com/youtube/answer/9296823?hl=en) the chat still works with current new messages, maybe doable by reverse-engineering). A question I want to raise is for a livestream that started a while ago does a first chat message is still visible after a while if there wasn't any message? Even with the official method I mention in the following item, we can only retrieve current and future chat messages but not previous ones. - - [x] [`@LeFatShow`](https://www.youtube.com/@LeFatShow/streams) previous livestreams are good examples to make sure my algorithm works fine with. We can't just use YouTube Data API v3 PlaylistItems: list endpoint to retrieve whether or a not given video is an ended livestream even if they are listed with this endpoint. So we are obliged to use YouTube Data API v3 Videos: list endpoint with `part=liveStreamingDetails`. Note that no `activeLiveChatId` can be returned for this latter video for instance while the first one doesn't have any problem to follow [the official method](https://stackoverflow.com/a/74894763). In the latter case I should rely on my YouTube operational API `liveChats` endpoint. ~~Note that current pagination algorithm doesn't work properly (in addition that the returned data aren't much parsed perfectly).~~ Note that even with reverse-engineering by providing a negative `time`, we can't retrieve all messages in the chat that happened before the livestream (we only have those returned when requesting `time = 0`), an example of livestream having messages before the beginning of it is [`FuFjLL7gKv4`](https://www.youtube.com/watch?v=FuFjLL7gKv4). I used the following algorithm to find some interesting `LIVE` tabs: ```py import subprocess with open('channels.txt') as f: lines = f.read().splitlines() for line in lines: print(line) url = f'https://www.youtube.com/channel/{line}/streams' subprocess.check_output(f'firefox --new-tab --url {url}', shell=True) ``` - [ ] `PLAYLISTS` (in case of playlist with videos of other channels) - should also keep unlisted videos found that way, see [`PLe0Nm0KU0Zo2vuYY_DDLHK4iP7SQVOUSm`](https://www.youtube.com/playlist?list=PLe0Nm0KU0Zo2vuYY_DDLHK4iP7SQVOUSm) for such an example) - ~~as far as I know [retrieving `Saved playlists` isn't possible with YouTube APIs](https://github.com/Benjamin-Loison/YouTube-operational-API/issues/113)~~ I solved this issue adding the ability to retrieve `Saved playlists`, as YouTube official APIs doesn't allow us to do so. ~~Currently having an issue with [`@Goldenmoustache`](https://www.youtube.com/@Goldenmoustache/playlists), as there is a YouTube Originals playlist.~~ Should pay attention to not treat twice the same unlisted videos. Should make sure that unlisted videos are treated with our livestream procedure if they are livestreams. We don't treat shows cf #47, but it's unclear if there can be comments on such videos as I'm unable to see when not logged in my comment on [`kZfSRj-cOJk`](https://www.youtube.com/watch?v=kZfSRj-cOJk&lc=UgyctQfeXjEeAOqbRQF4AaABAg). [YouTube Data API v3 seems to agree](https://yt.lemnoslife.com/noKey/commentThreads?part=snippet,replies&videoId=kZfSRj-cOJk) with my observation. In fact concerning shows, it doesn't seem that we can extract any channel id that may have indirectly comments as the serie channel [`UCubjFsqje4qtdyQVZiQAUsg`](https://www.youtube.com/channel/UCubjFsqje4qtdyQVZiQAUsg) isn't available. - [x] `CHANNELS` - ~~have to implement pagination for [`@LeFatShow`](https://www.youtube.com/@LeFatShow/channels) for instance~~ - [`@cyprien`](https://www.youtube.com/@cyprien/channels) has a different layout (cf [YouTube-operational-API/issues/121](https://github.com/Benjamin-Loison/YouTube-operational-API/issues/121)) - from [`@LeFatShow`](https://www.youtube.com/@LeFatShow/channels) `CHANNELS` tab by proceeding recursively, I retrieved at least 20,538 channels (stopped it by mistake) - [x] `COMMUNITY` thanks to post comments (have to add its support in YouTube operational API, as I confirm again that YouTube Data API v3 CommentThreads: list endpoint with [`allThreadsRelatedToChannelId`](https://developers.google.com/youtube/v3/docs/commentThreads/list#allThreadsRelatedToChannelId) and [`channelId`](https://developers.google.com/youtube/v3/docs/commentThreads/list#channelId)) Note that in theory we should call YouTube operational API instance at the end of the process with YouTube official APIs otherwise if we do it in the beginning we risk to be detected as having an unusual activity. However should adopt this method only for the long run, as otherwise for testing we would lose quite some time.
Benjamin_Loison added the
enhancement
label 2023-01-06 18:52:52 +01:00
Benjamin_Loison added the
medium priority
label 2023-01-06 19:32:41 +01:00
Benjamin_Loison added the
medium
label 2023-01-06 19:34:49 +01:00
Author
Owner

Should make another boolean USE_YT_LEMNOSLIFE_COM_YOUTUBE_OPERATIONAL_API_INSTANCE like USE_YT_LEMNOSLIFE_COM_NO_KEY_SERVICE to work even if the user doesn't have a YouTube operational API running locally.

In fact a --youtube-operational-api-instance URL with an URL like https://yt.lemnoslife.com would be nice and by default the used URL would be the one I just mentioned.

Should make another boolean `USE_YT_LEMNOSLIFE_COM_YOUTUBE_OPERATIONAL_API_INSTANCE` like [`USE_YT_LEMNOSLIFE_COM_NO_KEY_SERVICE`](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/6ce29051c06624859c57f475f0f6cf6788da195d/main.cpp#L52) to work even if the user doesn't have a YouTube operational API running locally. In fact a `--youtube-operational-api-instance URL` with an `URL` like `https://yt.lemnoslife.com` would be nice and by default the used `URL` would be the one I just mentioned.
Benjamin_Loison added
epic
and removed
medium
labels 2023-01-15 00:05:10 +01:00
Benjamin_Loison started working 2023-01-15 00:32:52 +01:00
Benjamin_Loison stopped working 2023-01-15 00:32:58 +01:00
6 seconds
Benjamin_Loison deleted spent time 2023-01-15 00:49:04 +01:00
- 6 seconds
Author
Owner

Note that we could find even more channels/videos by analyzing user messages etc however as we don't have any guarantee that a message contains a channel or a video, we don't treat them. So we limit us to YouTube webpages parts where we are sure that if there is something then it's a YouTube channel/video...

Note that we could find even more channels/videos by analyzing user messages etc however as we don't have any guarantee that a message contains a channel or a video, we don't treat them. So we limit us to YouTube webpages parts where we are sure that if there is something then it's a YouTube channel/video...
Author
Owner

Note that we could also give a try to YouTube Data API v3 endpoints Search: list with notably relatedToVideoId filter and Videos: list with part=suggestions using OAuth (note that it is maybe to some extent equivalent to Search: list possibility).

Note that we could also give a try to YouTube Data API v3 endpoints Search: list with notably `relatedToVideoId` filter and Videos: list with `part=suggestions` using OAuth (note that it is maybe to some extent equivalent to Search: list possibility).
Benjamin_Loison added the
youtube-operational-api
label 2023-01-21 21:55:58 +01:00
Benjamin_Loison added this to the (deleted) milestone 2023-01-22 01:42:53 +01:00
Author
Owner

Treated @LeParisien livestreams:

  • in 1 hour and 15 minutes
  • to discover 29,650 channels
  • to treat about 261,000 comments (with 134,338 distinct)
  • to treat 164 videos
  • with about 3,300 requests
  • resulting in respectively 572 and 50 MB of uncompressed and compressed data

To filter video ids out of the logs, I used:

#!/usr/bin/python3

with open('logs.txt') as f:
    lines = f.read().splitlines()
    for line in lines:
        if len(line.split('1: ')[-1]) == 11:
            print(line)
Treated [`@LeParisien`](https://www.youtube.com/@LeParisien/streams) livestreams: - in 1 hour and 15 minutes - to discover 29,650 channels - to treat about 261,000 comments (with 134,338 distinct) - to treat 164 videos - with about 3,300 requests - resulting in respectively 572 and 50 MB of uncompressed and compressed data To filter video ids out of the logs, I used: ```py #!/usr/bin/python3 with open('logs.txt') as f: lines = f.read().splitlines() for line in lines: if len(line.split('1: ')[-1]) == 11: print(line) ```
132 KiB
Benjamin_Loison changed title from Note that current algorithm doesn't check comments on the `COMMUNITY` tab to Note that current algorithm doesn't check all channel tabs 2023-01-24 23:58:43 +01:00
Author
Owner

Just writing the id of a channel having unlisted videos doesn't make much sense in my opinion.

[Just writing the id of a channel having unlisted videos](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/afd9e1b0b682af3c9d3eb5af77d0165669cd04ac/main.cpp#L440) doesn't make much sense in my opinion.
Benjamin_Loison changed title from Note that current algorithm doesn't check all channel tabs to Current algorithm doesn't check all channel tabs 2023-02-22 04:10:27 +01:00
Author
Owner

If you are looking for unlisted videos, you can use:

grep 'Found non public' nohup.out | sed 's/.*(//' | sed 's/)//' | sort | uniq

If you don't mind about listing as many times as an unlisted video appears in playlists, you can use:

grep 'Found non public' nohup.out | sed 's/.*(//' | sed 's/).*//' | sort | uniq
If you are looking for unlisted videos, you can use: ```sh grep 'Found non public' nohup.out | sed 's/.*(//' | sed 's/)//' | sort | uniq ``` If you don't mind about listing as many times as an unlisted video appears in playlists, you can use: ```sh grep 'Found non public' nohup.out | sed 's/.*(//' | sed 's/).*//' | sort | uniq ```
Author
Owner

Could also get more channels thanks to HOME tab, as it may introduce some, however maybe they are repeating with the CHANNELS ones.

Could also get more channels thanks to `HOME` tab, as [it may introduce some](https://www.youtube.com/@Squeezie), however maybe they are repeating with [the `CHANNELS` ones](https://www.youtube.com/@Squeezie/channels).
Sign in to join this conversation.
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: Benjamin_Loison/YouTube_captions_search_engine#11
No description provided.