Update README.md to remove the question about whether or not both methods return the same comments, as it's the case

More precisely I used following algorithm with these three channels:
channel id               | 1st method            | 2nd method
-------------------------|-----------------------|-----------
UCt5USYpzzMCYhkirVQGHwKQ | 16                    | 16
UCUo1RqYV8tGjV38sQ8S5p9A | 58,165                | 58,165
UCWIdqSQekeGmUWlSFeCiEnA | *error* (as expected) | 27

```py
"""
Algorithm comparing comments count using:
1. CommentThreads: list with allThreadsRelatedToChannelId filter
2. PlaylistItems: list and CommentThreads: list
Note that the second approach isn't *atomic*, so counts will differ if some comments are posted while retrieving data.
"""

import requests, json

CHANNEL_ID = 'UC...'
API_KEY = 'AIzaSy...'

def getJSON(url, firstTry = True):
    if firstTry:
        url = 'https://www.googleapis.com/youtube/v3/' + url + f'&key={API_KEY}'
    try:
        content = requests.get(url).text
    except:
        print('retry')
        return getJSON(url, False)
    data = json.loads(content)
    return data

items = []
pageToken = ''
while True:
    # After having verified, I confirm that using `allThreadsRelatedToChannelId` doesn't return comments of the `COMMUNITY` tab
    data = getJSON(f'commentThreads?part=id,snippet,replies&allThreadsRelatedToChannelId={CHANNEL_ID}&maxResults=100&pageToken={pageToken}')
    items += data['items']
    # In fact once we have top level comment, then with both methods if the replies *count* is correct, then we are fine as we both use the same Comments: list endpoint
    """for item in data['items']:
        if 'replies' in item:
            if len(item['replies']['comments']) >= 5:
                print('should consider replies too!')"""
    print(len(items))
    if 'nextPageToken' in data:
        pageToken = data['nextPageToken']
    else:
        break

print(len(items))

PLAYLIST_ID = 'UU' + CHANNEL_ID[2:]

videoIds = []
pageToken = ''
while True:
    data = getJSON(f'playlistItems?part=snippet&playlistId={PLAYLIST_ID}&maxResults=50&pageToken={pageToken}')
    for item in data['items']:
        videoIds += [item['snippet']['resourceId']['videoId']]
    print(len(videoIds))
    if 'nextPageToken' in data:
        pageToken = data['nextPageToken']
    else:
        break

print(len(videoIds))
items = []

for videoIndex, videoId in enumerate(videoIds):
    pageToken = ''
    while True:
        data = getJSON(f'commentThreads?part=id,snippet,replies&videoId={videoId}&maxResults=100&pageToken={pageToken}')
        if 'items' in data:
            items += data['items']
        # repeat replies check as could be the case here and not there
            """for item in data['items']:
                if 'replies' in item:
                    if len(item['replies']['comments']) >= 5:
                        print('should consider replies too!')"""
        print(videoIndex, len(videoIds), len(items))
        if 'nextPageToken' in data:
            pageToken = data['nextPageToken']
        else:
            break

print(len(items))
```
This commit is contained in:
Benjamin Loison 2022-12-22 03:18:25 +01:00
parent d776c09fec
commit 53acda6abe
Signed by: Benjamin_Loison
SSH Key Fingerprint: SHA256:BtnEgYTlHdOg1u+RmYcDE0mnfz1rhv5dSbQ2gyxW8B8

View File

@ -2,7 +2,7 @@ As explained in the project proposal, the idea to retrieve all video ids is to s
For a given channel, there are two ways to list comments users published on it:
1. As explained, YouTube Data API v3 PlaylistItems: list endpoint enables us to list the channel videos up to 20,000 videos (so we will not treat and write down channels in this case) and CommentThreads: list and Comments: list endpoints enable us to retrieve their comments
2. A simpler approach consists in using YouTube Data API v3 CommentThreads: list endpoint with `allThreadsRelatedToChannelId`. The main upside of this method, in addition to be simpler, is that for channels with many videos we spare much time by working 100 comments at a time instead of a video at a time with possibly not a single comment. Note that this approach doesn't list all videos etc so we don't retrieve some information. **As I haven't gone this way previously (or I forgot) making sure that for a given video we retrieve all its comments would make sense.** Note that this approach doesn't work for some channels that have comments enabled on some videos but not the whole channels.**
2. A simpler approach consists in using YouTube Data API v3 CommentThreads: list endpoint with `allThreadsRelatedToChannelId`. The main upside of this method, in addition to be simpler, is that for channels with many videos we spare much time by working 100 comments at a time instead of a video at a time with possibly not a single comment. Note that this approach doesn't list all videos etc so we don't retrieve some information. Note that this approach doesn't work for some channels that have comments enabled on some videos but not the whole channels.
So when possible we will proceed with 2. and use 1. as a fallback approach.
We can multi-thread this process by channel or we can multi-thread per videos of a given channel.