Commit Graph

105 Commits

Author SHA1 Message Date
53acda6abe
Update README.md to remove the question about whether or not both methods return the same comments, as it's the case
More precisely I used following algorithm with these three channels:
channel id               | 1st method            | 2nd method
-------------------------|-----------------------|-----------
UCt5USYpzzMCYhkirVQGHwKQ | 16                    | 16
UCUo1RqYV8tGjV38sQ8S5p9A | 58,165                | 58,165
UCWIdqSQekeGmUWlSFeCiEnA | *error* (as expected) | 27

```py
"""
Algorithm comparing comments count using:
1. CommentThreads: list with allThreadsRelatedToChannelId filter
2. PlaylistItems: list and CommentThreads: list
Note that the second approach isn't *atomic*, so counts will differ if some comments are posted while retrieving data.
"""

import requests, json

CHANNEL_ID = 'UC...'
API_KEY = 'AIzaSy...'

def getJSON(url, firstTry = True):
    if firstTry:
        url = 'https://www.googleapis.com/youtube/v3/' + url + f'&key={API_KEY}'
    try:
        content = requests.get(url).text
    except:
        print('retry')
        return getJSON(url, False)
    data = json.loads(content)
    return data

items = []
pageToken = ''
while True:
    # After having verified, I confirm that using `allThreadsRelatedToChannelId` doesn't return comments of the `COMMUNITY` tab
    data = getJSON(f'commentThreads?part=id,snippet,replies&allThreadsRelatedToChannelId={CHANNEL_ID}&maxResults=100&pageToken={pageToken}')
    items += data['items']
    # In fact once we have top level comment, then with both methods if the replies *count* is correct, then we are fine as we both use the same Comments: list endpoint
    """for item in data['items']:
        if 'replies' in item:
            if len(item['replies']['comments']) >= 5:
                print('should consider replies too!')"""
    print(len(items))
    if 'nextPageToken' in data:
        pageToken = data['nextPageToken']
    else:
        break

print(len(items))

PLAYLIST_ID = 'UU' + CHANNEL_ID[2:]

videoIds = []
pageToken = ''
while True:
    data = getJSON(f'playlistItems?part=snippet&playlistId={PLAYLIST_ID}&maxResults=50&pageToken={pageToken}')
    for item in data['items']:
        videoIds += [item['snippet']['resourceId']['videoId']]
    print(len(videoIds))
    if 'nextPageToken' in data:
        pageToken = data['nextPageToken']
    else:
        break

print(len(videoIds))
items = []

for videoIndex, videoId in enumerate(videoIds):
    pageToken = ''
    while True:
        data = getJSON(f'commentThreads?part=id,snippet,replies&videoId={videoId}&maxResults=100&pageToken={pageToken}')
        if 'items' in data:
            items += data['items']
        # repeat replies check as could be the case here and not there
            """for item in data['items']:
                if 'replies' in item:
                    if len(item['replies']['comments']) >= 5:
                        print('should consider replies too!')"""
        print(videoIndex, len(videoIds), len(items))
        if 'nextPageToken' in data:
            pageToken = data['nextPageToken']
        else:
            break

print(len(items))
```
2022-12-22 03:18:25 +01:00
d776c09fec
Update README.md to clean notes concerning optimized approaches 2022-12-22 02:02:48 +01:00
daf14d4b5b
Update README.md to make clear to use different strategies to optimize the process
Note that as far as I (and StackOverflow ([1.](https://stackoverflow.com/q/63387215) and [2.](https://stackoverflow.com/q/67652250)) seems to) know there is no workaround to the 20,000 limit of PlaylistItems: list. This issue can be checked with:

```py
import requests, json

PLAYLIST_ID = 'UUf8w5m0YsRa8MHQ5bwSGmbw'
API_KEY = 'AIzaSy...'

items = []
pageToken = ''
while True:
    url = f'https://www.googleapis.com/youtube/v3/playlistItems?part=id&playlistId={PLAYLIST_ID}&maxResults=50&key={API_KEY}&pageToken={pageToken}'
    content = requests.get(url).text
    data = json.loads(content)
    items += data['items']
    print(len(items))
    if 'nextPageToken' in data:
        pageToken = data['nextPageToken']
    else:
        break

print(len(items))
```

Returns >= 19,000.

Note that this algorithm says that:
- [france24](https://www.youtube.com/@FRANCE24) has 6,086 videos while [SocialBlade states that it has 101,196 videos](https://socialblade.com/youtube/user/france24)
- [CNN](https://www.youtube.com/@CNN) has 19,289 while [SocialBlade states that it has 157,321 videos](https://socialblade.com/youtube/user/cnn)

Indeed both YouTube Data API v3 Search: list (I verified that https://github.com/Benjamin-Loison/YouTube-operational-API/issues/4 applied here with below code) and web-scraping `VIDEOS` tab don't work (see second SO link).

```py
import requests, json

CHANNEL_ID = 'UCf8w5m0YsRa8MHQ5bwSGmbw'
API_KEY = 'AIzaSy...'

items = []
pageToken = ''
while True:
    url = f'https://www.googleapis.com/youtube/v3/search?part=id&type=video&channelId={CHANNEL_ID}&maxResults=50&key={API_KEY}&pageToken={pageToken}'
    content = requests.get(url).text
    data = json.loads(content)
    items += data['items']
    print(len(items))
    if 'nextPageToken' in data:
        pageToken = data['nextPageToken']
    else:
        break

print(len(items))
```

Got ~18,734.

Another try by working with Search: list with date filter may make sense.

Note that according to SocialBlade:
- [asianetnews has 195,600 videos](https://socialblade.com/youtube/user/asianetnews)
- [RoelVandePaar has 2,2025,566 videos](https://socialblade.com/youtube/c/roelvandepaar)
2022-12-22 01:54:57 +01:00
3aa9947f8e
Update README.md to remove possibility to proceed using YouTube Data API v3 CommentThreads: list endpoint with allThreadsRelatedToChannelId filter
As we want to retrieve as many comments as possible, we have to proceed video per video, as [`3F8dFt8LsXY`](https://www.youtube.com/watch?v=3F8dFt8LsXY) for instance has comments but using YouTube Data API v3 CommentThreads: list endpoint with `allThreadsRelatedToChannelId` filter returns for `UCWIdqSQekeGmUWlSFeCiEnA`:
```json
{
  "error": {
    "code": 403,
    "message": "The video identified by the \u003ccode\u003e\u003ca href=\"/youtube/v3/docs/commentThreads/list#videoId\"\u003evideoId\u003c/a\u003e\u003c/code\u003e parameter has disabled comments.",
    "errors": [
      {
        "message": "The video identified by the \u003ccode\u003e\u003ca href=\"/youtube/v3/docs/commentThreads/list#videoId\"\u003evideoId\u003c/a\u003e\u003c/code\u003e parameter has disabled comments.",
        "domain": "youtube.commentThread",
        "reason": "commentsDisabled",
        "location": "videoId",
        "locationType": "parameter"
      }
    ]
  }
}
```
2022-12-21 23:49:27 +01:00
c828b118d3
Add README.md with first sketching questions 2022-12-21 23:46:14 +01:00