Also modified compression command, as I got `sh: 1: zip: Argument list too long` when compressing the 248,868 json files of the French most subscribers channel.
As got:
```
terminate called after throwing an instance of 'nlohmann::detail::parse_error'
terminate called recursively
what(): [json.exception.parse_error.101] parse error at line 1, column 1: syntax error while parsing value - unexpected end of input; expected '[', '{', or a literal
terminate called recursively
```
Tested with `UCWIdqSQekeGmUWlSFeCiEnA` which treated correctly the 36 comments of the only comments enabled video `3F8dFt8LsXY`.
Note that this commit doesn't support comments disabled channels with more than 20,000 videos.
More precisely I used following algorithm with these three channels:
channel id | 1st method | 2nd method
-------------------------|-----------------------|-----------
UCt5USYpzzMCYhkirVQGHwKQ | 16 | 16
UCUo1RqYV8tGjV38sQ8S5p9A | 58,165 | 58,165
UCWIdqSQekeGmUWlSFeCiEnA | *error* (as expected) | 27
```py
"""
Algorithm comparing comments count using:
1. CommentThreads: list with allThreadsRelatedToChannelId filter
2. PlaylistItems: list and CommentThreads: list
Note that the second approach isn't *atomic*, so counts will differ if some comments are posted while retrieving data.
"""
import requests, json
CHANNEL_ID = 'UC...'
API_KEY = 'AIzaSy...'
def getJSON(url, firstTry = True):
if firstTry:
url = 'https://www.googleapis.com/youtube/v3/' + url + f'&key={API_KEY}'
try:
content = requests.get(url).text
except:
print('retry')
return getJSON(url, False)
data = json.loads(content)
return data
items = []
pageToken = ''
while True:
# After having verified, I confirm that using `allThreadsRelatedToChannelId` doesn't return comments of the `COMMUNITY` tab
data = getJSON(f'commentThreads?part=id,snippet,replies&allThreadsRelatedToChannelId={CHANNEL_ID}&maxResults=100&pageToken={pageToken}')
items += data['items']
# In fact once we have top level comment, then with both methods if the replies *count* is correct, then we are fine as we both use the same Comments: list endpoint
"""for item in data['items']:
if 'replies' in item:
if len(item['replies']['comments']) >= 5:
print('should consider replies too!')"""
print(len(items))
if 'nextPageToken' in data:
pageToken = data['nextPageToken']
else:
break
print(len(items))
PLAYLIST_ID = 'UU' + CHANNEL_ID[2:]
videoIds = []
pageToken = ''
while True:
data = getJSON(f'playlistItems?part=snippet&playlistId={PLAYLIST_ID}&maxResults=50&pageToken={pageToken}')
for item in data['items']:
videoIds += [item['snippet']['resourceId']['videoId']]
print(len(videoIds))
if 'nextPageToken' in data:
pageToken = data['nextPageToken']
else:
break
print(len(videoIds))
items = []
for videoIndex, videoId in enumerate(videoIds):
pageToken = ''
while True:
data = getJSON(f'commentThreads?part=id,snippet,replies&videoId={videoId}&maxResults=100&pageToken={pageToken}')
if 'items' in data:
items += data['items']
# repeat replies check as could be the case here and not there
"""for item in data['items']:
if 'replies' in item:
if len(item['replies']['comments']) >= 5:
print('should consider replies too!')"""
print(videoIndex, len(videoIds), len(items))
if 'nextPageToken' in data:
pageToken = data['nextPageToken']
else:
break
print(len(items))
```