2022-12-21 23:46:14 +01:00
As explained in the project proposal, the idea to retrieve all video ids is to start from a starting set of channels, then list their videos using YouTube Data API v3 PlaylistItems: list, then list the comments on their videos and then restart the process as we potentially retrieved new channels thanks to comment authors on videos from already known channels.
2022-12-21 23:49:27 +01:00
For a given channel, there is a single way to list comments users published on it:
Update `README.md` to make clear to use different strategies to optimize the process
Note that as far as I (and StackOverflow ([1.](https://stackoverflow.com/q/63387215) and [2.](https://stackoverflow.com/q/67652250)) seems to) know there is no workaround to the 20,000 limit of PlaylistItems: list. This issue can be checked with:
```py
import requests, json
PLAYLIST_ID = 'UUf8w5m0YsRa8MHQ5bwSGmbw'
API_KEY = 'AIzaSy...'
items = []
pageToken = ''
while True:
url = f'https://www.googleapis.com/youtube/v3/playlistItems?part=id&playlistId={PLAYLIST_ID}&maxResults=50&key={API_KEY}&pageToken={pageToken}'
content = requests.get(url).text
data = json.loads(content)
items += data['items']
print(len(items))
if 'nextPageToken' in data:
pageToken = data['nextPageToken']
else:
break
print(len(items))
```
Returns >= 19,000.
Note that this algorithm says that:
- [france24](https://www.youtube.com/@FRANCE24) has 6,086 videos while [SocialBlade states that it has 101,196 videos](https://socialblade.com/youtube/user/france24)
- [CNN](https://www.youtube.com/@CNN) has 19,289 while [SocialBlade states that it has 157,321 videos](https://socialblade.com/youtube/user/cnn)
Indeed both YouTube Data API v3 Search: list (I verified that https://github.com/Benjamin-Loison/YouTube-operational-API/issues/4 applied here with below code) and web-scraping `VIDEOS` tab don't work (see second SO link).
```py
import requests, json
CHANNEL_ID = 'UCf8w5m0YsRa8MHQ5bwSGmbw'
API_KEY = 'AIzaSy...'
items = []
pageToken = ''
while True:
url = f'https://www.googleapis.com/youtube/v3/search?part=id&type=video&channelId={CHANNEL_ID}&maxResults=50&key={API_KEY}&pageToken={pageToken}'
content = requests.get(url).text
data = json.loads(content)
items += data['items']
print(len(items))
if 'nextPageToken' in data:
pageToken = data['nextPageToken']
else:
break
print(len(items))
```
Got ~18,734.
Another try by working with Search: list with date filter may make sense.
Note that according to SocialBlade:
- [asianetnews has 195,600 videos](https://socialblade.com/youtube/user/asianetnews)
- [RoelVandePaar has 2,2025,566 videos](https://socialblade.com/youtube/c/roelvandepaar)
2022-12-22 01:54:57 +01:00
As explained, YouTube Data API v3 PlaylistItems: list endpoint enables us to list the channel videos up to 20,000 videos and CommentThreads: list and Comments: list endpoints enable us to retrieve their comments
2022-12-21 23:46:14 +01:00
2022-12-21 23:49:27 +01:00
We can multi-thread this process by channel or we can multi-thread per videos of a given channel.
As would like to proceed channel per channel, the question is **how much time does it take to retrieve all comments from the biggest YouTube channel? If the answer is a long period of time, then multi-threading per videos of a given channel may make sense.**
2022-12-21 23:46:14 +01:00
Update `README.md` to make clear to use different strategies to optimize the process
Note that as far as I (and StackOverflow ([1.](https://stackoverflow.com/q/63387215) and [2.](https://stackoverflow.com/q/67652250)) seems to) know there is no workaround to the 20,000 limit of PlaylistItems: list. This issue can be checked with:
```py
import requests, json
PLAYLIST_ID = 'UUf8w5m0YsRa8MHQ5bwSGmbw'
API_KEY = 'AIzaSy...'
items = []
pageToken = ''
while True:
url = f'https://www.googleapis.com/youtube/v3/playlistItems?part=id&playlistId={PLAYLIST_ID}&maxResults=50&key={API_KEY}&pageToken={pageToken}'
content = requests.get(url).text
data = json.loads(content)
items += data['items']
print(len(items))
if 'nextPageToken' in data:
pageToken = data['nextPageToken']
else:
break
print(len(items))
```
Returns >= 19,000.
Note that this algorithm says that:
- [france24](https://www.youtube.com/@FRANCE24) has 6,086 videos while [SocialBlade states that it has 101,196 videos](https://socialblade.com/youtube/user/france24)
- [CNN](https://www.youtube.com/@CNN) has 19,289 while [SocialBlade states that it has 157,321 videos](https://socialblade.com/youtube/user/cnn)
Indeed both YouTube Data API v3 Search: list (I verified that https://github.com/Benjamin-Loison/YouTube-operational-API/issues/4 applied here with below code) and web-scraping `VIDEOS` tab don't work (see second SO link).
```py
import requests, json
CHANNEL_ID = 'UCf8w5m0YsRa8MHQ5bwSGmbw'
API_KEY = 'AIzaSy...'
items = []
pageToken = ''
while True:
url = f'https://www.googleapis.com/youtube/v3/search?part=id&type=video&channelId={CHANNEL_ID}&maxResults=50&key={API_KEY}&pageToken={pageToken}'
content = requests.get(url).text
data = json.loads(content)
items += data['items']
print(len(items))
if 'nextPageToken' in data:
pageToken = data['nextPageToken']
else:
break
print(len(items))
```
Got ~18,734.
Another try by working with Search: list with date filter may make sense.
Note that according to SocialBlade:
- [asianetnews has 195,600 videos](https://socialblade.com/youtube/user/asianetnews)
- [RoelVandePaar has 2,2025,566 videos](https://socialblade.com/youtube/c/roelvandepaar)
2022-12-22 01:54:57 +01:00
**In fact should proceed fastly with CommentThreads: list with `allThreads...` when possible**
**do I have an example of channels where commentthreads: list work but doesn't list a comment of a video ... ?**
2022-12-21 23:46:14 +01:00
Have to proceed with a breadth-first search approach as treating all *child* channels might take a time equivalent to treating the whole original tree.