YouTube_captions_search_engine

Author SHA1 Message Date

Author	SHA1	Message	Date
Benjamin Loison	daf14d4b5b	Update `README.md` to make clear to use different strategies to optimize the process Note that as far as I (and StackOverflow ([1.](https://stackoverflow.com/q/63387215) and [2.](https://stackoverflow.com/q/67652250)) seems to) know there is no workaround to the 20,000 limit of PlaylistItems: list. This issue can be checked with: ```py import requests, json PLAYLIST_ID = 'UUf8w5m0YsRa8MHQ5bwSGmbw' API_KEY = 'AIzaSy...' items = [] pageToken = '' while True: url = f'https://www.googleapis.com/youtube/v3/playlistItems?part=id&playlistId={PLAYLIST_ID}&maxResults=50&key={API_KEY}&pageToken={pageToken}' content = requests.get(url).text data = json.loads(content) items += data['items'] print(len(items)) if 'nextPageToken' in data: pageToken = data['nextPageToken'] else: break print(len(items)) ``` Returns >= 19,000. Note that this algorithm says that: - [france24](https://www.youtube.com/@FRANCE24) has 6,086 videos while [SocialBlade states that it has 101,196 videos](https://socialblade.com/youtube/user/france24) - [CNN](https://www.youtube.com/@CNN) has 19,289 while [SocialBlade states that it has 157,321 videos](https://socialblade.com/youtube/user/cnn) Indeed both YouTube Data API v3 Search: list (I verified that https://github.com/Benjamin-Loison/YouTube-operational-API/issues/4 applied here with below code) and web-scraping `VIDEOS` tab don't work (see second SO link). ```py import requests, json CHANNEL_ID = 'UCf8w5m0YsRa8MHQ5bwSGmbw' API_KEY = 'AIzaSy...' items = [] pageToken = '' while True: url = f'https://www.googleapis.com/youtube/v3/search?part=id&type=video&channelId={CHANNEL_ID}&maxResults=50&key={API_KEY}&pageToken={pageToken}' content = requests.get(url).text data = json.loads(content) items += data['items'] print(len(items)) if 'nextPageToken' in data: pageToken = data['nextPageToken'] else: break print(len(items)) ``` Got ~18,734. Another try by working with Search: list with date filter may make sense. Note that according to SocialBlade: - [asianetnews has 195,600 videos](https://socialblade.com/youtube/user/asianetnews) - [RoelVandePaar has 2,2025,566 videos](https://socialblade.com/youtube/c/roelvandepaar)	2022-12-22 01:54:57 +01:00
Benjamin Loison	3aa9947f8e	Update `README.md` to remove possibility to proceed using YouTube Data API v3 CommentThreads: list endpoint with `allThreadsRelatedToChannelId` filter As we want to retrieve as many comments as possible, we have to proceed video per video, as [`3F8dFt8LsXY`](https://www.youtube.com/watch?v=3F8dFt8LsXY) for instance has comments but using YouTube Data API v3 CommentThreads: list endpoint with `allThreadsRelatedToChannelId` filter returns for `UCWIdqSQekeGmUWlSFeCiEnA`: ```json { "error": { "code": 403, "message": "The video identified by the \u003ccode\u003e\u003ca href=\"/youtube/v3/docs/commentThreads/list#videoId\"\u003evideoId\u003c/a\u003e\u003c/code\u003e parameter has disabled comments.", "errors": [ { "message": "The video identified by the \u003ccode\u003e\u003ca href=\"/youtube/v3/docs/commentThreads/list#videoId\"\u003evideoId\u003c/a\u003e\u003c/code\u003e parameter has disabled comments.", "domain": "youtube.commentThread", "reason": "commentsDisabled", "location": "videoId", "locationType": "parameter" } ] } } ```	2022-12-21 23:49:27 +01:00
Benjamin Loison	c828b118d3	Add `README.md` with first sketching questions	2022-12-21 23:46:14 +01:00

daf14d4b5b

Update README.md to make clear to use different strategies to optimize the process

Note that as far as I (and StackOverflow ([1.](https://stackoverflow.com/q/63387215) and [2.](https://stackoverflow.com/q/67652250)) seems to) know there is no workaround to the 20,000 limit of PlaylistItems: list. This issue can be checked with:

```py
import requests, json

PLAYLIST_ID = 'UUf8w5m0YsRa8MHQ5bwSGmbw'
API_KEY = 'AIzaSy...'

items = []
pageToken = ''
while True:
    url = f'https://www.googleapis.com/youtube/v3/playlistItems?part=id&playlistId={PLAYLIST_ID}&maxResults=50&key={API_KEY}&pageToken={pageToken}'
    content = requests.get(url).text
    data = json.loads(content)
    items += data['items']
    print(len(items))
    if 'nextPageToken' in data:
        pageToken = data['nextPageToken']
    else:
        break

print(len(items))
```

Returns >= 19,000.

Note that this algorithm says that:
- [france24](https://www.youtube.com/@FRANCE24) has 6,086 videos while [SocialBlade states that it has 101,196 videos](https://socialblade.com/youtube/user/france24)
- [CNN](https://www.youtube.com/@CNN) has 19,289 while [SocialBlade states that it has 157,321 videos](https://socialblade.com/youtube/user/cnn)

Indeed both YouTube Data API v3 Search: list (I verified that https://github.com/Benjamin-Loison/YouTube-operational-API/issues/4 applied here with below code) and web-scraping `VIDEOS` tab don't work (see second SO link).

```py
import requests, json

CHANNEL_ID = 'UCf8w5m0YsRa8MHQ5bwSGmbw'
API_KEY = 'AIzaSy...'

items = []
pageToken = ''
while True:
    url = f'https://www.googleapis.com/youtube/v3/search?part=id&type=video&channelId={CHANNEL_ID}&maxResults=50&key={API_KEY}&pageToken={pageToken}'
    content = requests.get(url).text
    data = json.loads(content)
    items += data['items']
    print(len(items))
    if 'nextPageToken' in data:
        pageToken = data['nextPageToken']
    else:
        break

print(len(items))
```

Got ~18,734.

Another try by working with Search: list with date filter may make sense.

Note that according to SocialBlade:
- [asianetnews has 195,600 videos](https://socialblade.com/youtube/user/asianetnews)
- [RoelVandePaar has 2,2025,566 videos](https://socialblade.com/youtube/c/roelvandepaar)

2022-12-22 01:54:57 +01:00

Benjamin Loison

3aa9947f8e

Update README.md to remove possibility to proceed using YouTube Data API v3 CommentThreads: list endpoint with allThreadsRelatedToChannelId filter

As we want to retrieve as many comments as possible, we have to proceed video per video, as [`3F8dFt8LsXY`](https://www.youtube.com/watch?v=3F8dFt8LsXY) for instance has comments but using YouTube Data API v3 CommentThreads: list endpoint with `allThreadsRelatedToChannelId` filter returns for `UCWIdqSQekeGmUWlSFeCiEnA`:
```json
{
  "error": {
    "code": 403,
    "message": "The video identified by the \u003ccode\u003e\u003ca href=\"/youtube/v3/docs/commentThreads/list#videoId\"\u003evideoId\u003c/a\u003e\u003c/code\u003e parameter has disabled comments.",
    "errors": [
      {
        "message": "The video identified by the \u003ccode\u003e\u003ca href=\"/youtube/v3/docs/commentThreads/list#videoId\"\u003evideoId\u003c/a\u003e\u003c/code\u003e parameter has disabled comments.",
        "domain": "youtube.commentThread",
        "reason": "commentsDisabled",
        "location": "videoId",
        "locationType": "parameter"
      }
    ]
  }
}
```

2022-12-21 23:49:27 +01:00

Benjamin Loison

c828b118d3

Add README.md with first sketching questions

2022-12-21 23:46:14 +01:00

53 Commits