YouTube_captions_search_engine

4 Commits 1 Branch 3 Tags

Author	SHA1	Message	Date
Benjamin Loison	54eb948a7e	Update `README.md` to clean notes concerning optimized approaches	2022-12-22 02:02:48 +01:00
Benjamin Loison	6f04109fe2	Update `README.md` to make clear to use different strategies to optimize the process Note that as far as I (and StackOverflow ([1.](https://stackoverflow.com/q/63387215) and [2.](https://stackoverflow.com/q/67652250)) seems to) know there is no workaround to the 20,000 limit of PlaylistItems: list. This issue can be checked with: ```py import requests, json PLAYLIST_ID = 'UUf8w5m0YsRa8MHQ5bwSGmbw' API_KEY = 'AIzaSy...' items = [] pageToken = '' while True: url = f'https://www.googleapis.com/youtube/v3/playlistItems?part=id&playlistId={PLAYLIST_ID}&maxResults=50&key={API_KEY}&pageToken={pageToken}' content = requests.get(url).text data = json.loads(content) items += data['items'] print(len(items)) if 'nextPageToken' in data: pageToken = data['nextPageToken'] else: break print(len(items)) ``` Returns >= 19,000. Note that this algorithm says that: - [france24](https://www.youtube.com/@FRANCE24) has 6,086 videos while [SocialBlade states that it has 101,196 videos](https://socialblade.com/youtube/user/france24) - [CNN](https://www.youtube.com/@CNN) has 19,289 while [SocialBlade states that it has 157,321 videos](https://socialblade.com/youtube/user/cnn) Indeed both YouTube Data API v3 Search: list (I verified that https://github.com/Benjamin-Loison/YouTube-operational-API/issues/4 applied here with below code) and web-scraping `VIDEOS` tab don't work (see second SO link). ```py import requests, json CHANNEL_ID = 'UCf8w5m0YsRa8MHQ5bwSGmbw' API_KEY = 'AIzaSy...' items = [] pageToken = '' while True: url = f'https://www.googleapis.com/youtube/v3/search?part=id&type=video&channelId={CHANNEL_ID}&maxResults=50&key={API_KEY}&pageToken={pageToken}' content = requests.get(url).text data = json.loads(content) items += data['items'] print(len(items)) if 'nextPageToken' in data: pageToken = data['nextPageToken'] else: break print(len(items)) ``` Got ~18,734. Another try by working with Search: list with date filter may make sense. Note that according to SocialBlade: - [asianetnews has 195,600 videos](https://socialblade.com/youtube/user/asianetnews) - [RoelVandePaar has 2,2025,566 videos](https://socialblade.com/youtube/c/roelvandepaar)	2022-12-22 01:54:57 +01:00
Benjamin Loison	db3db57a9f	Update `README.md` to remove possibility to proceed using YouTube Data API v3 CommentThreads: list endpoint with `allThreadsRelatedToChannelId` filter As we want to retrieve as many comments as possible, we have to proceed video per video, as [`3F8dFt8LsXY`](https://www.youtube.com/watch?v=3F8dFt8LsXY) for instance has comments but using YouTube Data API v3 CommentThreads: list endpoint with `allThreadsRelatedToChannelId` filter returns for `UCWIdqSQekeGmUWlSFeCiEnA`: ```json { "error": { "code": 403, "message": "The video identified by the \u003ccode\u003e\u003ca href=\"/youtube/v3/docs/commentThreads/list#videoId\"\u003evideoId\u003c/a\u003e\u003c/code\u003e parameter has disabled comments.", "errors": [ { "message": "The video identified by the \u003ccode\u003e\u003ca href=\"/youtube/v3/docs/commentThreads/list#videoId\"\u003evideoId\u003c/a\u003e\u003c/code\u003e parameter has disabled comments.", "domain": "youtube.commentThread", "reason": "commentsDisabled", "location": "videoId", "locationType": "parameter" } ] } } ```	2022-12-21 23:49:27 +01:00
Benjamin Loison	b4e99c1eca	Add `README.md` with first sketching questions	2022-12-21 23:46:14 +01:00