YouTube_captions_search_engine

Benjamin_Loison/YouTube_captions_search_engine

Author	SHA1	Message	Date
Benjamin_Loison	ad3e90fe92	Fix #8 : Support comments disabled channels Tested with `UCWIdqSQekeGmUWlSFeCiEnA` which treated correctly the 36 comments of the only comments enabled video `3F8dFt8LsXY`. Note that this commit doesn't support comments disabled channels with more than 20,000 videos.	2023-01-03 02:56:07 +01:00
Benjamin_Loison	b12fa15288	#2 : Add data logging	2023-01-02 19:46:32 +01:00
Benjamin_Loison	0675314fe6	Apply `astyle` formatting to `main.cpp`	2023-01-02 18:31:16 +01:00
Benjamin_Loison	dfd9ee9c41	Fix #4 : Provide a version relying on the no-key service of https://yt.lemnoslife.com	2023-01-02 18:30:18 +01:00
Benjamin_Loison	68800a25a0	Make compatible with Debian More precise ly make compatible with `gcc version 10.2.1 20210110 (Debian 10.2.1-6)`	2023-01-02 18:23:30 +01:00
Benjamin_Loison	7a1eac5e40	Add progression save and use spaces instead of tabs	2022-12-22 06:18:22 +01:00
Benjamin_Loison	273537bc8d	Add time to logging	2022-12-22 05:47:16 +01:00
Benjamin_Loison	6685c13706	Add resilience to missing `authorChannelId` in `main.cpp`	2022-12-22 05:41:38 +01:00
Benjamin_Loison	95a9421ad0	Add `main.cpp`, `Makefile` and `channelsToTreat.txt` Note that running this algorithm end up with channel [`UC-99odscxh1xxTyxHyXuRrg`](https://www.youtube.com/channel/UC-99odscxh1xxTyxHyXuRrg) and more precisely the video [`Tq5aPNzfYcg`](https://www.youtube.com/watch?v=Tq5aPNzfYcg) and more precisely the comment [`Ugx-TlSq6SNCbOX04mx4AaABAg`](https://www.youtube.com/watch?v=Tq5aPNzfYcg&lc=Ugx-TlSq6SNCbOX04mx4AaABAg) [which doesn't have any author](https://yt.lemnoslife.com/noKey/comments?part=snippet&id=Ugx-TlSq6SNCbOX04mx4AaABAg)...	2022-12-22 05:20:32 +01:00
Benjamin_Loison	6af8a168d7	Update `README.md` to remove the question about whether or not both methods return the same comments, as it's the case More precisely I used following algorithm with these three channels: channel id \| 1st method \| 2nd method -------------------------\|-----------------------\|----------- UCt5USYpzzMCYhkirVQGHwKQ \| 16 \| 16 UCUo1RqYV8tGjV38sQ8S5p9A \| 58,165 \| 58,165 UCWIdqSQekeGmUWlSFeCiEnA \| error (as expected) \| 27 ```py """ Algorithm comparing comments count using: 1. CommentThreads: list with allThreadsRelatedToChannelId filter 2. PlaylistItems: list and CommentThreads: list Note that the second approach isn't atomic, so counts will differ if some comments are posted while retrieving data. """ import requests, json CHANNEL_ID = 'UC...' API_KEY = 'AIzaSy...' def getJSON(url, firstTry = True): if firstTry: url = 'https://www.googleapis.com/youtube/v3/' + url + f'&key={API_KEY}' try: content = requests.get(url).text except: print('retry') return getJSON(url, False) data = json.loads(content) return data items = [] pageToken = '' while True: # After having verified, I confirm that using `allThreadsRelatedToChannelId` doesn't return comments of the `COMMUNITY` tab data = getJSON(f'commentThreads?part=id,snippet,replies&allThreadsRelatedToChannelId={CHANNEL_ID}&maxResults=100&pageToken={pageToken}') items += data['items'] # In fact once we have top level comment, then with both methods if the replies count is correct, then we are fine as we both use the same Comments: list endpoint """for item in data['items']: if 'replies' in item: if len(item['replies']['comments']) >= 5: print('should consider replies too!')""" print(len(items)) if 'nextPageToken' in data: pageToken = data['nextPageToken'] else: break print(len(items)) PLAYLIST_ID = 'UU' + CHANNEL_ID[2:] videoIds = [] pageToken = '' while True: data = getJSON(f'playlistItems?part=snippet&playlistId={PLAYLIST_ID}&maxResults=50&pageToken={pageToken}') for item in data['items']: videoIds += [item['snippet']['resourceId']['videoId']] print(len(videoIds)) if 'nextPageToken' in data: pageToken = data['nextPageToken'] else: break print(len(videoIds)) items = [] for videoIndex, videoId in enumerate(videoIds): pageToken = '' while True: data = getJSON(f'commentThreads?part=id,snippet,replies&videoId={videoId}&maxResults=100&pageToken={pageToken}') if 'items' in data: items += data['items'] # repeat replies check as could be the case here and not there """for item in data['items']: if 'replies' in item: if len(item['replies']['comments']) >= 5: print('should consider replies too!')""" print(videoIndex, len(videoIds), len(items)) if 'nextPageToken' in data: pageToken = data['nextPageToken'] else: break print(len(items)) ```	2022-12-22 03:18:25 +01:00
Benjamin_Loison	54eb948a7e	Update `README.md` to clean notes concerning optimized approaches	2022-12-22 02:02:48 +01:00
Benjamin_Loison	6f04109fe2	Update `README.md` to make clear to use different strategies to optimize the process Note that as far as I (and StackOverflow ([1.](https://stackoverflow.com/q/63387215) and [2.](https://stackoverflow.com/q/67652250)) seems to) know there is no workaround to the 20,000 limit of PlaylistItems: list. This issue can be checked with: ```py import requests, json PLAYLIST_ID = 'UUf8w5m0YsRa8MHQ5bwSGmbw' API_KEY = 'AIzaSy...' items = [] pageToken = '' while True: url = f'https://www.googleapis.com/youtube/v3/playlistItems?part=id&playlistId={PLAYLIST_ID}&maxResults=50&key={API_KEY}&pageToken={pageToken}' content = requests.get(url).text data = json.loads(content) items += data['items'] print(len(items)) if 'nextPageToken' in data: pageToken = data['nextPageToken'] else: break print(len(items)) ``` Returns >= 19,000. Note that this algorithm says that: - [france24](https://www.youtube.com/@FRANCE24) has 6,086 videos while [SocialBlade states that it has 101,196 videos](https://socialblade.com/youtube/user/france24) - [CNN](https://www.youtube.com/@CNN) has 19,289 while [SocialBlade states that it has 157,321 videos](https://socialblade.com/youtube/user/cnn) Indeed both YouTube Data API v3 Search: list (I verified that https://github.com/Benjamin-Loison/YouTube-operational-API/issues/4 applied here with below code) and web-scraping `VIDEOS` tab don't work (see second SO link). ```py import requests, json CHANNEL_ID = 'UCf8w5m0YsRa8MHQ5bwSGmbw' API_KEY = 'AIzaSy...' items = [] pageToken = '' while True: url = f'https://www.googleapis.com/youtube/v3/search?part=id&type=video&channelId={CHANNEL_ID}&maxResults=50&key={API_KEY}&pageToken={pageToken}' content = requests.get(url).text data = json.loads(content) items += data['items'] print(len(items)) if 'nextPageToken' in data: pageToken = data['nextPageToken'] else: break print(len(items)) ``` Got ~18,734. Another try by working with Search: list with date filter may make sense. Note that according to SocialBlade: - [asianetnews has 195,600 videos](https://socialblade.com/youtube/user/asianetnews) - [RoelVandePaar has 2,2025,566 videos](https://socialblade.com/youtube/c/roelvandepaar)	2022-12-22 01:54:57 +01:00
Benjamin_Loison	db3db57a9f	Update `README.md` to remove possibility to proceed using YouTube Data API v3 CommentThreads: list endpoint with `allThreadsRelatedToChannelId` filter As we want to retrieve as many comments as possible, we have to proceed video per video, as [`3F8dFt8LsXY`](https://www.youtube.com/watch?v=3F8dFt8LsXY) for instance has comments but using YouTube Data API v3 CommentThreads: list endpoint with `allThreadsRelatedToChannelId` filter returns for `UCWIdqSQekeGmUWlSFeCiEnA`: ```json { "error": { "code": 403, "message": "The video identified by the \u003ccode\u003e\u003ca href=\"/youtube/v3/docs/commentThreads/list#videoId\"\u003evideoId\u003c/a\u003e\u003c/code\u003e parameter has disabled comments.", "errors": [ { "message": "The video identified by the \u003ccode\u003e\u003ca href=\"/youtube/v3/docs/commentThreads/list#videoId\"\u003evideoId\u003c/a\u003e\u003c/code\u003e parameter has disabled comments.", "domain": "youtube.commentThread", "reason": "commentsDisabled", "location": "videoId", "locationType": "parameter" } ] } } ```	2022-12-21 23:49:27 +01:00
Benjamin_Loison	b4e99c1eca	Add `README.md` with first sketching questions	2022-12-21 23:46:14 +01:00