YouTube_captions_search_engine

Benjamin_Loison/YouTube_captions_search_engine

Author	SHA1	Message	Date
Benjamin Loison	f436007836	Fix #16 : Provide an algorithm to determine the progress of retrieving comments for huge YouTube channels	2023-01-06 17:51:00 +01:00
Benjamin Loison	dfbf38b071	#1 : Add GNU AGPLv3 license	2023-01-06 16:09:12 +01:00
Benjamin Loison	292dd8919e	Add `try`/`catch` around json parser As got: ``` terminate called after throwing an instance of 'nlohmann::detail::parse_error' terminate called recursively what(): [json.exception.parse_error.101] parse error at line 1, column 1: syntax error while parsing value - unexpected end of input; expected '[', '{', or a literal terminate called recursively ```	2023-01-06 00:31:05 +01:00
Benjamin Loison	dab4c8ff69	Modify `removeChannelsBeingTreated.py` to be more resilient against not existing files in the treatment process	2023-01-04 03:10:28 +01:00
Benjamin Loison	9d5c9fde2a	#2 : Add compression to `channels/` folder Can use following Python script to compress existing uncompressed `channels/` folder. ```py import os, shutil path = 'channels/' os.chdir(path) d = next(os.walk('.'))[1] for channelIndex, channelId in enumerate(d): print(f'{channelIndex} / {len(d)}: {channelId}') shutil.make_archive(channelId, 'zip', channelId) shutil.rmtree(channelId) ```	2023-01-04 03:06:33 +01:00
Benjamin Loison	f201ae7a91	Make #7 : Add multi-threading compatible with my Debian setup	2023-01-04 02:51:40 +01:00
Benjamin Loison	4cae7e09d1	Add `{removeChannelsBeingTreated, findTreatedChannelWithMost{Comments, Subscribers}}.py`	2023-01-04 02:41:07 +01:00
Benjamin Loison	e4b4ce21a2	Fix #7 : Add multi-threading	2023-01-03 04:56:19 +01:00
Benjamin Loison	a2990c7699	Fix #8 : Support comments disabled channels Tested with `UCWIdqSQekeGmUWlSFeCiEnA` which treated correctly the 36 comments of the only comments enabled video `3F8dFt8LsXY`. Note that this commit doesn't support comments disabled channels with more than 20,000 videos.	2023-01-03 02:56:07 +01:00
Benjamin Loison	923c14a77b	#2 : Add data logging	2023-01-02 19:46:32 +01:00
Benjamin Loison	73a9dea023	Apply `astyle` formatting to `main.cpp`	2023-01-02 18:31:16 +01:00
Benjamin Loison	938ae4b0fb	Fix #4 : Provide a version relying on the no-key service of https://yt.lemnoslife.com	2023-01-02 18:30:18 +01:00
Benjamin Loison	c50a82df1b	Make compatible with Debian More precise ly make compatible with `gcc version 10.2.1 20210110 (Debian 10.2.1-6)`	2023-01-02 18:23:30 +01:00
Benjamin Loison	36f1fb9e83	Add progression save and use spaces instead of tabs	2022-12-22 06:18:22 +01:00
Benjamin Loison	934954092a	Add time to logging	2022-12-22 05:47:16 +01:00
Benjamin Loison	eaae954e1b	Add resilience to missing `authorChannelId` in `main.cpp`	2022-12-22 05:41:38 +01:00
Benjamin Loison	2ffc1d0e5d	Add `main.cpp`, `Makefile` and `channelsToTreat.txt` Note that running this algorithm end up with channel [`UC-99odscxh1xxTyxHyXuRrg`](https://www.youtube.com/channel/UC-99odscxh1xxTyxHyXuRrg) and more precisely the video [`Tq5aPNzfYcg`](https://www.youtube.com/watch?v=Tq5aPNzfYcg) and more precisely the comment [`Ugx-TlSq6SNCbOX04mx4AaABAg`](https://www.youtube.com/watch?v=Tq5aPNzfYcg&lc=Ugx-TlSq6SNCbOX04mx4AaABAg) [which doesn't have any author](https://yt.lemnoslife.com/noKey/comments?part=snippet&id=Ugx-TlSq6SNCbOX04mx4AaABAg)...	2022-12-22 05:20:32 +01:00
Benjamin Loison	53acda6abe	Update `README.md` to remove the question about whether or not both methods return the same comments, as it's the case More precisely I used following algorithm with these three channels: channel id \| 1st method \| 2nd method -------------------------\|-----------------------\|----------- UCt5USYpzzMCYhkirVQGHwKQ \| 16 \| 16 UCUo1RqYV8tGjV38sQ8S5p9A \| 58,165 \| 58,165 UCWIdqSQekeGmUWlSFeCiEnA \| error (as expected) \| 27 ```py """ Algorithm comparing comments count using: 1. CommentThreads: list with allThreadsRelatedToChannelId filter 2. PlaylistItems: list and CommentThreads: list Note that the second approach isn't atomic, so counts will differ if some comments are posted while retrieving data. """ import requests, json CHANNEL_ID = 'UC...' API_KEY = 'AIzaSy...' def getJSON(url, firstTry = True): if firstTry: url = 'https://www.googleapis.com/youtube/v3/' + url + f'&key={API_KEY}' try: content = requests.get(url).text except: print('retry') return getJSON(url, False) data = json.loads(content) return data items = [] pageToken = '' while True: # After having verified, I confirm that using `allThreadsRelatedToChannelId` doesn't return comments of the `COMMUNITY` tab data = getJSON(f'commentThreads?part=id,snippet,replies&allThreadsRelatedToChannelId={CHANNEL_ID}&maxResults=100&pageToken={pageToken}') items += data['items'] # In fact once we have top level comment, then with both methods if the replies count is correct, then we are fine as we both use the same Comments: list endpoint """for item in data['items']: if 'replies' in item: if len(item['replies']['comments']) >= 5: print('should consider replies too!')""" print(len(items)) if 'nextPageToken' in data: pageToken = data['nextPageToken'] else: break print(len(items)) PLAYLIST_ID = 'UU' + CHANNEL_ID[2:] videoIds = [] pageToken = '' while True: data = getJSON(f'playlistItems?part=snippet&playlistId={PLAYLIST_ID}&maxResults=50&pageToken={pageToken}') for item in data['items']: videoIds += [item['snippet']['resourceId']['videoId']] print(len(videoIds)) if 'nextPageToken' in data: pageToken = data['nextPageToken'] else: break print(len(videoIds)) items = [] for videoIndex, videoId in enumerate(videoIds): pageToken = '' while True: data = getJSON(f'commentThreads?part=id,snippet,replies&videoId={videoId}&maxResults=100&pageToken={pageToken}') if 'items' in data: items += data['items'] # repeat replies check as could be the case here and not there """for item in data['items']: if 'replies' in item: if len(item['replies']['comments']) >= 5: print('should consider replies too!')""" print(videoIndex, len(videoIds), len(items)) if 'nextPageToken' in data: pageToken = data['nextPageToken'] else: break print(len(items)) ```	2022-12-22 03:18:25 +01:00
Benjamin Loison	d776c09fec	Update `README.md` to clean notes concerning optimized approaches	2022-12-22 02:02:48 +01:00
Benjamin Loison	daf14d4b5b	Update `README.md` to make clear to use different strategies to optimize the process Note that as far as I (and StackOverflow ([1.](https://stackoverflow.com/q/63387215) and [2.](https://stackoverflow.com/q/67652250)) seems to) know there is no workaround to the 20,000 limit of PlaylistItems: list. This issue can be checked with: ```py import requests, json PLAYLIST_ID = 'UUf8w5m0YsRa8MHQ5bwSGmbw' API_KEY = 'AIzaSy...' items = [] pageToken = '' while True: url = f'https://www.googleapis.com/youtube/v3/playlistItems?part=id&playlistId={PLAYLIST_ID}&maxResults=50&key={API_KEY}&pageToken={pageToken}' content = requests.get(url).text data = json.loads(content) items += data['items'] print(len(items)) if 'nextPageToken' in data: pageToken = data['nextPageToken'] else: break print(len(items)) ``` Returns >= 19,000. Note that this algorithm says that: - [france24](https://www.youtube.com/@FRANCE24) has 6,086 videos while [SocialBlade states that it has 101,196 videos](https://socialblade.com/youtube/user/france24) - [CNN](https://www.youtube.com/@CNN) has 19,289 while [SocialBlade states that it has 157,321 videos](https://socialblade.com/youtube/user/cnn) Indeed both YouTube Data API v3 Search: list (I verified that https://github.com/Benjamin-Loison/YouTube-operational-API/issues/4 applied here with below code) and web-scraping `VIDEOS` tab don't work (see second SO link). ```py import requests, json CHANNEL_ID = 'UCf8w5m0YsRa8MHQ5bwSGmbw' API_KEY = 'AIzaSy...' items = [] pageToken = '' while True: url = f'https://www.googleapis.com/youtube/v3/search?part=id&type=video&channelId={CHANNEL_ID}&maxResults=50&key={API_KEY}&pageToken={pageToken}' content = requests.get(url).text data = json.loads(content) items += data['items'] print(len(items)) if 'nextPageToken' in data: pageToken = data['nextPageToken'] else: break print(len(items)) ``` Got ~18,734. Another try by working with Search: list with date filter may make sense. Note that according to SocialBlade: - [asianetnews has 195,600 videos](https://socialblade.com/youtube/user/asianetnews) - [RoelVandePaar has 2,2025,566 videos](https://socialblade.com/youtube/c/roelvandepaar)	2022-12-22 01:54:57 +01:00
Benjamin Loison	3aa9947f8e	Update `README.md` to remove possibility to proceed using YouTube Data API v3 CommentThreads: list endpoint with `allThreadsRelatedToChannelId` filter As we want to retrieve as many comments as possible, we have to proceed video per video, as [`3F8dFt8LsXY`](https://www.youtube.com/watch?v=3F8dFt8LsXY) for instance has comments but using YouTube Data API v3 CommentThreads: list endpoint with `allThreadsRelatedToChannelId` filter returns for `UCWIdqSQekeGmUWlSFeCiEnA`: ```json { "error": { "code": 403, "message": "The video identified by the \u003ccode\u003e\u003ca href=\"/youtube/v3/docs/commentThreads/list#videoId\"\u003evideoId\u003c/a\u003e\u003c/code\u003e parameter has disabled comments.", "errors": [ { "message": "The video identified by the \u003ccode\u003e\u003ca href=\"/youtube/v3/docs/commentThreads/list#videoId\"\u003evideoId\u003c/a\u003e\u003c/code\u003e parameter has disabled comments.", "domain": "youtube.commentThread", "reason": "commentsDisabled", "location": "videoId", "locationType": "parameter" } ] } } ```	2022-12-21 23:49:27 +01:00
Benjamin Loison	c828b118d3	Add `README.md` with first sketching questions	2022-12-21 23:46:14 +01:00

1 2

72 Commits