YouTube_captions_search_engine

Benjamin_Loison/YouTube_captions_search_engine

Author	SHA1	Message	Date
Benjamin Loison	bdb4e6443a	#11 : Add current livestreams support to discover channels	2023-01-22 04:00:11 +01:00
Benjamin Loison	d2391e5d54	Instead of looping on `items` where we expect only one to be, we just use `items[0]`	2023-01-22 02:19:26 +01:00
Benjamin Loison	993d0b9771	Make `PRINT` not requiring to precise `threadId`	2023-01-22 02:04:03 +01:00
Benjamin Loison	0fcb5a0426	#11 : Treat `COMMUNITY` post comments to discover channels	2023-01-22 01:37:32 +01:00
Benjamin Loison	57200da482	Add in `README.md` the fact that as documented in #30 , this algorithm is only known to be working fin on Linux	2023-01-21 22:20:45 +01:00
Benjamin Loison	a0880c79bb	#11 : Update channel `CHANNELS` tab treatment following YouTube-operational-API/issues/121 closure	2023-01-21 02:24:42 +01:00
Benjamin Loison	10c5c1d605	#11 : Add the treatment of channels' tab, but only postpone unlisted videos treatment	2023-01-15 14:56:44 +01:00
Benjamin Loison	51a70f6e54	#7 : Make `commentsCount` and `requestsPerChannel` compatible with multithreading	2023-01-15 14:31:55 +01:00
Benjamin Loison	aa97c94bf8	#11 : Add a first iteration for the `CHANNELS` retrieval	2023-01-15 02:19:31 +01:00
Benjamin Loison	d1b84335d1	#11 : Add `--youtube-operational-api-instance-url` parameter and use `exit(EXIT_{SUCCESS, FAILURE})` instead of `exit({0, 1})`	2023-01-15 00:49:32 +01:00
Benjamin Loison	6ce29051c0	Fix #26 : Keep efficient search algorithm while keeping order (notably of the starting set)	2023-01-14 15:14:24 +01:00
Benjamin Loison	ad9f96b33c	Fix #24 : Stop using macros for user inputs to notably make releases 0.0.0	2023-01-08 18:26:20 +01:00
Benjamin Loison	d498c86058	Fix #6 : Add support for multiple keys to be resilient against exceeded quota errors	2023-01-08 17:59:08 +01:00
Benjamin Loison	1ee767abbc	Fix #23 : YouTube Data API v3 PlaylistItems: list endpoint returns `playlistNotFound` error for regular `uploads` ones	2023-01-08 16:31:57 +01:00
Benjamin Loison	7e35a6473a	Fix #20 : YouTube Data API v3 returns rarely suddenly `commentsDisabled` error which involves an unwanted method switch Also modified compression command, as I got `sh: 1: zip: Argument list too long` when compressing the 248,868 json files of the French most subscribers channel.	2023-01-08 15:43:27 +01:00
Benjamin Loison	ba37d6a111	Make all Python scripts executable and add `findAlreadyTreatedCommentsCount.py` to find how many comments were already treated	2023-01-07 15:45:31 +01:00
Benjamin Loison	5a7e5b6f78	Add a note about the timing percentage of `findLatestTreatedCommentsForChannelsBeingTreated.py` going backward	2023-01-07 15:35:12 +01:00
Benjamin Loison	e3cab4c204	Fix #9 : Make sure that in case of error returned by the YouTube Data API v3 the algorithm treats it correctly Note that in case of error the algorithm used to skip the received content, as if just no `items` were in it.	2023-01-06 20:55:32 +01:00
Benjamin Loison	156a621413	Fix #15 : Provide an algorithm to retrieve the list of 100 French channels with most subscribers (and provide it too)	2023-01-06 18:06:00 +01:00
Benjamin Loison	fdfec17817	#7 : Remove remaining undefined behavior due to missing mutex use	2023-01-06 18:00:51 +01:00
Benjamin Loison	3ef5fa0707	Fix #17 : Add to `stdout` live statistics of the number of comments treated per second	2023-01-06 17:55:16 +01:00
Benjamin Loison	0259dfb3fb	Fix #16 : Provide an algorithm to determine the progress of retrieving comments for huge YouTube channels	2023-01-06 17:51:00 +01:00
Benjamin Loison	b2fafb721c	#1 : Add GNU AGPLv3 license	2023-01-06 16:09:12 +01:00
Benjamin Loison	01394769fd	Add `try`/`catch` around json parser As got: ``` terminate called after throwing an instance of 'nlohmann::detail::parse_error' terminate called recursively what(): [json.exception.parse_error.101] parse error at line 1, column 1: syntax error while parsing value - unexpected end of input; expected '[', '{', or a literal terminate called recursively ```	2023-01-06 00:31:05 +01:00
Benjamin Loison	5d13bd3c44	Modify `removeChannelsBeingTreated.py` to be more resilient against not existing files in the treatment process	2023-01-04 03:10:28 +01:00
Benjamin Loison	512485b1b8	#2 : Add compression to `channels/` folder Can use following Python script to compress existing uncompressed `channels/` folder. ```py import os, shutil path = 'channels/' os.chdir(path) d = next(os.walk('.'))[1] for channelIndex, channelId in enumerate(d): print(f'{channelIndex} / {len(d)}: {channelId}') shutil.make_archive(channelId, 'zip', channelId) shutil.rmtree(channelId) ```	2023-01-04 03:06:33 +01:00
Benjamin Loison	a4a282642d	Make #7 : Add multi-threading compatible with my Debian setup	2023-01-04 02:51:40 +01:00
Benjamin Loison	dde52da8c8	Add `{removeChannelsBeingTreated, findTreatedChannelWithMost{Comments, Subscribers}}.py`	2023-01-04 02:41:07 +01:00
Benjamin Loison	2a33be9272	Fix #7 : Add multi-threading	2023-01-03 04:56:19 +01:00
Benjamin Loison	ad3e90fe92	Fix #8 : Support comments disabled channels Tested with `UCWIdqSQekeGmUWlSFeCiEnA` which treated correctly the 36 comments of the only comments enabled video `3F8dFt8LsXY`. Note that this commit doesn't support comments disabled channels with more than 20,000 videos.	2023-01-03 02:56:07 +01:00
Benjamin Loison	b12fa15288	#2 : Add data logging	2023-01-02 19:46:32 +01:00
Benjamin Loison	0675314fe6	Apply `astyle` formatting to `main.cpp`	2023-01-02 18:31:16 +01:00
Benjamin Loison	dfd9ee9c41	Fix #4 : Provide a version relying on the no-key service of https://yt.lemnoslife.com	2023-01-02 18:30:18 +01:00
Benjamin Loison	68800a25a0	Make compatible with Debian More precise ly make compatible with `gcc version 10.2.1 20210110 (Debian 10.2.1-6)`	2023-01-02 18:23:30 +01:00
Benjamin Loison	7a1eac5e40	Add progression save and use spaces instead of tabs	2022-12-22 06:18:22 +01:00
Benjamin Loison	273537bc8d	Add time to logging	2022-12-22 05:47:16 +01:00
Benjamin Loison	6685c13706	Add resilience to missing `authorChannelId` in `main.cpp`	2022-12-22 05:41:38 +01:00
Benjamin Loison	95a9421ad0	Add `main.cpp`, `Makefile` and `channelsToTreat.txt` Note that running this algorithm end up with channel [`UC-99odscxh1xxTyxHyXuRrg`](https://www.youtube.com/channel/UC-99odscxh1xxTyxHyXuRrg) and more precisely the video [`Tq5aPNzfYcg`](https://www.youtube.com/watch?v=Tq5aPNzfYcg) and more precisely the comment [`Ugx-TlSq6SNCbOX04mx4AaABAg`](https://www.youtube.com/watch?v=Tq5aPNzfYcg&lc=Ugx-TlSq6SNCbOX04mx4AaABAg) [which doesn't have any author](https://yt.lemnoslife.com/noKey/comments?part=snippet&id=Ugx-TlSq6SNCbOX04mx4AaABAg)...	2022-12-22 05:20:32 +01:00
Benjamin Loison	6af8a168d7	Update `README.md` to remove the question about whether or not both methods return the same comments, as it's the case More precisely I used following algorithm with these three channels: channel id \| 1st method \| 2nd method -------------------------\|-----------------------\|----------- UCt5USYpzzMCYhkirVQGHwKQ \| 16 \| 16 UCUo1RqYV8tGjV38sQ8S5p9A \| 58,165 \| 58,165 UCWIdqSQekeGmUWlSFeCiEnA \| error (as expected) \| 27 ```py """ Algorithm comparing comments count using: 1. CommentThreads: list with allThreadsRelatedToChannelId filter 2. PlaylistItems: list and CommentThreads: list Note that the second approach isn't atomic, so counts will differ if some comments are posted while retrieving data. """ import requests, json CHANNEL_ID = 'UC...' API_KEY = 'AIzaSy...' def getJSON(url, firstTry = True): if firstTry: url = 'https://www.googleapis.com/youtube/v3/' + url + f'&key={API_KEY}' try: content = requests.get(url).text except: print('retry') return getJSON(url, False) data = json.loads(content) return data items = [] pageToken = '' while True: # After having verified, I confirm that using `allThreadsRelatedToChannelId` doesn't return comments of the `COMMUNITY` tab data = getJSON(f'commentThreads?part=id,snippet,replies&allThreadsRelatedToChannelId={CHANNEL_ID}&maxResults=100&pageToken={pageToken}') items += data['items'] # In fact once we have top level comment, then with both methods if the replies count is correct, then we are fine as we both use the same Comments: list endpoint """for item in data['items']: if 'replies' in item: if len(item['replies']['comments']) >= 5: print('should consider replies too!')""" print(len(items)) if 'nextPageToken' in data: pageToken = data['nextPageToken'] else: break print(len(items)) PLAYLIST_ID = 'UU' + CHANNEL_ID[2:] videoIds = [] pageToken = '' while True: data = getJSON(f'playlistItems?part=snippet&playlistId={PLAYLIST_ID}&maxResults=50&pageToken={pageToken}') for item in data['items']: videoIds += [item['snippet']['resourceId']['videoId']] print(len(videoIds)) if 'nextPageToken' in data: pageToken = data['nextPageToken'] else: break print(len(videoIds)) items = [] for videoIndex, videoId in enumerate(videoIds): pageToken = '' while True: data = getJSON(f'commentThreads?part=id,snippet,replies&videoId={videoId}&maxResults=100&pageToken={pageToken}') if 'items' in data: items += data['items'] # repeat replies check as could be the case here and not there """for item in data['items']: if 'replies' in item: if len(item['replies']['comments']) >= 5: print('should consider replies too!')""" print(videoIndex, len(videoIds), len(items)) if 'nextPageToken' in data: pageToken = data['nextPageToken'] else: break print(len(items)) ```	2022-12-22 03:18:25 +01:00
Benjamin Loison	54eb948a7e	Update `README.md` to clean notes concerning optimized approaches	2022-12-22 02:02:48 +01:00
Benjamin Loison	6f04109fe2	Update `README.md` to make clear to use different strategies to optimize the process Note that as far as I (and StackOverflow ([1.](https://stackoverflow.com/q/63387215) and [2.](https://stackoverflow.com/q/67652250)) seems to) know there is no workaround to the 20,000 limit of PlaylistItems: list. This issue can be checked with: ```py import requests, json PLAYLIST_ID = 'UUf8w5m0YsRa8MHQ5bwSGmbw' API_KEY = 'AIzaSy...' items = [] pageToken = '' while True: url = f'https://www.googleapis.com/youtube/v3/playlistItems?part=id&playlistId={PLAYLIST_ID}&maxResults=50&key={API_KEY}&pageToken={pageToken}' content = requests.get(url).text data = json.loads(content) items += data['items'] print(len(items)) if 'nextPageToken' in data: pageToken = data['nextPageToken'] else: break print(len(items)) ``` Returns >= 19,000. Note that this algorithm says that: - [france24](https://www.youtube.com/@FRANCE24) has 6,086 videos while [SocialBlade states that it has 101,196 videos](https://socialblade.com/youtube/user/france24) - [CNN](https://www.youtube.com/@CNN) has 19,289 while [SocialBlade states that it has 157,321 videos](https://socialblade.com/youtube/user/cnn) Indeed both YouTube Data API v3 Search: list (I verified that https://github.com/Benjamin-Loison/YouTube-operational-API/issues/4 applied here with below code) and web-scraping `VIDEOS` tab don't work (see second SO link). ```py import requests, json CHANNEL_ID = 'UCf8w5m0YsRa8MHQ5bwSGmbw' API_KEY = 'AIzaSy...' items = [] pageToken = '' while True: url = f'https://www.googleapis.com/youtube/v3/search?part=id&type=video&channelId={CHANNEL_ID}&maxResults=50&key={API_KEY}&pageToken={pageToken}' content = requests.get(url).text data = json.loads(content) items += data['items'] print(len(items)) if 'nextPageToken' in data: pageToken = data['nextPageToken'] else: break print(len(items)) ``` Got ~18,734. Another try by working with Search: list with date filter may make sense. Note that according to SocialBlade: - [asianetnews has 195,600 videos](https://socialblade.com/youtube/user/asianetnews) - [RoelVandePaar has 2,2025,566 videos](https://socialblade.com/youtube/c/roelvandepaar)	2022-12-22 01:54:57 +01:00
Benjamin Loison	db3db57a9f	Update `README.md` to remove possibility to proceed using YouTube Data API v3 CommentThreads: list endpoint with `allThreadsRelatedToChannelId` filter As we want to retrieve as many comments as possible, we have to proceed video per video, as [`3F8dFt8LsXY`](https://www.youtube.com/watch?v=3F8dFt8LsXY) for instance has comments but using YouTube Data API v3 CommentThreads: list endpoint with `allThreadsRelatedToChannelId` filter returns for `UCWIdqSQekeGmUWlSFeCiEnA`: ```json { "error": { "code": 403, "message": "The video identified by the \u003ccode\u003e\u003ca href=\"/youtube/v3/docs/commentThreads/list#videoId\"\u003evideoId\u003c/a\u003e\u003c/code\u003e parameter has disabled comments.", "errors": [ { "message": "The video identified by the \u003ccode\u003e\u003ca href=\"/youtube/v3/docs/commentThreads/list#videoId\"\u003evideoId\u003c/a\u003e\u003c/code\u003e parameter has disabled comments.", "domain": "youtube.commentThread", "reason": "commentsDisabled", "location": "videoId", "locationType": "parameter" } ] } } ```	2022-12-21 23:49:27 +01:00
Benjamin Loison	b4e99c1eca	Add `README.md` with first sketching questions	2022-12-21 23:46:14 +01:00

43 Commits