YouTube_captions_search_engine

Benjamin_Loison/YouTube_captions_search_engine

Author	SHA1	Message	Date
Benjamin Loison	3c4664a4b1	Fix #13 : Add captions extraction I was about to commit in addition: ```c++ // Due to videos with automatically generated captions but being set to `Off` by default aren't retrieved with `--sub-langs '.orig'`. // My workaround is to first call YouTube Data API v3 Captions: list endpoint with `part=snippet` and retrieve the language that has `"trackKind": "asr"` (automatic speech recognition) in `snippet`. /json data = getJson(threadId, "captions?part=snippet&videoId=" + videoId, true, channelToTreat), items = data["items"]; for(const auto& item : items) { json snippet = item["snippet"]; if(snippet["trackKind"] == "asr") { string language = snippet["language"]; cmd = cmdCommonPrefix + "--write-auto-subs --sub-langs '" + language + "-orig' --sub-format ttml --convert-subs vtt" + cmdCommonPostfix; exec(threadId, cmd); // As there should be a single automatic speech recognized track, there is no need to go through all tracks. break; } }/ ``` Instead of: ```c++ cmd = cmdCommonPrefix + "--write-auto-subs --sub-langs '.orig' --sub-format ttml --convert-subs vtt" + cmdCommonPostfix; exec(threadId, cmd); ``` But I realized that, as the GitHub comment I was about to add to https://github.com/yt-dlp/yt-dlp/issues/2655, I was wrong: > `yt-dlp --cookies cookies.txt --sub-langs 'en.,.orig' --write-auto-subs https://www.youtube.com/watch?v=tQqDBySHYlc` work as expected. Many thanks again. > > ``` > 'subtitleslangs': ['en.','.orig'], > 'writeautomaticsub': True, > ``` > > Work as expected too. Thank you > > Very sorry for the video sample. I even not watched it. Thank you for this workaround. However note that videos having automatically generated subtitles but being set to `Off` by default aren't retrieved with your method (example of such video: [`mozyXsZJnQ4`](https://www.youtube.com/watch?v=mozyXsZJnQ4)). My workaround is to first call [YouTube Data API v3](https://developers.google.com/youtube/v3) [Captions: list](https://developers.google.com/youtube/v3/docs/captions/list) endpoint with [`part=snippet`](https://developers.google.com/youtube/v3/docs/captions/list#part) and retrieve the [`language`](https://developers.google.com/youtube/v3/docs/captions#snippet.language) that has [`"trackKind": "asr"`](https://developers.google.com/youtube/v3/docs/captions#snippet.trackKind) (automatic speech recognition) in [`snippet`](https://developers.google.com/youtube/v3/docs/captions#snippet).	2023-02-10 20:03:08 +01:00
Benjamin Loison	7fcc8b09fa	Fix #36 : Make the program stops by crashing on YouTube operational API instance being detected as sending unusual traffic	2023-02-10 12:02:39 +01:00
Benjamin Loison	87d67e4e85	Correct the termination of `COMMUNITY` tab process due to missing page tokens	2023-02-10 00:37:28 +01:00
Benjamin Loison	8f9b1275be	Remove the `Content-Type: application/json` HTTP header when retrieving `urls.txt` inside a `.zip`	2023-02-09 02:07:10 +01:00
Benjamin Loison	afd9e1b0b6	Add a verification that `snippet/authorChannelId/value` isn't null when using `commentThreads` for `COMMUNITY` As it can happen cf https://www.youtube.com/channel/UCWeg2Pkate69NFdBeuRFTAw/community?lc=UgwGfjNxGuwqP8qYPPN4AaABAg&lb=UgkxYiEAo9-b1vWPasxFy13f959rrctQpZwW	2023-02-09 01:51:22 +01:00
Benjamin Loison	5a1df71bb9	Append to `channels.txt` all channels mentioned in the Wiki	2023-02-08 16:28:44 +01:00
Benjamin Loison	622188d6d9	Add in `urls.txt` if the URL is related to YouTube Data API v3 or YouTube operational API	2023-02-08 16:05:03 +01:00
Benjamin Loison	0c51bd05bc	Fix #34 : Correct JSON files by putting first line in another metadata file	2023-02-07 23:08:09 +01:00
Benjamin Loison	e0f521d572	Restore ability to download whole archives As API keys aren't written in the first line of JSON files.	2023-02-07 23:01:26 +01:00
Benjamin Loison	e5a50bcba4	Remove ability in `channels.php` to download whole archive for not leaking API keys used	2023-02-07 22:42:24 +01:00
Benjamin Loison	2179e9b6f4	Add `channels.php` adding support for (file in) zip download	2023-02-07 22:39:43 +01:00
Benjamin Loison	e9b77369fb	#31 : Add zip files search	2023-02-07 20:15:36 +01:00
Benjamin Loison	b45384bab7	Comment WebSocket mechanism to work with an arbitrary number of independent send	2023-02-07 18:14:49 +01:00
Benjamin Loison	126cc75dc6	Make WebSocket able to manage arbitrary feedback to end-user While previous implementation was able to send two independent messages, now we can send an arbitrary amount of independent messages.	2023-02-07 17:25:17 +01:00
Benjamin Loison	7302679a81	Make `websockets.php` able to proceed blocking treatments	2023-02-07 01:22:26 +01:00
Benjamin Loison	0dba8e0c7d	Make a WebSocket example work with `crawler.yt.lemnoslife.com`	2023-01-31 01:05:09 +01:00
Benjamin Loison	155d372186	Run `php-cs-fixer fix --rules=@PSR12 websocket.php`	2023-01-31 00:57:06 +01:00
Benjamin Loison	bd184bd0f0	Rename `chat.php` to `websocket.php`	2023-01-30 22:24:02 +01:00
Benjamin Loison	0193f05143	Copy-pasted the `README.md` quick example of `ratchetphp/Ratchet` `5012dc9545 (a-quick-example)`	2023-01-30 22:19:04 +01:00
Benjamin Loison	931b2df563	Add static `website/index.php`	2023-01-30 22:14:05 +01:00
Benjamin Loison	0f4b89ccd9	Correct typo: the channel tab is `LIVE`, not `LIVES`	2023-01-25 01:00:29 +01:00
Benjamin Loison	4e162e34c3	Add comment in `README.md` about the usage of `--no-keys` or generating a YouTube Data API v3 key	2023-01-22 15:41:13 +01:00
Benjamin Loison	10e8811817	Introduce `{,MAIN_}EXIT_WITH_ERROR` macros for exitting with an error	2023-01-22 15:17:14 +01:00
Benjamin Loison	0f15bb0235	#11 : Add the discovering of channels having commented on ended livestreams	2023-01-22 15:15:27 +01:00
Benjamin Loison	bdb4e6443a	#11 : Add current livestreams support to discover channels	2023-01-22 04:00:11 +01:00
Benjamin Loison	d2391e5d54	Instead of looping on `items` where we expect only one to be, we just use `items[0]`	2023-01-22 02:19:26 +01:00
Benjamin Loison	993d0b9771	Make `PRINT` not requiring to precise `threadId`	2023-01-22 02:04:03 +01:00
Benjamin Loison	0fcb5a0426	#11 : Treat `COMMUNITY` post comments to discover channels	2023-01-22 01:37:32 +01:00
Benjamin Loison	57200da482	Add in `README.md` the fact that as documented in #30 , this algorithm is only known to be working fin on Linux	2023-01-21 22:20:45 +01:00
Benjamin Loison	a0880c79bb	#11 : Update channel `CHANNELS` tab treatment following YouTube-operational-API/issues/121 closure	2023-01-21 02:24:42 +01:00
Benjamin Loison	10c5c1d605	#11 : Add the treatment of channels' tab, but only postpone unlisted videos treatment	2023-01-15 14:56:44 +01:00
Benjamin Loison	51a70f6e54	#7 : Make `commentsCount` and `requestsPerChannel` compatible with multithreading	2023-01-15 14:31:55 +01:00
Benjamin Loison	aa97c94bf8	#11 : Add a first iteration for the `CHANNELS` retrieval	2023-01-15 02:19:31 +01:00
Benjamin Loison	d1b84335d1	#11 : Add `--youtube-operational-api-instance-url` parameter and use `exit(EXIT_{SUCCESS, FAILURE})` instead of `exit({0, 1})`	2023-01-15 00:49:32 +01:00
Benjamin Loison	6ce29051c0	Fix #26 : Keep efficient search algorithm while keeping order (notably of the starting set)	2023-01-14 15:14:24 +01:00
Benjamin Loison	ad9f96b33c	Fix #24 : Stop using macros for user inputs to notably make releases 0.0.0	2023-01-08 18:26:20 +01:00
Benjamin Loison	d498c86058	Fix #6 : Add support for multiple keys to be resilient against exceeded quota errors	2023-01-08 17:59:08 +01:00
Benjamin Loison	1ee767abbc	Fix #23 : YouTube Data API v3 PlaylistItems: list endpoint returns `playlistNotFound` error for regular `uploads` ones	2023-01-08 16:31:57 +01:00
Benjamin Loison	7e35a6473a	Fix #20 : YouTube Data API v3 returns rarely suddenly `commentsDisabled` error which involves an unwanted method switch Also modified compression command, as I got `sh: 1: zip: Argument list too long` when compressing the 248,868 json files of the French most subscribers channel.	2023-01-08 15:43:27 +01:00
Benjamin Loison	ba37d6a111	Make all Python scripts executable and add `findAlreadyTreatedCommentsCount.py` to find how many comments were already treated	2023-01-07 15:45:31 +01:00
Benjamin Loison	5a7e5b6f78	Add a note about the timing percentage of `findLatestTreatedCommentsForChannelsBeingTreated.py` going backward	2023-01-07 15:35:12 +01:00
Benjamin Loison	e3cab4c204	Fix #9 : Make sure that in case of error returned by the YouTube Data API v3 the algorithm treats it correctly Note that in case of error the algorithm used to skip the received content, as if just no `items` were in it.	2023-01-06 20:55:32 +01:00
Benjamin Loison	156a621413	Fix #15 : Provide an algorithm to retrieve the list of 100 French channels with most subscribers (and provide it too)	2023-01-06 18:06:00 +01:00
Benjamin Loison	fdfec17817	#7 : Remove remaining undefined behavior due to missing mutex use	2023-01-06 18:00:51 +01:00
Benjamin Loison	3ef5fa0707	Fix #17 : Add to `stdout` live statistics of the number of comments treated per second	2023-01-06 17:55:16 +01:00
Benjamin Loison	0259dfb3fb	Fix #16 : Provide an algorithm to determine the progress of retrieving comments for huge YouTube channels	2023-01-06 17:51:00 +01:00
Benjamin Loison	b2fafb721c	#1 : Add GNU AGPLv3 license	2023-01-06 16:09:12 +01:00
Benjamin Loison	01394769fd	Add `try`/`catch` around json parser As got: ``` terminate called after throwing an instance of 'nlohmann::detail::parse_error' terminate called recursively what(): [json.exception.parse_error.101] parse error at line 1, column 1: syntax error while parsing value - unexpected end of input; expected '[', '{', or a literal terminate called recursively ```	2023-01-06 00:31:05 +01:00
Benjamin Loison	5d13bd3c44	Modify `removeChannelsBeingTreated.py` to be more resilient against not existing files in the treatment process	2023-01-04 03:10:28 +01:00
Benjamin Loison	512485b1b8	#2 : Add compression to `channels/` folder Can use following Python script to compress existing uncompressed `channels/` folder. ```py import os, shutil path = 'channels/' os.chdir(path) d = next(os.walk('.'))[1] for channelIndex, channelId in enumerate(d): print(f'{channelIndex} / {len(d)}: {channelId}') shutil.make_archive(channelId, 'zip', channelId) shutil.rmtree(channelId) ```	2023-01-04 03:06:33 +01:00

1 2

67 Commits