YouTube_captions_search_engine

Benjamin_Loison/YouTube_captions_search_engine

Author	SHA1	Message	Date
Benjamin_Loison	94ad823e3b	Modify website to support new sub-folders architecture	2023-02-13 05:45:08 +01:00
Benjamin_Loison	454503271e	Fix #37 : Use a number of channels seen (possibly repeated) instead of YouTube Data API v3 Comment(Thread): resource	2023-02-12 16:31:27 +01:00
Benjamin_Loison	54fe40e588	Add logging to `exec` and make it crashless, `requests` and `captions` folders support for compressing, clean captions support for videos being livestreams and videos starting with `-`	2023-02-12 16:24:16 +01:00
Benjamin_Loison	8cf5698051	Move YouTube API requests logging to `requests/` channel sub-folder	2023-02-10 20:17:49 +01:00
Benjamin_Loison	04c59eb025	Fix #13 : Add captions extraction I was about to commit in addition: ```c++ // Due to videos with automatically generated captions but being set to `Off` by default aren't retrieved with `--sub-langs '.orig'`. // My workaround is to first call YouTube Data API v3 Captions: list endpoint with `part=snippet` and retrieve the language that has `"trackKind": "asr"` (automatic speech recognition) in `snippet`. /json data = getJson(threadId, "captions?part=snippet&videoId=" + videoId, true, channelToTreat), items = data["items"]; for(const auto& item : items) { json snippet = item["snippet"]; if(snippet["trackKind"] == "asr") { string language = snippet["language"]; cmd = cmdCommonPrefix + "--write-auto-subs --sub-langs '" + language + "-orig' --sub-format ttml --convert-subs vtt" + cmdCommonPostfix; exec(threadId, cmd); // As there should be a single automatic speech recognized track, there is no need to go through all tracks. break; } }/ ``` Instead of: ```c++ cmd = cmdCommonPrefix + "--write-auto-subs --sub-langs '.orig' --sub-format ttml --convert-subs vtt" + cmdCommonPostfix; exec(threadId, cmd); ``` But I realized that, as the GitHub comment I was about to add to https://github.com/yt-dlp/yt-dlp/issues/2655, I was wrong: > `yt-dlp --cookies cookies.txt --sub-langs 'en.,.orig' --write-auto-subs https://www.youtube.com/watch?v=tQqDBySHYlc` work as expected. Many thanks again. > > ``` > 'subtitleslangs': ['en.','.orig'], > 'writeautomaticsub': True, > ``` > > Work as expected too. Thank you > > Very sorry for the video sample. I even not watched it. Thank you for this workaround. However note that videos having automatically generated subtitles but being set to `Off` by default aren't retrieved with your method (example of such video: [`mozyXsZJnQ4`](https://www.youtube.com/watch?v=mozyXsZJnQ4)). My workaround is to first call [YouTube Data API v3](https://developers.google.com/youtube/v3) [Captions: list](https://developers.google.com/youtube/v3/docs/captions/list) endpoint with [`part=snippet`](https://developers.google.com/youtube/v3/docs/captions/list#part) and retrieve the [`language`](https://developers.google.com/youtube/v3/docs/captions#snippet.language) that has [`"trackKind": "asr"`](https://developers.google.com/youtube/v3/docs/captions#snippet.trackKind) (automatic speech recognition) in [`snippet`](https://developers.google.com/youtube/v3/docs/captions#snippet).	2023-02-10 20:03:08 +01:00
Benjamin_Loison	9b792015fa	Fix #36 : Make the program stops by crashing on YouTube operational API instance being detected as sending unusual traffic	2023-02-10 12:02:39 +01:00
Benjamin_Loison	4508b12b6c	Correct the termination of `COMMUNITY` tab process due to missing page tokens	2023-02-10 00:37:28 +01:00
Benjamin_Loison	ea604cce40	Remove the `Content-Type: application/json` HTTP header when retrieving `urls.txt` inside a `.zip`	2023-02-09 02:07:10 +01:00
Benjamin_Loison	01aac3f66e	Add a verification that `snippet/authorChannelId/value` isn't null when using `commentThreads` for `COMMUNITY` As it can happen cf https://www.youtube.com/channel/UCWeg2Pkate69NFdBeuRFTAw/community?lc=UgwGfjNxGuwqP8qYPPN4AaABAg&lb=UgkxYiEAo9-b1vWPasxFy13f959rrctQpZwW	2023-02-09 01:51:22 +01:00
Benjamin_Loison	d0dee043c6	Append to `channels.txt` all channels mentioned in the Wiki	2023-02-08 16:28:44 +01:00
Benjamin_Loison	d5c55e756a	Add in `urls.txt` if the URL is related to YouTube Data API v3 or YouTube operational API	2023-02-08 16:05:03 +01:00
Benjamin_Loison	50306aff5d	Fix #34 : Correct JSON files by putting first line in another metadata file	2023-02-07 23:08:09 +01:00
Benjamin_Loison	4fa433495b	Restore ability to download whole archives As API keys aren't written in the first line of JSON files.	2023-02-07 23:01:26 +01:00
Benjamin_Loison	a0ba474fcc	Remove ability in `channels.php` to download whole archive for not leaking API keys used	2023-02-07 22:42:24 +01:00
Benjamin_Loison	b4cb072770	Add `channels.php` adding support for (file in) zip download	2023-02-07 22:39:43 +01:00
Benjamin_Loison	fda8fc728e	#31 : Add zip files search	2023-02-07 20:15:36 +01:00
Benjamin_Loison	82e597f205	Comment WebSocket mechanism to work with an arbitrary number of independent send	2023-02-07 18:14:49 +01:00
Benjamin_Loison	03c2566a20	Make WebSocket able to manage arbitrary feedback to end-user While previous implementation was able to send two independent messages, now we can send an arbitrary amount of independent messages.	2023-02-07 17:25:17 +01:00
Benjamin_Loison	a116b29df9	Make `websockets.php` able to proceed blocking treatments	2023-02-07 01:22:26 +01:00
Benjamin_Loison	1fe92ec2d0	Make a WebSocket example work with `crawler.yt.lemnoslife.com`	2023-01-31 01:05:09 +01:00
Benjamin_Loison	411a3db465	Run `php-cs-fixer fix --rules=@PSR12 websocket.php`	2023-01-31 00:57:06 +01:00
Benjamin_Loison	08b465753d	Rename `chat.php` to `websocket.php`	2023-01-30 22:24:02 +01:00
Benjamin_Loison	45c5d8a940	Copy-pasted the `README.md` quick example of `ratchetphp/Ratchet` https://github.com/ratchetphp/Ratchet/tree/5012dc954541b40c5599d286fd40653f5716a38f#a-quick-example	2023-01-30 22:19:04 +01:00
Benjamin_Loison	668aa608ed	Add static `website/index.php`	2023-01-30 22:14:05 +01:00
Benjamin_Loison	c746d43ddf	Correct typo: the channel tab is `LIVE`, not `LIVES`	2023-01-25 01:00:29 +01:00
Benjamin_Loison	05cd243abd	Add comment in `README.md` about the usage of `--no-keys` or generating a YouTube Data API v3 key	2023-01-22 15:41:13 +01:00
Benjamin_Loison	9d40fef429	Introduce `{,MAIN_}EXIT_WITH_ERROR` macros for exitting with an error	2023-01-22 15:17:14 +01:00
Benjamin_Loison	d34fade0cd	#11 : Add the discovering of channels having commented on ended livestreams	2023-01-22 15:15:27 +01:00
Benjamin_Loison	68b1f9a77f	#11 : Add current livestreams support to discover channels	2023-01-22 04:00:11 +01:00
Benjamin_Loison	c17a33d181	Instead of looping on `items` where we expect only one to be, we just use `items[0]`	2023-01-22 02:19:26 +01:00
Benjamin_Loison	59dc5676cc	Make `PRINT` not requiring to precise `threadId`	2023-01-22 02:04:03 +01:00
Benjamin_Loison	548a797ee8	#11 : Treat `COMMUNITY` post comments to discover channels	2023-01-22 01:37:32 +01:00
Benjamin_Loison	46ef8146f8	Add in `README.md` the fact that as documented in #30 , this algorithm is only known to be working fin on Linux	2023-01-21 22:20:45 +01:00
Benjamin_Loison	4133faad41	#11 : Update channel `CHANNELS` tab treatment following YouTube-operational-API/issues/121 closure	2023-01-21 02:24:42 +01:00
Benjamin_Loison	fced9e0a3a	#11 : Add the treatment of channels' tab, but only postpone unlisted videos treatment	2023-01-15 14:56:44 +01:00
Benjamin_Loison	f114aac0cf	#7 : Make `commentsCount` and `requestsPerChannel` compatible with multithreading	2023-01-15 14:31:55 +01:00
Benjamin_Loison	7456685f2b	#11 : Add a first iteration for the `CHANNELS` retrieval	2023-01-15 02:19:31 +01:00
Benjamin_Loison	270c48da02	#11 : Add `--youtube-operational-api-instance-url` parameter and use `exit(EXIT_{SUCCESS, FAILURE})` instead of `exit({0, 1})`	2023-01-15 00:49:32 +01:00
Benjamin_Loison	f6c11b54f3	Fix #26 : Keep efficient search algorithm while keeping order (notably of the starting set)	2023-01-14 15:14:24 +01:00
Benjamin_Loison	27cd5c3a64	Fix #24 : Stop using macros for user inputs to notably make releases	2023-01-08 18:26:20 +01:00
Benjamin_Loison	eb805f5ced	Fix #6 : Add support for multiple keys to be resilient against exceeded quota errors	2023-01-08 17:59:08 +01:00
Benjamin_Loison	d6f6b26361	Fix #23 : YouTube Data API v3 PlaylistItems: list endpoint returns `playlistNotFound` error for regular `uploads` ones	2023-01-08 16:31:57 +01:00
Benjamin_Loison	b3779fe49a	Fix #20 : YouTube Data API v3 returns rarely suddenly `commentsDisabled` error which involves an unwanted method switch Also modified compression command, as I got `sh: 1: zip: Argument list too long` when compressing the 248,868 json files of the French most subscribers channel.	2023-01-08 15:43:27 +01:00
Benjamin_Loison	3ae0f4e924	Make all Python scripts executable and add `findAlreadyTreatedCommentsCount.py` to find how many comments were already treated	2023-01-07 15:45:31 +01:00
Benjamin_Loison	3758405f52	Add a note about the timing percentage of `findLatestTreatedCommentsForChannelsBeingTreated.py` going backward	2023-01-07 15:35:12 +01:00
Benjamin_Loison	71e4bd95a9	Fix #9 : Make sure that in case of error returned by the YouTube Data API v3 the algorithm treats it correctly Note that in case of error the algorithm used to skip the received content, as if just no `items` were in it.	2023-01-06 20:55:32 +01:00
Benjamin_Loison	34bbc216f6	Fix #15 : Provide an algorithm to retrieve the list of 100 French channels with most subscribers (and provide it too)	2023-01-06 18:06:00 +01:00
Benjamin_Loison	baec8fcb6c	#7 : Remove remaining undefined behavior due to missing mutex use	2023-01-06 18:00:51 +01:00
Benjamin_Loison	773f86c551	Fix #17 : Add to `stdout` live statistics of the number of comments treated per second	2023-01-06 17:55:16 +01:00
Benjamin_Loison	f436007836	Fix #16 : Provide an algorithm to determine the progress of retrieving comments for huge YouTube channels	2023-01-06 17:51:00 +01:00

1 2

71 Commits