Make a website with a search engine notably based on the captions extracted #31

Closed
opened 2023-01-25 22:32:37 +01:00 by Benjamin_Loison · 6 comments

Could propose two options:

  • filtering only on captions
  • filtering on all data retrieved (as an archive with matched files, make sure to remove/censor API keys - currently doing so disable the ability to download whole archives)

Could also add channels.txt (done) and logs retrieval (make sure that there isn't anything secret in the latter, it's not the case cf at least 1. and 2., it's not much a problem as it's a small file that we can post-process, I am waiting to have one consequent to verify experimentally that only these two occurrences are problematic).

Could propose two options: - filtering only on captions - filtering on all data retrieved (~~as an archive with matched files~~, make sure to remove/censor API keys - currently doing so disable the ability to download whole archives) Could also add `channels.txt` (done) and logs retrieval (make sure that there isn't anything secret in the latter, it's not the case cf at least [1.](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/5a1df71bb97a7e45242756f895f04d7590f966ca/main.cpp#L711) and [2.](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/5a1df71bb97a7e45242756f895f04d7590f966ca/main.cpp#L716), it's not much a problem as it's a *small* file that we can post-process, I am waiting to have one consequent to verify experimentally that only these two occurrences are problematic).
Benjamin_Loison added the
epic
enhancement
captions
labels 2023-01-25 22:32:37 +01:00
Author
Owner

Introduce https://crawler.yt.lemnoslife.com for this purpose.

The plan is to use WebSocket, as it's perfectly adapted and compatible.

Introduce https://crawler.yt.lemnoslife.com for this purpose. The plan is to use WebSocket, as it's perfectly adapted and [compatible](https://developer.mozilla.org/en-US/docs/Web/API/WebSockets_API#browser_compatibility).
Benjamin_Loison added the
high priority
label 2023-01-30 21:17:48 +01:00
Benjamin_Loison added this to the 0.0.1 milestone 2023-02-10 17:14:13 +01:00
Benjamin_Loison added a new dependency 2023-02-10 20:22:09 +01:00
Author
Owner

Should add vtt parsing to not be limited to line wrapping.

Should add an option to only search through captions.

Should also update findLatestTreatedCommentsForChannelsBeingTreated.py with all features to better evaluate algorithm progress.

Should add `vtt` parsing to not be limited to line wrapping. Should add an option to only search through captions. Should also update [`findLatestTreatedCommentsForChannelsBeingTreated.py`](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/fa7da64879b2f8180dcf0b6bd964f9d9af3e2dde/findLatestTreatedCommentsForChannelsBeingTreated.py) with all features to better evaluate algorithm progress.
Benjamin_Loison added the
website
label 2023-02-13 22:56:45 +01:00
Author
Owner

Working at PHP level wasn't making much sense:

$searchOnlyCaptions = str_starts_with($msg, 'search-only-captions ');
$msg = substr($msg, strpos($msg, ' ') + 1);
Working at PHP level wasn't making much sense: ```php $searchOnlyCaptions = str_starts_with($msg, 'search-only-captions '); $msg = substr($msg, strpos($msg, ' ') + 1); ```
Author
Owner

Could add later on a link directly to the YouTube video timestamp, however as we aren't limited to line wrapping, it's not easy to implement efficiently this feature.

Could add later on a link directly to the YouTube video timestamp, however as we aren't limited to line wrapping, it's not easy to implement efficiently this feature.
Author
Owner

We could list all occurrences within a video.

We could list all occurrences within a video.
Author
Owner

Note that maybe the returned match timestamps aren't as precise as we can (maybe it returns the previous beginning timestamp caption for instance). This should be ideally investigated.

Note that maybe the returned match timestamps aren't as precise as we can (maybe it returns the previous beginning timestamp caption for instance). This should be ideally investigated.
Sign in to join this conversation.
No description provided.