Benjamin_Loison opened issue Benjamin_Loison/YouTube_captions_search_engine#46 2023-02-17 16:54:09 +01:00
yt-dlp seems able to download more live chat than I do
Benjamin_Loison commented on issue Benjamin_Loison/YouTube_captions_search_engine#31 2023-02-17 01:02:41 +01:00
Make a website with a search engine notably based on the captions extracted

Note that maybe the returned match timestamps aren't as precise as we can (maybe it returns the previous beginning timestamp caption for instance). This should be ideally investigated.

Benjamin_Loison opened issue Benjamin_Loison/YouTube_captions_search_engine#45 2023-02-16 23:32:14 +01:00
What does the website returns for a video with two captions matching the query?
Benjamin_Loison commented on issue Benjamin_Loison/YouTube_captions_search_engine#40 2023-02-16 21:05:06 +01:00
Publish nginx configuration

Note that as I'm hosting multiple websites, to guess which website (here the YouTube operational API one) to talk to, I'm using a private sub domain private.sub.domain. However reaching this…

Benjamin_Loison commented on issue Benjamin_Loison/YouTube_captions_search_engine#35 2023-02-16 13:19:18 +01:00
Prepare the presentation

To verify the correct format of channels.txt, as I ran dos2unix on it while the algorithm was running:

verifyChannels.py:

#!/usr/bin/python3

with open('channels.txt') as f:
  
Benjamin_Loison commented on issue Benjamin_Loison/YouTube_captions_search_engine#35 2023-02-16 13:16:40 +01:00
Prepare the presentation

To verify that the starting set was treated:

isStartingSetTreated.py:

#!/usr/bin/python3

import os

with open('newChannels.txt') as f:
    lines = f.read().splitlines()
    for
Benjamin_Loison commented on issue Benjamin_Loison/YouTube_captions_search_engine#35 2023-02-16 12:12:27 +01:00
Prepare the presentation

Concerning channels/ due to crashes during the unstable process at the time of the process, using:

find -name '*.zip' -exec unzip -t {} \; 
Benjamin_Loison opened issue Benjamin_Loison/YouTube_captions_search_engine#44 2023-02-15 23:52:16 +01:00
Improve indexing of website
Benjamin_Loison commented on issue Benjamin_Loison/YouTube_captions_search_engine#25 2023-02-15 16:20:13 +01:00
Make a not pre-release release

Will publish such a release after having treated all the channels I provided it initially.

Benjamin_Loison opened issue Benjamin_Loison/YouTube_captions_search_engine#43 2023-02-15 00:00:05 +01:00
Could wonder if doing on our own speech-to-text wouldn't make sense for videos
Benjamin_Loison commented on issue Benjamin_Loison/YouTube_captions_search_engine#35 2023-02-14 23:50:53 +01:00
Prepare the presentation

Also verifying quality by verifying debug/*.err content:

cat *.err 
Benjamin_Loison closed issue Benjamin_Loison/YouTube_captions_search_engine#14 2023-02-14 23:44:10 +01:00
Current code is retrieving a maximum number of channels, not videos
Benjamin_Loison closed issue Benjamin_Loison/YouTube_captions_search_engine#2 2023-02-14 23:43:19 +01:00
Add data logging
Benjamin_Loison commented on issue Benjamin_Loison/YouTube_captions_search_engine#2 2023-02-14 23:43:00 +01:00
Add data logging

With my 604G free storage, it's enough to already have a nice dataset that would need 604 / 14 = 43 days to be filled.

Benjamin_Loison commented on issue Benjamin_Loison/YouTube_captions_search_engine#42 2023-02-14 23:41:31 +01:00
Could propose a version that can be run on multiple computers

Note that the main problem might be to have multiple YouTube operational API running on multiple IPs cf [this Git issue comment](https://github.com/Benjamin-Loison/YouTube-operational-API/issues/11

Benjamin_Loison opened issue Benjamin_Loison/YouTube_captions_search_engine#42 2023-02-14 23:38:36 +01:00
Could propose a version that can be run on multiple computers
Benjamin_Loison closed issue Benjamin_Loison/YouTube_captions_search_engine#21 2023-02-14 23:35:30 +01:00
Delete WIP channels/ archives
eb8431746e Make the first channel of channels.txt being treated again, solve temporary empty response from YouTube Data API v3 issue and temporarily remove sanity check failing very rarely #39