YouTube UI and YouTube Data API v3 search feature doesn't browse through not auto-generated video captions.
Go to file
Benjamin Loison cba2535d97 Make search.py search across displayed captions.
Otherwise `Linux, is in millions of computers` doesn't match the not automatically generated caption of [`o8NPllzkFhE`](https://www.youtube.com/watch?v=o8NPllzkFhE). Note to be confused with the search across captions that already used to work with for instance `is in millions of computers, it`.
2023-02-24 14:46:00 +01:00
website Make search.py search across displayed captions. 2023-02-24 14:46:00 +01:00
.gitignore Add .gitignore to ignore {keys, channels}.txt 2023-02-13 06:18:42 +01:00
channels.txt Append to channels.txt all channels mentioned in the Wiki 2023-02-08 16:28:44 +01:00
findAlreadyTreatedCommentsCount.py Make all Python scripts executable and add findAlreadyTreatedCommentsCount.py to find how many comments were already treated 2023-01-07 15:45:31 +01:00
findLatestTreatedCommentsForChannelsBeingTreated.py Make all Python scripts executable and add findAlreadyTreatedCommentsCount.py to find how many comments were already treated 2023-01-07 15:45:31 +01:00
findTreatedChannelWithMostComments.py Make all Python scripts executable and add findAlreadyTreatedCommentsCount.py to find how many comments were already treated 2023-01-07 15:45:31 +01:00
findTreatedChannelWithMostSubscribers.py Make all Python scripts executable and add findAlreadyTreatedCommentsCount.py to find how many comments were already treated 2023-01-07 15:45:31 +01:00
keys.txt Fix #6: Add support for multiple keys to be resilient against exceeded quota errors 2023-01-08 17:59:08 +01:00
LICENSE #1: Add GNU AGPLv3 license 2023-01-06 16:09:12 +01:00
main.cpp Remove unused setFromVector function 2023-02-23 23:50:07 +01:00
Makefile #11: Add a first iteration for the CHANNELS retrieval 2023-01-15 02:19:31 +01:00
README.md Precise in README.md in which folder each command has to be ran 2023-02-23 23:48:40 +01:00
removeChannelsBeingTreated.py #48: Modify removeChannelsBeingTreated.py to temporarily solve the issue 2023-02-19 02:04:28 +01:00
retrieveTop100SubscribersFrance.py Make all Python scripts executable and add findAlreadyTreatedCommentsCount.py to find how many comments were already treated 2023-01-07 15:45:31 +01:00

The algorithm:

To retrieve the most YouTube video ids in order to retrieve the most video captions, we need to retrieve the most YouTube channels. So to discover the YouTube channels graph with a breadth-first search, we proceed as follows:

  1. Provide a starting set of channels.
  2. Given a channel, retrieve other channels thanks to its content by using YouTube Data API v3 and YouTube operational API and then repeat 1. for each retrieved channel.

A ready to be used by the end-user website instance of this project is hosted at: https://crawler.yt.lemnoslife.com

See more details on the Wiki.

Running the YouTube graph discovery algorithm:

Because of the current compression mechanism, Linux is the only known OS able to run this algorithm.

To clone the repository, run:

git clone https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine

Move to the cloned repository by running:

cd YouTube_captions_search_engine/

To install the dependencies on an apt based Linux distribution of this project make sure to have pip and run:

sudo apt install nlohmann-json3-dev
pip install yt-dlp

To compile the YouTube discovery graph algorithm, run:

make

To see the command line arguments of the algorithm, run:

./youtubeCaptionsSearchEngine -h

To run the YouTube discovery graph algorithm, run:

./youtubeCaptionsSearchEngine

Except if you provide the argument --youtube-operational-api-instance-url https://yt.lemnoslife.com, you have to host your own instance of the YouTube operational API.

Except if you provide the argument --no-keys, you have to provide at least one YouTube Data API v3 key in keys.txt.

Hosting the website enabling users to make requests:

Move to the website/ folder by running:

cd website/

To install its dependencies make sure to have composer installed and run:

composer install
pip install webvtt-py

Add the following configuration to your Nginx website one:

    # Make the default webpage of your website to be `index.php`.
    index index.php;

    # Allow end-users to retrieve the content of a file within a channel zip.
    location /channels {
        rewrite ^(.*).zip$ /channels.php;
        rewrite ^(.*).zip/(.*).json$ /channels.php;
        rewrite ^(.*).zip/(.*).txt$ /channels.php;
        rewrite ^(.*).zip/(.*).vtt$ /channels.php;
        # Allow end-users to list `channels/` content.
        autoindex on;
    }

    # Disable end-users to access to other end-users requests.
    location /users {
        deny all;
    }

    # Configure the websocket endpoint.
    location /websocket {
        # switch off logging
        access_log off;

        # redirect all HTTP traffic to localhost
        proxy_pass http://localhost:4430;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header Host $host;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

        # WebSocket support (nginx 1.4)
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";

        # timeout extension, possibly keep this short if using a ping strategy
        proxy_read_timeout 99999s;
    }

Start the websocket worker by running:

php websockets.php