YouTube_captions_search_engine/README.md

3.5 KiB

The algorithm:

To retrieve the most YouTube video ids in order to retrieve the most video captions, we need to retrieve the most YouTube channels. So to discover the YouTube channels graph with a breadth-first search, we proceed as follows:

  1. Provide a starting set of channels.
  2. Given a channel, retrieve other channels thanks to its content by using YouTube Data API v3 and YouTube operational API and then repeat 1. for each retrieved channel.

A ready to be used by the end-user website instance of this project is hosted at: https://crawler.yt.lemnoslife.com

See more details on the Wiki.

Running the YouTube graph discovery algorithm:

Because of the current compression mechanism, Linux is the only known OS able to run this algorithm.

To install the dependencies on an apt based Linux distribution of this project, run:

sudo apt install nlohmann-json3-dev yt-dlp

To compile the YouTube discovery graph algorithm, run:

make

To see the command line arguments of the algorithm, run:

./youtubeCaptionsSearchEngine -h

To run the YouTube discovery graph algorithm, run:

./youtubeCaptionsSearchEngine

Except if you provide the argument --youtube-operational-api-instance-url https://yt.lemnoslife.com, you have to host your own instance of the YouTube operational API.

Except if you provide the argument --no-keys, you have to provide at least one YouTube Data API v3 key in keys.txt.

Hosting the website enabling users to make requests:

To install its dependencies make sure to have pip and composer installed and run:

composer install
pip install webvtt-py

Add the following configuration to your Nginx website one:

    # Make the default webpage of your website to be `index.php`.
    index index.php;

    # Allow end-users to retrieve the content of a file within a channel zip.
    location /channels {
        rewrite ^(.*).zip$ /channels.php;
        rewrite ^(.*).zip/(.*).json$ /channels.php;
        rewrite ^(.*).zip/(.*).txt$ /channels.php;
        rewrite ^(.*).zip/(.*).vtt$ /channels.php;
        # Allow end-users to list `channels/` content.
        autoindex on;
    }

    # Disable end-users to access to other end-users requests.
    location /users {
        deny all;
    }

    # Configure the websocket endpoint.
    location /websocket {
        # switch off logging
        access_log off;

        # redirect all HTTP traffic to localhost
        proxy_pass http://localhost:4430;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header Host $host;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

        # WebSocket support (nginx 1.4)
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";

        # timeout extension, possibly keep this short if using a ping strategy
        proxy_read_timeout 99999s;
    }

Start the websocket worker by running:

php websockets.php