Benjamin_Loison/YouTube_captions_search_engine

Go to file

Add support for channelsToTreat to be empty

It's the case when providing a single channel in `channels.txt` for
instance.

2023-02-23 23:45:36 +01:00

website

#19 : Detail how to run the website and reference channels.txt on it

2023-02-23 23:12:18 +01:00

.gitignore

Add .gitignore to ignore {keys, channels}.txt

2023-02-13 06:18:42 +01:00

channels.txt

Append to channels.txt all channels mentioned in the Wiki

2023-02-08 16:28:44 +01:00

findAlreadyTreatedCommentsCount.py

Make all Python scripts executable and add findAlreadyTreatedCommentsCount.py to find how many comments were already treated

2023-01-07 15:45:31 +01:00

findLatestTreatedCommentsForChannelsBeingTreated.py

Make all Python scripts executable and add findAlreadyTreatedCommentsCount.py to find how many comments were already treated

2023-01-07 15:45:31 +01:00

findTreatedChannelWithMostComments.py

Make all Python scripts executable and add findAlreadyTreatedCommentsCount.py to find how many comments were already treated

2023-01-07 15:45:31 +01:00

findTreatedChannelWithMostSubscribers.py

Make all Python scripts executable and add findAlreadyTreatedCommentsCount.py to find how many comments were already treated

2023-01-07 15:45:31 +01:00

keys.txt

Fix #6 : Add support for multiple keys to be resilient against exceeded quota errors

2023-01-08 17:59:08 +01:00

LICENSE

#1 : Add GNU AGPLv3 license

2023-01-06 16:09:12 +01:00

main.cpp

Add support for channelsToTreat to be empty

2023-02-23 23:45:36 +01:00

Makefile

#11 : Add a first iteration for the CHANNELS retrieval

2023-01-15 02:19:31 +01:00

README.md

Advertize pip instead of apt in README.md to install the latest version of yt-dlp

2023-02-23 23:16:36 +01:00

removeChannelsBeingTreated.py

#48 : Modify removeChannelsBeingTreated.py to temporarily solve the issue

2023-02-19 02:04:28 +01:00

retrieveTop100SubscribersFrance.py

Make all Python scripts executable and add findAlreadyTreatedCommentsCount.py to find how many comments were already treated

2023-01-07 15:45:31 +01:00

README.md

The algorithm:

To retrieve the most YouTube video ids in order to retrieve the most video captions, we need to retrieve the most YouTube channels. So to discover the YouTube channels graph with a breadth-first search, we proceed as follows:

Provide a starting set of channels.
Given a channel, retrieve other channels thanks to its content by using YouTube Data API v3 and YouTube operational API and then repeat 1. for each retrieved channel.

A ready to be used by the end-user website instance of this project is hosted at: https://crawler.yt.lemnoslife.com

See more details on the Wiki.

Running the YouTube graph discovery algorithm:

Because of the current compression mechanism, Linux is the only known OS able to run this algorithm.

To install the dependencies on an apt based Linux distribution of this project make sure to have pip and run:

sudo apt install nlohmann-json3-dev
pip install yt-dlp

To compile the YouTube discovery graph algorithm, run:

make

To see the command line arguments of the algorithm, run:

./youtubeCaptionsSearchEngine -h

To run the YouTube discovery graph algorithm, run:

./youtubeCaptionsSearchEngine

Except if you provide the argument --youtube-operational-api-instance-url https://yt.lemnoslife.com, you have to host your own instance of the YouTube operational API.

Except if you provide the argument --no-keys, you have to provide at least one YouTube Data API v3 key in keys.txt.

Hosting the website enabling users to make requests:

To install its dependencies make sure to have composer installed and run:

composer install
pip install webvtt-py

Add the following configuration to your Nginx website one:

    # Make the default webpage of your website to be `index.php`.
    index index.php;

    # Allow end-users to retrieve the content of a file within a channel zip.
    location /channels {
        rewrite ^(.*).zip$ /channels.php;
        rewrite ^(.*).zip/(.*).json$ /channels.php;
        rewrite ^(.*).zip/(.*).txt$ /channels.php;
        rewrite ^(.*).zip/(.*).vtt$ /channels.php;
        # Allow end-users to list `channels/` content.
        autoindex on;
    }

    # Disable end-users to access to other end-users requests.
    location /users {
        deny all;
    }

    # Configure the websocket endpoint.
    location /websocket {
        # switch off logging
        access_log off;

        # redirect all HTTP traffic to localhost
        proxy_pass http://localhost:4430;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header Host $host;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

        # WebSocket support (nginx 1.4)
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";

        # timeout extension, possibly keep this short if using a ping strategy
        proxy_read_timeout 99999s;
    }

Start the websocket worker by running:

php websockets.php

Releases 2

Add quite exhaustive channels discovery and captions extraction through almost all channel features. Latest

2023-02-26 13:25:42 +01:00

Languages

C++ 69.6%

PHP 17.6%

Python 12.7%

Makefile 0.1%