6.6 KiB
A video introducing this project is available here.
The algorithm:
To retrieve the most YouTube video ids in order to retrieve the most video captions, we need to retrieve the most YouTube channels. So to discover the YouTube channels graph with a breadth-first search, we proceed as follows:
- Provide a starting set of channels.
- Given a channel, retrieve other channels thanks to its content by using YouTube Data API v3 and YouTube operational API and then repeat 1. for each retrieved channel.
A ready to be used by the end-user website instance of this project is hosted at: https://crawler.yt.lemnoslife.com
See more details on the Wiki.
The project structure:
main.cpp
contains the C++ multi-threaded algorithm proceeding to the YouTube channels discovery. It is notably made of the following functions:main
which takes into account the command line arguments, load variables from files (channels.txt
,keys.txt
,channels/
content) and start the threads as executingtreatChannels
functiontreatChannels
gets a YouTube channel to treat, treat it intreatChannelOrVideo
function and compress the retrieved datatreatChannelOrVideo
which provided a YouTube channel id or a video id, treats this resource. In both cases it treats comments left on this resource. In the case of a channel it also treats itsCHANNELS
,COMMUNITY
,PLAYLISTS
andLIVE
tabs and downloads the captions of the channel videos.markChannelAsRequiringTreatmentIfNeeded
which provided a YouTube channel id marks it as requiring treatment if it wasn't already treatedexecute
which provided anyt-dlp
command executes it in a shellgetJson
which provided an API request returns a JSON structure with its result. In the case that the API requested is YouTube Data API v3 and a set of keys is provided (see belowkeys.txt
), it rotates the keys as required
channels.txt
contains a starting set of channels which contains mostly the 100 most subscribed French channelskeys.txt
contains a set of YouTube Data API v3 keys (not provided) to have the ability to request this API (see an alternative to filling it in the section below with--no-keys
command line argument)scripts/
contains Python scripts to:- generate the
channels.txt
as described above (retrieveTop100SubscribersFrance.py
) - remove channels being treated before a restart of the algorithm as described in the
main
function documentation (removeChannelsBeingTreated.py
)
- generate the
website/
is a PHP website using WebSocket to allow the end-user to proceed to requests on the retrieved dataset. When fetching the website, the end-user receives the interpretedindex.php
which upon making a request interacts withwebsocket.php
which in the back-end dispatches the requests from various end-users tosearch.py
(which treats the actual end-user request on the compressed dataset) by usingusers/
to make the inter-process communication.
Note that this project heavily relies on YouTube operational API which was modified for this project.
Running the YouTube graph discovery algorithm:
Because of the current compression mechanism, Linux is the only known OS able to run this algorithm.
To clone the repository, run:
git clone https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine
Move to the cloned repository by running:
cd YouTube_captions_search_engine/
To install the dependencies on an apt
based Linux distribution of this project make sure to have pip
and run:
sudo apt install nlohmann-json3-dev
pip install yt-dlp
To compile the YouTube discovery graph algorithm, run:
make
To see the command line arguments of the algorithm, run:
./youtubeCaptionsSearchEngine -h
To run the YouTube discovery graph algorithm, run:
./youtubeCaptionsSearchEngine
Except if you provide the argument --youtube-operational-api-instance-url https://yt.lemnoslife.com
, you have to host your own instance of the YouTube operational API.
Except if you provide the argument --no-keys
, you have to provide at least one YouTube Data API v3 key in keys.txt
.
Hosting the website enabling users to make requests:
Move to the website/
folder by running:
cd website/
To install its dependencies make sure to have composer
installed and run:
sudo apt install nginx
composer install
pip install webvtt-py
Add the following configuration to your Nginx website one:
# Make the default webpage of your website to be `index.php`.
index index.php;
# Allow end-users to retrieve the content of a file within a channel zip.
location /channels {
rewrite ^(.*).zip$ /channels.php;
rewrite ^(.*).zip/(.*).json$ /channels.php;
rewrite ^(.*).zip/(.*).txt$ /channels.php;
rewrite ^(.*).zip/(.*).vtt$ /channels.php;
# Allow end-users to list `channels/` content.
autoindex on;
}
# Disable end-users to access to other end-users requests.
location /users {
deny all;
}
# Configure the websocket endpoint.
location /websocket {
# switch off logging
access_log off;
# redirect all HTTP traffic to localhost
proxy_pass http://localhost:4430;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header Host $host;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
# WebSocket support (nginx 1.4)
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
# timeout extension, possibly keep this short if using a ping strategy
proxy_read_timeout 99999s;
}
Start the websocket worker by running:
php websockets.php