140 lines
6.6 KiB
Markdown
140 lines
6.6 KiB
Markdown
A video introducing this project is available [here](https://crawler.yt.lemnoslife.com/presentation).
|
|
|
|
# The algorithm:
|
|
|
|
To retrieve the most YouTube video ids in order to retrieve the most video captions, we need to retrieve the most YouTube channels.
|
|
So to discover the YouTube channels graph with a breadth-first search, we proceed as follows:
|
|
1. Provide a starting set of channels.
|
|
2. Given a channel, retrieve other channels thanks to its content by using [YouTube Data API v3](https://developers.google.com/youtube/v3) and [YouTube operational API](https://github.com/Benjamin-Loison/YouTube-operational-API) and then repeat 1. for each retrieved channel.
|
|
|
|
A ready to be used by the end-user website instance of this project is hosted at: https://crawler.yt.lemnoslife.com
|
|
|
|
See more details on [the Wiki](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/wiki).
|
|
|
|
# The project structure:
|
|
|
|
- `main.cpp` contains the C++ multi-threaded algorithm proceeding to the YouTube channels discovery. It is notably made of the following functions:
|
|
- `main` which takes into account the command line arguments, load variables from files (`channels.txt`, `keys.txt`, `channels/` content) and start the threads as executing `treatChannels` function
|
|
- `treatChannels` gets a YouTube channel to treat, treat it in `treatChannelOrVideo` function and compress the retrieved data
|
|
- `treatChannelOrVideo` which provided a YouTube channel id or a video id, treats this resource. In both cases it treats comments left on this resource. In the case of a channel it also treats its `CHANNELS`, `COMMUNITY`, `PLAYLISTS` and `LIVE` tabs and downloads the captions of the channel videos.
|
|
- `markChannelAsRequiringTreatmentIfNeeded` which provided a YouTube channel id marks it as requiring treatment if it wasn't already treated
|
|
- `execute` which provided an `yt-dlp` command executes it in a shell
|
|
- `getJson` which provided an API request returns a JSON structure with its result. In the case that the API requested is YouTube Data API v3 and a set of keys is provided (see below `keys.txt`), it rotates the keys as required
|
|
- `channels.txt` contains a starting set of channels which contains mostly the 100 most subscribed French channels
|
|
- `keys.txt` contains a set of YouTube Data API v3 keys (not provided) to have the ability to request this API (see an alternative to filling it in the section below with `--no-keys` command line argument)
|
|
- `scripts/` contains Python scripts to:
|
|
- generate the `channels.txt` as described above (`retrieveTop100SubscribersFrance.py`)
|
|
- remove channels being treated before a restart of the algorithm as described in [the `main` function documentation](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/8dd89e6e881da0a905b6fa4b23775c4344dd0d9d/main.cpp#L126-L128) (`removeChannelsBeingTreated.py`)
|
|
- `website/` is a PHP website using WebSocket to allow the end-user to proceed to requests on the retrieved dataset. When fetching the website, the end-user receives the interpreted `index.php` which upon making a request interacts with `websocket.php` which in the back-end dispatches the requests from various end-users to `search.py` (which treats the actual end-user request on the compressed dataset) by using `users/` to make the inter-process communication.
|
|
|
|
Note that this project heavily relies on [YouTube operational API](https://github.com/Benjamin-Loison/YouTube-operational-API) [which was modified for this project](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/wiki/YouTube-operational-API-commits).
|
|
|
|
# Running the YouTube graph discovery algorithm:
|
|
|
|
Because of [the current compression mechanism](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/issues/30), Linux is the only known OS able to run this algorithm.
|
|
|
|
To clone the repository, run:
|
|
|
|
```sh
|
|
git clone https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine
|
|
```
|
|
|
|
Move to the cloned repository by running:
|
|
|
|
```sh
|
|
cd YouTube_captions_search_engine/
|
|
```
|
|
|
|
To install the dependencies on an `apt` based Linux distribution of this project make sure to have [`pip`](https://pip.pypa.io/en/stable/installation/) and run:
|
|
|
|
```sh
|
|
sudo apt install nlohmann-json3-dev
|
|
pip install yt-dlp
|
|
```
|
|
|
|
To compile the YouTube discovery graph algorithm, run:
|
|
|
|
```sh
|
|
make
|
|
```
|
|
|
|
To see the command line arguments of the algorithm, run:
|
|
|
|
```sh
|
|
./youtubeCaptionsSearchEngine -h
|
|
```
|
|
|
|
To run the YouTube discovery graph algorithm, run:
|
|
|
|
```sh
|
|
./youtubeCaptionsSearchEngine
|
|
```
|
|
|
|
Except if you provide the argument `--youtube-operational-api-instance-url https://yt.lemnoslife.com`, you have [to host your own instance of the YouTube operational API](https://github.com/Benjamin-Loison/YouTube-operational-API/#install-your-own-instance-of-the-api).
|
|
|
|
Except if you provide the argument `--no-keys`, you have to provide at least one [YouTube Data API v3 key](https://developers.google.com/youtube/v3/getting-started) in `keys.txt`.
|
|
|
|
# Hosting the website enabling users to make requests:
|
|
|
|
Move to the `website/` folder by running:
|
|
|
|
```sh
|
|
cd website/
|
|
```
|
|
|
|
To install its dependencies make sure to have [`composer`](https://getcomposer.org/doc/00-intro.md) installed and run:
|
|
|
|
```sh
|
|
sudo apt install nginx
|
|
composer install
|
|
pip install webvtt-py
|
|
```
|
|
|
|
Add the following configuration to your Nginx website one:
|
|
|
|
```nginx
|
|
# Make the default webpage of your website to be `index.php`.
|
|
index index.php;
|
|
|
|
# Allow end-users to retrieve the content of a file within a channel zip.
|
|
location /channels {
|
|
rewrite ^(.*).zip$ /channels.php;
|
|
rewrite ^(.*).zip/(.*).json$ /channels.php;
|
|
rewrite ^(.*).zip/(.*).txt$ /channels.php;
|
|
rewrite ^(.*).zip/(.*).vtt$ /channels.php;
|
|
# Allow end-users to list `channels/` content.
|
|
autoindex on;
|
|
}
|
|
|
|
# Disable end-users to access to other end-users requests.
|
|
location /users {
|
|
deny all;
|
|
}
|
|
|
|
# Configure the websocket endpoint.
|
|
location /websocket {
|
|
# switch off logging
|
|
access_log off;
|
|
|
|
# redirect all HTTP traffic to localhost
|
|
proxy_pass http://localhost:4430;
|
|
proxy_set_header X-Real-IP $remote_addr;
|
|
proxy_set_header Host $host;
|
|
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
|
|
|
|
# WebSocket support (nginx 1.4)
|
|
proxy_http_version 1.1;
|
|
proxy_set_header Upgrade $http_upgrade;
|
|
proxy_set_header Connection "upgrade";
|
|
|
|
# timeout extension, possibly keep this short if using a ping strategy
|
|
proxy_read_timeout 99999s;
|
|
}
|
|
```
|
|
|
|
Start the websocket worker by running:
|
|
|
|
```sh
|
|
php websockets.php
|
|
```
|