# The algorithm: To retrieve the most YouTube video ids in order to retrieve the most video captions, we need to retrieve the most YouTube channels. So to discover the YouTube channels graph with a breadth-first search, we proceed as follows: 1. Provide a starting set of channels. 2. Given a channel, retrieve other channels thanks to its content by using [YouTube Data API v3](https://developers.google.com/youtube/v3) and [YouTube operational API](https://github.com/Benjamin-Loison/YouTube-operational-API) and then repeat 1. for each retrieved channel. A ready to be used by the end-user website instance of this project is hosted at: https://crawler.yt.lemnoslife.com See more details on [the Wiki](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/wiki). # Running the YouTube graph discovery algorithm: Because of [the current compression mechanism](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/issues/30), Linux is the only known OS able to run this algorithm. To clone the repository, run: ```sh git clone https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine ``` Move to the cloned repository by running: ```sh cd YouTube_captions_search_engine/ ``` To install the dependencies on an `apt` based Linux distribution of this project make sure to have [`pip`](https://pip.pypa.io/en/stable/installation/) and run: ```sh sudo apt install nlohmann-json3-dev pip install yt-dlp ``` To compile the YouTube discovery graph algorithm, run: ```sh make ``` To see the command line arguments of the algorithm, run: ```sh ./youtubeCaptionsSearchEngine -h ``` To run the YouTube discovery graph algorithm, run: ```sh ./youtubeCaptionsSearchEngine ``` Except if you provide the argument `--youtube-operational-api-instance-url https://yt.lemnoslife.com`, you have [to host your own instance of the YouTube operational API](https://github.com/Benjamin-Loison/YouTube-operational-API/#install-your-own-instance-of-the-api). Except if you provide the argument `--no-keys`, you have to provide at least one [YouTube Data API v3 key](https://developers.google.com/youtube/v3/getting-started) in `keys.txt`. # Hosting the website enabling users to make requests: Move to the `website/` folder by running: ```sh cd website/ ``` To install its dependencies make sure to have [`composer`](https://getcomposer.org/doc/00-intro.md) installed and run: ```sh composer install pip install webvtt-py ``` Add the following configuration to your Nginx website one: ```nginx # Make the default webpage of your website to be `index.php`. index index.php; # Allow end-users to retrieve the content of a file within a channel zip. location /channels { rewrite ^(.*).zip$ /channels.php; rewrite ^(.*).zip/(.*).json$ /channels.php; rewrite ^(.*).zip/(.*).txt$ /channels.php; rewrite ^(.*).zip/(.*).vtt$ /channels.php; # Allow end-users to list `channels/` content. autoindex on; } # Disable end-users to access to other end-users requests. location /users { deny all; } # Configure the websocket endpoint. location /websocket { # switch off logging access_log off; # redirect all HTTP traffic to localhost proxy_pass http://localhost:4430; proxy_set_header X-Real-IP $remote_addr; proxy_set_header Host $host; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; # WebSocket support (nginx 1.4) proxy_http_version 1.1; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection "upgrade"; # timeout extension, possibly keep this short if using a ping strategy proxy_read_timeout 99999s; } ``` Start the websocket worker by running: ```sh php websockets.php ```