website | ||
.gitignore | ||
channels.txt | ||
findAlreadyTreatedCommentsCount.py | ||
findLatestTreatedCommentsForChannelsBeingTreated.py | ||
findTreatedChannelWithMostComments.py | ||
findTreatedChannelWithMostSubscribers.py | ||
keys.txt | ||
LICENSE | ||
main.cpp | ||
Makefile | ||
README.md | ||
removeChannelsBeingTreated.py | ||
retrieveTop100SubscribersFrance.py |
The algorithm:
To retrieve the most YouTube video ids in order to retrieve the most video captions, we need to retrieve the most YouTube channels. So to discover the YouTube channels graph with a breadth-first search, we proceed as follows:
- Provide a starting set of channels.
- Given a channel, retrieve other channels thanks to its content by using YouTube Data API v3 and YouTube operational API and then repeat 1. for each retrieved channel.
A ready to be used by the end-user website instance of this project is hosted at: https://crawler.yt.lemnoslife.com
See more details on the Wiki.
Running the algorithm:
Because of the current compression mechanism, Linux is the only known OS able to run this algorithm.
sudo apt install nlohmann-json3-dev yt-dlp
make
./youtubeCaptionsSearchEngine -h
If you plan to use the front-end website, also run:
pip install webvtt-py
Except if you provide the argument --youtube-operational-api-instance-url https://yt.lemnoslife.com
, you have to host your own instance of the YouTube operational API.
Except if you provide the argument --no-keys
, you have to provide at least one YouTube Data API v3 key in keys.txt
.
./youtubeCaptionsSearchEngine