Benjamin_Loison/YouTube_captions_search_engine

Go to file

Benjamin Loison 68cd27c263

Fix #19 : Improve documentation and code comments

2023-02-23 22:50:30 +01:00

#44 : Allow arbitrary end-user requests

2023-02-22 17:48:24 +01:00

.gitignore

Add .gitignore to ignore {keys, channels}.txt

2023-02-13 06:18:42 +01:00

channels.txt

Append to channels.txt all channels mentioned in the Wiki

2023-02-08 16:28:44 +01:00

findAlreadyTreatedCommentsCount.py

Make all Python scripts executable and add findAlreadyTreatedCommentsCount.py to find how many comments were already treated

2023-01-07 15:45:31 +01:00

findLatestTreatedCommentsForChannelsBeingTreated.py

Make all Python scripts executable and add findAlreadyTreatedCommentsCount.py to find how many comments were already treated

2023-01-07 15:45:31 +01:00

findTreatedChannelWithMostComments.py

Make all Python scripts executable and add findAlreadyTreatedCommentsCount.py to find how many comments were already treated

2023-01-07 15:45:31 +01:00

findTreatedChannelWithMostSubscribers.py

Make all Python scripts executable and add findAlreadyTreatedCommentsCount.py to find how many comments were already treated

2023-01-07 15:45:31 +01:00

keys.txt

Fix #6 : Add support for multiple keys to be resilient against exceeded quota errors

2023-01-08 17:59:08 +01:00

LICENSE

#1 : Add GNU AGPLv3 license

2023-01-06 16:09:12 +01:00

main.cpp

Fix #19 : Improve documentation and code comments

2023-02-23 22:50:30 +01:00

Makefile

#11 : Add a first iteration for the CHANNELS retrieval

2023-01-15 02:19:31 +01:00

README.md

Fix #19 : Improve documentation and code comments

2023-02-23 22:50:30 +01:00

removeChannelsBeingTreated.py

#48 : Modify removeChannelsBeingTreated.py to temporarily solve the issue

2023-02-19 02:04:28 +01:00

retrieveTop100SubscribersFrance.py

Make all Python scripts executable and add findAlreadyTreatedCommentsCount.py to find how many comments were already treated

2023-01-07 15:45:31 +01:00

README.md

The algorithm:

To retrieve the most YouTube video ids in order to retrieve the most video captions, we need to retrieve the most YouTube channels. So to discover the YouTube channels graph with a breadth-first search, we proceed as follows:

Provide a starting set of channels.
Given a channel, retrieve other channels thanks to its content by using YouTube Data API v3 and YouTube operational API and then repeat 1. for each retrieved channel.

A ready to be used by the end-user website instance of this project is hosted at: https://crawler.yt.lemnoslife.com

See more details on the Wiki.

Running the algorithm:

Because of the current compression mechanism, Linux is the only known OS able to run this algorithm.

sudo apt install nlohmann-json3-dev yt-dlp
make
./youtubeCaptionsSearchEngine -h

If you plan to use the front-end website, also run:

pip install webvtt-py

Except if you provide the argument --youtube-operational-api-instance-url https://yt.lemnoslife.com, you have to host your own instance of the YouTube operational API.

Except if you provide the argument --no-keys, you have to provide at least one YouTube Data API v3 key in keys.txt.

./youtubeCaptionsSearchEngine

Add quite exhaustive channels discovery and captions extraction through almost all channel features. Latest

2023-02-26 13:25:42 +01:00

Languages

C++ 69.6%

PHP 17.6%

Python 12.7%

Makefile 0.1%