Compare commits

...

No commits in common. "master" and "0.0.1" have entirely different histories.

10 changed files with 24 additions and 84 deletions

View File

@ -1,5 +1,3 @@
A video introducing this project is available [here](https://crawler.yt.lemnoslife.com/presentation).
# The algorithm:
To retrieve the most YouTube video ids in order to retrieve the most video captions, we need to retrieve the most YouTube channels.
@ -11,24 +9,6 @@ A ready to be used by the end-user website instance of this project is hosted at
See more details on [the Wiki](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/wiki).
# The project structure:
- `main.cpp` contains the C++ multi-threaded algorithm proceeding to the YouTube channels discovery. It is notably made of the following functions:
- `main` which takes into account the command line arguments, load variables from files (`channels.txt`, `keys.txt`, `channels/` content) and start the threads as executing `treatChannels` function
- `treatChannels` gets a YouTube channel to treat, treat it in `treatChannelOrVideo` function and compress the retrieved data
- `treatChannelOrVideo` which provided a YouTube channel id or a video id, treats this resource. In both cases it treats comments left on this resource. In the case of a channel it also treats its `CHANNELS`, `COMMUNITY`, `PLAYLISTS` and `LIVE` tabs and downloads the captions of the channel videos.
- `markChannelAsRequiringTreatmentIfNeeded` which provided a YouTube channel id marks it as requiring treatment if it wasn't already treated
- `execute` which provided an `yt-dlp` command executes it in a shell
- `getJson` which provided an API request returns a JSON structure with its result. In the case that the API requested is YouTube Data API v3 and a set of keys is provided (see below `keys.txt`), it rotates the keys as required
- `channels.txt` contains a starting set of channels which contains mostly the 100 most subscribed French channels
- `keys.txt` contains a set of YouTube Data API v3 keys (not provided) to have the ability to request this API (see an alternative to filling it in the section below with `--no-keys` command line argument)
- `scripts/` contains Python scripts to:
- generate the `channels.txt` as described above (`retrieveTop100SubscribersFrance.py`)
- remove channels being treated before a restart of the algorithm as described in [the `main` function documentation](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/src/commit/8dd89e6e881da0a905b6fa4b23775c4344dd0d9d/main.cpp#L126-L128) (`removeChannelsBeingTreated.py`)
- `website/` is a PHP website using WebSocket to allow the end-user to proceed to requests on the retrieved dataset. When fetching the website, the end-user receives the interpreted `index.php` which upon making a request interacts with `websocket.php` which in the back-end dispatches the requests from various end-users to `search.py` (which treats the actual end-user request on the compressed dataset) by using `users/` to make the inter-process communication.
Note that this project heavily relies on [YouTube operational API](https://github.com/Benjamin-Loison/YouTube-operational-API) [which was modified for this project](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/wiki/YouTube-operational-API-commits).
# Running the YouTube graph discovery algorithm:
Because of [the current compression mechanism](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/issues/30), Linux is the only known OS able to run this algorithm.

View File

@ -1,6 +1,6 @@
#!/usr/bin/python3
PREFIX = 'Channels per second: '
PREFIX = 'Comments per second: '
alreadyTreatedCommentsCount = 0
with open('nohup.out') as f:
@ -8,7 +8,5 @@ with open('nohup.out') as f:
for line in lines:
if PREFIX in line:
alreadyTreatedCommentsCount += int(line.split(PREFIX)[-1])
#if 'UCsT0YIqwnpJCM-mx7-gSA4Q' in line:
# break
print(alreadyTreatedCommentsCount)

View File

@ -1,7 +1,5 @@
#!/usr/bin/python3
# This algorithm should also take in account other features that we use to retrieve channels.
import os, requests, json, time, datetime
path = 'channels/'
@ -9,24 +7,23 @@ path = 'channels/'
os.chdir(path)
def getTimestampFromDateString(dateString):
return int(time.mktime(datetime.datetime.strptime(dateString, '%Y-%m-%dT%H:%M:%SZ').timetuple()))
return int(time.mktime(datetime.datetime.strptime(dateString, "%Y-%m-%dT%H:%M:%SZ").timetuple()))
for channelId in list(os.walk('.'))[1]:
channelId = channelId[2:]
#print(channelId)
numberOfRequests = len(list(os.walk(f'{channelId}/requests'))[0][2]) - 1
numberOfRequests = len(list(os.walk(channelId))[0][2])
# Assume that the folder isn't empty (may not be the case, but it is most of the time).
filePath = f'{channelId}/requests/{str(numberOfRequests - 1)}.json'
with open(filePath) as f:
print(filePath)
#content = '\n'.join(f.read().splitlines()[1:])
data = json.load(f)#json.loads(content)
with open(f'{channelId}/{str(numberOfRequests - 1)}.json') as f:
content = "\n".join(f.read().splitlines()[1:])
data = json.loads(content)
snippet = data['items'][-1]['snippet']
if 'topLevelComment' in snippet:
snippet = snippet['topLevelComment']['snippet']
latestTreatedCommentDate = snippet['publishedAt']
url = f'https://yt.lemnoslife.com/noKey/channels?part=snippet&id={channelId}'
data = requests.get(url).json()
content = requests.get(url).text
data = json.loads(content)
channelCreationDate = data['items'][0]['snippet']['publishedAt']
#print(channelCreationDate)
# Timing percentage not taking into account the not uniform in time distribution of comments. Note that in the case of the last request is to list replies to a comment, the percentage might goes a bit backward, as replies are posted after the initial comment.

View File

@ -1,8 +1,8 @@
#!/usr/bin/python3
import os, requests
import os, requests, json
channelIds = [channelId.replace('.zip', '') for channelId in next(os.walk('channels/'))[2]]
channelIds = next(os.walk('channels/'))[1]
maxResults = 50
channelIdsChunks = [channelIds[i : i + maxResults] for i in range(0, len(channelIds), maxResults)]
@ -11,7 +11,8 @@ mostSubscriberChannel = None
for channelIds in channelIdsChunks:
url = 'https://yt.lemnoslife.com/noKey/channels?part=statistics&id=' + ','.join(channelIds)
data = requests.get(url).json()
content = requests.get(url).text
data = json.loads(content)
items = data['items']
for item in items:
subscriberCount = int(item['statistics']['subscriberCount'])

View File

@ -27,9 +27,7 @@ void createDirectory(string path),
markChannelAsRequiringTreatmentIfNeeded(unsigned short threadId, string channelId),
execute(unsigned short threadId, string command, bool debug = true);
string getHttps(string url),
join(vector<string> parts, string delimiter),
escapeShellArgument(string shellArgument),
replaceAll(string str, const string& from, const string& to);
join(vector<string> parts, string delimiter);
size_t writeCallback(void* contents, size_t size, size_t nmemb, void* userp);
bool doesFileExist(string filePath),
writeFile(unsigned short threadId, string filePath, string option, string toWrite);
@ -244,7 +242,7 @@ void treatChannels(unsigned short threadId)
// As I haven't found any well-known library that compress easily a directory, I have chosen to rely on `zip` cli.
// We precise no `debug`ging, as otherwise the zipping operation doesn't work as expected.
// As the zipping process isn't recursive, we can't just rely on `ls`, but we are obliged to use `find`.
execute(threadId, "cd " + escapeShellArgument(channelToTreatDirectory) + " && find | zip " + escapeShellArgument("../" + channelToTreat + ".zip") + " -@");
execute(threadId, "cd " + channelToTreatDirectory + " && find | zip ../" + channelToTreat + ".zip -@");
PRINT("Compression finished, started deleting initial directory...")
// Get rid of the uncompressed data.
@ -683,7 +681,7 @@ void treatChannelOrVideo(unsigned short threadId, bool isIdAChannelId, string id
// The underscore in `-o` argument is used to not end up with hidden files.
// We are obliged to precise the video id after `--`, otherwise if the video id starts with `-` it's considered as an argument.
string commandCommonPrefix = "yt-dlp --skip-download ",
commandCommonPostfix = " -o " + escapeShellArgument(channelCaptionsToTreatDirectory + "_") + " -- " + escapeShellArgument(videoId);
commandCommonPostfix = " -o '" + channelCaptionsToTreatDirectory + "_' -- " + videoId;
string command = commandCommonPrefix + "--write-sub --sub-lang all,-live_chat" + commandCommonPostfix;
execute(threadId, command);
@ -931,20 +929,3 @@ size_t writeCallback(void* contents, size_t size, size_t nmemb, void* userp)
((string*)userp)->append((char*)contents, size * nmemb);
return size * nmemb;
}
// Source: https://stackoverflow.com/a/3669819
string escapeShellArgument(string shellArgument)
{
return "'" + replaceAll(shellArgument, "'", "'\\''") + "'";
}
string replaceAll(string str, const string& from, const string& to)
{
size_t start_pos = 0;
while((start_pos = str.find(from, start_pos)) != string::npos)
{
str.replace(start_pos, from.length(), to);
start_pos += to.length(); // Handles case where 'to' is a substring of 'from'
}
return str;
}

View File

@ -1,7 +1,4 @@
#!/usr/bin/python3
# We can't proceed automatically by using `requests` Python module because https://socialblade.com/youtube/top/country/fr/mostsubscribed is protected by CloudFlare.
# Note that `undetected-chromedriver` might be a workaround this limitation.
with open('mostsubscribed.html') as f:
lines = f.read().splitlines()

View File

@ -10,11 +10,6 @@ pathSearchMessageParts = sys.argv[2].split(' ')
pathSearch = pathSearchMessageParts[1]
message = ' '.join(pathSearchMessageParts[2:])
pathSearchRegex = re.compile(pathSearch)
messageRegex = re.compile(message)
isPathSearchAChannelId = re.match(r'[a-zA-Z0-9-_]{24}', pathSearch)
searchOnlyCaptions = pathSearchMessageParts[0] == 'search-only-captions'
clientFilePath = f'users/{clientId}.txt'
@ -27,7 +22,7 @@ def write(s):
read = f.read()
# We are appening content, as we moved in-file cursor.
if read != '':
f.write('\n')
f.write("\n")
f.write(s)
f.flush()
fcntl.flock(f, fcntl.LOCK_UN)
@ -38,31 +33,23 @@ def cleanCaption(caption):
return caption.replace('\n', ' ')
# As `zipgrep` doesn't support arguments to stop on first match for each file, we proceed manually to keep a good theoretical complexity.
if isPathSearchAChannelId:
file = pathSearch + '.zip'
if os.path.isfile(path + file):
files = [file]
else:
write(f'progress:0 / 0')
else:
files = [file for file in os.listdir(path) if file.endswith('.zip')]
files = [file for file in os.listdir(path) if file.endswith('.zip')]
for fileIndex, file in enumerate(files):
write(f'progress:{fileIndex} / {len(files)}')
write(f'progress:{fileIndex + 1} / {len(files)}')
zip = zipfile.ZipFile(path + file)
for fileInZip in zip.namelist():
endsWithVtt = fileInZip.endswith('.vtt')
if searchOnlyCaptions and not endsWithVtt:
continue
toWrite = f'{file}/{fileInZip}'
if not bool(pathSearchRegex.search(toWrite)):
if not bool(re.search(pathSearch, toWrite)):
continue
with zip.open(fileInZip) as f:
if endsWithVtt:
content = f.read().decode('utf-8')
stringIOf = StringIO(content)
wholeCaption = ' '.join([cleanCaption(caption.text) for caption in webvtt.read_buffer(stringIOf)])
messagePositions = [m.start() for m in messageRegex.finditer(wholeCaption)]
messagePositions = [m.start() for m in re.finditer(message, wholeCaption)]
if messagePositions != []:
timestamps = []
for messagePosition in messagePositions:
@ -80,7 +67,6 @@ for fileIndex, file in enumerate(files):
if message in str(line):
write(toWrite)
break
write(f'progress:{fileIndex + 1} / {len(files)}')
with open(clientFilePath) as f:
while True:

View File

@ -27,7 +27,7 @@ class Client
posix_kill($this->pid, SIGTERM);
$clientFilePath = getClientFilePath($this->id);
if (file_exists($clientFilePath)) {
$fp = fopen($clientFilePath, 'r+');
$fp = fopen($clientFilePath, "r+");
if (flock($fp, LOCK_EX, $WAIT_IF_LOCKED)) { // acquire an exclusive lock
unlink($clientFilePath); // delete file
flock($fp, LOCK_UN); // release the lock
@ -92,6 +92,8 @@ class MyProcess implements MessageComponentInterface
public function onMessage(ConnectionInterface $from, $msg)
{
// As we are going to use this argument in a shell command, we escape it.
$msg = escapeshellarg($msg);
$client = $this->clients->offsetGet($from);
// If a previous request was received, we execute the new one with another client for simplicity otherwise with current file deletion approach, we can't tell the worker `search.py` that we don't care about its execution anymore.
if ($client->pid !== null) {
@ -103,8 +105,6 @@ class MyProcess implements MessageComponentInterface
$clientFilePath = getClientFilePath($clientId);
// Create the worker output file otherwise it would believe that we don't need this worker anymore.
file_put_contents($clientFilePath, '');
// As we are going to use this argument in a shell command, we escape it.
$msg = escapeshellarg($msg);
// Start the independent worker.
// Redirecting `stdout` is mandatory otherwise `exec` is blocking.
$client->pid = exec("./search.py $clientId $msg > /dev/null & echo $!");
@ -114,7 +114,7 @@ class MyProcess implements MessageComponentInterface
// If the worker output file doesn't exist anymore, then it means that the worker have finished its work and acknowledged that `websocket.php` completely read its output.
if (file_exists($clientFilePath)) {
// `flock` requires `r`eading permission and we need `w`riting one due to `ftruncate` usage.
$fp = fopen($clientFilePath, 'r+');
$fp = fopen($clientFilePath, "r+");
$read = null;
if (flock($fp, LOCK_EX, $WAIT_IF_LOCKED)) { // acquire an exclusive lock
// We assume that the temporary output is less than 1 MB long.