Fix #19: Improve documentation and code comments

This commit is contained in:
Benjamin Loison 2023-02-23 22:50:30 +01:00
parent f44ee4b3c1
commit 68cd27c263
Signed by: Benjamin_Loison
SSH Key Fingerprint: SHA256:BtnEgYTlHdOg1u+RmYcDE0mnfz1rhv5dSbQ2gyxW8B8
2 changed files with 187 additions and 88 deletions

View File

@ -1,16 +1,15 @@
As explained in the project proposal, the idea to retrieve all video ids is to start from a starting set of channels, then list their videos using YouTube Data API v3 PlaylistItems: list, then list the comments on their videos and then restart the process as we potentially retrieved new channels thanks to comment authors on videos from already known channels.
# The algorithm:
For a given channel, there are two ways to list comments users published on it:
1. As explained, YouTube Data API v3 PlaylistItems: list endpoint enables us to list the channel videos up to 20,000 videos (so we will not treat and write down channels in this case) and CommentThreads: list and Comments: list endpoints enable us to retrieve their comments
2. A simpler approach consists in using YouTube Data API v3 CommentThreads: list endpoint with `allThreadsRelatedToChannelId`. The main upside of this method, in addition to be simpler, is that for channels with many videos we spare much time by working 100 comments at a time instead of a video at a time with possibly not a single comment. Note that this approach doesn't list all videos etc so we don't retrieve some information. Note that this approach doesn't work for some channels that have comments enabled on some videos but not the whole channels.
So when possible we will proceed with 2. and use 1. as a fallback approach.
To retrieve the most YouTube video ids in order to retrieve the most video captions, we need to retrieve the most YouTube channels.
So to discover the YouTube channels graph with a breadth-first search, we proceed as follows:
1. Provide a starting set of channels.
2. Given a channel, retrieve other channels thanks to its content by using [YouTube Data API v3](https://developers.google.com/youtube/v3) and [YouTube operational API](https://github.com/Benjamin-Loison/YouTube-operational-API) and then repeat 1. for each retrieved channel.
We can multi-thread this process by channel or we can multi-thread per videos of a given channel (loosing optimization of CommentThreads: list with `allThreadsRelatedToChannelId`). In any case we shouldn't do something hybrid in terms of multi-threading, as it would be too complex.
As would like to proceed channel per channel, the question is **how much time does it take to retrieve all comments from the biggest YouTube channel? If the answer is a long period of time, then multi-threading per videos of a given channel may make sense.** There are two possibilities following our methods:
1. Here the complexity is linear in the number of channel's comments, more precisely this number divided by 100 - we could guess that the channel with the most subscribers ([T-Series](https://www.youtube.com/@tseries)) has the most comments
2. Here the complexity is linear in the number of videos - as far as I know [RoelVandePaar](https://www.youtube.com/@RoelVandePaar) has the most videos, [2,026,566 according to SocialBlade](https://socialblade.com/youtube/c/roelvandepaar). However due to the 20,000 limit of YouTube Data API v3 PlaylistItems: list the actual limit is 20,000 [as far as I know](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/wiki#user-content-concerning-20-000-videos-limit-for-youtube-data-api-v3-playlistitems-list-endpoint).
A ready to be used by the end-user website instance of this project is hosted at: https://crawler.yt.lemnoslife.com
Have to proceed with a breadth-first search approach as treating all *child* channels might take a time equivalent to treating the whole original tree.
See more details on [the Wiki](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/wiki).
# Running the algorithm:
Because of [the current compression mechanism](https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/issues/30), Linux is the only known OS able to run this algorithm.

256
main.cpp
View File

@ -12,6 +12,8 @@ using namespace std;
using namespace chrono;
using json = nlohmann::json;
// Concerning `retryOnCommentsDisabled`, `commentThreads` can return for some channels that they have disabled their comments while we can find comments on some videos, so we enumerate the channel videos and request the comments on each video.
// Concerning `returnErrorIfPlaylistNotFound`, it is used when not trying to retrieve a channel `uploads` playlist content as it seems to always work.
enum getJsonBehavior { normal, retryOnCommentsDisabled, returnErrorIfPlaylistNotFound };
set<string> setFromVector(vector<string> vec);
@ -20,17 +22,18 @@ json getJson(unsigned short threadId, string url, bool usingYouTubeDataApiV3, st
void createDirectory(string path),
print(ostringstream* toPrint),
treatComment(unsigned short threadId, json comment, string channelId),
treatChannelOrVideo(unsigned short threadId, bool isChannel, string id, string channelToTreat),
treatChannelOrVideo(unsigned short threadId, bool isIdAChannelId, string id, string channelToTreat),
treatChannels(unsigned short threadId),
deleteDirectory(string path),
addChannelToTreat(unsigned short threadId, string channelId),
exec(unsigned short threadId, string cmd, bool debug = true);
markChannelAsRequiringTreatmentIfNeeded(unsigned short threadId, string channelId),
execute(unsigned short threadId, string command, bool debug = true);
string getHttps(string url),
join(vector<string> parts, string delimiter);
size_t writeCallback(void* contents, size_t size, size_t nmemb, void* userp);
bool doesFileExist(string filePath),
writeFile(unsigned short threadId, string filePath, string option, string toWrite);
// Use macros not to have to repeat `threadId` in each function calling `print` function.
#define THREAD_PRINT(threadId, x) { ostringstream toPrint; toPrint << threadId << ": " << x; print(&toPrint); }
#define PRINT(x) THREAD_PRINT(threadId, x)
#define DEFAULT_THREAD_ID 0
@ -39,33 +42,53 @@ bool doesFileExist(string filePath),
#define EXIT_WITH_ERROR(x) { PRINT(x); exit(EXIT_FAILURE); }
#define MAIN_EXIT_WITH_ERROR(x) { MAIN_PRINT(x); exit(EXIT_FAILURE); }
// Note that in the following a `channel` designates a `string` that is the channel id starting with `UC`.
// The only resources shared are:
// - standard streams
// - the ordered set of channels to treat and the unordered set of channels already treated
// - the ordered set of YouTube Data API v3 keys
mutex printMutex,
channelsAlreadyTreatedAndToTreatMutex,
quotaMutex;
// We use `set`s and `map`s for performance reasons.
set<string> channelsAlreadyTreated;
// Two `map`s to simulate a bidirectional map.
map<unsigned int, string> channelsToTreat;
map<string, unsigned int> channelsToTreatRev;
vector<string> keys;
unsigned int channelsPerSecondCount = 0;
map<unsigned short, unsigned int> channelsCountThreads,
requestsPerChannelThreads;
vector<string> youtubeDataApiV3keys;
// For statistics we count the number of:
// - channels found per second (`channelsFoundPerSecondCount`)
// - channels (`channelsTreatedCountThreads`) and requests (`requestsCountThreads`) done by each channel once they are treated
unsigned int channelsFoundPerSecondCount = 0;
map<unsigned short, unsigned int> channelsTreatedCountThreads,
requestsCountThreads;
// Variables that can be override by command line arguments.
unsigned short THREADS_NUMBER = 1;
// Use `string` variables instead of macros to have `string` properties, even if could use a meta-macro inlining as `string`s.
// Can be https://yt.lemnoslife.com to use the official YouTube operational API instance for instance.
string YOUTUBE_OPERATIONAL_API_INSTANCE_URL = "http://localhost/YouTube-operational-API";
bool USE_YT_LEMNOSLIFE_COM_NO_KEY_SERVICE = false;
// Constants written as `string` variables instead of macros to have `string` properties, even if could use a meta-macro inlining as `string`s.
string CHANNELS_DIRECTORY = "channels/",
CHANNELS_FILE_PATH = "channels.txt",
KEYS_FILE_PATH = "keys.txt",
STARTING_CHANNELS_SET_FILE_PATH = "channels.txt",
YOUTUBE_DATA_API_V3_KEYS_FILE_PATH = "keys.txt",
UNLISTED_VIDEOS_FILE_PATH = "unlistedVideos.txt",
apiKey = "", // Will firstly be filled with `KEYS_FILE_PATH` first line.
YOUTUBE_OPERATIONAL_API_INSTANCE_URL = "http://localhost/YouTube-operational-API", // Can be "https://yt.lemnoslife.com" for instance.
CAPTIONS_DIRECTORY = "captions/",
DEBUG_DIRECTORY = "debug/",
YOUTUBE_API_REQUESTS_DIRECTORY = "requests/",
YOUTUBE_APIS_REQUESTS_DIRECTORY = "requests/";
// The keys usage is identical to the YouTube operational API no-key service that is about using completely the daily quota of the first key before using the next one and so on by looping when reached the end of the ordered keys set.
string currentYouTubeDataAPIv3Key = "", // Will firstly be filled with `YOUTUBE_DATA_API_V3_KEYS_FILE_PATH` first line.
CURRENT_WORKING_DIRECTORY;
bool USE_YT_LEMNOSLIFE_COM_NO_KEY_SERVICE = false;
int main(int argc, char *argv[])
{
// Proceed passed command line arguments.
for(unsigned short argvIndex = 1; argvIndex < argc; argvIndex++)
{
string argvStr = string(argv[argvIndex]);
@ -82,6 +105,7 @@ int main(int argc, char *argv[])
MAIN_PRINT("Usage: " << argv[0] << " [--help/-h] [--no-keys] [--threads=N] [--youtube-operational-api-instance-url URL]")
exit(EXIT_SUCCESS);
}
// Contrarily to `--threads=` the separator between the command line argument label and value is a space and not an equal sign.
else if(argvStr == "--youtube-operational-api-instance-url")
{
if(argvIndex < argc - 1)
@ -100,22 +124,24 @@ int main(int argc, char *argv[])
}
}
// The starting set should be written to `CHANNELS_FILE_PATH`.
// The starting set should be written to `STARTING_CHANNELS_SET_FILE_PATH`.
// To resume this algorithm after a shutdown, just restart it after having deleted the last channel folders in `CHANNELS_DIRECTORY` being treated.
// On a restart, `CHANNELS_FILE_PATH` is read and every channel not found in `CHANNELS_DIRECTORY` is added to `channelsToTreat*` or `channelsToTreat*` otherwise before continuing, as if `CHANNELS_FILE_PATH` was containing a **treated** starting set.
vector<string> channelsVec = getFileContent(CHANNELS_FILE_PATH);
for(unsigned int channelsVecIndex = 0; channelsVecIndex < channelsVec.size(); channelsVecIndex++)
// On a restart, `STARTING_CHANNELS_SET_FILE_PATH` is read and every channel not found in `CHANNELS_DIRECTORY` is added to `channelsToTreat*` or `channelsToTreat*` otherwise before continuing, as if `STARTING_CHANNELS_SET_FILE_PATH` was containing a **treated** starting set.
vector<string> startingChannelsSet = getFileContent(STARTING_CHANNELS_SET_FILE_PATH);
for(unsigned int startingChannelsSetIndex = 0; startingChannelsSetIndex < startingChannelsSet.size(); startingChannelsSetIndex++)
{
string channel = channelsVec[channelsVecIndex];
channelsToTreat[channelsVecIndex] = channel;
channelsToTreatRev[channel] = channelsVecIndex;
string startingChannel = startingChannelsSet[startingChannelsSetIndex];
channelsToTreat[startingChannelsSetIndex] = startingChannel;
channelsToTreatRev[startingChannel] = startingChannelsSetIndex;
}
keys = getFileContent(KEYS_FILE_PATH);
apiKey = keys[0];
// Load the YouTube Data API v3 keys stored in `YOUTUBE_DATA_API_V3_KEYS_FILE_PATH`.
youtubeDataApiV3keys = getFileContent(YOUTUBE_DATA_API_V3_KEYS_FILE_PATH);
currentYouTubeDataAPIv3Key = youtubeDataApiV3keys[0];
createDirectory(CHANNELS_DIRECTORY);
// Remove already treated channels from channels to treat.
for(const auto& entry : filesystem::directory_iterator(CHANNELS_DIRECTORY))
{
string fileName = entry.path().filename();
@ -130,6 +156,7 @@ int main(int argc, char *argv[])
}
}
// Load at runtime the current working directory.
char cwd[PATH_MAX];
if (getcwd(cwd, sizeof(cwd)) != NULL) {
CURRENT_WORKING_DIRECTORY = string(cwd) + "/";
@ -137,19 +164,26 @@ int main(int argc, char *argv[])
MAIN_EXIT_WITH_ERROR("`getcwd()` error");
}
// Print the number of:
// - channels to treat
// - channels already treated
MAIN_PRINT(channelsToTreat.size() << " channel(s) to treat")
MAIN_PRINT(channelsAlreadyTreated.size() << " channel(s) already treated")
// Start the `THREADS_NUMBER` threads.
// Note that there is an additional thread that is the one the `main` function that will continue the code below this `for` loop.
vector<thread> threads;
for(unsigned short threadsIndex = 0; threadsIndex < THREADS_NUMBER; threadsIndex++)
{
threads.push_back(thread(treatChannels, threadsIndex + 1));
}
// Every second print the number of channels found during the last second.
// Note that if a same channel is found multiple times, the count will be incremented the same number of times.
while(true)
{
MAIN_PRINT("Channels per second: " << channelsPerSecondCount)
channelsPerSecondCount = 0;
MAIN_PRINT("Channels treated per second: " << channelsFoundPerSecondCount)
channelsFoundPerSecondCount = 0;
sleep(1);
}
@ -162,25 +196,30 @@ int main(int argc, char *argv[])
return 0;
}
// Function each thread loop in until the whole YouTube graph is completely treated.
void treatChannels(unsigned short threadId)
{
// For the moment we assume that we never have treated completely YouTube, otherwise we have to pay attention how to proceed if the starting set involves startvation for some threads.
while(true)
{
// As we're about to mark a channel as being treated, we need to make sure that no other thread is also modifying the set of channels we are working on.
channelsAlreadyTreatedAndToTreatMutex.lock();
if(channelsToTreat.empty())
{
channelsAlreadyTreatedAndToTreatMutex.unlock();
// Consumer thread waiting producer one to provide a channel to work on.
sleep(1);
continue;
}
// Treat channels in the order we found them in `STARTING_CHANNELS_SET_FILE_PATH` or discovered them.
string channelToTreat = channelsToTreat.begin()->second;
// Print the channel id the thread is going to work on and remind the number of channels already treated and the number of channels to treat.
PRINT("Treating channel " << channelToTreat << " (treated: " << channelsAlreadyTreated.size() << ", to treat: " << channelsToTreat.size() << ")")
channelsCountThreads[threadId] = 0;
requestsPerChannelThreads[threadId] = 0;
channelsTreatedCountThreads[threadId] = 0;
requestsCountThreads[threadId] = 0;
channelsToTreat.erase(channelsToTreatRev[channelToTreat]);
channelsToTreatRev.erase(channelToTreat);
@ -189,12 +228,14 @@ void treatChannels(unsigned short threadId)
channelsAlreadyTreatedAndToTreatMutex.unlock();
// Create directories in which we are going to store the requests to YouTube we did.
string channelToTreatDirectory = CHANNELS_DIRECTORY + channelToTreat + "/";
createDirectory(channelToTreatDirectory);
createDirectory(DEBUG_DIRECTORY);
createDirectory(channelToTreatDirectory + CAPTIONS_DIRECTORY);
createDirectory(channelToTreatDirectory + YOUTUBE_API_REQUESTS_DIRECTORY);
createDirectory(channelToTreatDirectory + YOUTUBE_APIS_REQUESTS_DIRECTORY);
// Actually treat the given channel.
treatChannelOrVideo(threadId, true, channelToTreat, channelToTreat);
// Note that compressing the French most subscribers channel took 4 minutes and 42 seconds.
@ -202,29 +243,36 @@ void treatChannels(unsigned short threadId)
// As I haven't found any well-known library that compress easily a directory, I have chosen to rely on `zip` cli.
// We precise no `debug`ging, as otherwise the zipping operation doesn't work as expected.
// As the zipping process isn't recursive, we can't just rely on `ls`, but we are obliged to use `find`.
exec(threadId, "cd " + channelToTreatDirectory + " && find | zip ../" + channelToTreat + ".zip -@");
execute(threadId, "cd " + channelToTreatDirectory + " && find | zip ../" + channelToTreat + ".zip -@");
PRINT("Compression finished, started deleting initial directory...")
// Get rid of the uncompressed data.
deleteDirectory(channelToTreatDirectory);
PRINT("Deleting directory finished.")
PRINT(channelsCountThreads[threadId] << " channels were found for this channel.")
PRINT(channelsTreatedCountThreads[threadId] << " channels were found for this channel.")
}
// This `unlock` seems to be dead code currently as the algorithm doesn't support treating the whole YouTube graph.
channelsAlreadyTreatedAndToTreatMutex.unlock();
}
// Have to pay attention not to recursively call this function with another channel otherwise we break the ability of the program to halt at any top level channel.
void treatChannelOrVideo(unsigned short threadId, bool isChannel, string id, string channelToTreat)
// Note that the `id` can be a channel id or a video id. We provide anyway `channelToTreat` even if it's identical to `id`.
void treatChannelOrVideo(unsigned short threadId, bool isIdAChannelId, string id, string channelToTreat)
{
string pageToken = "";
// Treat all comments:
// - of a given channel thanks to YouTube Data API v3 CommentThreads: list endpoint and `allThreadsRelatedToChannelId` filter if the provided `id` is a channel id
// - of a given video thanks to YouTube Data API v3 CommentThreads: list endpoint and `videoId` filter otherwise (if the provided `id` is a video id)
while(true)
{
ostringstream toString;
toString << "commentThreads?part=snippet,replies&" << (isChannel ? "allThreadsRelatedToChannelId" : "videoId") << "=" << id << "&maxResults=100&pageToken=" << pageToken;
toString << "commentThreads?part=snippet,replies&" << (isIdAChannelId ? "allThreadsRelatedToChannelId" : "videoId") << "=" << id << "&maxResults=100&pageToken=" << pageToken;
string url = toString.str();
json data = getJson(threadId, url, true, channelToTreat, pageToken == "" ? normal : retryOnCommentsDisabled);
bool doesRelyingOnCommentThreadsIsEnough = (!isChannel) || data["error"]["errors"][0]["reason"] != "commentsDisabled";
// This condition doesn't hold for not existing channels.
bool doesRelyingOnCommentThreadsIsEnough = (!isIdAChannelId) || data["error"]["errors"][0]["reason"] != "commentsDisabled";
if(doesRelyingOnCommentThreadsIsEnough)
{
json items = data["items"];
@ -235,6 +283,8 @@ void treatChannelOrVideo(unsigned short threadId, bool isChannel, string id, str
treatComment(threadId, comment, channelToTreat);
if(item.contains("replies"))
{
// If there is more than 5 replies, they need to be requested by using pagination with YouTube Data API v3 Comments: list endpoint.
// In such case we delay the treatment of the retrieved 5 first replies in order to double treat them.
if(item["snippet"]["totalReplyCount"] > 5)
{
string pageToken = "";
@ -278,11 +328,19 @@ void treatChannelOrVideo(unsigned short threadId, bool isChannel, string id, str
else
{
PRINT("Comments disabled channel, treating differently...")
json data = getJson(threadId, "channels?part=statistics&id=" + channelToTreat, true, channelToTreat);
// As far as I know we can't retrieve all videos of a channel if it has more than 20,000 videos, in such case the program stops to investigate this further.
json data = getJson(threadId, "channels?part=statistics&id=" + channelToTreat, true, channelToTreat),
items = data["items"];
if(items.empty())
{
PRINT("The provided channel doesn't exist, skipping it.");
break;
}
// YouTube Data API v3 Videos: list endpoint returns `videoCount` as a string and not an integer...
unsigned int videoCount = atoi(string(data["items"][0]["statistics"]["videoCount"]).c_str());
unsigned int videoCount = atoi(string(items[0]["statistics"]["videoCount"]).c_str());
PRINT("The channel has about " << videoCount << " videos.")
// `UC-3A9g4U1PpLaeAuD4jSP_w` has a `videoCount` of 2, while its `uploads` playlist contains 3 videos. So we use a strict inequality here.
// The `0 < videoCount` is an optimization to avoid making a request to YouTube Data API v3 PlaylistItems: list endpoint while we already know that no results will be returned. As many YouTube channels don't have videos, this optimization is implemented.
if(0 < videoCount && videoCount < 20000)
{
string playlistToTreat = "UU" + channelToTreat.substr(2),
@ -290,16 +348,17 @@ void treatChannelOrVideo(unsigned short threadId, bool isChannel, string id, str
while(true)
{
// `snippet` and `status` are unneeded `part`s here but may be interesting later, as we log them.
json data = getJson(threadId, "playlistItems?part=snippet,contentDetails,status&playlistId=" + playlistToTreat + "&maxResults=50&pageToken=" + pageToken, true, channelToTreat, returnErrorIfPlaylistNotFound);
json data = getJson(threadId, "playlistItems?part=contentDetails,snippet,status&playlistId=" + playlistToTreat + "&maxResults=50&pageToken=" + pageToken, true, channelToTreat, returnErrorIfPlaylistNotFound);
if(data.contains("error"))
{
// This is a sanity check that hasn't ever been violated.
EXIT_WITH_ERROR("Not listing comments on videos, as `playlistItems` hasn't found the `uploads` playlist!")
}
json items = data["items"];
for(const auto& item : items)
{
string videoId = item["contentDetails"]["videoId"];
// To keep the same amount of logs for each channel, I comment the following `PRINT`.
// To keep the same amount of logs for each regular channel, I comment the following `PRINT`.
//PRINT("Treating video " << videoId)
treatChannelOrVideo(threadId, false, videoId, channelToTreat);
}
@ -325,22 +384,26 @@ void treatChannelOrVideo(unsigned short threadId, bool isChannel, string id, str
}
}
}
if(isChannel)
// If the provided `id` is a channel id, then we treat its tabs.
if(isIdAChannelId)
{
// `CHANNELS`
// Treat the `CHANNELS` tab.
string pageToken = "";
while(true)
{
json data = getJson(threadId, "channels?part=channels&id=" + id + (pageToken == "" ? "" : "&pageToken=" + pageToken), false, id),
// There is no need to verify that the channel exists as it does thanks to previous comments listing.
channelSections = data["items"][0]["channelSections"];
// We don't mind about channel sections, we are only looking for channel ids.
for(const auto& channelSection : channelSections)
{
for(const auto& sectionChannel : channelSection["sectionChannels"])
{
string channelId = sectionChannel["channelId"];
addChannelToTreat(threadId, channelId);
markChannelAsRequiringTreatmentIfNeeded(threadId, channelId);
}
}
// There is a pagination mechanism only when there is a single channel section.
if(channelSections.size() == 1)
{
json channelSection = channelSections[0];
@ -358,16 +421,18 @@ void treatChannelOrVideo(unsigned short threadId, bool isChannel, string id, str
break;
}
}
// `COMMUNITY`
// Treat the `COMMUNITY` tab.
pageToken = "";
while(true)
{
// First we retrieve community post ids then we retrieve their comments and their replies.
json data = getJson(threadId, "channels?part=community&id=" + id + (pageToken == "" ? "" : "&pageToken=" + pageToken), false, id);
data = data["items"][0];
json posts = data["community"];
for(const auto& post : posts)
{
string postId = post["id"];
// As livestreams chats, comments can be filtered as `Top comments` and `Newest first`, from my experience `Top comments` hide some comments, so we use time filtering everywhere it is possible.
json data = getJson(threadId, "community?part=snippet&id=" + postId + "&order=time", false, id);
string pageToken = data["items"][0]["snippet"]["comments"]["nextPageToken"];
while(pageToken != "")
@ -381,8 +446,9 @@ void treatChannelOrVideo(unsigned short threadId, bool isChannel, string id, str
if(!authorChannelId["value"].is_null())
{
string channelId = authorChannelId["value"];
addChannelToTreat(threadId, channelId);
markChannelAsRequiringTreatmentIfNeeded(threadId, channelId);
}
// Contrarily to YouTube Data API v3 for a given comments having replies, we don't switch from CommentThreads: list endpoint to Comments: list endpoint, here we keep working with YouTube operational API CommentThreads: list endpoint but change the page token.
string pageToken = snippet["nextPageToken"];
while(pageToken != "")
{
@ -391,7 +457,7 @@ void treatChannelOrVideo(unsigned short threadId, bool isChannel, string id, str
for(const auto& item : items)
{
string channelId = item["snippet"]["authorChannelId"]["value"];
addChannelToTreat(threadId, channelId);
markChannelAsRequiringTreatmentIfNeeded(threadId, channelId);
}
if(data.contains("nextPageToken"))
{
@ -413,6 +479,7 @@ void treatChannelOrVideo(unsigned short threadId, bool isChannel, string id, str
}
}
}
// See https://github.com/Benjamin-Loison/YouTube-operational-API/issues/49
if(data.contains("nextPageToken") && data["nextPageToken"] != "")
{
pageToken = data["nextPageToken"];
@ -422,18 +489,24 @@ void treatChannelOrVideo(unsigned short threadId, bool isChannel, string id, str
break;
}
}
// `PLAYLISTS`
// Treat the `PLAYLISTS` tab.
pageToken = "";
while(true)
{
json data = getJson(threadId, "channels?part=playlists&id=" + id + (pageToken == "" ? "" : "&pageToken=" + pageToken), false, id),
playlistSections = data["items"][0]["playlistSections"];
// We don't mind about playlist sections, we are only looking for channel ids.
for(const auto& playlistSection : playlistSections)
{
for(const auto& playlist : playlistSection["playlists"])
{
string playlistId = playlist["id"];
// We exclude shows as they at least for free don't contain any comment indirectly.
if(playlistId.substr(0, 2) == "SC")
{
continue;
}
//PRINT(threadId, playlistId)
string pageToken = "";
while(true)
@ -443,6 +516,7 @@ void treatChannelOrVideo(unsigned short threadId, bool isChannel, string id, str
for(const auto& item : items)
{
json snippet = item["snippet"];
// This section is bit out of the scope of the YouTube captions search engine goal, as we are just curious about unlisted videos that we found but in fact it's also a bit in the scope of the initial goal, as this enable us to treat unlisted content.
string privacyStatus = item["status"]["privacyStatus"];
// `5-CXVU8si3A` in `PLTYUE9O6WCrjQsnOm56rMMNmFy_A-SjUx` has its privacy status on `privacyStatusUnspecified` and is inaccessible.
// `GMiVi8xkEXA` in `PLTYUE9O6WCrgNpeSiryP8LYVX-7tOJ1f1` has its privacy status on `private`.
@ -462,9 +536,10 @@ void treatChannelOrVideo(unsigned short threadId, bool isChannel, string id, str
{
// There isn't any `videoOwnerChannelId` to retrieve for `5-CXVU8si3A` for instance.
string channelId = snippet["videoOwnerChannelId"];
// As we are already treating the given channel, verifying if it needs to be treated again is only a loss of time, so we skip the verification in this case.
if(channelId != id)
{
addChannelToTreat(threadId, channelId);
markChannelAsRequiringTreatmentIfNeeded(threadId, channelId);
}
}
}
@ -488,12 +563,13 @@ void treatChannelOrVideo(unsigned short threadId, bool isChannel, string id, str
break;
}
}
// `LIVE`
// Treat the `LIVE` tab.
pageToken = "";
string playlistId = "UU" + id.substr(2);
vector<string> videoIds;
while(true)
{
// We verify by batch of 50 videos, if they are livestreams or not thanks to YouTube Data API v3 PlaylistItems: list and Videos: list endpoints, as PlaylistItems: list endpoint doesn't provide on its own whether or not the given video is a livestream.
json data = getJson(threadId, "playlistItems?part=contentDetails,snippet,status&playlistId=" + playlistId + "&maxResults=50&pageToken=" + pageToken, true, id, returnErrorIfPlaylistNotFound),
items = data["items"];
for(const auto& item : items)
@ -513,6 +589,8 @@ void treatChannelOrVideo(unsigned short threadId, bool isChannel, string id, str
string videoId = item["id"];
//PRINT(videoId)
json liveStreamingDetails = item["liveStreamingDetails"];
// There is two possibilities for a live stream, whether it's ended or not.
// If it's ended we can't anymore use YouTube Live Streaming API LiveChat/messages: list endpoint.
if(liveStreamingDetails.contains("activeLiveChatId"))
{
string activeLiveChatId = liveStreamingDetails["activeLiveChatId"];
@ -521,12 +599,12 @@ void treatChannelOrVideo(unsigned short threadId, bool isChannel, string id, str
for(const auto& item : items)
{
string channelId = item["snippet"]["authorChannelId"];
addChannelToTreat(threadId, channelId);
markChannelAsRequiringTreatmentIfNeeded(threadId, channelId);
}
}
else
{
// As there isn't the usual pagination mechanism for these ended livestreams, we proceed in an uncertain way as follows.
// As there isn't the usual pagination mechanism for these ended livestreams, we proceed in an uncertain way as follows based on a time pagination.
set<string> messageIds;
unsigned long long lastMessageTimestampRelativeMsec = 0;
while(true)
@ -543,6 +621,7 @@ void treatChannelOrVideo(unsigned short threadId, bool isChannel, string id, str
// We verify that we don't skip any message by verifying that the first message was already treated if we already treated some messages.
if(!messageIds.empty() && messageIds.find(firstMessageId) == messageIds.end())
{
// This sometimes happen cf https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/issues/39.
PRINT("The verification that we don't skip any message failed! Continuing anyway...")
}
for(const auto& message : snippet)
@ -552,7 +631,7 @@ void treatChannelOrVideo(unsigned short threadId, bool isChannel, string id, str
{
messageIds.insert(messageId);
string channelId = message["authorChannelId"];
addChannelToTreat(threadId, channelId);
markChannelAsRequiringTreatmentIfNeeded(threadId, channelId);
}
}
json lastMessage = snippet.back();
@ -593,7 +672,7 @@ void treatChannelOrVideo(unsigned short threadId, bool isChannel, string id, str
for(const auto& item : items)
{
string videoId = item["contentDetails"]["videoId"];
// Could proceed as follows by verifying `!isChannel` but as we don't know how to manage unlisted videos, we don't proceed this way.
// Could proceed as follows by verifying `!isIdAChannelId` but as we don't know how to manage unlisted videos, we don't proceed this way.
//treatChannelOrVideo(threadId, false, videoId, channelToTreat);
string channelCaptionsToTreatDirectory = CHANNELS_DIRECTORY + channelToTreat + "/" + CAPTIONS_DIRECTORY + videoId + "/";
@ -602,14 +681,14 @@ void treatChannelOrVideo(unsigned short threadId, bool isChannel, string id, str
// Firstly download all not automatically generated captions.
// The underscore in `-o` argument is used to not end up with hidden files.
// We are obliged to precise the video id after `--`, otherwise if the video id starts with `-` it's considered as an argument.
string cmdCommonPrefix = "yt-dlp --skip-download ",
cmdCommonPostfix = " -o '" + channelCaptionsToTreatDirectory + "_' -- " + videoId;
string cmd = cmdCommonPrefix + "--write-sub --sub-lang all,-live_chat" + cmdCommonPostfix;
exec(threadId, cmd);
string commandCommonPrefix = "yt-dlp --skip-download ",
commandCommonPostfix = " -o '" + channelCaptionsToTreatDirectory + "_' -- " + videoId;
string command = commandCommonPrefix + "--write-sub --sub-lang all,-live_chat" + commandCommonPostfix;
execute(threadId, command);
// Secondly download the automatically generated captions.
cmd = cmdCommonPrefix + "--write-auto-subs --sub-langs '.*orig' --sub-format ttml --convert-subs vtt" + cmdCommonPostfix;
exec(threadId, cmd);
command = commandCommonPrefix + "--write-auto-subs --sub-langs '.*orig' --sub-format ttml --convert-subs vtt" + commandCommonPostfix;
execute(threadId, command);
}
if(data.contains("nextPageToken"))
{
@ -623,11 +702,12 @@ void treatChannelOrVideo(unsigned short threadId, bool isChannel, string id, str
}
}
// This function verifies that the given hasn't already been treated.
void addChannelToTreat(unsigned short threadId, string channelId)
// This function verifies that the given channel hasn't already been treated.
void markChannelAsRequiringTreatmentIfNeeded(unsigned short threadId, string channelId)
{
channelsPerSecondCount++;
channelsCountThreads[threadId]++;
channelsFoundPerSecondCount++;
channelsTreatedCountThreads[threadId]++;
// As other threads may be writing the sets we are reading, we need to make sure it's not the case to ensure consistency.
channelsAlreadyTreatedAndToTreatMutex.lock();
if(channelsAlreadyTreated.find(channelId) == channelsAlreadyTreated.end() && channelsToTreatRev.find(channelId) == channelsToTreatRev.end())
{
@ -638,7 +718,7 @@ void addChannelToTreat(unsigned short threadId, string channelId)
channelsAlreadyTreatedAndToTreatMutex.unlock();
writeFile(threadId, CHANNELS_FILE_PATH, "a", "\n" + channelId);
writeFile(threadId, STARTING_CHANNELS_SET_FILE_PATH, "a", "\n" + channelId);
}
else
{
@ -646,6 +726,7 @@ void addChannelToTreat(unsigned short threadId, string channelId)
}
}
// Mark the comment author channel as requiring treatment if needed.
void treatComment(unsigned short threadId, json comment, string channelId)
{
json snippet = comment["snippet"];
@ -653,10 +734,11 @@ void treatComment(unsigned short threadId, json comment, string channelId)
if(snippet.contains("authorChannelId"))
{
string channelId = snippet["authorChannelId"]["value"];
addChannelToTreat(threadId, channelId);
markChannelAsRequiringTreatmentIfNeeded(threadId, channelId);
}
}
// Join `parts` with the `delimiter`.
string join(vector<string> parts, string delimiter)
{
string result = "";
@ -672,24 +754,27 @@ string join(vector<string> parts, string delimiter)
return result;
}
void exec(unsigned short threadId, string cmd, bool debug)
// Execute a provide command as if being ran in a shell.
// This is mandatory as as far as I know there isn't a C++ API for `yt-dlp`.
void execute(unsigned short threadId, string command, bool debug)
{
// The debugging gives us confidence that `yt-dlp` is working as expected, cf https://gitea.lemnoslife.com/Benjamin_Loison/YouTube_captions_search_engine/issues/35#issuecomment-578.
if(debug)
{
ostringstream toString;
toString << threadId;
string initialCmd = cmd,
string initialCommand = command,
threadIdStr = toString.str(),
debugCommonFilePath = CURRENT_WORKING_DIRECTORY + DEBUG_DIRECTORY + threadIdStr,
debugOutFilePath = debugCommonFilePath + ".out",
debugErrFilePath = debugCommonFilePath + ".err";
cmd += " >> " + debugOutFilePath;
cmd += " 2>> " + debugErrFilePath;
command += " >> " + debugOutFilePath;
command += " 2>> " + debugErrFilePath;
writeFile(threadId, debugOutFilePath, "a", initialCmd + "\n");
writeFile(threadId, debugErrFilePath, "a", initialCmd + "\n");
writeFile(threadId, debugOutFilePath, "a", initialCommand + "\n");
writeFile(threadId, debugErrFilePath, "a", initialCommand + "\n");
}
system(cmd.c_str());
system(command.c_str());
}
bool writeFile(unsigned short threadId, string filePath, string option, string toWrite)
@ -714,16 +799,20 @@ bool doesFileExist(string filePath)
return stat(filePath.c_str(), &buffer) == 0;
}
// Create a directory in the case that it isn't already existing.
void createDirectory(string path)
{
mkdir(path.c_str(), S_IRWXU | S_IRWXG | S_IROTH | S_IXOTH);
}
// Delete a directory even if it's not empty.
void deleteDirectory(string path)
{
filesystem::remove_all(path);
}
// Get date in `%d-%m-%Y %H-%M-%S.%MS` format.
// Return for instance `22-02-2023 00-43-24.602`.
string getDate()
{
auto t = time(nullptr);
@ -737,11 +826,13 @@ string getDate()
return toString.str();
}
// Returns a set from a given vector.
set<string> setFromVector(vector<string> vec)
{
return set(vec.begin(), vec.end());
}
// Return file lines as a vector of the file at the given `filePath`.
vector<string> getFileContent(string filePath)
{
vector<string> lines;
@ -752,12 +843,14 @@ vector<string> getFileContent(string filePath)
return lines;
}
// Execute and return the result of a given request to a YouTube API.
json getJson(unsigned short threadId, string url, bool usingYoutubeDataApiv3, string channelId, getJsonBehavior behavior)
{
// If using the YouTube operational API official instance no-key service, we don't need to provide any YouTube Data API v3 key.
string finalUrl = usingYoutubeDataApiv3 ?
(USE_YT_LEMNOSLIFE_COM_NO_KEY_SERVICE ?
"https://yt.lemnoslife.com/noKey/" + url :
"https://www.googleapis.com/youtube/v3/" + url + "&key=" + apiKey) :
"https://www.googleapis.com/youtube/v3/" + url + "&key=" + currentYouTubeDataAPIv3Key) :
YOUTUBE_OPERATIONAL_API_INSTANCE_URL + "/" + url,
content = getHttps(finalUrl);
json data;
@ -774,22 +867,26 @@ json getJson(unsigned short threadId, string url, bool usingYoutubeDataApiv3, st
if(data.contains("error"))
{
// The YouTube operational API shouldn't be returning any error, if it's the case we stop the execution to investigate the problem.
if(!usingYoutubeDataApiv3)
{
EXIT_WITH_ERROR("Found error in JSON retrieve from YouTube operational API at URL: " << finalUrl << " for content: " << content << " !")
EXIT_WITH_ERROR("Found error in JSON retrieved from YouTube operational API at URL: " << finalUrl << " for content: " << content << " !")
}
string reason = data["error"]["errors"][0]["reason"];
// Contrarily to YouTube operational API no-key service we don't rotate keys in `KEYS_FILE_PATH`, as we keep them in memory here.
// Contrarily to YouTube operational API no-key service we don't rotate keys in `YOUTUBE_DATA_API_V3_KEYS_FILE_PATH`, as we keep them in memory here, but we do rotate them in the memory.
if(reason == "quotaExceeded")
{
quotaMutex.lock();
keys.erase(keys.begin());
keys.push_back(apiKey);
PRINT("No more quota on " << apiKey << " switching to " << keys[0] << ".")
apiKey = keys[0];
// Move the current exhausted YouTube Data API v3 key from the first slot to the last one.
youtubeDataApiV3keys.erase(youtubeDataApiV3keys.begin());
youtubeDataApiV3keys.push_back(currentYouTubeDataAPIv3Key);
PRINT("No more quota on " << currentYouTubeDataAPIv3Key << " switching to " << youtubeDataApiV3keys[0] << ".")
currentYouTubeDataAPIv3Key = youtubeDataApiV3keys[0];
quotaMutex.unlock();
// We proceed again to the request not to return a temporary error due to our keys management.
return getJson(threadId, url, true, channelId);
}
// Errors from YouTube Data API v3 are normal in some cases when we request something that doesn't exist such as comments of a channel on a channel that doesn't have any, but we have to make the request to know that it doesn't have any that's why we proceed this way.
PRINT("Found error in JSON at URL: " << finalUrl << " for content: " << content << " !")
if(reason != "commentsDisabled" || behavior == retryOnCommentsDisabled)
{
@ -797,10 +894,11 @@ json getJson(unsigned short threadId, string url, bool usingYoutubeDataApiv3, st
}
}
// Write the request URL and the retrieved content to logs.
ostringstream toString;
toString << CHANNELS_DIRECTORY << channelId << "/" << YOUTUBE_API_REQUESTS_DIRECTORY;
toString << CHANNELS_DIRECTORY << channelId << "/" << YOUTUBE_APIS_REQUESTS_DIRECTORY;
writeFile(threadId, toString.str() + "urls.txt", "a", url + " " + (usingYoutubeDataApiv3 ? "true" : "false") + "\n");
toString << requestsPerChannelThreads[threadId]++ << ".json";
toString << requestsCountThreads[threadId]++ << ".json";
writeFile(threadId, toString.str(), "w", content);
return data;
@ -817,6 +915,7 @@ void print(ostringstream* toPrint)
}
// Is this function really multi-threading friendly? If not, could consider executing `curl` using the command line.
// Retrieves content from an URL. Note that this function verifies the validity of the certificate in case of HTTPS.
string getHttps(string url)
{
CURL* curl = curl_easy_init();
@ -831,6 +930,7 @@ string getHttps(string url)
return got;
}
// Auxiliary function required by `getHttps` function.
size_t writeCallback(void* contents, size_t size, size_t nmemb, void* userp)
{
((string*)userp)->append((char*)contents, size * nmemb);