Add data logging #2
Labels
No Label
bug
captions
discussion
enhancement
epic
high priority
low priority
medium
medium priority
quick
security
waiting presentation
website
youtube-operational-api
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: Benjamin_Loison/YouTube_captions_search_engine#2
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Could remove some repetitive fields depending on the data logging.
Adding compression might make sense (just storing compressed responses is also a possibility). In practice compressing the
channels/
folder (even if actually implement compression in the algorithm, we have to proceed folder by folder or make a global archive but it might be more difficult) weighing 20G results in 3.2 G, so we win approximately a factor 6.25.Could also spare a bit of storage by not storing pretty printed version of the jsons (personal note: json files use spaces instead of tabs here). Could also rename fields to
0
,1
etc.Could also not log data that we don't need but there is no turn back. Especially comment ids, authors and texts would particularly interest me too.
For instance we could add:
&fields=items(snippet/topLevelComment/snippet/authorChannelId,replies/comments/snippet/authorChannelId),nextPageToken
to calls to CommentThreads: list endpoint&fields=items(snippet/authorChannelId),nextPageToken
to calls to Comments: list endpoint&fields=items(statistics/videoCount)
to calls to Videos: list endpoint&fields=items(contentDetails/videoId),nextPageToken
to calls to PlaylistItems: list endpointHowever note that currently the YouTube operational API doesn't provide support for the no-key service (related to #4).I solved the YouTube operational API issue disabling us to use thefields
parameter with the no-key service.Should also consider finding an unused hard disk at home, if it's enough, to use it until the presentation day.
If was having a lot of storage compared to the typical execution of the algorithm, using multiple computers to run this YouTube graph discovery would make sense. Should then also pay attention to global variables (notably mutexes) to not treat multiple times (one time on different computers) the same channel.
As a practical experience, with 4 threads, we treat the 43 most subscribers French channels in 24 hours. Which is about 14G (including 4.5 of uncompressed working on channels) and having discovered 14,910,408 channels.
With my 604G free storage, it's enough to already have a nice dataset that would need 604 / 14 = 43 days to be filled with 4 threads.