Add data logging #2

Closed
opened 2022-12-22 04:37:44 +01:00 by Benjamin_Loison · 3 comments

Could remove some repetitive fields depending on the data logging.

Adding compression might make sense (just storing compressed responses is also a possibility). In practice compressing the channels/ folder (even if actually implement compression in the algorithm, we have to proceed folder by folder or make a global archive but it might be more difficult) weighing 20G results in 3.2 G, so we win approximately a factor 6.25.

Could also spare a bit of storage by not storing pretty printed version of the jsons (personal note: json files use spaces instead of tabs here). Could also rename fields to 0, 1 etc.

Could also not log data that we don't need but there is no turn back. Especially comment ids, authors and texts would particularly interest me too.
For instance we could add:

  • &fields=items(snippet/topLevelComment/snippet/authorChannelId,replies/comments/snippet/authorChannelId),nextPageToken to calls to CommentThreads: list endpoint
  • &fields=items(snippet/authorChannelId),nextPageToken to calls to Comments: list endpoint
  • &fields=items(statistics/videoCount) to calls to Videos: list endpoint
  • &fields=items(contentDetails/videoId),nextPageToken to calls to PlaylistItems: list endpoint

However note that currently the YouTube operational API doesn't provide support for the no-key service (related to #4). I solved the YouTube operational API issue disabling us to use the fields parameter with the no-key service.

Should also consider finding an unused hard disk at home, if it's enough, to use it until the presentation day.

Could remove some repetitive fields depending on the data logging. Adding compression might make sense (just storing compressed responses is also a possibility). In practice compressing the `channels/` folder (even if actually implement compression in the algorithm, we have to proceed folder by folder or make a global archive but it might be more difficult) weighing 20G results in 3.2 G, so we win approximately a factor 6.25. Could also spare a bit of storage by not storing pretty printed version of the jsons (personal note: json files use spaces instead of tabs here). Could also rename fields to `0`, `1` etc. Could also not log data that we don't need but there is no turn back. Especially comment ids, authors and texts would particularly interest me too. For instance we could add: - `&fields=items(snippet/topLevelComment/snippet/authorChannelId,replies/comments/snippet/authorChannelId),nextPageToken` to calls to CommentThreads: list endpoint - `&fields=items(snippet/authorChannelId),nextPageToken` to calls to Comments: list endpoint - `&fields=items(statistics/videoCount)` to calls to Videos: list endpoint - `&fields=items(contentDetails/videoId),nextPageToken` to calls to PlaylistItems: list endpoint ~~However note that currently [the YouTube operational API doesn't provide support for the no-key service](https://github.com/Benjamin-Loison/YouTube-operational-API/issues/27) (related to #4).~~ I solved the YouTube operational API issue disabling us to use the `fields` parameter with the no-key service. Should also consider finding an unused hard disk at home, if it's enough, to use it until the presentation day.
Benjamin_Loison added the
enhancement
label 2023-01-06 18:52:52 +01:00
Benjamin_Loison added the
high priority
label 2023-01-06 19:34:01 +01:00
Benjamin_Loison added the
epic
label 2023-01-06 19:35:29 +01:00
Author
Owner

If was having a lot of storage compared to the typical execution of the algorithm, using multiple computers to run this YouTube graph discovery would make sense. Should then also pay attention to global variables (notably mutexes) to not treat multiple times (one time on different computers) the same channel.

If was having a lot of storage compared to the typical execution of the algorithm, using multiple computers to run this YouTube graph discovery would make sense. Should then also pay attention to global variables (notably mutexes) to not treat multiple times (one time on different computers) the same channel.
Author
Owner

As a practical experience, with 4 threads, we treat the 43 most subscribers French channels in 24 hours. Which is about 14G (including 4.5 of uncompressed working on channels) and having discovered 14,910,408 channels.

As a practical experience, with 4 threads, we treat the 43 most subscribers French channels in 24 hours. Which is about 14G (including 4.5 of uncompressed working on channels) and having discovered 14,910,408 channels.
Author
Owner

With my 604G free storage, it's enough to already have a nice dataset that would need 604 / 14 = 43 days to be filled with 4 threads.

With my 604G free storage, it's enough to already have a nice dataset that would need 604 / 14 = 43 days to be filled with 4 threads.
Sign in to join this conversation.
No description provided.