Add data logging #2

New Issue

Benjamin_Loison · 2022-12-22T04:37:44+01:00

Benjamin_Loison commented

2022-12-22 04:37:44 +01:00

Could remove some repetitive fields depending on the data logging.

Adding compression might make sense (just storing compressed responses is also a possibility). In practice compressing the channels/ folder (even if actually implement compression in the algorithm, we have to proceed folder by folder or make a global archive but it might be more difficult) weighing 20G results in 3.2 G, so we win approximately a factor 6.25.

Could also spare a bit of storage by not storing pretty printed version of the jsons (personal note: json files use spaces instead of tabs here). Could also rename fields to 0, 1 etc.

Could also not log data that we don't need but there is no turn back. Especially comment ids, authors and texts would particularly interest me too.
For instance we could add:

&fields=items(snippet/topLevelComment/snippet/authorChannelId,replies/comments/snippet/authorChannelId),nextPageToken to calls to CommentThreads: list endpoint
&fields=items(snippet/authorChannelId),nextPageToken to calls to Comments: list endpoint
&fields=items(statistics/videoCount) to calls to Videos: list endpoint
&fields=items(contentDetails/videoId),nextPageToken to calls to PlaylistItems: list endpoint

~~However note that currently the YouTube operational API doesn't provide support for the no-key service (related to #4).~~ I solved the YouTube operational API issue disabling us to use the fields parameter with the no-key service.

Should also consider finding an unused hard disk at home, if it's enough, to use it until the presentation day.

Could remove some repetitive fields depending on the data logging. Adding compression might make sense (just storing compressed responses is also a possibility). In practice compressing the `channels/` folder (even if actually implement compression in the algorithm, we have to proceed folder by folder or make a global archive but it might be more difficult) weighing 20G results in 3.2 G, so we win approximately a factor 6.25. Could also spare a bit of storage by not storing pretty printed version of the jsons (personal note: json files use spaces instead of tabs here). Could also rename fields to `0`, `1` etc. Could also not log data that we don't need but there is no turn back. Especially comment ids, authors and texts would particularly interest me too. For instance we could add: - `&fields=items(snippet/topLevelComment/snippet/authorChannelId,replies/comments/snippet/authorChannelId),nextPageToken` to calls to CommentThreads: list endpoint - `&fields=items(snippet/authorChannelId),nextPageToken` to calls to Comments: list endpoint - `&fields=items(statistics/videoCount)` to calls to Videos: list endpoint - `&fields=items(contentDetails/videoId),nextPageToken` to calls to PlaylistItems: list endpoint ~~However note that currently [the YouTube operational API doesn't provide support for the no-key service](https://github.com/Benjamin-Loison/YouTube-operational-API/issues/27) (related to #4).~~ I solved the YouTube operational API issue disabling us to use the `fields` parameter with the no-key service. Should also consider finding an unused hard disk at home, if it's enough, to use it until the presentation day.

Benjamin_Loison referenced this issue from a commit

2023-01-02 19:46:51 +01:00

#2: Add data logging

Benjamin_Loison referenced this issue

2023-01-03 05:33:16 +01:00

Add support for multiple keys to be resilient against exceeded quota errors #6

Benjamin_Loison referenced this issue from a commit

2023-01-04 03:02:46 +01:00

#2: Add compression to stored channels

Benjamin_Loison referenced this issue from a commit

2023-01-04 03:06:56 +01:00

#2: Add compression to `channels/` folder

Benjamin_Loison added the enhancement label 2023-01-06 18:52:52 +01:00

Benjamin_Loison added the high priority label 2023-01-06 19:34:01 +01:00

Benjamin_Loison added the epic label 2023-01-06 19:35:29 +01:00

Benjamin_Loison commented

2023-01-14 15:48:10 +01:00

If was having a lot of storage compared to the typical execution of the algorithm, using multiple computers to run this YouTube graph discovery would make sense. Should then also pay attention to global variables (notably mutexes) to not treat multiple times (one time on different computers) the same channel.

Benjamin_Loison commented

2023-01-15 01:42:32 +01:00

As a practical experience, with 4 threads, we treat the 43 most subscribers French channels in 24 hours. Which is about 14G (including 4.5 of uncompressed working on channels) and having discovered 14,910,408 channels.

Benjamin_Loison referenced this issue

2023-02-14 23:38:36 +01:00

Could propose a version that can be run on multiple computers #42

Benjamin_Loison commented

2023-02-14 23:43:00 +01:00

With my 604G free storage, it's enough to already have a nice dataset that would need 604 / 14 = 43 days to be filled with 4 threads.

Benjamin_Loison closed this issue

2023-02-14 23:43:19 +01:00

Benjamin_Loison referenced this issue from a commit

2023-03-24 23:46:27 +01:00

#2: Add data logging

Benjamin_Loison referenced this issue from a commit