Commit Graph

29 Commits

Author SHA1 Message Date
b3779fe49a
Fix #20: YouTube Data API v3 returns rarely suddenly commentsDisabled error which involves an unwanted method switch
Also modified compression command, as I got `sh: 1: zip: Argument list too long` when compressing the 248,868 json files of the French most subscribers channel.
2023-01-08 15:43:27 +01:00
3ae0f4e924
Make all Python scripts executable and add findAlreadyTreatedCommentsCount.py to find how many comments were already treated 2023-01-07 15:45:31 +01:00
3758405f52
Add a note about the timing percentage of findLatestTreatedCommentsForChannelsBeingTreated.py going backward 2023-01-07 15:35:12 +01:00
71e4bd95a9
Fix #9: Make sure that in case of error returned by the YouTube Data API v3 the algorithm treats it correctly
Note that in case of error the algorithm used to skip the received content, as if just no `items` were in it.
2023-01-06 20:55:32 +01:00
34bbc216f6
Fix #15: Provide an algorithm to retrieve the list of 100 French channels with most subscribers (and provide it too) 2023-01-06 18:06:00 +01:00
baec8fcb6c
#7: Remove remaining undefined behavior due to missing mutex use 2023-01-06 18:00:51 +01:00
773f86c551
Fix #17: Add to stdout live statistics of the number of comments treated per second 2023-01-06 17:55:16 +01:00
f436007836
Fix #16: Provide an algorithm to determine the progress of retrieving comments for huge YouTube channels 2023-01-06 17:51:00 +01:00
dfbf38b071
#1: Add GNU AGPLv3 license 2023-01-06 16:09:12 +01:00
292dd8919e
Add try/catch around json parser
As got:
```
terminate called after throwing an instance of 'nlohmann::detail::parse_error'
terminate called recursively
  what():  [json.exception.parse_error.101] parse error at line 1, column 1: syntax error while parsing value - unexpected end of input; expected '[', '{', or a literal
terminate called recursively
```
2023-01-06 00:31:05 +01:00
dab4c8ff69
Modify removeChannelsBeingTreated.py to be more resilient against not existing files in the treatment process 2023-01-04 03:10:28 +01:00
9d5c9fde2a
#2: Add compression to channels/ folder
Can use following Python script to compress existing uncompressed
`channels/` folder.

```py

import os, shutil

path = 'channels/'

os.chdir(path)

d = next(os.walk('.'))[1]
for channelIndex, channelId in enumerate(d):
    print(f'{channelIndex} / {len(d)}: {channelId}')
    shutil.make_archive(channelId, 'zip', channelId)
    shutil.rmtree(channelId)
```
2023-01-04 03:06:33 +01:00
f201ae7a91
Make #7: Add multi-threading compatible with my Debian setup 2023-01-04 02:51:40 +01:00
4cae7e09d1
Add {removeChannelsBeingTreated, findTreatedChannelWithMost{Comments, Subscribers}}.py 2023-01-04 02:41:07 +01:00
e4b4ce21a2
Fix #7: Add multi-threading 2023-01-03 04:56:19 +01:00
a2990c7699
Fix #8: Support comments disabled channels
Tested with `UCWIdqSQekeGmUWlSFeCiEnA` which treated correctly the 36 comments of the only comments enabled video `3F8dFt8LsXY`.

Note that this commit doesn't support comments disabled channels with more than 20,000 videos.
2023-01-03 02:56:07 +01:00
923c14a77b
#2: Add data logging 2023-01-02 19:46:32 +01:00
73a9dea023
Apply astyle formatting to main.cpp 2023-01-02 18:31:16 +01:00
938ae4b0fb
Fix #4: Provide a version relying on the no-key service of https://yt.lemnoslife.com 2023-01-02 18:30:18 +01:00
c50a82df1b
Make compatible with Debian
More precise ly make compatible with `gcc version 10.2.1 20210110 (Debian 10.2.1-6)`
2023-01-02 18:23:30 +01:00
36f1fb9e83
Add progression save and use spaces instead of tabs 2022-12-22 06:18:22 +01:00
934954092a
Add time to logging 2022-12-22 05:47:16 +01:00
eaae954e1b
Add resilience to missing authorChannelId in main.cpp 2022-12-22 05:41:38 +01:00
2ffc1d0e5d
Add main.cpp, Makefile and channelsToTreat.txt
Note that running this algorithm end up with channel [`UC-99odscxh1xxTyxHyXuRrg`](https://www.youtube.com/channel/UC-99odscxh1xxTyxHyXuRrg) and more precisely the video [`Tq5aPNzfYcg`](https://www.youtube.com/watch?v=Tq5aPNzfYcg) and more precisely the comment [`Ugx-TlSq6SNCbOX04mx4AaABAg`](https://www.youtube.com/watch?v=Tq5aPNzfYcg&lc=Ugx-TlSq6SNCbOX04mx4AaABAg) [which doesn't have any author](https://yt.lemnoslife.com/noKey/comments?part=snippet&id=Ugx-TlSq6SNCbOX04mx4AaABAg)...
2022-12-22 05:20:32 +01:00
53acda6abe
Update README.md to remove the question about whether or not both methods return the same comments, as it's the case
More precisely I used following algorithm with these three channels:
channel id               | 1st method            | 2nd method
-------------------------|-----------------------|-----------
UCt5USYpzzMCYhkirVQGHwKQ | 16                    | 16
UCUo1RqYV8tGjV38sQ8S5p9A | 58,165                | 58,165
UCWIdqSQekeGmUWlSFeCiEnA | *error* (as expected) | 27

```py
"""
Algorithm comparing comments count using:
1. CommentThreads: list with allThreadsRelatedToChannelId filter
2. PlaylistItems: list and CommentThreads: list
Note that the second approach isn't *atomic*, so counts will differ if some comments are posted while retrieving data.
"""

import requests, json

CHANNEL_ID = 'UC...'
API_KEY = 'AIzaSy...'

def getJSON(url, firstTry = True):
    if firstTry:
        url = 'https://www.googleapis.com/youtube/v3/' + url + f'&key={API_KEY}'
    try:
        content = requests.get(url).text
    except:
        print('retry')
        return getJSON(url, False)
    data = json.loads(content)
    return data

items = []
pageToken = ''
while True:
    # After having verified, I confirm that using `allThreadsRelatedToChannelId` doesn't return comments of the `COMMUNITY` tab
    data = getJSON(f'commentThreads?part=id,snippet,replies&allThreadsRelatedToChannelId={CHANNEL_ID}&maxResults=100&pageToken={pageToken}')
    items += data['items']
    # In fact once we have top level comment, then with both methods if the replies *count* is correct, then we are fine as we both use the same Comments: list endpoint
    """for item in data['items']:
        if 'replies' in item:
            if len(item['replies']['comments']) >= 5:
                print('should consider replies too!')"""
    print(len(items))
    if 'nextPageToken' in data:
        pageToken = data['nextPageToken']
    else:
        break

print(len(items))

PLAYLIST_ID = 'UU' + CHANNEL_ID[2:]

videoIds = []
pageToken = ''
while True:
    data = getJSON(f'playlistItems?part=snippet&playlistId={PLAYLIST_ID}&maxResults=50&pageToken={pageToken}')
    for item in data['items']:
        videoIds += [item['snippet']['resourceId']['videoId']]
    print(len(videoIds))
    if 'nextPageToken' in data:
        pageToken = data['nextPageToken']
    else:
        break

print(len(videoIds))
items = []

for videoIndex, videoId in enumerate(videoIds):
    pageToken = ''
    while True:
        data = getJSON(f'commentThreads?part=id,snippet,replies&videoId={videoId}&maxResults=100&pageToken={pageToken}')
        if 'items' in data:
            items += data['items']
        # repeat replies check as could be the case here and not there
            """for item in data['items']:
                if 'replies' in item:
                    if len(item['replies']['comments']) >= 5:
                        print('should consider replies too!')"""
        print(videoIndex, len(videoIds), len(items))
        if 'nextPageToken' in data:
            pageToken = data['nextPageToken']
        else:
            break

print(len(items))
```
2022-12-22 03:18:25 +01:00
d776c09fec
Update README.md to clean notes concerning optimized approaches 2022-12-22 02:02:48 +01:00
daf14d4b5b
Update README.md to make clear to use different strategies to optimize the process
Note that as far as I (and StackOverflow ([1.](https://stackoverflow.com/q/63387215) and [2.](https://stackoverflow.com/q/67652250)) seems to) know there is no workaround to the 20,000 limit of PlaylistItems: list. This issue can be checked with:

```py
import requests, json

PLAYLIST_ID = 'UUf8w5m0YsRa8MHQ5bwSGmbw'
API_KEY = 'AIzaSy...'

items = []
pageToken = ''
while True:
    url = f'https://www.googleapis.com/youtube/v3/playlistItems?part=id&playlistId={PLAYLIST_ID}&maxResults=50&key={API_KEY}&pageToken={pageToken}'
    content = requests.get(url).text
    data = json.loads(content)
    items += data['items']
    print(len(items))
    if 'nextPageToken' in data:
        pageToken = data['nextPageToken']
    else:
        break

print(len(items))
```

Returns >= 19,000.

Note that this algorithm says that:
- [france24](https://www.youtube.com/@FRANCE24) has 6,086 videos while [SocialBlade states that it has 101,196 videos](https://socialblade.com/youtube/user/france24)
- [CNN](https://www.youtube.com/@CNN) has 19,289 while [SocialBlade states that it has 157,321 videos](https://socialblade.com/youtube/user/cnn)

Indeed both YouTube Data API v3 Search: list (I verified that https://github.com/Benjamin-Loison/YouTube-operational-API/issues/4 applied here with below code) and web-scraping `VIDEOS` tab don't work (see second SO link).

```py
import requests, json

CHANNEL_ID = 'UCf8w5m0YsRa8MHQ5bwSGmbw'
API_KEY = 'AIzaSy...'

items = []
pageToken = ''
while True:
    url = f'https://www.googleapis.com/youtube/v3/search?part=id&type=video&channelId={CHANNEL_ID}&maxResults=50&key={API_KEY}&pageToken={pageToken}'
    content = requests.get(url).text
    data = json.loads(content)
    items += data['items']
    print(len(items))
    if 'nextPageToken' in data:
        pageToken = data['nextPageToken']
    else:
        break

print(len(items))
```

Got ~18,734.

Another try by working with Search: list with date filter may make sense.

Note that according to SocialBlade:
- [asianetnews has 195,600 videos](https://socialblade.com/youtube/user/asianetnews)
- [RoelVandePaar has 2,2025,566 videos](https://socialblade.com/youtube/c/roelvandepaar)
2022-12-22 01:54:57 +01:00
3aa9947f8e
Update README.md to remove possibility to proceed using YouTube Data API v3 CommentThreads: list endpoint with allThreadsRelatedToChannelId filter
As we want to retrieve as many comments as possible, we have to proceed video per video, as [`3F8dFt8LsXY`](https://www.youtube.com/watch?v=3F8dFt8LsXY) for instance has comments but using YouTube Data API v3 CommentThreads: list endpoint with `allThreadsRelatedToChannelId` filter returns for `UCWIdqSQekeGmUWlSFeCiEnA`:
```json
{
  "error": {
    "code": 403,
    "message": "The video identified by the \u003ccode\u003e\u003ca href=\"/youtube/v3/docs/commentThreads/list#videoId\"\u003evideoId\u003c/a\u003e\u003c/code\u003e parameter has disabled comments.",
    "errors": [
      {
        "message": "The video identified by the \u003ccode\u003e\u003ca href=\"/youtube/v3/docs/commentThreads/list#videoId\"\u003evideoId\u003c/a\u003e\u003c/code\u003e parameter has disabled comments.",
        "domain": "youtube.commentThread",
        "reason": "commentsDisabled",
        "location": "videoId",
        "locationType": "parameter"
      }
    ]
  }
}
```
2022-12-21 23:49:27 +01:00
c828b118d3
Add README.md with first sketching questions 2022-12-21 23:46:14 +01:00