I was about to commit in addition:
```c++
// Due to videos with automatically generated captions but being set to `Off` by default aren't retrieved with `--sub-langs '.*orig'`.
// My workaround is to first call YouTube Data API v3 Captions: list endpoint with `part=snippet` and retrieve the language that has `"trackKind": "asr"` (automatic speech recognition) in `snippet`.
/*json data = getJson(threadId, "captions?part=snippet&videoId=" + videoId, true, channelToTreat),
items = data["items"];
for(const auto& item : items)
{
json snippet = item["snippet"];
if(snippet["trackKind"] == "asr")
{
string language = snippet["language"];
cmd = cmdCommonPrefix + "--write-auto-subs --sub-langs '" + language + "-orig' --sub-format ttml --convert-subs vtt" + cmdCommonPostfix;
exec(threadId, cmd);
// As there should be a single automatic speech recognized track, there is no need to go through all tracks.
break;
}
}*/
```
Instead of:
```c++
cmd = cmdCommonPrefix + "--write-auto-subs --sub-langs '.*orig' --sub-format ttml --convert-subs vtt" + cmdCommonPostfix;
exec(threadId, cmd);
```
But I realized that, as the GitHub comment I was about to add to https://github.com/yt-dlp/yt-dlp/issues/2655, I was
wrong:
> `yt-dlp --cookies cookies.txt --sub-langs 'en.*,.*orig' --write-auto-subs https://www.youtube.com/watch?v=tQqDBySHYlc` work as expected. Many thanks again.
>
> ```
> 'subtitleslangs': ['en.*','.*orig'],
> 'writeautomaticsub': True,
> ```
>
> Work as expected too. Thank you
>
> Very sorry for the video sample. I even not watched it.
Thank you for this workaround. However note that videos having automatically generated subtitles but being set to `Off` by default aren't retrieved with your method (example of such video: [`mozyXsZJnQ4`](https://www.youtube.com/watch?v=mozyXsZJnQ4)). My workaround is to first call [YouTube Data API v3](https://developers.google.com/youtube/v3) [Captions: list](https://developers.google.com/youtube/v3/docs/captions/list) endpoint with [`part=snippet`](https://developers.google.com/youtube/v3/docs/captions/list#part) and retrieve the [`language`](https://developers.google.com/youtube/v3/docs/captions#snippet.language) that has [`"trackKind": "asr"`](https://developers.google.com/youtube/v3/docs/captions#snippet.trackKind) (automatic speech recognition) in [`snippet`](https://developers.google.com/youtube/v3/docs/captions#snippet).
More precisely I used following algorithm with these three channels:
channel id | 1st method | 2nd method
-------------------------|-----------------------|-----------
UCt5USYpzzMCYhkirVQGHwKQ | 16 | 16
UCUo1RqYV8tGjV38sQ8S5p9A | 58,165 | 58,165
UCWIdqSQekeGmUWlSFeCiEnA | *error* (as expected) | 27
```py
"""
Algorithm comparing comments count using:
1. CommentThreads: list with allThreadsRelatedToChannelId filter
2. PlaylistItems: list and CommentThreads: list
Note that the second approach isn't *atomic*, so counts will differ if some comments are posted while retrieving data.
"""
import requests, json
CHANNEL_ID = 'UC...'
API_KEY = 'AIzaSy...'
def getJSON(url, firstTry = True):
if firstTry:
url = 'https://www.googleapis.com/youtube/v3/' + url + f'&key={API_KEY}'
try:
content = requests.get(url).text
except:
print('retry')
return getJSON(url, False)
data = json.loads(content)
return data
items = []
pageToken = ''
while True:
# After having verified, I confirm that using `allThreadsRelatedToChannelId` doesn't return comments of the `COMMUNITY` tab
data = getJSON(f'commentThreads?part=id,snippet,replies&allThreadsRelatedToChannelId={CHANNEL_ID}&maxResults=100&pageToken={pageToken}')
items += data['items']
# In fact once we have top level comment, then with both methods if the replies *count* is correct, then we are fine as we both use the same Comments: list endpoint
"""for item in data['items']:
if 'replies' in item:
if len(item['replies']['comments']) >= 5:
print('should consider replies too!')"""
print(len(items))
if 'nextPageToken' in data:
pageToken = data['nextPageToken']
else:
break
print(len(items))
PLAYLIST_ID = 'UU' + CHANNEL_ID[2:]
videoIds = []
pageToken = ''
while True:
data = getJSON(f'playlistItems?part=snippet&playlistId={PLAYLIST_ID}&maxResults=50&pageToken={pageToken}')
for item in data['items']:
videoIds += [item['snippet']['resourceId']['videoId']]
print(len(videoIds))
if 'nextPageToken' in data:
pageToken = data['nextPageToken']
else:
break
print(len(videoIds))
items = []
for videoIndex, videoId in enumerate(videoIds):
pageToken = ''
while True:
data = getJSON(f'commentThreads?part=id,snippet,replies&videoId={videoId}&maxResults=100&pageToken={pageToken}')
if 'items' in data:
items += data['items']
# repeat replies check as could be the case here and not there
"""for item in data['items']:
if 'replies' in item:
if len(item['replies']['comments']) >= 5:
print('should consider replies too!')"""
print(videoIndex, len(videoIds), len(items))
if 'nextPageToken' in data:
pageToken = data['nextPageToken']
else:
break
print(len(items))
```
As we want to retrieve as many comments as possible, we have to proceed video per video, as [`3F8dFt8LsXY`](https://www.youtube.com/watch?v=3F8dFt8LsXY) for instance has comments but using YouTube Data API v3 CommentThreads: list endpoint with `allThreadsRelatedToChannelId` filter returns for `UCWIdqSQekeGmUWlSFeCiEnA`:
```json
{
"error": {
"code": 403,
"message": "The video identified by the \u003ccode\u003e\u003ca href=\"/youtube/v3/docs/commentThreads/list#videoId\"\u003evideoId\u003c/a\u003e\u003c/code\u003e parameter has disabled comments.",
"errors": [
{
"message": "The video identified by the \u003ccode\u003e\u003ca href=\"/youtube/v3/docs/commentThreads/list#videoId\"\u003evideoId\u003c/a\u003e\u003c/code\u003e parameter has disabled comments.",
"domain": "youtube.commentThread",
"reason": "commentsDisabled",
"location": "videoId",
"locationType": "parameter"
}
]
}
}
```