Prepare the presentation #35

New Issue

Benjamin_Loison · 2023-02-10T08:44:47+01:00

Benjamin_Loison commented

2023-02-10 08:44:47 +01:00

The defenses will take place on Monday, February 17, during the usual class hours.
17h55-18h20 : Benjamin Loison, Searching in Youtube subtitles

The defenses will consist of a presentation of 15 min, followed by 10 min of questions. You must give the defense on-site. The presentation should feature a demo of the outcome of the project. Please make sure to bring your laptop for this, as well as adapters to be able to project (the room has VGA or HDMI).

Remember that the outcome of your project must be available on an open version control platform (e.g., GitLab, Codeberg, or GitHub), licensed as open-source, and documented with some minimal documentation (README file) to explain the goal of the project, the structure of the code, the dependencies, and how to deploy and run the system.

Details about the evaluation criteria of the projects are available on
the project description PDF on the Moodle platform.

See below for the project description PDF summary.

Source: email of 14/02/23.

1 Overview of the project

The purpose of this project is to design and implement a novel application of Web data, that includes aspects such as data acquisition, extraction, cleaning, processing, integration, visualization, on one or several data sources from the Web (downloaded or used as services)

Other contributions to Web-related software such as Web browsers, servers, caches, etc., can also be considered.

2 Organization

Projects can be carried out by individual students, or (preferably) in groups of two. Projects chosen by different groups of students can integrate to each other (if relevant). The project chosen by each group of students will need to be submitted on the Moodle platform by December 14 at the latest, for approval. Groups will defend their contributions, by giving an overall presentation and showing a demonstration of their system on February 27. The software needs to be developed on an open version control platform such as GitLab or GitHub and licensed as open source. A minimal documentation (README file) should be provided to explain the goal of the project, the structure of the code, the dependencies, and how to deploy and run the system.

(repeated in the email)

3 Evaluation

The project is expected to be an implementation project; contributions that are more at the algorithmic level will also be accepted, but implementations of these algorithms are still required.
The following elements will be particularly valued when evaluating a group’s work:
• Depth of the contribution;
• Applicability value of the software and usability;
• Integration within existing software, services, platforms (in particular, contributions to existing open-source code bases are allowed and encouraged);
• Impact, wow effect of the demonstration;
• Initiative, creativity, originality;
• Good engineering practices, code quality;
• Quality of the presentation itself.

Source: https://moodle.r2.enst.fr/moodle/pluginfile.php/39334/mod_resource/content/4/project.pdf

Related to #19.

> The defenses will take place on Monday, February 17, during the usual class hours. > 17h55-18h20 : Benjamin Loison, Searching in Youtube subtitles > **The defenses will consist of a presentation of 15 min, followed by 10 min of questions.** You must give the defense on-site. **The presentation should feature a demo of the outcome of the project.** Please make sure to bring your laptop for this, as well as adapters to be able to project (the room has VGA or HDMI). > Remember that the outcome of your project must be available on an open version control platform (e.g., GitLab, Codeberg, or GitHub), licensed as open-source, and documented with some minimal documentation (README file) to explain the goal of the project, the structure of the code, the dependencies, and **how to deploy and run the system**. > Details about the evaluation criteria of the projects are available on the project description PDF on the Moodle platform. See below for the project description PDF summary. Source: email of 14/02/23. > # 1 Overview of the project > > The purpose of this project is to design and implement a novel application of Web data, that includes aspects such as data acquisition, extraction, cleaning, processing, integration, visualization, on one or several data sources from the Web (downloaded or used as services) > Other contributions to Web-related software such as Web browsers, servers, caches, etc., can also be considered. > > # 2 Organization > > Projects can be carried out by individual students, or (preferably) in groups of two. Projects chosen by different groups of students can integrate to each other (if relevant). The project chosen by each group of students will need to be submitted on the Moodle platform by December 14 at the latest, for approval. Groups will defend their contributions, by giving an overall presentation and showing a demonstration of their system on February 27. The software needs to be developed on an open version control platform such as GitLab or GitHub and licensed as open source. A minimal documentation (README file) should be provided to explain the goal of the project, the structure of the code, the dependencies, and how to deploy and run the system. (repeated in the email) > > # 3 Evaluation > > The project is expected to be an implementation project; contributions that are more at the algorithmic level will also be accepted, but implementations of these algorithms are still required. The following elements will be particularly valued when evaluating a group’s work: • Depth of the contribution; • Applicability value of the software and usability; • Integration within existing software, services, platforms (in particular, contributions to existing open-source code bases are allowed and encouraged); • Impact, wow effect of the demonstration; • Initiative, creativity, originality; • Good engineering practices, **code quality;** • Quality of the presentation itself. Source: https://moodle.r2.enst.fr/moodle/pluginfile.php/39334/mod_resource/content/4/project.pdf Related to #19.

Benjamin_Loison added the

waiting presentation

medium

medium priority

labels 2023-02-10 08:44:47 +01:00

Benjamin_Loison commented

2023-02-14 23:50:53 +01:00

Also verify algorithm quality by checking debug/*.err content:

grep -vE "yt-dlp --skip-download|find \| zip|style information loss|Video unavailable|This video is not available|Offline|player = |nsig extraction failed|Retrying \(1/3\)|Retrying \(2/3\)|Temporary failure in name resolution|This live event will begin in|We're processing this video|This live stream recording is not available|inappropriate|Internal Server Error|Invalid start time|Premieres in |Premiere will begin shortly|Some formats are possibly damaged|Remote end closed connection without response|only available to Music Premium members|The handshake operation timed out|read operation timed out|This live event has ended|Error 404|Incomplete chapter" *.err

The current only video QAlRgdhz6sU which returns:

ERROR: [youtube] QAlRgdhz6sU: Sign in to confirm your age. This video may be inappropriate for some users.

doesn't have captions. Maybe such inappropriate videos for some users may be enforced not to have captions.

Have:

WARNING: [youtube] Invalid start time (488.0 < 458.0) for chapter "Assure-toi qu'il y a des offres d'emploi dans la région où tu habites"

For video edF4q1DxcEU as it mentions chapter timestamps that are later than the video duration.

Even if there is this warning, we retrieve correctly the captions of this video.

grep 'Invalid start time' *.err | sort | uniq > invalidStartTime.txt

verifyStartTime.py:

#!/usr/bin/python3

import requests, json

chapters = set()

with open('invalidStartTime.txt') as f:
    lines = f.readlines()
    for line in lines:
        chapter = ' '.join(line.split()[10:])
        if not chapter in chapters:
            chapters.add(chapter)
            print(chapter)

No problem for

WARNING: [youtube] 00eyFxG1kFs: Native nsig extraction failed: Trying with PhantomJS

As 00eyFxG1kFs has captions and we retrieve them correctly.

Note that this seems to be a solved issue from yt-dlp.

Found a video id with captions thanks to:

getVideoIdHavingCaptions.py:

#!/usr/bin/python3

import requests, json

with open('trying.txt') as f:
    lines = f.readlines()
    for line in lines:
        videoId = line.split()[2][:-1]
        print(videoId)
        content = json.loads(requests.get(f'https://yt.lemnoslife.com/noKey/captions?part=snippet&videoId={videoId}').text)
        items = content['items']
        if items != []:
            print('Found captions')
            break

Using:

grep 'Trying' *.err | sort | uniq > trying.txt

Also got:

yt-dlp --skip-download --sub-lang all,-live_chat -o 'channels/UC_i8X3p8oZNaik8X513Zn1Q/captions/D9A3e4TJhUo/_' -- D9A3e4TJhUo
WARNING: [youtube] Failed to download MPD manifest: HTTP Error 500: Internal Server Error

But the only video currently suffering this problem doesn't have captions.

The following error seems to be on yt-dlp end, I should investigate that.

ERROR: [youtube] cmcHYsVbYi0: Unable to extract uploader id; please report this issue on  https://github.com/yt-dlp/yt-dlp/issues?q= , filling out the appropriate issue template. Confirm you are on the latest version using  yt-dlp -U

In fact it's an already reported bug from yt-dlp that was reported very recently.

However even by compiling the latest version of yt-dlp, I'm unable to download captions with:

yt-dlp --skip-download --sub-lang all -o '_' -- o8NPllzkFhE

It's unclear if with latest commit we are supposed to have the above command still working. I'm subscribed to GitHub yt-dlp releases in case of a new one.
I should also pay attention in a day to https://github.com/ytdl-patched/yt-dlp/releases

I may have to restart the algorithm, maybe at least verifying the correct download of all treated videos seem to make sense.

It may be an error on my end as:

yt-dlp --skip-download --all-subs -o '_' -- o8NPllzkFhE

works fine.

Cf commit 78b2bf18fa for the patch. However I won't try to filter a channel basis which one was correctly treated, as even Cyprien that is the second most subscribed French YouTube channel wasn't treated correctly.

I verified the correct captions download with following algorithm:

verifyDownloadedAllCaptions.py:

#!/usr/bin/python3

# Could proceed in a multi-threaded way.

import zipfile, os, requests, json

os.chdir('channels/')

CAPTIONS_FOLDER = 'captions'
API_KEY = 'AIzaSy...'

#for file in os.listdir():
for file in ['UCyWqModMQlbIo8274Wh_ZsQ.zip']:
    print(file)
    isFine = True

    z = zipfile.ZipFile(file)
    files = [x for x in z.namelist() if x.startswith(f'{CAPTIONS_FOLDER}/')]
    dirs = list(set([os.path.dirname(x) for x in files]))
    for dir in dirs:
        if dir == CAPTIONS_FOLDER:
            continue
        #print(dir)
        videoId = dir.split('/')[1]
        url = f'https://www.googleapis.com/youtube/v3/captions?part=snippet&videoId={videoId}&key={API_KEY}'
        while True:
            try:
                content = json.loads(requests.get(url).text)
            except:
                continue
            break
        items = content['items']
        for item in items:
            print(item)
            snippet = item['snippet']
            filePath = f'{CAPTIONS_FOLDER}/{videoId}/_.{snippet["language"]}{"-orig" if snippet["trackKind"] == "asr" else ""}.vtt'
            if not filePath in files:
                print('Not fine!')
                isFine = False
                break
        if not isFine:
            break
    #if not isFine:
    #    print('Not fine!')
    #break

Note that this algorithm can't be used for its purpose, as for instance LHu-CJbPyCo doesn't have captions on the YouTube UI but according to the API have some.

Also got:

WARNING: [youtube] aSyZa3b58QA: Some formats are possibly damaged. They will be deprioritized

But the retrieved captions look correct and are written in the correct format, as checked with the following algorithm.

verifyCaptionsNotDamaged.py:

#!/usr/bin/python3

import webvtt

for caption in webvtt.read('../channels/UC0KU8F9jJqSLS11LRXvFWmg/captions/aSyZa3b58QA/_.en-orig.vtt'):
    print(caption.start)
    print(caption.end)
    print(caption.text)

Also got:

WARNING: [youtube] Unable to download webpage: Remote end closed connection without response

But the captions downloaded of YabFeyjN47Y was completely successful.

Also got:

ERROR: [youtube] 0Y680gpug9g: This video is only available to Music Premium members

But as 0Y680gpug9g doesn't have comments and we can only know if it has captions by paying, we won't pay to know that.

Also got:

ERROR: [youtube] bxulItfpOuI: Premiere will begin shortly

The premiere bxulItfpOuI should have started a while ago...

Also got:

ERROR: unable to open for writing: [Errno 2] No such file or directory: 'channels/UCI4NJYRqP_zPFWXM8_LygjQ/captions/5UfPGzmbLXs/_.ru-orig.ttml.part'
WARNING: Skipping embedding ru-orig subtitle because the file is missing

But I have deleted the associated retrieved data, so I added them to my channels.txt to verify once it'll be treated.

I stopped giving explanations when we retrieve successfully the captions anyway.

Also verify algorithm quality by checking `debug/*.err` content: ```sh grep -vE "yt-dlp --skip-download|find \| zip|style information loss|Video unavailable|This video is not available|Offline|player = |nsig extraction failed|Retrying \(1/3\)|Retrying \(2/3\)|Temporary failure in name resolution|This live event will begin in|We're processing this video|This live stream recording is not available|inappropriate|Internal Server Error|Invalid start time|Premieres in |Premiere will begin shortly|Some formats are possibly damaged|Remote end closed connection without response|only available to Music Premium members|The handshake operation timed out|read operation timed out|This live event has ended|Error 404|Incomplete chapter" *.err ``` # The current only video [`QAlRgdhz6sU`](https://www.youtube.com/watch?v=QAlRgdhz6sU) which returns: ``` ERROR: [youtube] QAlRgdhz6sU: Sign in to confirm your age. This video may be inappropriate for some users. ``` doesn't have captions. Maybe such inappropriate videos for some users may be enforced not to have captions. # Have: ``` WARNING: [youtube] Invalid start time (488.0 < 458.0) for chapter "Assure-toi qu'il y a des offres d'emploi dans la région où tu habites" ``` For video [`edF4q1DxcEU`](https://www.youtube.com/watch?v=edF4q1DxcEU) as it mentions chapter timestamps that are later than the video duration. Even if there is this warning, we retrieve correctly the captions of this video. ```sh grep 'Invalid start time' *.err | sort | uniq > invalidStartTime.txt ``` `verifyStartTime.py`: ```py #!/usr/bin/python3 import requests, json chapters = set() with open('invalidStartTime.txt') as f: lines = f.readlines() for line in lines: chapter = ' '.join(line.split()[10:]) if not chapter in chapters: chapters.add(chapter) print(chapter) ``` # No problem for ``` WARNING: [youtube] 00eyFxG1kFs: Native nsig extraction failed: Trying with PhantomJS ``` As [`00eyFxG1kFs`](https://www.youtube.com/watch?v=00eyFxG1kFs) has captions and we retrieve them correctly. Note that this seems to be [a solved issue from yt-dlp](https://github.com/yt-dlp/yt-dlp/issues/6131). Found a video id with captions thanks to: `getVideoIdHavingCaptions.py`: ```py #!/usr/bin/python3 import requests, json with open('trying.txt') as f: lines = f.readlines() for line in lines: videoId = line.split()[2][:-1] print(videoId) content = json.loads(requests.get(f'https://yt.lemnoslife.com/noKey/captions?part=snippet&videoId={videoId}').text) items = content['items'] if items != []: print('Found captions') break ``` Using: ```sh grep 'Trying' *.err | sort | uniq > trying.txt ``` # Also got: ``` yt-dlp --skip-download --sub-lang all,-live_chat -o 'channels/UC_i8X3p8oZNaik8X513Zn1Q/captions/D9A3e4TJhUo/_' -- D9A3e4TJhUo WARNING: [youtube] Failed to download MPD manifest: HTTP Error 500: Internal Server Error ``` But the only video currently suffering this problem doesn't have captions. # The following error seems to be on yt-dlp end, **I should investigate that**. ``` ERROR: [youtube] cmcHYsVbYi0: Unable to extract uploader id; please report this issue on https://github.com/yt-dlp/yt-dlp/issues?q= , filling out the appropriate issue template. Confirm you are on the latest version using yt-dlp -U ``` In fact it's [an already reported bug from yt-dlp](https://github.com/yt-dlp/yt-dlp/issues/6247) that was reported very recently. However even by compiling the latest version of yt-dlp, I'm unable to download captions with: ```sh yt-dlp --skip-download --sub-lang all -o '_' -- o8NPllzkFhE ``` It's unclear if with latest commit we are supposed to have the above command still working. I'm subscribed to GitHub yt-dlp releases in case of a new one. I should also pay attention in a day to https://github.com/ytdl-patched/yt-dlp/releases I may have to restart the algorithm, maybe at least verifying the correct download of all treated videos seem to make sense. It may be an error on my end as: ```sh yt-dlp --skip-download --all-subs -o '_' -- o8NPllzkFhE ``` works fine. Cf commit 78b2bf18fa4f56496ed432905e062e85c9cac13e for the patch. However I won't try to filter a channel basis which one was correctly treated, as even Cyprien that is the second most subscribed French YouTube channel wasn't treated correctly. I verified the correct captions download with following algorithm: `verifyDownloadedAllCaptions.py`: ```py #!/usr/bin/python3 # Could proceed in a multi-threaded way. import zipfile, os, requests, json os.chdir('channels/') CAPTIONS_FOLDER = 'captions' API_KEY = 'AIzaSy...' #for file in os.listdir(): for file in ['UCyWqModMQlbIo8274Wh_ZsQ.zip']: print(file) isFine = True z = zipfile.ZipFile(file) files = [x for x in z.namelist() if x.startswith(f'{CAPTIONS_FOLDER}/')] dirs = list(set([os.path.dirname(x) for x in files])) for dir in dirs: if dir == CAPTIONS_FOLDER: continue #print(dir) videoId = dir.split('/')[1] url = f'https://www.googleapis.com/youtube/v3/captions?part=snippet&videoId={videoId}&key={API_KEY}' while True: try: content = json.loads(requests.get(url).text) except: continue break items = content['items'] for item in items: print(item) snippet = item['snippet'] filePath = f'{CAPTIONS_FOLDER}/{videoId}/_.{snippet["language"]}{"-orig" if snippet["trackKind"] == "asr" else ""}.vtt' if not filePath in files: print('Not fine!') isFine = False break if not isFine: break #if not isFine: # print('Not fine!') #break ``` Note that this algorithm can't be used for its purpose, as for instance [`LHu-CJbPyCo`](https://www.youtube.com/watch?v=LHu-CJbPyCo) doesn't have captions on the YouTube UI but [according to the API have some](https://yt.lemnoslife.com/noKey/captions?part=snippet&videoId=LHu-CJbPyCo). # Also got: ``` WARNING: [youtube] aSyZa3b58QA: Some formats are possibly damaged. They will be deprioritized ``` But the retrieved captions look correct and are written in the correct format, as checked with the following algorithm. `verifyCaptionsNotDamaged.py`: ```py #!/usr/bin/python3 import webvtt for caption in webvtt.read('../channels/UC0KU8F9jJqSLS11LRXvFWmg/captions/aSyZa3b58QA/_.en-orig.vtt'): print(caption.start) print(caption.end) print(caption.text) ``` # Also got: ``` WARNING: [youtube] Unable to download webpage: Remote end closed connection without response ``` But the captions downloaded of [`YabFeyjN47Y`](https://www.youtube.com/watch?v=YabFeyjN47Y) was completely successful. # Also got: ``` ERROR: [youtube] 0Y680gpug9g: This video is only available to Music Premium members ``` But as [`0Y680gpug9g`](https://www.youtube.com/watch?v=0Y680gpug9g) doesn't have comments and we can only know if it has captions by paying, we won't pay to know that. # Also got: ``` ERROR: [youtube] bxulItfpOuI: Premiere will begin shortly ``` The premiere [`bxulItfpOuI`](https://www.youtube.com/watch?v=bxulItfpOuI) should have started a while ago... # Also got: ``` ERROR: unable to open for writing: [Errno 2] No such file or directory: 'channels/UCI4NJYRqP_zPFWXM8_LygjQ/captions/5UfPGzmbLXs/_.ru-orig.ttml.part' WARNING: Skipping embedding ru-orig subtitle because the file is missing ``` But I have deleted the associated retrieved data, so I added them to my `channels.txt` to verify once it'll be treated. # I stopped giving explanations when we retrieve successfully the captions anyway.

Benjamin_Loison commented

2023-02-16 12:12:27 +01:00

Concerning channels/ due to crashes during the unstable process at the time of the process, using:

find -name '*.zip' -exec unzip -t {} \; | grep -vE 'OK|No errors'

verifies that there isn't any error with the archives.

Concerning `channels/` due to crashes during the unstable process at the time of the process, using: ```sh find -name '*.zip' -exec unzip -t {} \; | grep -vE 'OK|No errors' ``` verifies that there isn't any error with the archives.

Benjamin_Loison commented

2023-02-16 13:16:40 +01:00

To verify that the starting set was treated:

isStartingSetTreated.py:

#!/usr/bin/python3

import os

with open('newChannels.txt') as f:
    lines = f.read().splitlines()
    for line in lines:
        print(line, end = '')
        commonPath = f'channels/{line}'
        isDir = os.path.isdir(commonPath)
        isZip = os.path.isfile(f'{commonPath}.zip')
        toPrint = ''
        if not isDir and not isZip:
            toPrint = 'neither a directory or a zip'
        elif isDir:
            toPrint = 'only a directory'
        print(f' {toPrint}')

To verify that the starting set was treated: `isStartingSetTreated.py`: ```py #!/usr/bin/python3 import os with open('newChannels.txt') as f: lines = f.read().splitlines() for line in lines: print(line, end = '') commonPath = f'channels/{line}' isDir = os.path.isdir(commonPath) isZip = os.path.isfile(f'{commonPath}.zip') toPrint = '' if not isDir and not isZip: toPrint = 'neither a directory or a zip' elif isDir: toPrint = 'only a directory' print(f' {toPrint}') ```

Benjamin_Loison commented

2023-02-16 13:19:18 +01:00

To verify the correct format of channels.txt, as I ran dos2unix on it while the algorithm was running:

verifyChannels.py:

#!/usr/bin/python3

# Could be only file size based.

with open('channels.txt') as f:
    lines = f.read().splitlines()
    for line in lines:
        if len(line) != 24 or not line.startswith('UC'):
            print(f'{line} doesn\'t look like a channel')

In fact it seems that dos2unix was using writing to another temporary file d2utmpfh4L3d.

To verify the correct format of `channels.txt`, as I ran `dos2unix` on it while the algorithm was running: `verifyChannels.py`: ```py #!/usr/bin/python3 # Could be only file size based. with open('channels.txt') as f: lines = f.read().splitlines() for line in lines: if len(line) != 24 or not line.startswith('UC'): print(f'{line} doesn\'t look like a channel') ``` In fact it seems that `dos2unix` was using writing to another temporary file `d2utmpfh4L3d`.

Benjamin_Loison added

high priority

and removed

medium priority

labels 2023-02-17 00:45:19 +01:00

Benjamin_Loison referenced this issue

2023-02-17 01:07:29 +01:00

What does the website returns for a video with two captions matching the query? #45

Benjamin_Loison referenced this issue from a commit

2023-02-17 16:57:13 +01:00

#35: Make the not automatically generated captions correctly downloaded

Benjamin_Loison referenced this issue

2023-02-22 17:14:34 +01:00

Once have established my definitive crawling YouTube channels algorithm, should share it #3

Benjamin_Loison commented

2023-02-22 17:34:37 +01:00

Use the:

commits
issues
releases
Wiki

to prepare the presentation.

Use the: - [ ] commits - [ ] issues - [ ] releases - [ ] Wiki to prepare the presentation.

Benjamin_Loison referenced this issue

2023-02-22 20:03:50 +01:00

Improve indexing of website #44

Benjamin_Loison referenced this issue from a commit

2023-02-26 15:07:36 +01:00

#35: Move Python scripts to `scripts/` and describe the project structure in `README.md`

Benjamin_Loison referenced this issue from a commit

2023-02-26 15:12:09 +01:00

#35: Move Python scripts to `scripts/` and describe the project structure in `README.md`

Benjamin_Loison commented

2023-02-26 15:38:41 +01:00

The statistics of the presentation were generated with:

presentationStats.py:

#!/usr/bin/python3

import os, zipfile

with open('newChannels.txt') as f:
    lines = [f'{line}.zip' for line in f.read().splitlines()]

os.chdir('channels/')

channelsSize = sum([os.path.getsize(line) for line in lines])

print(channelsSize)

videosWithCaptions = 0
requestsToYouTubeDataAPIv3 = 0
requestsToYouTubeOperationalAPI = 0
youtubeDataAPIv3QuotaSpent = 0

for line in lines:
    z = zipfile.ZipFile(line)
    files = [x for x in z.namelist() if x.startswith(f'captions/')]
    videoIds = set()
    for file in files:
        if file.endswith('.vtt'):
            videoId = file.split('/')[1]
            videoIds.add(videoId)
    with z.open('requests/urls.txt') as f:
        urls = f.read().decode('utf-8').splitlines()
        for url in urls:
            isYouTubeDataAPIv3 = url.endswith('true')
            #print(url)
            if isYouTubeDataAPIv3:
                requestsToYouTubeDataAPIv3 += 1
                quota = 1
                # None of https://developers.google.com/youtube/v3/determine_quota_cost and https://developers.google.com/youtube/v3/live/docs/liveChatMessages/list precise the quota cost of LiveChatMessages : list endpoint.
                if not url.startswith('comment') and not url.startswith('playlistItems') and not url.startswith('videos') and not url.startswith('channels') and not url.startswith('liveChat/messages'):
                    print(url)
                youtubeDataAPIv3QuotaSpent += quota
            else:
                requestsToYouTubeOperationalAPI += 1
        #print(len(urls))
    videosWithCaptions += len(videoIds)

print(f'{videosWithCaptions=}')
print(f'{requestsToYouTubeDataAPIv3=}')
print(f'{youtubeDataAPIv3QuotaSpent=}')
print(f'{requestsToYouTubeOperationalAPI=}')

The statistics of the presentation were generated with: `presentationStats.py`: ```py #!/usr/bin/python3 import os, zipfile with open('newChannels.txt') as f: lines = [f'{line}.zip' for line in f.read().splitlines()] os.chdir('channels/') channelsSize = sum([os.path.getsize(line) for line in lines]) print(channelsSize) videosWithCaptions = 0 requestsToYouTubeDataAPIv3 = 0 requestsToYouTubeOperationalAPI = 0 youtubeDataAPIv3QuotaSpent = 0 for line in lines: z = zipfile.ZipFile(line) files = [x for x in z.namelist() if x.startswith(f'captions/')] videoIds = set() for file in files: if file.endswith('.vtt'): videoId = file.split('/')[1] videoIds.add(videoId) with z.open('requests/urls.txt') as f: urls = f.read().decode('utf-8').splitlines() for url in urls: isYouTubeDataAPIv3 = url.endswith('true') #print(url) if isYouTubeDataAPIv3: requestsToYouTubeDataAPIv3 += 1 quota = 1 # None of https://developers.google.com/youtube/v3/determine_quota_cost and https://developers.google.com/youtube/v3/live/docs/liveChatMessages/list precise the quota cost of LiveChatMessages : list endpoint. if not url.startswith('comment') and not url.startswith('playlistItems') and not url.startswith('videos') and not url.startswith('channels') and not url.startswith('liveChat/messages'): print(url) youtubeDataAPIv3QuotaSpent += quota else: requestsToYouTubeOperationalAPI += 1 #print(len(urls)) videosWithCaptions += len(videoIds) print(f'{videosWithCaptions=}') print(f'{requestsToYouTubeDataAPIv3=}') print(f'{youtubeDataAPIv3QuotaSpent=}') print(f'{requestsToYouTubeOperationalAPI=}') ```

Benjamin_Loison closed this issue

2023-03-08 01:02:46 +01:00

Benjamin_Loison referenced this issue from a commit

2023-03-24 23:46:27 +01:00

#35: Make the not automatically generated captions correctly downloaded

Benjamin_Loison referenced this issue from a commit

2023-03-24 23:46:27 +01:00

#35: Move Python scripts to `scripts/` and describe the project structure in `README.md`

Sign in to join this conversation.