Compare commits

...

120 Commits
3.0.0 ... 3.0.4

Author SHA1 Message Date
2650943f9d v3.0.4 (#160)
* Update README.md

* Update README.md

* update asciinema link

* v3.0.4

* update video link
2022-02-18 16:05:58 +05:30
4b218d35cb Cdx based oldest newest and near (#159)
* implement oldest newest and near methods in the cdx interface class, now cli uses the cdx methods instead of availablity api methods.

* handle the closest parameter derivative methods more efficiently and also handle exceptions gracefully.

* update test code
2022-02-18 13:17:40 +05:30
f990b93f8a Add sort, use_pagination and closest (#158)
* add sort param support in CDX API class

see https://nla.github.io/outbackcdx/api.html#operation/query

sort takes string input which must be one of the follwoing:
- default
- closest
- reverse

This commit shall help in closing issue at https://github.com/akamhy/waybackpy/issues/155

* add BlockedSiteError for cases when archiving is blocked by site's robots.txt

* create check_for_blocked_site for handling the BlockedSiteError for sites that are blocking wayback machine by their robots.txt policy

* add attrs use_pagination and closest, which are can be used to use the pagination API and lookup archive close to a timestamp respectively. And now to get out of infinte blank pages loop just check for two succesive black and not total two blank pages while using the CDX server API.

* added cli support for sort, use-pagination and closest

* added tests

* fix codeql warnings, nothing to worry about here.

* fix save test for archive_url
2022-02-18 00:24:14 +05:30
3a44a710d3 add sort param support in CDX API class (#156)
see https://nla.github.io/outbackcdx/api.html#operation/query

sort takes string input which must be one of the follwoing:
- default
- closest
- reverse

This commit shall help in closing issue at https://github.com/akamhy/waybackpy/issues/155
2022-02-17 12:17:23 +05:30
f63c6adf79 Trigger Build 2022-02-09 17:29:19 +05:30
b4d3393ef1 fix: move metadata from __init__.py into setup.cfg (#153) 2022-02-09 17:20:23 +05:30
cd5c3c61a5 fix imports with isort 2022-02-09 16:18:25 +05:30
87fb5ecd58 remove latest version funcs from utils, they were unused. 2022-02-09 16:12:30 +05:30
5954fcc646 format with black 2022-02-09 15:51:11 +05:30
89016d433c added trove Typing :: Typed and Development Status :: 5 - Production/Stable 2022-02-09 15:47:38 +05:30
edaa1d5d54 update value to the new limit. 2022-02-09 15:40:38 +05:30
16f94db144 incr version to v3.0.3 2022-02-09 14:33:16 +05:30
25eb709ade improve doc strings and comments and remove useless exceptions. 2022-02-09 14:32:15 +05:30
6d233f24fc apply isort 2022-02-09 11:20:59 +05:30
ec341fa8b3 refactor code in cli module 2022-02-09 11:20:10 +05:30
cf18090f90 fix typo 2022-02-09 09:52:20 +05:30
81162eebd0 issues with HN 2022-02-08 21:28:25 +05:30
ca4f79a2e3 + jfinkhaeuser and rafael (#150) 2022-02-08 20:34:34 +05:30
27f2727049 add cli alias for --start-timestamp(--from) and --end-timestamp(--to) to conform with the CDX API docs. 2022-02-08 20:12:19 +05:30
118dc6c523 add test for wrapper module 2022-02-08 20:08:44 +05:30
1216ffbc70 lint and refactor cli module 2022-02-08 20:06:17 +05:30
d58a5f0ee5 explicitly exculde some dirs from flake8 check 2022-02-08 18:59:13 +05:30
7e7412d9d1 remove deepsource, LGTM is better and has fewer False Postives. 2022-02-08 18:49:44 +05:30
70c38c5a60 + codecov badge 2022-02-08 17:49:05 +05:30
f8bf9c16f9 Add tests (#149)
* enable codecov

* fix save_urls_on_file

* increase the limit of CDX to 25000 from 5000. 5X increase.

* added test for the CLI module

* make flake 8 happy

* make mypy happy
2022-02-08 17:46:59 +05:30
2bbfee7b2f replace non-ASCII emojis with GitHub hosted equivalent images (#148) 2022-02-08 11:43:32 +05:30
7317bd7183 Remove blank lines after docstring (#146)
Co-authored-by: deepsource-autofix[bot] <62050782+deepsource-autofix[bot]@users.noreply.github.com>
2022-02-08 10:14:20 +05:30
e0dfbe0b7d Fix comparison constant position (#145)
* Fix comparison constant position

* format with black

Co-authored-by: deepsource-autofix[bot] <62050782+deepsource-autofix[bot]@users.noreply.github.com>
Co-authored-by: Akash Mahanty <akamhy@yahoo.com>
2022-02-08 10:06:23 +05:30
0b631592ea Improve pylint score (#142)
* fix: errors to improve pylint scores

* fix: test

* fix

* add: flake ignore rule to pip8speaks conf

* fix

* add: test patterns to deepsource conf
2022-02-08 06:42:20 +09:00
d3a8f343f8 + [eggplants](https://github.com/eggplants) (#143) 2022-02-08 01:41:10 +05:30
97f8b96411 added docstrings, added some static type hints and also lint. (#141)
* added docstrings, added some static type hints and also lint.

* added doc strings and changed some internal variable names for more clarity.

* make flake8 happy

* add descriptive docstrings and type hints in waybackpy/cdx_snapshot.py

* remove useless code and add docstrings and also lint using pylint.

* remove unwarented test

* added docstrings, lint using pylint and add a raise on 509 SC

* added docstrings and lint with pylint

* lint

* add doc strings and lint

* add docstrings and lint
2022-02-07 19:40:37 +05:30
004ff26196 Add .deepsource.toml 2022-02-07 12:55:57 +00:00
a772c22431 explicitly tell pep8speaks that mll is 88. 2022-02-06 21:00:15 +05:30
b79f1c471e Merge pull request #135 from eggplants/fix_cli
Fix cli.py
2022-02-05 16:54:36 +05:30
f49d67a411 Merge pull request #136 from eggplants/429_error
Add TooManyRequestsError
2022-02-05 11:28:27 +05:30
ad8bd25633 added badge of codacy (#139) 2022-02-05 10:05:17 +05:30
d2a3946425 fix: escape banner 2022-02-05 10:12:27 +09:00
7b6401d59b fix: delete useless conds 2022-02-05 06:20:03 +09:00
ed6160c54f add: TooManyRequestsError 2022-02-05 06:19:02 +09:00
fcab19a40a fix: cli
print error message to stderr and specify defaults of url
2022-02-05 05:55:04 +09:00
5f3cd28046 Fix Pylint errors were pointed out by codacy (#133)
* fix: pylint errors were pointed out by codacy

* fix: line length

* fix: help text

* fix: revert

https://stackoverflow.com/a/64477857 makes cli unusable

* fix: cli error and refactor codes
2022-02-05 05:25:40 +09:00
9d9cc3328b add .pep8speaks.yml, override deafult 2022-02-05 00:53:38 +05:30
b69e4dff37 rename params of main in cli.py to avoid using built-ins (#132)
* rename params of main in cli.py to avoid using built-ins

* Fix Line 32:80: E501 line too long (102 > 79 characters)
2022-02-05 00:30:35 +05:30
d8cabdfdb5 Typing (#128)
* fix: CI yml name

* add: mypy configuraion

* add: type annotation to waybackpy modules

* add: type annotation to test modules

* fix: mypy command

* add: types-requests to dev deps

* fix: disable max-line-length

* fix: move pytest.ini into setup.cfg

* add: urllib3 to deps

* fix: Retry (ref: https://github.com/python/typeshed/issues/6893)

* fix: f-string

* fix: shorten long lines

* add: staticmethod decorator to no-self-use methods

* fix: str(headers)->headers_str

* fix: error message

* fix: revert "str(headers)->headers_str" and ignore assignment CaseInsensitiveDict with str

* fix: mypy error
2022-02-05 03:23:36 +09:00
320ef30371 fix: format md and yml (#129) 2022-02-04 22:31:46 +05:30
e61447effd Format and lint codes and fix packaging (#125)
* add: configure files (setup.py->setup.py+setup.cfg+pyproject.toml)

* add: __download_url__

* format with black and isort

* fix: flake8 section in setup.cfg

* add: E501 to flake ignore

* fix: metadata.name does not accept attr

* fix: merge __version__.py into __init__.py

* fix: flake8 errors in tests/

* fix: datetime.datetime -> datetime

* fix: banner

* fix: ignore W605 for banner

* fix: way to install deps in CI

* add: versem to setuptools

* fix: drop python<=3.6 (#126) from package and CI
2022-02-03 19:13:39 +05:30
947647f2e7 Merge pull request #124 from eggplants/fix_save_retry
Fix save retry mechanism
2022-02-03 18:01:51 +05:30
bc1dc4dc96 fix: save retry mechanism 2022-02-03 19:45:16 +09:00
5cbdfc040b waybackpy/cli.py : remove duplicate original_string from output_string in cdx 2022-01-30 21:02:25 +05:30
3be6ac01fc created tests/test_cdx_api.py: added tests for cdx_api.py 2022-01-30 20:03:40 +05:30
b8b9bc098f tests/test_utils.py: test latest_version_pypi and latest_version_github of waybackpy.utils 2022-01-30 20:02:17 +05:30
946c28eddf waybackpy/cli.py: Added help text, fix bug in the cdx_print parameter and lots of other stuff
parameter --filters is now --filter

parameter --collapses is now --collapse

added a new --license flag for fetching the license from GitHub repo and printing it.
2022-01-30 20:00:50 +05:30
004027f73b waybackpy/utils.py : Add a new function(latest_version_github) to fetch the latest release from github api and renamed latest_version to latest_version_pypi as now we have two functions to get the latest release. 2022-01-30 13:28:13 +05:30
e86dd93b29 Delete custom.md 2022-01-30 11:45:51 +05:30
988568e8f0 Update issue templates 2022-01-30 11:44:30 +05:30
f4c32a44fd Merge pull request #123 from akamhy/add-code-of-conduct-1
Create CODE_OF_CONDUCT.md
2022-01-30 11:39:22 +05:30
7755e6391c Create CODE_OF_CONDUCT.md 2022-01-30 11:39:11 +05:30
9dbe3b3bf4 In waybackpy/wrapper.py set self.timestamp to None on init.
In older interface(2.x.x) we had timestamp set to none in the constructer, so maybe it should be best to set it to None in the backwards compatiblliy module.)
2022-01-29 22:12:02 +05:30
e84ba9f2c3 Merge pull request #122 from akamhy/update-readme
add conda install and related links and tell users that they can copy…
2022-01-27 00:25:49 +05:30
1250d105b4 update install command for conda and replace the link to conda-forge.org with https://anaconda.org/conda-forge/waybackpy 2022-01-27 00:17:36 +05:30
f03b2cb6cb fix formatting of ASCII art 2022-01-26 18:24:24 +05:30
5e0ea023e6 update CLI help text 2022-01-26 16:23:24 +05:30
8dff6f349e add maintainers 2022-01-26 15:45:03 +05:30
e04cfdfeaf add conda install and related links and tell users that they can copy text from asciinema.org 2022-01-26 15:40:33 +05:30
0e2cc8f5ba + asciicast https://asciinema.org/a/464367
[![asciicast](https://asciinema.org/a/464367.svg)](https://asciinema.org/a/464367)
2022-01-26 14:51:06 +05:30
9007149fef 3.0.1 -- > 3.0.2, for condaforge staged-recipes issues 2022-01-26 01:54:58 +05:30
8b7603e241 the test is faulty as it fails when we increment the version on dunder version file but did not upstreamed the code to PyPi. 2022-01-26 01:51:24 +05:30
5ea1d3ba4f Replace NON-ASCII character figlet with ASCII character figlet. 2022-01-26 01:46:42 +05:30
4408c5e2ca add snapcraft.yaml 2022-01-25 20:54:09 +05:30
9afe29a819 Merge pull request #119 from akamhy/akamhy-patch-1
v3.0.0 --> v3.0.1
2022-01-25 19:54:01 +05:30
d79b10c74c v3.0.0 --> v3.0.1 2022-01-25 19:52:10 +05:30
32314dc102 Merge branch 'build-test' #118
Add build test to CI
 see #117
2022-01-25 14:02:36 +05:30
50e176e2ba .github/workflows/build_test.yml : change python versions from '3.4', '3.8', '3.10' to '3.6', '3.10' as 3.4 not found by GitHub. 2022-01-25 13:56:49 +05:30
4007859c92 Install dependencies for build test in CI : setuptools wheel 2022-01-25 13:35:58 +05:30
d8bd6c628d Add build test to CI 2022-01-25 13:30:16 +05:30
28f6ff8df2 Merge pull request #116 from akamhy/patch-setup-py
Fix syntax for opening the README.md and __version__.py
2022-01-25 13:11:33 +05:30
7ac9353f74 Fix syntax for opening the README.md and __version__.py
For some reason updates made at https://github.com/akamhy/waybackpy/pull/114
are breaking the build using setup, caught while deploying to a cloud service
provider.

The exact error is:
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/tmp/pip-req-build-n3b9e5pj/setup.py", line 5
  os.path.join(os.path.dirname(__file__), README.md), encoding=utf-8),
                                                                                ^
SyntaxError: invalid syntax
----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

See also :
https://github.com/conda-forge/staged-recipes/pull/17634
2022-01-25 13:05:01 +05:30
15c7244a22 Merge pull request #115 from akamhy/akamhy-patch-1
do not use f-strings in setup.py
2022-01-25 10:42:27 +05:30
8510210e94 do not use f-strings in setup.py
These are not supported in <Python 3.6 version of the cpython.
2022-01-25 10:34:46 +05:30
552967487e Merge pull request #114 from rafaelrdealmeida/patch-1
Update setup.py

See also <https://github.com/akamhy/waybackpy/issues/111#issuecomment-1020673814>
2022-01-25 10:30:34 +05:30
86a90a3840 Update setup.py
pep8
2022-01-24 22:03:28 -03:00
759874cdc6 Update setup.py
see: https://github.com/akamhy/waybackpy/issues/111#issuecomment-1020673814
2022-01-24 21:23:31 -03:00
06095202fe BUG FIX : forgot to use the endpoint from the instance and also assign payload to param. Bug caught by the flake8 in the CI tests. 2022-01-24 23:35:48 +05:30
06fc7855bf waybackpy/cdx_api.py : deafult user agent is now DEFAULT_USER_AGENT, get_response now take url and headers as arguments and request url is generated by full_url function. max_tries added as parameter for the WaybackMachineCDXServerAPI class with default value of 3. 2022-01-24 23:20:49 +05:30
c49fe971fd update the older deprecation not for Url class, the newer date is now 2025 instead of 2024. 2022-01-24 23:15:59 +05:30
d6783d5525 added tests for cdx_utils.py 2022-01-24 23:05:47 +05:30
9262f5da21 improve functions get_total_pages, get_response and lint check_filters, check_collapses and check_match_type
get_total_pages : default user agent is now DEFAULT_USER_AGENT
                  and now instead of str formatting passing payload
                  as param to full_url to generate the request url
                  also get_response make the request instead of directly
                  using requests.get()

get_response : get_response is now not taking param as keyword arguments
               instead the invoker is supposed to pass the full url which
               may be generated by the full_url function therefore the return_full_url=False,
               is deprecated also.
               Also now closing the session via session.close()
               No need to check 'Exceeded 30 redirects' as save API uses a
               diffrent method.

check_filters : Not assigning to variables the return of match groups
                beacause we wont be using them and the linter picks these
                unused assignments.

check_collapses : Same reason as for check_filters but also removed a foolish
                  test that checks equality with objects that are guaranteed
                  to be same.

check_match_type : Updated the text that of WaybackError
2022-01-24 22:57:20 +05:30
d1a1cf2546 added tests for utils.py at tests/test_utils.py also changed a keyword argument from headers to user_agent for latest_version of utils.py with the usual default vaule. 2022-01-24 17:50:36 +05:30
cd8a32ed1f added tests for cdx_snapshot.py at tests/test_cdx_snapshot.py 2022-01-24 16:29:44 +05:30
57512c65ff change test oldest method from google.com to example.com, the oldest on google is for some unknown reason is not very stable. 2022-01-24 16:27:35 +05:30
d9ea26e11c added code style black badge 2022-01-24 13:46:31 +05:30
2bea92b348 fix bug with the third matching case of the archive_url_parser, caught while writing more tests fo the save API interface. 2022-01-24 13:31:30 +05:30
d506685f68 added some tests for save_api interface 2022-01-23 18:35:54 +05:30
7844d15d99 close the session in save api interface 2022-01-23 18:34:06 +05:30
c0252edff2 updated tests for availability_api.py and also added max_tries(default value is 3) with delay (sleep) between successive API calls. The dealy actually improves the performace of the availability_api interface. 2022-01-23 15:05:10 +05:30
e7488f3a3e added test badge, rename test to Tests from ubuntu and fix the Incomplete URL substring sanitization(or trying to) 2022-01-23 02:26:53 +05:30
aed75ad1db Make modules imprtable as part of a Python package, waybackpy by creating __init__.py file in tests 2022-01-23 02:14:38 +05:30
d740959c34 more dev reqs 2022-01-23 02:10:12 +05:30
2d83043ef7 + flake8 in requirements-dev.txt 2022-01-23 02:05:08 +05:30
31b1056217 fix typo in CI 2022-01-23 02:03:30 +05:30
97712b2c1e add CI unit_test.yml 2022-01-23 02:00:15 +05:30
a8acc4c4d8 Fix Incomplete URL substring sanitization in the last commit. 2022-01-23 01:42:48 +05:30
1bacd73002 created pytest.ini, added test for waybackpy/availability_api.py, new exceptions all of which inherit from the main WaybackError and created requirements-dev.txt 2022-01-23 01:29:07 +05:30
79901ba968 updated README.md 2022-01-22 03:08:26 +05:30
df64e839d7 added trove classifiers for python 3.10 2022-01-22 00:57:10 +05:30
405e9a2a79 waybackpy/save_api.py : Added doc strings and also lint with black. 2022-01-22 00:41:10 +05:30
db551abbf6 lint waybackpy/cdx_api.py and added some doc strings 2022-01-22 00:11:35 +05:30
d13dd4db1a added notice on waybackpy/wrapper.py that the Url class will cease to exist after 2024-01-01 and also removed unused imports. 2022-01-21 23:14:20 +05:30
d3bb8337a1 make setup.py smarter, now no need to update the URL again and also added more keywords. And in __version__.py updated the __author__ 2022-01-21 23:01:09 +05:30
fd5e85420c waybackpy/availability_api.py : removed unused imports, added doc strings, removed redundant function. 2022-01-21 22:47:44 +05:30
5c685ef5d7 upload logo and make p path not text
I was dumb to forget to convert the p to path.
2022-01-21 21:11:42 +05:30
6a3d96b453 Logo (#113)
* Create logo.txt

* Delete waybackpy_logo.svg

* Add files via upload

* Delete logo.txt
2022-01-21 21:02:38 +05:30
afe1b15a5f Add files via upload 2022-01-21 20:58:53 +05:30
4fd9d142e7 Merge pull request #112 from akamhy/fix
escape '.' before 'archive.org'
2022-01-21 19:52:55 +05:30
5e9fdb40ce escape '.' before 'archive.org'
escape '.' before 'archive.org' on line 88 so it does not match more hosts than expected.
2022-01-21 19:51:08 +05:30
fa72098270 _get_response is not used anymore
- datashaman (<https://stackoverflow.com/users/401467/datashaman>) for <https://stackoverflow.com/a/35504626>. _get_response is based on this amazing answer.
2022-01-21 19:43:35 +05:30
d18f955044 date year range 2020-2022 2022-01-21 11:55:42 +05:30
9c340d6967 Create codeql-analysis.yml 2022-01-21 11:12:59 +05:30
78d0e0c126 Update README.md 2022-01-21 09:54:04 +05:30
564101e6f5 🐳 for docker image 2022-01-21 01:23:05 +05:30
38 changed files with 2763 additions and 695 deletions

34
.github/ISSUE_TEMPLATE/bug_report.md vendored Normal file
View File

@ -0,0 +1,34 @@
---
name: Bug report
about: Create a report to help us improve
title: ''
labels: bug
assignees: akamhy
---
**Describe the bug**
A clear and concise description of what the bug is.
**To Reproduce**
Steps to reproduce the behavior:
1. Go to '...'
2. Click on '....'
3. Scroll down to '....'
4. See error
**Expected behavior**
A clear and concise description of what you expected to happen.
**Screenshots**
If applicable, add screenshots to help explain your problem.
**Version:**
- OS: [e.g. iOS]
- Version [e.g. 22]
- Is latest version? [e.g. Yes/No]
**Additional context**
Add any other context about the problem here.

View File

@ -0,0 +1,19 @@
---
name: Feature request
about: Suggest an idea for this project
title: ''
labels: enhancement
assignees: akamhy
---
**Is your feature request related to a problem? Please describe.**
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
**Describe the solution you'd like**
A clear and concise description of what you want to happen.
**Describe alternatives you've considered**
A clear and concise description of any alternative solutions or features you've considered.
**Additional context**
Add any other context or screenshots about the feature request here.

30
.github/workflows/build-test.yml vendored Normal file
View File

@ -0,0 +1,30 @@
# This workflow will install Python dependencies, run tests and lint with a variety of Python versions
# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions
name: Build
on:
push:
branches: [ master ]
pull_request:
branches: [ master ]
jobs:
build:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ['3.7', '3.10']
steps:
- uses: actions/checkout@v2
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v2
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -U setuptools wheel
- name: Build test the package
run: |
python setup.py sdist bdist_wheel

70
.github/workflows/codeql-analysis.yml vendored Normal file
View File

@ -0,0 +1,70 @@
# For most projects, this workflow file will not need changing; you simply need
# to commit it to your repository.
#
# You may wish to alter this file to override the set of languages analyzed,
# or to provide custom queries or build logic.
#
# ******** NOTE ********
# We have attempted to detect the languages in your repository. Please check
# the `language` matrix defined below to confirm you have the correct set of
# supported CodeQL languages.
#
name: "CodeQL"
on:
push:
branches: [ master ]
pull_request:
# The branches below must be a subset of the branches above
branches: [ master ]
schedule:
- cron: '30 6 * * 1'
jobs:
analyze:
name: Analyze
runs-on: ubuntu-latest
permissions:
actions: read
contents: read
security-events: write
strategy:
fail-fast: false
matrix:
language: [ 'python' ]
# CodeQL supports [ 'cpp', 'csharp', 'go', 'java', 'javascript', 'python', 'ruby' ]
# Learn more about CodeQL language support at https://git.io/codeql-language-support
steps:
- name: Checkout repository
uses: actions/checkout@v2
# Initializes the CodeQL tools for scanning.
- name: Initialize CodeQL
uses: github/codeql-action/init@v1
with:
languages: ${{ matrix.language }}
# If you wish to specify custom queries, you can do so here or in a config file.
# By default, queries listed here will override any specified in a config file.
# Prefix the list here with "+" to use these queries and those in the config file.
# queries: ./path/to/local/query, your-org/your-repo/queries@main
# Autobuild attempts to build any compiled languages (C/C++, C#, or Java).
# If this step fails, then you should remove it and run the build manually (see below)
- name: Autobuild
uses: github/codeql-action/autobuild@v1
# Command-line programs to run using the OS shell.
# 📚 https://git.io/JvXDl
# ✏️ If the Autobuild fails above, remove it and uncomment the following three lines
# and modify them (or add more) to build your code if your project
# uses a compiled language
#- run: |
# make bootstrap
# make release
- name: Perform CodeQL Analysis
uses: github/codeql-action/analyze@v1

43
.github/workflows/unit-test.yml vendored Normal file
View File

@ -0,0 +1,43 @@
# This workflow will install Python dependencies, run tests and lint with a variety of Python versions
# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions
name: Tests
on:
push:
branches: [ master ]
pull_request:
branches: [ master ]
jobs:
build:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ['3.10']
steps:
- uses: actions/checkout@v2
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v2
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install '.[dev]'
- name: Lint with flake8
run: |
flake8 . --count --show-source --statistics
- name: Lint with black
run: |
black . --check --diff
- name: Static type test with mypy
run: |
mypy -p waybackpy -p tests
- name: Test with pytest
run: |
pytest
- name: Upload coverage to Codecov
run: |
bash <(curl -s https://codecov.io/bash) -t ${{ secrets.CODECOV_TOKEN }}

7
.pep8speaks.yml Normal file
View File

@ -0,0 +1,7 @@
scanner:
diff_only: True
linter: flake8
flake8:
max-line-length: 88
extend-ignore: W503,W605

View File

@ -9,4 +9,4 @@
"issueSettings": {
"minSeverityLevel": "LOW"
}
}
}

128
CODE_OF_CONDUCT.md Normal file
View File

@ -0,0 +1,128 @@
# Contributor Covenant Code of Conduct
## Our Pledge
We as members, contributors, and leaders pledge to make participation in our
community a harassment-free experience for everyone, regardless of age, body
size, visible or invisible disability, ethnicity, sex characteristics, gender
identity and expression, level of experience, education, socio-economic status,
nationality, personal appearance, race, religion, or sexual identity
and orientation.
We pledge to act and interact in ways that contribute to an open, welcoming,
diverse, inclusive, and healthy community.
## Our Standards
Examples of behavior that contributes to a positive environment for our
community include:
* Demonstrating empathy and kindness toward other people
* Being respectful of differing opinions, viewpoints, and experiences
* Giving and gracefully accepting constructive feedback
* Accepting responsibility and apologizing to those affected by our mistakes,
and learning from the experience
* Focusing on what is best not just for us as individuals, but for the
overall community
Examples of unacceptable behavior include:
* The use of sexualized language or imagery, and sexual attention or
advances of any kind
* Trolling, insulting or derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or email
address, without their explicit permission
* Other conduct which could reasonably be considered inappropriate in a
professional setting
## Enforcement Responsibilities
Community leaders are responsible for clarifying and enforcing our standards of
acceptable behavior and will take appropriate and fair corrective action in
response to any behavior that they deem inappropriate, threatening, offensive,
or harmful.
Community leaders have the right and responsibility to remove, edit, or reject
comments, commits, code, wiki edits, issues, and other contributions that are
not aligned to this Code of Conduct, and will communicate reasons for moderation
decisions when appropriate.
## Scope
This Code of Conduct applies within all community spaces, and also applies when
an individual is officially representing the community in public spaces.
Examples of representing our community include using an official e-mail address,
posting via an official social media account, or acting as an appointed
representative at an online or offline event.
## Enforcement
Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported to the community leaders responsible for enforcement at
akamhy@yahoo.com.
All complaints will be reviewed and investigated promptly and fairly.
All community leaders are obligated to respect the privacy and security of the
reporter of any incident.
## Enforcement Guidelines
Community leaders will follow these Community Impact Guidelines in determining
the consequences for any action they deem in violation of this Code of Conduct:
### 1. Correction
**Community Impact**: Use of inappropriate language or other behavior deemed
unprofessional or unwelcome in the community.
**Consequence**: A private, written warning from community leaders, providing
clarity around the nature of the violation and an explanation of why the
behavior was inappropriate. A public apology may be requested.
### 2. Warning
**Community Impact**: A violation through a single incident or series
of actions.
**Consequence**: A warning with consequences for continued behavior. No
interaction with the people involved, including unsolicited interaction with
those enforcing the Code of Conduct, for a specified period of time. This
includes avoiding interactions in community spaces as well as external channels
like social media. Violating these terms may lead to a temporary or
permanent ban.
### 3. Temporary Ban
**Community Impact**: A serious violation of community standards, including
sustained inappropriate behavior.
**Consequence**: A temporary ban from any sort of interaction or public
communication with the community for a specified period of time. No public or
private interaction with the people involved, including unsolicited interaction
with those enforcing the Code of Conduct, is allowed during this period.
Violating these terms may lead to a permanent ban.
### 4. Permanent Ban
**Community Impact**: Demonstrating a pattern of violation of community
standards, including sustained inappropriate behavior, harassment of an
individual, or aggression toward or disparagement of classes of individuals.
**Consequence**: A permanent ban from any sort of public interaction within
the community.
## Attribution
This Code of Conduct is adapted from the [Contributor Covenant][homepage],
version 2.0, available at
<https://www.contributor-covenant.org/version/2/0/code_of_conduct.html>.
Community Impact Guidelines were inspired by [Mozilla's code of conduct
enforcement ladder](https://github.com/mozilla/diversity).
[homepage]: https://www.contributor-covenant.org
For answers to common questions about this code of conduct, see the FAQ at
<https://www.contributor-covenant.org/faq>. Translations are available at
<https://www.contributor-covenant.org/translations>.

View File

@ -1,10 +1,16 @@
# CONTRIBUTORS
## AUTHORS
- akamhy (<https://github.com/akamhy>)
- danvalen1 (<https://github.com/danvalen1>)
- AntiCompositeNumber (<https://github.com/AntiCompositeNumber>)
- jonasjancarik (<https://github.com/jonasjancarik>)
- akamhy (<https://github.com/akamhy>)
- eggplants (<https://github.com/eggplants>)
- danvalen1 (<https://github.com/danvalen1>)
- AntiCompositeNumber (<https://github.com/AntiCompositeNumber>)
- rafaelrdealmeida (<https://github.com/rafaelrdealmeida>)
- jonasjancarik (<https://github.com/jonasjancarik>)
- jfinkhaeuser (<https://github.com/jfinkhaeuser>)
## ACKNOWLEDGEMENTS
- mhmdiaa (<https://github.com/mhmdiaa>) for <https://gist.github.com/mhmdiaa/adf6bff70142e5091792841d4b372050>. known_urls is based on this gist.
- datashaman (<https://stackoverflow.com/users/401467/datashaman>) for <https://stackoverflow.com/a/35504626>. _get_response is based on this amazing answer.
- dequeued0 (<https://github.com/dequeued0>) for reporting bugs and useful feature requests.
- mhmdiaa (<https://github.com/mhmdiaa>) for <https://gist.github.com/mhmdiaa/adf6bff70142e5091792841d4b372050>. known_urls is based on this gist.
- dequeued0 (<https://github.com/dequeued0>) for reporting bugs and useful feature requests.

View File

@ -1,6 +1,6 @@
MIT License
Copyright (c) 2020 waybackpy contributors ( https://github.com/akamhy/waybackpy/graphs/contributors )
Copyright (c) 2020-2022 waybackpy contributors ( https://github.com/akamhy/waybackpy/graphs/contributors )
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal

185
README.md
View File

@ -1,66 +1,80 @@
<!-- markdownlint-disable MD033 MD041 -->
<div align="center">
<img src="https://raw.githubusercontent.com/akamhy/waybackpy/master/assets/waybackpy_logo.svg"><br>
<h3>Python package & CLI tool that interfaces with the Wayback Machine API</h3>
<h3>A Python package & CLI tool that interfaces with the Wayback Machine API</h3>
</div>
<p align="center">
<a href="https://github.com/akamhy/waybackpy/actions?query=workflow%3ATests"><img alt="Unit Tests" src="https://github.com/akamhy/waybackpy/workflows/Tests/badge.svg"></a>
<a href="https://codecov.io/gh/akamhy/waybackpy"><img alt="codecov" src="https://codecov.io/gh/akamhy/waybackpy/branch/master/graph/badge.svg"></a>
<a href="https://pypi.org/project/waybackpy/"><img alt="pypi" src="https://img.shields.io/pypi/v/waybackpy.svg"></a>
<a href="https://github.com/akamhy/waybackpy/blob/master/CONTRIBUTING.md"><img alt="Contributions Welcome" src="https://img.shields.io/static/v1.svg?label=Contributions&message=Welcome&color=0059b3&style=flat-square"></a>
<a href="https://pepy.tech/project/waybackpy?versions=2*&versions=1*&versions=3*"><img alt="Downloads" src="https://pepy.tech/badge/waybackpy/month"></a>
<a href="https://app.codacy.com/gh/akamhy/waybackpy?utm_source=github.com&utm_medium=referral&utm_content=akamhy/waybackpy&utm_campaign=Badge_Grade_Settings"><img alt="Codacy Badge" src="https://api.codacy.com/project/badge/Grade/6d777d8509f642ac89a20715bb3a6193"></a>
<a href="https://github.com/akamhy/waybackpy/commits/master"><img alt="GitHub lastest commit" src="https://img.shields.io/github/last-commit/akamhy/waybackpy?color=blue&style=flat-square"></a>
<a href="#"><img alt="PyPI - Python Version" src="https://img.shields.io/pypi/pyversions/waybackpy?style=flat-square"></a>
<a href="https://github.com/psf/black"><img alt="Code style: black" src="https://img.shields.io/badge/code%20style-black-000000.svg"></a>
</p>
-----------------------------------------------------------------------------------------------------------------------------------------------
---
## ⭐️ Introduction
Waybackpy is a [Python package](https://www.udacity.com/blog/2021/01/what-is-a-python-package.html) and a CLI tool that interfaces with the Wayback Machine API.
# <img src="https://github.githubassets.com/images/icons/emoji/unicode/2b50.png" width="30"></img> Introduction
Wayback Machine has 3 client side APIs.
Waybackpy is a [Python package](https://www.udacity.com/blog/2021/01/what-is-a-python-package.html) and a [CLI](https://www.w3schools.com/whatis/whatis_cli.asp) tool that interfaces with the [Wayback Machine](https://en.wikipedia.org/wiki/Wayback_Machine) API.
- Save API
- Availability API
- CDX API
Wayback Machine has 3 client side [API](https://www.redhat.com/en/topics/api/what-are-application-programming-interfaces)s.
All three of these can be accessed by waybackpy.
- [Save API](https://github.com/akamhy/waybackpy/wiki/Wayback-Machine-APIs#save-api)
- [Availability API](https://github.com/akamhy/waybackpy/wiki/Wayback-Machine-APIs#availability-api)
- [CDX API](https://github.com/akamhy/waybackpy/wiki/Wayback-Machine-APIs#cdx-api)
These three APIs can be accessed via the waybackpy either by importing it in a script or from the CLI.
### 🏗 Installation
## <img src="https://github.githubassets.com/images/icons/emoji/unicode/1f3d7.png" width="20"></img> Installation
Using [pip](https://en.wikipedia.org/wiki/Pip_(package_manager)):
**Using [pip](https://en.wikipedia.org/wiki/Pip_(package_manager)), from [PyPI](https://pypi.org/) (recommended)**:
```bash
pip install waybackpy
```
Install directly from GitHub:
**Using [conda](https://en.wikipedia.org/wiki/Conda_(package_manager)), from [conda-forge](https://anaconda.org/conda-forge/waybackpy) (recommended)**:
See also [waybackpy feedstock](https://github.com/conda-forge/waybackpy-feedstock), maintainers are [@rafaelrdealmeida](https://github.com/rafaelrdealmeida/),
[@labriunesp](https://github.com/labriunesp/)
and [@akamhy](https://github.com/akamhy/).
```bash
conda install -c conda-forge waybackpy
```
**Install directly from [this git repository](https://github.com/akamhy/waybackpy) (NOT recommended)**:
```bash
pip install git+https://github.com/akamhy/waybackpy.git
```
### Docker Image
## <img src="https://github.githubassets.com/images/icons/emoji/unicode/1f433.png" width="20"></img> Docker Image
Docker Hub : <https://hub.docker.com/r/secsi/waybackpy>
Docker image is automatically updated on every release by [Regulary and Automatically Updated Docker Images](https://github.com/cybersecsi/RAUDI) (RAUDI).
[Docker image](https://searchitoperations.techtarget.com/definition/Docker-image) is automatically updated on every release by [Regulary and Automatically Updated Docker Images](https://github.com/cybersecsi/RAUDI) (RAUDI).
RAUDI is a tool by SecSI (<https://secsi.io>), an Italian cybersecurity startup.
## <img src="https://github.githubassets.com/images/icons/emoji/unicode/1f680.png" width="20"></img> Usage
### As a Python package
### Usage
#### Save API aka SavePageNow
#### As a Python package
##### Save API aka SavePageNow
```python
>>> from waybackpy import WaybackMachineSaveAPI
>>> url = "https://github.com"
>>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
>>>
>>>
>>> save_api = WaybackMachineSaveAPI(url, user_agent)
>>> save_api.save()
https://web.archive.org/web/20220118125249/https://github.com/
@ -70,26 +84,68 @@ False
datetime.datetime(2022, 1, 18, 12, 52, 49)
```
##### Availability API
```python
>>> from waybackpy import WaybackMachineAvailabilityAPI
>>>
>>> url = "https://google.com"
>>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
>>>
>>> availability_api = WaybackMachineAvailabilityAPI(url, user_agent)
>>>
>>> availability_api.oldest()
https://web.archive.org/web/19981111184551/http://google.com:80/
>>>
>>> availability_api.newest()
https://web.archive.org/web/20220118150444/https://www.google.com/
>>>
>>> availability_api.near(year=2010, month=10, day=10, hour=10)
https://web.archive.org/web/20101010101708/http://www.google.com/
```
#### CDX API aka CDXServerAPI
##### CDX API aka CDXServerAPI
```python
>>> from waybackpy import WaybackMachineCDXServerAPI
>>> url = "https://google.com"
>>> user_agent = "my new app's user agent"
>>> cdx_api = WaybackMachineCDXServerAPI(url, user_agent)
```
##### oldest
```python
>>> cdx_api.oldest()
com,google)/ 19981111184551 http://google.com:80/ text/html 200 HOQ2TGPYAEQJPNUA6M4SMZ3NGQRBXDZ3 381
>>> oldest = cdx_api.oldest()
>>> oldest
com,google)/ 19981111184551 http://google.com:80/ text/html 200 HOQ2TGPYAEQJPNUA6M4SMZ3NGQRBXDZ3 381
>>> oldest.archive_url
'https://web.archive.org/web/19981111184551/http://google.com:80/'
>>> oldest.original
'http://google.com:80/'
>>> oldest.urlkey
'com,google)/'
>>> oldest.timestamp
'19981111184551'
>>> oldest.datetime_timestamp
datetime.datetime(1998, 11, 11, 18, 45, 51)
>>> oldest.statuscode
'200'
>>> oldest.mimetype
'text/html'
```
##### newest
```python
>>> newest = cdx_api.newest()
>>> newest
com,google)/ 20220217234427 http://@google.com/ text/html 301 Y6PVK4XWOI3BXQEXM5WLLWU5JKUVNSFZ 563
>>> newest.archive_url
'https://web.archive.org/web/20220217234427/http://@google.com/'
>>> newest.timestamp
'20220217234427'
```
##### near
```python
>>> near = cdx_api.near(year=2010, month=10, day=10, hour=10, minute=10)
>>> near.archive_url
'https://web.archive.org/web/20101010101435/http://google.com/'
>>> near
com,google)/ 20101010101435 http://google.com/ text/html 301 Y6PVK4XWOI3BXQEXM5WLLWU5JKUVNSFZ 391
>>> near.timestamp
'20101010101435'
>>> near.timestamp
'20101010101435'
>>> near = cdx_api.near(wayback_machine_timestamp=2008080808)
>>> near.archive_url
'https://web.archive.org/web/20080808051143/http://google.com/'
>>> near = cdx_api.near(unix_timestamp=1286705410)
>>> near
com,google)/ 20101010101435 http://google.com/ text/html 301 Y6PVK4XWOI3BXQEXM5WLLWU5JKUVNSFZ 391
>>> near.archive_url
'https://web.archive.org/web/20101010101435/http://google.com/'
>>>
```
##### snapshots
```python
>>> from waybackpy import WaybackMachineCDXServerAPI
>>> url = "https://pypi.org"
@ -97,7 +153,7 @@ https://web.archive.org/web/20101010101708/http://www.google.com/
>>> cdx = WaybackMachineCDXServerAPI(url, user_agent, start_timestamp=2016, end_timestamp=2017)
>>> for item in cdx.snapshots():
... print(item.archive_url)
...
...
https://web.archive.org/web/20160110011047/http://pypi.org/
https://web.archive.org/web/20160305104847/http://pypi.org/
.
@ -107,23 +163,48 @@ https://web.archive.org/web/20171127171549/https://pypi.org/
https://web.archive.org/web/20171206002737/http://pypi.org:80/
```
> Documentation at <https://github.com/akamhy/waybackpy/wiki/Python-package-docs>.
#### Availability API
It is recommended to not use the availability API due to performance issues. All the methods of availability API interface class, `WaybackMachineAvailabilityAPI`, are also implemented in the CDX server API interface class, `WaybackMachineCDXServerAPI`.
#### As a CLI tool
```bash
$ waybackpy --save --url "https://en.wikipedia.org/wiki/Social_media" --user_agent "my-unique-user-agent"
https://web.archive.org/web/20200719062108/https://en.wikipedia.org/wiki/Social_media
$ waybackpy --oldest --url "https://en.wikipedia.org/wiki/Humanoid" --user_agent "my-unique-user-agent"
https://web.archive.org/web/20040415020811/http://en.wikipedia.org:80/wiki/Humanoid
$ waybackpy --newest --url "https://en.wikipedia.org/wiki/Remote_sensing" --user_agent "my-unique-user-agent"
https://web.archive.org/web/20201221130522/https://en.wikipedia.org/wiki/Remote_sensing
```python
>>> from waybackpy import WaybackMachineAvailabilityAPI
>>>
>>> url = "https://google.com"
>>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
>>>
>>> availability_api = WaybackMachineAvailabilityAPI(url, user_agent)
```
##### oldest
```python
>>> availability_api.oldest()
https://web.archive.org/web/19981111184551/http://google.com:80/
```
##### newest
```python
>>> availability_api.newest()
https://web.archive.org/web/20220118150444/https://www.google.com/
```
##### near
```python
>>> availability_api.near(year=2010, month=10, day=10, hour=10)
https://web.archive.org/web/20101010101708/http://www.google.com/
```
> Documentation is at <https://github.com/akamhy/waybackpy/wiki/Python-package-docs>.
### As a CLI tool
Demo video on [asciinema.org](https://asciinema.org/a/469890), you can copy the text from video:
[![asciicast](https://asciinema.org/a/469890.svg)](https://asciinema.org/a/469890)
> CLI documentation is at <https://github.com/akamhy/waybackpy/wiki/CLI-docs>.
### 🛡 License
## <img src="https://github.githubassets.com/images/icons/emoji/unicode/1f6e1.png" width="20"></img> License
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://github.com/akamhy/waybackpy/blob/master/LICENSE)
Copyright (c) 2020-2022 Akash Mahanty Et al.
Released under the MIT License. See [license](https://github.com/akamhy/waybackpy/blob/master/LICENSE) for details.

View File

@ -1 +1,14 @@
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 176.612 41.908" height="158.392" width="667.51" xmlns:v="https://github.com/akamhy/waybackpy"><text transform="matrix(.862888 0 0 1.158899 -.748 -98.312)" y="110.937" x="0.931" xml:space="preserve" font-weight="bold" font-size="28.149" font-family="sans-serif" letter-spacing="0" word-spacing="0" writing-mode="lr-tb" fill="#003dff"><tspan y="110.937" x="0.931"><tspan y="110.937" x="0.931" letter-spacing="3.568" writing-mode="lr-tb">waybackpy</tspan></tspan></text><path d="M.749 0h153.787v4.864H.749zm22.076 37.418h153.787v4.49H22.825z" fill="navy"/><path d="M0 37.418h22.825v4.49H0zM154.536 0h21.702v4.864h-21.702z" fill="#f0f"/></svg>
<?xml version="1.0" encoding="utf-8"?>
<svg width="711.80188pt" height="258.30469pt" viewBox="0 0 711.80188 258.30469" version="1.1" id="svg2" xmlns="http://www.w3.org/2000/svg">
<g id="surface1" transform="translate(-40.045801,-148)">
<path style="fill: rgb(171, 46, 51); fill-opacity: 1; fill-rule: nonzero; stroke: none;" d="M 224.09 309.814 L 224.09 197.997 L 204.768 197.994 L 204.768 312.635 C 204.768 312.635 205.098 312.9 204.105 313.698 C 203.113 314.497 202.408 313.849 202.408 313.849 L 200.518 313.849 L 200.518 197.991 L 181.139 197.991 L 181.139 313.849 L 179.253 313.849 C 179.253 313.849 178.544 314.497 177.551 313.698 C 176.558 312.9 176.888 312.635 176.888 312.635 L 176.888 197.994 L 157.57 197.997 L 157.57 309.814 C 157.57 309.814 156.539 316.772 162.615 321.658 C 168.691 326.546 177.551 326.049 177.551 326.049 L 204.11 326.049 C 204.11 326.049 212.965 326.546 219.041 321.658 C 225.118 316.772 224.09 309.814 224.09 309.814" id="path5"/>
<path style="fill: rgb(171, 46, 51); fill-opacity: 1; fill-rule: nonzero; stroke: none;" d="M 253.892 299.821 C 253.892 299.821 253.632 300.965 251.888 300.965 C 250.143 300.965 249.629 299.821 249.629 299.821 L 249.629 278.477 C 249.629 278.477 249.433 278.166 250.078 277.645 C 250.726 277.124 251.243 277.179 251.243 277.179 L 253.892 277.228 Z M 251.588 199.144 C 230.266 199.144 231.071 213.218 231.071 213.218 L 231.071 254.303 L 249.675 254.303 L 249.675 213.69 C 249.675 213.69 249.775 211.276 251.787 211.276 C 253.8 211.276 254 213.542 254 213.542 L 254 265.146 L 246.156 265.146 C 246.156 265.146 240.022 264.579 235.495 268.22 C 230.968 271.858 231.071 276.791 231.071 276.791 L 231.071 298.955 C 231.071 298.955 229.461 308.016 238.914 312.058 C 248.368 316.103 254.805 309.795 254.805 309.795 L 254.805 312.706 L 272.508 312.706 L 272.508 212.895 C 272.508 212.895 272.907 199.144 251.588 199.144" id="path7"/>
<path style="fill: rgb(171, 46, 51); fill-opacity: 1; fill-rule: nonzero; stroke: none;" d="M 404.682 318.261 C 404.682 318.261 404.398 319.494 402.485 319.494 C 400.568 319.494 400.001 318.261 400.001 318.261 L 400.001 295.216 C 400.001 295.216 399.786 294.879 400.496 294.315 C 401.208 293.757 401.776 293.812 401.776 293.812 L 404.682 293.868 Z M 402.152 209.568 C 378.728 209.568 379.61 224.761 379.61 224.761 L 379.61 269.117 L 400.051 269.117 L 400.051 225.273 C 400.051 225.273 400.162 222.665 402.374 222.665 C 404.582 222.665 404.805 225.109 404.805 225.109 L 404.805 280.82 L 396.187 280.82 C 396.187 280.82 389.447 280.213 384.475 284.141 C 379.499 288.072 379.61 293.396 379.61 293.396 L 379.61 317.324 C 379.61 317.324 377.843 327.104 388.232 331.469 C 398.616 335.838 405.69 329.027 405.69 329.027 L 405.69 332.169 L 425.133 332.169 L 425.133 224.413 C 425.133 224.413 425.578 209.568 402.152 209.568" id="path9"/>
<path style="fill: rgb(171, 46, 51); fill-opacity: 1; fill-rule: nonzero; stroke: none;" d="M 321.114 328.636 L 321.114 206.587 L 302.582 206.587 L 302.582 304.902 C 302.582 304.902 303.211 307.094 300.624 307.094 C 298.035 307.094 298.316 304.902 298.316 304.902 L 298.316 206.587 L 279.784 206.587 C 279.784 206.587 279.922 304.338 279.922 306.756 C 279.922 309.175 280.27 310.526 280.831 312.379 C 281.391 314.238 282.579 318.116 290.901 319.186 C 299.224 320.256 302.44 315.813 302.44 315.813 L 302.44 327.736 C 302.44 327.736 302.862 329.366 300.554 329.366 C 298.246 329.366 298.316 327.849 298.316 327.849 L 298.316 322.957 L 279.642 322.957 L 279.642 327.791 C 279.642 327.791 278.523 341.514 300.274 341.514 C 322.026 341.514 321.114 328.636 321.114 328.636" id="path11"/>
<path style="fill: rgb(171, 46, 51); fill-opacity: 1; fill-rule: nonzero; stroke: none;" d="M 352.449 209.811 L 352.449 273.495 C 352.449 277.49 347.911 277.194 347.911 277.194 L 347.911 207.592 C 347.911 207.592 346.929 207.542 349.567 207.542 C 352.817 207.542 352.449 209.811 352.449 209.811 M 352.326 310.393 C 352.326 310.393 352.143 312.366 350.425 312.366 L 348.033 312.366 L 348.033 289.478 L 349.628 289.478 C 349.628 289.478 352.326 289.428 352.326 292.092 Z M 371.341 287.505 C 371.341 284.791 370.727 282.966 368.826 280.993 C 366.925 279.02 363.367 277.441 363.367 277.441 C 363.367 277.441 365.514 276.948 368.704 274.728 C 371.893 272.509 371.525 267.921 371.525 267.921 L 371.525 212.919 C 371.525 212.919 371.801 204.509 366.925 200.587 C 362.049 196.665 352.515 196.363 352.515 196.363 L 328.711 196.363 L 328.711 324.107 L 350.609 324.107 C 360.055 324.107 364.594 322.232 368.336 318.286 C 372.077 314.34 371.341 308.321 371.341 308.321 Z M 371.341 287.505" id="path13"/>
<path style="fill: rgb(171, 46, 51); fill-opacity: 1; fill-rule: nonzero; stroke: none;" d="M 452.747 226.744 L 452.747 268.806 L 471.581 268.806 L 471.581 227.459 C 471.581 227.459 471.846 213.532 450.516 213.532 C 429.182 213.532 430.076 227.533 430.076 227.533 L 430.076 313.381 C 430.076 313.381 428.825 327.523 450.872 327.523 C 472.919 327.523 471.401 313.526 471.401 313.526 L 471.401 292.064 L 452.835 292.064 L 452.835 314.389 C 452.835 314.389 452.923 315.61 450.961 315.61 C 448.997 315.61 448.729 314.389 448.729 314.389 L 448.729 226.524 C 448.729 226.524 448.821 225.378 450.692 225.378 C 452.566 225.378 452.747 226.744 452.747 226.744" id="path15"/>
<path style="fill: rgb(171, 46, 51); fill-opacity: 1; fill-rule: nonzero; stroke: none;" d="M 520.624 281.841 C 517.672 278.98 514.317 277.904 514.317 277.904 C 514.317 277.904 517.538 277.796 520.489 274.775 C 523.442 271.753 523.173 267.924 523.173 267.924 L 523.173 208.211 L 503.185 208.211 L 503.185 276.014 C 503.185 276.014 503.185 277.361 501.172 277.361 L 498.761 277.309 L 498.761 191.655 L 478.973 191.655 L 478.973 327.905 L 498.692 327.905 L 498.692 290.039 L 501.709 290.039 C 501.709 290.039 502.112 290.039 502.648 290.523 C 503.185 291.01 503.185 291.602 503.185 291.602 L 503.185 327.905 L 523.307 327.905 L 523.307 288.636 C 523.307 288.636 523.576 284.699 520.624 281.841" id="path17"/>
<path style="fill-opacity: 1; fill-rule: nonzero; stroke: none; fill: rgb(255, 222, 87);" d="M 638.021 327.182 L 638.021 205.132 L 619.489 205.132 L 619.489 303.448 C 619.489 303.448 620.119 305.64 617.53 305.64 C 614.944 305.64 615.223 303.448 615.223 303.448 L 615.223 205.132 L 596.692 205.132 C 596.692 205.132 596.83 302.884 596.83 305.301 C 596.83 307.721 597.178 309.071 597.738 310.924 C 598.299 312.784 599.487 316.662 607.809 317.732 C 616.132 318.802 619.349 314.359 619.349 314.359 L 619.349 326.281 C 619.349 326.281 619.77 327.913 617.462 327.913 C 615.154 327.913 615.223 326.396 615.223 326.396 L 615.223 321.502 L 596.55 321.502 L 596.55 326.336 C 596.55 326.336 595.43 340.059 617.182 340.059 C 638.934 340.059 638.021 327.182 638.021 327.182" id="path-1"/>
<path d="M 592.159 233.846 C 593.222 238.576 593.75 243.873 593.745 249.735 C 593.74 255.598 593.135 261.281 591.931 266.782 C 590.726 272.285 588.901 277.144 586.453 281.361 C 584.006 285.578 580.938 288.946 577.248 291.466 C 573.559 293.985 569.226 295.246 564.25 295.246 C 561.585 295.246 559.008 294.936 556.521 294.32 C 554.033 293.703 551.813 292.854 549.859 291.774 C 547.905 290.694 546.284 289.512 544.997 288.226 C 543.71 286.94 542.934 285.578 542.668 284.138 L 542.629 328.722 L 526.369 328.722 L 526.475 207.466 L 541.003 207.466 L 542.728 216.259 C 544.507 213.38 547.197 211.065 550.797 209.317 C 554.397 207.568 558.374 206.694 562.728 206.694 C 565.66 206.694 568.637 207.157 571.657 208.083 C 574.677 209.008 577.497 210.551 580.116 212.711 C 582.735 214.871 585.11 217.698 587.239 221.196 C 589.369 224.692 591.009 228.909 592.159 233.846 Z M 558.932 280.744 C 561.597 280.744 564.019 279.972 566.197 278.429 C 568.376 276.887 570.243 274.804 571.801 272.182 C 573.358 269.559 574.582 266.423 575.474 262.772 C 576.366 259.121 576.814 255.238 576.817 251.124 C 576.821 247.113 576.424 243.307 575.628 239.708 C 574.831 236.108 573.701 232.92 572.237 230.143 C 570.774 227.366 568.999 225.155 566.912 223.51 C 564.825 221.864 562.405 221.041 559.65 221.041 C 556.985 221.041 554.54 221.813 552.318 223.356 C 550.095 224.898 548.183 226.981 546.581 229.603 C 544.98 232.226 543.755 235.311 542.908 238.86 C 542.061 242.408 541.635 246.239 541.632 250.353 C 541.628 254.466 542.002 258.349 542.754 262 C 543.506 265.651 544.637 268.865 546.145 271.642 C 547.653 274.419 549.472 276.63 551.603 278.276 C 553.734 279.922 556.177 280.744 558.932 280.744 Z" style="fill: rgb(69, 132, 182); white-space: pre;"/>
</g>
</svg>

Before

Width:  |  Height:  |  Size: 694 B

After

Width:  |  Height:  |  Size: 8.3 KiB

3
pyproject.toml Normal file
View File

@ -0,0 +1,3 @@
[build-system]
requires = ["wheel", "setuptools"]
build-backend = "setuptools.build_meta"

10
requirements-dev.txt Normal file
View File

@ -0,0 +1,10 @@
black
click
codecov
flake8
mypy
pytest
pytest-cov
requests
setuptools>=46.4.0
types-requests

View File

@ -1,2 +1,3 @@
click
requests
urllib3

View File

@ -1,7 +1,97 @@
[metadata]
description-file = README.md
license_file = LICENSE
name = waybackpy
version = attr: waybackpy.__version__
description = Python package that interfaces with the Internet Archive's Wayback Machine APIs. Archive pages and retrieve archived pages easily.
long_description = file: README.md
long_description_content_type = text/markdown
license = MIT
author = Akash Mahanty
author_email = akamhy@yahoo.com
url = https://akamhy.github.io/waybackpy/
download_url = https://github.com/akamhy/waybackpy/releases
project_urls =
Documentation = https://github.com/akamhy/waybackpy/wiki
Source = https://github.com/akamhy/waybackpy
Tracker = https://github.com/akamhy/waybackpy/issues
keywords =
Archive Website
Wayback Machine
Internet Archive
Wayback Machine CLI
Wayback Machine Python
Internet Archiving
Availability API
CDX API
savepagenow
classifiers =
Development Status :: 5 - Production/Stable
Intended Audience :: Developers
Intended Audience :: End Users/Desktop
Natural Language :: English
Typing :: Typed
License :: OSI Approved :: MIT License
Programming Language :: Python
Programming Language :: Python :: 3
Programming Language :: Python :: 3.7
Programming Language :: Python :: 3.8
Programming Language :: Python :: 3.9
Programming Language :: Python :: 3.10
Programming Language :: Python :: Implementation :: CPython
[options]
packages = find:
python_requires = >= 3.7
install_requires =
click
requests
urllib3
[options.extras_require]
dev =
black
codecov
flake8
mypy
pytest
pytest-cov
setuptools>=46.4.0
types-requests
[options.entry_points]
console_scripts =
waybackpy = waybackpy.cli:main
[isort]
profile = black
[flake8]
indent-size = 4
max-line-length = 88
extend-ignore = E203,W503
extend-ignore = W503,W605
exclude =
venv
__pycache__
.venv
./env
venv/
env
.env
./build
[mypy]
python_version = 3.9
show_error_codes = True
pretty = True
strict = True
[tool:pytest]
addopts =
# show summary of all tests that did not pass
-ra
# enable all warnings
-Wd
# coverage and html report
--cov=waybackpy
--cov-report=html
testpaths =
tests

View File

@ -1,51 +1,3 @@
import os.path
from setuptools import setup
with open(os.path.join(os.path.dirname(__file__), "README.md")) as f:
long_description = f.read()
about = {}
with open(os.path.join(os.path.dirname(__file__), "waybackpy", "__version__.py")) as f:
exec(f.read(), about)
setup(
name=about["__title__"],
packages=["waybackpy"],
version=about["__version__"],
description=about["__description__"],
long_description=long_description,
long_description_content_type="text/markdown",
license=about["__license__"],
author=about["__author__"],
author_email=about["__author_email__"],
url=about["__url__"],
download_url="https://github.com/akamhy/waybackpy/archive/3.0.0.tar.gz",
keywords=[
"Archive Website",
"Wayback Machine",
"Internet Archive",
],
install_requires=["requests", "click"],
python_requires=">=3.4",
classifiers=[
"Development Status :: 4 - Beta",
"Intended Audience :: Developers",
"Natural Language :: English",
"License :: OSI Approved :: MIT License",
"Programming Language :: Python",
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.4",
"Programming Language :: Python :: 3.5",
"Programming Language :: Python :: 3.6",
"Programming Language :: Python :: 3.7",
"Programming Language :: Python :: 3.8",
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: Implementation :: CPython",
],
entry_points={"console_scripts": ["waybackpy = waybackpy.cli:main"]},
project_urls={
"Documentation": "https://github.com/akamhy/waybackpy/wiki",
"Source": "https://github.com/akamhy/waybackpy",
"Tracker": "https://github.com/akamhy/waybackpy/issues",
},
)
setup()

23
snapcraft.yaml Normal file
View File

@ -0,0 +1,23 @@
name: waybackpy
summary: Wayback Machine API interface and a command-line tool
description: |
Waybackpy is a CLI tool that interfaces with the Wayback Machine APIs.
Wayback Machine has three client side public APIs, Save API,
Availability API and CDX API. These three APIs can be accessed via
the waybackpy from the terminal.
version: git
grade: stable
confinement: strict
base: core20
architectures:
- build-on: [arm64, armhf, amd64]
apps:
waybackpy:
command: bin/waybackpy
plugs: [home, network, network-bind, removable-media]
parts:
waybackpy:
plugin: python
source: https://github.com/akamhy/waybackpy.git

0
tests/__init__.py Normal file
View File

View File

@ -0,0 +1,113 @@
import random
import string
from datetime import datetime, timedelta
import pytest
from waybackpy.availability_api import WaybackMachineAvailabilityAPI
from waybackpy.exceptions import (
ArchiveNotInAvailabilityAPIResponse,
InvalidJSONInAvailabilityAPIResponse,
)
now = datetime.utcnow()
url = "https://example.com/"
user_agent = (
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36"
)
def rndstr(n: int) -> str:
return "".join(
random.choice(string.ascii_uppercase + string.digits) for _ in range(n)
)
def test_oldest() -> None:
"""
Test the oldest archive of Google.com and also checks the attributes.
"""
url = "https://example.com/"
user_agent = (
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36"
)
availability_api = WaybackMachineAvailabilityAPI(url, user_agent)
oldest = availability_api.oldest()
oldest_archive_url = oldest.archive_url
assert "2002" in oldest_archive_url
oldest_timestamp = oldest.timestamp()
assert abs(oldest_timestamp - now) > timedelta(days=7000) # More than 19 years
assert (
availability_api.json is not None
and availability_api.json["archived_snapshots"]["closest"]["available"] is True
)
assert repr(oldest).find("example.com") != -1
assert "2002" in str(oldest)
def test_newest() -> None:
"""
Assuming that the recent most Google Archive was made no more earlier than
last one day which is 86400 seconds.
"""
url = "https://www.youtube.com/"
user_agent = "Mozilla/5.0 (X11; Linux x86_64; rv:96.0) Gecko/20100101 Firefox/96.0"
availability_api = WaybackMachineAvailabilityAPI(url, user_agent)
newest = availability_api.newest()
newest_timestamp = newest.timestamp()
# betting in favor that latest youtube archive was not before the last 3 days
# high tarffic sites like youtube are archived mnay times a day, so seems
# very reasonable to me.
assert abs(newest_timestamp - now) < timedelta(seconds=86400 * 3)
def test_invalid_json() -> None:
"""
When the API is malfunctioning or we don't pass a URL,
it may return invalid JSON data.
"""
with pytest.raises(InvalidJSONInAvailabilityAPIResponse):
availability_api = WaybackMachineAvailabilityAPI(url="", user_agent=user_agent)
_ = availability_api.archive_url
def test_no_archive() -> None:
"""
ArchiveNotInAvailabilityAPIResponse may be raised if Wayback Machine did not
replied with the archive despite the fact that we know the site has million
of archives. Don't know the reason for this wierd behavior.
And also if really there are no archives for the passed URL this exception
is raised.
"""
with pytest.raises(ArchiveNotInAvailabilityAPIResponse):
availability_api = WaybackMachineAvailabilityAPI(
url=f"https://{rndstr(30)}.cn", user_agent=user_agent
)
_ = availability_api.archive_url
def test_no_api_call_str_repr() -> None:
"""
Some entitled users maybe want to see what is the string representation
if they dont make any API requests.
str() must not return None so we return ""
"""
availability_api = WaybackMachineAvailabilityAPI(
url=f"https://{rndstr(30)}.gov", user_agent=user_agent
)
assert str(availability_api) == ""
def test_no_call_timestamp() -> None:
"""
If no API requests were made the bound timestamp() method returns
the datetime.max as a default value.
"""
availability_api = WaybackMachineAvailabilityAPI(
url=f"https://{rndstr(30)}.in", user_agent=user_agent
)
assert datetime.max == availability_api.timestamp()

178
tests/test_cdx_api.py Normal file
View File

@ -0,0 +1,178 @@
import random
import string
import pytest
from waybackpy.cdx_api import WaybackMachineCDXServerAPI
from waybackpy.exceptions import NoCDXRecordFound
def rndstr(n: int) -> str:
return "".join(
random.choice(string.ascii_uppercase + string.digits) for _ in range(n)
)
def test_a() -> None:
user_agent = (
"Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) AppleWebKit/605.1.15 "
"(KHTML, like Gecko) Version/14.1.1 Safari/604.1"
)
url = "https://twitter.com/jack"
wayback = WaybackMachineCDXServerAPI(
url=url,
user_agent=user_agent,
match_type="prefix",
collapses=["urlkey"],
start_timestamp="201001",
end_timestamp="201002",
)
# timeframe bound prefix matching enabled along with active urlkey based collapsing
snapshots = wayback.snapshots() # <class 'generator'>
for snapshot in snapshots:
assert snapshot.timestamp.startswith("2010")
def test_b() -> None:
user_agent = (
"Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) "
"AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/604.1"
)
url = "https://www.google.com"
wayback = WaybackMachineCDXServerAPI(
url=url,
user_agent=user_agent,
start_timestamp="202101",
end_timestamp="202112",
collapses=["urlkey"],
)
# timeframe bound prefix matching enabled along with active urlkey based collapsing
snapshots = wayback.snapshots() # <class 'generator'>
for snapshot in snapshots:
assert snapshot.timestamp.startswith("2021")
def test_c() -> None:
user_agent = (
"Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) "
"AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/604.1"
)
url = "https://www.google.com"
cdx = WaybackMachineCDXServerAPI(
url=url,
user_agent=user_agent,
closest="201010101010",
sort="closest",
limit="1",
)
snapshots = cdx.snapshots()
for snapshot in snapshots:
archive_url = snapshot.archive_url
timestamp = snapshot.timestamp
break
assert str(archive_url).find("google.com")
assert "20101010" in timestamp
def test_d() -> None:
user_agent = (
"Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) "
"AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/604.1"
)
cdx = WaybackMachineCDXServerAPI(
url="akamhy.github.io",
user_agent=user_agent,
match_type="prefix",
use_pagination=True,
filters=["statuscode:200"],
)
snapshots = cdx.snapshots()
count = 0
for snapshot in snapshots:
count += 1
assert str(snapshot.archive_url).find("akamhy.github.io")
assert count > 50
def test_oldest() -> None:
user_agent = (
"Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) "
"AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/604.1"
)
cdx = WaybackMachineCDXServerAPI(
url="google.com",
user_agent=user_agent,
filters=["statuscode:200"],
)
oldest = cdx.oldest()
assert "1998" in oldest.timestamp
assert "google" in oldest.urlkey
assert oldest.original.find("google.com") != -1
assert oldest.archive_url.find("google.com") != -1
def test_newest() -> None:
user_agent = (
"Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) "
"AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/604.1"
)
cdx = WaybackMachineCDXServerAPI(
url="google.com",
user_agent=user_agent,
filters=["statuscode:200"],
)
newest = cdx.newest()
assert "google" in newest.urlkey
assert newest.original.find("google.com") != -1
assert newest.archive_url.find("google.com") != -1
def test_near() -> None:
user_agent = (
"Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) "
"AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/604.1"
)
cdx = WaybackMachineCDXServerAPI(
url="google.com",
user_agent=user_agent,
filters=["statuscode:200"],
)
near = cdx.near(year=2010, month=10, day=10, hour=10, minute=10)
assert "2010101010" in near.timestamp
assert "google" in near.urlkey
assert near.original.find("google.com") != -1
assert near.archive_url.find("google.com") != -1
near = cdx.near(wayback_machine_timestamp="201010101010")
assert "2010101010" in near.timestamp
assert "google" in near.urlkey
assert near.original.find("google.com") != -1
assert near.archive_url.find("google.com") != -1
near = cdx.near(unix_timestamp=1286705410)
assert "2010101010" in near.timestamp
assert "google" in near.urlkey
assert near.original.find("google.com") != -1
assert near.archive_url.find("google.com") != -1
with pytest.raises(NoCDXRecordFound):
dne_url = f"https://{rndstr(30)}.in"
cdx = WaybackMachineCDXServerAPI(
url=dne_url,
user_agent=user_agent,
filters=["statuscode:200"],
)
cdx.near(unix_timestamp=1286705410)

View File

@ -0,0 +1,44 @@
from datetime import datetime
from waybackpy.cdx_snapshot import CDXSnapshot
def test_CDXSnapshot() -> None:
sample_input = (
"org,archive)/ 20080126045828 http://github.com "
"text/html 200 Q4YULN754FHV2U6Q5JUT6Q2P57WEWNNY 1415"
)
prop_values = sample_input.split(" ")
properties = {}
(
properties["urlkey"],
properties["timestamp"],
properties["original"],
properties["mimetype"],
properties["statuscode"],
properties["digest"],
properties["length"],
) = prop_values
snapshot = CDXSnapshot(properties)
assert properties["urlkey"] == snapshot.urlkey
assert properties["timestamp"] == snapshot.timestamp
assert properties["original"] == snapshot.original
assert properties["mimetype"] == snapshot.mimetype
assert properties["statuscode"] == snapshot.statuscode
assert properties["digest"] == snapshot.digest
assert properties["length"] == snapshot.length
assert (
datetime.strptime(properties["timestamp"], "%Y%m%d%H%M%S")
== snapshot.datetime_timestamp
)
archive_url = (
"https://web.archive.org/web/"
+ properties["timestamp"]
+ "/"
+ properties["original"]
)
assert archive_url == snapshot.archive_url
assert sample_input == str(snapshot)
assert sample_input == repr(snapshot)

113
tests/test_cdx_utils.py Normal file
View File

@ -0,0 +1,113 @@
from typing import Any, Dict, List
import pytest
from waybackpy.cdx_utils import (
check_collapses,
check_filters,
check_match_type,
check_sort,
full_url,
get_response,
get_total_pages,
)
from waybackpy.exceptions import WaybackError
def test_get_total_pages() -> None:
url = "twitter.com"
user_agent = (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/605.1.15 "
"(KHTML, like Gecko) Version/14.0.2 Safari/605.1.15"
)
assert get_total_pages(url=url, user_agent=user_agent) >= 56
def test_full_url() -> None:
endpoint = "https://web.archive.org/cdx/search/cdx"
params: Dict[str, Any] = {}
assert endpoint == full_url(endpoint, params)
params = {"a": "1"}
assert full_url(endpoint, params) == "https://web.archive.org/cdx/search/cdx?a=1"
assert (
full_url(endpoint + "?", params) == "https://web.archive.org/cdx/search/cdx?a=1"
)
params["b"] = 2
assert (
full_url(endpoint + "?", params)
== "https://web.archive.org/cdx/search/cdx?a=1&b=2"
)
params["c"] = "foo bar"
assert (
full_url(endpoint + "?", params)
== "https://web.archive.org/cdx/search/cdx?a=1&b=2&c=foo%20bar"
)
def test_get_response() -> None:
url = "https://github.com"
user_agent = (
"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0"
)
headers = {"User-Agent": str(user_agent)}
response = get_response(url, headers=headers)
assert not isinstance(response, Exception) and response.status_code == 200
def test_check_filters() -> None:
filters: List[str] = []
check_filters(filters)
filters = ["statuscode:200", "timestamp:20215678901234", "original:https://url.com"]
check_filters(filters)
with pytest.raises(WaybackError):
check_filters("not-list") # type: ignore[arg-type]
with pytest.raises(WaybackError):
check_filters(["invalid"])
def test_check_collapses() -> None:
collapses: List[str] = []
check_collapses(collapses)
collapses = ["timestamp:10"]
check_collapses(collapses)
collapses = ["urlkey"]
check_collapses(collapses)
collapses = "urlkey" # type: ignore[assignment]
with pytest.raises(WaybackError):
check_collapses(collapses)
collapses = ["also illegal collapse"]
with pytest.raises(WaybackError):
check_collapses(collapses)
def test_check_match_type() -> None:
assert check_match_type(None, "url")
match_type = "exact"
url = "test_url"
assert check_match_type(match_type, url)
url = "has * in it"
with pytest.raises(WaybackError):
check_match_type("domain", url)
with pytest.raises(WaybackError):
check_match_type("not a valid type", "url")
def test_check_sort() -> None:
assert check_sort("default")
assert check_sort("closest")
assert check_sort("reverse")
with pytest.raises(WaybackError):
assert check_sort("random crap")

136
tests/test_cli.py Normal file
View File

@ -0,0 +1,136 @@
import requests
from click.testing import CliRunner
from waybackpy import __version__
from waybackpy.cli import main
def test_oldest() -> None:
runner = CliRunner()
result = runner.invoke(main, ["--url", " https://github.com ", "--oldest"])
assert result.exit_code == 0
assert (
result.output
== "Archive URL:\nhttps://web.archive.org/web/2008051421\
0148/http://github.com/\n"
)
def test_near() -> None:
runner = CliRunner()
result = runner.invoke(
main,
[
"--url",
" https://facebook.com ",
"--near",
"--year",
"2010",
"--month",
"5",
"--day",
"10",
"--hour",
"6",
],
)
assert result.exit_code == 0
assert (
result.output
== "Archive URL:\nhttps://web.archive.org/web/2010051008\
2647/http://www.facebook.com/\n"
)
def test_newest() -> None:
runner = CliRunner()
result = runner.invoke(main, ["--url", " https://microsoft.com ", "--newest"])
assert result.exit_code == 0
assert (
result.output.find("microsoft.com") != -1
and result.output.find("Archive URL:\n") != -1
)
def test_cdx() -> None:
runner = CliRunner()
result = runner.invoke(
main,
"--url https://twitter.com/jack --cdx --user-agent some-user-agent \
--start-timestamp 2010 --end-timestamp 2012 --collapse urlkey \
--match-type prefix --cdx-print archiveurl --cdx-print length \
--cdx-print digest --cdx-print statuscode --cdx-print mimetype \
--cdx-print original --cdx-print timestamp --cdx-print urlkey".split(
" "
),
)
assert result.exit_code == 0
assert result.output.count("\n") > 3000
def test_save() -> None:
runner = CliRunner()
result = runner.invoke(
main,
"--url https://yahoo.com --user_agent my-unique-user-agent \
--save --headers".split(
" "
),
)
assert result.exit_code == 0
assert result.output.find("Archive URL:") != -1
assert (result.output.find("Cached save:\nTrue") != -1) or (
result.output.find("Cached save:\nFalse") != -1
)
assert result.output.find("Save API headers:\n") != -1
assert result.output.find("yahoo.com") != -1
def test_version() -> None:
runner = CliRunner()
result = runner.invoke(main, ["--version"])
assert result.exit_code == 0
assert result.output == f"waybackpy version {__version__}\n"
def test_license() -> None:
runner = CliRunner()
result = runner.invoke(main, ["--license"])
assert result.exit_code == 0
assert (
result.output
== requests.get(
url="https://raw.githubusercontent.com/akamhy/waybackpy/master/LICENSE"
).text
+ "\n"
)
def test_only_url() -> None:
runner = CliRunner()
result = runner.invoke(main, ["--url", "https://google.com"])
assert result.exit_code == 0
assert (
result.output
== "NoCommandFound: Only URL passed, but did not specify what to do with the URL. Use \
--help flag for help using waybackpy.\n"
)
def test_known_url() -> None:
# with file generator enabled
runner = CliRunner()
result = runner.invoke(
main, ["--url", "https://akamhy.github.io", "--known-urls", "--file"]
)
assert result.exit_code == 0
assert result.output.count("\n") > 40
assert result.output.count("akamhy.github.io") > 40
assert result.output.find("in the current working directory.\n") != -1
# without file
runner = CliRunner()
result = runner.invoke(main, ["--url", "https://akamhy.github.io", "--known-urls"])
assert result.exit_code == 0
assert result.output.count("\n") > 40
assert result.output.count("akamhy.github.io") > 40

223
tests/test_save_api.py Normal file
View File

@ -0,0 +1,223 @@
import random
import string
import time
from datetime import datetime
from typing import cast
import pytest
from requests.structures import CaseInsensitiveDict
from waybackpy.exceptions import MaximumSaveRetriesExceeded
from waybackpy.save_api import WaybackMachineSaveAPI
def rndstr(n: int) -> str:
return "".join(
random.choice(string.ascii_uppercase + string.digits) for _ in range(n)
)
def test_save() -> None:
url = "https://github.com/akamhy/waybackpy"
user_agent = (
"Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) AppleWebKit/605.1.15 "
"(KHTML, like Gecko) Version/14.1.1 Safari/604.1"
)
save_api = WaybackMachineSaveAPI(url, user_agent)
save_api.save()
archive_url = save_api.archive_url
timestamp = save_api.timestamp()
headers = save_api.headers # CaseInsensitiveDict
cached_save = save_api.cached_save
assert cached_save in [True, False]
assert archive_url.find("github.com/akamhy/waybackpy") != -1
assert timestamp is not None
assert str(headers).find("github.com/akamhy/waybackpy") != -1
assert isinstance(save_api.timestamp(), datetime)
def test_max_redirect_exceeded() -> None:
with pytest.raises(MaximumSaveRetriesExceeded):
url = f"https://{rndstr}.gov"
user_agent = (
"Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) AppleWebKit/605.1.15 "
"(KHTML, like Gecko) Version/14.1.1 Safari/604.1"
)
save_api = WaybackMachineSaveAPI(url, user_agent, max_tries=3)
save_api.save()
def test_sleep() -> None:
"""
sleeping is actually very important for SaveAPI
interface stability.
The test checks that the time taken by sleep method
is as intended.
"""
url = "https://example.com"
user_agent = (
"Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) AppleWebKit/605.1.15 "
"(KHTML, like Gecko) Version/14.1.1 Safari/604.1"
)
save_api = WaybackMachineSaveAPI(url, user_agent)
s_time = int(time.time())
save_api.sleep(6) # multiple of 3 sleep for 10 seconds
e_time = int(time.time())
assert (e_time - s_time) >= 10
s_time = int(time.time())
save_api.sleep(7) # sleeps for 5 seconds
e_time = int(time.time())
assert (e_time - s_time) >= 5
def test_timestamp() -> None:
url = "https://example.com"
user_agent = (
"Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) AppleWebKit/605.1.15 "
"(KHTML, like Gecko) Version/14.1.1 Safari/604.1"
)
save_api = WaybackMachineSaveAPI(url, user_agent)
now = datetime.utcnow().strftime("%Y%m%d%H%M%S")
save_api._archive_url = f"https://web.archive.org/web/{now}/{url}/"
save_api.timestamp()
assert save_api.cached_save is False
now = "20100124063622"
save_api._archive_url = f"https://web.archive.org/web/{now}/{url}/"
save_api.timestamp()
assert save_api.cached_save is True
def test_archive_url_parser() -> None:
"""
Testing three regex for matches and also tests the response URL.
"""
url = "https://example.com"
user_agent = (
"Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) AppleWebKit/605.1.15 "
"(KHTML, like Gecko) Version/14.1.1 Safari/604.1"
)
save_api = WaybackMachineSaveAPI(url, user_agent)
h = (
"\nSTART\nContent-Location: "
"/web/20201126185327/https://www.scribbr.com/citing-sources/et-al"
"\nEND\n"
)
save_api.headers = h # type: ignore[assignment]
expected_url = (
"https://web.archive.org/web/20201126185327/"
"https://www.scribbr.com/citing-sources/et-al"
)
assert save_api.archive_url_parser() == expected_url
headers = {
"Server": "nginx/1.15.8",
"Date": "Sat, 02 Jan 2021 09:40:25 GMT",
"Content-Type": "text/html; charset=UTF-8",
"Transfer-Encoding": "chunked",
"Connection": "keep-alive",
"X-Archive-Orig-Server": "nginx",
"X-Archive-Orig-Date": "Sat, 02 Jan 2021 09:40:09 GMT",
"X-Archive-Orig-Transfer-Encoding": "chunked",
"X-Archive-Orig-Connection": "keep-alive",
"X-Archive-Orig-Vary": "Accept-Encoding",
"X-Archive-Orig-Last-Modified": "Fri, 01 Jan 2021 12:19:00 GMT",
"X-Archive-Orig-Strict-Transport-Security": "max-age=31536000, max-age=0;",
"X-Archive-Guessed-Content-Type": "text/html",
"X-Archive-Guessed-Charset": "utf-8",
"Memento-Datetime": "Sat, 02 Jan 2021 09:40:09 GMT",
"Link": (
'<https://www.scribbr.com/citing-sources/et-al/>; rel="original", '
"<https://web.archive.org/web/timemap/link/https://www.scribbr.com/"
'citing-sources/et-al/>; rel="timemap"; type="application/link-format", '
"<https://web.archive.org/web/https://www.scribbr.com/citing-sources/"
'et-al/>; rel="timegate", <https://web.archive.org/web/20200601082911/'
'https://www.scribbr.com/citing-sources/et-al/>; rel="first memento"; '
'datetime="Mon, 01 Jun 2020 08:29:11 GMT", <https://web.archive.org/web/'
"20201126185327/https://www.scribbr.com/citing-sources/et-al/>; "
'rel="prev memento"; datetime="Thu, 26 Nov 2020 18:53:27 GMT", '
"<https://web.archive.org/web/20210102094009/https://www.scribbr.com/"
'citing-sources/et-al/>; rel="memento"; datetime="Sat, 02 Jan 2021 '
'09:40:09 GMT", <https://web.archive.org/web/20210102094009/'
"https://www.scribbr.com/citing-sources/et-al/>; "
'rel="last memento"; datetime="Sat, 02 Jan 2021 09:40:09 GMT"'
),
"Content-Security-Policy": (
"default-src 'self' 'unsafe-eval' 'unsafe-inline' "
"data: blob: archive.org web.archive.org analytics.archive.org "
"pragma.archivelab.org",
),
"X-Archive-Src": "spn2-20210102092956-wwwb-spn20.us.archive.org-8001.warc.gz",
"Server-Timing": (
"captures_list;dur=112.646325, exclusion.robots;dur=0.172010, "
"exclusion.robots.policy;dur=0.158205, RedisCDXSource;dur=2.205932, "
"esindex;dur=0.014647, LoadShardBlock;dur=82.205012, "
"PetaboxLoader3.datanode;dur=70.750239, CDXLines.iter;dur=24.306278, "
"load_resource;dur=26.520179"
),
"X-App-Server": "wwwb-app200",
"X-ts": "200",
"X-location": "All",
"X-Cache-Key": (
"httpsweb.archive.org/web/20210102094009/"
"https://www.scribbr.com/citing-sources/et-al/IN",
),
"X-RL": "0",
"X-Page-Cache": "MISS",
"X-Archive-Screenname": "0",
"Content-Encoding": "gzip",
}
save_api.headers = cast(CaseInsensitiveDict[str], headers)
expected_url2 = (
"https://web.archive.org/web/20210102094009/"
"https://www.scribbr.com/citing-sources/et-al/"
)
assert save_api.archive_url_parser() == expected_url2
expected_url_3 = (
"https://web.archive.org/web/20171128185327/"
"https://www.scribbr.com/citing-sources/et-al/US"
)
h = f"START\nX-Cache-Key: {expected_url_3}\nEND\n"
save_api.headers = h # type: ignore[assignment]
expected_url4 = (
"https://web.archive.org/web/20171128185327/"
"https://www.scribbr.com/citing-sources/et-al/"
)
assert save_api.archive_url_parser() == expected_url4
h = "TEST TEST TEST AND NO MATCH - TEST FOR RESPONSE URL MATCHING"
save_api.headers = h # type: ignore[assignment]
save_api.response_url = (
"https://web.archive.org/web/20171128185327/"
"https://www.scribbr.com/citing-sources/et-al"
)
expected_url5 = (
"https://web.archive.org/web/20171128185327/"
"https://www.scribbr.com/citing-sources/et-al"
)
assert save_api.archive_url_parser() == expected_url5
def test_archive_url() -> None:
"""
Checks the attribute archive_url's value when the save method was not
explicitly invoked by the end-user but the save method was invoked implicitly
by the archive_url method which is an attribute due to @property.
"""
url = "https://example.com"
user_agent = (
"Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) AppleWebKit/605.1.15 "
"(KHTML, like Gecko) Version/14.1.1 Safari/604.1"
)
save_api = WaybackMachineSaveAPI(url, user_agent)
save_api.saved_archive = (
"https://web.archive.org/web/20220124063056/https://example.com/"
)
save_api._archive_url = save_api.saved_archive
assert save_api.archive_url == save_api.saved_archive

9
tests/test_utils.py Normal file
View File

@ -0,0 +1,9 @@
from waybackpy import __version__
from waybackpy.utils import DEFAULT_USER_AGENT
def test_default_user_agent() -> None:
assert (
DEFAULT_USER_AGENT
== f"waybackpy {__version__} - https://github.com/akamhy/waybackpy"
)

45
tests/test_wrapper.py Normal file
View File

@ -0,0 +1,45 @@
from waybackpy.wrapper import Url
def test_oldest() -> None:
url = "https://bing.com"
oldest_archive = (
"https://web.archive.org/web/20030726111100/http://www.bing.com:80/"
)
wayback = Url(url).oldest()
assert wayback.archive_url == oldest_archive
assert str(wayback) == oldest_archive
assert len(wayback) > 365 * 15 # days in a year times years
def test_newest() -> None:
url = "https://www.youtube.com/"
wayback = Url(url).newest()
assert "youtube" in str(wayback.archive_url)
assert "archived_snapshots" in str(wayback.json)
def test_near() -> None:
url = "https://www.google.com"
wayback = Url(url).near(year=2010, month=10, day=10, hour=10, minute=10)
assert "20101010" in str(wayback.archive_url)
def test_total_archives() -> None:
wayback = Url("https://akamhy.github.io")
assert wayback.total_archives() > 10
wayback = Url("https://gaha.ef4i3n.m5iai3kifp6ied.cima/gahh2718gs/ahkst63t7gad8")
assert wayback.total_archives() == 0
def test_known_urls() -> None:
wayback = Url("akamhy.github.io")
assert len(list(wayback.known_urls(subdomain=True))) > 40
def test_Save() -> None:
wayback = Url("https://en.wikipedia.org/wiki/Asymptotic_equipartition_property")
wayback.save()
archive_url = str(wayback.archive_url)
assert archive_url.find("Asymptotic_equipartition_property") != -1

View File

@ -1,14 +1,16 @@
from .wrapper import Url
"""Module initializer and provider of static information."""
__version__ = "3.0.4"
from .availability_api import WaybackMachineAvailabilityAPI
from .cdx_api import WaybackMachineCDXServerAPI
from .save_api import WaybackMachineSaveAPI
from .availability_api import WaybackMachineAvailabilityAPI
from .__version__ import (
__title__,
__description__,
__url__,
__version__,
__author__,
__author_email__,
__license__,
__copyright__,
)
from .wrapper import Url
__all__ = [
"__version__",
"WaybackMachineAvailabilityAPI",
"WaybackMachineCDXServerAPI",
"WaybackMachineSaveAPI",
"Url",
]

View File

@ -1,11 +0,0 @@
__title__ = "waybackpy"
__description__ = (
"Python package that interfaces with the Internet Archive's Wayback Machine APIs. "
"Archive pages and retrieve archived pages easily."
)
__url__ = "https://akamhy.github.io/waybackpy/"
__version__ = "3.0.0"
__author__ = "akamhy"
__author_email__ = "akamhy@yahoo.com"
__license__ = "MIT"
__copyright__ = "Copyright 2020-2022 Akash Mahanty et al."

View File

@ -1,69 +1,188 @@
import re
"""
This module interfaces the Wayback Machine's availability API.
The interface is useful for looking up archives and finding archives
that are close to a specific date and time.
It has a class WaybackMachineAvailabilityAPI, and the class has
methods like:
near() for retrieving archives close to a specific date and time.
oldest() for retrieving the first archive URL of the webpage.
newest() for retrieving the latest archive of the webpage.
The Wayback Machine Availability API response must be a valid JSON and
if it is not then an exception, InvalidJSONInAvailabilityAPIResponse is raised.
If the Availability API returned valid JSON but archive URL could not be found
it it then ArchiveNotInAvailabilityAPIResponse is raised.
"""
import json
import time
import requests
from datetime import datetime
from .__version__ import __version__
from .utils import DEFAULT_USER_AGENT
from typing import Any, Dict, Optional
import requests
from requests.models import Response
def full_url(endpoint, params):
if not params:
return endpoint.strip()
from .exceptions import (
ArchiveNotInAvailabilityAPIResponse,
InvalidJSONInAvailabilityAPIResponse,
)
from .utils import (
DEFAULT_USER_AGENT,
unix_timestamp_to_wayback_timestamp,
wayback_timestamp,
)
full_url = endpoint if endpoint.endswith("?") else (endpoint + "?")
for key, val in params.items():
key = "filter" if key.startswith("filter") else key
key = "collapse" if key.startswith("collapse") else key
amp = "" if full_url.endswith("?") else "&"
full_url = (
full_url
+ amp
+ "{key}={val}".format(key=key, val=requests.utils.quote(str(val)))
)
return full_url
ResponseJSON = Dict[str, Any]
class WaybackMachineAvailabilityAPI:
def __init__(self, url, user_agent=DEFAULT_USER_AGENT):
"""
Class that interfaces the Wayback Machine's availability API.
"""
def __init__(
self, url: str, user_agent: str = DEFAULT_USER_AGENT, max_tries: int = 3
) -> None:
self.url = str(url).strip().replace(" ", "%20")
self.user_agent = user_agent
self.headers = {"User-Agent": self.user_agent}
self.payload = {"url": "{url}".format(url=self.url)}
self.endpoint = "https://archive.org/wayback/available"
self.JSON = None
self.headers: Dict[str, str] = {"User-Agent": self.user_agent}
self.payload: Dict[str, str] = {"url": self.url}
self.endpoint: str = "https://archive.org/wayback/available"
self.max_tries: int = max_tries
self.tries: int = 0
self.last_api_call_unix_time: int = int(time.time())
self.api_call_time_gap: int = 5
self.json: Optional[ResponseJSON] = None
self.response: Optional[Response] = None
def unix_timestamp_to_wayback_timestamp(self, unix_timestamp):
return datetime.utcfromtimestamp(int(unix_timestamp)).strftime("%Y%m%d%H%M%S")
def __repr__(self) -> str:
"""
Same as string representation, just return the archive URL as a string.
"""
return str(self)
def __repr__(self):
return str(self) # self.__str__()
def __str__(self) -> str:
"""
String representation of the class. If atleast one API
call was successfully made then return the archive URL
as a string. Else returns "" (empty string literal).
"""
# __str__ can not return anything other than a string object
# So, if a string repr is asked even before making a API request
# just return ""
if not self.json:
return ""
def __str__(self):
if not self.JSON:
return None
return self.archive_url
def json(self):
self.request_url = full_url(self.endpoint, self.payload)
self.response = requests.get(self.request_url, self.headers)
self.JSON = self.response.json()
return self.JSON
def setup_json(self) -> Optional[ResponseJSON]:
"""
Makes the API call to the availability API and set the JSON response
to the JSON attribute of the instance and also returns the JSON
attribute.
def timestamp(self):
if not self.JSON["archived_snapshots"] or not self.JSON:
time_diff and sleep_time makes sure that you are not making too many
requests in a short interval of item, making too many requests is bad
as Wayback Machine may reject them above a certain threshold.
The end-user can change the api_call_time_gap attribute of the instance
to increase or decrease the default time gap between two successive API
calls, but it is not recommended to increase it.
"""
time_diff = int(time.time()) - self.last_api_call_unix_time
sleep_time = self.api_call_time_gap - time_diff
if sleep_time > 0:
time.sleep(sleep_time)
self.response = requests.get(
self.endpoint, params=self.payload, headers=self.headers
)
self.last_api_call_unix_time = int(time.time())
self.tries += 1
try:
self.json = None if self.response is None else self.response.json()
except json.decoder.JSONDecodeError as json_decode_error:
raise InvalidJSONInAvailabilityAPIResponse(
f"Response data:\n{self.response.text}"
) from json_decode_error
return self.json
def timestamp(self) -> datetime:
"""
Converts the timestamp form the JSON response to datetime object.
If JSON attribute of the instance is None it implies that the either
the the last API call failed or one was never made.
If not JSON or if JSON but no timestamp in the JSON response then
returns the maximum value for datetime object that is possible.
If you get an URL as a response form the availability API it is
guaranteed that you can get the datetime object from the timestamp.
"""
if self.json is None or "archived_snapshots" not in self.json:
return datetime.max
return datetime.strptime(
self.JSON["archived_snapshots"]["closest"]["timestamp"], "%Y%m%d%H%M%S"
)
if (
self.json is not None
and "archived_snapshots" in self.json
and self.json["archived_snapshots"] is not None
and "closest" in self.json["archived_snapshots"]
and self.json["archived_snapshots"]["closest"] is not None
and "timestamp" in self.json["archived_snapshots"]["closest"]
):
return datetime.strptime(
self.json["archived_snapshots"]["closest"]["timestamp"], "%Y%m%d%H%M%S"
)
raise ValueError("Timestamp not found in the Availability API's JSON response.")
@property
def archive_url(self):
data = self.JSON
def archive_url(self) -> str:
"""
Reads the the JSON response data and returns
the timestamp if found and if not found raises
ArchiveNotInAvailabilityAPIResponse.
"""
archive_url = ""
data = self.json
if not data["archived_snapshots"]:
archive_url = None
# If the user didn't invoke oldest, newest or near but tries to access
# archive_url attribute then assume they that are fine with any archive
# and invoke the oldest method.
if not data:
self.oldest()
# If data is still not none then probably there are no
# archive for the requested URL.
if not data or not data["archived_snapshots"]:
while (self.tries < self.max_tries) and (
not data or not data["archived_snapshots"]
):
self.setup_json() # It makes a new API call
data = self.json # setup_json() updates value of json attribute
# If exhausted max_tries, then give up and
# raise ArchiveNotInAvailabilityAPIResponse.
if not data or not data["archived_snapshots"]:
raise ArchiveNotInAvailabilityAPIResponse(
"Archive not found in the availability "
"API response, the URL you requested may not have any archives "
"yet. You may retry after some time or archive the webpage now.\n"
"Response data:\n"
""
if self.response is None
else self.response.text
)
else:
archive_url = data["archived_snapshots"]["closest"]["url"]
archive_url = archive_url.replace(
@ -71,39 +190,57 @@ class WaybackMachineAvailabilityAPI:
)
return archive_url
def wayback_timestamp(self, **kwargs):
return "".join(
str(kwargs[key]).zfill(2)
for key in ["year", "month", "day", "hour", "minute"]
)
def oldest(self) -> "WaybackMachineAvailabilityAPI":
"""
Passes the date 1994-01-01 to near which should return the oldest archive
because Wayback Machine was started in May, 1996 and it is assumed that
there would be no archive older than January 1, 1994.
"""
return self.near(year=1994, month=1, day=1)
def oldest(self):
return self.near(year=1994)
def newest(self) -> "WaybackMachineAvailabilityAPI":
"""
Passes the current UNIX time to near() for retrieving the newest archive
from the availability API.
def newest(self):
Remember UNIX time is UTC and Wayback Machine is also UTC based.
"""
return self.near(unix_timestamp=int(time.time()))
def near(
self,
year=None,
month=None,
day=None,
hour=None,
minute=None,
unix_timestamp=None,
):
year: Optional[int] = None,
month: Optional[int] = None,
day: Optional[int] = None,
hour: Optional[int] = None,
minute: Optional[int] = None,
unix_timestamp: Optional[int] = None,
) -> "WaybackMachineAvailabilityAPI":
"""
The most important method of this Class, oldest() and newest() are
dependent on it.
It generates the timestamp based on the input either by calling the
unix_timestamp_to_wayback_timestamp or wayback_timestamp method with
appropriate arguments for their respective parameters.
Adds the timestamp to the payload dictionary.
And finally invokes the setup_json method to make the API call then
finally returns the instance.
"""
if unix_timestamp:
timestamp = self.unix_timestamp_to_wayback_timestamp(unix_timestamp)
timestamp = unix_timestamp_to_wayback_timestamp(unix_timestamp)
else:
now = datetime.utcnow().timetuple()
timestamp = self.wayback_timestamp(
year=year if year else now.tm_year,
month=month if month else now.tm_mon,
day=day if day else now.tm_mday,
hour=hour if hour else now.tm_hour,
minute=minute if minute else now.tm_min,
timestamp = wayback_timestamp(
year=now.tm_year if year is None else year,
month=now.tm_mon if month is None else month,
day=now.tm_mday if day is None else day,
hour=now.tm_hour if hour is None else hour,
minute=now.tm_min if minute is None else minute,
)
self.payload["timestamp"] = timestamp
self.json()
self.setup_json()
return self

View File

@ -1,83 +1,144 @@
from .exceptions import WaybackError
"""
This module interfaces the Wayback Machine's CDX server API.
The module has WaybackMachineCDXServerAPI which should be used by the users of
this module to consume the CDX server API.
WaybackMachineCDXServerAPI has a snapshot method that yields the snapshots, and
the snapshots are yielded as instances of the CDXSnapshot class.
"""
import time
from datetime import datetime
from typing import Dict, Generator, List, Optional, Union, cast
from .cdx_snapshot import CDXSnapshot
from .cdx_utils import (
get_total_pages,
get_response,
check_filters,
check_collapses,
check_filters,
check_match_type,
check_sort,
full_url,
get_response,
get_total_pages,
)
from .exceptions import NoCDXRecordFound, WaybackError
from .utils import (
DEFAULT_USER_AGENT,
unix_timestamp_to_wayback_timestamp,
wayback_timestamp,
)
from .utils import DEFAULT_USER_AGENT
class WaybackMachineCDXServerAPI:
"""
Class that interfaces the CDX server API of the Wayback Machine.
snapshot() returns a generator that can be iterated upon by the end-user,
the generator returns the snapshots/entries as instance of CDXSnapshot to
make the usage easy, just use '.' to get any attribute as the attributes are
accessible via a dot ".".
"""
# start_timestamp: from, can not use from as it's a keyword
# end_timestamp: to, not using to as can not use from
def __init__(
self,
url,
user_agent=None,
start_timestamp=None,
end_timestamp=None,
filters=[],
match_type=None,
gzip=None,
collapses=[],
limit=None,
):
url: str,
user_agent: str = DEFAULT_USER_AGENT,
start_timestamp: Optional[str] = None,
end_timestamp: Optional[str] = None,
filters: Optional[List[str]] = None,
match_type: Optional[str] = None,
sort: Optional[str] = None,
gzip: Optional[str] = None,
collapses: Optional[List[str]] = None,
limit: Optional[str] = None,
max_tries: int = 3,
use_pagination: bool = False,
closest: Optional[str] = None,
) -> None:
self.url = str(url).strip().replace(" ", "%20")
self.user_agent = str(user_agent) if user_agent else DEFAULT_USER_AGENT
self.start_timestamp = str(start_timestamp) if start_timestamp else None
self.end_timestamp = str(end_timestamp) if end_timestamp else None
self.filters = filters
self.user_agent = user_agent
self.start_timestamp = None if start_timestamp is None else str(start_timestamp)
self.end_timestamp = None if end_timestamp is None else str(end_timestamp)
self.filters = [] if filters is None else filters
check_filters(self.filters)
self.match_type = str(match_type).strip() if match_type else None
self.match_type = None if match_type is None else str(match_type).strip()
check_match_type(self.match_type, self.url)
self.gzip = gzip if gzip else True
self.collapses = collapses
self.sort = None if sort is None else str(sort).strip()
check_sort(self.sort)
self.gzip = gzip
self.collapses = [] if collapses is None else collapses
check_collapses(self.collapses)
self.limit = limit if limit else 5000
self.last_api_request_url = None
self.use_page = False
self.limit = 25000 if limit is None else limit
self.max_tries = max_tries
self.use_pagination = use_pagination
self.closest = None if closest is None else str(closest)
self.last_api_request_url: Optional[str] = None
self.endpoint = "https://web.archive.org/cdx/search/cdx"
def cdx_api_manager(self, payload, headers, use_page=False):
def cdx_api_manager(
self, payload: Dict[str, str], headers: Dict[str, str]
) -> Generator[str, None, None]:
"""
This method uses the pagination API of the CDX server if
use_pagination attribute is True else uses the standard
CDX server response data.
"""
# When using the pagination API of the CDX server.
if self.use_pagination is True:
total_pages = get_total_pages(self.url, self.user_agent)
successive_blank_pages = 0
total_pages = get_total_pages(self.url, self.user_agent)
# If we only have two or less pages of archives then we care for accuracy
# pagination API can be lagged sometimes
if use_page == True and total_pages >= 2:
blank_pages = 0
for i in range(total_pages):
payload["page"] = str(i)
url, res = get_response(
self.endpoint, params=payload, headers=headers, return_full_url=True
)
url = full_url(self.endpoint, params=payload)
res = get_response(url, headers=headers)
if isinstance(res, Exception):
raise res
self.last_api_request_url = url
text = res.text
if len(text) == 0:
blank_pages += 1
if blank_pages >= 2:
# Reset the counter if the last page was blank
# but the current page is not.
if successive_blank_pages == 1:
if len(text) != 0:
successive_blank_pages = 0
# Increase the succesive page counter on encountering
# blank page.
if len(text) == 0:
successive_blank_pages += 1
# If two succesive pages are blank
# then we don't have any more pages left to
# iterate.
if successive_blank_pages >= 2:
break
yield text
else:
# When not using the pagination API of the CDX server
else:
payload["showResumeKey"] = "true"
payload["limit"] = str(self.limit)
resumeKey = None
resume_key = None
more = True
while more:
if resume_key:
payload["resumeKey"] = resume_key
if resumeKey:
payload["resumeKey"] = resumeKey
url, res = get_response(
self.endpoint, params=payload, headers=headers, return_full_url=True
)
url = full_url(self.endpoint, params=payload)
res = get_response(url, headers=headers)
if isinstance(res, Exception):
raise res
self.last_api_request_url = url
@ -92,63 +153,153 @@ class WaybackMachineCDXServerAPI:
if len(second_last_line) == 0:
resumeKey = lines[-1].strip()
text = text.replace(resumeKey, "", 1).strip()
resume_key = lines[-1].strip()
text = text.replace(resume_key, "", 1).strip()
more = True
yield text
def add_payload(self, payload):
def add_payload(self, payload: Dict[str, str]) -> None:
"""
Adds the payload to the payload dictionary.
"""
if self.start_timestamp:
payload["from"] = self.start_timestamp
if self.end_timestamp:
payload["to"] = self.end_timestamp
if self.gzip != True:
if self.gzip is None:
payload["gzip"] = "false"
if self.closest:
payload["closest"] = self.closest
if self.match_type:
payload["matchType"] = self.match_type
if self.sort:
payload["sort"] = self.sort
if self.filters and len(self.filters) > 0:
for i, f in enumerate(self.filters):
payload["filter" + str(i)] = f
for i, _filter in enumerate(self.filters):
payload["filter" + str(i)] = _filter
if self.collapses and len(self.collapses) > 0:
for i, f in enumerate(self.collapses):
payload["collapse" + str(i)] = f
for i, collapse in enumerate(self.collapses):
payload["collapse" + str(i)] = collapse
# Don't need to return anything as it's dictionary.
payload["url"] = self.url
def snapshots(self):
payload = {}
def near(
self,
year: Optional[int] = None,
month: Optional[int] = None,
day: Optional[int] = None,
hour: Optional[int] = None,
minute: Optional[int] = None,
unix_timestamp: Optional[int] = None,
wayback_machine_timestamp: Optional[Union[int, str]] = None,
) -> CDXSnapshot:
"""
Fetch archive close to a datetime, it can only return
a single URL. If you want more do not use this method
instead use the class.
"""
if unix_timestamp:
timestamp = unix_timestamp_to_wayback_timestamp(unix_timestamp)
elif wayback_machine_timestamp:
timestamp = str(wayback_machine_timestamp)
else:
now = datetime.utcnow().timetuple()
timestamp = wayback_timestamp(
year=now.tm_year if year is None else year,
month=now.tm_mon if month is None else month,
day=now.tm_mday if day is None else day,
hour=now.tm_hour if hour is None else hour,
minute=now.tm_min if minute is None else minute,
)
self.closest = timestamp
self.sort = "closest"
self.limit = 1
first_snapshot = None
for snapshot in self.snapshots():
first_snapshot = snapshot
break
if not first_snapshot:
raise NoCDXRecordFound(
"Wayback Machine's CDX server did not return any records "
+ "for the query. The URL may not have any archives "
+ " on the Wayback Machine or the URL may have been recently "
+ "archived and is still not available on the CDX server."
)
return first_snapshot
def newest(self) -> CDXSnapshot:
"""
Passes the current UNIX time to near() for retrieving the newest archive
from the availability API.
Remember UNIX time is UTC and Wayback Machine is also UTC based.
"""
return self.near(unix_timestamp=int(time.time()))
def oldest(self) -> CDXSnapshot:
"""
Passes the date 1994-01-01 to near which should return the oldest archive
because Wayback Machine was started in May, 1996 and it is assumed that
there would be no archive older than January 1, 1994.
"""
return self.near(year=1994, month=1, day=1)
def snapshots(self) -> Generator[CDXSnapshot, None, None]:
"""
This function yields the CDX data lines as snapshots.
As it is a generator it exhaustible, the reason that this is
a generator and not a list are:
a) CDX server API can return millions of entries for a query and list
is not suitable for such cases.
b) Preventing memory usage issues, as told before this method may yield
millions of records for some queries and your system may not have enough
memory for such a big list. Also Remember this if outputing to Jupyter
Notebooks.
The objects yielded by this method are instance of CDXSnapshot class,
you can access the attributes of the entries as the attribute of the instance
itself.
"""
payload: Dict[str, str] = {}
headers = {"User-Agent": self.user_agent}
self.add_payload(payload)
if not self.start_timestamp or self.end_timestamp:
self.use_page = True
entries = self.cdx_api_manager(payload, headers)
if self.collapses != []:
self.use_page = False
for entry in entries:
texts = self.cdx_api_manager(payload, headers, use_page=self.use_page)
for text in texts:
if text.isspace() or len(text) <= 1 or not text:
if entry.isspace() or len(entry) <= 1 or not entry:
continue
snapshot_list = text.split("\n")
# each line is a snapshot aka entry of the CDX server API.
# We are able to split the page by lines because it only
# splits the lines on a sinlge page and not all the entries
# at once, thus there should be no issues of too much memory usage.
snapshot_list = entry.split("\n")
for snapshot in snapshot_list:
if len(snapshot) < 46: # 14 + 32 (timestamp+digest)
# 14 + 32 == 46 ( timestamp + digest ), ignore the invalid entries.
# they are invalid if their length is smaller than sum of length
# of a standard wayback_timestamp and standard digest of an entry.
if len(snapshot) < 46:
continue
properties = {
properties: Dict[str, Optional[str]] = {
"urlkey": None,
"timestamp": None,
"original": None,
@ -158,18 +309,16 @@ class WaybackMachineCDXServerAPI:
"length": None,
}
prop_values = snapshot.split(" ")
property_value = snapshot.split(" ")
prop_values_len = len(prop_values)
properties_len = len(properties)
total_property_values = len(property_value)
warranted_total_property_values = len(properties)
if prop_values_len != properties_len:
if total_property_values != warranted_total_property_values:
raise WaybackError(
"Snapshot returned by Cdx API has {prop_values_len} properties instead of expected {properties_len} properties.\nInvolved Snapshot : {snapshot}".format(
prop_values_len=prop_values_len,
properties_len=properties_len,
snapshot=snapshot,
)
f"Snapshot returned by CDX API has {total_property_values} prop"
f"erties instead of expected {warranted_total_property_values} "
f"properties.\nProblematic Snapshot: {snapshot}"
)
(
@ -180,6 +329,6 @@ class WaybackMachineCDXServerAPI:
properties["statuscode"],
properties["digest"],
properties["length"],
) = prop_values
) = property_value
yield CDXSnapshot(properties)
yield CDXSnapshot(cast(Dict[str, str], properties))

View File

@ -1,27 +1,90 @@
"""
Module that contains the CDXSnapshot class, CDX records/lines are casted
to CDXSnapshot objects for easier access.
The CDX index format is plain text data. Each line ('record') indicates a
crawled document. And these lines are casted to CDXSnapshot.
"""
from datetime import datetime
from typing import Dict
class CDXSnapshot:
def __init__(self, properties):
self.urlkey = properties["urlkey"]
self.timestamp = properties["timestamp"]
self.datetime_timestamp = datetime.strptime(self.timestamp, "%Y%m%d%H%M%S")
self.original = properties["original"]
self.mimetype = properties["mimetype"]
self.statuscode = properties["statuscode"]
self.digest = properties["digest"]
self.length = properties["length"]
self.archive_url = (
"https://web.archive.org/web/" + self.timestamp + "/" + self.original
"""
Class for the CDX snapshot lines('record') returned by the CDX API,
Each valid line of the CDX API is casted to an CDXSnapshot object
by the CDX API interface, just use "." to access any attribute of the
CDX server API snapshot.
This provides the end-user the ease of using the data as attributes
of the CDXSnapshot.
The string representation of the class is identical to the line returned
by the CDX server API.
Besides all the attributes of the CDX server API this class also provides
archive_url attribute, yes it is the archive url of the snapshot.
Attributes of the this class and what they represents and are useful for:
urlkey: The document captured, expressed as a SURT
SURT stands for Sort-friendly URI Reordering Transform, and is a
transformation applied to URIs which makes their left-to-right
representation better match the natural hierarchy of domain names.
A URI <scheme://domain.tld/path?query> has SURT
form <scheme://(tld,domain,)/path?query>.
timestamp: The timestamp of the archive, format is yyyyMMddhhmmss and type
is string.
datetime_timestamp: The timestamp as a datetime object.
original: The original URL of the archive. If archive_url is
https://web.archive.org/web/20220113130051/https://google.com then the
original URL is https://google.com
mimetype: The documents file type. e.g. text/html
statuscode: HTTP response code for the document at the time of its crawling
digest: Base32-encoded SHA-1 checksum of the document for discriminating
with others
length: Documents volume of bytes in the WARC file
archive_url: The archive url of the snapshot, this is not returned by the
CDX server API but created by this class on init.
"""
def __init__(self, properties: Dict[str, str]) -> None:
self.urlkey: str = properties["urlkey"]
self.timestamp: str = properties["timestamp"]
self.datetime_timestamp: datetime = datetime.strptime(
self.timestamp, "%Y%m%d%H%M%S"
)
self.original: str = properties["original"]
self.mimetype: str = properties["mimetype"]
self.statuscode: str = properties["statuscode"]
self.digest: str = properties["digest"]
self.length: str = properties["length"]
self.archive_url: str = (
f"https://web.archive.org/web/{self.timestamp}/{self.original}"
)
def __str__(self):
return "{urlkey} {timestamp} {original} {mimetype} {statuscode} {digest} {length}".format(
urlkey=self.urlkey,
timestamp=self.timestamp,
original=self.original,
mimetype=self.mimetype,
statuscode=self.statuscode,
digest=self.digest,
length=self.length,
def __repr__(self) -> str:
"""
Same as __str__()
"""
return str(self)
def __str__(self) -> str:
"""
The string representation is same as the line returned by the
CDX server API for the snapshot.
"""
return (
f"{self.urlkey} {self.timestamp} {self.original} "
f"{self.mimetype} {self.statuscode} {self.digest} {self.length}"
)

View File

@ -1,154 +1,201 @@
"""
Utility functions required for accessing the CDX server API.
These are here in this module so that we dont make any module too
long.
"""
import re
from typing import Any, Dict, List, Optional, Union
from urllib.parse import quote
import requests
from urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter
from .exceptions import WaybackError
from urllib3.util.retry import Retry
from .exceptions import BlockedSiteError, WaybackError
from .utils import DEFAULT_USER_AGENT
def get_total_pages(url, user_agent):
request_url = (
"https://web.archive.org/cdx/search/cdx?url={url}&showNumPages=true".format(
url=url
)
)
def get_total_pages(url: str, user_agent: str = DEFAULT_USER_AGENT) -> int:
"""
When using the pagination use adding showNumPages=true to the request
URL makes the CDX server return an integer which is the number of pages
of CDX pages available for us to query using the pagination API.
"""
endpoint = "https://web.archive.org/cdx/search/cdx?"
payload = {"showNumPages": "true", "url": str(url)}
headers = {"User-Agent": user_agent}
return int((requests.get(request_url, headers=headers).text).strip())
request_url = full_url(endpoint, params=payload)
response = get_response(request_url, headers=headers)
check_for_blocked_site(response, url)
if isinstance(response, requests.Response):
return int(response.text.strip())
raise response
def full_url(endpoint, params):
def check_for_blocked_site(
response: Union[requests.Response, Exception], url: Optional[str] = None
) -> None:
"""
Checks that the URL can be archived by wayback machine or not.
robots.txt policy of the site may prevent the wayback machine.
"""
# see https://github.com/akamhy/waybackpy/issues/157
# the following if block is to make mypy happy.
if isinstance(response, Exception):
raise response
if not url:
url = "The requested content"
if (
"org.archive.util.io.RuntimeIOException: "
+ "org.archive.wayback.exception.AdministrativeAccessControlException: "
+ "Blocked Site Error"
in response.text.strip()
):
raise BlockedSiteError(
f"{url} is excluded from Wayback Machine by the site's robots.txt policy."
)
def full_url(endpoint: str, params: Dict[str, Any]) -> str:
"""
As the function's name already implies that it returns
full URL, but why we need a function for generating full URL?
The CDX server can support multiple arguments for parameters
such as filter and collapse and this function adds them without
overwriting earlier added arguments.
"""
if not params:
return endpoint
full_url = endpoint if endpoint.endswith("?") else (endpoint + "?")
_full_url = endpoint if endpoint.endswith("?") else (endpoint + "?")
for key, val in params.items():
key = "filter" if key.startswith("filter") else key
key = "collapse" if key.startswith("collapse") else key
amp = "" if full_url.endswith("?") else "&"
full_url = (
full_url
+ amp
+ "{key}={val}".format(key=key, val=requests.utils.quote(str(val)))
)
return full_url
amp = "" if _full_url.endswith("?") else "&"
val = quote(str(val), safe="")
_full_url += f"{amp}{key}={val}"
return _full_url
def get_response(
endpoint,
params=None,
headers=None,
return_full_url=False,
retries=5,
backoff_factor=0.5,
no_raise_on_redirects=False,
):
url: str,
headers: Optional[Dict[str, str]] = None,
retries: int = 5,
backoff_factor: float = 0.5,
) -> Union[requests.Response, Exception]:
"""
Makes get request to the CDX server and returns the response.
"""
session = requests.Session()
s = requests.Session()
retries = Retry(
retries_ = Retry(
total=retries,
backoff_factor=backoff_factor,
status_forcelist=[500, 502, 503, 504],
)
s.mount("https://", HTTPAdapter(max_retries=retries))
# The URL with parameters required for the get request
url = full_url(endpoint, params)
try:
if not return_full_url:
return s.get(url, headers=headers)
return (url, s.get(url, headers=headers))
except Exception as e:
reason = str(e)
if no_raise_on_redirects:
if "Exceeded 30 redirects" in reason:
return
exc_message = "Error while retrieving {url}.\n{reason}".format(
url=url, reason=reason
)
exc = WaybackError(exc_message)
exc.__cause__ = e
raise exc
session.mount("https://", HTTPAdapter(max_retries=retries_))
response = session.get(url, headers=headers)
session.close()
check_for_blocked_site(response)
return response
def check_filters(filters):
def check_filters(filters: List[str]) -> None:
"""
Check that the filter arguments passed by the end-user are valid.
If not valid then raise WaybackError.
"""
if not isinstance(filters, list):
raise WaybackError("filters must be a list.")
# [!]field:regex
for _filter in filters:
try:
match = re.search(
r"(\!?(?:urlkey|timestamp|original|mimetype|statuscode|digest|length)):"
r"(.*)",
_filter,
)
match = re.search(
r"(\!?(?:urlkey|timestamp|original|mimetype|statuscode|digest|length)):(.*)",
_filter,
)
if match is None or len(match.groups()) != 2:
key = match.group(1)
val = match.group(2)
except Exception:
exc_message = (
"Filter '{_filter}' is not following the cdx filter syntax.".format(
_filter=_filter
)
)
exc_message = f"Filter '{_filter}' is not following the cdx filter syntax."
raise WaybackError(exc_message)
def check_collapses(collapses):
def check_collapses(collapses: List[str]) -> bool:
"""
Check that the collapse arguments passed by the end-user are valid.
If not valid then raise WaybackError.
"""
if not isinstance(collapses, list):
raise WaybackError("collapses must be a list.")
if len(collapses) == 0:
return
return True
for collapse in collapses:
try:
match = re.search(
r"(urlkey|timestamp|original|mimetype|statuscode|digest|length)(:?[0-9]{1,99})?",
collapse,
)
field = match.group(1)
N = None
if 2 == len(match.groups()):
N = match.group(2)
if N:
if not (field + N == collapse):
raise Exception
else:
if not (field == collapse):
raise Exception
except Exception:
exc_message = "collapse argument '{collapse}' is not following the cdx collapse syntax.".format(
collapse=collapse
match = re.search(
r"(urlkey|timestamp|original|mimetype|statuscode|digest|length)"
r"(:?[0-9]{1,99})?",
collapse,
)
if match is None or len(match.groups()) != 2:
exc_message = (
f"collapse argument '{collapse}' "
"is not following the cdx collapse syntax."
)
raise WaybackError(exc_message)
return True
def check_match_type(match_type, url):
if not match_type:
return
if "*" in url:
raise WaybackError("Can not use wildcard with match_type argument")
def check_match_type(match_type: Optional[str], url: str) -> bool:
"""
Check that the match_type argument passed by the end-user is valid.
If not valid then raise WaybackError.
"""
legal_match_type = ["exact", "prefix", "host", "domain"]
if not match_type:
return True
if "*" in url:
raise WaybackError(
"Can not use wildcard in the URL along with the match_type arguments."
)
if match_type not in legal_match_type:
exc_message = "{match_type} is not an allowed match type.\nUse one from 'exact', 'prefix', 'host' or 'domain'".format(
match_type=match_type
exc_message = (
f"{match_type} is not an allowed match type.\n"
"Use one from 'exact', 'prefix', 'host' or 'domain'"
)
raise WaybackError(exc_message)
return True
def check_sort(sort: Optional[str]) -> bool:
"""
Check that the sort argument passed by the end-user is valid.
If not valid then raise WaybackError.
"""
legal_sort = ["default", "closest", "reverse"]
if not sort:
return True
if sort not in legal_sort:
exc_message = (
f"{sort} is not an allowed argument for sort.\n"
"Use one from 'default', 'closest' or 'reverse'"
)
raise WaybackError(exc_message)
return True

View File

@ -1,17 +1,168 @@
import click
import re
"""
Module responsible for enabling waybackpy to function as a CLI tool.
"""
import os
import json as JSON
import random
import re
import string
from .__version__ import __version__
from .utils import DEFAULT_USER_AGENT
from typing import Any, Dict, Generator, List, Optional
import click
import requests
from . import __version__
from .cdx_api import WaybackMachineCDXServerAPI
from .exceptions import BlockedSiteError, NoCDXRecordFound
from .save_api import WaybackMachineSaveAPI
from .availability_api import WaybackMachineAvailabilityAPI
from .utils import DEFAULT_USER_AGENT
from .wrapper import Url
def handle_cdx_closest_derivative_methods(
cdx_api: "WaybackMachineCDXServerAPI",
oldest: bool,
near: bool,
newest: bool,
near_args: Optional[Dict[str, int]] = None,
) -> None:
"""
Handles the closest parameter derivative methods.
near, newest and oldest use the closest parameter with active
closest based sorting.
"""
try:
if near:
if near_args:
archive_url = cdx_api.near(**near_args).archive_url
else:
archive_url = cdx_api.near().archive_url
elif newest:
archive_url = cdx_api.newest().archive_url
elif oldest:
archive_url = cdx_api.oldest().archive_url
click.echo("Archive URL:")
click.echo(archive_url)
except NoCDXRecordFound as exc:
click.echo(click.style("NoCDXRecordFound: ", fg="red") + str(exc), err=True)
except BlockedSiteError as exc:
click.echo(click.style("BlockedSiteError: ", fg="red") + str(exc), err=True)
def handle_cdx(data: List[Any]) -> None:
"""
Handles the CDX CLI options and output format.
"""
url = data[0]
user_agent = data[1]
start_timestamp = data[2]
end_timestamp = data[3]
cdx_filter = data[4]
collapse = data[5]
cdx_print = data[6]
limit = data[7]
gzip = data[8]
match_type = data[9]
sort = data[10]
use_pagination = data[11]
closest = data[12]
filters = list(cdx_filter)
collapses = list(collapse)
cdx_print = list(cdx_print)
cdx_api = WaybackMachineCDXServerAPI(
url,
user_agent=user_agent,
start_timestamp=start_timestamp,
end_timestamp=end_timestamp,
closest=closest,
filters=filters,
match_type=match_type,
sort=sort,
use_pagination=use_pagination,
gzip=gzip,
collapses=collapses,
limit=limit,
)
snapshots = cdx_api.snapshots()
for snapshot in snapshots:
if len(cdx_print) == 0:
click.echo(snapshot)
else:
output_string = []
if any(val in cdx_print for val in ["urlkey", "url-key", "url_key"]):
output_string.append(snapshot.urlkey)
if any(
val in cdx_print for val in ["timestamp", "time-stamp", "time_stamp"]
):
output_string.append(snapshot.timestamp)
if "original" in cdx_print:
output_string.append(snapshot.original)
if any(val in cdx_print for val in ["mimetype", "mime-type", "mime_type"]):
output_string.append(snapshot.mimetype)
if any(
val in cdx_print for val in ["statuscode", "status-code", "status_code"]
):
output_string.append(snapshot.statuscode)
if "digest" in cdx_print:
output_string.append(snapshot.digest)
if "length" in cdx_print:
output_string.append(snapshot.length)
if any(
val in cdx_print for val in ["archiveurl", "archive-url", "archive_url"]
):
output_string.append(snapshot.archive_url)
click.echo(" ".join(output_string))
def save_urls_on_file(url_gen: Generator[str, None, None]) -> None:
"""
Save output of CDX API on file.
Mainly here because of backwards compatibility.
"""
domain = None
sys_random = random.SystemRandom()
uid = "".join(
sys_random.choice(string.ascii_lowercase + string.digits) for _ in range(6)
)
url_count = 0
file_name = None
for url in url_gen:
url_count += 1
if not domain:
match = re.search("https?://([A-Za-z_0-9.-]+).*", url)
domain = "domain-unknown"
if match:
domain = match.group(1)
file_name = f"{domain}-urls-{uid}.txt"
file_path = os.path.join(os.getcwd(), file_name)
if not os.path.isfile(file_path):
with open(file_path, "w+", encoding="utf-8") as file:
file.close()
with open(file_path, "a", encoding="utf-8") as file:
file.write(f"{url}\n")
click.echo(url)
if url_count > 0:
click.echo(
f"\n\n{url_count} URLs saved inside '{file_name}' in the current "
+ "working directory."
)
else:
click.echo("No known URLs found. Please try a diffrent input!")
@click.command()
@click.option(
"-u", "--url", help="URL on which Wayback machine operations are to be performed."
@ -21,10 +172,17 @@ from .wrapper import Url
"--user-agent",
"--user_agent",
default=DEFAULT_USER_AGENT,
help="User agent, default user agent is '%s' " % DEFAULT_USER_AGENT,
help=f"User agent, default value is '{DEFAULT_USER_AGENT}'.",
)
@click.option("-v", "--version", is_flag=True, default=False, help="waybackpy version.")
@click.option(
"-v", "--version", is_flag=True, default=False, help="Print waybackpy version."
"-l",
"--show-license",
"--show_license",
"--license",
is_flag=True,
default=False,
help="Show license of Waybackpy.",
)
@click.option(
"-n",
@ -34,24 +192,21 @@ from .wrapper import Url
"--archive-url",
default=False,
is_flag=True,
help="Fetch the newest archive of the specified URL",
help="Retrieve the newest archive of URL.",
)
@click.option(
"-o",
"--oldest",
default=False,
is_flag=True,
help="Fetch the oldest archive of the specified URL",
help="Retrieve the oldest archive of URL.",
)
@click.option(
"-j",
"--json",
"-N",
"--near",
default=False,
is_flag=True,
help="Spit out the JSON data for availability_api commands.",
)
@click.option(
"-N", "--near", default=False, is_flag=True, help="Archive near specified time."
help="Archive close to a specified time.",
)
@click.option("-Y", "--year", type=click.IntRange(1994, 9999), help="Year in integer.")
@click.option("-M", "--month", type=click.IntRange(1, 12), help="Month in integer.")
@ -70,7 +225,7 @@ from .wrapper import Url
"--headers",
default=False,
is_flag=True,
help="Spit out the headers data for save_api commands.",
help="Headers data of the SavePageNow API.",
)
@click.option(
"-ku",
@ -95,151 +250,178 @@ from .wrapper import Url
help="Use with '--known_urls' to save the URLs in file at current directory.",
)
@click.option(
"-c",
"--cdx",
default=False,
is_flag=True,
help="Spit out the headers data for save_api commands.",
help="Flag for using CDX API.",
)
@click.option(
"-st",
"--start-timestamp",
"--start_timestamp",
"--from",
help="Start timestamp for CDX API in yyyyMMddhhmmss format.",
)
@click.option(
"-et",
"--end-timestamp",
"--end_timestamp",
"--to",
help="End timestamp for CDX API in yyyyMMddhhmmss format.",
)
@click.option(
"-C",
"--closest",
help="Archive that are closest the timestamp passed as arguments to this "
+ "parameter.",
)
@click.option(
"-f",
"--filters",
"--cdx-filter",
"--cdx_filter",
"--filter",
multiple=True,
help="Filter on a specific field or all the CDX fields.",
)
@click.option(
"-mt",
"--match-type",
"--match_type",
help="The default behavior is to return matches for an exact URL. "
+ "However, the CDX server can also return results matching a certain prefix, "
+ "a certain host, or all sub-hosts by using the match_type",
)
@click.option(
"-st",
"--sort",
help="Choose one from default, closest or reverse. It returns sorted CDX entries "
+ "in the response.",
)
@click.option(
"-up",
"--use-pagination",
"--use_pagination",
default=False,
is_flag=True,
help="Use the pagination API of the CDX server instead of the default one.",
)
@click.option(
"-gz",
"--gzip",
help="To disable gzip compression pass false as argument to this parameter. "
+ "The default behavior is gzip compression enabled.",
)
@click.option(
"-c",
"--collapses",
"--collapse",
multiple=True,
help="Filtering or 'collapse' results based on a field, or a substring of a field.",
)
@click.option(
"-l",
"--limit",
help="Number of maximum record that CDX API is asked to return per API call, "
+ "default value is 25000 records.",
)
@click.option(
"-cp",
"--cdx-print",
"--cdx_print",
multiple=True,
help="Print only certain fields of the CDX API response, "
+ "if this parameter is not used then the plain text response of the CDX API "
+ "will be printed.",
)
def main(
url,
user_agent,
version,
newest,
oldest,
json,
near,
year,
month,
day,
hour,
minute,
save,
headers,
known_urls,
subdomain,
file,
cdx,
start_timestamp,
end_timestamp,
filters,
match_type,
gzip,
collapses,
limit,
cdx_print,
):
"""
┏┓┏┓┏┓━━━━━━━━━━┏━━┓━━━━━━━━━━┏┓━━┏━━━┓━━━━━
┃┃┃┃┃┃━━━━━━━━━━┃┏┓┃━━━━━━━━━━┃┃━━┃┏━┓┃━━━━━
┃┃┃┃┃┃┏━━┓━┏┓━┏┓┃┗┛┗┓┏━━┓━┏━━┓┃┃┏┓┃┗━┛┃┏┓━┏┓
┃┗┛┗┛┃┗━┓┃━┃┃━┃┃┃┏━┓┃┗━┓┃━┃┏━┛┃┗┛┛┃┏━━┛┃┃━┃┃
┗┓┏┓┏┛┃┗┛┗┓┃┗━┛┃┃┗━┛┃┃┗┛┗┓┃┗━┓┃┏┓┓┃┃━━━┃┗━┛┃
━┗┛┗┛━┗━━━┛┗━┓┏┛┗━━━┛┗━━━┛┗━━┛┗┛┗┛┗┛━━━┗━┓┏┛
━━━━━━━━━━━┏━┛┃━━━━━━━━━━━━━━━━━━━━━━━━┏━┛┃━
━━━━━━━━━━━┗━━┛━━━━━━━━━━━━━━━━━━━━━━━━┗━━┛━
def main( # pylint: disable=no-value-for-parameter
user_agent: str,
version: bool,
show_license: bool,
newest: bool,
oldest: bool,
near: bool,
save: bool,
headers: bool,
known_urls: bool,
subdomain: bool,
file: bool,
cdx: bool,
use_pagination: bool,
cdx_filter: List[str],
collapse: List[str],
cdx_print: List[str],
url: Optional[str] = None,
year: Optional[int] = None,
month: Optional[int] = None,
day: Optional[int] = None,
hour: Optional[int] = None,
minute: Optional[int] = None,
start_timestamp: Optional[str] = None,
end_timestamp: Optional[str] = None,
closest: Optional[str] = None,
match_type: Optional[str] = None,
sort: Optional[str] = None,
gzip: Optional[str] = None,
limit: Optional[str] = None,
) -> None:
"""\b
_ _
| | | |
__ ____ _ _ _| |__ __ _ ___| | ___ __ _ _
\\ \\ /\\ / / _` | | | | '_ \\ / _` |/ __| |/ / '_ \\| | | |
\\ V V / (_| | |_| | |_) | (_| | (__| <| |_) | |_| |
\\_/\\_/ \\__,_|\\__, |_.__/ \\__,_|\\___|_|\\_\\ .__/ \\__, |
__/ | | | __/ |
|___/ |_| |___/
waybackpy : Python package & CLI tool that interfaces the Wayback Machine API
Python package & CLI tool that interfaces the Wayback Machine APIs
Released under the MIT License.
License @ https://github.com/akamhy/waybackpy/blob/master/LICENSE
Repository: https://github.com/akamhy/waybackpy
Copyright (c) 2020 waybackpy contributors. Contributors list @
https://github.com/akamhy/waybackpy/graphs/contributors
Documentation: https://github.com/akamhy/waybackpy/wiki/CLI-docs
https://github.com/akamhy/waybackpy
waybackpy - CLI usage(Demo video): https://asciinema.org/a/469890
https://pypi.org/project/waybackpy
Released under the MIT License. Use the flag --license for license.
"""
if version:
click.echo("waybackpy version %s" % __version__)
return
click.echo(f"waybackpy version {__version__}")
if not url:
click.echo("No URL detected. Please pass an URL.")
return
elif show_license:
click.echo(
requests.get(
url="https://raw.githubusercontent.com/akamhy/waybackpy/master/LICENSE"
).text
)
elif url is None:
click.echo(
click.style("NoURLDetected: ", fg="red")
+ "No URL detected. "
+ "Please provide an URL.",
err=True,
)
def echo_availability_api(availability_api_instance):
click.echo("Archive URL:")
if not availability_api_instance.archive_url:
archive_url = (
"NO ARCHIVE FOUND - The requested URL is probably "
+ "not yet archived or if the URL was recently archived then it is "
+ "not yet available via the Wayback Machine's availability API "
+ "because of database lag and should be available after some time."
)
else:
archive_url = availability_api_instance.archive_url
click.echo(archive_url)
if json:
click.echo("JSON response:")
click.echo(JSON.dumps(availability_api_instance.JSON))
elif oldest:
cdx_api = WaybackMachineCDXServerAPI(url, user_agent=user_agent)
handle_cdx_closest_derivative_methods(cdx_api, oldest, near, newest)
availability_api = WaybackMachineAvailabilityAPI(url, user_agent=user_agent)
elif newest:
cdx_api = WaybackMachineCDXServerAPI(url, user_agent=user_agent)
handle_cdx_closest_derivative_methods(cdx_api, oldest, near, newest)
if oldest:
availability_api.oldest()
echo_availability_api(availability_api)
return
if newest:
availability_api.newest()
echo_availability_api(availability_api)
return
if near:
elif near:
cdx_api = WaybackMachineCDXServerAPI(url, user_agent=user_agent)
near_args = {}
keys = ["year", "month", "day", "hour", "minute"]
args_arr = [year, month, day, hour, minute]
for key, arg in zip(keys, args_arr):
if arg:
near_args[key] = arg
availability_api.near(**near_args)
echo_availability_api(availability_api)
return
handle_cdx_closest_derivative_methods(
cdx_api, oldest, near, newest, near_args=near_args
)
if save:
elif save:
save_api = WaybackMachineSaveAPI(url, user_agent=user_agent)
save_api.save()
click.echo("Archive URL:")
@ -249,99 +431,44 @@ def main(
if headers:
click.echo("Save API headers:")
click.echo(save_api.headers)
return
def save_urls_on_file(url_gen):
domain = None
sys_random = random.SystemRandom()
uid = "".join(
sys_random.choice(string.ascii_lowercase + string.digits) for _ in range(6)
)
url_count = 0
for url in url_gen:
url_count += 1
if not domain:
match = re.search("https?://([A-Za-z_0-9.-]+).*", url)
domain = "domain-unknown"
if match:
domain = match.group(1)
file_name = "{domain}-urls-{uid}.txt".format(domain=domain, uid=uid)
file_path = os.path.join(os.getcwd(), file_name)
if not os.path.isfile(file_path):
open(file_path, "w+").close()
with open(file_path, "a") as f:
f.write("{url}\n".format(url=url))
click.echo(url)
if url_count > 0:
click.echo(
"\n\n'{file_name}' saved in current working directory".format(
file_name=file_name
)
)
else:
click.echo("No known URLs found. Please try a diffrent input!")
if known_urls:
elif known_urls:
wayback = Url(url, user_agent)
url_gen = wayback.known_urls(subdomain=subdomain)
if file:
return save_urls_on_file(url_gen)
save_urls_on_file(url_gen)
else:
for url in url_gen:
click.echo(url)
for url_ in url_gen:
click.echo(url_)
if cdx:
filters = list(filters)
collapses = list(collapses)
cdx_print = list(cdx_print)
cdx_api = WaybackMachineCDXServerAPI(
elif cdx:
data = [
url,
user_agent=user_agent,
start_timestamp=start_timestamp,
end_timestamp=end_timestamp,
filters=filters,
match_type=match_type,
gzip=gzip,
collapses=collapses,
limit=limit,
user_agent,
start_timestamp,
end_timestamp,
cdx_filter,
collapse,
cdx_print,
limit,
gzip,
match_type,
sort,
use_pagination,
closest,
]
handle_cdx(data)
else:
click.echo(
click.style("NoCommandFound: ", fg="red")
+ "Only URL passed, but did not specify what to do with the URL. "
+ "Use --help flag for help using waybackpy.",
err=True,
)
snapshots = cdx_api.snapshots()
for snapshot in snapshots:
if len(cdx_print) == 0:
click.echo(snapshot)
else:
output_string = ""
if "urlkey" or "url-key" or "url_key" in cdx_print:
output_string = output_string + snapshot.urlkey + " "
if "timestamp" or "time-stamp" or "time_stamp" in cdx_print:
output_string = output_string + snapshot.timestamp + " "
if "original" in cdx_print:
output_string = output_string + snapshot.original + " "
if "original" in cdx_print:
output_string = output_string + snapshot.original + " "
if "mimetype" or "mime-type" or "mime_type" in cdx_print:
output_string = output_string + snapshot.mimetype + " "
if "statuscode" or "status-code" or "status_code" in cdx_print:
output_string = output_string + snapshot.statuscode + " "
if "digest" in cdx_print:
output_string = output_string + snapshot.digest + " "
if "length" in cdx_print:
output_string = output_string + snapshot.length + " "
if "archiveurl" or "archive-url" or "archive_url" in cdx_print:
output_string = output_string + snapshot.archive_url + " "
click.echo(output_string)
if __name__ == "__main__":
main()
main() # pylint: disable=no-value-for-parameter

View File

@ -8,21 +8,35 @@ This module contains the set of Waybackpy's exceptions.
class WaybackError(Exception):
"""
Raised when Waybackpy can not return what you asked for.
1) Wayback Machine API Service is unreachable/down.
2) You passed illegal arguments.
1) Wayback Machine API Service is unreachable/down.
2) You passed illegal arguments.
All other exceptions are inherited from this main exception.
"""
class RedirectSaveError(WaybackError):
class NoCDXRecordFound(WaybackError):
"""
Raised when the original URL is redirected and the
redirect URL is archived but not the original URL.
No records returned by the CDX server for a query.
Raised when the user invokes near(), newest() or oldest() methods
and there are no archives.
"""
class URLError(Exception):
class BlockedSiteError(WaybackError):
"""
Raised when malformed URLs are passed as arguments.
Raised when the archives for website/URLs that was excluded from Wayback
Machine are requested via the CDX server API.
"""
class TooManyRequestsError(WaybackError):
"""
Raised when you make more than 15 requests per
minute and the Wayback Machine returns 429.
See https://github.com/akamhy/waybackpy/issues/131
"""
@ -36,3 +50,15 @@ class MaximumSaveRetriesExceeded(MaximumRetriesExceeded):
"""
MaximumSaveRetriesExceeded
"""
class ArchiveNotInAvailabilityAPIResponse(WaybackError):
"""
Could not parse the archive in the JSON response of the availability API.
"""
class InvalidJSONInAvailabilityAPIResponse(WaybackError):
"""
availability api returned invalid JSON
"""

View File

@ -1,44 +1,84 @@
"""
This module interfaces the Wayback Machine's SavePageNow (SPN) API.
The module has WaybackMachineSaveAPI class which should be used by the users of
this module to use the SavePageNow API.
"""
import re
import time
import requests
from datetime import datetime
from urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter
from typing import Dict, Optional
import requests
from requests.adapters import HTTPAdapter
from requests.models import Response
from requests.structures import CaseInsensitiveDict
from urllib3.util.retry import Retry
from .exceptions import MaximumSaveRetriesExceeded, TooManyRequestsError, WaybackError
from .utils import DEFAULT_USER_AGENT
from .exceptions import MaximumSaveRetriesExceeded
class WaybackMachineSaveAPI:
"""
WaybackMachineSaveAPI class provides an interface for saving URLs on the
Wayback Machine.
"""
def __init__(self, url, user_agent=DEFAULT_USER_AGENT, max_tries=8):
def __init__(
self,
url: str,
user_agent: str = DEFAULT_USER_AGENT,
max_tries: int = 8,
) -> None:
self.url = str(url).strip().replace(" ", "%20")
self.request_url = "https://web.archive.org/save/" + self.url
self.user_agent = user_agent
self.request_headers = {"User-Agent": self.user_agent}
self.request_headers: Dict[str, str] = {"User-Agent": self.user_agent}
if max_tries < 1:
raise ValueError("max_tries should be positive")
self.max_tries = max_tries
self.total_save_retries = 5
self.backoff_factor = 0.5
self.status_forcelist = [500, 502, 503, 504]
self._archive_url = None
self._archive_url: Optional[str] = None
self.instance_birth_time = datetime.utcnow()
self.response: Optional[Response] = None
self.headers: Optional[CaseInsensitiveDict[str]] = None
self.status_code: Optional[int] = None
self.response_url: Optional[str] = None
self.cached_save: Optional[bool] = None
self.saved_archive: Optional[str] = None
@property
def archive_url(self):
def archive_url(self) -> str:
"""
Returns the archive URL is already cached by _archive_url
else invoke the save method to save the archive which returns the
archive thus we return the methods return value.
"""
if self._archive_url:
return self._archive_url
else:
return self.save()
def get_save_request_headers(self):
return self.save()
def get_save_request_headers(self) -> None:
"""
Creates a session and tries 'retries' number of times to
retrieve the archive.
If successful in getting the response, sets the headers, status_code
and response_url attributes.
The archive is usually in the headers but it can also be the response URL
as the Wayback Machine redirects to the archive after a successful capture
of the webpage.
Wayback Machine's save API is known
to be very unreliable thus if it fails first check opening
the response URL yourself in the browser.
"""
session = requests.Session()
retries = Retry(
total=self.total_save_retries,
@ -47,12 +87,37 @@ class WaybackMachineSaveAPI:
)
session.mount("https://", HTTPAdapter(max_retries=retries))
self.response = session.get(self.request_url, headers=self.request_headers)
# requests.response.headers is requests.structures.CaseInsensitiveDict
self.headers = self.response.headers
self.status_code = self.response.status_code
self.response_url = self.response.url
session.close()
def archive_url_parser(self):
if self.status_code == 429:
# why wait 5 minutes and 429?
# see https://github.com/akamhy/waybackpy/issues/97
raise TooManyRequestsError(
f"Can not save '{self.url}'. "
f"Save request refused by the server. "
f"Save Page Now limits saving 15 URLs per minutes. "
f"Try waiting for 5 minutes and then try again."
)
# why 509?
# see https://github.com/akamhy/waybackpy/pull/99
# also https://t.co/xww4YJ0Iwc
if self.status_code == 509:
raise WaybackError(
f"Can not save '{self.url}'. You have probably reached the "
f"limit of active sessions."
)
def archive_url_parser(self) -> Optional[str]:
"""
Three regexen (like oxen?) are used to search for the
archive URL in the headers and finally look in the response URL
for the archive URL.
"""
regex1 = r"Content-Location: (/web/[0-9]{14}/.*)"
match = re.search(regex1, str(self.headers))
if match:
@ -60,36 +125,63 @@ class WaybackMachineSaveAPI:
regex2 = r"rel=\"memento.*?(web\.archive\.org/web/[0-9]{14}/.*?)>"
match = re.search(regex2, str(self.headers))
if match:
if match is not None and len(match.groups()) == 1:
return "https://" + match.group(1)
regex3 = r"X-Cache-Key:\shttps(.*)[A-Z]{2}"
match = re.search(regex3, str(self.headers))
if match:
return "https://" + match.group(1)
if match is not None and len(match.groups()) == 1:
return "https" + match.group(1)
if self.response_url:
self.response_url = self.response_url.strip()
if "web.archive.org/web" in self.response_url:
regex = r"web\.archive\.org/web/(?:[0-9]*?)/(?:.*)$"
match = re.search(regex, self.response_url)
if match:
return "https://" + match.group(0)
self.response_url = (
"" if self.response_url is None else self.response_url.strip()
)
regex4 = r"web\.archive\.org/web/(?:[0-9]*?)/(?:.*)$"
match = re.search(regex4, self.response_url)
if match is not None:
return "https://" + match.group(0)
def sleep(self, tries):
return None
@staticmethod
def sleep(tries: int) -> None:
"""
Ensure that the we wait some time before succesive retries so that we
don't waste the retries before the page is even captured by the Wayback
Machine crawlers also ensures that we are not putting too much load on
the Wayback Machine's save API.
If tries are multiple of 3 sleep 10 seconds else sleep 5 seconds.
"""
sleep_seconds = 5
if tries % 3 == 0:
sleep_seconds = 10
time.sleep(sleep_seconds)
def timestamp(self):
m = re.search(
r"https?://web.archive.org/web/([0-9]{14})/http", self._archive_url
)
string_timestamp = m.group(1)
timestamp = datetime.strptime(string_timestamp, "%Y%m%d%H%M%S")
def timestamp(self) -> datetime:
"""
Read the timestamp off the archive URL and convert the Wayback Machine
timestamp to datetime object.
Also check if the time on archive is URL and compare it to instance birth
time.
If time on the archive is older than the instance creation time set the
cached_save to True else set it to False. The flag can be used to check
if the Wayback Machine didn't serve a Cached URL. It is quite common for
the Wayback Machine to serve cached archive if last archive was captured
before last 45 minutes.
"""
regex = r"https?://web\.archive.org/web/([0-9]{14})/http"
match = re.search(regex, str(self._archive_url))
if match is None or len(match.groups()) != 1:
raise ValueError(
f"Can not parse timestamp from archive URL, '{self._archive_url}'."
)
string_timestamp = match.group(1)
timestamp = datetime.strptime(string_timestamp, "%Y%m%d%H%M%S")
timestamp_unixtime = time.mktime(timestamp.timetuple())
instance_birth_time_unixtime = time.mktime(self.instance_birth_time.timetuple())
@ -100,32 +192,34 @@ class WaybackMachineSaveAPI:
return timestamp
def save(self):
def save(self) -> str:
"""
Calls the SavePageNow API of the Wayback Machine with required parameters
and headers to save the URL.
saved_archive = None
Raises MaximumSaveRetriesExceeded is maximum retries are exhausted but still
we were unable to retrieve the archive from the Wayback Machine.
"""
self.saved_archive = None
tries = 0
while True:
if tries >= 1:
self.sleep(tries)
self.get_save_request_headers()
self.saved_archive = self.archive_url_parser()
if isinstance(self.saved_archive, str):
self._archive_url = self.saved_archive
self.timestamp()
return self.saved_archive
tries += 1
if tries >= self.max_tries:
raise MaximumSaveRetriesExceeded(
"Tried %s times but failed to save and return the archive for %s.\nResponse URL:\n%s \nResponse Header:\n%s\n"
% (str(tries), self.url, self.response_url, str(self.headers)),
f"Tried {tries} times but failed to save "
f"and retrieve the archive for {self.url}.\n"
f"Response URL:\n{self.response_url}\n"
f"Response Header:\n{self.headers}"
)
if not saved_archive:
if tries > 1:
self.sleep(tries)
self.get_save_request_headers()
saved_archive = self.archive_url_parser()
if not saved_archive:
continue
else:
self._archive_url = saved_archive
self.timestamp()
return saved_archive

View File

@ -1,11 +1,29 @@
import requests
from .__version__ import __version__
"""
Utility functions and shared variables like DEFAULT_USER_AGENT are here.
"""
DEFAULT_USER_AGENT = "waybackpy %s - https://github.com/akamhy/waybackpy" % __version__
from datetime import datetime
from . import __version__
DEFAULT_USER_AGENT: str = (
f"waybackpy {__version__} - https://github.com/akamhy/waybackpy"
)
def latest_version(package_name, headers):
request_url = "https://pypi.org/pypi/" + package_name + "/json"
response = requests.get(request_url, headers=headers)
data = response.json()
return data["info"]["version"]
def unix_timestamp_to_wayback_timestamp(unix_timestamp: int) -> str:
"""
Converts Unix time to Wayback Machine timestamp, Wayback Machine
timestamp format is yyyyMMddhhmmss.
"""
return datetime.utcfromtimestamp(int(unix_timestamp)).strftime("%Y%m%d%H%M%S")
def wayback_timestamp(**kwargs: int) -> str:
"""
Prepends zero before the year, month, day, hour and minute so that they
are conformable with the YYYYMMDDhhmmss Wayback Machine timestamp format.
"""
return "".join(
str(kwargs[key]).zfill(2) for key in ["year", "month", "day", "hour", "minute"]
)

View File

@ -1,39 +1,73 @@
from .save_api import WaybackMachineSaveAPI
from .availability_api import WaybackMachineAvailabilityAPI
from .cdx_api import WaybackMachineCDXServerAPI
from .utils import DEFAULT_USER_AGENT
from .exceptions import WaybackError
"""
This module exists because backwards compatibility matters.
Don't touch this or add any new functionality here and don't use
the Url class.
"""
from datetime import datetime, timedelta
from typing import Generator, Optional
from requests.structures import CaseInsensitiveDict
from .availability_api import ResponseJSON, WaybackMachineAvailabilityAPI
from .cdx_api import WaybackMachineCDXServerAPI
from .save_api import WaybackMachineSaveAPI
from .utils import DEFAULT_USER_AGENT
class Url:
def __init__(self, url, user_agent=DEFAULT_USER_AGENT):
"""
The Url class is not recommended to be used anymore, instead use:
- WaybackMachineSaveAPI
- WaybackMachineAvailabilityAPI
- WaybackMachineCDXServerAPI
The reason it is still in the code is backwards compatibility with 2.x.x
versions.
If were are using the Url before the update to version 3.x.x, your code should
still be working fine and there is no hurry to update the interface but is
recommended that you do not use the Url class for new code as it would be
removed after 2025 also the first 3.x.x versions was released in January 2022
and three years are more than enough to update the older interface code.
"""
def __init__(self, url: str, user_agent: str = DEFAULT_USER_AGENT) -> None:
self.url = url
self.user_agent = str(user_agent)
self.archive_url = None
self.archive_url: Optional[str] = None
self.timestamp: Optional[datetime] = None
self.wayback_machine_availability_api = WaybackMachineAvailabilityAPI(
self.url, user_agent=self.user_agent
)
self.wayback_machine_save_api: Optional[WaybackMachineSaveAPI] = None
self.headers: Optional[CaseInsensitiveDict[str]] = None
self.json: Optional[ResponseJSON] = None
def __str__(self):
def __str__(self) -> str:
if not self.archive_url:
self.newest()
return self.archive_url
return str(self.archive_url)
def __len__(self):
def __len__(self) -> int:
td_max = timedelta(
days=999999999, hours=23, minutes=59, seconds=59, microseconds=999999
)
if not self.timestamp:
if not isinstance(self.timestamp, datetime):
self.oldest()
if not isinstance(self.timestamp, datetime):
raise TypeError("timestamp must be a datetime")
if self.timestamp == datetime.max:
return td_max.days
return (datetime.utcnow() - self.timestamp).days
def save(self):
def save(self) -> "Url":
"""Save the URL on wayback machine."""
self.wayback_machine_save_api = WaybackMachineSaveAPI(
self.url, user_agent=self.user_agent
)
@ -44,14 +78,14 @@ class Url:
def near(
self,
year=None,
month=None,
day=None,
hour=None,
minute=None,
unix_timestamp=None,
):
year: Optional[int] = None,
month: Optional[int] = None,
day: Optional[int] = None,
hour: Optional[int] = None,
minute: Optional[int] = None,
unix_timestamp: Optional[int] = None,
) -> "Url":
"""Returns the archive of the URL close to a date and time."""
self.wayback_machine_availability_api.near(
year=year,
month=month,
@ -63,22 +97,32 @@ class Url:
self.set_availability_api_attrs()
return self
def oldest(self):
def oldest(self) -> "Url":
"""Returns the oldest archive of the URL."""
self.wayback_machine_availability_api.oldest()
self.set_availability_api_attrs()
return self
def newest(self):
def newest(self) -> "Url":
"""Returns the newest archive of the URL."""
self.wayback_machine_availability_api.newest()
self.set_availability_api_attrs()
return self
def set_availability_api_attrs(self):
def set_availability_api_attrs(self) -> None:
"""Set the attributes for total backwards compatibility."""
self.archive_url = self.wayback_machine_availability_api.archive_url
self.JSON = self.wayback_machine_availability_api.JSON
self.json = self.wayback_machine_availability_api.json
self.JSON = self.json # for backwards compatibility, do not remove it.
self.timestamp = self.wayback_machine_availability_api.timestamp()
def total_archives(self, start_timestamp=None, end_timestamp=None):
def total_archives(
self, start_timestamp: Optional[str] = None, end_timestamp: Optional[str] = None
) -> int:
"""
Returns an integer which indicates total number of archives for an URL.
Useless in my opinion, only here because of backwards compatibility.
"""
cdx = WaybackMachineCDXServerAPI(
self.url,
user_agent=self.user_agent,
@ -93,12 +137,13 @@ class Url:
def known_urls(
self,
subdomain=False,
host=False,
start_timestamp=None,
end_timestamp=None,
match_type="prefix",
):
subdomain: bool = False,
host: bool = False,
start_timestamp: Optional[str] = None,
end_timestamp: Optional[str] = None,
match_type: str = "prefix",
) -> Generator[str, None, None]:
"""Yields known URLs for any URL."""
if subdomain:
match_type = "domain"
if host:
@ -114,4 +159,4 @@ class Url:
)
for snapshot in cdx.snapshots():
yield (snapshot.original)
yield snapshot.original