Compare commits

...

204 Commits
2.3.0 ... 3.0.6

Author SHA1 Message Date
5407681c34 v3.0.6 (#170)
* remove the license section from readme

This does not mean that I'm waving the copyrights rather just formatting the README

* remove useless external links form the README lead

and also added a line about the recentness of the newest method between the availability and CDX server API.

* incr version to 3.0.6 and change date to todays da

-te that is 15th of March, 2022.

* update secsi and DI section

* v3.0.5 --> v3.0.6
2022-03-15 20:33:51 +05:30
cfd977135d Update CITATION.cff (#169) 2022-03-04 11:48:49 +05:30
7a5e0bfdaf fix: cff format (#168) 2022-03-04 03:10:27 +05:30
48dcda8020 add: typed marker (PEP561) (#167) 2022-03-03 19:05:43 +05:30
3ed2170a32 add: CITATION.cff (#166) 2022-03-03 19:05:23 +05:30
d6ef55020c undo drop python3.6, see #162 (#163) 2022-02-18 21:54:33 +05:30
2650943f9d v3.0.4 (#160)
* Update README.md

* Update README.md

* update asciinema link

* v3.0.4

* update video link
2022-02-18 16:05:58 +05:30
4b218d35cb Cdx based oldest newest and near (#159)
* implement oldest newest and near methods in the cdx interface class, now cli uses the cdx methods instead of availablity api methods.

* handle the closest parameter derivative methods more efficiently and also handle exceptions gracefully.

* update test code
2022-02-18 13:17:40 +05:30
f990b93f8a Add sort, use_pagination and closest (#158)
* add sort param support in CDX API class

see https://nla.github.io/outbackcdx/api.html#operation/query

sort takes string input which must be one of the follwoing:
- default
- closest
- reverse

This commit shall help in closing issue at https://github.com/akamhy/waybackpy/issues/155

* add BlockedSiteError for cases when archiving is blocked by site's robots.txt

* create check_for_blocked_site for handling the BlockedSiteError for sites that are blocking wayback machine by their robots.txt policy

* add attrs use_pagination and closest, which are can be used to use the pagination API and lookup archive close to a timestamp respectively. And now to get out of infinte blank pages loop just check for two succesive black and not total two blank pages while using the CDX server API.

* added cli support for sort, use-pagination and closest

* added tests

* fix codeql warnings, nothing to worry about here.

* fix save test for archive_url
2022-02-18 00:24:14 +05:30
3a44a710d3 add sort param support in CDX API class (#156)
see https://nla.github.io/outbackcdx/api.html#operation/query

sort takes string input which must be one of the follwoing:
- default
- closest
- reverse

This commit shall help in closing issue at https://github.com/akamhy/waybackpy/issues/155
2022-02-17 12:17:23 +05:30
f63c6adf79 Trigger Build 2022-02-09 17:29:19 +05:30
b4d3393ef1 fix: move metadata from __init__.py into setup.cfg (#153) 2022-02-09 17:20:23 +05:30
cd5c3c61a5 fix imports with isort 2022-02-09 16:18:25 +05:30
87fb5ecd58 remove latest version funcs from utils, they were unused. 2022-02-09 16:12:30 +05:30
5954fcc646 format with black 2022-02-09 15:51:11 +05:30
89016d433c added trove Typing :: Typed and Development Status :: 5 - Production/Stable 2022-02-09 15:47:38 +05:30
edaa1d5d54 update value to the new limit. 2022-02-09 15:40:38 +05:30
16f94db144 incr version to v3.0.3 2022-02-09 14:33:16 +05:30
25eb709ade improve doc strings and comments and remove useless exceptions. 2022-02-09 14:32:15 +05:30
6d233f24fc apply isort 2022-02-09 11:20:59 +05:30
ec341fa8b3 refactor code in cli module 2022-02-09 11:20:10 +05:30
cf18090f90 fix typo 2022-02-09 09:52:20 +05:30
81162eebd0 issues with HN 2022-02-08 21:28:25 +05:30
ca4f79a2e3 + jfinkhaeuser and rafael (#150) 2022-02-08 20:34:34 +05:30
27f2727049 add cli alias for --start-timestamp(--from) and --end-timestamp(--to) to conform with the CDX API docs. 2022-02-08 20:12:19 +05:30
118dc6c523 add test for wrapper module 2022-02-08 20:08:44 +05:30
1216ffbc70 lint and refactor cli module 2022-02-08 20:06:17 +05:30
d58a5f0ee5 explicitly exculde some dirs from flake8 check 2022-02-08 18:59:13 +05:30
7e7412d9d1 remove deepsource, LGTM is better and has fewer False Postives. 2022-02-08 18:49:44 +05:30
70c38c5a60 + codecov badge 2022-02-08 17:49:05 +05:30
f8bf9c16f9 Add tests (#149)
* enable codecov

* fix save_urls_on_file

* increase the limit of CDX to 25000 from 5000. 5X increase.

* added test for the CLI module

* make flake 8 happy

* make mypy happy
2022-02-08 17:46:59 +05:30
2bbfee7b2f replace non-ASCII emojis with GitHub hosted equivalent images (#148) 2022-02-08 11:43:32 +05:30
7317bd7183 Remove blank lines after docstring (#146)
Co-authored-by: deepsource-autofix[bot] <62050782+deepsource-autofix[bot]@users.noreply.github.com>
2022-02-08 10:14:20 +05:30
e0dfbe0b7d Fix comparison constant position (#145)
* Fix comparison constant position

* format with black

Co-authored-by: deepsource-autofix[bot] <62050782+deepsource-autofix[bot]@users.noreply.github.com>
Co-authored-by: Akash Mahanty <akamhy@yahoo.com>
2022-02-08 10:06:23 +05:30
0b631592ea Improve pylint score (#142)
* fix: errors to improve pylint scores

* fix: test

* fix

* add: flake ignore rule to pip8speaks conf

* fix

* add: test patterns to deepsource conf
2022-02-08 06:42:20 +09:00
d3a8f343f8 + [eggplants](https://github.com/eggplants) (#143) 2022-02-08 01:41:10 +05:30
97f8b96411 added docstrings, added some static type hints and also lint. (#141)
* added docstrings, added some static type hints and also lint.

* added doc strings and changed some internal variable names for more clarity.

* make flake8 happy

* add descriptive docstrings and type hints in waybackpy/cdx_snapshot.py

* remove useless code and add docstrings and also lint using pylint.

* remove unwarented test

* added docstrings, lint using pylint and add a raise on 509 SC

* added docstrings and lint with pylint

* lint

* add doc strings and lint

* add docstrings and lint
2022-02-07 19:40:37 +05:30
004ff26196 Add .deepsource.toml 2022-02-07 12:55:57 +00:00
a772c22431 explicitly tell pep8speaks that mll is 88. 2022-02-06 21:00:15 +05:30
b79f1c471e Merge pull request #135 from eggplants/fix_cli
Fix cli.py
2022-02-05 16:54:36 +05:30
f49d67a411 Merge pull request #136 from eggplants/429_error
Add TooManyRequestsError
2022-02-05 11:28:27 +05:30
ad8bd25633 added badge of codacy (#139) 2022-02-05 10:05:17 +05:30
d2a3946425 fix: escape banner 2022-02-05 10:12:27 +09:00
7b6401d59b fix: delete useless conds 2022-02-05 06:20:03 +09:00
ed6160c54f add: TooManyRequestsError 2022-02-05 06:19:02 +09:00
fcab19a40a fix: cli
print error message to stderr and specify defaults of url
2022-02-05 05:55:04 +09:00
5f3cd28046 Fix Pylint errors were pointed out by codacy (#133)
* fix: pylint errors were pointed out by codacy

* fix: line length

* fix: help text

* fix: revert

https://stackoverflow.com/a/64477857 makes cli unusable

* fix: cli error and refactor codes
2022-02-05 05:25:40 +09:00
9d9cc3328b add .pep8speaks.yml, override deafult 2022-02-05 00:53:38 +05:30
b69e4dff37 rename params of main in cli.py to avoid using built-ins (#132)
* rename params of main in cli.py to avoid using built-ins

* Fix Line 32:80: E501 line too long (102 > 79 characters)
2022-02-05 00:30:35 +05:30
d8cabdfdb5 Typing (#128)
* fix: CI yml name

* add: mypy configuraion

* add: type annotation to waybackpy modules

* add: type annotation to test modules

* fix: mypy command

* add: types-requests to dev deps

* fix: disable max-line-length

* fix: move pytest.ini into setup.cfg

* add: urllib3 to deps

* fix: Retry (ref: https://github.com/python/typeshed/issues/6893)

* fix: f-string

* fix: shorten long lines

* add: staticmethod decorator to no-self-use methods

* fix: str(headers)->headers_str

* fix: error message

* fix: revert "str(headers)->headers_str" and ignore assignment CaseInsensitiveDict with str

* fix: mypy error
2022-02-05 03:23:36 +09:00
320ef30371 fix: format md and yml (#129) 2022-02-04 22:31:46 +05:30
e61447effd Format and lint codes and fix packaging (#125)
* add: configure files (setup.py->setup.py+setup.cfg+pyproject.toml)

* add: __download_url__

* format with black and isort

* fix: flake8 section in setup.cfg

* add: E501 to flake ignore

* fix: metadata.name does not accept attr

* fix: merge __version__.py into __init__.py

* fix: flake8 errors in tests/

* fix: datetime.datetime -> datetime

* fix: banner

* fix: ignore W605 for banner

* fix: way to install deps in CI

* add: versem to setuptools

* fix: drop python<=3.6 (#126) from package and CI
2022-02-03 19:13:39 +05:30
947647f2e7 Merge pull request #124 from eggplants/fix_save_retry
Fix save retry mechanism
2022-02-03 18:01:51 +05:30
bc1dc4dc96 fix: save retry mechanism 2022-02-03 19:45:16 +09:00
5cbdfc040b waybackpy/cli.py : remove duplicate original_string from output_string in cdx 2022-01-30 21:02:25 +05:30
3be6ac01fc created tests/test_cdx_api.py: added tests for cdx_api.py 2022-01-30 20:03:40 +05:30
b8b9bc098f tests/test_utils.py: test latest_version_pypi and latest_version_github of waybackpy.utils 2022-01-30 20:02:17 +05:30
946c28eddf waybackpy/cli.py: Added help text, fix bug in the cdx_print parameter and lots of other stuff
parameter --filters is now --filter

parameter --collapses is now --collapse

added a new --license flag for fetching the license from GitHub repo and printing it.
2022-01-30 20:00:50 +05:30
004027f73b waybackpy/utils.py : Add a new function(latest_version_github) to fetch the latest release from github api and renamed latest_version to latest_version_pypi as now we have two functions to get the latest release. 2022-01-30 13:28:13 +05:30
e86dd93b29 Delete custom.md 2022-01-30 11:45:51 +05:30
988568e8f0 Update issue templates 2022-01-30 11:44:30 +05:30
f4c32a44fd Merge pull request #123 from akamhy/add-code-of-conduct-1
Create CODE_OF_CONDUCT.md
2022-01-30 11:39:22 +05:30
7755e6391c Create CODE_OF_CONDUCT.md 2022-01-30 11:39:11 +05:30
9dbe3b3bf4 In waybackpy/wrapper.py set self.timestamp to None on init.
In older interface(2.x.x) we had timestamp set to none in the constructer, so maybe it should be best to set it to None in the backwards compatiblliy module.)
2022-01-29 22:12:02 +05:30
e84ba9f2c3 Merge pull request #122 from akamhy/update-readme
add conda install and related links and tell users that they can copy…
2022-01-27 00:25:49 +05:30
1250d105b4 update install command for conda and replace the link to conda-forge.org with https://anaconda.org/conda-forge/waybackpy 2022-01-27 00:17:36 +05:30
f03b2cb6cb fix formatting of ASCII art 2022-01-26 18:24:24 +05:30
5e0ea023e6 update CLI help text 2022-01-26 16:23:24 +05:30
8dff6f349e add maintainers 2022-01-26 15:45:03 +05:30
e04cfdfeaf add conda install and related links and tell users that they can copy text from asciinema.org 2022-01-26 15:40:33 +05:30
0e2cc8f5ba + asciicast https://asciinema.org/a/464367
[![asciicast](https://asciinema.org/a/464367.svg)](https://asciinema.org/a/464367)
2022-01-26 14:51:06 +05:30
9007149fef 3.0.1 -- > 3.0.2, for condaforge staged-recipes issues 2022-01-26 01:54:58 +05:30
8b7603e241 the test is faulty as it fails when we increment the version on dunder version file but did not upstreamed the code to PyPi. 2022-01-26 01:51:24 +05:30
5ea1d3ba4f Replace NON-ASCII character figlet with ASCII character figlet. 2022-01-26 01:46:42 +05:30
4408c5e2ca add snapcraft.yaml 2022-01-25 20:54:09 +05:30
9afe29a819 Merge pull request #119 from akamhy/akamhy-patch-1
v3.0.0 --> v3.0.1
2022-01-25 19:54:01 +05:30
d79b10c74c v3.0.0 --> v3.0.1 2022-01-25 19:52:10 +05:30
32314dc102 Merge branch 'build-test' #118
Add build test to CI
 see #117
2022-01-25 14:02:36 +05:30
50e176e2ba .github/workflows/build_test.yml : change python versions from '3.4', '3.8', '3.10' to '3.6', '3.10' as 3.4 not found by GitHub. 2022-01-25 13:56:49 +05:30
4007859c92 Install dependencies for build test in CI : setuptools wheel 2022-01-25 13:35:58 +05:30
d8bd6c628d Add build test to CI 2022-01-25 13:30:16 +05:30
28f6ff8df2 Merge pull request #116 from akamhy/patch-setup-py
Fix syntax for opening the README.md and __version__.py
2022-01-25 13:11:33 +05:30
7ac9353f74 Fix syntax for opening the README.md and __version__.py
For some reason updates made at https://github.com/akamhy/waybackpy/pull/114
are breaking the build using setup, caught while deploying to a cloud service
provider.

The exact error is:
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/tmp/pip-req-build-n3b9e5pj/setup.py", line 5
  os.path.join(os.path.dirname(__file__), README.md), encoding=utf-8),
                                                                                ^
SyntaxError: invalid syntax
----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

See also :
https://github.com/conda-forge/staged-recipes/pull/17634
2022-01-25 13:05:01 +05:30
15c7244a22 Merge pull request #115 from akamhy/akamhy-patch-1
do not use f-strings in setup.py
2022-01-25 10:42:27 +05:30
8510210e94 do not use f-strings in setup.py
These are not supported in <Python 3.6 version of the cpython.
2022-01-25 10:34:46 +05:30
552967487e Merge pull request #114 from rafaelrdealmeida/patch-1
Update setup.py

See also <https://github.com/akamhy/waybackpy/issues/111#issuecomment-1020673814>
2022-01-25 10:30:34 +05:30
86a90a3840 Update setup.py
pep8
2022-01-24 22:03:28 -03:00
759874cdc6 Update setup.py
see: https://github.com/akamhy/waybackpy/issues/111#issuecomment-1020673814
2022-01-24 21:23:31 -03:00
06095202fe BUG FIX : forgot to use the endpoint from the instance and also assign payload to param. Bug caught by the flake8 in the CI tests. 2022-01-24 23:35:48 +05:30
06fc7855bf waybackpy/cdx_api.py : deafult user agent is now DEFAULT_USER_AGENT, get_response now take url and headers as arguments and request url is generated by full_url function. max_tries added as parameter for the WaybackMachineCDXServerAPI class with default value of 3. 2022-01-24 23:20:49 +05:30
c49fe971fd update the older deprecation not for Url class, the newer date is now 2025 instead of 2024. 2022-01-24 23:15:59 +05:30
d6783d5525 added tests for cdx_utils.py 2022-01-24 23:05:47 +05:30
9262f5da21 improve functions get_total_pages, get_response and lint check_filters, check_collapses and check_match_type
get_total_pages : default user agent is now DEFAULT_USER_AGENT
                  and now instead of str formatting passing payload
                  as param to full_url to generate the request url
                  also get_response make the request instead of directly
                  using requests.get()

get_response : get_response is now not taking param as keyword arguments
               instead the invoker is supposed to pass the full url which
               may be generated by the full_url function therefore the return_full_url=False,
               is deprecated also.
               Also now closing the session via session.close()
               No need to check 'Exceeded 30 redirects' as save API uses a
               diffrent method.

check_filters : Not assigning to variables the return of match groups
                beacause we wont be using them and the linter picks these
                unused assignments.

check_collapses : Same reason as for check_filters but also removed a foolish
                  test that checks equality with objects that are guaranteed
                  to be same.

check_match_type : Updated the text that of WaybackError
2022-01-24 22:57:20 +05:30
d1a1cf2546 added tests for utils.py at tests/test_utils.py also changed a keyword argument from headers to user_agent for latest_version of utils.py with the usual default vaule. 2022-01-24 17:50:36 +05:30
cd8a32ed1f added tests for cdx_snapshot.py at tests/test_cdx_snapshot.py 2022-01-24 16:29:44 +05:30
57512c65ff change test oldest method from google.com to example.com, the oldest on google is for some unknown reason is not very stable. 2022-01-24 16:27:35 +05:30
d9ea26e11c added code style black badge 2022-01-24 13:46:31 +05:30
2bea92b348 fix bug with the third matching case of the archive_url_parser, caught while writing more tests fo the save API interface. 2022-01-24 13:31:30 +05:30
d506685f68 added some tests for save_api interface 2022-01-23 18:35:54 +05:30
7844d15d99 close the session in save api interface 2022-01-23 18:34:06 +05:30
c0252edff2 updated tests for availability_api.py and also added max_tries(default value is 3) with delay (sleep) between successive API calls. The dealy actually improves the performace of the availability_api interface. 2022-01-23 15:05:10 +05:30
e7488f3a3e added test badge, rename test to Tests from ubuntu and fix the Incomplete URL substring sanitization(or trying to) 2022-01-23 02:26:53 +05:30
aed75ad1db Make modules imprtable as part of a Python package, waybackpy by creating __init__.py file in tests 2022-01-23 02:14:38 +05:30
d740959c34 more dev reqs 2022-01-23 02:10:12 +05:30
2d83043ef7 + flake8 in requirements-dev.txt 2022-01-23 02:05:08 +05:30
31b1056217 fix typo in CI 2022-01-23 02:03:30 +05:30
97712b2c1e add CI unit_test.yml 2022-01-23 02:00:15 +05:30
a8acc4c4d8 Fix Incomplete URL substring sanitization in the last commit. 2022-01-23 01:42:48 +05:30
1bacd73002 created pytest.ini, added test for waybackpy/availability_api.py, new exceptions all of which inherit from the main WaybackError and created requirements-dev.txt 2022-01-23 01:29:07 +05:30
79901ba968 updated README.md 2022-01-22 03:08:26 +05:30
df64e839d7 added trove classifiers for python 3.10 2022-01-22 00:57:10 +05:30
405e9a2a79 waybackpy/save_api.py : Added doc strings and also lint with black. 2022-01-22 00:41:10 +05:30
db551abbf6 lint waybackpy/cdx_api.py and added some doc strings 2022-01-22 00:11:35 +05:30
d13dd4db1a added notice on waybackpy/wrapper.py that the Url class will cease to exist after 2024-01-01 and also removed unused imports. 2022-01-21 23:14:20 +05:30
d3bb8337a1 make setup.py smarter, now no need to update the URL again and also added more keywords. And in __version__.py updated the __author__ 2022-01-21 23:01:09 +05:30
fd5e85420c waybackpy/availability_api.py : removed unused imports, added doc strings, removed redundant function. 2022-01-21 22:47:44 +05:30
5c685ef5d7 upload logo and make p path not text
I was dumb to forget to convert the p to path.
2022-01-21 21:11:42 +05:30
6a3d96b453 Logo (#113)
* Create logo.txt

* Delete waybackpy_logo.svg

* Add files via upload

* Delete logo.txt
2022-01-21 21:02:38 +05:30
afe1b15a5f Add files via upload 2022-01-21 20:58:53 +05:30
4fd9d142e7 Merge pull request #112 from akamhy/fix
escape '.' before 'archive.org'
2022-01-21 19:52:55 +05:30
5e9fdb40ce escape '.' before 'archive.org'
escape '.' before 'archive.org' on line 88 so it does not match more hosts than expected.
2022-01-21 19:51:08 +05:30
fa72098270 _get_response is not used anymore
- datashaman (<https://stackoverflow.com/users/401467/datashaman>) for <https://stackoverflow.com/a/35504626>. _get_response is based on this amazing answer.
2022-01-21 19:43:35 +05:30
d18f955044 date year range 2020-2022 2022-01-21 11:55:42 +05:30
9c340d6967 Create codeql-analysis.yml 2022-01-21 11:12:59 +05:30
78d0e0c126 Update README.md 2022-01-21 09:54:04 +05:30
564101e6f5 🐳 for docker image 2022-01-21 01:23:05 +05:30
de5a3e1561 improve usage code 2022-01-18 21:18:17 +05:30
52e46fecc2 more usage example 2022-01-18 20:58:39 +05:30
3b6415abc7 updating examples 2022-01-18 20:44:47 +05:30
66e16d6d89 define __repr__ for the Availability API class 2022-01-18 20:34:21 +05:30
16b9bdd7f9 output the file name if known_url and file flag are passed. 2022-01-18 20:14:44 +05:30
7adc01bff2 implement known_urls for cli from the newer interface. Although use of CDX is recommended but backward-compatibility matters. 2022-01-18 20:07:12 +05:30
9bbd056268 Update README.md 2022-01-17 02:15:38 +05:30
2ab44391cf close #107, added link to SecSI/Docker image 2022-01-16 23:01:31 +05:30
cc3628ae18 define __str__ for objects of WaybackMachineAvailabilityAPI class, the check for self.JSON ensures that the API was atleast called. 2022-01-16 22:28:12 +05:30
1d751b942b invoke json, was a bad idea removing it the earlier commit as the end user should not have to call it 2022-01-16 22:15:25 +05:30
261a867a21 near() method of WaybackMachineAvailabilityAPI return self to preserve past behaviour 2022-01-16 21:53:54 +05:30
2e487e88d3 define __len__ on Url objects, if any method not used prior to len op then default to len of oldest archive. 2022-01-16 21:29:43 +05:30
c8d0ad493a defined __str__ for Url objects, print func should print the url. 2022-01-16 21:22:43 +05:30
ce869177fd Merge pull request #103 from akamhy/whitesource/configure
Configure WhiteSource Bolt for GitHub
2022-01-02 16:04:15 +05:30
58616fb986 Add .whitesource configuration file 2022-01-02 08:45:07 +00:00
4e68cd5743 Create separate module for the 3 different APIs also CDX is now CLI supported. 2022-01-02 14:14:45 +05:30
a7b805292d changes made for v2.4.4 (update download_url) (#100)
* v2.4.4 (update download_url)

* v2.4.4 (update __version__)

* +1

add jonasjancarik
2021-09-03 11:28:26 +05:30
6dc6124dc4 Raise error on a 509 response (too many sessions) (#99)
* Raise error on a 509 response (too many sessions)

When the response code is 509, raise an error with an explanation (based on the actual error message contained in the response HTML).

* Raise error on a 509 response (too many sessions) - linting
2021-09-03 08:04:36 +05:30
5a7fc7d568 Fix typo (#95) 2021-04-13 16:58:34 +05:30
5a9c861cad v2.4.3 (#94)
* 2.4.3

* 2.4.3
2021-04-02 10:41:59 +05:30
dd1917c77e added RedirectSaveError - for failed saves if the URL is a permanent … (#93)
* added RedirectSaveError - for failed saves if the URL is a permanent redirect.

* check if url is redirect before throwing exceptions, res.url is the redirect url if redirected at all

* update tests and cli errors
2021-04-02 10:38:17 +05:30
db8f902cff Add doc strings (#90)
* Added some docstrings in utils.py

* renamed some func/meth to better names and added doc strings + lint

* added more docstrings

* more docstrings

* improve docstrings

* docstrings

* added more docstrings, lint

* fix import error
2021-01-26 11:56:03 +05:30
88cda94c0b v2.4.2 (#89)
* v2.4.2

* v2.4.2
2021-01-24 17:03:35 +05:30
09290f88d1 fix one more error 2021-01-24 16:58:53 +05:30
e5835091c9 import re 2021-01-24 16:56:59 +05:30
7312ed1f4f set cached_save to True if archive older than 3 mins. 2021-01-24 16:53:36 +05:30
6ae8f843d3 add --file to --known_urls 2021-01-24 16:15:11 +05:30
36b936820b known urls now yileds, more reliable. And save the file in chucks wrt to response. --file arg can be used to create output file, if --file not used no output will be saved in any file. (#88) 2021-01-24 16:11:39 +05:30
a3bc6aad2b too much API usage by duplicate tests was causing too much tests failure 2021-01-23 21:08:21 +05:30
edc2f63d93 Output valid JSON, dumps python dict. Make JSON valid. 2021-01-23 20:43:52 +05:30
ffe0810b12 flag to check if the archive saved is 30 mins older or not 2021-01-16 12:06:08 +05:30
40233eb115 improve code quality, remove unused imports, use system randomness etc 2021-01-16 11:35:13 +05:30
d549d31421 improve save method, now we know that 302 errors indicates that wayback machine is archiving the URL and hasn't yet archived. We construct an artifical archive with the current UTC time and check for HTTP status code 20* or 30*. If we verify the archival, we return the artifical archive. The artificial archive will automatically point to the new archive or in best case will be the new archive after some time. 2021-01-16 10:47:43 +05:30
0725163af8 mimify the logo, remove ugly old logos 2021-01-15 18:14:48 +05:30
712471176b better error messages(str), check latest version before asking for an upgrade and rm alive checking 2021-01-15 16:47:26 +05:30
dcd7b03302 getting rid of c style str formatting, now using .format 2021-01-14 19:30:07 +05:30
76205d9cf6 backoff_factor=2 for save, incr success by 25% 2021-01-13 10:13:16 +05:30
ec0a0d04cc + dequeued0
dequeued0 (https://github.com/dequeued0) for reporting bugs and useful feature requests.
2021-01-12 10:52:41 +05:30
7bb01df846 v2.4.1 2021-01-12 10:18:09 +05:30
6142e0b353 get should retrive the last fetched archive by default 2021-01-12 10:07:14 +05:30
a65990aee3 don't use pagination API if total pages <= 2 2021-01-12 09:46:07 +05:30
259a024eb1 joke? they changed their robots.txt 2021-01-11 23:17:01 +05:30
91402792e6 + Supported Features
tell what the package can do, many users probably do not read the full usage.
2021-01-11 23:01:18 +05:30
eabf4dc046 don't fetch more pages if >=2 pages are empty 2021-01-11 22:43:14 +05:30
5a7bd73565 support unix ts as an arg in near 2021-01-11 19:53:37 +05:30
4693dbf9c1 change str repr of cdxsnapshot to cdx line 2021-01-11 09:34:37 +05:30
f4f2e51315 V2.4.0 (#62)
* v 2.4.0

* v 2.4.0
2021-01-10 11:53:45 +05:30
d6b7df6837 no need to de-duplicate as we are collapsing the results by urlkey
Same urls aren't recieved
2021-01-10 11:36:46 +05:30
dafba5d0cb collapses=["urlkey"] for known urls 2021-01-10 11:34:06 +05:30
6c71dfbe41 use cdx matchtype for domain and host 2021-01-10 11:10:49 +05:30
a6470b1036 not passing dict to cdxsnapshot 2021-01-10 10:40:32 +05:30
04cda4558e fix test 2021-01-10 03:18:09 +05:30
625ed63482 remove asserts stmnts 2021-01-10 03:05:48 +05:30
a03813315f full cdx api support 2021-01-10 02:23:53 +05:30
a2550f17d7 retries support for get requests 2021-01-06 01:58:38 +05:30
15ef5816db Always cast url to string, avoid passing waybackpy objects to _get_response 2021-01-05 19:46:17 +05:30
93b52bd0fe FIX : don't use self.user_agent if user_agent passed in get() 2021-01-05 19:31:27 +05:30
28ff877081 Update README.md 2021-01-05 19:08:35 +05:30
3e3ecff9df l2 heading and lint 2021-01-05 01:59:29 +05:30
ce64135ba8 ce 2021-01-05 01:52:35 +05:30
2af6580ffb docs link 2021-01-05 01:51:53 +05:30
8a3c515176 v2.3.3 2021-01-05 01:49:26 +05:30
d98c4f32ad v2.3.3 2021-01-05 01:48:54 +05:30
e0a4b007d5 improve docs 2021-01-05 01:46:12 +05:30
6fb6b2deee Update readme + new file CONTRIBUTORS.md (#59)
* remove some badges

* remove made with python button, obvious

* - maintained badge, we already have latest commit badge

- [![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-green.svg)](https://github.com/akamhy/waybackpy/graphs/commit-activity)

* re arranged order of badges

* a bit more re odering

* - release badge

* - license section

* center h1

* try once more'

* removed the TOC

* move the hr

* Update README.md

* + hr

* h1 --> h2

* remove tests and pacakging info from here to docs/wiki

* Update README.md

* example inspired by psf/requests

* CLI tool example gist

* Update README.md

* Update README.md

* + license

* Update README.md

* authors list

* Update CONTRIBUTORS.md

* fix code

* Update README.md

* Update README.md

* center the button
2021-01-05 00:30:07 +05:30
1882862992 now using cdx Pagination API 2021-01-04 20:46:54 +05:30
0c6107e675 increase coverage 2021-01-04 01:54:40 +05:30
bd079978bf inc coverage 2021-01-04 00:44:55 +05:30
5dec4927cd refactoring, try to code complexity 2021-01-04 00:14:38 +05:30
62e5217b9e reduce code complexity: refactoring, less flow breaking structures 2021-01-03 19:38:25 +05:30
9823c809e9 Added doc strings in wrapper.py, documenting code and improving docs. 2021-01-03 17:11:32 +05:30
db5737a857 JSON is now available for near and other other methods that call it 2021-01-02 18:52:46 +05:30
ca0821a466 Wiki docs (#58)
* move docs to wiki

* Update README.md

* Update setup.py
2021-01-02 12:20:43 +05:30
bb4dbc7d3c rm url = obj.url 2021-01-02 11:19:09 +05:30
7c7fd75376 No need to fetch archive_url and timestamp from availability API on init (#55)
* No need to fetch archive_url and timestamp from availability API on init. 

Not useful if all I want is to archive a page

* Update test_wrapper.py

* Update wrapper.py

* Update test_wrapper.py

* Update wrapper.py

* Update cli.py

* Update wrapper.py

* Update __version__.py

* Update __version__.py

* Update __version__.py

* Update __version__.py

* Update setup.py

* Update README.md
2021-01-02 11:10:23 +05:30
0b71433667 v2.3.1 (#54)
* 2.3.1

* 2.3.1
2021-01-01 19:15:23 +05:30
1b499a7594 removed JSON from init, this was resulting in too much unnecessary taffic. Some users who are thousands of URLs were blocked by IA (#53)
closes #52
2021-01-01 16:38:57 +05:30
da390ee8a3 improve maintainability and reduce code cognitive complexity (#49) 2020-12-15 10:24:13 +05:30
44 changed files with 3243 additions and 1862 deletions

34
.github/ISSUE_TEMPLATE/bug_report.md vendored Normal file
View File

@ -0,0 +1,34 @@
---
name: Bug report
about: Create a report to help us improve
title: ''
labels: bug
assignees: akamhy
---
**Describe the bug**
A clear and concise description of what the bug is.
**To Reproduce**
Steps to reproduce the behavior:
1. Go to '...'
2. Click on '....'
3. Scroll down to '....'
4. See error
**Expected behavior**
A clear and concise description of what you expected to happen.
**Screenshots**
If applicable, add screenshots to help explain your problem.
**Version:**
- OS: [e.g. iOS]
- Version [e.g. 22]
- Is latest version? [e.g. Yes/No]
**Additional context**
Add any other context about the problem here.

View File

@ -0,0 +1,19 @@
---
name: Feature request
about: Suggest an idea for this project
title: ''
labels: enhancement
assignees: akamhy
---
**Is your feature request related to a problem? Please describe.**
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
**Describe the solution you'd like**
A clear and concise description of what you want to happen.
**Describe alternatives you've considered**
A clear and concise description of any alternative solutions or features you've considered.
**Additional context**
Add any other context or screenshots about the feature request here.

30
.github/workflows/build-test.yml vendored Normal file
View File

@ -0,0 +1,30 @@
# This workflow will install Python dependencies, run tests and lint with a variety of Python versions
# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions
name: Build
on:
push:
branches: [ master ]
pull_request:
branches: [ master ]
jobs:
build:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ['3.7', '3.10']
steps:
- uses: actions/checkout@v2
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v2
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -U setuptools wheel
- name: Build test the package
run: |
python setup.py sdist bdist_wheel

70
.github/workflows/codeql-analysis.yml vendored Normal file
View File

@ -0,0 +1,70 @@
# For most projects, this workflow file will not need changing; you simply need
# to commit it to your repository.
#
# You may wish to alter this file to override the set of languages analyzed,
# or to provide custom queries or build logic.
#
# ******** NOTE ********
# We have attempted to detect the languages in your repository. Please check
# the `language` matrix defined below to confirm you have the correct set of
# supported CodeQL languages.
#
name: "CodeQL"
on:
push:
branches: [ master ]
pull_request:
# The branches below must be a subset of the branches above
branches: [ master ]
schedule:
- cron: '30 6 * * 1'
jobs:
analyze:
name: Analyze
runs-on: ubuntu-latest
permissions:
actions: read
contents: read
security-events: write
strategy:
fail-fast: false
matrix:
language: [ 'python' ]
# CodeQL supports [ 'cpp', 'csharp', 'go', 'java', 'javascript', 'python', 'ruby' ]
# Learn more about CodeQL language support at https://git.io/codeql-language-support
steps:
- name: Checkout repository
uses: actions/checkout@v2
# Initializes the CodeQL tools for scanning.
- name: Initialize CodeQL
uses: github/codeql-action/init@v1
with:
languages: ${{ matrix.language }}
# If you wish to specify custom queries, you can do so here or in a config file.
# By default, queries listed here will override any specified in a config file.
# Prefix the list here with "+" to use these queries and those in the config file.
# queries: ./path/to/local/query, your-org/your-repo/queries@main
# Autobuild attempts to build any compiled languages (C/C++, C#, or Java).
# If this step fails, then you should remove it and run the build manually (see below)
- name: Autobuild
uses: github/codeql-action/autobuild@v1
# Command-line programs to run using the OS shell.
# 📚 https://git.io/JvXDl
# ✏️ If the Autobuild fails above, remove it and uncomment the following three lines
# and modify them (or add more) to build your code if your project
# uses a compiled language
#- run: |
# make bootstrap
# make release
- name: Perform CodeQL Analysis
uses: github/codeql-action/analyze@v1

View File

@ -1,7 +1,7 @@
# This workflow will install Python dependencies, run tests and lint with a variety of Python versions
# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions
name: CI
name: Tests
on:
push:
@ -15,8 +15,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ['3.8']
python-version: ['3.10']
steps:
- uses: actions/checkout@v2
- name: Set up Python ${{ matrix.python-version }}
@ -26,17 +25,19 @@ jobs:
- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install flake8 pytest codecov pytest-cov
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
pip install '.[dev]'
- name: Lint with flake8
run: |
# stop the build if there are Python syntax errors or undefined names
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
flake8 . --count --show-source --statistics
- name: Lint with black
run: |
black . --check --diff
- name: Static type test with mypy
run: |
mypy -p waybackpy -p tests
- name: Test with pytest
run: |
pytest --cov=waybackpy tests/
pytest
- name: Upload coverage to Codecov
run: |
bash <(curl -s https://codecov.io/bash) -t ${{ secrets.CODECOV_TOKEN }}

3
.gitignore vendored
View File

@ -1,3 +1,6 @@
# Files generated while testing
*-urls-*.txt
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]

View File

@ -1,4 +1,7 @@
# File : .pep8speaks.yml
scanner:
diff_only: True # If True, errors caused by only the patch are shown
diff_only: True
linter: flake8
flake8:
max-line-length: 88
extend-ignore: W503,W605

View File

@ -1,5 +0,0 @@
# autogenerated pyup.io config file
# see https://pyup.io/docs/configuration/ for all available options
schedule: ''
update: false

View File

@ -1,8 +1,12 @@
{
"scanSettings": {
"baseBranches": []
},
"checkRunSettings": {
"vulnerableCheckRunConclusionLevel": "failure"
"vulnerableCheckRunConclusionLevel": "failure",
"displayMode": "diff"
},
"issueSettings": {
"minSeverityLevel": "LOW"
}
}
}

25
CITATION.cff Normal file
View File

@ -0,0 +1,25 @@
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
title: waybackpy
abstract: "Python package that interfaces with the Internet Archive's Wayback Machine APIs. Archive pages and retrieve archived pages easily."
version: '3.0.6'
doi: 10.5281/ZENODO.3977276
date-released: 2022-03-15
type: software
authors:
- given-names: Akash
family-names: Mahanty
email: akamhy@yahoo.com
orcid: https://orcid.org/0000-0003-2482-8227
keywords:
- Archive Website
- Wayback Machine
- Internet Archive
- Wayback Machine CLI
- Wayback Machine Python
- Internet Archiving
- Availability API
- CDX API
- savepagenow
license: MIT
repository-code: "https://github.com/akamhy/waybackpy"

128
CODE_OF_CONDUCT.md Normal file
View File

@ -0,0 +1,128 @@
# Contributor Covenant Code of Conduct
## Our Pledge
We as members, contributors, and leaders pledge to make participation in our
community a harassment-free experience for everyone, regardless of age, body
size, visible or invisible disability, ethnicity, sex characteristics, gender
identity and expression, level of experience, education, socio-economic status,
nationality, personal appearance, race, religion, or sexual identity
and orientation.
We pledge to act and interact in ways that contribute to an open, welcoming,
diverse, inclusive, and healthy community.
## Our Standards
Examples of behavior that contributes to a positive environment for our
community include:
* Demonstrating empathy and kindness toward other people
* Being respectful of differing opinions, viewpoints, and experiences
* Giving and gracefully accepting constructive feedback
* Accepting responsibility and apologizing to those affected by our mistakes,
and learning from the experience
* Focusing on what is best not just for us as individuals, but for the
overall community
Examples of unacceptable behavior include:
* The use of sexualized language or imagery, and sexual attention or
advances of any kind
* Trolling, insulting or derogatory comments, and personal or political attacks
* Public or private harassment
* Publishing others' private information, such as a physical or email
address, without their explicit permission
* Other conduct which could reasonably be considered inappropriate in a
professional setting
## Enforcement Responsibilities
Community leaders are responsible for clarifying and enforcing our standards of
acceptable behavior and will take appropriate and fair corrective action in
response to any behavior that they deem inappropriate, threatening, offensive,
or harmful.
Community leaders have the right and responsibility to remove, edit, or reject
comments, commits, code, wiki edits, issues, and other contributions that are
not aligned to this Code of Conduct, and will communicate reasons for moderation
decisions when appropriate.
## Scope
This Code of Conduct applies within all community spaces, and also applies when
an individual is officially representing the community in public spaces.
Examples of representing our community include using an official e-mail address,
posting via an official social media account, or acting as an appointed
representative at an online or offline event.
## Enforcement
Instances of abusive, harassing, or otherwise unacceptable behavior may be
reported to the community leaders responsible for enforcement at
akamhy@yahoo.com.
All complaints will be reviewed and investigated promptly and fairly.
All community leaders are obligated to respect the privacy and security of the
reporter of any incident.
## Enforcement Guidelines
Community leaders will follow these Community Impact Guidelines in determining
the consequences for any action they deem in violation of this Code of Conduct:
### 1. Correction
**Community Impact**: Use of inappropriate language or other behavior deemed
unprofessional or unwelcome in the community.
**Consequence**: A private, written warning from community leaders, providing
clarity around the nature of the violation and an explanation of why the
behavior was inappropriate. A public apology may be requested.
### 2. Warning
**Community Impact**: A violation through a single incident or series
of actions.
**Consequence**: A warning with consequences for continued behavior. No
interaction with the people involved, including unsolicited interaction with
those enforcing the Code of Conduct, for a specified period of time. This
includes avoiding interactions in community spaces as well as external channels
like social media. Violating these terms may lead to a temporary or
permanent ban.
### 3. Temporary Ban
**Community Impact**: A serious violation of community standards, including
sustained inappropriate behavior.
**Consequence**: A temporary ban from any sort of interaction or public
communication with the community for a specified period of time. No public or
private interaction with the people involved, including unsolicited interaction
with those enforcing the Code of Conduct, is allowed during this period.
Violating these terms may lead to a permanent ban.
### 4. Permanent Ban
**Community Impact**: Demonstrating a pattern of violation of community
standards, including sustained inappropriate behavior, harassment of an
individual, or aggression toward or disparagement of classes of individuals.
**Consequence**: A permanent ban from any sort of public interaction within
the community.
## Attribution
This Code of Conduct is adapted from the [Contributor Covenant][homepage],
version 2.0, available at
<https://www.contributor-covenant.org/version/2/0/code_of_conduct.html>.
Community Impact Guidelines were inspired by [Mozilla's code of conduct
enforcement ladder](https://github.com/mozilla/diversity).
[homepage]: https://www.contributor-covenant.org
For answers to common questions about this code of conduct, see the FAQ at
<https://www.contributor-covenant.org/faq>. Translations are available at
<https://www.contributor-covenant.org/translations>.

View File

@ -1,58 +0,0 @@
# Contributing to waybackpy
We love your input! We want to make contributing to this project as easy and transparent as possible, whether it's:
- Reporting a bug
- Discussing the current state of the code
- Submitting a fix
- Proposing new features
- Becoming a maintainer
## We Develop with Github
We use github to host code, to track issues and feature requests, as well as accept pull requests.
## We Use [Github Flow](https://guides.github.com/introduction/flow/index.html), So All Code Changes Happen Through Pull Requests
Pull requests are the best way to propose changes to the codebase (we use [Github Flow](https://guides.github.com/introduction/flow/index.html)). We actively welcome your pull requests:
1. Fork the repo and create your branch from `master`.
2. If you've added code that should be tested, add tests.
3. If you've changed APIs, update the documentation.
4. Ensure the test suite passes.
5. Make sure your code lints.
6. Issue that pull request!
## Any contributions you make will be under the MIT Software License
In short, when you submit code changes, your submissions are understood to be under the same [MIT License](https://github.com/akamhy/waybackpy/blob/master/LICENSE) that covers the project. Feel free to contact the maintainers if that's a concern.
## Report bugs using Github's [issues](https://github.com/akamhy/waybackpy/issues)
We use GitHub issues to track public bugs. Report a bug by [opening a new issue](https://github.com/akamhy/waybackpy/issues/new); it's that easy!
## Write bug reports with detail, background, and sample code
**Great Bug Reports** tend to have:
- A quick summary and/or background
- Steps to reproduce
- Be specific!
- Give sample code if you can.
- What you expected would happen
- What actually happens
- Notes (possibly including why you think this might be happening, or stuff you tried that didn't work)
People *love* thorough bug reports. I'm not even kidding.
## Use a Consistent Coding Style
* You can try running `flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics` for style unification.
## License
By contributing, you agree that your contributions will be licensed under its [MIT License](https://github.com/akamhy/waybackpy/blob/master/LICENSE).
## References
This document is forked from [this gist](https://gist.github.com/briandk/3d2e8b3ec8daf5a27a62) by [briandk](https://github.com/briandk) which was itself adapted from the open-source contribution guidelines for [Facebook's Draft](https://github.com/facebook/draft-js/blob/a9316a723f9e918afde44dea68b5f9f39b7d9b00/CONTRIBUTING.md)

16
CONTRIBUTORS.md Normal file
View File

@ -0,0 +1,16 @@
# CONTRIBUTORS
## AUTHORS
- akamhy (<https://github.com/akamhy>)
- eggplants (<https://github.com/eggplants>)
- danvalen1 (<https://github.com/danvalen1>)
- AntiCompositeNumber (<https://github.com/AntiCompositeNumber>)
- rafaelrdealmeida (<https://github.com/rafaelrdealmeida>)
- jonasjancarik (<https://github.com/jonasjancarik>)
- jfinkhaeuser (<https://github.com/jfinkhaeuser>)
## ACKNOWLEDGEMENTS
- mhmdiaa (<https://github.com/mhmdiaa>) for <https://gist.github.com/mhmdiaa/adf6bff70142e5091792841d4b372050>. known_urls is based on this gist.
- dequeued0 (<https://github.com/dequeued0>) for reporting bugs and useful feature requests.

View File

@ -1,6 +1,6 @@
MIT License
Copyright (c) 2020 waybackpy contributors ( https://github.com/akamhy/waybackpy/graphs/contributors )
Copyright (c) 2020-2022 waybackpy contributors ( https://github.com/akamhy/waybackpy/graphs/contributors )
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal

558
README.md
View File

@ -1,452 +1,206 @@
<!-- markdownlint-disable MD033 MD041 -->
<div align="center">
<img src="https://raw.githubusercontent.com/akamhy/waybackpy/master/assets/waybackpy_logo.svg"><br>
<img src="https://raw.githubusercontent.com/akamhy/waybackpy/master/assets/waybackpy_logo.svg"><br>
<h3>A Python package & CLI tool that interfaces with the Wayback Machine API</h3>
</div>
-----------------
<p align="center">
<a href="https://github.com/akamhy/waybackpy/actions?query=workflow%3ATests"><img alt="Unit Tests" src="https://github.com/akamhy/waybackpy/workflows/Tests/badge.svg"></a>
<a href="https://codecov.io/gh/akamhy/waybackpy"><img alt="codecov" src="https://codecov.io/gh/akamhy/waybackpy/branch/master/graph/badge.svg"></a>
<a href="https://pypi.org/project/waybackpy/"><img alt="pypi" src="https://img.shields.io/pypi/v/waybackpy.svg"></a>
<a href="https://pepy.tech/project/waybackpy?versions=2*&versions=1*&versions=3*"><img alt="Downloads" src="https://pepy.tech/badge/waybackpy/month"></a>
<a href="https://app.codacy.com/gh/akamhy/waybackpy?utm_source=github.com&utm_medium=referral&utm_content=akamhy/waybackpy&utm_campaign=Badge_Grade_Settings"><img alt="Codacy Badge" src="https://api.codacy.com/project/badge/Grade/6d777d8509f642ac89a20715bb3a6193"></a>
<a href="https://github.com/akamhy/waybackpy/commits/master"><img alt="GitHub lastest commit" src="https://img.shields.io/github/last-commit/akamhy/waybackpy?color=blue&style=flat-square"></a>
<a href="#"><img alt="PyPI - Python Version" src="https://img.shields.io/pypi/pyversions/waybackpy?style=flat-square"></a>
<a href="https://github.com/psf/black"><img alt="Code style: black" src="https://img.shields.io/badge/code%20style-black-000000.svg"></a>
</p>
## Python package & CLI tool that interfaces with the Wayback Machine API.
[![pypi](https://img.shields.io/pypi/v/waybackpy.svg)](https://pypi.org/project/waybackpy/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://github.com/akamhy/waybackpy/blob/master/LICENSE)
[![Build Status](https://github.com/akamhy/waybackpy/workflows/CI/badge.svg)](https://github.com/akamhy/waybackpy/actions)
[![codecov](https://codecov.io/gh/akamhy/waybackpy/branch/master/graph/badge.svg)](https://codecov.io/gh/akamhy/waybackpy)
[![contributions welcome](https://img.shields.io/static/v1.svg?label=Contributions&message=Welcome&color=0059b3&style=flat-square)](https://github.com/akamhy/waybackpy/blob/master/CONTRIBUTING.md)
[![Codacy Badge](https://api.codacy.com/project/badge/Grade/255459cede9341e39436ec8866d3fb65)](https://www.codacy.com/manual/akamhy/waybackpy?utm_source=github.com&amp;utm_medium=referral&amp;utm_content=akamhy/waybackpy&amp;utm_campaign=Badge_Grade)
[![Downloads](https://pepy.tech/badge/waybackpy/month)](https://pepy.tech/project/waybackpy)
[![Release](https://img.shields.io/github/v/release/akamhy/waybackpy.svg)](https://github.com/akamhy/waybackpy/releases)
[![Maintainability](https://api.codeclimate.com/v1/badges/942f13d8177a56c1c906/maintainability)](https://codeclimate.com/github/akamhy/waybackpy/maintainability)
[![made-with-python](https://img.shields.io/badge/Made%20with-Python-1f425f.svg)](https://www.python.org/)
[![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-green.svg)](https://github.com/akamhy/waybackpy/graphs/commit-activity)
[![GitHub last commit](https://img.shields.io/github/last-commit/akamhy/waybackpy?color=blue&style=flat-square)](https://github.com/akamhy/waybackpy/commits/master)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/waybackpy?style=flat-square)
---
# <img src="https://github.githubassets.com/images/icons/emoji/unicode/2b50.png" width="30"></img> Introduction
Waybackpy is a Python package and a CLI tool that interfaces with the Wayback Machine APIs.
Table of contents
=================
<!--ts-->
Wayback Machine has 3 client side APIs.
* [Installation](#installation)
- SavePageNow or Save API
- CDX Server API
- Availability API
* [Usage](#usage)
* [As a Python package](#as-a-python-package)
* [Saving a webpage](#capturing-aka-saving-an-url-using-save)
* [Retrieving archive](#retrieving-the-archive-for-an-url-using-archive_url)
* [Retrieving the oldest archive](#retrieving-the-oldest-archive-for-an-url-using-oldest)
* [Retrieving the latest/newest archive](#retrieving-the-newest-archive-for-an-url-using-newest)
* [Retrieving the JSON response of availability API](#retrieving-the-json-response-for-the-availability-api-request)
* [Retrieving archive close to a specified year, month, day, hour, and minute](#retrieving-archive-close-to-a-specified-year-month-day-hour-and-minute-using-near)
* [Get the content of webpage](#get-the-content-of-webpage-using-get)
* [Count total archives for an URL](#count-total-archives-for-an-url-using-total_archives)
* [List of URLs that Wayback Machine knows and has archived for a domain name](#list-of-urls-that-wayback-machine-knows-and-has-archived-for-a-domain-name)
These three APIs can be accessed via the waybackpy either by importing it from a python file/module or from the command-line interface.
* [With the Command-line interface](#with-the-command-line-interface)
* [Saving webpage](#save)
* [Archive URL](#get-archive-url)
* [Oldest archive URL](#oldest-archive)
* [Newest archive URL](#newest-archive)
* [JSON response of API](#get-json-data-of-avaialblity-api)
* [Total archives](#total-number-of-archives)
* [Archive near specified time](#archive-near-time)
* [Get the source code](#get-the-source-code)
* [Fetch all the URLs that the Wayback Machine knows for a domain](#fetch-all-the-urls-that-the-wayback-machine-knows-for-a-domain)
## <img src="https://github.githubassets.com/images/icons/emoji/unicode/1f3d7.png" width="20"></img> Installation
* [Tests](#tests)
* [Packaging](#packaging)
* [License](#license)
<!--te-->
## Installation
Using [pip](https://en.wikipedia.org/wiki/Pip_(package_manager)):
**Using [pip](https://en.wikipedia.org/wiki/Pip_(package_manager)), from [PyPI](https://pypi.org/) (recommended)**:
```bash
pip install waybackpy
```
or direct from this repository using git.
**Using [conda](https://en.wikipedia.org/wiki/Conda_(package_manager)), from [conda-forge](https://anaconda.org/conda-forge/waybackpy) (recommended)**:
See also [waybackpy feedstock](https://github.com/conda-forge/waybackpy-feedstock), maintainers are [@rafaelrdealmeida](https://github.com/rafaelrdealmeida/),
[@labriunesp](https://github.com/labriunesp/)
and [@akamhy](https://github.com/akamhy/).
```bash
conda install -c conda-forge waybackpy
```
**Install directly from [this git repository](https://github.com/akamhy/waybackpy) (NOT recommended)**:
```bash
pip install git+https://github.com/akamhy/waybackpy.git
```
## Usage
## <img src="https://github.githubassets.com/images/icons/emoji/unicode/1f433.png" width="20"></img> Docker Image
Docker Hub: [hub.docker.com/r/secsi/waybackpy](https://hub.docker.com/r/secsi/waybackpy)
Docker image is automatically updated on every release by [Regulary and Automatically Updated Docker Images](https://github.com/cybersecsi/RAUDI) (RAUDI).
RAUDI is a tool by [SecSI](https://secsi.io), an Italian cybersecurity startup.
## <img src="https://github.githubassets.com/images/icons/emoji/unicode/1f680.png" width="20"></img> Usage
### As a Python package
#### Capturing aka Saving an URL using save()
#### Save API aka SavePageNow
```python
import waybackpy
url = "https://en.wikipedia.org/wiki/Multivariable_calculus"
user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
waybackpy_url_obj = waybackpy.Url(url, user_agent)
archive = waybackpy_url_obj.save()
print(archive)
>>> from waybackpy import WaybackMachineSaveAPI
>>> url = "https://github.com"
>>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
>>>
>>> save_api = WaybackMachineSaveAPI(url, user_agent)
>>> save_api.save()
https://web.archive.org/web/20220118125249/https://github.com/
>>> save_api.cached_save
False
>>> save_api.timestamp()
datetime.datetime(2022, 1, 18, 12, 52, 49)
```
```bash
https://web.archive.org/web/20201016171808/https://en.wikipedia.org/wiki/Multivariable_calculus
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPySaveExample></sub>
#### Retrieving the archive for an URL using archive_url
#### CDX API aka CDXServerAPI
```python
import waybackpy
url = "https://www.google.com/"
user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:40.0) Gecko/20100101 Firefox/40.0"
waybackpy_url_obj = waybackpy.Url(url, user_agent)
archive_url = waybackpy_url_obj.archive_url
print(archive_url)
>>> from waybackpy import WaybackMachineCDXServerAPI
>>> url = "https://google.com"
>>> user_agent = "my new app's user agent"
>>> cdx_api = WaybackMachineCDXServerAPI(url, user_agent)
```
##### oldest
```python
>>> cdx_api.oldest()
com,google)/ 19981111184551 http://google.com:80/ text/html 200 HOQ2TGPYAEQJPNUA6M4SMZ3NGQRBXDZ3 381
>>> oldest = cdx_api.oldest()
>>> oldest
com,google)/ 19981111184551 http://google.com:80/ text/html 200 HOQ2TGPYAEQJPNUA6M4SMZ3NGQRBXDZ3 381
>>> oldest.archive_url
'https://web.archive.org/web/19981111184551/http://google.com:80/'
>>> oldest.original
'http://google.com:80/'
>>> oldest.urlkey
'com,google)/'
>>> oldest.timestamp
'19981111184551'
>>> oldest.datetime_timestamp
datetime.datetime(1998, 11, 11, 18, 45, 51)
>>> oldest.statuscode
'200'
>>> oldest.mimetype
'text/html'
```
##### newest
```python
>>> newest = cdx_api.newest()
>>> newest
com,google)/ 20220217234427 http://@google.com/ text/html 301 Y6PVK4XWOI3BXQEXM5WLLWU5JKUVNSFZ 563
>>> newest.archive_url
'https://web.archive.org/web/20220217234427/http://@google.com/'
>>> newest.timestamp
'20220217234427'
```
##### near
```python
>>> near = cdx_api.near(year=2010, month=10, day=10, hour=10, minute=10)
>>> near.archive_url
'https://web.archive.org/web/20101010101435/http://google.com/'
>>> near
com,google)/ 20101010101435 http://google.com/ text/html 301 Y6PVK4XWOI3BXQEXM5WLLWU5JKUVNSFZ 391
>>> near.timestamp
'20101010101435'
>>> near.timestamp
'20101010101435'
>>> near = cdx_api.near(wayback_machine_timestamp=2008080808)
>>> near.archive_url
'https://web.archive.org/web/20080808051143/http://google.com/'
>>> near = cdx_api.near(unix_timestamp=1286705410)
>>> near
com,google)/ 20101010101435 http://google.com/ text/html 301 Y6PVK4XWOI3BXQEXM5WLLWU5JKUVNSFZ 391
>>> near.archive_url
'https://web.archive.org/web/20101010101435/http://google.com/'
>>>
```
##### snapshots
```python
>>> from waybackpy import WaybackMachineCDXServerAPI
>>> url = "https://pypi.org"
>>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
>>> cdx = WaybackMachineCDXServerAPI(url, user_agent, start_timestamp=2016, end_timestamp=2017)
>>> for item in cdx.snapshots():
... print(item.archive_url)
...
https://web.archive.org/web/20160110011047/http://pypi.org/
https://web.archive.org/web/20160305104847/http://pypi.org/
.
. # URLS REDACTED FOR READABILITY
.
https://web.archive.org/web/20171127171549/https://pypi.org/
https://web.archive.org/web/20171206002737/http://pypi.org:80/
```
```bash
https://web.archive.org/web/20201016153320/https://www.google.com/
```
#### Availability API
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyArchiveUrl></sub>
#### Retrieving the oldest archive for an URL using oldest()
It is recommended to not use the availability API due to performance issues. All the methods of availability API interface class, `WaybackMachineAvailabilityAPI`, are also implemented in the CDX server API interface class, `WaybackMachineCDXServerAPI`. Also note
that the `newest()` method of `WaybackMachineAvailabilityAPI` can be more recent than `WaybackMachineCDXServerAPI`'s same method.
```python
import waybackpy
url = "https://www.google.com/"
user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:40.0) Gecko/20100101 Firefox/40.0"
waybackpy_url_obj = waybackpy.Url(url, user_agent)
oldest_archive_url = waybackpy_url_obj.oldest()
print(oldest_archive_url)
>>> from waybackpy import WaybackMachineAvailabilityAPI
>>>
>>> url = "https://google.com"
>>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
>>>
>>> availability_api = WaybackMachineAvailabilityAPI(url, user_agent)
```
```bash
http://web.archive.org/web/19981111184551/http://google.com:80/
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyOldestExample></sub>
#### Retrieving the newest archive for an URL using newest()
##### oldest
```python
import waybackpy
url = "https://www.facebook.com/"
user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:39.0) Gecko/20100101 Firefox/39.0"
waybackpy_url_obj = waybackpy.Url(url, user_agent)
newest_archive_url = waybackpy_url_obj.newest()
print(newest_archive_url)
>>> availability_api.oldest()
https://web.archive.org/web/19981111184551/http://google.com:80/
```
```bash
https://web.archive.org/web/20201016150543/https://www.facebook.com/
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyNewestExample></sub>
#### Retrieving the JSON response for the availability API request
##### newest
```python
import waybackpy
url = "https://www.facebook.com/"
user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:39.0) Gecko/20100101 Firefox/39.0"
waybackpy_url_obj = waybackpy.Url(url, user_agent)
json_dict = waybackpy_url_obj.JSON
print(json_dict)
>>> availability_api.newest()
https://web.archive.org/web/20220118150444/https://www.google.com/
```
```javascript
{'url': 'https://www.facebook.com/', 'archived_snapshots': {'closest': {'available': True, 'url': 'http://web.archive.org/web/20201016150543/https://www.facebook.com/', 'timestamp': '20201016150543', 'status': '200'}}}
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyJSON></sub>
#### Retrieving archive close to a specified year, month, day, hour, and minute using near()
##### near
```python
from waybackpy import Url
user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:38.0) Gecko/20100101 Firefox/38.0"
url = "https://github.com/"
waybackpy_url_obj = Url(url, user_agent)
# Do not pad (don't use zeros in the month, year, day, minute, and hour arguments). e.g. For January, set month = 1 and not month = 01.
>>> availability_api.near(year=2010, month=10, day=10, hour=10)
https://web.archive.org/web/20101010101708/http://www.google.com/
```
```python
github_archive_near_2010 = waybackpy_url_obj.near(year=2010)
print(github_archive_near_2010)
```
> Documentation is at <https://github.com/akamhy/waybackpy/wiki/Python-package-docs>.
```bash
https://web.archive.org/web/20101018053604/http://github.com:80/
```
### As a CLI tool
```python
github_archive_near_2011_may = waybackpy_url_obj.near(year=2011, month=5)
print(github_archive_near_2011_may)
```
Demo video on [asciinema.org](https://asciinema.org/a/469890), you can copy the text from video:
```bash
https://web.archive.org/web/20110518233639/https://github.com/
```
[![asciicast](https://asciinema.org/a/469890.svg)](https://asciinema.org/a/469890)
```python
github_archive_near_2015_january_26 = waybackpy_url_obj.near(year=2015, month=1, day=26)
print(github_archive_near_2015_january_26)
```
> CLI documentation is at <https://github.com/akamhy/waybackpy/wiki/CLI-docs>.
```bash
https://web.archive.org/web/20150125102636/https://github.com/
```
```python
github_archive_near_2018_4_july_9_2_am = waybackpy_url_obj.near(year=2018, month=7, day=4, hour=9, minute=2)
print(github_archive_near_2018_4_july_9_2_am)
```
```bash
https://web.archive.org/web/20180704090245/https://github.com/
```
<sub>The package doesn't support the seconds' argument yet. You are encouraged to create a PR ;)</sub>
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyNearExample></sub>
#### Get the content of webpage using get()
```python
import waybackpy
google_url = "https://www.google.com/"
User_Agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.85 Safari/537.36"
waybackpy_url_object = waybackpy.Url(google_url, User_Agent)
# If no argument is passed in get(), it gets the source of the Url used to create the object.
current_google_url_source = waybackpy_url_object.get()
print(current_google_url_source)
# The following chunk of code will force a new archive of google.com and get the source of the archived page.
# waybackpy_url_object.save() type is string.
google_newest_archive_source = waybackpy_url_object.get(waybackpy_url_object.save())
print(google_newest_archive_source)
# waybackpy_url_object.oldest() type is str, it's oldest archive of google.com
google_oldest_archive_source = waybackpy_url_object.get(waybackpy_url_object.oldest())
print(google_oldest_archive_source)
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyGetExample#main.py></sub>
#### Count total archives for an URL using total_archives()
```python
import waybackpy
URL = "https://en.wikipedia.org/wiki/Python (programming language)"
UA = "Mozilla/5.0 (iPad; CPU OS 8_1_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B435 Safari/600.1.4"
waybackpy_url_object = waybackpy.Url(url=URL, user_agent=UA)
archive_count = waybackpy_url_object.total_archives()
print(archive_count) # total_archives() returns an int
```
```bash
2516
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyTotalArchivesExample></sub>
#### List of URLs that Wayback Machine knows and has archived for a domain name
1) If alive=True is set, waybackpy will check all URLs to identify the alive URLs. Don't use with popular websites like google or it would take too long.
2) To include URLs from subdomain set sundomain=True
```python
import waybackpy
URL = "akamhy.github.io"
UA = "Mozilla/5.0 (iPad; CPU OS 8_1_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B435 Safari/600.1.4"
waybackpy_url_object = waybackpy.Url(url=URL, user_agent=UA)
known_urls = waybackpy_url_object.known_urls(alive=True, subdomain=False) # alive and subdomain are optional.
print(known_urls) # known_urls() returns list of URLs
```
```bash
['http://akamhy.github.io',
'https://akamhy.github.io/waybackpy/',
'https://akamhy.github.io/waybackpy/assets/css/style.css?v=a418a4e4641a1dbaad8f3bfbf293fad21a75ff11',
'https://akamhy.github.io/waybackpy/assets/css/style.css?v=f881705d00bf47b5bf0c58808efe29eecba2226c']
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyKnownURLsToWayBackMachineExample#main.py></sub>
### With the Command-line interface
#### Save
```bash
$ waybackpy --url "https://en.wikipedia.org/wiki/Social_media" --user_agent "my-unique-user-agent" --save
https://web.archive.org/web/20200719062108/https://en.wikipedia.org/wiki/Social_media
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyBashSave></sub>
#### Get archive URL
```bash
$ waybackpy --url "https://en.wikipedia.org/wiki/SpaceX" --user_agent "my-unique-user-agent" --archive_url
https://web.archive.org/web/20201007132458/https://en.wikipedia.org/wiki/SpaceX
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyBashArchiveUrl></sub>
#### Oldest archive
```bash
$ waybackpy --url "https://en.wikipedia.org/wiki/SpaceX" --user_agent "my-unique-user-agent" --oldest
https://web.archive.org/web/20040803000845/http://en.wikipedia.org:80/wiki/SpaceX
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyBashOldest></sub>
#### Newest archive
```bash
$ waybackpy --url "https://en.wikipedia.org/wiki/YouTube" --user_agent "my-unique-user-agent" --newest
https://web.archive.org/web/20200606044708/https://en.wikipedia.org/wiki/YouTube
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyBashNewest></sub>
#### Get JSON data of avaialblity API
```bash
waybackpy --url "https://en.wikipedia.org/wiki/SpaceX" --user_agent "my-unique-user-agent" --json
```
```javascript
{'archived_snapshots': {'closest': {'timestamp': '20201007132458', 'status': '200', 'available': True, 'url': 'http://web.archive.org/web/20201007132458/https://en.wikipedia.org/wiki/SpaceX'}}, 'url': 'https://en.wikipedia.org/wiki/SpaceX'}
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyBashJSON></sub>
#### Total number of archives
```bash
$ waybackpy --url "https://en.wikipedia.org/wiki/Linux_kernel" --user_agent "my-unique-user-agent" --total
853
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyBashTotal></sub>
#### Archive near time
```bash
$ waybackpy --url facebook.com --user_agent "my-unique-user-agent" --near --year 2012 --month 5 --day 12
https://web.archive.org/web/20120512142515/https://www.facebook.com/
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyBashNear></sub>
#### Get the source code
```bash
waybackpy --url google.com --user_agent "my-unique-user-agent" --get url # Prints the source code of the URL
waybackpy --url google.com --user_agent "my-unique-user-agent" --get oldest # Prints the source code of the oldest archive
waybackpy --url google.com --user_agent "my-unique-user-agent" --get newest # Prints the source code of the newest archive
waybackpy --url google.com --user_agent "my-unique-user-agent" --get save # Save a new archive on Wayback machine then print the source code of this archive.
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyBashGet></sub>
#### Fetch all the URLs that the Wayback Machine knows for a domain
1) You can add the '--alive' flag to only fetch alive links.
2) You can add the '--subdomain' flag to add subdomains.
3) '--alive' and '--subdomain' flags can be used simultaneously.
4) All links will be saved in a file, and the file will be created in the current working directory.
```bash
pip install waybackpy
# Ignore the above installation line.
waybackpy --url akamhy.github.io --user_agent "my-user-agent" --known_urls
# Prints all known URLs under akamhy.github.io
waybackpy --url akamhy.github.io --user_agent "my-user-agent" --known_urls --alive
# Prints all known URLs under akamhy.github.io which are still working and not dead links.
waybackpy --url akamhy.github.io --user_agent "my-user-agent" --known_urls --subdomain
# Prints all known URLs under akamhy.github.io including subdomain
waybackpy --url akamhy.github.io --user_agent "my-user-agent" --known_urls --subdomain --alive
# Prints all known URLs under akamhy.github.io including subdomain which are not dead links and still alive.
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackpyKnownUrlsFromWaybackMachine#main.sh></sub>
## Tests
To run tests locally:
1) Install or update the testing/coverage tools
```bash
pip install codecov pytest pytest-cov -U
```
2) Inside the repository run the following commands
```bash
pytest --cov=waybackpy tests/
```
3) To report coverage run
```bash
bash <(curl -s https://codecov.io/bash) -t SECRET_CODECOV_TOKEN
```
You can find the tests [here](https://github.com/akamhy/waybackpy/tree/master/tests).
## Packaging
1. Increment version.
2. Build package ``python setup.py sdist bdist_wheel``.
3. Sign & upload the package ``twine upload -s dist/*``.
## License
Released under the MIT License. See
[license](https://github.com/akamhy/waybackpy/blob/master/LICENSE) for details.

View File

@ -1,268 +0,0 @@
<?xml version="1.0" standalone="no"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 20010904//EN"
"http://www.w3.org/TR/2001/REC-SVG-20010904/DTD/svg10.dtd">
<svg version="1.0" xmlns="http://www.w3.org/2000/svg"
width="629.000000pt" height="103.000000pt" viewBox="0 0 629.000000 103.000000"
preserveAspectRatio="xMidYMid meet">
<g transform="translate(0.000000,103.000000) scale(0.100000,-0.100000)"
fill="#000000" stroke="none">
<path d="M0 515 l0 -515 3145 0 3145 0 0 515 0 515 -3145 0 -3145 0 0 -515z
m5413 439 c31 -6 36 -10 31 -26 -3 -10 0 -26 7 -34 6 -8 10 -17 7 -20 -3 -2
-17 11 -32 31 -15 19 -41 39 -59 44 -38 11 -10 14 46 5z m150 -11 c-7 -2 -21
-2 -30 0 -10 3 -4 5 12 5 17 0 24 -2 18 -5z m-4869 -23 c-6 -6 -21 -6 -39 -1
-30 9 -30 9 10 10 25 1 36 -2 29 -9z m452 -37 c-3 -26 -15 -65 -25 -88 -10
-22 -21 -64 -25 -94 -3 -29 -14 -72 -26 -95 -11 -23 -20 -51 -20 -61 0 -30
-39 -152 -53 -163 -6 -5 -45 -12 -85 -14 -72 -5 -102 4 -102 33 0 6 -9 31 -21
56 -11 25 -26 72 -33 103 -6 31 -17 64 -24 73 -8 9 -22 37 -32 64 l-18 48 -16
-39 c-9 -21 -16 -44 -16 -50 0 -6 -7 -24 -15 -40 -8 -16 -24 -63 -34 -106 -11
-43 -26 -93 -34 -112 -14 -34 -15 -35 -108 -46 -70 -9 -96 -9 -106 0 -21 17
-43 64 -43 92 0 14 -4 27 -9 31 -12 7 -50 120 -66 200 -8 35 -25 81 -40 103
-14 22 -27 52 -28 68 -2 28 0 29 48 31 28 1 82 5 120 9 54 4 73 3 82 -7 11
-15 53 -148 53 -170 0 -7 9 -32 21 -56 20 -41 39 -49 39 -17 0 8 -5 12 -10 9
-6 -3 -13 2 -16 12 -3 10 -10 26 -15 36 -14 26 7 21 29 -8 l20 -26 7 33 c7 35
41 149 56 185 7 19 16 23 56 23 27 0 80 2 120 6 80 6 88 1 97 -71 3 -20 9 -42
14 -48 5 -7 20 -43 32 -82 13 -38 24 -72 26 -74 2 -2 13 4 24 14 13 12 20 31
20 55 0 20 7 56 15 81 7 24 19 63 25 87 12 47 31 60 89 61 l34 1 -7 -47z
m3131 41 c17 -3 34 -12 37 -20 3 -7 1 -48 -4 -91 -4 -43 -7 -80 -4 -82 2 -2
11 2 20 10 9 7 24 18 34 24 9 5 55 40 101 77 79 64 87 68 136 68 28 0 54 -4
58 -10 3 -5 12 -7 20 -3 9 3 15 -1 15 -9 0 -13 -180 -158 -197 -158 -4 0 -14
-9 -20 -20 -11 -17 -7 -27 27 -76 22 -32 40 -63 40 -70 0 -7 6 -19 14 -26 7
-8 37 -48 65 -89 l52 -74 -28 -3 c-51 -5 -74 -12 -68 -22 9 -14 -59 -12 -73 2
-20 20 -13 30 10 14 34 -24 44 -19 17 8 -25 25 -109 140 -109 149 0 7 -60 97
-64 97 -2 0 -11 -10 -22 -22 -18 -21 -18 -21 0 -15 10 4 25 2 32 -4 18 -15 19
-35 2 -22 -7 6 -25 13 -39 17 -34 8 -39 -5 -39 -94 0 -38 -3 -75 -6 -84 -6
-16 -54 -22 -67 -9 -4 3 -40 7 -81 8 -101 2 -110 10 -104 97 3 37 10 73 16 80
6 8 10 77 10 174 0 89 2 166 6 172 6 11 162 15 213 6z m301 -1 c-25 -2 -52
-11 -58 -19 -7 -7 -17 -14 -23 -14 -5 0 -2 9 8 20 14 16 29 20 69 18 l51 -2
-47 -3z m809 -9 c33 -21 65 -89 62 -132 -1 -21 1 -47 5 -59 9 -28 -26 -111
-51 -120 -10 -3 -25 -12 -33 -19 -10 -8 -70 -15 -170 -21 l-155 -8 4 -73 c4
-93 -10 -112 -80 -112 -26 0 -60 5 -74 12 -19 8 -31 8 -51 -1 -45 -20 -55 -1
-55 98 0 47 -1 111 -3 141 -2 30 -5 107 -7 170 l-4 115 65 2 c36 2 103 7 150
11 150 15 372 13 397 -4z m338 -19 c11 -14 46 -54 78 -88 l58 -62 62 65 c34
36 75 73 89 83 28 18 113 24 122 9 3 -5 -32 -51 -77 -102 -147 -167 -134 -143
-139 -253 -3 -54 -10 -103 -16 -109 -8 -8 -8 -17 -1 -30 14 -26 11 -28 -47
-29 -119 -2 -165 3 -174 22 -6 10 -9 69 -8 131 l2 113 -57 75 c-32 41 -80 102
-107 134 -27 33 -47 62 -45 66 3 4 58 6 122 4 113 -3 119 -5 138 -29z m-4233
13 c16 -13 98 -150 98 -164 0 -4 29 -65 65 -135 36 -71 65 -135 65 -143 0 -10
-14 -17 -37 -21 -21 -4 -48 -10 -61 -16 -40 -16 -51 -10 -77 41 -29 57 -35 59
-157 38 -65 -11 -71 -14 -84 -43 -10 -25 -21 -34 -46 -38 -41 -6 -61 8 -48 33
15 28 12 38 -12 42 -18 2 -23 10 -24 36 -1 27 3 35 23 43 13 5 34 9 46 9 23 0
57 47 57 78 0 9 10 33 22 52 14 24 21 52 22 92 1 49 4 58 24 67 13 6 31 11 40
11 9 0 26 7 36 15 24 18 28 18 48 3z m1701 0 c16 -12 97 -143 97 -157 0 -3 32
-69 70 -146 39 -76 67 -142 62 -147 -4 -4 -28 -12 -52 -17 -25 -6 -57 -13 -72
-17 -25 -6 -29 -2 -50 42 -14 30 -31 50 -43 53 -11 2 -57 -2 -103 -9 -79 -12
-83 -13 -96 -45 -10 -24 -22 -34 -46 -38 -43 -9 -53 -1 -45 39 5 30 3 34 -15
34 -17 0 -20 6 -20 39 0 40 13 50 65 51 19 0 55 48 55 72 0 6 8 29 19 52 32
72 41 107 31 127 -8 14 -5 21 12 33 12 9 32 16 43 16 11 0 29 7 39 15 24 18
28 18 49 3z m-3021 -11 c-29 -9 -32 -13 -27 -39 8 -36 -11 -37 -20 -1 -8 32
15 54 54 52 24 -1 23 -2 -7 -12z m3499 4 c-12 -8 -51 -4 -51 5 0 2 15 4 33 4
22 0 28 -3 18 -9z m1081 -67 c2 -42 0 -78 -4 -81 -5 -2 -8 18 -8 45 0 27 -3
64 -6 81 -4 19 -2 31 4 31 6 0 12 -32 14 -76z m-1951 46 c12 -7 19 -21 19 -38
l-1 -27 -15 28 c-8 15 -22 27 -32 27 -9 0 -24 5 -32 10 -21 14 35 13 61 0z
m1004 -3 c73 -19 135 -61 135 -92 0 -15 -8 -29 -21 -36 -18 -9 -30 -6 -69 15
-37 20 -62 26 -109 26 -54 0 -62 -3 -78 -26 -21 -32 -33 -130 -25 -191 9 -58
41 -84 111 -91 38 -3 61 1 97 17 36 17 49 19 60 10 25 -21 15 -48 -28 -76 -38
-24 -54 -28 -148 -31 -114 -4 -170 10 -190 48 -6 11 -16 20 -23 20 -24 0 -59
95 -59 159 0 59 20 122 42 136 6 3 10 13 10 22 0 31 80 82 130 83 19 0 42 5
50 10 21 13 57 12 115 -3z m-1682 -23 c-14 -14 -28 -23 -31 -20 -8 8 29 46 44
46 7 0 2 -11 -13 -26z m159 -2 c-20 -15 -22 -23 -16 -60 4 -28 3 -42 -5 -42
-7 0 -11 19 -11 50 0 36 5 52 18 59 28 17 39 12 14 -7z m1224 -28 c-39 -40
-46 -38 -19 7 15 24 40 41 52 33 2 -2 -13 -20 -33 -40z m-1538 -33 l62 -66 63
68 c56 59 68 67 100 67 19 0 38 -3 40 -7 3 -5 -32 -53 -76 -108 -88 -108 -84
-97 -90 -255 l-2 -55 -87 -3 c-49 -1 -88 -1 -89 0 0 2 -3 50 -5 107 -3 75 -8
109 -19 121 -8 9 -15 20 -15 25 0 4 -18 29 -41 54 -83 94 -89 102 -84 111 3 6
45 9 93 9 l87 -1 63 -67z m786 59 c33 -12 48 -42 52 -107 3 -43 0 -57 -16 -73
l-20 -20 20 -28 c26 -35 35 -89 21 -125 -18 -46 -66 -60 -226 -64 -77 -3 -166
-7 -198 -10 -84 -7 -99 9 -97 102 1 38 -1 125 -4 191 l-5 122 47 5 c26 3 103
4 171 2 69 -2 134 1 145 5 29 12 80 12 110 0z m-1050 -16 c3 -8 2 -12 -4 -9
-6 3 -10 10 -10 16 0 14 7 11 14 -7z m-374 -22 c0 -9 -5 -24 -10 -32 -7 -11
-10 -5 -10 23 0 23 4 36 10 32 6 -3 10 -14 10 -23z m1701 16 c2 -21 -2 -43
-10 -51 -4 -4 -7 9 -8 28 -1 32 15 52 18 23z m2859 -28 c-11 -20 -50 -28 -50
-10 0 6 9 10 19 10 11 0 23 5 26 10 12 19 16 10 5 -10z m-4759 -47 c-8 -15
-10 -15 -11 -2 0 17 10 32 18 25 2 -3 -1 -13 -7 -23z m2599 9 c0 -9 -40 -35
-46 -29 -6 6 25 37 37 37 5 0 9 -3 9 -8z m316 -127 c-4 -19 -12 -37 -18 -41
-8 -5 -9 -1 -5 10 4 10 7 36 7 59 1 35 2 39 11 24 6 -10 8 -34 5 -52z m1942
38 c-15 -16 -30 -45 -33 -65 -4 -21 -12 -38 -17 -38 -19 0 3 74 30 103 14 15
30 27 36 27 5 0 -2 -12 -16 -27z m-3855 -16 c-6 -12 -15 -33 -20 -47 -9 -23
-10 -23 -15 -3 -3 12 3 34 14 52 23 35 37 34 21 -2z m3282 -82 c-23 -18 -81
-35 -115 -34 -17 1 -11 5 21 13 25 7 54 18 65 24 30 18 53 15 29 -3z m-2585
-130 c-7 -8 -19 -15 -27 -15 -10 0 -7 8 9 31 18 24 24 27 26 14 2 -9 -2 -22
-8 -30z m-1775 -5 c-4 -12 -9 -19 -12 -17 -3 3 -2 15 2 27 4 12 9 19 12 17 3
-3 2 -15 -2 -27z m820 -29 c-9 -8 -25 21 -25 44 0 16 3 14 15 -9 9 -16 13 -32
10 -35z m2085 47 c0 -17 -31 -48 -47 -48 -11 0 -8 8 9 29 24 32 38 38 38 19z
m-1655 -47 c-11 -10 -35 11 -35 30 0 21 0 21 19 -2 11 -13 18 -26 16 -28z
m1221 24 c13 -14 21 -25 18 -25 -11 0 -54 33 -54 41 0 15 12 10 36 -16z
m-1428 -7 c-3 -7 -18 -14 -34 -15 -20 -1 -22 0 -6 4 12 2 22 9 22 14 0 5 5 9
11 9 6 0 9 -6 7 -12z m3574 -45 c8 -10 6 -13 -11 -13 -18 0 -21 6 -20 38 0 34
1 35 10 13 5 -14 15 -31 21 -38z m-4097 14 c19 -4 19 -4 2 -12 -18 -7 -46 16
-47 39 0 6 6 3 13 -6 6 -9 21 -18 32 -21z m1700 1 c19 -5 19 -5 2 -13 -18 -7
-46 17 -46 40 0 6 5 3 12 -6 7 -9 21 -19 32 -21z m-1970 12 c-3 -5 -21 -9 -38
-9 l-32 2 35 7 c19 4 36 8 38 9 2 0 0 -3 -3 -9z m350 0 c-27 -12 -35 -12 -35
0 0 6 12 10 28 9 24 0 25 -1 7 -9z m1350 0 c-3 -5 -18 -9 -33 -9 l-27 1 30 8
c17 4 31 8 33 9 2 0 0 -3 -3 -9z m355 0 c-19 -13 -30 -13 -30 0 0 6 10 10 23
10 18 0 19 -2 7 -10z m-2324 -35 c-6 -22 -11 -25 -44 -24 -31 2 -32 3 -9 6 18
3 32 14 39 29 14 30 23 24 14 -11z m2839 16 c-14 -14 -73 -26 -60 -13 6 5 19
12 30 15 34 8 40 8 30 -2z m212 -21 l48 -8 -47 -1 c-56 -1 -78 6 -78 26 0 12
3 13 14 3 8 -6 36 -15 63 -20z m116 -1 c-6 -6 -18 -6 -28 -3 -18 7 -18 8 1 14
23 9 39 1 27 -11z m633 -14 c31 5 35 4 21 -5 -9 -6 -34 -10 -55 -8 -31 3 -37
7 -40 28 l-3 25 19 -23 c16 -20 24 -23 58 -17z m939 15 c16 -7 11 -9 -20 -9
-29 -1 -36 2 -25 9 17 11 19 11 45 0z m-5445 -24 c6 -8 21 -16 33 -18 19 -3
20 -4 5 -10 -12 -5 -27 1 -45 17 -16 13 -23 25 -17 25 6 0 17 -6 24 -14z m150
-76 c0 -11 -4 -20 -10 -20 -14 0 -13 -103 1 -117 21 -21 2 -43 -36 -43 -19 0
-35 5 -35 11 0 8 -5 7 -15 -1 -21 -17 -44 2 -28 22 22 26 20 128 -2 128 -8 0
-15 9 -15 19 0 18 8 20 70 20 63 0 70 -2 70 -19z m1189 -63 c17 -32 31 -62 31
-66 0 -14 -43 -21 -57 -9 -7 6 -29 12 -48 14 -26 2 -35 -1 -40 -16 -4 -12 -12
-17 -21 -13 -8 3 -13 12 -10 19 3 8 1 14 -4 14 -18 0 -10 22 9 27 22 6 43 46
35 67 -3 9 5 20 23 30 34 18 38 14 82 -67z m2146 -8 l34 -67 -25 -6 c-14 -4
-31 -3 -37 2 -7 5 -29 12 -49 16 -31 6 -38 4 -38 -9 0 -8 -7 -15 -15 -15 -8 0
-15 7 -15 15 0 8 -4 15 -10 15 -19 0 -10 21 14 30 16 6 27 20 31 40 4 18 16
41 27 52 26 26 40 14 83 -73z m-3205 51 c8 -10 20 -26 27 -36 10 -17 12 -14
12 19 1 36 2 37 37 37 l37 0 -8 -72 c-3 -40 -11 -76 -17 -79 -20 -13 -43 3
-62 42 -27 56 -34 56 -41 4 -7 -42 -9 -44 -34 -39 -35 9 -34 6 -35 71 -1 41 4
62 14 70 18 15 50 7 70 -17z m280 11 c-5 -11 -15 -21 -21 -23 -13 -4 -14 -101
-3 -120 5 -8 1 -9 -10 -5 -10 4 -29 7 -42 7 -22 0 -24 3 -24 55 0 52 -1 55
-26 55 -19 0 -25 5 -22 18 2 13 17 18 68 23 36 3 71 6 78 7 9 2 10 -3 2 -17z
m178 -3 c3 -15 -4 -18 -32 -18 -25 0 -36 -4 -36 -15 0 -10 11 -15 35 -15 24 0
35 -5 35 -15 0 -11 -11 -15 -41 -15 -55 0 -47 -24 9 -28 29 -2 42 -8 42 -18 0
-16 -25 -17 -108 -7 l-53 6 2 56 c3 92 1 90 77 88 55 -2 67 -5 70 -19z m230
10 c18 -18 14 -56 -7 -77 -17 -17 -18 -21 -5 -40 14 -19 13 -21 -4 -21 -10 0
-28 11 -40 25 -24 27 -52 24 -52 -5 0 -24 -9 -29 -43 -23 -26 5 -27 7 -27 73
0 45 4 70 13 73 26 11 153 7 165 -5z m557 -2 c47 -20 47 -40 0 -32 -53 10 -77
-7 -73 -52 l3 -37 48 1 c26 0 47 -3 47 -6 0 -35 -108 -42 -140 -10 -29 29 -27
94 5 125 28 28 60 31 110 11z m213 -8 c3 -15 -4 -18 -38 -18 -50 0 -51 -22 -1
-30 44 -7 44 -24 -1 -28 -54 -5 -52 -32 2 -32 29 0 40 -4 40 -15 0 -17 -28
-19 -104 -9 l-46 7 0 72 0 72 72 -1 c61 -1 73 -4 76 -18z m312 6 c0 -9 -9 -18
-21 -21 -19 -5 -20 -12 -17 -69 3 -63 3 -63 -22 -58 -49 11 -50 12 -50 64 0
43 -3 50 -20 50 -13 0 -20 7 -20 20 0 17 8 20 68 23 37 2 70 4 75 5 4 1 7 -5
7 -14z m155 6 c65 -15 94 -73 62 -125 -14 -24 -25 -28 -92 -33 -44 -3 -54 0
-78 24 -34 34 -36 82 -4 111 37 34 53 37 112 23z m505 -3 c0 -8 -9 -40 -20
-72 -11 -31 -18 -60 -16 -64 3 -4 -9 -8 -25 -9 -25 -2 -31 3 -51 45 l-22 47
-21 -46 c-17 -38 -25 -47 -51 -50 -24 -3 -30 0 -32 17 -1 12 -8 40 -17 64 -21
59 -20 61 20 61 27 0 35 -4 35 -17 0 -10 4 -24 9 -32 7 -11 13 -6 25 23 14 35
18 37 53 34 32 -2 39 -7 41 -28 6 -43 19 -43 36 -1 15 40 36 55 36 28z m136
-4 c27 -45 64 -115 64 -122 0 -13 -42 -22 -54 -12 -6 5 -28 11 -49 15 -32 6
-38 4 -45 -13 -8 -24 -26 -16 -36 16 -5 16 -2 25 13 32 11 6 25 28 32 48 17
55 53 71 75 36z m840 -4 c22 -18 16 -32 -11 -25 -59 15 -94 -18 -74 -71 8 -21
15 -24 47 -22 40 3 66 -7 57 -21 -3 -5 -12 -7 -20 -3 -8 3 -15 1 -15 -4 0 -17
-111 4 -126 24 -26 34 -13 100 25 131 18 14 96 9 117 -9z m816 -54 l37 -70
-25 -8 c-16 -6 -30 -5 -40 3 -22 19 -81 22 -88 4 -7 -19 -26 -18 -26 1 0 8 -4
15 -10 15 -20 0 -9 21 15 30 24 9 30 24 27 63 -1 10 2 16 7 13 5 -3 12 1 15
10 4 9 15 14 28 12 17 -2 33 -22 60 -73z m183 61 c47 -20 47 -40 0 -32 -46 9
-75 -7 -75 -42 0 -45 13 -56 59 -49 30 4 41 2 41 -8 0 -32 -95 -35 -134 -4
-30 24 -34 64 -11 109 22 43 60 51 120 26z m398 4 c19 0 24 -26 6 -32 -13 -4
-16 -42 -5 -84 l7 -32 -55 -1 c-57 0 -68 7 -41 29 17 14 21 90 5 90 -5 0 -10
10 -10 21 0 19 4 21 38 15 20 -3 45 -6 55 -6z m117 0 c5 0 17 -13 27 -30 9
-16 21 -30 25 -30 4 0 8 14 8 30 0 28 3 30 36 30 l36 0 -5 -71 c-2 -42 -9 -74
-17 -79 -15 -9 -50 -1 -50 12 0 5 -11 25 -24 45 l-24 35 -9 -42 c-4 -23 -11
-41 -15 -41 -5 1 -19 1 -32 1 -23 0 -23 2 -20 67 3 66 15 88 42 78 8 -3 18 -5
22 -5z m317 -3 c21 -15 4 -27 -38 -27 -50 0 -49 -23 1 -30 50 -8 51 -30 1 -30
-30 0 -41 -4 -41 -15 0 -11 12 -15 45 -15 33 0 45 -4 45 -15 0 -17 -24 -19
-108 -8 l-54 6 6 66 c3 36 5 69 6 72 0 11 124 7 137 -4z m-4374 -7 c9 0 17 -4
17 -10 0 -5 -16 -10 -35 -10 -28 0 -35 -4 -35 -19 0 -15 8 -21 35 -23 20 -2
35 -7 35 -13 0 -5 -15 -11 -35 -13 -30 -3 -35 -7 -35 -28 0 -18 -5 -24 -23
-24 -13 0 -28 -5 -33 -10 -7 -7 -11 9 -13 51 -1 35 -6 70 -11 79 -7 13 -2 16
28 18 20 2 39 5 41 8 3 3 15 3 26 0 11 -3 28 -6 38 -6z m1856 -14 c23 -21 38
-20 51 4 6 11 17 20 25 20 16 0 20 -16 6 -24 -17 -11 -50 -94 -44 -114 4 -18
0 -20 -34 -19 l-38 2 3 40 c3 33 -1 45 -22 64 -36 34 -34 53 5 47 17 -2 39
-12 48 -20z m299 -18 c-3 -24 -1 -55 3 -70 6 -24 4 -29 -14 -32 -41 -9 -155
-14 -163 -7 -5 3 -10 36 -12 73 l-2 67 67 4 c38 2 81 4 97 5 27 2 28 1 24 -40z
m512 22 c0 -11 4 -20 9 -20 4 0 20 9 34 20 25 20 57 27 57 12 0 -5 -14 -18
-30 -31 l-30 -22 26 -44 c24 -41 24 -45 7 -45 -10 0 -27 14 -37 31 -21 35 -40
34 -44 -4 -3 -22 -8 -27 -32 -27 -39 0 -43 11 -35 86 l7 64 34 0 c27 0 34 -4
34 -20z m511 12 c0 -4 1 -36 2 -72 l2 -65 -32 -3 c-28 -3 -32 0 -39 30 l-7 33
-14 -33 c-16 -40 -34 -41 -51 -2 -16 35 -35 31 -26 -6 6 -22 3 -24 -30 -24
l-36 0 -1 55 c-1 30 -2 61 -3 68 -1 7 14 13 34 15 33 3 38 -1 59 -39 l24 -42
18 24 c10 13 19 29 19 35 0 5 4 14 10 20 11 11 70 16 71 6z m509 -28 c0 -31 3
-35 23 -32 17 2 23 11 25 36 3 29 6 32 36 32 l34 0 1 -75 1 -75 -29 0 c-23 0
-30 5 -35 26 -5 19 -12 25 -29 22 -17 -2 -22 -10 -22 -30 1 -24 -2 -27 -25
-22 -45 10 -50 13 -50 33 0 11 -6 21 -12 24 -10 4 -10 7 0 18 6 7 12 25 12 39
0 34 7 40 42 40 25 0 28 -3 28 -36z"/>
<path d="M800 860 c30 -24 44 -25 36 -4 -3 9 -6 18 -6 20 0 2 -12 4 -27 4
l-28 0 25 -20z"/>
<path d="M310 850 c0 -5 5 -10 10 -10 6 0 10 5 10 10 0 6 -4 10 -10 10 -5 0
-10 -4 -10 -10z"/>
<path d="M366 851 c-8 -12 21 -34 33 -27 6 4 8 13 4 21 -6 17 -29 20 -37 6z"/>
<path d="M920 586 c0 -9 7 -16 16 -16 9 0 14 5 12 12 -6 18 -28 21 -28 4z"/>
<path d="M965 419 c-4 -6 -5 -13 -2 -16 7 -7 27 6 27 18 0 12 -17 12 -25 -2z"/>
<path d="M362 388 c3 -7 15 -14 29 -16 24 -4 24 -3 4 12 -24 19 -38 20 -33 4z"/>
<path d="M4106 883 c-14 -14 -5 -31 14 -26 11 3 20 9 20 13 0 10 -26 20 -34
13z"/>
<path d="M4590 870 c-14 -10 -22 -22 -18 -25 7 -8 57 25 58 38 0 12 -14 8 -40
-13z"/>
<path d="M4380 655 c7 -8 17 -15 22 -15 6 0 5 7 -2 15 -7 8 -17 15 -22 15 -6
0 -5 -7 2 -15z"/>
<path d="M4082 560 c-6 -11 -12 -28 -12 -37 0 -13 6 -10 20 12 11 17 20 33 20
38 0 14 -15 7 -28 -13z"/>
<path d="M4496 466 c3 -9 11 -16 16 -16 13 0 5 23 -10 28 -7 2 -10 -2 -6 -12z"/>
<path d="M4236 445 c-9 -24 5 -41 16 -20 7 11 7 20 0 27 -6 6 -12 3 -16 -7z"/>
<path d="M4540 400 c0 -5 5 -10 11 -10 5 0 7 5 4 10 -3 6 -8 10 -11 10 -2 0
-4 -4 -4 -10z"/>
<path d="M5330 891 c0 -11 26 -22 34 -14 3 3 3 10 0 14 -7 12 -34 11 -34 0z"/>
<path d="M4805 880 c-8 -13 4 -32 16 -25 12 8 12 35 0 35 -6 0 -13 -4 -16 -10z"/>
<path d="M5070 821 l-35 -6 0 -75 0 -75 40 -3 c22 -2 58 3 80 10 38 12 40 16
47 63 12 88 -16 107 -132 86z m109 -36 c3 -19 2 -19 -15 -4 -11 9 -26 19 -34
22 -8 4 -2 5 15 4 21 -1 31 -8 34 -22z"/>
<path d="M5411 694 c0 -11 3 -14 6 -6 3 7 2 16 -1 19 -3 4 -6 -2 -5 -13z"/>
<path d="M5223 674 c-10 -22 -10 -25 3 -20 9 3 18 6 20 6 2 0 4 9 4 20 0 28
-13 25 -27 -6z"/>
<path d="M5001 422 c-14 -27 -12 -35 8 -23 7 5 11 17 9 27 -4 17 -5 17 -17 -4z"/>
<path d="M5673 883 c9 -9 19 -14 23 -11 10 10 -6 28 -24 28 -15 0 -15 -1 1
-17z"/>
<path d="M5866 717 c-14 -10 -16 -16 -7 -22 15 -9 35 8 30 24 -3 8 -10 7 -23
-2z"/>
<path d="M5700 520 c0 -5 5 -10 10 -10 6 0 10 5 10 10 0 6 -4 10 -10 10 -5 0
-10 -4 -10 -10z"/>
<path d="M5700 451 c0 -23 25 -46 34 -32 4 6 -2 19 -14 31 -19 19 -20 19 -20
1z"/>
<path d="M1375 850 c-3 -5 -1 -10 4 -10 6 0 11 5 11 10 0 6 -2 10 -4 10 -3 0
-8 -4 -11 -10z"/>
<path d="M1391 687 c-5 -12 -7 -35 -6 -50 2 -15 -1 -27 -7 -27 -5 0 -6 9 -3
21 5 15 4 19 -4 15 -6 -4 -11 -18 -11 -30 0 -19 7 -25 33 -29 17 -2 42 1 55 7
l22 12 -27 52 c-29 57 -39 63 -52 29z"/>
<path d="M1240 520 c0 -5 5 -10 10 -10 6 0 10 5 10 10 0 6 -4 10 -10 10 -5 0
-10 -4 -10 -10z"/>
<path d="M1575 490 c4 -14 9 -27 11 -29 7 -7 34 9 34 20 0 7 -3 9 -7 6 -3 -4
-15 1 -26 10 -19 17 -19 17 -12 -7z"/>
<path d="M3094 688 c-4 -13 -7 -35 -6 -50 1 -16 -2 -28 -8 -28 -5 0 -6 7 -3
17 4 11 3 14 -5 9 -16 -10 -15 -49 1 -43 6 2 20 0 29 -4 10 -6 27 -5 41 2 28
13 26 30 -8 86 -24 39 -31 41 -41 11z"/>
<path d="M3270 502 c0 -19 29 -47 39 -37 6 7 1 16 -15 28 -13 10 -24 14 -24 9z"/>
<path d="M3570 812 c-13 -10 -21 -24 -19 -31 3 -7 15 0 34 19 31 33 21 41 -15
12z"/>
<path d="M3855 480 c-3 -5 -1 -10 4 -10 6 0 11 5 11 10 0 6 -2 10 -4 10 -3 0
-8 -4 -11 -10z"/>
<path d="M3585 450 c3 -5 13 -10 21 -10 8 0 12 5 9 10 -3 6 -13 10 -21 10 -8
0 -12 -4 -9 -10z"/>
<path d="M1880 820 c0 -5 7 -10 16 -10 8 0 12 5 9 10 -3 6 -10 10 -16 10 -5 0
-9 -4 -9 -10z"/>
<path d="M2042 668 c-7 -7 -12 -23 -12 -37 1 -24 2 -24 16 8 16 37 14 47 -4
29z"/>
<path d="M2015 560 c4 -6 11 -8 16 -5 14 9 11 15 -7 15 -8 0 -12 -5 -9 -10z"/>
<path d="M1915 470 c4 -6 11 -8 16 -5 14 9 11 15 -7 15 -8 0 -12 -5 -9 -10z"/>
<path d="M2320 795 c0 -14 5 -25 10 -25 6 0 10 11 10 25 0 14 -4 25 -10 25 -5
0 -10 -11 -10 -25z"/>
<path d="M2660 771 c0 -6 5 -13 10 -16 6 -3 10 1 10 9 0 9 -4 16 -10 16 -5 0
-10 -4 -10 -9z"/>
<path d="M2487 763 c-4 -3 -7 -23 -7 -43 0 -36 1 -38 40 -43 68 -9 116 20 102
61 -3 10 -7 10 -18 1 -11 -9 -14 -7 -14 10 0 18 -6 21 -48 21 -27 0 -52 -3
-55 -7z"/>
<path d="M2320 719 c0 -5 5 -7 10 -4 6 3 10 8 10 11 0 2 -4 4 -10 4 -5 0 -10
-5 -10 -11z"/>
<path d="M2480 550 l0 -40 66 1 c58 1 67 4 76 25 18 39 -4 54 -78 54 l-64 0 0
-40z m40 15 c-7 -8 -16 -15 -21 -15 -5 0 -6 7 -3 15 4 8 13 15 21 15 13 0 13
-3 3 -15z"/>
<path d="M2665 527 c-4 -10 -5 -21 -1 -24 10 -10 18 4 13 24 -4 17 -4 17 -12
0z"/>
<path d="M1586 205 c-9 -23 -8 -25 9 -25 17 0 19 9 6 28 -7 11 -10 10 -15 -3z"/>
<path d="M3727 200 c-3 -13 0 -20 9 -20 15 0 19 26 5 34 -5 3 -11 -3 -14 -14z"/>
<path d="M1194 229 c-3 -6 -2 -15 3 -20 13 -13 43 -1 43 17 0 16 -36 19 -46 3z"/>
<path d="M2470 224 c-18 -46 -12 -73 15 -80 37 -9 52 1 59 40 5 26 3 41 -8 51
-23 24 -55 18 -66 -11z"/>
<path d="M3120 196 c0 -9 7 -16 16 -16 17 0 14 22 -4 28 -7 2 -12 -3 -12 -12z"/>
<path d="M4750 201 c0 -12 5 -21 10 -21 6 0 10 6 10 14 0 8 -4 18 -10 21 -5 3
-10 -3 -10 -14z"/>
<path d="M3515 229 c-8 -12 14 -31 30 -26 6 2 10 10 10 18 0 17 -31 24 -40 8z"/>
<path d="M3521 161 c-7 -5 -9 -11 -4 -14 14 -9 54 4 47 14 -7 11 -25 11 -43 0z"/>
</g>
</svg>

Before

Width:  |  Height:  |  Size: 18 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 10 KiB

View File

@ -1,85 +1,14 @@
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<svg
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:cc="http://creativecommons.org/ns#"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:svg="http://www.w3.org/2000/svg"
xmlns="http://www.w3.org/2000/svg"
id="svg8"
version="1.1"
viewBox="0 0 176.61171 41.907883"
height="41.907883mm"
width="176.61171mm">
<defs
id="defs2" />
<metadata
id="metadata5">
<rdf:RDF>
<cc:Work
rdf:about="">
<dc:format>image/svg+xml</dc:format>
<dc:type
rdf:resource="http://purl.org/dc/dcmitype/StillImage" />
<dc:title></dc:title>
</cc:Work>
</rdf:RDF>
</metadata>
<g
transform="translate(-0.74835286,-98.31182)"
id="layer1">
<flowRoot
transform="scale(0.26458333)"
style="font-style:normal;font-weight:normal;font-size:40px;line-height:1.25;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none"
id="flowRoot4598"
xml:space="preserve"><flowRegion
id="flowRegion4600"><rect
y="415.4129"
x="-38.183765"
height="48.08326"
width="257.38687"
id="rect4602" /></flowRegion><flowPara
id="flowPara4604"></flowPara></flowRoot> <text
transform="scale(0.86288797,1.158899)"
id="text4777"
y="110.93711"
x="0.93061"
style="font-style:normal;font-variant:normal;font-weight:bold;font-stretch:normal;font-size:28.14887619px;line-height:4.25;font-family:sans-serif;-inkscape-font-specification:'sans-serif, Bold';font-variant-ligatures:normal;font-variant-caps:normal;font-variant-numeric:normal;font-feature-settings:normal;text-align:start;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:start;fill:#003dff;fill-opacity:1;stroke:none;stroke-width:7.51955223;stroke-miterlimit:4;stroke-dasharray:none"
xml:space="preserve"><tspan
style="stroke-width:7.51955223"
id="tspan4775"
y="110.93711"
x="0.93061"><tspan
id="tspan4773"
style="font-style:normal;font-variant:normal;font-weight:bold;font-stretch:normal;font-size:28.14887619px;font-family:sans-serif;-inkscape-font-specification:'sans-serif, Bold';font-variant-ligatures:normal;font-variant-caps:normal;font-variant-numeric:normal;font-feature-settings:normal;text-align:start;letter-spacing:3.56786728px;writing-mode:lr-tb;text-anchor:start;fill:#003dff;fill-opacity:1;stroke-width:7.51955223;stroke-miterlimit:4;stroke-dasharray:none"
y="110.93711"
x="0.93061">waybackpy</tspan></tspan></text>
<rect
y="98.311821"
x="1.4967092"
height="4.8643045"
width="153.78688"
id="rect4644"
style="opacity:1;fill:#000080;fill-opacity:1;stroke:#00ff00;stroke-width:0;stroke-miterlimit:4;stroke-dasharray:none" />
<rect
style="opacity:1;fill:#000080;fill-opacity:1;stroke:#00ff00;stroke-width:0;stroke-miterlimit:4;stroke-dasharray:none"
id="rect4648"
width="153.78688"
height="4.490128"
x="23.573174"
y="135.72957" />
<rect
y="135.72957"
x="0.74835336"
height="4.4901319"
width="22.82482"
id="rect4650"
style="opacity:1;fill:#ff00ff;fill-opacity:1;stroke:#00ff00;stroke-width:0;stroke-miterlimit:4;stroke-dasharray:none" />
<rect
style="opacity:1;fill:#ff00ff;fill-opacity:1;stroke:#00ff00;stroke-width:0;stroke-miterlimit:4;stroke-dasharray:none"
id="rect4652"
width="21.702286"
height="4.8643003"
x="155.2836"
y="98.311821" />
<?xml version="1.0" encoding="utf-8"?>
<svg width="711.80188pt" height="258.30469pt" viewBox="0 0 711.80188 258.30469" version="1.1" id="svg2" xmlns="http://www.w3.org/2000/svg">
<g id="surface1" transform="translate(-40.045801,-148)">
<path style="fill: rgb(171, 46, 51); fill-opacity: 1; fill-rule: nonzero; stroke: none;" d="M 224.09 309.814 L 224.09 197.997 L 204.768 197.994 L 204.768 312.635 C 204.768 312.635 205.098 312.9 204.105 313.698 C 203.113 314.497 202.408 313.849 202.408 313.849 L 200.518 313.849 L 200.518 197.991 L 181.139 197.991 L 181.139 313.849 L 179.253 313.849 C 179.253 313.849 178.544 314.497 177.551 313.698 C 176.558 312.9 176.888 312.635 176.888 312.635 L 176.888 197.994 L 157.57 197.997 L 157.57 309.814 C 157.57 309.814 156.539 316.772 162.615 321.658 C 168.691 326.546 177.551 326.049 177.551 326.049 L 204.11 326.049 C 204.11 326.049 212.965 326.546 219.041 321.658 C 225.118 316.772 224.09 309.814 224.09 309.814" id="path5"/>
<path style="fill: rgb(171, 46, 51); fill-opacity: 1; fill-rule: nonzero; stroke: none;" d="M 253.892 299.821 C 253.892 299.821 253.632 300.965 251.888 300.965 C 250.143 300.965 249.629 299.821 249.629 299.821 L 249.629 278.477 C 249.629 278.477 249.433 278.166 250.078 277.645 C 250.726 277.124 251.243 277.179 251.243 277.179 L 253.892 277.228 Z M 251.588 199.144 C 230.266 199.144 231.071 213.218 231.071 213.218 L 231.071 254.303 L 249.675 254.303 L 249.675 213.69 C 249.675 213.69 249.775 211.276 251.787 211.276 C 253.8 211.276 254 213.542 254 213.542 L 254 265.146 L 246.156 265.146 C 246.156 265.146 240.022 264.579 235.495 268.22 C 230.968 271.858 231.071 276.791 231.071 276.791 L 231.071 298.955 C 231.071 298.955 229.461 308.016 238.914 312.058 C 248.368 316.103 254.805 309.795 254.805 309.795 L 254.805 312.706 L 272.508 312.706 L 272.508 212.895 C 272.508 212.895 272.907 199.144 251.588 199.144" id="path7"/>
<path style="fill: rgb(171, 46, 51); fill-opacity: 1; fill-rule: nonzero; stroke: none;" d="M 404.682 318.261 C 404.682 318.261 404.398 319.494 402.485 319.494 C 400.568 319.494 400.001 318.261 400.001 318.261 L 400.001 295.216 C 400.001 295.216 399.786 294.879 400.496 294.315 C 401.208 293.757 401.776 293.812 401.776 293.812 L 404.682 293.868 Z M 402.152 209.568 C 378.728 209.568 379.61 224.761 379.61 224.761 L 379.61 269.117 L 400.051 269.117 L 400.051 225.273 C 400.051 225.273 400.162 222.665 402.374 222.665 C 404.582 222.665 404.805 225.109 404.805 225.109 L 404.805 280.82 L 396.187 280.82 C 396.187 280.82 389.447 280.213 384.475 284.141 C 379.499 288.072 379.61 293.396 379.61 293.396 L 379.61 317.324 C 379.61 317.324 377.843 327.104 388.232 331.469 C 398.616 335.838 405.69 329.027 405.69 329.027 L 405.69 332.169 L 425.133 332.169 L 425.133 224.413 C 425.133 224.413 425.578 209.568 402.152 209.568" id="path9"/>
<path style="fill: rgb(171, 46, 51); fill-opacity: 1; fill-rule: nonzero; stroke: none;" d="M 321.114 328.636 L 321.114 206.587 L 302.582 206.587 L 302.582 304.902 C 302.582 304.902 303.211 307.094 300.624 307.094 C 298.035 307.094 298.316 304.902 298.316 304.902 L 298.316 206.587 L 279.784 206.587 C 279.784 206.587 279.922 304.338 279.922 306.756 C 279.922 309.175 280.27 310.526 280.831 312.379 C 281.391 314.238 282.579 318.116 290.901 319.186 C 299.224 320.256 302.44 315.813 302.44 315.813 L 302.44 327.736 C 302.44 327.736 302.862 329.366 300.554 329.366 C 298.246 329.366 298.316 327.849 298.316 327.849 L 298.316 322.957 L 279.642 322.957 L 279.642 327.791 C 279.642 327.791 278.523 341.514 300.274 341.514 C 322.026 341.514 321.114 328.636 321.114 328.636" id="path11"/>
<path style="fill: rgb(171, 46, 51); fill-opacity: 1; fill-rule: nonzero; stroke: none;" d="M 352.449 209.811 L 352.449 273.495 C 352.449 277.49 347.911 277.194 347.911 277.194 L 347.911 207.592 C 347.911 207.592 346.929 207.542 349.567 207.542 C 352.817 207.542 352.449 209.811 352.449 209.811 M 352.326 310.393 C 352.326 310.393 352.143 312.366 350.425 312.366 L 348.033 312.366 L 348.033 289.478 L 349.628 289.478 C 349.628 289.478 352.326 289.428 352.326 292.092 Z M 371.341 287.505 C 371.341 284.791 370.727 282.966 368.826 280.993 C 366.925 279.02 363.367 277.441 363.367 277.441 C 363.367 277.441 365.514 276.948 368.704 274.728 C 371.893 272.509 371.525 267.921 371.525 267.921 L 371.525 212.919 C 371.525 212.919 371.801 204.509 366.925 200.587 C 362.049 196.665 352.515 196.363 352.515 196.363 L 328.711 196.363 L 328.711 324.107 L 350.609 324.107 C 360.055 324.107 364.594 322.232 368.336 318.286 C 372.077 314.34 371.341 308.321 371.341 308.321 Z M 371.341 287.505" id="path13"/>
<path style="fill: rgb(171, 46, 51); fill-opacity: 1; fill-rule: nonzero; stroke: none;" d="M 452.747 226.744 L 452.747 268.806 L 471.581 268.806 L 471.581 227.459 C 471.581 227.459 471.846 213.532 450.516 213.532 C 429.182 213.532 430.076 227.533 430.076 227.533 L 430.076 313.381 C 430.076 313.381 428.825 327.523 450.872 327.523 C 472.919 327.523 471.401 313.526 471.401 313.526 L 471.401 292.064 L 452.835 292.064 L 452.835 314.389 C 452.835 314.389 452.923 315.61 450.961 315.61 C 448.997 315.61 448.729 314.389 448.729 314.389 L 448.729 226.524 C 448.729 226.524 448.821 225.378 450.692 225.378 C 452.566 225.378 452.747 226.744 452.747 226.744" id="path15"/>
<path style="fill: rgb(171, 46, 51); fill-opacity: 1; fill-rule: nonzero; stroke: none;" d="M 520.624 281.841 C 517.672 278.98 514.317 277.904 514.317 277.904 C 514.317 277.904 517.538 277.796 520.489 274.775 C 523.442 271.753 523.173 267.924 523.173 267.924 L 523.173 208.211 L 503.185 208.211 L 503.185 276.014 C 503.185 276.014 503.185 277.361 501.172 277.361 L 498.761 277.309 L 498.761 191.655 L 478.973 191.655 L 478.973 327.905 L 498.692 327.905 L 498.692 290.039 L 501.709 290.039 C 501.709 290.039 502.112 290.039 502.648 290.523 C 503.185 291.01 503.185 291.602 503.185 291.602 L 503.185 327.905 L 523.307 327.905 L 523.307 288.636 C 523.307 288.636 523.576 284.699 520.624 281.841" id="path17"/>
<path style="fill-opacity: 1; fill-rule: nonzero; stroke: none; fill: rgb(255, 222, 87);" d="M 638.021 327.182 L 638.021 205.132 L 619.489 205.132 L 619.489 303.448 C 619.489 303.448 620.119 305.64 617.53 305.64 C 614.944 305.64 615.223 303.448 615.223 303.448 L 615.223 205.132 L 596.692 205.132 C 596.692 205.132 596.83 302.884 596.83 305.301 C 596.83 307.721 597.178 309.071 597.738 310.924 C 598.299 312.784 599.487 316.662 607.809 317.732 C 616.132 318.802 619.349 314.359 619.349 314.359 L 619.349 326.281 C 619.349 326.281 619.77 327.913 617.462 327.913 C 615.154 327.913 615.223 326.396 615.223 326.396 L 615.223 321.502 L 596.55 321.502 L 596.55 326.336 C 596.55 326.336 595.43 340.059 617.182 340.059 C 638.934 340.059 638.021 327.182 638.021 327.182" id="path-1"/>
<path d="M 592.159 233.846 C 593.222 238.576 593.75 243.873 593.745 249.735 C 593.74 255.598 593.135 261.281 591.931 266.782 C 590.726 272.285 588.901 277.144 586.453 281.361 C 584.006 285.578 580.938 288.946 577.248 291.466 C 573.559 293.985 569.226 295.246 564.25 295.246 C 561.585 295.246 559.008 294.936 556.521 294.32 C 554.033 293.703 551.813 292.854 549.859 291.774 C 547.905 290.694 546.284 289.512 544.997 288.226 C 543.71 286.94 542.934 285.578 542.668 284.138 L 542.629 328.722 L 526.369 328.722 L 526.475 207.466 L 541.003 207.466 L 542.728 216.259 C 544.507 213.38 547.197 211.065 550.797 209.317 C 554.397 207.568 558.374 206.694 562.728 206.694 C 565.66 206.694 568.637 207.157 571.657 208.083 C 574.677 209.008 577.497 210.551 580.116 212.711 C 582.735 214.871 585.11 217.698 587.239 221.196 C 589.369 224.692 591.009 228.909 592.159 233.846 Z M 558.932 280.744 C 561.597 280.744 564.019 279.972 566.197 278.429 C 568.376 276.887 570.243 274.804 571.801 272.182 C 573.358 269.559 574.582 266.423 575.474 262.772 C 576.366 259.121 576.814 255.238 576.817 251.124 C 576.821 247.113 576.424 243.307 575.628 239.708 C 574.831 236.108 573.701 232.92 572.237 230.143 C 570.774 227.366 568.999 225.155 566.912 223.51 C 564.825 221.864 562.405 221.041 559.65 221.041 C 556.985 221.041 554.54 221.813 552.318 223.356 C 550.095 224.898 548.183 226.981 546.581 229.603 C 544.98 232.226 543.755 235.311 542.908 238.86 C 542.061 242.408 541.635 246.239 541.632 250.353 C 541.628 254.466 542.002 258.349 542.754 262 C 543.506 265.651 544.637 268.865 546.145 271.642 C 547.653 274.419 549.472 276.63 551.603 278.276 C 553.734 279.922 556.177 280.744 558.932 280.744 Z" style="fill: rgb(69, 132, 182); white-space: pre;"/>
</g>
</svg>
</svg>

Before

Width:  |  Height:  |  Size: 3.6 KiB

After

Width:  |  Height:  |  Size: 8.3 KiB

3
pyproject.toml Normal file
View File

@ -0,0 +1,3 @@
[build-system]
requires = ["wheel", "setuptools"]
build-backend = "setuptools.build_meta"

10
requirements-dev.txt Normal file
View File

@ -0,0 +1,10 @@
black
click
codecov
flake8
mypy
pytest
pytest-cov
requests
setuptools>=46.4.0
types-requests

View File

@ -1 +1,3 @@
requests>=2.24.0
click
requests
urllib3

101
setup.cfg
View File

@ -1,7 +1,102 @@
[metadata]
description-file = README.md
license_file = LICENSE
name = waybackpy
version = attr: waybackpy.__version__
description = Python package that interfaces with the Internet Archive's Wayback Machine APIs. Archive pages and retrieve archived pages easily.
long_description = file: README.md
long_description_content_type = text/markdown
license = MIT
author = Akash Mahanty
author_email = akamhy@yahoo.com
url = https://akamhy.github.io/waybackpy/
download_url = https://github.com/akamhy/waybackpy/releases
project_urls =
Documentation = https://github.com/akamhy/waybackpy/wiki
Source = https://github.com/akamhy/waybackpy
Tracker = https://github.com/akamhy/waybackpy/issues
keywords =
Archive Website
Wayback Machine
Internet Archive
Wayback Machine CLI
Wayback Machine Python
Internet Archiving
Availability API
CDX API
savepagenow
classifiers =
Development Status :: 5 - Production/Stable
Intended Audience :: Developers
Intended Audience :: End Users/Desktop
Natural Language :: English
Typing :: Typed
License :: OSI Approved :: MIT License
Programming Language :: Python
Programming Language :: Python :: 3
Programming Language :: Python :: 3.6
Programming Language :: Python :: 3.7
Programming Language :: Python :: 3.8
Programming Language :: Python :: 3.9
Programming Language :: Python :: 3.10
Programming Language :: Python :: Implementation :: CPython
[options]
packages = find:
include-package-data = True
python_requires = >= 3.6
install_requires =
click
requests
urllib3
[options.package_data]
waybackpy = py.typed
[options.extras_require]
dev =
black
codecov
flake8
mypy
pytest
pytest-cov
setuptools>=46.4.0
types-requests
[options.entry_points]
console_scripts =
waybackpy = waybackpy.cli:main
[isort]
profile = black
[flake8]
indent-size = 4
max-line-length = 88
extend-ignore = E203,W503
extend-ignore = W503,W605
exclude =
venv
__pycache__
.venv
./env
venv/
env
.env
./build
[mypy]
python_version = 3.9
show_error_codes = True
pretty = True
strict = True
[tool:pytest]
addopts =
# show summary of all tests that did not pass
-ra
# enable all warnings
-Wd
# coverage and html report
--cov=waybackpy
--cov-report=html
testpaths =
tests

View File

@ -1,54 +1,3 @@
import os.path
from setuptools import setup
with open(os.path.join(os.path.dirname(__file__), "README.md")) as f:
long_description = f.read()
about = {}
with open(os.path.join(os.path.dirname(__file__), "waybackpy", "__version__.py")) as f:
exec(f.read(), about)
setup(
name=about["__title__"],
packages=["waybackpy"],
version=about["__version__"],
description=about["__description__"],
long_description=long_description,
long_description_content_type="text/markdown",
license=about["__license__"],
author=about["__author__"],
author_email=about["__author_email__"],
url=about["__url__"],
download_url="https://github.com/akamhy/waybackpy/archive/2.3.0.tar.gz",
keywords=[
"Archive It",
"Archive Website",
"Wayback Machine",
"waybackurls",
"Internet Archive",
],
install_requires=["requests"],
python_requires=">=3.4",
classifiers=[
"Development Status :: 5 - Production/Stable",
"Intended Audience :: Developers",
"Natural Language :: English",
"Topic :: Software Development :: Build Tools",
"License :: OSI Approved :: MIT License",
"Programming Language :: Python",
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.4",
"Programming Language :: Python :: 3.5",
"Programming Language :: Python :: 3.6",
"Programming Language :: Python :: 3.7",
"Programming Language :: Python :: 3.8",
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: Implementation :: CPython",
],
entry_points={"console_scripts": ["waybackpy = waybackpy.cli:main"]},
project_urls={
"Documentation": "https://akamhy.github.io/waybackpy/",
"Source": "https://github.com/akamhy/waybackpy",
"Tracker": "https://github.com/akamhy/waybackpy/issues",
},
)
setup()

23
snapcraft.yaml Normal file
View File

@ -0,0 +1,23 @@
name: waybackpy
summary: Wayback Machine API interface and a command-line tool
description: |
Waybackpy is a CLI tool that interfaces with the Wayback Machine APIs.
Wayback Machine has three client side public APIs, Save API,
Availability API and CDX API. These three APIs can be accessed via
the waybackpy from the terminal.
version: git
grade: stable
confinement: strict
base: core20
architectures:
- build-on: [arm64, armhf, amd64]
apps:
waybackpy:
command: bin/waybackpy
plugs: [home, network, network-bind, removable-media]
parts:
waybackpy:
plugin: python
source: https://github.com/akamhy/waybackpy.git

View File

@ -0,0 +1,113 @@
import random
import string
from datetime import datetime, timedelta
import pytest
from waybackpy.availability_api import WaybackMachineAvailabilityAPI
from waybackpy.exceptions import (
ArchiveNotInAvailabilityAPIResponse,
InvalidJSONInAvailabilityAPIResponse,
)
now = datetime.utcnow()
url = "https://example.com/"
user_agent = (
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36"
)
def rndstr(n: int) -> str:
return "".join(
random.choice(string.ascii_uppercase + string.digits) for _ in range(n)
)
def test_oldest() -> None:
"""
Test the oldest archive of Google.com and also checks the attributes.
"""
url = "https://example.com/"
user_agent = (
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36"
)
availability_api = WaybackMachineAvailabilityAPI(url, user_agent)
oldest = availability_api.oldest()
oldest_archive_url = oldest.archive_url
assert "2002" in oldest_archive_url
oldest_timestamp = oldest.timestamp()
assert abs(oldest_timestamp - now) > timedelta(days=7000) # More than 19 years
assert (
availability_api.json is not None
and availability_api.json["archived_snapshots"]["closest"]["available"] is True
)
assert repr(oldest).find("example.com") != -1
assert "2002" in str(oldest)
def test_newest() -> None:
"""
Assuming that the recent most Google Archive was made no more earlier than
last one day which is 86400 seconds.
"""
url = "https://www.youtube.com/"
user_agent = "Mozilla/5.0 (X11; Linux x86_64; rv:96.0) Gecko/20100101 Firefox/96.0"
availability_api = WaybackMachineAvailabilityAPI(url, user_agent)
newest = availability_api.newest()
newest_timestamp = newest.timestamp()
# betting in favor that latest youtube archive was not before the last 3 days
# high tarffic sites like youtube are archived mnay times a day, so seems
# very reasonable to me.
assert abs(newest_timestamp - now) < timedelta(seconds=86400 * 3)
def test_invalid_json() -> None:
"""
When the API is malfunctioning or we don't pass a URL,
it may return invalid JSON data.
"""
with pytest.raises(InvalidJSONInAvailabilityAPIResponse):
availability_api = WaybackMachineAvailabilityAPI(url="", user_agent=user_agent)
_ = availability_api.archive_url
def test_no_archive() -> None:
"""
ArchiveNotInAvailabilityAPIResponse may be raised if Wayback Machine did not
replied with the archive despite the fact that we know the site has million
of archives. Don't know the reason for this wierd behavior.
And also if really there are no archives for the passed URL this exception
is raised.
"""
with pytest.raises(ArchiveNotInAvailabilityAPIResponse):
availability_api = WaybackMachineAvailabilityAPI(
url=f"https://{rndstr(30)}.cn", user_agent=user_agent
)
_ = availability_api.archive_url
def test_no_api_call_str_repr() -> None:
"""
Some entitled users maybe want to see what is the string representation
if they dont make any API requests.
str() must not return None so we return ""
"""
availability_api = WaybackMachineAvailabilityAPI(
url=f"https://{rndstr(30)}.gov", user_agent=user_agent
)
assert str(availability_api) == ""
def test_no_call_timestamp() -> None:
"""
If no API requests were made the bound timestamp() method returns
the datetime.max as a default value.
"""
availability_api = WaybackMachineAvailabilityAPI(
url=f"https://{rndstr(30)}.in", user_agent=user_agent
)
assert datetime.max == availability_api.timestamp()

178
tests/test_cdx_api.py Normal file
View File

@ -0,0 +1,178 @@
import random
import string
import pytest
from waybackpy.cdx_api import WaybackMachineCDXServerAPI
from waybackpy.exceptions import NoCDXRecordFound
def rndstr(n: int) -> str:
return "".join(
random.choice(string.ascii_uppercase + string.digits) for _ in range(n)
)
def test_a() -> None:
user_agent = (
"Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) AppleWebKit/605.1.15 "
"(KHTML, like Gecko) Version/14.1.1 Safari/604.1"
)
url = "https://twitter.com/jack"
wayback = WaybackMachineCDXServerAPI(
url=url,
user_agent=user_agent,
match_type="prefix",
collapses=["urlkey"],
start_timestamp="201001",
end_timestamp="201002",
)
# timeframe bound prefix matching enabled along with active urlkey based collapsing
snapshots = wayback.snapshots() # <class 'generator'>
for snapshot in snapshots:
assert snapshot.timestamp.startswith("2010")
def test_b() -> None:
user_agent = (
"Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) "
"AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/604.1"
)
url = "https://www.google.com"
wayback = WaybackMachineCDXServerAPI(
url=url,
user_agent=user_agent,
start_timestamp="202101",
end_timestamp="202112",
collapses=["urlkey"],
)
# timeframe bound prefix matching enabled along with active urlkey based collapsing
snapshots = wayback.snapshots() # <class 'generator'>
for snapshot in snapshots:
assert snapshot.timestamp.startswith("2021")
def test_c() -> None:
user_agent = (
"Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) "
"AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/604.1"
)
url = "https://www.google.com"
cdx = WaybackMachineCDXServerAPI(
url=url,
user_agent=user_agent,
closest="201010101010",
sort="closest",
limit="1",
)
snapshots = cdx.snapshots()
for snapshot in snapshots:
archive_url = snapshot.archive_url
timestamp = snapshot.timestamp
break
assert str(archive_url).find("google.com")
assert "20101010" in timestamp
def test_d() -> None:
user_agent = (
"Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) "
"AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/604.1"
)
cdx = WaybackMachineCDXServerAPI(
url="akamhy.github.io",
user_agent=user_agent,
match_type="prefix",
use_pagination=True,
filters=["statuscode:200"],
)
snapshots = cdx.snapshots()
count = 0
for snapshot in snapshots:
count += 1
assert str(snapshot.archive_url).find("akamhy.github.io")
assert count > 50
def test_oldest() -> None:
user_agent = (
"Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) "
"AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/604.1"
)
cdx = WaybackMachineCDXServerAPI(
url="google.com",
user_agent=user_agent,
filters=["statuscode:200"],
)
oldest = cdx.oldest()
assert "1998" in oldest.timestamp
assert "google" in oldest.urlkey
assert oldest.original.find("google.com") != -1
assert oldest.archive_url.find("google.com") != -1
def test_newest() -> None:
user_agent = (
"Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) "
"AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/604.1"
)
cdx = WaybackMachineCDXServerAPI(
url="google.com",
user_agent=user_agent,
filters=["statuscode:200"],
)
newest = cdx.newest()
assert "google" in newest.urlkey
assert newest.original.find("google.com") != -1
assert newest.archive_url.find("google.com") != -1
def test_near() -> None:
user_agent = (
"Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) "
"AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/604.1"
)
cdx = WaybackMachineCDXServerAPI(
url="google.com",
user_agent=user_agent,
filters=["statuscode:200"],
)
near = cdx.near(year=2010, month=10, day=10, hour=10, minute=10)
assert "2010101010" in near.timestamp
assert "google" in near.urlkey
assert near.original.find("google.com") != -1
assert near.archive_url.find("google.com") != -1
near = cdx.near(wayback_machine_timestamp="201010101010")
assert "2010101010" in near.timestamp
assert "google" in near.urlkey
assert near.original.find("google.com") != -1
assert near.archive_url.find("google.com") != -1
near = cdx.near(unix_timestamp=1286705410)
assert "2010101010" in near.timestamp
assert "google" in near.urlkey
assert near.original.find("google.com") != -1
assert near.archive_url.find("google.com") != -1
with pytest.raises(NoCDXRecordFound):
dne_url = f"https://{rndstr(30)}.in"
cdx = WaybackMachineCDXServerAPI(
url=dne_url,
user_agent=user_agent,
filters=["statuscode:200"],
)
cdx.near(unix_timestamp=1286705410)

View File

@ -0,0 +1,44 @@
from datetime import datetime
from waybackpy.cdx_snapshot import CDXSnapshot
def test_CDXSnapshot() -> None:
sample_input = (
"org,archive)/ 20080126045828 http://github.com "
"text/html 200 Q4YULN754FHV2U6Q5JUT6Q2P57WEWNNY 1415"
)
prop_values = sample_input.split(" ")
properties = {}
(
properties["urlkey"],
properties["timestamp"],
properties["original"],
properties["mimetype"],
properties["statuscode"],
properties["digest"],
properties["length"],
) = prop_values
snapshot = CDXSnapshot(properties)
assert properties["urlkey"] == snapshot.urlkey
assert properties["timestamp"] == snapshot.timestamp
assert properties["original"] == snapshot.original
assert properties["mimetype"] == snapshot.mimetype
assert properties["statuscode"] == snapshot.statuscode
assert properties["digest"] == snapshot.digest
assert properties["length"] == snapshot.length
assert (
datetime.strptime(properties["timestamp"], "%Y%m%d%H%M%S")
== snapshot.datetime_timestamp
)
archive_url = (
"https://web.archive.org/web/"
+ properties["timestamp"]
+ "/"
+ properties["original"]
)
assert archive_url == snapshot.archive_url
assert sample_input == str(snapshot)
assert sample_input == repr(snapshot)

113
tests/test_cdx_utils.py Normal file
View File

@ -0,0 +1,113 @@
from typing import Any, Dict, List
import pytest
from waybackpy.cdx_utils import (
check_collapses,
check_filters,
check_match_type,
check_sort,
full_url,
get_response,
get_total_pages,
)
from waybackpy.exceptions import WaybackError
def test_get_total_pages() -> None:
url = "twitter.com"
user_agent = (
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/605.1.15 "
"(KHTML, like Gecko) Version/14.0.2 Safari/605.1.15"
)
assert get_total_pages(url=url, user_agent=user_agent) >= 56
def test_full_url() -> None:
endpoint = "https://web.archive.org/cdx/search/cdx"
params: Dict[str, Any] = {}
assert endpoint == full_url(endpoint, params)
params = {"a": "1"}
assert full_url(endpoint, params) == "https://web.archive.org/cdx/search/cdx?a=1"
assert (
full_url(endpoint + "?", params) == "https://web.archive.org/cdx/search/cdx?a=1"
)
params["b"] = 2
assert (
full_url(endpoint + "?", params)
== "https://web.archive.org/cdx/search/cdx?a=1&b=2"
)
params["c"] = "foo bar"
assert (
full_url(endpoint + "?", params)
== "https://web.archive.org/cdx/search/cdx?a=1&b=2&c=foo%20bar"
)
def test_get_response() -> None:
url = "https://github.com"
user_agent = (
"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0"
)
headers = {"User-Agent": str(user_agent)}
response = get_response(url, headers=headers)
assert not isinstance(response, Exception) and response.status_code == 200
def test_check_filters() -> None:
filters: List[str] = []
check_filters(filters)
filters = ["statuscode:200", "timestamp:20215678901234", "original:https://url.com"]
check_filters(filters)
with pytest.raises(WaybackError):
check_filters("not-list") # type: ignore[arg-type]
with pytest.raises(WaybackError):
check_filters(["invalid"])
def test_check_collapses() -> None:
collapses: List[str] = []
check_collapses(collapses)
collapses = ["timestamp:10"]
check_collapses(collapses)
collapses = ["urlkey"]
check_collapses(collapses)
collapses = "urlkey" # type: ignore[assignment]
with pytest.raises(WaybackError):
check_collapses(collapses)
collapses = ["also illegal collapse"]
with pytest.raises(WaybackError):
check_collapses(collapses)
def test_check_match_type() -> None:
assert check_match_type(None, "url")
match_type = "exact"
url = "test_url"
assert check_match_type(match_type, url)
url = "has * in it"
with pytest.raises(WaybackError):
check_match_type("domain", url)
with pytest.raises(WaybackError):
check_match_type("not a valid type", "url")
def test_check_sort() -> None:
assert check_sort("default")
assert check_sort("closest")
assert check_sort("reverse")
with pytest.raises(WaybackError):
assert check_sort("random crap")

View File

@ -1,307 +1,136 @@
# -*- coding: utf-8 -*-
import sys
import os
import pytest
import argparse
import requests
from click.testing import CliRunner
sys.path.append("..")
import waybackpy.cli as cli # noqa: E402
from waybackpy.wrapper import Url # noqa: E402
from waybackpy.__version__ import __version__
# Namespace(day=None, get=None, hour=None, minute=None, month=None, near=False,
# newest=False, oldest=False, save=False, total=False, url=None, user_agent=None, version=False, year=None)
from waybackpy import __version__
from waybackpy.cli import main
def test_save():
args = argparse.Namespace(
user_agent=None,
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=False,
save=True,
json=False,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get=None,
def test_oldest() -> None:
runner = CliRunner()
result = runner.invoke(main, ["--url", " https://github.com ", "--oldest"])
assert result.exit_code == 0
assert (
result.output
== "Archive URL:\nhttps://web.archive.org/web/2008051421\
0148/http://github.com/\n"
)
reply = cli.args_handler(args)
assert "pypi.org/user/akamhy" in str(reply)
def test_json():
args = argparse.Namespace(
user_agent=None,
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=False,
save=False,
json=True,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get=None,
def test_near() -> None:
runner = CliRunner()
result = runner.invoke(
main,
[
"--url",
" https://facebook.com ",
"--near",
"--year",
"2010",
"--month",
"5",
"--day",
"10",
"--hour",
"6",
],
)
reply = cli.args_handler(args)
assert "archived_snapshots" in str(reply)
def test_archive_url():
args = argparse.Namespace(
user_agent=None,
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=True,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get=None,
assert result.exit_code == 0
assert (
result.output
== "Archive URL:\nhttps://web.archive.org/web/2010051008\
2647/http://www.facebook.com/\n"
)
reply = cli.args_handler(args)
assert "https://web.archive.org/web/" in str(reply)
def test_oldest():
args = argparse.Namespace(
user_agent=None,
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=True,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get=None,
def test_newest() -> None:
runner = CliRunner()
result = runner.invoke(main, ["--url", " https://microsoft.com ", "--newest"])
assert result.exit_code == 0
assert (
result.output.find("microsoft.com") != -1
and result.output.find("Archive URL:\n") != -1
)
reply = cli.args_handler(args)
assert "pypi.org/user/akamhy" in str(reply)
def test_newest():
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=True,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get=None,
def test_cdx() -> None:
runner = CliRunner()
result = runner.invoke(
main,
"--url https://twitter.com/jack --cdx --user-agent some-user-agent \
--start-timestamp 2010 --end-timestamp 2012 --collapse urlkey \
--match-type prefix --cdx-print archiveurl --cdx-print length \
--cdx-print digest --cdx-print statuscode --cdx-print mimetype \
--cdx-print original --cdx-print timestamp --cdx-print urlkey".split(
" "
),
)
reply = cli.args_handler(args)
assert "pypi.org/user/akamhy" in str(reply)
assert result.exit_code == 0
assert result.output.count("\n") > 3000
def test_total_archives():
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://pypi.org/user/akamhy/",
total=True,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get=None,
def test_save() -> None:
runner = CliRunner()
result = runner.invoke(
main,
"--url https://yahoo.com --user_agent my-unique-user-agent \
--save --headers".split(
" "
),
)
reply = cli.args_handler(args)
assert isinstance(reply, int)
def test_known_urls():
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://akamhy.github.io",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
alive=True,
subdomain=True,
known_urls=True,
get=None,
assert result.exit_code == 0
assert result.output.find("Archive URL:") != -1
assert (result.output.find("Cached save:\nTrue") != -1) or (
result.output.find("Cached save:\nFalse") != -1
)
reply = cli.args_handler(args)
assert "github" in str(reply)
assert result.output.find("Save API headers:\n") != -1
assert result.output.find("yahoo.com") != -1
def test_near():
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=True,
alive=False,
subdomain=False,
known_urls=False,
get=None,
year=2020,
month=7,
day=15,
hour=1,
minute=1,
def test_version() -> None:
runner = CliRunner()
result = runner.invoke(main, ["--version"])
assert result.exit_code == 0
assert result.output == f"waybackpy version {__version__}\n"
def test_license() -> None:
runner = CliRunner()
result = runner.invoke(main, ["--license"])
assert result.exit_code == 0
assert (
result.output
== requests.get(
url="https://raw.githubusercontent.com/akamhy/waybackpy/master/LICENSE"
).text
+ "\n"
)
reply = cli.args_handler(args)
assert "202007" in str(reply)
def test_get():
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get="url",
def test_only_url() -> None:
runner = CliRunner()
result = runner.invoke(main, ["--url", "https://google.com"])
assert result.exit_code == 0
assert (
result.output
== "NoCommandFound: Only URL passed, but did not specify what to do with the URL. Use \
--help flag for help using waybackpy.\n"
)
reply = cli.args_handler(args)
assert "waybackpy" in str(reply)
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get="oldest",
def test_known_url() -> None:
# with file generator enabled
runner = CliRunner()
result = runner.invoke(
main, ["--url", "https://akamhy.github.io", "--known-urls", "--file"]
)
reply = cli.args_handler(args)
assert "waybackpy" in str(reply)
assert result.exit_code == 0
assert result.output.count("\n") > 40
assert result.output.count("akamhy.github.io") > 40
assert result.output.find("in the current working directory.\n") != -1
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get="newest",
)
reply = cli.args_handler(args)
assert "waybackpy" in str(reply)
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get="save",
)
reply = cli.args_handler(args)
assert "waybackpy" in str(reply)
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get="BullShit",
)
reply = cli.args_handler(args)
assert "get the source code of the" in str(reply)
def test_args_handler():
args = argparse.Namespace(version=True)
reply = cli.args_handler(args)
assert ("waybackpy version %s" % (__version__)) == reply
args = argparse.Namespace(url=None, version=False)
reply = cli.args_handler(args)
assert ("waybackpy %s" % (__version__)) in str(reply)
def test_main():
# This also tests the parse_args method in cli.py
cli.main(["temp.py", "--version"])
# without file
runner = CliRunner()
result = runner.invoke(main, ["--url", "https://akamhy.github.io", "--known-urls"])
assert result.exit_code == 0
assert result.output.count("\n") > 40
assert result.output.count("akamhy.github.io") > 40

223
tests/test_save_api.py Normal file
View File

@ -0,0 +1,223 @@
import random
import string
import time
from datetime import datetime
from typing import cast
import pytest
from requests.structures import CaseInsensitiveDict
from waybackpy.exceptions import MaximumSaveRetriesExceeded
from waybackpy.save_api import WaybackMachineSaveAPI
def rndstr(n: int) -> str:
return "".join(
random.choice(string.ascii_uppercase + string.digits) for _ in range(n)
)
def test_save() -> None:
url = "https://github.com/akamhy/waybackpy"
user_agent = (
"Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) AppleWebKit/605.1.15 "
"(KHTML, like Gecko) Version/14.1.1 Safari/604.1"
)
save_api = WaybackMachineSaveAPI(url, user_agent)
save_api.save()
archive_url = save_api.archive_url
timestamp = save_api.timestamp()
headers = save_api.headers # CaseInsensitiveDict
cached_save = save_api.cached_save
assert cached_save in [True, False]
assert archive_url.find("github.com/akamhy/waybackpy") != -1
assert timestamp is not None
assert str(headers).find("github.com/akamhy/waybackpy") != -1
assert isinstance(save_api.timestamp(), datetime)
def test_max_redirect_exceeded() -> None:
with pytest.raises(MaximumSaveRetriesExceeded):
url = f"https://{rndstr}.gov"
user_agent = (
"Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) AppleWebKit/605.1.15 "
"(KHTML, like Gecko) Version/14.1.1 Safari/604.1"
)
save_api = WaybackMachineSaveAPI(url, user_agent, max_tries=3)
save_api.save()
def test_sleep() -> None:
"""
sleeping is actually very important for SaveAPI
interface stability.
The test checks that the time taken by sleep method
is as intended.
"""
url = "https://example.com"
user_agent = (
"Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) AppleWebKit/605.1.15 "
"(KHTML, like Gecko) Version/14.1.1 Safari/604.1"
)
save_api = WaybackMachineSaveAPI(url, user_agent)
s_time = int(time.time())
save_api.sleep(6) # multiple of 3 sleep for 10 seconds
e_time = int(time.time())
assert (e_time - s_time) >= 10
s_time = int(time.time())
save_api.sleep(7) # sleeps for 5 seconds
e_time = int(time.time())
assert (e_time - s_time) >= 5
def test_timestamp() -> None:
url = "https://example.com"
user_agent = (
"Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) AppleWebKit/605.1.15 "
"(KHTML, like Gecko) Version/14.1.1 Safari/604.1"
)
save_api = WaybackMachineSaveAPI(url, user_agent)
now = datetime.utcnow().strftime("%Y%m%d%H%M%S")
save_api._archive_url = f"https://web.archive.org/web/{now}/{url}/"
save_api.timestamp()
assert save_api.cached_save is False
now = "20100124063622"
save_api._archive_url = f"https://web.archive.org/web/{now}/{url}/"
save_api.timestamp()
assert save_api.cached_save is True
def test_archive_url_parser() -> None:
"""
Testing three regex for matches and also tests the response URL.
"""
url = "https://example.com"
user_agent = (
"Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) AppleWebKit/605.1.15 "
"(KHTML, like Gecko) Version/14.1.1 Safari/604.1"
)
save_api = WaybackMachineSaveAPI(url, user_agent)
h = (
"\nSTART\nContent-Location: "
"/web/20201126185327/https://www.scribbr.com/citing-sources/et-al"
"\nEND\n"
)
save_api.headers = h # type: ignore[assignment]
expected_url = (
"https://web.archive.org/web/20201126185327/"
"https://www.scribbr.com/citing-sources/et-al"
)
assert save_api.archive_url_parser() == expected_url
headers = {
"Server": "nginx/1.15.8",
"Date": "Sat, 02 Jan 2021 09:40:25 GMT",
"Content-Type": "text/html; charset=UTF-8",
"Transfer-Encoding": "chunked",
"Connection": "keep-alive",
"X-Archive-Orig-Server": "nginx",
"X-Archive-Orig-Date": "Sat, 02 Jan 2021 09:40:09 GMT",
"X-Archive-Orig-Transfer-Encoding": "chunked",
"X-Archive-Orig-Connection": "keep-alive",
"X-Archive-Orig-Vary": "Accept-Encoding",
"X-Archive-Orig-Last-Modified": "Fri, 01 Jan 2021 12:19:00 GMT",
"X-Archive-Orig-Strict-Transport-Security": "max-age=31536000, max-age=0;",
"X-Archive-Guessed-Content-Type": "text/html",
"X-Archive-Guessed-Charset": "utf-8",
"Memento-Datetime": "Sat, 02 Jan 2021 09:40:09 GMT",
"Link": (
'<https://www.scribbr.com/citing-sources/et-al/>; rel="original", '
"<https://web.archive.org/web/timemap/link/https://www.scribbr.com/"
'citing-sources/et-al/>; rel="timemap"; type="application/link-format", '
"<https://web.archive.org/web/https://www.scribbr.com/citing-sources/"
'et-al/>; rel="timegate", <https://web.archive.org/web/20200601082911/'
'https://www.scribbr.com/citing-sources/et-al/>; rel="first memento"; '
'datetime="Mon, 01 Jun 2020 08:29:11 GMT", <https://web.archive.org/web/'
"20201126185327/https://www.scribbr.com/citing-sources/et-al/>; "
'rel="prev memento"; datetime="Thu, 26 Nov 2020 18:53:27 GMT", '
"<https://web.archive.org/web/20210102094009/https://www.scribbr.com/"
'citing-sources/et-al/>; rel="memento"; datetime="Sat, 02 Jan 2021 '
'09:40:09 GMT", <https://web.archive.org/web/20210102094009/'
"https://www.scribbr.com/citing-sources/et-al/>; "
'rel="last memento"; datetime="Sat, 02 Jan 2021 09:40:09 GMT"'
),
"Content-Security-Policy": (
"default-src 'self' 'unsafe-eval' 'unsafe-inline' "
"data: blob: archive.org web.archive.org analytics.archive.org "
"pragma.archivelab.org",
),
"X-Archive-Src": "spn2-20210102092956-wwwb-spn20.us.archive.org-8001.warc.gz",
"Server-Timing": (
"captures_list;dur=112.646325, exclusion.robots;dur=0.172010, "
"exclusion.robots.policy;dur=0.158205, RedisCDXSource;dur=2.205932, "
"esindex;dur=0.014647, LoadShardBlock;dur=82.205012, "
"PetaboxLoader3.datanode;dur=70.750239, CDXLines.iter;dur=24.306278, "
"load_resource;dur=26.520179"
),
"X-App-Server": "wwwb-app200",
"X-ts": "200",
"X-location": "All",
"X-Cache-Key": (
"httpsweb.archive.org/web/20210102094009/"
"https://www.scribbr.com/citing-sources/et-al/IN",
),
"X-RL": "0",
"X-Page-Cache": "MISS",
"X-Archive-Screenname": "0",
"Content-Encoding": "gzip",
}
save_api.headers = cast(CaseInsensitiveDict[str], headers)
expected_url2 = (
"https://web.archive.org/web/20210102094009/"
"https://www.scribbr.com/citing-sources/et-al/"
)
assert save_api.archive_url_parser() == expected_url2
expected_url_3 = (
"https://web.archive.org/web/20171128185327/"
"https://www.scribbr.com/citing-sources/et-al/US"
)
h = f"START\nX-Cache-Key: {expected_url_3}\nEND\n"
save_api.headers = h # type: ignore[assignment]
expected_url4 = (
"https://web.archive.org/web/20171128185327/"
"https://www.scribbr.com/citing-sources/et-al/"
)
assert save_api.archive_url_parser() == expected_url4
h = "TEST TEST TEST AND NO MATCH - TEST FOR RESPONSE URL MATCHING"
save_api.headers = h # type: ignore[assignment]
save_api.response_url = (
"https://web.archive.org/web/20171128185327/"
"https://www.scribbr.com/citing-sources/et-al"
)
expected_url5 = (
"https://web.archive.org/web/20171128185327/"
"https://www.scribbr.com/citing-sources/et-al"
)
assert save_api.archive_url_parser() == expected_url5
def test_archive_url() -> None:
"""
Checks the attribute archive_url's value when the save method was not
explicitly invoked by the end-user but the save method was invoked implicitly
by the archive_url method which is an attribute due to @property.
"""
url = "https://example.com"
user_agent = (
"Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) AppleWebKit/605.1.15 "
"(KHTML, like Gecko) Version/14.1.1 Safari/604.1"
)
save_api = WaybackMachineSaveAPI(url, user_agent)
save_api.saved_archive = (
"https://web.archive.org/web/20220124063056/https://example.com/"
)
save_api._archive_url = save_api.saved_archive
assert save_api.archive_url == save_api.saved_archive

9
tests/test_utils.py Normal file
View File

@ -0,0 +1,9 @@
from waybackpy import __version__
from waybackpy.utils import DEFAULT_USER_AGENT
def test_default_user_agent() -> None:
assert (
DEFAULT_USER_AGENT
== f"waybackpy {__version__} - https://github.com/akamhy/waybackpy"
)

View File

@ -1,184 +1,45 @@
# -*- coding: utf-8 -*-
import sys
import pytest
import random
import requests
sys.path.append("..")
import waybackpy.wrapper as waybackpy # noqa: E402
from waybackpy.wrapper import Url
user_agent = "Mozilla/5.0 (Windows NT 6.2; rv:20.0) Gecko/20121202 Firefox/20.0"
def test_clean_url():
test_url = " https://en.wikipedia.org/wiki/Network security "
answer = "https://en.wikipedia.org/wiki/Network_security"
target = waybackpy.Url(test_url, user_agent)
test_result = target._clean_url()
assert answer == test_result
def test_dunders():
url = "https://en.wikipedia.org/wiki/Network_security"
user_agent = "UA"
target = waybackpy.Url(url, user_agent)
assert "waybackpy.Url(url=%s, user_agent=%s)" % (url, user_agent) == repr(target)
assert "en.wikipedia.org" in str(target)
def test_archive_url_parser():
endpoint = "https://amazon.com"
user_agent = "Mozilla/5.0 (Windows NT 6.2; rv:20.0) Gecko/20121202 Firefox/20.0"
headers = {"User-Agent": "%s" % user_agent}
response = waybackpy._get_response(endpoint, params=None, headers=headers)
header = response.headers
with pytest.raises(Exception):
waybackpy._archive_url_parser(header)
def test_url_check():
broken_url = "http://wwwgooglecom/"
with pytest.raises(Exception):
waybackpy.Url(broken_url, user_agent)
def test_save():
# Test for urls that exist and can be archived.
url_list = [
"en.wikipedia.org",
"www.wikidata.org",
"commons.wikimedia.org",
"www.wiktionary.org",
"www.w3schools.com",
"www.ibm.com",
]
x = random.randint(0, len(url_list) - 1)
url1 = url_list[x]
target = waybackpy.Url(
url1,
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36",
def test_oldest() -> None:
url = "https://bing.com"
oldest_archive = (
"https://web.archive.org/web/20030726111100/http://www.bing.com:80/"
)
archived_url1 = str(target.save())
assert url1 in archived_url1
# Test for urls that are incorrect.
with pytest.raises(Exception):
url2 = "ha ha ha ha"
waybackpy.Url(url2, user_agent)
url3 = "http://www.archive.is/faq.html"
with pytest.raises(Exception):
target = waybackpy.Url(
url3,
"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) "
"AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 "
"Safari/533.20.27",
)
target.save()
wayback = Url(url).oldest()
assert wayback.archive_url == oldest_archive
assert str(wayback) == oldest_archive
assert len(wayback) > 365 * 15 # days in a year times years
def test_near():
url = "google.com"
target = waybackpy.Url(
url,
"Mozilla/5.0 (Windows; U; Windows NT 6.0; de-DE) AppleWebKit/533.20.25 "
"(KHTML, like Gecko) Version/5.0.3 Safari/533.19.4",
)
archive_near_year = target.near(year=2010)
assert "2010" in str(archive_near_year)
archive_near_month_year = str(target.near(year=2015, month=2))
assert (
("201502" in archive_near_month_year)
or ("201501" in archive_near_month_year)
or ("201503" in archive_near_month_year)
)
target = waybackpy.Url(
"www.python.org",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246",
)
archive_near_hour_day_month_year = str(
target.near(year=2008, month=5, day=9, hour=15)
)
assert (
("2008050915" in archive_near_hour_day_month_year)
or ("2008050914" in archive_near_hour_day_month_year)
or ("2008050913" in archive_near_hour_day_month_year)
)
with pytest.raises(Exception):
NeverArchivedUrl = (
"https://ee_3n.wrihkeipef4edia.org/rwti5r_ki/Nertr6w_rork_rse7c_urity"
)
target = waybackpy.Url(NeverArchivedUrl, user_agent)
target.near(year=2010)
def test_newest() -> None:
url = "https://www.youtube.com/"
wayback = Url(url).newest()
assert "youtube" in str(wayback.archive_url)
assert "archived_snapshots" in str(wayback.json)
def test_oldest():
url = "github.com/akamhy/waybackpy"
target = waybackpy.Url(url, user_agent)
assert "20200504141153" in str(target.oldest())
def test_near() -> None:
url = "https://www.google.com"
wayback = Url(url).near(year=2010, month=10, day=10, hour=10, minute=10)
assert "20101010" in str(wayback.archive_url)
def test_json():
url = "github.com/akamhy/waybackpy"
target = waybackpy.Url(url, user_agent)
assert "archived_snapshots" in str(target.JSON)
def test_total_archives() -> None:
wayback = Url("https://akamhy.github.io")
assert wayback.total_archives() > 10
wayback = Url("https://gaha.ef4i3n.m5iai3kifp6ied.cima/gahh2718gs/ahkst63t7gad8")
assert wayback.total_archives() == 0
def test_archive_url():
url = "github.com/akamhy/waybackpy"
target = waybackpy.Url(url, user_agent)
assert "github.com/akamhy" in str(target.archive_url)
def test_known_urls() -> None:
wayback = Url("akamhy.github.io")
assert len(list(wayback.known_urls(subdomain=True))) > 40
def test_newest():
url = "github.com/akamhy/waybackpy"
target = waybackpy.Url(url, user_agent)
assert url in str(target.newest())
def test_get():
target = waybackpy.Url("google.com", user_agent)
assert "Welcome to Google" in target.get(target.oldest())
def test_wayback_timestamp():
ts = waybackpy._wayback_timestamp(year=2020, month=1, day=2, hour=3, minute=4)
assert "202001020304" in str(ts)
def test_get_response():
endpoint = "https://www.google.com"
user_agent = (
"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0"
)
headers = {"User-Agent": "%s" % user_agent}
response = waybackpy._get_response(endpoint, params=None, headers=headers)
assert response.status_code == 200
def test_total_archives():
target = waybackpy.Url(" https://google.com ", user_agent)
assert target.total_archives() > 500000
target = waybackpy.Url(
" https://gaha.e4i3n.m5iai3kip6ied.cima/gahh2718gs/ahkst63t7gad8 ", user_agent
)
assert target.total_archives() == 0
def test_known_urls():
target = waybackpy.Url("akamhy.github.io", user_agent)
assert len(target.known_urls(alive=True, subdomain=True)) > 2
target = waybackpy.Url("akamhy.github.io", user_agent)
assert len(target.known_urls()) > 3
def test_Save() -> None:
wayback = Url("https://en.wikipedia.org/wiki/Asymptotic_equipartition_property")
wayback.save()
archive_url = str(wayback.archive_url)
assert archive_url.find("Asymptotic_equipartition_property") != -1

View File

@ -1,40 +1,16 @@
# -*- coding: utf-8 -*-
"""Module initializer and provider of static information."""
# ┏┓┏┓┏┓━━━━━━━━━━┏━━┓━━━━━━━━━━┏┓━━┏━━━┓━━━━━
# ┃┃┃┃┃┃━━━━━━━━━━┃┏┓┃━━━━━━━━━━┃┃━━┃┏━┓┃━━━━━
# ┃┃┃┃┃┃┏━━┓━┏┓━┏┓┃┗┛┗┓┏━━┓━┏━━┓┃┃┏┓┃┗━┛┃┏┓━┏┓
# ┃┗┛┗┛┃┗━┓┃━┃┃━┃┃┃┏━┓┃┗━┓┃━┃┏━┛┃┗┛┛┃┏━━┛┃┃━┃┃
# ┗┓┏┓┏┛┃┗┛┗┓┃┗━┛┃┃┗━┛┃┃┗┛┗┓┃┗━┓┃┏┓┓┃┃━━━┃┗━┛┃
# ━┗┛┗┛━┗━━━┛┗━┓┏┛┗━━━┛┗━━━┛┗━━┛┗┛┗┛┗┛━━━┗━┓┏┛
# ━━━━━━━━━━━┏━┛┃━━━━━━━━━━━━━━━━━━━━━━━━┏━┛┃━
# ━━━━━━━━━━━┗━━┛━━━━━━━━━━━━━━━━━━━━━━━━┗━━┛━
"""
Waybackpy is a Python package that interfaces with the Internet Archive's Wayback Machine API.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Archive pages and retrieve archived pages easily.
Usage:
>>> import waybackpy
>>> target_url = waybackpy.Url('https://www.python.org', 'Your-apps-cool-user-agent')
>>> new_archive = target_url.save()
>>> print(new_archive)
https://web.archive.org/web/20200502170312/https://www.python.org/
Full documentation @ <https://akamhy.github.io/waybackpy/>.
:copyright: (c) 2020 by akamhy.
:license: MIT
"""
__version__ = "3.0.6"
from .availability_api import WaybackMachineAvailabilityAPI
from .cdx_api import WaybackMachineCDXServerAPI
from .save_api import WaybackMachineSaveAPI
from .wrapper import Url
from .__version__ import (
__title__,
__description__,
__url__,
__version__,
__author__,
__author_email__,
__license__,
__copyright__,
)
__all__ = [
"__version__",
"WaybackMachineAvailabilityAPI",
"WaybackMachineCDXServerAPI",
"WaybackMachineSaveAPI",
"Url",
]

View File

@ -1,13 +0,0 @@
# -*- coding: utf-8 -*-
__title__ = "waybackpy"
__description__ = (
"A Python package that interfaces with the Internet Archive's Wayback Machine API. "
"Archive pages and retrieve archived pages easily."
)
__url__ = "https://akamhy.github.io/waybackpy/"
__version__ = "2.3.0"
__author__ = "akamhy"
__author_email__ = "akash3pro@gmail.com"
__license__ = "MIT"
__copyright__ = "Copyright 2020 akamhy"

View File

@ -0,0 +1,246 @@
"""
This module interfaces the Wayback Machine's availability API.
The interface is useful for looking up archives and finding archives
that are close to a specific date and time.
It has a class WaybackMachineAvailabilityAPI, and the class has
methods like:
near() for retrieving archives close to a specific date and time.
oldest() for retrieving the first archive URL of the webpage.
newest() for retrieving the latest archive of the webpage.
The Wayback Machine Availability API response must be a valid JSON and
if it is not then an exception, InvalidJSONInAvailabilityAPIResponse is raised.
If the Availability API returned valid JSON but archive URL could not be found
it it then ArchiveNotInAvailabilityAPIResponse is raised.
"""
import json
import time
from datetime import datetime
from typing import Any, Dict, Optional
import requests
from requests.models import Response
from .exceptions import (
ArchiveNotInAvailabilityAPIResponse,
InvalidJSONInAvailabilityAPIResponse,
)
from .utils import (
DEFAULT_USER_AGENT,
unix_timestamp_to_wayback_timestamp,
wayback_timestamp,
)
ResponseJSON = Dict[str, Any]
class WaybackMachineAvailabilityAPI:
"""
Class that interfaces the Wayback Machine's availability API.
"""
def __init__(
self, url: str, user_agent: str = DEFAULT_USER_AGENT, max_tries: int = 3
) -> None:
self.url = str(url).strip().replace(" ", "%20")
self.user_agent = user_agent
self.headers: Dict[str, str] = {"User-Agent": self.user_agent}
self.payload: Dict[str, str] = {"url": self.url}
self.endpoint: str = "https://archive.org/wayback/available"
self.max_tries: int = max_tries
self.tries: int = 0
self.last_api_call_unix_time: int = int(time.time())
self.api_call_time_gap: int = 5
self.json: Optional[ResponseJSON] = None
self.response: Optional[Response] = None
def __repr__(self) -> str:
"""
Same as string representation, just return the archive URL as a string.
"""
return str(self)
def __str__(self) -> str:
"""
String representation of the class. If atleast one API
call was successfully made then return the archive URL
as a string. Else returns "" (empty string literal).
"""
# __str__ can not return anything other than a string object
# So, if a string repr is asked even before making a API request
# just return ""
if not self.json:
return ""
return self.archive_url
def setup_json(self) -> Optional[ResponseJSON]:
"""
Makes the API call to the availability API and set the JSON response
to the JSON attribute of the instance and also returns the JSON
attribute.
time_diff and sleep_time makes sure that you are not making too many
requests in a short interval of item, making too many requests is bad
as Wayback Machine may reject them above a certain threshold.
The end-user can change the api_call_time_gap attribute of the instance
to increase or decrease the default time gap between two successive API
calls, but it is not recommended to increase it.
"""
time_diff = int(time.time()) - self.last_api_call_unix_time
sleep_time = self.api_call_time_gap - time_diff
if sleep_time > 0:
time.sleep(sleep_time)
self.response = requests.get(
self.endpoint, params=self.payload, headers=self.headers
)
self.last_api_call_unix_time = int(time.time())
self.tries += 1
try:
self.json = None if self.response is None else self.response.json()
except json.decoder.JSONDecodeError as json_decode_error:
raise InvalidJSONInAvailabilityAPIResponse(
f"Response data:\n{self.response.text}"
) from json_decode_error
return self.json
def timestamp(self) -> datetime:
"""
Converts the timestamp form the JSON response to datetime object.
If JSON attribute of the instance is None it implies that the either
the the last API call failed or one was never made.
If not JSON or if JSON but no timestamp in the JSON response then
returns the maximum value for datetime object that is possible.
If you get an URL as a response form the availability API it is
guaranteed that you can get the datetime object from the timestamp.
"""
if self.json is None or "archived_snapshots" not in self.json:
return datetime.max
if (
self.json is not None
and "archived_snapshots" in self.json
and self.json["archived_snapshots"] is not None
and "closest" in self.json["archived_snapshots"]
and self.json["archived_snapshots"]["closest"] is not None
and "timestamp" in self.json["archived_snapshots"]["closest"]
):
return datetime.strptime(
self.json["archived_snapshots"]["closest"]["timestamp"], "%Y%m%d%H%M%S"
)
raise ValueError("Timestamp not found in the Availability API's JSON response.")
@property
def archive_url(self) -> str:
"""
Reads the the JSON response data and returns
the timestamp if found and if not found raises
ArchiveNotInAvailabilityAPIResponse.
"""
archive_url = ""
data = self.json
# If the user didn't invoke oldest, newest or near but tries to access
# archive_url attribute then assume they that are fine with any archive
# and invoke the oldest method.
if not data:
self.oldest()
# If data is still not none then probably there are no
# archive for the requested URL.
if not data or not data["archived_snapshots"]:
while (self.tries < self.max_tries) and (
not data or not data["archived_snapshots"]
):
self.setup_json() # It makes a new API call
data = self.json # setup_json() updates value of json attribute
# If exhausted max_tries, then give up and
# raise ArchiveNotInAvailabilityAPIResponse.
if not data or not data["archived_snapshots"]:
raise ArchiveNotInAvailabilityAPIResponse(
"Archive not found in the availability "
"API response, the URL you requested may not have any archives "
"yet. You may retry after some time or archive the webpage now.\n"
"Response data:\n"
""
if self.response is None
else self.response.text
)
else:
archive_url = data["archived_snapshots"]["closest"]["url"]
archive_url = archive_url.replace(
"http://web.archive.org/web/", "https://web.archive.org/web/", 1
)
return archive_url
def oldest(self) -> "WaybackMachineAvailabilityAPI":
"""
Passes the date 1994-01-01 to near which should return the oldest archive
because Wayback Machine was started in May, 1996 and it is assumed that
there would be no archive older than January 1, 1994.
"""
return self.near(year=1994, month=1, day=1)
def newest(self) -> "WaybackMachineAvailabilityAPI":
"""
Passes the current UNIX time to near() for retrieving the newest archive
from the availability API.
Remember UNIX time is UTC and Wayback Machine is also UTC based.
"""
return self.near(unix_timestamp=int(time.time()))
def near(
self,
year: Optional[int] = None,
month: Optional[int] = None,
day: Optional[int] = None,
hour: Optional[int] = None,
minute: Optional[int] = None,
unix_timestamp: Optional[int] = None,
) -> "WaybackMachineAvailabilityAPI":
"""
The most important method of this Class, oldest() and newest() are
dependent on it.
It generates the timestamp based on the input either by calling the
unix_timestamp_to_wayback_timestamp or wayback_timestamp method with
appropriate arguments for their respective parameters.
Adds the timestamp to the payload dictionary.
And finally invokes the setup_json method to make the API call then
finally returns the instance.
"""
if unix_timestamp:
timestamp = unix_timestamp_to_wayback_timestamp(unix_timestamp)
else:
now = datetime.utcnow().timetuple()
timestamp = wayback_timestamp(
year=now.tm_year if year is None else year,
month=now.tm_mon if month is None else month,
day=now.tm_mday if day is None else day,
hour=now.tm_hour if hour is None else hour,
minute=now.tm_min if minute is None else minute,
)
self.payload["timestamp"] = timestamp
self.setup_json()
return self

334
waybackpy/cdx_api.py Normal file
View File

@ -0,0 +1,334 @@
"""
This module interfaces the Wayback Machine's CDX server API.
The module has WaybackMachineCDXServerAPI which should be used by the users of
this module to consume the CDX server API.
WaybackMachineCDXServerAPI has a snapshot method that yields the snapshots, and
the snapshots are yielded as instances of the CDXSnapshot class.
"""
import time
from datetime import datetime
from typing import Dict, Generator, List, Optional, Union, cast
from .cdx_snapshot import CDXSnapshot
from .cdx_utils import (
check_collapses,
check_filters,
check_match_type,
check_sort,
full_url,
get_response,
get_total_pages,
)
from .exceptions import NoCDXRecordFound, WaybackError
from .utils import (
DEFAULT_USER_AGENT,
unix_timestamp_to_wayback_timestamp,
wayback_timestamp,
)
class WaybackMachineCDXServerAPI:
"""
Class that interfaces the CDX server API of the Wayback Machine.
snapshot() returns a generator that can be iterated upon by the end-user,
the generator returns the snapshots/entries as instance of CDXSnapshot to
make the usage easy, just use '.' to get any attribute as the attributes are
accessible via a dot ".".
"""
# start_timestamp: from, can not use from as it's a keyword
# end_timestamp: to, not using to as can not use from
def __init__(
self,
url: str,
user_agent: str = DEFAULT_USER_AGENT,
start_timestamp: Optional[str] = None,
end_timestamp: Optional[str] = None,
filters: Optional[List[str]] = None,
match_type: Optional[str] = None,
sort: Optional[str] = None,
gzip: Optional[str] = None,
collapses: Optional[List[str]] = None,
limit: Optional[str] = None,
max_tries: int = 3,
use_pagination: bool = False,
closest: Optional[str] = None,
) -> None:
self.url = str(url).strip().replace(" ", "%20")
self.user_agent = user_agent
self.start_timestamp = None if start_timestamp is None else str(start_timestamp)
self.end_timestamp = None if end_timestamp is None else str(end_timestamp)
self.filters = [] if filters is None else filters
check_filters(self.filters)
self.match_type = None if match_type is None else str(match_type).strip()
check_match_type(self.match_type, self.url)
self.sort = None if sort is None else str(sort).strip()
check_sort(self.sort)
self.gzip = gzip
self.collapses = [] if collapses is None else collapses
check_collapses(self.collapses)
self.limit = 25000 if limit is None else limit
self.max_tries = max_tries
self.use_pagination = use_pagination
self.closest = None if closest is None else str(closest)
self.last_api_request_url: Optional[str] = None
self.endpoint = "https://web.archive.org/cdx/search/cdx"
def cdx_api_manager(
self, payload: Dict[str, str], headers: Dict[str, str]
) -> Generator[str, None, None]:
"""
This method uses the pagination API of the CDX server if
use_pagination attribute is True else uses the standard
CDX server response data.
"""
# When using the pagination API of the CDX server.
if self.use_pagination is True:
total_pages = get_total_pages(self.url, self.user_agent)
successive_blank_pages = 0
for i in range(total_pages):
payload["page"] = str(i)
url = full_url(self.endpoint, params=payload)
res = get_response(url, headers=headers)
if isinstance(res, Exception):
raise res
self.last_api_request_url = url
text = res.text
# Reset the counter if the last page was blank
# but the current page is not.
if successive_blank_pages == 1:
if len(text) != 0:
successive_blank_pages = 0
# Increase the succesive page counter on encountering
# blank page.
if len(text) == 0:
successive_blank_pages += 1
# If two succesive pages are blank
# then we don't have any more pages left to
# iterate.
if successive_blank_pages >= 2:
break
yield text
# When not using the pagination API of the CDX server
else:
payload["showResumeKey"] = "true"
payload["limit"] = str(self.limit)
resume_key = None
more = True
while more:
if resume_key:
payload["resumeKey"] = resume_key
url = full_url(self.endpoint, params=payload)
res = get_response(url, headers=headers)
if isinstance(res, Exception):
raise res
self.last_api_request_url = url
text = res.text.strip()
lines = text.splitlines()
more = False
if len(lines) >= 3:
second_last_line = lines[-2]
if len(second_last_line) == 0:
resume_key = lines[-1].strip()
text = text.replace(resume_key, "", 1).strip()
more = True
yield text
def add_payload(self, payload: Dict[str, str]) -> None:
"""
Adds the payload to the payload dictionary.
"""
if self.start_timestamp:
payload["from"] = self.start_timestamp
if self.end_timestamp:
payload["to"] = self.end_timestamp
if self.gzip is None:
payload["gzip"] = "false"
if self.closest:
payload["closest"] = self.closest
if self.match_type:
payload["matchType"] = self.match_type
if self.sort:
payload["sort"] = self.sort
if self.filters and len(self.filters) > 0:
for i, _filter in enumerate(self.filters):
payload["filter" + str(i)] = _filter
if self.collapses and len(self.collapses) > 0:
for i, collapse in enumerate(self.collapses):
payload["collapse" + str(i)] = collapse
payload["url"] = self.url
def near(
self,
year: Optional[int] = None,
month: Optional[int] = None,
day: Optional[int] = None,
hour: Optional[int] = None,
minute: Optional[int] = None,
unix_timestamp: Optional[int] = None,
wayback_machine_timestamp: Optional[Union[int, str]] = None,
) -> CDXSnapshot:
"""
Fetch archive close to a datetime, it can only return
a single URL. If you want more do not use this method
instead use the class.
"""
if unix_timestamp:
timestamp = unix_timestamp_to_wayback_timestamp(unix_timestamp)
elif wayback_machine_timestamp:
timestamp = str(wayback_machine_timestamp)
else:
now = datetime.utcnow().timetuple()
timestamp = wayback_timestamp(
year=now.tm_year if year is None else year,
month=now.tm_mon if month is None else month,
day=now.tm_mday if day is None else day,
hour=now.tm_hour if hour is None else hour,
minute=now.tm_min if minute is None else minute,
)
self.closest = timestamp
self.sort = "closest"
self.limit = 1
first_snapshot = None
for snapshot in self.snapshots():
first_snapshot = snapshot
break
if not first_snapshot:
raise NoCDXRecordFound(
"Wayback Machine's CDX server did not return any records "
+ "for the query. The URL may not have any archives "
+ " on the Wayback Machine or the URL may have been recently "
+ "archived and is still not available on the CDX server."
)
return first_snapshot
def newest(self) -> CDXSnapshot:
"""
Passes the current UNIX time to near() for retrieving the newest archive
from the availability API.
Remember UNIX time is UTC and Wayback Machine is also UTC based.
"""
return self.near(unix_timestamp=int(time.time()))
def oldest(self) -> CDXSnapshot:
"""
Passes the date 1994-01-01 to near which should return the oldest archive
because Wayback Machine was started in May, 1996 and it is assumed that
there would be no archive older than January 1, 1994.
"""
return self.near(year=1994, month=1, day=1)
def snapshots(self) -> Generator[CDXSnapshot, None, None]:
"""
This function yields the CDX data lines as snapshots.
As it is a generator it exhaustible, the reason that this is
a generator and not a list are:
a) CDX server API can return millions of entries for a query and list
is not suitable for such cases.
b) Preventing memory usage issues, as told before this method may yield
millions of records for some queries and your system may not have enough
memory for such a big list. Also Remember this if outputing to Jupyter
Notebooks.
The objects yielded by this method are instance of CDXSnapshot class,
you can access the attributes of the entries as the attribute of the instance
itself.
"""
payload: Dict[str, str] = {}
headers = {"User-Agent": self.user_agent}
self.add_payload(payload)
entries = self.cdx_api_manager(payload, headers)
for entry in entries:
if entry.isspace() or len(entry) <= 1 or not entry:
continue
# each line is a snapshot aka entry of the CDX server API.
# We are able to split the page by lines because it only
# splits the lines on a sinlge page and not all the entries
# at once, thus there should be no issues of too much memory usage.
snapshot_list = entry.split("\n")
for snapshot in snapshot_list:
# 14 + 32 == 46 ( timestamp + digest ), ignore the invalid entries.
# they are invalid if their length is smaller than sum of length
# of a standard wayback_timestamp and standard digest of an entry.
if len(snapshot) < 46:
continue
properties: Dict[str, Optional[str]] = {
"urlkey": None,
"timestamp": None,
"original": None,
"mimetype": None,
"statuscode": None,
"digest": None,
"length": None,
}
property_value = snapshot.split(" ")
total_property_values = len(property_value)
warranted_total_property_values = len(properties)
if total_property_values != warranted_total_property_values:
raise WaybackError(
f"Snapshot returned by CDX API has {total_property_values} prop"
f"erties instead of expected {warranted_total_property_values} "
f"properties.\nProblematic Snapshot: {snapshot}"
)
(
properties["urlkey"],
properties["timestamp"],
properties["original"],
properties["mimetype"],
properties["statuscode"],
properties["digest"],
properties["length"],
) = property_value
yield CDXSnapshot(cast(Dict[str, str], properties))

90
waybackpy/cdx_snapshot.py Normal file
View File

@ -0,0 +1,90 @@
"""
Module that contains the CDXSnapshot class, CDX records/lines are casted
to CDXSnapshot objects for easier access.
The CDX index format is plain text data. Each line ('record') indicates a
crawled document. And these lines are casted to CDXSnapshot.
"""
from datetime import datetime
from typing import Dict
class CDXSnapshot:
"""
Class for the CDX snapshot lines('record') returned by the CDX API,
Each valid line of the CDX API is casted to an CDXSnapshot object
by the CDX API interface, just use "." to access any attribute of the
CDX server API snapshot.
This provides the end-user the ease of using the data as attributes
of the CDXSnapshot.
The string representation of the class is identical to the line returned
by the CDX server API.
Besides all the attributes of the CDX server API this class also provides
archive_url attribute, yes it is the archive url of the snapshot.
Attributes of the this class and what they represents and are useful for:
urlkey: The document captured, expressed as a SURT
SURT stands for Sort-friendly URI Reordering Transform, and is a
transformation applied to URIs which makes their left-to-right
representation better match the natural hierarchy of domain names.
A URI <scheme://domain.tld/path?query> has SURT
form <scheme://(tld,domain,)/path?query>.
timestamp: The timestamp of the archive, format is yyyyMMddhhmmss and type
is string.
datetime_timestamp: The timestamp as a datetime object.
original: The original URL of the archive. If archive_url is
https://web.archive.org/web/20220113130051/https://google.com then the
original URL is https://google.com
mimetype: The documents file type. e.g. text/html
statuscode: HTTP response code for the document at the time of its crawling
digest: Base32-encoded SHA-1 checksum of the document for discriminating
with others
length: Documents volume of bytes in the WARC file
archive_url: The archive url of the snapshot, this is not returned by the
CDX server API but created by this class on init.
"""
def __init__(self, properties: Dict[str, str]) -> None:
self.urlkey: str = properties["urlkey"]
self.timestamp: str = properties["timestamp"]
self.datetime_timestamp: datetime = datetime.strptime(
self.timestamp, "%Y%m%d%H%M%S"
)
self.original: str = properties["original"]
self.mimetype: str = properties["mimetype"]
self.statuscode: str = properties["statuscode"]
self.digest: str = properties["digest"]
self.length: str = properties["length"]
self.archive_url: str = (
f"https://web.archive.org/web/{self.timestamp}/{self.original}"
)
def __repr__(self) -> str:
"""
Same as __str__()
"""
return str(self)
def __str__(self) -> str:
"""
The string representation is same as the line returned by the
CDX server API for the snapshot.
"""
return (
f"{self.urlkey} {self.timestamp} {self.original} "
f"{self.mimetype} {self.statuscode} {self.digest} {self.length}"
)

201
waybackpy/cdx_utils.py Normal file
View File

@ -0,0 +1,201 @@
"""
Utility functions required for accessing the CDX server API.
These are here in this module so that we dont make any module too
long.
"""
import re
from typing import Any, Dict, List, Optional, Union
from urllib.parse import quote
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from .exceptions import BlockedSiteError, WaybackError
from .utils import DEFAULT_USER_AGENT
def get_total_pages(url: str, user_agent: str = DEFAULT_USER_AGENT) -> int:
"""
When using the pagination use adding showNumPages=true to the request
URL makes the CDX server return an integer which is the number of pages
of CDX pages available for us to query using the pagination API.
"""
endpoint = "https://web.archive.org/cdx/search/cdx?"
payload = {"showNumPages": "true", "url": str(url)}
headers = {"User-Agent": user_agent}
request_url = full_url(endpoint, params=payload)
response = get_response(request_url, headers=headers)
check_for_blocked_site(response, url)
if isinstance(response, requests.Response):
return int(response.text.strip())
raise response
def check_for_blocked_site(
response: Union[requests.Response, Exception], url: Optional[str] = None
) -> None:
"""
Checks that the URL can be archived by wayback machine or not.
robots.txt policy of the site may prevent the wayback machine.
"""
# see https://github.com/akamhy/waybackpy/issues/157
# the following if block is to make mypy happy.
if isinstance(response, Exception):
raise response
if not url:
url = "The requested content"
if (
"org.archive.util.io.RuntimeIOException: "
+ "org.archive.wayback.exception.AdministrativeAccessControlException: "
+ "Blocked Site Error"
in response.text.strip()
):
raise BlockedSiteError(
f"{url} is excluded from Wayback Machine by the site's robots.txt policy."
)
def full_url(endpoint: str, params: Dict[str, Any]) -> str:
"""
As the function's name already implies that it returns
full URL, but why we need a function for generating full URL?
The CDX server can support multiple arguments for parameters
such as filter and collapse and this function adds them without
overwriting earlier added arguments.
"""
if not params:
return endpoint
_full_url = endpoint if endpoint.endswith("?") else (endpoint + "?")
for key, val in params.items():
key = "filter" if key.startswith("filter") else key
key = "collapse" if key.startswith("collapse") else key
amp = "" if _full_url.endswith("?") else "&"
val = quote(str(val), safe="")
_full_url += f"{amp}{key}={val}"
return _full_url
def get_response(
url: str,
headers: Optional[Dict[str, str]] = None,
retries: int = 5,
backoff_factor: float = 0.5,
) -> Union[requests.Response, Exception]:
"""
Makes get request to the CDX server and returns the response.
"""
session = requests.Session()
retries_ = Retry(
total=retries,
backoff_factor=backoff_factor,
status_forcelist=[500, 502, 503, 504],
)
session.mount("https://", HTTPAdapter(max_retries=retries_))
response = session.get(url, headers=headers)
session.close()
check_for_blocked_site(response)
return response
def check_filters(filters: List[str]) -> None:
"""
Check that the filter arguments passed by the end-user are valid.
If not valid then raise WaybackError.
"""
if not isinstance(filters, list):
raise WaybackError("filters must be a list.")
# [!]field:regex
for _filter in filters:
match = re.search(
r"(\!?(?:urlkey|timestamp|original|mimetype|statuscode|digest|length)):"
r"(.*)",
_filter,
)
if match is None or len(match.groups()) != 2:
exc_message = f"Filter '{_filter}' is not following the cdx filter syntax."
raise WaybackError(exc_message)
def check_collapses(collapses: List[str]) -> bool:
"""
Check that the collapse arguments passed by the end-user are valid.
If not valid then raise WaybackError.
"""
if not isinstance(collapses, list):
raise WaybackError("collapses must be a list.")
if len(collapses) == 0:
return True
for collapse in collapses:
match = re.search(
r"(urlkey|timestamp|original|mimetype|statuscode|digest|length)"
r"(:?[0-9]{1,99})?",
collapse,
)
if match is None or len(match.groups()) != 2:
exc_message = (
f"collapse argument '{collapse}' "
"is not following the cdx collapse syntax."
)
raise WaybackError(exc_message)
return True
def check_match_type(match_type: Optional[str], url: str) -> bool:
"""
Check that the match_type argument passed by the end-user is valid.
If not valid then raise WaybackError.
"""
legal_match_type = ["exact", "prefix", "host", "domain"]
if not match_type:
return True
if "*" in url:
raise WaybackError(
"Can not use wildcard in the URL along with the match_type arguments."
)
if match_type not in legal_match_type:
exc_message = (
f"{match_type} is not an allowed match type.\n"
"Use one from 'exact', 'prefix', 'host' or 'domain'"
)
raise WaybackError(exc_message)
return True
def check_sort(sort: Optional[str]) -> bool:
"""
Check that the sort argument passed by the end-user is valid.
If not valid then raise WaybackError.
"""
legal_sort = ["default", "closest", "reverse"]
if not sort:
return True
if sort not in legal_sort:
exc_message = (
f"{sort} is not an allowed argument for sort.\n"
"Use one from 'default', 'closest' or 'reverse'"
)
raise WaybackError(exc_message)
return True

View File

@ -1,259 +1,474 @@
# -*- coding: utf-8 -*-
import sys
"""
Module responsible for enabling waybackpy to function as a CLI tool.
"""
import os
import re
import argparse
import string
import random
from waybackpy.wrapper import Url
from waybackpy.__version__ import __version__
import re
import string
from typing import Any, Dict, Generator, List, Optional
import click
import requests
from . import __version__
from .cdx_api import WaybackMachineCDXServerAPI
from .exceptions import BlockedSiteError, NoCDXRecordFound
from .save_api import WaybackMachineSaveAPI
from .utils import DEFAULT_USER_AGENT
from .wrapper import Url
def _save(obj):
return obj.save()
def handle_cdx_closest_derivative_methods(
cdx_api: "WaybackMachineCDXServerAPI",
oldest: bool,
near: bool,
newest: bool,
near_args: Optional[Dict[str, int]] = None,
) -> None:
"""
Handles the closest parameter derivative methods.
near, newest and oldest use the closest parameter with active
closest based sorting.
"""
try:
if near:
if near_args:
archive_url = cdx_api.near(**near_args).archive_url
else:
archive_url = cdx_api.near().archive_url
elif newest:
archive_url = cdx_api.newest().archive_url
elif oldest:
archive_url = cdx_api.oldest().archive_url
click.echo("Archive URL:")
click.echo(archive_url)
except NoCDXRecordFound as exc:
click.echo(click.style("NoCDXRecordFound: ", fg="red") + str(exc), err=True)
except BlockedSiteError as exc:
click.echo(click.style("BlockedSiteError: ", fg="red") + str(exc), err=True)
def _archive_url(obj):
return obj.archive_url
def handle_cdx(data: List[Any]) -> None:
"""
Handles the CDX CLI options and output format.
"""
url = data[0]
user_agent = data[1]
start_timestamp = data[2]
end_timestamp = data[3]
cdx_filter = data[4]
collapse = data[5]
cdx_print = data[6]
limit = data[7]
gzip = data[8]
match_type = data[9]
sort = data[10]
use_pagination = data[11]
closest = data[12]
filters = list(cdx_filter)
collapses = list(collapse)
cdx_print = list(cdx_print)
def _json(obj):
return obj.JSON
def _oldest(obj):
return obj.oldest()
def _newest(obj):
return obj.newest()
def _total_archives(obj):
return obj.total_archives()
def _near(obj, args):
_near_args = {}
if args.year:
_near_args["year"] = args.year
if args.month:
_near_args["month"] = args.month
if args.day:
_near_args["day"] = args.day
if args.hour:
_near_args["hour"] = args.hour
if args.minute:
_near_args["minute"] = args.minute
return obj.near(**_near_args)
def _save_urls_on_file(input_list, live_url_count):
m = re.search("https?://([A-Za-z_0-9.-]+).*", input_list[0])
if m:
domain = m.group(1)
else:
domain = "domain-unknown"
uid = "".join(
random.choice(string.ascii_lowercase + string.digits) for _ in range(6)
cdx_api = WaybackMachineCDXServerAPI(
url,
user_agent=user_agent,
start_timestamp=start_timestamp,
end_timestamp=end_timestamp,
closest=closest,
filters=filters,
match_type=match_type,
sort=sort,
use_pagination=use_pagination,
gzip=gzip,
collapses=collapses,
limit=limit,
)
file_name = "%s-%d-urls-%s.txt" % (domain, live_url_count, uid)
file_content = "\n".join(input_list)
file_path = os.path.join(os.getcwd(), file_name)
with open(file_path, "w+") as f:
f.write(file_content)
return "%s\n\n'%s' saved in current working directory" % (file_content, file_name)
snapshots = cdx_api.snapshots()
for snapshot in snapshots:
if len(cdx_print) == 0:
click.echo(snapshot)
else:
output_string = []
if any(val in cdx_print for val in ["urlkey", "url-key", "url_key"]):
output_string.append(snapshot.urlkey)
if any(
val in cdx_print for val in ["timestamp", "time-stamp", "time_stamp"]
):
output_string.append(snapshot.timestamp)
if "original" in cdx_print:
output_string.append(snapshot.original)
if any(val in cdx_print for val in ["mimetype", "mime-type", "mime_type"]):
output_string.append(snapshot.mimetype)
if any(
val in cdx_print for val in ["statuscode", "status-code", "status_code"]
):
output_string.append(snapshot.statuscode)
if "digest" in cdx_print:
output_string.append(snapshot.digest)
if "length" in cdx_print:
output_string.append(snapshot.length)
if any(
val in cdx_print for val in ["archiveurl", "archive-url", "archive_url"]
):
output_string.append(snapshot.archive_url)
click.echo(" ".join(output_string))
def _known_urls(obj, args):
"""Abbreviations:
sd = subdomain
al = alive
def save_urls_on_file(url_gen: Generator[str, None, None]) -> None:
"""
sd = False
al = False
if args.subdomain:
sd = True
if args.alive:
al = True
url_list = obj.known_urls(alive=al, subdomain=sd)
total_urls = len(url_list)
Save output of CDX API on file.
Mainly here because of backwards compatibility.
"""
domain = None
sys_random = random.SystemRandom()
uid = "".join(
sys_random.choice(string.ascii_lowercase + string.digits) for _ in range(6)
)
url_count = 0
file_name = None
if total_urls > 0:
text = _save_urls_on_file(url_list, total_urls)
for url in url_gen:
url_count += 1
if not domain:
match = re.search("https?://([A-Za-z_0-9.-]+).*", url)
domain = "domain-unknown"
if match:
domain = match.group(1)
file_name = f"{domain}-urls-{uid}.txt"
file_path = os.path.join(os.getcwd(), file_name)
if not os.path.isfile(file_path):
with open(file_path, "w+", encoding="utf-8") as file:
file.close()
with open(file_path, "a", encoding="utf-8") as file:
file.write(f"{url}\n")
click.echo(url)
if url_count > 0:
click.echo(
f"\n\n{url_count} URLs saved inside '{file_name}' in the current "
+ "working directory."
)
else:
text = "No known URLs found. Please try a diffrent domain!"
return text
click.echo("No known URLs found. Please try a diffrent input!")
def _get(obj, args):
if args.get.lower() == "url":
return obj.get()
@click.command()
@click.option(
"-u", "--url", help="URL on which Wayback machine operations are to be performed."
)
@click.option(
"-ua",
"--user-agent",
"--user_agent",
default=DEFAULT_USER_AGENT,
help=f"User agent, default value is '{DEFAULT_USER_AGENT}'.",
)
@click.option("-v", "--version", is_flag=True, default=False, help="waybackpy version.")
@click.option(
"-l",
"--show-license",
"--show_license",
"--license",
is_flag=True,
default=False,
help="Show license of Waybackpy.",
)
@click.option(
"-n",
"--newest",
"-au",
"--archive_url",
"--archive-url",
default=False,
is_flag=True,
help="Retrieve the newest archive of URL.",
)
@click.option(
"-o",
"--oldest",
default=False,
is_flag=True,
help="Retrieve the oldest archive of URL.",
)
@click.option(
"-N",
"--near",
default=False,
is_flag=True,
help="Archive close to a specified time.",
)
@click.option("-Y", "--year", type=click.IntRange(1994, 9999), help="Year in integer.")
@click.option("-M", "--month", type=click.IntRange(1, 12), help="Month in integer.")
@click.option("-D", "--day", type=click.IntRange(1, 31), help="Day in integer.")
@click.option("-H", "--hour", type=click.IntRange(0, 24), help="Hour in integer.")
@click.option("-MIN", "--minute", type=click.IntRange(0, 60), help="Minute in integer.")
@click.option(
"-s",
"--save",
default=False,
is_flag=True,
help="Save the specified URL's webpage and print the archive URL.",
)
@click.option(
"-h",
"--headers",
default=False,
is_flag=True,
help="Headers data of the SavePageNow API.",
)
@click.option(
"-ku",
"--known-urls",
"--known_urls",
default=False,
is_flag=True,
help="List known URLs. Uses CDX API.",
)
@click.option(
"-sub",
"--subdomain",
default=False,
is_flag=True,
help="Use with '--known_urls' to include known URLs for subdomains.",
)
@click.option(
"-f",
"--file",
default=False,
is_flag=True,
help="Use with '--known_urls' to save the URLs in file at current directory.",
)
@click.option(
"--cdx",
default=False,
is_flag=True,
help="Flag for using CDX API.",
)
@click.option(
"-st",
"--start-timestamp",
"--start_timestamp",
"--from",
help="Start timestamp for CDX API in yyyyMMddhhmmss format.",
)
@click.option(
"-et",
"--end-timestamp",
"--end_timestamp",
"--to",
help="End timestamp for CDX API in yyyyMMddhhmmss format.",
)
@click.option(
"-C",
"--closest",
help="Archive that are closest the timestamp passed as arguments to this "
+ "parameter.",
)
@click.option(
"-f",
"--cdx-filter",
"--cdx_filter",
"--filter",
multiple=True,
help="Filter on a specific field or all the CDX fields.",
)
@click.option(
"-mt",
"--match-type",
"--match_type",
help="The default behavior is to return matches for an exact URL. "
+ "However, the CDX server can also return results matching a certain prefix, "
+ "a certain host, or all sub-hosts by using the match_type",
)
@click.option(
"-st",
"--sort",
help="Choose one from default, closest or reverse. It returns sorted CDX entries "
+ "in the response.",
)
@click.option(
"-up",
"--use-pagination",
"--use_pagination",
default=False,
is_flag=True,
help="Use the pagination API of the CDX server instead of the default one.",
)
@click.option(
"-gz",
"--gzip",
help="To disable gzip compression pass false as argument to this parameter. "
+ "The default behavior is gzip compression enabled.",
)
@click.option(
"-c",
"--collapse",
multiple=True,
help="Filtering or 'collapse' results based on a field, or a substring of a field.",
)
@click.option(
"-l",
"--limit",
help="Number of maximum record that CDX API is asked to return per API call, "
+ "default value is 25000 records.",
)
@click.option(
"-cp",
"--cdx-print",
"--cdx_print",
multiple=True,
help="Print only certain fields of the CDX API response, "
+ "if this parameter is not used then the plain text response of the CDX API "
+ "will be printed.",
)
def main( # pylint: disable=no-value-for-parameter
user_agent: str,
version: bool,
show_license: bool,
newest: bool,
oldest: bool,
near: bool,
save: bool,
headers: bool,
known_urls: bool,
subdomain: bool,
file: bool,
cdx: bool,
use_pagination: bool,
cdx_filter: List[str],
collapse: List[str],
cdx_print: List[str],
url: Optional[str] = None,
year: Optional[int] = None,
month: Optional[int] = None,
day: Optional[int] = None,
hour: Optional[int] = None,
minute: Optional[int] = None,
start_timestamp: Optional[str] = None,
end_timestamp: Optional[str] = None,
closest: Optional[str] = None,
match_type: Optional[str] = None,
sort: Optional[str] = None,
gzip: Optional[str] = None,
limit: Optional[str] = None,
) -> None:
"""\b
_ _
| | | |
__ ____ _ _ _| |__ __ _ ___| | ___ __ _ _
\\ \\ /\\ / / _` | | | | '_ \\ / _` |/ __| |/ / '_ \\| | | |
\\ V V / (_| | |_| | |_) | (_| | (__| <| |_) | |_| |
\\_/\\_/ \\__,_|\\__, |_.__/ \\__,_|\\___|_|\\_\\ .__/ \\__, |
__/ | | | __/ |
|___/ |_| |___/
if args.get.lower() == "archive_url":
return obj.get(obj.archive_url)
Python package & CLI tool that interfaces the Wayback Machine APIs
if args.get.lower() == "oldest":
return obj.get(obj.oldest())
Repository: https://github.com/akamhy/waybackpy
if args.get.lower() == "latest" or args.get.lower() == "newest":
return obj.get(obj.newest())
Documentation: https://github.com/akamhy/waybackpy/wiki/CLI-docs
if args.get.lower() == "save":
return obj.get(obj.save())
waybackpy - CLI usage(Demo video): https://asciinema.org/a/469890
return "Use get as \"--get 'source'\", 'source' can be one of the followings: \
\n1) url - get the source code of the url specified using --url/-u.\
\n2) archive_url - get the source code of the newest archive for the supplied url, alias of newest.\
\n3) oldest - get the source code of the oldest archive for the supplied url.\
\n4) newest - get the source code of the newest archive for the supplied url.\
\n5) save - Create a new archive and get the source code of this new archive for the supplied url."
Released under the MIT License. Use the flag --license for license.
"""
if version:
click.echo(f"waybackpy version {__version__}")
def args_handler(args):
if args.version:
return "waybackpy version %s" % __version__
if not args.url:
return (
"waybackpy %s \nSee 'waybackpy --help' for help using this tool."
% __version__
elif show_license:
click.echo(
requests.get(
url="https://raw.githubusercontent.com/akamhy/waybackpy/master/LICENSE"
).text
)
elif url is None:
click.echo(
click.style("NoURLDetected: ", fg="red")
+ "No URL detected. "
+ "Please provide an URL.",
err=True,
)
if args.user_agent:
obj = Url(args.url, args.user_agent)
elif oldest:
cdx_api = WaybackMachineCDXServerAPI(url, user_agent=user_agent)
handle_cdx_closest_derivative_methods(cdx_api, oldest, near, newest)
elif newest:
cdx_api = WaybackMachineCDXServerAPI(url, user_agent=user_agent)
handle_cdx_closest_derivative_methods(cdx_api, oldest, near, newest)
elif near:
cdx_api = WaybackMachineCDXServerAPI(url, user_agent=user_agent)
near_args = {}
keys = ["year", "month", "day", "hour", "minute"]
args_arr = [year, month, day, hour, minute]
for key, arg in zip(keys, args_arr):
if arg:
near_args[key] = arg
handle_cdx_closest_derivative_methods(
cdx_api, oldest, near, newest, near_args=near_args
)
elif save:
save_api = WaybackMachineSaveAPI(url, user_agent=user_agent)
save_api.save()
click.echo("Archive URL:")
click.echo(save_api.archive_url)
click.echo("Cached save:")
click.echo(save_api.cached_save)
if headers:
click.echo("Save API headers:")
click.echo(save_api.headers)
elif known_urls:
wayback = Url(url, user_agent)
url_gen = wayback.known_urls(subdomain=subdomain)
if file:
save_urls_on_file(url_gen)
else:
for url_ in url_gen:
click.echo(url_)
elif cdx:
data = [
url,
user_agent,
start_timestamp,
end_timestamp,
cdx_filter,
collapse,
cdx_print,
limit,
gzip,
match_type,
sort,
use_pagination,
closest,
]
handle_cdx(data)
else:
obj = Url(args.url)
if args.save:
return _save(obj)
if args.archive_url:
return _archive_url(obj)
if args.json:
return _json(obj)
if args.oldest:
return _oldest(obj)
if args.newest:
return _newest(obj)
if args.known_urls:
return _known_urls(obj, args)
if args.total:
return _total_archives(obj)
if args.near:
return _near(obj, args)
if args.get:
return _get(obj, args)
message = (
"You only specified the URL. But you also need to specify the operation."
"\nSee 'waybackpy --help' for help using this tool."
)
return message
def parse_args(argv):
parser = argparse.ArgumentParser()
requiredArgs = parser.add_argument_group("URL argument (required)")
requiredArgs.add_argument(
"--url", "-u", help="URL on which Wayback machine operations would occur"
)
userAgentArg = parser.add_argument_group("User Agent")
help_text = 'User agent, default user_agent is "waybackpy python package - https://github.com/akamhy/waybackpy"'
userAgentArg.add_argument("--user_agent", "-ua", help=help_text)
saveArg = parser.add_argument_group("Create new archive/save URL")
saveArg.add_argument(
"--save", "-s", action="store_true", help="Save the URL on the Wayback machine"
)
auArg = parser.add_argument_group("Get the latest Archive")
auArg.add_argument(
"--archive_url",
"-au",
action="store_true",
help="Get the latest archive URL, alias for --newest",
)
jsonArg = parser.add_argument_group("Get the JSON data")
jsonArg.add_argument(
"--json",
"-j",
action="store_true",
help="JSON data of the availability API request",
)
oldestArg = parser.add_argument_group("Oldest archive")
oldestArg.add_argument(
"--oldest",
"-o",
action="store_true",
help="Oldest archive for the specified URL",
)
newestArg = parser.add_argument_group("Newest archive")
newestArg.add_argument(
"--newest",
"-n",
action="store_true",
help="Newest archive for the specified URL",
)
totalArg = parser.add_argument_group("Total number of archives")
totalArg.add_argument(
"--total",
"-t",
action="store_true",
help="Total number of archives for the specified URL",
)
getArg = parser.add_argument_group("Get source code")
getArg.add_argument(
"--get",
"-g",
help="Prints the source code of the supplied url. Use '--get help' for extended usage",
)
knownUrlArg = parser.add_argument_group(
"URLs known and archived to Waybcak Machine for the site."
)
knownUrlArg.add_argument(
"--known_urls", "-ku", action="store_true", help="URLs known for the domain."
)
help_text = "Use with '--known_urls' to include known URLs for subdomains."
knownUrlArg.add_argument("--subdomain", "-sub", action="store_true", help=help_text)
help_text = "Only include live URLs. Will not inlclude dead links."
knownUrlArg.add_argument("--alive", "-a", action="store_true", help=help_text)
nearArg = parser.add_argument_group("Archive close to time specified")
nearArg.add_argument(
"--near", "-N", action="store_true", help="Archive near specified time"
)
nearArgs = parser.add_argument_group("Arguments that are used only with --near")
nearArgs.add_argument("--year", "-Y", type=int, help="Year in integer")
nearArgs.add_argument("--month", "-M", type=int, help="Month in integer")
nearArgs.add_argument("--day", "-D", type=int, help="Day in integer.")
nearArgs.add_argument("--hour", "-H", type=int, help="Hour in intege")
nearArgs.add_argument("--minute", "-MIN", type=int, help="Minute in integer")
parser.add_argument(
"--version", "-v", action="store_true", help="Waybackpy version"
)
return parser.parse_args(argv[1:])
def main(argv=None):
if argv is None:
argv = sys.argv
args = parse_args(argv)
output = args_handler(args)
print(output)
click.echo(
click.style("NoCommandFound: ", fg="red")
+ "Only URL passed, but did not specify what to do with the URL. "
+ "Use --help flag for help using waybackpy.",
err=True,
)
if __name__ == "__main__":
sys.exit(main(sys.argv))
main() # pylint: disable=no-value-for-parameter

View File

@ -1,13 +1,64 @@
# -*- coding: utf-8 -*-
"""
waybackpy.exceptions
~~~~~~~~~~~~~~~~~~~
This module contains the set of Waybackpy's exceptions.
"""
class WaybackError(Exception):
"""
Raised when Wayback Machine API Service is unreachable/down.
Raised when Waybackpy can not return what you asked for.
1) Wayback Machine API Service is unreachable/down.
2) You passed illegal arguments.
All other exceptions are inherited from this main exception.
"""
class URLError(Exception):
class NoCDXRecordFound(WaybackError):
"""
Raised when malformed URLs are passed as arguments.
No records returned by the CDX server for a query.
Raised when the user invokes near(), newest() or oldest() methods
and there are no archives.
"""
class BlockedSiteError(WaybackError):
"""
Raised when the archives for website/URLs that was excluded from Wayback
Machine are requested via the CDX server API.
"""
class TooManyRequestsError(WaybackError):
"""
Raised when you make more than 15 requests per
minute and the Wayback Machine returns 429.
See https://github.com/akamhy/waybackpy/issues/131
"""
class MaximumRetriesExceeded(WaybackError):
"""
MaximumRetriesExceeded
"""
class MaximumSaveRetriesExceeded(MaximumRetriesExceeded):
"""
MaximumSaveRetriesExceeded
"""
class ArchiveNotInAvailabilityAPIResponse(WaybackError):
"""
Could not parse the archive in the JSON response of the availability API.
"""
class InvalidJSONInAvailabilityAPIResponse(WaybackError):
"""
availability api returned invalid JSON
"""

0
waybackpy/py.typed Normal file
View File

225
waybackpy/save_api.py Normal file
View File

@ -0,0 +1,225 @@
"""
This module interfaces the Wayback Machine's SavePageNow (SPN) API.
The module has WaybackMachineSaveAPI class which should be used by the users of
this module to use the SavePageNow API.
"""
import re
import time
from datetime import datetime
from typing import Dict, Optional
import requests
from requests.adapters import HTTPAdapter
from requests.models import Response
from requests.structures import CaseInsensitiveDict
from urllib3.util.retry import Retry
from .exceptions import MaximumSaveRetriesExceeded, TooManyRequestsError, WaybackError
from .utils import DEFAULT_USER_AGENT
class WaybackMachineSaveAPI:
"""
WaybackMachineSaveAPI class provides an interface for saving URLs on the
Wayback Machine.
"""
def __init__(
self,
url: str,
user_agent: str = DEFAULT_USER_AGENT,
max_tries: int = 8,
) -> None:
self.url = str(url).strip().replace(" ", "%20")
self.request_url = "https://web.archive.org/save/" + self.url
self.user_agent = user_agent
self.request_headers: Dict[str, str] = {"User-Agent": self.user_agent}
if max_tries < 1:
raise ValueError("max_tries should be positive")
self.max_tries = max_tries
self.total_save_retries = 5
self.backoff_factor = 0.5
self.status_forcelist = [500, 502, 503, 504]
self._archive_url: Optional[str] = None
self.instance_birth_time = datetime.utcnow()
self.response: Optional[Response] = None
self.headers: Optional[CaseInsensitiveDict[str]] = None
self.status_code: Optional[int] = None
self.response_url: Optional[str] = None
self.cached_save: Optional[bool] = None
self.saved_archive: Optional[str] = None
@property
def archive_url(self) -> str:
"""
Returns the archive URL is already cached by _archive_url
else invoke the save method to save the archive which returns the
archive thus we return the methods return value.
"""
if self._archive_url:
return self._archive_url
return self.save()
def get_save_request_headers(self) -> None:
"""
Creates a session and tries 'retries' number of times to
retrieve the archive.
If successful in getting the response, sets the headers, status_code
and response_url attributes.
The archive is usually in the headers but it can also be the response URL
as the Wayback Machine redirects to the archive after a successful capture
of the webpage.
Wayback Machine's save API is known
to be very unreliable thus if it fails first check opening
the response URL yourself in the browser.
"""
session = requests.Session()
retries = Retry(
total=self.total_save_retries,
backoff_factor=self.backoff_factor,
status_forcelist=self.status_forcelist,
)
session.mount("https://", HTTPAdapter(max_retries=retries))
self.response = session.get(self.request_url, headers=self.request_headers)
# requests.response.headers is requests.structures.CaseInsensitiveDict
self.headers = self.response.headers
self.status_code = self.response.status_code
self.response_url = self.response.url
session.close()
if self.status_code == 429:
# why wait 5 minutes and 429?
# see https://github.com/akamhy/waybackpy/issues/97
raise TooManyRequestsError(
f"Can not save '{self.url}'. "
f"Save request refused by the server. "
f"Save Page Now limits saving 15 URLs per minutes. "
f"Try waiting for 5 minutes and then try again."
)
# why 509?
# see https://github.com/akamhy/waybackpy/pull/99
# also https://t.co/xww4YJ0Iwc
if self.status_code == 509:
raise WaybackError(
f"Can not save '{self.url}'. You have probably reached the "
f"limit of active sessions."
)
def archive_url_parser(self) -> Optional[str]:
"""
Three regexen (like oxen?) are used to search for the
archive URL in the headers and finally look in the response URL
for the archive URL.
"""
regex1 = r"Content-Location: (/web/[0-9]{14}/.*)"
match = re.search(regex1, str(self.headers))
if match:
return "https://web.archive.org" + match.group(1)
regex2 = r"rel=\"memento.*?(web\.archive\.org/web/[0-9]{14}/.*?)>"
match = re.search(regex2, str(self.headers))
if match is not None and len(match.groups()) == 1:
return "https://" + match.group(1)
regex3 = r"X-Cache-Key:\shttps(.*)[A-Z]{2}"
match = re.search(regex3, str(self.headers))
if match is not None and len(match.groups()) == 1:
return "https" + match.group(1)
self.response_url = (
"" if self.response_url is None else self.response_url.strip()
)
regex4 = r"web\.archive\.org/web/(?:[0-9]*?)/(?:.*)$"
match = re.search(regex4, self.response_url)
if match is not None:
return "https://" + match.group(0)
return None
@staticmethod
def sleep(tries: int) -> None:
"""
Ensure that the we wait some time before succesive retries so that we
don't waste the retries before the page is even captured by the Wayback
Machine crawlers also ensures that we are not putting too much load on
the Wayback Machine's save API.
If tries are multiple of 3 sleep 10 seconds else sleep 5 seconds.
"""
sleep_seconds = 5
if tries % 3 == 0:
sleep_seconds = 10
time.sleep(sleep_seconds)
def timestamp(self) -> datetime:
"""
Read the timestamp off the archive URL and convert the Wayback Machine
timestamp to datetime object.
Also check if the time on archive is URL and compare it to instance birth
time.
If time on the archive is older than the instance creation time set the
cached_save to True else set it to False. The flag can be used to check
if the Wayback Machine didn't serve a Cached URL. It is quite common for
the Wayback Machine to serve cached archive if last archive was captured
before last 45 minutes.
"""
regex = r"https?://web\.archive.org/web/([0-9]{14})/http"
match = re.search(regex, str(self._archive_url))
if match is None or len(match.groups()) != 1:
raise ValueError(
f"Can not parse timestamp from archive URL, '{self._archive_url}'."
)
string_timestamp = match.group(1)
timestamp = datetime.strptime(string_timestamp, "%Y%m%d%H%M%S")
timestamp_unixtime = time.mktime(timestamp.timetuple())
instance_birth_time_unixtime = time.mktime(self.instance_birth_time.timetuple())
if timestamp_unixtime < instance_birth_time_unixtime:
self.cached_save = True
else:
self.cached_save = False
return timestamp
def save(self) -> str:
"""
Calls the SavePageNow API of the Wayback Machine with required parameters
and headers to save the URL.
Raises MaximumSaveRetriesExceeded is maximum retries are exhausted but still
we were unable to retrieve the archive from the Wayback Machine.
"""
self.saved_archive = None
tries = 0
while True:
if tries >= 1:
self.sleep(tries)
self.get_save_request_headers()
self.saved_archive = self.archive_url_parser()
if isinstance(self.saved_archive, str):
self._archive_url = self.saved_archive
self.timestamp()
return self.saved_archive
tries += 1
if tries >= self.max_tries:
raise MaximumSaveRetriesExceeded(
f"Tried {tries} times but failed to save "
f"and retrieve the archive for {self.url}.\n"
f"Response URL:\n{self.response_url}\n"
f"Response Header:\n{self.headers}"
)

29
waybackpy/utils.py Normal file
View File

@ -0,0 +1,29 @@
"""
Utility functions and shared variables like DEFAULT_USER_AGENT are here.
"""
from datetime import datetime
from . import __version__
DEFAULT_USER_AGENT: str = (
f"waybackpy {__version__} - https://github.com/akamhy/waybackpy"
)
def unix_timestamp_to_wayback_timestamp(unix_timestamp: int) -> str:
"""
Converts Unix time to Wayback Machine timestamp, Wayback Machine
timestamp format is yyyyMMddhhmmss.
"""
return datetime.utcfromtimestamp(int(unix_timestamp)).strftime("%Y%m%d%H%M%S")
def wayback_timestamp(**kwargs: int) -> str:
"""
Prepends zero before the year, month, day, hour and minute so that they
are conformable with the YYYYMMDDhhmmss Wayback Machine timestamp format.
"""
return "".join(
str(kwargs[key]).zfill(2) for key in ["year", "month", "day", "hour", "minute"]
)

View File

@ -1,272 +1,162 @@
# -*- coding: utf-8 -*-
"""
This module exists because backwards compatibility matters.
Don't touch this or add any new functionality here and don't use
the Url class.
"""
import re
from datetime import datetime, timedelta
from waybackpy.exceptions import WaybackError, URLError
from waybackpy.__version__ import __version__
import requests
import concurrent.futures
from typing import Generator, Optional
from requests.structures import CaseInsensitiveDict
default_UA = "waybackpy python package - https://github.com/akamhy/waybackpy"
def _archive_url_parser(header):
"""Parse out the archive from header."""
# Regex1
arch = re.search(r"Content-Location: (/web/[0-9]{14}/.*)", str(header))
if arch:
return "web.archive.org" + arch.group(1)
# Regex2
arch = re.search(
r"rel=\"memento.*?(web\.archive\.org/web/[0-9]{14}/.*?)>", str(header)
)
if arch:
return arch.group(1)
# Regex3
arch = re.search(r"X-Cache-Key:\shttps(.*)[A-Z]{2}", str(header))
if arch:
return arch.group(1)
raise WaybackError(
"No archive URL found in the API response. "
"This version of waybackpy (%s) is likely out of date. Visit "
"https://github.com/akamhy/waybackpy for the latest version "
"of waybackpy.\nHeader:\n%s" % (__version__, str(header))
)
def _wayback_timestamp(**kwargs):
"""Return a formatted timestamp."""
return "".join(
str(kwargs[key]).zfill(2) for key in ["year", "month", "day", "hour", "minute"]
)
def _get_response(endpoint, params=None, headers=None):
"""Get response for the supplied request."""
try:
response = requests.get(endpoint, params=params, headers=headers)
except Exception:
try:
response = requests.get(endpoint, params=params, headers=headers) # nosec
except Exception as e:
exc = WaybackError("Error while retrieving %s" % endpoint)
exc.__cause__ = e
raise exc
return response
from .availability_api import ResponseJSON, WaybackMachineAvailabilityAPI
from .cdx_api import WaybackMachineCDXServerAPI
from .save_api import WaybackMachineSaveAPI
from .utils import DEFAULT_USER_AGENT
class Url:
"""waybackpy Url object"""
"""
The Url class is not recommended to be used anymore, instead use:
def __init__(self, url, user_agent=default_UA):
- WaybackMachineSaveAPI
- WaybackMachineAvailabilityAPI
- WaybackMachineCDXServerAPI
The reason it is still in the code is backwards compatibility with 2.x.x
versions.
If were are using the Url before the update to version 3.x.x, your code should
still be working fine and there is no hurry to update the interface but is
recommended that you do not use the Url class for new code as it would be
removed after 2025 also the first 3.x.x versions was released in January 2022
and three years are more than enough to update the older interface code.
"""
def __init__(self, url: str, user_agent: str = DEFAULT_USER_AGENT) -> None:
self.url = url
self.user_agent = user_agent
self._url_check() # checks url validity on init.
self.JSON = self._JSON() # JSON of most recent archive
self.archive_url = self._archive_url() # URL of archive
self.timestamp = self._archive_timestamp() # timestamp for last archive
self._alive_url_list = []
self.user_agent = str(user_agent)
self.archive_url: Optional[str] = None
self.timestamp: Optional[datetime] = None
self.wayback_machine_availability_api = WaybackMachineAvailabilityAPI(
self.url, user_agent=self.user_agent
)
self.wayback_machine_save_api: Optional[WaybackMachineSaveAPI] = None
self.headers: Optional[CaseInsensitiveDict[str]] = None
self.json: Optional[ResponseJSON] = None
def __repr__(self):
return "waybackpy.Url(url=%s, user_agent=%s)" % (self.url, self.user_agent)
def __str__(self) -> str:
if not self.archive_url:
self.newest()
return str(self.archive_url)
def __str__(self):
return "%s" % self.archive_url
def __len__(self):
def __len__(self) -> int:
td_max = timedelta(
days=999999999, hours=23, minutes=59, seconds=59, microseconds=999999
)
if not isinstance(self.timestamp, datetime):
self.oldest()
if not isinstance(self.timestamp, datetime):
raise TypeError("timestamp must be a datetime")
if self.timestamp == datetime.max:
return td_max.days
diff = datetime.utcnow() - self.timestamp
return diff.days
return (datetime.utcnow() - self.timestamp).days
def _url_check(self):
"""Check for common URL problems."""
if "." not in self.url:
raise URLError("'%s' is not a vaild URL." % self.url)
def _JSON(self):
endpoint = "https://archive.org/wayback/available"
headers = {"User-Agent": "%s" % self.user_agent}
payload = {"url": "%s" % self._clean_url()}
response = _get_response(endpoint, params=payload, headers=headers)
return response.json()
def _archive_url(self):
"""Get URL of archive."""
data = self.JSON
if not data["archived_snapshots"]:
archive_url = None
else:
archive_url = data["archived_snapshots"]["closest"]["url"]
archive_url = archive_url.replace(
"http://web.archive.org/web/", "https://web.archive.org/web/", 1
)
return archive_url
def _archive_timestamp(self):
"""Get timestamp of last archive."""
data = self.JSON
if not data["archived_snapshots"]:
time = datetime.max
else:
time = datetime.strptime(
data["archived_snapshots"]["closest"]["timestamp"], "%Y%m%d%H%M%S"
)
return time
def _clean_url(self):
"""Fix the URL, if possible."""
return str(self.url).strip().replace(" ", "_")
def save(self):
"""Create a new Wayback Machine archive for this URL."""
request_url = "https://web.archive.org/save/" + self._clean_url()
headers = {"User-Agent": "%s" % self.user_agent}
response = _get_response(request_url, params=None, headers=headers)
self.archive_url = "https://" + _archive_url_parser(response.headers)
self.timestamp = datetime.utcnow()
def save(self) -> "Url":
"""Save the URL on wayback machine."""
self.wayback_machine_save_api = WaybackMachineSaveAPI(
self.url, user_agent=self.user_agent
)
self.archive_url = self.wayback_machine_save_api.archive_url
self.timestamp = self.wayback_machine_save_api.timestamp()
self.headers = self.wayback_machine_save_api.headers
return self
def get(self, url="", user_agent="", encoding=""):
"""Return the source code of the supplied URL.
If encoding is not supplied, it is auto-detected from the response.
"""
if not url:
url = self._clean_url()
if not user_agent:
user_agent = self.user_agent
headers = {"User-Agent": "%s" % self.user_agent}
response = _get_response(url, params=None, headers=headers)
if not encoding:
try:
encoding = response.encoding
except AttributeError:
encoding = "UTF-8"
return response.content.decode(encoding.replace("text/html", "UTF-8", 1))
def near(self, year=None, month=None, day=None, hour=None, minute=None):
"""Return the closest Wayback Machine archive to the time supplied.
Supported params are year, month, day, hour and minute.
Any non-supplied parameters default to the current time.
"""
now = datetime.utcnow().timetuple()
timestamp = _wayback_timestamp(
year=year if year else now.tm_year,
month=month if month else now.tm_mon,
day=day if day else now.tm_mday,
hour=hour if hour else now.tm_hour,
minute=minute if minute else now.tm_min,
def near(
self,
year: Optional[int] = None,
month: Optional[int] = None,
day: Optional[int] = None,
hour: Optional[int] = None,
minute: Optional[int] = None,
unix_timestamp: Optional[int] = None,
) -> "Url":
"""Returns the archive of the URL close to a date and time."""
self.wayback_machine_availability_api.near(
year=year,
month=month,
day=day,
hour=hour,
minute=minute,
unix_timestamp=unix_timestamp,
)
endpoint = "https://archive.org/wayback/available"
headers = {"User-Agent": "%s" % self.user_agent}
payload = {"url": "%s" % self._clean_url(), "timestamp": timestamp}
response = _get_response(endpoint, params=payload, headers=headers)
data = response.json()
if not data["archived_snapshots"]:
raise WaybackError(
"Can not find archive for '%s' try later or use wayback.Url(url, user_agent).save() "
"to create a new archive." % self._clean_url()
)
archive_url = data["archived_snapshots"]["closest"]["url"]
archive_url = archive_url.replace(
"http://web.archive.org/web/", "https://web.archive.org/web/", 1
)
self.archive_url = archive_url
self.timestamp = datetime.strptime(
data["archived_snapshots"]["closest"]["timestamp"], "%Y%m%d%H%M%S"
)
self.set_availability_api_attrs()
return self
def oldest(self, year=1994):
"""Return the oldest Wayback Machine archive for this URL."""
return self.near(year=year)
def oldest(self) -> "Url":
"""Returns the oldest archive of the URL."""
self.wayback_machine_availability_api.oldest()
self.set_availability_api_attrs()
return self
def newest(self):
"""Return the newest Wayback Machine archive available for this URL.
def newest(self) -> "Url":
"""Returns the newest archive of the URL."""
self.wayback_machine_availability_api.newest()
self.set_availability_api_attrs()
return self
Due to Wayback Machine database lag, this may not always be the
most recent archive.
def set_availability_api_attrs(self) -> None:
"""Set the attributes for total backwards compatibility."""
self.archive_url = self.wayback_machine_availability_api.archive_url
self.json = self.wayback_machine_availability_api.json
self.JSON = self.json # for backwards compatibility, do not remove it.
self.timestamp = self.wayback_machine_availability_api.timestamp()
def total_archives(
self, start_timestamp: Optional[str] = None, end_timestamp: Optional[str] = None
) -> int:
"""
return self.near()
def total_archives(self):
"""Returns the total number of Wayback Machine archives for this URL."""
endpoint = "https://web.archive.org/cdx/search/cdx"
headers = {
"User-Agent": "%s" % self.user_agent,
"output": "json",
"fl": "statuscode",
}
payload = {"url": "%s" % self._clean_url()}
response = _get_response(endpoint, params=payload, headers=headers)
# Most efficient method to count number of archives (yet)
return response.text.count(",")
def pick_live_urls(self, url):
try:
response_code = requests.get(url).status_code
except Exception:
return # we don't care if urls are not opening
# 200s are OK and 300s are usually redirects, if you don't want redirects replace 400 with 300
if response_code >= 400:
return
self._alive_url_list.append(url)
def known_urls(self, alive=False, subdomain=False):
"""Returns list of URLs known to exist for given domain name
because these URLs were crawled by WayBack Machine bots.
Useful for pen-testers and others.
Idea by Mohammed Diaa (https://github.com/mhmdiaa) from:
https://gist.github.com/mhmdiaa/adf6bff70142e5091792841d4b372050
Returns an integer which indicates total number of archives for an URL.
Useless in my opinion, only here because of backwards compatibility.
"""
cdx = WaybackMachineCDXServerAPI(
self.url,
user_agent=self.user_agent,
start_timestamp=start_timestamp,
end_timestamp=end_timestamp,
)
url_list = []
count = 0
for _ in cdx.snapshots():
count = count + 1
return count
def known_urls(
self,
subdomain: bool = False,
host: bool = False,
start_timestamp: Optional[str] = None,
end_timestamp: Optional[str] = None,
match_type: str = "prefix",
) -> Generator[str, None, None]:
"""Yields known URLs for any URL."""
if subdomain:
request_url = (
"https://web.archive.org/cdx/search/cdx?url=*.%s/*&output=json&fl=original&collapse=urlkey"
% self._clean_url()
)
else:
request_url = (
"http://web.archive.org/cdx/search/cdx?url=%s/*&output=json&fl=original&collapse=urlkey"
% self._clean_url()
)
match_type = "domain"
if host:
match_type = "host"
headers = {"User-Agent": "%s" % self.user_agent}
response = _get_response(request_url, params=None, headers=headers)
data = response.json()
url_list = [y[0] for y in data if y[0] != "original"]
cdx = WaybackMachineCDXServerAPI(
self.url,
user_agent=self.user_agent,
start_timestamp=start_timestamp,
end_timestamp=end_timestamp,
match_type=match_type,
collapses=["urlkey"],
)
# Remove all deadURLs from url_list if alive=True
if alive:
with concurrent.futures.ThreadPoolExecutor() as executor:
executor.map(self.pick_live_urls, url_list)
url_list = self._alive_url_list
return url_list
for snapshot in cdx.snapshots():
yield snapshot.original