Compare commits

...

177 Commits
2.0.1 ... 2.4.0

Author SHA1 Message Date
f4f2e51315 V2.4.0 (#62)
* v 2.4.0

* v 2.4.0
2021-01-10 11:53:45 +05:30
d6b7df6837 no need to de-duplicate as we are collapsing the results by urlkey
Same urls aren't recieved
2021-01-10 11:36:46 +05:30
dafba5d0cb collapses=["urlkey"] for known urls 2021-01-10 11:34:06 +05:30
6c71dfbe41 use cdx matchtype for domain and host 2021-01-10 11:10:49 +05:30
a6470b1036 not passing dict to cdxsnapshot 2021-01-10 10:40:32 +05:30
04cda4558e fix test 2021-01-10 03:18:09 +05:30
625ed63482 remove asserts stmnts 2021-01-10 03:05:48 +05:30
a03813315f full cdx api support 2021-01-10 02:23:53 +05:30
a2550f17d7 retries support for get requests 2021-01-06 01:58:38 +05:30
15ef5816db Always cast url to string, avoid passing waybackpy objects to _get_response 2021-01-05 19:46:17 +05:30
93b52bd0fe FIX : don't use self.user_agent if user_agent passed in get() 2021-01-05 19:31:27 +05:30
28ff877081 Update README.md 2021-01-05 19:08:35 +05:30
3e3ecff9df l2 heading and lint 2021-01-05 01:59:29 +05:30
ce64135ba8 ce 2021-01-05 01:52:35 +05:30
2af6580ffb docs link 2021-01-05 01:51:53 +05:30
8a3c515176 v2.3.3 2021-01-05 01:49:26 +05:30
d98c4f32ad v2.3.3 2021-01-05 01:48:54 +05:30
e0a4b007d5 improve docs 2021-01-05 01:46:12 +05:30
6fb6b2deee Update readme + new file CONTRIBUTORS.md (#59)
* remove some badges

* remove made with python button, obvious

* - maintained badge, we already have latest commit badge

- [![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-green.svg)](https://github.com/akamhy/waybackpy/graphs/commit-activity)

* re arranged order of badges

* a bit more re odering

* - release badge

* - license section

* center h1

* try once more'

* removed the TOC

* move the hr

* Update README.md

* + hr

* h1 --> h2

* remove tests and pacakging info from here to docs/wiki

* Update README.md

* example inspired by psf/requests

* CLI tool example gist

* Update README.md

* Update README.md

* + license

* Update README.md

* authors list

* Update CONTRIBUTORS.md

* fix code

* Update README.md

* Update README.md

* center the button
2021-01-05 00:30:07 +05:30
1882862992 now using cdx Pagination API 2021-01-04 20:46:54 +05:30
0c6107e675 increase coverage 2021-01-04 01:54:40 +05:30
bd079978bf inc coverage 2021-01-04 00:44:55 +05:30
5dec4927cd refactoring, try to code complexity 2021-01-04 00:14:38 +05:30
62e5217b9e reduce code complexity: refactoring, less flow breaking structures 2021-01-03 19:38:25 +05:30
9823c809e9 Added doc strings in wrapper.py, documenting code and improving docs. 2021-01-03 17:11:32 +05:30
db5737a857 JSON is now available for near and other other methods that call it 2021-01-02 18:52:46 +05:30
ca0821a466 Wiki docs (#58)
* move docs to wiki

* Update README.md

* Update setup.py
2021-01-02 12:20:43 +05:30
bb4dbc7d3c rm url = obj.url 2021-01-02 11:19:09 +05:30
7c7fd75376 No need to fetch archive_url and timestamp from availability API on init (#55)
* No need to fetch archive_url and timestamp from availability API on init. 

Not useful if all I want is to archive a page

* Update test_wrapper.py

* Update wrapper.py

* Update test_wrapper.py

* Update wrapper.py

* Update cli.py

* Update wrapper.py

* Update __version__.py

* Update __version__.py

* Update __version__.py

* Update __version__.py

* Update setup.py

* Update README.md
2021-01-02 11:10:23 +05:30
0b71433667 v2.3.1 (#54)
* 2.3.1

* 2.3.1
2021-01-01 19:15:23 +05:30
1b499a7594 removed JSON from init, this was resulting in too much unnecessary taffic. Some users who are thousands of URLs were blocked by IA (#53)
closes #52
2021-01-01 16:38:57 +05:30
da390ee8a3 improve maintainability and reduce code cognitive complexity (#49) 2020-12-15 10:24:13 +05:30
d3e68d0e70 code formated with black (#47) 2020-12-14 01:18:04 +05:30
fde28d57aa Update CONTRIBUTING.md 2020-12-14 00:16:29 +05:30
6092e504c8 Update CONTRIBUTING.md 2020-12-14 00:15:51 +05:30
93ef60ecd2 v2.3.0 (#46)
* v2.3.0

* v2.3.0

* decrease line length
2020-12-14 00:14:54 +05:30
461b3f74c9 UPDATE header image url 2020-12-13 23:09:59 +05:30
3c53b411b0 Improve the appearance of readme (#45)
* replaced text header wth image

* svg

* Update README.md

* Update README.md

* Update README.md

* level 2

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Create CONTRIBUTING.md

* Update README.md

* Add files via upload

* Update README.md

* Delete waybackpy-colored 284.png

* Delete waybackpy colored.png

* Update README.md

* Update index.rst

* Update index.rst

* Update index.rst

* Update setup.py

* Delete index.rst

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md
2020-12-13 23:08:16 +05:30
8125526061 create pyup.io config file (#44) 2020-12-13 22:31:49 +05:30
2dc81569a8 Create .pep8speaks.yml 2020-12-13 17:58:09 +05:30
fd163f3d36 Update wrapper.py 2020-12-13 17:12:32 +05:30
a0a918cf0d . 2020-12-13 17:10:28 +05:30
4943cf6873 remove print stmnt, update ci 2020-12-13 16:37:35 +05:30
bc3efc7d63 now using requests lib as it handles errors nicely (#42)
* now using requests lib as it handles errors nicely

* remove unused import (urllib)

* FIX : replaced full_url with endpoint (not using urlib)

* LINT :  Found in waybackpy\wrapper.py:88  Unnecessary else after return
2020-12-13 15:44:37 +05:30
f89368f16d LINT : Found in waybackpy\wrapper.py:88 Unnecessary else after return 2020-12-13 15:39:23 +05:30
c919a6a605 FIX : replaced full_url with endpoint (not using urlib) 2020-12-13 15:22:56 +05:30
0280fca189 remove unused import (urllib) 2020-12-13 15:13:51 +05:30
60ee8b95a8 now using requests lib as it handles errors nicely 2020-12-13 15:05:57 +05:30
ca51c14332 deleted .travis.yml, link with flake (#41)
close #38
2020-11-26 13:06:50 +05:30
525cf17c6f Update ci.yml 2020-11-26 12:14:15 +05:30
406e03c52f Update ci.yml 2020-11-26 12:04:45 +05:30
672b33e83a Update ci.yml 2020-11-26 10:10:10 +05:30
b19b840628 Update ci.yml 2020-11-26 10:01:55 +05:30
a6df4f899c Update ci.yml 2020-11-26 09:26:11 +05:30
7686e9c20d Update README.md (#40) 2020-11-26 09:18:26 +05:30
3c5932bc39 now using gh actions (#39) 2020-11-26 09:09:53 +05:30
f9a986f489 Create ci.yml 2020-11-26 08:55:23 +05:30
0d7458ee90 per https://docs.travis-ci.com/user/languages/python/, Python builds are not available on the macOS 2020-11-26 08:08:59 +05:30
ac8b9d6a50 use osx, huge backlog on .org travis for linux builds 2020-11-26 08:03:27 +05:30
58cd9c28e7 Threading enabled checking for URLs 2020-11-26 06:15:42 +05:30
5088305a58 removed python2 compatibility code 2020-11-21 17:00:11 +05:30
9f847a5e55 change pepy.tech download count link, they removed the month page 2020-11-11 10:44:14 +05:30
6c04c2f3d3 + https://github.com/akamhy/waybackpy/graphs/contributors 2020-11-04 08:09:30 +05:30
925be7b17e V2.2.0 2020-10-17 17:10:46 +05:30
2b132456ac updated index.rst and minor docs updated. 2020-10-17 16:56:51 +05:30
50e3154a4e lint README.md 2020-10-17 12:01:49 +05:30
7aef50428f add link to the repo 2020-10-17 11:51:56 +05:30
d8ec0f5025 More pythonic code snippets in README (#36) 2020-10-17 11:49:27 +05:30
0a2f97c034 Update README, drop python 2 support
* Drop python 2 support

* updated docs

* added new docs
2020-10-16 22:37:32 +05:30
3e9cf23578 3.9 archive doesn't not exist yet. 2020-10-16 19:43:06 +05:30
7f927ec7be added tests for json and archive_url, updated broken tests (#34)
* added tests for json and archive_url, updated broken tests

* drop 2.7 support
2020-10-16 19:25:45 +05:30
9de6393cd5 Add support for JSON and archive_url (#33)
CLI support for JSON and archive_url attributes
2020-10-16 15:16:18 +05:30
91e7f65617 Fixing len() bug (#32)
* added class functionality

* Update wrapper.py

* style edits

* fixed bug with len() of url()

* fixing len() bug

* fixing len() bug

* squashing bug

* removed test notebook
2020-10-16 10:04:13 +05:30
d465454019 Adding attributes to Url class (#28)
* added class functionality

* Update wrapper.py

* style edits
2020-10-15 22:10:32 +05:30
1a81eb97fb lint 2020-10-03 16:58:11 +05:30
6b3b2e2a7d tests for newly added known_urls feature 2020-10-03 09:33:50 +05:30
82c65454e6 2.1.9 2020-10-03 01:34:15 +05:30
19710461b6 Update setup.py 2020-10-03 01:33:46 +05:30
a3661d6b85 Update index.rst 2020-10-03 01:33:15 +05:30
58375e4ef4 fix broken links 2020-10-03 01:31:28 +05:30
ea023e98da update 2020-10-03 01:22:51 +05:30
f1065ed1c8 v2.1.8 2020-10-03 01:18:30 +05:30
315519b21f 2.1.8 2020-10-03 01:18:08 +05:30
07c98661de add usage for known urls (#26)
* Update README.md

* Update README.md

* Update README.md

* bash example for known urls

* python examples / usage for known urls :)

* Update README.md

* Update README.md

* Update README.md

* Update README.md
2020-10-03 01:16:19 +05:30
2cd991a54e lint markdown 2020-10-02 23:34:06 +05:30
ede251afb3 update tests 2020-10-02 23:10:48 +05:30
a8ce970ca0 fixed yet another issue with tests :( 2020-10-02 23:01:59 +05:30
243af26bf6 update version format in tests 2020-10-02 22:23:58 +05:30
0f1db94884 license & packaging info 2020-10-02 22:10:30 +05:30
c304f58ea2 update tests 2020-10-02 21:35:39 +05:30
23f7222cb5 tweak 2020-10-02 21:01:32 +05:30
ce7294d990 Implemented new feature, known urls for domain. 2020-10-02 20:27:28 +05:30
c9fa114d2e grammar 2020-10-01 23:50:03 +05:30
8b6bacb28e Add files via upload 2020-09-08 09:23:59 +05:30
32d8ad7780 Update README.md (#24)
- IA and Wayback machine logo, added new waybackpy logo.
+ changed pages to webpages in lead
2020-09-08 09:12:48 +05:30
cbf2f90faa Add files via upload 2020-09-08 09:06:33 +05:30
4dde3e3134 Delete a.txt 2020-09-08 09:02:36 +05:30
1551e8f1c6 Add files via upload 2020-09-08 09:02:19 +05:30
c84f09e2d2 Create a.txt 2020-09-08 08:59:28 +05:30
57a32669b5 v2.1.7 2020-08-09 11:06:29 +05:30
fe017cbcc8 v2.1.7 2020-08-09 11:06:04 +05:30
5edb03d24b update docs 2020-08-09 11:05:04 +05:30
c5de2232ba Update test_wrapper.py 2020-08-09 10:53:00 +05:30
ca9186c301 update message, sometimes raised for poor performance by wayback machine even if the url is archived. 2020-08-09 10:43:16 +05:30
8a4b631c13 new regex to parse archive, IA changed the header again :( 2020-08-09 10:36:25 +05:30
ec9ce92f48 Update README.md (#23)
* Update README.md

* fix grammar
2020-07-26 10:30:54 +05:30
e95d35c37f re arrange the badges, moved contributions welcome to top 2020-07-26 10:24:31 +05:30
36d662b961 Update __version__.py 2020-07-24 16:24:57 +05:30
2835f8877e Update setup.py 2020-07-24 16:24:38 +05:30
18cbd2fd30 Update cli.py 2020-07-24 16:10:29 +05:30
a2812fb56f patch for cli 2020-07-24 16:09:47 +05:30
77effcf649 Update setup.py 2020-07-24 15:34:14 +05:30
7272ef45a0 Update __version__.py 2020-07-24 15:33:58 +05:30
56116551ac Coverge improvements (#22)
* Update cli.py

* improved tests

* chnages for proper testing

* Type check using isinstance

* Replace elifs with if when used after return

* twitter.com --> www.ibm.com

* fix typo

* test archive urll parser and dunders

* Update test_wrapper.py
2020-07-24 15:31:21 +05:30
4dcda94cb0 v2.1.4 2020-07-24 01:03:44 +05:30
09f59b0182 v2.1.4 2020-07-24 01:03:04 +05:30
ed24184b99 Remove duplicate get response method 2020-07-24 00:57:22 +05:30
56bef064b1 only test save on >3.7 2020-07-23 20:51:46 +05:30
44bb2cf5e4 some cli tests 2020-07-23 20:44:14 +05:30
e231228721 Update README.md (#21)
* Update README.md

* example bash oldest newest

* total archives bash example

* near bash example

* format the list

* ce

* get bash example

* pip git install example

* Update index.rst

* + argparse

* + argparse
2020-07-22 21:35:02 +05:30
b8b2d6dfa9 v2.1.3 2020-07-22 20:21:37 +05:30
3eca6294df v2.1.3 2020-07-22 20:20:44 +05:30
eb037a0284 Rename test_1.py to test_wrapper.py 2020-07-22 20:19:59 +05:30
a01821f20b Update .travis.yml 2020-07-22 17:33:59 +05:30
b21036f8df Update .travis.yml 2020-07-22 17:31:57 +05:30
b43bacb7ac fix error language 2020-07-22 17:25:15 +05:30
f7313b255a Update cli.py 2020-07-22 17:22:38 +05:30
7457e1c793 - print(repr(obj)) 2020-07-22 17:18:27 +05:30
f7493d823f Update cli.py 2020-07-22 17:16:53 +05:30
7fa7b59ce3 if version don't try to create object 2020-07-22 17:15:28 +05:30
78a608db50 Update cli.py 2020-07-22 17:12:44 +05:30
93f7dfdaf9 resolve args conflict 2020-07-22 17:09:32 +05:30
83c6f256c9 version arg 2020-07-22 17:03:56 +05:30
dee9105794 command_line support (#18)
* Update wrapper.py

* entry points cli

* Suppress the urllib2/3 Exception

* rm cli code, will create a new cli.py file

* Create cli.py

* update cli entry pts

* Update cli.py

* Update cli.py

* import print_function

* Update cli.py

* Update cli.py

* Delete pypi_uploader.sh

* resolve conflicts with the master

* update the test ; resolve the conflicts

* decrease code complexity

* cli method changed to main

* get is not for just local usage

* get method should be available from interface

* get is used in the interface

* Update cli.py
2020-07-22 16:40:13 +05:30
3bfc3b46d0 Delete SECURITY.md 2020-07-22 11:07:59 +05:30
553f150bee replace youtube with twitter.com
for some reason Wayback API is returing diffrent youtube URL now.
2020-07-22 11:07:23 +05:30
b3a7e714a5 Update wrapper.py 2020-07-22 10:57:43 +05:30
cd9841713c Update wrapper.py 2020-07-22 10:52:43 +05:30
1ea9548d46 Raise WaybackError from URLError and include URL (#19)
* Raise WaybackError from URLError and include URL

* python2 compatibility

Co-authored-by: Akash <64683866+akamhy@users.noreply.github.com>
2020-07-22 10:51:44 +05:30
be7642c837 Code style improvements (#20)
* Add sane line length to setup.cfg

* Use Black for quick readability improvements

* Clean up exceptions, docstrings, and comments

Docstrings on dunder functions are redundant and typically ignored
Limit to reasonable line length
General grammar and style corrections
Clarify docstrings and exceptions
Format docstrings per PEP 257 -- Docstring Conventions

* Move archive_url_parser out of Url.save()

It's generally poor form to define a function in a function, as it will
be re-defined each time the function is run.

archive_url_parser does not depend on anything in Url, so it makes sense
to move it out of the class.

* move wayback_timestamp out of class, mark private functions

* DRY in _wayback_timestamp

* Url._url_check should return None

There's no point in returning True if it's never checked and won't ever
be False.
Implicitly returning None or raising an exception is more idiomatic.

* Default parameters should be type-consistant with expected values

* Specify parameters to near

* Use datetime.datetime in _wayback_timestamp

* cleanup __init__.py

* Cleanup formatting in tests

* Fix names in tests

* Revert "Use datetime.datetime in _wayback_timestamp"

This reverts commit 5b30380865.

Introduced unnecessary complexity

* Move _get_response outside of Url

Because Codacy reminded me that I missed it.

* fix imports in tests
2020-07-22 10:09:14 +05:30
a418a4e464 Update SECURITY.md 2020-07-21 10:38:41 +05:30
aec035ef1e Create SECURITY.md 2020-07-21 08:41:07 +05:30
6d37993ab9 moved to manuals 2020-07-21 08:14:40 +05:30
72b80ca44e Create pypi_uploader.sh 2020-07-21 08:14:21 +05:30
c10aa9279c Create python-publish.yml 2020-07-21 08:08:55 +05:30
68d809a7d6 Update test_1.py 2020-07-20 23:45:49 +05:30
4ad09a419b Fix bash syntax 2020-07-20 23:44:23 +05:30
ddc6620f09 Only report coverage if python 3.8 or greater 2020-07-20 23:29:54 +05:30
4066a65678 Update .travis.yml 2020-07-20 23:20:43 +05:30
8e46a9ba7a Update .travis.yml 2020-07-20 23:16:33 +05:30
a5a98b9b00 Update .travis.yml 2020-07-20 23:11:56 +05:30
a721ab7d6c Update .travis.yml 2020-07-20 23:10:06 +05:30
7db27ae5e1 Create pypi_uploader.sh 2020-07-20 22:28:35 +05:30
8fd4462025 Update wrapper.py 2020-07-20 20:17:18 +05:30
c458a15820 Update .travis.yml 2020-07-20 15:38:33 +05:30
bae3412bee Update .travis.yml 2020-07-20 15:24:26 +05:30
94cb08bb37 Update setup.py 2020-07-20 10:41:00 +05:30
af888db13e 2.1.2 2020-07-20 10:40:37 +05:30
d24f2408ee Update test_1.py 2020-07-20 10:31:47 +05:30
ddd2274015 Update test_1.py 2020-07-20 10:21:15 +05:30
99abdb7c67 Update test_1.py 2020-07-20 10:16:39 +05:30
f3bb9a8540 Update wrapper.py 2020-07-20 10:11:36 +05:30
bb94e0d1c5 Update index.rst and remove dupes 2020-07-20 10:07:31 +05:30
1a78d88be2 2.1.1 2020-07-19 23:17:01 +05:30
3ec61758b3 Update __version__.py 2020-07-19 23:16:13 +05:30
83c962166d Raise 2020-07-19 23:02:04 +05:30
e87dee3bdf Waybackpy example on replit (#15)
* Waybackpy save example on replit

* Oldest example

* Newest method replit link

* Near method example

* Get example

* Total archive method example
2020-07-19 22:28:08 +05:30
b27bfff15a v2.1.0 2020-07-19 21:08:01 +05:30
970fc1cd08 Update __version__.py 2020-07-19 21:06:54 +05:30
65391bf14b update 2020-07-19 21:04:32 +05:30
8ab116f276 API chnaged again. updated
* Update wrapper.py

* Update wrapper.py

* Update wrapper.py

* Update wrapper.py

* Update wrapper.py

* api changed; fix archive url parser

* Update wrapper.py

* - Trailing whitespace

* include the header in exception
2020-07-19 20:39:07 +05:30
6f82041ec9 Update README.md (#13)
* Update README.md

* Update README.md

* replit demo for waybackpy.Url.save()

* Update README.md

* Update README.md

* replit demo for oldest()

* replit demo for newest()

* Update README.md

* replit demo for total_archives

* demo at replit for get()

* demo for near

* Update README.md

* Update README.md

* Update README.md
2020-07-19 16:39:39 +05:30
11059c960e Update setup.py 2020-07-18 19:27:04 +05:30
eee1b8eba1 Update __version__.py 2020-07-18 19:26:41 +05:30
f7de8f5575 sleeps to prevent too many requests in a timeframe 2020-07-18 19:25:19 +05:30
3fa0c32064 V2.0.1 link 2020-07-18 19:09:18 +05:30
aa1e3b8825 V2.0.1 2020-07-18 19:08:39 +05:30
33 changed files with 2692 additions and 677 deletions

42
.github/workflows/ci.yml vendored Normal file
View File

@ -0,0 +1,42 @@
# This workflow will install Python dependencies, run tests and lint with a variety of Python versions
# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions
name: CI
on:
push:
branches: [ master ]
pull_request:
branches: [ master ]
jobs:
build:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ['3.8']
steps:
- uses: actions/checkout@v2
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v2
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install flake8 pytest codecov pytest-cov
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
- name: Lint with flake8
run: |
# stop the build if there are Python syntax errors or undefined names
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
- name: Test with pytest
run: |
pytest --cov=waybackpy tests/
- name: Upload coverage to Codecov
run: |
bash <(curl -s https://codecov.io/bash) -t ${{ secrets.CODECOV_TOKEN }}

31
.github/workflows/python-publish.yml vendored Normal file
View File

@ -0,0 +1,31 @@
# This workflows will upload a Python Package using Twine when a release is created
# For more information see: https://help.github.com/en/actions/language-and-framework-guides/using-python-with-github-actions#publishing-to-package-registries
name: Upload Python Package
on:
release:
types: [created]
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: '3.x'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install setuptools wheel twine
- name: Build and publish
env:
TWINE_USERNAME: ${{ secrets.PYPI_USERNAME }}
TWINE_PASSWORD: ${{ secrets.PYPI_PASSWORD }}
run: |
python setup.py sdist bdist_wheel
twine upload dist/*

3
.gitignore vendored
View File

@ -1,3 +1,6 @@
# Files generated while testing
*-urls-*.txt
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]

4
.pep8speaks.yml Normal file
View File

@ -0,0 +1,4 @@
# File : .pep8speaks.yml
scanner:
diff_only: True # If True, errors caused by only the patch are shown

5
.pyup.yml Normal file
View File

@ -0,0 +1,5 @@
# autogenerated pyup.io config file
# see https://pyup.io/docs/configuration/ for all available options
schedule: ''
update: false

View File

@ -1,14 +0,0 @@
language: python
python:
- "2.7"
- "3.6"
- "3.8"
os: linux
dist: xenial
cache: pip
install:
- pip install pytest
before_script:
cd tests
script:
- pytest test_1.py

58
CONTRIBUTING.md Normal file
View File

@ -0,0 +1,58 @@
# Contributing to waybackpy
We love your input! We want to make contributing to this project as easy and transparent as possible, whether it's:
- Reporting a bug
- Discussing the current state of the code
- Submitting a fix
- Proposing new features
- Becoming a maintainer
## We Develop with Github
We use github to host code, to track issues and feature requests, as well as accept pull requests.
## We Use [Github Flow](https://guides.github.com/introduction/flow/index.html), So All Code Changes Happen Through Pull Requests
Pull requests are the best way to propose changes to the codebase (we use [Github Flow](https://guides.github.com/introduction/flow/index.html)). We actively welcome your pull requests:
1. Fork the repo and create your branch from `master`.
2. If you've added code that should be tested, add tests.
3. If you've changed APIs, update the documentation.
4. Ensure the test suite passes.
5. Make sure your code lints.
6. Issue that pull request!
## Any contributions you make will be under the MIT Software License
In short, when you submit code changes, your submissions are understood to be under the same [MIT License](https://github.com/akamhy/waybackpy/blob/master/LICENSE) that covers the project. Feel free to contact the maintainers if that's a concern.
## Report bugs using Github's [issues](https://github.com/akamhy/waybackpy/issues)
We use GitHub issues to track public bugs. Report a bug by [opening a new issue](https://github.com/akamhy/waybackpy/issues/new); it's that easy!
## Write bug reports with detail, background, and sample code
**Great Bug Reports** tend to have:
- A quick summary and/or background
- Steps to reproduce
- Be specific!
- Give sample code if you can.
- What you expected would happen
- What actually happens
- Notes (possibly including why you think this might be happening, or stuff you tried that didn't work)
People *love* thorough bug reports. I'm not even kidding.
## Use a Consistent Coding Style
* You can try running `flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics` for style unification.
## License
By contributing, you agree that your contributions will be licensed under its [MIT License](https://github.com/akamhy/waybackpy/blob/master/LICENSE).
## References
This document is forked from [this gist](https://gist.github.com/briandk/3d2e8b3ec8daf5a27a62) by [briandk](https://github.com/briandk) which was itself adapted from the open-source contribution guidelines for [Facebook's Draft](https://github.com/facebook/draft-js/blob/a9316a723f9e918afde44dea68b5f9f39b7d9b00/CONTRIBUTING.md)

8
CONTRIBUTORS.md Normal file
View File

@ -0,0 +1,8 @@
## AUTHORS
- akamhy (<https://github.com/akamhy>)
- danvalen1 (<https://github.com/danvalen1>)
- AntiCompositeNumber (<https://github.com/AntiCompositeNumber>)
## ACKNOWLEDGEMENTS
- mhmdiaa (<https://github.com/mhmdiaa>) for <https://gist.github.com/mhmdiaa/adf6bff70142e5091792841d4b372050>. known_urls is based on this gist.
- datashaman (<https://stackoverflow.com/users/401467/datashaman>) for <https://stackoverflow.com/a/35504626>. _get_response is based on this amazing answer.

View File

@ -1,6 +1,6 @@
MIT License
Copyright (c) 2020 akamhy
Copyright (c) 2020 waybackpy contributors ( https://github.com/akamhy/waybackpy/graphs/contributors )
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal

197
README.md
View File

@ -1,142 +1,101 @@
# waybackpy
<div align="center">
[![Build Status](https://img.shields.io/travis/akamhy/waybackpy.svg?label=Travis%20CI&logo=travis&style=flat-square)](https://travis-ci.org/akamhy/waybackpy)
[![Downloads](https://img.shields.io/pypi/dm/waybackpy.svg)](https://pypistats.org/packages/waybackpy)
[![Release](https://img.shields.io/github/v/release/akamhy/waybackpy.svg)](https://github.com/akamhy/waybackpy/releases)
[![Codacy Badge](https://api.codacy.com/project/badge/Grade/255459cede9341e39436ec8866d3fb65)](https://www.codacy.com/manual/akamhy/waybackpy?utm_source=github.com&amp;utm_medium=referral&amp;utm_content=akamhy/waybackpy&amp;utm_campaign=Badge_Grade)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://github.com/akamhy/waybackpy/blob/master/LICENSE)
[![Maintainability](https://api.codeclimate.com/v1/badges/942f13d8177a56c1c906/maintainability)](https://codeclimate.com/github/akamhy/waybackpy/maintainability)
[![CodeFactor](https://www.codefactor.io/repository/github/akamhy/waybackpy/badge)](https://www.codefactor.io/repository/github/akamhy/waybackpy)
[![made-with-python](https://img.shields.io/badge/Made%20with-Python-1f425f.svg)](https://www.python.org/)
![pypi](https://img.shields.io/pypi/v/waybackpy.svg)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/waybackpy?style=flat-square)
[![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-green.svg)](https://github.com/akamhy/waybackpy/graphs/commit-activity)
[![codecov](https://codecov.io/gh/akamhy/waybackpy/branch/master/graph/badge.svg)](https://codecov.io/gh/akamhy/waybackpy)
![](https://img.shields.io/github/repo-size/akamhy/waybackpy.svg?label=Repo%20size&style=flat-square)
![contributions welcome](https://img.shields.io/static/v1.svg?label=Contributions&message=Welcome&color=0059b3&style=flat-square)
<img src="https://raw.githubusercontent.com/akamhy/waybackpy/master/assets/waybackpy_logo.svg"><br>
<h2>Python package & CLI tool that interfaces with the Wayback Machine API</h2>
![Internet Archive](https://upload.wikimedia.org/wikipedia/commons/thumb/8/84/Internet_Archive_logo_and_wordmark.svg/84px-Internet_Archive_logo_and_wordmark.svg.png)
![Wayback Machine](https://upload.wikimedia.org/wikipedia/commons/thumb/0/01/Wayback_Machine_logo_2010.svg/284px-Wayback_Machine_logo_2010.svg.png)
</div>
Waybackpy is a Python library that interfaces with the [Internet Archive](https://en.wikipedia.org/wiki/Internet_Archive)'s [Wayback Machine](https://en.wikipedia.org/wiki/Wayback_Machine) API. Archive pages and retrieve archived pages easily.
<p align="center">
<a href="https://pypi.org/project/waybackpy/"><img alt="pypi" src="https://img.shields.io/pypi/v/waybackpy.svg"></a>
<a href="https://github.com/akamhy/waybackpy/actions?query=workflow%3ACI"><img alt="Build Status" src="https://github.com/akamhy/waybackpy/workflows/CI/badge.svg"></a>
<a href="https://www.codacy.com/manual/akamhy/waybackpy?utm_source=github.com&amp;utm_medium=referral&amp;utm_content=akamhy/waybackpy&amp;utm_campaign=Badge_Grade"><img alt="Codacy Badge" src="https://api.codacy.com/project/badge/Grade/255459cede9341e39436ec8866d3fb65"></a>
<a href="https://codecov.io/gh/akamhy/waybackpy"><img alt="codecov" src="https://codecov.io/gh/akamhy/waybackpy/branch/master/graph/badge.svg"></a>
<a href="https://github.com/akamhy/waybackpy/blob/master/CONTRIBUTING.md"><img alt="Contributions Welcome" src="https://img.shields.io/static/v1.svg?label=Contributions&message=Welcome&color=0059b3&style=flat-square"></a>
<a href="https://pepy.tech/project/waybackpy?versions=2*&versions=1*&versions=3*"><img alt="Downloads" src="https://pepy.tech/badge/waybackpy/month"></a>
<a href="https://github.com/akamhy/waybackpy/commits/master"><img alt="GitHub lastest commit" src="https://img.shields.io/github/last-commit/akamhy/waybackpy?color=blue&style=flat-square"></a>
<a href="#"><img alt="PyPI - Python Version" src="https://img.shields.io/pypi/pyversions/waybackpy?style=flat-square"></a>
</p>
Table of contents
=================
<!--ts-->
-----------------------------------------------------------------------------------------------------------------------------------------------
* [Installation](#installation)
### Installation
* [Usage](#usage)
* [Saving an url using save()](#capturing-aka-saving-an-url-using-save)
* [Receiving the oldest archive for an URL Using oldest()](#receiving-the-oldest-archive-for-an-url-using-oldest)
* [Receiving the recent most/newest archive for an URL using newest()](#receiving-the-newest-archive-for-an-url-using-newest)
* [Receiving archive close to a specified year, month, day, hour, and minute using near()](#receiving-archive-close-to-a-specified-year-month-day-hour-and-minute-using-near)
* [Get the content of webpage using get()](#get-the-content-of-webpage-using-get)
* [Count total archives for an URL using total_archives()](#count-total-archives-for-an-url-using-total_archives)
* [Tests](#tests)
* [Dependency](#dependency)
* [License](#license)
<!--te-->
## Installation
Using [pip](https://en.wikipedia.org/wiki/Pip_(package_manager)):
```bash
pip install waybackpy
```
Install directly from GitHub:
## Usage
#### Capturing aka Saving an url Using save()
```python
import waybackpy
# Capturing a new archive on Wayback machine.
target_url = waybackpy.Url("https://github.com/akamhy/waybackpy", user_agnet="My-cool-user-agent")
archived_url = target_url.save()
print(archived_url)
```bash
pip install git+https://github.com/akamhy/waybackpy.git
```
This should print an URL similar to the following archived URL:
> <https://web.archive.org/web/20200504141153/https://github.com/akamhy/waybackpy>
### Usage
#### Receiving the oldest archive for an URL Using oldest()
#### As a Python package
```python
import waybackpy
# retrieving the oldest archive on Wayback machine.
target_url = waybackpy.Url("https://www.google.com/", "My-cool-user-agent")
oldest_archive = target_url.oldest()
print(oldest_archive)
>>> import waybackpy
>>> url = "https://en.wikipedia.org/wiki/Multivariable_calculus"
>>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
>>> wayback = waybackpy.Url(url, user_agent)
>>> archive = wayback.save()
>>> archive.archive_url
'https://web.archive.org/web/20210104173410/https://en.wikipedia.org/wiki/Multivariable_calculus'
>>> archive.timestamp
datetime.datetime(2021, 1, 4, 17, 35, 12, 691741)
>>> oldest_archive = wayback.oldest()
>>> oldest_archive.archive_url
'https://web.archive.org/web/20050422130129/http://en.wikipedia.org:80/wiki/Multivariable_calculus'
>>> archive_close_to_2010_feb = wayback.near(year=2010, month=2)
>>> archive_close_to_2010_feb.archive_url
'https://web.archive.org/web/20100215001541/http://en.wikipedia.org:80/wiki/Multivariable_calculus'
>>> wayback.newest().archive_url
'https://web.archive.org/web/20210104173410/https://en.wikipedia.org/wiki/Multivariable_calculus'
```
This should print the oldest available archive for <https://google.com>.
> <http://web.archive.org/web/19981111184551/http://google.com:80/>
> Full Python package documentation can be found at <https://github.com/akamhy/waybackpy/wiki/Python-package-docs>.
#### Receiving the newest archive for an URL using newest()
```python
import waybackpy
# retrieving the newest/latest archive on Wayback machine.
target_url = waybackpy.Url(url="https://www.google.com/", user_agnet="My-cool-user-agent")
newest_archive = target_url.newest()
print(newest_archive)
#### As a CLI tool
```bash
$ waybackpy --save --url "https://en.wikipedia.org/wiki/Social_media" --user_agent "my-unique-user-agent"
https://web.archive.org/web/20200719062108/https://en.wikipedia.org/wiki/Social_media
$ waybackpy --oldest --url "https://en.wikipedia.org/wiki/Humanoid" --user_agent "my-unique-user-agent"
https://web.archive.org/web/20040415020811/http://en.wikipedia.org:80/wiki/Humanoid
$ waybackpy --newest --url "https://en.wikipedia.org/wiki/Remote_sensing" --user_agent "my-unique-user-agent"
https://web.archive.org/web/20201221130522/https://en.wikipedia.org/wiki/Remote_sensing
$ waybackpy --total --url "https://en.wikipedia.org/wiki/Linux_kernel" --user_agent "my-unique-user-agent"
1904
$ waybackpy --known_urls --url akamhy.github.io --user_agent "my-unique-user-agent"
https://akamhy.github.io
https://akamhy.github.io/assets/js/scale.fix.js
https://akamhy.github.io/favicon.ico
https://akamhy.github.io/robots.txt
https://akamhy.github.io/waybackpy/
'akamhy.github.io-10-urls-m2a24y.txt' saved in current working directory
```
This print the newest available archive for <https://www.microsoft.com/en-us>, something just like this:
> <http://web.archive.org/web/20200429033402/https://www.microsoft.com/en-us/>
#### Receiving archive close to a specified year, month, day, hour, and minute using near()
```python
import waybackpy
# retriving the the closest archive from a specified year.
# supported argumnets are year,month,day,hour and minute
target_url = waybackpy.Url(https://www.facebook.com/", "Any-User-Agent")
archive_near_year = target_url.near(year=2010)
print(archive_near_year)
```
returns : <http://web.archive.org/web/20100504071154/http://www.facebook.com/>
> Please note that if you only specify the year, the current month and day are default arguments for month and day respectively. Just putting the year parameter would not return the archive closer to January but the current month you are using the package. You need to specify the month "1" for January , 2 for february and so on.
> Do not pad (don't use zeros in the month, year, day, minute, and hour arguments). e.g. For January, set month = 1 and not month = 01.
#### Get the content of webpage using get()
```python
import waybackpy
# retriving the webpage from any url including the archived urls. Don't need to import other libraies :)
# supported argumnets encoding and user_agent
target = waybackpy.Url("google.com", "any-user_agent")
oldest_url = target.oldest()
webpage = target.get(oldest_url) # We are getting the source of oldest archive of google.com.
print(webpage)
```
> This should print the source code for oldest archive of google.com. If no URL is passed in get() then it should retrive the source code of google.com and not any archive.
#### Count total archives for an URL using total_archives()
```python
from waybackpy import Url
# retriving the content of a webpage from any url including but not limited to the archived urls.
count = Url("https://en.wikipedia.org/wiki/Python (programming language)", "User-Agent").total_archives()
print(count)
```
> This should print an integer (int), which is the number of total archives on archive.org
## Tests
* [Here](https://github.com/akamhy/waybackpy/tree/master/tests)
## Dependency
* None, just python standard libraries (re, json, urllib and datetime). Both python 2 and 3 are supported :)
> Full CLI documentation can be found at <https://github.com/akamhy/waybackpy/wiki/CLI-docs>.
## License
[MIT License](https://github.com/akamhy/waybackpy/blob/master/LICENSE)
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://github.com/akamhy/waybackpy/blob/master/LICENSE)
Released under the MIT License. See
[license](https://github.com/akamhy/waybackpy/blob/master/LICENSE) for details.
-----------------------------------------------------------------------------------------------------------------------------------------------

View File

@ -1 +1 @@
theme: jekyll-theme-cayman
theme: jekyll-theme-cayman

View File

@ -0,0 +1,268 @@
<?xml version="1.0" standalone="no"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 20010904//EN"
"http://www.w3.org/TR/2001/REC-SVG-20010904/DTD/svg10.dtd">
<svg version="1.0" xmlns="http://www.w3.org/2000/svg"
width="629.000000pt" height="103.000000pt" viewBox="0 0 629.000000 103.000000"
preserveAspectRatio="xMidYMid meet">
<g transform="translate(0.000000,103.000000) scale(0.100000,-0.100000)"
fill="#000000" stroke="none">
<path d="M0 515 l0 -515 3145 0 3145 0 0 515 0 515 -3145 0 -3145 0 0 -515z
m5413 439 c31 -6 36 -10 31 -26 -3 -10 0 -26 7 -34 6 -8 10 -17 7 -20 -3 -2
-17 11 -32 31 -15 19 -41 39 -59 44 -38 11 -10 14 46 5z m150 -11 c-7 -2 -21
-2 -30 0 -10 3 -4 5 12 5 17 0 24 -2 18 -5z m-4869 -23 c-6 -6 -21 -6 -39 -1
-30 9 -30 9 10 10 25 1 36 -2 29 -9z m452 -37 c-3 -26 -15 -65 -25 -88 -10
-22 -21 -64 -25 -94 -3 -29 -14 -72 -26 -95 -11 -23 -20 -51 -20 -61 0 -30
-39 -152 -53 -163 -6 -5 -45 -12 -85 -14 -72 -5 -102 4 -102 33 0 6 -9 31 -21
56 -11 25 -26 72 -33 103 -6 31 -17 64 -24 73 -8 9 -22 37 -32 64 l-18 48 -16
-39 c-9 -21 -16 -44 -16 -50 0 -6 -7 -24 -15 -40 -8 -16 -24 -63 -34 -106 -11
-43 -26 -93 -34 -112 -14 -34 -15 -35 -108 -46 -70 -9 -96 -9 -106 0 -21 17
-43 64 -43 92 0 14 -4 27 -9 31 -12 7 -50 120 -66 200 -8 35 -25 81 -40 103
-14 22 -27 52 -28 68 -2 28 0 29 48 31 28 1 82 5 120 9 54 4 73 3 82 -7 11
-15 53 -148 53 -170 0 -7 9 -32 21 -56 20 -41 39 -49 39 -17 0 8 -5 12 -10 9
-6 -3 -13 2 -16 12 -3 10 -10 26 -15 36 -14 26 7 21 29 -8 l20 -26 7 33 c7 35
41 149 56 185 7 19 16 23 56 23 27 0 80 2 120 6 80 6 88 1 97 -71 3 -20 9 -42
14 -48 5 -7 20 -43 32 -82 13 -38 24 -72 26 -74 2 -2 13 4 24 14 13 12 20 31
20 55 0 20 7 56 15 81 7 24 19 63 25 87 12 47 31 60 89 61 l34 1 -7 -47z
m3131 41 c17 -3 34 -12 37 -20 3 -7 1 -48 -4 -91 -4 -43 -7 -80 -4 -82 2 -2
11 2 20 10 9 7 24 18 34 24 9 5 55 40 101 77 79 64 87 68 136 68 28 0 54 -4
58 -10 3 -5 12 -7 20 -3 9 3 15 -1 15 -9 0 -13 -180 -158 -197 -158 -4 0 -14
-9 -20 -20 -11 -17 -7 -27 27 -76 22 -32 40 -63 40 -70 0 -7 6 -19 14 -26 7
-8 37 -48 65 -89 l52 -74 -28 -3 c-51 -5 -74 -12 -68 -22 9 -14 -59 -12 -73 2
-20 20 -13 30 10 14 34 -24 44 -19 17 8 -25 25 -109 140 -109 149 0 7 -60 97
-64 97 -2 0 -11 -10 -22 -22 -18 -21 -18 -21 0 -15 10 4 25 2 32 -4 18 -15 19
-35 2 -22 -7 6 -25 13 -39 17 -34 8 -39 -5 -39 -94 0 -38 -3 -75 -6 -84 -6
-16 -54 -22 -67 -9 -4 3 -40 7 -81 8 -101 2 -110 10 -104 97 3 37 10 73 16 80
6 8 10 77 10 174 0 89 2 166 6 172 6 11 162 15 213 6z m301 -1 c-25 -2 -52
-11 -58 -19 -7 -7 -17 -14 -23 -14 -5 0 -2 9 8 20 14 16 29 20 69 18 l51 -2
-47 -3z m809 -9 c33 -21 65 -89 62 -132 -1 -21 1 -47 5 -59 9 -28 -26 -111
-51 -120 -10 -3 -25 -12 -33 -19 -10 -8 -70 -15 -170 -21 l-155 -8 4 -73 c4
-93 -10 -112 -80 -112 -26 0 -60 5 -74 12 -19 8 -31 8 -51 -1 -45 -20 -55 -1
-55 98 0 47 -1 111 -3 141 -2 30 -5 107 -7 170 l-4 115 65 2 c36 2 103 7 150
11 150 15 372 13 397 -4z m338 -19 c11 -14 46 -54 78 -88 l58 -62 62 65 c34
36 75 73 89 83 28 18 113 24 122 9 3 -5 -32 -51 -77 -102 -147 -167 -134 -143
-139 -253 -3 -54 -10 -103 -16 -109 -8 -8 -8 -17 -1 -30 14 -26 11 -28 -47
-29 -119 -2 -165 3 -174 22 -6 10 -9 69 -8 131 l2 113 -57 75 c-32 41 -80 102
-107 134 -27 33 -47 62 -45 66 3 4 58 6 122 4 113 -3 119 -5 138 -29z m-4233
13 c16 -13 98 -150 98 -164 0 -4 29 -65 65 -135 36 -71 65 -135 65 -143 0 -10
-14 -17 -37 -21 -21 -4 -48 -10 -61 -16 -40 -16 -51 -10 -77 41 -29 57 -35 59
-157 38 -65 -11 -71 -14 -84 -43 -10 -25 -21 -34 -46 -38 -41 -6 -61 8 -48 33
15 28 12 38 -12 42 -18 2 -23 10 -24 36 -1 27 3 35 23 43 13 5 34 9 46 9 23 0
57 47 57 78 0 9 10 33 22 52 14 24 21 52 22 92 1 49 4 58 24 67 13 6 31 11 40
11 9 0 26 7 36 15 24 18 28 18 48 3z m1701 0 c16 -12 97 -143 97 -157 0 -3 32
-69 70 -146 39 -76 67 -142 62 -147 -4 -4 -28 -12 -52 -17 -25 -6 -57 -13 -72
-17 -25 -6 -29 -2 -50 42 -14 30 -31 50 -43 53 -11 2 -57 -2 -103 -9 -79 -12
-83 -13 -96 -45 -10 -24 -22 -34 -46 -38 -43 -9 -53 -1 -45 39 5 30 3 34 -15
34 -17 0 -20 6 -20 39 0 40 13 50 65 51 19 0 55 48 55 72 0 6 8 29 19 52 32
72 41 107 31 127 -8 14 -5 21 12 33 12 9 32 16 43 16 11 0 29 7 39 15 24 18
28 18 49 3z m-3021 -11 c-29 -9 -32 -13 -27 -39 8 -36 -11 -37 -20 -1 -8 32
15 54 54 52 24 -1 23 -2 -7 -12z m3499 4 c-12 -8 -51 -4 -51 5 0 2 15 4 33 4
22 0 28 -3 18 -9z m1081 -67 c2 -42 0 -78 -4 -81 -5 -2 -8 18 -8 45 0 27 -3
64 -6 81 -4 19 -2 31 4 31 6 0 12 -32 14 -76z m-1951 46 c12 -7 19 -21 19 -38
l-1 -27 -15 28 c-8 15 -22 27 -32 27 -9 0 -24 5 -32 10 -21 14 35 13 61 0z
m1004 -3 c73 -19 135 -61 135 -92 0 -15 -8 -29 -21 -36 -18 -9 -30 -6 -69 15
-37 20 -62 26 -109 26 -54 0 -62 -3 -78 -26 -21 -32 -33 -130 -25 -191 9 -58
41 -84 111 -91 38 -3 61 1 97 17 36 17 49 19 60 10 25 -21 15 -48 -28 -76 -38
-24 -54 -28 -148 -31 -114 -4 -170 10 -190 48 -6 11 -16 20 -23 20 -24 0 -59
95 -59 159 0 59 20 122 42 136 6 3 10 13 10 22 0 31 80 82 130 83 19 0 42 5
50 10 21 13 57 12 115 -3z m-1682 -23 c-14 -14 -28 -23 -31 -20 -8 8 29 46 44
46 7 0 2 -11 -13 -26z m159 -2 c-20 -15 -22 -23 -16 -60 4 -28 3 -42 -5 -42
-7 0 -11 19 -11 50 0 36 5 52 18 59 28 17 39 12 14 -7z m1224 -28 c-39 -40
-46 -38 -19 7 15 24 40 41 52 33 2 -2 -13 -20 -33 -40z m-1538 -33 l62 -66 63
68 c56 59 68 67 100 67 19 0 38 -3 40 -7 3 -5 -32 -53 -76 -108 -88 -108 -84
-97 -90 -255 l-2 -55 -87 -3 c-49 -1 -88 -1 -89 0 0 2 -3 50 -5 107 -3 75 -8
109 -19 121 -8 9 -15 20 -15 25 0 4 -18 29 -41 54 -83 94 -89 102 -84 111 3 6
45 9 93 9 l87 -1 63 -67z m786 59 c33 -12 48 -42 52 -107 3 -43 0 -57 -16 -73
l-20 -20 20 -28 c26 -35 35 -89 21 -125 -18 -46 -66 -60 -226 -64 -77 -3 -166
-7 -198 -10 -84 -7 -99 9 -97 102 1 38 -1 125 -4 191 l-5 122 47 5 c26 3 103
4 171 2 69 -2 134 1 145 5 29 12 80 12 110 0z m-1050 -16 c3 -8 2 -12 -4 -9
-6 3 -10 10 -10 16 0 14 7 11 14 -7z m-374 -22 c0 -9 -5 -24 -10 -32 -7 -11
-10 -5 -10 23 0 23 4 36 10 32 6 -3 10 -14 10 -23z m1701 16 c2 -21 -2 -43
-10 -51 -4 -4 -7 9 -8 28 -1 32 15 52 18 23z m2859 -28 c-11 -20 -50 -28 -50
-10 0 6 9 10 19 10 11 0 23 5 26 10 12 19 16 10 5 -10z m-4759 -47 c-8 -15
-10 -15 -11 -2 0 17 10 32 18 25 2 -3 -1 -13 -7 -23z m2599 9 c0 -9 -40 -35
-46 -29 -6 6 25 37 37 37 5 0 9 -3 9 -8z m316 -127 c-4 -19 -12 -37 -18 -41
-8 -5 -9 -1 -5 10 4 10 7 36 7 59 1 35 2 39 11 24 6 -10 8 -34 5 -52z m1942
38 c-15 -16 -30 -45 -33 -65 -4 -21 -12 -38 -17 -38 -19 0 3 74 30 103 14 15
30 27 36 27 5 0 -2 -12 -16 -27z m-3855 -16 c-6 -12 -15 -33 -20 -47 -9 -23
-10 -23 -15 -3 -3 12 3 34 14 52 23 35 37 34 21 -2z m3282 -82 c-23 -18 -81
-35 -115 -34 -17 1 -11 5 21 13 25 7 54 18 65 24 30 18 53 15 29 -3z m-2585
-130 c-7 -8 -19 -15 -27 -15 -10 0 -7 8 9 31 18 24 24 27 26 14 2 -9 -2 -22
-8 -30z m-1775 -5 c-4 -12 -9 -19 -12 -17 -3 3 -2 15 2 27 4 12 9 19 12 17 3
-3 2 -15 -2 -27z m820 -29 c-9 -8 -25 21 -25 44 0 16 3 14 15 -9 9 -16 13 -32
10 -35z m2085 47 c0 -17 -31 -48 -47 -48 -11 0 -8 8 9 29 24 32 38 38 38 19z
m-1655 -47 c-11 -10 -35 11 -35 30 0 21 0 21 19 -2 11 -13 18 -26 16 -28z
m1221 24 c13 -14 21 -25 18 -25 -11 0 -54 33 -54 41 0 15 12 10 36 -16z
m-1428 -7 c-3 -7 -18 -14 -34 -15 -20 -1 -22 0 -6 4 12 2 22 9 22 14 0 5 5 9
11 9 6 0 9 -6 7 -12z m3574 -45 c8 -10 6 -13 -11 -13 -18 0 -21 6 -20 38 0 34
1 35 10 13 5 -14 15 -31 21 -38z m-4097 14 c19 -4 19 -4 2 -12 -18 -7 -46 16
-47 39 0 6 6 3 13 -6 6 -9 21 -18 32 -21z m1700 1 c19 -5 19 -5 2 -13 -18 -7
-46 17 -46 40 0 6 5 3 12 -6 7 -9 21 -19 32 -21z m-1970 12 c-3 -5 -21 -9 -38
-9 l-32 2 35 7 c19 4 36 8 38 9 2 0 0 -3 -3 -9z m350 0 c-27 -12 -35 -12 -35
0 0 6 12 10 28 9 24 0 25 -1 7 -9z m1350 0 c-3 -5 -18 -9 -33 -9 l-27 1 30 8
c17 4 31 8 33 9 2 0 0 -3 -3 -9z m355 0 c-19 -13 -30 -13 -30 0 0 6 10 10 23
10 18 0 19 -2 7 -10z m-2324 -35 c-6 -22 -11 -25 -44 -24 -31 2 -32 3 -9 6 18
3 32 14 39 29 14 30 23 24 14 -11z m2839 16 c-14 -14 -73 -26 -60 -13 6 5 19
12 30 15 34 8 40 8 30 -2z m212 -21 l48 -8 -47 -1 c-56 -1 -78 6 -78 26 0 12
3 13 14 3 8 -6 36 -15 63 -20z m116 -1 c-6 -6 -18 -6 -28 -3 -18 7 -18 8 1 14
23 9 39 1 27 -11z m633 -14 c31 5 35 4 21 -5 -9 -6 -34 -10 -55 -8 -31 3 -37
7 -40 28 l-3 25 19 -23 c16 -20 24 -23 58 -17z m939 15 c16 -7 11 -9 -20 -9
-29 -1 -36 2 -25 9 17 11 19 11 45 0z m-5445 -24 c6 -8 21 -16 33 -18 19 -3
20 -4 5 -10 -12 -5 -27 1 -45 17 -16 13 -23 25 -17 25 6 0 17 -6 24 -14z m150
-76 c0 -11 -4 -20 -10 -20 -14 0 -13 -103 1 -117 21 -21 2 -43 -36 -43 -19 0
-35 5 -35 11 0 8 -5 7 -15 -1 -21 -17 -44 2 -28 22 22 26 20 128 -2 128 -8 0
-15 9 -15 19 0 18 8 20 70 20 63 0 70 -2 70 -19z m1189 -63 c17 -32 31 -62 31
-66 0 -14 -43 -21 -57 -9 -7 6 -29 12 -48 14 -26 2 -35 -1 -40 -16 -4 -12 -12
-17 -21 -13 -8 3 -13 12 -10 19 3 8 1 14 -4 14 -18 0 -10 22 9 27 22 6 43 46
35 67 -3 9 5 20 23 30 34 18 38 14 82 -67z m2146 -8 l34 -67 -25 -6 c-14 -4
-31 -3 -37 2 -7 5 -29 12 -49 16 -31 6 -38 4 -38 -9 0 -8 -7 -15 -15 -15 -8 0
-15 7 -15 15 0 8 -4 15 -10 15 -19 0 -10 21 14 30 16 6 27 20 31 40 4 18 16
41 27 52 26 26 40 14 83 -73z m-3205 51 c8 -10 20 -26 27 -36 10 -17 12 -14
12 19 1 36 2 37 37 37 l37 0 -8 -72 c-3 -40 -11 -76 -17 -79 -20 -13 -43 3
-62 42 -27 56 -34 56 -41 4 -7 -42 -9 -44 -34 -39 -35 9 -34 6 -35 71 -1 41 4
62 14 70 18 15 50 7 70 -17z m280 11 c-5 -11 -15 -21 -21 -23 -13 -4 -14 -101
-3 -120 5 -8 1 -9 -10 -5 -10 4 -29 7 -42 7 -22 0 -24 3 -24 55 0 52 -1 55
-26 55 -19 0 -25 5 -22 18 2 13 17 18 68 23 36 3 71 6 78 7 9 2 10 -3 2 -17z
m178 -3 c3 -15 -4 -18 -32 -18 -25 0 -36 -4 -36 -15 0 -10 11 -15 35 -15 24 0
35 -5 35 -15 0 -11 -11 -15 -41 -15 -55 0 -47 -24 9 -28 29 -2 42 -8 42 -18 0
-16 -25 -17 -108 -7 l-53 6 2 56 c3 92 1 90 77 88 55 -2 67 -5 70 -19z m230
10 c18 -18 14 -56 -7 -77 -17 -17 -18 -21 -5 -40 14 -19 13 -21 -4 -21 -10 0
-28 11 -40 25 -24 27 -52 24 -52 -5 0 -24 -9 -29 -43 -23 -26 5 -27 7 -27 73
0 45 4 70 13 73 26 11 153 7 165 -5z m557 -2 c47 -20 47 -40 0 -32 -53 10 -77
-7 -73 -52 l3 -37 48 1 c26 0 47 -3 47 -6 0 -35 -108 -42 -140 -10 -29 29 -27
94 5 125 28 28 60 31 110 11z m213 -8 c3 -15 -4 -18 -38 -18 -50 0 -51 -22 -1
-30 44 -7 44 -24 -1 -28 -54 -5 -52 -32 2 -32 29 0 40 -4 40 -15 0 -17 -28
-19 -104 -9 l-46 7 0 72 0 72 72 -1 c61 -1 73 -4 76 -18z m312 6 c0 -9 -9 -18
-21 -21 -19 -5 -20 -12 -17 -69 3 -63 3 -63 -22 -58 -49 11 -50 12 -50 64 0
43 -3 50 -20 50 -13 0 -20 7 -20 20 0 17 8 20 68 23 37 2 70 4 75 5 4 1 7 -5
7 -14z m155 6 c65 -15 94 -73 62 -125 -14 -24 -25 -28 -92 -33 -44 -3 -54 0
-78 24 -34 34 -36 82 -4 111 37 34 53 37 112 23z m505 -3 c0 -8 -9 -40 -20
-72 -11 -31 -18 -60 -16 -64 3 -4 -9 -8 -25 -9 -25 -2 -31 3 -51 45 l-22 47
-21 -46 c-17 -38 -25 -47 -51 -50 -24 -3 -30 0 -32 17 -1 12 -8 40 -17 64 -21
59 -20 61 20 61 27 0 35 -4 35 -17 0 -10 4 -24 9 -32 7 -11 13 -6 25 23 14 35
18 37 53 34 32 -2 39 -7 41 -28 6 -43 19 -43 36 -1 15 40 36 55 36 28z m136
-4 c27 -45 64 -115 64 -122 0 -13 -42 -22 -54 -12 -6 5 -28 11 -49 15 -32 6
-38 4 -45 -13 -8 -24 -26 -16 -36 16 -5 16 -2 25 13 32 11 6 25 28 32 48 17
55 53 71 75 36z m840 -4 c22 -18 16 -32 -11 -25 -59 15 -94 -18 -74 -71 8 -21
15 -24 47 -22 40 3 66 -7 57 -21 -3 -5 -12 -7 -20 -3 -8 3 -15 1 -15 -4 0 -17
-111 4 -126 24 -26 34 -13 100 25 131 18 14 96 9 117 -9z m816 -54 l37 -70
-25 -8 c-16 -6 -30 -5 -40 3 -22 19 -81 22 -88 4 -7 -19 -26 -18 -26 1 0 8 -4
15 -10 15 -20 0 -9 21 15 30 24 9 30 24 27 63 -1 10 2 16 7 13 5 -3 12 1 15
10 4 9 15 14 28 12 17 -2 33 -22 60 -73z m183 61 c47 -20 47 -40 0 -32 -46 9
-75 -7 -75 -42 0 -45 13 -56 59 -49 30 4 41 2 41 -8 0 -32 -95 -35 -134 -4
-30 24 -34 64 -11 109 22 43 60 51 120 26z m398 4 c19 0 24 -26 6 -32 -13 -4
-16 -42 -5 -84 l7 -32 -55 -1 c-57 0 -68 7 -41 29 17 14 21 90 5 90 -5 0 -10
10 -10 21 0 19 4 21 38 15 20 -3 45 -6 55 -6z m117 0 c5 0 17 -13 27 -30 9
-16 21 -30 25 -30 4 0 8 14 8 30 0 28 3 30 36 30 l36 0 -5 -71 c-2 -42 -9 -74
-17 -79 -15 -9 -50 -1 -50 12 0 5 -11 25 -24 45 l-24 35 -9 -42 c-4 -23 -11
-41 -15 -41 -5 1 -19 1 -32 1 -23 0 -23 2 -20 67 3 66 15 88 42 78 8 -3 18 -5
22 -5z m317 -3 c21 -15 4 -27 -38 -27 -50 0 -49 -23 1 -30 50 -8 51 -30 1 -30
-30 0 -41 -4 -41 -15 0 -11 12 -15 45 -15 33 0 45 -4 45 -15 0 -17 -24 -19
-108 -8 l-54 6 6 66 c3 36 5 69 6 72 0 11 124 7 137 -4z m-4374 -7 c9 0 17 -4
17 -10 0 -5 -16 -10 -35 -10 -28 0 -35 -4 -35 -19 0 -15 8 -21 35 -23 20 -2
35 -7 35 -13 0 -5 -15 -11 -35 -13 -30 -3 -35 -7 -35 -28 0 -18 -5 -24 -23
-24 -13 0 -28 -5 -33 -10 -7 -7 -11 9 -13 51 -1 35 -6 70 -11 79 -7 13 -2 16
28 18 20 2 39 5 41 8 3 3 15 3 26 0 11 -3 28 -6 38 -6z m1856 -14 c23 -21 38
-20 51 4 6 11 17 20 25 20 16 0 20 -16 6 -24 -17 -11 -50 -94 -44 -114 4 -18
0 -20 -34 -19 l-38 2 3 40 c3 33 -1 45 -22 64 -36 34 -34 53 5 47 17 -2 39
-12 48 -20z m299 -18 c-3 -24 -1 -55 3 -70 6 -24 4 -29 -14 -32 -41 -9 -155
-14 -163 -7 -5 3 -10 36 -12 73 l-2 67 67 4 c38 2 81 4 97 5 27 2 28 1 24 -40z
m512 22 c0 -11 4 -20 9 -20 4 0 20 9 34 20 25 20 57 27 57 12 0 -5 -14 -18
-30 -31 l-30 -22 26 -44 c24 -41 24 -45 7 -45 -10 0 -27 14 -37 31 -21 35 -40
34 -44 -4 -3 -22 -8 -27 -32 -27 -39 0 -43 11 -35 86 l7 64 34 0 c27 0 34 -4
34 -20z m511 12 c0 -4 1 -36 2 -72 l2 -65 -32 -3 c-28 -3 -32 0 -39 30 l-7 33
-14 -33 c-16 -40 -34 -41 -51 -2 -16 35 -35 31 -26 -6 6 -22 3 -24 -30 -24
l-36 0 -1 55 c-1 30 -2 61 -3 68 -1 7 14 13 34 15 33 3 38 -1 59 -39 l24 -42
18 24 c10 13 19 29 19 35 0 5 4 14 10 20 11 11 70 16 71 6z m509 -28 c0 -31 3
-35 23 -32 17 2 23 11 25 36 3 29 6 32 36 32 l34 0 1 -75 1 -75 -29 0 c-23 0
-30 5 -35 26 -5 19 -12 25 -29 22 -17 -2 -22 -10 -22 -30 1 -24 -2 -27 -25
-22 -45 10 -50 13 -50 33 0 11 -6 21 -12 24 -10 4 -10 7 0 18 6 7 12 25 12 39
0 34 7 40 42 40 25 0 28 -3 28 -36z"/>
<path d="M800 860 c30 -24 44 -25 36 -4 -3 9 -6 18 -6 20 0 2 -12 4 -27 4
l-28 0 25 -20z"/>
<path d="M310 850 c0 -5 5 -10 10 -10 6 0 10 5 10 10 0 6 -4 10 -10 10 -5 0
-10 -4 -10 -10z"/>
<path d="M366 851 c-8 -12 21 -34 33 -27 6 4 8 13 4 21 -6 17 -29 20 -37 6z"/>
<path d="M920 586 c0 -9 7 -16 16 -16 9 0 14 5 12 12 -6 18 -28 21 -28 4z"/>
<path d="M965 419 c-4 -6 -5 -13 -2 -16 7 -7 27 6 27 18 0 12 -17 12 -25 -2z"/>
<path d="M362 388 c3 -7 15 -14 29 -16 24 -4 24 -3 4 12 -24 19 -38 20 -33 4z"/>
<path d="M4106 883 c-14 -14 -5 -31 14 -26 11 3 20 9 20 13 0 10 -26 20 -34
13z"/>
<path d="M4590 870 c-14 -10 -22 -22 -18 -25 7 -8 57 25 58 38 0 12 -14 8 -40
-13z"/>
<path d="M4380 655 c7 -8 17 -15 22 -15 6 0 5 7 -2 15 -7 8 -17 15 -22 15 -6
0 -5 -7 2 -15z"/>
<path d="M4082 560 c-6 -11 -12 -28 -12 -37 0 -13 6 -10 20 12 11 17 20 33 20
38 0 14 -15 7 -28 -13z"/>
<path d="M4496 466 c3 -9 11 -16 16 -16 13 0 5 23 -10 28 -7 2 -10 -2 -6 -12z"/>
<path d="M4236 445 c-9 -24 5 -41 16 -20 7 11 7 20 0 27 -6 6 -12 3 -16 -7z"/>
<path d="M4540 400 c0 -5 5 -10 11 -10 5 0 7 5 4 10 -3 6 -8 10 -11 10 -2 0
-4 -4 -4 -10z"/>
<path d="M5330 891 c0 -11 26 -22 34 -14 3 3 3 10 0 14 -7 12 -34 11 -34 0z"/>
<path d="M4805 880 c-8 -13 4 -32 16 -25 12 8 12 35 0 35 -6 0 -13 -4 -16 -10z"/>
<path d="M5070 821 l-35 -6 0 -75 0 -75 40 -3 c22 -2 58 3 80 10 38 12 40 16
47 63 12 88 -16 107 -132 86z m109 -36 c3 -19 2 -19 -15 -4 -11 9 -26 19 -34
22 -8 4 -2 5 15 4 21 -1 31 -8 34 -22z"/>
<path d="M5411 694 c0 -11 3 -14 6 -6 3 7 2 16 -1 19 -3 4 -6 -2 -5 -13z"/>
<path d="M5223 674 c-10 -22 -10 -25 3 -20 9 3 18 6 20 6 2 0 4 9 4 20 0 28
-13 25 -27 -6z"/>
<path d="M5001 422 c-14 -27 -12 -35 8 -23 7 5 11 17 9 27 -4 17 -5 17 -17 -4z"/>
<path d="M5673 883 c9 -9 19 -14 23 -11 10 10 -6 28 -24 28 -15 0 -15 -1 1
-17z"/>
<path d="M5866 717 c-14 -10 -16 -16 -7 -22 15 -9 35 8 30 24 -3 8 -10 7 -23
-2z"/>
<path d="M5700 520 c0 -5 5 -10 10 -10 6 0 10 5 10 10 0 6 -4 10 -10 10 -5 0
-10 -4 -10 -10z"/>
<path d="M5700 451 c0 -23 25 -46 34 -32 4 6 -2 19 -14 31 -19 19 -20 19 -20
1z"/>
<path d="M1375 850 c-3 -5 -1 -10 4 -10 6 0 11 5 11 10 0 6 -2 10 -4 10 -3 0
-8 -4 -11 -10z"/>
<path d="M1391 687 c-5 -12 -7 -35 -6 -50 2 -15 -1 -27 -7 -27 -5 0 -6 9 -3
21 5 15 4 19 -4 15 -6 -4 -11 -18 -11 -30 0 -19 7 -25 33 -29 17 -2 42 1 55 7
l22 12 -27 52 c-29 57 -39 63 -52 29z"/>
<path d="M1240 520 c0 -5 5 -10 10 -10 6 0 10 5 10 10 0 6 -4 10 -10 10 -5 0
-10 -4 -10 -10z"/>
<path d="M1575 490 c4 -14 9 -27 11 -29 7 -7 34 9 34 20 0 7 -3 9 -7 6 -3 -4
-15 1 -26 10 -19 17 -19 17 -12 -7z"/>
<path d="M3094 688 c-4 -13 -7 -35 -6 -50 1 -16 -2 -28 -8 -28 -5 0 -6 7 -3
17 4 11 3 14 -5 9 -16 -10 -15 -49 1 -43 6 2 20 0 29 -4 10 -6 27 -5 41 2 28
13 26 30 -8 86 -24 39 -31 41 -41 11z"/>
<path d="M3270 502 c0 -19 29 -47 39 -37 6 7 1 16 -15 28 -13 10 -24 14 -24 9z"/>
<path d="M3570 812 c-13 -10 -21 -24 -19 -31 3 -7 15 0 34 19 31 33 21 41 -15
12z"/>
<path d="M3855 480 c-3 -5 -1 -10 4 -10 6 0 11 5 11 10 0 6 -2 10 -4 10 -3 0
-8 -4 -11 -10z"/>
<path d="M3585 450 c3 -5 13 -10 21 -10 8 0 12 5 9 10 -3 6 -13 10 -21 10 -8
0 -12 -4 -9 -10z"/>
<path d="M1880 820 c0 -5 7 -10 16 -10 8 0 12 5 9 10 -3 6 -10 10 -16 10 -5 0
-9 -4 -9 -10z"/>
<path d="M2042 668 c-7 -7 -12 -23 -12 -37 1 -24 2 -24 16 8 16 37 14 47 -4
29z"/>
<path d="M2015 560 c4 -6 11 -8 16 -5 14 9 11 15 -7 15 -8 0 -12 -5 -9 -10z"/>
<path d="M1915 470 c4 -6 11 -8 16 -5 14 9 11 15 -7 15 -8 0 -12 -5 -9 -10z"/>
<path d="M2320 795 c0 -14 5 -25 10 -25 6 0 10 11 10 25 0 14 -4 25 -10 25 -5
0 -10 -11 -10 -25z"/>
<path d="M2660 771 c0 -6 5 -13 10 -16 6 -3 10 1 10 9 0 9 -4 16 -10 16 -5 0
-10 -4 -10 -9z"/>
<path d="M2487 763 c-4 -3 -7 -23 -7 -43 0 -36 1 -38 40 -43 68 -9 116 20 102
61 -3 10 -7 10 -18 1 -11 -9 -14 -7 -14 10 0 18 -6 21 -48 21 -27 0 -52 -3
-55 -7z"/>
<path d="M2320 719 c0 -5 5 -7 10 -4 6 3 10 8 10 11 0 2 -4 4 -10 4 -5 0 -10
-5 -10 -11z"/>
<path d="M2480 550 l0 -40 66 1 c58 1 67 4 76 25 18 39 -4 54 -78 54 l-64 0 0
-40z m40 15 c-7 -8 -16 -15 -21 -15 -5 0 -6 7 -3 15 4 8 13 15 21 15 13 0 13
-3 3 -15z"/>
<path d="M2665 527 c-4 -10 -5 -21 -1 -24 10 -10 18 4 13 24 -4 17 -4 17 -12
0z"/>
<path d="M1586 205 c-9 -23 -8 -25 9 -25 17 0 19 9 6 28 -7 11 -10 10 -15 -3z"/>
<path d="M3727 200 c-3 -13 0 -20 9 -20 15 0 19 26 5 34 -5 3 -11 -3 -14 -14z"/>
<path d="M1194 229 c-3 -6 -2 -15 3 -20 13 -13 43 -1 43 17 0 16 -36 19 -46 3z"/>
<path d="M2470 224 c-18 -46 -12 -73 15 -80 37 -9 52 1 59 40 5 26 3 41 -8 51
-23 24 -55 18 -66 -11z"/>
<path d="M3120 196 c0 -9 7 -16 16 -16 17 0 14 22 -4 28 -7 2 -12 -3 -12 -12z"/>
<path d="M4750 201 c0 -12 5 -21 10 -21 6 0 10 6 10 14 0 8 -4 18 -10 21 -5 3
-10 -3 -10 -14z"/>
<path d="M3515 229 c-8 -12 14 -31 30 -26 6 2 10 10 10 18 0 17 -31 24 -40 8z"/>
<path d="M3521 161 c-7 -5 -9 -11 -4 -14 14 -9 54 4 47 14 -7 11 -25 11 -43 0z"/>
</g>
</svg>

After

Width:  |  Height:  |  Size: 18 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 10 KiB

85
assets/waybackpy_logo.svg Normal file
View File

@ -0,0 +1,85 @@
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<svg
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:cc="http://creativecommons.org/ns#"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:svg="http://www.w3.org/2000/svg"
xmlns="http://www.w3.org/2000/svg"
id="svg8"
version="1.1"
viewBox="0 0 176.61171 41.907883"
height="41.907883mm"
width="176.61171mm">
<defs
id="defs2" />
<metadata
id="metadata5">
<rdf:RDF>
<cc:Work
rdf:about="">
<dc:format>image/svg+xml</dc:format>
<dc:type
rdf:resource="http://purl.org/dc/dcmitype/StillImage" />
<dc:title></dc:title>
</cc:Work>
</rdf:RDF>
</metadata>
<g
transform="translate(-0.74835286,-98.31182)"
id="layer1">
<flowRoot
transform="scale(0.26458333)"
style="font-style:normal;font-weight:normal;font-size:40px;line-height:1.25;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none"
id="flowRoot4598"
xml:space="preserve"><flowRegion
id="flowRegion4600"><rect
y="415.4129"
x="-38.183765"
height="48.08326"
width="257.38687"
id="rect4602" /></flowRegion><flowPara
id="flowPara4604"></flowPara></flowRoot> <text
transform="scale(0.86288797,1.158899)"
id="text4777"
y="110.93711"
x="0.93061"
style="font-style:normal;font-variant:normal;font-weight:bold;font-stretch:normal;font-size:28.14887619px;line-height:4.25;font-family:sans-serif;-inkscape-font-specification:'sans-serif, Bold';font-variant-ligatures:normal;font-variant-caps:normal;font-variant-numeric:normal;font-feature-settings:normal;text-align:start;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:start;fill:#003dff;fill-opacity:1;stroke:none;stroke-width:7.51955223;stroke-miterlimit:4;stroke-dasharray:none"
xml:space="preserve"><tspan
style="stroke-width:7.51955223"
id="tspan4775"
y="110.93711"
x="0.93061"><tspan
id="tspan4773"
style="font-style:normal;font-variant:normal;font-weight:bold;font-stretch:normal;font-size:28.14887619px;font-family:sans-serif;-inkscape-font-specification:'sans-serif, Bold';font-variant-ligatures:normal;font-variant-caps:normal;font-variant-numeric:normal;font-feature-settings:normal;text-align:start;letter-spacing:3.56786728px;writing-mode:lr-tb;text-anchor:start;fill:#003dff;fill-opacity:1;stroke-width:7.51955223;stroke-miterlimit:4;stroke-dasharray:none"
y="110.93711"
x="0.93061">waybackpy</tspan></tspan></text>
<rect
y="98.311821"
x="1.4967092"
height="4.8643045"
width="153.78688"
id="rect4644"
style="opacity:1;fill:#000080;fill-opacity:1;stroke:#00ff00;stroke-width:0;stroke-miterlimit:4;stroke-dasharray:none" />
<rect
style="opacity:1;fill:#000080;fill-opacity:1;stroke:#00ff00;stroke-width:0;stroke-miterlimit:4;stroke-dasharray:none"
id="rect4648"
width="153.78688"
height="4.490128"
x="23.573174"
y="135.72957" />
<rect
y="135.72957"
x="0.74835336"
height="4.4901319"
width="22.82482"
id="rect4650"
style="opacity:1;fill:#ff00ff;fill-opacity:1;stroke:#00ff00;stroke-width:0;stroke-miterlimit:4;stroke-dasharray:none" />
<rect
style="opacity:1;fill:#ff00ff;fill-opacity:1;stroke:#00ff00;stroke-width:0;stroke-miterlimit:4;stroke-dasharray:none"
id="rect4652"
width="21.702286"
height="4.8643003"
x="155.2836"
y="98.311821" />
</g>
</svg>

After

Width:  |  Height:  |  Size: 3.6 KiB

225
index.rst
View File

@ -1,225 +0,0 @@
waybackpy
=========
|Build Status| |Downloads| |Release| |Codacy Badge| |License: MIT|
|Maintainability| |CodeFactor| |made-with-python| |pypi| |PyPI - Python
Version| |Maintenance| |codecov| |image1| |contributions welcome|
.. |Build Status| image:: https://img.shields.io/travis/akamhy/waybackpy.svg?label=Travis%20CI&logo=travis&style=flat-square
:target: https://travis-ci.org/akamhy/waybackpy
.. |Downloads| image:: https://img.shields.io/pypi/dm/waybackpy.svg
:target: https://pypistats.org/packages/waybackpy
.. |Release| image:: https://img.shields.io/github/v/release/akamhy/waybackpy.svg
:target: https://github.com/akamhy/waybackpy/releases
.. |Codacy Badge| image:: https://api.codacy.com/project/badge/Grade/255459cede9341e39436ec8866d3fb65
:target: https://www.codacy.com/manual/akamhy/waybackpy?utm_source=github.com&utm_medium=referral&utm_content=akamhy/waybackpy&utm_campaign=Badge_Grade
.. |License: MIT| image:: https://img.shields.io/badge/License-MIT-yellow.svg
:target: https://github.com/akamhy/waybackpy/blob/master/LICENSE
.. |Maintainability| image:: https://api.codeclimate.com/v1/badges/942f13d8177a56c1c906/maintainability
:target: https://codeclimate.com/github/akamhy/waybackpy/maintainability
.. |CodeFactor| image:: https://www.codefactor.io/repository/github/akamhy/waybackpy/badge
:target: https://www.codefactor.io/repository/github/akamhy/waybackpy
.. |made-with-python| image:: https://img.shields.io/badge/Made%20with-Python-1f425f.svg
:target: https://www.python.org/
.. |pypi| image:: https://img.shields.io/pypi/v/waybackpy.svg
.. |PyPI - Python Version| image:: https://img.shields.io/pypi/pyversions/waybackpy?style=flat-square
.. |Maintenance| image:: https://img.shields.io/badge/Maintained%3F-yes-green.svg
:target: https://github.com/akamhy/waybackpy/graphs/commit-activity
.. |codecov| image:: https://codecov.io/gh/akamhy/waybackpy/branch/master/graph/badge.svg
:target: https://codecov.io/gh/akamhy/waybackpy
.. |image1| image:: https://img.shields.io/github/repo-size/akamhy/waybackpy.svg?label=Repo%20size&style=flat-square
.. |contributions welcome| image:: https://img.shields.io/static/v1.svg?label=Contributions&message=Welcome&color=0059b3&style=flat-square
|Internet Archive| |Wayback Machine|
Waybackpy is a Python library that interfaces with the `Internet
Archive`_\ s `Wayback Machine`_ API. Archive pages and retrieve
archived pages easily.
.. _Internet Archive: https://en.wikipedia.org/wiki/Internet_Archive
.. _Wayback Machine: https://en.wikipedia.org/wiki/Wayback_Machine
.. |Internet Archive| image:: https://upload.wikimedia.org/wikipedia/commons/thumb/8/84/Internet_Archive_logo_and_wordmark.svg/84px-Internet_Archive_logo_and_wordmark.svg.png
.. |Wayback Machine| image:: https://upload.wikimedia.org/wikipedia/commons/thumb/0/01/Wayback_Machine_logo_2010.svg/284px-Wayback_Machine_logo_2010.svg.png
Table of contents
=================
.. raw:: html
<!--ts-->
- `Installation`_
- `Usage`_
- `Saving an url using save()`_
- `Receiving the oldest archive for an URL Using oldest()`_
- `Receiving the recent most/newest archive for an URL using
newest()`_
- `Receiving archive close to a specified year, month, day, hour,
and minute using near()`_
- `Get the content of webpage using get()`_
- `Count total archives for an URL using total_archives()`_
- `Tests`_
- `Dependency`_
- `License`_
.. raw:: html
<!--te-->
.. _Installation: #installation
.. _Usage: #usage
.. _Saving an url using save(): #capturing-aka-saving-an-url-using-save
.. _Receiving the oldest archive for an URL Using oldest(): #receiving-the-oldest-archive-for-an-url-using-oldest
.. _Receiving the recent most/newest archive for an URL using newest(): #receiving-the-newest-archive-for-an-url-using-newest
.. _Receiving archive close to a specified year, month, day, hour, and minute using near(): #receiving-archive-close-to-a-specified-year-month-day-hour-and-minute-using-near
.. _Get the content of webpage using get(): #get-the-content-of-webpage-using-get
.. _Count total archives for an URL using total_archives(): #count-total-archives-for-an-url-using-total_archives
.. _Tests: #tests
.. _Dependency: #dependency
.. _License: #license
Installation
------------
Using `pip`_:
.. code:: bash
pip install waybackpy
Usage
-----
Capturing aka Saving an url Using save()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: python
import waybackpy
# Capturing a new archive on Wayback machine.
target_url = waybackpy.Url("https://github.com/akamhy/waybackpy", user_agnet="My-cool-user-agent")
archived_url = target_url.save()
print(archived_url)
This should print an URL similar to the following archived URL:
https://web.archive.org/web/20200504141153/https://github.com/akamhy/waybackpy
.. _pip: https://en.wikipedia.org/wiki/Pip_(package_manager)
Receiving the oldest archive for an URL Using oldest()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: python
import waybackpy
# retrieving the oldest archive on Wayback machine.
target_url = waybackpy.Url("https://www.google.com/", "My-cool-user-agent")
oldest_archive = target_url.oldest()
print(oldest_archive)
This should print the oldest available archive for https://google.com.
http://web.archive.org/web/19981111184551/http://google.com:80/
Receiving the newest archive for an URL using newest()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: python
import waybackpy
# retrieving the newest/latest archive on Wayback machine.
target_url = waybackpy.Url(url="https://www.google.com/", user_agnet="My-cool-user-agent")
newest_archive = target_url.newest()
print(newest_archive)
This print the newest available archive for
https://www.microsoft.com/en-us, something just like this:
http://web.archive.org/web/20200429033402/https://www.microsoft.com/en-us/
Receiving archive close to a specified year, month, day, hour, and minute using near()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: python
import waybackpy
# retriving the the closest archive from a specified year.
# supported argumnets are year,month,day,hour and minute
target_url = waybackpy.Url(https://www.facebook.com/", "Any-User-Agent")
archive_near_year = target_url.near(year=2010)
print(archive_near_year)
returns :
http://web.archive.org/web/20100504071154/http://www.facebook.com/
Please note that if you only specify the year, the current month and
day are default arguments for month and day respectively. Just
putting the year parameter would not return the archive closer to
January but the current month you are using the package. You need to
specify the month “1” for January , 2 for february and so on.
..
Do not pad (dont use zeros in the month, year, day, minute, and hour
arguments). e.g. For January, set month = 1 and not month = 01.
Get the content of webpage using get()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: python
import waybackpy
# retriving the webpage from any url including the archived urls. Don't need to import other libraies :)
# supported argumnets encoding and user_agent
target = waybackpy.Url("google.com", "any-user_agent")
oldest_url = target.oldest()
webpage = target.get(oldest_url) # We are getting the source of oldest archive of google.com.
print(webpage)
..
This should print the source code for oldest archive of google.com.
If no URL is passed in get() then it should retrive the source code
of google.com and not any archive.
Count total archives for an URL using total_archives()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: python
from waybackpy import Url
# retriving the content of a webpage from any url including but not limited to the archived urls.
count = Url("https://en.wikipedia.org/wiki/Python (programming language)", "User-Agent").total_archives()
print(count)
..
This should print an integer (int), which is the number of total
archives on archive.org
Tests
-----
- `Here`_
Dependency
----------
- None, just python standard libraries (re, json, urllib and datetime).
Both python 2 and 3 are supported :)
License
-------
`MIT License`_
.. _Here: https://github.com/akamhy/waybackpy/tree/master/tests
.. _MIT License: https://github.com/akamhy/waybackpy/blob/master/LICENSE

1
requirements.txt Normal file
View File

@ -0,0 +1 @@
requests>=2.24.0

View File

@ -1,3 +1,7 @@
[metadata]
description-file = README.md
license_file = LICENSE
[flake8]
max-line-length = 88
extend-ignore = E203,W503

View File

@ -1,49 +1,54 @@
import os.path
from setuptools import setup
with open(os.path.join(os.path.dirname(__file__), 'README.md')) as f:
with open(os.path.join(os.path.dirname(__file__), "README.md")) as f:
long_description = f.read()
about = {}
with open(os.path.join(os.path.dirname(__file__), 'waybackpy', '__version__.py')) as f:
with open(os.path.join(os.path.dirname(__file__), "waybackpy", "__version__.py")) as f:
exec(f.read(), about)
setup(
name = about['__title__'],
packages = ['waybackpy'],
version = about['__version__'],
description = about['__description__'],
name=about["__title__"],
packages=["waybackpy"],
version=about["__version__"],
description=about["__description__"],
long_description=long_description,
long_description_content_type='text/markdown',
license= about['__license__'],
author = about['__author__'],
author_email = about['__author_email__'],
url = about['__url__'],
download_url = 'https://github.com/akamhy/waybackpy/archive/2.0.0.tar.gz',
keywords = ['wayback', 'archive', 'archive website', 'wayback machine', 'Internet Archive'],
install_requires=[],
python_requires= ">=2.7",
long_description_content_type="text/markdown",
license=about["__license__"],
author=about["__author__"],
author_email=about["__author_email__"],
url=about["__url__"],
download_url="https://github.com/akamhy/waybackpy/archive/2.4.0.tar.gz",
keywords=[
"Archive It",
"Archive Website",
"Wayback Machine",
"waybackurls",
"Internet Archive",
],
install_requires=["requests"],
python_requires=">=3.4",
classifiers=[
'Development Status :: 5 - Production/Stable',
'Intended Audience :: Developers',
'Natural Language :: English',
'Topic :: Software Development :: Build Tools',
'License :: OSI Approved :: MIT License',
'Programming Language :: Python',
'Programming Language :: Python :: 2',
'Programming Language :: Python :: 2.7',
'Programming Language :: Python :: 3',
'Programming Language :: Python :: 3.2',
'Programming Language :: Python :: 3.3',
'Programming Language :: Python :: 3.4',
'Programming Language :: Python :: 3.5',
'Programming Language :: Python :: 3.6',
'Programming Language :: Python :: 3.7',
'Programming Language :: Python :: 3.8',
'Programming Language :: Python :: Implementation :: CPython',
],
"Development Status :: 5 - Production/Stable",
"Intended Audience :: Developers",
"Natural Language :: English",
"Topic :: Software Development :: Build Tools",
"License :: OSI Approved :: MIT License",
"Programming Language :: Python",
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.4",
"Programming Language :: Python :: 3.5",
"Programming Language :: Python :: 3.6",
"Programming Language :: Python :: 3.7",
"Programming Language :: Python :: 3.8",
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: Implementation :: CPython",
],
entry_points={"console_scripts": ["waybackpy = waybackpy.cli:main"]},
project_urls={
'Documentation': 'https://waybackpy.readthedocs.io',
'Source': 'https://github.com/akamhy/waybackpy',
"Documentation": "https://github.com/akamhy/waybackpy/wiki",
"Source": "https://github.com/akamhy/waybackpy",
"Tracker": "https://github.com/akamhy/waybackpy/issues",
},
)

0
tests/__init__.py Normal file
View File

View File

@ -1,127 +0,0 @@
# -*- coding: utf-8 -*-
import sys
sys.path.append("..")
import waybackpy
import pytest
import random
user_agent = "Mozilla/5.0 (Windows NT 6.2; rv:20.0) Gecko/20121202 Firefox/20.0"
def test_clean_url():
test_url = " https://en.wikipedia.org/wiki/Network security "
answer = "https://en.wikipedia.org/wiki/Network_security"
target = waybackpy.Url(test_url, user_agent)
test_result = target.clean_url()
assert answer == test_result
def test_url_check():
broken_url = "http://wwwgooglecom/"
with pytest.raises(Exception) as e_info:
waybackpy.Url(broken_url, user_agent)
def test_save():
# Test for urls that exist and can be archived.
url_list = [
"en.wikipedia.org",
"www.wikidata.org",
"commons.wikimedia.org",
"www.wiktionary.org",
"www.w3schools.com",
"www.youtube.com"
]
x = random.randint(0, len(url_list)-1)
url1 = url_list[x]
target = waybackpy.Url(url1, "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36")
archived_url1 = target.save()
assert url1 in archived_url1
if sys.version_info > (3, 6):
# Test for urls that are incorrect.
with pytest.raises(Exception) as e_info:
url2 = "ha ha ha ha"
waybackpy.Url(url2, user_agent)
# Test for urls not allowed to archive by robot.txt.
with pytest.raises(Exception) as e_info:
url3 = "http://www.archive.is/faq.html"
target = waybackpy.Url(url3, "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:25.0) Gecko/20100101 Firefox/25.0")
target.save()
# Non existent urls, test
with pytest.raises(Exception) as e_info:
url4 = "https://githfgdhshajagjstgeths537agajaajgsagudadhuss8762346887adsiugujsdgahub.us"
target = waybackpy.Url(url3, "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27")
target.save()
else:
pass
def test_near():
url = "google.com"
target = waybackpy.Url(url, "Mozilla/5.0 (Windows; U; Windows NT 6.0; de-DE) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.3 Safari/533.19.4")
archive_near_year = target.near(year=2010)
assert "2010" in archive_near_year
if sys.version_info > (3, 6):
archive_near_month_year = target.near( year=2015, month=2)
assert ("201502" in archive_near_month_year) or ("201501" in archive_near_month_year) or ("201503" in archive_near_month_year)
archive_near_day_month_year = target.near(year=2006, month=11, day=15)
assert ("20061114" in archive_near_day_month_year) or ("20061115" in archive_near_day_month_year) or ("2006116" in archive_near_day_month_year)
target = waybackpy.Url("www.python.org", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246")
archive_near_hour_day_month_year = target.near(year=2008, month=5, day=9, hour=15)
assert ("2008050915" in archive_near_hour_day_month_year) or ("2008050914" in archive_near_hour_day_month_year) or ("2008050913" in archive_near_hour_day_month_year)
with pytest.raises(Exception) as e_info:
NeverArchivedUrl = "https://ee_3n.wrihkeipef4edia.org/rwti5r_ki/Nertr6w_rork_rse7c_urity"
target = waybackpy.Url(NeverArchivedUrl, user_agent)
target.near(year=2010)
else:
pass
def test_oldest():
url = "github.com/akamhy/waybackpy"
target = waybackpy.Url(url, user_agent)
assert "20200504141153" in target.oldest()
def test_newest():
url = "github.com/akamhy/waybackpy"
target = waybackpy.Url(url, user_agent)
assert url in target.newest()
def test_get():
target = waybackpy.Url("google.com", user_agent)
assert "Welcome to Google" in target.get(target.oldest())
def test_total_archives():
if sys.version_info > (3, 6):
target = waybackpy.Url(" https://google.com ", user_agent)
assert target.total_archives() > 500000
else:
pass
target = waybackpy.Url(" https://gaha.e4i3n.m5iai3kip6ied.cima/gahh2718gs/ahkst63t7gad8 ", user_agent)
assert target.total_archives() == 0
if __name__ == "__main__":
test_clean_url()
print(".") #1
test_url_check()
print(".") #1
test_get()
print(".") #3
test_near()
print(".") #4
test_newest()
print(".") #5
test_save()
print(".") #6
test_oldest()
print(".") #7
test_total_archives()
print(".") #8
print("OK")

93
tests/test_cdx.py Normal file
View File

@ -0,0 +1,93 @@
import pytest
from waybackpy.cdx import Cdx
from waybackpy.exceptions import WaybackError
def test_all_cdx():
url = "akamhy.github.io"
user_agent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, \
like Gecko) Chrome/45.0.2454.85 Safari/537.36"
cdx = Cdx(
url=url,
user_agent=user_agent,
start_timestamp=2017,
end_timestamp=2020,
filters=[
"statuscode:200",
"mimetype:text/html",
"timestamp:20201002182319",
"original:https://akamhy.github.io/",
],
gzip=False,
collapses=["timestamp:10", "digest"],
limit=50,
match_type="prefix",
)
snapshots = cdx.snapshots()
for snapshot in snapshots:
ans = snapshot.archive_url
assert "https://web.archive.org/web/20201002182319/https://akamhy.github.io/" == ans
url = "akahfjgjkmhy.gihthub.ip"
cdx = Cdx(
url=url,
user_agent=user_agent,
start_timestamp=None,
end_timestamp=None,
filters=[],
match_type=None,
gzip=True,
collapses=[],
limit=10,
)
snapshots = cdx.snapshots()
print(snapshots)
i = 0
for _ in snapshots:
i += 1
assert i == 0
url = "https://github.com/akamhy/waybackpy/*"
cdx = Cdx(url=url, user_agent=user_agent, limit=50)
snapshots = cdx.snapshots()
for snapshot in snapshots:
print(snapshot.archive_url)
url = "https://github.com/akamhy/waybackpy"
with pytest.raises(WaybackError):
cdx = Cdx(url=url, user_agent=user_agent, limit=50, filters=["ghddhfhj"])
snapshots = cdx.snapshots()
with pytest.raises(WaybackError):
cdx = Cdx(url=url, user_agent=user_agent, collapses=["timestamp", "ghdd:hfhj"])
snapshots = cdx.snapshots()
url = "https://github.com"
cdx = Cdx(url=url, user_agent=user_agent, limit=50)
snapshots = cdx.snapshots()
c = 0
for snapshot in snapshots:
c += 1
if c > 100:
break
url = "https://github.com/*"
cdx = Cdx(url=url, user_agent=user_agent, collapses=["timestamp"])
snapshots = cdx.snapshots()
c = 0
for snapshot in snapshots:
c += 1
if c > 30_529: # deafult limit is 10k
break
url = "https://github.com/*"
cdx = Cdx(url=url, user_agent=user_agent)
c = 0
snapshots = cdx.snapshots()
for snapshot in snapshots:
c += 1
if c > 100_529:
break

418
tests/test_cli.py Normal file
View File

@ -0,0 +1,418 @@
import sys
import os
import pytest
import random
import string
import argparse
sys.path.append("..")
import waybackpy.cli as cli # noqa: E402
from waybackpy.wrapper import Url # noqa: E402
from waybackpy.__version__ import __version__
def test_save():
args = argparse.Namespace(
user_agent=None,
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=False,
save=True,
json=False,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get=None,
)
reply = cli.args_handler(args)
assert "pypi.org/user/akamhy" in str(reply)
args = argparse.Namespace(
user_agent=None,
url="https://hfjfjfjfyu6r6rfjvj.fjhgjhfjgvjm",
total=False,
version=False,
oldest=False,
save=True,
json=False,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get=None,
)
reply = cli.args_handler(args)
assert "could happen because either your waybackpy" in str(reply)
def test_json():
args = argparse.Namespace(
user_agent=None,
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=False,
save=False,
json=True,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get=None,
)
reply = cli.args_handler(args)
assert "archived_snapshots" in str(reply)
def test_archive_url():
args = argparse.Namespace(
user_agent=None,
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=True,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get=None,
)
reply = cli.args_handler(args)
assert "https://web.archive.org/web/" in str(reply)
def test_oldest():
args = argparse.Namespace(
user_agent=None,
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=True,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get=None,
)
reply = cli.args_handler(args)
assert "pypi.org/user/akamhy" in str(reply)
uid = "".join(
random.choice(string.ascii_lowercase + string.digits) for _ in range(6)
)
url = "https://pypi.org/yfvjvycyc667r67ed67r" + uid
args = argparse.Namespace(
user_agent=None,
url=url,
total=False,
version=False,
oldest=True,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get=None,
)
reply = cli.args_handler(args)
assert "Can not find archive for" in str(reply)
def test_newest():
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=True,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get=None,
)
reply = cli.args_handler(args)
assert "pypi.org/user/akamhy" in str(reply)
uid = "".join(
random.choice(string.ascii_lowercase + string.digits) for _ in range(6)
)
url = "https://pypi.org/yfvjvycyc667r67ed67r" + uid
args = argparse.Namespace(
user_agent=None,
url=url,
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=True,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get=None,
)
reply = cli.args_handler(args)
assert "Can not find archive for" in str(reply)
def test_total_archives():
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://pypi.org/user/akamhy/",
total=True,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get=None,
)
reply = cli.args_handler(args)
assert isinstance(reply, int)
def test_known_urls():
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://www.keybr.com",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=True,
get=None,
)
reply = cli.args_handler(args)
assert "keybr" in str(reply)
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://akfyfufyjcujfufu6576r76r6amhy.gitd6r67r6u6hub.yfjyfjio",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
alive=True,
subdomain=True,
known_urls=True,
get=None,
)
reply = cli.args_handler(args)
assert "No known URLs found" in str(reply)
def test_near():
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=True,
alive=False,
subdomain=False,
known_urls=False,
get=None,
year=2020,
month=7,
day=15,
hour=1,
minute=1,
)
reply = cli.args_handler(args)
assert "202007" in str(reply)
uid = "".join(
random.choice(string.ascii_lowercase + string.digits) for _ in range(6)
)
url = "https://pypi.org/yfvjvycyc667r67ed67r" + uid
args = argparse.Namespace(
user_agent=None,
url=url,
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=True,
alive=False,
subdomain=False,
known_urls=False,
get=None,
year=2020,
month=7,
day=15,
hour=1,
minute=1,
)
reply = cli.args_handler(args)
assert "Can not find archive for" in str(reply)
def test_get():
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://github.com/akamhy",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get="url",
)
reply = cli.args_handler(args)
assert "waybackpy" in str(reply)
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://github.com/akamhy/waybackpy",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get="oldest",
)
reply = cli.args_handler(args)
assert "waybackpy" in str(reply)
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://akamhy.github.io/waybackpy/",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get="newest",
)
reply = cli.args_handler(args)
assert "waybackpy" in str(reply)
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get="save",
)
reply = cli.args_handler(args)
assert "waybackpy" in str(reply)
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get="foobar",
)
reply = cli.args_handler(args)
assert "get the source code of the" in str(reply)
def test_args_handler():
args = argparse.Namespace(version=True)
reply = cli.args_handler(args)
assert ("waybackpy version %s" % (__version__)) == reply
args = argparse.Namespace(url=None, version=False)
reply = cli.args_handler(args)
assert ("waybackpy %s" % (__version__)) in str(reply)
def test_main():
# This also tests the parse_args method in cli.py
cli.main(["temp.py", "--version"])

32
tests/test_snapshot.py Normal file
View File

@ -0,0 +1,32 @@
import pytest
from waybackpy.snapshot import CdxSnapshot, datetime
def test_CdxSnapshot():
sample_input = "org,archive)/ 20080126045828 http://github.com text/html 200 Q4YULN754FHV2U6Q5JUT6Q2P57WEWNNY 1415"
prop_values = sample_input.split(" ")
properties = {}
(
properties["urlkey"],
properties["timestamp"],
properties["original"],
properties["mimetype"],
properties["statuscode"],
properties["digest"],
properties["length"],
) = prop_values
snapshot = CdxSnapshot(properties)
assert properties["urlkey"] == snapshot.urlkey
assert properties["timestamp"] == snapshot.timestamp
assert properties["original"] == snapshot.original
assert properties["mimetype"] == snapshot.mimetype
assert properties["statuscode"] == snapshot.statuscode
assert properties["digest"] == snapshot.digest
assert properties["length"] == snapshot.length
assert datetime.strptime(properties["timestamp"], "%Y%m%d%H%M%S") == snapshot.datetime_timestamp
archive_url = "https://web.archive.org/web/" + properties["timestamp"] + "/" + properties["original"]
assert archive_url == snapshot.archive_url
assert archive_url == str(snapshot)

186
tests/test_utils.py Normal file
View File

@ -0,0 +1,186 @@
import pytest
import json
from waybackpy.utils import (
_cleaned_url,
_url_check,
_full_url,
URLError,
WaybackError,
_get_total_pages,
_archive_url_parser,
_wayback_timestamp,
_get_response,
_check_match_type,
_check_collapses,
_check_filters,
_ts,
)
def test_ts():
timestamp = True
data = {}
assert _ts(timestamp, data)
data = """
{"archived_snapshots": {"closest": {"timestamp": "20210109155628", "available": true, "status": "200", "url": "http://web.archive.org/web/20210109155628/https://www.google.com/"}}, "url": "https://www.google.com/"}
"""
data = json.loads(data)
assert data["archived_snapshots"]["closest"]["timestamp"] == "20210109155628"
def test_check_filters():
filters = []
_check_filters(filters)
filters = ["statuscode:200", "timestamp:20215678901234", "original:https://url.com"]
_check_filters(filters)
with pytest.raises(WaybackError):
_check_filters("not-list")
def test_check_collapses():
collapses = []
_check_collapses(collapses)
collapses = ["timestamp:10"]
_check_collapses(collapses)
collapses = ["urlkey"]
_check_collapses(collapses)
collapses = "urlkey" # NOT LIST
with pytest.raises(WaybackError):
_check_collapses(collapses)
collapses = ["also illegal collapse"]
with pytest.raises(WaybackError):
_check_collapses(collapses)
def test_check_match_type():
assert None == _check_match_type(None, "url")
match_type = "exact"
url = "test_url"
assert None == _check_match_type(match_type, url)
url = "has * in it"
with pytest.raises(WaybackError):
_check_match_type("domain", url)
with pytest.raises(WaybackError):
_check_match_type("not a valid type", "url")
def test_cleaned_url():
test_url = " https://en.wikipedia.org/wiki/Network security "
answer = "https://en.wikipedia.org/wiki/Network%20security"
assert answer == _cleaned_url(test_url)
def test_url_check():
good_url = "https://akamhy.github.io"
assert None == _url_check(good_url)
bad_url = "https://github-com"
with pytest.raises(URLError):
_url_check(bad_url)
def test_full_url():
params = {}
endpoint = "https://web.archive.org/cdx/search/cdx"
assert endpoint == _full_url(endpoint, params)
params = {"a": "1"}
assert "https://web.archive.org/cdx/search/cdx?a=1" == _full_url(endpoint, params)
assert "https://web.archive.org/cdx/search/cdx?a=1" == _full_url(
endpoint + "?", params
)
params["b"] = 2
assert "https://web.archive.org/cdx/search/cdx?a=1&b=2" == _full_url(
endpoint + "?", params
)
params["c"] = "foo bar"
assert "https://web.archive.org/cdx/search/cdx?a=1&b=2&c=foo%20bar" == _full_url(
endpoint + "?", params
)
def test_get_total_pages():
user_agent = "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
url = "github.com*"
assert 212890 <= _get_total_pages(url, user_agent)
url = "https://zenodo.org/record/4416138"
assert 2 >= _get_total_pages(url, user_agent)
def test_archive_url_parser():
perfect_header = """
{'Server': 'nginx/1.15.8', 'Date': 'Sat, 02 Jan 2021 09:40:25 GMT', 'Content-Type': 'text/html; charset=UTF-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'X-Archive-Orig-Server': 'nginx', 'X-Archive-Orig-Date': 'Sat, 02 Jan 2021 09:40:09 GMT', 'X-Archive-Orig-Transfer-Encoding': 'chunked', 'X-Archive-Orig-Connection': 'keep-alive', 'X-Archive-Orig-Vary': 'Accept-Encoding', 'X-Archive-Orig-Last-Modified': 'Fri, 01 Jan 2021 12:19:00 GMT', 'X-Archive-Orig-Strict-Transport-Security': 'max-age=31536000, max-age=0;', 'X-Archive-Guessed-Content-Type': 'text/html', 'X-Archive-Guessed-Charset': 'utf-8', 'Memento-Datetime': 'Sat, 02 Jan 2021 09:40:09 GMT', 'Link': '<https://www.scribbr.com/citing-sources/et-al/>; rel="original", <https://web.archive.org/web/timemap/link/https://www.scribbr.com/citing-sources/et-al/>; rel="timemap"; type="application/link-format", <https://web.archive.org/web/https://www.scribbr.com/citing-sources/et-al/>; rel="timegate", <https://web.archive.org/web/20200601082911/https://www.scribbr.com/citing-sources/et-al/>; rel="first memento"; datetime="Mon, 01 Jun 2020 08:29:11 GMT", <https://web.archive.org/web/20201126185327/https://www.scribbr.com/citing-sources/et-al/>; rel="prev memento"; datetime="Thu, 26 Nov 2020 18:53:27 GMT", <https://web.archive.org/web/20210102094009/https://www.scribbr.com/citing-sources/et-al/>; rel="memento"; datetime="Sat, 02 Jan 2021 09:40:09 GMT", <https://web.archive.org/web/20210102094009/https://www.scribbr.com/citing-sources/et-al/>; rel="last memento"; datetime="Sat, 02 Jan 2021 09:40:09 GMT"', 'Content-Security-Policy': "default-src 'self' 'unsafe-eval' 'unsafe-inline' data: blob: archive.org web.archive.org analytics.archive.org pragma.archivelab.org", 'X-Archive-Src': 'spn2-20210102092956-wwwb-spn20.us.archive.org-8001.warc.gz', 'Server-Timing': 'captures_list;dur=112.646325, exclusion.robots;dur=0.172010, exclusion.robots.policy;dur=0.158205, RedisCDXSource;dur=2.205932, esindex;dur=0.014647, LoadShardBlock;dur=82.205012, PetaboxLoader3.datanode;dur=70.750239, CDXLines.iter;dur=24.306278, load_resource;dur=26.520179', 'X-App-Server': 'wwwb-app200', 'X-ts': '200', 'X-location': 'All', 'X-Cache-Key': 'httpsweb.archive.org/web/20210102094009/https://www.scribbr.com/citing-sources/et-al/IN', 'X-RL': '0', 'X-Page-Cache': 'MISS', 'X-Archive-Screenname': '0', 'Content-Encoding': 'gzip'}
"""
archive = _archive_url_parser(
perfect_header, "https://www.scribbr.com/citing-sources/et-al/"
)
assert "web.archive.org/web/20210102094009" in archive
header = """
vhgvkjv
Content-Location: /web/20201126185327/https://www.scribbr.com/citing-sources/et-al
ghvjkbjmmcmhj
"""
archive = _archive_url_parser(
header, "https://www.scribbr.com/citing-sources/et-al/"
)
assert "20201126185327" in archive
header = """
hfjkfjfcjhmghmvjm
X-Cache-Key: https://web.archive.org/web/20171128185327/https://www.scribbr.com/citing-sources/et-al/US
yfu,u,gikgkikik
"""
archive = _archive_url_parser(
header, "https://www.scribbr.com/citing-sources/et-al/"
)
assert "20171128185327" in archive
# The below header should result in Exception
no_archive_header = """
{'Server': 'nginx/1.15.8', 'Date': 'Sat, 02 Jan 2021 09:42:45 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Cache-Control': 'no-cache', 'X-App-Server': 'wwwb-app52', 'X-ts': '523', 'X-RL': '0', 'X-Page-Cache': 'MISS', 'X-Archive-Screenname': '0'}
"""
with pytest.raises(WaybackError):
_archive_url_parser(
no_archive_header, "https://www.scribbr.com/citing-sources/et-al/"
)
def test_wayback_timestamp():
ts = _wayback_timestamp(year=2020, month=1, day=2, hour=3, minute=4)
assert "202001020304" in str(ts)
def test_get_response():
endpoint = "https://www.google.com"
user_agent = (
"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0"
)
headers = {"User-Agent": "%s" % user_agent}
response = _get_response(endpoint, params=None, headers=headers)
assert response.status_code == 200
endpoint = "http/wwhfhfvhvjhmom"
with pytest.raises(WaybackError):
_get_response(endpoint, params=None, headers=headers)
endpoint = "https://akamhy.github.io"
url, response = _get_response(
endpoint, params=None, headers=headers, return_full_url=True
)
assert endpoint == url

145
tests/test_wrapper.py Normal file
View File

@ -0,0 +1,145 @@
import sys
import pytest
import random
import requests
from datetime import datetime
from waybackpy.wrapper import Url, Cdx
user_agent = "Mozilla/5.0 (Windows NT 6.2; rv:20.0) Gecko/20121202 Firefox/20.0"
def test_url_check():
"""No API Use"""
broken_url = "http://wwwgooglecom/"
with pytest.raises(Exception):
Url(broken_url, user_agent)
def test_save():
# Test for urls that exist and can be archived.
url_list = [
"en.wikipedia.org",
"akamhy.github.io",
"www.wiktionary.org",
"www.w3schools.com",
"youtube.com",
]
x = random.randint(0, len(url_list) - 1)
url1 = url_list[x]
target = Url(
url1,
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36",
)
archived_url1 = str(target.save())
assert url1 in archived_url1
# Test for urls that are incorrect.
with pytest.raises(Exception):
url2 = "ha ha ha ha"
Url(url2, user_agent)
url3 = "http://www.archive.is/faq.html"
with pytest.raises(Exception):
target = Url(
url3,
"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) "
"AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 "
"Safari/533.20.27",
)
target.save()
def test_near():
url = "google.com"
target = Url(
url,
"Mozilla/5.0 (Windows; U; Windows NT 6.0; de-DE) AppleWebKit/533.20.25 "
"(KHTML, like Gecko) Version/5.0.3 Safari/533.19.4",
)
archive_near_year = target.near(year=2010)
assert "2010" in str(archive_near_year.timestamp)
archive_near_month_year = str(target.near(year=2015, month=2).timestamp)
assert (
("2015-02" in archive_near_month_year)
or ("2015-01" in archive_near_month_year)
or ("2015-03" in archive_near_month_year)
)
target = Url(
"www.python.org",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246",
)
archive_near_hour_day_month_year = str(
target.near(year=2008, month=5, day=9, hour=15)
)
assert (
("2008050915" in archive_near_hour_day_month_year)
or ("2008050914" in archive_near_hour_day_month_year)
or ("2008050913" in archive_near_hour_day_month_year)
)
with pytest.raises(Exception):
NeverArchivedUrl = (
"https://ee_3n.wrihkeipef4edia.org/rwti5r_ki/Nertr6w_rork_rse7c_urity"
)
target = Url(NeverArchivedUrl, user_agent)
target.near(year=2010)
def test_oldest():
url = "github.com/akamhy/waybackpy"
target = Url(url, user_agent)
o = target.oldest()
assert "20200504141153" in str(o)
assert "2020-05-04" in str(o._timestamp)
def test_json():
url = "github.com/akamhy/waybackpy"
target = Url(url, user_agent)
assert "archived_snapshots" in str(target.JSON)
def test_archive_url():
url = "github.com/akamhy/waybackpy"
target = Url(url, user_agent)
assert "github.com/akamhy" in str(target.archive_url)
def test_newest():
url = "github.com/akamhy/waybackpy"
target = Url(url, user_agent)
assert url in str(target.newest())
def test_get():
target = Url("google.com", user_agent)
assert "Welcome to Google" in target.get(target.oldest())
def test_total_archives():
user_agent = (
"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0"
)
target = Url(" https://outlook.com ", user_agent)
assert target.total_archives() > 80000
target = Url(
" https://gaha.e4i3n.m5iai3kip6ied.cima/gahh2718gs/ahkst63t7gad8 ", user_agent
)
assert target.total_archives() == 0
def test_known_urls():
target = Url("akamhy.github.io", user_agent)
assert len(target.known_urls(alive=True, subdomain=False)) > 2
target = Url("akamhy.github.io", user_agent)
assert len(target.known_urls()) > 3

View File

@ -1,5 +1,3 @@
# -*- coding: utf-8 -*-
# ┏┓┏┓┏┓━━━━━━━━━━┏━━┓━━━━━━━━━━┏┓━━┏━━━┓━━━━━
# ┃┃┃┃┃┃━━━━━━━━━━┃┏┓┃━━━━━━━━━━┃┃━━┃┏━┓┃━━━━━
# ┃┃┃┃┃┃┏━━┓━┏┓━┏┓┃┗┛┗┓┏━━┓━┏━━┓┃┃┏┓┃┗━┛┃┏┓━┏┓
@ -10,23 +8,50 @@
# ━━━━━━━━━━━┗━━┛━━━━━━━━━━━━━━━━━━━━━━━━┗━━┛━
"""
Waybackpy is a Python library that interfaces with the Internet Archive's Wayback Machine API.
Waybackpy is a Python package & command-line program that interfaces with the Internet Archive's Wayback Machine API.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Archive pages and retrieve archived pages easily.
Archive webpage and retrieve archived URLs easily.
Usage:
>>> import waybackpy
>>> target_url = waybackpy.Url('https://www.python.org', 'Your-apps-cool-user-agent')
>>> new_archive = target_url.save()
>>> print(new_archive)
https://web.archive.org/web/20200502170312/https://www.python.org/
>>> import waybackpy
Full documentation @ <https://akamhy.github.io/waybackpy/>.
:copyright: (c) 2020 by akamhy.
>>> url = "https://en.wikipedia.org/wiki/Multivariable_calculus"
>>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
>>> wayback = waybackpy.Url(url, user_agent)
>>> archive = wayback.save()
>>> str(archive)
'https://web.archive.org/web/20210104173410/https://en.wikipedia.org/wiki/Multivariable_calculus'
>>> archive.timestamp
datetime.datetime(2021, 1, 4, 17, 35, 12, 691741)
>>> oldest_archive = wayback.oldest()
>>> str(oldest_archive)
'https://web.archive.org/web/20050422130129/http://en.wikipedia.org:80/wiki/Multivariable_calculus'
>>> archive_close_to_2010_feb = wayback.near(year=2010, month=2)
>>> str(archive_close_to_2010_feb)
'https://web.archive.org/web/20100215001541/http://en.wikipedia.org:80/wiki/Multivariable_calculus'
>>> str(wayback.newest())
'https://web.archive.org/web/20210104173410/https://en.wikipedia.org/wiki/Multivariable_calculus'
Full documentation @ <https://github.com/akamhy/waybackpy/wiki>.
:copyright: (c) 2020-2021 AKash Mahanty Et al.
:license: MIT
"""
from .wrapper import Url
from .__version__ import __title__, __description__, __url__, __version__
from .__version__ import __author__, __author_email__, __license__, __copyright__
from .wrapper import Url, Cdx
from .__version__ import (
__title__,
__description__,
__url__,
__version__,
__author__,
__author_email__,
__license__,
__copyright__,
)

View File

@ -1,10 +1,11 @@
# -*- coding: utf-8 -*-
__title__ = "waybackpy"
__description__ = "A Python library that interfaces with the Internet Archive's Wayback Machine API. Archive pages and retrieve archived pages easily."
__description__ = (
"A Python package that interfaces with the Internet Archive's Wayback Machine API. "
"Archive pages and retrieve archived pages easily."
)
__url__ = "https://akamhy.github.io/waybackpy/"
__version__ = "2.0.0"
__version__ = "2.4.0"
__author__ = "akamhy"
__author_email__ = "akash3pro@gmail.com"
__author_email__ = "akamhy@yahoo.com"
__license__ = "MIT"
__copyright__ = "Copyright 2020 akamhy"
__copyright__ = "Copyright 2020-2021 Akash Mahanty et al."

205
waybackpy/cdx.py Normal file
View File

@ -0,0 +1,205 @@
from .snapshot import CdxSnapshot
from .exceptions import WaybackError
from .utils import (
_get_total_pages,
_get_response,
default_user_agent,
_check_filters,
_check_collapses,
_check_match_type,
_add_payload,
)
# TODO : Threading support for pagination API. It's designed for Threading.
class Cdx:
def __init__(
self,
url,
user_agent=default_user_agent,
start_timestamp=None,
end_timestamp=None,
filters=[],
match_type=None,
gzip=True,
collapses=[],
limit=10000,
):
self.url = str(url).strip()
self.user_agent = str(user_agent)
self.start_timestamp = str(start_timestamp) if start_timestamp else None
self.end_timestamp = str(end_timestamp) if end_timestamp else None
self.filters = filters
_check_filters(self.filters)
self.match_type = str(match_type).strip() if match_type else None
_check_match_type(self.match_type, self.url)
self.gzip = gzip
self.collapses = collapses
_check_collapses(self.collapses)
self.limit = limit
self.last_api_request_url = None
self.use_page = False
def cdx_api_manager(self, payload, headers, use_page=False):
"""
We have two options to get the snapshots, we use this
method to make a selection between pagination API and
the normal one with Resumption Key, sequential querying
of CDX data. For very large querying (for example domain query),
it may be useful to perform queries in parallel and also estimate
the total size of the query.
read more about the pagination API at:
https://web.archive.org/web/20201228063237/https://github.com/internetarchive/wayback/blob/master/wayback-cdx-server/README.md#pagination-api
if use_page is false if will use the normal sequential query API,
else use the pagination API.
two mutually exclusive cases possible:
1) pagination API is selected
a) get the total number of pages to read, using _get_total_pages()
b) then we use a for loop to get all the pages and yield the response text
2) normal sequential query API is selected.
a) get use showResumeKey=true to ask the API to add a query resumption key
at the bottom of response
b) check if the page has more than 3 lines, if not return the text
c) if it has atleast three lines, we check the second last line for zero length.
d) if the second last line has length zero than we assume that the last line contains
the resumption key, we set the resumeKey and remove the resumeKey from text
e) if the second line has non zero length we return the text as there will no resumption key
f) if we find the resumption key we set the "more" variable status to True which is always set
to False on each iteration. If more is not True the iteration stops and function returns.
"""
endpoint = "https://web.archive.org/cdx/search/cdx"
if use_page == True:
total_pages = _get_total_pages(self.url, self.user_agent)
for i in range(total_pages):
payload["page"] = str(i)
url, res = _get_response(
endpoint, params=payload, headers=headers, return_full_url=True
)
self.last_api_request_url = url
yield res.text
else:
payload["showResumeKey"] = "true"
payload["limit"] = str(self.limit)
resumeKey = None
more = True
while more:
if resumeKey:
payload["resumeKey"] = resumeKey
url, res = _get_response(
endpoint, params=payload, headers=headers, return_full_url=True
)
self.last_api_request_url = url
text = res.text.strip()
lines = text.splitlines()
more = False
if len(lines) >= 3:
second_last_line = lines[-2]
if len(second_last_line) == 0:
resumeKey = lines[-1].strip()
text = text.replace(resumeKey, "", 1).strip()
more = True
yield text
def snapshots(self):
"""
This function yeilds snapshots encapsulated
in CdxSnapshot for more usability.
All the get request values are set if the conditions match
And we use logic that if someone's only inputs don't have any
of [start_timestamp, end_timestamp] and don't use any collapses
then we use the pagination API as it returns archives starting
from the first archive and the recent most archive will be on
the last page.
"""
payload = {}
headers = {"User-Agent": self.user_agent}
_add_payload(self, payload)
if not self.start_timestamp or self.end_timestamp:
self.use_page = True
if self.collapses != []:
self.use_page = False
texts = self.cdx_api_manager(payload, headers, use_page=self.use_page)
for text in texts:
if text.isspace() or len(text) <= 1 or not text:
continue
snapshot_list = text.split("\n")
for snapshot in snapshot_list:
if len(snapshot) < 46: # 14 + 32 (timestamp+digest)
continue
properties = {
"urlkey": None,
"timestamp": None,
"original": None,
"mimetype": None,
"statuscode": None,
"digest": None,
"length": None,
}
prop_values = snapshot.split(" ")
# Making sure that we get the same number of
# property values as the number of properties
prop_values_len = len(prop_values)
properties_len = len(properties)
if prop_values_len != properties_len:
raise WaybackError(
"Snapshot returned by Cdx API has %s properties instead of expected %s properties.\nInvolved Snapshot : %s"
% (prop_values_len, properties_len, snapshot)
)
(
properties["urlkey"],
properties["timestamp"],
properties["original"],
properties["mimetype"],
properties["statuscode"],
properties["digest"],
properties["length"],
) = prop_values
yield CdxSnapshot(properties)

312
waybackpy/cli.py Normal file
View File

@ -0,0 +1,312 @@
import os
import re
import sys
import random
import string
import argparse
from .wrapper import Url
from .exceptions import WaybackError
from .__version__ import __version__
def _save(obj):
try:
return obj.save()
except Exception as err:
e = str(err)
m = re.search(r"Header:\n(.*)", e)
if m:
header = m.group(1)
if "No archive URL found in the API response" in e:
return (
"\n[waybackpy] Can not save/archive your link.\n[waybackpy] This "
"could happen because either your waybackpy (%s) is likely out of "
"date or Wayback Machine is malfunctioning.\n[waybackpy] Visit "
"https://github.com/akamhy/waybackpy for the latest version of "
"waybackpy.\n[waybackpy] API response Header :\n%s"
% (__version__, header)
)
return WaybackError(err)
def _archive_url(obj):
return obj.archive_url
def _json(obj):
return obj.JSON
def no_archive_handler(e, obj):
m = re.search(r"archive\sfor\s\'(.*?)\'\stry", str(e))
if m:
url = m.group(1)
ua = obj.user_agent
if "github.com/akamhy/waybackpy" in ua:
ua = "YOUR_USER_AGENT_HERE"
return (
"\n[Waybackpy] Can not find archive for '%s'.\n[Waybackpy] You can"
" save the URL using the following command:\n[Waybackpy] waybackpy --"
'user_agent "%s" --url "%s" --save' % (url, ua, url)
)
return WaybackError(e)
def _oldest(obj):
try:
return obj.oldest()
except Exception as e:
return no_archive_handler(e, obj)
def _newest(obj):
try:
return obj.newest()
except Exception as e:
return no_archive_handler(e, obj)
def _total_archives(obj):
return obj.total_archives()
def _near(obj, args):
_near_args = {}
args_arr = [args.year, args.month, args.day, args.hour, args.minute]
keys = ["year", "month", "day", "hour", "minute"]
for key, arg in zip(keys, args_arr):
if arg:
_near_args[key] = arg
try:
return obj.near(**_near_args)
except Exception as e:
return no_archive_handler(e, obj)
def _save_urls_on_file(input_list, live_url_count):
m = re.search("https?://([A-Za-z_0-9.-]+).*", input_list[0])
domain = "domain-unknown"
if m:
domain = m.group(1)
uid = "".join(
random.choice(string.ascii_lowercase + string.digits) for _ in range(6)
)
file_name = "%s-%d-urls-%s.txt" % (domain, live_url_count, uid)
file_content = "\n".join(input_list)
file_path = os.path.join(os.getcwd(), file_name)
with open(file_path, "w+") as f:
f.write(file_content)
return "%s\n\n'%s' saved in current working directory" % (file_content, file_name)
def _known_urls(obj, args):
"""
Known urls for a domain.
"""
subdomain = False
if args.subdomain:
subdomain = True
alive = False
if args.alive:
alive = True
url_list = obj.known_urls(alive=alive, subdomain=subdomain)
total_urls = len(url_list)
if total_urls > 0:
return _save_urls_on_file(url_list, total_urls)
return "No known URLs found. Please try a diffrent domain!"
def _get(obj, args):
if args.get.lower() == "url":
return obj.get()
if args.get.lower() == "archive_url":
return obj.get(obj.archive_url)
if args.get.lower() == "oldest":
return obj.get(obj.oldest())
if args.get.lower() == "latest" or args.get.lower() == "newest":
return obj.get(obj.newest())
if args.get.lower() == "save":
return obj.get(obj.save())
return "Use get as \"--get 'source'\", 'source' can be one of the followings: \
\n1) url - get the source code of the url specified using --url/-u.\
\n2) archive_url - get the source code of the newest archive for the supplied url, alias of newest.\
\n3) oldest - get the source code of the oldest archive for the supplied url.\
\n4) newest - get the source code of the newest archive for the supplied url.\
\n5) save - Create a new archive and get the source code of this new archive for the supplied url."
def args_handler(args):
if args.version:
return "waybackpy version %s" % __version__
if not args.url:
return (
"waybackpy %s \nSee 'waybackpy --help' for help using this tool."
% __version__
)
obj = Url(args.url)
if args.user_agent:
obj = Url(args.url, args.user_agent)
if args.save:
output = _save(obj)
elif args.archive_url:
output = _archive_url(obj)
elif args.json:
output = _json(obj)
elif args.oldest:
output = _oldest(obj)
elif args.newest:
output = _newest(obj)
elif args.known_urls:
output = _known_urls(obj, args)
elif args.total:
output = _total_archives(obj)
elif args.near:
return _near(obj, args)
elif args.get:
output = _get(obj, args)
else:
output = (
"You only specified the URL. But you also need to specify the operation."
"\nSee 'waybackpy --help' for help using this tool."
)
return output
def add_requiredArgs(requiredArgs):
requiredArgs.add_argument(
"--url", "-u", help="URL on which Wayback machine operations would occur"
)
def add_userAgentArg(userAgentArg):
help_text = 'User agent, default user_agent is "waybackpy python package - https://github.com/akamhy/waybackpy"'
userAgentArg.add_argument("--user_agent", "-ua", help=help_text)
def add_saveArg(saveArg):
saveArg.add_argument(
"--save", "-s", action="store_true", help="Save the URL on the Wayback machine"
)
def add_auArg(auArg):
auArg.add_argument(
"--archive_url",
"-au",
action="store_true",
help="Get the latest archive URL, alias for --newest",
)
def add_jsonArg(jsonArg):
jsonArg.add_argument(
"--json",
"-j",
action="store_true",
help="JSON data of the availability API request",
)
def add_oldestArg(oldestArg):
oldestArg.add_argument(
"--oldest",
"-o",
action="store_true",
help="Oldest archive for the specified URL",
)
def add_newestArg(newestArg):
newestArg.add_argument(
"--newest",
"-n",
action="store_true",
help="Newest archive for the specified URL",
)
def add_totalArg(totalArg):
totalArg.add_argument(
"--total",
"-t",
action="store_true",
help="Total number of archives for the specified URL",
)
def add_getArg(getArg):
getArg.add_argument(
"--get",
"-g",
help="Prints the source code of the supplied url. Use '--get help' for extended usage",
)
def add_knownUrlArg(knownUrlArg):
knownUrlArg.add_argument(
"--known_urls", "-ku", action="store_true", help="URLs known for the domain."
)
help_text = "Use with '--known_urls' to include known URLs for subdomains."
knownUrlArg.add_argument("--subdomain", "-sub", action="store_true", help=help_text)
help_text = "Only include live URLs. Will not inlclude dead links."
knownUrlArg.add_argument("--alive", "-a", action="store_true", help=help_text)
def add_nearArg(nearArg):
nearArg.add_argument(
"--near", "-N", action="store_true", help="Archive near specified time"
)
def add_nearArgs(nearArgs):
nearArgs.add_argument("--year", "-Y", type=int, help="Year in integer")
nearArgs.add_argument("--month", "-M", type=int, help="Month in integer")
nearArgs.add_argument("--day", "-D", type=int, help="Day in integer.")
nearArgs.add_argument("--hour", "-H", type=int, help="Hour in intege")
nearArgs.add_argument("--minute", "-MIN", type=int, help="Minute in integer")
def parse_args(argv):
parser = argparse.ArgumentParser()
add_requiredArgs(parser.add_argument_group("URL argument (required)"))
add_userAgentArg(parser.add_argument_group("User Agent"))
add_saveArg(parser.add_argument_group("Create new archive/save URL"))
add_auArg(parser.add_argument_group("Get the latest Archive"))
add_jsonArg(parser.add_argument_group("Get the JSON data"))
add_oldestArg(parser.add_argument_group("Oldest archive"))
add_newestArg(parser.add_argument_group("Newest archive"))
add_totalArg(parser.add_argument_group("Total number of archives"))
add_getArg(parser.add_argument_group("Get source code"))
add_knownUrlArg(
parser.add_argument_group(
"URLs known and archived to Waybcak Machine for the site."
)
)
add_nearArg(parser.add_argument_group("Archive close to time specified"))
add_nearArgs(parser.add_argument_group("Arguments that are used only with --near"))
parser.add_argument(
"--version", "-v", action="store_true", help="Waybackpy version"
)
return parser.parse_args(argv[1:])
def main(argv=None):
argv = sys.argv if argv is None else argv
print(args_handler(parse_args(argv)))
if __name__ == "__main__":
sys.exit(main(sys.argv))

View File

@ -1,6 +1,19 @@
# -*- coding: utf-8 -*-
"""
waybackpy.exceptions
~~~~~~~~~~~~~~~~~~~
This module contains the set of Waybackpy's exceptions.
"""
class WaybackError(Exception):
"""
Raised when API Service error.
Raised when Waybackpy can not return what you asked for.
1) Wayback Machine API Service is unreachable/down.
2) You passed illegal arguments.
"""
class URLError(Exception):
"""
Raised when malformed URLs are passed as arguments.
"""

28
waybackpy/snapshot.py Normal file
View File

@ -0,0 +1,28 @@
from datetime import datetime
class CdxSnapshot:
"""
This class helps to use the Cdx Snapshots easily.
Raw Snapshot data looks like:
org,archive)/ 20080126045828 http://github.com text/html 200 Q4YULN754FHV2U6Q5JUT6Q2P57WEWNNY 1415
properties is a dict containg all of the 7 cdx snapshot properties.
"""
def __init__(self, properties):
self.urlkey = properties["urlkey"]
self.timestamp = properties["timestamp"]
self.datetime_timestamp = datetime.strptime(self.timestamp, "%Y%m%d%H%M%S")
self.original = properties["original"]
self.mimetype = properties["mimetype"]
self.statuscode = properties["statuscode"]
self.digest = properties["digest"]
self.length = properties["length"]
self.archive_url = (
"https://web.archive.org/web/" + self.timestamp + "/" + self.original
)
def __str__(self):
return self.archive_url

286
waybackpy/utils.py Normal file
View File

@ -0,0 +1,286 @@
import re
import requests
from .exceptions import WaybackError, URLError
from datetime import datetime
from urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter
from .__version__ import __version__
quote = requests.utils.quote
default_user_agent = "waybackpy python package - https://github.com/akamhy/waybackpy"
def _add_payload(self, payload):
if self.start_timestamp:
payload["from"] = self.start_timestamp
if self.end_timestamp:
payload["to"] = self.end_timestamp
if self.gzip != True:
payload["gzip"] = "false"
if self.match_type:
payload["matchType"] = self.match_type
if self.filters and len(self.filters) > 0:
for i, f in enumerate(self.filters):
payload["filter" + str(i)] = f
if self.collapses and len(self.collapses) > 0:
for i, f in enumerate(self.collapses):
payload["collapse" + str(i)] = f
payload["url"] = self.url
def _ts(timestamp, data):
"""
Get timestamp of last fetched archive.
If used before fetching any archive, will
use whatever self.JSON returns.
self.timestamp is None implies that
self.JSON will return any archive's JSON
that wayback machine provides it.
"""
if timestamp:
return timestamp
if not data["archived_snapshots"]:
return datetime.max
return datetime.strptime(
data["archived_snapshots"]["closest"]["timestamp"], "%Y%m%d%H%M%S"
)
def _check_match_type(match_type, url):
if not match_type:
return
if "*" in url:
raise WaybackError("Can not use wildcard with match_type argument")
legal_match_type = ["exact", "prefix", "host", "domain"]
if match_type not in legal_match_type:
raise WaybackError(
"%s is not an allowed match type.\nUse one from 'exact', 'prefix', 'host' or 'domain'"
% match_type
)
def _check_collapses(collapses):
if not isinstance(collapses, list):
raise WaybackError("collapses must be a list.")
if len(collapses) == 0:
return
for c in collapses:
try:
match = re.search(
r"(urlkey|timestamp|original|mimetype|statuscode|digest|length)(:?[0-9]{1,99})?",
c,
)
field = match.group(1)
N = None
if 2 == len(match.groups()):
N = match.group(2)
if N:
if not (field + N == c):
raise Exception
else:
if not (field == c):
raise Exception
except Exception:
e = "collapse argument '%s' is not following the cdx collapse syntax." % c
raise WaybackError(e)
def _check_filters(filters):
if not isinstance(filters, list):
raise WaybackError("filters must be a list.")
# [!]field:regex
for f in filters:
try:
match = re.search(
r"(\!?(?:urlkey|timestamp|original|mimetype|statuscode|digest|length)):(.*)",
f,
)
key = match.group(1)
val = match.group(2)
except Exception:
e = "Filter '%s' not following the cdx filter syntax." % f
raise WaybackError(e)
def _cleaned_url(url):
return str(url).strip().replace(" ", "%20")
def _url_check(url):
"""
Check for common URL problems.
What we are checking:
1) '.' in self.url, no url that ain't '.' in it.
If you known any others, please create a PR on the github repo.
"""
if "." not in url:
raise URLError("'%s' is not a vaild URL." % url)
def _full_url(endpoint, params):
full_url = endpoint
if params:
full_url = endpoint if endpoint.endswith("?") else (endpoint + "?")
for key, val in params.items():
key = "filter" if key.startswith("filter") else key
key = "collapse" if key.startswith("collapse") else key
amp = "" if full_url.endswith("?") else "&"
full_url = full_url + amp + "%s=%s" % (key, quote(str(val)))
return full_url
def _get_total_pages(url, user_agent):
"""
If showNumPages is passed in cdx API, it returns
'number of archive pages'and each page has many archives.
This func returns number of pages of archives (type int).
"""
total_pages_url = (
"https://web.archive.org/cdx/search/cdx?url=%s&showNumPages=true" % url
)
headers = {"User-Agent": user_agent}
return int((_get_response(total_pages_url, headers=headers).text).strip())
def _archive_url_parser(header, url):
"""
The wayback machine's save API doesn't
return JSON response, we are required
to read the header of the API response
and look for the archive URL.
This method has some regexen (or regexes)
that search for archive url in header.
This method is used when you try to
save a webpage on wayback machine.
Two cases are possible:
1) Either we find the archive url in
the header.
2) Or we didn't find the archive url in
API header.
If we found the archive URL we return it.
And if we couldn't find it, we raise
WaybackError with an error message.
"""
# Regex1
m = re.search(r"Content-Location: (/web/[0-9]{14}/.*)", str(header))
if m:
return "web.archive.org" + m.group(1)
# Regex2
m = re.search(
r"rel=\"memento.*?(web\.archive\.org/web/[0-9]{14}/.*?)>", str(header)
)
if m:
return m.group(1)
# Regex3
m = re.search(r"X-Cache-Key:\shttps(.*)[A-Z]{2}", str(header))
if m:
return m.group(1)
raise WaybackError(
"No archive URL found in the API response. "
"If '%s' can be accessed via your web browser then either "
"this version of waybackpy (%s) is out of date or WayBack Machine is malfunctioning. Visit "
"'https://github.com/akamhy/waybackpy' for the latest version "
"of waybackpy.\nHeader:\n%s" % (url, __version__, str(header))
)
def _wayback_timestamp(**kwargs):
"""
Wayback Machine archive URLs
have a timestamp in them.
The standard archive URL format is
https://web.archive.org/web/20191214041711/https://www.youtube.com
If we break it down in three parts:
1 ) The start (https://web.archive.org/web/)
2 ) timestamp (20191214041711)
3 ) https://www.youtube.com, the original URL
The near method takes year, month, day, hour and minute
as Arguments, their type is int.
This method takes those integers and converts it to
wayback machine timestamp and returns it.
Return format is string.
"""
return "".join(
str(kwargs[key]).zfill(2) for key in ["year", "month", "day", "hour", "minute"]
)
def _get_response(
endpoint, params=None, headers=None, retries=5, return_full_url=False
):
"""
This function is used make get request.
We use the requests package to make the
requests.
We try five times and if it fails it raises
WaybackError exception.
You can handles WaybackError by importing:
from waybackpy.exceptions import WaybackError
try:
...
except WaybackError as e:
# handle it
"""
# From https://stackoverflow.com/a/35504626
# By https://stackoverflow.com/users/401467/datashaman
s = requests.Session()
retries = Retry(
total=retries, backoff_factor=0.5, status_forcelist=[500, 502, 503, 504]
)
s.mount("https://", HTTPAdapter(max_retries=retries))
url = _full_url(endpoint, params)
try:
if not return_full_url:
return s.get(url, headers=headers)
return (url, s.get(url, headers=headers))
except Exception as e:
exc = WaybackError("Error while retrieving %s" % url)
exc.__cause__ = e
raise exc

View File

@ -1,153 +1,317 @@
# -*- coding: utf-8 -*-
import re
import sys
import json
from datetime import datetime
from waybackpy.exceptions import WaybackError
if sys.version_info >= (3, 0): # If the python ver >= 3
from urllib.request import Request, urlopen
from urllib.error import HTTPError, URLError
else: # For python2.x
from urllib2 import Request, urlopen, HTTPError, URLError
default_UA = "waybackpy python package - https://github.com/akamhy/waybackpy"
class Url():
"""waybackpy Url object"""
import requests
import concurrent.futures
from datetime import datetime, timedelta
from .exceptions import WaybackError
from .cdx import Cdx
from .utils import (
_archive_url_parser,
_wayback_timestamp,
_get_response,
default_user_agent,
_url_check,
_cleaned_url,
_ts,
)
def __init__(self, url, user_agent=default_UA):
class Url:
def __init__(self, url, user_agent=default_user_agent):
self.url = url
self.user_agent = user_agent
self.url_check() # checks url validity on init.
self.user_agent = str(user_agent)
_url_check(self.url)
self._archive_url = None
self.timestamp = None
self._JSON = None
self._alive_url_list = []
def __repr__(self):
"""Representation of the object."""
return "waybackpy.Url(url=%s, user_agent=%s)" % (self.url, self.user_agent)
def __str__(self):
"""String representation of the object."""
return "%s" % self.clean_url()
"""
Output when print() is used on <class 'waybackpy.wrapper.Url'>
This should print an archive URL.
We check if self._archive_url is not None.
If not None, good. We return string of self._archive_url.
If self._archive_url is None, it means we ain't used any method that
sets self._archive_url, we now set self._archive_url to self.archive_url
and return it.
"""
if not self._archive_url:
self._archive_url = self.archive_url
return "%s" % self._archive_url
def __len__(self):
"""Length of the URL."""
return len(self.clean_url())
def url_check(self):
"""Check for common URL problems."""
if "." not in self.url:
raise URLError("'%s' is not a vaild url." % self.url)
return True
def clean_url(self):
"""Fix the URL, if possible."""
return str(self.url).strip().replace(" ","_")
def wayback_timestamp(self, **kwargs):
"""Return the formatted the timestamp."""
return (
str(kwargs["year"])
+
str(kwargs["month"]).zfill(2)
+
str(kwargs["day"]).zfill(2)
+
str(kwargs["hour"]).zfill(2)
+
str(kwargs["minute"]).zfill(2)
)
def handle_HTTPError(self, e):
"""Handle some common HTTPErrors."""
if e.code == 404:
raise HTTPError(e)
if e.code >= 400:
raise WaybackError(e)
def save(self):
"""Create a new archives for an URL on the Wayback Machine."""
request_url = ("https://web.archive.org/save/" + self.clean_url())
hdr = { 'User-Agent' : '%s' % self.user_agent } #nosec
req = Request(request_url, headers=hdr) #nosec
try:
response = urlopen(req, timeout=30) #nosec
except Exception:
try:
response = urlopen(req) #nosec
except Exception as e:
raise WaybackError(e)
header = response.headers
try:
arch = re.search(r"rel=\"memento.*?web\.archive\.org(/web/[0-9]{14}/.*?)>", str(header)).group(1)
except KeyError as e:
raise WaybackError(e)
return "https://web.archive.org" + arch
def get(self, url=None, user_agent=None, encoding=None):
"""Returns the source code of the supplied URL. Auto detects the encoding if not supplied."""
if not url:
url = self.clean_url()
if not user_agent:
user_agent = self.user_agent
hdr = { 'User-Agent' : '%s' % user_agent }
req = Request(url, headers=hdr) #nosec
try:
resp=urlopen(req) #nosec
except URLError:
try:
resp=urlopen(req) #nosec
except URLError as e:
raise HTTPError(e)
if not encoding:
try:
encoding= resp.headers['content-type'].split('charset=')[-1]
except AttributeError:
encoding = "UTF-8"
return resp.read().decode(encoding.replace("text/html", "UTF-8", 1))
def near(self, **kwargs):
""" Returns the archived from Wayback Machine for an URL closest to the time supplied.
Supported params are year, month, day, hour and minute.
The non supplied parameters are default to the runtime time.
"""
year=kwargs.get("year", datetime.utcnow().strftime('%Y'))
month=kwargs.get("month", datetime.utcnow().strftime('%m'))
day=kwargs.get("day", datetime.utcnow().strftime('%d'))
hour=kwargs.get("hour", datetime.utcnow().strftime('%H'))
minute=kwargs.get("minute", datetime.utcnow().strftime('%M'))
timestamp = self.wayback_timestamp(year=year,month=month,day=day,hour=hour,minute=minute)
request_url = "https://archive.org/wayback/available?url=%s&timestamp=%s" % (self.clean_url(), str(timestamp))
hdr = { 'User-Agent' : '%s' % self.user_agent }
req = Request(request_url, headers=hdr) # nosec
try:
response = urlopen(req) #nosec
except Exception as e:
self.handle_HTTPError(e)
data = json.loads(response.read().decode("UTF-8"))
Why do we have len here?
Applying len() on <class 'waybackpy.wrapper.Url'>
will calculate the number of days between today and
the archive timestamp.
Can be applied on return values of near and its
childs (e.g. oldest) and if applied on waybackpy.Url()
whithout using any functions, it just grabs
self._timestamp and def _timestamp gets it
from def JSON.
"""
td_max = timedelta(
days=999999999, hours=23, minutes=59, seconds=59, microseconds=999999
)
if not self.timestamp:
self.timestamp = self._timestamp
if self.timestamp == datetime.max:
return td_max.days
return (datetime.utcnow() - self.timestamp).days
@property
def JSON(self):
"""
If the end user has used near() or its childs like oldest, newest
and archive_url then the JSON response of these are cached in self._JSON
If we find that self._JSON is not None we return it.
else we get the response of 'https://archive.org/wayback/available?url=YOUR-URL'
and return it.
"""
if self._JSON:
return self._JSON
endpoint = "https://archive.org/wayback/available"
headers = {"User-Agent": self.user_agent}
payload = {"url": "%s" % _cleaned_url(self.url)}
response = _get_response(endpoint, params=payload, headers=headers)
return response.json()
@property
def archive_url(self):
"""
Returns any random archive for the instance.
But if near, oldest, newest were used before
then it returns the same archive again.
We cache archive in self._archive_url
"""
if self._archive_url:
return self._archive_url
data = self.JSON
if not data["archived_snapshots"]:
raise WaybackError("'%s' is not yet archived." % url)
archive_url = (data["archived_snapshots"]["closest"]["url"])
# wayback machine returns http sometimes, idk why? But they support https
archive_url = archive_url.replace("http://web.archive.org/web/","https://web.archive.org/web/",1)
archive_url = None
else:
archive_url = data["archived_snapshots"]["closest"]["url"]
archive_url = archive_url.replace(
"http://web.archive.org/web/", "https://web.archive.org/web/", 1
)
self._archive_url = archive_url
return archive_url
@property
def _timestamp(self):
self.timestamp = _ts(self.timestamp, self.JSON)
return self.timestamp
def save(self):
"""
To save a webpage on WayBack machine we
need to send get request to https://web.archive.org/save/
And to get the archive URL we are required to read the
header of the API response.
_get_response() takes care of the get requests. It uses requests
package.
_archive_url_parser() parses the archive from the header.
"""
request_url = "https://web.archive.org/save/" + _cleaned_url(self.url)
headers = {"User-Agent": self.user_agent}
response = _get_response(request_url, params=None, headers=headers)
self._archive_url = "https://" + _archive_url_parser(response.headers, self.url)
self.timestamp = datetime.utcnow()
return self
def get(self, url="", user_agent="", encoding=""):
"""
Return the source code of the supplied URL.
If encoding is not supplied, it is auto-detected
from the response itself by requests package.
"""
if not url:
url = _cleaned_url(self.url)
if not user_agent:
user_agent = self.user_agent
headers = {"User-Agent": str(user_agent)}
response = _get_response(str(url), params=None, headers=headers)
if not encoding:
try:
encoding = response.encoding
except AttributeError:
encoding = "UTF-8"
return response.content.decode(encoding.replace("text/html", "UTF-8", 1))
def near(self, year=None, month=None, day=None, hour=None, minute=None):
"""
Wayback Machine can have many archives of a webpage,
sometimes we want archive close to a specific time.
This method takes year, month, day, hour and minute as input.
The input type must be integer. Any non-supplied parameters
default to the current time.
We convert the input to a wayback machine timestamp using
_wayback_timestamp(), it returns a string.
We use the wayback machine's availability API
(https://archive.org/wayback/available)
to get the closest archive from the timestamp.
We set self._archive_url to the archive found, if any.
If archive found, we set self.timestamp to its timestamp.
We self._JSON to the response of the availability API.
And finally return self.
"""
now = datetime.utcnow().timetuple()
timestamp = _wayback_timestamp(
year=year if year else now.tm_year,
month=month if month else now.tm_mon,
day=day if day else now.tm_mday,
hour=hour if hour else now.tm_hour,
minute=minute if minute else now.tm_min,
)
endpoint = "https://archive.org/wayback/available"
headers = {"User-Agent": self.user_agent}
payload = {"url": "%s" % _cleaned_url(self.url), "timestamp": timestamp}
response = _get_response(endpoint, params=payload, headers=headers)
data = response.json()
if not data["archived_snapshots"]:
raise WaybackError(
"Can not find archive for '%s' try later or use wayback.Url(url, user_agent).save() "
"to create a new archive.\nAPI response:\n%s"
% (_cleaned_url(self.url), response.text)
)
archive_url = data["archived_snapshots"]["closest"]["url"]
archive_url = archive_url.replace(
"http://web.archive.org/web/", "https://web.archive.org/web/", 1
)
self._archive_url = archive_url
self.timestamp = datetime.strptime(
data["archived_snapshots"]["closest"]["timestamp"], "%Y%m%d%H%M%S"
)
self._JSON = data
return self
def oldest(self, year=1994):
"""Returns the oldest archive from Wayback Machine for an URL."""
"""
Returns the earliest/oldest Wayback Machine archive for the webpage.
Wayback machine has started archiving the internet around 1997 and
therefore we can't have any archive older than 1997, we use 1994 as the
deafult year to look for the oldest archive.
We simply pass the year in near() and return it.
"""
return self.near(year=year)
def newest(self):
"""Returns the newest archive on Wayback Machine for an URL, sometimes you may not get the newest archive because wayback machine DB lag."""
"""
Return the newest Wayback Machine archive available for this URL.
We return the output of self.near() as it deafults to current utc time.
Due to Wayback Machine database lag, this may not always be the
most recent archive.
"""
return self.near()
def total_archives(self):
"""Returns the total number of archives on Wayback Machine for an URL."""
hdr = { 'User-Agent' : '%s' % self.user_agent }
request_url = "https://web.archive.org/cdx/search/cdx?url=%s&output=json&fl=statuscode" % self.clean_url()
req = Request(request_url, headers=hdr) # nosec
def total_archives(self, start_timestamp=None, end_timestamp=None):
"""
A webpage can have multiple archives on the wayback machine
If someone wants to count the total number of archives of a
webpage on wayback machine they can use this method.
Returns the total number of Wayback Machine archives for the URL.
Return type in integer.
"""
cdx = Cdx(
_cleaned_url(self.url),
user_agent=self.user_agent,
start_timestamp=start_timestamp,
end_timestamp=end_timestamp,
)
i = 0
for _ in cdx.snapshots():
i = i + 1
return i
def live_urls_finder(self, url):
"""
This method is used to check if supplied url
is >= 400.
"""
try:
response = urlopen(req) #nosec
except Exception as e:
self.handle_HTTPError(e)
return str(response.read()).count(",") # Most efficient method to count number of archives (yet)
response_code = requests.get(url).status_code
except Exception:
return # we don't care if Exception
# 200s are OK and 300s are usually redirects, if you don't want redirects replace 400 with 300
if response_code >= 400:
return
self._alive_url_list.append(url)
def known_urls(
self, alive=False, subdomain=False, start_timestamp=None, end_timestamp=None
):
"""
Returns list of URLs known to exist for given domain name
because these URLs were crawled by WayBack Machine spider.
Useful for pen-testing.
"""
# Idea by Mohammed Diaa (https://github.com/mhmdiaa) from:
# https://gist.github.com/mhmdiaa/adf6bff70142e5091792841d4b372050
url_list = []
if subdomain:
cdx = Cdx(_cleaned_url(self.url), user_agent=self.user_agent, start_timestamp=start_timestamp, end_timestamp=end_timestamp, match_type="domain", collapses=["urlkey"])
else:
cdx = Cdx(_cleaned_url(self.url), user_agent=self.user_agent, start_timestamp=start_timestamp, end_timestamp=end_timestamp, match_type="host", collapses=["urlkey"])
snapshots = cdx.snapshots()
url_list = []
for snapshot in snapshots:
url_list.append(snapshot.original)
# Remove all deadURLs from url_list if alive=True
if alive:
with concurrent.futures.ThreadPoolExecutor() as executor:
executor.map(self.live_urls_finder, url_list)
url_list = self._alive_url_list
return url_list