Compare commits

...

186 Commits
v1.6 ... 2.3.2

Author SHA1 Message Date
7c7fd75376 No need to fetch archive_url and timestamp from availability API on init (#55)
* No need to fetch archive_url and timestamp from availability API on init. 

Not useful if all I want is to archive a page

* Update test_wrapper.py

* Update wrapper.py

* Update test_wrapper.py

* Update wrapper.py

* Update cli.py

* Update wrapper.py

* Update __version__.py

* Update __version__.py

* Update __version__.py

* Update __version__.py

* Update setup.py

* Update README.md
2021-01-02 11:10:23 +05:30
0b71433667 v2.3.1 (#54)
* 2.3.1

* 2.3.1
2021-01-01 19:15:23 +05:30
1b499a7594 removed JSON from init, this was resulting in too much unnecessary taffic. Some users who are thousands of URLs were blocked by IA (#53)
closes #52
2021-01-01 16:38:57 +05:30
da390ee8a3 improve maintainability and reduce code cognitive complexity (#49) 2020-12-15 10:24:13 +05:30
d3e68d0e70 code formated with black (#47) 2020-12-14 01:18:04 +05:30
fde28d57aa Update CONTRIBUTING.md 2020-12-14 00:16:29 +05:30
6092e504c8 Update CONTRIBUTING.md 2020-12-14 00:15:51 +05:30
93ef60ecd2 v2.3.0 (#46)
* v2.3.0

* v2.3.0

* decrease line length
2020-12-14 00:14:54 +05:30
461b3f74c9 UPDATE header image url 2020-12-13 23:09:59 +05:30
3c53b411b0 Improve the appearance of readme (#45)
* replaced text header wth image

* svg

* Update README.md

* Update README.md

* Update README.md

* level 2

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Create CONTRIBUTING.md

* Update README.md

* Add files via upload

* Update README.md

* Delete waybackpy-colored 284.png

* Delete waybackpy colored.png

* Update README.md

* Update index.rst

* Update index.rst

* Update index.rst

* Update setup.py

* Delete index.rst

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md
2020-12-13 23:08:16 +05:30
8125526061 create pyup.io config file (#44) 2020-12-13 22:31:49 +05:30
2dc81569a8 Create .pep8speaks.yml 2020-12-13 17:58:09 +05:30
fd163f3d36 Update wrapper.py 2020-12-13 17:12:32 +05:30
a0a918cf0d . 2020-12-13 17:10:28 +05:30
4943cf6873 remove print stmnt, update ci 2020-12-13 16:37:35 +05:30
bc3efc7d63 now using requests lib as it handles errors nicely (#42)
* now using requests lib as it handles errors nicely

* remove unused import (urllib)

* FIX : replaced full_url with endpoint (not using urlib)

* LINT :  Found in waybackpy\wrapper.py:88  Unnecessary else after return
2020-12-13 15:44:37 +05:30
f89368f16d LINT : Found in waybackpy\wrapper.py:88 Unnecessary else after return 2020-12-13 15:39:23 +05:30
c919a6a605 FIX : replaced full_url with endpoint (not using urlib) 2020-12-13 15:22:56 +05:30
0280fca189 remove unused import (urllib) 2020-12-13 15:13:51 +05:30
60ee8b95a8 now using requests lib as it handles errors nicely 2020-12-13 15:05:57 +05:30
ca51c14332 deleted .travis.yml, link with flake (#41)
close #38
2020-11-26 13:06:50 +05:30
525cf17c6f Update ci.yml 2020-11-26 12:14:15 +05:30
406e03c52f Update ci.yml 2020-11-26 12:04:45 +05:30
672b33e83a Update ci.yml 2020-11-26 10:10:10 +05:30
b19b840628 Update ci.yml 2020-11-26 10:01:55 +05:30
a6df4f899c Update ci.yml 2020-11-26 09:26:11 +05:30
7686e9c20d Update README.md (#40) 2020-11-26 09:18:26 +05:30
3c5932bc39 now using gh actions (#39) 2020-11-26 09:09:53 +05:30
f9a986f489 Create ci.yml 2020-11-26 08:55:23 +05:30
0d7458ee90 per https://docs.travis-ci.com/user/languages/python/, Python builds are not available on the macOS 2020-11-26 08:08:59 +05:30
ac8b9d6a50 use osx, huge backlog on .org travis for linux builds 2020-11-26 08:03:27 +05:30
58cd9c28e7 Threading enabled checking for URLs 2020-11-26 06:15:42 +05:30
5088305a58 removed python2 compatibility code 2020-11-21 17:00:11 +05:30
9f847a5e55 change pepy.tech download count link, they removed the month page 2020-11-11 10:44:14 +05:30
6c04c2f3d3 + https://github.com/akamhy/waybackpy/graphs/contributors 2020-11-04 08:09:30 +05:30
925be7b17e V2.2.0 2020-10-17 17:10:46 +05:30
2b132456ac updated index.rst and minor docs updated. 2020-10-17 16:56:51 +05:30
50e3154a4e lint README.md 2020-10-17 12:01:49 +05:30
7aef50428f add link to the repo 2020-10-17 11:51:56 +05:30
d8ec0f5025 More pythonic code snippets in README (#36) 2020-10-17 11:49:27 +05:30
0a2f97c034 Update README, drop python 2 support
* Drop python 2 support

* updated docs

* added new docs
2020-10-16 22:37:32 +05:30
3e9cf23578 3.9 archive doesn't not exist yet. 2020-10-16 19:43:06 +05:30
7f927ec7be added tests for json and archive_url, updated broken tests (#34)
* added tests for json and archive_url, updated broken tests

* drop 2.7 support
2020-10-16 19:25:45 +05:30
9de6393cd5 Add support for JSON and archive_url (#33)
CLI support for JSON and archive_url attributes
2020-10-16 15:16:18 +05:30
91e7f65617 Fixing len() bug (#32)
* added class functionality

* Update wrapper.py

* style edits

* fixed bug with len() of url()

* fixing len() bug

* fixing len() bug

* squashing bug

* removed test notebook
2020-10-16 10:04:13 +05:30
d465454019 Adding attributes to Url class (#28)
* added class functionality

* Update wrapper.py

* style edits
2020-10-15 22:10:32 +05:30
1a81eb97fb lint 2020-10-03 16:58:11 +05:30
6b3b2e2a7d tests for newly added known_urls feature 2020-10-03 09:33:50 +05:30
82c65454e6 2.1.9 2020-10-03 01:34:15 +05:30
19710461b6 Update setup.py 2020-10-03 01:33:46 +05:30
a3661d6b85 Update index.rst 2020-10-03 01:33:15 +05:30
58375e4ef4 fix broken links 2020-10-03 01:31:28 +05:30
ea023e98da update 2020-10-03 01:22:51 +05:30
f1065ed1c8 v2.1.8 2020-10-03 01:18:30 +05:30
315519b21f 2.1.8 2020-10-03 01:18:08 +05:30
07c98661de add usage for known urls (#26)
* Update README.md

* Update README.md

* Update README.md

* bash example for known urls

* python examples / usage for known urls :)

* Update README.md

* Update README.md

* Update README.md

* Update README.md
2020-10-03 01:16:19 +05:30
2cd991a54e lint markdown 2020-10-02 23:34:06 +05:30
ede251afb3 update tests 2020-10-02 23:10:48 +05:30
a8ce970ca0 fixed yet another issue with tests :( 2020-10-02 23:01:59 +05:30
243af26bf6 update version format in tests 2020-10-02 22:23:58 +05:30
0f1db94884 license & packaging info 2020-10-02 22:10:30 +05:30
c304f58ea2 update tests 2020-10-02 21:35:39 +05:30
23f7222cb5 tweak 2020-10-02 21:01:32 +05:30
ce7294d990 Implemented new feature, known urls for domain. 2020-10-02 20:27:28 +05:30
c9fa114d2e grammar 2020-10-01 23:50:03 +05:30
8b6bacb28e Add files via upload 2020-09-08 09:23:59 +05:30
32d8ad7780 Update README.md (#24)
- IA and Wayback machine logo, added new waybackpy logo.
+ changed pages to webpages in lead
2020-09-08 09:12:48 +05:30
cbf2f90faa Add files via upload 2020-09-08 09:06:33 +05:30
4dde3e3134 Delete a.txt 2020-09-08 09:02:36 +05:30
1551e8f1c6 Add files via upload 2020-09-08 09:02:19 +05:30
c84f09e2d2 Create a.txt 2020-09-08 08:59:28 +05:30
57a32669b5 v2.1.7 2020-08-09 11:06:29 +05:30
fe017cbcc8 v2.1.7 2020-08-09 11:06:04 +05:30
5edb03d24b update docs 2020-08-09 11:05:04 +05:30
c5de2232ba Update test_wrapper.py 2020-08-09 10:53:00 +05:30
ca9186c301 update message, sometimes raised for poor performance by wayback machine even if the url is archived. 2020-08-09 10:43:16 +05:30
8a4b631c13 new regex to parse archive, IA changed the header again :( 2020-08-09 10:36:25 +05:30
ec9ce92f48 Update README.md (#23)
* Update README.md

* fix grammar
2020-07-26 10:30:54 +05:30
e95d35c37f re arrange the badges, moved contributions welcome to top 2020-07-26 10:24:31 +05:30
36d662b961 Update __version__.py 2020-07-24 16:24:57 +05:30
2835f8877e Update setup.py 2020-07-24 16:24:38 +05:30
18cbd2fd30 Update cli.py 2020-07-24 16:10:29 +05:30
a2812fb56f patch for cli 2020-07-24 16:09:47 +05:30
77effcf649 Update setup.py 2020-07-24 15:34:14 +05:30
7272ef45a0 Update __version__.py 2020-07-24 15:33:58 +05:30
56116551ac Coverge improvements (#22)
* Update cli.py

* improved tests

* chnages for proper testing

* Type check using isinstance

* Replace elifs with if when used after return

* twitter.com --> www.ibm.com

* fix typo

* test archive urll parser and dunders

* Update test_wrapper.py
2020-07-24 15:31:21 +05:30
4dcda94cb0 v2.1.4 2020-07-24 01:03:44 +05:30
09f59b0182 v2.1.4 2020-07-24 01:03:04 +05:30
ed24184b99 Remove duplicate get response method 2020-07-24 00:57:22 +05:30
56bef064b1 only test save on >3.7 2020-07-23 20:51:46 +05:30
44bb2cf5e4 some cli tests 2020-07-23 20:44:14 +05:30
e231228721 Update README.md (#21)
* Update README.md

* example bash oldest newest

* total archives bash example

* near bash example

* format the list

* ce

* get bash example

* pip git install example

* Update index.rst

* + argparse

* + argparse
2020-07-22 21:35:02 +05:30
b8b2d6dfa9 v2.1.3 2020-07-22 20:21:37 +05:30
3eca6294df v2.1.3 2020-07-22 20:20:44 +05:30
eb037a0284 Rename test_1.py to test_wrapper.py 2020-07-22 20:19:59 +05:30
a01821f20b Update .travis.yml 2020-07-22 17:33:59 +05:30
b21036f8df Update .travis.yml 2020-07-22 17:31:57 +05:30
b43bacb7ac fix error language 2020-07-22 17:25:15 +05:30
f7313b255a Update cli.py 2020-07-22 17:22:38 +05:30
7457e1c793 - print(repr(obj)) 2020-07-22 17:18:27 +05:30
f7493d823f Update cli.py 2020-07-22 17:16:53 +05:30
7fa7b59ce3 if version don't try to create object 2020-07-22 17:15:28 +05:30
78a608db50 Update cli.py 2020-07-22 17:12:44 +05:30
93f7dfdaf9 resolve args conflict 2020-07-22 17:09:32 +05:30
83c6f256c9 version arg 2020-07-22 17:03:56 +05:30
dee9105794 command_line support (#18)
* Update wrapper.py

* entry points cli

* Suppress the urllib2/3 Exception

* rm cli code, will create a new cli.py file

* Create cli.py

* update cli entry pts

* Update cli.py

* Update cli.py

* import print_function

* Update cli.py

* Update cli.py

* Delete pypi_uploader.sh

* resolve conflicts with the master

* update the test ; resolve the conflicts

* decrease code complexity

* cli method changed to main

* get is not for just local usage

* get method should be available from interface

* get is used in the interface

* Update cli.py
2020-07-22 16:40:13 +05:30
3bfc3b46d0 Delete SECURITY.md 2020-07-22 11:07:59 +05:30
553f150bee replace youtube with twitter.com
for some reason Wayback API is returing diffrent youtube URL now.
2020-07-22 11:07:23 +05:30
b3a7e714a5 Update wrapper.py 2020-07-22 10:57:43 +05:30
cd9841713c Update wrapper.py 2020-07-22 10:52:43 +05:30
1ea9548d46 Raise WaybackError from URLError and include URL (#19)
* Raise WaybackError from URLError and include URL

* python2 compatibility

Co-authored-by: Akash <64683866+akamhy@users.noreply.github.com>
2020-07-22 10:51:44 +05:30
be7642c837 Code style improvements (#20)
* Add sane line length to setup.cfg

* Use Black for quick readability improvements

* Clean up exceptions, docstrings, and comments

Docstrings on dunder functions are redundant and typically ignored
Limit to reasonable line length
General grammar and style corrections
Clarify docstrings and exceptions
Format docstrings per PEP 257 -- Docstring Conventions

* Move archive_url_parser out of Url.save()

It's generally poor form to define a function in a function, as it will
be re-defined each time the function is run.

archive_url_parser does not depend on anything in Url, so it makes sense
to move it out of the class.

* move wayback_timestamp out of class, mark private functions

* DRY in _wayback_timestamp

* Url._url_check should return None

There's no point in returning True if it's never checked and won't ever
be False.
Implicitly returning None or raising an exception is more idiomatic.

* Default parameters should be type-consistant with expected values

* Specify parameters to near

* Use datetime.datetime in _wayback_timestamp

* cleanup __init__.py

* Cleanup formatting in tests

* Fix names in tests

* Revert "Use datetime.datetime in _wayback_timestamp"

This reverts commit 5b30380865.

Introduced unnecessary complexity

* Move _get_response outside of Url

Because Codacy reminded me that I missed it.

* fix imports in tests
2020-07-22 10:09:14 +05:30
a418a4e464 Update SECURITY.md 2020-07-21 10:38:41 +05:30
aec035ef1e Create SECURITY.md 2020-07-21 08:41:07 +05:30
6d37993ab9 moved to manuals 2020-07-21 08:14:40 +05:30
72b80ca44e Create pypi_uploader.sh 2020-07-21 08:14:21 +05:30
c10aa9279c Create python-publish.yml 2020-07-21 08:08:55 +05:30
68d809a7d6 Update test_1.py 2020-07-20 23:45:49 +05:30
4ad09a419b Fix bash syntax 2020-07-20 23:44:23 +05:30
ddc6620f09 Only report coverage if python 3.8 or greater 2020-07-20 23:29:54 +05:30
4066a65678 Update .travis.yml 2020-07-20 23:20:43 +05:30
8e46a9ba7a Update .travis.yml 2020-07-20 23:16:33 +05:30
a5a98b9b00 Update .travis.yml 2020-07-20 23:11:56 +05:30
a721ab7d6c Update .travis.yml 2020-07-20 23:10:06 +05:30
7db27ae5e1 Create pypi_uploader.sh 2020-07-20 22:28:35 +05:30
8fd4462025 Update wrapper.py 2020-07-20 20:17:18 +05:30
c458a15820 Update .travis.yml 2020-07-20 15:38:33 +05:30
bae3412bee Update .travis.yml 2020-07-20 15:24:26 +05:30
94cb08bb37 Update setup.py 2020-07-20 10:41:00 +05:30
af888db13e 2.1.2 2020-07-20 10:40:37 +05:30
d24f2408ee Update test_1.py 2020-07-20 10:31:47 +05:30
ddd2274015 Update test_1.py 2020-07-20 10:21:15 +05:30
99abdb7c67 Update test_1.py 2020-07-20 10:16:39 +05:30
f3bb9a8540 Update wrapper.py 2020-07-20 10:11:36 +05:30
bb94e0d1c5 Update index.rst and remove dupes 2020-07-20 10:07:31 +05:30
1a78d88be2 2.1.1 2020-07-19 23:17:01 +05:30
3ec61758b3 Update __version__.py 2020-07-19 23:16:13 +05:30
83c962166d Raise 2020-07-19 23:02:04 +05:30
e87dee3bdf Waybackpy example on replit (#15)
* Waybackpy save example on replit

* Oldest example

* Newest method replit link

* Near method example

* Get example

* Total archive method example
2020-07-19 22:28:08 +05:30
b27bfff15a v2.1.0 2020-07-19 21:08:01 +05:30
970fc1cd08 Update __version__.py 2020-07-19 21:06:54 +05:30
65391bf14b update 2020-07-19 21:04:32 +05:30
8ab116f276 API chnaged again. updated
* Update wrapper.py

* Update wrapper.py

* Update wrapper.py

* Update wrapper.py

* Update wrapper.py

* api changed; fix archive url parser

* Update wrapper.py

* - Trailing whitespace

* include the header in exception
2020-07-19 20:39:07 +05:30
6f82041ec9 Update README.md (#13)
* Update README.md

* Update README.md

* replit demo for waybackpy.Url.save()

* Update README.md

* Update README.md

* replit demo for oldest()

* replit demo for newest()

* Update README.md

* replit demo for total_archives

* demo at replit for get()

* demo for near

* Update README.md

* Update README.md

* Update README.md
2020-07-19 16:39:39 +05:30
11059c960e Update setup.py 2020-07-18 19:27:04 +05:30
eee1b8eba1 Update __version__.py 2020-07-18 19:26:41 +05:30
f7de8f5575 sleeps to prevent too many requests in a timeframe 2020-07-18 19:25:19 +05:30
3fa0c32064 V2.0.1 link 2020-07-18 19:09:18 +05:30
aa1e3b8825 V2.0.1 2020-07-18 19:08:39 +05:30
58d2d585c8 No timeout for final try 2020-07-18 18:29:41 +05:30
e8efed2e2f Update test_1.py 2020-07-18 17:24:54 +05:30
49089b7321 2.0.0 link 2020-07-18 17:09:07 +05:30
55d8687566 Update test_1.py 2020-07-18 16:58:23 +05:30
0fa28527af Update index.rst 2020-07-18 16:54:07 +05:30
68259fd2d9 Update index.rst 2020-07-18 16:53:27 +05:30
e7086a89d3 Update index.rst 2020-07-18 16:52:37 +05:30
e39467227c Update index.rst 2020-07-18 16:51:47 +05:30
ba840404cf Update index.rst 2020-07-18 16:50:37 +05:30
8fbd2d9e55 Update index.rst 2020-07-18 16:49:03 +05:30
eebf6043de Update index.rst 2020-07-18 16:48:29 +05:30
3d3b09d6d8 Update README.md 2020-07-18 16:46:40 +05:30
ef15b5863c Update index.rst 2020-07-18 16:44:32 +05:30
256c0cdb6b update test - save 2020-07-18 16:39:35 +05:30
12c72a8294 fix link 2020-07-18 16:30:20 +05:30
0ad27f5ecc update readme for newer oop and some test changes (#12)
* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* docstrings

* user agent ; more variants

* description update

* Update __init__.py

* # -*- coding: utf-8 -*-

* Update test_1.py

* update docs for get()

* Update README.md
2020-07-18 16:22:09 +05:30
700b60b5f8 Update README.md 2020-07-18 08:16:59 +05:30
11032596c8 Update README.md 2020-07-18 08:15:43 +05:30
9727f92168 Update README.md 2020-07-18 08:12:33 +05:30
d2893fec13 Delete CONTRIBUTING.md 2020-07-18 08:12:00 +05:30
f1353b2129 Update CONTRIBUTING.md 2020-07-18 00:58:50 +05:30
c76a95ef90 Create CONTRIBUTING.md (#11) 2020-07-18 00:57:48 +05:30
62d88359ce Update README.md 2020-07-18 00:40:21 +05:30
9942c474c9 Update README.md 2020-07-18 00:35:12 +05:30
dfb736e794 Size 2020-07-18 00:32:00 +05:30
84d1766917 Update README.md 2020-07-18 00:20:58 +05:30
9d3cdfafb3 Update README.md 2020-07-18 00:20:17 +05:30
20a16bfa45 Version 2.0.0 on it's way for release (tommorow) 2020-07-18 00:09:28 +05:30
f2112c73f6 Python 2 support 2020-07-17 21:08:32 +05:30
9860527d96 OOP (#10)
* Update wrapper.py

* Update exceptions.py

* Update __init__.py

* test adjusted for new changes

* Update wrapper.py
2020-07-17 20:50:00 +05:30
9ac1e877c8 Update README.md 2020-07-16 20:39:12 +05:30
f881705d00 detecet python version whith sys.version_info (#9) 2020-06-26 15:48:01 +05:30
f015c3f4f3 test on the worst case possible 2020-05-08 09:56:01 +05:30
42ac399362 Most efficient method to count (yet) 2020-05-08 09:47:13 +05:30
e9d010c793 just count the status code, consumes less memory 2020-05-08 09:28:18 +05:30
58a6409528 v1.6 2020-05-07 20:14:59 +05:30
7ca2029158 Update setup.py 2020-05-07 20:14:40 +05:30
26 changed files with 2031 additions and 651 deletions

42
.github/workflows/ci.yml vendored Normal file
View File

@ -0,0 +1,42 @@
# This workflow will install Python dependencies, run tests and lint with a variety of Python versions
# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions
name: CI
on:
push:
branches: [ master ]
pull_request:
branches: [ master ]
jobs:
build:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ['3.8']
steps:
- uses: actions/checkout@v2
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v2
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install flake8 pytest codecov pytest-cov
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
- name: Lint with flake8
run: |
# stop the build if there are Python syntax errors or undefined names
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
- name: Test with pytest
run: |
pytest --cov=waybackpy tests/
- name: Upload coverage to Codecov
run: |
bash <(curl -s https://codecov.io/bash) -t ${{ secrets.CODECOV_TOKEN }}

31
.github/workflows/python-publish.yml vendored Normal file
View File

@ -0,0 +1,31 @@
# This workflows will upload a Python Package using Twine when a release is created
# For more information see: https://help.github.com/en/actions/language-and-framework-guides/using-python-with-github-actions#publishing-to-package-registries
name: Upload Python Package
on:
release:
types: [created]
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: '3.x'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install setuptools wheel twine
- name: Build and publish
env:
TWINE_USERNAME: ${{ secrets.PYPI_USERNAME }}
TWINE_PASSWORD: ${{ secrets.PYPI_PASSWORD }}
run: |
python setup.py sdist bdist_wheel
twine upload dist/*

3
.gitignore vendored
View File

@ -1,3 +1,6 @@
# Files generated while testing
*-urls-*.txt
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]

4
.pep8speaks.yml Normal file
View File

@ -0,0 +1,4 @@
# File : .pep8speaks.yml
scanner:
diff_only: True # If True, errors caused by only the patch are shown

5
.pyup.yml Normal file
View File

@ -0,0 +1,5 @@
# autogenerated pyup.io config file
# see https://pyup.io/docs/configuration/ for all available options
schedule: ''
update: false

View File

@ -1,14 +0,0 @@
language: python
python:
- "2.7"
- "3.6"
- "3.8"
os: linux
dist: xenial
cache: pip
install:
- pip install pytest
before_script:
cd tests
script:
- pytest test_1.py

58
CONTRIBUTING.md Normal file
View File

@ -0,0 +1,58 @@
# Contributing to waybackpy
We love your input! We want to make contributing to this project as easy and transparent as possible, whether it's:
- Reporting a bug
- Discussing the current state of the code
- Submitting a fix
- Proposing new features
- Becoming a maintainer
## We Develop with Github
We use github to host code, to track issues and feature requests, as well as accept pull requests.
## We Use [Github Flow](https://guides.github.com/introduction/flow/index.html), So All Code Changes Happen Through Pull Requests
Pull requests are the best way to propose changes to the codebase (we use [Github Flow](https://guides.github.com/introduction/flow/index.html)). We actively welcome your pull requests:
1. Fork the repo and create your branch from `master`.
2. If you've added code that should be tested, add tests.
3. If you've changed APIs, update the documentation.
4. Ensure the test suite passes.
5. Make sure your code lints.
6. Issue that pull request!
## Any contributions you make will be under the MIT Software License
In short, when you submit code changes, your submissions are understood to be under the same [MIT License](https://github.com/akamhy/waybackpy/blob/master/LICENSE) that covers the project. Feel free to contact the maintainers if that's a concern.
## Report bugs using Github's [issues](https://github.com/akamhy/waybackpy/issues)
We use GitHub issues to track public bugs. Report a bug by [opening a new issue](https://github.com/akamhy/waybackpy/issues/new); it's that easy!
## Write bug reports with detail, background, and sample code
**Great Bug Reports** tend to have:
- A quick summary and/or background
- Steps to reproduce
- Be specific!
- Give sample code if you can.
- What you expected would happen
- What actually happens
- Notes (possibly including why you think this might be happening, or stuff you tried that didn't work)
People *love* thorough bug reports. I'm not even kidding.
## Use a Consistent Coding Style
* You can try running `flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics` for style unification.
## License
By contributing, you agree that your contributions will be licensed under its [MIT License](https://github.com/akamhy/waybackpy/blob/master/LICENSE).
## References
This document is forked from [this gist](https://gist.github.com/briandk/3d2e8b3ec8daf5a27a62) by [briandk](https://github.com/briandk) which was itself adapted from the open-source contribution guidelines for [Facebook's Draft](https://github.com/facebook/draft-js/blob/a9316a723f9e918afde44dea68b5f9f39b7d9b00/CONTRIBUTING.md)

View File

@ -1,6 +1,6 @@
MIT License
Copyright (c) 2020 akamhy
Copyright (c) 2020 waybackpy contributors ( https://github.com/akamhy/waybackpy/graphs/contributors )
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal

492
README.md
View File

@ -1,176 +1,452 @@
# waybackpy
[![Build Status](https://travis-ci.org/akamhy/waybackpy.svg?branch=master)](https://travis-ci.org/akamhy/waybackpy)
[![Downloads](https://img.shields.io/pypi/dm/waybackpy.svg)](https://pypistats.org/packages/waybackpy)
[![Release](https://img.shields.io/github/v/release/akamhy/waybackpy.svg)](https://github.com/akamhy/waybackpy/releases)
[![Codacy Badge](https://api.codacy.com/project/badge/Grade/255459cede9341e39436ec8866d3fb65)](https://www.codacy.com/manual/akamhy/waybackpy?utm_source=github.com&amp;utm_medium=referral&amp;utm_content=akamhy/waybackpy&amp;utm_campaign=Badge_Grade)
<div align="center">
<img src="https://raw.githubusercontent.com/akamhy/waybackpy/master/assets/waybackpy_logo.svg"><br>
</div>
-----------------
## Python package & CLI tool that interfaces with the Wayback Machine API.
[![pypi](https://img.shields.io/pypi/v/waybackpy.svg)](https://pypi.org/project/waybackpy/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://github.com/akamhy/waybackpy/blob/master/LICENSE)
[![Build Status](https://github.com/akamhy/waybackpy/workflows/CI/badge.svg)](https://github.com/akamhy/waybackpy/actions?query=workflow%3ACI)
[![codecov](https://codecov.io/gh/akamhy/waybackpy/branch/master/graph/badge.svg)](https://codecov.io/gh/akamhy/waybackpy)
[![contributions welcome](https://img.shields.io/static/v1.svg?label=Contributions&message=Welcome&color=0059b3&style=flat-square)](https://github.com/akamhy/waybackpy/blob/master/CONTRIBUTING.md)
[![Codacy Badge](https://api.codacy.com/project/badge/Grade/255459cede9341e39436ec8866d3fb65)](https://www.codacy.com/manual/akamhy/waybackpy?utm_source=github.com&amp;utm_medium=referral&amp;utm_content=akamhy/waybackpy&amp;utm_campaign=Badge_Grade)
[![Downloads](https://pepy.tech/badge/waybackpy/month)](https://pepy.tech/project/waybackpy)
[![Release](https://img.shields.io/github/v/release/akamhy/waybackpy.svg)](https://github.com/akamhy/waybackpy/releases)
[![Maintainability](https://api.codeclimate.com/v1/badges/942f13d8177a56c1c906/maintainability)](https://codeclimate.com/github/akamhy/waybackpy/maintainability)
[![CodeFactor](https://www.codefactor.io/repository/github/akamhy/waybackpy/badge)](https://www.codefactor.io/repository/github/akamhy/waybackpy)
[![made-with-python](https://img.shields.io/badge/Made%20with-Python-1f425f.svg)](https://www.python.org/)
![pypi](https://img.shields.io/pypi/v/waybackpy.svg)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/waybackpy?style=flat-square)
[![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-green.svg)](https://github.com/akamhy/waybackpy/graphs/commit-activity)
[![GitHub last commit](https://img.shields.io/github/last-commit/akamhy/waybackpy?color=blue&style=flat-square)](https://github.com/akamhy/waybackpy/commits/master)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/waybackpy?style=flat-square)
![Internet Archive](https://upload.wikimedia.org/wikipedia/commons/thumb/8/84/Internet_Archive_logo_and_wordmark.svg/84px-Internet_Archive_logo_and_wordmark.svg.png)
![Wayback Machine](https://upload.wikimedia.org/wikipedia/commons/thumb/0/01/Wayback_Machine_logo_2010.svg/284px-Wayback_Machine_logo_2010.svg.png)
The waybackpy is a python wrapper for [Internet Archive](https://en.wikipedia.org/wiki/Internet_Archive)'s [Wayback Machine](https://en.wikipedia.org/wiki/Wayback_Machine).
Table of contents
=================
<!--ts-->
* [Installation](https://github.com/akamhy/waybackpy#installation)
* [Installation](#installation)
* [Usage](https://github.com/akamhy/waybackpy#usage)
* [Saving an url using save()](https://github.com/akamhy/waybackpy#capturing-aka-saving-an-url-using-save)
* [Receiving the oldest archive for an URL Using oldest()](https://github.com/akamhy/waybackpy#receiving-the-oldest-archive-for-an-url-using-oldest)
* [Receiving the recent most/newest archive for an URL using newest()](https://github.com/akamhy/waybackpy#receiving-the-newest-archive-for-an-url-using-newest)
* [Receiving archive close to a specified year, month, day, hour, and minute using near()](https://github.com/akamhy/waybackpy#receiving-archive-close-to-a-specified-year-month-day-hour-and-minute-using-near)
* [Get the content of webpage using get()](https://github.com/akamhy/waybackpy#get-the-content-of-webpage-using-get)
* [Count total archives for an URL using total_archives()](https://github.com/akamhy/waybackpy#count-total-archives-for-an-url-using-total_archives)
* [Usage](#usage)
* [As a Python package](#as-a-python-package)
* [Saving a webpage](#capturing-aka-saving-an-url-using-save)
* [Retrieving archive](#retrieving-the-archive-for-an-url-using-archive_url)
* [Retrieving the oldest archive](#retrieving-the-oldest-archive-for-an-url-using-oldest)
* [Retrieving the latest/newest archive](#retrieving-the-newest-archive-for-an-url-using-newest)
* [Retrieving the JSON response of availability API](#retrieving-the-json-response-for-the-availability-api-request)
* [Retrieving archive close to a specified year, month, day, hour, and minute](#retrieving-archive-close-to-a-specified-year-month-day-hour-and-minute-using-near)
* [Get the content of webpage](#get-the-content-of-webpage-using-get)
* [Count total archives for an URL](#count-total-archives-for-an-url-using-total_archives)
* [List of URLs that Wayback Machine knows and has archived for a domain name](#list-of-urls-that-wayback-machine-knows-and-has-archived-for-a-domain-name)
* [With the Command-line interface](#with-the-command-line-interface)
* [Saving webpage](#save)
* [Archive URL](#get-archive-url)
* [Oldest archive URL](#oldest-archive)
* [Newest archive URL](#newest-archive)
* [JSON response of API](#get-json-data-of-avaialblity-api)
* [Total archives](#total-number-of-archives)
* [Archive near specified time](#archive-near-time)
* [Get the source code](#get-the-source-code)
* [Fetch all the URLs that the Wayback Machine knows for a domain](#fetch-all-the-urls-that-the-wayback-machine-knows-for-a-domain)
* [Tests](https://github.com/akamhy/waybackpy#tests)
* [Tests](#tests)
* [Dependency](https://github.com/akamhy/waybackpy#dependency)
* [Packaging](#packaging)
* [License](https://github.com/akamhy/waybackpy#license)
* [License](#license)
<!--te-->
## Installation
Using [pip](https://en.wikipedia.org/wiki/Pip_(package_manager)):
**pip install waybackpy**
```bash
pip install waybackpy
```
or direct from this repository using git.
```bash
pip install git+https://github.com/akamhy/waybackpy.git
```
## Usage
#### Capturing aka Saving an url Using save()
```diff
+ waybackpy.save(url, UA=user_agent)
```
> url is mandatory. UA is not, but highly recommended.
```python
import waybackpy
# Capturing a new archive on Wayback machine.
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
archived_url = waybackpy.save("https://github.com/akamhy/waybackpy", UA = "Any-User-Agent")
print(archived_url)
```
This should print something similar to the following archived URL:
<https://web.archive.org/web/20200504141153/https://github.com/akamhy/waybackpy>
#### Receiving the oldest archive for an URL Using oldest()
```diff
+ waybackpy.oldest(url, UA=user_agent)
```
> url is mandatory. UA is not, but highly recommended.
### As a Python package
#### Capturing aka Saving an URL using save()
```python
import waybackpy
# retrieving the oldest archive on Wayback machine.
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
oldest_archive = waybackpy.oldest("https://www.google.com/", UA = "Any-User-Agent")
print(oldest_archive)
url = "https://en.wikipedia.org/wiki/Multivariable_calculus"
user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
waybackpy_url_obj = waybackpy.Url(url, user_agent)
archive = waybackpy_url_obj.save()
print(archive)
```
This returns the oldest available archive for <https://google.com>.
<http://web.archive.org/web/19981111184551/http://google.com:80/>
#### Receiving the newest archive for an URL using newest()
```diff
+ waybackpy.newest(url, UA=user_agent)
```bash
https://web.archive.org/web/20201016171808/https://en.wikipedia.org/wiki/Multivariable_calculus
```
> url is mandatory. UA is not, but highly recommended.
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPySaveExample></sub>
#### Retrieving the archive for an URL using archive_url
```python
import waybackpy
# retrieving the newest archive on Wayback machine.
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
newest_archive = waybackpy.newest("https://www.microsoft.com/en-us", UA = "Any-User-Agent")
print(newest_archive)
url = "https://www.google.com/"
user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:40.0) Gecko/20100101 Firefox/40.0"
waybackpy_url_obj = waybackpy.Url(url, user_agent)
archive_url = waybackpy_url_obj.archive_url
print(archive_url)
```
This returns the newest available archive for <https://www.microsoft.com/en-us>, something just like this:
<http://web.archive.org/web/20200429033402/https://www.microsoft.com/en-us/>
#### Receiving archive close to a specified year, month, day, hour, and minute using near()
```diff
+ waybackpy.near(url, year=2020, month=1, day=1, hour=1, minute=1, UA=user_agent)
```bash
https://web.archive.org/web/20201016153320/https://www.google.com/
```
> url is mandotory. year,month,day,hour and minute are optional arguments. UA is not mandotory, but higly recomended.
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyArchiveUrl></sub>
#### Retrieving the oldest archive for an URL using oldest()
```python
import waybackpy
# retriving the the closest archive from a specified year.
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
# supported argumnets are year,month,day,hour and minute
archive_near_year = waybackpy.near("https://www.facebook.com/", year=2010, UA ="Any-User-Agent")
print(archive_near_year)
url = "https://www.google.com/"
user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:40.0) Gecko/20100101 Firefox/40.0"
waybackpy_url_obj = waybackpy.Url(url, user_agent)
oldest_archive_url = waybackpy_url_obj.oldest()
print(oldest_archive_url)
```
returns : <http://web.archive.org/web/20100504071154/http://www.facebook.com/>
```waybackpy.near("https://www.facebook.com/", year=2010, month=1, UA ="Any-User-Agent")``` returns: <http://web.archive.org/web/20101111173430/http://www.facebook.com//>
```bash
http://web.archive.org/web/19981111184551/http://google.com:80/
```
```waybackpy.near("https://www.oracle.com/index.html", year=2019, month=1, day=5, UA ="Any-User-Agent")``` returns: <http://web.archive.org/web/20190105054437/https://www.oracle.com/index.html>
> Please note that if you only specify the year, the current month and day are default arguments for month and day respectively. Do not expect just putting the year parameter would return the archive closer to January but the current month you are using the package. If you are using it in July 2018 and let's say you use ```waybackpy.near("https://www.facebook.com/", year=2011, UA ="Any-User-Agent")``` then you would be returned the nearest archive to July 2011 and not January 2011. You need to specify the month "1" for January.
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyOldestExample></sub>
> Do not pad (don't use zeros in the month, year, day, minute, and hour arguments). e.g. For January, set month = 1 and not month = 01.
#### Retrieving the newest archive for an URL using newest()
```python
import waybackpy
url = "https://www.facebook.com/"
user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:39.0) Gecko/20100101 Firefox/39.0"
waybackpy_url_obj = waybackpy.Url(url, user_agent)
newest_archive_url = waybackpy_url_obj.newest()
print(newest_archive_url)
```
```bash
https://web.archive.org/web/20201016150543/https://www.facebook.com/
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyNewestExample></sub>
#### Retrieving the JSON response for the availability API request
```python
import waybackpy
url = "https://www.facebook.com/"
user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:39.0) Gecko/20100101 Firefox/39.0"
waybackpy_url_obj = waybackpy.Url(url, user_agent)
json_dict = waybackpy_url_obj.JSON
print(json_dict)
```
```javascript
{'url': 'https://www.facebook.com/', 'archived_snapshots': {'closest': {'available': True, 'url': 'http://web.archive.org/web/20201016150543/https://www.facebook.com/', 'timestamp': '20201016150543', 'status': '200'}}}
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyJSON></sub>
#### Retrieving archive close to a specified year, month, day, hour, and minute using near()
```python
from waybackpy import Url
user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:38.0) Gecko/20100101 Firefox/38.0"
url = "https://github.com/"
waybackpy_url_obj = Url(url, user_agent)
# Do not pad (don't use zeros in the month, year, day, minute, and hour arguments). e.g. For January, set month = 1 and not month = 01.
```
```python
github_archive_near_2010 = waybackpy_url_obj.near(year=2010)
print(github_archive_near_2010)
```
```bash
https://web.archive.org/web/20101018053604/http://github.com:80/
```
```python
github_archive_near_2011_may = waybackpy_url_obj.near(year=2011, month=5)
print(github_archive_near_2011_may)
```
```bash
https://web.archive.org/web/20110518233639/https://github.com/
```
```python
github_archive_near_2015_january_26 = waybackpy_url_obj.near(year=2015, month=1, day=26)
print(github_archive_near_2015_january_26)
```
```bash
https://web.archive.org/web/20150125102636/https://github.com/
```
```python
github_archive_near_2018_4_july_9_2_am = waybackpy_url_obj.near(year=2018, month=7, day=4, hour=9, minute=2)
print(github_archive_near_2018_4_july_9_2_am)
```
```bash
https://web.archive.org/web/20180704090245/https://github.com/
```
<sub>The package doesn't support the seconds' argument yet. You are encouraged to create a PR ;)</sub>
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyNearExample></sub>
#### Get the content of webpage using get()
```diff
+ waybackpy.get(url, encoding="UTF-8", UA=user_agent)
```
> url is mandatory. UA is not, but highly recommended. encoding is detected automatically, don't specify unless necessary.
```python
from waybackpy import get
# retriving the webpage from any url including the archived urls. Don't need to import other libraies :)
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
# supported argumnets are url, encoding and UA
webpage = get("https://example.com/", UA="User-Agent")
print(webpage)
import waybackpy
google_url = "https://www.google.com/"
User_Agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.85 Safari/537.36"
waybackpy_url_object = waybackpy.Url(google_url, User_Agent)
# If no argument is passed in get(), it gets the source of the Url used to create the object.
current_google_url_source = waybackpy_url_object.get()
print(current_google_url_source)
# The following chunk of code will force a new archive of google.com and get the source of the archived page.
# waybackpy_url_object.save() type is string.
google_newest_archive_source = waybackpy_url_object.get(waybackpy_url_object.save())
print(google_newest_archive_source)
# waybackpy_url_object.oldest() type is str, it's oldest archive of google.com
google_oldest_archive_source = waybackpy_url_object.get(waybackpy_url_object.oldest())
print(google_oldest_archive_source)
```
> This should print the source code for <https://example.com/>.
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyGetExample#main.py></sub>
#### Count total archives for an URL using total_archives()
```diff
+ waybackpy.total_archives(url, UA=user_agent)
```python
import waybackpy
URL = "https://en.wikipedia.org/wiki/Python (programming language)"
UA = "Mozilla/5.0 (iPad; CPU OS 8_1_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B435 Safari/600.1.4"
waybackpy_url_object = waybackpy.Url(url=URL, user_agent=UA)
archive_count = waybackpy_url_object.total_archives()
print(archive_count) # total_archives() returns an int
```
> url is mandatory. UA is not, but highly recommended.
```bash
2516
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyTotalArchivesExample></sub>
#### List of URLs that Wayback Machine knows and has archived for a domain name
1) If alive=True is set, waybackpy will check all URLs to identify the alive URLs. Don't use with popular websites like google or it would take too long.
2) To include URLs from subdomain set sundomain=True
```python
from waybackpy import total_archives
# retriving the webpage from any url including the archived urls. Don't need to import other libraies :)
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
# supported argumnets are url and UA
count = total_archives("https://en.wikipedia.org/wiki/Python (programming language)", UA="User-Agent")
print(count)
import waybackpy
URL = "akamhy.github.io"
UA = "Mozilla/5.0 (iPad; CPU OS 8_1_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B435 Safari/600.1.4"
waybackpy_url_object = waybackpy.Url(url=URL, user_agent=UA)
known_urls = waybackpy_url_object.known_urls(alive=True, subdomain=False) # alive and subdomain are optional.
print(known_urls) # known_urls() returns list of URLs
```
> This should print an integer (int), which is the number of total archives on archive.org
```bash
['http://akamhy.github.io',
'https://akamhy.github.io/waybackpy/',
'https://akamhy.github.io/waybackpy/assets/css/style.css?v=a418a4e4641a1dbaad8f3bfbf293fad21a75ff11',
'https://akamhy.github.io/waybackpy/assets/css/style.css?v=f881705d00bf47b5bf0c58808efe29eecba2226c']
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyKnownURLsToWayBackMachineExample#main.py></sub>
### With the Command-line interface
#### Save
```bash
$ waybackpy --url "https://en.wikipedia.org/wiki/Social_media" --user_agent "my-unique-user-agent" --save
https://web.archive.org/web/20200719062108/https://en.wikipedia.org/wiki/Social_media
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyBashSave></sub>
#### Get archive URL
```bash
$ waybackpy --url "https://en.wikipedia.org/wiki/SpaceX" --user_agent "my-unique-user-agent" --archive_url
https://web.archive.org/web/20201007132458/https://en.wikipedia.org/wiki/SpaceX
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyBashArchiveUrl></sub>
#### Oldest archive
```bash
$ waybackpy --url "https://en.wikipedia.org/wiki/SpaceX" --user_agent "my-unique-user-agent" --oldest
https://web.archive.org/web/20040803000845/http://en.wikipedia.org:80/wiki/SpaceX
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyBashOldest></sub>
#### Newest archive
```bash
$ waybackpy --url "https://en.wikipedia.org/wiki/YouTube" --user_agent "my-unique-user-agent" --newest
https://web.archive.org/web/20200606044708/https://en.wikipedia.org/wiki/YouTube
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyBashNewest></sub>
#### Get JSON data of avaialblity API
```bash
waybackpy --url "https://en.wikipedia.org/wiki/SpaceX" --user_agent "my-unique-user-agent" --json
```
```javascript
{'archived_snapshots': {'closest': {'timestamp': '20201007132458', 'status': '200', 'available': True, 'url': 'http://web.archive.org/web/20201007132458/https://en.wikipedia.org/wiki/SpaceX'}}, 'url': 'https://en.wikipedia.org/wiki/SpaceX'}
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyBashJSON></sub>
#### Total number of archives
```bash
$ waybackpy --url "https://en.wikipedia.org/wiki/Linux_kernel" --user_agent "my-unique-user-agent" --total
853
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyBashTotal></sub>
#### Archive near time
```bash
$ waybackpy --url facebook.com --user_agent "my-unique-user-agent" --near --year 2012 --month 5 --day 12
https://web.archive.org/web/20120512142515/https://www.facebook.com/
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyBashNear></sub>
#### Get the source code
```bash
waybackpy --url google.com --user_agent "my-unique-user-agent" --get url # Prints the source code of the URL
waybackpy --url google.com --user_agent "my-unique-user-agent" --get oldest # Prints the source code of the oldest archive
waybackpy --url google.com --user_agent "my-unique-user-agent" --get newest # Prints the source code of the newest archive
waybackpy --url google.com --user_agent "my-unique-user-agent" --get save # Save a new archive on Wayback machine then print the source code of this archive.
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyBashGet></sub>
#### Fetch all the URLs that the Wayback Machine knows for a domain
1) You can add the '--alive' flag to only fetch alive links.
2) You can add the '--subdomain' flag to add subdomains.
3) '--alive' and '--subdomain' flags can be used simultaneously.
4) All links will be saved in a file, and the file will be created in the current working directory.
```bash
pip install waybackpy
# Ignore the above installation line.
waybackpy --url akamhy.github.io --user_agent "my-user-agent" --known_urls
# Prints all known URLs under akamhy.github.io
waybackpy --url akamhy.github.io --user_agent "my-user-agent" --known_urls --alive
# Prints all known URLs under akamhy.github.io which are still working and not dead links.
waybackpy --url akamhy.github.io --user_agent "my-user-agent" --known_urls --subdomain
# Prints all known URLs under akamhy.github.io including subdomain
waybackpy --url akamhy.github.io --user_agent "my-user-agent" --known_urls --subdomain --alive
# Prints all known URLs under akamhy.github.io including subdomain which are not dead links and still alive.
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackpyKnownUrlsFromWaybackMachine#main.sh></sub>
## Tests
* [Here](https://github.com/akamhy/waybackpy/tree/master/tests)
## Dependency
* None, just python standard libraries (json, urllib and datetime). Both python 2 and 3 are supported :)
To run tests locally:
1) Install or update the testing/coverage tools
```bash
pip install codecov pytest pytest-cov -U
```
2) Inside the repository run the following commands
```bash
pytest --cov=waybackpy tests/
```
3) To report coverage run
```bash
bash <(curl -s https://codecov.io/bash) -t SECRET_CODECOV_TOKEN
```
You can find the tests [here](https://github.com/akamhy/waybackpy/tree/master/tests).
## Packaging
1. Increment version.
2. Build package ``python setup.py sdist bdist_wheel``.
3. Sign & upload the package ``twine upload -s dist/*``.
## License
[MIT License](https://github.com/akamhy/waybackpy/blob/master/LICENSE)
Released under the MIT License. See
[license](https://github.com/akamhy/waybackpy/blob/master/LICENSE) for details.

View File

@ -1 +1 @@
theme: jekyll-theme-cayman
theme: jekyll-theme-cayman

View File

@ -0,0 +1,268 @@
<?xml version="1.0" standalone="no"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 20010904//EN"
"http://www.w3.org/TR/2001/REC-SVG-20010904/DTD/svg10.dtd">
<svg version="1.0" xmlns="http://www.w3.org/2000/svg"
width="629.000000pt" height="103.000000pt" viewBox="0 0 629.000000 103.000000"
preserveAspectRatio="xMidYMid meet">
<g transform="translate(0.000000,103.000000) scale(0.100000,-0.100000)"
fill="#000000" stroke="none">
<path d="M0 515 l0 -515 3145 0 3145 0 0 515 0 515 -3145 0 -3145 0 0 -515z
m5413 439 c31 -6 36 -10 31 -26 -3 -10 0 -26 7 -34 6 -8 10 -17 7 -20 -3 -2
-17 11 -32 31 -15 19 -41 39 -59 44 -38 11 -10 14 46 5z m150 -11 c-7 -2 -21
-2 -30 0 -10 3 -4 5 12 5 17 0 24 -2 18 -5z m-4869 -23 c-6 -6 -21 -6 -39 -1
-30 9 -30 9 10 10 25 1 36 -2 29 -9z m452 -37 c-3 -26 -15 -65 -25 -88 -10
-22 -21 -64 -25 -94 -3 -29 -14 -72 -26 -95 -11 -23 -20 -51 -20 -61 0 -30
-39 -152 -53 -163 -6 -5 -45 -12 -85 -14 -72 -5 -102 4 -102 33 0 6 -9 31 -21
56 -11 25 -26 72 -33 103 -6 31 -17 64 -24 73 -8 9 -22 37 -32 64 l-18 48 -16
-39 c-9 -21 -16 -44 -16 -50 0 -6 -7 -24 -15 -40 -8 -16 -24 -63 -34 -106 -11
-43 -26 -93 -34 -112 -14 -34 -15 -35 -108 -46 -70 -9 -96 -9 -106 0 -21 17
-43 64 -43 92 0 14 -4 27 -9 31 -12 7 -50 120 -66 200 -8 35 -25 81 -40 103
-14 22 -27 52 -28 68 -2 28 0 29 48 31 28 1 82 5 120 9 54 4 73 3 82 -7 11
-15 53 -148 53 -170 0 -7 9 -32 21 -56 20 -41 39 -49 39 -17 0 8 -5 12 -10 9
-6 -3 -13 2 -16 12 -3 10 -10 26 -15 36 -14 26 7 21 29 -8 l20 -26 7 33 c7 35
41 149 56 185 7 19 16 23 56 23 27 0 80 2 120 6 80 6 88 1 97 -71 3 -20 9 -42
14 -48 5 -7 20 -43 32 -82 13 -38 24 -72 26 -74 2 -2 13 4 24 14 13 12 20 31
20 55 0 20 7 56 15 81 7 24 19 63 25 87 12 47 31 60 89 61 l34 1 -7 -47z
m3131 41 c17 -3 34 -12 37 -20 3 -7 1 -48 -4 -91 -4 -43 -7 -80 -4 -82 2 -2
11 2 20 10 9 7 24 18 34 24 9 5 55 40 101 77 79 64 87 68 136 68 28 0 54 -4
58 -10 3 -5 12 -7 20 -3 9 3 15 -1 15 -9 0 -13 -180 -158 -197 -158 -4 0 -14
-9 -20 -20 -11 -17 -7 -27 27 -76 22 -32 40 -63 40 -70 0 -7 6 -19 14 -26 7
-8 37 -48 65 -89 l52 -74 -28 -3 c-51 -5 -74 -12 -68 -22 9 -14 -59 -12 -73 2
-20 20 -13 30 10 14 34 -24 44 -19 17 8 -25 25 -109 140 -109 149 0 7 -60 97
-64 97 -2 0 -11 -10 -22 -22 -18 -21 -18 -21 0 -15 10 4 25 2 32 -4 18 -15 19
-35 2 -22 -7 6 -25 13 -39 17 -34 8 -39 -5 -39 -94 0 -38 -3 -75 -6 -84 -6
-16 -54 -22 -67 -9 -4 3 -40 7 -81 8 -101 2 -110 10 -104 97 3 37 10 73 16 80
6 8 10 77 10 174 0 89 2 166 6 172 6 11 162 15 213 6z m301 -1 c-25 -2 -52
-11 -58 -19 -7 -7 -17 -14 -23 -14 -5 0 -2 9 8 20 14 16 29 20 69 18 l51 -2
-47 -3z m809 -9 c33 -21 65 -89 62 -132 -1 -21 1 -47 5 -59 9 -28 -26 -111
-51 -120 -10 -3 -25 -12 -33 -19 -10 -8 -70 -15 -170 -21 l-155 -8 4 -73 c4
-93 -10 -112 -80 -112 -26 0 -60 5 -74 12 -19 8 -31 8 -51 -1 -45 -20 -55 -1
-55 98 0 47 -1 111 -3 141 -2 30 -5 107 -7 170 l-4 115 65 2 c36 2 103 7 150
11 150 15 372 13 397 -4z m338 -19 c11 -14 46 -54 78 -88 l58 -62 62 65 c34
36 75 73 89 83 28 18 113 24 122 9 3 -5 -32 -51 -77 -102 -147 -167 -134 -143
-139 -253 -3 -54 -10 -103 -16 -109 -8 -8 -8 -17 -1 -30 14 -26 11 -28 -47
-29 -119 -2 -165 3 -174 22 -6 10 -9 69 -8 131 l2 113 -57 75 c-32 41 -80 102
-107 134 -27 33 -47 62 -45 66 3 4 58 6 122 4 113 -3 119 -5 138 -29z m-4233
13 c16 -13 98 -150 98 -164 0 -4 29 -65 65 -135 36 -71 65 -135 65 -143 0 -10
-14 -17 -37 -21 -21 -4 -48 -10 -61 -16 -40 -16 -51 -10 -77 41 -29 57 -35 59
-157 38 -65 -11 -71 -14 -84 -43 -10 -25 -21 -34 -46 -38 -41 -6 -61 8 -48 33
15 28 12 38 -12 42 -18 2 -23 10 -24 36 -1 27 3 35 23 43 13 5 34 9 46 9 23 0
57 47 57 78 0 9 10 33 22 52 14 24 21 52 22 92 1 49 4 58 24 67 13 6 31 11 40
11 9 0 26 7 36 15 24 18 28 18 48 3z m1701 0 c16 -12 97 -143 97 -157 0 -3 32
-69 70 -146 39 -76 67 -142 62 -147 -4 -4 -28 -12 -52 -17 -25 -6 -57 -13 -72
-17 -25 -6 -29 -2 -50 42 -14 30 -31 50 -43 53 -11 2 -57 -2 -103 -9 -79 -12
-83 -13 -96 -45 -10 -24 -22 -34 -46 -38 -43 -9 -53 -1 -45 39 5 30 3 34 -15
34 -17 0 -20 6 -20 39 0 40 13 50 65 51 19 0 55 48 55 72 0 6 8 29 19 52 32
72 41 107 31 127 -8 14 -5 21 12 33 12 9 32 16 43 16 11 0 29 7 39 15 24 18
28 18 49 3z m-3021 -11 c-29 -9 -32 -13 -27 -39 8 -36 -11 -37 -20 -1 -8 32
15 54 54 52 24 -1 23 -2 -7 -12z m3499 4 c-12 -8 -51 -4 -51 5 0 2 15 4 33 4
22 0 28 -3 18 -9z m1081 -67 c2 -42 0 -78 -4 -81 -5 -2 -8 18 -8 45 0 27 -3
64 -6 81 -4 19 -2 31 4 31 6 0 12 -32 14 -76z m-1951 46 c12 -7 19 -21 19 -38
l-1 -27 -15 28 c-8 15 -22 27 -32 27 -9 0 -24 5 -32 10 -21 14 35 13 61 0z
m1004 -3 c73 -19 135 -61 135 -92 0 -15 -8 -29 -21 -36 -18 -9 -30 -6 -69 15
-37 20 -62 26 -109 26 -54 0 -62 -3 -78 -26 -21 -32 -33 -130 -25 -191 9 -58
41 -84 111 -91 38 -3 61 1 97 17 36 17 49 19 60 10 25 -21 15 -48 -28 -76 -38
-24 -54 -28 -148 -31 -114 -4 -170 10 -190 48 -6 11 -16 20 -23 20 -24 0 -59
95 -59 159 0 59 20 122 42 136 6 3 10 13 10 22 0 31 80 82 130 83 19 0 42 5
50 10 21 13 57 12 115 -3z m-1682 -23 c-14 -14 -28 -23 -31 -20 -8 8 29 46 44
46 7 0 2 -11 -13 -26z m159 -2 c-20 -15 -22 -23 -16 -60 4 -28 3 -42 -5 -42
-7 0 -11 19 -11 50 0 36 5 52 18 59 28 17 39 12 14 -7z m1224 -28 c-39 -40
-46 -38 -19 7 15 24 40 41 52 33 2 -2 -13 -20 -33 -40z m-1538 -33 l62 -66 63
68 c56 59 68 67 100 67 19 0 38 -3 40 -7 3 -5 -32 -53 -76 -108 -88 -108 -84
-97 -90 -255 l-2 -55 -87 -3 c-49 -1 -88 -1 -89 0 0 2 -3 50 -5 107 -3 75 -8
109 -19 121 -8 9 -15 20 -15 25 0 4 -18 29 -41 54 -83 94 -89 102 -84 111 3 6
45 9 93 9 l87 -1 63 -67z m786 59 c33 -12 48 -42 52 -107 3 -43 0 -57 -16 -73
l-20 -20 20 -28 c26 -35 35 -89 21 -125 -18 -46 -66 -60 -226 -64 -77 -3 -166
-7 -198 -10 -84 -7 -99 9 -97 102 1 38 -1 125 -4 191 l-5 122 47 5 c26 3 103
4 171 2 69 -2 134 1 145 5 29 12 80 12 110 0z m-1050 -16 c3 -8 2 -12 -4 -9
-6 3 -10 10 -10 16 0 14 7 11 14 -7z m-374 -22 c0 -9 -5 -24 -10 -32 -7 -11
-10 -5 -10 23 0 23 4 36 10 32 6 -3 10 -14 10 -23z m1701 16 c2 -21 -2 -43
-10 -51 -4 -4 -7 9 -8 28 -1 32 15 52 18 23z m2859 -28 c-11 -20 -50 -28 -50
-10 0 6 9 10 19 10 11 0 23 5 26 10 12 19 16 10 5 -10z m-4759 -47 c-8 -15
-10 -15 -11 -2 0 17 10 32 18 25 2 -3 -1 -13 -7 -23z m2599 9 c0 -9 -40 -35
-46 -29 -6 6 25 37 37 37 5 0 9 -3 9 -8z m316 -127 c-4 -19 -12 -37 -18 -41
-8 -5 -9 -1 -5 10 4 10 7 36 7 59 1 35 2 39 11 24 6 -10 8 -34 5 -52z m1942
38 c-15 -16 -30 -45 -33 -65 -4 -21 -12 -38 -17 -38 -19 0 3 74 30 103 14 15
30 27 36 27 5 0 -2 -12 -16 -27z m-3855 -16 c-6 -12 -15 -33 -20 -47 -9 -23
-10 -23 -15 -3 -3 12 3 34 14 52 23 35 37 34 21 -2z m3282 -82 c-23 -18 -81
-35 -115 -34 -17 1 -11 5 21 13 25 7 54 18 65 24 30 18 53 15 29 -3z m-2585
-130 c-7 -8 -19 -15 -27 -15 -10 0 -7 8 9 31 18 24 24 27 26 14 2 -9 -2 -22
-8 -30z m-1775 -5 c-4 -12 -9 -19 -12 -17 -3 3 -2 15 2 27 4 12 9 19 12 17 3
-3 2 -15 -2 -27z m820 -29 c-9 -8 -25 21 -25 44 0 16 3 14 15 -9 9 -16 13 -32
10 -35z m2085 47 c0 -17 -31 -48 -47 -48 -11 0 -8 8 9 29 24 32 38 38 38 19z
m-1655 -47 c-11 -10 -35 11 -35 30 0 21 0 21 19 -2 11 -13 18 -26 16 -28z
m1221 24 c13 -14 21 -25 18 -25 -11 0 -54 33 -54 41 0 15 12 10 36 -16z
m-1428 -7 c-3 -7 -18 -14 -34 -15 -20 -1 -22 0 -6 4 12 2 22 9 22 14 0 5 5 9
11 9 6 0 9 -6 7 -12z m3574 -45 c8 -10 6 -13 -11 -13 -18 0 -21 6 -20 38 0 34
1 35 10 13 5 -14 15 -31 21 -38z m-4097 14 c19 -4 19 -4 2 -12 -18 -7 -46 16
-47 39 0 6 6 3 13 -6 6 -9 21 -18 32 -21z m1700 1 c19 -5 19 -5 2 -13 -18 -7
-46 17 -46 40 0 6 5 3 12 -6 7 -9 21 -19 32 -21z m-1970 12 c-3 -5 -21 -9 -38
-9 l-32 2 35 7 c19 4 36 8 38 9 2 0 0 -3 -3 -9z m350 0 c-27 -12 -35 -12 -35
0 0 6 12 10 28 9 24 0 25 -1 7 -9z m1350 0 c-3 -5 -18 -9 -33 -9 l-27 1 30 8
c17 4 31 8 33 9 2 0 0 -3 -3 -9z m355 0 c-19 -13 -30 -13 -30 0 0 6 10 10 23
10 18 0 19 -2 7 -10z m-2324 -35 c-6 -22 -11 -25 -44 -24 -31 2 -32 3 -9 6 18
3 32 14 39 29 14 30 23 24 14 -11z m2839 16 c-14 -14 -73 -26 -60 -13 6 5 19
12 30 15 34 8 40 8 30 -2z m212 -21 l48 -8 -47 -1 c-56 -1 -78 6 -78 26 0 12
3 13 14 3 8 -6 36 -15 63 -20z m116 -1 c-6 -6 -18 -6 -28 -3 -18 7 -18 8 1 14
23 9 39 1 27 -11z m633 -14 c31 5 35 4 21 -5 -9 -6 -34 -10 -55 -8 -31 3 -37
7 -40 28 l-3 25 19 -23 c16 -20 24 -23 58 -17z m939 15 c16 -7 11 -9 -20 -9
-29 -1 -36 2 -25 9 17 11 19 11 45 0z m-5445 -24 c6 -8 21 -16 33 -18 19 -3
20 -4 5 -10 -12 -5 -27 1 -45 17 -16 13 -23 25 -17 25 6 0 17 -6 24 -14z m150
-76 c0 -11 -4 -20 -10 -20 -14 0 -13 -103 1 -117 21 -21 2 -43 -36 -43 -19 0
-35 5 -35 11 0 8 -5 7 -15 -1 -21 -17 -44 2 -28 22 22 26 20 128 -2 128 -8 0
-15 9 -15 19 0 18 8 20 70 20 63 0 70 -2 70 -19z m1189 -63 c17 -32 31 -62 31
-66 0 -14 -43 -21 -57 -9 -7 6 -29 12 -48 14 -26 2 -35 -1 -40 -16 -4 -12 -12
-17 -21 -13 -8 3 -13 12 -10 19 3 8 1 14 -4 14 -18 0 -10 22 9 27 22 6 43 46
35 67 -3 9 5 20 23 30 34 18 38 14 82 -67z m2146 -8 l34 -67 -25 -6 c-14 -4
-31 -3 -37 2 -7 5 -29 12 -49 16 -31 6 -38 4 -38 -9 0 -8 -7 -15 -15 -15 -8 0
-15 7 -15 15 0 8 -4 15 -10 15 -19 0 -10 21 14 30 16 6 27 20 31 40 4 18 16
41 27 52 26 26 40 14 83 -73z m-3205 51 c8 -10 20 -26 27 -36 10 -17 12 -14
12 19 1 36 2 37 37 37 l37 0 -8 -72 c-3 -40 -11 -76 -17 -79 -20 -13 -43 3
-62 42 -27 56 -34 56 -41 4 -7 -42 -9 -44 -34 -39 -35 9 -34 6 -35 71 -1 41 4
62 14 70 18 15 50 7 70 -17z m280 11 c-5 -11 -15 -21 -21 -23 -13 -4 -14 -101
-3 -120 5 -8 1 -9 -10 -5 -10 4 -29 7 -42 7 -22 0 -24 3 -24 55 0 52 -1 55
-26 55 -19 0 -25 5 -22 18 2 13 17 18 68 23 36 3 71 6 78 7 9 2 10 -3 2 -17z
m178 -3 c3 -15 -4 -18 -32 -18 -25 0 -36 -4 -36 -15 0 -10 11 -15 35 -15 24 0
35 -5 35 -15 0 -11 -11 -15 -41 -15 -55 0 -47 -24 9 -28 29 -2 42 -8 42 -18 0
-16 -25 -17 -108 -7 l-53 6 2 56 c3 92 1 90 77 88 55 -2 67 -5 70 -19z m230
10 c18 -18 14 -56 -7 -77 -17 -17 -18 -21 -5 -40 14 -19 13 -21 -4 -21 -10 0
-28 11 -40 25 -24 27 -52 24 -52 -5 0 -24 -9 -29 -43 -23 -26 5 -27 7 -27 73
0 45 4 70 13 73 26 11 153 7 165 -5z m557 -2 c47 -20 47 -40 0 -32 -53 10 -77
-7 -73 -52 l3 -37 48 1 c26 0 47 -3 47 -6 0 -35 -108 -42 -140 -10 -29 29 -27
94 5 125 28 28 60 31 110 11z m213 -8 c3 -15 -4 -18 -38 -18 -50 0 -51 -22 -1
-30 44 -7 44 -24 -1 -28 -54 -5 -52 -32 2 -32 29 0 40 -4 40 -15 0 -17 -28
-19 -104 -9 l-46 7 0 72 0 72 72 -1 c61 -1 73 -4 76 -18z m312 6 c0 -9 -9 -18
-21 -21 -19 -5 -20 -12 -17 -69 3 -63 3 -63 -22 -58 -49 11 -50 12 -50 64 0
43 -3 50 -20 50 -13 0 -20 7 -20 20 0 17 8 20 68 23 37 2 70 4 75 5 4 1 7 -5
7 -14z m155 6 c65 -15 94 -73 62 -125 -14 -24 -25 -28 -92 -33 -44 -3 -54 0
-78 24 -34 34 -36 82 -4 111 37 34 53 37 112 23z m505 -3 c0 -8 -9 -40 -20
-72 -11 -31 -18 -60 -16 -64 3 -4 -9 -8 -25 -9 -25 -2 -31 3 -51 45 l-22 47
-21 -46 c-17 -38 -25 -47 -51 -50 -24 -3 -30 0 -32 17 -1 12 -8 40 -17 64 -21
59 -20 61 20 61 27 0 35 -4 35 -17 0 -10 4 -24 9 -32 7 -11 13 -6 25 23 14 35
18 37 53 34 32 -2 39 -7 41 -28 6 -43 19 -43 36 -1 15 40 36 55 36 28z m136
-4 c27 -45 64 -115 64 -122 0 -13 -42 -22 -54 -12 -6 5 -28 11 -49 15 -32 6
-38 4 -45 -13 -8 -24 -26 -16 -36 16 -5 16 -2 25 13 32 11 6 25 28 32 48 17
55 53 71 75 36z m840 -4 c22 -18 16 -32 -11 -25 -59 15 -94 -18 -74 -71 8 -21
15 -24 47 -22 40 3 66 -7 57 -21 -3 -5 -12 -7 -20 -3 -8 3 -15 1 -15 -4 0 -17
-111 4 -126 24 -26 34 -13 100 25 131 18 14 96 9 117 -9z m816 -54 l37 -70
-25 -8 c-16 -6 -30 -5 -40 3 -22 19 -81 22 -88 4 -7 -19 -26 -18 -26 1 0 8 -4
15 -10 15 -20 0 -9 21 15 30 24 9 30 24 27 63 -1 10 2 16 7 13 5 -3 12 1 15
10 4 9 15 14 28 12 17 -2 33 -22 60 -73z m183 61 c47 -20 47 -40 0 -32 -46 9
-75 -7 -75 -42 0 -45 13 -56 59 -49 30 4 41 2 41 -8 0 -32 -95 -35 -134 -4
-30 24 -34 64 -11 109 22 43 60 51 120 26z m398 4 c19 0 24 -26 6 -32 -13 -4
-16 -42 -5 -84 l7 -32 -55 -1 c-57 0 -68 7 -41 29 17 14 21 90 5 90 -5 0 -10
10 -10 21 0 19 4 21 38 15 20 -3 45 -6 55 -6z m117 0 c5 0 17 -13 27 -30 9
-16 21 -30 25 -30 4 0 8 14 8 30 0 28 3 30 36 30 l36 0 -5 -71 c-2 -42 -9 -74
-17 -79 -15 -9 -50 -1 -50 12 0 5 -11 25 -24 45 l-24 35 -9 -42 c-4 -23 -11
-41 -15 -41 -5 1 -19 1 -32 1 -23 0 -23 2 -20 67 3 66 15 88 42 78 8 -3 18 -5
22 -5z m317 -3 c21 -15 4 -27 -38 -27 -50 0 -49 -23 1 -30 50 -8 51 -30 1 -30
-30 0 -41 -4 -41 -15 0 -11 12 -15 45 -15 33 0 45 -4 45 -15 0 -17 -24 -19
-108 -8 l-54 6 6 66 c3 36 5 69 6 72 0 11 124 7 137 -4z m-4374 -7 c9 0 17 -4
17 -10 0 -5 -16 -10 -35 -10 -28 0 -35 -4 -35 -19 0 -15 8 -21 35 -23 20 -2
35 -7 35 -13 0 -5 -15 -11 -35 -13 -30 -3 -35 -7 -35 -28 0 -18 -5 -24 -23
-24 -13 0 -28 -5 -33 -10 -7 -7 -11 9 -13 51 -1 35 -6 70 -11 79 -7 13 -2 16
28 18 20 2 39 5 41 8 3 3 15 3 26 0 11 -3 28 -6 38 -6z m1856 -14 c23 -21 38
-20 51 4 6 11 17 20 25 20 16 0 20 -16 6 -24 -17 -11 -50 -94 -44 -114 4 -18
0 -20 -34 -19 l-38 2 3 40 c3 33 -1 45 -22 64 -36 34 -34 53 5 47 17 -2 39
-12 48 -20z m299 -18 c-3 -24 -1 -55 3 -70 6 -24 4 -29 -14 -32 -41 -9 -155
-14 -163 -7 -5 3 -10 36 -12 73 l-2 67 67 4 c38 2 81 4 97 5 27 2 28 1 24 -40z
m512 22 c0 -11 4 -20 9 -20 4 0 20 9 34 20 25 20 57 27 57 12 0 -5 -14 -18
-30 -31 l-30 -22 26 -44 c24 -41 24 -45 7 -45 -10 0 -27 14 -37 31 -21 35 -40
34 -44 -4 -3 -22 -8 -27 -32 -27 -39 0 -43 11 -35 86 l7 64 34 0 c27 0 34 -4
34 -20z m511 12 c0 -4 1 -36 2 -72 l2 -65 -32 -3 c-28 -3 -32 0 -39 30 l-7 33
-14 -33 c-16 -40 -34 -41 -51 -2 -16 35 -35 31 -26 -6 6 -22 3 -24 -30 -24
l-36 0 -1 55 c-1 30 -2 61 -3 68 -1 7 14 13 34 15 33 3 38 -1 59 -39 l24 -42
18 24 c10 13 19 29 19 35 0 5 4 14 10 20 11 11 70 16 71 6z m509 -28 c0 -31 3
-35 23 -32 17 2 23 11 25 36 3 29 6 32 36 32 l34 0 1 -75 1 -75 -29 0 c-23 0
-30 5 -35 26 -5 19 -12 25 -29 22 -17 -2 -22 -10 -22 -30 1 -24 -2 -27 -25
-22 -45 10 -50 13 -50 33 0 11 -6 21 -12 24 -10 4 -10 7 0 18 6 7 12 25 12 39
0 34 7 40 42 40 25 0 28 -3 28 -36z"/>
<path d="M800 860 c30 -24 44 -25 36 -4 -3 9 -6 18 -6 20 0 2 -12 4 -27 4
l-28 0 25 -20z"/>
<path d="M310 850 c0 -5 5 -10 10 -10 6 0 10 5 10 10 0 6 -4 10 -10 10 -5 0
-10 -4 -10 -10z"/>
<path d="M366 851 c-8 -12 21 -34 33 -27 6 4 8 13 4 21 -6 17 -29 20 -37 6z"/>
<path d="M920 586 c0 -9 7 -16 16 -16 9 0 14 5 12 12 -6 18 -28 21 -28 4z"/>
<path d="M965 419 c-4 -6 -5 -13 -2 -16 7 -7 27 6 27 18 0 12 -17 12 -25 -2z"/>
<path d="M362 388 c3 -7 15 -14 29 -16 24 -4 24 -3 4 12 -24 19 -38 20 -33 4z"/>
<path d="M4106 883 c-14 -14 -5 -31 14 -26 11 3 20 9 20 13 0 10 -26 20 -34
13z"/>
<path d="M4590 870 c-14 -10 -22 -22 -18 -25 7 -8 57 25 58 38 0 12 -14 8 -40
-13z"/>
<path d="M4380 655 c7 -8 17 -15 22 -15 6 0 5 7 -2 15 -7 8 -17 15 -22 15 -6
0 -5 -7 2 -15z"/>
<path d="M4082 560 c-6 -11 -12 -28 -12 -37 0 -13 6 -10 20 12 11 17 20 33 20
38 0 14 -15 7 -28 -13z"/>
<path d="M4496 466 c3 -9 11 -16 16 -16 13 0 5 23 -10 28 -7 2 -10 -2 -6 -12z"/>
<path d="M4236 445 c-9 -24 5 -41 16 -20 7 11 7 20 0 27 -6 6 -12 3 -16 -7z"/>
<path d="M4540 400 c0 -5 5 -10 11 -10 5 0 7 5 4 10 -3 6 -8 10 -11 10 -2 0
-4 -4 -4 -10z"/>
<path d="M5330 891 c0 -11 26 -22 34 -14 3 3 3 10 0 14 -7 12 -34 11 -34 0z"/>
<path d="M4805 880 c-8 -13 4 -32 16 -25 12 8 12 35 0 35 -6 0 -13 -4 -16 -10z"/>
<path d="M5070 821 l-35 -6 0 -75 0 -75 40 -3 c22 -2 58 3 80 10 38 12 40 16
47 63 12 88 -16 107 -132 86z m109 -36 c3 -19 2 -19 -15 -4 -11 9 -26 19 -34
22 -8 4 -2 5 15 4 21 -1 31 -8 34 -22z"/>
<path d="M5411 694 c0 -11 3 -14 6 -6 3 7 2 16 -1 19 -3 4 -6 -2 -5 -13z"/>
<path d="M5223 674 c-10 -22 -10 -25 3 -20 9 3 18 6 20 6 2 0 4 9 4 20 0 28
-13 25 -27 -6z"/>
<path d="M5001 422 c-14 -27 -12 -35 8 -23 7 5 11 17 9 27 -4 17 -5 17 -17 -4z"/>
<path d="M5673 883 c9 -9 19 -14 23 -11 10 10 -6 28 -24 28 -15 0 -15 -1 1
-17z"/>
<path d="M5866 717 c-14 -10 -16 -16 -7 -22 15 -9 35 8 30 24 -3 8 -10 7 -23
-2z"/>
<path d="M5700 520 c0 -5 5 -10 10 -10 6 0 10 5 10 10 0 6 -4 10 -10 10 -5 0
-10 -4 -10 -10z"/>
<path d="M5700 451 c0 -23 25 -46 34 -32 4 6 -2 19 -14 31 -19 19 -20 19 -20
1z"/>
<path d="M1375 850 c-3 -5 -1 -10 4 -10 6 0 11 5 11 10 0 6 -2 10 -4 10 -3 0
-8 -4 -11 -10z"/>
<path d="M1391 687 c-5 -12 -7 -35 -6 -50 2 -15 -1 -27 -7 -27 -5 0 -6 9 -3
21 5 15 4 19 -4 15 -6 -4 -11 -18 -11 -30 0 -19 7 -25 33 -29 17 -2 42 1 55 7
l22 12 -27 52 c-29 57 -39 63 -52 29z"/>
<path d="M1240 520 c0 -5 5 -10 10 -10 6 0 10 5 10 10 0 6 -4 10 -10 10 -5 0
-10 -4 -10 -10z"/>
<path d="M1575 490 c4 -14 9 -27 11 -29 7 -7 34 9 34 20 0 7 -3 9 -7 6 -3 -4
-15 1 -26 10 -19 17 -19 17 -12 -7z"/>
<path d="M3094 688 c-4 -13 -7 -35 -6 -50 1 -16 -2 -28 -8 -28 -5 0 -6 7 -3
17 4 11 3 14 -5 9 -16 -10 -15 -49 1 -43 6 2 20 0 29 -4 10 -6 27 -5 41 2 28
13 26 30 -8 86 -24 39 -31 41 -41 11z"/>
<path d="M3270 502 c0 -19 29 -47 39 -37 6 7 1 16 -15 28 -13 10 -24 14 -24 9z"/>
<path d="M3570 812 c-13 -10 -21 -24 -19 -31 3 -7 15 0 34 19 31 33 21 41 -15
12z"/>
<path d="M3855 480 c-3 -5 -1 -10 4 -10 6 0 11 5 11 10 0 6 -2 10 -4 10 -3 0
-8 -4 -11 -10z"/>
<path d="M3585 450 c3 -5 13 -10 21 -10 8 0 12 5 9 10 -3 6 -13 10 -21 10 -8
0 -12 -4 -9 -10z"/>
<path d="M1880 820 c0 -5 7 -10 16 -10 8 0 12 5 9 10 -3 6 -10 10 -16 10 -5 0
-9 -4 -9 -10z"/>
<path d="M2042 668 c-7 -7 -12 -23 -12 -37 1 -24 2 -24 16 8 16 37 14 47 -4
29z"/>
<path d="M2015 560 c4 -6 11 -8 16 -5 14 9 11 15 -7 15 -8 0 -12 -5 -9 -10z"/>
<path d="M1915 470 c4 -6 11 -8 16 -5 14 9 11 15 -7 15 -8 0 -12 -5 -9 -10z"/>
<path d="M2320 795 c0 -14 5 -25 10 -25 6 0 10 11 10 25 0 14 -4 25 -10 25 -5
0 -10 -11 -10 -25z"/>
<path d="M2660 771 c0 -6 5 -13 10 -16 6 -3 10 1 10 9 0 9 -4 16 -10 16 -5 0
-10 -4 -10 -9z"/>
<path d="M2487 763 c-4 -3 -7 -23 -7 -43 0 -36 1 -38 40 -43 68 -9 116 20 102
61 -3 10 -7 10 -18 1 -11 -9 -14 -7 -14 10 0 18 -6 21 -48 21 -27 0 -52 -3
-55 -7z"/>
<path d="M2320 719 c0 -5 5 -7 10 -4 6 3 10 8 10 11 0 2 -4 4 -10 4 -5 0 -10
-5 -10 -11z"/>
<path d="M2480 550 l0 -40 66 1 c58 1 67 4 76 25 18 39 -4 54 -78 54 l-64 0 0
-40z m40 15 c-7 -8 -16 -15 -21 -15 -5 0 -6 7 -3 15 4 8 13 15 21 15 13 0 13
-3 3 -15z"/>
<path d="M2665 527 c-4 -10 -5 -21 -1 -24 10 -10 18 4 13 24 -4 17 -4 17 -12
0z"/>
<path d="M1586 205 c-9 -23 -8 -25 9 -25 17 0 19 9 6 28 -7 11 -10 10 -15 -3z"/>
<path d="M3727 200 c-3 -13 0 -20 9 -20 15 0 19 26 5 34 -5 3 -11 -3 -14 -14z"/>
<path d="M1194 229 c-3 -6 -2 -15 3 -20 13 -13 43 -1 43 17 0 16 -36 19 -46 3z"/>
<path d="M2470 224 c-18 -46 -12 -73 15 -80 37 -9 52 1 59 40 5 26 3 41 -8 51
-23 24 -55 18 -66 -11z"/>
<path d="M3120 196 c0 -9 7 -16 16 -16 17 0 14 22 -4 28 -7 2 -12 -3 -12 -12z"/>
<path d="M4750 201 c0 -12 5 -21 10 -21 6 0 10 6 10 14 0 8 -4 18 -10 21 -5 3
-10 -3 -10 -14z"/>
<path d="M3515 229 c-8 -12 14 -31 30 -26 6 2 10 10 10 18 0 17 -31 24 -40 8z"/>
<path d="M3521 161 c-7 -5 -9 -11 -4 -14 14 -9 54 4 47 14 -7 11 -25 11 -43 0z"/>
</g>
</svg>

After

Width:  |  Height:  |  Size: 18 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 10 KiB

85
assets/waybackpy_logo.svg Normal file
View File

@ -0,0 +1,85 @@
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<svg
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:cc="http://creativecommons.org/ns#"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:svg="http://www.w3.org/2000/svg"
xmlns="http://www.w3.org/2000/svg"
id="svg8"
version="1.1"
viewBox="0 0 176.61171 41.907883"
height="41.907883mm"
width="176.61171mm">
<defs
id="defs2" />
<metadata
id="metadata5">
<rdf:RDF>
<cc:Work
rdf:about="">
<dc:format>image/svg+xml</dc:format>
<dc:type
rdf:resource="http://purl.org/dc/dcmitype/StillImage" />
<dc:title></dc:title>
</cc:Work>
</rdf:RDF>
</metadata>
<g
transform="translate(-0.74835286,-98.31182)"
id="layer1">
<flowRoot
transform="scale(0.26458333)"
style="font-style:normal;font-weight:normal;font-size:40px;line-height:1.25;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none"
id="flowRoot4598"
xml:space="preserve"><flowRegion
id="flowRegion4600"><rect
y="415.4129"
x="-38.183765"
height="48.08326"
width="257.38687"
id="rect4602" /></flowRegion><flowPara
id="flowPara4604"></flowPara></flowRoot> <text
transform="scale(0.86288797,1.158899)"
id="text4777"
y="110.93711"
x="0.93061"
style="font-style:normal;font-variant:normal;font-weight:bold;font-stretch:normal;font-size:28.14887619px;line-height:4.25;font-family:sans-serif;-inkscape-font-specification:'sans-serif, Bold';font-variant-ligatures:normal;font-variant-caps:normal;font-variant-numeric:normal;font-feature-settings:normal;text-align:start;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:start;fill:#003dff;fill-opacity:1;stroke:none;stroke-width:7.51955223;stroke-miterlimit:4;stroke-dasharray:none"
xml:space="preserve"><tspan
style="stroke-width:7.51955223"
id="tspan4775"
y="110.93711"
x="0.93061"><tspan
id="tspan4773"
style="font-style:normal;font-variant:normal;font-weight:bold;font-stretch:normal;font-size:28.14887619px;font-family:sans-serif;-inkscape-font-specification:'sans-serif, Bold';font-variant-ligatures:normal;font-variant-caps:normal;font-variant-numeric:normal;font-feature-settings:normal;text-align:start;letter-spacing:3.56786728px;writing-mode:lr-tb;text-anchor:start;fill:#003dff;fill-opacity:1;stroke-width:7.51955223;stroke-miterlimit:4;stroke-dasharray:none"
y="110.93711"
x="0.93061">waybackpy</tspan></tspan></text>
<rect
y="98.311821"
x="1.4967092"
height="4.8643045"
width="153.78688"
id="rect4644"
style="opacity:1;fill:#000080;fill-opacity:1;stroke:#00ff00;stroke-width:0;stroke-miterlimit:4;stroke-dasharray:none" />
<rect
style="opacity:1;fill:#000080;fill-opacity:1;stroke:#00ff00;stroke-width:0;stroke-miterlimit:4;stroke-dasharray:none"
id="rect4648"
width="153.78688"
height="4.490128"
x="23.573174"
y="135.72957" />
<rect
y="135.72957"
x="0.74835336"
height="4.4901319"
width="22.82482"
id="rect4650"
style="opacity:1;fill:#ff00ff;fill-opacity:1;stroke:#00ff00;stroke-width:0;stroke-miterlimit:4;stroke-dasharray:none" />
<rect
style="opacity:1;fill:#ff00ff;fill-opacity:1;stroke:#00ff00;stroke-width:0;stroke-miterlimit:4;stroke-dasharray:none"
id="rect4652"
width="21.702286"
height="4.8643003"
x="155.2836"
y="98.311821" />
</g>
</svg>

After

Width:  |  Height:  |  Size: 3.6 KiB

232
index.rst
View File

@ -1,232 +0,0 @@
waybackpy
=========
|Build Status| |Downloads| |Release| |Codacy Badge| |License: MIT|
|Maintainability| |CodeFactor| |made-with-python| |pypi| |PyPI - Python
Version| |Maintenance|
.. |Build Status| image:: https://travis-ci.org/akamhy/waybackpy.svg?branch=master
:target: https://travis-ci.org/akamhy/waybackpy
.. |Downloads| image:: https://img.shields.io/pypi/dm/waybackpy.svg
:target: https://pypistats.org/packages/waybackpy
.. |Release| image:: https://img.shields.io/github/v/release/akamhy/waybackpy.svg
:target: https://github.com/akamhy/waybackpy/releases
.. |Codacy Badge| image:: https://api.codacy.com/project/badge/Grade/255459cede9341e39436ec8866d3fb65
:target: https://www.codacy.com/manual/akamhy/waybackpy?utm_source=github.com&utm_medium=referral&utm_content=akamhy/waybackpy&utm_campaign=Badge_Grade
.. |License: MIT| image:: https://img.shields.io/badge/License-MIT-yellow.svg
:target: https://github.com/akamhy/waybackpy/blob/master/LICENSE
.. |Maintainability| image:: https://api.codeclimate.com/v1/badges/942f13d8177a56c1c906/maintainability
:target: https://codeclimate.com/github/akamhy/waybackpy/maintainability
.. |CodeFactor| image:: https://www.codefactor.io/repository/github/akamhy/waybackpy/badge
:target: https://www.codefactor.io/repository/github/akamhy/waybackpy
.. |made-with-python| image:: https://img.shields.io/badge/Made%20with-Python-1f425f.svg
:target: https://www.python.org/
.. |pypi| image:: https://img.shields.io/pypi/v/waybackpy.svg
.. |PyPI - Python Version| image:: https://img.shields.io/pypi/pyversions/waybackpy?style=flat-square
.. |Maintenance| image:: https://img.shields.io/badge/Maintained%3F-yes-green.svg
:target: https://github.com/akamhy/waybackpy/graphs/commit-activity
|Internet Archive| |Wayback Machine|
The waybackpy is a python wrapper for `Internet Archive`_\ s `Wayback
Machine`_.
.. _Internet Archive: https://en.wikipedia.org/wiki/Internet_Archive
.. _Wayback Machine: https://en.wikipedia.org/wiki/Wayback_Machine
.. |Internet Archive| image:: https://upload.wikimedia.org/wikipedia/commons/thumb/8/84/Internet_Archive_logo_and_wordmark.svg/84px-Internet_Archive_logo_and_wordmark.svg.png
.. |Wayback Machine| image:: https://upload.wikimedia.org/wikipedia/commons/thumb/0/01/Wayback_Machine_logo_2010.svg/284px-Wayback_Machine_logo_2010.svg.png
Installation
------------
Using `pip`_:
**pip install waybackpy**
.. _pip: https://en.wikipedia.org/wiki/Pip_(package_manager)
Usage
-----
Archiving aka Saving an url Using save()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: diff
+ waybackpy.save(url, UA=user_agent)
..
url is mandatory. UA is not, but highly recommended.
.. code:: python
import waybackpy
# Capturing a new archive on Wayback machine.
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
archived_url = waybackpy.save("https://github.com/akamhy/waybackpy", UA = "Any-User-Agent")
print(archived_url)
This should print something similar to the following archived URL:
https://web.archive.org/web/20200504141153/https://github.com/akamhy/waybackpy
Receiving the oldest archive for an URL Using oldest()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: diff
+ waybackpy.oldest(url, UA=user_agent)
..
url is mandatory. UA is not, but highly recommended.
.. code:: python
import waybackpy
# retrieving the oldest archive on Wayback machine.
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
oldest_archive = waybackpy.oldest("https://www.google.com/", UA = "Any-User-Agent")
print(oldest_archive)
This returns the oldest available archive for https://google.com.
http://web.archive.org/web/19981111184551/http://google.com:80/
Receiving the newest archive for an URL using newest()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: diff
+ waybackpy.newest(url, UA=user_agent)
..
url is mandatory. UA is not, but highly recommended.
.. code:: python
import waybackpy
# retrieving the newest archive on Wayback machine.
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
newest_archive = waybackpy.newest("https://www.microsoft.com/en-us", UA = "Any-User-Agent")
print(newest_archive)
This returns the newest available archive for
https://www.microsoft.com/en-us, something just like this:
http://web.archive.org/web/20200429033402/https://www.microsoft.com/en-us/
Receiving archive close to a specified year, month, day, hour, and minute using near()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: diff
+ waybackpy.near(url, year=2020, month=1, day=1, hour=1, minute=1, UA=user_agent)
..
url is mandotory. year,month,day,hour and minute are optional
arguments. UA is not mandotory, but higly recomended.
.. code:: python
import waybackpy
# retriving the the closest archive from a specified year.
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
# supported argumnets are year,month,day,hour and minute
archive_near_year = waybackpy.near("https://www.facebook.com/", year=2010, UA ="Any-User-Agent")
print(archive_near_year)
returns :
http://web.archive.org/web/20100504071154/http://www.facebook.com/
``waybackpy.near("https://www.facebook.com/", year=2010, month=1, UA ="Any-User-Agent")``
returns:
http://web.archive.org/web/20101111173430/http://www.facebook.com//
``waybackpy.near("https://www.oracle.com/index.html", year=2019, month=1, day=5, UA ="Any-User-Agent")``
returns:
http://web.archive.org/web/20190105054437/https://www.oracle.com/index.html
> Please note that if you only specify the year, the current month and
day are default arguments for month and day respectively. Do not expect
just putting the year parameter would return the archive closer to
January but the current month you are using the package. If you are
using it in July 2018 and lets say you use
``waybackpy.near("https://www.facebook.com/", year=2011, UA ="Any-User-Agent")``
then you would be returned the nearest archive to July 2011 and not
January 2011. You need to specify the month “1” for January.
Do not pad (dont use zeros in the month, year, day, minute, and hour
arguments). e.g. For January, set month = 1 and not month = 01.
Get the content of webpage using get()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: diff
+ waybackpy.get(url, encoding="UTF-8", UA=user_agent)
..
url is mandatory. UA is not, but highly recommended. encoding is
detected automatically, dont specify unless necessary.
.. code:: python
from waybackpy import get
# retriving the webpage from any url including the archived urls. Don't need to import other libraies :)
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
# supported argumnets are url, encoding and UA
webpage = get("https://example.com/", UA="User-Agent")
print(webpage)
..
This should print the source code for https://example.com/.
Count total archives for an URL using total_archives()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: diff
+ waybackpy.total_archives(url, UA=user_agent)
..
url is mandatory. UA is not, but highly recommended.
.. code:: python
from waybackpy import total_archives
# retriving the webpage from any url including the archived urls. Don't need to import other libraies :)
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
# supported argumnets are url and UA
count = total_archives("https://en.wikipedia.org/wiki/Python (programming language)", UA="User-Agent")
print(count)
..
This should print an integer (int), which is the number of total
archives on archive.org
Tests
-----
- `Here`_
Dependency
----------
- None, just python standard libraries (json, urllib and datetime).
Both python 2 and 3 are supported :)
License
-------
`MIT License`_
.. _Here: https://github.com/akamhy/waybackpy/tree/master/tests
.. _MIT License: https://github.com/akamhy/waybackpy/blob/master/LICENSE

1
requirements.txt Normal file
View File

@ -0,0 +1 @@
requests>=2.24.0

View File

@ -1,3 +1,7 @@
[metadata]
description-file = README.md
license_file = LICENSE
[flake8]
max-line-length = 88
extend-ignore = E203,W503

View File

@ -1,49 +1,54 @@
import os.path
from setuptools import setup
with open(os.path.join(os.path.dirname(__file__), 'README.md')) as f:
with open(os.path.join(os.path.dirname(__file__), "README.md")) as f:
long_description = f.read()
about = {}
with open(os.path.join(os.path.dirname(__file__), 'waybackpy', '__version__.py')) as f:
with open(os.path.join(os.path.dirname(__file__), "waybackpy", "__version__.py")) as f:
exec(f.read(), about)
setup(
name = about['__title__'],
packages = ['waybackpy'],
version = about['__version__'],
description = about['__description__'],
name=about["__title__"],
packages=["waybackpy"],
version=about["__version__"],
description=about["__description__"],
long_description=long_description,
long_description_content_type='text/markdown',
license= about['__license__'],
author = about['__author__'],
author_email = about['__author_email__'],
url = about['__url__'],
download_url = 'https://github.com/akamhy/waybackpy/archive/v1.5.tar.gz',
keywords = ['wayback', 'archive', 'archive website', 'wayback machine', 'Internet Archive'],
install_requires=[],
python_requires= ">=2.7",
long_description_content_type="text/markdown",
license=about["__license__"],
author=about["__author__"],
author_email=about["__author_email__"],
url=about["__url__"],
download_url="https://github.com/akamhy/waybackpy/archive/2.3.2.tar.gz",
keywords=[
"Archive It",
"Archive Website",
"Wayback Machine",
"waybackurls",
"Internet Archive",
],
install_requires=["requests"],
python_requires=">=3.4",
classifiers=[
'Development Status :: 5 - Production/Stable',
'Intended Audience :: Developers',
'Natural Language :: English',
'Topic :: Software Development :: Build Tools',
'License :: OSI Approved :: MIT License',
'Programming Language :: Python',
'Programming Language :: Python :: 2',
'Programming Language :: Python :: 2.7',
'Programming Language :: Python :: 3',
'Programming Language :: Python :: 3.2',
'Programming Language :: Python :: 3.3',
'Programming Language :: Python :: 3.4',
'Programming Language :: Python :: 3.5',
'Programming Language :: Python :: 3.6',
'Programming Language :: Python :: 3.7',
'Programming Language :: Python :: 3.8',
'Programming Language :: Python :: Implementation :: CPython',
],
"Development Status :: 5 - Production/Stable",
"Intended Audience :: Developers",
"Natural Language :: English",
"Topic :: Software Development :: Build Tools",
"License :: OSI Approved :: MIT License",
"Programming Language :: Python",
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.4",
"Programming Language :: Python :: 3.5",
"Programming Language :: Python :: 3.6",
"Programming Language :: Python :: 3.7",
"Programming Language :: Python :: 3.8",
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: Implementation :: CPython",
],
entry_points={"console_scripts": ["waybackpy = waybackpy.cli:main"]},
project_urls={
'Documentation': 'https://waybackpy.readthedocs.io',
'Source': 'https://github.com/akamhy/waybackpy',
"Documentation": "https://akamhy.github.io/waybackpy/",
"Source": "https://github.com/akamhy/waybackpy",
"Tracker": "https://github.com/akamhy/waybackpy/issues",
},
)

0
tests/__init__.py Normal file
View File

View File

@ -1,98 +0,0 @@
import sys
sys.path.append("..")
import waybackpy
import pytest
user_agent = "Mozilla/5.0 (Windows NT 6.2; rv:20.0) Gecko/20121202 Firefox/20.0"
def test_clean_url():
test_url = " https://en.wikipedia.org/wiki/Network security "
answer = "https://en.wikipedia.org/wiki/Network_security"
test_result = waybackpy.clean_url(test_url)
assert answer == test_result
def test_url_check():
InvalidUrl = "http://wwwgooglecom/"
with pytest.raises(Exception) as e_info:
waybackpy.url_check(InvalidUrl)
def test_save():
# Test for urls that exist and can be archived.
url1="https://github.com/akamhy/waybackpy"
archived_url1 = waybackpy.save(url1, UA=user_agent)
assert url1 in archived_url1
# Test for urls that are incorrect.
with pytest.raises(Exception) as e_info:
url2 = "ha ha ha ha"
waybackpy.save(url2, UA=user_agent)
# Test for urls not allowed to archive by robot.txt.
with pytest.raises(Exception) as e_info:
url3 = "http://www.archive.is/faq.html"
waybackpy.save(url3, UA=user_agent)
# Non existent urls, test
with pytest.raises(Exception) as e_info:
url4 = "https://githfgdhshajagjstgeths537agajaajgsagudadhuss8762346887adsiugujsdgahub.us"
archived_url4 = waybackpy.save(url4, UA=user_agent)
def test_near():
url = "google.com"
archive_near_year = waybackpy.near(url, year=2010, UA=user_agent)
assert "2010" in archive_near_year
archive_near_month_year = waybackpy.near(url, year=2015, month=2, UA=user_agent)
assert ("201502" in archive_near_month_year) or ("201501" in archive_near_month_year) or ("201503" in archive_near_month_year)
archive_near_day_month_year = waybackpy.near(url, year=2006, month=11, day=15, UA=user_agent)
assert ("20061114" in archive_near_day_month_year) or ("20061115" in archive_near_day_month_year) or ("2006116" in archive_near_day_month_year)
archive_near_hour_day_month_year = waybackpy.near("www.python.org", year=2008, month=5, day=9, hour=15, UA=user_agent)
assert ("2008050915" in archive_near_hour_day_month_year) or ("2008050914" in archive_near_hour_day_month_year) or ("2008050913" in archive_near_hour_day_month_year)
with pytest.raises(Exception) as e_info:
NeverArchivedUrl = "https://ee_3n.wrihkeipef4edia.org/rwti5r_ki/Nertr6w_rork_rse7c_urity"
waybackpy.near(NeverArchivedUrl, year=2010, UA=user_agent)
def test_oldest():
url = "github.com/akamhy/waybackpy"
archive_oldest = waybackpy.oldest(url, UA=user_agent)
assert "20200504141153" in archive_oldest
def test_newest():
url = "github.com/akamhy/waybackpy"
archive_newest = waybackpy.newest(url, UA=user_agent)
assert url in archive_newest
def test_get():
oldest_google_archive = waybackpy.oldest("google.com", UA=user_agent)
oldest_google_page_text = waybackpy.get(oldest_google_archive, UA=user_agent)
assert "Welcome to Google" in oldest_google_page_text
def test_total_archives():
count1 = waybackpy.total_archives("https://en.wikipedia.org/wiki/Python (programming language)", UA=user_agent)
assert count1 > 2000
count2 = waybackpy.total_archives("https://gaha.e4i3n.m5iai3kip6ied.cima/gahh2718gs/ahkst63t7gad8", UA=user_agent)
assert count2 == 0
if __name__ == "__main__":
test_clean_url()
print(".")
test_url_check()
print(".")
test_get()
print(".")
test_near()
print(".")
test_newest()
print(".")
test_save()
print(".")
test_oldest()
print(".")
test_total_archives()
print(".")

307
tests/test_cli.py Normal file
View File

@ -0,0 +1,307 @@
# -*- coding: utf-8 -*-
import sys
import os
import pytest
import argparse
sys.path.append("..")
import waybackpy.cli as cli # noqa: E402
from waybackpy.wrapper import Url # noqa: E402
from waybackpy.__version__ import __version__
# Namespace(day=None, get=None, hour=None, minute=None, month=None, near=False,
# newest=False, oldest=False, save=False, total=False, url=None, user_agent=None, version=False, year=None)
def test_save():
args = argparse.Namespace(
user_agent=None,
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=False,
save=True,
json=False,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get=None,
)
reply = cli.args_handler(args)
assert "pypi.org/user/akamhy" in str(reply)
def test_json():
args = argparse.Namespace(
user_agent=None,
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=False,
save=False,
json=True,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get=None,
)
reply = cli.args_handler(args)
assert "archived_snapshots" in str(reply)
def test_archive_url():
args = argparse.Namespace(
user_agent=None,
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=True,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get=None,
)
reply = cli.args_handler(args)
assert "https://web.archive.org/web/" in str(reply)
def test_oldest():
args = argparse.Namespace(
user_agent=None,
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=True,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get=None,
)
reply = cli.args_handler(args)
assert "pypi.org/user/akamhy" in str(reply)
def test_newest():
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=True,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get=None,
)
reply = cli.args_handler(args)
assert "pypi.org/user/akamhy" in str(reply)
def test_total_archives():
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://pypi.org/user/akamhy/",
total=True,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get=None,
)
reply = cli.args_handler(args)
assert isinstance(reply, int)
def test_known_urls():
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://akamhy.github.io",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
alive=True,
subdomain=True,
known_urls=True,
get=None,
)
reply = cli.args_handler(args)
assert "github" in str(reply)
def test_near():
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=True,
alive=False,
subdomain=False,
known_urls=False,
get=None,
year=2020,
month=7,
day=15,
hour=1,
minute=1,
)
reply = cli.args_handler(args)
assert "202007" in str(reply)
def test_get():
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get="url",
)
reply = cli.args_handler(args)
assert "waybackpy" in str(reply)
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get="oldest",
)
reply = cli.args_handler(args)
assert "waybackpy" in str(reply)
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get="newest",
)
reply = cli.args_handler(args)
assert "waybackpy" in str(reply)
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get="save",
)
reply = cli.args_handler(args)
assert "waybackpy" in str(reply)
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get="BullShit",
)
reply = cli.args_handler(args)
assert "get the source code of the" in str(reply)
def test_args_handler():
args = argparse.Namespace(version=True)
reply = cli.args_handler(args)
assert ("waybackpy version %s" % (__version__)) == reply
args = argparse.Namespace(url=None, version=False)
reply = cli.args_handler(args)
assert ("waybackpy %s" % (__version__)) in str(reply)
def test_main():
# This also tests the parse_args method in cli.py
cli.main(["temp.py", "--version"])

184
tests/test_wrapper.py Normal file
View File

@ -0,0 +1,184 @@
# -*- coding: utf-8 -*-
import sys
import pytest
import random
import requests
sys.path.append("..")
import waybackpy.wrapper as waybackpy # noqa: E402
user_agent = "Mozilla/5.0 (Windows NT 6.2; rv:20.0) Gecko/20121202 Firefox/20.0"
def test_clean_url():
test_url = " https://en.wikipedia.org/wiki/Network security "
answer = "https://en.wikipedia.org/wiki/Network_security"
target = waybackpy.Url(test_url, user_agent)
test_result = target._clean_url()
assert answer == test_result
def test_dunders():
url = "https://en.wikipedia.org/wiki/Network_security"
user_agent = "UA"
target = waybackpy.Url(url, user_agent)
assert "waybackpy.Url(url=%s, user_agent=%s)" % (url, user_agent) == repr(target)
assert "en.wikipedia.org" in str(target)
def test_archive_url_parser():
endpoint = "https://amazon.com"
user_agent = "Mozilla/5.0 (Windows NT 6.2; rv:20.0) Gecko/20121202 Firefox/20.0"
headers = {"User-Agent": "%s" % user_agent}
response = waybackpy._get_response(endpoint, params=None, headers=headers)
header = response.headers
with pytest.raises(Exception):
waybackpy._archive_url_parser(header)
def test_url_check():
broken_url = "http://wwwgooglecom/"
with pytest.raises(Exception):
waybackpy.Url(broken_url, user_agent)
def test_save():
# Test for urls that exist and can be archived.
url_list = [
"en.wikipedia.org",
"www.wikidata.org",
"commons.wikimedia.org",
"www.wiktionary.org",
"www.w3schools.com",
"www.ibm.com",
]
x = random.randint(0, len(url_list) - 1)
url1 = url_list[x]
target = waybackpy.Url(
url1,
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36",
)
archived_url1 = str(target.save())
assert url1 in archived_url1
# Test for urls that are incorrect.
with pytest.raises(Exception):
url2 = "ha ha ha ha"
waybackpy.Url(url2, user_agent)
url3 = "http://www.archive.is/faq.html"
with pytest.raises(Exception):
target = waybackpy.Url(
url3,
"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) "
"AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 "
"Safari/533.20.27",
)
target.save()
def test_near():
url = "google.com"
target = waybackpy.Url(
url,
"Mozilla/5.0 (Windows; U; Windows NT 6.0; de-DE) AppleWebKit/533.20.25 "
"(KHTML, like Gecko) Version/5.0.3 Safari/533.19.4",
)
archive_near_year = target.near(year=2010)
assert "2010" in str(archive_near_year)
archive_near_month_year = str(target.near(year=2015, month=2))
assert (
("201502" in archive_near_month_year)
or ("201501" in archive_near_month_year)
or ("201503" in archive_near_month_year)
)
target = waybackpy.Url(
"www.python.org",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246",
)
archive_near_hour_day_month_year = str(
target.near(year=2008, month=5, day=9, hour=15)
)
assert (
("2008050915" in archive_near_hour_day_month_year)
or ("2008050914" in archive_near_hour_day_month_year)
or ("2008050913" in archive_near_hour_day_month_year)
)
with pytest.raises(Exception):
NeverArchivedUrl = (
"https://ee_3n.wrihkeipef4edia.org/rwti5r_ki/Nertr6w_rork_rse7c_urity"
)
target = waybackpy.Url(NeverArchivedUrl, user_agent)
target.near(year=2010)
def test_oldest():
url = "github.com/akamhy/waybackpy"
target = waybackpy.Url(url, user_agent)
assert "20200504141153" in str(target.oldest())
def test_json():
url = "github.com/akamhy/waybackpy"
target = waybackpy.Url(url, user_agent)
assert "archived_snapshots" in str(target.JSON)
def test_archive_url():
url = "github.com/akamhy/waybackpy"
target = waybackpy.Url(url, user_agent)
assert "github.com/akamhy" in str(target.archive_url)
def test_newest():
url = "github.com/akamhy/waybackpy"
target = waybackpy.Url(url, user_agent)
assert url in str(target.newest())
def test_get():
target = waybackpy.Url("google.com", user_agent)
assert "Welcome to Google" in target.get(target.oldest())
def test_wayback_timestamp():
ts = waybackpy._wayback_timestamp(year=2020, month=1, day=2, hour=3, minute=4)
assert "202001020304" in str(ts)
def test_get_response():
endpoint = "https://www.google.com"
user_agent = (
"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0"
)
headers = {"User-Agent": "%s" % user_agent}
response = waybackpy._get_response(endpoint, params=None, headers=headers)
assert response.status_code == 200
def test_total_archives():
target = waybackpy.Url(" https://google.com ", user_agent)
assert target.total_archives() > 500000
target = waybackpy.Url(
" https://gaha.e4i3n.m5iai3kip6ied.cima/gahh2718gs/ahkst63t7gad8 ", user_agent
)
assert target.total_archives() == 0
def test_known_urls():
target = waybackpy.Url("akamhy.github.io", user_agent)
assert len(target.known_urls(alive=True, subdomain=True)) > 2
target = waybackpy.Url("akamhy.github.io", user_agent)
assert len(target.known_urls()) > 3

View File

@ -10,13 +10,15 @@
# ━━━━━━━━━━━┗━━┛━━━━━━━━━━━━━━━━━━━━━━━━┗━━┛━
"""
A python wrapper for Internet Archive's Wayback Machine API.
Waybackpy is a Python package that interfaces with the Internet Archive's Wayback Machine API.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Archive pages and retrieve archived pages easily.
Usage:
>>> import waybackpy
>>> new_archive = waybackpy.save('https://www.python.org')
>>> target_url = waybackpy.Url('https://www.python.org', 'Your-apps-cool-user-agent')
>>> new_archive = target_url.save()
>>> print(new_archive)
https://web.archive.org/web/20200502170312/https://www.python.org/
@ -25,6 +27,14 @@ Full documentation @ <https://akamhy.github.io/waybackpy/>.
:license: MIT
"""
from .wrapper import save, near, oldest, newest, get, clean_url, url_check, total_archives
from .__version__ import __title__, __description__, __url__, __version__
from .__version__ import __author__, __author_email__, __license__, __copyright__
from .wrapper import Url
from .__version__ import (
__title__,
__description__,
__url__,
__version__,
__author__,
__author_email__,
__license__,
__copyright__,
)

View File

@ -1,8 +1,13 @@
# -*- coding: utf-8 -*-
__title__ = "waybackpy"
__description__ = "A python wrapper for Internet Archive's Wayback Machine API. Archive pages and retrieve archived pages easily."
__description__ = (
"A Python package that interfaces with the Internet Archive's Wayback Machine API. "
"Archive pages and retrieve archived pages easily."
)
__url__ = "https://akamhy.github.io/waybackpy/"
__version__ = "v1.5"
__version__ = "2.3.2"
__author__ = "akamhy"
__author_email__ = "akash3pro@gmail.com"
__author_email__ = "akamhy@yahoo.com"
__license__ = "MIT"
__copyright__ = "Copyright 2020 akamhy"
__copyright__ = "Copyright 2020-2021 Akash Mahanty et al."

305
waybackpy/cli.py Normal file
View File

@ -0,0 +1,305 @@
# -*- coding: utf-8 -*-
import sys
import os
import re
import argparse
import string
import random
from waybackpy.wrapper import Url
from waybackpy.__version__ import __version__
from waybackpy.exceptions import WaybackError
def _save(obj):
try:
return obj.save()
except Exception as err:
e = str(err)
url = obj.url
m = re.search(r"Header:\n(.*)", e)
if m:
header = m.group(1)
if "No archive URL found in the API response" in e:
return (
"\n[waybackpy] Can not save/archive your link.\n[waybackpy] This\
could happen because either your waybackpy (%s) is likely out of\
date or Wayback Machine is malfunctioning.\n[waybackpy] Visit\
https://github.com/akamhy/waybackpy for the latest version of \
waybackpy.\n[waybackpy] API response Header :\n%s"
% (__version__, header)
)
return WaybackError(err)
def _archive_url(obj):
return obj.archive_url
def _json(obj):
return obj.JSON
def handle_not_archived_error(e, obj):
m = re.search(r"archive\sfor\s\'(.*?)\'\stry", str(e))
if m:
url = m.group(1)
ua = obj.user_agent
if "github.com/akamhy/waybackpy" in ua:
ua = "YOUR_USER_AGENT_HERE"
return (
"\n[Waybackpy] Can not find archive for '%s'.\n[Waybackpy] You can"
" save the URL using the following command:\n[Waybackpy] waybackpy --"
'user_agent "%s" --url "%s" --save' % (url, ua, url)
)
return WaybackError(e)
def _oldest(obj):
try:
return obj.oldest()
except Exception as e:
return handle_not_archived_error(e, obj)
def _newest(obj):
try:
return obj.newest()
except Exception as e:
return handle_not_archived_error(e, obj)
def _total_archives(obj):
return obj.total_archives()
def _near(obj, args):
_near_args = {}
if args.year:
_near_args["year"] = args.year
if args.month:
_near_args["month"] = args.month
if args.day:
_near_args["day"] = args.day
if args.hour:
_near_args["hour"] = args.hour
if args.minute:
_near_args["minute"] = args.minute
try:
return obj.near(**_near_args)
except Exception as e:
return handle_not_archived_error(e, obj)
def _save_urls_on_file(input_list, live_url_count):
m = re.search("https?://([A-Za-z_0-9.-]+).*", input_list[0])
if m:
domain = m.group(1)
else:
domain = "domain-unknown"
uid = "".join(
random.choice(string.ascii_lowercase + string.digits) for _ in range(6)
)
file_name = "%s-%d-urls-%s.txt" % (domain, live_url_count, uid)
file_content = "\n".join(input_list)
file_path = os.path.join(os.getcwd(), file_name)
with open(file_path, "w+") as f:
f.write(file_content)
return "%s\n\n'%s' saved in current working directory" % (file_content, file_name)
def _known_urls(obj, args):
"""Abbreviations:
sd = subdomain
al = alive
"""
sd = False
al = False
if args.subdomain:
sd = True
if args.alive:
al = True
url_list = obj.known_urls(alive=al, subdomain=sd)
total_urls = len(url_list)
if total_urls > 0:
text = _save_urls_on_file(url_list, total_urls)
else:
text = "No known URLs found. Please try a diffrent domain!"
return text
def _get(obj, args):
if args.get.lower() == "url":
output = obj.get()
elif args.get.lower() == "archive_url":
output = obj.get(obj.archive_url)
elif args.get.lower() == "oldest":
output = obj.get(obj.oldest())
elif args.get.lower() == "latest" or args.get.lower() == "newest":
output = obj.get(obj.newest())
elif args.get.lower() == "save":
output = obj.get(obj.save())
else:
output = "Use get as \"--get 'source'\", 'source' can be one of the followings: \
\n1) url - get the source code of the url specified using --url/-u.\
\n2) archive_url - get the source code of the newest archive for the supplied url, alias of newest.\
\n3) oldest - get the source code of the oldest archive for the supplied url.\
\n4) newest - get the source code of the newest archive for the supplied url.\
\n5) save - Create a new archive and get the source code of this new archive for the supplied url."
return output
def args_handler(args):
if args.version:
return "waybackpy version %s" % __version__
if not args.url:
return (
"waybackpy %s \nSee 'waybackpy --help' for help using this tool."
% __version__
)
obj = Url(args.url)
if args.user_agent:
obj = Url(args.url, args.user_agent)
if args.save:
output = _save(obj)
elif args.archive_url:
output = _archive_url(obj)
elif args.json:
output = _json(obj)
elif args.oldest:
output = _oldest(obj)
elif args.newest:
output = _newest(obj)
elif args.known_urls:
output = _known_urls(obj, args)
elif args.total:
output = _total_archives(obj)
elif args.near:
output = _near(obj, args)
elif args.get:
output = _get(obj, args)
else:
output = (
"You only specified the URL. But you also need to specify the operation."
"\nSee 'waybackpy --help' for help using this tool."
)
return output
def parse_args(argv):
parser = argparse.ArgumentParser()
requiredArgs = parser.add_argument_group("URL argument (required)")
requiredArgs.add_argument(
"--url", "-u", help="URL on which Wayback machine operations would occur"
)
userAgentArg = parser.add_argument_group("User Agent")
help_text = 'User agent, default user_agent is "waybackpy python package - https://github.com/akamhy/waybackpy"'
userAgentArg.add_argument("--user_agent", "-ua", help=help_text)
saveArg = parser.add_argument_group("Create new archive/save URL")
saveArg.add_argument(
"--save", "-s", action="store_true", help="Save the URL on the Wayback machine"
)
auArg = parser.add_argument_group("Get the latest Archive")
auArg.add_argument(
"--archive_url",
"-au",
action="store_true",
help="Get the latest archive URL, alias for --newest",
)
jsonArg = parser.add_argument_group("Get the JSON data")
jsonArg.add_argument(
"--json",
"-j",
action="store_true",
help="JSON data of the availability API request",
)
oldestArg = parser.add_argument_group("Oldest archive")
oldestArg.add_argument(
"--oldest",
"-o",
action="store_true",
help="Oldest archive for the specified URL",
)
newestArg = parser.add_argument_group("Newest archive")
newestArg.add_argument(
"--newest",
"-n",
action="store_true",
help="Newest archive for the specified URL",
)
totalArg = parser.add_argument_group("Total number of archives")
totalArg.add_argument(
"--total",
"-t",
action="store_true",
help="Total number of archives for the specified URL",
)
getArg = parser.add_argument_group("Get source code")
getArg.add_argument(
"--get",
"-g",
help="Prints the source code of the supplied url. Use '--get help' for extended usage",
)
knownUrlArg = parser.add_argument_group(
"URLs known and archived to Waybcak Machine for the site."
)
knownUrlArg.add_argument(
"--known_urls", "-ku", action="store_true", help="URLs known for the domain."
)
help_text = "Use with '--known_urls' to include known URLs for subdomains."
knownUrlArg.add_argument("--subdomain", "-sub", action="store_true", help=help_text)
help_text = "Only include live URLs. Will not inlclude dead links."
knownUrlArg.add_argument("--alive", "-a", action="store_true", help=help_text)
nearArg = parser.add_argument_group("Archive close to time specified")
nearArg.add_argument(
"--near", "-N", action="store_true", help="Archive near specified time"
)
nearArgs = parser.add_argument_group("Arguments that are used only with --near")
nearArgs.add_argument("--year", "-Y", type=int, help="Year in integer")
nearArgs.add_argument("--month", "-M", type=int, help="Month in integer")
nearArgs.add_argument("--day", "-D", type=int, help="Day in integer.")
nearArgs.add_argument("--hour", "-H", type=int, help="Hour in intege")
nearArgs.add_argument("--minute", "-MIN", type=int, help="Minute in integer")
parser.add_argument(
"--version", "-v", action="store_true", help="Waybackpy version"
)
return parser.parse_args(argv[1:])
def main(argv=None):
if argv is None:
argv = sys.argv
args = parse_args(argv)
output = args_handler(args)
print(output)
if __name__ == "__main__":
sys.exit(main(sys.argv))

View File

@ -1,43 +1,13 @@
# -*- coding: utf-8 -*-
class TooManyArchivingRequests(Exception):
"""Error when a single url reqeusted for archiving too many times in a short timespam.
Wayback machine doesn't supports archivng any url too many times in a short period of time.
class WaybackError(Exception):
"""
Raised when Wayback Machine API Service is unreachable/down.
"""
class ArchivingNotAllowed(Exception):
"""Files like robots.txt are set to deny robot archiving.
Wayback machine respects these file, will not archive.
class URLError(Exception):
"""
class PageNotSaved(Exception):
"""
When unable to save a webpage.
"""
class ArchiveNotFound(Exception):
"""
When a page was never archived but client asks for old archive.
"""
class UrlNotFound(Exception):
"""
Raised when 404 UrlNotFound.
"""
class BadGateWay(Exception):
"""
Raised when 502 bad gateway.
"""
class WaybackUnavailable(Exception):
"""
Raised when 503 API Service Temporarily Unavailable.
"""
class InvalidUrl(Exception):
"""
Raised when url doesn't follow the standard url format.
Raised when malformed URLs are passed as arguments.
"""

View File

@ -1,143 +1,304 @@
# -*- coding: utf-8 -*-
import json
from datetime import datetime
from waybackpy.exceptions import TooManyArchivingRequests, ArchivingNotAllowed, PageNotSaved, ArchiveNotFound, UrlNotFound, BadGateWay, InvalidUrl, WaybackUnavailable
try:
from urllib.request import Request, urlopen
from urllib.error import HTTPError, URLError
except ImportError:
from urllib2 import Request, urlopen, HTTPError, URLError
import re
from datetime import datetime, timedelta
from waybackpy.exceptions import WaybackError, URLError
from waybackpy.__version__ import __version__
import requests
import concurrent.futures
default_UA = "waybackpy python package"
default_UA = "waybackpy python package - https://github.com/akamhy/waybackpy"
def url_check(url):
if "." not in url:
raise InvalidUrl("'%s' is not a vaild url." % url)
def clean_url(url):
return str(url).strip().replace(" ","_")
def _archive_url_parser(header):
"""Parse out the archive from header."""
# Regex1
arch = re.search(r"Content-Location: (/web/[0-9]{14}/.*)", str(header))
if arch:
return "web.archive.org" + arch.group(1)
# Regex2
arch = re.search(
r"rel=\"memento.*?(web\.archive\.org/web/[0-9]{14}/.*?)>", str(header)
)
if arch:
return arch.group(1)
# Regex3
arch = re.search(r"X-Cache-Key:\shttps(.*)[A-Z]{2}", str(header))
if arch:
return arch.group(1)
raise WaybackError(
"No archive URL found in the API response. "
"This version of waybackpy (%s) is likely out of date or WayBack Machine is malfunctioning. Visit "
"https://github.com/akamhy/waybackpy for the latest version "
"of waybackpy.\nHeader:\n%s" % (__version__, str(header))
)
def wayback_timestamp(**kwargs):
return (
str(kwargs["year"])
+
str(kwargs["month"]).zfill(2)
+
str(kwargs["day"]).zfill(2)
+
str(kwargs["hour"]).zfill(2)
+
str(kwargs["minute"]).zfill(2)
)
def handle_HTTPError(e):
if e.code == 502:
raise BadGateWay(e)
elif e.code == 503:
raise WaybackUnavailable(e)
elif e.code == 429:
raise TooManyArchivingRequests(e)
elif e.code == 404:
raise UrlNotFound(e)
def _wayback_timestamp(**kwargs):
"""Return a formatted timestamp."""
return "".join(
str(kwargs[key]).zfill(2) for key in ["year", "month", "day", "hour", "minute"]
)
def save(url, UA=default_UA):
url_check(url)
request_url = ("https://web.archive.org/save/" + clean_url(url))
hdr = { 'User-Agent' : '%s' % UA } #nosec
req = Request(request_url, headers=hdr) #nosec
def _get_response(endpoint, params=None, headers=None):
"""Get response for the supplied request."""
try:
response = urlopen(req) #nosec
except HTTPError as e:
if handle_HTTPError(e) is None:
raise PageNotSaved(e)
except URLError:
response = requests.get(endpoint, params=params, headers=headers)
except Exception:
try:
response = urlopen(req) #nosec
except URLError as e:
raise UrlNotFound(e)
response = requests.get(endpoint, params=params, headers=headers) # nosec
except Exception as e:
exc = WaybackError("Error while retrieving %s" % endpoint)
exc.__cause__ = e
raise exc
return response
header = response.headers
if "exclusion.robots.policy" in str(header):
raise ArchivingNotAllowed("Can not archive %s. Disabled by site owner." % (url))
class Url:
"""waybackpy Url object"""
return "https://web.archive.org" + header['Content-Location']
def __init__(self, url, user_agent=default_UA):
self.url = url
self.user_agent = user_agent
self._url_check() # checks url validity on init.
self._archive_url = None # URL of archive
self.timestamp = None # timestamp for last archive
self._alive_url_list = []
def get(url, encoding=None, UA=default_UA):
url_check(url)
hdr = { 'User-Agent' : '%s' % UA }
req = Request(clean_url(url), headers=hdr) #nosec
def __repr__(self):
return "waybackpy.Url(url=%s, user_agent=%s)" % (self.url, self.user_agent)
def __str__(self):
if not self._archive_url:
self._archive_url = self.archive_url
return "%s" % self._archive_url
def __len__(self):
td_max = timedelta(
days=999999999, hours=23, minutes=59, seconds=59, microseconds=999999
)
if not self.timestamp:
self.timestamp = self._timestamp
if self.timestamp == datetime.max:
return td_max.days
diff = datetime.utcnow() - self.timestamp
return diff.days
def _url_check(self):
"""Check for common URL problems."""
if "." not in self.url:
raise URLError("'%s' is not a vaild URL." % self.url)
@property
def JSON(self):
"""
Returns JSON data from 'https://archive.org/wayback/available?url=YOUR-URL'.
"""
endpoint = "https://archive.org/wayback/available"
headers = {"User-Agent": "%s" % self.user_agent}
payload = {"url": "%s" % self._clean_url()}
response = _get_response(endpoint, params=payload, headers=headers)
return response.json()
@property
def archive_url(self):
"""
Returns any random archive for the instance.
But if near, oldest, newest were used before
then it returns the same archive again.
We cache archive in self._archive_url
"""
if self._archive_url:
return self._archive_url
data = self.JSON
if not data["archived_snapshots"]:
archive_url = None
else:
archive_url = data["archived_snapshots"]["closest"]["url"]
archive_url = archive_url.replace(
"http://web.archive.org/web/", "https://web.archive.org/web/", 1
)
self._archive_url = archive_url
return archive_url
@property
def _timestamp(self):
"""
Get timestamp of last fetched archive.
If used before fetching any archive, This
randomly picks archive.
"""
if self.timestamp:
return self.timestamp
data = self.JSON
if not data["archived_snapshots"]:
ts = datetime.max
else:
ts = datetime.strptime(
data["archived_snapshots"]["closest"]["timestamp"], "%Y%m%d%H%M%S"
)
self.timestamp = ts
return ts
def _clean_url(self):
"""
Remove newlines
replace " " with "_"
"""
return str(self.url).strip().replace(" ", "_")
def save(self):
"""Create a new Wayback Machine archive for this URL."""
request_url = "https://web.archive.org/save/" + self._clean_url()
headers = {"User-Agent": "%s" % self.user_agent}
response = _get_response(request_url, params=None, headers=headers)
self._archive_url = "https://" + _archive_url_parser(response.headers)
self.timestamp = datetime.utcnow()
return self
def get(self, url="", user_agent="", encoding=""):
"""Return the source code of the supplied URL.
If encoding is not supplied, it is auto-detected from the response.
"""
if not url:
url = self._clean_url()
if not user_agent:
user_agent = self.user_agent
headers = {"User-Agent": "%s" % self.user_agent}
response = _get_response(url, params=None, headers=headers)
if not encoding:
try:
encoding = response.encoding
except AttributeError:
encoding = "UTF-8"
return response.content.decode(encoding.replace("text/html", "UTF-8", 1))
def near(self, year=None, month=None, day=None, hour=None, minute=None):
"""Return the closest Wayback Machine archive to the time supplied.
Supported params are year, month, day, hour and minute.
Any non-supplied parameters default to the current time.
"""
now = datetime.utcnow().timetuple()
timestamp = _wayback_timestamp(
year=year if year else now.tm_year,
month=month if month else now.tm_mon,
day=day if day else now.tm_mday,
hour=hour if hour else now.tm_hour,
minute=minute if minute else now.tm_min,
)
endpoint = "https://archive.org/wayback/available"
headers = {"User-Agent": "%s" % self.user_agent}
payload = {"url": "%s" % self._clean_url(), "timestamp": timestamp}
response = _get_response(endpoint, params=payload, headers=headers)
data = response.json()
if not data["archived_snapshots"]:
raise WaybackError(
"Can not find archive for '%s' try later or use wayback.Url(url, user_agent).save() "
"to create a new archive." % self._clean_url()
)
archive_url = data["archived_snapshots"]["closest"]["url"]
archive_url = archive_url.replace(
"http://web.archive.org/web/", "https://web.archive.org/web/", 1
)
self._archive_url = archive_url
self.timestamp = datetime.strptime(
data["archived_snapshots"]["closest"]["timestamp"], "%Y%m%d%H%M%S"
)
return self
def oldest(self, year=1994):
"""Return the oldest Wayback Machine archive for this URL."""
return self.near(year=year)
def newest(self):
"""Return the newest Wayback Machine archive available for this URL.
Due to Wayback Machine database lag, this may not always be the
most recent archive.
"""
return self.near()
def total_archives(self):
"""Returns the total number of Wayback Machine archives for this URL."""
endpoint = "https://web.archive.org/cdx/search/cdx"
headers = {
"User-Agent": "%s" % self.user_agent,
"output": "json",
"fl": "statuscode",
}
payload = {"url": "%s" % self._clean_url()}
response = _get_response(endpoint, params=payload, headers=headers)
# Most efficient method to count number of archives (yet)
return response.text.count(",")
def live_urls_picker(self, url):
try:
resp=urlopen(req) #nosec
except URLError:
try:
resp=urlopen(req) #nosec
except URLError as e:
raise UrlNotFound(e)
response_code = requests.get(url).status_code
except Exception:
return # we don't care if urls are not opening
if encoding is None:
try:
encoding= resp.headers['content-type'].split('charset=')[-1]
except AttributeError:
encoding = "UTF-8"
# 200s are OK and 300s are usually redirects, if you don't want redirects replace 400 with 300
if response_code >= 400:
return
return resp.read().decode(encoding.replace("text/html", "UTF-8", 1))
self._alive_url_list.append(url)
def near(url, **kwargs):
def known_urls(self, alive=False, subdomain=False):
"""Returns list of URLs known to exist for given domain name
because these URLs were crawled by WayBack Machine bots.
Useful for pen-testers and others.
Idea by Mohammed Diaa (https://github.com/mhmdiaa) from:
https://gist.github.com/mhmdiaa/adf6bff70142e5091792841d4b372050
"""
try:
url = kwargs["url"]
except KeyError:
url = url
url_list = []
year=kwargs.get("year", datetime.utcnow().strftime('%Y'))
month=kwargs.get("month", datetime.utcnow().strftime('%m'))
day=kwargs.get("day", datetime.utcnow().strftime('%d'))
hour=kwargs.get("hour", datetime.utcnow().strftime('%H'))
minute=kwargs.get("minute", datetime.utcnow().strftime('%M'))
UA=kwargs.get("UA", default_UA)
if subdomain:
request_url = (
"https://web.archive.org/cdx/search/cdx?url=*.%s/*&output=json&fl=original&collapse=urlkey"
% self._clean_url()
)
else:
request_url = (
"http://web.archive.org/cdx/search/cdx?url=%s/*&output=json&fl=original&collapse=urlkey"
% self._clean_url()
)
url_check(url)
timestamp = wayback_timestamp(year=year,month=month,day=day,hour=hour,minute=minute)
request_url = "https://archive.org/wayback/available?url=%s&timestamp=%s" % (clean_url(url), str(timestamp))
hdr = { 'User-Agent' : '%s' % UA }
req = Request(request_url, headers=hdr) # nosec
headers = {"User-Agent": "%s" % self.user_agent}
response = _get_response(request_url, params=None, headers=headers)
data = response.json()
url_list = [y[0] for y in data if y[0] != "original"]
try:
response = urlopen(req) #nosec
except HTTPError as e:
handle_HTTPError(e)
# Remove all deadURLs from url_list if alive=True
if alive:
with concurrent.futures.ThreadPoolExecutor() as executor:
executor.map(self.live_urls_picker, url_list)
url_list = self._alive_url_list
data = json.loads(response.read().decode("UTF-8"))
if not data["archived_snapshots"]:
raise ArchiveNotFound("'%s' is not yet archived." % url)
archive_url = (data["archived_snapshots"]["closest"]["url"])
# wayback machine returns http sometimes, idk why? But they support https
archive_url = archive_url.replace("http://web.archive.org/web/","https://web.archive.org/web/",1)
return archive_url
def oldest(url, UA=default_UA, year=1994):
return near(url, year=year, UA=UA)
def newest(url, UA=default_UA):
return near(url, UA=UA)
def total_archives(url, UA=default_UA):
url_check(url)
hdr = { 'User-Agent' : '%s' % UA }
request_url = "https://web.archive.org/cdx/search/cdx?url=%s&output=json" % clean_url(url)
req = Request(request_url, headers=hdr) # nosec
try:
response = urlopen(req) #nosec
except HTTPError as e:
handle_HTTPError(e)
return (len(json.loads(response.read())))
return url_list