Compare commits

...

275 Commits
v1.4 ... 2.4.2

Author SHA1 Message Date
88cda94c0b v2.4.2 (#89)
* v2.4.2

* v2.4.2
2021-01-24 17:03:35 +05:30
09290f88d1 fix one more error 2021-01-24 16:58:53 +05:30
e5835091c9 import re 2021-01-24 16:56:59 +05:30
7312ed1f4f set cached_save to True if archive older than 3 mins. 2021-01-24 16:53:36 +05:30
6ae8f843d3 add --file to --known_urls 2021-01-24 16:15:11 +05:30
36b936820b known urls now yileds, more reliable. And save the file in chucks wrt to response. --file arg can be used to create output file, if --file not used no output will be saved in any file. (#88) 2021-01-24 16:11:39 +05:30
a3bc6aad2b too much API usage by duplicate tests was causing too much tests failure 2021-01-23 21:08:21 +05:30
edc2f63d93 Output valid JSON, dumps python dict. Make JSON valid. 2021-01-23 20:43:52 +05:30
ffe0810b12 flag to check if the archive saved is 30 mins older or not 2021-01-16 12:06:08 +05:30
40233eb115 improve code quality, remove unused imports, use system randomness etc 2021-01-16 11:35:13 +05:30
d549d31421 improve save method, now we know that 302 errors indicates that wayback machine is archiving the URL and hasn't yet archived. We construct an artifical archive with the current UTC time and check for HTTP status code 20* or 30*. If we verify the archival, we return the artifical archive. The artificial archive will automatically point to the new archive or in best case will be the new archive after some time. 2021-01-16 10:47:43 +05:30
0725163af8 mimify the logo, remove ugly old logos 2021-01-15 18:14:48 +05:30
712471176b better error messages(str), check latest version before asking for an upgrade and rm alive checking 2021-01-15 16:47:26 +05:30
dcd7b03302 getting rid of c style str formatting, now using .format 2021-01-14 19:30:07 +05:30
76205d9cf6 backoff_factor=2 for save, incr success by 25% 2021-01-13 10:13:16 +05:30
ec0a0d04cc + dequeued0
dequeued0 (https://github.com/dequeued0) for reporting bugs and useful feature requests.
2021-01-12 10:52:41 +05:30
7bb01df846 v2.4.1 2021-01-12 10:18:09 +05:30
6142e0b353 get should retrive the last fetched archive by default 2021-01-12 10:07:14 +05:30
a65990aee3 don't use pagination API if total pages <= 2 2021-01-12 09:46:07 +05:30
259a024eb1 joke? they changed their robots.txt 2021-01-11 23:17:01 +05:30
91402792e6 + Supported Features
tell what the package can do, many users probably do not read the full usage.
2021-01-11 23:01:18 +05:30
eabf4dc046 don't fetch more pages if >=2 pages are empty 2021-01-11 22:43:14 +05:30
5a7bd73565 support unix ts as an arg in near 2021-01-11 19:53:37 +05:30
4693dbf9c1 change str repr of cdxsnapshot to cdx line 2021-01-11 09:34:37 +05:30
f4f2e51315 V2.4.0 (#62)
* v 2.4.0

* v 2.4.0
2021-01-10 11:53:45 +05:30
d6b7df6837 no need to de-duplicate as we are collapsing the results by urlkey
Same urls aren't recieved
2021-01-10 11:36:46 +05:30
dafba5d0cb collapses=["urlkey"] for known urls 2021-01-10 11:34:06 +05:30
6c71dfbe41 use cdx matchtype for domain and host 2021-01-10 11:10:49 +05:30
a6470b1036 not passing dict to cdxsnapshot 2021-01-10 10:40:32 +05:30
04cda4558e fix test 2021-01-10 03:18:09 +05:30
625ed63482 remove asserts stmnts 2021-01-10 03:05:48 +05:30
a03813315f full cdx api support 2021-01-10 02:23:53 +05:30
a2550f17d7 retries support for get requests 2021-01-06 01:58:38 +05:30
15ef5816db Always cast url to string, avoid passing waybackpy objects to _get_response 2021-01-05 19:46:17 +05:30
93b52bd0fe FIX : don't use self.user_agent if user_agent passed in get() 2021-01-05 19:31:27 +05:30
28ff877081 Update README.md 2021-01-05 19:08:35 +05:30
3e3ecff9df l2 heading and lint 2021-01-05 01:59:29 +05:30
ce64135ba8 ce 2021-01-05 01:52:35 +05:30
2af6580ffb docs link 2021-01-05 01:51:53 +05:30
8a3c515176 v2.3.3 2021-01-05 01:49:26 +05:30
d98c4f32ad v2.3.3 2021-01-05 01:48:54 +05:30
e0a4b007d5 improve docs 2021-01-05 01:46:12 +05:30
6fb6b2deee Update readme + new file CONTRIBUTORS.md (#59)
* remove some badges

* remove made with python button, obvious

* - maintained badge, we already have latest commit badge

- [![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-green.svg)](https://github.com/akamhy/waybackpy/graphs/commit-activity)

* re arranged order of badges

* a bit more re odering

* - release badge

* - license section

* center h1

* try once more'

* removed the TOC

* move the hr

* Update README.md

* + hr

* h1 --> h2

* remove tests and pacakging info from here to docs/wiki

* Update README.md

* example inspired by psf/requests

* CLI tool example gist

* Update README.md

* Update README.md

* + license

* Update README.md

* authors list

* Update CONTRIBUTORS.md

* fix code

* Update README.md

* Update README.md

* center the button
2021-01-05 00:30:07 +05:30
1882862992 now using cdx Pagination API 2021-01-04 20:46:54 +05:30
0c6107e675 increase coverage 2021-01-04 01:54:40 +05:30
bd079978bf inc coverage 2021-01-04 00:44:55 +05:30
5dec4927cd refactoring, try to code complexity 2021-01-04 00:14:38 +05:30
62e5217b9e reduce code complexity: refactoring, less flow breaking structures 2021-01-03 19:38:25 +05:30
9823c809e9 Added doc strings in wrapper.py, documenting code and improving docs. 2021-01-03 17:11:32 +05:30
db5737a857 JSON is now available for near and other other methods that call it 2021-01-02 18:52:46 +05:30
ca0821a466 Wiki docs (#58)
* move docs to wiki

* Update README.md

* Update setup.py
2021-01-02 12:20:43 +05:30
bb4dbc7d3c rm url = obj.url 2021-01-02 11:19:09 +05:30
7c7fd75376 No need to fetch archive_url and timestamp from availability API on init (#55)
* No need to fetch archive_url and timestamp from availability API on init. 

Not useful if all I want is to archive a page

* Update test_wrapper.py

* Update wrapper.py

* Update test_wrapper.py

* Update wrapper.py

* Update cli.py

* Update wrapper.py

* Update __version__.py

* Update __version__.py

* Update __version__.py

* Update __version__.py

* Update setup.py

* Update README.md
2021-01-02 11:10:23 +05:30
0b71433667 v2.3.1 (#54)
* 2.3.1

* 2.3.1
2021-01-01 19:15:23 +05:30
1b499a7594 removed JSON from init, this was resulting in too much unnecessary taffic. Some users who are thousands of URLs were blocked by IA (#53)
closes #52
2021-01-01 16:38:57 +05:30
da390ee8a3 improve maintainability and reduce code cognitive complexity (#49) 2020-12-15 10:24:13 +05:30
d3e68d0e70 code formated with black (#47) 2020-12-14 01:18:04 +05:30
fde28d57aa Update CONTRIBUTING.md 2020-12-14 00:16:29 +05:30
6092e504c8 Update CONTRIBUTING.md 2020-12-14 00:15:51 +05:30
93ef60ecd2 v2.3.0 (#46)
* v2.3.0

* v2.3.0

* decrease line length
2020-12-14 00:14:54 +05:30
461b3f74c9 UPDATE header image url 2020-12-13 23:09:59 +05:30
3c53b411b0 Improve the appearance of readme (#45)
* replaced text header wth image

* svg

* Update README.md

* Update README.md

* Update README.md

* level 2

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Create CONTRIBUTING.md

* Update README.md

* Add files via upload

* Update README.md

* Delete waybackpy-colored 284.png

* Delete waybackpy colored.png

* Update README.md

* Update index.rst

* Update index.rst

* Update index.rst

* Update setup.py

* Delete index.rst

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md
2020-12-13 23:08:16 +05:30
8125526061 create pyup.io config file (#44) 2020-12-13 22:31:49 +05:30
2dc81569a8 Create .pep8speaks.yml 2020-12-13 17:58:09 +05:30
fd163f3d36 Update wrapper.py 2020-12-13 17:12:32 +05:30
a0a918cf0d . 2020-12-13 17:10:28 +05:30
4943cf6873 remove print stmnt, update ci 2020-12-13 16:37:35 +05:30
bc3efc7d63 now using requests lib as it handles errors nicely (#42)
* now using requests lib as it handles errors nicely

* remove unused import (urllib)

* FIX : replaced full_url with endpoint (not using urlib)

* LINT :  Found in waybackpy\wrapper.py:88  Unnecessary else after return
2020-12-13 15:44:37 +05:30
f89368f16d LINT : Found in waybackpy\wrapper.py:88 Unnecessary else after return 2020-12-13 15:39:23 +05:30
c919a6a605 FIX : replaced full_url with endpoint (not using urlib) 2020-12-13 15:22:56 +05:30
0280fca189 remove unused import (urllib) 2020-12-13 15:13:51 +05:30
60ee8b95a8 now using requests lib as it handles errors nicely 2020-12-13 15:05:57 +05:30
ca51c14332 deleted .travis.yml, link with flake (#41)
close #38
2020-11-26 13:06:50 +05:30
525cf17c6f Update ci.yml 2020-11-26 12:14:15 +05:30
406e03c52f Update ci.yml 2020-11-26 12:04:45 +05:30
672b33e83a Update ci.yml 2020-11-26 10:10:10 +05:30
b19b840628 Update ci.yml 2020-11-26 10:01:55 +05:30
a6df4f899c Update ci.yml 2020-11-26 09:26:11 +05:30
7686e9c20d Update README.md (#40) 2020-11-26 09:18:26 +05:30
3c5932bc39 now using gh actions (#39) 2020-11-26 09:09:53 +05:30
f9a986f489 Create ci.yml 2020-11-26 08:55:23 +05:30
0d7458ee90 per https://docs.travis-ci.com/user/languages/python/, Python builds are not available on the macOS 2020-11-26 08:08:59 +05:30
ac8b9d6a50 use osx, huge backlog on .org travis for linux builds 2020-11-26 08:03:27 +05:30
58cd9c28e7 Threading enabled checking for URLs 2020-11-26 06:15:42 +05:30
5088305a58 removed python2 compatibility code 2020-11-21 17:00:11 +05:30
9f847a5e55 change pepy.tech download count link, they removed the month page 2020-11-11 10:44:14 +05:30
6c04c2f3d3 + https://github.com/akamhy/waybackpy/graphs/contributors 2020-11-04 08:09:30 +05:30
925be7b17e V2.2.0 2020-10-17 17:10:46 +05:30
2b132456ac updated index.rst and minor docs updated. 2020-10-17 16:56:51 +05:30
50e3154a4e lint README.md 2020-10-17 12:01:49 +05:30
7aef50428f add link to the repo 2020-10-17 11:51:56 +05:30
d8ec0f5025 More pythonic code snippets in README (#36) 2020-10-17 11:49:27 +05:30
0a2f97c034 Update README, drop python 2 support
* Drop python 2 support

* updated docs

* added new docs
2020-10-16 22:37:32 +05:30
3e9cf23578 3.9 archive doesn't not exist yet. 2020-10-16 19:43:06 +05:30
7f927ec7be added tests for json and archive_url, updated broken tests (#34)
* added tests for json and archive_url, updated broken tests

* drop 2.7 support
2020-10-16 19:25:45 +05:30
9de6393cd5 Add support for JSON and archive_url (#33)
CLI support for JSON and archive_url attributes
2020-10-16 15:16:18 +05:30
91e7f65617 Fixing len() bug (#32)
* added class functionality

* Update wrapper.py

* style edits

* fixed bug with len() of url()

* fixing len() bug

* fixing len() bug

* squashing bug

* removed test notebook
2020-10-16 10:04:13 +05:30
d465454019 Adding attributes to Url class (#28)
* added class functionality

* Update wrapper.py

* style edits
2020-10-15 22:10:32 +05:30
1a81eb97fb lint 2020-10-03 16:58:11 +05:30
6b3b2e2a7d tests for newly added known_urls feature 2020-10-03 09:33:50 +05:30
82c65454e6 2.1.9 2020-10-03 01:34:15 +05:30
19710461b6 Update setup.py 2020-10-03 01:33:46 +05:30
a3661d6b85 Update index.rst 2020-10-03 01:33:15 +05:30
58375e4ef4 fix broken links 2020-10-03 01:31:28 +05:30
ea023e98da update 2020-10-03 01:22:51 +05:30
f1065ed1c8 v2.1.8 2020-10-03 01:18:30 +05:30
315519b21f 2.1.8 2020-10-03 01:18:08 +05:30
07c98661de add usage for known urls (#26)
* Update README.md

* Update README.md

* Update README.md

* bash example for known urls

* python examples / usage for known urls :)

* Update README.md

* Update README.md

* Update README.md

* Update README.md
2020-10-03 01:16:19 +05:30
2cd991a54e lint markdown 2020-10-02 23:34:06 +05:30
ede251afb3 update tests 2020-10-02 23:10:48 +05:30
a8ce970ca0 fixed yet another issue with tests :( 2020-10-02 23:01:59 +05:30
243af26bf6 update version format in tests 2020-10-02 22:23:58 +05:30
0f1db94884 license & packaging info 2020-10-02 22:10:30 +05:30
c304f58ea2 update tests 2020-10-02 21:35:39 +05:30
23f7222cb5 tweak 2020-10-02 21:01:32 +05:30
ce7294d990 Implemented new feature, known urls for domain. 2020-10-02 20:27:28 +05:30
c9fa114d2e grammar 2020-10-01 23:50:03 +05:30
8b6bacb28e Add files via upload 2020-09-08 09:23:59 +05:30
32d8ad7780 Update README.md (#24)
- IA and Wayback machine logo, added new waybackpy logo.
+ changed pages to webpages in lead
2020-09-08 09:12:48 +05:30
cbf2f90faa Add files via upload 2020-09-08 09:06:33 +05:30
4dde3e3134 Delete a.txt 2020-09-08 09:02:36 +05:30
1551e8f1c6 Add files via upload 2020-09-08 09:02:19 +05:30
c84f09e2d2 Create a.txt 2020-09-08 08:59:28 +05:30
57a32669b5 v2.1.7 2020-08-09 11:06:29 +05:30
fe017cbcc8 v2.1.7 2020-08-09 11:06:04 +05:30
5edb03d24b update docs 2020-08-09 11:05:04 +05:30
c5de2232ba Update test_wrapper.py 2020-08-09 10:53:00 +05:30
ca9186c301 update message, sometimes raised for poor performance by wayback machine even if the url is archived. 2020-08-09 10:43:16 +05:30
8a4b631c13 new regex to parse archive, IA changed the header again :( 2020-08-09 10:36:25 +05:30
ec9ce92f48 Update README.md (#23)
* Update README.md

* fix grammar
2020-07-26 10:30:54 +05:30
e95d35c37f re arrange the badges, moved contributions welcome to top 2020-07-26 10:24:31 +05:30
36d662b961 Update __version__.py 2020-07-24 16:24:57 +05:30
2835f8877e Update setup.py 2020-07-24 16:24:38 +05:30
18cbd2fd30 Update cli.py 2020-07-24 16:10:29 +05:30
a2812fb56f patch for cli 2020-07-24 16:09:47 +05:30
77effcf649 Update setup.py 2020-07-24 15:34:14 +05:30
7272ef45a0 Update __version__.py 2020-07-24 15:33:58 +05:30
56116551ac Coverge improvements (#22)
* Update cli.py

* improved tests

* chnages for proper testing

* Type check using isinstance

* Replace elifs with if when used after return

* twitter.com --> www.ibm.com

* fix typo

* test archive urll parser and dunders

* Update test_wrapper.py
2020-07-24 15:31:21 +05:30
4dcda94cb0 v2.1.4 2020-07-24 01:03:44 +05:30
09f59b0182 v2.1.4 2020-07-24 01:03:04 +05:30
ed24184b99 Remove duplicate get response method 2020-07-24 00:57:22 +05:30
56bef064b1 only test save on >3.7 2020-07-23 20:51:46 +05:30
44bb2cf5e4 some cli tests 2020-07-23 20:44:14 +05:30
e231228721 Update README.md (#21)
* Update README.md

* example bash oldest newest

* total archives bash example

* near bash example

* format the list

* ce

* get bash example

* pip git install example

* Update index.rst

* + argparse

* + argparse
2020-07-22 21:35:02 +05:30
b8b2d6dfa9 v2.1.3 2020-07-22 20:21:37 +05:30
3eca6294df v2.1.3 2020-07-22 20:20:44 +05:30
eb037a0284 Rename test_1.py to test_wrapper.py 2020-07-22 20:19:59 +05:30
a01821f20b Update .travis.yml 2020-07-22 17:33:59 +05:30
b21036f8df Update .travis.yml 2020-07-22 17:31:57 +05:30
b43bacb7ac fix error language 2020-07-22 17:25:15 +05:30
f7313b255a Update cli.py 2020-07-22 17:22:38 +05:30
7457e1c793 - print(repr(obj)) 2020-07-22 17:18:27 +05:30
f7493d823f Update cli.py 2020-07-22 17:16:53 +05:30
7fa7b59ce3 if version don't try to create object 2020-07-22 17:15:28 +05:30
78a608db50 Update cli.py 2020-07-22 17:12:44 +05:30
93f7dfdaf9 resolve args conflict 2020-07-22 17:09:32 +05:30
83c6f256c9 version arg 2020-07-22 17:03:56 +05:30
dee9105794 command_line support (#18)
* Update wrapper.py

* entry points cli

* Suppress the urllib2/3 Exception

* rm cli code, will create a new cli.py file

* Create cli.py

* update cli entry pts

* Update cli.py

* Update cli.py

* import print_function

* Update cli.py

* Update cli.py

* Delete pypi_uploader.sh

* resolve conflicts with the master

* update the test ; resolve the conflicts

* decrease code complexity

* cli method changed to main

* get is not for just local usage

* get method should be available from interface

* get is used in the interface

* Update cli.py
2020-07-22 16:40:13 +05:30
3bfc3b46d0 Delete SECURITY.md 2020-07-22 11:07:59 +05:30
553f150bee replace youtube with twitter.com
for some reason Wayback API is returing diffrent youtube URL now.
2020-07-22 11:07:23 +05:30
b3a7e714a5 Update wrapper.py 2020-07-22 10:57:43 +05:30
cd9841713c Update wrapper.py 2020-07-22 10:52:43 +05:30
1ea9548d46 Raise WaybackError from URLError and include URL (#19)
* Raise WaybackError from URLError and include URL

* python2 compatibility

Co-authored-by: Akash <64683866+akamhy@users.noreply.github.com>
2020-07-22 10:51:44 +05:30
be7642c837 Code style improvements (#20)
* Add sane line length to setup.cfg

* Use Black for quick readability improvements

* Clean up exceptions, docstrings, and comments

Docstrings on dunder functions are redundant and typically ignored
Limit to reasonable line length
General grammar and style corrections
Clarify docstrings and exceptions
Format docstrings per PEP 257 -- Docstring Conventions

* Move archive_url_parser out of Url.save()

It's generally poor form to define a function in a function, as it will
be re-defined each time the function is run.

archive_url_parser does not depend on anything in Url, so it makes sense
to move it out of the class.

* move wayback_timestamp out of class, mark private functions

* DRY in _wayback_timestamp

* Url._url_check should return None

There's no point in returning True if it's never checked and won't ever
be False.
Implicitly returning None or raising an exception is more idiomatic.

* Default parameters should be type-consistant with expected values

* Specify parameters to near

* Use datetime.datetime in _wayback_timestamp

* cleanup __init__.py

* Cleanup formatting in tests

* Fix names in tests

* Revert "Use datetime.datetime in _wayback_timestamp"

This reverts commit 5b30380865.

Introduced unnecessary complexity

* Move _get_response outside of Url

Because Codacy reminded me that I missed it.

* fix imports in tests
2020-07-22 10:09:14 +05:30
a418a4e464 Update SECURITY.md 2020-07-21 10:38:41 +05:30
aec035ef1e Create SECURITY.md 2020-07-21 08:41:07 +05:30
6d37993ab9 moved to manuals 2020-07-21 08:14:40 +05:30
72b80ca44e Create pypi_uploader.sh 2020-07-21 08:14:21 +05:30
c10aa9279c Create python-publish.yml 2020-07-21 08:08:55 +05:30
68d809a7d6 Update test_1.py 2020-07-20 23:45:49 +05:30
4ad09a419b Fix bash syntax 2020-07-20 23:44:23 +05:30
ddc6620f09 Only report coverage if python 3.8 or greater 2020-07-20 23:29:54 +05:30
4066a65678 Update .travis.yml 2020-07-20 23:20:43 +05:30
8e46a9ba7a Update .travis.yml 2020-07-20 23:16:33 +05:30
a5a98b9b00 Update .travis.yml 2020-07-20 23:11:56 +05:30
a721ab7d6c Update .travis.yml 2020-07-20 23:10:06 +05:30
7db27ae5e1 Create pypi_uploader.sh 2020-07-20 22:28:35 +05:30
8fd4462025 Update wrapper.py 2020-07-20 20:17:18 +05:30
c458a15820 Update .travis.yml 2020-07-20 15:38:33 +05:30
bae3412bee Update .travis.yml 2020-07-20 15:24:26 +05:30
94cb08bb37 Update setup.py 2020-07-20 10:41:00 +05:30
af888db13e 2.1.2 2020-07-20 10:40:37 +05:30
d24f2408ee Update test_1.py 2020-07-20 10:31:47 +05:30
ddd2274015 Update test_1.py 2020-07-20 10:21:15 +05:30
99abdb7c67 Update test_1.py 2020-07-20 10:16:39 +05:30
f3bb9a8540 Update wrapper.py 2020-07-20 10:11:36 +05:30
bb94e0d1c5 Update index.rst and remove dupes 2020-07-20 10:07:31 +05:30
1a78d88be2 2.1.1 2020-07-19 23:17:01 +05:30
3ec61758b3 Update __version__.py 2020-07-19 23:16:13 +05:30
83c962166d Raise 2020-07-19 23:02:04 +05:30
e87dee3bdf Waybackpy example on replit (#15)
* Waybackpy save example on replit

* Oldest example

* Newest method replit link

* Near method example

* Get example

* Total archive method example
2020-07-19 22:28:08 +05:30
b27bfff15a v2.1.0 2020-07-19 21:08:01 +05:30
970fc1cd08 Update __version__.py 2020-07-19 21:06:54 +05:30
65391bf14b update 2020-07-19 21:04:32 +05:30
8ab116f276 API chnaged again. updated
* Update wrapper.py

* Update wrapper.py

* Update wrapper.py

* Update wrapper.py

* Update wrapper.py

* api changed; fix archive url parser

* Update wrapper.py

* - Trailing whitespace

* include the header in exception
2020-07-19 20:39:07 +05:30
6f82041ec9 Update README.md (#13)
* Update README.md

* Update README.md

* replit demo for waybackpy.Url.save()

* Update README.md

* Update README.md

* replit demo for oldest()

* replit demo for newest()

* Update README.md

* replit demo for total_archives

* demo at replit for get()

* demo for near

* Update README.md

* Update README.md

* Update README.md
2020-07-19 16:39:39 +05:30
11059c960e Update setup.py 2020-07-18 19:27:04 +05:30
eee1b8eba1 Update __version__.py 2020-07-18 19:26:41 +05:30
f7de8f5575 sleeps to prevent too many requests in a timeframe 2020-07-18 19:25:19 +05:30
3fa0c32064 V2.0.1 link 2020-07-18 19:09:18 +05:30
aa1e3b8825 V2.0.1 2020-07-18 19:08:39 +05:30
58d2d585c8 No timeout for final try 2020-07-18 18:29:41 +05:30
e8efed2e2f Update test_1.py 2020-07-18 17:24:54 +05:30
49089b7321 2.0.0 link 2020-07-18 17:09:07 +05:30
55d8687566 Update test_1.py 2020-07-18 16:58:23 +05:30
0fa28527af Update index.rst 2020-07-18 16:54:07 +05:30
68259fd2d9 Update index.rst 2020-07-18 16:53:27 +05:30
e7086a89d3 Update index.rst 2020-07-18 16:52:37 +05:30
e39467227c Update index.rst 2020-07-18 16:51:47 +05:30
ba840404cf Update index.rst 2020-07-18 16:50:37 +05:30
8fbd2d9e55 Update index.rst 2020-07-18 16:49:03 +05:30
eebf6043de Update index.rst 2020-07-18 16:48:29 +05:30
3d3b09d6d8 Update README.md 2020-07-18 16:46:40 +05:30
ef15b5863c Update index.rst 2020-07-18 16:44:32 +05:30
256c0cdb6b update test - save 2020-07-18 16:39:35 +05:30
12c72a8294 fix link 2020-07-18 16:30:20 +05:30
0ad27f5ecc update readme for newer oop and some test changes (#12)
* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* docstrings

* user agent ; more variants

* description update

* Update __init__.py

* # -*- coding: utf-8 -*-

* Update test_1.py

* update docs for get()

* Update README.md
2020-07-18 16:22:09 +05:30
700b60b5f8 Update README.md 2020-07-18 08:16:59 +05:30
11032596c8 Update README.md 2020-07-18 08:15:43 +05:30
9727f92168 Update README.md 2020-07-18 08:12:33 +05:30
d2893fec13 Delete CONTRIBUTING.md 2020-07-18 08:12:00 +05:30
f1353b2129 Update CONTRIBUTING.md 2020-07-18 00:58:50 +05:30
c76a95ef90 Create CONTRIBUTING.md (#11) 2020-07-18 00:57:48 +05:30
62d88359ce Update README.md 2020-07-18 00:40:21 +05:30
9942c474c9 Update README.md 2020-07-18 00:35:12 +05:30
dfb736e794 Size 2020-07-18 00:32:00 +05:30
84d1766917 Update README.md 2020-07-18 00:20:58 +05:30
9d3cdfafb3 Update README.md 2020-07-18 00:20:17 +05:30
20a16bfa45 Version 2.0.0 on it's way for release (tommorow) 2020-07-18 00:09:28 +05:30
f2112c73f6 Python 2 support 2020-07-17 21:08:32 +05:30
9860527d96 OOP (#10)
* Update wrapper.py

* Update exceptions.py

* Update __init__.py

* test adjusted for new changes

* Update wrapper.py
2020-07-17 20:50:00 +05:30
9ac1e877c8 Update README.md 2020-07-16 20:39:12 +05:30
f881705d00 detecet python version whith sys.version_info (#9) 2020-06-26 15:48:01 +05:30
f015c3f4f3 test on the worst case possible 2020-05-08 09:56:01 +05:30
42ac399362 Most efficient method to count (yet) 2020-05-08 09:47:13 +05:30
e9d010c793 just count the status code, consumes less memory 2020-05-08 09:28:18 +05:30
58a6409528 v1.6 2020-05-07 20:14:59 +05:30
7ca2029158 Update setup.py 2020-05-07 20:14:40 +05:30
80331833f2 Update setup.py 2020-05-07 20:12:32 +05:30
5e3d3a815f fix 2020-05-07 20:03:17 +05:30
6182a18cf4 fix 2020-05-07 20:02:47 +05:30
9bca750310 v1.5 2020-05-07 19:59:23 +05:30
c22749a6a3 update 2020-05-07 19:54:00 +05:30
151df94fe3 license_file = LICENSE 2020-05-07 19:38:19 +05:30
24540d0b2c update 2020-05-07 19:33:39 +05:30
bdfc72d05d Create __version__.py 2020-05-07 19:16:26 +05:30
3b104c1a28 v1.5 2020-05-07 19:03:02 +05:30
fb0d4658a7 ce 2020-05-07 19:02:12 +05:30
48833980e1 update 2020-05-07 18:58:01 +05:30
0c4f119981 Update wrapper.py 2020-05-07 17:25:34 +05:30
afded51a04 Update wrapper.py 2020-05-07 17:20:23 +05:30
b950616561 Update wrapper.py 2020-05-07 17:17:17 +05:30
444675538f fix code Complexity (#8)
* fix code Complexity

* Update wrapper.py

* codefactor badge
2020-05-07 16:51:08 +05:30
0ca6710334 Update wrapper.py 2020-05-07 16:24:33 +05:30
01a7c591ad retry 2020-05-07 15:46:39 +05:30
74d3bc154b fix issue with py2.7 2020-05-07 15:34:41 +05:30
a8e94dfb25 Update README.md 2020-05-07 15:14:55 +05:30
cc38798b32 Update README.md 2020-05-07 15:14:30 +05:30
bc3dd44f27 Update README.md 2020-05-07 15:13:58 +05:30
ba46cdafe2 Update README.md 2020-05-07 15:12:37 +05:30
538afb14e9 Update test_1.py 2020-05-07 15:06:52 +05:30
7605b614ee test for total_archives() 2020-05-07 15:00:28 +05:30
d0a4e25cf5 Update __init__.py 2020-05-07 14:53:09 +05:30
8c5c0153da + total_archives() 2020-05-07 14:52:05 +05:30
e7dac74906 Update __init__.py 2020-05-07 09:06:49 +05:30
c686708c9e more testing 2020-05-07 08:59:09 +05:30
f9ae8ada70 Update test_1.py 2020-05-07 08:39:24 +05:30
e56ece3dc9 Update README.md 2020-05-07 08:23:31 +05:30
db127a5c54 always return https 2020-05-06 20:16:25 +05:30
ed497bbd23 Update wrapper.py 2020-05-06 20:07:25 +05:30
45fe07ddb6 Update wrapper.py 2020-05-06 19:35:01 +05:30
0029d63d8a 503 API Service Temporarily Unavailable 2020-05-06 19:22:56 +05:30
beb5b625ec Set theme jekyll-theme-cayman 2020-05-06 12:20:43 +05:30
b40d734346 Update README.md 2020-05-06 09:18:02 +05:30
be0a30de85 Create index.rst 2020-05-05 20:22:46 +05:30
30 changed files with 2403 additions and 349 deletions

42
.github/workflows/ci.yml vendored Normal file
View File

@ -0,0 +1,42 @@
# This workflow will install Python dependencies, run tests and lint with a variety of Python versions
# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions
name: CI
on:
push:
branches: [ master ]
pull_request:
branches: [ master ]
jobs:
build:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ['3.8']
steps:
- uses: actions/checkout@v2
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v2
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install flake8 pytest codecov pytest-cov
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
- name: Lint with flake8
run: |
# stop the build if there are Python syntax errors or undefined names
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
- name: Test with pytest
run: |
pytest --cov=waybackpy tests/
- name: Upload coverage to Codecov
run: |
bash <(curl -s https://codecov.io/bash) -t ${{ secrets.CODECOV_TOKEN }}

31
.github/workflows/python-publish.yml vendored Normal file
View File

@ -0,0 +1,31 @@
# This workflows will upload a Python Package using Twine when a release is created
# For more information see: https://help.github.com/en/actions/language-and-framework-guides/using-python-with-github-actions#publishing-to-package-registries
name: Upload Python Package
on:
release:
types: [created]
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: '3.x'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install setuptools wheel twine
- name: Build and publish
env:
TWINE_USERNAME: ${{ secrets.PYPI_USERNAME }}
TWINE_PASSWORD: ${{ secrets.PYPI_PASSWORD }}
run: |
python setup.py sdist bdist_wheel
twine upload dist/*

3
.gitignore vendored
View File

@ -1,3 +1,6 @@
# Files generated while testing
*-urls-*.txt
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]

4
.pep8speaks.yml Normal file
View File

@ -0,0 +1,4 @@
# File : .pep8speaks.yml
scanner:
diff_only: True # If True, errors caused by only the patch are shown

5
.pyup.yml Normal file
View File

@ -0,0 +1,5 @@
# autogenerated pyup.io config file
# see https://pyup.io/docs/configuration/ for all available options
schedule: ''
update: false

View File

@ -1,14 +0,0 @@
language: python
python:
- "2.7"
- "3.6"
- "3.8"
os: linux
dist: xenial
cache: pip
install:
- pip install pytest
before_script:
cd tests
script:
- pytest test_1.py

58
CONTRIBUTING.md Normal file
View File

@ -0,0 +1,58 @@
# Contributing to waybackpy
We love your input! We want to make contributing to this project as easy and transparent as possible, whether it's:
- Reporting a bug
- Discussing the current state of the code
- Submitting a fix
- Proposing new features
- Becoming a maintainer
## We Develop with Github
We use github to host code, to track issues and feature requests, as well as accept pull requests.
## We Use [Github Flow](https://guides.github.com/introduction/flow/index.html), So All Code Changes Happen Through Pull Requests
Pull requests are the best way to propose changes to the codebase (we use [Github Flow](https://guides.github.com/introduction/flow/index.html)). We actively welcome your pull requests:
1. Fork the repo and create your branch from `master`.
2. If you've added code that should be tested, add tests.
3. If you've changed APIs, update the documentation.
4. Ensure the test suite passes.
5. Make sure your code lints.
6. Issue that pull request!
## Any contributions you make will be under the MIT Software License
In short, when you submit code changes, your submissions are understood to be under the same [MIT License](https://github.com/akamhy/waybackpy/blob/master/LICENSE) that covers the project. Feel free to contact the maintainers if that's a concern.
## Report bugs using Github's [issues](https://github.com/akamhy/waybackpy/issues)
We use GitHub issues to track public bugs. Report a bug by [opening a new issue](https://github.com/akamhy/waybackpy/issues/new); it's that easy!
## Write bug reports with detail, background, and sample code
**Great Bug Reports** tend to have:
- A quick summary and/or background
- Steps to reproduce
- Be specific!
- Give sample code if you can.
- What you expected would happen
- What actually happens
- Notes (possibly including why you think this might be happening, or stuff you tried that didn't work)
People *love* thorough bug reports. I'm not even kidding.
## Use a Consistent Coding Style
* You can try running `flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics` for style unification.
## License
By contributing, you agree that your contributions will be licensed under its [MIT License](https://github.com/akamhy/waybackpy/blob/master/LICENSE).
## References
This document is forked from [this gist](https://gist.github.com/briandk/3d2e8b3ec8daf5a27a62) by [briandk](https://github.com/briandk) which was itself adapted from the open-source contribution guidelines for [Facebook's Draft](https://github.com/facebook/draft-js/blob/a9316a723f9e918afde44dea68b5f9f39b7d9b00/CONTRIBUTING.md)

9
CONTRIBUTORS.md Normal file
View File

@ -0,0 +1,9 @@
## AUTHORS
- akamhy (<https://github.com/akamhy>)
- danvalen1 (<https://github.com/danvalen1>)
- AntiCompositeNumber (<https://github.com/AntiCompositeNumber>)
## ACKNOWLEDGEMENTS
- mhmdiaa (<https://github.com/mhmdiaa>) for <https://gist.github.com/mhmdiaa/adf6bff70142e5091792841d4b372050>. known_urls is based on this gist.
- datashaman (<https://stackoverflow.com/users/401467/datashaman>) for <https://stackoverflow.com/a/35504626>. _get_response is based on this amazing answer.
- dequeued0 (<https://github.com/dequeued0>) for reporting bugs and useful feature requests.

View File

@ -1,6 +1,6 @@
MIT License
Copyright (c) 2020 akamhy
Copyright (c) 2020 waybackpy contributors ( https://github.com/akamhy/waybackpy/graphs/contributors )
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal

227
README.md
View File

@ -1,154 +1,111 @@
# waybackpy
[![Build Status](https://travis-ci.org/akamhy/waybackpy.svg?branch=master)](https://travis-ci.org/akamhy/waybackpy)
[![Downloads](https://img.shields.io/pypi/dm/waybackpy.svg)](https://pypistats.org/packages/waybackpy)
[![Release](https://img.shields.io/github/v/release/akamhy/waybackpy.svg)](https://github.com/akamhy/waybackpy/releases)
[![Codacy Badge](https://api.codacy.com/project/badge/Grade/255459cede9341e39436ec8866d3fb65)](https://www.codacy.com/manual/akamhy/waybackpy?utm_source=github.com&amp;utm_medium=referral&amp;utm_content=akamhy/waybackpy&amp;utm_campaign=Badge_Grade)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://github.com/akamhy/waybackpy/blob/master/LICENSE)
[![made-with-python](https://img.shields.io/badge/Made%20with-Python-1f425f.svg)](https://www.python.org/)
![pypi](https://img.shields.io/pypi/v/wayback.svg)
[![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-green.svg)](https://github.com/akamhy/waybackpy/graphs/commit-activity)
<div align="center">
<img src="https://raw.githubusercontent.com/akamhy/waybackpy/master/assets/waybackpy_logo.svg"><br>
<h2>Python package & CLI tool that interfaces with the Wayback Machine API</h2>
![Internet Archive](https://upload.wikimedia.org/wikipedia/commons/thumb/8/84/Internet_Archive_logo_and_wordmark.svg/84px-Internet_Archive_logo_and_wordmark.svg.png)
![Wayback Machine](https://upload.wikimedia.org/wikipedia/commons/thumb/0/01/Wayback_Machine_logo_2010.svg/284px-Wayback_Machine_logo_2010.svg.png)
</div>
The waybackpy is a python wrapper for [Internet Archive](https://en.wikipedia.org/wiki/Internet_Archive)'s [Wayback Machine](https://en.wikipedia.org/wiki/Wayback_Machine).
<p align="center">
<a href="https://pypi.org/project/waybackpy/"><img alt="pypi" src="https://img.shields.io/pypi/v/waybackpy.svg"></a>
<a href="https://github.com/akamhy/waybackpy/actions?query=workflow%3ACI"><img alt="Build Status" src="https://github.com/akamhy/waybackpy/workflows/CI/badge.svg"></a>
<a href="https://www.codacy.com/manual/akamhy/waybackpy?utm_source=github.com&amp;utm_medium=referral&amp;utm_content=akamhy/waybackpy&amp;utm_campaign=Badge_Grade"><img alt="Codacy Badge" src="https://api.codacy.com/project/badge/Grade/255459cede9341e39436ec8866d3fb65"></a>
<a href="https://codecov.io/gh/akamhy/waybackpy"><img alt="codecov" src="https://codecov.io/gh/akamhy/waybackpy/branch/master/graph/badge.svg"></a>
<a href="https://github.com/akamhy/waybackpy/blob/master/CONTRIBUTING.md"><img alt="Contributions Welcome" src="https://img.shields.io/static/v1.svg?label=Contributions&message=Welcome&color=0059b3&style=flat-square"></a>
<a href="https://pepy.tech/project/waybackpy?versions=2*&versions=1*&versions=3*"><img alt="Downloads" src="https://pepy.tech/badge/waybackpy/month"></a>
<a href="https://github.com/akamhy/waybackpy/commits/master"><img alt="GitHub lastest commit" src="https://img.shields.io/github/last-commit/akamhy/waybackpy?color=blue&style=flat-square"></a>
<a href="#"><img alt="PyPI - Python Version" src="https://img.shields.io/pypi/pyversions/waybackpy?style=flat-square"></a>
</p>
Table of contents
=================
<!--ts-->
-----------------------------------------------------------------------------------------------------------------------------------------------
* [Installation](https://github.com/akamhy/waybackpy#installation)
### Installation
* [Usage](https://github.com/akamhy/waybackpy#usage)
* [Saving an url using save()](https://github.com/akamhy/waybackpy#capturing-aka-saving-an-url-using-save)
* [Receiving the oldest archive for an URL Using oldest()](https://github.com/akamhy/waybackpy#receiving-the-oldest-archive-for-an-url-using-oldest)
* [Receiving the recent most/newest archive for an URL using newest()](https://github.com/akamhy/waybackpy#receiving-the-newest-archive-for-an-url-using-newest)
* [Receiving archive close to a specified year, month, day, hour, and minute using near()](https://github.com/akamhy/waybackpy#receiving-archive-close-to-a-specified-year-month-day-hour-and-minute-using-near)
* [Get the content of webpage using get()](https://github.com/akamhy/waybackpy#get-the-content-of-webpage-using-get)
* [Tests](https://github.com/akamhy/waybackpy#tests)
* [Dependency](https://github.com/akamhy/waybackpy#dependency)
* [License](https://github.com/akamhy/waybackpy#license)
<!--te-->
## Installation
Using [pip](https://en.wikipedia.org/wiki/Pip_(package_manager)):
**pip install waybackpy**
## Usage
#### Capturing aka Saving an url Using save()
```diff
+ waybackpy.save(url, UA=user_agent)
```bash
pip install waybackpy
```
> url is mandatory. UA is not, but highly recommended.
Install directly from GitHub:
```bash
pip install git+https://github.com/akamhy/waybackpy.git
```
### Supported Features
- Archive webpage
- Retrieve all archives of a webpage/domain
- Retrieve archive close to a date or timestamp
- Retrieve all archives which have a particular prefix
- Get source code of the archive easily
- CDX API support
### Usage
#### As a Python package
```python
import waybackpy
# Capturing a new archive on Wayback machine.
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
archived_url = waybackpy.save("https://github.com/akamhy/waybackpy", UA = "Any-User-Agent")
print(archived_url)
>>> import waybackpy
>>> url = "https://en.wikipedia.org/wiki/Multivariable_calculus"
>>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
>>> wayback = waybackpy.Url(url, user_agent)
>>> archive = wayback.save()
>>> archive.archive_url
'https://web.archive.org/web/20210104173410/https://en.wikipedia.org/wiki/Multivariable_calculus'
>>> archive.timestamp
datetime.datetime(2021, 1, 4, 17, 35, 12, 691741)
>>> oldest_archive = wayback.oldest()
>>> oldest_archive.archive_url
'https://web.archive.org/web/20050422130129/http://en.wikipedia.org:80/wiki/Multivariable_calculus'
>>> archive_close_to_2010_feb = wayback.near(year=2010, month=2)
>>> archive_close_to_2010_feb.archive_url
'https://web.archive.org/web/20100215001541/http://en.wikipedia.org:80/wiki/Multivariable_calculus'
>>> wayback.newest().archive_url
'https://web.archive.org/web/20210104173410/https://en.wikipedia.org/wiki/Multivariable_calculus'
```
This should print something similar to the following archived URL:
> Full Python package documentation can be found at <https://github.com/akamhy/waybackpy/wiki/Python-package-docs>.
<https://web.archive.org/web/20200504141153/https://github.com/akamhy/waybackpy>
#### Receiving the oldest archive for an URL Using oldest()
```diff
+ waybackpy.oldest(url, UA=user_agent)
#### As a CLI tool
```bash
$ waybackpy --save --url "https://en.wikipedia.org/wiki/Social_media" --user_agent "my-unique-user-agent"
https://web.archive.org/web/20200719062108/https://en.wikipedia.org/wiki/Social_media
$ waybackpy --oldest --url "https://en.wikipedia.org/wiki/Humanoid" --user_agent "my-unique-user-agent"
https://web.archive.org/web/20040415020811/http://en.wikipedia.org:80/wiki/Humanoid
$ waybackpy --newest --url "https://en.wikipedia.org/wiki/Remote_sensing" --user_agent "my-unique-user-agent"
https://web.archive.org/web/20201221130522/https://en.wikipedia.org/wiki/Remote_sensing
$ waybackpy --total --url "https://en.wikipedia.org/wiki/Linux_kernel" --user_agent "my-unique-user-agent"
1904
$ waybackpy --known_urls --url akamhy.github.io --user_agent "my-unique-user-agent" --file
https://akamhy.github.io
https://akamhy.github.io/assets/js/scale.fix.js
https://akamhy.github.io/favicon.ico
https://akamhy.github.io/robots.txt
https://akamhy.github.io/waybackpy/
'akamhy.github.io-urls-iftor2.txt' saved in current working directory
```
> url is mandatory. UA is not, but highly recommended.
```python
import waybackpy
# retrieving the oldest archive on Wayback machine.
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
oldest_archive = waybackpy.oldest("https://www.google.com/", UA = "Any-User-Agent")
print(oldest_archive)
```
This returns the oldest available archive for <https://google.com>.
<http://web.archive.org/web/19981111184551/http://google.com:80/>
#### Receiving the newest archive for an URL using newest()
```diff
+ waybackpy.newest(url, UA=user_agent)
```
> url is mandatory. UA is not, but highly recommended.
```python
import waybackpy
# retrieving the newest archive on Wayback machine.
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
newest_archive = waybackpy.newest("https://www.microsoft.com/en-us", UA = "Any-User-Agent")
print(newest_archive)
```
This returns the newest available archive for <https://www.microsoft.com/en-us>, something just like this:
<http://web.archive.org/web/20200429033402/https://www.microsoft.com/en-us/>
#### Receiving archive close to a specified year, month, day, hour, and minute using near()
```diff
+ waybackpy.near(url, year=2020, month=1, day=1, hour=1, minute=1, UA=user_agent)
```
> url is mandotory. year,month,day,hour and minute are optional arguments. UA is not mandotory, but higly recomended.
```python
import waybackpy
# retriving the the closest archive from a specified year.
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
# supported argumnets are year,month,day,hour and minute
archive_near_year = waybackpy.near("https://www.facebook.com/", year=2010, UA ="Any-User-Agent")
print(archive_near_year)
```
returns : <http://web.archive.org/web/20100504071154/http://www.facebook.com/>
```waybackpy.near("https://www.facebook.com/", year=2010, month=1, UA ="Any-User-Agent")``` returns: <http://web.archive.org/web/20101111173430/http://www.facebook.com//>
```waybackpy.near("https://www.oracle.com/index.html", year=2019, month=1, day=5, UA ="Any-User-Agent")``` returns: <http://web.archive.org/web/20190105054437/https://www.oracle.com/index.html>
> Please note that if you only specify the year, the current month and day are default arguments for month and day respectively. Do not expect just putting the year parameter would return the archive closer to January but the current month you are using the package. If you are using it in July 2018 and let's say you use ```waybackpy.near("https://www.facebook.com/", year=2011, UA ="Any-User-Agent")``` then you would be returned the nearest archive to July 2011 and not January 2011. You need to specify the month "1" for January.
> Do not pad (don't use zeros in the month, year, day, minute, and hour arguments). e.g. For January, set month = 1 and not month = 01.
#### Get the content of webpage using get()
```diff
+ waybackpy.get(url, encoding="UTF-8", UA=user_agent)
```
> url is mandatory. UA is not, but highly recommended. encoding is detected automatically, don't specify unless necessary.
```python
from waybackpy import get
# retriving the webpage from any url including the archived urls. Don't need to import other libraies :)
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
# supported argumnets are url, encoding and UA
webpage = get("https://example.com/", UA="User-Agent")
print(webpage)
```
> This should print the source code for <https://example.com/>.
## Tests
* [Here](https://github.com/akamhy/waybackpy/tree/master/tests)
## Dependency
* None, just python standard libraries (json, urllib and datetime). Both python 2 and 3 are supported :)
> Full CLI documentation can be found at <https://github.com/akamhy/waybackpy/wiki/CLI-docs>.
## License
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://github.com/akamhy/waybackpy/blob/master/LICENSE)
[MIT License](https://github.com/akamhy/waybackpy/blob/master/LICENSE)
Released under the MIT License. See
[license](https://github.com/akamhy/waybackpy/blob/master/LICENSE) for details.
-----------------------------------------------------------------------------------------------------------------------------------------------

1
_config.yml Normal file
View File

@ -0,0 +1 @@
theme: jekyll-theme-cayman

View File

@ -0,0 +1 @@
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 176.612 41.908" height="158.392" width="667.51" xmlns:v="https://github.com/akamhy/waybackpy"><text transform="matrix(.862888 0 0 1.158899 -.748 -98.312)" y="110.937" x="0.931" xml:space="preserve" font-weight="bold" font-size="28.149" font-family="sans-serif" letter-spacing="0" word-spacing="0" writing-mode="lr-tb" fill="#003dff"><tspan y="110.937" x="0.931"><tspan y="110.937" x="0.931" letter-spacing="3.568" writing-mode="lr-tb">waybackpy</tspan></tspan></text><path d="M.749 0h153.787v4.864H.749zm22.076 37.418h153.787v4.49H22.825z" fill="navy"/><path d="M0 37.418h22.825v4.49H0zM154.536 0h21.702v4.864h-21.702z" fill="#f0f"/></svg>

After

Width:  |  Height:  |  Size: 694 B

1
requirements.txt Normal file
View File

@ -0,0 +1 @@
requests>=2.24.0

View File

@ -1,2 +1,7 @@
[metadata]
description-file = README.md
license_file = LICENSE
[flake8]
max-line-length = 88
extend-ignore = E203,W503

View File

@ -1,40 +1,54 @@
import os.path
from setuptools import setup
with open(os.path.join(os.path.dirname(__file__), 'README.md')) as f:
with open(os.path.join(os.path.dirname(__file__), "README.md")) as f:
long_description = f.read()
about = {}
with open(os.path.join(os.path.dirname(__file__), "waybackpy", "__version__.py")) as f:
exec(f.read(), about)
setup(
name = 'waybackpy',
packages = ['waybackpy'],
version = 'v1.4',
description = "A python wrapper for Internet Archive's Wayback Machine API. Archive pages and retrieve archived pages easily.",
name=about["__title__"],
packages=["waybackpy"],
version=about["__version__"],
description=about["__description__"],
long_description=long_description,
long_description_content_type='text/markdown',
license='MIT',
author = 'akamhy',
author_email = 'akash3pro@gmail.com',
url = 'https://github.com/akamhy/waybackpy',
download_url = 'https://github.com/akamhy/waybackpy/archive/v1.4.tar.gz',
keywords = ['wayback', 'archive', 'archive website', 'wayback machine', 'Internet Archive'],
install_requires=[],
python_requires=">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*",
long_description_content_type="text/markdown",
license=about["__license__"],
author=about["__author__"],
author_email=about["__author_email__"],
url=about["__url__"],
download_url="https://github.com/akamhy/waybackpy/archive/2.4.2.tar.gz",
keywords=[
"Archive It",
"Archive Website",
"Wayback Machine",
"waybackurls",
"Internet Archive",
],
install_requires=["requests"],
python_requires=">=3.4",
classifiers=[
'Development Status :: 5 - Production/Stable',
'Intended Audience :: Developers',
'Natural Language :: English',
'Topic :: Software Development :: Build Tools',
'License :: OSI Approved :: MIT License',
'Programming Language :: Python',
'Programming Language :: Python :: 2',
'Programming Language :: Python :: 2.7',
'Programming Language :: Python :: 3',
'Programming Language :: Python :: 3.4',
'Programming Language :: Python :: 3.5',
'Programming Language :: Python :: 3.6',
'Programming Language :: Python :: 3.7',
'Programming Language :: Python :: 3.8',
'Programming Language :: Python :: Implementation :: CPython',
'Programming Language :: Python :: Implementation :: PyPy'
],
"Development Status :: 5 - Production/Stable",
"Intended Audience :: Developers",
"Natural Language :: English",
"Topic :: Software Development :: Build Tools",
"License :: OSI Approved :: MIT License",
"Programming Language :: Python",
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.4",
"Programming Language :: Python :: 3.5",
"Programming Language :: Python :: 3.6",
"Programming Language :: Python :: 3.7",
"Programming Language :: Python :: 3.8",
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: Implementation :: CPython",
],
entry_points={"console_scripts": ["waybackpy = waybackpy.cli:main"]},
project_urls={
"Documentation": "https://github.com/akamhy/waybackpy/wiki",
"Source": "https://github.com/akamhy/waybackpy",
"Tracker": "https://github.com/akamhy/waybackpy/issues",
},
)

0
tests/__init__.py Normal file
View File

View File

@ -1,58 +0,0 @@
import sys
sys.path.append("..")
import waybackpy
import pytest
user_agent = "Mozilla/5.0 (Windows NT 6.2; rv:20.0) Gecko/20121202 Firefox/20.0"
def test_save():
# Test for urls that exist and can be archived.
url1="https://github.com/akamhy/waybackpy"
archived_url1 = waybackpy.save(url1, UA=user_agent)
assert url1 in archived_url1
# Test for urls that are incorrect.
with pytest.raises(Exception) as e_info:
url2 = "ha ha ha ha"
archived_url2 = waybackpy.save(url2, UA=user_agent)
# Test for urls not allowed to archive by robot.txt.
with pytest.raises(Exception) as e_info:
url3 = "http://www.archive.is/faq.html"
archived_url3 = waybackpy.save(url3, UA=user_agent)
# Non existent urls, test
with pytest.raises(Exception) as e_info:
url4 = "https://githfgdhshajagjstgeths537agajaajgsagudadhuss8762346887adsiugujsdgahub.us"
archived_url4 = waybackpy.save(url4, UA=user_agent)
def test_near():
url = "google.com"
archive_near_year = waybackpy.near(url, year=2010, UA=user_agent)
assert "2010" in archive_near_year
archive_near_month_year = waybackpy.near(url, year=2015, month=2, UA=user_agent)
assert "201502" in archive_near_month_year
archive_near_day_month_year = waybackpy.near(url, year=2006, month=11, day=15, UA=user_agent)
assert "20061115" in archive_near_day_month_year
archive_near_hour_day_month_year = waybackpy.near("www.python.org", year=2008, month=5, day=9, hour=15, UA=user_agent)
assert "2008050915" in archive_near_hour_day_month_year
def test_oldest():
url = "github.com/akamhy/waybackpy"
archive_oldest = waybackpy.oldest(url, UA=user_agent)
assert "20200504141153" in archive_oldest
def test_newest():
url = "github.com/akamhy/waybackpy"
archive_newest = waybackpy.newest(url, UA=user_agent)
assert url in archive_newest
def test_get():
oldest_google_archive = waybackpy.oldest("google.com", UA=user_agent)
oldest_google_page_text = waybackpy.get(oldest_google_archive, UA=user_agent)
assert "Welcome to Google" in oldest_google_page_text

93
tests/test_cdx.py Normal file
View File

@ -0,0 +1,93 @@
import pytest
from waybackpy.cdx import Cdx
from waybackpy.exceptions import WaybackError
def test_all_cdx():
url = "akamhy.github.io"
user_agent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, \
like Gecko) Chrome/45.0.2454.85 Safari/537.36"
cdx = Cdx(
url=url,
user_agent=user_agent,
start_timestamp=2017,
end_timestamp=2020,
filters=[
"statuscode:200",
"mimetype:text/html",
"timestamp:20201002182319",
"original:https://akamhy.github.io/",
],
gzip=False,
collapses=["timestamp:10", "digest"],
limit=50,
match_type="prefix",
)
snapshots = cdx.snapshots()
for snapshot in snapshots:
ans = snapshot.archive_url
assert "https://web.archive.org/web/20201002182319/https://akamhy.github.io/" == ans
url = "akahfjgjkmhy.gihthub.ip"
cdx = Cdx(
url=url,
user_agent=user_agent,
start_timestamp=None,
end_timestamp=None,
filters=[],
match_type=None,
gzip=True,
collapses=[],
limit=10,
)
snapshots = cdx.snapshots()
print(snapshots)
i = 0
for _ in snapshots:
i += 1
assert i == 0
url = "https://github.com/akamhy/waybackpy/*"
cdx = Cdx(url=url, user_agent=user_agent, limit=50)
snapshots = cdx.snapshots()
for snapshot in snapshots:
print(snapshot.archive_url)
url = "https://github.com/akamhy/waybackpy"
with pytest.raises(WaybackError):
cdx = Cdx(url=url, user_agent=user_agent, limit=50, filters=["ghddhfhj"])
snapshots = cdx.snapshots()
with pytest.raises(WaybackError):
cdx = Cdx(url=url, user_agent=user_agent, collapses=["timestamp", "ghdd:hfhj"])
snapshots = cdx.snapshots()
url = "https://github.com"
cdx = Cdx(url=url, user_agent=user_agent, limit=50)
snapshots = cdx.snapshots()
c = 0
for snapshot in snapshots:
c += 1
if c > 100:
break
url = "https://github.com/*"
cdx = Cdx(url=url, user_agent=user_agent, collapses=["timestamp"])
snapshots = cdx.snapshots()
c = 0
for snapshot in snapshots:
c += 1
if c > 30_529: # deafult limit is 10k
break
url = "https://github.com/*"
cdx = Cdx(url=url, user_agent=user_agent)
c = 0
snapshots = cdx.snapshots()
for snapshot in snapshots:
c += 1
if c > 100_529:
break

360
tests/test_cli.py Normal file
View File

@ -0,0 +1,360 @@
import sys
import os
import pytest
import random
import string
import argparse
sys.path.append("..")
import waybackpy.cli as cli # noqa: E402
from waybackpy.wrapper import Url # noqa: E402
from waybackpy.__version__ import __version__
def test_save():
args = argparse.Namespace(
user_agent=None,
url="https://hfjfjfjfyu6r6rfjvj.fjhgjhfjgvjm",
total=False,
version=False,
file=False,
oldest=False,
save=True,
json=False,
archive_url=False,
newest=False,
near=False,
subdomain=False,
known_urls=False,
get=None,
)
reply = cli.args_handler(args)
assert "could happen because either your waybackpy" in str(reply)
def test_json():
args = argparse.Namespace(
user_agent=None,
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
file=False,
oldest=False,
save=False,
json=True,
archive_url=False,
newest=False,
near=False,
subdomain=False,
known_urls=False,
get=None,
)
reply = cli.args_handler(args)
assert "archived_snapshots" in str(reply)
def test_archive_url():
args = argparse.Namespace(
user_agent=None,
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
file=False,
oldest=False,
save=False,
json=False,
archive_url=True,
newest=False,
near=False,
subdomain=False,
known_urls=False,
get=None,
)
reply = cli.args_handler(args)
assert "https://web.archive.org/web/" in str(reply)
def test_oldest():
args = argparse.Namespace(
user_agent=None,
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
file=False,
oldest=True,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
subdomain=False,
known_urls=False,
get=None,
)
reply = cli.args_handler(args)
assert "pypi.org/user/akamhy" in str(reply)
uid = "".join(
random.choice(string.ascii_lowercase + string.digits) for _ in range(6)
)
url = "https://pypi.org/yfvjvycyc667r67ed67r" + uid
args = argparse.Namespace(
user_agent=None,
url=url,
total=False,
version=False,
file=False,
oldest=True,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
subdomain=False,
known_urls=False,
get=None,
)
reply = cli.args_handler(args)
assert "Can not find archive for" in str(reply)
def test_newest():
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
file=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=True,
near=False,
subdomain=False,
known_urls=False,
get=None,
)
reply = cli.args_handler(args)
assert "pypi.org/user/akamhy" in str(reply)
uid = "".join(
random.choice(string.ascii_lowercase + string.digits) for _ in range(6)
)
url = "https://pypi.org/yfvjvycyc667r67ed67r" + uid
args = argparse.Namespace(
user_agent=None,
url=url,
total=False,
version=False,
file=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=True,
near=False,
subdomain=False,
known_urls=False,
get=None,
)
reply = cli.args_handler(args)
assert "Can not find archive for" in str(reply)
def test_total_archives():
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://pypi.org/user/akamhy/",
total=True,
version=False,
file=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
subdomain=False,
known_urls=False,
get=None,
)
reply = cli.args_handler(args)
assert isinstance(reply, int)
def test_known_urls():
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://www.keybr.com",
total=False,
version=False,
file=True,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
subdomain=False,
known_urls=True,
get=None,
)
reply = cli.args_handler(args)
assert "keybr" in str(reply)
def test_near():
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
file=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=True,
subdomain=False,
known_urls=False,
get=None,
year=2020,
month=7,
day=15,
hour=1,
minute=1,
)
reply = cli.args_handler(args)
assert "202007" in str(reply)
uid = "".join(
random.choice(string.ascii_lowercase + string.digits) for _ in range(6)
)
url = "https://pypi.org/yfvjvycyc667r67ed67r" + uid
args = argparse.Namespace(
user_agent=None,
url=url,
total=False,
version=False,
file=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=True,
subdomain=False,
known_urls=False,
get=None,
year=2020,
month=7,
day=15,
hour=1,
minute=1,
)
reply = cli.args_handler(args)
assert "Can not find archive for" in str(reply)
def test_get():
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://github.com/akamhy",
total=False,
version=False,
file=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
subdomain=False,
known_urls=False,
get="url",
)
reply = cli.args_handler(args)
assert "waybackpy" in str(reply)
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://github.com/akamhy/waybackpy",
total=False,
version=False,
file=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
subdomain=False,
known_urls=False,
get="oldest",
)
reply = cli.args_handler(args)
assert "waybackpy" in str(reply)
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://akamhy.github.io/waybackpy/",
total=False,
version=False,
file=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
subdomain=False,
known_urls=False,
get="newest",
)
reply = cli.args_handler(args)
assert "waybackpy" in str(reply)
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
file=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
subdomain=False,
known_urls=False,
get="foobar",
)
reply = cli.args_handler(args)
assert "get the source code of the" in str(reply)
def test_args_handler():
args = argparse.Namespace(version=True)
reply = cli.args_handler(args)
assert ("waybackpy version %s" % (__version__)) == reply
args = argparse.Namespace(url=None, version=False)
reply = cli.args_handler(args)
assert ("waybackpy %s" % (__version__)) in str(reply)
def test_main():
# This also tests the parse_args method in cli.py
cli.main(["temp.py", "--version"])

40
tests/test_snapshot.py Normal file
View File

@ -0,0 +1,40 @@
import pytest
from waybackpy.snapshot import CdxSnapshot, datetime
def test_CdxSnapshot():
sample_input = "org,archive)/ 20080126045828 http://github.com text/html 200 Q4YULN754FHV2U6Q5JUT6Q2P57WEWNNY 1415"
prop_values = sample_input.split(" ")
properties = {}
(
properties["urlkey"],
properties["timestamp"],
properties["original"],
properties["mimetype"],
properties["statuscode"],
properties["digest"],
properties["length"],
) = prop_values
snapshot = CdxSnapshot(properties)
assert properties["urlkey"] == snapshot.urlkey
assert properties["timestamp"] == snapshot.timestamp
assert properties["original"] == snapshot.original
assert properties["mimetype"] == snapshot.mimetype
assert properties["statuscode"] == snapshot.statuscode
assert properties["digest"] == snapshot.digest
assert properties["length"] == snapshot.length
assert (
datetime.strptime(properties["timestamp"], "%Y%m%d%H%M%S")
== snapshot.datetime_timestamp
)
archive_url = (
"https://web.archive.org/web/"
+ properties["timestamp"]
+ "/"
+ properties["original"]
)
assert archive_url == snapshot.archive_url
assert sample_input == str(snapshot)

186
tests/test_utils.py Normal file
View File

@ -0,0 +1,186 @@
import pytest
import json
from waybackpy.utils import (
_cleaned_url,
_url_check,
_full_url,
URLError,
WaybackError,
_get_total_pages,
_archive_url_parser,
_wayback_timestamp,
_get_response,
_check_match_type,
_check_collapses,
_check_filters,
_ts,
)
def test_ts():
timestamp = True
data = {}
assert _ts(timestamp, data)
data = """
{"archived_snapshots": {"closest": {"timestamp": "20210109155628", "available": true, "status": "200", "url": "http://web.archive.org/web/20210109155628/https://www.google.com/"}}, "url": "https://www.google.com/"}
"""
data = json.loads(data)
assert data["archived_snapshots"]["closest"]["timestamp"] == "20210109155628"
def test_check_filters():
filters = []
_check_filters(filters)
filters = ["statuscode:200", "timestamp:20215678901234", "original:https://url.com"]
_check_filters(filters)
with pytest.raises(WaybackError):
_check_filters("not-list")
def test_check_collapses():
collapses = []
_check_collapses(collapses)
collapses = ["timestamp:10"]
_check_collapses(collapses)
collapses = ["urlkey"]
_check_collapses(collapses)
collapses = "urlkey" # NOT LIST
with pytest.raises(WaybackError):
_check_collapses(collapses)
collapses = ["also illegal collapse"]
with pytest.raises(WaybackError):
_check_collapses(collapses)
def test_check_match_type():
assert None == _check_match_type(None, "url")
match_type = "exact"
url = "test_url"
assert None == _check_match_type(match_type, url)
url = "has * in it"
with pytest.raises(WaybackError):
_check_match_type("domain", url)
with pytest.raises(WaybackError):
_check_match_type("not a valid type", "url")
def test_cleaned_url():
test_url = " https://en.wikipedia.org/wiki/Network security "
answer = "https://en.wikipedia.org/wiki/Network%20security"
assert answer == _cleaned_url(test_url)
def test_url_check():
good_url = "https://akamhy.github.io"
assert None == _url_check(good_url)
bad_url = "https://github-com"
with pytest.raises(URLError):
_url_check(bad_url)
def test_full_url():
params = {}
endpoint = "https://web.archive.org/cdx/search/cdx"
assert endpoint == _full_url(endpoint, params)
params = {"a": "1"}
assert "https://web.archive.org/cdx/search/cdx?a=1" == _full_url(endpoint, params)
assert "https://web.archive.org/cdx/search/cdx?a=1" == _full_url(
endpoint + "?", params
)
params["b"] = 2
assert "https://web.archive.org/cdx/search/cdx?a=1&b=2" == _full_url(
endpoint + "?", params
)
params["c"] = "foo bar"
assert "https://web.archive.org/cdx/search/cdx?a=1&b=2&c=foo%20bar" == _full_url(
endpoint + "?", params
)
def test_get_total_pages():
user_agent = "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
url = "github.com*"
assert 212890 <= _get_total_pages(url, user_agent)
url = "https://zenodo.org/record/4416138"
assert 2 >= _get_total_pages(url, user_agent)
def test_archive_url_parser():
perfect_header = """
{'Server': 'nginx/1.15.8', 'Date': 'Sat, 02 Jan 2021 09:40:25 GMT', 'Content-Type': 'text/html; charset=UTF-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'X-Archive-Orig-Server': 'nginx', 'X-Archive-Orig-Date': 'Sat, 02 Jan 2021 09:40:09 GMT', 'X-Archive-Orig-Transfer-Encoding': 'chunked', 'X-Archive-Orig-Connection': 'keep-alive', 'X-Archive-Orig-Vary': 'Accept-Encoding', 'X-Archive-Orig-Last-Modified': 'Fri, 01 Jan 2021 12:19:00 GMT', 'X-Archive-Orig-Strict-Transport-Security': 'max-age=31536000, max-age=0;', 'X-Archive-Guessed-Content-Type': 'text/html', 'X-Archive-Guessed-Charset': 'utf-8', 'Memento-Datetime': 'Sat, 02 Jan 2021 09:40:09 GMT', 'Link': '<https://www.scribbr.com/citing-sources/et-al/>; rel="original", <https://web.archive.org/web/timemap/link/https://www.scribbr.com/citing-sources/et-al/>; rel="timemap"; type="application/link-format", <https://web.archive.org/web/https://www.scribbr.com/citing-sources/et-al/>; rel="timegate", <https://web.archive.org/web/20200601082911/https://www.scribbr.com/citing-sources/et-al/>; rel="first memento"; datetime="Mon, 01 Jun 2020 08:29:11 GMT", <https://web.archive.org/web/20201126185327/https://www.scribbr.com/citing-sources/et-al/>; rel="prev memento"; datetime="Thu, 26 Nov 2020 18:53:27 GMT", <https://web.archive.org/web/20210102094009/https://www.scribbr.com/citing-sources/et-al/>; rel="memento"; datetime="Sat, 02 Jan 2021 09:40:09 GMT", <https://web.archive.org/web/20210102094009/https://www.scribbr.com/citing-sources/et-al/>; rel="last memento"; datetime="Sat, 02 Jan 2021 09:40:09 GMT"', 'Content-Security-Policy': "default-src 'self' 'unsafe-eval' 'unsafe-inline' data: blob: archive.org web.archive.org analytics.archive.org pragma.archivelab.org", 'X-Archive-Src': 'spn2-20210102092956-wwwb-spn20.us.archive.org-8001.warc.gz', 'Server-Timing': 'captures_list;dur=112.646325, exclusion.robots;dur=0.172010, exclusion.robots.policy;dur=0.158205, RedisCDXSource;dur=2.205932, esindex;dur=0.014647, LoadShardBlock;dur=82.205012, PetaboxLoader3.datanode;dur=70.750239, CDXLines.iter;dur=24.306278, load_resource;dur=26.520179', 'X-App-Server': 'wwwb-app200', 'X-ts': '200', 'X-location': 'All', 'X-Cache-Key': 'httpsweb.archive.org/web/20210102094009/https://www.scribbr.com/citing-sources/et-al/IN', 'X-RL': '0', 'X-Page-Cache': 'MISS', 'X-Archive-Screenname': '0', 'Content-Encoding': 'gzip'}
"""
archive = _archive_url_parser(
perfect_header, "https://www.scribbr.com/citing-sources/et-al/"
)
assert "web.archive.org/web/20210102094009" in archive
header = """
vhgvkjv
Content-Location: /web/20201126185327/https://www.scribbr.com/citing-sources/et-al
ghvjkbjmmcmhj
"""
archive = _archive_url_parser(
header, "https://www.scribbr.com/citing-sources/et-al/"
)
assert "20201126185327" in archive
header = """
hfjkfjfcjhmghmvjm
X-Cache-Key: https://web.archive.org/web/20171128185327/https://www.scribbr.com/citing-sources/et-al/US
yfu,u,gikgkikik
"""
archive = _archive_url_parser(
header, "https://www.scribbr.com/citing-sources/et-al/"
)
assert "20171128185327" in archive
# The below header should result in Exception
no_archive_header = """
{'Server': 'nginx/1.15.8', 'Date': 'Sat, 02 Jan 2021 09:42:45 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Cache-Control': 'no-cache', 'X-App-Server': 'wwwb-app52', 'X-ts': '523', 'X-RL': '0', 'X-Page-Cache': 'MISS', 'X-Archive-Screenname': '0'}
"""
with pytest.raises(WaybackError):
_archive_url_parser(
no_archive_header, "https://www.scribbr.com/citing-sources/et-al/"
)
def test_wayback_timestamp():
ts = _wayback_timestamp(year=2020, month=1, day=2, hour=3, minute=4)
assert "202001020304" in str(ts)
def test_get_response():
endpoint = "https://www.google.com"
user_agent = (
"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0"
)
headers = {"User-Agent": "%s" % user_agent}
response = _get_response(endpoint, params=None, headers=headers)
assert response.status_code == 200
endpoint = "http/wwhfhfvhvjhmom"
with pytest.raises(WaybackError):
_get_response(endpoint, params=None, headers=headers)
endpoint = "https://akamhy.github.io"
url, response = _get_response(
endpoint, params=None, headers=headers, return_full_url=True
)
assert endpoint == url

32
tests/test_wrapper.py Normal file
View File

@ -0,0 +1,32 @@
import sys
import pytest
import random
import requests
from datetime import datetime
from waybackpy.wrapper import Url
user_agent = "Mozilla/5.0 (Windows NT 6.2; rv:20.0) Gecko/20121202 Firefox/20.0"
def test_url_check():
"""No API Use"""
broken_url = "http://wwwgooglecom/"
with pytest.raises(Exception):
Url(broken_url, user_agent)
def test_near():
with pytest.raises(Exception):
NeverArchivedUrl = (
"https://ee_3n.wrihkeipef4edia.org/rwti5r_ki/Nertr6w_rork_rse7c_urity"
)
target = Url(NeverArchivedUrl, user_agent)
target.near(year=2010)
def test_json():
url = "github.com/akamhy/waybackpy"
target = Url(url, user_agent)
assert "archived_snapshots" in str(target.JSON)

View File

@ -1,6 +1,57 @@
# -*- coding: utf-8 -*-
from .wrapper import save, near, oldest, newest, get
# ┏┓┏┓┏┓━━━━━━━━━━┏━━┓━━━━━━━━━━┏┓━━┏━━━┓━━━━━
# ┃┃┃┃┃┃━━━━━━━━━━┃┏┓┃━━━━━━━━━━┃┃━━┃┏━┓┃━━━━━
# ┃┃┃┃┃┃┏━━┓━┏┓━┏┓┃┗┛┗┓┏━━┓━┏━━┓┃┃┏┓┃┗━┛┃┏┓━┏┓
# ┃┗┛┗┛┃┗━┓┃━┃┃━┃┃┃┏━┓┃┗━┓┃━┃┏━┛┃┗┛┛┃┏━━┛┃┃━┃┃
# ┗┓┏┓┏┛┃┗┛┗┓┃┗━┛┃┃┗━┛┃┃┗┛┗┓┃┗━┓┃┏┓┓┃┃━━━┃┗━┛┃
# ━┗┛┗┛━┗━━━┛┗━┓┏┛┗━━━┛┗━━━┛┗━━┛┗┛┗┛┗┛━━━┗━┓┏┛
# ━━━━━━━━━━━┏━┛┃━━━━━━━━━━━━━━━━━━━━━━━━┏━┛┃━
# ━━━━━━━━━━━┗━━┛━━━━━━━━━━━━━━━━━━━━━━━━┗━━┛━
__version__ = "v1.4"
"""
Waybackpy is a Python package & command-line program that interfaces with the Internet Archive's Wayback Machine API.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
__all__ = ['wrapper', 'exceptions']
Archive webpage and retrieve archived URLs easily.
Usage:
>>> import waybackpy
>>> url = "https://en.wikipedia.org/wiki/Multivariable_calculus"
>>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
>>> wayback = waybackpy.Url(url, user_agent)
>>> archive = wayback.save()
>>> str(archive)
'https://web.archive.org/web/20210104173410/https://en.wikipedia.org/wiki/Multivariable_calculus'
>>> archive.timestamp
datetime.datetime(2021, 1, 4, 17, 35, 12, 691741)
>>> oldest_archive = wayback.oldest()
>>> str(oldest_archive)
'https://web.archive.org/web/20050422130129/http://en.wikipedia.org:80/wiki/Multivariable_calculus'
>>> archive_close_to_2010_feb = wayback.near(year=2010, month=2)
>>> str(archive_close_to_2010_feb)
'https://web.archive.org/web/20100215001541/http://en.wikipedia.org:80/wiki/Multivariable_calculus'
>>> str(wayback.newest())
'https://web.archive.org/web/20210104173410/https://en.wikipedia.org/wiki/Multivariable_calculus'
Full documentation @ <https://github.com/akamhy/waybackpy/wiki>.
:copyright: (c) 2020-2021 AKash Mahanty Et al.
:license: MIT
"""
from .wrapper import Url, Cdx
from .__version__ import (
__title__,
__description__,
__url__,
__version__,
__author__,
__author_email__,
__license__,
__copyright__,
)

11
waybackpy/__version__.py Normal file
View File

@ -0,0 +1,11 @@
__title__ = "waybackpy"
__description__ = (
"A Python package that interfaces with the Internet Archive's Wayback Machine API. "
"Archive pages and retrieve archived pages easily."
)
__url__ = "https://akamhy.github.io/waybackpy/"
__version__ = "2.4.2"
__author__ = "akamhy"
__author_email__ = "akamhy@yahoo.com"
__license__ = "MIT"
__copyright__ = "Copyright 2020-2021 Akash Mahanty et al."

214
waybackpy/cdx.py Normal file
View File

@ -0,0 +1,214 @@
from .snapshot import CdxSnapshot
from .exceptions import WaybackError
from .utils import (
_get_total_pages,
_get_response,
default_user_agent,
_check_filters,
_check_collapses,
_check_match_type,
_add_payload,
)
# TODO : Threading support for pagination API. It's designed for Threading.
class Cdx:
def __init__(
self,
url,
user_agent=None,
start_timestamp=None,
end_timestamp=None,
filters=[],
match_type=None,
gzip=None,
collapses=[],
limit=None,
):
self.url = str(url).strip()
self.user_agent = str(user_agent) if user_agent else default_user_agent
self.start_timestamp = str(start_timestamp) if start_timestamp else None
self.end_timestamp = str(end_timestamp) if end_timestamp else None
self.filters = filters
_check_filters(self.filters)
self.match_type = str(match_type).strip() if match_type else None
_check_match_type(self.match_type, self.url)
self.gzip = gzip if gzip else True
self.collapses = collapses
_check_collapses(self.collapses)
self.limit = limit if limit else 5000
self.last_api_request_url = None
self.use_page = False
def cdx_api_manager(self, payload, headers, use_page=False):
"""
We have two options to get the snapshots, we use this
method to make a selection between pagination API and
the normal one with Resumption Key, sequential querying
of CDX data. For very large querying (for example domain query),
it may be useful to perform queries in parallel and also estimate
the total size of the query.
read more about the pagination API at:
https://web.archive.org/web/20201228063237/https://github.com/internetarchive/wayback/blob/master/wayback-cdx-server/README.md#pagination-api
if use_page is false if will use the normal sequential query API,
else use the pagination API.
two mutually exclusive cases possible:
1) pagination API is selected
a) get the total number of pages to read, using _get_total_pages()
b) then we use a for loop to get all the pages and yield the response text
2) normal sequential query API is selected.
a) get use showResumeKey=true to ask the API to add a query resumption key
at the bottom of response
b) check if the page has more than 3 lines, if not return the text
c) if it has atleast three lines, we check the second last line for zero length.
d) if the second last line has length zero than we assume that the last line contains
the resumption key, we set the resumeKey and remove the resumeKey from text
e) if the second line has non zero length we return the text as there will no resumption key
f) if we find the resumption key we set the "more" variable status to True which is always set
to False on each iteration. If more is not True the iteration stops and function returns.
"""
endpoint = "https://web.archive.org/cdx/search/cdx"
total_pages = _get_total_pages(self.url, self.user_agent)
# If we only have two or less pages of archives then we care for accuracy
# pagination API can be lagged sometimes
if use_page == True and total_pages >= 2:
blank_pages = 0
for i in range(total_pages):
payload["page"] = str(i)
url, res = _get_response(
endpoint, params=payload, headers=headers, return_full_url=True
)
self.last_api_request_url = url
text = res.text
if len(text) == 0:
blank_pages += 1
if blank_pages >= 2:
break
yield text
else:
payload["showResumeKey"] = "true"
payload["limit"] = str(self.limit)
resumeKey = None
more = True
while more:
if resumeKey:
payload["resumeKey"] = resumeKey
url, res = _get_response(
endpoint, params=payload, headers=headers, return_full_url=True
)
self.last_api_request_url = url
text = res.text.strip()
lines = text.splitlines()
more = False
if len(lines) >= 3:
second_last_line = lines[-2]
if len(second_last_line) == 0:
resumeKey = lines[-1].strip()
text = text.replace(resumeKey, "", 1).strip()
more = True
yield text
def snapshots(self):
"""
This function yeilds snapshots encapsulated
in CdxSnapshot for more usability.
All the get request values are set if the conditions match
And we use logic that if someone's only inputs don't have any
of [start_timestamp, end_timestamp] and don't use any collapses
then we use the pagination API as it returns archives starting
from the first archive and the recent most archive will be on
the last page.
"""
payload = {}
headers = {"User-Agent": self.user_agent}
_add_payload(self, payload)
if not self.start_timestamp or self.end_timestamp:
self.use_page = True
if self.collapses != []:
self.use_page = False
texts = self.cdx_api_manager(payload, headers, use_page=self.use_page)
for text in texts:
if text.isspace() or len(text) <= 1 or not text:
continue
snapshot_list = text.split("\n")
for snapshot in snapshot_list:
if len(snapshot) < 46: # 14 + 32 (timestamp+digest)
continue
properties = {
"urlkey": None,
"timestamp": None,
"original": None,
"mimetype": None,
"statuscode": None,
"digest": None,
"length": None,
}
prop_values = snapshot.split(" ")
# Making sure that we get the same number of
# property values as the number of properties
prop_values_len = len(prop_values)
properties_len = len(properties)
if prop_values_len != properties_len:
raise WaybackError(
"Snapshot returned by Cdx API has {prop_values_len} properties instead of expected {properties_len} properties.\nInvolved Snapshot : {snapshot}".format(
prop_values_len=prop_values_len,
properties_len=properties_len,
snapshot=snapshot,
)
)
(
properties["urlkey"],
properties["timestamp"],
properties["original"],
properties["mimetype"],
properties["statuscode"],
properties["digest"],
properties["length"],
) = prop_values
yield CdxSnapshot(properties)

331
waybackpy/cli.py Normal file
View File

@ -0,0 +1,331 @@
import os
import re
import sys
import json
import random
import string
import argparse
from .wrapper import Url
from .exceptions import WaybackError
from .__version__ import __version__
def _save(obj):
try:
return obj.save()
except Exception as err:
e = str(err)
m = re.search(r"Header:\n(.*)", e)
if m:
header = m.group(1)
if "No archive URL found in the API response" in e:
return (
"\n[waybackpy] Can not save/archive your link.\n[waybackpy] This "
"could happen because either your waybackpy ({version}) is likely out of "
"date or Wayback Machine is malfunctioning.\n[waybackpy] Visit "
"https://github.com/akamhy/waybackpy for the latest version of "
"waybackpy.\n[waybackpy] API response Header :\n{header}".format(
version=__version__, header=header
)
)
raise WaybackError(err)
def _archive_url(obj):
return obj.archive_url
def _json(obj):
return json.dumps(obj.JSON)
def no_archive_handler(e, obj):
m = re.search(r"archive\sfor\s\'(.*?)\'\stry", str(e))
if m:
url = m.group(1)
ua = obj.user_agent
if "github.com/akamhy/waybackpy" in ua:
ua = "YOUR_USER_AGENT_HERE"
return (
"\n[Waybackpy] Can not find archive for '{url}'.\n[Waybackpy] You can"
" save the URL using the following command:\n[Waybackpy] waybackpy --"
'user_agent "{user_agent}" --url "{url}" --save'.format(
url=url, user_agent=ua
)
)
raise WaybackError(e)
def _oldest(obj):
try:
return obj.oldest()
except Exception as e:
return no_archive_handler(e, obj)
def _newest(obj):
try:
return obj.newest()
except Exception as e:
return no_archive_handler(e, obj)
def _total_archives(obj):
return obj.total_archives()
def _near(obj, args):
_near_args = {}
args_arr = [args.year, args.month, args.day, args.hour, args.minute]
keys = ["year", "month", "day", "hour", "minute"]
for key, arg in zip(keys, args_arr):
if arg:
_near_args[key] = arg
try:
return obj.near(**_near_args)
except Exception as e:
return no_archive_handler(e, obj)
def _save_urls_on_file(url_gen):
domain = None
sys_random = random.SystemRandom()
uid = "".join(
sys_random.choice(string.ascii_lowercase + string.digits) for _ in range(6)
)
url_count = 0
for url in url_gen:
url_count += 1
if not domain:
m = re.search("https?://([A-Za-z_0-9.-]+).*", url)
domain = "domain-unknown"
if m:
domain = m.group(1)
file_name = "{domain}-urls-{uid}.txt".format(domain=domain, uid=uid)
file_path = os.path.join(os.getcwd(), file_name)
if not os.path.isfile(file_path):
open(file_path, "w+").close()
with open(file_path, "a") as f:
f.write("{url}\n".format(url=url))
print(url)
if url_count > 0:
return "\n\n'{file_name}' saved in current working directory".format(
file_name=file_name
)
else:
return "No known URLs found. Please try a diffrent input!"
def _known_urls(obj, args):
"""
Known urls for a domain.
"""
subdomain = True if args.subdomain else False
url_gen = obj.known_urls(subdomain=subdomain)
if args.file:
return _save_urls_on_file(url_gen)
else:
for url in url_gen:
print(url)
return "\n"
def _get(obj, args):
if args.get.lower() == "url":
return obj.get()
if args.get.lower() == "archive_url":
return obj.get(obj.archive_url)
if args.get.lower() == "oldest":
return obj.get(obj.oldest())
if args.get.lower() == "latest" or args.get.lower() == "newest":
return obj.get(obj.newest())
if args.get.lower() == "save":
return obj.get(obj.save())
return "Use get as \"--get 'source'\", 'source' can be one of the followings: \
\n1) url - get the source code of the url specified using --url/-u.\
\n2) archive_url - get the source code of the newest archive for the supplied url, alias of newest.\
\n3) oldest - get the source code of the oldest archive for the supplied url.\
\n4) newest - get the source code of the newest archive for the supplied url.\
\n5) save - Create a new archive and get the source code of this new archive for the supplied url."
def args_handler(args):
if args.version:
return "waybackpy version {version}".format(version=__version__)
if not args.url:
return "waybackpy {version} \nSee 'waybackpy --help' for help using this tool.".format(
version=__version__
)
obj = Url(args.url)
if args.user_agent:
obj = Url(args.url, args.user_agent)
if args.save:
output = _save(obj)
elif args.archive_url:
output = _archive_url(obj)
elif args.json:
output = _json(obj)
elif args.oldest:
output = _oldest(obj)
elif args.newest:
output = _newest(obj)
elif args.known_urls:
output = _known_urls(obj, args)
elif args.total:
output = _total_archives(obj)
elif args.near:
return _near(obj, args)
elif args.get:
output = _get(obj, args)
else:
output = (
"You only specified the URL. But you also need to specify the operation."
"\nSee 'waybackpy --help' for help using this tool."
)
return output
def add_requiredArgs(requiredArgs):
requiredArgs.add_argument(
"--url", "-u", help="URL on which Wayback machine operations would occur"
)
def add_userAgentArg(userAgentArg):
help_text = 'User agent, default user_agent is "waybackpy python package - https://github.com/akamhy/waybackpy"'
userAgentArg.add_argument("--user_agent", "-ua", help=help_text)
def add_saveArg(saveArg):
saveArg.add_argument(
"--save", "-s", action="store_true", help="Save the URL on the Wayback machine"
)
def add_auArg(auArg):
auArg.add_argument(
"--archive_url",
"-au",
action="store_true",
help="Get the latest archive URL, alias for --newest",
)
def add_jsonArg(jsonArg):
jsonArg.add_argument(
"--json",
"-j",
action="store_true",
help="JSON data of the availability API request",
)
def add_oldestArg(oldestArg):
oldestArg.add_argument(
"--oldest",
"-o",
action="store_true",
help="Oldest archive for the specified URL",
)
def add_newestArg(newestArg):
newestArg.add_argument(
"--newest",
"-n",
action="store_true",
help="Newest archive for the specified URL",
)
def add_totalArg(totalArg):
totalArg.add_argument(
"--total",
"-t",
action="store_true",
help="Total number of archives for the specified URL",
)
def add_getArg(getArg):
getArg.add_argument(
"--get",
"-g",
help="Prints the source code of the supplied url. Use '--get help' for extended usage",
)
def add_knownUrlArg(knownUrlArg):
knownUrlArg.add_argument(
"--known_urls", "-ku", action="store_true", help="URLs known for the domain."
)
help_text = "Use with '--known_urls' to include known URLs for subdomains."
knownUrlArg.add_argument("--subdomain", "-sub", action="store_true", help=help_text)
knownUrlArg.add_argument(
"--file",
"-f",
action="store_true",
help="Save the URLs in file at current directory.",
)
def add_nearArg(nearArg):
nearArg.add_argument(
"--near", "-N", action="store_true", help="Archive near specified time"
)
def add_nearArgs(nearArgs):
nearArgs.add_argument("--year", "-Y", type=int, help="Year in integer")
nearArgs.add_argument("--month", "-M", type=int, help="Month in integer")
nearArgs.add_argument("--day", "-D", type=int, help="Day in integer.")
nearArgs.add_argument("--hour", "-H", type=int, help="Hour in intege")
nearArgs.add_argument("--minute", "-MIN", type=int, help="Minute in integer")
def parse_args(argv):
parser = argparse.ArgumentParser()
add_requiredArgs(parser.add_argument_group("URL argument (required)"))
add_userAgentArg(parser.add_argument_group("User Agent"))
add_saveArg(parser.add_argument_group("Create new archive/save URL"))
add_auArg(parser.add_argument_group("Get the latest Archive"))
add_jsonArg(parser.add_argument_group("Get the JSON data"))
add_oldestArg(parser.add_argument_group("Oldest archive"))
add_newestArg(parser.add_argument_group("Newest archive"))
add_totalArg(parser.add_argument_group("Total number of archives"))
add_getArg(parser.add_argument_group("Get source code"))
add_knownUrlArg(
parser.add_argument_group(
"URLs known and archived to Waybcak Machine for the site."
)
)
add_nearArg(parser.add_argument_group("Archive close to time specified"))
add_nearArgs(parser.add_argument_group("Arguments that are used only with --near"))
parser.add_argument(
"--version", "-v", action="store_true", help="Waybackpy version"
)
return parser.parse_args(argv[1:])
def main(argv=None):
argv = sys.argv if argv is None else argv
print(args_handler(parse_args(argv)))
if __name__ == "__main__":
sys.exit(main(sys.argv))

View File

@ -1,38 +1,19 @@
# -*- coding: utf-8 -*-
"""
waybackpy.exceptions
~~~~~~~~~~~~~~~~~~~
This module contains the set of Waybackpy's exceptions.
"""
class TooManyArchivingRequests(Exception):
"""Error when a single url reqeusted for archiving too many times in a short timespam.
Wayback machine doesn't supports archivng any url too many times in a short period of time.
class WaybackError(Exception):
"""
Raised when Waybackpy can not return what you asked for.
1) Wayback Machine API Service is unreachable/down.
2) You passed illegal arguments.
"""
class ArchivingNotAllowed(Exception):
"""Files like robots.txt are set to deny robot archiving.
Wayback machine respects these file, will not archive.
class URLError(Exception):
"""
class PageNotSaved(Exception):
"""
When unable to save a webpage.
"""
class ArchiveNotFound(Exception):
"""
When a page was never archived but client asks for old archive.
"""
class UrlNotFound(Exception):
"""
Raised when 404 UrlNotFound.
"""
class BadGateWay(Exception):
"""
Raised when 502 bad gateway.
"""
class InvalidUrl(Exception):
"""
Raised when url doesn't follow the standard url format.
Raised when malformed URLs are passed as arguments.
"""

36
waybackpy/snapshot.py Normal file
View File

@ -0,0 +1,36 @@
from datetime import datetime
class CdxSnapshot:
"""
This class helps to use the Cdx Snapshots easily.
Raw Snapshot data looks like:
org,archive)/ 20080126045828 http://github.com text/html 200 Q4YULN754FHV2U6Q5JUT6Q2P57WEWNNY 1415
properties is a dict containg all of the 7 cdx snapshot properties.
"""
def __init__(self, properties):
self.urlkey = properties["urlkey"]
self.timestamp = properties["timestamp"]
self.datetime_timestamp = datetime.strptime(self.timestamp, "%Y%m%d%H%M%S")
self.original = properties["original"]
self.mimetype = properties["mimetype"]
self.statuscode = properties["statuscode"]
self.digest = properties["digest"]
self.length = properties["length"]
self.archive_url = (
"https://web.archive.org/web/" + self.timestamp + "/" + self.original
)
def __str__(self):
return "{urlkey} {timestamp} {original} {mimetype} {statuscode} {digest} {length}".format(
urlkey=self.urlkey,
timestamp=self.timestamp,
original=self.original,
mimetype=self.mimetype,
statuscode=self.statuscode,
digest=self.digest,
length=self.length,
)

389
waybackpy/utils.py Normal file
View File

@ -0,0 +1,389 @@
import re
import time
import requests
from .exceptions import WaybackError, URLError
from datetime import datetime
from urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter
from .__version__ import __version__
quote = requests.utils.quote
default_user_agent = "waybackpy python package - https://github.com/akamhy/waybackpy"
def _latest_version(package_name, headers):
endpoint = "https://pypi.org/pypi/" + package_name + "/json"
json = _get_response(endpoint, headers=headers).json()
return json["info"]["version"]
def _unix_ts_to_wayback_ts(unix_ts):
return datetime.utcfromtimestamp(int(unix_ts)).strftime("%Y%m%d%H%M%S")
def _add_payload(instance, payload):
if instance.start_timestamp:
payload["from"] = instance.start_timestamp
if instance.end_timestamp:
payload["to"] = instance.end_timestamp
if instance.gzip != True:
payload["gzip"] = "false"
if instance.match_type:
payload["matchType"] = instance.match_type
if instance.filters and len(instance.filters) > 0:
for i, f in enumerate(instance.filters):
payload["filter" + str(i)] = f
if instance.collapses and len(instance.collapses) > 0:
for i, f in enumerate(instance.collapses):
payload["collapse" + str(i)] = f
payload["url"] = instance.url
def _ts(timestamp, data):
"""
Get timestamp of last fetched archive.
If used before fetching any archive, will
use whatever self.JSON returns.
self.timestamp is None implies that
self.JSON will return any archive's JSON
that wayback machine provides it.
"""
if timestamp:
return timestamp
if not data["archived_snapshots"]:
return datetime.max
return datetime.strptime(
data["archived_snapshots"]["closest"]["timestamp"], "%Y%m%d%H%M%S"
)
def _check_match_type(match_type, url):
if not match_type:
return
if "*" in url:
raise WaybackError("Can not use wildcard with match_type argument")
legal_match_type = ["exact", "prefix", "host", "domain"]
if match_type not in legal_match_type:
exc_message = "{match_type} is not an allowed match type.\nUse one from 'exact', 'prefix', 'host' or 'domain'".format(
match_type=match_type
)
raise WaybackError(exc_message)
def _check_collapses(collapses):
if not isinstance(collapses, list):
raise WaybackError("collapses must be a list.")
if len(collapses) == 0:
return
for collapse in collapses:
try:
match = re.search(
r"(urlkey|timestamp|original|mimetype|statuscode|digest|length)(:?[0-9]{1,99})?",
collapse,
)
field = match.group(1)
N = None
if 2 == len(match.groups()):
N = match.group(2)
if N:
if not (field + N == collapse):
raise Exception
else:
if not (field == collapse):
raise Exception
except Exception:
exc_message = "collapse argument '{collapse}' is not following the cdx collapse syntax.".format(
collapse=collapse
)
raise WaybackError(exc_message)
def _check_filters(filters):
if not isinstance(filters, list):
raise WaybackError("filters must be a list.")
# [!]field:regex
for _filter in filters:
try:
match = re.search(
r"(\!?(?:urlkey|timestamp|original|mimetype|statuscode|digest|length)):(.*)",
_filter,
)
key = match.group(1)
val = match.group(2)
except Exception:
exc_message = (
"Filter '{_filter}' not following the cdx filter syntax.".format(
_filter=_filter
)
)
raise WaybackError(exc_message)
def _cleaned_url(url):
return str(url).strip().replace(" ", "%20")
def _url_check(url):
"""
Check for common URL problems.
What we are checking:
1) '.' in self.url, no url that ain't '.' in it.
If you known any others, please create a PR on the github repo.
"""
if "." not in url:
exc_message = "'{url}' is not a vaild URL.".format(url=url)
raise URLError(exc_message)
def _full_url(endpoint, params):
full_url = endpoint
if params:
full_url = endpoint if endpoint.endswith("?") else (endpoint + "?")
for key, val in params.items():
key = "filter" if key.startswith("filter") else key
key = "collapse" if key.startswith("collapse") else key
amp = "" if full_url.endswith("?") else "&"
full_url = (
full_url + amp + "{key}={val}".format(key=key, val=quote(str(val)))
)
return full_url
def _get_total_pages(url, user_agent):
"""
If showNumPages is passed in cdx API, it returns
'number of archive pages'and each page has many archives.
This func returns number of pages of archives (type int).
"""
total_pages_url = (
"https://web.archive.org/cdx/search/cdx?url={url}&showNumPages=true".format(
url=url
)
)
headers = {"User-Agent": user_agent}
return int((_get_response(total_pages_url, headers=headers).text).strip())
def _archive_url_parser(header, url, latest_version=__version__, instance=None):
"""
The wayback machine's save API doesn't
return JSON response, we are required
to read the header of the API response
and look for the archive URL.
This method has some regexen (or regexes)
that search for archive url in header.
This method is used when you try to
save a webpage on wayback machine.
Two cases are possible:
1) Either we find the archive url in
the header.
2) Or we didn't find the archive url in
API header.
If we found the archive URL we return it.
Return format:
web.archive.org/web/<TIMESTAMP>/<URL>
And if we couldn't find it, we raise
WaybackError with an error message.
"""
if "save redirected" in header and instance:
time.sleep(60) # makeup for archive time
now = datetime.utcnow().timetuple()
timestamp = _wayback_timestamp(
year=now.tm_year,
month=now.tm_mon,
day=now.tm_mday,
hour=now.tm_hour,
minute=now.tm_min,
)
return_str = "web.archive.org/web/{timestamp}/{url}".format(
timestamp=timestamp, url=url
)
url = "https://" + return_str
headers = {"User-Agent": instance.user_agent}
res = _get_response(url, headers=headers)
if res.status_code < 400:
return "web.archive.org/web/{timestamp}/{url}".format(
timestamp=timestamp, url=url
)
# Regex1
m = re.search(r"Content-Location: (/web/[0-9]{14}/.*)", str(header))
if m:
return "web.archive.org" + m.group(1)
# Regex2
m = re.search(
r"rel=\"memento.*?(web\.archive\.org/web/[0-9]{14}/.*?)>", str(header)
)
if m:
return m.group(1)
# Regex3
m = re.search(r"X-Cache-Key:\shttps(.*)[A-Z]{2}", str(header))
if m:
return m.group(1)
if instance:
newest_archive = None
try:
newest_archive = instance.newest()
except WaybackError:
pass # We don't care as this is a save request
if newest_archive:
minutes_old = (
datetime.utcnow() - newest_archive.timestamp
).total_seconds() / 60.0
if minutes_old <= 30:
archive_url = newest_archive.archive_url
m = re.search(r"web\.archive\.org/web/[0-9]{14}/.*", archive_url)
if m:
instance.cached_save = True
return m.group(0)
if __version__ == latest_version:
exc_message = (
"No archive URL found in the API response. "
"If '{url}' can be accessed via your web browser then either "
"Wayback Machine is malfunctioning or it refused to archive your URL."
"\nHeader:\n{header}".format(url=url, header=header)
)
else:
exc_message = (
"No archive URL found in the API response. "
"If '{url}' can be accessed via your web browser then either "
"this version of waybackpy ({version}) is out of date or WayBack "
"Machine is malfunctioning. Visit 'https://github.com/akamhy/waybackpy' "
"for the latest version of waybackpy.\nHeader:\n{header}".format(
url=url, version=__version__, header=header
)
)
raise WaybackError(exc_message)
def _wayback_timestamp(**kwargs):
"""
Wayback Machine archive URLs
have a timestamp in them.
The standard archive URL format is
https://web.archive.org/web/20191214041711/https://www.youtube.com
If we break it down in three parts:
1 ) The start (https://web.archive.org/web/)
2 ) timestamp (20191214041711)
3 ) https://www.youtube.com, the original URL
The near method takes year, month, day, hour and minute
as Arguments, their type is int.
This method takes those integers and converts it to
wayback machine timestamp and returns it.
Return format is string.
"""
return "".join(
str(kwargs[key]).zfill(2) for key in ["year", "month", "day", "hour", "minute"]
)
def _get_response(
endpoint,
params=None,
headers=None,
return_full_url=False,
retries=5,
backoff_factor=0.5,
no_raise_on_redirects=False,
):
"""
This function is used make get request.
We use the requests package to make the
requests.
We try five times and if it fails it raises
WaybackError exception.
You can handles WaybackError by importing:
from waybackpy.exceptions import WaybackError
try:
...
except WaybackError as e:
# handle it
"""
# From https://stackoverflow.com/a/35504626
# By https://stackoverflow.com/users/401467/datashaman
s = requests.Session()
retries = Retry(
total=retries,
backoff_factor=backoff_factor,
status_forcelist=[500, 502, 503, 504],
)
s.mount("https://", HTTPAdapter(max_retries=retries))
url = _full_url(endpoint, params)
try:
if not return_full_url:
return s.get(url, headers=headers)
return (url, s.get(url, headers=headers))
except Exception as e:
reason = str(e)
if no_raise_on_redirects:
if "Exceeded 30 redirects" in reason:
return
exc_message = "Error while retrieving {url}.\n{reason}".format(
url=url, reason=reason
)
exc = WaybackError(exc_message)
exc.__cause__ = e
raise exc

View File

@ -1,88 +1,359 @@
# -*- coding: utf-8 -*-
import json
from datetime import datetime
from waybackpy.exceptions import TooManyArchivingRequests, ArchivingNotAllowed, PageNotSaved, ArchiveNotFound, UrlNotFound, BadGateWay, InvalidUrl
try:
from urllib.request import Request, urlopen
from urllib.error import HTTPError
except ImportError:
from urllib2 import Request, urlopen, HTTPError
import re
from datetime import datetime, timedelta
from .exceptions import WaybackError
from .cdx import Cdx
from .utils import (
_archive_url_parser,
_wayback_timestamp,
_get_response,
default_user_agent,
_url_check,
_cleaned_url,
_ts,
_unix_ts_to_wayback_ts,
_latest_version,
)
default_UA = "waybackpy python package"
class Url:
def __init__(self, url, user_agent=default_user_agent):
self.url = url
self.user_agent = str(user_agent)
_url_check(self.url)
self._archive_url = None
self.timestamp = None
self._JSON = None
self.latest_version = None
self.cached_save = False
def clean_url(url):
return str(url).strip().replace(" ","_")
def __repr__(self):
return "waybackpy.Url(url={url}, user_agent={user_agent})".format(
url=self.url, user_agent=self.user_agent
)
def save(url,UA=default_UA):
base_save_url = "https://web.archive.org/save/"
request_url = (base_save_url + clean_url(url))
hdr = { 'User-Agent' : '%s' % UA } #nosec
req = Request(request_url, headers=hdr) #nosec
if "." not in url:
raise InvalidUrl("'%s' is not a vaild url." % url)
try:
response = urlopen(req) #nosec
except HTTPError as e:
if e.code == 502:
raise BadGateWay(e)
elif e.code == 429:
raise TooManyArchivingRequests(e)
elif e.code == 404:
raise UrlNotFound(e)
def __str__(self):
"""
Output when print() is used on <class 'waybackpy.wrapper.Url'>
This should print an archive URL.
We check if self._archive_url is not None.
If not None, good. We return string of self._archive_url.
If self._archive_url is None, it means we ain't used any method that
sets self._archive_url, we now set self._archive_url to self.archive_url
and return it.
"""
if not self._archive_url:
self._archive_url = self.archive_url
return "{archive_url}".format(archive_url=self._archive_url)
def __len__(self):
"""
Why do we have len here?
Applying len() on <class 'waybackpy.wrapper.Url'>
will calculate the number of days between today and
the archive timestamp.
Can be applied on return values of near and its
childs (e.g. oldest) and if applied on waybackpy.Url()
whithout using any functions, it just grabs
self._timestamp and def _timestamp gets it
from def JSON.
"""
td_max = timedelta(
days=999999999, hours=23, minutes=59, seconds=59, microseconds=999999
)
if not self.timestamp:
self.timestamp = self._timestamp
if self.timestamp == datetime.max:
return td_max.days
return (datetime.utcnow() - self.timestamp).days
@property
def JSON(self):
"""
If the end user has used near() or its childs like oldest, newest
and archive_url then the JSON response of these are cached in self._JSON
If we find that self._JSON is not None we return it.
else we get the response of 'https://archive.org/wayback/available?url=YOUR-URL'
and return it.
"""
if self._JSON:
return self._JSON
endpoint = "https://archive.org/wayback/available"
headers = {"User-Agent": self.user_agent}
payload = {"url": "{url}".format(url=_cleaned_url(self.url))}
response = _get_response(endpoint, params=payload, headers=headers)
return response.json()
@property
def archive_url(self):
"""
Returns any random archive for the instance.
But if near, oldest, newest were used before
then it returns the same archive again.
We cache archive in self._archive_url
"""
if self._archive_url:
return self._archive_url
data = self.JSON
if not data["archived_snapshots"]:
archive_url = None
else:
raise PageNotSaved(e)
archive_url = data["archived_snapshots"]["closest"]["url"]
archive_url = archive_url.replace(
"http://web.archive.org/web/", "https://web.archive.org/web/", 1
)
self._archive_url = archive_url
return archive_url
header = response.headers
if "exclusion.robots.policy" in str(header):
raise ArchivingNotAllowed("Can not archive %s. Disabled by site owner." % (url))
archive_id = header['Content-Location']
archived_url = "https://web.archive.org" + archive_id
return archived_url
@property
def _timestamp(self):
self.timestamp = _ts(self.timestamp, self.JSON)
return self.timestamp
def get(url,encoding=None,UA=default_UA):
hdr = { 'User-Agent' : '%s' % UA }
request_url = clean_url(url)
req = Request(request_url, headers=hdr) #nosec
resp=urlopen(req) #nosec
if encoding is None:
try:
encoding= resp.headers['content-type'].split('charset=')[-1]
except AttributeError:
encoding = "UTF-8"
return resp.read().decode(encoding)
def save(self):
"""
To save a webpage on WayBack machine we
need to send get request to https://web.archive.org/save/
def wayback_timestamp(year,month,day,hour,minute):
year = str(year)
month = str(month).zfill(2)
day = str(day).zfill(2)
hour = str(hour).zfill(2)
minute = str(minute).zfill(2)
return (year+month+day+hour+minute)
And to get the archive URL we are required to read the
header of the API response.
def near(
url,
year=datetime.utcnow().strftime('%Y'),
month=datetime.utcnow().strftime('%m'),
day=datetime.utcnow().strftime('%d'),
hour=datetime.utcnow().strftime('%H'),
minute=datetime.utcnow().strftime('%M'),
UA=default_UA,
_get_response() takes care of the get requests.
_archive_url_parser() parses the archive from the header.
"""
request_url = "https://web.archive.org/save/" + _cleaned_url(self.url)
headers = {"User-Agent": self.user_agent}
response = _get_response(
request_url,
params=None,
headers=headers,
backoff_factor=2,
no_raise_on_redirects=True,
)
if not self.latest_version:
self.latest_version = _latest_version("waybackpy", headers=headers)
if response:
res_headers = response.headers
else:
res_headers = "save redirected"
self._archive_url = "https://" + _archive_url_parser(
res_headers,
self.url,
latest_version=self.latest_version,
instance=self,
)
m = re.search(r"https?://web.archive.org/web/([0-9]{14})/http", self._archive_url)
str_ts = m.group(1)
ts = datetime.strptime(str_ts, "%Y%m%d%H%M%S")
now = datetime.utcnow()
total_seconds = int((now - ts).total_seconds())
if total_seconds > 60 * 3:
self.cached_save = True
self.timestamp = ts
return self
def get(self, url="", user_agent="", encoding=""):
"""
Return the source code of the last archived URL,
if no URL is passed to this method.
If encoding is not supplied, it is auto-detected
from the response itself by requests package.
"""
if not url and self._archive_url:
url = self._archive_url
elif not url and not self._archive_url:
url = _cleaned_url(self.url)
if not user_agent:
user_agent = self.user_agent
headers = {"User-Agent": str(user_agent)}
response = _get_response(str(url), params=None, headers=headers)
if not encoding:
try:
encoding = response.encoding
except AttributeError:
encoding = "UTF-8"
return response.content.decode(encoding.replace("text/html", "UTF-8", 1))
def near(
self,
year=None,
month=None,
day=None,
hour=None,
minute=None,
unix_timestamp=None,
):
timestamp = wayback_timestamp(year,month,day,hour,minute)
request_url = "https://archive.org/wayback/available?url=%s&timestamp=%s" % (clean_url(url), str(timestamp))
hdr = { 'User-Agent' : '%s' % UA }
req = Request(request_url, headers=hdr) # nosec
response = urlopen(req) #nosec
data = json.loads(response.read().decode("UTF-8"))
if not data["archived_snapshots"]:
raise ArchiveNotFound("'%s' is not yet archived." % url)
"""
Wayback Machine can have many archives of a webpage,
sometimes we want archive close to a specific time.
archive_url = (data["archived_snapshots"]["closest"]["url"])
return archive_url
This method takes year, month, day, hour and minute as input.
The input type must be integer. Any non-supplied parameters
default to the current time.
def oldest(url,UA=default_UA,year=1994):
return near(url,year=year,UA=UA)
We convert the input to a wayback machine timestamp using
_wayback_timestamp(), it returns a string.
def newest(url,UA=default_UA):
return near(url,UA=UA)
We use the wayback machine's availability API
(https://archive.org/wayback/available)
to get the closest archive from the timestamp.
We set self._archive_url to the archive found, if any.
If archive found, we set self.timestamp to its timestamp.
We self._JSON to the response of the availability API.
And finally return self.
"""
if unix_timestamp:
timestamp = _unix_ts_to_wayback_ts(unix_timestamp)
else:
now = datetime.utcnow().timetuple()
timestamp = _wayback_timestamp(
year=year if year else now.tm_year,
month=month if month else now.tm_mon,
day=day if day else now.tm_mday,
hour=hour if hour else now.tm_hour,
minute=minute if minute else now.tm_min,
)
endpoint = "https://archive.org/wayback/available"
headers = {"User-Agent": self.user_agent}
payload = {
"url": "{url}".format(url=_cleaned_url(self.url)),
"timestamp": timestamp,
}
response = _get_response(endpoint, params=payload, headers=headers)
data = response.json()
if not data["archived_snapshots"]:
raise WaybackError(
"Can not find archive for '{url}' try later or use wayback.Url(url, user_agent).save() "
"to create a new archive.\nAPI response:\n{text}".format(
url=_cleaned_url(self.url), text=response.text
)
)
archive_url = data["archived_snapshots"]["closest"]["url"]
archive_url = archive_url.replace(
"http://web.archive.org/web/", "https://web.archive.org/web/", 1
)
self._archive_url = archive_url
self.timestamp = datetime.strptime(
data["archived_snapshots"]["closest"]["timestamp"], "%Y%m%d%H%M%S"
)
self._JSON = data
return self
def oldest(self, year=1994):
"""
Returns the earliest/oldest Wayback Machine archive for the webpage.
Wayback machine has started archiving the internet around 1997 and
therefore we can't have any archive older than 1997, we use 1994 as the
deafult year to look for the oldest archive.
We simply pass the year in near() and return it.
"""
return self.near(year=year)
def newest(self):
"""
Return the newest Wayback Machine archive available for this URL.
We return the output of self.near() as it deafults to current utc time.
Due to Wayback Machine database lag, this may not always be the
most recent archive.
"""
return self.near()
def total_archives(self, start_timestamp=None, end_timestamp=None):
"""
A webpage can have multiple archives on the wayback machine
If someone wants to count the total number of archives of a
webpage on wayback machine they can use this method.
Returns the total number of Wayback Machine archives for the URL.
Return type in integer.
"""
cdx = Cdx(
_cleaned_url(self.url),
user_agent=self.user_agent,
start_timestamp=start_timestamp,
end_timestamp=end_timestamp,
)
i = 0
for _ in cdx.snapshots():
i = i + 1
return i
def known_urls(
self,
subdomain=False,
host=False,
start_timestamp=None,
end_timestamp=None,
match_type="prefix",
):
"""
Yields list of URLs known to exist for given input.
Defaults to input URL as prefix.
This method is kept for compatibility, use the Cdx class instead.
This method itself depends on Cdx.
Idea by Mohammed Diaa (https://github.com/mhmdiaa) from:
https://gist.github.com/mhmdiaa/adf6bff70142e5091792841d4b372050
"""
if subdomain:
match_type = "domain"
if host:
match_type = "host"
cdx = Cdx(
_cleaned_url(self.url),
user_agent=self.user_agent,
start_timestamp=start_timestamp,
end_timestamp=end_timestamp,
match_type=match_type,
collapses=["urlkey"],
)
snapshots = cdx.snapshots()
for snapshot in snapshots:
yield (snapshot.original)