Compare commits

...

46 Commits
2.4.0 ... 3.0.0

Author SHA1 Message Date
de5a3e1561 improve usage code 2022-01-18 21:18:17 +05:30
52e46fecc2 more usage example 2022-01-18 20:58:39 +05:30
3b6415abc7 updating examples 2022-01-18 20:44:47 +05:30
66e16d6d89 define __repr__ for the Availability API class 2022-01-18 20:34:21 +05:30
16b9bdd7f9 output the file name if known_url and file flag are passed. 2022-01-18 20:14:44 +05:30
7adc01bff2 implement known_urls for cli from the newer interface. Although use of CDX is recommended but backward-compatibility matters. 2022-01-18 20:07:12 +05:30
9bbd056268 Update README.md 2022-01-17 02:15:38 +05:30
2ab44391cf close #107, added link to SecSI/Docker image 2022-01-16 23:01:31 +05:30
cc3628ae18 define __str__ for objects of WaybackMachineAvailabilityAPI class, the check for self.JSON ensures that the API was atleast called. 2022-01-16 22:28:12 +05:30
1d751b942b invoke json, was a bad idea removing it the earlier commit as the end user should not have to call it 2022-01-16 22:15:25 +05:30
261a867a21 near() method of WaybackMachineAvailabilityAPI return self to preserve past behaviour 2022-01-16 21:53:54 +05:30
2e487e88d3 define __len__ on Url objects, if any method not used prior to len op then default to len of oldest archive. 2022-01-16 21:29:43 +05:30
c8d0ad493a defined __str__ for Url objects, print func should print the url. 2022-01-16 21:22:43 +05:30
ce869177fd Merge pull request #103 from akamhy/whitesource/configure
Configure WhiteSource Bolt for GitHub
2022-01-02 16:04:15 +05:30
58616fb986 Add .whitesource configuration file 2022-01-02 08:45:07 +00:00
4e68cd5743 Create separate module for the 3 different APIs also CDX is now CLI supported. 2022-01-02 14:14:45 +05:30
a7b805292d changes made for v2.4.4 (update download_url) (#100)
* v2.4.4 (update download_url)

* v2.4.4 (update __version__)

* +1

add jonasjancarik
2021-09-03 11:28:26 +05:30
6dc6124dc4 Raise error on a 509 response (too many sessions) (#99)
* Raise error on a 509 response (too many sessions)

When the response code is 509, raise an error with an explanation (based on the actual error message contained in the response HTML).

* Raise error on a 509 response (too many sessions) - linting
2021-09-03 08:04:36 +05:30
5a7fc7d568 Fix typo (#95) 2021-04-13 16:58:34 +05:30
5a9c861cad v2.4.3 (#94)
* 2.4.3

* 2.4.3
2021-04-02 10:41:59 +05:30
dd1917c77e added RedirectSaveError - for failed saves if the URL is a permanent … (#93)
* added RedirectSaveError - for failed saves if the URL is a permanent redirect.

* check if url is redirect before throwing exceptions, res.url is the redirect url if redirected at all

* update tests and cli errors
2021-04-02 10:38:17 +05:30
db8f902cff Add doc strings (#90)
* Added some docstrings in utils.py

* renamed some func/meth to better names and added doc strings + lint

* added more docstrings

* more docstrings

* improve docstrings

* docstrings

* added more docstrings, lint

* fix import error
2021-01-26 11:56:03 +05:30
88cda94c0b v2.4.2 (#89)
* v2.4.2

* v2.4.2
2021-01-24 17:03:35 +05:30
09290f88d1 fix one more error 2021-01-24 16:58:53 +05:30
e5835091c9 import re 2021-01-24 16:56:59 +05:30
7312ed1f4f set cached_save to True if archive older than 3 mins. 2021-01-24 16:53:36 +05:30
6ae8f843d3 add --file to --known_urls 2021-01-24 16:15:11 +05:30
36b936820b known urls now yileds, more reliable. And save the file in chucks wrt to response. --file arg can be used to create output file, if --file not used no output will be saved in any file. (#88) 2021-01-24 16:11:39 +05:30
a3bc6aad2b too much API usage by duplicate tests was causing too much tests failure 2021-01-23 21:08:21 +05:30
edc2f63d93 Output valid JSON, dumps python dict. Make JSON valid. 2021-01-23 20:43:52 +05:30
ffe0810b12 flag to check if the archive saved is 30 mins older or not 2021-01-16 12:06:08 +05:30
40233eb115 improve code quality, remove unused imports, use system randomness etc 2021-01-16 11:35:13 +05:30
d549d31421 improve save method, now we know that 302 errors indicates that wayback machine is archiving the URL and hasn't yet archived. We construct an artifical archive with the current UTC time and check for HTTP status code 20* or 30*. If we verify the archival, we return the artifical archive. The artificial archive will automatically point to the new archive or in best case will be the new archive after some time. 2021-01-16 10:47:43 +05:30
0725163af8 mimify the logo, remove ugly old logos 2021-01-15 18:14:48 +05:30
712471176b better error messages(str), check latest version before asking for an upgrade and rm alive checking 2021-01-15 16:47:26 +05:30
dcd7b03302 getting rid of c style str formatting, now using .format 2021-01-14 19:30:07 +05:30
76205d9cf6 backoff_factor=2 for save, incr success by 25% 2021-01-13 10:13:16 +05:30
ec0a0d04cc + dequeued0
dequeued0 (https://github.com/dequeued0) for reporting bugs and useful feature requests.
2021-01-12 10:52:41 +05:30
7bb01df846 v2.4.1 2021-01-12 10:18:09 +05:30
6142e0b353 get should retrive the last fetched archive by default 2021-01-12 10:07:14 +05:30
a65990aee3 don't use pagination API if total pages <= 2 2021-01-12 09:46:07 +05:30
259a024eb1 joke? they changed their robots.txt 2021-01-11 23:17:01 +05:30
91402792e6 + Supported Features
tell what the package can do, many users probably do not read the full usage.
2021-01-11 23:01:18 +05:30
eabf4dc046 don't fetch more pages if >=2 pages are empty 2021-01-11 22:43:14 +05:30
5a7bd73565 support unix ts as an arg in near 2021-01-11 19:53:37 +05:30
4693dbf9c1 change str repr of cdxsnapshot to cdx line 2021-01-11 09:34:37 +05:30
30 changed files with 1099 additions and 2493 deletions

View File

@ -1,42 +0,0 @@
# This workflow will install Python dependencies, run tests and lint with a variety of Python versions
# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions
name: CI
on:
push:
branches: [ master ]
pull_request:
branches: [ master ]
jobs:
build:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ['3.8']
steps:
- uses: actions/checkout@v2
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v2
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install flake8 pytest codecov pytest-cov
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
- name: Lint with flake8
run: |
# stop the build if there are Python syntax errors or undefined names
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
- name: Test with pytest
run: |
pytest --cov=waybackpy tests/
- name: Upload coverage to Codecov
run: |
bash <(curl -s https://codecov.io/bash) -t ${{ secrets.CODECOV_TOKEN }}

View File

@ -1,4 +0,0 @@
# File : .pep8speaks.yml
scanner:
diff_only: True # If True, errors caused by only the patch are shown

View File

@ -1,5 +0,0 @@
# autogenerated pyup.io config file
# see https://pyup.io/docs/configuration/ for all available options
schedule: ''
update: false

View File

@ -1,6 +1,10 @@
{
"scanSettings": {
"baseBranches": []
},
"checkRunSettings": {
"vulnerableCheckRunConclusionLevel": "failure"
"vulnerableCheckRunConclusionLevel": "failure",
"displayMode": "diff"
},
"issueSettings": {
"minSeverityLevel": "LOW"

View File

@ -1,58 +0,0 @@
# Contributing to waybackpy
We love your input! We want to make contributing to this project as easy and transparent as possible, whether it's:
- Reporting a bug
- Discussing the current state of the code
- Submitting a fix
- Proposing new features
- Becoming a maintainer
## We Develop with Github
We use github to host code, to track issues and feature requests, as well as accept pull requests.
## We Use [Github Flow](https://guides.github.com/introduction/flow/index.html), So All Code Changes Happen Through Pull Requests
Pull requests are the best way to propose changes to the codebase (we use [Github Flow](https://guides.github.com/introduction/flow/index.html)). We actively welcome your pull requests:
1. Fork the repo and create your branch from `master`.
2. If you've added code that should be tested, add tests.
3. If you've changed APIs, update the documentation.
4. Ensure the test suite passes.
5. Make sure your code lints.
6. Issue that pull request!
## Any contributions you make will be under the MIT Software License
In short, when you submit code changes, your submissions are understood to be under the same [MIT License](https://github.com/akamhy/waybackpy/blob/master/LICENSE) that covers the project. Feel free to contact the maintainers if that's a concern.
## Report bugs using Github's [issues](https://github.com/akamhy/waybackpy/issues)
We use GitHub issues to track public bugs. Report a bug by [opening a new issue](https://github.com/akamhy/waybackpy/issues/new); it's that easy!
## Write bug reports with detail, background, and sample code
**Great Bug Reports** tend to have:
- A quick summary and/or background
- Steps to reproduce
- Be specific!
- Give sample code if you can.
- What you expected would happen
- What actually happens
- Notes (possibly including why you think this might be happening, or stuff you tried that didn't work)
People *love* thorough bug reports. I'm not even kidding.
## Use a Consistent Coding Style
* You can try running `flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics` for style unification.
## License
By contributing, you agree that your contributions will be licensed under its [MIT License](https://github.com/akamhy/waybackpy/blob/master/LICENSE).
## References
This document is forked from [this gist](https://gist.github.com/briandk/3d2e8b3ec8daf5a27a62) by [briandk](https://github.com/briandk) which was itself adapted from the open-source contribution guidelines for [Facebook's Draft](https://github.com/facebook/draft-js/blob/a9316a723f9e918afde44dea68b5f9f39b7d9b00/CONTRIBUTING.md)

View File

@ -2,7 +2,9 @@
- akamhy (<https://github.com/akamhy>)
- danvalen1 (<https://github.com/danvalen1>)
- AntiCompositeNumber (<https://github.com/AntiCompositeNumber>)
- jonasjancarik (<https://github.com/jonasjancarik>)
## ACKNOWLEDGEMENTS
- mhmdiaa (<https://github.com/mhmdiaa>) for <https://gist.github.com/mhmdiaa/adf6bff70142e5091792841d4b372050>. known_urls is based on this gist.
- datashaman (<https://stackoverflow.com/users/401467/datashaman>) for <https://stackoverflow.com/a/35504626>. _get_response is based on this amazing answer.
- dequeued0 (<https://github.com/dequeued0>) for reporting bugs and useful feature requests.

124
README.md
View File

@ -2,15 +2,12 @@
<img src="https://raw.githubusercontent.com/akamhy/waybackpy/master/assets/waybackpy_logo.svg"><br>
<h2>Python package & CLI tool that interfaces with the Wayback Machine API</h2>
<h3>Python package & CLI tool that interfaces with the Wayback Machine API</h3>
</div>
<p align="center">
<a href="https://pypi.org/project/waybackpy/"><img alt="pypi" src="https://img.shields.io/pypi/v/waybackpy.svg"></a>
<a href="https://github.com/akamhy/waybackpy/actions?query=workflow%3ACI"><img alt="Build Status" src="https://github.com/akamhy/waybackpy/workflows/CI/badge.svg"></a>
<a href="https://www.codacy.com/manual/akamhy/waybackpy?utm_source=github.com&amp;utm_medium=referral&amp;utm_content=akamhy/waybackpy&amp;utm_campaign=Badge_Grade"><img alt="Codacy Badge" src="https://api.codacy.com/project/badge/Grade/255459cede9341e39436ec8866d3fb65"></a>
<a href="https://codecov.io/gh/akamhy/waybackpy"><img alt="codecov" src="https://codecov.io/gh/akamhy/waybackpy/branch/master/graph/badge.svg"></a>
<a href="https://github.com/akamhy/waybackpy/blob/master/CONTRIBUTING.md"><img alt="Contributions Welcome" src="https://img.shields.io/static/v1.svg?label=Contributions&message=Welcome&color=0059b3&style=flat-square"></a>
<a href="https://pepy.tech/project/waybackpy?versions=2*&versions=1*&versions=3*"><img alt="Downloads" src="https://pepy.tech/badge/waybackpy/month"></a>
<a href="https://github.com/akamhy/waybackpy/commits/master"><img alt="GitHub lastest commit" src="https://img.shields.io/github/last-commit/akamhy/waybackpy?color=blue&style=flat-square"></a>
@ -19,7 +16,19 @@
-----------------------------------------------------------------------------------------------------------------------------------------------
### Installation
## ⭐️ Introduction
Waybackpy is a [Python package](https://www.udacity.com/blog/2021/01/what-is-a-python-package.html) and a CLI tool that interfaces with the Wayback Machine API.
Wayback Machine has 3 client side APIs.
- Save API
- Availability API
- CDX API
All three of these can be accessed by waybackpy.
### 🏗 Installation
Using [pip](https://en.wikipedia.org/wiki/Pip_(package_manager)):
@ -33,37 +42,72 @@ Install directly from GitHub:
pip install git+https://github.com/akamhy/waybackpy.git
```
### Docker Image
Docker Hub : <https://hub.docker.com/r/secsi/waybackpy>
Docker image is automatically updated on every release by [Regulary and Automatically Updated Docker Images](https://github.com/cybersecsi/RAUDI) (RAUDI).
RAUDI is a tool by SecSI (<https://secsi.io>), an Italian cybersecurity startup.
### Usage
#### As a Python package
##### Save API aka SavePageNow
```python
>>> import waybackpy
>>> url = "https://en.wikipedia.org/wiki/Multivariable_calculus"
>>> from waybackpy import WaybackMachineSaveAPI
>>> url = "https://github.com"
>>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
>>> wayback = waybackpy.Url(url, user_agent)
>>> archive = wayback.save()
>>> archive.archive_url
'https://web.archive.org/web/20210104173410/https://en.wikipedia.org/wiki/Multivariable_calculus'
>>> archive.timestamp
datetime.datetime(2021, 1, 4, 17, 35, 12, 691741)
>>> oldest_archive = wayback.oldest()
>>> oldest_archive.archive_url
'https://web.archive.org/web/20050422130129/http://en.wikipedia.org:80/wiki/Multivariable_calculus'
>>> archive_close_to_2010_feb = wayback.near(year=2010, month=2)
>>> archive_close_to_2010_feb.archive_url
'https://web.archive.org/web/20100215001541/http://en.wikipedia.org:80/wiki/Multivariable_calculus'
>>> wayback.newest().archive_url
'https://web.archive.org/web/20210104173410/https://en.wikipedia.org/wiki/Multivariable_calculus'
>>>
>>> save_api = WaybackMachineSaveAPI(url, user_agent)
>>> save_api.save()
https://web.archive.org/web/20220118125249/https://github.com/
>>> save_api.cached_save
False
>>> save_api.timestamp()
datetime.datetime(2022, 1, 18, 12, 52, 49)
```
> Full Python package documentation can be found at <https://github.com/akamhy/waybackpy/wiki/Python-package-docs>.
##### Availability API
```python
>>> from waybackpy import WaybackMachineAvailabilityAPI
>>>
>>> url = "https://google.com"
>>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
>>>
>>> availability_api = WaybackMachineAvailabilityAPI(url, user_agent)
>>>
>>> availability_api.oldest()
https://web.archive.org/web/19981111184551/http://google.com:80/
>>>
>>> availability_api.newest()
https://web.archive.org/web/20220118150444/https://www.google.com/
>>>
>>> availability_api.near(year=2010, month=10, day=10, hour=10)
https://web.archive.org/web/20101010101708/http://www.google.com/
```
##### CDX API aka CDXServerAPI
```python
>>> from waybackpy import WaybackMachineCDXServerAPI
>>> url = "https://pypi.org"
>>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
>>> cdx = WaybackMachineCDXServerAPI(url, user_agent, start_timestamp=2016, end_timestamp=2017)
>>> for item in cdx.snapshots():
... print(item.archive_url)
...
https://web.archive.org/web/20160110011047/http://pypi.org/
https://web.archive.org/web/20160305104847/http://pypi.org/
.
. # URLS REDACTED FOR READABILITY
.
https://web.archive.org/web/20171127171549/https://pypi.org/
https://web.archive.org/web/20171206002737/http://pypi.org:80/
```
> Documentation at <https://github.com/akamhy/waybackpy/wiki/Python-package-docs>.
#### As a CLI tool
@ -76,26 +120,10 @@ https://web.archive.org/web/20040415020811/http://en.wikipedia.org:80/wiki/Human
$ waybackpy --newest --url "https://en.wikipedia.org/wiki/Remote_sensing" --user_agent "my-unique-user-agent"
https://web.archive.org/web/20201221130522/https://en.wikipedia.org/wiki/Remote_sensing
$ waybackpy --total --url "https://en.wikipedia.org/wiki/Linux_kernel" --user_agent "my-unique-user-agent"
1904
$ waybackpy --known_urls --url akamhy.github.io --user_agent "my-unique-user-agent"
https://akamhy.github.io
https://akamhy.github.io/assets/js/scale.fix.js
https://akamhy.github.io/favicon.ico
https://akamhy.github.io/robots.txt
https://akamhy.github.io/waybackpy/
'akamhy.github.io-10-urls-m2a24y.txt' saved in current working directory
```
> Full CLI documentation can be found at <https://github.com/akamhy/waybackpy/wiki/CLI-docs>.
> CLI documentation is at <https://github.com/akamhy/waybackpy/wiki/CLI-docs>.
## License
### 🛡 License
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://github.com/akamhy/waybackpy/blob/master/LICENSE)
Released under the MIT License. See
[license](https://github.com/akamhy/waybackpy/blob/master/LICENSE) for details.
-----------------------------------------------------------------------------------------------------------------------------------------------
Released under the MIT License. See [license](https://github.com/akamhy/waybackpy/blob/master/LICENSE) for details.

View File

@ -1,268 +0,0 @@
<?xml version="1.0" standalone="no"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 20010904//EN"
"http://www.w3.org/TR/2001/REC-SVG-20010904/DTD/svg10.dtd">
<svg version="1.0" xmlns="http://www.w3.org/2000/svg"
width="629.000000pt" height="103.000000pt" viewBox="0 0 629.000000 103.000000"
preserveAspectRatio="xMidYMid meet">
<g transform="translate(0.000000,103.000000) scale(0.100000,-0.100000)"
fill="#000000" stroke="none">
<path d="M0 515 l0 -515 3145 0 3145 0 0 515 0 515 -3145 0 -3145 0 0 -515z
m5413 439 c31 -6 36 -10 31 -26 -3 -10 0 -26 7 -34 6 -8 10 -17 7 -20 -3 -2
-17 11 -32 31 -15 19 -41 39 -59 44 -38 11 -10 14 46 5z m150 -11 c-7 -2 -21
-2 -30 0 -10 3 -4 5 12 5 17 0 24 -2 18 -5z m-4869 -23 c-6 -6 -21 -6 -39 -1
-30 9 -30 9 10 10 25 1 36 -2 29 -9z m452 -37 c-3 -26 -15 -65 -25 -88 -10
-22 -21 -64 -25 -94 -3 -29 -14 -72 -26 -95 -11 -23 -20 -51 -20 -61 0 -30
-39 -152 -53 -163 -6 -5 -45 -12 -85 -14 -72 -5 -102 4 -102 33 0 6 -9 31 -21
56 -11 25 -26 72 -33 103 -6 31 -17 64 -24 73 -8 9 -22 37 -32 64 l-18 48 -16
-39 c-9 -21 -16 -44 -16 -50 0 -6 -7 -24 -15 -40 -8 -16 -24 -63 -34 -106 -11
-43 -26 -93 -34 -112 -14 -34 -15 -35 -108 -46 -70 -9 -96 -9 -106 0 -21 17
-43 64 -43 92 0 14 -4 27 -9 31 -12 7 -50 120 -66 200 -8 35 -25 81 -40 103
-14 22 -27 52 -28 68 -2 28 0 29 48 31 28 1 82 5 120 9 54 4 73 3 82 -7 11
-15 53 -148 53 -170 0 -7 9 -32 21 -56 20 -41 39 -49 39 -17 0 8 -5 12 -10 9
-6 -3 -13 2 -16 12 -3 10 -10 26 -15 36 -14 26 7 21 29 -8 l20 -26 7 33 c7 35
41 149 56 185 7 19 16 23 56 23 27 0 80 2 120 6 80 6 88 1 97 -71 3 -20 9 -42
14 -48 5 -7 20 -43 32 -82 13 -38 24 -72 26 -74 2 -2 13 4 24 14 13 12 20 31
20 55 0 20 7 56 15 81 7 24 19 63 25 87 12 47 31 60 89 61 l34 1 -7 -47z
m3131 41 c17 -3 34 -12 37 -20 3 -7 1 -48 -4 -91 -4 -43 -7 -80 -4 -82 2 -2
11 2 20 10 9 7 24 18 34 24 9 5 55 40 101 77 79 64 87 68 136 68 28 0 54 -4
58 -10 3 -5 12 -7 20 -3 9 3 15 -1 15 -9 0 -13 -180 -158 -197 -158 -4 0 -14
-9 -20 -20 -11 -17 -7 -27 27 -76 22 -32 40 -63 40 -70 0 -7 6 -19 14 -26 7
-8 37 -48 65 -89 l52 -74 -28 -3 c-51 -5 -74 -12 -68 -22 9 -14 -59 -12 -73 2
-20 20 -13 30 10 14 34 -24 44 -19 17 8 -25 25 -109 140 -109 149 0 7 -60 97
-64 97 -2 0 -11 -10 -22 -22 -18 -21 -18 -21 0 -15 10 4 25 2 32 -4 18 -15 19
-35 2 -22 -7 6 -25 13 -39 17 -34 8 -39 -5 -39 -94 0 -38 -3 -75 -6 -84 -6
-16 -54 -22 -67 -9 -4 3 -40 7 -81 8 -101 2 -110 10 -104 97 3 37 10 73 16 80
6 8 10 77 10 174 0 89 2 166 6 172 6 11 162 15 213 6z m301 -1 c-25 -2 -52
-11 -58 -19 -7 -7 -17 -14 -23 -14 -5 0 -2 9 8 20 14 16 29 20 69 18 l51 -2
-47 -3z m809 -9 c33 -21 65 -89 62 -132 -1 -21 1 -47 5 -59 9 -28 -26 -111
-51 -120 -10 -3 -25 -12 -33 -19 -10 -8 -70 -15 -170 -21 l-155 -8 4 -73 c4
-93 -10 -112 -80 -112 -26 0 -60 5 -74 12 -19 8 -31 8 -51 -1 -45 -20 -55 -1
-55 98 0 47 -1 111 -3 141 -2 30 -5 107 -7 170 l-4 115 65 2 c36 2 103 7 150
11 150 15 372 13 397 -4z m338 -19 c11 -14 46 -54 78 -88 l58 -62 62 65 c34
36 75 73 89 83 28 18 113 24 122 9 3 -5 -32 -51 -77 -102 -147 -167 -134 -143
-139 -253 -3 -54 -10 -103 -16 -109 -8 -8 -8 -17 -1 -30 14 -26 11 -28 -47
-29 -119 -2 -165 3 -174 22 -6 10 -9 69 -8 131 l2 113 -57 75 c-32 41 -80 102
-107 134 -27 33 -47 62 -45 66 3 4 58 6 122 4 113 -3 119 -5 138 -29z m-4233
13 c16 -13 98 -150 98 -164 0 -4 29 -65 65 -135 36 -71 65 -135 65 -143 0 -10
-14 -17 -37 -21 -21 -4 -48 -10 -61 -16 -40 -16 -51 -10 -77 41 -29 57 -35 59
-157 38 -65 -11 -71 -14 -84 -43 -10 -25 -21 -34 -46 -38 -41 -6 -61 8 -48 33
15 28 12 38 -12 42 -18 2 -23 10 -24 36 -1 27 3 35 23 43 13 5 34 9 46 9 23 0
57 47 57 78 0 9 10 33 22 52 14 24 21 52 22 92 1 49 4 58 24 67 13 6 31 11 40
11 9 0 26 7 36 15 24 18 28 18 48 3z m1701 0 c16 -12 97 -143 97 -157 0 -3 32
-69 70 -146 39 -76 67 -142 62 -147 -4 -4 -28 -12 -52 -17 -25 -6 -57 -13 -72
-17 -25 -6 -29 -2 -50 42 -14 30 -31 50 -43 53 -11 2 -57 -2 -103 -9 -79 -12
-83 -13 -96 -45 -10 -24 -22 -34 -46 -38 -43 -9 -53 -1 -45 39 5 30 3 34 -15
34 -17 0 -20 6 -20 39 0 40 13 50 65 51 19 0 55 48 55 72 0 6 8 29 19 52 32
72 41 107 31 127 -8 14 -5 21 12 33 12 9 32 16 43 16 11 0 29 7 39 15 24 18
28 18 49 3z m-3021 -11 c-29 -9 -32 -13 -27 -39 8 -36 -11 -37 -20 -1 -8 32
15 54 54 52 24 -1 23 -2 -7 -12z m3499 4 c-12 -8 -51 -4 -51 5 0 2 15 4 33 4
22 0 28 -3 18 -9z m1081 -67 c2 -42 0 -78 -4 -81 -5 -2 -8 18 -8 45 0 27 -3
64 -6 81 -4 19 -2 31 4 31 6 0 12 -32 14 -76z m-1951 46 c12 -7 19 -21 19 -38
l-1 -27 -15 28 c-8 15 -22 27 -32 27 -9 0 -24 5 -32 10 -21 14 35 13 61 0z
m1004 -3 c73 -19 135 -61 135 -92 0 -15 -8 -29 -21 -36 -18 -9 -30 -6 -69 15
-37 20 -62 26 -109 26 -54 0 -62 -3 -78 -26 -21 -32 -33 -130 -25 -191 9 -58
41 -84 111 -91 38 -3 61 1 97 17 36 17 49 19 60 10 25 -21 15 -48 -28 -76 -38
-24 -54 -28 -148 -31 -114 -4 -170 10 -190 48 -6 11 -16 20 -23 20 -24 0 -59
95 -59 159 0 59 20 122 42 136 6 3 10 13 10 22 0 31 80 82 130 83 19 0 42 5
50 10 21 13 57 12 115 -3z m-1682 -23 c-14 -14 -28 -23 -31 -20 -8 8 29 46 44
46 7 0 2 -11 -13 -26z m159 -2 c-20 -15 -22 -23 -16 -60 4 -28 3 -42 -5 -42
-7 0 -11 19 -11 50 0 36 5 52 18 59 28 17 39 12 14 -7z m1224 -28 c-39 -40
-46 -38 -19 7 15 24 40 41 52 33 2 -2 -13 -20 -33 -40z m-1538 -33 l62 -66 63
68 c56 59 68 67 100 67 19 0 38 -3 40 -7 3 -5 -32 -53 -76 -108 -88 -108 -84
-97 -90 -255 l-2 -55 -87 -3 c-49 -1 -88 -1 -89 0 0 2 -3 50 -5 107 -3 75 -8
109 -19 121 -8 9 -15 20 -15 25 0 4 -18 29 -41 54 -83 94 -89 102 -84 111 3 6
45 9 93 9 l87 -1 63 -67z m786 59 c33 -12 48 -42 52 -107 3 -43 0 -57 -16 -73
l-20 -20 20 -28 c26 -35 35 -89 21 -125 -18 -46 -66 -60 -226 -64 -77 -3 -166
-7 -198 -10 -84 -7 -99 9 -97 102 1 38 -1 125 -4 191 l-5 122 47 5 c26 3 103
4 171 2 69 -2 134 1 145 5 29 12 80 12 110 0z m-1050 -16 c3 -8 2 -12 -4 -9
-6 3 -10 10 -10 16 0 14 7 11 14 -7z m-374 -22 c0 -9 -5 -24 -10 -32 -7 -11
-10 -5 -10 23 0 23 4 36 10 32 6 -3 10 -14 10 -23z m1701 16 c2 -21 -2 -43
-10 -51 -4 -4 -7 9 -8 28 -1 32 15 52 18 23z m2859 -28 c-11 -20 -50 -28 -50
-10 0 6 9 10 19 10 11 0 23 5 26 10 12 19 16 10 5 -10z m-4759 -47 c-8 -15
-10 -15 -11 -2 0 17 10 32 18 25 2 -3 -1 -13 -7 -23z m2599 9 c0 -9 -40 -35
-46 -29 -6 6 25 37 37 37 5 0 9 -3 9 -8z m316 -127 c-4 -19 -12 -37 -18 -41
-8 -5 -9 -1 -5 10 4 10 7 36 7 59 1 35 2 39 11 24 6 -10 8 -34 5 -52z m1942
38 c-15 -16 -30 -45 -33 -65 -4 -21 -12 -38 -17 -38 -19 0 3 74 30 103 14 15
30 27 36 27 5 0 -2 -12 -16 -27z m-3855 -16 c-6 -12 -15 -33 -20 -47 -9 -23
-10 -23 -15 -3 -3 12 3 34 14 52 23 35 37 34 21 -2z m3282 -82 c-23 -18 -81
-35 -115 -34 -17 1 -11 5 21 13 25 7 54 18 65 24 30 18 53 15 29 -3z m-2585
-130 c-7 -8 -19 -15 -27 -15 -10 0 -7 8 9 31 18 24 24 27 26 14 2 -9 -2 -22
-8 -30z m-1775 -5 c-4 -12 -9 -19 -12 -17 -3 3 -2 15 2 27 4 12 9 19 12 17 3
-3 2 -15 -2 -27z m820 -29 c-9 -8 -25 21 -25 44 0 16 3 14 15 -9 9 -16 13 -32
10 -35z m2085 47 c0 -17 -31 -48 -47 -48 -11 0 -8 8 9 29 24 32 38 38 38 19z
m-1655 -47 c-11 -10 -35 11 -35 30 0 21 0 21 19 -2 11 -13 18 -26 16 -28z
m1221 24 c13 -14 21 -25 18 -25 -11 0 -54 33 -54 41 0 15 12 10 36 -16z
m-1428 -7 c-3 -7 -18 -14 -34 -15 -20 -1 -22 0 -6 4 12 2 22 9 22 14 0 5 5 9
11 9 6 0 9 -6 7 -12z m3574 -45 c8 -10 6 -13 -11 -13 -18 0 -21 6 -20 38 0 34
1 35 10 13 5 -14 15 -31 21 -38z m-4097 14 c19 -4 19 -4 2 -12 -18 -7 -46 16
-47 39 0 6 6 3 13 -6 6 -9 21 -18 32 -21z m1700 1 c19 -5 19 -5 2 -13 -18 -7
-46 17 -46 40 0 6 5 3 12 -6 7 -9 21 -19 32 -21z m-1970 12 c-3 -5 -21 -9 -38
-9 l-32 2 35 7 c19 4 36 8 38 9 2 0 0 -3 -3 -9z m350 0 c-27 -12 -35 -12 -35
0 0 6 12 10 28 9 24 0 25 -1 7 -9z m1350 0 c-3 -5 -18 -9 -33 -9 l-27 1 30 8
c17 4 31 8 33 9 2 0 0 -3 -3 -9z m355 0 c-19 -13 -30 -13 -30 0 0 6 10 10 23
10 18 0 19 -2 7 -10z m-2324 -35 c-6 -22 -11 -25 -44 -24 -31 2 -32 3 -9 6 18
3 32 14 39 29 14 30 23 24 14 -11z m2839 16 c-14 -14 -73 -26 -60 -13 6 5 19
12 30 15 34 8 40 8 30 -2z m212 -21 l48 -8 -47 -1 c-56 -1 -78 6 -78 26 0 12
3 13 14 3 8 -6 36 -15 63 -20z m116 -1 c-6 -6 -18 -6 -28 -3 -18 7 -18 8 1 14
23 9 39 1 27 -11z m633 -14 c31 5 35 4 21 -5 -9 -6 -34 -10 -55 -8 -31 3 -37
7 -40 28 l-3 25 19 -23 c16 -20 24 -23 58 -17z m939 15 c16 -7 11 -9 -20 -9
-29 -1 -36 2 -25 9 17 11 19 11 45 0z m-5445 -24 c6 -8 21 -16 33 -18 19 -3
20 -4 5 -10 -12 -5 -27 1 -45 17 -16 13 -23 25 -17 25 6 0 17 -6 24 -14z m150
-76 c0 -11 -4 -20 -10 -20 -14 0 -13 -103 1 -117 21 -21 2 -43 -36 -43 -19 0
-35 5 -35 11 0 8 -5 7 -15 -1 -21 -17 -44 2 -28 22 22 26 20 128 -2 128 -8 0
-15 9 -15 19 0 18 8 20 70 20 63 0 70 -2 70 -19z m1189 -63 c17 -32 31 -62 31
-66 0 -14 -43 -21 -57 -9 -7 6 -29 12 -48 14 -26 2 -35 -1 -40 -16 -4 -12 -12
-17 -21 -13 -8 3 -13 12 -10 19 3 8 1 14 -4 14 -18 0 -10 22 9 27 22 6 43 46
35 67 -3 9 5 20 23 30 34 18 38 14 82 -67z m2146 -8 l34 -67 -25 -6 c-14 -4
-31 -3 -37 2 -7 5 -29 12 -49 16 -31 6 -38 4 -38 -9 0 -8 -7 -15 -15 -15 -8 0
-15 7 -15 15 0 8 -4 15 -10 15 -19 0 -10 21 14 30 16 6 27 20 31 40 4 18 16
41 27 52 26 26 40 14 83 -73z m-3205 51 c8 -10 20 -26 27 -36 10 -17 12 -14
12 19 1 36 2 37 37 37 l37 0 -8 -72 c-3 -40 -11 -76 -17 -79 -20 -13 -43 3
-62 42 -27 56 -34 56 -41 4 -7 -42 -9 -44 -34 -39 -35 9 -34 6 -35 71 -1 41 4
62 14 70 18 15 50 7 70 -17z m280 11 c-5 -11 -15 -21 -21 -23 -13 -4 -14 -101
-3 -120 5 -8 1 -9 -10 -5 -10 4 -29 7 -42 7 -22 0 -24 3 -24 55 0 52 -1 55
-26 55 -19 0 -25 5 -22 18 2 13 17 18 68 23 36 3 71 6 78 7 9 2 10 -3 2 -17z
m178 -3 c3 -15 -4 -18 -32 -18 -25 0 -36 -4 -36 -15 0 -10 11 -15 35 -15 24 0
35 -5 35 -15 0 -11 -11 -15 -41 -15 -55 0 -47 -24 9 -28 29 -2 42 -8 42 -18 0
-16 -25 -17 -108 -7 l-53 6 2 56 c3 92 1 90 77 88 55 -2 67 -5 70 -19z m230
10 c18 -18 14 -56 -7 -77 -17 -17 -18 -21 -5 -40 14 -19 13 -21 -4 -21 -10 0
-28 11 -40 25 -24 27 -52 24 -52 -5 0 -24 -9 -29 -43 -23 -26 5 -27 7 -27 73
0 45 4 70 13 73 26 11 153 7 165 -5z m557 -2 c47 -20 47 -40 0 -32 -53 10 -77
-7 -73 -52 l3 -37 48 1 c26 0 47 -3 47 -6 0 -35 -108 -42 -140 -10 -29 29 -27
94 5 125 28 28 60 31 110 11z m213 -8 c3 -15 -4 -18 -38 -18 -50 0 -51 -22 -1
-30 44 -7 44 -24 -1 -28 -54 -5 -52 -32 2 -32 29 0 40 -4 40 -15 0 -17 -28
-19 -104 -9 l-46 7 0 72 0 72 72 -1 c61 -1 73 -4 76 -18z m312 6 c0 -9 -9 -18
-21 -21 -19 -5 -20 -12 -17 -69 3 -63 3 -63 -22 -58 -49 11 -50 12 -50 64 0
43 -3 50 -20 50 -13 0 -20 7 -20 20 0 17 8 20 68 23 37 2 70 4 75 5 4 1 7 -5
7 -14z m155 6 c65 -15 94 -73 62 -125 -14 -24 -25 -28 -92 -33 -44 -3 -54 0
-78 24 -34 34 -36 82 -4 111 37 34 53 37 112 23z m505 -3 c0 -8 -9 -40 -20
-72 -11 -31 -18 -60 -16 -64 3 -4 -9 -8 -25 -9 -25 -2 -31 3 -51 45 l-22 47
-21 -46 c-17 -38 -25 -47 -51 -50 -24 -3 -30 0 -32 17 -1 12 -8 40 -17 64 -21
59 -20 61 20 61 27 0 35 -4 35 -17 0 -10 4 -24 9 -32 7 -11 13 -6 25 23 14 35
18 37 53 34 32 -2 39 -7 41 -28 6 -43 19 -43 36 -1 15 40 36 55 36 28z m136
-4 c27 -45 64 -115 64 -122 0 -13 -42 -22 -54 -12 -6 5 -28 11 -49 15 -32 6
-38 4 -45 -13 -8 -24 -26 -16 -36 16 -5 16 -2 25 13 32 11 6 25 28 32 48 17
55 53 71 75 36z m840 -4 c22 -18 16 -32 -11 -25 -59 15 -94 -18 -74 -71 8 -21
15 -24 47 -22 40 3 66 -7 57 -21 -3 -5 -12 -7 -20 -3 -8 3 -15 1 -15 -4 0 -17
-111 4 -126 24 -26 34 -13 100 25 131 18 14 96 9 117 -9z m816 -54 l37 -70
-25 -8 c-16 -6 -30 -5 -40 3 -22 19 -81 22 -88 4 -7 -19 -26 -18 -26 1 0 8 -4
15 -10 15 -20 0 -9 21 15 30 24 9 30 24 27 63 -1 10 2 16 7 13 5 -3 12 1 15
10 4 9 15 14 28 12 17 -2 33 -22 60 -73z m183 61 c47 -20 47 -40 0 -32 -46 9
-75 -7 -75 -42 0 -45 13 -56 59 -49 30 4 41 2 41 -8 0 -32 -95 -35 -134 -4
-30 24 -34 64 -11 109 22 43 60 51 120 26z m398 4 c19 0 24 -26 6 -32 -13 -4
-16 -42 -5 -84 l7 -32 -55 -1 c-57 0 -68 7 -41 29 17 14 21 90 5 90 -5 0 -10
10 -10 21 0 19 4 21 38 15 20 -3 45 -6 55 -6z m117 0 c5 0 17 -13 27 -30 9
-16 21 -30 25 -30 4 0 8 14 8 30 0 28 3 30 36 30 l36 0 -5 -71 c-2 -42 -9 -74
-17 -79 -15 -9 -50 -1 -50 12 0 5 -11 25 -24 45 l-24 35 -9 -42 c-4 -23 -11
-41 -15 -41 -5 1 -19 1 -32 1 -23 0 -23 2 -20 67 3 66 15 88 42 78 8 -3 18 -5
22 -5z m317 -3 c21 -15 4 -27 -38 -27 -50 0 -49 -23 1 -30 50 -8 51 -30 1 -30
-30 0 -41 -4 -41 -15 0 -11 12 -15 45 -15 33 0 45 -4 45 -15 0 -17 -24 -19
-108 -8 l-54 6 6 66 c3 36 5 69 6 72 0 11 124 7 137 -4z m-4374 -7 c9 0 17 -4
17 -10 0 -5 -16 -10 -35 -10 -28 0 -35 -4 -35 -19 0 -15 8 -21 35 -23 20 -2
35 -7 35 -13 0 -5 -15 -11 -35 -13 -30 -3 -35 -7 -35 -28 0 -18 -5 -24 -23
-24 -13 0 -28 -5 -33 -10 -7 -7 -11 9 -13 51 -1 35 -6 70 -11 79 -7 13 -2 16
28 18 20 2 39 5 41 8 3 3 15 3 26 0 11 -3 28 -6 38 -6z m1856 -14 c23 -21 38
-20 51 4 6 11 17 20 25 20 16 0 20 -16 6 -24 -17 -11 -50 -94 -44 -114 4 -18
0 -20 -34 -19 l-38 2 3 40 c3 33 -1 45 -22 64 -36 34 -34 53 5 47 17 -2 39
-12 48 -20z m299 -18 c-3 -24 -1 -55 3 -70 6 -24 4 -29 -14 -32 -41 -9 -155
-14 -163 -7 -5 3 -10 36 -12 73 l-2 67 67 4 c38 2 81 4 97 5 27 2 28 1 24 -40z
m512 22 c0 -11 4 -20 9 -20 4 0 20 9 34 20 25 20 57 27 57 12 0 -5 -14 -18
-30 -31 l-30 -22 26 -44 c24 -41 24 -45 7 -45 -10 0 -27 14 -37 31 -21 35 -40
34 -44 -4 -3 -22 -8 -27 -32 -27 -39 0 -43 11 -35 86 l7 64 34 0 c27 0 34 -4
34 -20z m511 12 c0 -4 1 -36 2 -72 l2 -65 -32 -3 c-28 -3 -32 0 -39 30 l-7 33
-14 -33 c-16 -40 -34 -41 -51 -2 -16 35 -35 31 -26 -6 6 -22 3 -24 -30 -24
l-36 0 -1 55 c-1 30 -2 61 -3 68 -1 7 14 13 34 15 33 3 38 -1 59 -39 l24 -42
18 24 c10 13 19 29 19 35 0 5 4 14 10 20 11 11 70 16 71 6z m509 -28 c0 -31 3
-35 23 -32 17 2 23 11 25 36 3 29 6 32 36 32 l34 0 1 -75 1 -75 -29 0 c-23 0
-30 5 -35 26 -5 19 -12 25 -29 22 -17 -2 -22 -10 -22 -30 1 -24 -2 -27 -25
-22 -45 10 -50 13 -50 33 0 11 -6 21 -12 24 -10 4 -10 7 0 18 6 7 12 25 12 39
0 34 7 40 42 40 25 0 28 -3 28 -36z"/>
<path d="M800 860 c30 -24 44 -25 36 -4 -3 9 -6 18 -6 20 0 2 -12 4 -27 4
l-28 0 25 -20z"/>
<path d="M310 850 c0 -5 5 -10 10 -10 6 0 10 5 10 10 0 6 -4 10 -10 10 -5 0
-10 -4 -10 -10z"/>
<path d="M366 851 c-8 -12 21 -34 33 -27 6 4 8 13 4 21 -6 17 -29 20 -37 6z"/>
<path d="M920 586 c0 -9 7 -16 16 -16 9 0 14 5 12 12 -6 18 -28 21 -28 4z"/>
<path d="M965 419 c-4 -6 -5 -13 -2 -16 7 -7 27 6 27 18 0 12 -17 12 -25 -2z"/>
<path d="M362 388 c3 -7 15 -14 29 -16 24 -4 24 -3 4 12 -24 19 -38 20 -33 4z"/>
<path d="M4106 883 c-14 -14 -5 -31 14 -26 11 3 20 9 20 13 0 10 -26 20 -34
13z"/>
<path d="M4590 870 c-14 -10 -22 -22 -18 -25 7 -8 57 25 58 38 0 12 -14 8 -40
-13z"/>
<path d="M4380 655 c7 -8 17 -15 22 -15 6 0 5 7 -2 15 -7 8 -17 15 -22 15 -6
0 -5 -7 2 -15z"/>
<path d="M4082 560 c-6 -11 -12 -28 -12 -37 0 -13 6 -10 20 12 11 17 20 33 20
38 0 14 -15 7 -28 -13z"/>
<path d="M4496 466 c3 -9 11 -16 16 -16 13 0 5 23 -10 28 -7 2 -10 -2 -6 -12z"/>
<path d="M4236 445 c-9 -24 5 -41 16 -20 7 11 7 20 0 27 -6 6 -12 3 -16 -7z"/>
<path d="M4540 400 c0 -5 5 -10 11 -10 5 0 7 5 4 10 -3 6 -8 10 -11 10 -2 0
-4 -4 -4 -10z"/>
<path d="M5330 891 c0 -11 26 -22 34 -14 3 3 3 10 0 14 -7 12 -34 11 -34 0z"/>
<path d="M4805 880 c-8 -13 4 -32 16 -25 12 8 12 35 0 35 -6 0 -13 -4 -16 -10z"/>
<path d="M5070 821 l-35 -6 0 -75 0 -75 40 -3 c22 -2 58 3 80 10 38 12 40 16
47 63 12 88 -16 107 -132 86z m109 -36 c3 -19 2 -19 -15 -4 -11 9 -26 19 -34
22 -8 4 -2 5 15 4 21 -1 31 -8 34 -22z"/>
<path d="M5411 694 c0 -11 3 -14 6 -6 3 7 2 16 -1 19 -3 4 -6 -2 -5 -13z"/>
<path d="M5223 674 c-10 -22 -10 -25 3 -20 9 3 18 6 20 6 2 0 4 9 4 20 0 28
-13 25 -27 -6z"/>
<path d="M5001 422 c-14 -27 -12 -35 8 -23 7 5 11 17 9 27 -4 17 -5 17 -17 -4z"/>
<path d="M5673 883 c9 -9 19 -14 23 -11 10 10 -6 28 -24 28 -15 0 -15 -1 1
-17z"/>
<path d="M5866 717 c-14 -10 -16 -16 -7 -22 15 -9 35 8 30 24 -3 8 -10 7 -23
-2z"/>
<path d="M5700 520 c0 -5 5 -10 10 -10 6 0 10 5 10 10 0 6 -4 10 -10 10 -5 0
-10 -4 -10 -10z"/>
<path d="M5700 451 c0 -23 25 -46 34 -32 4 6 -2 19 -14 31 -19 19 -20 19 -20
1z"/>
<path d="M1375 850 c-3 -5 -1 -10 4 -10 6 0 11 5 11 10 0 6 -2 10 -4 10 -3 0
-8 -4 -11 -10z"/>
<path d="M1391 687 c-5 -12 -7 -35 -6 -50 2 -15 -1 -27 -7 -27 -5 0 -6 9 -3
21 5 15 4 19 -4 15 -6 -4 -11 -18 -11 -30 0 -19 7 -25 33 -29 17 -2 42 1 55 7
l22 12 -27 52 c-29 57 -39 63 -52 29z"/>
<path d="M1240 520 c0 -5 5 -10 10 -10 6 0 10 5 10 10 0 6 -4 10 -10 10 -5 0
-10 -4 -10 -10z"/>
<path d="M1575 490 c4 -14 9 -27 11 -29 7 -7 34 9 34 20 0 7 -3 9 -7 6 -3 -4
-15 1 -26 10 -19 17 -19 17 -12 -7z"/>
<path d="M3094 688 c-4 -13 -7 -35 -6 -50 1 -16 -2 -28 -8 -28 -5 0 -6 7 -3
17 4 11 3 14 -5 9 -16 -10 -15 -49 1 -43 6 2 20 0 29 -4 10 -6 27 -5 41 2 28
13 26 30 -8 86 -24 39 -31 41 -41 11z"/>
<path d="M3270 502 c0 -19 29 -47 39 -37 6 7 1 16 -15 28 -13 10 -24 14 -24 9z"/>
<path d="M3570 812 c-13 -10 -21 -24 -19 -31 3 -7 15 0 34 19 31 33 21 41 -15
12z"/>
<path d="M3855 480 c-3 -5 -1 -10 4 -10 6 0 11 5 11 10 0 6 -2 10 -4 10 -3 0
-8 -4 -11 -10z"/>
<path d="M3585 450 c3 -5 13 -10 21 -10 8 0 12 5 9 10 -3 6 -13 10 -21 10 -8
0 -12 -4 -9 -10z"/>
<path d="M1880 820 c0 -5 7 -10 16 -10 8 0 12 5 9 10 -3 6 -10 10 -16 10 -5 0
-9 -4 -9 -10z"/>
<path d="M2042 668 c-7 -7 -12 -23 -12 -37 1 -24 2 -24 16 8 16 37 14 47 -4
29z"/>
<path d="M2015 560 c4 -6 11 -8 16 -5 14 9 11 15 -7 15 -8 0 -12 -5 -9 -10z"/>
<path d="M1915 470 c4 -6 11 -8 16 -5 14 9 11 15 -7 15 -8 0 -12 -5 -9 -10z"/>
<path d="M2320 795 c0 -14 5 -25 10 -25 6 0 10 11 10 25 0 14 -4 25 -10 25 -5
0 -10 -11 -10 -25z"/>
<path d="M2660 771 c0 -6 5 -13 10 -16 6 -3 10 1 10 9 0 9 -4 16 -10 16 -5 0
-10 -4 -10 -9z"/>
<path d="M2487 763 c-4 -3 -7 -23 -7 -43 0 -36 1 -38 40 -43 68 -9 116 20 102
61 -3 10 -7 10 -18 1 -11 -9 -14 -7 -14 10 0 18 -6 21 -48 21 -27 0 -52 -3
-55 -7z"/>
<path d="M2320 719 c0 -5 5 -7 10 -4 6 3 10 8 10 11 0 2 -4 4 -10 4 -5 0 -10
-5 -10 -11z"/>
<path d="M2480 550 l0 -40 66 1 c58 1 67 4 76 25 18 39 -4 54 -78 54 l-64 0 0
-40z m40 15 c-7 -8 -16 -15 -21 -15 -5 0 -6 7 -3 15 4 8 13 15 21 15 13 0 13
-3 3 -15z"/>
<path d="M2665 527 c-4 -10 -5 -21 -1 -24 10 -10 18 4 13 24 -4 17 -4 17 -12
0z"/>
<path d="M1586 205 c-9 -23 -8 -25 9 -25 17 0 19 9 6 28 -7 11 -10 10 -15 -3z"/>
<path d="M3727 200 c-3 -13 0 -20 9 -20 15 0 19 26 5 34 -5 3 -11 -3 -14 -14z"/>
<path d="M1194 229 c-3 -6 -2 -15 3 -20 13 -13 43 -1 43 17 0 16 -36 19 -46 3z"/>
<path d="M2470 224 c-18 -46 -12 -73 15 -80 37 -9 52 1 59 40 5 26 3 41 -8 51
-23 24 -55 18 -66 -11z"/>
<path d="M3120 196 c0 -9 7 -16 16 -16 17 0 14 22 -4 28 -7 2 -12 -3 -12 -12z"/>
<path d="M4750 201 c0 -12 5 -21 10 -21 6 0 10 6 10 14 0 8 -4 18 -10 21 -5 3
-10 -3 -10 -14z"/>
<path d="M3515 229 c-8 -12 14 -31 30 -26 6 2 10 10 10 18 0 17 -31 24 -40 8z"/>
<path d="M3521 161 c-7 -5 -9 -11 -4 -14 14 -9 54 4 47 14 -7 11 -25 11 -43 0z"/>
</g>
</svg>

Before

Width:  |  Height:  |  Size: 18 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 10 KiB

View File

@ -1,85 +1 @@
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<svg
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:cc="http://creativecommons.org/ns#"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:svg="http://www.w3.org/2000/svg"
xmlns="http://www.w3.org/2000/svg"
id="svg8"
version="1.1"
viewBox="0 0 176.61171 41.907883"
height="41.907883mm"
width="176.61171mm">
<defs
id="defs2" />
<metadata
id="metadata5">
<rdf:RDF>
<cc:Work
rdf:about="">
<dc:format>image/svg+xml</dc:format>
<dc:type
rdf:resource="http://purl.org/dc/dcmitype/StillImage" />
<dc:title></dc:title>
</cc:Work>
</rdf:RDF>
</metadata>
<g
transform="translate(-0.74835286,-98.31182)"
id="layer1">
<flowRoot
transform="scale(0.26458333)"
style="font-style:normal;font-weight:normal;font-size:40px;line-height:1.25;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none"
id="flowRoot4598"
xml:space="preserve"><flowRegion
id="flowRegion4600"><rect
y="415.4129"
x="-38.183765"
height="48.08326"
width="257.38687"
id="rect4602" /></flowRegion><flowPara
id="flowPara4604"></flowPara></flowRoot> <text
transform="scale(0.86288797,1.158899)"
id="text4777"
y="110.93711"
x="0.93061"
style="font-style:normal;font-variant:normal;font-weight:bold;font-stretch:normal;font-size:28.14887619px;line-height:4.25;font-family:sans-serif;-inkscape-font-specification:'sans-serif, Bold';font-variant-ligatures:normal;font-variant-caps:normal;font-variant-numeric:normal;font-feature-settings:normal;text-align:start;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:start;fill:#003dff;fill-opacity:1;stroke:none;stroke-width:7.51955223;stroke-miterlimit:4;stroke-dasharray:none"
xml:space="preserve"><tspan
style="stroke-width:7.51955223"
id="tspan4775"
y="110.93711"
x="0.93061"><tspan
id="tspan4773"
style="font-style:normal;font-variant:normal;font-weight:bold;font-stretch:normal;font-size:28.14887619px;font-family:sans-serif;-inkscape-font-specification:'sans-serif, Bold';font-variant-ligatures:normal;font-variant-caps:normal;font-variant-numeric:normal;font-feature-settings:normal;text-align:start;letter-spacing:3.56786728px;writing-mode:lr-tb;text-anchor:start;fill:#003dff;fill-opacity:1;stroke-width:7.51955223;stroke-miterlimit:4;stroke-dasharray:none"
y="110.93711"
x="0.93061">waybackpy</tspan></tspan></text>
<rect
y="98.311821"
x="1.4967092"
height="4.8643045"
width="153.78688"
id="rect4644"
style="opacity:1;fill:#000080;fill-opacity:1;stroke:#00ff00;stroke-width:0;stroke-miterlimit:4;stroke-dasharray:none" />
<rect
style="opacity:1;fill:#000080;fill-opacity:1;stroke:#00ff00;stroke-width:0;stroke-miterlimit:4;stroke-dasharray:none"
id="rect4648"
width="153.78688"
height="4.490128"
x="23.573174"
y="135.72957" />
<rect
y="135.72957"
x="0.74835336"
height="4.4901319"
width="22.82482"
id="rect4650"
style="opacity:1;fill:#ff00ff;fill-opacity:1;stroke:#00ff00;stroke-width:0;stroke-miterlimit:4;stroke-dasharray:none" />
<rect
style="opacity:1;fill:#ff00ff;fill-opacity:1;stroke:#00ff00;stroke-width:0;stroke-miterlimit:4;stroke-dasharray:none"
id="rect4652"
width="21.702286"
height="4.8643003"
x="155.2836"
y="98.311821" />
</g>
</svg>
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 176.612 41.908" height="158.392" width="667.51" xmlns:v="https://github.com/akamhy/waybackpy"><text transform="matrix(.862888 0 0 1.158899 -.748 -98.312)" y="110.937" x="0.931" xml:space="preserve" font-weight="bold" font-size="28.149" font-family="sans-serif" letter-spacing="0" word-spacing="0" writing-mode="lr-tb" fill="#003dff"><tspan y="110.937" x="0.931"><tspan y="110.937" x="0.931" letter-spacing="3.568" writing-mode="lr-tb">waybackpy</tspan></tspan></text><path d="M.749 0h153.787v4.864H.749zm22.076 37.418h153.787v4.49H22.825z" fill="navy"/><path d="M0 37.418h22.825v4.49H0zM154.536 0h21.702v4.864h-21.702z" fill="#f0f"/></svg>

Before

Width:  |  Height:  |  Size: 3.6 KiB

After

Width:  |  Height:  |  Size: 694 B

View File

@ -1 +1,2 @@
requests>=2.24.0
click
requests

View File

@ -19,21 +19,18 @@ setup(
author=about["__author__"],
author_email=about["__author_email__"],
url=about["__url__"],
download_url="https://github.com/akamhy/waybackpy/archive/2.4.0.tar.gz",
download_url="https://github.com/akamhy/waybackpy/archive/3.0.0.tar.gz",
keywords=[
"Archive It",
"Archive Website",
"Wayback Machine",
"waybackurls",
"Internet Archive",
],
install_requires=["requests"],
install_requires=["requests", "click"],
python_requires=">=3.4",
classifiers=[
"Development Status :: 5 - Production/Stable",
"Development Status :: 4 - Beta",
"Intended Audience :: Developers",
"Natural Language :: English",
"Topic :: Software Development :: Build Tools",
"License :: OSI Approved :: MIT License",
"Programming Language :: Python",
"Programming Language :: Python :: 3",

View File

View File

@ -1,93 +0,0 @@
import pytest
from waybackpy.cdx import Cdx
from waybackpy.exceptions import WaybackError
def test_all_cdx():
url = "akamhy.github.io"
user_agent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, \
like Gecko) Chrome/45.0.2454.85 Safari/537.36"
cdx = Cdx(
url=url,
user_agent=user_agent,
start_timestamp=2017,
end_timestamp=2020,
filters=[
"statuscode:200",
"mimetype:text/html",
"timestamp:20201002182319",
"original:https://akamhy.github.io/",
],
gzip=False,
collapses=["timestamp:10", "digest"],
limit=50,
match_type="prefix",
)
snapshots = cdx.snapshots()
for snapshot in snapshots:
ans = snapshot.archive_url
assert "https://web.archive.org/web/20201002182319/https://akamhy.github.io/" == ans
url = "akahfjgjkmhy.gihthub.ip"
cdx = Cdx(
url=url,
user_agent=user_agent,
start_timestamp=None,
end_timestamp=None,
filters=[],
match_type=None,
gzip=True,
collapses=[],
limit=10,
)
snapshots = cdx.snapshots()
print(snapshots)
i = 0
for _ in snapshots:
i += 1
assert i == 0
url = "https://github.com/akamhy/waybackpy/*"
cdx = Cdx(url=url, user_agent=user_agent, limit=50)
snapshots = cdx.snapshots()
for snapshot in snapshots:
print(snapshot.archive_url)
url = "https://github.com/akamhy/waybackpy"
with pytest.raises(WaybackError):
cdx = Cdx(url=url, user_agent=user_agent, limit=50, filters=["ghddhfhj"])
snapshots = cdx.snapshots()
with pytest.raises(WaybackError):
cdx = Cdx(url=url, user_agent=user_agent, collapses=["timestamp", "ghdd:hfhj"])
snapshots = cdx.snapshots()
url = "https://github.com"
cdx = Cdx(url=url, user_agent=user_agent, limit=50)
snapshots = cdx.snapshots()
c = 0
for snapshot in snapshots:
c += 1
if c > 100:
break
url = "https://github.com/*"
cdx = Cdx(url=url, user_agent=user_agent, collapses=["timestamp"])
snapshots = cdx.snapshots()
c = 0
for snapshot in snapshots:
c += 1
if c > 30_529: # deafult limit is 10k
break
url = "https://github.com/*"
cdx = Cdx(url=url, user_agent=user_agent)
c = 0
snapshots = cdx.snapshots()
for snapshot in snapshots:
c += 1
if c > 100_529:
break

View File

@ -1,418 +0,0 @@
import sys
import os
import pytest
import random
import string
import argparse
sys.path.append("..")
import waybackpy.cli as cli # noqa: E402
from waybackpy.wrapper import Url # noqa: E402
from waybackpy.__version__ import __version__
def test_save():
args = argparse.Namespace(
user_agent=None,
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=False,
save=True,
json=False,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get=None,
)
reply = cli.args_handler(args)
assert "pypi.org/user/akamhy" in str(reply)
args = argparse.Namespace(
user_agent=None,
url="https://hfjfjfjfyu6r6rfjvj.fjhgjhfjgvjm",
total=False,
version=False,
oldest=False,
save=True,
json=False,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get=None,
)
reply = cli.args_handler(args)
assert "could happen because either your waybackpy" in str(reply)
def test_json():
args = argparse.Namespace(
user_agent=None,
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=False,
save=False,
json=True,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get=None,
)
reply = cli.args_handler(args)
assert "archived_snapshots" in str(reply)
def test_archive_url():
args = argparse.Namespace(
user_agent=None,
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=True,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get=None,
)
reply = cli.args_handler(args)
assert "https://web.archive.org/web/" in str(reply)
def test_oldest():
args = argparse.Namespace(
user_agent=None,
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=True,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get=None,
)
reply = cli.args_handler(args)
assert "pypi.org/user/akamhy" in str(reply)
uid = "".join(
random.choice(string.ascii_lowercase + string.digits) for _ in range(6)
)
url = "https://pypi.org/yfvjvycyc667r67ed67r" + uid
args = argparse.Namespace(
user_agent=None,
url=url,
total=False,
version=False,
oldest=True,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get=None,
)
reply = cli.args_handler(args)
assert "Can not find archive for" in str(reply)
def test_newest():
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=True,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get=None,
)
reply = cli.args_handler(args)
assert "pypi.org/user/akamhy" in str(reply)
uid = "".join(
random.choice(string.ascii_lowercase + string.digits) for _ in range(6)
)
url = "https://pypi.org/yfvjvycyc667r67ed67r" + uid
args = argparse.Namespace(
user_agent=None,
url=url,
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=True,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get=None,
)
reply = cli.args_handler(args)
assert "Can not find archive for" in str(reply)
def test_total_archives():
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://pypi.org/user/akamhy/",
total=True,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get=None,
)
reply = cli.args_handler(args)
assert isinstance(reply, int)
def test_known_urls():
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://www.keybr.com",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=True,
get=None,
)
reply = cli.args_handler(args)
assert "keybr" in str(reply)
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://akfyfufyjcujfufu6576r76r6amhy.gitd6r67r6u6hub.yfjyfjio",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
alive=True,
subdomain=True,
known_urls=True,
get=None,
)
reply = cli.args_handler(args)
assert "No known URLs found" in str(reply)
def test_near():
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=True,
alive=False,
subdomain=False,
known_urls=False,
get=None,
year=2020,
month=7,
day=15,
hour=1,
minute=1,
)
reply = cli.args_handler(args)
assert "202007" in str(reply)
uid = "".join(
random.choice(string.ascii_lowercase + string.digits) for _ in range(6)
)
url = "https://pypi.org/yfvjvycyc667r67ed67r" + uid
args = argparse.Namespace(
user_agent=None,
url=url,
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=True,
alive=False,
subdomain=False,
known_urls=False,
get=None,
year=2020,
month=7,
day=15,
hour=1,
minute=1,
)
reply = cli.args_handler(args)
assert "Can not find archive for" in str(reply)
def test_get():
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://github.com/akamhy",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get="url",
)
reply = cli.args_handler(args)
assert "waybackpy" in str(reply)
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://github.com/akamhy/waybackpy",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get="oldest",
)
reply = cli.args_handler(args)
assert "waybackpy" in str(reply)
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://akamhy.github.io/waybackpy/",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get="newest",
)
reply = cli.args_handler(args)
assert "waybackpy" in str(reply)
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get="save",
)
reply = cli.args_handler(args)
assert "waybackpy" in str(reply)
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get="foobar",
)
reply = cli.args_handler(args)
assert "get the source code of the" in str(reply)
def test_args_handler():
args = argparse.Namespace(version=True)
reply = cli.args_handler(args)
assert ("waybackpy version %s" % (__version__)) == reply
args = argparse.Namespace(url=None, version=False)
reply = cli.args_handler(args)
assert ("waybackpy %s" % (__version__)) in str(reply)
def test_main():
# This also tests the parse_args method in cli.py
cli.main(["temp.py", "--version"])

View File

@ -1,32 +0,0 @@
import pytest
from waybackpy.snapshot import CdxSnapshot, datetime
def test_CdxSnapshot():
sample_input = "org,archive)/ 20080126045828 http://github.com text/html 200 Q4YULN754FHV2U6Q5JUT6Q2P57WEWNNY 1415"
prop_values = sample_input.split(" ")
properties = {}
(
properties["urlkey"],
properties["timestamp"],
properties["original"],
properties["mimetype"],
properties["statuscode"],
properties["digest"],
properties["length"],
) = prop_values
snapshot = CdxSnapshot(properties)
assert properties["urlkey"] == snapshot.urlkey
assert properties["timestamp"] == snapshot.timestamp
assert properties["original"] == snapshot.original
assert properties["mimetype"] == snapshot.mimetype
assert properties["statuscode"] == snapshot.statuscode
assert properties["digest"] == snapshot.digest
assert properties["length"] == snapshot.length
assert datetime.strptime(properties["timestamp"], "%Y%m%d%H%M%S") == snapshot.datetime_timestamp
archive_url = "https://web.archive.org/web/" + properties["timestamp"] + "/" + properties["original"]
assert archive_url == snapshot.archive_url
assert archive_url == str(snapshot)

View File

@ -1,186 +0,0 @@
import pytest
import json
from waybackpy.utils import (
_cleaned_url,
_url_check,
_full_url,
URLError,
WaybackError,
_get_total_pages,
_archive_url_parser,
_wayback_timestamp,
_get_response,
_check_match_type,
_check_collapses,
_check_filters,
_ts,
)
def test_ts():
timestamp = True
data = {}
assert _ts(timestamp, data)
data = """
{"archived_snapshots": {"closest": {"timestamp": "20210109155628", "available": true, "status": "200", "url": "http://web.archive.org/web/20210109155628/https://www.google.com/"}}, "url": "https://www.google.com/"}
"""
data = json.loads(data)
assert data["archived_snapshots"]["closest"]["timestamp"] == "20210109155628"
def test_check_filters():
filters = []
_check_filters(filters)
filters = ["statuscode:200", "timestamp:20215678901234", "original:https://url.com"]
_check_filters(filters)
with pytest.raises(WaybackError):
_check_filters("not-list")
def test_check_collapses():
collapses = []
_check_collapses(collapses)
collapses = ["timestamp:10"]
_check_collapses(collapses)
collapses = ["urlkey"]
_check_collapses(collapses)
collapses = "urlkey" # NOT LIST
with pytest.raises(WaybackError):
_check_collapses(collapses)
collapses = ["also illegal collapse"]
with pytest.raises(WaybackError):
_check_collapses(collapses)
def test_check_match_type():
assert None == _check_match_type(None, "url")
match_type = "exact"
url = "test_url"
assert None == _check_match_type(match_type, url)
url = "has * in it"
with pytest.raises(WaybackError):
_check_match_type("domain", url)
with pytest.raises(WaybackError):
_check_match_type("not a valid type", "url")
def test_cleaned_url():
test_url = " https://en.wikipedia.org/wiki/Network security "
answer = "https://en.wikipedia.org/wiki/Network%20security"
assert answer == _cleaned_url(test_url)
def test_url_check():
good_url = "https://akamhy.github.io"
assert None == _url_check(good_url)
bad_url = "https://github-com"
with pytest.raises(URLError):
_url_check(bad_url)
def test_full_url():
params = {}
endpoint = "https://web.archive.org/cdx/search/cdx"
assert endpoint == _full_url(endpoint, params)
params = {"a": "1"}
assert "https://web.archive.org/cdx/search/cdx?a=1" == _full_url(endpoint, params)
assert "https://web.archive.org/cdx/search/cdx?a=1" == _full_url(
endpoint + "?", params
)
params["b"] = 2
assert "https://web.archive.org/cdx/search/cdx?a=1&b=2" == _full_url(
endpoint + "?", params
)
params["c"] = "foo bar"
assert "https://web.archive.org/cdx/search/cdx?a=1&b=2&c=foo%20bar" == _full_url(
endpoint + "?", params
)
def test_get_total_pages():
user_agent = "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
url = "github.com*"
assert 212890 <= _get_total_pages(url, user_agent)
url = "https://zenodo.org/record/4416138"
assert 2 >= _get_total_pages(url, user_agent)
def test_archive_url_parser():
perfect_header = """
{'Server': 'nginx/1.15.8', 'Date': 'Sat, 02 Jan 2021 09:40:25 GMT', 'Content-Type': 'text/html; charset=UTF-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'X-Archive-Orig-Server': 'nginx', 'X-Archive-Orig-Date': 'Sat, 02 Jan 2021 09:40:09 GMT', 'X-Archive-Orig-Transfer-Encoding': 'chunked', 'X-Archive-Orig-Connection': 'keep-alive', 'X-Archive-Orig-Vary': 'Accept-Encoding', 'X-Archive-Orig-Last-Modified': 'Fri, 01 Jan 2021 12:19:00 GMT', 'X-Archive-Orig-Strict-Transport-Security': 'max-age=31536000, max-age=0;', 'X-Archive-Guessed-Content-Type': 'text/html', 'X-Archive-Guessed-Charset': 'utf-8', 'Memento-Datetime': 'Sat, 02 Jan 2021 09:40:09 GMT', 'Link': '<https://www.scribbr.com/citing-sources/et-al/>; rel="original", <https://web.archive.org/web/timemap/link/https://www.scribbr.com/citing-sources/et-al/>; rel="timemap"; type="application/link-format", <https://web.archive.org/web/https://www.scribbr.com/citing-sources/et-al/>; rel="timegate", <https://web.archive.org/web/20200601082911/https://www.scribbr.com/citing-sources/et-al/>; rel="first memento"; datetime="Mon, 01 Jun 2020 08:29:11 GMT", <https://web.archive.org/web/20201126185327/https://www.scribbr.com/citing-sources/et-al/>; rel="prev memento"; datetime="Thu, 26 Nov 2020 18:53:27 GMT", <https://web.archive.org/web/20210102094009/https://www.scribbr.com/citing-sources/et-al/>; rel="memento"; datetime="Sat, 02 Jan 2021 09:40:09 GMT", <https://web.archive.org/web/20210102094009/https://www.scribbr.com/citing-sources/et-al/>; rel="last memento"; datetime="Sat, 02 Jan 2021 09:40:09 GMT"', 'Content-Security-Policy': "default-src 'self' 'unsafe-eval' 'unsafe-inline' data: blob: archive.org web.archive.org analytics.archive.org pragma.archivelab.org", 'X-Archive-Src': 'spn2-20210102092956-wwwb-spn20.us.archive.org-8001.warc.gz', 'Server-Timing': 'captures_list;dur=112.646325, exclusion.robots;dur=0.172010, exclusion.robots.policy;dur=0.158205, RedisCDXSource;dur=2.205932, esindex;dur=0.014647, LoadShardBlock;dur=82.205012, PetaboxLoader3.datanode;dur=70.750239, CDXLines.iter;dur=24.306278, load_resource;dur=26.520179', 'X-App-Server': 'wwwb-app200', 'X-ts': '200', 'X-location': 'All', 'X-Cache-Key': 'httpsweb.archive.org/web/20210102094009/https://www.scribbr.com/citing-sources/et-al/IN', 'X-RL': '0', 'X-Page-Cache': 'MISS', 'X-Archive-Screenname': '0', 'Content-Encoding': 'gzip'}
"""
archive = _archive_url_parser(
perfect_header, "https://www.scribbr.com/citing-sources/et-al/"
)
assert "web.archive.org/web/20210102094009" in archive
header = """
vhgvkjv
Content-Location: /web/20201126185327/https://www.scribbr.com/citing-sources/et-al
ghvjkbjmmcmhj
"""
archive = _archive_url_parser(
header, "https://www.scribbr.com/citing-sources/et-al/"
)
assert "20201126185327" in archive
header = """
hfjkfjfcjhmghmvjm
X-Cache-Key: https://web.archive.org/web/20171128185327/https://www.scribbr.com/citing-sources/et-al/US
yfu,u,gikgkikik
"""
archive = _archive_url_parser(
header, "https://www.scribbr.com/citing-sources/et-al/"
)
assert "20171128185327" in archive
# The below header should result in Exception
no_archive_header = """
{'Server': 'nginx/1.15.8', 'Date': 'Sat, 02 Jan 2021 09:42:45 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Cache-Control': 'no-cache', 'X-App-Server': 'wwwb-app52', 'X-ts': '523', 'X-RL': '0', 'X-Page-Cache': 'MISS', 'X-Archive-Screenname': '0'}
"""
with pytest.raises(WaybackError):
_archive_url_parser(
no_archive_header, "https://www.scribbr.com/citing-sources/et-al/"
)
def test_wayback_timestamp():
ts = _wayback_timestamp(year=2020, month=1, day=2, hour=3, minute=4)
assert "202001020304" in str(ts)
def test_get_response():
endpoint = "https://www.google.com"
user_agent = (
"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0"
)
headers = {"User-Agent": "%s" % user_agent}
response = _get_response(endpoint, params=None, headers=headers)
assert response.status_code == 200
endpoint = "http/wwhfhfvhvjhmom"
with pytest.raises(WaybackError):
_get_response(endpoint, params=None, headers=headers)
endpoint = "https://akamhy.github.io"
url, response = _get_response(
endpoint, params=None, headers=headers, return_full_url=True
)
assert endpoint == url

View File

@ -1,145 +0,0 @@
import sys
import pytest
import random
import requests
from datetime import datetime
from waybackpy.wrapper import Url, Cdx
user_agent = "Mozilla/5.0 (Windows NT 6.2; rv:20.0) Gecko/20121202 Firefox/20.0"
def test_url_check():
"""No API Use"""
broken_url = "http://wwwgooglecom/"
with pytest.raises(Exception):
Url(broken_url, user_agent)
def test_save():
# Test for urls that exist and can be archived.
url_list = [
"en.wikipedia.org",
"akamhy.github.io",
"www.wiktionary.org",
"www.w3schools.com",
"youtube.com",
]
x = random.randint(0, len(url_list) - 1)
url1 = url_list[x]
target = Url(
url1,
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36",
)
archived_url1 = str(target.save())
assert url1 in archived_url1
# Test for urls that are incorrect.
with pytest.raises(Exception):
url2 = "ha ha ha ha"
Url(url2, user_agent)
url3 = "http://www.archive.is/faq.html"
with pytest.raises(Exception):
target = Url(
url3,
"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) "
"AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 "
"Safari/533.20.27",
)
target.save()
def test_near():
url = "google.com"
target = Url(
url,
"Mozilla/5.0 (Windows; U; Windows NT 6.0; de-DE) AppleWebKit/533.20.25 "
"(KHTML, like Gecko) Version/5.0.3 Safari/533.19.4",
)
archive_near_year = target.near(year=2010)
assert "2010" in str(archive_near_year.timestamp)
archive_near_month_year = str(target.near(year=2015, month=2).timestamp)
assert (
("2015-02" in archive_near_month_year)
or ("2015-01" in archive_near_month_year)
or ("2015-03" in archive_near_month_year)
)
target = Url(
"www.python.org",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246",
)
archive_near_hour_day_month_year = str(
target.near(year=2008, month=5, day=9, hour=15)
)
assert (
("2008050915" in archive_near_hour_day_month_year)
or ("2008050914" in archive_near_hour_day_month_year)
or ("2008050913" in archive_near_hour_day_month_year)
)
with pytest.raises(Exception):
NeverArchivedUrl = (
"https://ee_3n.wrihkeipef4edia.org/rwti5r_ki/Nertr6w_rork_rse7c_urity"
)
target = Url(NeverArchivedUrl, user_agent)
target.near(year=2010)
def test_oldest():
url = "github.com/akamhy/waybackpy"
target = Url(url, user_agent)
o = target.oldest()
assert "20200504141153" in str(o)
assert "2020-05-04" in str(o._timestamp)
def test_json():
url = "github.com/akamhy/waybackpy"
target = Url(url, user_agent)
assert "archived_snapshots" in str(target.JSON)
def test_archive_url():
url = "github.com/akamhy/waybackpy"
target = Url(url, user_agent)
assert "github.com/akamhy" in str(target.archive_url)
def test_newest():
url = "github.com/akamhy/waybackpy"
target = Url(url, user_agent)
assert url in str(target.newest())
def test_get():
target = Url("google.com", user_agent)
assert "Welcome to Google" in target.get(target.oldest())
def test_total_archives():
user_agent = (
"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0"
)
target = Url(" https://outlook.com ", user_agent)
assert target.total_archives() > 80000
target = Url(
" https://gaha.e4i3n.m5iai3kip6ied.cima/gahh2718gs/ahkst63t7gad8 ", user_agent
)
assert target.total_archives() == 0
def test_known_urls():
target = Url("akamhy.github.io", user_agent)
assert len(target.known_urls(alive=True, subdomain=False)) > 2
target = Url("akamhy.github.io", user_agent)
assert len(target.known_urls()) > 3

View File

@ -1,50 +1,7 @@
# ┏┓┏┓┏┓━━━━━━━━━━┏━━┓━━━━━━━━━━┏┓━━┏━━━┓━━━━━
# ┃┃┃┃┃┃━━━━━━━━━━┃┏┓┃━━━━━━━━━━┃┃━━┃┏━┓┃━━━━━
# ┃┃┃┃┃┃┏━━┓━┏┓━┏┓┃┗┛┗┓┏━━┓━┏━━┓┃┃┏┓┃┗━┛┃┏┓━┏┓
# ┃┗┛┗┛┃┗━┓┃━┃┃━┃┃┃┏━┓┃┗━┓┃━┃┏━┛┃┗┛┛┃┏━━┛┃┃━┃┃
# ┗┓┏┓┏┛┃┗┛┗┓┃┗━┛┃┃┗━┛┃┃┗┛┗┓┃┗━┓┃┏┓┓┃┃━━━┃┗━┛┃
# ━┗┛┗┛━┗━━━┛┗━┓┏┛┗━━━┛┗━━━┛┗━━┛┗┛┗┛┗┛━━━┗━┓┏┛
# ━━━━━━━━━━━┏━┛┃━━━━━━━━━━━━━━━━━━━━━━━━┏━┛┃━
# ━━━━━━━━━━━┗━━┛━━━━━━━━━━━━━━━━━━━━━━━━┗━━┛━
"""
Waybackpy is a Python package & command-line program that interfaces with the Internet Archive's Wayback Machine API.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Archive webpage and retrieve archived URLs easily.
Usage:
>>> import waybackpy
>>> url = "https://en.wikipedia.org/wiki/Multivariable_calculus"
>>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
>>> wayback = waybackpy.Url(url, user_agent)
>>> archive = wayback.save()
>>> str(archive)
'https://web.archive.org/web/20210104173410/https://en.wikipedia.org/wiki/Multivariable_calculus'
>>> archive.timestamp
datetime.datetime(2021, 1, 4, 17, 35, 12, 691741)
>>> oldest_archive = wayback.oldest()
>>> str(oldest_archive)
'https://web.archive.org/web/20050422130129/http://en.wikipedia.org:80/wiki/Multivariable_calculus'
>>> archive_close_to_2010_feb = wayback.near(year=2010, month=2)
>>> str(archive_close_to_2010_feb)
'https://web.archive.org/web/20100215001541/http://en.wikipedia.org:80/wiki/Multivariable_calculus'
>>> str(wayback.newest())
'https://web.archive.org/web/20210104173410/https://en.wikipedia.org/wiki/Multivariable_calculus'
Full documentation @ <https://github.com/akamhy/waybackpy/wiki>.
:copyright: (c) 2020-2021 AKash Mahanty Et al.
:license: MIT
"""
from .wrapper import Url, Cdx
from .wrapper import Url
from .cdx_api import WaybackMachineCDXServerAPI
from .save_api import WaybackMachineSaveAPI
from .availability_api import WaybackMachineAvailabilityAPI
from .__version__ import (
__title__,
__description__,

View File

@ -1,11 +1,11 @@
__title__ = "waybackpy"
__description__ = (
"A Python package that interfaces with the Internet Archive's Wayback Machine API. "
"Python package that interfaces with the Internet Archive's Wayback Machine APIs. "
"Archive pages and retrieve archived pages easily."
)
__url__ = "https://akamhy.github.io/waybackpy/"
__version__ = "2.4.0"
__version__ = "3.0.0"
__author__ = "akamhy"
__author_email__ = "akamhy@yahoo.com"
__license__ = "MIT"
__copyright__ = "Copyright 2020-2021 Akash Mahanty et al."
__copyright__ = "Copyright 2020-2022 Akash Mahanty et al."

View File

@ -0,0 +1,109 @@
import re
import time
import requests
from datetime import datetime
from .__version__ import __version__
from .utils import DEFAULT_USER_AGENT
def full_url(endpoint, params):
if not params:
return endpoint.strip()
full_url = endpoint if endpoint.endswith("?") else (endpoint + "?")
for key, val in params.items():
key = "filter" if key.startswith("filter") else key
key = "collapse" if key.startswith("collapse") else key
amp = "" if full_url.endswith("?") else "&"
full_url = (
full_url
+ amp
+ "{key}={val}".format(key=key, val=requests.utils.quote(str(val)))
)
return full_url
class WaybackMachineAvailabilityAPI:
def __init__(self, url, user_agent=DEFAULT_USER_AGENT):
self.url = str(url).strip().replace(" ", "%20")
self.user_agent = user_agent
self.headers = {"User-Agent": self.user_agent}
self.payload = {"url": "{url}".format(url=self.url)}
self.endpoint = "https://archive.org/wayback/available"
self.JSON = None
def unix_timestamp_to_wayback_timestamp(self, unix_timestamp):
return datetime.utcfromtimestamp(int(unix_timestamp)).strftime("%Y%m%d%H%M%S")
def __repr__(self):
return str(self) # self.__str__()
def __str__(self):
if not self.JSON:
return None
return self.archive_url
def json(self):
self.request_url = full_url(self.endpoint, self.payload)
self.response = requests.get(self.request_url, self.headers)
self.JSON = self.response.json()
return self.JSON
def timestamp(self):
if not self.JSON["archived_snapshots"] or not self.JSON:
return datetime.max
return datetime.strptime(
self.JSON["archived_snapshots"]["closest"]["timestamp"], "%Y%m%d%H%M%S"
)
@property
def archive_url(self):
data = self.JSON
if not data["archived_snapshots"]:
archive_url = None
else:
archive_url = data["archived_snapshots"]["closest"]["url"]
archive_url = archive_url.replace(
"http://web.archive.org/web/", "https://web.archive.org/web/", 1
)
return archive_url
def wayback_timestamp(self, **kwargs):
return "".join(
str(kwargs[key]).zfill(2)
for key in ["year", "month", "day", "hour", "minute"]
)
def oldest(self):
return self.near(year=1994)
def newest(self):
return self.near(unix_timestamp=int(time.time()))
def near(
self,
year=None,
month=None,
day=None,
hour=None,
minute=None,
unix_timestamp=None,
):
if unix_timestamp:
timestamp = self.unix_timestamp_to_wayback_timestamp(unix_timestamp)
else:
now = datetime.utcnow().timetuple()
timestamp = self.wayback_timestamp(
year=year if year else now.tm_year,
month=month if month else now.tm_mon,
day=day if day else now.tm_mday,
hour=hour if hour else now.tm_hour,
minute=minute if minute else now.tm_min,
)
self.payload["timestamp"] = timestamp
self.json()
return self

View File

@ -1,205 +0,0 @@
from .snapshot import CdxSnapshot
from .exceptions import WaybackError
from .utils import (
_get_total_pages,
_get_response,
default_user_agent,
_check_filters,
_check_collapses,
_check_match_type,
_add_payload,
)
# TODO : Threading support for pagination API. It's designed for Threading.
class Cdx:
def __init__(
self,
url,
user_agent=default_user_agent,
start_timestamp=None,
end_timestamp=None,
filters=[],
match_type=None,
gzip=True,
collapses=[],
limit=10000,
):
self.url = str(url).strip()
self.user_agent = str(user_agent)
self.start_timestamp = str(start_timestamp) if start_timestamp else None
self.end_timestamp = str(end_timestamp) if end_timestamp else None
self.filters = filters
_check_filters(self.filters)
self.match_type = str(match_type).strip() if match_type else None
_check_match_type(self.match_type, self.url)
self.gzip = gzip
self.collapses = collapses
_check_collapses(self.collapses)
self.limit = limit
self.last_api_request_url = None
self.use_page = False
def cdx_api_manager(self, payload, headers, use_page=False):
"""
We have two options to get the snapshots, we use this
method to make a selection between pagination API and
the normal one with Resumption Key, sequential querying
of CDX data. For very large querying (for example domain query),
it may be useful to perform queries in parallel and also estimate
the total size of the query.
read more about the pagination API at:
https://web.archive.org/web/20201228063237/https://github.com/internetarchive/wayback/blob/master/wayback-cdx-server/README.md#pagination-api
if use_page is false if will use the normal sequential query API,
else use the pagination API.
two mutually exclusive cases possible:
1) pagination API is selected
a) get the total number of pages to read, using _get_total_pages()
b) then we use a for loop to get all the pages and yield the response text
2) normal sequential query API is selected.
a) get use showResumeKey=true to ask the API to add a query resumption key
at the bottom of response
b) check if the page has more than 3 lines, if not return the text
c) if it has atleast three lines, we check the second last line for zero length.
d) if the second last line has length zero than we assume that the last line contains
the resumption key, we set the resumeKey and remove the resumeKey from text
e) if the second line has non zero length we return the text as there will no resumption key
f) if we find the resumption key we set the "more" variable status to True which is always set
to False on each iteration. If more is not True the iteration stops and function returns.
"""
endpoint = "https://web.archive.org/cdx/search/cdx"
if use_page == True:
total_pages = _get_total_pages(self.url, self.user_agent)
for i in range(total_pages):
payload["page"] = str(i)
url, res = _get_response(
endpoint, params=payload, headers=headers, return_full_url=True
)
self.last_api_request_url = url
yield res.text
else:
payload["showResumeKey"] = "true"
payload["limit"] = str(self.limit)
resumeKey = None
more = True
while more:
if resumeKey:
payload["resumeKey"] = resumeKey
url, res = _get_response(
endpoint, params=payload, headers=headers, return_full_url=True
)
self.last_api_request_url = url
text = res.text.strip()
lines = text.splitlines()
more = False
if len(lines) >= 3:
second_last_line = lines[-2]
if len(second_last_line) == 0:
resumeKey = lines[-1].strip()
text = text.replace(resumeKey, "", 1).strip()
more = True
yield text
def snapshots(self):
"""
This function yeilds snapshots encapsulated
in CdxSnapshot for more usability.
All the get request values are set if the conditions match
And we use logic that if someone's only inputs don't have any
of [start_timestamp, end_timestamp] and don't use any collapses
then we use the pagination API as it returns archives starting
from the first archive and the recent most archive will be on
the last page.
"""
payload = {}
headers = {"User-Agent": self.user_agent}
_add_payload(self, payload)
if not self.start_timestamp or self.end_timestamp:
self.use_page = True
if self.collapses != []:
self.use_page = False
texts = self.cdx_api_manager(payload, headers, use_page=self.use_page)
for text in texts:
if text.isspace() or len(text) <= 1 or not text:
continue
snapshot_list = text.split("\n")
for snapshot in snapshot_list:
if len(snapshot) < 46: # 14 + 32 (timestamp+digest)
continue
properties = {
"urlkey": None,
"timestamp": None,
"original": None,
"mimetype": None,
"statuscode": None,
"digest": None,
"length": None,
}
prop_values = snapshot.split(" ")
# Making sure that we get the same number of
# property values as the number of properties
prop_values_len = len(prop_values)
properties_len = len(properties)
if prop_values_len != properties_len:
raise WaybackError(
"Snapshot returned by Cdx API has %s properties instead of expected %s properties.\nInvolved Snapshot : %s"
% (prop_values_len, properties_len, snapshot)
)
(
properties["urlkey"],
properties["timestamp"],
properties["original"],
properties["mimetype"],
properties["statuscode"],
properties["digest"],
properties["length"],
) = prop_values
yield CdxSnapshot(properties)

185
waybackpy/cdx_api.py Normal file
View File

@ -0,0 +1,185 @@
from .exceptions import WaybackError
from .cdx_snapshot import CDXSnapshot
from .cdx_utils import (
get_total_pages,
get_response,
check_filters,
check_collapses,
check_match_type,
)
from .utils import DEFAULT_USER_AGENT
class WaybackMachineCDXServerAPI:
def __init__(
self,
url,
user_agent=None,
start_timestamp=None,
end_timestamp=None,
filters=[],
match_type=None,
gzip=None,
collapses=[],
limit=None,
):
self.url = str(url).strip().replace(" ", "%20")
self.user_agent = str(user_agent) if user_agent else DEFAULT_USER_AGENT
self.start_timestamp = str(start_timestamp) if start_timestamp else None
self.end_timestamp = str(end_timestamp) if end_timestamp else None
self.filters = filters
check_filters(self.filters)
self.match_type = str(match_type).strip() if match_type else None
check_match_type(self.match_type, self.url)
self.gzip = gzip if gzip else True
self.collapses = collapses
check_collapses(self.collapses)
self.limit = limit if limit else 5000
self.last_api_request_url = None
self.use_page = False
self.endpoint = "https://web.archive.org/cdx/search/cdx"
def cdx_api_manager(self, payload, headers, use_page=False):
total_pages = get_total_pages(self.url, self.user_agent)
# If we only have two or less pages of archives then we care for accuracy
# pagination API can be lagged sometimes
if use_page == True and total_pages >= 2:
blank_pages = 0
for i in range(total_pages):
payload["page"] = str(i)
url, res = get_response(
self.endpoint, params=payload, headers=headers, return_full_url=True
)
self.last_api_request_url = url
text = res.text
if len(text) == 0:
blank_pages += 1
if blank_pages >= 2:
break
yield text
else:
payload["showResumeKey"] = "true"
payload["limit"] = str(self.limit)
resumeKey = None
more = True
while more:
if resumeKey:
payload["resumeKey"] = resumeKey
url, res = get_response(
self.endpoint, params=payload, headers=headers, return_full_url=True
)
self.last_api_request_url = url
text = res.text.strip()
lines = text.splitlines()
more = False
if len(lines) >= 3:
second_last_line = lines[-2]
if len(second_last_line) == 0:
resumeKey = lines[-1].strip()
text = text.replace(resumeKey, "", 1).strip()
more = True
yield text
def add_payload(self, payload):
if self.start_timestamp:
payload["from"] = self.start_timestamp
if self.end_timestamp:
payload["to"] = self.end_timestamp
if self.gzip != True:
payload["gzip"] = "false"
if self.match_type:
payload["matchType"] = self.match_type
if self.filters and len(self.filters) > 0:
for i, f in enumerate(self.filters):
payload["filter" + str(i)] = f
if self.collapses and len(self.collapses) > 0:
for i, f in enumerate(self.collapses):
payload["collapse" + str(i)] = f
# Don't need to return anything as it's dictionary.
payload["url"] = self.url
def snapshots(self):
payload = {}
headers = {"User-Agent": self.user_agent}
self.add_payload(payload)
if not self.start_timestamp or self.end_timestamp:
self.use_page = True
if self.collapses != []:
self.use_page = False
texts = self.cdx_api_manager(payload, headers, use_page=self.use_page)
for text in texts:
if text.isspace() or len(text) <= 1 or not text:
continue
snapshot_list = text.split("\n")
for snapshot in snapshot_list:
if len(snapshot) < 46: # 14 + 32 (timestamp+digest)
continue
properties = {
"urlkey": None,
"timestamp": None,
"original": None,
"mimetype": None,
"statuscode": None,
"digest": None,
"length": None,
}
prop_values = snapshot.split(" ")
prop_values_len = len(prop_values)
properties_len = len(properties)
if prop_values_len != properties_len:
raise WaybackError(
"Snapshot returned by Cdx API has {prop_values_len} properties instead of expected {properties_len} properties.\nInvolved Snapshot : {snapshot}".format(
prop_values_len=prop_values_len,
properties_len=properties_len,
snapshot=snapshot,
)
)
(
properties["urlkey"],
properties["timestamp"],
properties["original"],
properties["mimetype"],
properties["statuscode"],
properties["digest"],
properties["length"],
) = prop_values
yield CDXSnapshot(properties)

View File

@ -1,16 +1,7 @@
from datetime import datetime
class CdxSnapshot:
"""
This class helps to use the Cdx Snapshots easily.
Raw Snapshot data looks like:
org,archive)/ 20080126045828 http://github.com text/html 200 Q4YULN754FHV2U6Q5JUT6Q2P57WEWNNY 1415
properties is a dict containg all of the 7 cdx snapshot properties.
"""
class CDXSnapshot:
def __init__(self, properties):
self.urlkey = properties["urlkey"]
self.timestamp = properties["timestamp"]
@ -25,4 +16,12 @@ class CdxSnapshot:
)
def __str__(self):
return self.archive_url
return "{urlkey} {timestamp} {original} {mimetype} {statuscode} {digest} {length}".format(
urlkey=self.urlkey,
timestamp=self.timestamp,
original=self.original,
mimetype=self.mimetype,
statuscode=self.statuscode,
digest=self.digest,
length=self.length,
)

154
waybackpy/cdx_utils.py Normal file
View File

@ -0,0 +1,154 @@
import re
import requests
from urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter
from .exceptions import WaybackError
def get_total_pages(url, user_agent):
request_url = (
"https://web.archive.org/cdx/search/cdx?url={url}&showNumPages=true".format(
url=url
)
)
headers = {"User-Agent": user_agent}
return int((requests.get(request_url, headers=headers).text).strip())
def full_url(endpoint, params):
if not params:
return endpoint
full_url = endpoint if endpoint.endswith("?") else (endpoint + "?")
for key, val in params.items():
key = "filter" if key.startswith("filter") else key
key = "collapse" if key.startswith("collapse") else key
amp = "" if full_url.endswith("?") else "&"
full_url = (
full_url
+ amp
+ "{key}={val}".format(key=key, val=requests.utils.quote(str(val)))
)
return full_url
def get_response(
endpoint,
params=None,
headers=None,
return_full_url=False,
retries=5,
backoff_factor=0.5,
no_raise_on_redirects=False,
):
s = requests.Session()
retries = Retry(
total=retries,
backoff_factor=backoff_factor,
status_forcelist=[500, 502, 503, 504],
)
s.mount("https://", HTTPAdapter(max_retries=retries))
# The URL with parameters required for the get request
url = full_url(endpoint, params)
try:
if not return_full_url:
return s.get(url, headers=headers)
return (url, s.get(url, headers=headers))
except Exception as e:
reason = str(e)
if no_raise_on_redirects:
if "Exceeded 30 redirects" in reason:
return
exc_message = "Error while retrieving {url}.\n{reason}".format(
url=url, reason=reason
)
exc = WaybackError(exc_message)
exc.__cause__ = e
raise exc
def check_filters(filters):
if not isinstance(filters, list):
raise WaybackError("filters must be a list.")
# [!]field:regex
for _filter in filters:
try:
match = re.search(
r"(\!?(?:urlkey|timestamp|original|mimetype|statuscode|digest|length)):(.*)",
_filter,
)
key = match.group(1)
val = match.group(2)
except Exception:
exc_message = (
"Filter '{_filter}' is not following the cdx filter syntax.".format(
_filter=_filter
)
)
raise WaybackError(exc_message)
def check_collapses(collapses):
if not isinstance(collapses, list):
raise WaybackError("collapses must be a list.")
if len(collapses) == 0:
return
for collapse in collapses:
try:
match = re.search(
r"(urlkey|timestamp|original|mimetype|statuscode|digest|length)(:?[0-9]{1,99})?",
collapse,
)
field = match.group(1)
N = None
if 2 == len(match.groups()):
N = match.group(2)
if N:
if not (field + N == collapse):
raise Exception
else:
if not (field == collapse):
raise Exception
except Exception:
exc_message = "collapse argument '{collapse}' is not following the cdx collapse syntax.".format(
collapse=collapse
)
raise WaybackError(exc_message)
def check_match_type(match_type, url):
if not match_type:
return
if "*" in url:
raise WaybackError("Can not use wildcard with match_type argument")
legal_match_type = ["exact", "prefix", "host", "domain"]
if match_type not in legal_match_type:
exc_message = "{match_type} is not an allowed match type.\nUse one from 'exact', 'prefix', 'host' or 'domain'".format(
match_type=match_type
)
raise WaybackError(exc_message)

View File

@ -1,312 +1,347 @@
import os
import click
import re
import sys
import os
import json as JSON
import random
import string
import argparse
from .wrapper import Url
from .exceptions import WaybackError
from .__version__ import __version__
from .utils import DEFAULT_USER_AGENT
from .cdx_api import WaybackMachineCDXServerAPI
from .save_api import WaybackMachineSaveAPI
from .availability_api import WaybackMachineAvailabilityAPI
from .wrapper import Url
def _save(obj):
try:
return obj.save()
except Exception as err:
e = str(err)
m = re.search(r"Header:\n(.*)", e)
if m:
header = m.group(1)
if "No archive URL found in the API response" in e:
return (
"\n[waybackpy] Can not save/archive your link.\n[waybackpy] This "
"could happen because either your waybackpy (%s) is likely out of "
"date or Wayback Machine is malfunctioning.\n[waybackpy] Visit "
"https://github.com/akamhy/waybackpy for the latest version of "
"waybackpy.\n[waybackpy] API response Header :\n%s"
% (__version__, header)
@click.command()
@click.option(
"-u", "--url", help="URL on which Wayback machine operations are to be performed."
)
@click.option(
"-ua",
"--user-agent",
"--user_agent",
default=DEFAULT_USER_AGENT,
help="User agent, default user agent is '%s' " % DEFAULT_USER_AGENT,
)
@click.option(
"-v", "--version", is_flag=True, default=False, help="Print waybackpy version."
)
@click.option(
"-n",
"--newest",
"-au",
"--archive_url",
"--archive-url",
default=False,
is_flag=True,
help="Fetch the newest archive of the specified URL",
)
@click.option(
"-o",
"--oldest",
default=False,
is_flag=True,
help="Fetch the oldest archive of the specified URL",
)
@click.option(
"-j",
"--json",
default=False,
is_flag=True,
help="Spit out the JSON data for availability_api commands.",
)
@click.option(
"-N", "--near", default=False, is_flag=True, help="Archive near specified time."
)
@click.option("-Y", "--year", type=click.IntRange(1994, 9999), help="Year in integer.")
@click.option("-M", "--month", type=click.IntRange(1, 12), help="Month in integer.")
@click.option("-D", "--day", type=click.IntRange(1, 31), help="Day in integer.")
@click.option("-H", "--hour", type=click.IntRange(0, 24), help="Hour in integer.")
@click.option("-MIN", "--minute", type=click.IntRange(0, 60), help="Minute in integer.")
@click.option(
"-s",
"--save",
default=False,
is_flag=True,
help="Save the specified URL's webpage and print the archive URL.",
)
@click.option(
"-h",
"--headers",
default=False,
is_flag=True,
help="Spit out the headers data for save_api commands.",
)
@click.option(
"-ku",
"--known-urls",
"--known_urls",
default=False,
is_flag=True,
help="List known URLs. Uses CDX API.",
)
@click.option(
"-sub",
"--subdomain",
default=False,
is_flag=True,
help="Use with '--known_urls' to include known URLs for subdomains.",
)
@click.option(
"-f",
"--file",
default=False,
is_flag=True,
help="Use with '--known_urls' to save the URLs in file at current directory.",
)
@click.option(
"-c",
"--cdx",
default=False,
is_flag=True,
help="Spit out the headers data for save_api commands.",
)
@click.option(
"-st",
"--start-timestamp",
"--start_timestamp",
)
@click.option(
"-et",
"--end-timestamp",
"--end_timestamp",
)
@click.option(
"-f",
"--filters",
multiple=True,
)
@click.option(
"-mt",
"--match-type",
"--match_type",
)
@click.option(
"-gz",
"--gzip",
)
@click.option(
"-c",
"--collapses",
multiple=True,
)
@click.option(
"-l",
"--limit",
)
@click.option(
"-cp",
"--cdx-print",
"--cdx_print",
multiple=True,
)
def main(
url,
user_agent,
version,
newest,
oldest,
json,
near,
year,
month,
day,
hour,
minute,
save,
headers,
known_urls,
subdomain,
file,
cdx,
start_timestamp,
end_timestamp,
filters,
match_type,
gzip,
collapses,
limit,
cdx_print,
):
"""
┏┓┏┓┏┓━━━━━━━━━━┏━━┓━━━━━━━━━━┏┓━━┏━━━┓━━━━━
┃┃┃┃┃┃━━━━━━━━━━┃┏┓┃━━━━━━━━━━┃┃━━┃┏━┓┃━━━━━
┃┃┃┃┃┃┏━━┓━┏┓━┏┓┃┗┛┗┓┏━━┓━┏━━┓┃┃┏┓┃┗━┛┃┏┓━┏┓
┃┗┛┗┛┃┗━┓┃━┃┃━┃┃┃┏━┓┃┗━┓┃━┃┏━┛┃┗┛┛┃┏━━┛┃┃━┃┃
┗┓┏┓┏┛┃┗┛┗┓┃┗━┛┃┃┗━┛┃┃┗┛┗┓┃┗━┓┃┏┓┓┃┃━━━┃┗━┛┃
━┗┛┗┛━┗━━━┛┗━┓┏┛┗━━━┛┗━━━┛┗━━┛┗┛┗┛┗┛━━━┗━┓┏┛
━━━━━━━━━━━┏━┛┃━━━━━━━━━━━━━━━━━━━━━━━━┏━┛┃━
━━━━━━━━━━━┗━━┛━━━━━━━━━━━━━━━━━━━━━━━━┗━━┛━
waybackpy : Python package & CLI tool that interfaces the Wayback Machine API
Released under the MIT License.
License @ https://github.com/akamhy/waybackpy/blob/master/LICENSE
Copyright (c) 2020 waybackpy contributors. Contributors list @
https://github.com/akamhy/waybackpy/graphs/contributors
https://github.com/akamhy/waybackpy
https://pypi.org/project/waybackpy
"""
if version:
click.echo("waybackpy version %s" % __version__)
return
if not url:
click.echo("No URL detected. Please pass an URL.")
return
def echo_availability_api(availability_api_instance):
click.echo("Archive URL:")
if not availability_api_instance.archive_url:
archive_url = (
"NO ARCHIVE FOUND - The requested URL is probably "
+ "not yet archived or if the URL was recently archived then it is "
+ "not yet available via the Wayback Machine's availability API "
+ "because of database lag and should be available after some time."
)
return WaybackError(err)
else:
archive_url = availability_api_instance.archive_url
click.echo(archive_url)
if json:
click.echo("JSON response:")
click.echo(JSON.dumps(availability_api_instance.JSON))
availability_api = WaybackMachineAvailabilityAPI(url, user_agent=user_agent)
def _archive_url(obj):
return obj.archive_url
if oldest:
availability_api.oldest()
echo_availability_api(availability_api)
return
if newest:
availability_api.newest()
echo_availability_api(availability_api)
return
def _json(obj):
return obj.JSON
if near:
near_args = {}
keys = ["year", "month", "day", "hour", "minute"]
args_arr = [year, month, day, hour, minute]
for key, arg in zip(keys, args_arr):
if arg:
near_args[key] = arg
availability_api.near(**near_args)
echo_availability_api(availability_api)
return
if save:
save_api = WaybackMachineSaveAPI(url, user_agent=user_agent)
save_api.save()
click.echo("Archive URL:")
click.echo(save_api.archive_url)
click.echo("Cached save:")
click.echo(save_api.cached_save)
if headers:
click.echo("Save API headers:")
click.echo(save_api.headers)
return
def no_archive_handler(e, obj):
m = re.search(r"archive\sfor\s\'(.*?)\'\stry", str(e))
if m:
url = m.group(1)
ua = obj.user_agent
if "github.com/akamhy/waybackpy" in ua:
ua = "YOUR_USER_AGENT_HERE"
return (
"\n[Waybackpy] Can not find archive for '%s'.\n[Waybackpy] You can"
" save the URL using the following command:\n[Waybackpy] waybackpy --"
'user_agent "%s" --url "%s" --save' % (url, ua, url)
def save_urls_on_file(url_gen):
domain = None
sys_random = random.SystemRandom()
uid = "".join(
sys_random.choice(string.ascii_lowercase + string.digits) for _ in range(6)
)
return WaybackError(e)
url_count = 0
for url in url_gen:
url_count += 1
if not domain:
match = re.search("https?://([A-Za-z_0-9.-]+).*", url)
def _oldest(obj):
try:
return obj.oldest()
except Exception as e:
return no_archive_handler(e, obj)
domain = "domain-unknown"
if match:
domain = match.group(1)
def _newest(obj):
try:
return obj.newest()
except Exception as e:
return no_archive_handler(e, obj)
file_name = "{domain}-urls-{uid}.txt".format(domain=domain, uid=uid)
file_path = os.path.join(os.getcwd(), file_name)
if not os.path.isfile(file_path):
open(file_path, "w+").close()
with open(file_path, "a") as f:
f.write("{url}\n".format(url=url))
def _total_archives(obj):
return obj.total_archives()
click.echo(url)
if url_count > 0:
click.echo(
"\n\n'{file_name}' saved in current working directory".format(
file_name=file_name
)
)
else:
click.echo("No known URLs found. Please try a diffrent input!")
def _near(obj, args):
_near_args = {}
args_arr = [args.year, args.month, args.day, args.hour, args.minute]
keys = ["year", "month", "day", "hour", "minute"]
if known_urls:
wayback = Url(url, user_agent)
url_gen = wayback.known_urls(subdomain=subdomain)
for key, arg in zip(keys, args_arr):
if arg:
_near_args[key] = arg
if file:
return save_urls_on_file(url_gen)
else:
for url in url_gen:
click.echo(url)
try:
return obj.near(**_near_args)
except Exception as e:
return no_archive_handler(e, obj)
if cdx:
filters = list(filters)
collapses = list(collapses)
cdx_print = list(cdx_print)
def _save_urls_on_file(input_list, live_url_count):
m = re.search("https?://([A-Za-z_0-9.-]+).*", input_list[0])
domain = "domain-unknown"
if m:
domain = m.group(1)
uid = "".join(
random.choice(string.ascii_lowercase + string.digits) for _ in range(6)
)
file_name = "%s-%d-urls-%s.txt" % (domain, live_url_count, uid)
file_content = "\n".join(input_list)
file_path = os.path.join(os.getcwd(), file_name)
with open(file_path, "w+") as f:
f.write(file_content)
return "%s\n\n'%s' saved in current working directory" % (file_content, file_name)
def _known_urls(obj, args):
"""
Known urls for a domain.
"""
subdomain = False
if args.subdomain:
subdomain = True
alive = False
if args.alive:
alive = True
url_list = obj.known_urls(alive=alive, subdomain=subdomain)
total_urls = len(url_list)
if total_urls > 0:
return _save_urls_on_file(url_list, total_urls)
return "No known URLs found. Please try a diffrent domain!"
def _get(obj, args):
if args.get.lower() == "url":
return obj.get()
if args.get.lower() == "archive_url":
return obj.get(obj.archive_url)
if args.get.lower() == "oldest":
return obj.get(obj.oldest())
if args.get.lower() == "latest" or args.get.lower() == "newest":
return obj.get(obj.newest())
if args.get.lower() == "save":
return obj.get(obj.save())
return "Use get as \"--get 'source'\", 'source' can be one of the followings: \
\n1) url - get the source code of the url specified using --url/-u.\
\n2) archive_url - get the source code of the newest archive for the supplied url, alias of newest.\
\n3) oldest - get the source code of the oldest archive for the supplied url.\
\n4) newest - get the source code of the newest archive for the supplied url.\
\n5) save - Create a new archive and get the source code of this new archive for the supplied url."
def args_handler(args):
if args.version:
return "waybackpy version %s" % __version__
if not args.url:
return (
"waybackpy %s \nSee 'waybackpy --help' for help using this tool."
% __version__
cdx_api = WaybackMachineCDXServerAPI(
url,
user_agent=user_agent,
start_timestamp=start_timestamp,
end_timestamp=end_timestamp,
filters=filters,
match_type=match_type,
gzip=gzip,
collapses=collapses,
limit=limit,
)
obj = Url(args.url)
if args.user_agent:
obj = Url(args.url, args.user_agent)
snapshots = cdx_api.snapshots()
if args.save:
output = _save(obj)
elif args.archive_url:
output = _archive_url(obj)
elif args.json:
output = _json(obj)
elif args.oldest:
output = _oldest(obj)
elif args.newest:
output = _newest(obj)
elif args.known_urls:
output = _known_urls(obj, args)
elif args.total:
output = _total_archives(obj)
elif args.near:
return _near(obj, args)
elif args.get:
output = _get(obj, args)
else:
output = (
"You only specified the URL. But you also need to specify the operation."
"\nSee 'waybackpy --help' for help using this tool."
)
return output
def add_requiredArgs(requiredArgs):
requiredArgs.add_argument(
"--url", "-u", help="URL on which Wayback machine operations would occur"
)
def add_userAgentArg(userAgentArg):
help_text = 'User agent, default user_agent is "waybackpy python package - https://github.com/akamhy/waybackpy"'
userAgentArg.add_argument("--user_agent", "-ua", help=help_text)
def add_saveArg(saveArg):
saveArg.add_argument(
"--save", "-s", action="store_true", help="Save the URL on the Wayback machine"
)
def add_auArg(auArg):
auArg.add_argument(
"--archive_url",
"-au",
action="store_true",
help="Get the latest archive URL, alias for --newest",
)
def add_jsonArg(jsonArg):
jsonArg.add_argument(
"--json",
"-j",
action="store_true",
help="JSON data of the availability API request",
)
def add_oldestArg(oldestArg):
oldestArg.add_argument(
"--oldest",
"-o",
action="store_true",
help="Oldest archive for the specified URL",
)
def add_newestArg(newestArg):
newestArg.add_argument(
"--newest",
"-n",
action="store_true",
help="Newest archive for the specified URL",
)
def add_totalArg(totalArg):
totalArg.add_argument(
"--total",
"-t",
action="store_true",
help="Total number of archives for the specified URL",
)
def add_getArg(getArg):
getArg.add_argument(
"--get",
"-g",
help="Prints the source code of the supplied url. Use '--get help' for extended usage",
)
def add_knownUrlArg(knownUrlArg):
knownUrlArg.add_argument(
"--known_urls", "-ku", action="store_true", help="URLs known for the domain."
)
help_text = "Use with '--known_urls' to include known URLs for subdomains."
knownUrlArg.add_argument("--subdomain", "-sub", action="store_true", help=help_text)
help_text = "Only include live URLs. Will not inlclude dead links."
knownUrlArg.add_argument("--alive", "-a", action="store_true", help=help_text)
def add_nearArg(nearArg):
nearArg.add_argument(
"--near", "-N", action="store_true", help="Archive near specified time"
)
def add_nearArgs(nearArgs):
nearArgs.add_argument("--year", "-Y", type=int, help="Year in integer")
nearArgs.add_argument("--month", "-M", type=int, help="Month in integer")
nearArgs.add_argument("--day", "-D", type=int, help="Day in integer.")
nearArgs.add_argument("--hour", "-H", type=int, help="Hour in intege")
nearArgs.add_argument("--minute", "-MIN", type=int, help="Minute in integer")
def parse_args(argv):
parser = argparse.ArgumentParser()
add_requiredArgs(parser.add_argument_group("URL argument (required)"))
add_userAgentArg(parser.add_argument_group("User Agent"))
add_saveArg(parser.add_argument_group("Create new archive/save URL"))
add_auArg(parser.add_argument_group("Get the latest Archive"))
add_jsonArg(parser.add_argument_group("Get the JSON data"))
add_oldestArg(parser.add_argument_group("Oldest archive"))
add_newestArg(parser.add_argument_group("Newest archive"))
add_totalArg(parser.add_argument_group("Total number of archives"))
add_getArg(parser.add_argument_group("Get source code"))
add_knownUrlArg(
parser.add_argument_group(
"URLs known and archived to Waybcak Machine for the site."
)
)
add_nearArg(parser.add_argument_group("Archive close to time specified"))
add_nearArgs(parser.add_argument_group("Arguments that are used only with --near"))
parser.add_argument(
"--version", "-v", action="store_true", help="Waybackpy version"
)
return parser.parse_args(argv[1:])
def main(argv=None):
argv = sys.argv if argv is None else argv
print(args_handler(parse_args(argv)))
for snapshot in snapshots:
if len(cdx_print) == 0:
click.echo(snapshot)
else:
output_string = ""
if "urlkey" or "url-key" or "url_key" in cdx_print:
output_string = output_string + snapshot.urlkey + " "
if "timestamp" or "time-stamp" or "time_stamp" in cdx_print:
output_string = output_string + snapshot.timestamp + " "
if "original" in cdx_print:
output_string = output_string + snapshot.original + " "
if "original" in cdx_print:
output_string = output_string + snapshot.original + " "
if "mimetype" or "mime-type" or "mime_type" in cdx_print:
output_string = output_string + snapshot.mimetype + " "
if "statuscode" or "status-code" or "status_code" in cdx_print:
output_string = output_string + snapshot.statuscode + " "
if "digest" in cdx_print:
output_string = output_string + snapshot.digest + " "
if "length" in cdx_print:
output_string = output_string + snapshot.length + " "
if "archiveurl" or "archive-url" or "archive_url" in cdx_print:
output_string = output_string + snapshot.archive_url + " "
click.echo(output_string)
if __name__ == "__main__":
sys.exit(main(sys.argv))
main()

View File

@ -13,7 +13,26 @@ class WaybackError(Exception):
"""
class RedirectSaveError(WaybackError):
"""
Raised when the original URL is redirected and the
redirect URL is archived but not the original URL.
"""
class URLError(Exception):
"""
Raised when malformed URLs are passed as arguments.
"""
class MaximumRetriesExceeded(WaybackError):
"""
MaximumRetriesExceeded
"""
class MaximumSaveRetriesExceeded(MaximumRetriesExceeded):
"""
MaximumSaveRetriesExceeded
"""

131
waybackpy/save_api.py Normal file
View File

@ -0,0 +1,131 @@
import re
import time
import requests
from datetime import datetime
from urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter
from .utils import DEFAULT_USER_AGENT
from .exceptions import MaximumSaveRetriesExceeded
class WaybackMachineSaveAPI:
"""
WaybackMachineSaveAPI class provides an interface for saving URLs on the
Wayback Machine.
"""
def __init__(self, url, user_agent=DEFAULT_USER_AGENT, max_tries=8):
self.url = str(url).strip().replace(" ", "%20")
self.request_url = "https://web.archive.org/save/" + self.url
self.user_agent = user_agent
self.request_headers = {"User-Agent": self.user_agent}
self.max_tries = max_tries
self.total_save_retries = 5
self.backoff_factor = 0.5
self.status_forcelist = [500, 502, 503, 504]
self._archive_url = None
self.instance_birth_time = datetime.utcnow()
@property
def archive_url(self):
if self._archive_url:
return self._archive_url
else:
return self.save()
def get_save_request_headers(self):
session = requests.Session()
retries = Retry(
total=self.total_save_retries,
backoff_factor=self.backoff_factor,
status_forcelist=self.status_forcelist,
)
session.mount("https://", HTTPAdapter(max_retries=retries))
self.response = session.get(self.request_url, headers=self.request_headers)
self.headers = self.response.headers
self.status_code = self.response.status_code
self.response_url = self.response.url
def archive_url_parser(self):
regex1 = r"Content-Location: (/web/[0-9]{14}/.*)"
match = re.search(regex1, str(self.headers))
if match:
return "https://web.archive.org" + match.group(1)
regex2 = r"rel=\"memento.*?(web\.archive\.org/web/[0-9]{14}/.*?)>"
match = re.search(regex2, str(self.headers))
if match:
return "https://" + match.group(1)
regex3 = r"X-Cache-Key:\shttps(.*)[A-Z]{2}"
match = re.search(regex3, str(self.headers))
if match:
return "https://" + match.group(1)
if self.response_url:
self.response_url = self.response_url.strip()
if "web.archive.org/web" in self.response_url:
regex = r"web\.archive\.org/web/(?:[0-9]*?)/(?:.*)$"
match = re.search(regex, self.response_url)
if match:
return "https://" + match.group(0)
def sleep(self, tries):
sleep_seconds = 5
if tries % 3 == 0:
sleep_seconds = 10
time.sleep(sleep_seconds)
def timestamp(self):
m = re.search(
r"https?://web.archive.org/web/([0-9]{14})/http", self._archive_url
)
string_timestamp = m.group(1)
timestamp = datetime.strptime(string_timestamp, "%Y%m%d%H%M%S")
timestamp_unixtime = time.mktime(timestamp.timetuple())
instance_birth_time_unixtime = time.mktime(self.instance_birth_time.timetuple())
if timestamp_unixtime < instance_birth_time_unixtime:
self.cached_save = True
else:
self.cached_save = False
return timestamp
def save(self):
saved_archive = None
tries = 0
while True:
tries += 1
if tries >= self.max_tries:
raise MaximumSaveRetriesExceeded(
"Tried %s times but failed to save and return the archive for %s.\nResponse URL:\n%s \nResponse Header:\n%s\n"
% (str(tries), self.url, self.response_url, str(self.headers)),
)
if not saved_archive:
if tries > 1:
self.sleep(tries)
self.get_save_request_headers()
saved_archive = self.archive_url_parser()
if not saved_archive:
continue
else:
self._archive_url = saved_archive
self.timestamp()
return saved_archive

View File

@ -1,286 +1,11 @@
import re
import requests
from .exceptions import WaybackError, URLError
from datetime import datetime
from urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter
from .__version__ import __version__
quote = requests.utils.quote
default_user_agent = "waybackpy python package - https://github.com/akamhy/waybackpy"
DEFAULT_USER_AGENT = "waybackpy %s - https://github.com/akamhy/waybackpy" % __version__
def _add_payload(self, payload):
if self.start_timestamp:
payload["from"] = self.start_timestamp
if self.end_timestamp:
payload["to"] = self.end_timestamp
if self.gzip != True:
payload["gzip"] = "false"
if self.match_type:
payload["matchType"] = self.match_type
if self.filters and len(self.filters) > 0:
for i, f in enumerate(self.filters):
payload["filter" + str(i)] = f
if self.collapses and len(self.collapses) > 0:
for i, f in enumerate(self.collapses):
payload["collapse" + str(i)] = f
payload["url"] = self.url
def _ts(timestamp, data):
"""
Get timestamp of last fetched archive.
If used before fetching any archive, will
use whatever self.JSON returns.
self.timestamp is None implies that
self.JSON will return any archive's JSON
that wayback machine provides it.
"""
if timestamp:
return timestamp
if not data["archived_snapshots"]:
return datetime.max
return datetime.strptime(
data["archived_snapshots"]["closest"]["timestamp"], "%Y%m%d%H%M%S"
)
def _check_match_type(match_type, url):
if not match_type:
return
if "*" in url:
raise WaybackError("Can not use wildcard with match_type argument")
legal_match_type = ["exact", "prefix", "host", "domain"]
if match_type not in legal_match_type:
raise WaybackError(
"%s is not an allowed match type.\nUse one from 'exact', 'prefix', 'host' or 'domain'"
% match_type
)
def _check_collapses(collapses):
if not isinstance(collapses, list):
raise WaybackError("collapses must be a list.")
if len(collapses) == 0:
return
for c in collapses:
try:
match = re.search(
r"(urlkey|timestamp|original|mimetype|statuscode|digest|length)(:?[0-9]{1,99})?",
c,
)
field = match.group(1)
N = None
if 2 == len(match.groups()):
N = match.group(2)
if N:
if not (field + N == c):
raise Exception
else:
if not (field == c):
raise Exception
except Exception:
e = "collapse argument '%s' is not following the cdx collapse syntax." % c
raise WaybackError(e)
def _check_filters(filters):
if not isinstance(filters, list):
raise WaybackError("filters must be a list.")
# [!]field:regex
for f in filters:
try:
match = re.search(
r"(\!?(?:urlkey|timestamp|original|mimetype|statuscode|digest|length)):(.*)",
f,
)
key = match.group(1)
val = match.group(2)
except Exception:
e = "Filter '%s' not following the cdx filter syntax." % f
raise WaybackError(e)
def _cleaned_url(url):
return str(url).strip().replace(" ", "%20")
def _url_check(url):
"""
Check for common URL problems.
What we are checking:
1) '.' in self.url, no url that ain't '.' in it.
If you known any others, please create a PR on the github repo.
"""
if "." not in url:
raise URLError("'%s' is not a vaild URL." % url)
def _full_url(endpoint, params):
full_url = endpoint
if params:
full_url = endpoint if endpoint.endswith("?") else (endpoint + "?")
for key, val in params.items():
key = "filter" if key.startswith("filter") else key
key = "collapse" if key.startswith("collapse") else key
amp = "" if full_url.endswith("?") else "&"
full_url = full_url + amp + "%s=%s" % (key, quote(str(val)))
return full_url
def _get_total_pages(url, user_agent):
"""
If showNumPages is passed in cdx API, it returns
'number of archive pages'and each page has many archives.
This func returns number of pages of archives (type int).
"""
total_pages_url = (
"https://web.archive.org/cdx/search/cdx?url=%s&showNumPages=true" % url
)
headers = {"User-Agent": user_agent}
return int((_get_response(total_pages_url, headers=headers).text).strip())
def _archive_url_parser(header, url):
"""
The wayback machine's save API doesn't
return JSON response, we are required
to read the header of the API response
and look for the archive URL.
This method has some regexen (or regexes)
that search for archive url in header.
This method is used when you try to
save a webpage on wayback machine.
Two cases are possible:
1) Either we find the archive url in
the header.
2) Or we didn't find the archive url in
API header.
If we found the archive URL we return it.
And if we couldn't find it, we raise
WaybackError with an error message.
"""
# Regex1
m = re.search(r"Content-Location: (/web/[0-9]{14}/.*)", str(header))
if m:
return "web.archive.org" + m.group(1)
# Regex2
m = re.search(
r"rel=\"memento.*?(web\.archive\.org/web/[0-9]{14}/.*?)>", str(header)
)
if m:
return m.group(1)
# Regex3
m = re.search(r"X-Cache-Key:\shttps(.*)[A-Z]{2}", str(header))
if m:
return m.group(1)
raise WaybackError(
"No archive URL found in the API response. "
"If '%s' can be accessed via your web browser then either "
"this version of waybackpy (%s) is out of date or WayBack Machine is malfunctioning. Visit "
"'https://github.com/akamhy/waybackpy' for the latest version "
"of waybackpy.\nHeader:\n%s" % (url, __version__, str(header))
)
def _wayback_timestamp(**kwargs):
"""
Wayback Machine archive URLs
have a timestamp in them.
The standard archive URL format is
https://web.archive.org/web/20191214041711/https://www.youtube.com
If we break it down in three parts:
1 ) The start (https://web.archive.org/web/)
2 ) timestamp (20191214041711)
3 ) https://www.youtube.com, the original URL
The near method takes year, month, day, hour and minute
as Arguments, their type is int.
This method takes those integers and converts it to
wayback machine timestamp and returns it.
Return format is string.
"""
return "".join(
str(kwargs[key]).zfill(2) for key in ["year", "month", "day", "hour", "minute"]
)
def _get_response(
endpoint, params=None, headers=None, retries=5, return_full_url=False
):
"""
This function is used make get request.
We use the requests package to make the
requests.
We try five times and if it fails it raises
WaybackError exception.
You can handles WaybackError by importing:
from waybackpy.exceptions import WaybackError
try:
...
except WaybackError as e:
# handle it
"""
# From https://stackoverflow.com/a/35504626
# By https://stackoverflow.com/users/401467/datashaman
s = requests.Session()
retries = Retry(
total=retries, backoff_factor=0.5, status_forcelist=[500, 502, 503, 504]
)
s.mount("https://", HTTPAdapter(max_retries=retries))
url = _full_url(endpoint, params)
try:
if not return_full_url:
return s.get(url, headers=headers)
return (url, s.get(url, headers=headers))
except Exception as e:
exc = WaybackError("Error while retrieving %s" % url)
exc.__cause__ = e
raise exc
def latest_version(package_name, headers):
request_url = "https://pypi.org/pypi/" + package_name + "/json"
response = requests.get(request_url, headers=headers)
data = response.json()
return data["info"]["version"]

View File

@ -1,317 +1,117 @@
import requests
import concurrent.futures
from datetime import datetime, timedelta
from .save_api import WaybackMachineSaveAPI
from .availability_api import WaybackMachineAvailabilityAPI
from .cdx_api import WaybackMachineCDXServerAPI
from .utils import DEFAULT_USER_AGENT
from .exceptions import WaybackError
from .cdx import Cdx
from .utils import (
_archive_url_parser,
_wayback_timestamp,
_get_response,
default_user_agent,
_url_check,
_cleaned_url,
_ts,
)
from datetime import datetime, timedelta
class Url:
def __init__(self, url, user_agent=default_user_agent):
def __init__(self, url, user_agent=DEFAULT_USER_AGENT):
self.url = url
self.user_agent = str(user_agent)
_url_check(self.url)
self._archive_url = None
self.timestamp = None
self._JSON = None
self._alive_url_list = []
def __repr__(self):
return "waybackpy.Url(url=%s, user_agent=%s)" % (self.url, self.user_agent)
self.archive_url = None
self.wayback_machine_availability_api = WaybackMachineAvailabilityAPI(
self.url, user_agent=self.user_agent
)
def __str__(self):
"""
Output when print() is used on <class 'waybackpy.wrapper.Url'>
This should print an archive URL.
We check if self._archive_url is not None.
If not None, good. We return string of self._archive_url.
If self._archive_url is None, it means we ain't used any method that
sets self._archive_url, we now set self._archive_url to self.archive_url
and return it.
"""
if not self._archive_url:
self._archive_url = self.archive_url
return "%s" % self._archive_url
if not self.archive_url:
self.newest()
return self.archive_url
def __len__(self):
"""
Why do we have len here?
Applying len() on <class 'waybackpy.wrapper.Url'>
will calculate the number of days between today and
the archive timestamp.
Can be applied on return values of near and its
childs (e.g. oldest) and if applied on waybackpy.Url()
whithout using any functions, it just grabs
self._timestamp and def _timestamp gets it
from def JSON.
"""
td_max = timedelta(
days=999999999, hours=23, minutes=59, seconds=59, microseconds=999999
)
if not self.timestamp:
self.timestamp = self._timestamp
self.oldest()
if self.timestamp == datetime.max:
return td_max.days
return (datetime.utcnow() - self.timestamp).days
@property
def JSON(self):
"""
If the end user has used near() or its childs like oldest, newest
and archive_url then the JSON response of these are cached in self._JSON
If we find that self._JSON is not None we return it.
else we get the response of 'https://archive.org/wayback/available?url=YOUR-URL'
and return it.
"""
if self._JSON:
return self._JSON
endpoint = "https://archive.org/wayback/available"
headers = {"User-Agent": self.user_agent}
payload = {"url": "%s" % _cleaned_url(self.url)}
response = _get_response(endpoint, params=payload, headers=headers)
return response.json()
@property
def archive_url(self):
"""
Returns any random archive for the instance.
But if near, oldest, newest were used before
then it returns the same archive again.
We cache archive in self._archive_url
"""
if self._archive_url:
return self._archive_url
data = self.JSON
if not data["archived_snapshots"]:
archive_url = None
else:
archive_url = data["archived_snapshots"]["closest"]["url"]
archive_url = archive_url.replace(
"http://web.archive.org/web/", "https://web.archive.org/web/", 1
)
self._archive_url = archive_url
return archive_url
@property
def _timestamp(self):
self.timestamp = _ts(self.timestamp, self.JSON)
return self.timestamp
def save(self):
"""
To save a webpage on WayBack machine we
need to send get request to https://web.archive.org/save/
And to get the archive URL we are required to read the
header of the API response.
_get_response() takes care of the get requests. It uses requests
package.
_archive_url_parser() parses the archive from the header.
"""
request_url = "https://web.archive.org/save/" + _cleaned_url(self.url)
headers = {"User-Agent": self.user_agent}
response = _get_response(request_url, params=None, headers=headers)
self._archive_url = "https://" + _archive_url_parser(response.headers, self.url)
self.timestamp = datetime.utcnow()
self.wayback_machine_save_api = WaybackMachineSaveAPI(
self.url, user_agent=self.user_agent
)
self.archive_url = self.wayback_machine_save_api.archive_url
self.timestamp = self.wayback_machine_save_api.timestamp()
self.headers = self.wayback_machine_save_api.headers
return self
def get(self, url="", user_agent="", encoding=""):
"""
Return the source code of the supplied URL.
If encoding is not supplied, it is auto-detected
from the response itself by requests package.
"""
def near(
self,
year=None,
month=None,
day=None,
hour=None,
minute=None,
unix_timestamp=None,
):
if not url:
url = _cleaned_url(self.url)
if not user_agent:
user_agent = self.user_agent
headers = {"User-Agent": str(user_agent)}
response = _get_response(str(url), params=None, headers=headers)
if not encoding:
try:
encoding = response.encoding
except AttributeError:
encoding = "UTF-8"
return response.content.decode(encoding.replace("text/html", "UTF-8", 1))
def near(self, year=None, month=None, day=None, hour=None, minute=None):
"""
Wayback Machine can have many archives of a webpage,
sometimes we want archive close to a specific time.
This method takes year, month, day, hour and minute as input.
The input type must be integer. Any non-supplied parameters
default to the current time.
We convert the input to a wayback machine timestamp using
_wayback_timestamp(), it returns a string.
We use the wayback machine's availability API
(https://archive.org/wayback/available)
to get the closest archive from the timestamp.
We set self._archive_url to the archive found, if any.
If archive found, we set self.timestamp to its timestamp.
We self._JSON to the response of the availability API.
And finally return self.
"""
now = datetime.utcnow().timetuple()
timestamp = _wayback_timestamp(
year=year if year else now.tm_year,
month=month if month else now.tm_mon,
day=day if day else now.tm_mday,
hour=hour if hour else now.tm_hour,
minute=minute if minute else now.tm_min,
self.wayback_machine_availability_api.near(
year=year,
month=month,
day=day,
hour=hour,
minute=minute,
unix_timestamp=unix_timestamp,
)
endpoint = "https://archive.org/wayback/available"
headers = {"User-Agent": self.user_agent}
payload = {"url": "%s" % _cleaned_url(self.url), "timestamp": timestamp}
response = _get_response(endpoint, params=payload, headers=headers)
data = response.json()
if not data["archived_snapshots"]:
raise WaybackError(
"Can not find archive for '%s' try later or use wayback.Url(url, user_agent).save() "
"to create a new archive.\nAPI response:\n%s"
% (_cleaned_url(self.url), response.text)
)
archive_url = data["archived_snapshots"]["closest"]["url"]
archive_url = archive_url.replace(
"http://web.archive.org/web/", "https://web.archive.org/web/", 1
)
self._archive_url = archive_url
self.timestamp = datetime.strptime(
data["archived_snapshots"]["closest"]["timestamp"], "%Y%m%d%H%M%S"
)
self._JSON = data
self.set_availability_api_attrs()
return self
def oldest(self, year=1994):
"""
Returns the earliest/oldest Wayback Machine archive for the webpage.
Wayback machine has started archiving the internet around 1997 and
therefore we can't have any archive older than 1997, we use 1994 as the
deafult year to look for the oldest archive.
We simply pass the year in near() and return it.
"""
return self.near(year=year)
def oldest(self):
self.wayback_machine_availability_api.oldest()
self.set_availability_api_attrs()
return self
def newest(self):
"""
Return the newest Wayback Machine archive available for this URL.
self.wayback_machine_availability_api.newest()
self.set_availability_api_attrs()
return self
We return the output of self.near() as it deafults to current utc time.
Due to Wayback Machine database lag, this may not always be the
most recent archive.
"""
return self.near()
def set_availability_api_attrs(self):
self.archive_url = self.wayback_machine_availability_api.archive_url
self.JSON = self.wayback_machine_availability_api.JSON
self.timestamp = self.wayback_machine_availability_api.timestamp()
def total_archives(self, start_timestamp=None, end_timestamp=None):
"""
A webpage can have multiple archives on the wayback machine
If someone wants to count the total number of archives of a
webpage on wayback machine they can use this method.
Returns the total number of Wayback Machine archives for the URL.
Return type in integer.
"""
cdx = Cdx(
_cleaned_url(self.url),
cdx = WaybackMachineCDXServerAPI(
self.url,
user_agent=self.user_agent,
start_timestamp=start_timestamp,
end_timestamp=end_timestamp,
)
i = 0
count = 0
for _ in cdx.snapshots():
i = i + 1
return i
def live_urls_finder(self, url):
"""
This method is used to check if supplied url
is >= 400.
"""
try:
response_code = requests.get(url).status_code
except Exception:
return # we don't care if Exception
# 200s are OK and 300s are usually redirects, if you don't want redirects replace 400 with 300
if response_code >= 400:
return
self._alive_url_list.append(url)
count = count + 1
return count
def known_urls(
self, alive=False, subdomain=False, start_timestamp=None, end_timestamp=None
self,
subdomain=False,
host=False,
start_timestamp=None,
end_timestamp=None,
match_type="prefix",
):
"""
Returns list of URLs known to exist for given domain name
because these URLs were crawled by WayBack Machine spider.
Useful for pen-testing.
"""
# Idea by Mohammed Diaa (https://github.com/mhmdiaa) from:
# https://gist.github.com/mhmdiaa/adf6bff70142e5091792841d4b372050
url_list = []
if subdomain:
cdx = Cdx(_cleaned_url(self.url), user_agent=self.user_agent, start_timestamp=start_timestamp, end_timestamp=end_timestamp, match_type="domain", collapses=["urlkey"])
else:
cdx = Cdx(_cleaned_url(self.url), user_agent=self.user_agent, start_timestamp=start_timestamp, end_timestamp=end_timestamp, match_type="host", collapses=["urlkey"])
match_type = "domain"
if host:
match_type = "host"
snapshots = cdx.snapshots()
cdx = WaybackMachineCDXServerAPI(
self.url,
user_agent=self.user_agent,
start_timestamp=start_timestamp,
end_timestamp=end_timestamp,
match_type=match_type,
collapses=["urlkey"],
)
url_list = []
for snapshot in snapshots:
url_list.append(snapshot.original)
# Remove all deadURLs from url_list if alive=True
if alive:
with concurrent.futures.ThreadPoolExecutor() as executor:
executor.map(self.live_urls_finder, url_list)
url_list = self._alive_url_list
return url_list
for snapshot in cdx.snapshots():
yield (snapshot.original)