Compare commits

...

75 Commits
2.3.1 ... 3.0.0

Author SHA1 Message Date
de5a3e1561 improve usage code 2022-01-18 21:18:17 +05:30
52e46fecc2 more usage example 2022-01-18 20:58:39 +05:30
3b6415abc7 updating examples 2022-01-18 20:44:47 +05:30
66e16d6d89 define __repr__ for the Availability API class 2022-01-18 20:34:21 +05:30
16b9bdd7f9 output the file name if known_url and file flag are passed. 2022-01-18 20:14:44 +05:30
7adc01bff2 implement known_urls for cli from the newer interface. Although use of CDX is recommended but backward-compatibility matters. 2022-01-18 20:07:12 +05:30
9bbd056268 Update README.md 2022-01-17 02:15:38 +05:30
2ab44391cf close #107, added link to SecSI/Docker image 2022-01-16 23:01:31 +05:30
cc3628ae18 define __str__ for objects of WaybackMachineAvailabilityAPI class, the check for self.JSON ensures that the API was atleast called. 2022-01-16 22:28:12 +05:30
1d751b942b invoke json, was a bad idea removing it the earlier commit as the end user should not have to call it 2022-01-16 22:15:25 +05:30
261a867a21 near() method of WaybackMachineAvailabilityAPI return self to preserve past behaviour 2022-01-16 21:53:54 +05:30
2e487e88d3 define __len__ on Url objects, if any method not used prior to len op then default to len of oldest archive. 2022-01-16 21:29:43 +05:30
c8d0ad493a defined __str__ for Url objects, print func should print the url. 2022-01-16 21:22:43 +05:30
ce869177fd Merge pull request #103 from akamhy/whitesource/configure
Configure WhiteSource Bolt for GitHub
2022-01-02 16:04:15 +05:30
58616fb986 Add .whitesource configuration file 2022-01-02 08:45:07 +00:00
4e68cd5743 Create separate module for the 3 different APIs also CDX is now CLI supported. 2022-01-02 14:14:45 +05:30
a7b805292d changes made for v2.4.4 (update download_url) (#100)
* v2.4.4 (update download_url)

* v2.4.4 (update __version__)

* +1

add jonasjancarik
2021-09-03 11:28:26 +05:30
6dc6124dc4 Raise error on a 509 response (too many sessions) (#99)
* Raise error on a 509 response (too many sessions)

When the response code is 509, raise an error with an explanation (based on the actual error message contained in the response HTML).

* Raise error on a 509 response (too many sessions) - linting
2021-09-03 08:04:36 +05:30
5a7fc7d568 Fix typo (#95) 2021-04-13 16:58:34 +05:30
5a9c861cad v2.4.3 (#94)
* 2.4.3

* 2.4.3
2021-04-02 10:41:59 +05:30
dd1917c77e added RedirectSaveError - for failed saves if the URL is a permanent … (#93)
* added RedirectSaveError - for failed saves if the URL is a permanent redirect.

* check if url is redirect before throwing exceptions, res.url is the redirect url if redirected at all

* update tests and cli errors
2021-04-02 10:38:17 +05:30
db8f902cff Add doc strings (#90)
* Added some docstrings in utils.py

* renamed some func/meth to better names and added doc strings + lint

* added more docstrings

* more docstrings

* improve docstrings

* docstrings

* added more docstrings, lint

* fix import error
2021-01-26 11:56:03 +05:30
88cda94c0b v2.4.2 (#89)
* v2.4.2

* v2.4.2
2021-01-24 17:03:35 +05:30
09290f88d1 fix one more error 2021-01-24 16:58:53 +05:30
e5835091c9 import re 2021-01-24 16:56:59 +05:30
7312ed1f4f set cached_save to True if archive older than 3 mins. 2021-01-24 16:53:36 +05:30
6ae8f843d3 add --file to --known_urls 2021-01-24 16:15:11 +05:30
36b936820b known urls now yileds, more reliable. And save the file in chucks wrt to response. --file arg can be used to create output file, if --file not used no output will be saved in any file. (#88) 2021-01-24 16:11:39 +05:30
a3bc6aad2b too much API usage by duplicate tests was causing too much tests failure 2021-01-23 21:08:21 +05:30
edc2f63d93 Output valid JSON, dumps python dict. Make JSON valid. 2021-01-23 20:43:52 +05:30
ffe0810b12 flag to check if the archive saved is 30 mins older or not 2021-01-16 12:06:08 +05:30
40233eb115 improve code quality, remove unused imports, use system randomness etc 2021-01-16 11:35:13 +05:30
d549d31421 improve save method, now we know that 302 errors indicates that wayback machine is archiving the URL and hasn't yet archived. We construct an artifical archive with the current UTC time and check for HTTP status code 20* or 30*. If we verify the archival, we return the artifical archive. The artificial archive will automatically point to the new archive or in best case will be the new archive after some time. 2021-01-16 10:47:43 +05:30
0725163af8 mimify the logo, remove ugly old logos 2021-01-15 18:14:48 +05:30
712471176b better error messages(str), check latest version before asking for an upgrade and rm alive checking 2021-01-15 16:47:26 +05:30
dcd7b03302 getting rid of c style str formatting, now using .format 2021-01-14 19:30:07 +05:30
76205d9cf6 backoff_factor=2 for save, incr success by 25% 2021-01-13 10:13:16 +05:30
ec0a0d04cc + dequeued0
dequeued0 (https://github.com/dequeued0) for reporting bugs and useful feature requests.
2021-01-12 10:52:41 +05:30
7bb01df846 v2.4.1 2021-01-12 10:18:09 +05:30
6142e0b353 get should retrive the last fetched archive by default 2021-01-12 10:07:14 +05:30
a65990aee3 don't use pagination API if total pages <= 2 2021-01-12 09:46:07 +05:30
259a024eb1 joke? they changed their robots.txt 2021-01-11 23:17:01 +05:30
91402792e6 + Supported Features
tell what the package can do, many users probably do not read the full usage.
2021-01-11 23:01:18 +05:30
eabf4dc046 don't fetch more pages if >=2 pages are empty 2021-01-11 22:43:14 +05:30
5a7bd73565 support unix ts as an arg in near 2021-01-11 19:53:37 +05:30
4693dbf9c1 change str repr of cdxsnapshot to cdx line 2021-01-11 09:34:37 +05:30
f4f2e51315 V2.4.0 (#62)
* v 2.4.0

* v 2.4.0
2021-01-10 11:53:45 +05:30
d6b7df6837 no need to de-duplicate as we are collapsing the results by urlkey
Same urls aren't recieved
2021-01-10 11:36:46 +05:30
dafba5d0cb collapses=["urlkey"] for known urls 2021-01-10 11:34:06 +05:30
6c71dfbe41 use cdx matchtype for domain and host 2021-01-10 11:10:49 +05:30
a6470b1036 not passing dict to cdxsnapshot 2021-01-10 10:40:32 +05:30
04cda4558e fix test 2021-01-10 03:18:09 +05:30
625ed63482 remove asserts stmnts 2021-01-10 03:05:48 +05:30
a03813315f full cdx api support 2021-01-10 02:23:53 +05:30
a2550f17d7 retries support for get requests 2021-01-06 01:58:38 +05:30
15ef5816db Always cast url to string, avoid passing waybackpy objects to _get_response 2021-01-05 19:46:17 +05:30
93b52bd0fe FIX : don't use self.user_agent if user_agent passed in get() 2021-01-05 19:31:27 +05:30
28ff877081 Update README.md 2021-01-05 19:08:35 +05:30
3e3ecff9df l2 heading and lint 2021-01-05 01:59:29 +05:30
ce64135ba8 ce 2021-01-05 01:52:35 +05:30
2af6580ffb docs link 2021-01-05 01:51:53 +05:30
8a3c515176 v2.3.3 2021-01-05 01:49:26 +05:30
d98c4f32ad v2.3.3 2021-01-05 01:48:54 +05:30
e0a4b007d5 improve docs 2021-01-05 01:46:12 +05:30
6fb6b2deee Update readme + new file CONTRIBUTORS.md (#59)
* remove some badges

* remove made with python button, obvious

* - maintained badge, we already have latest commit badge

- [![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-green.svg)](https://github.com/akamhy/waybackpy/graphs/commit-activity)

* re arranged order of badges

* a bit more re odering

* - release badge

* - license section

* center h1

* try once more'

* removed the TOC

* move the hr

* Update README.md

* + hr

* h1 --> h2

* remove tests and pacakging info from here to docs/wiki

* Update README.md

* example inspired by psf/requests

* CLI tool example gist

* Update README.md

* Update README.md

* + license

* Update README.md

* authors list

* Update CONTRIBUTORS.md

* fix code

* Update README.md

* Update README.md

* center the button
2021-01-05 00:30:07 +05:30
1882862992 now using cdx Pagination API 2021-01-04 20:46:54 +05:30
0c6107e675 increase coverage 2021-01-04 01:54:40 +05:30
bd079978bf inc coverage 2021-01-04 00:44:55 +05:30
5dec4927cd refactoring, try to code complexity 2021-01-04 00:14:38 +05:30
62e5217b9e reduce code complexity: refactoring, less flow breaking structures 2021-01-03 19:38:25 +05:30
9823c809e9 Added doc strings in wrapper.py, documenting code and improving docs. 2021-01-03 17:11:32 +05:30
db5737a857 JSON is now available for near and other other methods that call it 2021-01-02 18:52:46 +05:30
ca0821a466 Wiki docs (#58)
* move docs to wiki

* Update README.md

* Update setup.py
2021-01-02 12:20:43 +05:30
bb4dbc7d3c rm url = obj.url 2021-01-02 11:19:09 +05:30
7c7fd75376 No need to fetch archive_url and timestamp from availability API on init (#55)
* No need to fetch archive_url and timestamp from availability API on init. 

Not useful if all I want is to archive a page

* Update test_wrapper.py

* Update wrapper.py

* Update test_wrapper.py

* Update wrapper.py

* Update cli.py

* Update wrapper.py

* Update __version__.py

* Update __version__.py

* Update __version__.py

* Update __version__.py

* Update setup.py

* Update README.md
2021-01-02 11:10:23 +05:30
26 changed files with 1181 additions and 1900 deletions

View File

@ -1,42 +0,0 @@
# This workflow will install Python dependencies, run tests and lint with a variety of Python versions
# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions
name: CI
on:
push:
branches: [ master ]
pull_request:
branches: [ master ]
jobs:
build:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ['3.8']
steps:
- uses: actions/checkout@v2
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v2
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install flake8 pytest codecov pytest-cov
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
- name: Lint with flake8
run: |
# stop the build if there are Python syntax errors or undefined names
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
- name: Test with pytest
run: |
pytest --cov=waybackpy tests/
- name: Upload coverage to Codecov
run: |
bash <(curl -s https://codecov.io/bash) -t ${{ secrets.CODECOV_TOKEN }}

View File

@ -1,4 +0,0 @@
# File : .pep8speaks.yml
scanner:
diff_only: True # If True, errors caused by only the patch are shown

View File

@ -1,5 +0,0 @@
# autogenerated pyup.io config file
# see https://pyup.io/docs/configuration/ for all available options
schedule: ''
update: false

View File

@ -1,6 +1,10 @@
{
"scanSettings": {
"baseBranches": []
},
"checkRunSettings": {
"vulnerableCheckRunConclusionLevel": "failure"
"vulnerableCheckRunConclusionLevel": "failure",
"displayMode": "diff"
},
"issueSettings": {
"minSeverityLevel": "LOW"

View File

@ -1,58 +0,0 @@
# Contributing to waybackpy
We love your input! We want to make contributing to this project as easy and transparent as possible, whether it's:
- Reporting a bug
- Discussing the current state of the code
- Submitting a fix
- Proposing new features
- Becoming a maintainer
## We Develop with Github
We use github to host code, to track issues and feature requests, as well as accept pull requests.
## We Use [Github Flow](https://guides.github.com/introduction/flow/index.html), So All Code Changes Happen Through Pull Requests
Pull requests are the best way to propose changes to the codebase (we use [Github Flow](https://guides.github.com/introduction/flow/index.html)). We actively welcome your pull requests:
1. Fork the repo and create your branch from `master`.
2. If you've added code that should be tested, add tests.
3. If you've changed APIs, update the documentation.
4. Ensure the test suite passes.
5. Make sure your code lints.
6. Issue that pull request!
## Any contributions you make will be under the MIT Software License
In short, when you submit code changes, your submissions are understood to be under the same [MIT License](https://github.com/akamhy/waybackpy/blob/master/LICENSE) that covers the project. Feel free to contact the maintainers if that's a concern.
## Report bugs using Github's [issues](https://github.com/akamhy/waybackpy/issues)
We use GitHub issues to track public bugs. Report a bug by [opening a new issue](https://github.com/akamhy/waybackpy/issues/new); it's that easy!
## Write bug reports with detail, background, and sample code
**Great Bug Reports** tend to have:
- A quick summary and/or background
- Steps to reproduce
- Be specific!
- Give sample code if you can.
- What you expected would happen
- What actually happens
- Notes (possibly including why you think this might be happening, or stuff you tried that didn't work)
People *love* thorough bug reports. I'm not even kidding.
## Use a Consistent Coding Style
* You can try running `flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics` for style unification.
## License
By contributing, you agree that your contributions will be licensed under its [MIT License](https://github.com/akamhy/waybackpy/blob/master/LICENSE).
## References
This document is forked from [this gist](https://gist.github.com/briandk/3d2e8b3ec8daf5a27a62) by [briandk](https://github.com/briandk) which was itself adapted from the open-source contribution guidelines for [Facebook's Draft](https://github.com/facebook/draft-js/blob/a9316a723f9e918afde44dea68b5f9f39b7d9b00/CONTRIBUTING.md)

10
CONTRIBUTORS.md Normal file
View File

@ -0,0 +1,10 @@
## AUTHORS
- akamhy (<https://github.com/akamhy>)
- danvalen1 (<https://github.com/danvalen1>)
- AntiCompositeNumber (<https://github.com/AntiCompositeNumber>)
- jonasjancarik (<https://github.com/jonasjancarik>)
## ACKNOWLEDGEMENTS
- mhmdiaa (<https://github.com/mhmdiaa>) for <https://gist.github.com/mhmdiaa/adf6bff70142e5091792841d4b372050>. known_urls is based on this gist.
- datashaman (<https://stackoverflow.com/users/401467/datashaman>) for <https://stackoverflow.com/a/35504626>. _get_response is based on this amazing answer.
- dequeued0 (<https://github.com/dequeued0>) for reporting bugs and useful feature requests.

523
README.md
View File

@ -1,64 +1,34 @@
<div align="center">
<img src="https://raw.githubusercontent.com/akamhy/waybackpy/master/assets/waybackpy_logo.svg"><br>
<img src="https://raw.githubusercontent.com/akamhy/waybackpy/master/assets/waybackpy_logo.svg"><br>
<h3>Python package & CLI tool that interfaces with the Wayback Machine API</h3>
</div>
-----------------
<p align="center">
<a href="https://pypi.org/project/waybackpy/"><img alt="pypi" src="https://img.shields.io/pypi/v/waybackpy.svg"></a>
<a href="https://github.com/akamhy/waybackpy/blob/master/CONTRIBUTING.md"><img alt="Contributions Welcome" src="https://img.shields.io/static/v1.svg?label=Contributions&message=Welcome&color=0059b3&style=flat-square"></a>
<a href="https://pepy.tech/project/waybackpy?versions=2*&versions=1*&versions=3*"><img alt="Downloads" src="https://pepy.tech/badge/waybackpy/month"></a>
<a href="https://github.com/akamhy/waybackpy/commits/master"><img alt="GitHub lastest commit" src="https://img.shields.io/github/last-commit/akamhy/waybackpy?color=blue&style=flat-square"></a>
<a href="#"><img alt="PyPI - Python Version" src="https://img.shields.io/pypi/pyversions/waybackpy?style=flat-square"></a>
</p>
## Python package & CLI tool that interfaces with the Wayback Machine API.
[![pypi](https://img.shields.io/pypi/v/waybackpy.svg)](https://pypi.org/project/waybackpy/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://github.com/akamhy/waybackpy/blob/master/LICENSE)
[![Build Status](https://github.com/akamhy/waybackpy/workflows/CI/badge.svg)](https://github.com/akamhy/waybackpy/actions)
[![codecov](https://codecov.io/gh/akamhy/waybackpy/branch/master/graph/badge.svg)](https://codecov.io/gh/akamhy/waybackpy)
[![contributions welcome](https://img.shields.io/static/v1.svg?label=Contributions&message=Welcome&color=0059b3&style=flat-square)](https://github.com/akamhy/waybackpy/blob/master/CONTRIBUTING.md)
[![Codacy Badge](https://api.codacy.com/project/badge/Grade/255459cede9341e39436ec8866d3fb65)](https://www.codacy.com/manual/akamhy/waybackpy?utm_source=github.com&amp;utm_medium=referral&amp;utm_content=akamhy/waybackpy&amp;utm_campaign=Badge_Grade)
[![Downloads](https://pepy.tech/badge/waybackpy/month)](https://pepy.tech/project/waybackpy)
[![Release](https://img.shields.io/github/v/release/akamhy/waybackpy.svg)](https://github.com/akamhy/waybackpy/releases)
[![Maintainability](https://api.codeclimate.com/v1/badges/942f13d8177a56c1c906/maintainability)](https://codeclimate.com/github/akamhy/waybackpy/maintainability)
[![made-with-python](https://img.shields.io/badge/Made%20with-Python-1f425f.svg)](https://www.python.org/)
[![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-green.svg)](https://github.com/akamhy/waybackpy/graphs/commit-activity)
[![GitHub last commit](https://img.shields.io/github/last-commit/akamhy/waybackpy?color=blue&style=flat-square)](https://github.com/akamhy/waybackpy/commits/master)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/waybackpy?style=flat-square)
-----------------------------------------------------------------------------------------------------------------------------------------------
## ⭐️ Introduction
Waybackpy is a [Python package](https://www.udacity.com/blog/2021/01/what-is-a-python-package.html) and a CLI tool that interfaces with the Wayback Machine API.
Wayback Machine has 3 client side APIs.
- Save API
- Availability API
- CDX API
All three of these can be accessed by waybackpy.
Table of contents
=================
<!--ts-->
* [Installation](#installation)
* [Usage](#usage)
* [As a Python package](#as-a-python-package)
* [Saving a webpage](#capturing-aka-saving-an-url-using-save)
* [Retrieving archive](#retrieving-the-archive-for-an-url-using-archive_url)
* [Retrieving the oldest archive](#retrieving-the-oldest-archive-for-an-url-using-oldest)
* [Retrieving the latest/newest archive](#retrieving-the-newest-archive-for-an-url-using-newest)
* [Retrieving the JSON response of availability API](#retrieving-the-json-response-for-the-availability-api-request)
* [Retrieving archive close to a specified year, month, day, hour, and minute](#retrieving-archive-close-to-a-specified-year-month-day-hour-and-minute-using-near)
* [Get the content of webpage](#get-the-content-of-webpage-using-get)
* [Count total archives for an URL](#count-total-archives-for-an-url-using-total_archives)
* [List of URLs that Wayback Machine knows and has archived for a domain name](#list-of-urls-that-wayback-machine-knows-and-has-archived-for-a-domain-name)
* [With the Command-line interface](#with-the-command-line-interface)
* [Saving webpage](#save)
* [Archive URL](#get-archive-url)
* [Oldest archive URL](#oldest-archive)
* [Newest archive URL](#newest-archive)
* [JSON response of API](#get-json-data-of-avaialblity-api)
* [Total archives](#total-number-of-archives)
* [Archive near specified time](#archive-near-time)
* [Get the source code](#get-the-source-code)
* [Fetch all the URLs that the Wayback Machine knows for a domain](#fetch-all-the-urls-that-the-wayback-machine-knows-for-a-domain)
* [Tests](#tests)
* [Packaging](#packaging)
* [License](#license)
<!--te-->
## Installation
### 🏗 Installation
Using [pip](https://en.wikipedia.org/wiki/Pip_(package_manager)):
@ -66,387 +36,94 @@ Using [pip](https://en.wikipedia.org/wiki/Pip_(package_manager)):
pip install waybackpy
```
or direct from this repository using git.
Install directly from GitHub:
```bash
pip install git+https://github.com/akamhy/waybackpy.git
```
## Usage
### Docker Image
Docker Hub : <https://hub.docker.com/r/secsi/waybackpy>
### As a Python package
Docker image is automatically updated on every release by [Regulary and Automatically Updated Docker Images](https://github.com/cybersecsi/RAUDI) (RAUDI).
#### Capturing aka Saving an URL using save()
RAUDI is a tool by SecSI (<https://secsi.io>), an Italian cybersecurity startup.
### Usage
#### As a Python package
##### Save API aka SavePageNow
```python
import waybackpy
url = "https://en.wikipedia.org/wiki/Multivariable_calculus"
user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
waybackpy_url_obj = waybackpy.Url(url, user_agent)
archive = waybackpy_url_obj.save()
print(archive)
>>> from waybackpy import WaybackMachineSaveAPI
>>> url = "https://github.com"
>>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
>>>
>>> save_api = WaybackMachineSaveAPI(url, user_agent)
>>> save_api.save()
https://web.archive.org/web/20220118125249/https://github.com/
>>> save_api.cached_save
False
>>> save_api.timestamp()
datetime.datetime(2022, 1, 18, 12, 52, 49)
```
##### Availability API
```python
>>> from waybackpy import WaybackMachineAvailabilityAPI
>>>
>>> url = "https://google.com"
>>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
>>>
>>> availability_api = WaybackMachineAvailabilityAPI(url, user_agent)
>>>
>>> availability_api.oldest()
https://web.archive.org/web/19981111184551/http://google.com:80/
>>>
>>> availability_api.newest()
https://web.archive.org/web/20220118150444/https://www.google.com/
>>>
>>> availability_api.near(year=2010, month=10, day=10, hour=10)
https://web.archive.org/web/20101010101708/http://www.google.com/
```
##### CDX API aka CDXServerAPI
```python
>>> from waybackpy import WaybackMachineCDXServerAPI
>>> url = "https://pypi.org"
>>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
>>> cdx = WaybackMachineCDXServerAPI(url, user_agent, start_timestamp=2016, end_timestamp=2017)
>>> for item in cdx.snapshots():
... print(item.archive_url)
...
https://web.archive.org/web/20160110011047/http://pypi.org/
https://web.archive.org/web/20160305104847/http://pypi.org/
.
. # URLS REDACTED FOR READABILITY
.
https://web.archive.org/web/20171127171549/https://pypi.org/
https://web.archive.org/web/20171206002737/http://pypi.org:80/
```
> Documentation at <https://github.com/akamhy/waybackpy/wiki/Python-package-docs>.
#### As a CLI tool
```bash
https://web.archive.org/web/20201016171808/https://en.wikipedia.org/wiki/Multivariable_calculus
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPySaveExample></sub>
#### Retrieving the archive for an URL using archive_url
```python
import waybackpy
url = "https://www.google.com/"
user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:40.0) Gecko/20100101 Firefox/40.0"
waybackpy_url_obj = waybackpy.Url(url, user_agent)
archive_url = waybackpy_url_obj.archive_url
print(archive_url)
```
```bash
https://web.archive.org/web/20201016153320/https://www.google.com/
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyArchiveUrl></sub>
#### Retrieving the oldest archive for an URL using oldest()
```python
import waybackpy
url = "https://www.google.com/"
user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:40.0) Gecko/20100101 Firefox/40.0"
waybackpy_url_obj = waybackpy.Url(url, user_agent)
oldest_archive_url = waybackpy_url_obj.oldest()
print(oldest_archive_url)
```
```bash
http://web.archive.org/web/19981111184551/http://google.com:80/
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyOldestExample></sub>
#### Retrieving the newest archive for an URL using newest()
```python
import waybackpy
url = "https://www.facebook.com/"
user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:39.0) Gecko/20100101 Firefox/39.0"
waybackpy_url_obj = waybackpy.Url(url, user_agent)
newest_archive_url = waybackpy_url_obj.newest()
print(newest_archive_url)
```
```bash
https://web.archive.org/web/20201016150543/https://www.facebook.com/
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyNewestExample></sub>
#### Retrieving the JSON response for the availability API request
```python
import waybackpy
url = "https://www.facebook.com/"
user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:39.0) Gecko/20100101 Firefox/39.0"
waybackpy_url_obj = waybackpy.Url(url, user_agent)
json_dict = waybackpy_url_obj.JSON
print(json_dict)
```
```javascript
{'url': 'https://www.facebook.com/', 'archived_snapshots': {'closest': {'available': True, 'url': 'http://web.archive.org/web/20201016150543/https://www.facebook.com/', 'timestamp': '20201016150543', 'status': '200'}}}
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyJSON></sub>
#### Retrieving archive close to a specified year, month, day, hour, and minute using near()
```python
from waybackpy import Url
user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:38.0) Gecko/20100101 Firefox/38.0"
url = "https://github.com/"
waybackpy_url_obj = Url(url, user_agent)
# Do not pad (don't use zeros in the month, year, day, minute, and hour arguments). e.g. For January, set month = 1 and not month = 01.
```
```python
github_archive_near_2010 = waybackpy_url_obj.near(year=2010)
print(github_archive_near_2010)
```
```bash
https://web.archive.org/web/20101018053604/http://github.com:80/
```
```python
github_archive_near_2011_may = waybackpy_url_obj.near(year=2011, month=5)
print(github_archive_near_2011_may)
```
```bash
https://web.archive.org/web/20110518233639/https://github.com/
```
```python
github_archive_near_2015_january_26 = waybackpy_url_obj.near(year=2015, month=1, day=26)
print(github_archive_near_2015_january_26)
```
```bash
https://web.archive.org/web/20150125102636/https://github.com/
```
```python
github_archive_near_2018_4_july_9_2_am = waybackpy_url_obj.near(year=2018, month=7, day=4, hour=9, minute=2)
print(github_archive_near_2018_4_july_9_2_am)
```
```bash
https://web.archive.org/web/20180704090245/https://github.com/
```
<sub>The package doesn't support the seconds' argument yet. You are encouraged to create a PR ;)</sub>
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyNearExample></sub>
#### Get the content of webpage using get()
```python
import waybackpy
google_url = "https://www.google.com/"
User_Agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.85 Safari/537.36"
waybackpy_url_object = waybackpy.Url(google_url, User_Agent)
# If no argument is passed in get(), it gets the source of the Url used to create the object.
current_google_url_source = waybackpy_url_object.get()
print(current_google_url_source)
# The following chunk of code will force a new archive of google.com and get the source of the archived page.
# waybackpy_url_object.save() type is string.
google_newest_archive_source = waybackpy_url_object.get(waybackpy_url_object.save())
print(google_newest_archive_source)
# waybackpy_url_object.oldest() type is str, it's oldest archive of google.com
google_oldest_archive_source = waybackpy_url_object.get(waybackpy_url_object.oldest())
print(google_oldest_archive_source)
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyGetExample#main.py></sub>
#### Count total archives for an URL using total_archives()
```python
import waybackpy
URL = "https://en.wikipedia.org/wiki/Python (programming language)"
UA = "Mozilla/5.0 (iPad; CPU OS 8_1_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B435 Safari/600.1.4"
waybackpy_url_object = waybackpy.Url(url=URL, user_agent=UA)
archive_count = waybackpy_url_object.total_archives()
print(archive_count) # total_archives() returns an int
```
```bash
2516
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyTotalArchivesExample></sub>
#### List of URLs that Wayback Machine knows and has archived for a domain name
1) If alive=True is set, waybackpy will check all URLs to identify the alive URLs. Don't use with popular websites like google or it would take too long.
2) To include URLs from subdomain set sundomain=True
```python
import waybackpy
URL = "akamhy.github.io"
UA = "Mozilla/5.0 (iPad; CPU OS 8_1_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B435 Safari/600.1.4"
waybackpy_url_object = waybackpy.Url(url=URL, user_agent=UA)
known_urls = waybackpy_url_object.known_urls(alive=True, subdomain=False) # alive and subdomain are optional.
print(known_urls) # known_urls() returns list of URLs
```
```bash
['http://akamhy.github.io',
'https://akamhy.github.io/waybackpy/',
'https://akamhy.github.io/waybackpy/assets/css/style.css?v=a418a4e4641a1dbaad8f3bfbf293fad21a75ff11',
'https://akamhy.github.io/waybackpy/assets/css/style.css?v=f881705d00bf47b5bf0c58808efe29eecba2226c']
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyKnownURLsToWayBackMachineExample#main.py></sub>
### With the Command-line interface
#### Save
```bash
$ waybackpy --url "https://en.wikipedia.org/wiki/Social_media" --user_agent "my-unique-user-agent" --save
$ waybackpy --save --url "https://en.wikipedia.org/wiki/Social_media" --user_agent "my-unique-user-agent"
https://web.archive.org/web/20200719062108/https://en.wikipedia.org/wiki/Social_media
$ waybackpy --oldest --url "https://en.wikipedia.org/wiki/Humanoid" --user_agent "my-unique-user-agent"
https://web.archive.org/web/20040415020811/http://en.wikipedia.org:80/wiki/Humanoid
$ waybackpy --newest --url "https://en.wikipedia.org/wiki/Remote_sensing" --user_agent "my-unique-user-agent"
https://web.archive.org/web/20201221130522/https://en.wikipedia.org/wiki/Remote_sensing
```
> CLI documentation is at <https://github.com/akamhy/waybackpy/wiki/CLI-docs>.
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyBashSave></sub>
### 🛡 License
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://github.com/akamhy/waybackpy/blob/master/LICENSE)
#### Get archive URL
```bash
$ waybackpy --url "https://en.wikipedia.org/wiki/SpaceX" --user_agent "my-unique-user-agent" --archive_url
https://web.archive.org/web/20201007132458/https://en.wikipedia.org/wiki/SpaceX
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyBashArchiveUrl></sub>
#### Oldest archive
```bash
$ waybackpy --url "https://en.wikipedia.org/wiki/SpaceX" --user_agent "my-unique-user-agent" --oldest
https://web.archive.org/web/20040803000845/http://en.wikipedia.org:80/wiki/SpaceX
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyBashOldest></sub>
#### Newest archive
```bash
$ waybackpy --url "https://en.wikipedia.org/wiki/YouTube" --user_agent "my-unique-user-agent" --newest
https://web.archive.org/web/20200606044708/https://en.wikipedia.org/wiki/YouTube
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyBashNewest></sub>
#### Get JSON data of avaialblity API
```bash
waybackpy --url "https://en.wikipedia.org/wiki/SpaceX" --user_agent "my-unique-user-agent" --json
```
```javascript
{'archived_snapshots': {'closest': {'timestamp': '20201007132458', 'status': '200', 'available': True, 'url': 'http://web.archive.org/web/20201007132458/https://en.wikipedia.org/wiki/SpaceX'}}, 'url': 'https://en.wikipedia.org/wiki/SpaceX'}
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyBashJSON></sub>
#### Total number of archives
```bash
$ waybackpy --url "https://en.wikipedia.org/wiki/Linux_kernel" --user_agent "my-unique-user-agent" --total
853
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyBashTotal></sub>
#### Archive near time
```bash
$ waybackpy --url facebook.com --user_agent "my-unique-user-agent" --near --year 2012 --month 5 --day 12
https://web.archive.org/web/20120512142515/https://www.facebook.com/
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyBashNear></sub>
#### Get the source code
```bash
waybackpy --url google.com --user_agent "my-unique-user-agent" --get url # Prints the source code of the URL
waybackpy --url google.com --user_agent "my-unique-user-agent" --get oldest # Prints the source code of the oldest archive
waybackpy --url google.com --user_agent "my-unique-user-agent" --get newest # Prints the source code of the newest archive
waybackpy --url google.com --user_agent "my-unique-user-agent" --get save # Save a new archive on Wayback machine then print the source code of this archive.
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyBashGet></sub>
#### Fetch all the URLs that the Wayback Machine knows for a domain
1) You can add the '--alive' flag to only fetch alive links.
2) You can add the '--subdomain' flag to add subdomains.
3) '--alive' and '--subdomain' flags can be used simultaneously.
4) All links will be saved in a file, and the file will be created in the current working directory.
```bash
pip install waybackpy
# Ignore the above installation line.
waybackpy --url akamhy.github.io --user_agent "my-user-agent" --known_urls
# Prints all known URLs under akamhy.github.io
waybackpy --url akamhy.github.io --user_agent "my-user-agent" --known_urls --alive
# Prints all known URLs under akamhy.github.io which are still working and not dead links.
waybackpy --url akamhy.github.io --user_agent "my-user-agent" --known_urls --subdomain
# Prints all known URLs under akamhy.github.io including subdomain
waybackpy --url akamhy.github.io --user_agent "my-user-agent" --known_urls --subdomain --alive
# Prints all known URLs under akamhy.github.io including subdomain which are not dead links and still alive.
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackpyKnownUrlsFromWaybackMachine#main.sh></sub>
## Tests
To run tests locally:
1) Install or update the testing/coverage tools
```bash
pip install codecov pytest pytest-cov -U
```
2) Inside the repository run the following commands
```bash
pytest --cov=waybackpy tests/
```
3) To report coverage run
```bash
bash <(curl -s https://codecov.io/bash) -t SECRET_CODECOV_TOKEN
```
You can find the tests [here](https://github.com/akamhy/waybackpy/tree/master/tests).
## Packaging
1. Increment version.
2. Build package ``python setup.py sdist bdist_wheel``.
3. Sign & upload the package ``twine upload -s dist/*``.
## License
Released under the MIT License. See
[license](https://github.com/akamhy/waybackpy/blob/master/LICENSE) for details.
Released under the MIT License. See [license](https://github.com/akamhy/waybackpy/blob/master/LICENSE) for details.

View File

@ -1,268 +0,0 @@
<?xml version="1.0" standalone="no"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 20010904//EN"
"http://www.w3.org/TR/2001/REC-SVG-20010904/DTD/svg10.dtd">
<svg version="1.0" xmlns="http://www.w3.org/2000/svg"
width="629.000000pt" height="103.000000pt" viewBox="0 0 629.000000 103.000000"
preserveAspectRatio="xMidYMid meet">
<g transform="translate(0.000000,103.000000) scale(0.100000,-0.100000)"
fill="#000000" stroke="none">
<path d="M0 515 l0 -515 3145 0 3145 0 0 515 0 515 -3145 0 -3145 0 0 -515z
m5413 439 c31 -6 36 -10 31 -26 -3 -10 0 -26 7 -34 6 -8 10 -17 7 -20 -3 -2
-17 11 -32 31 -15 19 -41 39 -59 44 -38 11 -10 14 46 5z m150 -11 c-7 -2 -21
-2 -30 0 -10 3 -4 5 12 5 17 0 24 -2 18 -5z m-4869 -23 c-6 -6 -21 -6 -39 -1
-30 9 -30 9 10 10 25 1 36 -2 29 -9z m452 -37 c-3 -26 -15 -65 -25 -88 -10
-22 -21 -64 -25 -94 -3 -29 -14 -72 -26 -95 -11 -23 -20 -51 -20 -61 0 -30
-39 -152 -53 -163 -6 -5 -45 -12 -85 -14 -72 -5 -102 4 -102 33 0 6 -9 31 -21
56 -11 25 -26 72 -33 103 -6 31 -17 64 -24 73 -8 9 -22 37 -32 64 l-18 48 -16
-39 c-9 -21 -16 -44 -16 -50 0 -6 -7 -24 -15 -40 -8 -16 -24 -63 -34 -106 -11
-43 -26 -93 -34 -112 -14 -34 -15 -35 -108 -46 -70 -9 -96 -9 -106 0 -21 17
-43 64 -43 92 0 14 -4 27 -9 31 -12 7 -50 120 -66 200 -8 35 -25 81 -40 103
-14 22 -27 52 -28 68 -2 28 0 29 48 31 28 1 82 5 120 9 54 4 73 3 82 -7 11
-15 53 -148 53 -170 0 -7 9 -32 21 -56 20 -41 39 -49 39 -17 0 8 -5 12 -10 9
-6 -3 -13 2 -16 12 -3 10 -10 26 -15 36 -14 26 7 21 29 -8 l20 -26 7 33 c7 35
41 149 56 185 7 19 16 23 56 23 27 0 80 2 120 6 80 6 88 1 97 -71 3 -20 9 -42
14 -48 5 -7 20 -43 32 -82 13 -38 24 -72 26 -74 2 -2 13 4 24 14 13 12 20 31
20 55 0 20 7 56 15 81 7 24 19 63 25 87 12 47 31 60 89 61 l34 1 -7 -47z
m3131 41 c17 -3 34 -12 37 -20 3 -7 1 -48 -4 -91 -4 -43 -7 -80 -4 -82 2 -2
11 2 20 10 9 7 24 18 34 24 9 5 55 40 101 77 79 64 87 68 136 68 28 0 54 -4
58 -10 3 -5 12 -7 20 -3 9 3 15 -1 15 -9 0 -13 -180 -158 -197 -158 -4 0 -14
-9 -20 -20 -11 -17 -7 -27 27 -76 22 -32 40 -63 40 -70 0 -7 6 -19 14 -26 7
-8 37 -48 65 -89 l52 -74 -28 -3 c-51 -5 -74 -12 -68 -22 9 -14 -59 -12 -73 2
-20 20 -13 30 10 14 34 -24 44 -19 17 8 -25 25 -109 140 -109 149 0 7 -60 97
-64 97 -2 0 -11 -10 -22 -22 -18 -21 -18 -21 0 -15 10 4 25 2 32 -4 18 -15 19
-35 2 -22 -7 6 -25 13 -39 17 -34 8 -39 -5 -39 -94 0 -38 -3 -75 -6 -84 -6
-16 -54 -22 -67 -9 -4 3 -40 7 -81 8 -101 2 -110 10 -104 97 3 37 10 73 16 80
6 8 10 77 10 174 0 89 2 166 6 172 6 11 162 15 213 6z m301 -1 c-25 -2 -52
-11 -58 -19 -7 -7 -17 -14 -23 -14 -5 0 -2 9 8 20 14 16 29 20 69 18 l51 -2
-47 -3z m809 -9 c33 -21 65 -89 62 -132 -1 -21 1 -47 5 -59 9 -28 -26 -111
-51 -120 -10 -3 -25 -12 -33 -19 -10 -8 -70 -15 -170 -21 l-155 -8 4 -73 c4
-93 -10 -112 -80 -112 -26 0 -60 5 -74 12 -19 8 -31 8 -51 -1 -45 -20 -55 -1
-55 98 0 47 -1 111 -3 141 -2 30 -5 107 -7 170 l-4 115 65 2 c36 2 103 7 150
11 150 15 372 13 397 -4z m338 -19 c11 -14 46 -54 78 -88 l58 -62 62 65 c34
36 75 73 89 83 28 18 113 24 122 9 3 -5 -32 -51 -77 -102 -147 -167 -134 -143
-139 -253 -3 -54 -10 -103 -16 -109 -8 -8 -8 -17 -1 -30 14 -26 11 -28 -47
-29 -119 -2 -165 3 -174 22 -6 10 -9 69 -8 131 l2 113 -57 75 c-32 41 -80 102
-107 134 -27 33 -47 62 -45 66 3 4 58 6 122 4 113 -3 119 -5 138 -29z m-4233
13 c16 -13 98 -150 98 -164 0 -4 29 -65 65 -135 36 -71 65 -135 65 -143 0 -10
-14 -17 -37 -21 -21 -4 -48 -10 -61 -16 -40 -16 -51 -10 -77 41 -29 57 -35 59
-157 38 -65 -11 -71 -14 -84 -43 -10 -25 -21 -34 -46 -38 -41 -6 -61 8 -48 33
15 28 12 38 -12 42 -18 2 -23 10 -24 36 -1 27 3 35 23 43 13 5 34 9 46 9 23 0
57 47 57 78 0 9 10 33 22 52 14 24 21 52 22 92 1 49 4 58 24 67 13 6 31 11 40
11 9 0 26 7 36 15 24 18 28 18 48 3z m1701 0 c16 -12 97 -143 97 -157 0 -3 32
-69 70 -146 39 -76 67 -142 62 -147 -4 -4 -28 -12 -52 -17 -25 -6 -57 -13 -72
-17 -25 -6 -29 -2 -50 42 -14 30 -31 50 -43 53 -11 2 -57 -2 -103 -9 -79 -12
-83 -13 -96 -45 -10 -24 -22 -34 -46 -38 -43 -9 -53 -1 -45 39 5 30 3 34 -15
34 -17 0 -20 6 -20 39 0 40 13 50 65 51 19 0 55 48 55 72 0 6 8 29 19 52 32
72 41 107 31 127 -8 14 -5 21 12 33 12 9 32 16 43 16 11 0 29 7 39 15 24 18
28 18 49 3z m-3021 -11 c-29 -9 -32 -13 -27 -39 8 -36 -11 -37 -20 -1 -8 32
15 54 54 52 24 -1 23 -2 -7 -12z m3499 4 c-12 -8 -51 -4 -51 5 0 2 15 4 33 4
22 0 28 -3 18 -9z m1081 -67 c2 -42 0 -78 -4 -81 -5 -2 -8 18 -8 45 0 27 -3
64 -6 81 -4 19 -2 31 4 31 6 0 12 -32 14 -76z m-1951 46 c12 -7 19 -21 19 -38
l-1 -27 -15 28 c-8 15 -22 27 -32 27 -9 0 -24 5 -32 10 -21 14 35 13 61 0z
m1004 -3 c73 -19 135 -61 135 -92 0 -15 -8 -29 -21 -36 -18 -9 -30 -6 -69 15
-37 20 -62 26 -109 26 -54 0 -62 -3 -78 -26 -21 -32 -33 -130 -25 -191 9 -58
41 -84 111 -91 38 -3 61 1 97 17 36 17 49 19 60 10 25 -21 15 -48 -28 -76 -38
-24 -54 -28 -148 -31 -114 -4 -170 10 -190 48 -6 11 -16 20 -23 20 -24 0 -59
95 -59 159 0 59 20 122 42 136 6 3 10 13 10 22 0 31 80 82 130 83 19 0 42 5
50 10 21 13 57 12 115 -3z m-1682 -23 c-14 -14 -28 -23 -31 -20 -8 8 29 46 44
46 7 0 2 -11 -13 -26z m159 -2 c-20 -15 -22 -23 -16 -60 4 -28 3 -42 -5 -42
-7 0 -11 19 -11 50 0 36 5 52 18 59 28 17 39 12 14 -7z m1224 -28 c-39 -40
-46 -38 -19 7 15 24 40 41 52 33 2 -2 -13 -20 -33 -40z m-1538 -33 l62 -66 63
68 c56 59 68 67 100 67 19 0 38 -3 40 -7 3 -5 -32 -53 -76 -108 -88 -108 -84
-97 -90 -255 l-2 -55 -87 -3 c-49 -1 -88 -1 -89 0 0 2 -3 50 -5 107 -3 75 -8
109 -19 121 -8 9 -15 20 -15 25 0 4 -18 29 -41 54 -83 94 -89 102 -84 111 3 6
45 9 93 9 l87 -1 63 -67z m786 59 c33 -12 48 -42 52 -107 3 -43 0 -57 -16 -73
l-20 -20 20 -28 c26 -35 35 -89 21 -125 -18 -46 -66 -60 -226 -64 -77 -3 -166
-7 -198 -10 -84 -7 -99 9 -97 102 1 38 -1 125 -4 191 l-5 122 47 5 c26 3 103
4 171 2 69 -2 134 1 145 5 29 12 80 12 110 0z m-1050 -16 c3 -8 2 -12 -4 -9
-6 3 -10 10 -10 16 0 14 7 11 14 -7z m-374 -22 c0 -9 -5 -24 -10 -32 -7 -11
-10 -5 -10 23 0 23 4 36 10 32 6 -3 10 -14 10 -23z m1701 16 c2 -21 -2 -43
-10 -51 -4 -4 -7 9 -8 28 -1 32 15 52 18 23z m2859 -28 c-11 -20 -50 -28 -50
-10 0 6 9 10 19 10 11 0 23 5 26 10 12 19 16 10 5 -10z m-4759 -47 c-8 -15
-10 -15 -11 -2 0 17 10 32 18 25 2 -3 -1 -13 -7 -23z m2599 9 c0 -9 -40 -35
-46 -29 -6 6 25 37 37 37 5 0 9 -3 9 -8z m316 -127 c-4 -19 -12 -37 -18 -41
-8 -5 -9 -1 -5 10 4 10 7 36 7 59 1 35 2 39 11 24 6 -10 8 -34 5 -52z m1942
38 c-15 -16 -30 -45 -33 -65 -4 -21 -12 -38 -17 -38 -19 0 3 74 30 103 14 15
30 27 36 27 5 0 -2 -12 -16 -27z m-3855 -16 c-6 -12 -15 -33 -20 -47 -9 -23
-10 -23 -15 -3 -3 12 3 34 14 52 23 35 37 34 21 -2z m3282 -82 c-23 -18 -81
-35 -115 -34 -17 1 -11 5 21 13 25 7 54 18 65 24 30 18 53 15 29 -3z m-2585
-130 c-7 -8 -19 -15 -27 -15 -10 0 -7 8 9 31 18 24 24 27 26 14 2 -9 -2 -22
-8 -30z m-1775 -5 c-4 -12 -9 -19 -12 -17 -3 3 -2 15 2 27 4 12 9 19 12 17 3
-3 2 -15 -2 -27z m820 -29 c-9 -8 -25 21 -25 44 0 16 3 14 15 -9 9 -16 13 -32
10 -35z m2085 47 c0 -17 -31 -48 -47 -48 -11 0 -8 8 9 29 24 32 38 38 38 19z
m-1655 -47 c-11 -10 -35 11 -35 30 0 21 0 21 19 -2 11 -13 18 -26 16 -28z
m1221 24 c13 -14 21 -25 18 -25 -11 0 -54 33 -54 41 0 15 12 10 36 -16z
m-1428 -7 c-3 -7 -18 -14 -34 -15 -20 -1 -22 0 -6 4 12 2 22 9 22 14 0 5 5 9
11 9 6 0 9 -6 7 -12z m3574 -45 c8 -10 6 -13 -11 -13 -18 0 -21 6 -20 38 0 34
1 35 10 13 5 -14 15 -31 21 -38z m-4097 14 c19 -4 19 -4 2 -12 -18 -7 -46 16
-47 39 0 6 6 3 13 -6 6 -9 21 -18 32 -21z m1700 1 c19 -5 19 -5 2 -13 -18 -7
-46 17 -46 40 0 6 5 3 12 -6 7 -9 21 -19 32 -21z m-1970 12 c-3 -5 -21 -9 -38
-9 l-32 2 35 7 c19 4 36 8 38 9 2 0 0 -3 -3 -9z m350 0 c-27 -12 -35 -12 -35
0 0 6 12 10 28 9 24 0 25 -1 7 -9z m1350 0 c-3 -5 -18 -9 -33 -9 l-27 1 30 8
c17 4 31 8 33 9 2 0 0 -3 -3 -9z m355 0 c-19 -13 -30 -13 -30 0 0 6 10 10 23
10 18 0 19 -2 7 -10z m-2324 -35 c-6 -22 -11 -25 -44 -24 -31 2 -32 3 -9 6 18
3 32 14 39 29 14 30 23 24 14 -11z m2839 16 c-14 -14 -73 -26 -60 -13 6 5 19
12 30 15 34 8 40 8 30 -2z m212 -21 l48 -8 -47 -1 c-56 -1 -78 6 -78 26 0 12
3 13 14 3 8 -6 36 -15 63 -20z m116 -1 c-6 -6 -18 -6 -28 -3 -18 7 -18 8 1 14
23 9 39 1 27 -11z m633 -14 c31 5 35 4 21 -5 -9 -6 -34 -10 -55 -8 -31 3 -37
7 -40 28 l-3 25 19 -23 c16 -20 24 -23 58 -17z m939 15 c16 -7 11 -9 -20 -9
-29 -1 -36 2 -25 9 17 11 19 11 45 0z m-5445 -24 c6 -8 21 -16 33 -18 19 -3
20 -4 5 -10 -12 -5 -27 1 -45 17 -16 13 -23 25 -17 25 6 0 17 -6 24 -14z m150
-76 c0 -11 -4 -20 -10 -20 -14 0 -13 -103 1 -117 21 -21 2 -43 -36 -43 -19 0
-35 5 -35 11 0 8 -5 7 -15 -1 -21 -17 -44 2 -28 22 22 26 20 128 -2 128 -8 0
-15 9 -15 19 0 18 8 20 70 20 63 0 70 -2 70 -19z m1189 -63 c17 -32 31 -62 31
-66 0 -14 -43 -21 -57 -9 -7 6 -29 12 -48 14 -26 2 -35 -1 -40 -16 -4 -12 -12
-17 -21 -13 -8 3 -13 12 -10 19 3 8 1 14 -4 14 -18 0 -10 22 9 27 22 6 43 46
35 67 -3 9 5 20 23 30 34 18 38 14 82 -67z m2146 -8 l34 -67 -25 -6 c-14 -4
-31 -3 -37 2 -7 5 -29 12 -49 16 -31 6 -38 4 -38 -9 0 -8 -7 -15 -15 -15 -8 0
-15 7 -15 15 0 8 -4 15 -10 15 -19 0 -10 21 14 30 16 6 27 20 31 40 4 18 16
41 27 52 26 26 40 14 83 -73z m-3205 51 c8 -10 20 -26 27 -36 10 -17 12 -14
12 19 1 36 2 37 37 37 l37 0 -8 -72 c-3 -40 -11 -76 -17 -79 -20 -13 -43 3
-62 42 -27 56 -34 56 -41 4 -7 -42 -9 -44 -34 -39 -35 9 -34 6 -35 71 -1 41 4
62 14 70 18 15 50 7 70 -17z m280 11 c-5 -11 -15 -21 -21 -23 -13 -4 -14 -101
-3 -120 5 -8 1 -9 -10 -5 -10 4 -29 7 -42 7 -22 0 -24 3 -24 55 0 52 -1 55
-26 55 -19 0 -25 5 -22 18 2 13 17 18 68 23 36 3 71 6 78 7 9 2 10 -3 2 -17z
m178 -3 c3 -15 -4 -18 -32 -18 -25 0 -36 -4 -36 -15 0 -10 11 -15 35 -15 24 0
35 -5 35 -15 0 -11 -11 -15 -41 -15 -55 0 -47 -24 9 -28 29 -2 42 -8 42 -18 0
-16 -25 -17 -108 -7 l-53 6 2 56 c3 92 1 90 77 88 55 -2 67 -5 70 -19z m230
10 c18 -18 14 -56 -7 -77 -17 -17 -18 -21 -5 -40 14 -19 13 -21 -4 -21 -10 0
-28 11 -40 25 -24 27 -52 24 -52 -5 0 -24 -9 -29 -43 -23 -26 5 -27 7 -27 73
0 45 4 70 13 73 26 11 153 7 165 -5z m557 -2 c47 -20 47 -40 0 -32 -53 10 -77
-7 -73 -52 l3 -37 48 1 c26 0 47 -3 47 -6 0 -35 -108 -42 -140 -10 -29 29 -27
94 5 125 28 28 60 31 110 11z m213 -8 c3 -15 -4 -18 -38 -18 -50 0 -51 -22 -1
-30 44 -7 44 -24 -1 -28 -54 -5 -52 -32 2 -32 29 0 40 -4 40 -15 0 -17 -28
-19 -104 -9 l-46 7 0 72 0 72 72 -1 c61 -1 73 -4 76 -18z m312 6 c0 -9 -9 -18
-21 -21 -19 -5 -20 -12 -17 -69 3 -63 3 -63 -22 -58 -49 11 -50 12 -50 64 0
43 -3 50 -20 50 -13 0 -20 7 -20 20 0 17 8 20 68 23 37 2 70 4 75 5 4 1 7 -5
7 -14z m155 6 c65 -15 94 -73 62 -125 -14 -24 -25 -28 -92 -33 -44 -3 -54 0
-78 24 -34 34 -36 82 -4 111 37 34 53 37 112 23z m505 -3 c0 -8 -9 -40 -20
-72 -11 -31 -18 -60 -16 -64 3 -4 -9 -8 -25 -9 -25 -2 -31 3 -51 45 l-22 47
-21 -46 c-17 -38 -25 -47 -51 -50 -24 -3 -30 0 -32 17 -1 12 -8 40 -17 64 -21
59 -20 61 20 61 27 0 35 -4 35 -17 0 -10 4 -24 9 -32 7 -11 13 -6 25 23 14 35
18 37 53 34 32 -2 39 -7 41 -28 6 -43 19 -43 36 -1 15 40 36 55 36 28z m136
-4 c27 -45 64 -115 64 -122 0 -13 -42 -22 -54 -12 -6 5 -28 11 -49 15 -32 6
-38 4 -45 -13 -8 -24 -26 -16 -36 16 -5 16 -2 25 13 32 11 6 25 28 32 48 17
55 53 71 75 36z m840 -4 c22 -18 16 -32 -11 -25 -59 15 -94 -18 -74 -71 8 -21
15 -24 47 -22 40 3 66 -7 57 -21 -3 -5 -12 -7 -20 -3 -8 3 -15 1 -15 -4 0 -17
-111 4 -126 24 -26 34 -13 100 25 131 18 14 96 9 117 -9z m816 -54 l37 -70
-25 -8 c-16 -6 -30 -5 -40 3 -22 19 -81 22 -88 4 -7 -19 -26 -18 -26 1 0 8 -4
15 -10 15 -20 0 -9 21 15 30 24 9 30 24 27 63 -1 10 2 16 7 13 5 -3 12 1 15
10 4 9 15 14 28 12 17 -2 33 -22 60 -73z m183 61 c47 -20 47 -40 0 -32 -46 9
-75 -7 -75 -42 0 -45 13 -56 59 -49 30 4 41 2 41 -8 0 -32 -95 -35 -134 -4
-30 24 -34 64 -11 109 22 43 60 51 120 26z m398 4 c19 0 24 -26 6 -32 -13 -4
-16 -42 -5 -84 l7 -32 -55 -1 c-57 0 -68 7 -41 29 17 14 21 90 5 90 -5 0 -10
10 -10 21 0 19 4 21 38 15 20 -3 45 -6 55 -6z m117 0 c5 0 17 -13 27 -30 9
-16 21 -30 25 -30 4 0 8 14 8 30 0 28 3 30 36 30 l36 0 -5 -71 c-2 -42 -9 -74
-17 -79 -15 -9 -50 -1 -50 12 0 5 -11 25 -24 45 l-24 35 -9 -42 c-4 -23 -11
-41 -15 -41 -5 1 -19 1 -32 1 -23 0 -23 2 -20 67 3 66 15 88 42 78 8 -3 18 -5
22 -5z m317 -3 c21 -15 4 -27 -38 -27 -50 0 -49 -23 1 -30 50 -8 51 -30 1 -30
-30 0 -41 -4 -41 -15 0 -11 12 -15 45 -15 33 0 45 -4 45 -15 0 -17 -24 -19
-108 -8 l-54 6 6 66 c3 36 5 69 6 72 0 11 124 7 137 -4z m-4374 -7 c9 0 17 -4
17 -10 0 -5 -16 -10 -35 -10 -28 0 -35 -4 -35 -19 0 -15 8 -21 35 -23 20 -2
35 -7 35 -13 0 -5 -15 -11 -35 -13 -30 -3 -35 -7 -35 -28 0 -18 -5 -24 -23
-24 -13 0 -28 -5 -33 -10 -7 -7 -11 9 -13 51 -1 35 -6 70 -11 79 -7 13 -2 16
28 18 20 2 39 5 41 8 3 3 15 3 26 0 11 -3 28 -6 38 -6z m1856 -14 c23 -21 38
-20 51 4 6 11 17 20 25 20 16 0 20 -16 6 -24 -17 -11 -50 -94 -44 -114 4 -18
0 -20 -34 -19 l-38 2 3 40 c3 33 -1 45 -22 64 -36 34 -34 53 5 47 17 -2 39
-12 48 -20z m299 -18 c-3 -24 -1 -55 3 -70 6 -24 4 -29 -14 -32 -41 -9 -155
-14 -163 -7 -5 3 -10 36 -12 73 l-2 67 67 4 c38 2 81 4 97 5 27 2 28 1 24 -40z
m512 22 c0 -11 4 -20 9 -20 4 0 20 9 34 20 25 20 57 27 57 12 0 -5 -14 -18
-30 -31 l-30 -22 26 -44 c24 -41 24 -45 7 -45 -10 0 -27 14 -37 31 -21 35 -40
34 -44 -4 -3 -22 -8 -27 -32 -27 -39 0 -43 11 -35 86 l7 64 34 0 c27 0 34 -4
34 -20z m511 12 c0 -4 1 -36 2 -72 l2 -65 -32 -3 c-28 -3 -32 0 -39 30 l-7 33
-14 -33 c-16 -40 -34 -41 -51 -2 -16 35 -35 31 -26 -6 6 -22 3 -24 -30 -24
l-36 0 -1 55 c-1 30 -2 61 -3 68 -1 7 14 13 34 15 33 3 38 -1 59 -39 l24 -42
18 24 c10 13 19 29 19 35 0 5 4 14 10 20 11 11 70 16 71 6z m509 -28 c0 -31 3
-35 23 -32 17 2 23 11 25 36 3 29 6 32 36 32 l34 0 1 -75 1 -75 -29 0 c-23 0
-30 5 -35 26 -5 19 -12 25 -29 22 -17 -2 -22 -10 -22 -30 1 -24 -2 -27 -25
-22 -45 10 -50 13 -50 33 0 11 -6 21 -12 24 -10 4 -10 7 0 18 6 7 12 25 12 39
0 34 7 40 42 40 25 0 28 -3 28 -36z"/>
<path d="M800 860 c30 -24 44 -25 36 -4 -3 9 -6 18 -6 20 0 2 -12 4 -27 4
l-28 0 25 -20z"/>
<path d="M310 850 c0 -5 5 -10 10 -10 6 0 10 5 10 10 0 6 -4 10 -10 10 -5 0
-10 -4 -10 -10z"/>
<path d="M366 851 c-8 -12 21 -34 33 -27 6 4 8 13 4 21 -6 17 -29 20 -37 6z"/>
<path d="M920 586 c0 -9 7 -16 16 -16 9 0 14 5 12 12 -6 18 -28 21 -28 4z"/>
<path d="M965 419 c-4 -6 -5 -13 -2 -16 7 -7 27 6 27 18 0 12 -17 12 -25 -2z"/>
<path d="M362 388 c3 -7 15 -14 29 -16 24 -4 24 -3 4 12 -24 19 -38 20 -33 4z"/>
<path d="M4106 883 c-14 -14 -5 -31 14 -26 11 3 20 9 20 13 0 10 -26 20 -34
13z"/>
<path d="M4590 870 c-14 -10 -22 -22 -18 -25 7 -8 57 25 58 38 0 12 -14 8 -40
-13z"/>
<path d="M4380 655 c7 -8 17 -15 22 -15 6 0 5 7 -2 15 -7 8 -17 15 -22 15 -6
0 -5 -7 2 -15z"/>
<path d="M4082 560 c-6 -11 -12 -28 -12 -37 0 -13 6 -10 20 12 11 17 20 33 20
38 0 14 -15 7 -28 -13z"/>
<path d="M4496 466 c3 -9 11 -16 16 -16 13 0 5 23 -10 28 -7 2 -10 -2 -6 -12z"/>
<path d="M4236 445 c-9 -24 5 -41 16 -20 7 11 7 20 0 27 -6 6 -12 3 -16 -7z"/>
<path d="M4540 400 c0 -5 5 -10 11 -10 5 0 7 5 4 10 -3 6 -8 10 -11 10 -2 0
-4 -4 -4 -10z"/>
<path d="M5330 891 c0 -11 26 -22 34 -14 3 3 3 10 0 14 -7 12 -34 11 -34 0z"/>
<path d="M4805 880 c-8 -13 4 -32 16 -25 12 8 12 35 0 35 -6 0 -13 -4 -16 -10z"/>
<path d="M5070 821 l-35 -6 0 -75 0 -75 40 -3 c22 -2 58 3 80 10 38 12 40 16
47 63 12 88 -16 107 -132 86z m109 -36 c3 -19 2 -19 -15 -4 -11 9 -26 19 -34
22 -8 4 -2 5 15 4 21 -1 31 -8 34 -22z"/>
<path d="M5411 694 c0 -11 3 -14 6 -6 3 7 2 16 -1 19 -3 4 -6 -2 -5 -13z"/>
<path d="M5223 674 c-10 -22 -10 -25 3 -20 9 3 18 6 20 6 2 0 4 9 4 20 0 28
-13 25 -27 -6z"/>
<path d="M5001 422 c-14 -27 -12 -35 8 -23 7 5 11 17 9 27 -4 17 -5 17 -17 -4z"/>
<path d="M5673 883 c9 -9 19 -14 23 -11 10 10 -6 28 -24 28 -15 0 -15 -1 1
-17z"/>
<path d="M5866 717 c-14 -10 -16 -16 -7 -22 15 -9 35 8 30 24 -3 8 -10 7 -23
-2z"/>
<path d="M5700 520 c0 -5 5 -10 10 -10 6 0 10 5 10 10 0 6 -4 10 -10 10 -5 0
-10 -4 -10 -10z"/>
<path d="M5700 451 c0 -23 25 -46 34 -32 4 6 -2 19 -14 31 -19 19 -20 19 -20
1z"/>
<path d="M1375 850 c-3 -5 -1 -10 4 -10 6 0 11 5 11 10 0 6 -2 10 -4 10 -3 0
-8 -4 -11 -10z"/>
<path d="M1391 687 c-5 -12 -7 -35 -6 -50 2 -15 -1 -27 -7 -27 -5 0 -6 9 -3
21 5 15 4 19 -4 15 -6 -4 -11 -18 -11 -30 0 -19 7 -25 33 -29 17 -2 42 1 55 7
l22 12 -27 52 c-29 57 -39 63 -52 29z"/>
<path d="M1240 520 c0 -5 5 -10 10 -10 6 0 10 5 10 10 0 6 -4 10 -10 10 -5 0
-10 -4 -10 -10z"/>
<path d="M1575 490 c4 -14 9 -27 11 -29 7 -7 34 9 34 20 0 7 -3 9 -7 6 -3 -4
-15 1 -26 10 -19 17 -19 17 -12 -7z"/>
<path d="M3094 688 c-4 -13 -7 -35 -6 -50 1 -16 -2 -28 -8 -28 -5 0 -6 7 -3
17 4 11 3 14 -5 9 -16 -10 -15 -49 1 -43 6 2 20 0 29 -4 10 -6 27 -5 41 2 28
13 26 30 -8 86 -24 39 -31 41 -41 11z"/>
<path d="M3270 502 c0 -19 29 -47 39 -37 6 7 1 16 -15 28 -13 10 -24 14 -24 9z"/>
<path d="M3570 812 c-13 -10 -21 -24 -19 -31 3 -7 15 0 34 19 31 33 21 41 -15
12z"/>
<path d="M3855 480 c-3 -5 -1 -10 4 -10 6 0 11 5 11 10 0 6 -2 10 -4 10 -3 0
-8 -4 -11 -10z"/>
<path d="M3585 450 c3 -5 13 -10 21 -10 8 0 12 5 9 10 -3 6 -13 10 -21 10 -8
0 -12 -4 -9 -10z"/>
<path d="M1880 820 c0 -5 7 -10 16 -10 8 0 12 5 9 10 -3 6 -10 10 -16 10 -5 0
-9 -4 -9 -10z"/>
<path d="M2042 668 c-7 -7 -12 -23 -12 -37 1 -24 2 -24 16 8 16 37 14 47 -4
29z"/>
<path d="M2015 560 c4 -6 11 -8 16 -5 14 9 11 15 -7 15 -8 0 -12 -5 -9 -10z"/>
<path d="M1915 470 c4 -6 11 -8 16 -5 14 9 11 15 -7 15 -8 0 -12 -5 -9 -10z"/>
<path d="M2320 795 c0 -14 5 -25 10 -25 6 0 10 11 10 25 0 14 -4 25 -10 25 -5
0 -10 -11 -10 -25z"/>
<path d="M2660 771 c0 -6 5 -13 10 -16 6 -3 10 1 10 9 0 9 -4 16 -10 16 -5 0
-10 -4 -10 -9z"/>
<path d="M2487 763 c-4 -3 -7 -23 -7 -43 0 -36 1 -38 40 -43 68 -9 116 20 102
61 -3 10 -7 10 -18 1 -11 -9 -14 -7 -14 10 0 18 -6 21 -48 21 -27 0 -52 -3
-55 -7z"/>
<path d="M2320 719 c0 -5 5 -7 10 -4 6 3 10 8 10 11 0 2 -4 4 -10 4 -5 0 -10
-5 -10 -11z"/>
<path d="M2480 550 l0 -40 66 1 c58 1 67 4 76 25 18 39 -4 54 -78 54 l-64 0 0
-40z m40 15 c-7 -8 -16 -15 -21 -15 -5 0 -6 7 -3 15 4 8 13 15 21 15 13 0 13
-3 3 -15z"/>
<path d="M2665 527 c-4 -10 -5 -21 -1 -24 10 -10 18 4 13 24 -4 17 -4 17 -12
0z"/>
<path d="M1586 205 c-9 -23 -8 -25 9 -25 17 0 19 9 6 28 -7 11 -10 10 -15 -3z"/>
<path d="M3727 200 c-3 -13 0 -20 9 -20 15 0 19 26 5 34 -5 3 -11 -3 -14 -14z"/>
<path d="M1194 229 c-3 -6 -2 -15 3 -20 13 -13 43 -1 43 17 0 16 -36 19 -46 3z"/>
<path d="M2470 224 c-18 -46 -12 -73 15 -80 37 -9 52 1 59 40 5 26 3 41 -8 51
-23 24 -55 18 -66 -11z"/>
<path d="M3120 196 c0 -9 7 -16 16 -16 17 0 14 22 -4 28 -7 2 -12 -3 -12 -12z"/>
<path d="M4750 201 c0 -12 5 -21 10 -21 6 0 10 6 10 14 0 8 -4 18 -10 21 -5 3
-10 -3 -10 -14z"/>
<path d="M3515 229 c-8 -12 14 -31 30 -26 6 2 10 10 10 18 0 17 -31 24 -40 8z"/>
<path d="M3521 161 c-7 -5 -9 -11 -4 -14 14 -9 54 4 47 14 -7 11 -25 11 -43 0z"/>
</g>
</svg>

Before

Width:  |  Height:  |  Size: 18 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 10 KiB

View File

@ -1,85 +1 @@
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<svg
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:cc="http://creativecommons.org/ns#"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:svg="http://www.w3.org/2000/svg"
xmlns="http://www.w3.org/2000/svg"
id="svg8"
version="1.1"
viewBox="0 0 176.61171 41.907883"
height="41.907883mm"
width="176.61171mm">
<defs
id="defs2" />
<metadata
id="metadata5">
<rdf:RDF>
<cc:Work
rdf:about="">
<dc:format>image/svg+xml</dc:format>
<dc:type
rdf:resource="http://purl.org/dc/dcmitype/StillImage" />
<dc:title></dc:title>
</cc:Work>
</rdf:RDF>
</metadata>
<g
transform="translate(-0.74835286,-98.31182)"
id="layer1">
<flowRoot
transform="scale(0.26458333)"
style="font-style:normal;font-weight:normal;font-size:40px;line-height:1.25;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none"
id="flowRoot4598"
xml:space="preserve"><flowRegion
id="flowRegion4600"><rect
y="415.4129"
x="-38.183765"
height="48.08326"
width="257.38687"
id="rect4602" /></flowRegion><flowPara
id="flowPara4604"></flowPara></flowRoot> <text
transform="scale(0.86288797,1.158899)"
id="text4777"
y="110.93711"
x="0.93061"
style="font-style:normal;font-variant:normal;font-weight:bold;font-stretch:normal;font-size:28.14887619px;line-height:4.25;font-family:sans-serif;-inkscape-font-specification:'sans-serif, Bold';font-variant-ligatures:normal;font-variant-caps:normal;font-variant-numeric:normal;font-feature-settings:normal;text-align:start;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:start;fill:#003dff;fill-opacity:1;stroke:none;stroke-width:7.51955223;stroke-miterlimit:4;stroke-dasharray:none"
xml:space="preserve"><tspan
style="stroke-width:7.51955223"
id="tspan4775"
y="110.93711"
x="0.93061"><tspan
id="tspan4773"
style="font-style:normal;font-variant:normal;font-weight:bold;font-stretch:normal;font-size:28.14887619px;font-family:sans-serif;-inkscape-font-specification:'sans-serif, Bold';font-variant-ligatures:normal;font-variant-caps:normal;font-variant-numeric:normal;font-feature-settings:normal;text-align:start;letter-spacing:3.56786728px;writing-mode:lr-tb;text-anchor:start;fill:#003dff;fill-opacity:1;stroke-width:7.51955223;stroke-miterlimit:4;stroke-dasharray:none"
y="110.93711"
x="0.93061">waybackpy</tspan></tspan></text>
<rect
y="98.311821"
x="1.4967092"
height="4.8643045"
width="153.78688"
id="rect4644"
style="opacity:1;fill:#000080;fill-opacity:1;stroke:#00ff00;stroke-width:0;stroke-miterlimit:4;stroke-dasharray:none" />
<rect
style="opacity:1;fill:#000080;fill-opacity:1;stroke:#00ff00;stroke-width:0;stroke-miterlimit:4;stroke-dasharray:none"
id="rect4648"
width="153.78688"
height="4.490128"
x="23.573174"
y="135.72957" />
<rect
y="135.72957"
x="0.74835336"
height="4.4901319"
width="22.82482"
id="rect4650"
style="opacity:1;fill:#ff00ff;fill-opacity:1;stroke:#00ff00;stroke-width:0;stroke-miterlimit:4;stroke-dasharray:none" />
<rect
style="opacity:1;fill:#ff00ff;fill-opacity:1;stroke:#00ff00;stroke-width:0;stroke-miterlimit:4;stroke-dasharray:none"
id="rect4652"
width="21.702286"
height="4.8643003"
x="155.2836"
y="98.311821" />
</g>
</svg>
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 176.612 41.908" height="158.392" width="667.51" xmlns:v="https://github.com/akamhy/waybackpy"><text transform="matrix(.862888 0 0 1.158899 -.748 -98.312)" y="110.937" x="0.931" xml:space="preserve" font-weight="bold" font-size="28.149" font-family="sans-serif" letter-spacing="0" word-spacing="0" writing-mode="lr-tb" fill="#003dff"><tspan y="110.937" x="0.931"><tspan y="110.937" x="0.931" letter-spacing="3.568" writing-mode="lr-tb">waybackpy</tspan></tspan></text><path d="M.749 0h153.787v4.864H.749zm22.076 37.418h153.787v4.49H22.825z" fill="navy"/><path d="M0 37.418h22.825v4.49H0zM154.536 0h21.702v4.864h-21.702z" fill="#f0f"/></svg>

Before

Width:  |  Height:  |  Size: 3.6 KiB

After

Width:  |  Height:  |  Size: 694 B

View File

@ -1 +1,2 @@
requests>=2.24.0
click
requests

View File

@ -19,21 +19,18 @@ setup(
author=about["__author__"],
author_email=about["__author_email__"],
url=about["__url__"],
download_url="https://github.com/akamhy/waybackpy/archive/2.3.1.tar.gz",
download_url="https://github.com/akamhy/waybackpy/archive/3.0.0.tar.gz",
keywords=[
"Archive It",
"Archive Website",
"Wayback Machine",
"waybackurls",
"Internet Archive",
],
install_requires=["requests"],
install_requires=["requests", "click"],
python_requires=">=3.4",
classifiers=[
"Development Status :: 5 - Production/Stable",
"Development Status :: 4 - Beta",
"Intended Audience :: Developers",
"Natural Language :: English",
"Topic :: Software Development :: Build Tools",
"License :: OSI Approved :: MIT License",
"Programming Language :: Python",
"Programming Language :: Python :: 3",
@ -47,7 +44,7 @@ setup(
],
entry_points={"console_scripts": ["waybackpy = waybackpy.cli:main"]},
project_urls={
"Documentation": "https://akamhy.github.io/waybackpy/",
"Documentation": "https://github.com/akamhy/waybackpy/wiki",
"Source": "https://github.com/akamhy/waybackpy",
"Tracker": "https://github.com/akamhy/waybackpy/issues",
},

View File

View File

@ -1,307 +0,0 @@
# -*- coding: utf-8 -*-
import sys
import os
import pytest
import argparse
sys.path.append("..")
import waybackpy.cli as cli # noqa: E402
from waybackpy.wrapper import Url # noqa: E402
from waybackpy.__version__ import __version__
# Namespace(day=None, get=None, hour=None, minute=None, month=None, near=False,
# newest=False, oldest=False, save=False, total=False, url=None, user_agent=None, version=False, year=None)
def test_save():
args = argparse.Namespace(
user_agent=None,
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=False,
save=True,
json=False,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get=None,
)
reply = cli.args_handler(args)
assert "pypi.org/user/akamhy" in str(reply)
def test_json():
args = argparse.Namespace(
user_agent=None,
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=False,
save=False,
json=True,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get=None,
)
reply = cli.args_handler(args)
assert "archived_snapshots" in str(reply)
def test_archive_url():
args = argparse.Namespace(
user_agent=None,
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=True,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get=None,
)
reply = cli.args_handler(args)
assert "https://web.archive.org/web/" in str(reply)
def test_oldest():
args = argparse.Namespace(
user_agent=None,
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=True,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get=None,
)
reply = cli.args_handler(args)
assert "pypi.org/user/akamhy" in str(reply)
def test_newest():
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=True,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get=None,
)
reply = cli.args_handler(args)
assert "pypi.org/user/akamhy" in str(reply)
def test_total_archives():
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://pypi.org/user/akamhy/",
total=True,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get=None,
)
reply = cli.args_handler(args)
assert isinstance(reply, int)
def test_known_urls():
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://akamhy.github.io",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
alive=True,
subdomain=True,
known_urls=True,
get=None,
)
reply = cli.args_handler(args)
assert "github" in str(reply)
def test_near():
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=True,
alive=False,
subdomain=False,
known_urls=False,
get=None,
year=2020,
month=7,
day=15,
hour=1,
minute=1,
)
reply = cli.args_handler(args)
assert "202007" in str(reply)
def test_get():
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get="url",
)
reply = cli.args_handler(args)
assert "waybackpy" in str(reply)
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get="oldest",
)
reply = cli.args_handler(args)
assert "waybackpy" in str(reply)
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get="newest",
)
reply = cli.args_handler(args)
assert "waybackpy" in str(reply)
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get="save",
)
reply = cli.args_handler(args)
assert "waybackpy" in str(reply)
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get="BullShit",
)
reply = cli.args_handler(args)
assert "get the source code of the" in str(reply)
def test_args_handler():
args = argparse.Namespace(version=True)
reply = cli.args_handler(args)
assert ("waybackpy version %s" % (__version__)) == reply
args = argparse.Namespace(url=None, version=False)
reply = cli.args_handler(args)
assert ("waybackpy %s" % (__version__)) in str(reply)
def test_main():
# This also tests the parse_args method in cli.py
cli.main(["temp.py", "--version"])

View File

@ -1,184 +0,0 @@
# -*- coding: utf-8 -*-
import sys
import pytest
import random
import requests
sys.path.append("..")
import waybackpy.wrapper as waybackpy # noqa: E402
user_agent = "Mozilla/5.0 (Windows NT 6.2; rv:20.0) Gecko/20121202 Firefox/20.0"
def test_clean_url():
test_url = " https://en.wikipedia.org/wiki/Network security "
answer = "https://en.wikipedia.org/wiki/Network_security"
target = waybackpy.Url(test_url, user_agent)
test_result = target._clean_url()
assert answer == test_result
def test_dunders():
url = "https://en.wikipedia.org/wiki/Network_security"
user_agent = "UA"
target = waybackpy.Url(url, user_agent)
assert "waybackpy.Url(url=%s, user_agent=%s)" % (url, user_agent) == repr(target)
assert "en.wikipedia.org" in str(target)
def test_archive_url_parser():
endpoint = "https://amazon.com"
user_agent = "Mozilla/5.0 (Windows NT 6.2; rv:20.0) Gecko/20121202 Firefox/20.0"
headers = {"User-Agent": "%s" % user_agent}
response = waybackpy._get_response(endpoint, params=None, headers=headers)
header = response.headers
with pytest.raises(Exception):
waybackpy._archive_url_parser(header)
def test_url_check():
broken_url = "http://wwwgooglecom/"
with pytest.raises(Exception):
waybackpy.Url(broken_url, user_agent)
def test_save():
# Test for urls that exist and can be archived.
url_list = [
"en.wikipedia.org",
"www.wikidata.org",
"commons.wikimedia.org",
"www.wiktionary.org",
"www.w3schools.com",
"www.ibm.com",
]
x = random.randint(0, len(url_list) - 1)
url1 = url_list[x]
target = waybackpy.Url(
url1,
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36",
)
archived_url1 = str(target.save())
assert url1 in archived_url1
# Test for urls that are incorrect.
with pytest.raises(Exception):
url2 = "ha ha ha ha"
waybackpy.Url(url2, user_agent)
url3 = "http://www.archive.is/faq.html"
with pytest.raises(Exception):
target = waybackpy.Url(
url3,
"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) "
"AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 "
"Safari/533.20.27",
)
target.save()
def test_near():
url = "google.com"
target = waybackpy.Url(
url,
"Mozilla/5.0 (Windows; U; Windows NT 6.0; de-DE) AppleWebKit/533.20.25 "
"(KHTML, like Gecko) Version/5.0.3 Safari/533.19.4",
)
archive_near_year = target.near(year=2010)
assert "2010" in str(archive_near_year)
archive_near_month_year = str(target.near(year=2015, month=2))
assert (
("201502" in archive_near_month_year)
or ("201501" in archive_near_month_year)
or ("201503" in archive_near_month_year)
)
target = waybackpy.Url(
"www.python.org",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246",
)
archive_near_hour_day_month_year = str(
target.near(year=2008, month=5, day=9, hour=15)
)
assert (
("2008050915" in archive_near_hour_day_month_year)
or ("2008050914" in archive_near_hour_day_month_year)
or ("2008050913" in archive_near_hour_day_month_year)
)
with pytest.raises(Exception):
NeverArchivedUrl = (
"https://ee_3n.wrihkeipef4edia.org/rwti5r_ki/Nertr6w_rork_rse7c_urity"
)
target = waybackpy.Url(NeverArchivedUrl, user_agent)
target.near(year=2010)
def test_oldest():
url = "github.com/akamhy/waybackpy"
target = waybackpy.Url(url, user_agent)
assert "20200504141153" in str(target.oldest())
def test_json():
url = "github.com/akamhy/waybackpy"
target = waybackpy.Url(url, user_agent)
assert "archived_snapshots" in str(target.JSON)
def test_archive_url():
url = "github.com/akamhy/waybackpy"
target = waybackpy.Url(url, user_agent)
assert "github.com/akamhy" in str(target.archive_url)
def test_newest():
url = "github.com/akamhy/waybackpy"
target = waybackpy.Url(url, user_agent)
assert url in str(target.newest())
def test_get():
target = waybackpy.Url("google.com", user_agent)
assert "Welcome to Google" in target.get(target.oldest())
def test_wayback_timestamp():
ts = waybackpy._wayback_timestamp(year=2020, month=1, day=2, hour=3, minute=4)
assert "202001020304" in str(ts)
def test_get_response():
endpoint = "https://www.google.com"
user_agent = (
"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0"
)
headers = {"User-Agent": "%s" % user_agent}
response = waybackpy._get_response(endpoint, params=None, headers=headers)
assert response.status_code == 200
def test_total_archives():
target = waybackpy.Url(" https://google.com ", user_agent)
assert target.total_archives() > 500000
target = waybackpy.Url(
" https://gaha.e4i3n.m5iai3kip6ied.cima/gahh2718gs/ahkst63t7gad8 ", user_agent
)
assert target.total_archives() == 0
def test_known_urls():
target = waybackpy.Url("akamhy.github.io", user_agent)
assert len(target.known_urls(alive=True, subdomain=True)) > 2
target = waybackpy.Url("akamhy.github.io", user_agent)
assert len(target.known_urls()) > 3

View File

@ -1,33 +1,7 @@
# -*- coding: utf-8 -*-
# ┏┓┏┓┏┓━━━━━━━━━━┏━━┓━━━━━━━━━━┏┓━━┏━━━┓━━━━━
# ┃┃┃┃┃┃━━━━━━━━━━┃┏┓┃━━━━━━━━━━┃┃━━┃┏━┓┃━━━━━
# ┃┃┃┃┃┃┏━━┓━┏┓━┏┓┃┗┛┗┓┏━━┓━┏━━┓┃┃┏┓┃┗━┛┃┏┓━┏┓
# ┃┗┛┗┛┃┗━┓┃━┃┃━┃┃┃┏━┓┃┗━┓┃━┃┏━┛┃┗┛┛┃┏━━┛┃┃━┃┃
# ┗┓┏┓┏┛┃┗┛┗┓┃┗━┛┃┃┗━┛┃┃┗┛┗┓┃┗━┓┃┏┓┓┃┃━━━┃┗━┛┃
# ━┗┛┗┛━┗━━━┛┗━┓┏┛┗━━━┛┗━━━┛┗━━┛┗┛┗┛┗┛━━━┗━┓┏┛
# ━━━━━━━━━━━┏━┛┃━━━━━━━━━━━━━━━━━━━━━━━━┏━┛┃━
# ━━━━━━━━━━━┗━━┛━━━━━━━━━━━━━━━━━━━━━━━━┗━━┛━
"""
Waybackpy is a Python package that interfaces with the Internet Archive's Wayback Machine API.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Archive pages and retrieve archived pages easily.
Usage:
>>> import waybackpy
>>> target_url = waybackpy.Url('https://www.python.org', 'Your-apps-cool-user-agent')
>>> new_archive = target_url.save()
>>> print(new_archive)
https://web.archive.org/web/20200502170312/https://www.python.org/
Full documentation @ <https://akamhy.github.io/waybackpy/>.
:copyright: (c) 2020 by akamhy.
:license: MIT
"""
from .wrapper import Url
from .cdx_api import WaybackMachineCDXServerAPI
from .save_api import WaybackMachineSaveAPI
from .availability_api import WaybackMachineAvailabilityAPI
from .__version__ import (
__title__,
__description__,

View File

@ -1,13 +1,11 @@
# -*- coding: utf-8 -*-
__title__ = "waybackpy"
__description__ = (
"A Python package that interfaces with the Internet Archive's Wayback Machine API. "
"Python package that interfaces with the Internet Archive's Wayback Machine APIs. "
"Archive pages and retrieve archived pages easily."
)
__url__ = "https://akamhy.github.io/waybackpy/"
__version__ = "2.3.1"
__version__ = "3.0.0"
__author__ = "akamhy"
__author_email__ = "akash3pro@gmail.com"
__author_email__ = "akamhy@yahoo.com"
__license__ = "MIT"
__copyright__ = "Copyright 2020 akamhy"
__copyright__ = "Copyright 2020-2022 Akash Mahanty et al."

View File

@ -0,0 +1,109 @@
import re
import time
import requests
from datetime import datetime
from .__version__ import __version__
from .utils import DEFAULT_USER_AGENT
def full_url(endpoint, params):
if not params:
return endpoint.strip()
full_url = endpoint if endpoint.endswith("?") else (endpoint + "?")
for key, val in params.items():
key = "filter" if key.startswith("filter") else key
key = "collapse" if key.startswith("collapse") else key
amp = "" if full_url.endswith("?") else "&"
full_url = (
full_url
+ amp
+ "{key}={val}".format(key=key, val=requests.utils.quote(str(val)))
)
return full_url
class WaybackMachineAvailabilityAPI:
def __init__(self, url, user_agent=DEFAULT_USER_AGENT):
self.url = str(url).strip().replace(" ", "%20")
self.user_agent = user_agent
self.headers = {"User-Agent": self.user_agent}
self.payload = {"url": "{url}".format(url=self.url)}
self.endpoint = "https://archive.org/wayback/available"
self.JSON = None
def unix_timestamp_to_wayback_timestamp(self, unix_timestamp):
return datetime.utcfromtimestamp(int(unix_timestamp)).strftime("%Y%m%d%H%M%S")
def __repr__(self):
return str(self) # self.__str__()
def __str__(self):
if not self.JSON:
return None
return self.archive_url
def json(self):
self.request_url = full_url(self.endpoint, self.payload)
self.response = requests.get(self.request_url, self.headers)
self.JSON = self.response.json()
return self.JSON
def timestamp(self):
if not self.JSON["archived_snapshots"] or not self.JSON:
return datetime.max
return datetime.strptime(
self.JSON["archived_snapshots"]["closest"]["timestamp"], "%Y%m%d%H%M%S"
)
@property
def archive_url(self):
data = self.JSON
if not data["archived_snapshots"]:
archive_url = None
else:
archive_url = data["archived_snapshots"]["closest"]["url"]
archive_url = archive_url.replace(
"http://web.archive.org/web/", "https://web.archive.org/web/", 1
)
return archive_url
def wayback_timestamp(self, **kwargs):
return "".join(
str(kwargs[key]).zfill(2)
for key in ["year", "month", "day", "hour", "minute"]
)
def oldest(self):
return self.near(year=1994)
def newest(self):
return self.near(unix_timestamp=int(time.time()))
def near(
self,
year=None,
month=None,
day=None,
hour=None,
minute=None,
unix_timestamp=None,
):
if unix_timestamp:
timestamp = self.unix_timestamp_to_wayback_timestamp(unix_timestamp)
else:
now = datetime.utcnow().timetuple()
timestamp = self.wayback_timestamp(
year=year if year else now.tm_year,
month=month if month else now.tm_mon,
day=day if day else now.tm_mday,
hour=hour if hour else now.tm_hour,
minute=minute if minute else now.tm_min,
)
self.payload["timestamp"] = timestamp
self.json()
return self

185
waybackpy/cdx_api.py Normal file
View File

@ -0,0 +1,185 @@
from .exceptions import WaybackError
from .cdx_snapshot import CDXSnapshot
from .cdx_utils import (
get_total_pages,
get_response,
check_filters,
check_collapses,
check_match_type,
)
from .utils import DEFAULT_USER_AGENT
class WaybackMachineCDXServerAPI:
def __init__(
self,
url,
user_agent=None,
start_timestamp=None,
end_timestamp=None,
filters=[],
match_type=None,
gzip=None,
collapses=[],
limit=None,
):
self.url = str(url).strip().replace(" ", "%20")
self.user_agent = str(user_agent) if user_agent else DEFAULT_USER_AGENT
self.start_timestamp = str(start_timestamp) if start_timestamp else None
self.end_timestamp = str(end_timestamp) if end_timestamp else None
self.filters = filters
check_filters(self.filters)
self.match_type = str(match_type).strip() if match_type else None
check_match_type(self.match_type, self.url)
self.gzip = gzip if gzip else True
self.collapses = collapses
check_collapses(self.collapses)
self.limit = limit if limit else 5000
self.last_api_request_url = None
self.use_page = False
self.endpoint = "https://web.archive.org/cdx/search/cdx"
def cdx_api_manager(self, payload, headers, use_page=False):
total_pages = get_total_pages(self.url, self.user_agent)
# If we only have two or less pages of archives then we care for accuracy
# pagination API can be lagged sometimes
if use_page == True and total_pages >= 2:
blank_pages = 0
for i in range(total_pages):
payload["page"] = str(i)
url, res = get_response(
self.endpoint, params=payload, headers=headers, return_full_url=True
)
self.last_api_request_url = url
text = res.text
if len(text) == 0:
blank_pages += 1
if blank_pages >= 2:
break
yield text
else:
payload["showResumeKey"] = "true"
payload["limit"] = str(self.limit)
resumeKey = None
more = True
while more:
if resumeKey:
payload["resumeKey"] = resumeKey
url, res = get_response(
self.endpoint, params=payload, headers=headers, return_full_url=True
)
self.last_api_request_url = url
text = res.text.strip()
lines = text.splitlines()
more = False
if len(lines) >= 3:
second_last_line = lines[-2]
if len(second_last_line) == 0:
resumeKey = lines[-1].strip()
text = text.replace(resumeKey, "", 1).strip()
more = True
yield text
def add_payload(self, payload):
if self.start_timestamp:
payload["from"] = self.start_timestamp
if self.end_timestamp:
payload["to"] = self.end_timestamp
if self.gzip != True:
payload["gzip"] = "false"
if self.match_type:
payload["matchType"] = self.match_type
if self.filters and len(self.filters) > 0:
for i, f in enumerate(self.filters):
payload["filter" + str(i)] = f
if self.collapses and len(self.collapses) > 0:
for i, f in enumerate(self.collapses):
payload["collapse" + str(i)] = f
# Don't need to return anything as it's dictionary.
payload["url"] = self.url
def snapshots(self):
payload = {}
headers = {"User-Agent": self.user_agent}
self.add_payload(payload)
if not self.start_timestamp or self.end_timestamp:
self.use_page = True
if self.collapses != []:
self.use_page = False
texts = self.cdx_api_manager(payload, headers, use_page=self.use_page)
for text in texts:
if text.isspace() or len(text) <= 1 or not text:
continue
snapshot_list = text.split("\n")
for snapshot in snapshot_list:
if len(snapshot) < 46: # 14 + 32 (timestamp+digest)
continue
properties = {
"urlkey": None,
"timestamp": None,
"original": None,
"mimetype": None,
"statuscode": None,
"digest": None,
"length": None,
}
prop_values = snapshot.split(" ")
prop_values_len = len(prop_values)
properties_len = len(properties)
if prop_values_len != properties_len:
raise WaybackError(
"Snapshot returned by Cdx API has {prop_values_len} properties instead of expected {properties_len} properties.\nInvolved Snapshot : {snapshot}".format(
prop_values_len=prop_values_len,
properties_len=properties_len,
snapshot=snapshot,
)
)
(
properties["urlkey"],
properties["timestamp"],
properties["original"],
properties["mimetype"],
properties["statuscode"],
properties["digest"],
properties["length"],
) = prop_values
yield CDXSnapshot(properties)

27
waybackpy/cdx_snapshot.py Normal file
View File

@ -0,0 +1,27 @@
from datetime import datetime
class CDXSnapshot:
def __init__(self, properties):
self.urlkey = properties["urlkey"]
self.timestamp = properties["timestamp"]
self.datetime_timestamp = datetime.strptime(self.timestamp, "%Y%m%d%H%M%S")
self.original = properties["original"]
self.mimetype = properties["mimetype"]
self.statuscode = properties["statuscode"]
self.digest = properties["digest"]
self.length = properties["length"]
self.archive_url = (
"https://web.archive.org/web/" + self.timestamp + "/" + self.original
)
def __str__(self):
return "{urlkey} {timestamp} {original} {mimetype} {statuscode} {digest} {length}".format(
urlkey=self.urlkey,
timestamp=self.timestamp,
original=self.original,
mimetype=self.mimetype,
statuscode=self.statuscode,
digest=self.digest,
length=self.length,
)

154
waybackpy/cdx_utils.py Normal file
View File

@ -0,0 +1,154 @@
import re
import requests
from urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter
from .exceptions import WaybackError
def get_total_pages(url, user_agent):
request_url = (
"https://web.archive.org/cdx/search/cdx?url={url}&showNumPages=true".format(
url=url
)
)
headers = {"User-Agent": user_agent}
return int((requests.get(request_url, headers=headers).text).strip())
def full_url(endpoint, params):
if not params:
return endpoint
full_url = endpoint if endpoint.endswith("?") else (endpoint + "?")
for key, val in params.items():
key = "filter" if key.startswith("filter") else key
key = "collapse" if key.startswith("collapse") else key
amp = "" if full_url.endswith("?") else "&"
full_url = (
full_url
+ amp
+ "{key}={val}".format(key=key, val=requests.utils.quote(str(val)))
)
return full_url
def get_response(
endpoint,
params=None,
headers=None,
return_full_url=False,
retries=5,
backoff_factor=0.5,
no_raise_on_redirects=False,
):
s = requests.Session()
retries = Retry(
total=retries,
backoff_factor=backoff_factor,
status_forcelist=[500, 502, 503, 504],
)
s.mount("https://", HTTPAdapter(max_retries=retries))
# The URL with parameters required for the get request
url = full_url(endpoint, params)
try:
if not return_full_url:
return s.get(url, headers=headers)
return (url, s.get(url, headers=headers))
except Exception as e:
reason = str(e)
if no_raise_on_redirects:
if "Exceeded 30 redirects" in reason:
return
exc_message = "Error while retrieving {url}.\n{reason}".format(
url=url, reason=reason
)
exc = WaybackError(exc_message)
exc.__cause__ = e
raise exc
def check_filters(filters):
if not isinstance(filters, list):
raise WaybackError("filters must be a list.")
# [!]field:regex
for _filter in filters:
try:
match = re.search(
r"(\!?(?:urlkey|timestamp|original|mimetype|statuscode|digest|length)):(.*)",
_filter,
)
key = match.group(1)
val = match.group(2)
except Exception:
exc_message = (
"Filter '{_filter}' is not following the cdx filter syntax.".format(
_filter=_filter
)
)
raise WaybackError(exc_message)
def check_collapses(collapses):
if not isinstance(collapses, list):
raise WaybackError("collapses must be a list.")
if len(collapses) == 0:
return
for collapse in collapses:
try:
match = re.search(
r"(urlkey|timestamp|original|mimetype|statuscode|digest|length)(:?[0-9]{1,99})?",
collapse,
)
field = match.group(1)
N = None
if 2 == len(match.groups()):
N = match.group(2)
if N:
if not (field + N == collapse):
raise Exception
else:
if not (field == collapse):
raise Exception
except Exception:
exc_message = "collapse argument '{collapse}' is not following the cdx collapse syntax.".format(
collapse=collapse
)
raise WaybackError(exc_message)
def check_match_type(match_type, url):
if not match_type:
return
if "*" in url:
raise WaybackError("Can not use wildcard with match_type argument")
legal_match_type = ["exact", "prefix", "host", "domain"]
if match_type not in legal_match_type:
exc_message = "{match_type} is not an allowed match type.\nUse one from 'exact', 'prefix', 'host' or 'domain'".format(
match_type=match_type
)
raise WaybackError(exc_message)

View File

@ -1,262 +1,347 @@
# -*- coding: utf-8 -*-
import sys
import os
import click
import re
import argparse
import string
import os
import json as JSON
import random
from waybackpy.wrapper import Url
from waybackpy.__version__ import __version__
import string
from .__version__ import __version__
from .utils import DEFAULT_USER_AGENT
from .cdx_api import WaybackMachineCDXServerAPI
from .save_api import WaybackMachineSaveAPI
from .availability_api import WaybackMachineAvailabilityAPI
from .wrapper import Url
def _save(obj):
return obj.save()
def _archive_url(obj):
return obj.archive_url
def _json(obj):
return obj.JSON
def _oldest(obj):
return obj.oldest()
def _newest(obj):
return obj.newest()
def _total_archives(obj):
return obj.total_archives()
def _near(obj, args):
_near_args = {}
if args.year:
_near_args["year"] = args.year
if args.month:
_near_args["month"] = args.month
if args.day:
_near_args["day"] = args.day
if args.hour:
_near_args["hour"] = args.hour
if args.minute:
_near_args["minute"] = args.minute
return obj.near(**_near_args)
def _save_urls_on_file(input_list, live_url_count):
m = re.search("https?://([A-Za-z_0-9.-]+).*", input_list[0])
if m:
domain = m.group(1)
else:
domain = "domain-unknown"
uid = "".join(
random.choice(string.ascii_lowercase + string.digits) for _ in range(6)
)
file_name = "%s-%d-urls-%s.txt" % (domain, live_url_count, uid)
file_content = "\n".join(input_list)
file_path = os.path.join(os.getcwd(), file_name)
with open(file_path, "w+") as f:
f.write(file_content)
return "%s\n\n'%s' saved in current working directory" % (file_content, file_name)
def _known_urls(obj, args):
"""Abbreviations:
sd = subdomain
al = alive
@click.command()
@click.option(
"-u", "--url", help="URL on which Wayback machine operations are to be performed."
)
@click.option(
"-ua",
"--user-agent",
"--user_agent",
default=DEFAULT_USER_AGENT,
help="User agent, default user agent is '%s' " % DEFAULT_USER_AGENT,
)
@click.option(
"-v", "--version", is_flag=True, default=False, help="Print waybackpy version."
)
@click.option(
"-n",
"--newest",
"-au",
"--archive_url",
"--archive-url",
default=False,
is_flag=True,
help="Fetch the newest archive of the specified URL",
)
@click.option(
"-o",
"--oldest",
default=False,
is_flag=True,
help="Fetch the oldest archive of the specified URL",
)
@click.option(
"-j",
"--json",
default=False,
is_flag=True,
help="Spit out the JSON data for availability_api commands.",
)
@click.option(
"-N", "--near", default=False, is_flag=True, help="Archive near specified time."
)
@click.option("-Y", "--year", type=click.IntRange(1994, 9999), help="Year in integer.")
@click.option("-M", "--month", type=click.IntRange(1, 12), help="Month in integer.")
@click.option("-D", "--day", type=click.IntRange(1, 31), help="Day in integer.")
@click.option("-H", "--hour", type=click.IntRange(0, 24), help="Hour in integer.")
@click.option("-MIN", "--minute", type=click.IntRange(0, 60), help="Minute in integer.")
@click.option(
"-s",
"--save",
default=False,
is_flag=True,
help="Save the specified URL's webpage and print the archive URL.",
)
@click.option(
"-h",
"--headers",
default=False,
is_flag=True,
help="Spit out the headers data for save_api commands.",
)
@click.option(
"-ku",
"--known-urls",
"--known_urls",
default=False,
is_flag=True,
help="List known URLs. Uses CDX API.",
)
@click.option(
"-sub",
"--subdomain",
default=False,
is_flag=True,
help="Use with '--known_urls' to include known URLs for subdomains.",
)
@click.option(
"-f",
"--file",
default=False,
is_flag=True,
help="Use with '--known_urls' to save the URLs in file at current directory.",
)
@click.option(
"-c",
"--cdx",
default=False,
is_flag=True,
help="Spit out the headers data for save_api commands.",
)
@click.option(
"-st",
"--start-timestamp",
"--start_timestamp",
)
@click.option(
"-et",
"--end-timestamp",
"--end_timestamp",
)
@click.option(
"-f",
"--filters",
multiple=True,
)
@click.option(
"-mt",
"--match-type",
"--match_type",
)
@click.option(
"-gz",
"--gzip",
)
@click.option(
"-c",
"--collapses",
multiple=True,
)
@click.option(
"-l",
"--limit",
)
@click.option(
"-cp",
"--cdx-print",
"--cdx_print",
multiple=True,
)
def main(
url,
user_agent,
version,
newest,
oldest,
json,
near,
year,
month,
day,
hour,
minute,
save,
headers,
known_urls,
subdomain,
file,
cdx,
start_timestamp,
end_timestamp,
filters,
match_type,
gzip,
collapses,
limit,
cdx_print,
):
"""
sd = False
al = False
if args.subdomain:
sd = True
if args.alive:
al = True
url_list = obj.known_urls(alive=al, subdomain=sd)
total_urls = len(url_list)
┏┓┏┓┏┓━━━━━━━━━━┏━━┓━━━━━━━━━━┏┓━━┏━━━┓━━━━━
┃┃┃┃┃┃━━━━━━━━━━┃┏┓┃━━━━━━━━━━┃┃━━┃┏━┓┃━━━━━
┃┃┃┃┃┃┏━━┓━┏┓━┏┓┃┗┛┗┓┏━━┓━┏━━┓┃┃┏┓┃┗━┛┃┏┓━┏┓
┃┗┛┗┛┃┗━┓┃━┃┃━┃┃┃┏━┓┃┗━┓┃━┃┏━┛┃┗┛┛┃┏━━┛┃┃━┃┃
┗┓┏┓┏┛┃┗┛┗┓┃┗━┛┃┃┗━┛┃┃┗┛┗┓┃┗━┓┃┏┓┓┃┃━━━┃┗━┛┃
━┗┛┗┛━┗━━━┛┗━┓┏┛┗━━━┛┗━━━┛┗━━┛┗┛┗┛┗┛━━━┗━┓┏┛
━━━━━━━━━━━┏━┛┃━━━━━━━━━━━━━━━━━━━━━━━━┏━┛┃━
━━━━━━━━━━━┗━━┛━━━━━━━━━━━━━━━━━━━━━━━━┗━━┛━
if total_urls > 0:
text = _save_urls_on_file(url_list, total_urls)
else:
text = "No known URLs found. Please try a diffrent domain!"
waybackpy : Python package & CLI tool that interfaces the Wayback Machine API
return text
Released under the MIT License.
License @ https://github.com/akamhy/waybackpy/blob/master/LICENSE
Copyright (c) 2020 waybackpy contributors. Contributors list @
https://github.com/akamhy/waybackpy/graphs/contributors
def _get(obj, args):
if args.get.lower() == "url":
output = obj.get()
https://github.com/akamhy/waybackpy
elif args.get.lower() == "archive_url":
output = obj.get(obj.archive_url)
https://pypi.org/project/waybackpy
elif args.get.lower() == "oldest":
output = obj.get(obj.oldest())
"""
elif args.get.lower() == "latest" or args.get.lower() == "newest":
output = obj.get(obj.newest())
if version:
click.echo("waybackpy version %s" % __version__)
return
elif args.get.lower() == "save":
output = obj.get(obj.save())
if not url:
click.echo("No URL detected. Please pass an URL.")
return
else:
output = "Use get as \"--get 'source'\", 'source' can be one of the followings: \
\n1) url - get the source code of the url specified using --url/-u.\
\n2) archive_url - get the source code of the newest archive for the supplied url, alias of newest.\
\n3) oldest - get the source code of the oldest archive for the supplied url.\
\n4) newest - get the source code of the newest archive for the supplied url.\
\n5) save - Create a new archive and get the source code of this new archive for the supplied url."
def echo_availability_api(availability_api_instance):
click.echo("Archive URL:")
if not availability_api_instance.archive_url:
archive_url = (
"NO ARCHIVE FOUND - The requested URL is probably "
+ "not yet archived or if the URL was recently archived then it is "
+ "not yet available via the Wayback Machine's availability API "
+ "because of database lag and should be available after some time."
)
else:
archive_url = availability_api_instance.archive_url
click.echo(archive_url)
if json:
click.echo("JSON response:")
click.echo(JSON.dumps(availability_api_instance.JSON))
return output
availability_api = WaybackMachineAvailabilityAPI(url, user_agent=user_agent)
if oldest:
availability_api.oldest()
echo_availability_api(availability_api)
return
def args_handler(args):
if args.version:
return "waybackpy version %s" % __version__
if newest:
availability_api.newest()
echo_availability_api(availability_api)
return
if not args.url:
return (
"waybackpy %s \nSee 'waybackpy --help' for help using this tool."
% __version__
if near:
near_args = {}
keys = ["year", "month", "day", "hour", "minute"]
args_arr = [year, month, day, hour, minute]
for key, arg in zip(keys, args_arr):
if arg:
near_args[key] = arg
availability_api.near(**near_args)
echo_availability_api(availability_api)
return
if save:
save_api = WaybackMachineSaveAPI(url, user_agent=user_agent)
save_api.save()
click.echo("Archive URL:")
click.echo(save_api.archive_url)
click.echo("Cached save:")
click.echo(save_api.cached_save)
if headers:
click.echo("Save API headers:")
click.echo(save_api.headers)
return
def save_urls_on_file(url_gen):
domain = None
sys_random = random.SystemRandom()
uid = "".join(
sys_random.choice(string.ascii_lowercase + string.digits) for _ in range(6)
)
url_count = 0
for url in url_gen:
url_count += 1
if not domain:
match = re.search("https?://([A-Za-z_0-9.-]+).*", url)
domain = "domain-unknown"
if match:
domain = match.group(1)
file_name = "{domain}-urls-{uid}.txt".format(domain=domain, uid=uid)
file_path = os.path.join(os.getcwd(), file_name)
if not os.path.isfile(file_path):
open(file_path, "w+").close()
with open(file_path, "a") as f:
f.write("{url}\n".format(url=url))
click.echo(url)
if url_count > 0:
click.echo(
"\n\n'{file_name}' saved in current working directory".format(
file_name=file_name
)
)
else:
click.echo("No known URLs found. Please try a diffrent input!")
if known_urls:
wayback = Url(url, user_agent)
url_gen = wayback.known_urls(subdomain=subdomain)
if file:
return save_urls_on_file(url_gen)
else:
for url in url_gen:
click.echo(url)
if cdx:
filters = list(filters)
collapses = list(collapses)
cdx_print = list(cdx_print)
cdx_api = WaybackMachineCDXServerAPI(
url,
user_agent=user_agent,
start_timestamp=start_timestamp,
end_timestamp=end_timestamp,
filters=filters,
match_type=match_type,
gzip=gzip,
collapses=collapses,
limit=limit,
)
obj = Url(args.url)
if args.user_agent:
obj = Url(args.url, args.user_agent)
snapshots = cdx_api.snapshots()
if args.save:
output = _save(obj)
elif args.archive_url:
output = _archive_url(obj)
elif args.json:
output = _json(obj)
elif args.oldest:
output = _oldest(obj)
elif args.newest:
output = _newest(obj)
elif args.known_urls:
output = _known_urls(obj, args)
elif args.total:
output = _total_archives(obj)
elif args.near:
output = _near(obj, args)
elif args.get:
output = _get(obj, args)
else:
output = (
"You only specified the URL. But you also need to specify the operation."
"\nSee 'waybackpy --help' for help using this tool."
)
return output
def parse_args(argv):
parser = argparse.ArgumentParser()
requiredArgs = parser.add_argument_group("URL argument (required)")
requiredArgs.add_argument(
"--url", "-u", help="URL on which Wayback machine operations would occur"
)
userAgentArg = parser.add_argument_group("User Agent")
help_text = 'User agent, default user_agent is "waybackpy python package - https://github.com/akamhy/waybackpy"'
userAgentArg.add_argument("--user_agent", "-ua", help=help_text)
saveArg = parser.add_argument_group("Create new archive/save URL")
saveArg.add_argument(
"--save", "-s", action="store_true", help="Save the URL on the Wayback machine"
)
auArg = parser.add_argument_group("Get the latest Archive")
auArg.add_argument(
"--archive_url",
"-au",
action="store_true",
help="Get the latest archive URL, alias for --newest",
)
jsonArg = parser.add_argument_group("Get the JSON data")
jsonArg.add_argument(
"--json",
"-j",
action="store_true",
help="JSON data of the availability API request",
)
oldestArg = parser.add_argument_group("Oldest archive")
oldestArg.add_argument(
"--oldest",
"-o",
action="store_true",
help="Oldest archive for the specified URL",
)
newestArg = parser.add_argument_group("Newest archive")
newestArg.add_argument(
"--newest",
"-n",
action="store_true",
help="Newest archive for the specified URL",
)
totalArg = parser.add_argument_group("Total number of archives")
totalArg.add_argument(
"--total",
"-t",
action="store_true",
help="Total number of archives for the specified URL",
)
getArg = parser.add_argument_group("Get source code")
getArg.add_argument(
"--get",
"-g",
help="Prints the source code of the supplied url. Use '--get help' for extended usage",
)
knownUrlArg = parser.add_argument_group(
"URLs known and archived to Waybcak Machine for the site."
)
knownUrlArg.add_argument(
"--known_urls", "-ku", action="store_true", help="URLs known for the domain."
)
help_text = "Use with '--known_urls' to include known URLs for subdomains."
knownUrlArg.add_argument("--subdomain", "-sub", action="store_true", help=help_text)
help_text = "Only include live URLs. Will not inlclude dead links."
knownUrlArg.add_argument("--alive", "-a", action="store_true", help=help_text)
nearArg = parser.add_argument_group("Archive close to time specified")
nearArg.add_argument(
"--near", "-N", action="store_true", help="Archive near specified time"
)
nearArgs = parser.add_argument_group("Arguments that are used only with --near")
nearArgs.add_argument("--year", "-Y", type=int, help="Year in integer")
nearArgs.add_argument("--month", "-M", type=int, help="Month in integer")
nearArgs.add_argument("--day", "-D", type=int, help="Day in integer.")
nearArgs.add_argument("--hour", "-H", type=int, help="Hour in intege")
nearArgs.add_argument("--minute", "-MIN", type=int, help="Minute in integer")
parser.add_argument(
"--version", "-v", action="store_true", help="Waybackpy version"
)
return parser.parse_args(argv[1:])
def main(argv=None):
if argv is None:
argv = sys.argv
args = parse_args(argv)
output = args_handler(args)
print(output)
for snapshot in snapshots:
if len(cdx_print) == 0:
click.echo(snapshot)
else:
output_string = ""
if "urlkey" or "url-key" or "url_key" in cdx_print:
output_string = output_string + snapshot.urlkey + " "
if "timestamp" or "time-stamp" or "time_stamp" in cdx_print:
output_string = output_string + snapshot.timestamp + " "
if "original" in cdx_print:
output_string = output_string + snapshot.original + " "
if "original" in cdx_print:
output_string = output_string + snapshot.original + " "
if "mimetype" or "mime-type" or "mime_type" in cdx_print:
output_string = output_string + snapshot.mimetype + " "
if "statuscode" or "status-code" or "status_code" in cdx_print:
output_string = output_string + snapshot.statuscode + " "
if "digest" in cdx_print:
output_string = output_string + snapshot.digest + " "
if "length" in cdx_print:
output_string = output_string + snapshot.length + " "
if "archiveurl" or "archive-url" or "archive_url" in cdx_print:
output_string = output_string + snapshot.archive_url + " "
click.echo(output_string)
if __name__ == "__main__":
sys.exit(main(sys.argv))
main()

View File

@ -1,9 +1,22 @@
# -*- coding: utf-8 -*-
"""
waybackpy.exceptions
~~~~~~~~~~~~~~~~~~~
This module contains the set of Waybackpy's exceptions.
"""
class WaybackError(Exception):
"""
Raised when Wayback Machine API Service is unreachable/down.
Raised when Waybackpy can not return what you asked for.
1) Wayback Machine API Service is unreachable/down.
2) You passed illegal arguments.
"""
class RedirectSaveError(WaybackError):
"""
Raised when the original URL is redirected and the
redirect URL is archived but not the original URL.
"""
@ -11,3 +24,15 @@ class URLError(Exception):
"""
Raised when malformed URLs are passed as arguments.
"""
class MaximumRetriesExceeded(WaybackError):
"""
MaximumRetriesExceeded
"""
class MaximumSaveRetriesExceeded(MaximumRetriesExceeded):
"""
MaximumSaveRetriesExceeded
"""

131
waybackpy/save_api.py Normal file
View File

@ -0,0 +1,131 @@
import re
import time
import requests
from datetime import datetime
from urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter
from .utils import DEFAULT_USER_AGENT
from .exceptions import MaximumSaveRetriesExceeded
class WaybackMachineSaveAPI:
"""
WaybackMachineSaveAPI class provides an interface for saving URLs on the
Wayback Machine.
"""
def __init__(self, url, user_agent=DEFAULT_USER_AGENT, max_tries=8):
self.url = str(url).strip().replace(" ", "%20")
self.request_url = "https://web.archive.org/save/" + self.url
self.user_agent = user_agent
self.request_headers = {"User-Agent": self.user_agent}
self.max_tries = max_tries
self.total_save_retries = 5
self.backoff_factor = 0.5
self.status_forcelist = [500, 502, 503, 504]
self._archive_url = None
self.instance_birth_time = datetime.utcnow()
@property
def archive_url(self):
if self._archive_url:
return self._archive_url
else:
return self.save()
def get_save_request_headers(self):
session = requests.Session()
retries = Retry(
total=self.total_save_retries,
backoff_factor=self.backoff_factor,
status_forcelist=self.status_forcelist,
)
session.mount("https://", HTTPAdapter(max_retries=retries))
self.response = session.get(self.request_url, headers=self.request_headers)
self.headers = self.response.headers
self.status_code = self.response.status_code
self.response_url = self.response.url
def archive_url_parser(self):
regex1 = r"Content-Location: (/web/[0-9]{14}/.*)"
match = re.search(regex1, str(self.headers))
if match:
return "https://web.archive.org" + match.group(1)
regex2 = r"rel=\"memento.*?(web\.archive\.org/web/[0-9]{14}/.*?)>"
match = re.search(regex2, str(self.headers))
if match:
return "https://" + match.group(1)
regex3 = r"X-Cache-Key:\shttps(.*)[A-Z]{2}"
match = re.search(regex3, str(self.headers))
if match:
return "https://" + match.group(1)
if self.response_url:
self.response_url = self.response_url.strip()
if "web.archive.org/web" in self.response_url:
regex = r"web\.archive\.org/web/(?:[0-9]*?)/(?:.*)$"
match = re.search(regex, self.response_url)
if match:
return "https://" + match.group(0)
def sleep(self, tries):
sleep_seconds = 5
if tries % 3 == 0:
sleep_seconds = 10
time.sleep(sleep_seconds)
def timestamp(self):
m = re.search(
r"https?://web.archive.org/web/([0-9]{14})/http", self._archive_url
)
string_timestamp = m.group(1)
timestamp = datetime.strptime(string_timestamp, "%Y%m%d%H%M%S")
timestamp_unixtime = time.mktime(timestamp.timetuple())
instance_birth_time_unixtime = time.mktime(self.instance_birth_time.timetuple())
if timestamp_unixtime < instance_birth_time_unixtime:
self.cached_save = True
else:
self.cached_save = False
return timestamp
def save(self):
saved_archive = None
tries = 0
while True:
tries += 1
if tries >= self.max_tries:
raise MaximumSaveRetriesExceeded(
"Tried %s times but failed to save and return the archive for %s.\nResponse URL:\n%s \nResponse Header:\n%s\n"
% (str(tries), self.url, self.response_url, str(self.headers)),
)
if not saved_archive:
if tries > 1:
self.sleep(tries)
self.get_save_request_headers()
saved_archive = self.archive_url_parser()
if not saved_archive:
continue
else:
self._archive_url = saved_archive
self.timestamp()
return saved_archive

11
waybackpy/utils.py Normal file
View File

@ -0,0 +1,11 @@
import requests
from .__version__ import __version__
DEFAULT_USER_AGENT = "waybackpy %s - https://github.com/akamhy/waybackpy" % __version__
def latest_version(package_name, headers):
request_url = "https://pypi.org/pypi/" + package_name + "/json"
response = requests.get(request_url, headers=headers)
data = response.json()
return data["info"]["version"]

View File

@ -1,272 +1,117 @@
# -*- coding: utf-8 -*-
import re
from .save_api import WaybackMachineSaveAPI
from .availability_api import WaybackMachineAvailabilityAPI
from .cdx_api import WaybackMachineCDXServerAPI
from .utils import DEFAULT_USER_AGENT
from .exceptions import WaybackError
from datetime import datetime, timedelta
from waybackpy.exceptions import WaybackError, URLError
from waybackpy.__version__ import __version__
import requests
import concurrent.futures
default_UA = "waybackpy python package - https://github.com/akamhy/waybackpy"
def _archive_url_parser(header):
"""Parse out the archive from header."""
# Regex1
arch = re.search(r"Content-Location: (/web/[0-9]{14}/.*)", str(header))
if arch:
return "web.archive.org" + arch.group(1)
# Regex2
arch = re.search(
r"rel=\"memento.*?(web\.archive\.org/web/[0-9]{14}/.*?)>", str(header)
)
if arch:
return arch.group(1)
# Regex3
arch = re.search(r"X-Cache-Key:\shttps(.*)[A-Z]{2}", str(header))
if arch:
return arch.group(1)
raise WaybackError(
"No archive URL found in the API response. "
"This version of waybackpy (%s) is likely out of date. Visit "
"https://github.com/akamhy/waybackpy for the latest version "
"of waybackpy.\nHeader:\n%s" % (__version__, str(header))
)
def _wayback_timestamp(**kwargs):
"""Return a formatted timestamp."""
return "".join(
str(kwargs[key]).zfill(2) for key in ["year", "month", "day", "hour", "minute"]
)
def _get_response(endpoint, params=None, headers=None):
"""Get response for the supplied request."""
try:
response = requests.get(endpoint, params=params, headers=headers)
except Exception:
try:
response = requests.get(endpoint, params=params, headers=headers) # nosec
except Exception as e:
exc = WaybackError("Error while retrieving %s" % endpoint)
exc.__cause__ = e
raise exc
return response
class Url:
"""waybackpy Url object"""
def __init__(self, url, user_agent=default_UA):
def __init__(self, url, user_agent=DEFAULT_USER_AGENT):
self.url = url
self.user_agent = user_agent
self._url_check() # checks url validity on init.
self.archive_url = self._archive_url() # URL of archive
self.timestamp = self._archive_timestamp() # timestamp for last archive
self._alive_url_list = []
def __repr__(self):
return "waybackpy.Url(url=%s, user_agent=%s)" % (self.url, self.user_agent)
self.user_agent = str(user_agent)
self.archive_url = None
self.wayback_machine_availability_api = WaybackMachineAvailabilityAPI(
self.url, user_agent=self.user_agent
)
def __str__(self):
return "%s" % self.archive_url
if not self.archive_url:
self.newest()
return self.archive_url
def __len__(self):
td_max = timedelta(
days=999999999, hours=23, minutes=59, seconds=59, microseconds=999999
)
if not self.timestamp:
self.oldest()
if self.timestamp == datetime.max:
return td_max.days
diff = datetime.utcnow() - self.timestamp
return diff.days
def _url_check(self):
"""Check for common URL problems."""
if "." not in self.url:
raise URLError("'%s' is not a vaild URL." % self.url)
@property
def JSON(self):
endpoint = "https://archive.org/wayback/available"
headers = {"User-Agent": "%s" % self.user_agent}
payload = {"url": "%s" % self._clean_url()}
response = _get_response(endpoint, params=payload, headers=headers)
return response.json()
def _archive_url(self):
"""Get URL of archive."""
data = self.JSON
if not data["archived_snapshots"]:
archive_url = None
else:
archive_url = data["archived_snapshots"]["closest"]["url"]
archive_url = archive_url.replace(
"http://web.archive.org/web/", "https://web.archive.org/web/", 1
)
return archive_url
def _archive_timestamp(self):
"""Get timestamp of last archive."""
data = self.JSON
if not data["archived_snapshots"]:
time = datetime.max
else:
time = datetime.strptime(
data["archived_snapshots"]["closest"]["timestamp"], "%Y%m%d%H%M%S"
)
return time
def _clean_url(self):
"""Fix the URL, if possible."""
return str(self.url).strip().replace(" ", "_")
return (datetime.utcnow() - self.timestamp).days
def save(self):
"""Create a new Wayback Machine archive for this URL."""
request_url = "https://web.archive.org/save/" + self._clean_url()
headers = {"User-Agent": "%s" % self.user_agent}
response = _get_response(request_url, params=None, headers=headers)
self.archive_url = "https://" + _archive_url_parser(response.headers)
self.timestamp = datetime.utcnow()
self.wayback_machine_save_api = WaybackMachineSaveAPI(
self.url, user_agent=self.user_agent
)
self.archive_url = self.wayback_machine_save_api.archive_url
self.timestamp = self.wayback_machine_save_api.timestamp()
self.headers = self.wayback_machine_save_api.headers
return self
def get(self, url="", user_agent="", encoding=""):
"""Return the source code of the supplied URL.
If encoding is not supplied, it is auto-detected from the response.
"""
def near(
self,
year=None,
month=None,
day=None,
hour=None,
minute=None,
unix_timestamp=None,
):
if not url:
url = self._clean_url()
if not user_agent:
user_agent = self.user_agent
headers = {"User-Agent": "%s" % self.user_agent}
response = _get_response(url, params=None, headers=headers)
if not encoding:
try:
encoding = response.encoding
except AttributeError:
encoding = "UTF-8"
return response.content.decode(encoding.replace("text/html", "UTF-8", 1))
def near(self, year=None, month=None, day=None, hour=None, minute=None):
"""Return the closest Wayback Machine archive to the time supplied.
Supported params are year, month, day, hour and minute.
Any non-supplied parameters default to the current time.
"""
now = datetime.utcnow().timetuple()
timestamp = _wayback_timestamp(
year=year if year else now.tm_year,
month=month if month else now.tm_mon,
day=day if day else now.tm_mday,
hour=hour if hour else now.tm_hour,
minute=minute if minute else now.tm_min,
self.wayback_machine_availability_api.near(
year=year,
month=month,
day=day,
hour=hour,
minute=minute,
unix_timestamp=unix_timestamp,
)
endpoint = "https://archive.org/wayback/available"
headers = {"User-Agent": "%s" % self.user_agent}
payload = {"url": "%s" % self._clean_url(), "timestamp": timestamp}
response = _get_response(endpoint, params=payload, headers=headers)
data = response.json()
if not data["archived_snapshots"]:
raise WaybackError(
"Can not find archive for '%s' try later or use wayback.Url(url, user_agent).save() "
"to create a new archive." % self._clean_url()
)
archive_url = data["archived_snapshots"]["closest"]["url"]
archive_url = archive_url.replace(
"http://web.archive.org/web/", "https://web.archive.org/web/", 1
)
self.archive_url = archive_url
self.timestamp = datetime.strptime(
data["archived_snapshots"]["closest"]["timestamp"], "%Y%m%d%H%M%S"
)
self.set_availability_api_attrs()
return self
def oldest(self, year=1994):
"""Return the oldest Wayback Machine archive for this URL."""
return self.near(year=year)
def oldest(self):
self.wayback_machine_availability_api.oldest()
self.set_availability_api_attrs()
return self
def newest(self):
"""Return the newest Wayback Machine archive available for this URL.
self.wayback_machine_availability_api.newest()
self.set_availability_api_attrs()
return self
Due to Wayback Machine database lag, this may not always be the
most recent archive.
"""
return self.near()
def set_availability_api_attrs(self):
self.archive_url = self.wayback_machine_availability_api.archive_url
self.JSON = self.wayback_machine_availability_api.JSON
self.timestamp = self.wayback_machine_availability_api.timestamp()
def total_archives(self):
"""Returns the total number of Wayback Machine archives for this URL."""
def total_archives(self, start_timestamp=None, end_timestamp=None):
cdx = WaybackMachineCDXServerAPI(
self.url,
user_agent=self.user_agent,
start_timestamp=start_timestamp,
end_timestamp=end_timestamp,
)
endpoint = "https://web.archive.org/cdx/search/cdx"
headers = {
"User-Agent": "%s" % self.user_agent,
"output": "json",
"fl": "statuscode",
}
payload = {"url": "%s" % self._clean_url()}
response = _get_response(endpoint, params=payload, headers=headers)
# Most efficient method to count number of archives (yet)
return response.text.count(",")
def pick_live_urls(self, url):
try:
response_code = requests.get(url).status_code
except Exception:
return # we don't care if urls are not opening
# 200s are OK and 300s are usually redirects, if you don't want redirects replace 400 with 300
if response_code >= 400:
return
self._alive_url_list.append(url)
def known_urls(self, alive=False, subdomain=False):
"""Returns list of URLs known to exist for given domain name
because these URLs were crawled by WayBack Machine bots.
Useful for pen-testers and others.
Idea by Mohammed Diaa (https://github.com/mhmdiaa) from:
https://gist.github.com/mhmdiaa/adf6bff70142e5091792841d4b372050
"""
url_list = []
count = 0
for _ in cdx.snapshots():
count = count + 1
return count
def known_urls(
self,
subdomain=False,
host=False,
start_timestamp=None,
end_timestamp=None,
match_type="prefix",
):
if subdomain:
request_url = (
"https://web.archive.org/cdx/search/cdx?url=*.%s/*&output=json&fl=original&collapse=urlkey"
% self._clean_url()
)
else:
request_url = (
"http://web.archive.org/cdx/search/cdx?url=%s/*&output=json&fl=original&collapse=urlkey"
% self._clean_url()
)
match_type = "domain"
if host:
match_type = "host"
headers = {"User-Agent": "%s" % self.user_agent}
response = _get_response(request_url, params=None, headers=headers)
data = response.json()
url_list = [y[0] for y in data if y[0] != "original"]
cdx = WaybackMachineCDXServerAPI(
self.url,
user_agent=self.user_agent,
start_timestamp=start_timestamp,
end_timestamp=end_timestamp,
match_type=match_type,
collapses=["urlkey"],
)
# Remove all deadURLs from url_list if alive=True
if alive:
with concurrent.futures.ThreadPoolExecutor() as executor:
executor.map(self.pick_live_urls, url_list)
url_list = self._alive_url_list
return url_list
for snapshot in cdx.snapshots():
yield (snapshot.original)