Compare commits
16 Commits
Author | SHA1 | Date | |
---|---|---|---|
3e3ecff9df | |||
ce64135ba8 | |||
2af6580ffb | |||
8a3c515176 | |||
d98c4f32ad | |||
e0a4b007d5 | |||
6fb6b2deee | |||
1882862992 | |||
0c6107e675 | |||
bd079978bf | |||
5dec4927cd | |||
62e5217b9e | |||
9823c809e9 | |||
db5737a857 | |||
ca0821a466 | |||
bb4dbc7d3c |
7
CONTRIBUTORS.md
Normal file
7
CONTRIBUTORS.md
Normal file
@ -0,0 +1,7 @@
|
||||
## AUTHORS
|
||||
- akamhy (<https://github.com/akamhy>)
|
||||
- danvalen1 (<https://github.com/danvalen1>)
|
||||
- AntiCompositeNumber (<https://github.com/AntiCompositeNumber>)
|
||||
|
||||
## ACKNOWLEDGEMENTS
|
||||
- mhmdiaa (<https://github.com/mhmdiaa>) for <https://gist.github.com/mhmdiaa/adf6bff70142e5091792841d4b372050>. known_urls is based on this gist.
|
493
README.md
493
README.md
@ -1,64 +1,26 @@
|
||||
<div align="center">
|
||||
<img src="https://raw.githubusercontent.com/akamhy/waybackpy/master/assets/waybackpy_logo.svg"><br>
|
||||
|
||||
<img src="https://raw.githubusercontent.com/akamhy/waybackpy/master/assets/waybackpy_logo.svg"><br>
|
||||
|
||||
<h2>Python package & CLI tool that interfaces with the Wayback Machine API</h2>
|
||||
|
||||
</div>
|
||||
|
||||
-----------------
|
||||
<p align="center">
|
||||
<a href="https://pypi.org/project/waybackpy/"><img alt="pypi" src="https://img.shields.io/pypi/v/waybackpy.svg"></a>
|
||||
<a href="https://github.com/akamhy/waybackpy/actions?query=workflow%3ACI"><img alt="Build Status" src="https://github.com/akamhy/waybackpy/workflows/CI/badge.svg"></a>
|
||||
<a href="https://www.codacy.com/manual/akamhy/waybackpy?utm_source=github.com&utm_medium=referral&utm_content=akamhy/waybackpy&utm_campaign=Badge_Grade"><img alt="Codacy Badge" src="https://api.codacy.com/project/badge/Grade/255459cede9341e39436ec8866d3fb65"></a>
|
||||
<a href="https://codecov.io/gh/akamhy/waybackpy"><img alt="codecov" src="https://codecov.io/gh/akamhy/waybackpy/branch/master/graph/badge.svg"></a>
|
||||
<a href="https://codeclimate.com/github/akamhy/waybackpy/maintainability"><img alt="Maintainability" src="https://api.codeclimate.com/v1/badges/942f13d8177a56c1c906/maintainability"></a>
|
||||
<a href="https://github.com/akamhy/waybackpy/blob/master/CONTRIBUTING.md"><img alt="Contributions Welcome" src="https://img.shields.io/static/v1.svg?label=Contributions&message=Welcome&color=0059b3&style=flat-square"></a>
|
||||
<a href="https://pepy.tech/project/waybackpy?versions=2*&versions=1*&versions=3*"><img alt="Downloads" src="https://pepy.tech/badge/waybackpy/month"></a>
|
||||
<a href="https://github.com/akamhy/waybackpy/commits/master"><img alt="GitHub lastest commit" src="https://img.shields.io/github/last-commit/akamhy/waybackpy?color=blue&style=flat-square"></a>
|
||||
<a href="#"><img alt="PyPI - Python Version" src="https://img.shields.io/pypi/pyversions/waybackpy?style=flat-square"></a>
|
||||
</p>
|
||||
|
||||
## Python package & CLI tool that interfaces with the Wayback Machine API.
|
||||
[](https://pypi.org/project/waybackpy/)
|
||||
[](https://github.com/akamhy/waybackpy/blob/master/LICENSE)
|
||||
[](https://github.com/akamhy/waybackpy/actions?query=workflow%3ACI)
|
||||
[](https://codecov.io/gh/akamhy/waybackpy)
|
||||
[](https://github.com/akamhy/waybackpy/blob/master/CONTRIBUTING.md)
|
||||
[](https://www.codacy.com/manual/akamhy/waybackpy?utm_source=github.com&utm_medium=referral&utm_content=akamhy/waybackpy&utm_campaign=Badge_Grade)
|
||||
[](https://pepy.tech/project/waybackpy)
|
||||
[](https://github.com/akamhy/waybackpy/releases)
|
||||
[](https://codeclimate.com/github/akamhy/waybackpy/maintainability)
|
||||
[](https://www.python.org/)
|
||||
[](https://github.com/akamhy/waybackpy/graphs/commit-activity)
|
||||
[](https://github.com/akamhy/waybackpy/commits/master)
|
||||

|
||||
-----------------------------------------------------------------------------------------------------------------------------------------------
|
||||
|
||||
|
||||
|
||||
Table of contents
|
||||
=================
|
||||
<!--ts-->
|
||||
|
||||
* [Installation](#installation)
|
||||
|
||||
* [Usage](#usage)
|
||||
* [As a Python package](#as-a-python-package)
|
||||
* [Saving a webpage](#capturing-aka-saving-an-url-using-save)
|
||||
* [Retrieving archive](#retrieving-the-archive-for-an-url-using-archive_url)
|
||||
* [Retrieving the oldest archive](#retrieving-the-oldest-archive-for-an-url-using-oldest)
|
||||
* [Retrieving the latest/newest archive](#retrieving-the-newest-archive-for-an-url-using-newest)
|
||||
* [Retrieving the JSON response of availability API](#retrieving-the-json-response-for-the-availability-api-request)
|
||||
* [Retrieving archive close to a specified year, month, day, hour, and minute](#retrieving-archive-close-to-a-specified-year-month-day-hour-and-minute-using-near)
|
||||
* [Get the content of webpage](#get-the-content-of-webpage-using-get)
|
||||
* [Count total archives for an URL](#count-total-archives-for-an-url-using-total_archives)
|
||||
* [List of URLs that Wayback Machine knows and has archived for a domain name](#list-of-urls-that-wayback-machine-knows-and-has-archived-for-a-domain-name)
|
||||
|
||||
* [With the Command-line interface](#with-the-command-line-interface)
|
||||
* [Saving webpage](#save)
|
||||
* [Archive URL](#get-archive-url)
|
||||
* [Oldest archive URL](#oldest-archive)
|
||||
* [Newest archive URL](#newest-archive)
|
||||
* [JSON response of API](#get-json-data-of-avaialblity-api)
|
||||
* [Total archives](#total-number-of-archives)
|
||||
* [Archive near specified time](#archive-near-time)
|
||||
* [Get the source code](#get-the-source-code)
|
||||
* [Fetch all the URLs that the Wayback Machine knows for a domain](#fetch-all-the-urls-that-the-wayback-machine-knows-for-a-domain)
|
||||
|
||||
* [Tests](#tests)
|
||||
|
||||
* [Packaging](#packaging)
|
||||
|
||||
* [License](#license)
|
||||
|
||||
<!--te-->
|
||||
|
||||
## Installation
|
||||
### Installation
|
||||
|
||||
Using [pip](https://en.wikipedia.org/wiki/Pip_(package_manager)):
|
||||
|
||||
@ -66,387 +28,76 @@ Using [pip](https://en.wikipedia.org/wiki/Pip_(package_manager)):
|
||||
pip install waybackpy
|
||||
```
|
||||
|
||||
or direct from this repository using git.
|
||||
Install directly from GitHub:
|
||||
|
||||
```bash
|
||||
pip install git+https://github.com/akamhy/waybackpy.git
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
### As a Python package
|
||||
|
||||
#### Capturing aka Saving an URL using save()
|
||||
### Usage
|
||||
|
||||
#### As a python package
|
||||
```python
|
||||
import waybackpy
|
||||
>>> import waybackpy
|
||||
|
||||
url = "https://en.wikipedia.org/wiki/Multivariable_calculus"
|
||||
user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
|
||||
>>> url = "https://en.wikipedia.org/wiki/Multivariable_calculus"
|
||||
>>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
|
||||
|
||||
waybackpy_url_obj = waybackpy.Url(url, user_agent)
|
||||
archive = waybackpy_url_obj.save()
|
||||
print(archive)
|
||||
>>> wayback = waybackpy.Url(url, user_agent)
|
||||
|
||||
>>> archive = wayback.save()
|
||||
>>> str(archive)
|
||||
'https://web.archive.org/web/20210104173410/https://en.wikipedia.org/wiki/Multivariable_calculus'
|
||||
|
||||
>>> archive.timestamp
|
||||
datetime.datetime(2021, 1, 4, 17, 35, 12, 691741)
|
||||
|
||||
>>> oldest_archive = wayback.oldest()
|
||||
>>> str(oldest_archive)
|
||||
'https://web.archive.org/web/20050422130129/http://en.wikipedia.org:80/wiki/Multivariable_calculus'
|
||||
|
||||
>>> archive_close_to_2010_feb = wayback.near(year=2010, month=2)
|
||||
>>> str(archive_close_to_2010_feb)
|
||||
'https://web.archive.org/web/20100215001541/http://en.wikipedia.org:80/wiki/Multivariable_calculus'
|
||||
|
||||
>>> str(wayback.newest())
|
||||
'https://web.archive.org/web/20210104173410/https://en.wikipedia.org/wiki/Multivariable_calculus'
|
||||
```
|
||||
> Full Python package documentation can be found at <https://github.com/akamhy/waybackpy/wiki/Python-package-docs>.
|
||||
|
||||
|
||||
|
||||
#### As a CLI tool
|
||||
```bash
|
||||
https://web.archive.org/web/20201016171808/https://en.wikipedia.org/wiki/Multivariable_calculus
|
||||
```
|
||||
|
||||
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPySaveExample></sub>
|
||||
|
||||
#### Retrieving the archive for an URL using archive_url
|
||||
|
||||
```python
|
||||
import waybackpy
|
||||
|
||||
url = "https://www.google.com/"
|
||||
user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:40.0) Gecko/20100101 Firefox/40.0"
|
||||
|
||||
waybackpy_url_obj = waybackpy.Url(url, user_agent)
|
||||
archive_url = waybackpy_url_obj.archive_url
|
||||
print(archive_url)
|
||||
```
|
||||
|
||||
```bash
|
||||
https://web.archive.org/web/20201016153320/https://www.google.com/
|
||||
```
|
||||
|
||||
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyArchiveUrl></sub>
|
||||
|
||||
#### Retrieving the oldest archive for an URL using oldest()
|
||||
|
||||
```python
|
||||
import waybackpy
|
||||
|
||||
url = "https://www.google.com/"
|
||||
user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:40.0) Gecko/20100101 Firefox/40.0"
|
||||
|
||||
waybackpy_url_obj = waybackpy.Url(url, user_agent)
|
||||
oldest_archive_url = waybackpy_url_obj.oldest()
|
||||
print(oldest_archive_url)
|
||||
```
|
||||
|
||||
```bash
|
||||
http://web.archive.org/web/19981111184551/http://google.com:80/
|
||||
```
|
||||
|
||||
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyOldestExample></sub>
|
||||
|
||||
#### Retrieving the newest archive for an URL using newest()
|
||||
|
||||
```python
|
||||
import waybackpy
|
||||
|
||||
url = "https://www.facebook.com/"
|
||||
user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:39.0) Gecko/20100101 Firefox/39.0"
|
||||
|
||||
waybackpy_url_obj = waybackpy.Url(url, user_agent)
|
||||
newest_archive_url = waybackpy_url_obj.newest()
|
||||
print(newest_archive_url)
|
||||
```
|
||||
|
||||
```bash
|
||||
https://web.archive.org/web/20201016150543/https://www.facebook.com/
|
||||
```
|
||||
|
||||
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyNewestExample></sub>
|
||||
|
||||
#### Retrieving the JSON response for the availability API request
|
||||
|
||||
```python
|
||||
import waybackpy
|
||||
|
||||
url = "https://www.facebook.com/"
|
||||
user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:39.0) Gecko/20100101 Firefox/39.0"
|
||||
|
||||
waybackpy_url_obj = waybackpy.Url(url, user_agent)
|
||||
json_dict = waybackpy_url_obj.JSON
|
||||
print(json_dict)
|
||||
```
|
||||
|
||||
```javascript
|
||||
{'url': 'https://www.facebook.com/', 'archived_snapshots': {'closest': {'available': True, 'url': 'http://web.archive.org/web/20201016150543/https://www.facebook.com/', 'timestamp': '20201016150543', 'status': '200'}}}
|
||||
```
|
||||
|
||||
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyJSON></sub>
|
||||
|
||||
#### Retrieving archive close to a specified year, month, day, hour, and minute using near()
|
||||
|
||||
```python
|
||||
from waybackpy import Url
|
||||
|
||||
user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:38.0) Gecko/20100101 Firefox/38.0"
|
||||
url = "https://github.com/"
|
||||
|
||||
waybackpy_url_obj = Url(url, user_agent)
|
||||
|
||||
# Do not pad (don't use zeros in the month, year, day, minute, and hour arguments). e.g. For January, set month = 1 and not month = 01.
|
||||
```
|
||||
|
||||
```python
|
||||
github_archive_near_2010 = waybackpy_url_obj.near(year=2010)
|
||||
print(github_archive_near_2010)
|
||||
```
|
||||
|
||||
```bash
|
||||
https://web.archive.org/web/20101018053604/http://github.com:80/
|
||||
```
|
||||
|
||||
```python
|
||||
github_archive_near_2011_may = waybackpy_url_obj.near(year=2011, month=5)
|
||||
print(github_archive_near_2011_may)
|
||||
```
|
||||
|
||||
```bash
|
||||
https://web.archive.org/web/20110518233639/https://github.com/
|
||||
```
|
||||
|
||||
```python
|
||||
github_archive_near_2015_january_26 = waybackpy_url_obj.near(year=2015, month=1, day=26)
|
||||
print(github_archive_near_2015_january_26)
|
||||
```
|
||||
|
||||
```bash
|
||||
https://web.archive.org/web/20150125102636/https://github.com/
|
||||
```
|
||||
|
||||
```python
|
||||
github_archive_near_2018_4_july_9_2_am = waybackpy_url_obj.near(year=2018, month=7, day=4, hour=9, minute=2)
|
||||
print(github_archive_near_2018_4_july_9_2_am)
|
||||
```
|
||||
|
||||
```bash
|
||||
https://web.archive.org/web/20180704090245/https://github.com/
|
||||
```
|
||||
|
||||
<sub>The package doesn't support the seconds' argument yet. You are encouraged to create a PR ;)</sub>
|
||||
|
||||
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyNearExample></sub>
|
||||
|
||||
#### Get the content of webpage using get()
|
||||
|
||||
```python
|
||||
import waybackpy
|
||||
|
||||
google_url = "https://www.google.com/"
|
||||
|
||||
User_Agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.85 Safari/537.36"
|
||||
|
||||
waybackpy_url_object = waybackpy.Url(google_url, User_Agent)
|
||||
|
||||
|
||||
# If no argument is passed in get(), it gets the source of the Url used to create the object.
|
||||
current_google_url_source = waybackpy_url_object.get()
|
||||
print(current_google_url_source)
|
||||
|
||||
|
||||
# The following chunk of code will force a new archive of google.com and get the source of the archived page.
|
||||
# waybackpy_url_object.save() type is string.
|
||||
google_newest_archive_source = waybackpy_url_object.get(waybackpy_url_object.save())
|
||||
print(google_newest_archive_source)
|
||||
|
||||
|
||||
# waybackpy_url_object.oldest() type is str, it's oldest archive of google.com
|
||||
google_oldest_archive_source = waybackpy_url_object.get(waybackpy_url_object.oldest())
|
||||
print(google_oldest_archive_source)
|
||||
```
|
||||
|
||||
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyGetExample#main.py></sub>
|
||||
|
||||
#### Count total archives for an URL using total_archives()
|
||||
|
||||
```python
|
||||
import waybackpy
|
||||
|
||||
URL = "https://en.wikipedia.org/wiki/Python (programming language)"
|
||||
UA = "Mozilla/5.0 (iPad; CPU OS 8_1_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B435 Safari/600.1.4"
|
||||
|
||||
waybackpy_url_object = waybackpy.Url(url=URL, user_agent=UA)
|
||||
|
||||
archive_count = waybackpy_url_object.total_archives()
|
||||
|
||||
print(archive_count) # total_archives() returns an int
|
||||
```
|
||||
|
||||
```bash
|
||||
2516
|
||||
```
|
||||
|
||||
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyTotalArchivesExample></sub>
|
||||
|
||||
#### List of URLs that Wayback Machine knows and has archived for a domain name
|
||||
|
||||
1) If alive=True is set, waybackpy will check all URLs to identify the alive URLs. Don't use with popular websites like google or it would take too long.
|
||||
2) To include URLs from subdomain set sundomain=True
|
||||
|
||||
```python
|
||||
import waybackpy
|
||||
|
||||
URL = "akamhy.github.io"
|
||||
UA = "Mozilla/5.0 (iPad; CPU OS 8_1_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B435 Safari/600.1.4"
|
||||
|
||||
waybackpy_url_object = waybackpy.Url(url=URL, user_agent=UA)
|
||||
known_urls = waybackpy_url_object.known_urls(alive=True, subdomain=False) # alive and subdomain are optional.
|
||||
print(known_urls) # known_urls() returns list of URLs
|
||||
```
|
||||
|
||||
```bash
|
||||
['http://akamhy.github.io',
|
||||
'https://akamhy.github.io/waybackpy/',
|
||||
'https://akamhy.github.io/waybackpy/assets/css/style.css?v=a418a4e4641a1dbaad8f3bfbf293fad21a75ff11',
|
||||
'https://akamhy.github.io/waybackpy/assets/css/style.css?v=f881705d00bf47b5bf0c58808efe29eecba2226c']
|
||||
```
|
||||
|
||||
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyKnownURLsToWayBackMachineExample#main.py></sub>
|
||||
|
||||
### With the Command-line interface
|
||||
|
||||
#### Save
|
||||
|
||||
```bash
|
||||
$ waybackpy --url "https://en.wikipedia.org/wiki/Social_media" --user_agent "my-unique-user-agent" --save
|
||||
$ waybackpy --save --url "https://en.wikipedia.org/wiki/Social_media" --user_agent "my-unique-user-agent"
|
||||
https://web.archive.org/web/20200719062108/https://en.wikipedia.org/wiki/Social_media
|
||||
|
||||
$ waybackpy --oldest --url "https://en.wikipedia.org/wiki/Humanoid" --user_agent "my-unique-user-agent"
|
||||
https://web.archive.org/web/20040415020811/http://en.wikipedia.org:80/wiki/Humanoid
|
||||
|
||||
$ waybackpy --newest --url "https://en.wikipedia.org/wiki/Remote_sensing" --user_agent "my-unique-user-agent"
|
||||
https://web.archive.org/web/20201221130522/https://en.wikipedia.org/wiki/Remote_sensing
|
||||
|
||||
$ waybackpy --total --url "https://en.wikipedia.org/wiki/Linux_kernel" --user_agent "my-unique-user-agent"
|
||||
1904
|
||||
|
||||
$ waybackpy --known_urls --url akamhy.github.io --user_agent "my-unique-user-agent"
|
||||
https://akamhy.github.io
|
||||
https://akamhy.github.io/assets/js/scale.fix.js
|
||||
https://akamhy.github.io/favicon.ico
|
||||
https://akamhy.github.io/robots.txt
|
||||
https://akamhy.github.io/waybackpy/
|
||||
|
||||
'akamhy.github.io-10-urls-m2a24y.txt' saved in current working directory
|
||||
```
|
||||
|
||||
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyBashSave></sub>
|
||||
|
||||
#### Get archive URL
|
||||
|
||||
```bash
|
||||
$ waybackpy --url "https://en.wikipedia.org/wiki/SpaceX" --user_agent "my-unique-user-agent" --archive_url
|
||||
https://web.archive.org/web/20201007132458/https://en.wikipedia.org/wiki/SpaceX
|
||||
```
|
||||
|
||||
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyBashArchiveUrl></sub>
|
||||
|
||||
#### Oldest archive
|
||||
|
||||
```bash
|
||||
$ waybackpy --url "https://en.wikipedia.org/wiki/SpaceX" --user_agent "my-unique-user-agent" --oldest
|
||||
https://web.archive.org/web/20040803000845/http://en.wikipedia.org:80/wiki/SpaceX
|
||||
```
|
||||
|
||||
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyBashOldest></sub>
|
||||
|
||||
#### Newest archive
|
||||
|
||||
```bash
|
||||
$ waybackpy --url "https://en.wikipedia.org/wiki/YouTube" --user_agent "my-unique-user-agent" --newest
|
||||
https://web.archive.org/web/20200606044708/https://en.wikipedia.org/wiki/YouTube
|
||||
```
|
||||
|
||||
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyBashNewest></sub>
|
||||
|
||||
#### Get JSON data of avaialblity API
|
||||
|
||||
```bash
|
||||
waybackpy --url "https://en.wikipedia.org/wiki/SpaceX" --user_agent "my-unique-user-agent" --json
|
||||
|
||||
```
|
||||
|
||||
```javascript
|
||||
{'archived_snapshots': {'closest': {'timestamp': '20201007132458', 'status': '200', 'available': True, 'url': 'http://web.archive.org/web/20201007132458/https://en.wikipedia.org/wiki/SpaceX'}}, 'url': 'https://en.wikipedia.org/wiki/SpaceX'}
|
||||
|
||||
```
|
||||
|
||||
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyBashJSON></sub>
|
||||
|
||||
#### Total number of archives
|
||||
|
||||
```bash
|
||||
$ waybackpy --url "https://en.wikipedia.org/wiki/Linux_kernel" --user_agent "my-unique-user-agent" --total
|
||||
853
|
||||
|
||||
```
|
||||
|
||||
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyBashTotal></sub>
|
||||
|
||||
#### Archive near time
|
||||
|
||||
```bash
|
||||
$ waybackpy --url facebook.com --user_agent "my-unique-user-agent" --near --year 2012 --month 5 --day 12
|
||||
https://web.archive.org/web/20120512142515/https://www.facebook.com/
|
||||
```
|
||||
|
||||
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyBashNear></sub>
|
||||
|
||||
#### Get the source code
|
||||
|
||||
```bash
|
||||
waybackpy --url google.com --user_agent "my-unique-user-agent" --get url # Prints the source code of the URL
|
||||
waybackpy --url google.com --user_agent "my-unique-user-agent" --get oldest # Prints the source code of the oldest archive
|
||||
waybackpy --url google.com --user_agent "my-unique-user-agent" --get newest # Prints the source code of the newest archive
|
||||
waybackpy --url google.com --user_agent "my-unique-user-agent" --get save # Save a new archive on Wayback machine then print the source code of this archive.
|
||||
```
|
||||
|
||||
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyBashGet></sub>
|
||||
|
||||
#### Fetch all the URLs that the Wayback Machine knows for a domain
|
||||
|
||||
1) You can add the '--alive' flag to only fetch alive links.
|
||||
2) You can add the '--subdomain' flag to add subdomains.
|
||||
3) '--alive' and '--subdomain' flags can be used simultaneously.
|
||||
4) All links will be saved in a file, and the file will be created in the current working directory.
|
||||
|
||||
```bash
|
||||
pip install waybackpy
|
||||
|
||||
# Ignore the above installation line.
|
||||
|
||||
waybackpy --url akamhy.github.io --user_agent "my-user-agent" --known_urls
|
||||
# Prints all known URLs under akamhy.github.io
|
||||
|
||||
|
||||
waybackpy --url akamhy.github.io --user_agent "my-user-agent" --known_urls --alive
|
||||
# Prints all known URLs under akamhy.github.io which are still working and not dead links.
|
||||
|
||||
|
||||
waybackpy --url akamhy.github.io --user_agent "my-user-agent" --known_urls --subdomain
|
||||
# Prints all known URLs under akamhy.github.io including subdomain
|
||||
|
||||
|
||||
waybackpy --url akamhy.github.io --user_agent "my-user-agent" --known_urls --subdomain --alive
|
||||
# Prints all known URLs under akamhy.github.io including subdomain which are not dead links and still alive.
|
||||
|
||||
```
|
||||
|
||||
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackpyKnownUrlsFromWaybackMachine#main.sh></sub>
|
||||
|
||||
## Tests
|
||||
|
||||
To run tests locally:
|
||||
|
||||
1) Install or update the testing/coverage tools
|
||||
|
||||
```bash
|
||||
pip install codecov pytest pytest-cov -U
|
||||
```
|
||||
|
||||
2) Inside the repository run the following commands
|
||||
|
||||
```bash
|
||||
pytest --cov=waybackpy tests/
|
||||
```
|
||||
|
||||
3) To report coverage run
|
||||
|
||||
```bash
|
||||
bash <(curl -s https://codecov.io/bash) -t SECRET_CODECOV_TOKEN
|
||||
```
|
||||
|
||||
You can find the tests [here](https://github.com/akamhy/waybackpy/tree/master/tests).
|
||||
|
||||
|
||||
## Packaging
|
||||
|
||||
1. Increment version.
|
||||
|
||||
2. Build package ``python setup.py sdist bdist_wheel``.
|
||||
|
||||
3. Sign & upload the package ``twine upload -s dist/*``.
|
||||
> Full CLI documentation can be found at <https://github.com/akamhy/waybackpy/wiki/CLI-docs>.
|
||||
|
||||
## License
|
||||
[](https://github.com/akamhy/waybackpy/blob/master/LICENSE)
|
||||
|
||||
Released under the MIT License. See
|
||||
[license](https://github.com/akamhy/waybackpy/blob/master/LICENSE) for details.
|
||||
|
||||
|
||||
-----------------------------------------------------------------------------------------------------------------------------------------------
|
||||
|
||||
|
4
setup.py
4
setup.py
@ -19,7 +19,7 @@ setup(
|
||||
author=about["__author__"],
|
||||
author_email=about["__author_email__"],
|
||||
url=about["__url__"],
|
||||
download_url="https://github.com/akamhy/waybackpy/archive/2.3.2.tar.gz",
|
||||
download_url="https://github.com/akamhy/waybackpy/archive/2.3.3.tar.gz",
|
||||
keywords=[
|
||||
"Archive It",
|
||||
"Archive Website",
|
||||
@ -47,7 +47,7 @@ setup(
|
||||
],
|
||||
entry_points={"console_scripts": ["waybackpy = waybackpy.cli:main"]},
|
||||
project_urls={
|
||||
"Documentation": "https://akamhy.github.io/waybackpy/",
|
||||
"Documentation": "https://github.com/akamhy/waybackpy/wiki",
|
||||
"Source": "https://github.com/akamhy/waybackpy",
|
||||
"Tracker": "https://github.com/akamhy/waybackpy/issues",
|
||||
},
|
||||
|
@ -1,7 +1,8 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
import sys
|
||||
import os
|
||||
import pytest
|
||||
import random
|
||||
import string
|
||||
import argparse
|
||||
|
||||
sys.path.append("..")
|
||||
@ -9,9 +10,6 @@ import waybackpy.cli as cli # noqa: E402
|
||||
from waybackpy.wrapper import Url # noqa: E402
|
||||
from waybackpy.__version__ import __version__
|
||||
|
||||
# Namespace(day=None, get=None, hour=None, minute=None, month=None, near=False,
|
||||
# newest=False, oldest=False, save=False, total=False, url=None, user_agent=None, version=False, year=None)
|
||||
|
||||
|
||||
def test_save():
|
||||
args = argparse.Namespace(
|
||||
@ -33,6 +31,25 @@ def test_save():
|
||||
reply = cli.args_handler(args)
|
||||
assert "pypi.org/user/akamhy" in str(reply)
|
||||
|
||||
args = argparse.Namespace(
|
||||
user_agent=None,
|
||||
url="https://hfjfjfjfyu6r6rfjvj.fjhgjhfjgvjm",
|
||||
total=False,
|
||||
version=False,
|
||||
oldest=False,
|
||||
save=True,
|
||||
json=False,
|
||||
archive_url=False,
|
||||
newest=False,
|
||||
near=False,
|
||||
alive=False,
|
||||
subdomain=False,
|
||||
known_urls=False,
|
||||
get=None,
|
||||
)
|
||||
reply = cli.args_handler(args)
|
||||
assert "could happen because either your waybackpy" in str(reply)
|
||||
|
||||
|
||||
def test_json():
|
||||
args = argparse.Namespace(
|
||||
@ -96,6 +113,29 @@ def test_oldest():
|
||||
reply = cli.args_handler(args)
|
||||
assert "pypi.org/user/akamhy" in str(reply)
|
||||
|
||||
uid = "".join(
|
||||
random.choice(string.ascii_lowercase + string.digits) for _ in range(6)
|
||||
)
|
||||
url = "https://pypi.org/yfvjvycyc667r67ed67r" + uid
|
||||
args = argparse.Namespace(
|
||||
user_agent=None,
|
||||
url=url,
|
||||
total=False,
|
||||
version=False,
|
||||
oldest=True,
|
||||
save=False,
|
||||
json=False,
|
||||
archive_url=False,
|
||||
newest=False,
|
||||
near=False,
|
||||
alive=False,
|
||||
subdomain=False,
|
||||
known_urls=False,
|
||||
get=None,
|
||||
)
|
||||
reply = cli.args_handler(args)
|
||||
assert "Can not find archive for" in str(reply)
|
||||
|
||||
|
||||
def test_newest():
|
||||
args = argparse.Namespace(
|
||||
@ -118,6 +158,29 @@ def test_newest():
|
||||
reply = cli.args_handler(args)
|
||||
assert "pypi.org/user/akamhy" in str(reply)
|
||||
|
||||
uid = "".join(
|
||||
random.choice(string.ascii_lowercase + string.digits) for _ in range(6)
|
||||
)
|
||||
url = "https://pypi.org/yfvjvycyc667r67ed67r" + uid
|
||||
args = argparse.Namespace(
|
||||
user_agent=None,
|
||||
url=url,
|
||||
total=False,
|
||||
version=False,
|
||||
oldest=False,
|
||||
save=False,
|
||||
json=False,
|
||||
archive_url=False,
|
||||
newest=True,
|
||||
near=False,
|
||||
alive=False,
|
||||
subdomain=False,
|
||||
known_urls=False,
|
||||
get=None,
|
||||
)
|
||||
reply = cli.args_handler(args)
|
||||
assert "Can not find archive for" in str(reply)
|
||||
|
||||
|
||||
def test_total_archives():
|
||||
args = argparse.Namespace(
|
||||
@ -162,6 +225,26 @@ def test_known_urls():
|
||||
reply = cli.args_handler(args)
|
||||
assert "github" in str(reply)
|
||||
|
||||
args = argparse.Namespace(
|
||||
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
|
||||
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
|
||||
url="https://akfyfufyjcujfufu6576r76r6amhy.gitd6r67r6u6hub.yfjyfjio",
|
||||
total=False,
|
||||
version=False,
|
||||
oldest=False,
|
||||
save=False,
|
||||
json=False,
|
||||
archive_url=False,
|
||||
newest=False,
|
||||
near=False,
|
||||
alive=True,
|
||||
subdomain=True,
|
||||
known_urls=True,
|
||||
get=None,
|
||||
)
|
||||
reply = cli.args_handler(args)
|
||||
assert "No known URLs found" in str(reply)
|
||||
|
||||
|
||||
def test_near():
|
||||
args = argparse.Namespace(
|
||||
@ -189,6 +272,34 @@ def test_near():
|
||||
reply = cli.args_handler(args)
|
||||
assert "202007" in str(reply)
|
||||
|
||||
uid = "".join(
|
||||
random.choice(string.ascii_lowercase + string.digits) for _ in range(6)
|
||||
)
|
||||
url = "https://pypi.org/yfvjvycyc667r67ed67r" + uid
|
||||
args = argparse.Namespace(
|
||||
user_agent=None,
|
||||
url=url,
|
||||
total=False,
|
||||
version=False,
|
||||
oldest=False,
|
||||
save=False,
|
||||
json=False,
|
||||
archive_url=False,
|
||||
newest=False,
|
||||
near=True,
|
||||
alive=False,
|
||||
subdomain=False,
|
||||
known_urls=False,
|
||||
get=None,
|
||||
year=2020,
|
||||
month=7,
|
||||
day=15,
|
||||
hour=1,
|
||||
minute=1,
|
||||
)
|
||||
reply = cli.args_handler(args)
|
||||
assert "Can not find archive for" in str(reply)
|
||||
|
||||
|
||||
def test_get():
|
||||
args = argparse.Namespace(
|
||||
@ -286,7 +397,7 @@ def test_get():
|
||||
alive=False,
|
||||
subdomain=False,
|
||||
known_urls=False,
|
||||
get="BullShit",
|
||||
get="foobar",
|
||||
)
|
||||
reply = cli.args_handler(args)
|
||||
assert "get the source code of the" in str(reply)
|
||||
|
@ -1,8 +1,8 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
import sys
|
||||
import pytest
|
||||
import random
|
||||
import requests
|
||||
from datetime import datetime
|
||||
|
||||
sys.path.append("..")
|
||||
|
||||
@ -12,15 +12,23 @@ import waybackpy.wrapper as waybackpy # noqa: E402
|
||||
user_agent = "Mozilla/5.0 (Windows NT 6.2; rv:20.0) Gecko/20121202 Firefox/20.0"
|
||||
|
||||
|
||||
def test_clean_url():
|
||||
def test_cleaned_url():
|
||||
"""No API use"""
|
||||
test_url = " https://en.wikipedia.org/wiki/Network security "
|
||||
answer = "https://en.wikipedia.org/wiki/Network_security"
|
||||
target = waybackpy.Url(test_url, user_agent)
|
||||
test_result = target._clean_url()
|
||||
test_result = target._cleaned_url()
|
||||
assert answer == test_result
|
||||
|
||||
|
||||
def test_ts():
|
||||
a = waybackpy.Url("https://google.com", user_agent)
|
||||
ts = a._timestamp
|
||||
assert str(datetime.utcnow().year) in str(ts)
|
||||
|
||||
|
||||
def test_dunders():
|
||||
"""No API use"""
|
||||
url = "https://en.wikipedia.org/wiki/Network_security"
|
||||
user_agent = "UA"
|
||||
target = waybackpy.Url(url, user_agent)
|
||||
@ -28,22 +36,55 @@ def test_dunders():
|
||||
assert "en.wikipedia.org" in str(target)
|
||||
|
||||
|
||||
def test_archive_url_parser():
|
||||
endpoint = "https://amazon.com"
|
||||
user_agent = "Mozilla/5.0 (Windows NT 6.2; rv:20.0) Gecko/20121202 Firefox/20.0"
|
||||
headers = {"User-Agent": "%s" % user_agent}
|
||||
response = waybackpy._get_response(endpoint, params=None, headers=headers)
|
||||
header = response.headers
|
||||
with pytest.raises(Exception):
|
||||
waybackpy._archive_url_parser(header)
|
||||
|
||||
|
||||
def test_url_check():
|
||||
"""No API Use"""
|
||||
broken_url = "http://wwwgooglecom/"
|
||||
with pytest.raises(Exception):
|
||||
waybackpy.Url(broken_url, user_agent)
|
||||
|
||||
|
||||
def test_archive_url_parser():
|
||||
"""No API Use"""
|
||||
perfect_header = """
|
||||
{'Server': 'nginx/1.15.8', 'Date': 'Sat, 02 Jan 2021 09:40:25 GMT', 'Content-Type': 'text/html; charset=UTF-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'X-Archive-Orig-Server': 'nginx', 'X-Archive-Orig-Date': 'Sat, 02 Jan 2021 09:40:09 GMT', 'X-Archive-Orig-Transfer-Encoding': 'chunked', 'X-Archive-Orig-Connection': 'keep-alive', 'X-Archive-Orig-Vary': 'Accept-Encoding', 'X-Archive-Orig-Last-Modified': 'Fri, 01 Jan 2021 12:19:00 GMT', 'X-Archive-Orig-Strict-Transport-Security': 'max-age=31536000, max-age=0;', 'X-Archive-Guessed-Content-Type': 'text/html', 'X-Archive-Guessed-Charset': 'utf-8', 'Memento-Datetime': 'Sat, 02 Jan 2021 09:40:09 GMT', 'Link': '<https://www.scribbr.com/citing-sources/et-al/>; rel="original", <https://web.archive.org/web/timemap/link/https://www.scribbr.com/citing-sources/et-al/>; rel="timemap"; type="application/link-format", <https://web.archive.org/web/https://www.scribbr.com/citing-sources/et-al/>; rel="timegate", <https://web.archive.org/web/20200601082911/https://www.scribbr.com/citing-sources/et-al/>; rel="first memento"; datetime="Mon, 01 Jun 2020 08:29:11 GMT", <https://web.archive.org/web/20201126185327/https://www.scribbr.com/citing-sources/et-al/>; rel="prev memento"; datetime="Thu, 26 Nov 2020 18:53:27 GMT", <https://web.archive.org/web/20210102094009/https://www.scribbr.com/citing-sources/et-al/>; rel="memento"; datetime="Sat, 02 Jan 2021 09:40:09 GMT", <https://web.archive.org/web/20210102094009/https://www.scribbr.com/citing-sources/et-al/>; rel="last memento"; datetime="Sat, 02 Jan 2021 09:40:09 GMT"', 'Content-Security-Policy': "default-src 'self' 'unsafe-eval' 'unsafe-inline' data: blob: archive.org web.archive.org analytics.archive.org pragma.archivelab.org", 'X-Archive-Src': 'spn2-20210102092956-wwwb-spn20.us.archive.org-8001.warc.gz', 'Server-Timing': 'captures_list;dur=112.646325, exclusion.robots;dur=0.172010, exclusion.robots.policy;dur=0.158205, RedisCDXSource;dur=2.205932, esindex;dur=0.014647, LoadShardBlock;dur=82.205012, PetaboxLoader3.datanode;dur=70.750239, CDXLines.iter;dur=24.306278, load_resource;dur=26.520179', 'X-App-Server': 'wwwb-app200', 'X-ts': '200', 'X-location': 'All', 'X-Cache-Key': 'httpsweb.archive.org/web/20210102094009/https://www.scribbr.com/citing-sources/et-al/IN', 'X-RL': '0', 'X-Page-Cache': 'MISS', 'X-Archive-Screenname': '0', 'Content-Encoding': 'gzip'}
|
||||
"""
|
||||
|
||||
archive = waybackpy._archive_url_parser(
|
||||
perfect_header, "https://www.scribbr.com/citing-sources/et-al/"
|
||||
)
|
||||
assert "web.archive.org/web/20210102094009" in archive
|
||||
|
||||
header = """
|
||||
vhgvkjv
|
||||
Content-Location: /web/20201126185327/https://www.scribbr.com/citing-sources/et-al
|
||||
ghvjkbjmmcmhj
|
||||
"""
|
||||
archive = waybackpy._archive_url_parser(
|
||||
header, "https://www.scribbr.com/citing-sources/et-al/"
|
||||
)
|
||||
assert "20201126185327" in archive
|
||||
|
||||
header = """
|
||||
hfjkfjfcjhmghmvjm
|
||||
X-Cache-Key: https://web.archive.org/web/20171128185327/https://www.scribbr.com/citing-sources/et-al/US
|
||||
yfu,u,gikgkikik
|
||||
"""
|
||||
archive = waybackpy._archive_url_parser(
|
||||
header, "https://www.scribbr.com/citing-sources/et-al/"
|
||||
)
|
||||
assert "20171128185327" in archive
|
||||
|
||||
# The below header should result in Exception
|
||||
no_archive_header = """
|
||||
{'Server': 'nginx/1.15.8', 'Date': 'Sat, 02 Jan 2021 09:42:45 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Cache-Control': 'no-cache', 'X-App-Server': 'wwwb-app52', 'X-ts': '523', 'X-RL': '0', 'X-Page-Cache': 'MISS', 'X-Archive-Screenname': '0'}
|
||||
"""
|
||||
|
||||
with pytest.raises(Exception):
|
||||
waybackpy._archive_url_parser(
|
||||
no_archive_header, "https://www.scribbr.com/citing-sources/et-al/"
|
||||
)
|
||||
|
||||
|
||||
def test_save():
|
||||
# Test for urls that exist and can be archived.
|
||||
|
||||
@ -89,13 +130,13 @@ def test_near():
|
||||
"(KHTML, like Gecko) Version/5.0.3 Safari/533.19.4",
|
||||
)
|
||||
archive_near_year = target.near(year=2010)
|
||||
assert "2010" in str(archive_near_year)
|
||||
assert "2010" in str(archive_near_year.timestamp)
|
||||
|
||||
archive_near_month_year = str(target.near(year=2015, month=2))
|
||||
archive_near_month_year = str(target.near(year=2015, month=2).timestamp)
|
||||
assert (
|
||||
("201502" in archive_near_month_year)
|
||||
or ("201501" in archive_near_month_year)
|
||||
or ("201503" in archive_near_month_year)
|
||||
("2015-02" in archive_near_month_year)
|
||||
or ("2015-01" in archive_near_month_year)
|
||||
or ("2015-03" in archive_near_month_year)
|
||||
)
|
||||
|
||||
target = waybackpy.Url(
|
||||
@ -123,7 +164,9 @@ def test_near():
|
||||
def test_oldest():
|
||||
url = "github.com/akamhy/waybackpy"
|
||||
target = waybackpy.Url(url, user_agent)
|
||||
assert "20200504141153" in str(target.oldest())
|
||||
o = target.oldest()
|
||||
assert "20200504141153" in str(o)
|
||||
assert "2020-05-04" in str(o._timestamp)
|
||||
|
||||
|
||||
def test_json():
|
||||
@ -165,9 +208,11 @@ def test_get_response():
|
||||
|
||||
|
||||
def test_total_archives():
|
||||
|
||||
target = waybackpy.Url(" https://google.com ", user_agent)
|
||||
assert target.total_archives() > 500000
|
||||
user_agent = (
|
||||
"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0"
|
||||
)
|
||||
target = waybackpy.Url(" https://outlook.com ", user_agent)
|
||||
assert target.total_archives() > 80000
|
||||
|
||||
target = waybackpy.Url(
|
||||
" https://gaha.e4i3n.m5iai3kip6ied.cima/gahh2718gs/ahkst63t7gad8 ", user_agent
|
||||
|
@ -1,5 +1,3 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
# ┏┓┏┓┏┓━━━━━━━━━━┏━━┓━━━━━━━━━━┏┓━━┏━━━┓━━━━━
|
||||
# ┃┃┃┃┃┃━━━━━━━━━━┃┏┓┃━━━━━━━━━━┃┃━━┃┏━┓┃━━━━━
|
||||
# ┃┃┃┃┃┃┏━━┓━┏┓━┏┓┃┗┛┗┓┏━━┓━┏━━┓┃┃┏┓┃┗━┛┃┏┓━┏┓
|
||||
@ -10,24 +8,43 @@
|
||||
# ━━━━━━━━━━━┗━━┛━━━━━━━━━━━━━━━━━━━━━━━━┗━━┛━
|
||||
|
||||
"""
|
||||
Waybackpy is a Python package that interfaces with the Internet Archive's Wayback Machine API.
|
||||
Waybackpy is a Python package & command-line program that interfaces with the Internet Archive's Wayback Machine API.
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Archive pages and retrieve archived pages easily.
|
||||
Archive webpage and retrieve archived URLs easily.
|
||||
|
||||
Usage:
|
||||
>>> import waybackpy
|
||||
>>> target_url = waybackpy.Url('https://www.python.org', 'Your-apps-cool-user-agent')
|
||||
>>> new_archive = target_url.save()
|
||||
>>> print(new_archive)
|
||||
https://web.archive.org/web/20200502170312/https://www.python.org/
|
||||
>>> import waybackpy
|
||||
|
||||
Full documentation @ <https://akamhy.github.io/waybackpy/>.
|
||||
:copyright: (c) 2020 by akamhy.
|
||||
>>> url = "https://en.wikipedia.org/wiki/Multivariable_calculus"
|
||||
>>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
|
||||
|
||||
>>> wayback = waybackpy.Url(url, user_agent)
|
||||
|
||||
>>> archive = wayback.save()
|
||||
>>> str(archive)
|
||||
'https://web.archive.org/web/20210104173410/https://en.wikipedia.org/wiki/Multivariable_calculus'
|
||||
|
||||
>>> archive.timestamp
|
||||
datetime.datetime(2021, 1, 4, 17, 35, 12, 691741)
|
||||
|
||||
>>> oldest_archive = wayback.oldest()
|
||||
>>> str(oldest_archive)
|
||||
'https://web.archive.org/web/20050422130129/http://en.wikipedia.org:80/wiki/Multivariable_calculus'
|
||||
|
||||
>>> archive_close_to_2010_feb = wayback.near(year=2010, month=2)
|
||||
>>> str(archive_close_to_2010_feb)
|
||||
'https://web.archive.org/web/20100215001541/http://en.wikipedia.org:80/wiki/Multivariable_calculus'
|
||||
|
||||
>>> str(wayback.newest())
|
||||
'https://web.archive.org/web/20210104173410/https://en.wikipedia.org/wiki/Multivariable_calculus'
|
||||
|
||||
Full documentation @ <https://github.com/akamhy/waybackpy/wiki>.
|
||||
:copyright: (c) 2020-2021 AKash Mahanty Et al.
|
||||
:license: MIT
|
||||
"""
|
||||
|
||||
from .wrapper import Url
|
||||
from .wrapper import Url, Cdx
|
||||
from .__version__ import (
|
||||
__title__,
|
||||
__description__,
|
||||
|
@ -1,12 +1,10 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
__title__ = "waybackpy"
|
||||
__description__ = (
|
||||
"A Python package that interfaces with the Internet Archive's Wayback Machine API. "
|
||||
"Archive pages and retrieve archived pages easily."
|
||||
)
|
||||
__url__ = "https://akamhy.github.io/waybackpy/"
|
||||
__version__ = "2.3.2"
|
||||
__version__ = "2.3.3"
|
||||
__author__ = "akamhy"
|
||||
__author_email__ = "akamhy@yahoo.com"
|
||||
__license__ = "MIT"
|
||||
|
162
waybackpy/cli.py
162
waybackpy/cli.py
@ -1,13 +1,12 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
import sys
|
||||
import os
|
||||
import re
|
||||
import argparse
|
||||
import string
|
||||
import sys
|
||||
import random
|
||||
import string
|
||||
import argparse
|
||||
from waybackpy.wrapper import Url
|
||||
from waybackpy.__version__ import __version__
|
||||
from waybackpy.exceptions import WaybackError
|
||||
from waybackpy.__version__ import __version__
|
||||
|
||||
|
||||
def _save(obj):
|
||||
@ -15,7 +14,6 @@ def _save(obj):
|
||||
return obj.save()
|
||||
except Exception as err:
|
||||
e = str(err)
|
||||
url = obj.url
|
||||
m = re.search(r"Header:\n(.*)", e)
|
||||
if m:
|
||||
header = m.group(1)
|
||||
@ -39,7 +37,7 @@ def _json(obj):
|
||||
return obj.JSON
|
||||
|
||||
|
||||
def handle_not_archived_error(e, obj):
|
||||
def no_archive_handler(e, obj):
|
||||
m = re.search(r"archive\sfor\s\'(.*?)\'\stry", str(e))
|
||||
if m:
|
||||
url = m.group(1)
|
||||
@ -58,14 +56,14 @@ def _oldest(obj):
|
||||
try:
|
||||
return obj.oldest()
|
||||
except Exception as e:
|
||||
return handle_not_archived_error(e, obj)
|
||||
return no_archive_handler(e, obj)
|
||||
|
||||
|
||||
def _newest(obj):
|
||||
try:
|
||||
return obj.newest()
|
||||
except Exception as e:
|
||||
return handle_not_archived_error(e, obj)
|
||||
return no_archive_handler(e, obj)
|
||||
|
||||
|
||||
def _total_archives(obj):
|
||||
@ -74,29 +72,25 @@ def _total_archives(obj):
|
||||
|
||||
def _near(obj, args):
|
||||
_near_args = {}
|
||||
if args.year:
|
||||
_near_args["year"] = args.year
|
||||
if args.month:
|
||||
_near_args["month"] = args.month
|
||||
if args.day:
|
||||
_near_args["day"] = args.day
|
||||
if args.hour:
|
||||
_near_args["hour"] = args.hour
|
||||
if args.minute:
|
||||
_near_args["minute"] = args.minute
|
||||
args_arr = [args.year, args.month, args.day, args.hour, args.minute]
|
||||
keys = ["year", "month", "day", "hour", "minute"]
|
||||
|
||||
for key, arg in zip(keys, args_arr):
|
||||
if arg:
|
||||
_near_args[key] = arg
|
||||
|
||||
try:
|
||||
return obj.near(**_near_args)
|
||||
except Exception as e:
|
||||
return handle_not_archived_error(e, obj)
|
||||
return no_archive_handler(e, obj)
|
||||
|
||||
|
||||
def _save_urls_on_file(input_list, live_url_count):
|
||||
m = re.search("https?://([A-Za-z_0-9.-]+).*", input_list[0])
|
||||
|
||||
domain = "domain-unknown"
|
||||
if m:
|
||||
domain = m.group(1)
|
||||
else:
|
||||
domain = "domain-unknown"
|
||||
|
||||
uid = "".join(
|
||||
random.choice(string.ascii_lowercase + string.digits) for _ in range(6)
|
||||
@ -111,52 +105,45 @@ def _save_urls_on_file(input_list, live_url_count):
|
||||
|
||||
|
||||
def _known_urls(obj, args):
|
||||
"""Abbreviations:
|
||||
sd = subdomain
|
||||
al = alive
|
||||
"""
|
||||
Known urls for a domain.
|
||||
"""
|
||||
# sd = subdomain
|
||||
sd = False
|
||||
al = False
|
||||
if args.subdomain:
|
||||
sd = True
|
||||
|
||||
# al = alive
|
||||
al = False
|
||||
if args.alive:
|
||||
al = True
|
||||
|
||||
url_list = obj.known_urls(alive=al, subdomain=sd)
|
||||
total_urls = len(url_list)
|
||||
|
||||
if total_urls > 0:
|
||||
text = _save_urls_on_file(url_list, total_urls)
|
||||
else:
|
||||
text = "No known URLs found. Please try a diffrent domain!"
|
||||
return _save_urls_on_file(url_list, total_urls)
|
||||
|
||||
return text
|
||||
return "No known URLs found. Please try a diffrent domain!"
|
||||
|
||||
|
||||
def _get(obj, args):
|
||||
if args.get.lower() == "url":
|
||||
output = obj.get()
|
||||
|
||||
elif args.get.lower() == "archive_url":
|
||||
output = obj.get(obj.archive_url)
|
||||
|
||||
elif args.get.lower() == "oldest":
|
||||
output = obj.get(obj.oldest())
|
||||
|
||||
elif args.get.lower() == "latest" or args.get.lower() == "newest":
|
||||
output = obj.get(obj.newest())
|
||||
|
||||
elif args.get.lower() == "save":
|
||||
output = obj.get(obj.save())
|
||||
|
||||
else:
|
||||
output = "Use get as \"--get 'source'\", 'source' can be one of the followings: \
|
||||
\n1) url - get the source code of the url specified using --url/-u.\
|
||||
\n2) archive_url - get the source code of the newest archive for the supplied url, alias of newest.\
|
||||
\n3) oldest - get the source code of the oldest archive for the supplied url.\
|
||||
\n4) newest - get the source code of the newest archive for the supplied url.\
|
||||
\n5) save - Create a new archive and get the source code of this new archive for the supplied url."
|
||||
|
||||
return output
|
||||
return obj.get()
|
||||
if args.get.lower() == "archive_url":
|
||||
return obj.get(obj.archive_url)
|
||||
if args.get.lower() == "oldest":
|
||||
return obj.get(obj.oldest())
|
||||
if args.get.lower() == "latest" or args.get.lower() == "newest":
|
||||
return obj.get(obj.newest())
|
||||
if args.get.lower() == "save":
|
||||
return obj.get(obj.save())
|
||||
return "Use get as \"--get 'source'\", 'source' can be one of the followings: \
|
||||
\n1) url - get the source code of the url specified using --url/-u.\
|
||||
\n2) archive_url - get the source code of the newest archive for the supplied url, alias of newest.\
|
||||
\n3) oldest - get the source code of the oldest archive for the supplied url.\
|
||||
\n4) newest - get the source code of the newest archive for the supplied url.\
|
||||
\n5) save - Create a new archive and get the source code of this new archive for the supplied url."
|
||||
|
||||
|
||||
def args_handler(args):
|
||||
@ -188,7 +175,7 @@ def args_handler(args):
|
||||
elif args.total:
|
||||
output = _total_archives(obj)
|
||||
elif args.near:
|
||||
output = _near(obj, args)
|
||||
return _near(obj, args)
|
||||
elif args.get:
|
||||
output = _get(obj, args)
|
||||
else:
|
||||
@ -199,24 +186,24 @@ def args_handler(args):
|
||||
return output
|
||||
|
||||
|
||||
def parse_args(argv):
|
||||
parser = argparse.ArgumentParser()
|
||||
|
||||
requiredArgs = parser.add_argument_group("URL argument (required)")
|
||||
def add_requiredArgs(requiredArgs):
|
||||
requiredArgs.add_argument(
|
||||
"--url", "-u", help="URL on which Wayback machine operations would occur"
|
||||
)
|
||||
|
||||
userAgentArg = parser.add_argument_group("User Agent")
|
||||
|
||||
def add_userAgentArg(userAgentArg):
|
||||
help_text = 'User agent, default user_agent is "waybackpy python package - https://github.com/akamhy/waybackpy"'
|
||||
userAgentArg.add_argument("--user_agent", "-ua", help=help_text)
|
||||
|
||||
saveArg = parser.add_argument_group("Create new archive/save URL")
|
||||
|
||||
def add_saveArg(saveArg):
|
||||
saveArg.add_argument(
|
||||
"--save", "-s", action="store_true", help="Save the URL on the Wayback machine"
|
||||
)
|
||||
|
||||
auArg = parser.add_argument_group("Get the latest Archive")
|
||||
|
||||
def add_auArg(auArg):
|
||||
auArg.add_argument(
|
||||
"--archive_url",
|
||||
"-au",
|
||||
@ -224,7 +211,8 @@ def parse_args(argv):
|
||||
help="Get the latest archive URL, alias for --newest",
|
||||
)
|
||||
|
||||
jsonArg = parser.add_argument_group("Get the JSON data")
|
||||
|
||||
def add_jsonArg(jsonArg):
|
||||
jsonArg.add_argument(
|
||||
"--json",
|
||||
"-j",
|
||||
@ -232,7 +220,8 @@ def parse_args(argv):
|
||||
help="JSON data of the availability API request",
|
||||
)
|
||||
|
||||
oldestArg = parser.add_argument_group("Oldest archive")
|
||||
|
||||
def add_oldestArg(oldestArg):
|
||||
oldestArg.add_argument(
|
||||
"--oldest",
|
||||
"-o",
|
||||
@ -240,7 +229,8 @@ def parse_args(argv):
|
||||
help="Oldest archive for the specified URL",
|
||||
)
|
||||
|
||||
newestArg = parser.add_argument_group("Newest archive")
|
||||
|
||||
def add_newestArg(newestArg):
|
||||
newestArg.add_argument(
|
||||
"--newest",
|
||||
"-n",
|
||||
@ -248,7 +238,8 @@ def parse_args(argv):
|
||||
help="Newest archive for the specified URL",
|
||||
)
|
||||
|
||||
totalArg = parser.add_argument_group("Total number of archives")
|
||||
|
||||
def add_totalArg(totalArg):
|
||||
totalArg.add_argument(
|
||||
"--total",
|
||||
"-t",
|
||||
@ -256,16 +247,16 @@ def parse_args(argv):
|
||||
help="Total number of archives for the specified URL",
|
||||
)
|
||||
|
||||
getArg = parser.add_argument_group("Get source code")
|
||||
|
||||
def add_getArg(getArg):
|
||||
getArg.add_argument(
|
||||
"--get",
|
||||
"-g",
|
||||
help="Prints the source code of the supplied url. Use '--get help' for extended usage",
|
||||
)
|
||||
|
||||
knownUrlArg = parser.add_argument_group(
|
||||
"URLs known and archived to Waybcak Machine for the site."
|
||||
)
|
||||
|
||||
def add_knownUrlArg(knownUrlArg):
|
||||
knownUrlArg.add_argument(
|
||||
"--known_urls", "-ku", action="store_true", help="URLs known for the domain."
|
||||
)
|
||||
@ -274,31 +265,48 @@ def parse_args(argv):
|
||||
help_text = "Only include live URLs. Will not inlclude dead links."
|
||||
knownUrlArg.add_argument("--alive", "-a", action="store_true", help=help_text)
|
||||
|
||||
nearArg = parser.add_argument_group("Archive close to time specified")
|
||||
|
||||
def add_nearArg(nearArg):
|
||||
nearArg.add_argument(
|
||||
"--near", "-N", action="store_true", help="Archive near specified time"
|
||||
)
|
||||
|
||||
nearArgs = parser.add_argument_group("Arguments that are used only with --near")
|
||||
|
||||
def add_nearArgs(nearArgs):
|
||||
nearArgs.add_argument("--year", "-Y", type=int, help="Year in integer")
|
||||
nearArgs.add_argument("--month", "-M", type=int, help="Month in integer")
|
||||
nearArgs.add_argument("--day", "-D", type=int, help="Day in integer.")
|
||||
nearArgs.add_argument("--hour", "-H", type=int, help="Hour in intege")
|
||||
nearArgs.add_argument("--minute", "-MIN", type=int, help="Minute in integer")
|
||||
|
||||
|
||||
def parse_args(argv):
|
||||
parser = argparse.ArgumentParser()
|
||||
add_requiredArgs(parser.add_argument_group("URL argument (required)"))
|
||||
add_userAgentArg(parser.add_argument_group("User Agent"))
|
||||
add_saveArg(parser.add_argument_group("Create new archive/save URL"))
|
||||
add_auArg(parser.add_argument_group("Get the latest Archive"))
|
||||
add_jsonArg(parser.add_argument_group("Get the JSON data"))
|
||||
add_oldestArg(parser.add_argument_group("Oldest archive"))
|
||||
add_newestArg(parser.add_argument_group("Newest archive"))
|
||||
add_totalArg(parser.add_argument_group("Total number of archives"))
|
||||
add_getArg(parser.add_argument_group("Get source code"))
|
||||
add_knownUrlArg(
|
||||
parser.add_argument_group(
|
||||
"URLs known and archived to Waybcak Machine for the site."
|
||||
)
|
||||
)
|
||||
add_nearArg(parser.add_argument_group("Archive close to time specified"))
|
||||
add_nearArgs(parser.add_argument_group("Arguments that are used only with --near"))
|
||||
parser.add_argument(
|
||||
"--version", "-v", action="store_true", help="Waybackpy version"
|
||||
)
|
||||
|
||||
return parser.parse_args(argv[1:])
|
||||
|
||||
|
||||
def main(argv=None):
|
||||
if argv is None:
|
||||
argv = sys.argv
|
||||
args = parse_args(argv)
|
||||
output = args_handler(args)
|
||||
print(output)
|
||||
argv = sys.argv if argv is None else argv
|
||||
print(args_handler(parse_args(argv)))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
@ -1,6 +1,3 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
|
||||
class WaybackError(Exception):
|
||||
"""
|
||||
Raised when Wayback Machine API Service is unreachable/down.
|
||||
|
@ -1,82 +1,185 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
import re
|
||||
from datetime import datetime, timedelta
|
||||
from waybackpy.exceptions import WaybackError, URLError
|
||||
from waybackpy.__version__ import __version__
|
||||
import requests
|
||||
import concurrent.futures
|
||||
from datetime import datetime, timedelta
|
||||
from waybackpy.__version__ import __version__
|
||||
from waybackpy.exceptions import WaybackError, URLError
|
||||
|
||||
|
||||
default_UA = "waybackpy python package - https://github.com/akamhy/waybackpy"
|
||||
default_user_agent = "waybackpy python package - https://github.com/akamhy/waybackpy"
|
||||
|
||||
|
||||
def _archive_url_parser(header):
|
||||
"""Parse out the archive from header."""
|
||||
def _get_total_pages(url, user_agent):
|
||||
"""
|
||||
If showNumPages is passed in cdx API, it returns
|
||||
'number of archive pages'and each page has many archives.
|
||||
|
||||
This func returns number of pages of archives (type int).
|
||||
"""
|
||||
total_pages_url = (
|
||||
"https://web.archive.org/cdx/search/cdx?url=%s&showNumPages=true" % url
|
||||
)
|
||||
headers = {"User-Agent": user_agent}
|
||||
return int((_get_response(total_pages_url, headers=headers).text).strip())
|
||||
|
||||
|
||||
def _archive_url_parser(header, url):
|
||||
"""
|
||||
The wayback machine's save API doesn't
|
||||
return JSON response, we are required
|
||||
to read the header of the API response
|
||||
and look for the archive URL.
|
||||
|
||||
This method has some regexen (or regexes)
|
||||
that search for archive url in header.
|
||||
|
||||
This method is used when you try to
|
||||
save a webpage on wayback machine.
|
||||
|
||||
Two cases are possible:
|
||||
1) Either we find the archive url in
|
||||
the header.
|
||||
|
||||
2) Or we didn't find the archive url in
|
||||
API header.
|
||||
|
||||
If we found the archive URL we return it.
|
||||
|
||||
And if we couldn't find it, we raise
|
||||
WaybackError with an error message.
|
||||
"""
|
||||
|
||||
# Regex1
|
||||
arch = re.search(r"Content-Location: (/web/[0-9]{14}/.*)", str(header))
|
||||
if arch:
|
||||
return "web.archive.org" + arch.group(1)
|
||||
m = re.search(r"Content-Location: (/web/[0-9]{14}/.*)", str(header))
|
||||
if m:
|
||||
return "web.archive.org" + m.group(1)
|
||||
|
||||
# Regex2
|
||||
arch = re.search(
|
||||
m = re.search(
|
||||
r"rel=\"memento.*?(web\.archive\.org/web/[0-9]{14}/.*?)>", str(header)
|
||||
)
|
||||
if arch:
|
||||
return arch.group(1)
|
||||
if m:
|
||||
return m.group(1)
|
||||
|
||||
# Regex3
|
||||
arch = re.search(r"X-Cache-Key:\shttps(.*)[A-Z]{2}", str(header))
|
||||
if arch:
|
||||
return arch.group(1)
|
||||
m = re.search(r"X-Cache-Key:\shttps(.*)[A-Z]{2}", str(header))
|
||||
if m:
|
||||
return m.group(1)
|
||||
|
||||
raise WaybackError(
|
||||
"No archive URL found in the API response. "
|
||||
"This version of waybackpy (%s) is likely out of date or WayBack Machine is malfunctioning. Visit "
|
||||
"https://github.com/akamhy/waybackpy for the latest version "
|
||||
"of waybackpy.\nHeader:\n%s" % (__version__, str(header))
|
||||
"If '%s' can be accessed via your web browser then either "
|
||||
"this version of waybackpy (%s) is out of date or WayBack Machine is malfunctioning. Visit "
|
||||
"'https://github.com/akamhy/waybackpy' for the latest version "
|
||||
"of waybackpy.\nHeader:\n%s" % (url, __version__, str(header))
|
||||
)
|
||||
|
||||
|
||||
def _wayback_timestamp(**kwargs):
|
||||
"""Return a formatted timestamp."""
|
||||
"""
|
||||
Wayback Machine archive URLs
|
||||
have a timestamp in them.
|
||||
|
||||
The standard archive URL format is
|
||||
https://web.archive.org/web/20191214041711/https://www.youtube.com
|
||||
|
||||
If we break it down in three parts:
|
||||
1 ) The start (https://web.archive.org/web/)
|
||||
2 ) timestamp (20191214041711)
|
||||
3 ) https://www.youtube.com, the original URL
|
||||
|
||||
The near method takes year, month, day, hour and minute
|
||||
as Arguments, their type is int.
|
||||
|
||||
This method takes those integers and converts it to
|
||||
wayback machine timestamp and returns it.
|
||||
|
||||
Return format is string.
|
||||
"""
|
||||
|
||||
return "".join(
|
||||
str(kwargs[key]).zfill(2) for key in ["year", "month", "day", "hour", "minute"]
|
||||
)
|
||||
|
||||
|
||||
def _get_response(endpoint, params=None, headers=None):
|
||||
"""Get response for the supplied request."""
|
||||
"""
|
||||
This function is used make get request.
|
||||
We use the requests package to make the
|
||||
requests.
|
||||
|
||||
|
||||
We try twice and if both the times is fails And
|
||||
raises exceptions we give-up and raise WaybackError.
|
||||
|
||||
You can handles WaybackError by importing:
|
||||
from waybackpy.exceptions import WaybackError
|
||||
|
||||
try:
|
||||
response = requests.get(endpoint, params=params, headers=headers)
|
||||
...
|
||||
except WaybackError as e:
|
||||
# handle it
|
||||
"""
|
||||
|
||||
try:
|
||||
return requests.get(endpoint, params=params, headers=headers)
|
||||
except Exception:
|
||||
try:
|
||||
response = requests.get(endpoint, params=params, headers=headers) # nosec
|
||||
return requests.get(endpoint, params=params, headers=headers)
|
||||
except Exception as e:
|
||||
exc = WaybackError("Error while retrieving %s" % endpoint)
|
||||
exc.__cause__ = e
|
||||
raise exc
|
||||
return response
|
||||
|
||||
|
||||
class Url:
|
||||
"""waybackpy Url object"""
|
||||
"""
|
||||
waybackpy Url class, Type : <class 'waybackpy.wrapper.Url'>
|
||||
"""
|
||||
|
||||
def __init__(self, url, user_agent=default_UA):
|
||||
def __init__(self, url, user_agent=default_user_agent):
|
||||
self.url = url
|
||||
self.user_agent = user_agent
|
||||
self._url_check() # checks url validity on init.
|
||||
self._archive_url = None # URL of archive
|
||||
self.timestamp = None # timestamp for last archive
|
||||
self.user_agent = str(user_agent)
|
||||
self._url_check()
|
||||
self._archive_url = None
|
||||
self.timestamp = None
|
||||
self._JSON = None
|
||||
self._alive_url_list = []
|
||||
|
||||
def __repr__(self):
|
||||
return "waybackpy.Url(url=%s, user_agent=%s)" % (self.url, self.user_agent)
|
||||
|
||||
def __str__(self):
|
||||
"""
|
||||
Output when print() is used on <class 'waybackpy.wrapper.Url'>
|
||||
This should print an archive URL.
|
||||
|
||||
We check if self._archive_url is not None.
|
||||
If not None, good. We return string of self._archive_url.
|
||||
|
||||
If self._archive_url is None, it means we ain't used any method that
|
||||
sets self._archive_url, we now set self._archive_url to self.archive_url
|
||||
and return it.
|
||||
"""
|
||||
|
||||
if not self._archive_url:
|
||||
self._archive_url = self.archive_url
|
||||
return "%s" % self._archive_url
|
||||
|
||||
def __len__(self):
|
||||
"""
|
||||
Why do we have len here?
|
||||
|
||||
Applying len() on <class 'waybackpy.wrapper.Url'>
|
||||
will calculate the number of days between today and
|
||||
the archive timestamp.
|
||||
|
||||
Can be applied on return values of near and its
|
||||
childs (e.g. oldest) and if applied on waybackpy.Url()
|
||||
whithout using any functions, it just grabs
|
||||
self._timestamp and def _timestamp gets it
|
||||
from def JSON.
|
||||
"""
|
||||
td_max = timedelta(
|
||||
days=999999999, hours=23, minutes=59, seconds=59, microseconds=999999
|
||||
)
|
||||
@ -87,22 +190,37 @@ class Url:
|
||||
if self.timestamp == datetime.max:
|
||||
return td_max.days
|
||||
|
||||
diff = datetime.utcnow() - self.timestamp
|
||||
return diff.days
|
||||
return (datetime.utcnow() - self.timestamp).days
|
||||
|
||||
def _url_check(self):
|
||||
"""Check for common URL problems."""
|
||||
"""
|
||||
Check for common URL problems.
|
||||
What we are checking:
|
||||
1) '.' in self.url, no url that ain't '.' in it.
|
||||
|
||||
If you known any others, please create a PR on the github repo.
|
||||
"""
|
||||
|
||||
if "." not in self.url:
|
||||
raise URLError("'%s' is not a vaild URL." % self.url)
|
||||
|
||||
@property
|
||||
def JSON(self):
|
||||
"""
|
||||
Returns JSON data from 'https://archive.org/wayback/available?url=YOUR-URL'.
|
||||
If the end user has used near() or its childs like oldest, newest
|
||||
and archive_url then the JSON response of these are cached in self._JSON
|
||||
|
||||
If we find that self._JSON is not None we return it.
|
||||
else we get the response of 'https://archive.org/wayback/available?url=YOUR-URL'
|
||||
and return it.
|
||||
"""
|
||||
|
||||
if self._JSON:
|
||||
return self._JSON
|
||||
|
||||
endpoint = "https://archive.org/wayback/available"
|
||||
headers = {"User-Agent": "%s" % self.user_agent}
|
||||
payload = {"url": "%s" % self._clean_url()}
|
||||
headers = {"User-Agent": self.user_agent}
|
||||
payload = {"url": "%s" % self._cleaned_url()}
|
||||
response = _get_response(endpoint, params=payload, headers=headers)
|
||||
return response.json()
|
||||
|
||||
@ -135,8 +253,12 @@ class Url:
|
||||
def _timestamp(self):
|
||||
"""
|
||||
Get timestamp of last fetched archive.
|
||||
If used before fetching any archive, This
|
||||
randomly picks archive.
|
||||
If used before fetching any archive, will
|
||||
use whatever self.JSON returns.
|
||||
|
||||
self.timestamp is None implies that
|
||||
self.JSON will return any archive's JSON
|
||||
that wayback machine provides it.
|
||||
"""
|
||||
|
||||
if self.timestamp:
|
||||
@ -154,34 +276,48 @@ class Url:
|
||||
self.timestamp = ts
|
||||
return ts
|
||||
|
||||
def _clean_url(self):
|
||||
def _cleaned_url(self):
|
||||
"""
|
||||
Remove newlines
|
||||
Remove EOL
|
||||
replace " " with "_"
|
||||
"""
|
||||
return str(self.url).strip().replace(" ", "_")
|
||||
|
||||
def save(self):
|
||||
"""Create a new Wayback Machine archive for this URL."""
|
||||
request_url = "https://web.archive.org/save/" + self._clean_url()
|
||||
headers = {"User-Agent": "%s" % self.user_agent}
|
||||
"""
|
||||
To save a webpage on WayBack machine we
|
||||
need to send get request to https://web.archive.org/save/
|
||||
|
||||
And to get the archive URL we are required to read the
|
||||
header of the API response.
|
||||
|
||||
_get_response() takes care of the get requests. It uses requests
|
||||
package.
|
||||
|
||||
_archive_url_parser() parses the archive from the header.
|
||||
|
||||
"""
|
||||
request_url = "https://web.archive.org/save/" + self._cleaned_url()
|
||||
headers = {"User-Agent": self.user_agent}
|
||||
response = _get_response(request_url, params=None, headers=headers)
|
||||
self._archive_url = "https://" + _archive_url_parser(response.headers)
|
||||
self._archive_url = "https://" + _archive_url_parser(response.headers, self.url)
|
||||
self.timestamp = datetime.utcnow()
|
||||
return self
|
||||
|
||||
def get(self, url="", user_agent="", encoding=""):
|
||||
"""Return the source code of the supplied URL.
|
||||
If encoding is not supplied, it is auto-detected from the response.
|
||||
"""
|
||||
Return the source code of the supplied URL.
|
||||
If encoding is not supplied, it is auto-detected
|
||||
from the response itself by requests package.
|
||||
"""
|
||||
|
||||
if not url:
|
||||
url = self._clean_url()
|
||||
url = self._cleaned_url()
|
||||
|
||||
if not user_agent:
|
||||
user_agent = self.user_agent
|
||||
|
||||
headers = {"User-Agent": "%s" % self.user_agent}
|
||||
headers = {"User-Agent": self.user_agent}
|
||||
response = _get_response(url, params=None, headers=headers)
|
||||
|
||||
if not encoding:
|
||||
@ -193,10 +329,26 @@ class Url:
|
||||
return response.content.decode(encoding.replace("text/html", "UTF-8", 1))
|
||||
|
||||
def near(self, year=None, month=None, day=None, hour=None, minute=None):
|
||||
"""Return the closest Wayback Machine archive to the time supplied.
|
||||
Supported params are year, month, day, hour and minute.
|
||||
Any non-supplied parameters default to the current time.
|
||||
"""
|
||||
Wayback Machine can have many archives of a webpage,
|
||||
sometimes we want archive close to a specific time.
|
||||
|
||||
This method takes year, month, day, hour and minute as input.
|
||||
The input type must be integer. Any non-supplied parameters
|
||||
default to the current time.
|
||||
|
||||
We convert the input to a wayback machine timestamp using
|
||||
_wayback_timestamp(), it returns a string.
|
||||
|
||||
We use the wayback machine's availability API
|
||||
(https://archive.org/wayback/available)
|
||||
to get the closest archive from the timestamp.
|
||||
|
||||
We set self._archive_url to the archive found, if any.
|
||||
If archive found, we set self.timestamp to its timestamp.
|
||||
We self._JSON to the response of the availability API.
|
||||
|
||||
And finally return self.
|
||||
"""
|
||||
now = datetime.utcnow().timetuple()
|
||||
timestamp = _wayback_timestamp(
|
||||
@ -208,14 +360,15 @@ class Url:
|
||||
)
|
||||
|
||||
endpoint = "https://archive.org/wayback/available"
|
||||
headers = {"User-Agent": "%s" % self.user_agent}
|
||||
payload = {"url": "%s" % self._clean_url(), "timestamp": timestamp}
|
||||
headers = {"User-Agent": self.user_agent}
|
||||
payload = {"url": "%s" % self._cleaned_url(), "timestamp": timestamp}
|
||||
response = _get_response(endpoint, params=payload, headers=headers)
|
||||
data = response.json()
|
||||
|
||||
if not data["archived_snapshots"]:
|
||||
raise WaybackError(
|
||||
"Can not find archive for '%s' try later or use wayback.Url(url, user_agent).save() "
|
||||
"to create a new archive." % self._clean_url()
|
||||
"to create a new archive." % self._cleaned_url()
|
||||
)
|
||||
archive_url = data["archived_snapshots"]["closest"]["url"]
|
||||
archive_url = archive_url.replace(
|
||||
@ -226,42 +379,65 @@ class Url:
|
||||
self.timestamp = datetime.strptime(
|
||||
data["archived_snapshots"]["closest"]["timestamp"], "%Y%m%d%H%M%S"
|
||||
)
|
||||
self._JSON = data
|
||||
|
||||
return self
|
||||
|
||||
def oldest(self, year=1994):
|
||||
"""Return the oldest Wayback Machine archive for this URL."""
|
||||
"""
|
||||
Returns the earliest/oldest Wayback Machine archive for the webpage.
|
||||
|
||||
Wayback machine has started archiving the internet around 1997 and
|
||||
therefore we can't have any archive older than 1997, we use 1994 as the
|
||||
deafult year to look for the oldest archive.
|
||||
|
||||
We simply pass the year in near() and return it.
|
||||
"""
|
||||
return self.near(year=year)
|
||||
|
||||
def newest(self):
|
||||
"""Return the newest Wayback Machine archive available for this URL.
|
||||
"""
|
||||
Return the newest Wayback Machine archive available for this URL.
|
||||
|
||||
We return the output of self.near() as it deafults to current utc time.
|
||||
|
||||
Due to Wayback Machine database lag, this may not always be the
|
||||
most recent archive.
|
||||
"""
|
||||
return self.near()
|
||||
|
||||
def total_archives(self):
|
||||
"""Returns the total number of Wayback Machine archives for this URL."""
|
||||
def total_archives(self, start_timestamp=None, end_timestamp=None):
|
||||
"""
|
||||
A webpage can have multiple archives on the wayback machine
|
||||
If someone wants to count the total number of archives of a
|
||||
webpage on wayback machine they can use this method.
|
||||
|
||||
endpoint = "https://web.archive.org/cdx/search/cdx"
|
||||
headers = {
|
||||
"User-Agent": "%s" % self.user_agent,
|
||||
"output": "json",
|
||||
"fl": "statuscode",
|
||||
}
|
||||
payload = {"url": "%s" % self._clean_url()}
|
||||
response = _get_response(endpoint, params=payload, headers=headers)
|
||||
Returns the total number of Wayback Machine archives for the URL.
|
||||
|
||||
# Most efficient method to count number of archives (yet)
|
||||
return response.text.count(",")
|
||||
Return type in integer.
|
||||
"""
|
||||
|
||||
cdx = Cdx(
|
||||
self._cleaned_url(),
|
||||
user_agent=self.user_agent,
|
||||
start_timestamp=start_timestamp,
|
||||
end_timestamp=end_timestamp,
|
||||
)
|
||||
i = 0
|
||||
for _ in cdx.snapshots():
|
||||
i += 1
|
||||
return i
|
||||
|
||||
def live_urls_picker(self, url):
|
||||
"""
|
||||
This method is used to check if supplied url
|
||||
is >= 400.
|
||||
"""
|
||||
|
||||
try:
|
||||
response_code = requests.get(url).status_code
|
||||
except Exception:
|
||||
return # we don't care if urls are not opening
|
||||
return # we don't care if Exception
|
||||
|
||||
# 200s are OK and 300s are usually redirects, if you don't want redirects replace 400 with 300
|
||||
if response_code >= 400:
|
||||
@ -269,8 +445,11 @@ class Url:
|
||||
|
||||
self._alive_url_list.append(url)
|
||||
|
||||
def known_urls(self, alive=False, subdomain=False):
|
||||
"""Returns list of URLs known to exist for given domain name
|
||||
def known_urls(
|
||||
self, alive=False, subdomain=False, start_timestamp=None, end_timestamp=None
|
||||
):
|
||||
"""
|
||||
Returns list of URLs known to exist for given domain name
|
||||
because these URLs were crawled by WayBack Machine bots.
|
||||
Useful for pen-testers and others.
|
||||
Idea by Mohammed Diaa (https://github.com/mhmdiaa) from:
|
||||
@ -280,20 +459,23 @@ class Url:
|
||||
url_list = []
|
||||
|
||||
if subdomain:
|
||||
request_url = (
|
||||
"https://web.archive.org/cdx/search/cdx?url=*.%s/*&output=json&fl=original&collapse=urlkey"
|
||||
% self._clean_url()
|
||||
)
|
||||
url = "*.%s/*" % self._cleaned_url()
|
||||
else:
|
||||
request_url = (
|
||||
"http://web.archive.org/cdx/search/cdx?url=%s/*&output=json&fl=original&collapse=urlkey"
|
||||
% self._clean_url()
|
||||
)
|
||||
url = "%s/*" % self._cleaned_url()
|
||||
|
||||
headers = {"User-Agent": "%s" % self.user_agent}
|
||||
response = _get_response(request_url, params=None, headers=headers)
|
||||
data = response.json()
|
||||
url_list = [y[0] for y in data if y[0] != "original"]
|
||||
cdx = Cdx(
|
||||
url,
|
||||
user_agent=self.user_agent,
|
||||
start_timestamp=start_timestamp,
|
||||
end_timestamp=end_timestamp,
|
||||
)
|
||||
snapshots = cdx.snapshots()
|
||||
|
||||
url_list = []
|
||||
for snapshot in snapshots:
|
||||
url_list.append(snapshot.original)
|
||||
|
||||
url_list = list(set(url_list)) # remove duplicates
|
||||
|
||||
# Remove all deadURLs from url_list if alive=True
|
||||
if alive:
|
||||
@ -302,3 +484,95 @@ class Url:
|
||||
url_list = self._alive_url_list
|
||||
|
||||
return url_list
|
||||
|
||||
|
||||
class CdxSnapshot:
|
||||
"""
|
||||
This class helps to handle the Cdx Snapshots easily.
|
||||
|
||||
What the raw data looks like:
|
||||
org,archive)/ 20080126045828 http://github.com text/html 200 Q4YULN754FHV2U6Q5JUT6Q2P57WEWNNY 1415
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self, urlkey, timestamp, original, mimetype, statuscode, digest, length
|
||||
):
|
||||
self.urlkey = urlkey # Useless
|
||||
self.timestamp = timestamp
|
||||
self.original = original
|
||||
self.mimetype = mimetype
|
||||
self.statuscode = statuscode
|
||||
self.digest = digest
|
||||
self.length = length
|
||||
self.archive_url = "https://web.archive.org/web/%s/%s" % (
|
||||
self.timestamp,
|
||||
self.original,
|
||||
)
|
||||
|
||||
def __str__(self):
|
||||
return self.archive_url
|
||||
|
||||
|
||||
class Cdx:
|
||||
"""
|
||||
waybackpy Cdx class, Type : <class 'waybackpy.wrapper.Cdx'>
|
||||
|
||||
Cdx keys are :
|
||||
urlkey
|
||||
timestamp
|
||||
original
|
||||
mimetype
|
||||
statuscode
|
||||
digest
|
||||
length
|
||||
"""
|
||||
|
||||
def __init__(
|
||||
self,
|
||||
url,
|
||||
user_agent=default_user_agent,
|
||||
start_timestamp=None,
|
||||
end_timestamp=None,
|
||||
):
|
||||
self.url = url
|
||||
self.user_agent = str(user_agent)
|
||||
self.start_timestamp = str(start_timestamp) if start_timestamp else None
|
||||
self.end_timestamp = str(end_timestamp) if end_timestamp else None
|
||||
|
||||
def snapshots(self):
|
||||
"""
|
||||
This function yeilds snapshots encapsulated
|
||||
in CdxSnapshot for more usability.
|
||||
"""
|
||||
payload = {}
|
||||
endpoint = "https://web.archive.org/cdx/search/cdx"
|
||||
total_pages = _get_total_pages(self.url, self.user_agent)
|
||||
headers = {"User-Agent": self.user_agent}
|
||||
if self.start_timestamp:
|
||||
payload["from"] = self.start_timestamp
|
||||
if self.end_timestamp:
|
||||
payload["to"] = self.end_timestamp
|
||||
payload["url"] = self.url
|
||||
|
||||
for i in range(total_pages):
|
||||
payload["page"] = str(i)
|
||||
res = _get_response(endpoint, params=payload, headers=headers)
|
||||
text = res.text
|
||||
if text.isspace() or len(text) <= 1 or not text:
|
||||
break
|
||||
snapshot_list = text.split("\n")
|
||||
for snapshot in snapshot_list:
|
||||
if len(snapshot) < 15:
|
||||
continue
|
||||
(
|
||||
urlkey,
|
||||
timestamp,
|
||||
original,
|
||||
mimetype,
|
||||
statuscode,
|
||||
digest,
|
||||
length,
|
||||
) = snapshot.split(" ")
|
||||
yield CdxSnapshot(
|
||||
urlkey, timestamp, original, mimetype, statuscode, digest, length
|
||||
)
|
||||
|
Reference in New Issue
Block a user