Fork of https://github.com/akamhy/waybackpy Wayback Machine API interface & a command-line tool
Go to file
Akash Mahanty 5407681c34
v3.0.6 (#170)
* remove the license section from readme

This does not mean that I'm waving the copyrights rather just formatting the README

* remove useless external links form the README lead

and also added a line about the recentness of the newest method between the availability and CDX server API.

* incr version to 3.0.6 and change date to todays da

-te that is 15th of March, 2022.

* update secsi and DI section

* v3.0.5 --> v3.0.6
2022-03-15 20:33:51 +05:30
.github Add tests (#149) 2022-02-08 17:46:59 +05:30
assets upload logo and make p path not text 2022-01-21 21:11:42 +05:30
tests Cdx based oldest newest and near (#159) 2022-02-18 13:17:40 +05:30
waybackpy v3.0.6 (#170) 2022-03-15 20:33:51 +05:30
_config.yml now using requests lib as it handles errors nicely 2020-12-13 15:05:57 +05:30
.gitignore removed JSON from init, this was resulting in too much unnecessary taffic. Some users who are thousands of URLs were blocked by IA (#53) 2021-01-01 16:38:57 +05:30
.pep8speaks.yml Improve pylint score (#142) 2022-02-08 06:42:20 +09:00
.whitesource fix: format md and yml (#129) 2022-02-04 22:31:46 +05:30
CITATION.cff v3.0.6 (#170) 2022-03-15 20:33:51 +05:30
CODE_OF_CONDUCT.md fix: format md and yml (#129) 2022-02-04 22:31:46 +05:30
CONTRIBUTORS.md + jfinkhaeuser and rafael (#150) 2022-02-08 20:34:34 +05:30
LICENSE date year range 2020-2022 2022-01-21 11:55:42 +05:30
pyproject.toml Format and lint codes and fix packaging (#125) 2022-02-03 19:13:39 +05:30
README.md v3.0.6 (#170) 2022-03-15 20:33:51 +05:30
requirements-dev.txt Typing (#128) 2022-02-05 03:23:36 +09:00
requirements.txt Typing (#128) 2022-02-05 03:23:36 +09:00
setup.cfg add: typed marker (PEP561) (#167) 2022-03-03 19:05:43 +05:30
setup.py Format and lint codes and fix packaging (#125) 2022-02-03 19:13:39 +05:30
snapcraft.yaml fix: format md and yml (#129) 2022-02-04 22:31:46 +05:30


A Python package & CLI tool that interfaces with the Wayback Machine API

Unit Tests codecov pypi Downloads Codacy Badge GitHub lastest commit PyPI - Python Version Code style: black


Introduction

Waybackpy is a Python package and a CLI tool that interfaces with the Wayback Machine APIs.

Wayback Machine has 3 client side APIs.

  • SavePageNow or Save API
  • CDX Server API
  • Availability API

These three APIs can be accessed via the waybackpy either by importing it from a python file/module or from the command-line interface.

Installation

Using pip, from PyPI (recommended):

pip install waybackpy

Using conda, from conda-forge (recommended):

See also waybackpy feedstock, maintainers are @rafaelrdealmeida, @labriunesp and @akamhy.

conda install -c conda-forge waybackpy

Install directly from this git repository (NOT recommended):

pip install git+https://github.com/akamhy/waybackpy.git

Docker Image

Docker Hub: hub.docker.com/r/secsi/waybackpy

Docker image is automatically updated on every release by Regulary and Automatically Updated Docker Images (RAUDI).

RAUDI is a tool by SecSI, an Italian cybersecurity startup.

Usage

As a Python package

Save API aka SavePageNow

>>> from waybackpy import WaybackMachineSaveAPI
>>> url = "https://github.com"
>>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
>>>
>>> save_api = WaybackMachineSaveAPI(url, user_agent)
>>> save_api.save()
https://web.archive.org/web/20220118125249/https://github.com/
>>> save_api.cached_save
False
>>> save_api.timestamp()
datetime.datetime(2022, 1, 18, 12, 52, 49)

CDX API aka CDXServerAPI

>>> from waybackpy import WaybackMachineCDXServerAPI
>>> url = "https://google.com"
>>> user_agent = "my new app's user agent"
>>> cdx_api = WaybackMachineCDXServerAPI(url, user_agent)
oldest
>>> cdx_api.oldest()
com,google)/ 19981111184551 http://google.com:80/ text/html 200 HOQ2TGPYAEQJPNUA6M4SMZ3NGQRBXDZ3 381
>>> oldest = cdx_api.oldest()
>>> oldest
com,google)/ 19981111184551 http://google.com:80/ text/html 200 HOQ2TGPYAEQJPNUA6M4SMZ3NGQRBXDZ3 381
>>> oldest.archive_url
'https://web.archive.org/web/19981111184551/http://google.com:80/'
>>> oldest.original
'http://google.com:80/'
>>> oldest.urlkey
'com,google)/'
>>> oldest.timestamp
'19981111184551'
>>> oldest.datetime_timestamp
datetime.datetime(1998, 11, 11, 18, 45, 51)
>>> oldest.statuscode
'200'
>>> oldest.mimetype
'text/html'
newest
>>> newest = cdx_api.newest()
>>> newest
com,google)/ 20220217234427 http://@google.com/ text/html 301 Y6PVK4XWOI3BXQEXM5WLLWU5JKUVNSFZ 563
>>> newest.archive_url
'https://web.archive.org/web/20220217234427/http://@google.com/'
>>> newest.timestamp
'20220217234427'
near
>>> near = cdx_api.near(year=2010, month=10, day=10, hour=10, minute=10)
>>> near.archive_url
'https://web.archive.org/web/20101010101435/http://google.com/'
>>> near
com,google)/ 20101010101435 http://google.com/ text/html 301 Y6PVK4XWOI3BXQEXM5WLLWU5JKUVNSFZ 391
>>> near.timestamp
'20101010101435'
>>> near.timestamp
'20101010101435'
>>> near = cdx_api.near(wayback_machine_timestamp=2008080808)
>>> near.archive_url
'https://web.archive.org/web/20080808051143/http://google.com/'
>>> near = cdx_api.near(unix_timestamp=1286705410)
>>> near
com,google)/ 20101010101435 http://google.com/ text/html 301 Y6PVK4XWOI3BXQEXM5WLLWU5JKUVNSFZ 391
>>> near.archive_url
'https://web.archive.org/web/20101010101435/http://google.com/'
>>> 
snapshots
>>> from waybackpy import WaybackMachineCDXServerAPI
>>> url = "https://pypi.org"
>>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
>>> cdx = WaybackMachineCDXServerAPI(url, user_agent, start_timestamp=2016, end_timestamp=2017)
>>> for item in cdx.snapshots():
...     print(item.archive_url)
...
https://web.archive.org/web/20160110011047/http://pypi.org/
https://web.archive.org/web/20160305104847/http://pypi.org/
.
. # URLS REDACTED FOR READABILITY
.
https://web.archive.org/web/20171127171549/https://pypi.org/
https://web.archive.org/web/20171206002737/http://pypi.org:80/

Availability API

It is recommended to not use the availability API due to performance issues. All the methods of availability API interface class, WaybackMachineAvailabilityAPI, are also implemented in the CDX server API interface class, WaybackMachineCDXServerAPI. Also note that the newest() method of WaybackMachineAvailabilityAPI can be more recent than WaybackMachineCDXServerAPI's same method.

>>> from waybackpy import WaybackMachineAvailabilityAPI
>>>
>>> url = "https://google.com"
>>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
>>>
>>> availability_api = WaybackMachineAvailabilityAPI(url, user_agent)
oldest
>>> availability_api.oldest()
https://web.archive.org/web/19981111184551/http://google.com:80/
newest
>>> availability_api.newest()
https://web.archive.org/web/20220118150444/https://www.google.com/
near
>>> availability_api.near(year=2010, month=10, day=10, hour=10)
https://web.archive.org/web/20101010101708/http://www.google.com/

Documentation is at https://github.com/akamhy/waybackpy/wiki/Python-package-docs.

As a CLI tool

Demo video on asciinema.org, you can copy the text from video:

asciicast

CLI documentation is at https://github.com/akamhy/waybackpy/wiki/CLI-docs.