Compare commits

...

51 Commits
3.0.0 ... 3.0.1

Author SHA1 Message Date
9afe29a819 Merge pull request #119 from akamhy/akamhy-patch-1
v3.0.0 --> v3.0.1
2022-01-25 19:54:01 +05:30
d79b10c74c v3.0.0 --> v3.0.1 2022-01-25 19:52:10 +05:30
32314dc102 Merge branch 'build-test' #118
Add build test to CI
 see #117
2022-01-25 14:02:36 +05:30
50e176e2ba .github/workflows/build_test.yml : change python versions from '3.4', '3.8', '3.10' to '3.6', '3.10' as 3.4 not found by GitHub. 2022-01-25 13:56:49 +05:30
4007859c92 Install dependencies for build test in CI : setuptools wheel 2022-01-25 13:35:58 +05:30
d8bd6c628d Add build test to CI 2022-01-25 13:30:16 +05:30
28f6ff8df2 Merge pull request #116 from akamhy/patch-setup-py
Fix syntax for opening the README.md and __version__.py
2022-01-25 13:11:33 +05:30
7ac9353f74 Fix syntax for opening the README.md and __version__.py
For some reason updates made at https://github.com/akamhy/waybackpy/pull/114
are breaking the build using setup, caught while deploying to a cloud service
provider.

The exact error is:
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/tmp/pip-req-build-n3b9e5pj/setup.py", line 5
  os.path.join(os.path.dirname(__file__), README.md), encoding=utf-8),
                                                                                ^
SyntaxError: invalid syntax
----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

See also :
https://github.com/conda-forge/staged-recipes/pull/17634
2022-01-25 13:05:01 +05:30
15c7244a22 Merge pull request #115 from akamhy/akamhy-patch-1
do not use f-strings in setup.py
2022-01-25 10:42:27 +05:30
8510210e94 do not use f-strings in setup.py
These are not supported in <Python 3.6 version of the cpython.
2022-01-25 10:34:46 +05:30
552967487e Merge pull request #114 from rafaelrdealmeida/patch-1
Update setup.py

See also <https://github.com/akamhy/waybackpy/issues/111#issuecomment-1020673814>
2022-01-25 10:30:34 +05:30
86a90a3840 Update setup.py
pep8
2022-01-24 22:03:28 -03:00
759874cdc6 Update setup.py
see: https://github.com/akamhy/waybackpy/issues/111#issuecomment-1020673814
2022-01-24 21:23:31 -03:00
06095202fe BUG FIX : forgot to use the endpoint from the instance and also assign payload to param. Bug caught by the flake8 in the CI tests. 2022-01-24 23:35:48 +05:30
06fc7855bf waybackpy/cdx_api.py : deafult user agent is now DEFAULT_USER_AGENT, get_response now take url and headers as arguments and request url is generated by full_url function. max_tries added as parameter for the WaybackMachineCDXServerAPI class with default value of 3. 2022-01-24 23:20:49 +05:30
c49fe971fd update the older deprecation not for Url class, the newer date is now 2025 instead of 2024. 2022-01-24 23:15:59 +05:30
d6783d5525 added tests for cdx_utils.py 2022-01-24 23:05:47 +05:30
9262f5da21 improve functions get_total_pages, get_response and lint check_filters, check_collapses and check_match_type
get_total_pages : default user agent is now DEFAULT_USER_AGENT
                  and now instead of str formatting passing payload
                  as param to full_url to generate the request url
                  also get_response make the request instead of directly
                  using requests.get()

get_response : get_response is now not taking param as keyword arguments
               instead the invoker is supposed to pass the full url which
               may be generated by the full_url function therefore the return_full_url=False,
               is deprecated also.
               Also now closing the session via session.close()
               No need to check 'Exceeded 30 redirects' as save API uses a
               diffrent method.

check_filters : Not assigning to variables the return of match groups
                beacause we wont be using them and the linter picks these
                unused assignments.

check_collapses : Same reason as for check_filters but also removed a foolish
                  test that checks equality with objects that are guaranteed
                  to be same.

check_match_type : Updated the text that of WaybackError
2022-01-24 22:57:20 +05:30
d1a1cf2546 added tests for utils.py at tests/test_utils.py also changed a keyword argument from headers to user_agent for latest_version of utils.py with the usual default vaule. 2022-01-24 17:50:36 +05:30
cd8a32ed1f added tests for cdx_snapshot.py at tests/test_cdx_snapshot.py 2022-01-24 16:29:44 +05:30
57512c65ff change test oldest method from google.com to example.com, the oldest on google is for some unknown reason is not very stable. 2022-01-24 16:27:35 +05:30
d9ea26e11c added code style black badge 2022-01-24 13:46:31 +05:30
2bea92b348 fix bug with the third matching case of the archive_url_parser, caught while writing more tests fo the save API interface. 2022-01-24 13:31:30 +05:30
d506685f68 added some tests for save_api interface 2022-01-23 18:35:54 +05:30
7844d15d99 close the session in save api interface 2022-01-23 18:34:06 +05:30
c0252edff2 updated tests for availability_api.py and also added max_tries(default value is 3) with delay (sleep) between successive API calls. The dealy actually improves the performace of the availability_api interface. 2022-01-23 15:05:10 +05:30
e7488f3a3e added test badge, rename test to Tests from ubuntu and fix the Incomplete URL substring sanitization(or trying to) 2022-01-23 02:26:53 +05:30
aed75ad1db Make modules imprtable as part of a Python package, waybackpy by creating __init__.py file in tests 2022-01-23 02:14:38 +05:30
d740959c34 more dev reqs 2022-01-23 02:10:12 +05:30
2d83043ef7 + flake8 in requirements-dev.txt 2022-01-23 02:05:08 +05:30
31b1056217 fix typo in CI 2022-01-23 02:03:30 +05:30
97712b2c1e add CI unit_test.yml 2022-01-23 02:00:15 +05:30
a8acc4c4d8 Fix Incomplete URL substring sanitization in the last commit. 2022-01-23 01:42:48 +05:30
1bacd73002 created pytest.ini, added test for waybackpy/availability_api.py, new exceptions all of which inherit from the main WaybackError and created requirements-dev.txt 2022-01-23 01:29:07 +05:30
79901ba968 updated README.md 2022-01-22 03:08:26 +05:30
df64e839d7 added trove classifiers for python 3.10 2022-01-22 00:57:10 +05:30
405e9a2a79 waybackpy/save_api.py : Added doc strings and also lint with black. 2022-01-22 00:41:10 +05:30
db551abbf6 lint waybackpy/cdx_api.py and added some doc strings 2022-01-22 00:11:35 +05:30
d13dd4db1a added notice on waybackpy/wrapper.py that the Url class will cease to exist after 2024-01-01 and also removed unused imports. 2022-01-21 23:14:20 +05:30
d3bb8337a1 make setup.py smarter, now no need to update the URL again and also added more keywords. And in __version__.py updated the __author__ 2022-01-21 23:01:09 +05:30
fd5e85420c waybackpy/availability_api.py : removed unused imports, added doc strings, removed redundant function. 2022-01-21 22:47:44 +05:30
5c685ef5d7 upload logo and make p path not text
I was dumb to forget to convert the p to path.
2022-01-21 21:11:42 +05:30
6a3d96b453 Logo (#113)
* Create logo.txt

* Delete waybackpy_logo.svg

* Add files via upload

* Delete logo.txt
2022-01-21 21:02:38 +05:30
afe1b15a5f Add files via upload 2022-01-21 20:58:53 +05:30
4fd9d142e7 Merge pull request #112 from akamhy/fix
escape '.' before 'archive.org'
2022-01-21 19:52:55 +05:30
5e9fdb40ce escape '.' before 'archive.org'
escape '.' before 'archive.org' on line 88 so it does not match more hosts than expected.
2022-01-21 19:51:08 +05:30
fa72098270 _get_response is not used anymore
- datashaman (<https://stackoverflow.com/users/401467/datashaman>) for <https://stackoverflow.com/a/35504626>. _get_response is based on this amazing answer.
2022-01-21 19:43:35 +05:30
d18f955044 date year range 2020-2022 2022-01-21 11:55:42 +05:30
9c340d6967 Create codeql-analysis.yml 2022-01-21 11:12:59 +05:30
78d0e0c126 Update README.md 2022-01-21 09:54:04 +05:30
564101e6f5 🐳 for docker image 2022-01-21 01:23:05 +05:30
25 changed files with 905 additions and 142 deletions

30
.github/workflows/build_test.yml vendored Normal file
View File

@ -0,0 +1,30 @@
# This workflow will install Python dependencies, run tests and lint with a variety of Python versions
# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions
name: Build
on:
push:
branches: [ master ]
pull_request:
branches: [ master ]
jobs:
build:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ['3.6', '3.10']
steps:
- uses: actions/checkout@v2
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v2
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install setuptools wheel
- name: Build test the package
run: |
python setup.py sdist bdist_wheel

70
.github/workflows/codeql-analysis.yml vendored Normal file
View File

@ -0,0 +1,70 @@
# For most projects, this workflow file will not need changing; you simply need
# to commit it to your repository.
#
# You may wish to alter this file to override the set of languages analyzed,
# or to provide custom queries or build logic.
#
# ******** NOTE ********
# We have attempted to detect the languages in your repository. Please check
# the `language` matrix defined below to confirm you have the correct set of
# supported CodeQL languages.
#
name: "CodeQL"
on:
push:
branches: [ master ]
pull_request:
# The branches below must be a subset of the branches above
branches: [ master ]
schedule:
- cron: '30 6 * * 1'
jobs:
analyze:
name: Analyze
runs-on: ubuntu-latest
permissions:
actions: read
contents: read
security-events: write
strategy:
fail-fast: false
matrix:
language: [ 'python' ]
# CodeQL supports [ 'cpp', 'csharp', 'go', 'java', 'javascript', 'python', 'ruby' ]
# Learn more about CodeQL language support at https://git.io/codeql-language-support
steps:
- name: Checkout repository
uses: actions/checkout@v2
# Initializes the CodeQL tools for scanning.
- name: Initialize CodeQL
uses: github/codeql-action/init@v1
with:
languages: ${{ matrix.language }}
# If you wish to specify custom queries, you can do so here or in a config file.
# By default, queries listed here will override any specified in a config file.
# Prefix the list here with "+" to use these queries and those in the config file.
# queries: ./path/to/local/query, your-org/your-repo/queries@main
# Autobuild attempts to build any compiled languages (C/C++, C#, or Java).
# If this step fails, then you should remove it and run the build manually (see below)
- name: Autobuild
uses: github/codeql-action/autobuild@v1
# Command-line programs to run using the OS shell.
# 📚 https://git.io/JvXDl
# ✏️ If the Autobuild fails above, remove it and uncomment the following three lines
# and modify them (or add more) to build your code if your project
# uses a compiled language
#- run: |
# make bootstrap
# make release
- name: Perform CodeQL Analysis
uses: github/codeql-action/analyze@v1

44
.github/workflows/unit_test.yml vendored Normal file
View File

@ -0,0 +1,44 @@
# This workflow will install Python dependencies, run tests and lint with a variety of Python versions
# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions
name: Tests
on:
push:
branches: [ master ]
pull_request:
branches: [ master ]
jobs:
build:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ['3.9']
steps:
- uses: actions/checkout@v2
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v2
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
if [ -f requirements-dev.txt ]; then pip install -r requirements-dev.txt; fi
- name: Lint with flake8
run: |
# stop the build if there are Python syntax errors or undefined names
flake8 waybackpy/ --count --select=E9,F63,F7,F82 --show-source --statistics
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
# flake8 waybackpy/ --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics --per-file-ignores="waybackpy/__init__.py:F401"
# - name: Static type test with mypy
# run: |
# mypy
- name: Test with pytest
run: |
pytest
# - name: Upload coverage to Codecov
# run: |
# bash <(curl -s https://codecov.io/bash) -t ${{ secrets.CODECOV_TOKEN }}

View File

@ -6,5 +6,4 @@
## ACKNOWLEDGEMENTS
- mhmdiaa (<https://github.com/mhmdiaa>) for <https://gist.github.com/mhmdiaa/adf6bff70142e5091792841d4b372050>. known_urls is based on this gist.
- datashaman (<https://stackoverflow.com/users/401467/datashaman>) for <https://stackoverflow.com/a/35504626>. _get_response is based on this amazing answer.
- dequeued0 (<https://github.com/dequeued0>) for reporting bugs and useful feature requests.

View File

@ -1,6 +1,6 @@
MIT License
Copyright (c) 2020 waybackpy contributors ( https://github.com/akamhy/waybackpy/graphs/contributors )
Copyright (c) 2020-2022 waybackpy contributors ( https://github.com/akamhy/waybackpy/graphs/contributors )
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal

View File

@ -2,56 +2,56 @@
<img src="https://raw.githubusercontent.com/akamhy/waybackpy/master/assets/waybackpy_logo.svg"><br>
<h3>Python package & CLI tool that interfaces with the Wayback Machine API</h3>
<h3>A Python package & CLI tool that interfaces with the Wayback Machine API</h3>
</div>
<p align="center">
<a href="https://github.com/akamhy/waybackpy/actions?query=workflow%3ATests"><img alt="Unit Tests" src="https://github.com/akamhy/waybackpy/workflows/Tests/badge.svg"></a>
<a href="https://pypi.org/project/waybackpy/"><img alt="pypi" src="https://img.shields.io/pypi/v/waybackpy.svg"></a>
<a href="https://github.com/akamhy/waybackpy/blob/master/CONTRIBUTING.md"><img alt="Contributions Welcome" src="https://img.shields.io/static/v1.svg?label=Contributions&message=Welcome&color=0059b3&style=flat-square"></a>
<a href="https://pepy.tech/project/waybackpy?versions=2*&versions=1*&versions=3*"><img alt="Downloads" src="https://pepy.tech/badge/waybackpy/month"></a>
<a href="https://github.com/akamhy/waybackpy/commits/master"><img alt="GitHub lastest commit" src="https://img.shields.io/github/last-commit/akamhy/waybackpy?color=blue&style=flat-square"></a>
<a href="#"><img alt="PyPI - Python Version" src="https://img.shields.io/pypi/pyversions/waybackpy?style=flat-square"></a>
<a href="https://github.com/psf/black"><img alt="Code style: black" src="https://img.shields.io/badge/code%20style-black-000000.svg"></a>
</p>
-----------------------------------------------------------------------------------------------------------------------------------------------
## ⭐️ Introduction
Waybackpy is a [Python package](https://www.udacity.com/blog/2021/01/what-is-a-python-package.html) and a CLI tool that interfaces with the Wayback Machine API.
Waybackpy is a [Python package](https://www.udacity.com/blog/2021/01/what-is-a-python-package.html) and a [CLI](https://www.w3schools.com/whatis/whatis_cli.asp) tool that interfaces with the [Wayback Machine](https://en.wikipedia.org/wiki/Wayback_Machine) API.
Wayback Machine has 3 client side APIs.
Wayback Machine has 3 client side [API](https://www.redhat.com/en/topics/api/what-are-application-programming-interfaces)s.
- Save API
- Availability API
- CDX API
- [Save API](https://github.com/akamhy/waybackpy/wiki/Wayback-Machine-APIs#save-api)
- [Availability API](https://github.com/akamhy/waybackpy/wiki/Wayback-Machine-APIs#availability-api)
- [CDX API](https://github.com/akamhy/waybackpy/wiki/Wayback-Machine-APIs#cdx-api)
All three of these can be accessed by waybackpy.
These three APIs can be accessed via the waybackpy either by importing it in a script or from the CLI.
### 🏗 Installation
Using [pip](https://en.wikipedia.org/wiki/Pip_(package_manager)):
Using [pip](https://en.wikipedia.org/wiki/Pip_(package_manager)), from [PyPI](https://pypi.org/) (recommended):
```bash
pip install waybackpy
```
Install directly from GitHub:
Install directly from [this git repository](https://github.com/akamhy/waybackpy) (NOT recommended):
```bash
pip install git+https://github.com/akamhy/waybackpy.git
```
### Docker Image
### 🐳 Docker Image
Docker Hub : <https://hub.docker.com/r/secsi/waybackpy>
Docker image is automatically updated on every release by [Regulary and Automatically Updated Docker Images](https://github.com/cybersecsi/RAUDI) (RAUDI).
[Docker image](https://searchitoperations.techtarget.com/definition/Docker-image) is automatically updated on every release by [Regulary and Automatically Updated Docker Images](https://github.com/cybersecsi/RAUDI) (RAUDI).
RAUDI is a tool by SecSI (<https://secsi.io>), an Italian cybersecurity startup.
### Usage
### 🚀 Usage
#### As a Python package
@ -60,7 +60,7 @@ RAUDI is a tool by SecSI (<https://secsi.io>), an Italian cybersecurity startup.
>>> from waybackpy import WaybackMachineSaveAPI
>>> url = "https://github.com"
>>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
>>>
>>>
>>> save_api = WaybackMachineSaveAPI(url, user_agent)
>>> save_api.save()
https://web.archive.org/web/20220118125249/https://github.com/
@ -73,18 +73,18 @@ datetime.datetime(2022, 1, 18, 12, 52, 49)
##### Availability API
```python
>>> from waybackpy import WaybackMachineAvailabilityAPI
>>>
>>>
>>> url = "https://google.com"
>>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
>>>
>>>
>>> availability_api = WaybackMachineAvailabilityAPI(url, user_agent)
>>>
>>>
>>> availability_api.oldest()
https://web.archive.org/web/19981111184551/http://google.com:80/
>>>
>>>
>>> availability_api.newest()
https://web.archive.org/web/20220118150444/https://www.google.com/
>>>
>>>
>>> availability_api.near(year=2010, month=10, day=10, hour=10)
https://web.archive.org/web/20101010101708/http://www.google.com/
```
@ -97,7 +97,7 @@ https://web.archive.org/web/20101010101708/http://www.google.com/
>>> cdx = WaybackMachineCDXServerAPI(url, user_agent, start_timestamp=2016, end_timestamp=2017)
>>> for item in cdx.snapshots():
... print(item.archive_url)
...
...
https://web.archive.org/web/20160110011047/http://pypi.org/
https://web.archive.org/web/20160305104847/http://pypi.org/
.
@ -107,23 +107,48 @@ https://web.archive.org/web/20171127171549/https://pypi.org/
https://web.archive.org/web/20171206002737/http://pypi.org:80/
```
> Documentation at <https://github.com/akamhy/waybackpy/wiki/Python-package-docs>.
> Documentation is at <https://github.com/akamhy/waybackpy/wiki/Python-package-docs>.
#### As a CLI tool
Saving a webpage:
```bash
$ waybackpy --save --url "https://en.wikipedia.org/wiki/Social_media" --user_agent "my-unique-user-agent"
https://web.archive.org/web/20200719062108/https://en.wikipedia.org/wiki/Social_media
waybackpy --save --url "https://en.wikipedia.org/wiki/Social_media" --user_agent "my-unique-user-agent"
```
```bash
Archive URL:
https://web.archive.org/web/20220121193801/https://en.wikipedia.org/wiki/Social_media
Cached save:
False
```
$ waybackpy --oldest --url "https://en.wikipedia.org/wiki/Humanoid" --user_agent "my-unique-user-agent"
Retriving the oldest archive and also printing the JSON response of the availability API:
```bash
waybackpy --oldest --json --url "https://en.wikipedia.org/wiki/Humanoid" --user_agent "my-unique-user-agent"
```
```bash
Archive URL:
https://web.archive.org/web/20040415020811/http://en.wikipedia.org:80/wiki/Humanoid
JSON response:
{"url": "https://en.wikipedia.org/wiki/Humanoid", "archived_snapshots": {"closest": {"status": "200", "available": true, "url": "http://web.archive.org/web/20040415020811/http://en.wikipedia.org:80/wiki/Humanoid", "timestamp": "20040415020811"}}, "timestamp": "199401212126"}
```
$ waybackpy --newest --url "https://en.wikipedia.org/wiki/Remote_sensing" --user_agent "my-unique-user-agent"
https://web.archive.org/web/20201221130522/https://en.wikipedia.org/wiki/Remote_sensing
Archive close to a time, minute level precision is supported:
```bash
waybackpy --url google.com --user_agent "my-unique-user-agent" --near --year 2008 --month 8 --day 8
```
```bash
Archive URL:
https://web.archive.org/web/20080808014003/http://www.google.com:80/
```
> CLI documentation is at <https://github.com/akamhy/waybackpy/wiki/CLI-docs>.
### 🛡 License
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://github.com/akamhy/waybackpy/blob/master/LICENSE)
Copyright (c) 2020-2022 Akash Mahanty Et al.
Released under the MIT License. See [license](https://github.com/akamhy/waybackpy/blob/master/LICENSE) for details.

View File

@ -1 +1,14 @@
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 176.612 41.908" height="158.392" width="667.51" xmlns:v="https://github.com/akamhy/waybackpy"><text transform="matrix(.862888 0 0 1.158899 -.748 -98.312)" y="110.937" x="0.931" xml:space="preserve" font-weight="bold" font-size="28.149" font-family="sans-serif" letter-spacing="0" word-spacing="0" writing-mode="lr-tb" fill="#003dff"><tspan y="110.937" x="0.931"><tspan y="110.937" x="0.931" letter-spacing="3.568" writing-mode="lr-tb">waybackpy</tspan></tspan></text><path d="M.749 0h153.787v4.864H.749zm22.076 37.418h153.787v4.49H22.825z" fill="navy"/><path d="M0 37.418h22.825v4.49H0zM154.536 0h21.702v4.864h-21.702z" fill="#f0f"/></svg>
<?xml version="1.0" encoding="utf-8"?>
<svg width="711.80188pt" height="258.30469pt" viewBox="0 0 711.80188 258.30469" version="1.1" id="svg2" xmlns="http://www.w3.org/2000/svg">
<g id="surface1" transform="translate(-40.045801,-148)">
<path style="fill: rgb(171, 46, 51); fill-opacity: 1; fill-rule: nonzero; stroke: none;" d="M 224.09 309.814 L 224.09 197.997 L 204.768 197.994 L 204.768 312.635 C 204.768 312.635 205.098 312.9 204.105 313.698 C 203.113 314.497 202.408 313.849 202.408 313.849 L 200.518 313.849 L 200.518 197.991 L 181.139 197.991 L 181.139 313.849 L 179.253 313.849 C 179.253 313.849 178.544 314.497 177.551 313.698 C 176.558 312.9 176.888 312.635 176.888 312.635 L 176.888 197.994 L 157.57 197.997 L 157.57 309.814 C 157.57 309.814 156.539 316.772 162.615 321.658 C 168.691 326.546 177.551 326.049 177.551 326.049 L 204.11 326.049 C 204.11 326.049 212.965 326.546 219.041 321.658 C 225.118 316.772 224.09 309.814 224.09 309.814" id="path5"/>
<path style="fill: rgb(171, 46, 51); fill-opacity: 1; fill-rule: nonzero; stroke: none;" d="M 253.892 299.821 C 253.892 299.821 253.632 300.965 251.888 300.965 C 250.143 300.965 249.629 299.821 249.629 299.821 L 249.629 278.477 C 249.629 278.477 249.433 278.166 250.078 277.645 C 250.726 277.124 251.243 277.179 251.243 277.179 L 253.892 277.228 Z M 251.588 199.144 C 230.266 199.144 231.071 213.218 231.071 213.218 L 231.071 254.303 L 249.675 254.303 L 249.675 213.69 C 249.675 213.69 249.775 211.276 251.787 211.276 C 253.8 211.276 254 213.542 254 213.542 L 254 265.146 L 246.156 265.146 C 246.156 265.146 240.022 264.579 235.495 268.22 C 230.968 271.858 231.071 276.791 231.071 276.791 L 231.071 298.955 C 231.071 298.955 229.461 308.016 238.914 312.058 C 248.368 316.103 254.805 309.795 254.805 309.795 L 254.805 312.706 L 272.508 312.706 L 272.508 212.895 C 272.508 212.895 272.907 199.144 251.588 199.144" id="path7"/>
<path style="fill: rgb(171, 46, 51); fill-opacity: 1; fill-rule: nonzero; stroke: none;" d="M 404.682 318.261 C 404.682 318.261 404.398 319.494 402.485 319.494 C 400.568 319.494 400.001 318.261 400.001 318.261 L 400.001 295.216 C 400.001 295.216 399.786 294.879 400.496 294.315 C 401.208 293.757 401.776 293.812 401.776 293.812 L 404.682 293.868 Z M 402.152 209.568 C 378.728 209.568 379.61 224.761 379.61 224.761 L 379.61 269.117 L 400.051 269.117 L 400.051 225.273 C 400.051 225.273 400.162 222.665 402.374 222.665 C 404.582 222.665 404.805 225.109 404.805 225.109 L 404.805 280.82 L 396.187 280.82 C 396.187 280.82 389.447 280.213 384.475 284.141 C 379.499 288.072 379.61 293.396 379.61 293.396 L 379.61 317.324 C 379.61 317.324 377.843 327.104 388.232 331.469 C 398.616 335.838 405.69 329.027 405.69 329.027 L 405.69 332.169 L 425.133 332.169 L 425.133 224.413 C 425.133 224.413 425.578 209.568 402.152 209.568" id="path9"/>
<path style="fill: rgb(171, 46, 51); fill-opacity: 1; fill-rule: nonzero; stroke: none;" d="M 321.114 328.636 L 321.114 206.587 L 302.582 206.587 L 302.582 304.902 C 302.582 304.902 303.211 307.094 300.624 307.094 C 298.035 307.094 298.316 304.902 298.316 304.902 L 298.316 206.587 L 279.784 206.587 C 279.784 206.587 279.922 304.338 279.922 306.756 C 279.922 309.175 280.27 310.526 280.831 312.379 C 281.391 314.238 282.579 318.116 290.901 319.186 C 299.224 320.256 302.44 315.813 302.44 315.813 L 302.44 327.736 C 302.44 327.736 302.862 329.366 300.554 329.366 C 298.246 329.366 298.316 327.849 298.316 327.849 L 298.316 322.957 L 279.642 322.957 L 279.642 327.791 C 279.642 327.791 278.523 341.514 300.274 341.514 C 322.026 341.514 321.114 328.636 321.114 328.636" id="path11"/>
<path style="fill: rgb(171, 46, 51); fill-opacity: 1; fill-rule: nonzero; stroke: none;" d="M 352.449 209.811 L 352.449 273.495 C 352.449 277.49 347.911 277.194 347.911 277.194 L 347.911 207.592 C 347.911 207.592 346.929 207.542 349.567 207.542 C 352.817 207.542 352.449 209.811 352.449 209.811 M 352.326 310.393 C 352.326 310.393 352.143 312.366 350.425 312.366 L 348.033 312.366 L 348.033 289.478 L 349.628 289.478 C 349.628 289.478 352.326 289.428 352.326 292.092 Z M 371.341 287.505 C 371.341 284.791 370.727 282.966 368.826 280.993 C 366.925 279.02 363.367 277.441 363.367 277.441 C 363.367 277.441 365.514 276.948 368.704 274.728 C 371.893 272.509 371.525 267.921 371.525 267.921 L 371.525 212.919 C 371.525 212.919 371.801 204.509 366.925 200.587 C 362.049 196.665 352.515 196.363 352.515 196.363 L 328.711 196.363 L 328.711 324.107 L 350.609 324.107 C 360.055 324.107 364.594 322.232 368.336 318.286 C 372.077 314.34 371.341 308.321 371.341 308.321 Z M 371.341 287.505" id="path13"/>
<path style="fill: rgb(171, 46, 51); fill-opacity: 1; fill-rule: nonzero; stroke: none;" d="M 452.747 226.744 L 452.747 268.806 L 471.581 268.806 L 471.581 227.459 C 471.581 227.459 471.846 213.532 450.516 213.532 C 429.182 213.532 430.076 227.533 430.076 227.533 L 430.076 313.381 C 430.076 313.381 428.825 327.523 450.872 327.523 C 472.919 327.523 471.401 313.526 471.401 313.526 L 471.401 292.064 L 452.835 292.064 L 452.835 314.389 C 452.835 314.389 452.923 315.61 450.961 315.61 C 448.997 315.61 448.729 314.389 448.729 314.389 L 448.729 226.524 C 448.729 226.524 448.821 225.378 450.692 225.378 C 452.566 225.378 452.747 226.744 452.747 226.744" id="path15"/>
<path style="fill: rgb(171, 46, 51); fill-opacity: 1; fill-rule: nonzero; stroke: none;" d="M 520.624 281.841 C 517.672 278.98 514.317 277.904 514.317 277.904 C 514.317 277.904 517.538 277.796 520.489 274.775 C 523.442 271.753 523.173 267.924 523.173 267.924 L 523.173 208.211 L 503.185 208.211 L 503.185 276.014 C 503.185 276.014 503.185 277.361 501.172 277.361 L 498.761 277.309 L 498.761 191.655 L 478.973 191.655 L 478.973 327.905 L 498.692 327.905 L 498.692 290.039 L 501.709 290.039 C 501.709 290.039 502.112 290.039 502.648 290.523 C 503.185 291.01 503.185 291.602 503.185 291.602 L 503.185 327.905 L 523.307 327.905 L 523.307 288.636 C 523.307 288.636 523.576 284.699 520.624 281.841" id="path17"/>
<path style="fill-opacity: 1; fill-rule: nonzero; stroke: none; fill: rgb(255, 222, 87);" d="M 638.021 327.182 L 638.021 205.132 L 619.489 205.132 L 619.489 303.448 C 619.489 303.448 620.119 305.64 617.53 305.64 C 614.944 305.64 615.223 303.448 615.223 303.448 L 615.223 205.132 L 596.692 205.132 C 596.692 205.132 596.83 302.884 596.83 305.301 C 596.83 307.721 597.178 309.071 597.738 310.924 C 598.299 312.784 599.487 316.662 607.809 317.732 C 616.132 318.802 619.349 314.359 619.349 314.359 L 619.349 326.281 C 619.349 326.281 619.77 327.913 617.462 327.913 C 615.154 327.913 615.223 326.396 615.223 326.396 L 615.223 321.502 L 596.55 321.502 L 596.55 326.336 C 596.55 326.336 595.43 340.059 617.182 340.059 C 638.934 340.059 638.021 327.182 638.021 327.182" id="path-1"/>
<path d="M 592.159 233.846 C 593.222 238.576 593.75 243.873 593.745 249.735 C 593.74 255.598 593.135 261.281 591.931 266.782 C 590.726 272.285 588.901 277.144 586.453 281.361 C 584.006 285.578 580.938 288.946 577.248 291.466 C 573.559 293.985 569.226 295.246 564.25 295.246 C 561.585 295.246 559.008 294.936 556.521 294.32 C 554.033 293.703 551.813 292.854 549.859 291.774 C 547.905 290.694 546.284 289.512 544.997 288.226 C 543.71 286.94 542.934 285.578 542.668 284.138 L 542.629 328.722 L 526.369 328.722 L 526.475 207.466 L 541.003 207.466 L 542.728 216.259 C 544.507 213.38 547.197 211.065 550.797 209.317 C 554.397 207.568 558.374 206.694 562.728 206.694 C 565.66 206.694 568.637 207.157 571.657 208.083 C 574.677 209.008 577.497 210.551 580.116 212.711 C 582.735 214.871 585.11 217.698 587.239 221.196 C 589.369 224.692 591.009 228.909 592.159 233.846 Z M 558.932 280.744 C 561.597 280.744 564.019 279.972 566.197 278.429 C 568.376 276.887 570.243 274.804 571.801 272.182 C 573.358 269.559 574.582 266.423 575.474 262.772 C 576.366 259.121 576.814 255.238 576.817 251.124 C 576.821 247.113 576.424 243.307 575.628 239.708 C 574.831 236.108 573.701 232.92 572.237 230.143 C 570.774 227.366 568.999 225.155 566.912 223.51 C 564.825 221.864 562.405 221.041 559.65 221.041 C 556.985 221.041 554.54 221.813 552.318 223.356 C 550.095 224.898 548.183 226.981 546.581 229.603 C 544.98 232.226 543.755 235.311 542.908 238.86 C 542.061 242.408 541.635 246.239 541.632 250.353 C 541.628 254.466 542.002 258.349 542.754 262 C 543.506 265.651 544.637 268.865 546.145 271.642 C 547.653 274.419 549.472 276.63 551.603 278.276 C 553.734 279.922 556.177 280.744 558.932 280.744 Z" style="fill: rgb(69, 132, 182); white-space: pre;"/>
</g>
</svg>

Before

Width:  |  Height:  |  Size: 694 B

After

Width:  |  Height:  |  Size: 8.3 KiB

11
pytest.ini Normal file
View File

@ -0,0 +1,11 @@
[pytest]
addopts =
# show summary of all tests that did not pass
-ra
# enable all warnings
-Wd
# coverage and html report
--cov=waybackpy
--cov-report=html
testpaths =
tests

8
requirements-dev.txt Normal file
View File

@ -0,0 +1,8 @@
click
requests
pytest
pytest-cov
codecov
flake8
mypy
black

View File

@ -1,17 +1,25 @@
import os.path
from setuptools import setup
with open(os.path.join(os.path.dirname(__file__), "README.md")) as f:
readme_path = os.path.join(os.path.dirname(__file__), "README.md")
with open(readme_path, encoding="utf-8") as f:
long_description = f.read()
about = {}
with open(os.path.join(os.path.dirname(__file__), "waybackpy", "__version__.py")) as f:
version_path = os.path.join(os.path.dirname(__file__), "waybackpy", "__version__.py")
with open(version_path, encoding="utf-8") as f:
exec(f.read(), about)
version = str(about["__version__"])
download_url = "https://github.com/akamhy/waybackpy/archive/{version}.tar.gz".format(
version=version
)
setup(
name=about["__title__"],
packages=["waybackpy"],
version=about["__version__"],
version=version,
description=about["__description__"],
long_description=long_description,
long_description_content_type="text/markdown",
@ -19,11 +27,17 @@ setup(
author=about["__author__"],
author_email=about["__author_email__"],
url=about["__url__"],
download_url="https://github.com/akamhy/waybackpy/archive/3.0.0.tar.gz",
download_url=download_url,
keywords=[
"Archive Website",
"Wayback Machine",
"Internet Archive",
"Wayback Machine CLI",
"Wayback Machine Python",
"Internet Archiving",
"Availability API",
"CDX API",
"savepagenow",
],
install_requires=["requests", "click"],
python_requires=">=3.4",
@ -40,6 +54,7 @@ setup(
"Programming Language :: Python :: 3.7",
"Programming Language :: Python :: 3.8",
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: Implementation :: CPython",
],
entry_points={"console_scripts": ["waybackpy = waybackpy.cli:main"]},

0
tests/__init__.py Normal file
View File

View File

@ -0,0 +1,100 @@
import pytest
import random
import string
from datetime import datetime, timedelta
from waybackpy.availability_api import WaybackMachineAvailabilityAPI
from waybackpy.exceptions import (
InvalidJSONInAvailabilityAPIResponse,
ArchiveNotInAvailabilityAPIResponse,
)
now = datetime.utcnow()
url = "https://example.com/"
user_agent = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36"
rndstr = lambda n: "".join(
random.choice(string.ascii_uppercase + string.digits) for _ in range(n)
)
def test_oldest():
"""
Test the oldest archive of Google.com and also checks the attributes.
"""
url = "https://example.com/"
user_agent = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36"
availability_api = WaybackMachineAvailabilityAPI(url, user_agent)
oldest = availability_api.oldest()
oldest_archive_url = oldest.archive_url
assert "2002" in oldest_archive_url
oldest_timestamp = oldest.timestamp()
assert abs(oldest_timestamp - now) > timedelta(days=7000) # More than 19 years
assert availability_api.JSON["archived_snapshots"]["closest"]["available"] is True
assert repr(oldest).find("example.com") != -1
assert "2002" in str(oldest)
def test_newest():
"""
Assuming that the recent most Google Archive was made no more earlier than
last one day which is 86400 seconds.
"""
url = "https://www.youtube.com/"
user_agent = "Mozilla/5.0 (X11; Linux x86_64; rv:96.0) Gecko/20100101 Firefox/96.0"
availability_api = WaybackMachineAvailabilityAPI(url, user_agent)
newest = availability_api.newest()
newest_timestamp = newest.timestamp()
# betting in favor that latest youtube archive was not before the last 3 days
# high tarffic sites like youtube are archived mnay times a day, so seems
# very reasonable to me.
assert abs(newest_timestamp - now) < timedelta(seconds=86400 * 3)
def test_invalid_json():
"""
When the API is malfunctioning or we don't pass a URL it may return invalid JSON data.
"""
with pytest.raises(InvalidJSONInAvailabilityAPIResponse):
availability_api = WaybackMachineAvailabilityAPI(url="", user_agent=user_agent)
archive_url = availability_api.archive_url
def test_no_archive():
"""
ArchiveNotInAvailabilityAPIResponse may be raised if Wayback Machine did not
replied with the archive despite the fact that we know the site has million
of archives. Don't know the reason for this wierd behavior.
And also if really there are no archives for the passed URL this exception
is raised.
"""
with pytest.raises(ArchiveNotInAvailabilityAPIResponse):
availability_api = WaybackMachineAvailabilityAPI(
url="https://%s.cn" % rndstr(30), user_agent=user_agent
)
archive_url = availability_api.archive_url
def test_no_api_call_str_repr():
"""
Some entitled users maybe want to see what is the string representation
if they dont make any API requests.
str() must not return None so we return ""
"""
availability_api = WaybackMachineAvailabilityAPI(
url="https://%s.gov" % rndstr(30), user_agent=user_agent
)
assert "" == str(availability_api)
def test_no_call_timestamp():
"""
If no API requests were made the bound timestamp() method returns
the datetime.max as a default value.
"""
availability_api = WaybackMachineAvailabilityAPI(
url="https://%s.in" % rndstr(30), user_agent=user_agent
)
assert datetime.max == availability_api.timestamp()

View File

@ -0,0 +1,41 @@
import pytest
from datetime import datetime
from waybackpy.cdx_snapshot import CDXSnapshot
def test_CDXSnapshot():
sample_input = "org,archive)/ 20080126045828 http://github.com text/html 200 Q4YULN754FHV2U6Q5JUT6Q2P57WEWNNY 1415"
prop_values = sample_input.split(" ")
properties = {}
(
properties["urlkey"],
properties["timestamp"],
properties["original"],
properties["mimetype"],
properties["statuscode"],
properties["digest"],
properties["length"],
) = prop_values
snapshot = CDXSnapshot(properties)
assert properties["urlkey"] == snapshot.urlkey
assert properties["timestamp"] == snapshot.timestamp
assert properties["original"] == snapshot.original
assert properties["mimetype"] == snapshot.mimetype
assert properties["statuscode"] == snapshot.statuscode
assert properties["digest"] == snapshot.digest
assert properties["length"] == snapshot.length
assert (
datetime.strptime(properties["timestamp"], "%Y%m%d%H%M%S")
== snapshot.datetime_timestamp
)
archive_url = (
"https://web.archive.org/web/"
+ properties["timestamp"]
+ "/"
+ properties["original"]
)
assert archive_url == snapshot.archive_url
assert sample_input == str(snapshot)

99
tests/test_cdx_utils.py Normal file
View File

@ -0,0 +1,99 @@
import pytest
from waybackpy.exceptions import WaybackError
from waybackpy.cdx_utils import (
get_total_pages,
full_url,
get_response,
check_filters,
check_collapses,
check_match_type,
)
def test_get_total_pages():
url = "twitter.com"
user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.2 Safari/605.1.15"
assert get_total_pages(url=url, user_agent=user_agent) >= 56
def test_full_url():
params = {}
endpoint = "https://web.archive.org/cdx/search/cdx"
assert endpoint == full_url(endpoint, params)
params = {"a": "1"}
assert "https://web.archive.org/cdx/search/cdx?a=1" == full_url(endpoint, params)
assert "https://web.archive.org/cdx/search/cdx?a=1" == full_url(
endpoint + "?", params
)
params["b"] = 2
assert "https://web.archive.org/cdx/search/cdx?a=1&b=2" == full_url(
endpoint + "?", params
)
params["c"] = "foo bar"
assert "https://web.archive.org/cdx/search/cdx?a=1&b=2&c=foo%20bar" == full_url(
endpoint + "?", params
)
def test_get_response():
url = "https://github.com"
user_agent = (
"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0"
)
headers = {"User-Agent": "%s" % user_agent}
response = get_response(url, headers=headers)
assert response.status_code == 200
url = "http/wwhfhfvhvjhmom"
with pytest.raises(WaybackError):
get_response(url, headers=headers)
def test_check_filters():
filters = []
check_filters(filters)
filters = ["statuscode:200", "timestamp:20215678901234", "original:https://url.com"]
check_filters(filters)
with pytest.raises(WaybackError):
check_filters("not-list")
with pytest.raises(WaybackError):
check_filters(["invalid"])
def test_check_collapses():
collapses = []
check_collapses(collapses)
collapses = ["timestamp:10"]
check_collapses(collapses)
collapses = ["urlkey"]
check_collapses(collapses)
collapses = "urlkey" # NOT LIST
with pytest.raises(WaybackError):
check_collapses(collapses)
collapses = ["also illegal collapse"]
with pytest.raises(WaybackError):
check_collapses(collapses)
def test_check_match_type():
assert None == check_match_type(None, "url")
match_type = "exact"
url = "test_url"
assert None == check_match_type(match_type, url)
url = "has * in it"
with pytest.raises(WaybackError):
check_match_type("domain", url)
with pytest.raises(WaybackError):
check_match_type("not a valid type", "url")

133
tests/test_save_api.py Normal file
View File

@ -0,0 +1,133 @@
import pytest
import time
import random
import string
from datetime import datetime
from waybackpy.save_api import WaybackMachineSaveAPI
from waybackpy.exceptions import MaximumSaveRetriesExceeded
rndstr = lambda n: "".join(
random.choice(string.ascii_uppercase + string.digits) for _ in range(n)
)
def test_save():
url = "https://github.com/akamhy/waybackpy"
user_agent = "Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/604.1"
save_api = WaybackMachineSaveAPI(url, user_agent)
save_api.save()
archive_url = save_api.archive_url
timestamp = save_api.timestamp()
headers = save_api.headers # CaseInsensitiveDict
cached_save = save_api.cached_save
assert cached_save in [True, False]
assert archive_url.find("github.com/akamhy/waybackpy") != -1
assert str(headers).find("github.com/akamhy/waybackpy") != -1
assert type(save_api.timestamp()) == type(datetime(year=2020, month=10, day=2))
def test_max_redirect_exceeded():
with pytest.raises(MaximumSaveRetriesExceeded):
url = "https://%s.gov" % rndstr
user_agent = "Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/604.1"
save_api = WaybackMachineSaveAPI(url, user_agent, max_tries=3)
save_api.save()
def test_sleep():
"""
sleeping is actually very important for SaveAPI
interface stability.
The test checks that the time taken by sleep method
is as intended.
"""
url = "https://example.com"
user_agent = "Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/604.1"
save_api = WaybackMachineSaveAPI(url, user_agent)
s_time = int(time.time())
save_api.sleep(6) # multiple of 3 sleep for 10 seconds
e_time = int(time.time())
assert (e_time - s_time) >= 10
s_time = int(time.time())
save_api.sleep(7) # sleeps for 5 seconds
e_time = int(time.time())
assert (e_time - s_time) >= 5
def test_timestamp():
url = "https://example.com"
user_agent = "Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/604.1"
save_api = WaybackMachineSaveAPI(url, user_agent)
now = datetime.utcnow()
save_api._archive_url = (
"https://web.archive.org/web/%s/" % now.strftime("%Y%m%d%H%M%S") + url
)
save_api.timestamp()
assert save_api.cached_save is False
save_api._archive_url = "https://web.archive.org/web/%s/" % "20100124063622" + url
save_api.timestamp()
assert save_api.cached_save is True
def test_archive_url_parser():
"""
Testing three regex for matches and also tests the response URL.
"""
url = "https://example.com"
user_agent = "Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/604.1"
save_api = WaybackMachineSaveAPI(url, user_agent)
save_api.headers = """
START
Content-Location: /web/20201126185327/https://www.scribbr.com/citing-sources/et-al
END
"""
assert (
save_api.archive_url_parser()
== "https://web.archive.org/web/20201126185327/https://www.scribbr.com/citing-sources/et-al"
)
save_api.headers = """
{'Server': 'nginx/1.15.8', 'Date': 'Sat, 02 Jan 2021 09:40:25 GMT', 'Content-Type': 'text/html; charset=UTF-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'X-Archive-Orig-Server': 'nginx', 'X-Archive-Orig-Date': 'Sat, 02 Jan 2021 09:40:09 GMT', 'X-Archive-Orig-Transfer-Encoding': 'chunked', 'X-Archive-Orig-Connection': 'keep-alive', 'X-Archive-Orig-Vary': 'Accept-Encoding', 'X-Archive-Orig-Last-Modified': 'Fri, 01 Jan 2021 12:19:00 GMT', 'X-Archive-Orig-Strict-Transport-Security': 'max-age=31536000, max-age=0;', 'X-Archive-Guessed-Content-Type': 'text/html', 'X-Archive-Guessed-Charset': 'utf-8', 'Memento-Datetime': 'Sat, 02 Jan 2021 09:40:09 GMT', 'Link': '<https://www.scribbr.com/citing-sources/et-al/>; rel="original", <https://web.archive.org/web/timemap/link/https://www.scribbr.com/citing-sources/et-al/>; rel="timemap"; type="application/link-format", <https://web.archive.org/web/https://www.scribbr.com/citing-sources/et-al/>; rel="timegate", <https://web.archive.org/web/20200601082911/https://www.scribbr.com/citing-sources/et-al/>; rel="first memento"; datetime="Mon, 01 Jun 2020 08:29:11 GMT", <https://web.archive.org/web/20201126185327/https://www.scribbr.com/citing-sources/et-al/>; rel="prev memento"; datetime="Thu, 26 Nov 2020 18:53:27 GMT", <https://web.archive.org/web/20210102094009/https://www.scribbr.com/citing-sources/et-al/>; rel="memento"; datetime="Sat, 02 Jan 2021 09:40:09 GMT", <https://web.archive.org/web/20210102094009/https://www.scribbr.com/citing-sources/et-al/>; rel="last memento"; datetime="Sat, 02 Jan 2021 09:40:09 GMT"', 'Content-Security-Policy': "default-src 'self' 'unsafe-eval' 'unsafe-inline' data: blob: archive.org web.archive.org analytics.archive.org pragma.archivelab.org", 'X-Archive-Src': 'spn2-20210102092956-wwwb-spn20.us.archive.org-8001.warc.gz', 'Server-Timing': 'captures_list;dur=112.646325, exclusion.robots;dur=0.172010, exclusion.robots.policy;dur=0.158205, RedisCDXSource;dur=2.205932, esindex;dur=0.014647, LoadShardBlock;dur=82.205012, PetaboxLoader3.datanode;dur=70.750239, CDXLines.iter;dur=24.306278, load_resource;dur=26.520179', 'X-App-Server': 'wwwb-app200', 'X-ts': '200', 'X-location': 'All', 'X-Cache-Key': 'httpsweb.archive.org/web/20210102094009/https://www.scribbr.com/citing-sources/et-al/IN', 'X-RL': '0', 'X-Page-Cache': 'MISS', 'X-Archive-Screenname': '0', 'Content-Encoding': 'gzip'}
"""
assert (
save_api.archive_url_parser()
== "https://web.archive.org/web/20210102094009/https://www.scribbr.com/citing-sources/et-al/"
)
save_api.headers = """
START
X-Cache-Key: https://web.archive.org/web/20171128185327/https://www.scribbr.com/citing-sources/et-al/US
END
"""
assert (
save_api.archive_url_parser()
== "https://web.archive.org/web/20171128185327/https://www.scribbr.com/citing-sources/et-al/"
)
save_api.headers = "TEST TEST TEST AND NO MATCH - TEST FOR RESPONSE URL MATCHING"
save_api.response_url = "https://web.archive.org/web/20171128185327/https://www.scribbr.com/citing-sources/et-al"
assert (
save_api.archive_url_parser()
== "https://web.archive.org/web/20171128185327/https://www.scribbr.com/citing-sources/et-al"
)
def test_archive_url():
"""
Checks the attribute archive_url's value when the save method was not
explicitly invoked by the end-user but the save method was invoked implicitly
by the archive_url method which is an attribute due to @property.
"""
url = "https://example.com"
user_agent = "Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/604.1"
save_api = WaybackMachineSaveAPI(url, user_agent)
save_api.saved_archive = (
"https://web.archive.org/web/20220124063056/https://example.com/"
)
assert save_api.archive_url == save_api.saved_archive

13
tests/test_utils.py Normal file
View File

@ -0,0 +1,13 @@
from waybackpy.utils import latest_version, DEFAULT_USER_AGENT
from waybackpy.__version__ import __version__
def test_default_user_agent():
assert (
DEFAULT_USER_AGENT
== "waybackpy %s - https://github.com/akamhy/waybackpy" % __version__
)
def test_latest_version():
assert __version__ == latest_version(package_name="waybackpy")

View File

@ -4,8 +4,8 @@ __description__ = (
"Archive pages and retrieve archived pages easily."
)
__url__ = "https://akamhy.github.io/waybackpy/"
__version__ = "3.0.0"
__author__ = "akamhy"
__version__ = "3.0.1"
__author__ = "Akash Mahanty"
__author_email__ = "akamhy@yahoo.com"
__license__ = "MIT"
__copyright__ = "Copyright 2020-2022 Akash Mahanty et al."

View File

@ -1,57 +1,95 @@
import re
import time
import json
import requests
from datetime import datetime
from .__version__ import __version__
from .utils import DEFAULT_USER_AGENT
def full_url(endpoint, params):
if not params:
return endpoint.strip()
full_url = endpoint if endpoint.endswith("?") else (endpoint + "?")
for key, val in params.items():
key = "filter" if key.startswith("filter") else key
key = "collapse" if key.startswith("collapse") else key
amp = "" if full_url.endswith("?") else "&"
full_url = (
full_url
+ amp
+ "{key}={val}".format(key=key, val=requests.utils.quote(str(val)))
)
return full_url
from .exceptions import (
ArchiveNotInAvailabilityAPIResponse,
InvalidJSONInAvailabilityAPIResponse,
)
class WaybackMachineAvailabilityAPI:
def __init__(self, url, user_agent=DEFAULT_USER_AGENT):
"""
Class that interfaces the availability API of the Wayback Machine.
"""
def __init__(self, url, user_agent=DEFAULT_USER_AGENT, max_tries=3):
self.url = str(url).strip().replace(" ", "%20")
self.user_agent = user_agent
self.headers = {"User-Agent": self.user_agent}
self.payload = {"url": "{url}".format(url=self.url)}
self.endpoint = "https://archive.org/wayback/available"
self.max_tries = max_tries
self.tries = 0
self.last_api_call_unix_time = int(time.time())
self.api_call_time_gap = 5
self.JSON = None
def unix_timestamp_to_wayback_timestamp(self, unix_timestamp):
"""
Converts Unix time to wayback Machine timestamp.
"""
return datetime.utcfromtimestamp(int(unix_timestamp)).strftime("%Y%m%d%H%M%S")
def __repr__(self):
return str(self) # self.__str__()
"""
Same as string representation, just return the archive URL as a string.
"""
return str(self)
def __str__(self):
"""
String representation of the class. If atleast one API call was successfully
made then return the archive URL as a string. Else returns None.
"""
# String must not return anything other than a string object
# So, if some asks for string repr before making the API requests
# just return ""
if not self.JSON:
return None
return ""
return self.archive_url
def json(self):
self.request_url = full_url(self.endpoint, self.payload)
self.response = requests.get(self.request_url, self.headers)
self.JSON = self.response.json()
"""
Makes the API call to the availability API can set the JSON response
to the JSON attribute of the instance and also returns the JSON attribute.
"""
time_diff = int(time.time()) - self.last_api_call_unix_time
sleep_time = self.api_call_time_gap - time_diff
if sleep_time > 0:
time.sleep(sleep_time)
self.response = requests.get(
self.endpoint, params=self.payload, headers=self.headers
)
self.last_api_call_unix_time = int(time.time())
self.tries += 1
try:
self.JSON = self.response.json()
except json.decoder.JSONDecodeError:
raise InvalidJSONInAvailabilityAPIResponse(
"Response data:\n{text}".format(text=self.response.text)
)
return self.JSON
def timestamp(self):
if not self.JSON["archived_snapshots"] or not self.JSON:
"""
Converts the timestamp form the JSON response to datetime object.
If JSON attribute of the instance is None it implies that the either
the the last API call failed or one was never made.
If not JSON or if JSON but no timestamp in the JSON response then returns
the maximum value for datetime object that is possible.
If you get an URL as a response form the availability API it is guaranteed
that you can get the datetime object from the timestamp.
"""
if not self.JSON or not self.JSON["archived_snapshots"]:
return datetime.max
return datetime.strptime(
@ -60,10 +98,37 @@ class WaybackMachineAvailabilityAPI:
@property
def archive_url(self):
"""
Reads the the JSON response data and tries to get the timestamp and returns
the timestamp if found else returns None.
"""
data = self.JSON
if not data["archived_snapshots"]:
archive_url = None
# If the user didn't used oldest, newest or near but tries to access the
# archive_url attribute then, we assume they are fine with any archive
# and invoke the oldest archive function.
if not data:
self.oldest()
# If data is still not none then probably there are no
# archive for the requested URL.
if not data or not data["archived_snapshots"]:
while (self.tries < self.max_tries) and (
not data or not data["archived_snapshots"]
):
self.json() # It makes a new API call
data = self.JSON # json() updated the value of JSON attribute
# Even if after we exhausted teh max_tries, then we give up and
# raise exception.
if not data or not data["archived_snapshots"]:
raise ArchiveNotInAvailabilityAPIResponse(
"Archive not found in the availability "
+ "API response, the URL you requested may not have any "
+ "archives yet. You may retry after some time or archive the webpage now."
+ "\nResponse data:\n{response}".format(response=self.response.text)
)
else:
archive_url = data["archived_snapshots"]["closest"]["url"]
archive_url = archive_url.replace(
@ -72,15 +137,29 @@ class WaybackMachineAvailabilityAPI:
return archive_url
def wayback_timestamp(self, **kwargs):
"""
Prepends zero before the year, month, day, hour and minute so that they
are conformable with the YYYYMMDDhhmmss wayback machine timestamp format.
"""
return "".join(
str(kwargs[key]).zfill(2)
for key in ["year", "month", "day", "hour", "minute"]
)
def oldest(self):
"""
Passing the year 1994 should return the oldest archive because
wayback machine was started in May, 1996 and there should be no archive
before the year 1994.
"""
return self.near(year=1994)
def newest(self):
"""
Passing the current UNIX time should be sufficient to get the newest
archive considering the API request-response time delay and also the
database lags on Wayback machine.
"""
return self.near(unix_timestamp=int(time.time()))
def near(
@ -92,6 +171,16 @@ class WaybackMachineAvailabilityAPI:
minute=None,
unix_timestamp=None,
):
"""
The main method for this Class, oldest and newest methods are dependent on this
method.
It generates the timestamp based on the input either by calling the
unix_timestamp_to_wayback_timestamp or wayback_timestamp method with
appropriate arguments for their respective parameters.
Adds the timestamp to the payload dictionary.
And finally invoking the json method to make the API call then returns the instance.
"""
if unix_timestamp:
timestamp = self.unix_timestamp_to_wayback_timestamp(unix_timestamp)
else:

View File

@ -6,26 +6,32 @@ from .cdx_utils import (
check_filters,
check_collapses,
check_match_type,
full_url,
)
from .utils import DEFAULT_USER_AGENT
class WaybackMachineCDXServerAPI:
"""
Class that interfaces the CDX server API of the Wayback Machine.
"""
def __init__(
self,
url,
user_agent=None,
start_timestamp=None,
end_timestamp=None,
user_agent=DEFAULT_USER_AGENT,
start_timestamp=None, # from, can not use from as it's a keyword
end_timestamp=None, # to, not using to as can not use from
filters=[],
match_type=None,
gzip=None,
collapses=[],
limit=None,
max_tries=3,
):
self.url = str(url).strip().replace(" ", "%20")
self.user_agent = str(user_agent) if user_agent else DEFAULT_USER_AGENT
self.user_agent = user_agent
self.start_timestamp = str(start_timestamp) if start_timestamp else None
self.end_timestamp = str(end_timestamp) if end_timestamp else None
self.filters = filters
@ -36,6 +42,7 @@ class WaybackMachineCDXServerAPI:
self.collapses = collapses
check_collapses(self.collapses)
self.limit = limit if limit else 5000
self.max_tries = max_tries
self.last_api_request_url = None
self.use_page = False
self.endpoint = "https://web.archive.org/cdx/search/cdx"
@ -43,16 +50,15 @@ class WaybackMachineCDXServerAPI:
def cdx_api_manager(self, payload, headers, use_page=False):
total_pages = get_total_pages(self.url, self.user_agent)
# If we only have two or less pages of archives then we care for accuracy
# pagination API can be lagged sometimes
if use_page == True and total_pages >= 2:
# If we only have two or less pages of archives then we care for more accuracy
# pagination API is lagged sometimes
if use_page is True and total_pages >= 2:
blank_pages = 0
for i in range(total_pages):
payload["page"] = str(i)
url, res = get_response(
self.endpoint, params=payload, headers=headers, return_full_url=True
)
url = full_url(self.endpoint, params=payload)
res = get_response(url, headers=headers)
self.last_api_request_url = url
text = res.text
@ -75,9 +81,8 @@ class WaybackMachineCDXServerAPI:
if resumeKey:
payload["resumeKey"] = resumeKey
url, res = get_response(
self.endpoint, params=payload, headers=headers, return_full_url=True
)
url = full_url(self.endpoint, params=payload)
res = get_response(url, headers=headers)
self.last_api_request_url = url
@ -105,7 +110,7 @@ class WaybackMachineCDXServerAPI:
if self.end_timestamp:
payload["to"] = self.end_timestamp
if self.gzip != True:
if self.gzip is not True:
payload["gzip"] = "false"
if self.match_type:
@ -165,10 +170,14 @@ class WaybackMachineCDXServerAPI:
if prop_values_len != properties_len:
raise WaybackError(
"Snapshot returned by Cdx API has {prop_values_len} properties instead of expected {properties_len} properties.\nInvolved Snapshot : {snapshot}".format(
prop_values_len=prop_values_len,
properties_len=properties_len,
snapshot=snapshot,
"Snapshot returned by Cdx API has {prop_values_len} properties".format(
prop_values_len=prop_values_len
)
+ " instead of expected {properties_len} ".format(
properties_len=properties_len
)
+ "properties.\nProblematic Snapshot : {snapshot}".format(
snapshot=snapshot
)
)

View File

@ -2,6 +2,14 @@ from datetime import datetime
class CDXSnapshot:
"""
Class for the CDX snapshot lines returned by the CDX API,
Each valid line of the CDX API is casted to an CDXSnapshot object
by the CDX API interface.
This provides the end-user the ease of using the data as attributes
of the CDXSnapshot.
"""
def __init__(self, properties):
self.urlkey = properties["urlkey"]
self.timestamp = properties["timestamp"]

View File

@ -3,16 +3,16 @@ import requests
from urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter
from .exceptions import WaybackError
from .utils import DEFAULT_USER_AGENT
def get_total_pages(url, user_agent):
request_url = (
"https://web.archive.org/cdx/search/cdx?url={url}&showNumPages=true".format(
url=url
)
)
def get_total_pages(url, user_agent=DEFAULT_USER_AGENT):
endpoint = "https://web.archive.org/cdx/search/cdx?"
payload = {"showNumPages": "true", "url": str(url)}
headers = {"User-Agent": user_agent}
return int((requests.get(request_url, headers=headers).text).strip())
request_url = full_url(endpoint, params=payload)
response = get_response(request_url, headers=headers)
return int(response.text.strip())
def full_url(endpoint, params):
@ -32,47 +32,29 @@ def full_url(endpoint, params):
def get_response(
endpoint,
params=None,
url,
headers=None,
return_full_url=False,
retries=5,
backoff_factor=0.5,
no_raise_on_redirects=False,
):
s = requests.Session()
session = requests.Session()
retries = Retry(
total=retries,
backoff_factor=backoff_factor,
status_forcelist=[500, 502, 503, 504],
)
s.mount("https://", HTTPAdapter(max_retries=retries))
# The URL with parameters required for the get request
url = full_url(endpoint, params)
session.mount("https://", HTTPAdapter(max_retries=retries))
try:
if not return_full_url:
return s.get(url, headers=headers)
return (url, s.get(url, headers=headers))
response = session.get(url, headers=headers)
session.close()
return response
except Exception as e:
reason = str(e)
if no_raise_on_redirects:
if "Exceeded 30 redirects" in reason:
return
exc_message = "Error while retrieving {url}.\n{reason}".format(
url=url, reason=reason
)
exc = WaybackError(exc_message)
exc.__cause__ = e
raise exc
@ -91,8 +73,8 @@ def check_filters(filters):
_filter,
)
key = match.group(1)
val = match.group(2)
match.group(1)
match.group(2)
except Exception:
@ -118,19 +100,9 @@ def check_collapses(collapses):
r"(urlkey|timestamp|original|mimetype|statuscode|digest|length)(:?[0-9]{1,99})?",
collapse,
)
field = match.group(1)
N = None
match.group(1)
if 2 == len(match.groups()):
N = match.group(2)
if N:
if not (field + N == collapse):
raise Exception
else:
if not (field == collapse):
raise Exception
match.group(2)
except Exception:
exc_message = "collapse argument '{collapse}' is not following the cdx collapse syntax.".format(
collapse=collapse
@ -143,7 +115,9 @@ def check_match_type(match_type, url):
return
if "*" in url:
raise WaybackError("Can not use wildcard with match_type argument")
raise WaybackError(
"Can not use wildcard in the URL along with the match_type arguments."
)
legal_match_type = ["exact", "prefix", "host", "domain"]

View File

@ -10,6 +10,8 @@ class WaybackError(Exception):
Raised when Waybackpy can not return what you asked for.
1) Wayback Machine API Service is unreachable/down.
2) You passed illegal arguments.
All other exceptions are inherited from this class.
"""
@ -36,3 +38,15 @@ class MaximumSaveRetriesExceeded(MaximumRetriesExceeded):
"""
MaximumSaveRetriesExceeded
"""
class ArchiveNotInAvailabilityAPIResponse(WaybackError):
"""
Could not parse the archive in the JSON response of the availability API.
"""
class InvalidJSONInAvailabilityAPIResponse(WaybackError):
"""
availability api returned invalid JSON
"""

View File

@ -31,6 +31,11 @@ class WaybackMachineSaveAPI:
@property
def archive_url(self):
"""
Returns the archive URL is already cached by _archive_url
else invoke the save method to save the archive which returns the
archive thus we return the methods return value.
"""
if self._archive_url:
return self._archive_url
@ -38,7 +43,21 @@ class WaybackMachineSaveAPI:
return self.save()
def get_save_request_headers(self):
"""
Creates a session and tries 'retries' number of times to
retrieve the archive.
If successful in getting the response, sets the headers, status_code
and response_url attributes.
The archive is usually in the headers but it can also be the response URL
as the Wayback Machine redirects to the archive after a successful capture
of the webpage.
Wayback Machine's save API is known
to be very unreliable thus if it fails first check opening
the response URL yourself in the browser.
"""
session = requests.Session()
retries = Retry(
total=self.total_save_retries,
@ -47,11 +66,19 @@ class WaybackMachineSaveAPI:
)
session.mount("https://", HTTPAdapter(max_retries=retries))
self.response = session.get(self.request_url, headers=self.request_headers)
self.headers = self.response.headers
self.headers = (
self.response.headers
) # <class 'requests.structures.CaseInsensitiveDict'>
self.status_code = self.response.status_code
self.response_url = self.response.url
session.close()
def archive_url_parser(self):
"""
Three regexen (like oxen?) are used to search for the
archive URL in the headers and finally look in the response URL
for the archive URL.
"""
regex1 = r"Content-Location: (/web/[0-9]{14}/.*)"
match = re.search(regex1, str(self.headers))
@ -66,7 +93,7 @@ class WaybackMachineSaveAPI:
regex3 = r"X-Cache-Key:\shttps(.*)[A-Z]{2}"
match = re.search(regex3, str(self.headers))
if match:
return "https://" + match.group(1)
return "https" + match.group(1)
if self.response_url:
self.response_url = self.response_url.strip()
@ -77,6 +104,14 @@ class WaybackMachineSaveAPI:
return "https://" + match.group(0)
def sleep(self, tries):
"""
Ensure that the we wait some time before succesive retries so that we
don't waste the retries before the page is even captured by the Wayback
Machine crawlers also ensures that we are not putting too much load on
the Wayback Machine's save API.
If tries are multiple of 3 sleep 10 seconds else sleep 5 seconds.
"""
sleep_seconds = 5
if tries % 3 == 0:
@ -84,8 +119,20 @@ class WaybackMachineSaveAPI:
time.sleep(sleep_seconds)
def timestamp(self):
"""
Read the timestamp off the archive URL and convert the Wayback Machine
timestamp to datetime object.
Also check if the time on archive is URL and compare it to instance birth
time.
If time on the archive is older than the instance creation time set the cached_save
to True else set it to False. The flag can be used to check if the Wayback Machine
didn't serve a Cached URL. It is quite common for the Wayback Machine to serve
cached archive if last archive was captured before last 45 minutes.
"""
m = re.search(
r"https?://web.archive.org/web/([0-9]{14})/http", self._archive_url
r"https?://web\.archive.org/web/([0-9]{14})/http", self._archive_url
)
string_timestamp = m.group(1)
timestamp = datetime.strptime(string_timestamp, "%Y%m%d%H%M%S")
@ -101,8 +148,15 @@ class WaybackMachineSaveAPI:
return timestamp
def save(self):
"""
Calls the SavePageNow API of the Wayback Machine with required parameters
and headers to save the URL.
saved_archive = None
Raises MaximumSaveRetriesExceeded is maximum retries are exhausted but still
we were unable to retrieve the archive from the Wayback Machine.
"""
self.saved_archive = None
tries = 0
while True:
@ -111,21 +165,22 @@ class WaybackMachineSaveAPI:
if tries >= self.max_tries:
raise MaximumSaveRetriesExceeded(
"Tried %s times but failed to save and return the archive for %s.\nResponse URL:\n%s \nResponse Header:\n%s\n"
% (str(tries), self.url, self.response_url, str(self.headers)),
"Tried %s times but failed to save and retrieve the" % str(tries)
+ " archive for %s.\nResponse URL:\n%s \nResponse Header:\n%s\n"
% (self.url, self.response_url, str(self.headers)),
)
if not saved_archive:
if not self.saved_archive:
if tries > 1:
self.sleep(tries)
self.get_save_request_headers()
saved_archive = self.archive_url_parser()
self.saved_archive = self.archive_url_parser()
if not saved_archive:
if not self.saved_archive:
continue
else:
self._archive_url = saved_archive
self._archive_url = self.saved_archive
self.timestamp()
return saved_archive
return self.saved_archive

View File

@ -4,8 +4,9 @@ from .__version__ import __version__
DEFAULT_USER_AGENT = "waybackpy %s - https://github.com/akamhy/waybackpy" % __version__
def latest_version(package_name, headers):
def latest_version(package_name, user_agent=DEFAULT_USER_AGENT):
request_url = "https://pypi.org/pypi/" + package_name + "/json"
headers = {"User-Agent": user_agent}
response = requests.get(request_url, headers=headers)
data = response.json()
return data["info"]["version"]

View File

@ -2,9 +2,21 @@ from .save_api import WaybackMachineSaveAPI
from .availability_api import WaybackMachineAvailabilityAPI
from .cdx_api import WaybackMachineCDXServerAPI
from .utils import DEFAULT_USER_AGENT
from .exceptions import WaybackError
from datetime import datetime, timedelta
"""
The Url class is not recommended to be used anymore, instead use the
WaybackMachineSaveAPI, WaybackMachineAvailabilityAPI and WaybackMachineCDXServerAPI.
The reason it is still in the code is backwards compatibility with 2.x.x versions.
If were are using the Url before the update to version 3.x.x, your code should still be
working fine and there is no hurry to update the interface but is recommended that you
do not use the Url class for new code as it would be removed after 2025 also the first
3.x.x versions was released in January 2022 and three years are more than enough to update
the older interface code.
"""
class Url:
def __init__(self, url, user_agent=DEFAULT_USER_AGENT):