Compare commits

...

70 Commits
2.4.3 ... 3.0.1

Author SHA1 Message Date
9afe29a819 Merge pull request #119 from akamhy/akamhy-patch-1
v3.0.0 --> v3.0.1
2022-01-25 19:54:01 +05:30
d79b10c74c v3.0.0 --> v3.0.1 2022-01-25 19:52:10 +05:30
32314dc102 Merge branch 'build-test' #118
Add build test to CI
 see #117
2022-01-25 14:02:36 +05:30
50e176e2ba .github/workflows/build_test.yml : change python versions from '3.4', '3.8', '3.10' to '3.6', '3.10' as 3.4 not found by GitHub. 2022-01-25 13:56:49 +05:30
4007859c92 Install dependencies for build test in CI : setuptools wheel 2022-01-25 13:35:58 +05:30
d8bd6c628d Add build test to CI 2022-01-25 13:30:16 +05:30
28f6ff8df2 Merge pull request #116 from akamhy/patch-setup-py
Fix syntax for opening the README.md and __version__.py
2022-01-25 13:11:33 +05:30
7ac9353f74 Fix syntax for opening the README.md and __version__.py
For some reason updates made at https://github.com/akamhy/waybackpy/pull/114
are breaking the build using setup, caught while deploying to a cloud service
provider.

The exact error is:
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/tmp/pip-req-build-n3b9e5pj/setup.py", line 5
  os.path.join(os.path.dirname(__file__), README.md), encoding=utf-8),
                                                                                ^
SyntaxError: invalid syntax
----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

See also :
https://github.com/conda-forge/staged-recipes/pull/17634
2022-01-25 13:05:01 +05:30
15c7244a22 Merge pull request #115 from akamhy/akamhy-patch-1
do not use f-strings in setup.py
2022-01-25 10:42:27 +05:30
8510210e94 do not use f-strings in setup.py
These are not supported in <Python 3.6 version of the cpython.
2022-01-25 10:34:46 +05:30
552967487e Merge pull request #114 from rafaelrdealmeida/patch-1
Update setup.py

See also <https://github.com/akamhy/waybackpy/issues/111#issuecomment-1020673814>
2022-01-25 10:30:34 +05:30
86a90a3840 Update setup.py
pep8
2022-01-24 22:03:28 -03:00
759874cdc6 Update setup.py
see: https://github.com/akamhy/waybackpy/issues/111#issuecomment-1020673814
2022-01-24 21:23:31 -03:00
06095202fe BUG FIX : forgot to use the endpoint from the instance and also assign payload to param. Bug caught by the flake8 in the CI tests. 2022-01-24 23:35:48 +05:30
06fc7855bf waybackpy/cdx_api.py : deafult user agent is now DEFAULT_USER_AGENT, get_response now take url and headers as arguments and request url is generated by full_url function. max_tries added as parameter for the WaybackMachineCDXServerAPI class with default value of 3. 2022-01-24 23:20:49 +05:30
c49fe971fd update the older deprecation not for Url class, the newer date is now 2025 instead of 2024. 2022-01-24 23:15:59 +05:30
d6783d5525 added tests for cdx_utils.py 2022-01-24 23:05:47 +05:30
9262f5da21 improve functions get_total_pages, get_response and lint check_filters, check_collapses and check_match_type
get_total_pages : default user agent is now DEFAULT_USER_AGENT
                  and now instead of str formatting passing payload
                  as param to full_url to generate the request url
                  also get_response make the request instead of directly
                  using requests.get()

get_response : get_response is now not taking param as keyword arguments
               instead the invoker is supposed to pass the full url which
               may be generated by the full_url function therefore the return_full_url=False,
               is deprecated also.
               Also now closing the session via session.close()
               No need to check 'Exceeded 30 redirects' as save API uses a
               diffrent method.

check_filters : Not assigning to variables the return of match groups
                beacause we wont be using them and the linter picks these
                unused assignments.

check_collapses : Same reason as for check_filters but also removed a foolish
                  test that checks equality with objects that are guaranteed
                  to be same.

check_match_type : Updated the text that of WaybackError
2022-01-24 22:57:20 +05:30
d1a1cf2546 added tests for utils.py at tests/test_utils.py also changed a keyword argument from headers to user_agent for latest_version of utils.py with the usual default vaule. 2022-01-24 17:50:36 +05:30
cd8a32ed1f added tests for cdx_snapshot.py at tests/test_cdx_snapshot.py 2022-01-24 16:29:44 +05:30
57512c65ff change test oldest method from google.com to example.com, the oldest on google is for some unknown reason is not very stable. 2022-01-24 16:27:35 +05:30
d9ea26e11c added code style black badge 2022-01-24 13:46:31 +05:30
2bea92b348 fix bug with the third matching case of the archive_url_parser, caught while writing more tests fo the save API interface. 2022-01-24 13:31:30 +05:30
d506685f68 added some tests for save_api interface 2022-01-23 18:35:54 +05:30
7844d15d99 close the session in save api interface 2022-01-23 18:34:06 +05:30
c0252edff2 updated tests for availability_api.py and also added max_tries(default value is 3) with delay (sleep) between successive API calls. The dealy actually improves the performace of the availability_api interface. 2022-01-23 15:05:10 +05:30
e7488f3a3e added test badge, rename test to Tests from ubuntu and fix the Incomplete URL substring sanitization(or trying to) 2022-01-23 02:26:53 +05:30
aed75ad1db Make modules imprtable as part of a Python package, waybackpy by creating __init__.py file in tests 2022-01-23 02:14:38 +05:30
d740959c34 more dev reqs 2022-01-23 02:10:12 +05:30
2d83043ef7 + flake8 in requirements-dev.txt 2022-01-23 02:05:08 +05:30
31b1056217 fix typo in CI 2022-01-23 02:03:30 +05:30
97712b2c1e add CI unit_test.yml 2022-01-23 02:00:15 +05:30
a8acc4c4d8 Fix Incomplete URL substring sanitization in the last commit. 2022-01-23 01:42:48 +05:30
1bacd73002 created pytest.ini, added test for waybackpy/availability_api.py, new exceptions all of which inherit from the main WaybackError and created requirements-dev.txt 2022-01-23 01:29:07 +05:30
79901ba968 updated README.md 2022-01-22 03:08:26 +05:30
df64e839d7 added trove classifiers for python 3.10 2022-01-22 00:57:10 +05:30
405e9a2a79 waybackpy/save_api.py : Added doc strings and also lint with black. 2022-01-22 00:41:10 +05:30
db551abbf6 lint waybackpy/cdx_api.py and added some doc strings 2022-01-22 00:11:35 +05:30
d13dd4db1a added notice on waybackpy/wrapper.py that the Url class will cease to exist after 2024-01-01 and also removed unused imports. 2022-01-21 23:14:20 +05:30
d3bb8337a1 make setup.py smarter, now no need to update the URL again and also added more keywords. And in __version__.py updated the __author__ 2022-01-21 23:01:09 +05:30
fd5e85420c waybackpy/availability_api.py : removed unused imports, added doc strings, removed redundant function. 2022-01-21 22:47:44 +05:30
5c685ef5d7 upload logo and make p path not text
I was dumb to forget to convert the p to path.
2022-01-21 21:11:42 +05:30
6a3d96b453 Logo (#113)
* Create logo.txt

* Delete waybackpy_logo.svg

* Add files via upload

* Delete logo.txt
2022-01-21 21:02:38 +05:30
afe1b15a5f Add files via upload 2022-01-21 20:58:53 +05:30
4fd9d142e7 Merge pull request #112 from akamhy/fix
escape '.' before 'archive.org'
2022-01-21 19:52:55 +05:30
5e9fdb40ce escape '.' before 'archive.org'
escape '.' before 'archive.org' on line 88 so it does not match more hosts than expected.
2022-01-21 19:51:08 +05:30
fa72098270 _get_response is not used anymore
- datashaman (<https://stackoverflow.com/users/401467/datashaman>) for <https://stackoverflow.com/a/35504626>. _get_response is based on this amazing answer.
2022-01-21 19:43:35 +05:30
d18f955044 date year range 2020-2022 2022-01-21 11:55:42 +05:30
9c340d6967 Create codeql-analysis.yml 2022-01-21 11:12:59 +05:30
78d0e0c126 Update README.md 2022-01-21 09:54:04 +05:30
564101e6f5 🐳 for docker image 2022-01-21 01:23:05 +05:30
de5a3e1561 improve usage code 2022-01-18 21:18:17 +05:30
52e46fecc2 more usage example 2022-01-18 20:58:39 +05:30
3b6415abc7 updating examples 2022-01-18 20:44:47 +05:30
66e16d6d89 define __repr__ for the Availability API class 2022-01-18 20:34:21 +05:30
16b9bdd7f9 output the file name if known_url and file flag are passed. 2022-01-18 20:14:44 +05:30
7adc01bff2 implement known_urls for cli from the newer interface. Although use of CDX is recommended but backward-compatibility matters. 2022-01-18 20:07:12 +05:30
9bbd056268 Update README.md 2022-01-17 02:15:38 +05:30
2ab44391cf close #107, added link to SecSI/Docker image 2022-01-16 23:01:31 +05:30
cc3628ae18 define __str__ for objects of WaybackMachineAvailabilityAPI class, the check for self.JSON ensures that the API was atleast called. 2022-01-16 22:28:12 +05:30
1d751b942b invoke json, was a bad idea removing it the earlier commit as the end user should not have to call it 2022-01-16 22:15:25 +05:30
261a867a21 near() method of WaybackMachineAvailabilityAPI return self to preserve past behaviour 2022-01-16 21:53:54 +05:30
2e487e88d3 define __len__ on Url objects, if any method not used prior to len op then default to len of oldest archive. 2022-01-16 21:29:43 +05:30
c8d0ad493a defined __str__ for Url objects, print func should print the url. 2022-01-16 21:22:43 +05:30
ce869177fd Merge pull request #103 from akamhy/whitesource/configure
Configure WhiteSource Bolt for GitHub
2022-01-02 16:04:15 +05:30
58616fb986 Add .whitesource configuration file 2022-01-02 08:45:07 +00:00
4e68cd5743 Create separate module for the 3 different APIs also CDX is now CLI supported. 2022-01-02 14:14:45 +05:30
a7b805292d changes made for v2.4.4 (update download_url) (#100)
* v2.4.4 (update download_url)

* v2.4.4 (update __version__)

* +1

add jonasjancarik
2021-09-03 11:28:26 +05:30
6dc6124dc4 Raise error on a 509 response (too many sessions) (#99)
* Raise error on a 509 response (too many sessions)

When the response code is 509, raise an error with an explanation (based on the actual error message contained in the response HTML).

* Raise error on a 509 response (too many sessions) - linting
2021-09-03 08:04:36 +05:30
5a7fc7d568 Fix typo (#95) 2021-04-13 16:58:34 +05:30
35 changed files with 1762 additions and 2421 deletions

30
.github/workflows/build_test.yml vendored Normal file
View File

@ -0,0 +1,30 @@
# This workflow will install Python dependencies, run tests and lint with a variety of Python versions
# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions
name: Build
on:
push:
branches: [ master ]
pull_request:
branches: [ master ]
jobs:
build:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ['3.6', '3.10']
steps:
- uses: actions/checkout@v2
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v2
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install setuptools wheel
- name: Build test the package
run: |
python setup.py sdist bdist_wheel

70
.github/workflows/codeql-analysis.yml vendored Normal file
View File

@ -0,0 +1,70 @@
# For most projects, this workflow file will not need changing; you simply need
# to commit it to your repository.
#
# You may wish to alter this file to override the set of languages analyzed,
# or to provide custom queries or build logic.
#
# ******** NOTE ********
# We have attempted to detect the languages in your repository. Please check
# the `language` matrix defined below to confirm you have the correct set of
# supported CodeQL languages.
#
name: "CodeQL"
on:
push:
branches: [ master ]
pull_request:
# The branches below must be a subset of the branches above
branches: [ master ]
schedule:
- cron: '30 6 * * 1'
jobs:
analyze:
name: Analyze
runs-on: ubuntu-latest
permissions:
actions: read
contents: read
security-events: write
strategy:
fail-fast: false
matrix:
language: [ 'python' ]
# CodeQL supports [ 'cpp', 'csharp', 'go', 'java', 'javascript', 'python', 'ruby' ]
# Learn more about CodeQL language support at https://git.io/codeql-language-support
steps:
- name: Checkout repository
uses: actions/checkout@v2
# Initializes the CodeQL tools for scanning.
- name: Initialize CodeQL
uses: github/codeql-action/init@v1
with:
languages: ${{ matrix.language }}
# If you wish to specify custom queries, you can do so here or in a config file.
# By default, queries listed here will override any specified in a config file.
# Prefix the list here with "+" to use these queries and those in the config file.
# queries: ./path/to/local/query, your-org/your-repo/queries@main
# Autobuild attempts to build any compiled languages (C/C++, C#, or Java).
# If this step fails, then you should remove it and run the build manually (see below)
- name: Autobuild
uses: github/codeql-action/autobuild@v1
# Command-line programs to run using the OS shell.
# 📚 https://git.io/JvXDl
# ✏️ If the Autobuild fails above, remove it and uncomment the following three lines
# and modify them (or add more) to build your code if your project
# uses a compiled language
#- run: |
# make bootstrap
# make release
- name: Perform CodeQL Analysis
uses: github/codeql-action/analyze@v1

View File

@ -1,7 +1,7 @@
# This workflow will install Python dependencies, run tests and lint with a variety of Python versions
# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions
name: CI
name: Tests
on:
push:
@ -15,8 +15,7 @@ jobs:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ['3.8']
python-version: ['3.9']
steps:
- uses: actions/checkout@v2
- name: Set up Python ${{ matrix.python-version }}
@ -26,17 +25,20 @@ jobs:
- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install flake8 pytest codecov pytest-cov
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
if [ -f requirements-dev.txt ]; then pip install -r requirements-dev.txt; fi
- name: Lint with flake8
run: |
# stop the build if there are Python syntax errors or undefined names
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
flake8 waybackpy/ --count --select=E9,F63,F7,F82 --show-source --statistics
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
# flake8 waybackpy/ --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics --per-file-ignores="waybackpy/__init__.py:F401"
# - name: Static type test with mypy
# run: |
# mypy
- name: Test with pytest
run: |
pytest --cov=waybackpy tests/
- name: Upload coverage to Codecov
run: |
bash <(curl -s https://codecov.io/bash) -t ${{ secrets.CODECOV_TOKEN }}
pytest
# - name: Upload coverage to Codecov
# run: |
# bash <(curl -s https://codecov.io/bash) -t ${{ secrets.CODECOV_TOKEN }}

View File

@ -1,4 +0,0 @@
# File : .pep8speaks.yml
scanner:
diff_only: True # If True, errors caused by only the patch are shown

View File

@ -1,5 +0,0 @@
# autogenerated pyup.io config file
# see https://pyup.io/docs/configuration/ for all available options
schedule: ''
update: false

View File

@ -1,6 +1,10 @@
{
"scanSettings": {
"baseBranches": []
},
"checkRunSettings": {
"vulnerableCheckRunConclusionLevel": "failure"
"vulnerableCheckRunConclusionLevel": "failure",
"displayMode": "diff"
},
"issueSettings": {
"minSeverityLevel": "LOW"

View File

@ -1,58 +0,0 @@
# Contributing to waybackpy
We love your input! We want to make contributing to this project as easy and transparent as possible, whether it's:
- Reporting a bug
- Discussing the current state of the code
- Submitting a fix
- Proposing new features
- Becoming a maintainer
## We Develop with Github
We use github to host code, to track issues and feature requests, as well as accept pull requests.
## We Use [Github Flow](https://guides.github.com/introduction/flow/index.html), So All Code Changes Happen Through Pull Requests
Pull requests are the best way to propose changes to the codebase (we use [Github Flow](https://guides.github.com/introduction/flow/index.html)). We actively welcome your pull requests:
1. Fork the repo and create your branch from `master`.
2. If you've added code that should be tested, add tests.
3. If you've changed APIs, update the documentation.
4. Ensure the test suite passes.
5. Make sure your code lints.
6. Issue that pull request!
## Any contributions you make will be under the MIT Software License
In short, when you submit code changes, your submissions are understood to be under the same [MIT License](https://github.com/akamhy/waybackpy/blob/master/LICENSE) that covers the project. Feel free to contact the maintainers if that's a concern.
## Report bugs using Github's [issues](https://github.com/akamhy/waybackpy/issues)
We use GitHub issues to track public bugs. Report a bug by [opening a new issue](https://github.com/akamhy/waybackpy/issues/new); it's that easy!
## Write bug reports with detail, background, and sample code
**Great Bug Reports** tend to have:
- A quick summary and/or background
- Steps to reproduce
- Be specific!
- Give sample code if you can.
- What you expected would happen
- What actually happens
- Notes (possibly including why you think this might be happening, or stuff you tried that didn't work)
People *love* thorough bug reports. I'm not even kidding.
## Use a Consistent Coding Style
* You can try running `flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics` for style unification.
## License
By contributing, you agree that your contributions will be licensed under its [MIT License](https://github.com/akamhy/waybackpy/blob/master/LICENSE).
## References
This document is forked from [this gist](https://gist.github.com/briandk/3d2e8b3ec8daf5a27a62) by [briandk](https://github.com/briandk) which was itself adapted from the open-source contribution guidelines for [Facebook's Draft](https://github.com/facebook/draft-js/blob/a9316a723f9e918afde44dea68b5f9f39b7d9b00/CONTRIBUTING.md)

View File

@ -2,8 +2,8 @@
- akamhy (<https://github.com/akamhy>)
- danvalen1 (<https://github.com/danvalen1>)
- AntiCompositeNumber (<https://github.com/AntiCompositeNumber>)
- jonasjancarik (<https://github.com/jonasjancarik>)
## ACKNOWLEDGEMENTS
- mhmdiaa (<https://github.com/mhmdiaa>) for <https://gist.github.com/mhmdiaa/adf6bff70142e5091792841d4b372050>. known_urls is based on this gist.
- datashaman (<https://stackoverflow.com/users/401467/datashaman>) for <https://stackoverflow.com/a/35504626>. _get_response is based on this amazing answer.
- dequeued0 (<https://github.com/dequeued0>) for reporting bugs and useful feature requests.

View File

@ -1,6 +1,6 @@
MIT License
Copyright (c) 2020 waybackpy contributors ( https://github.com/akamhy/waybackpy/graphs/contributors )
Copyright (c) 2020-2022 waybackpy contributors ( https://github.com/akamhy/waybackpy/graphs/contributors )
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal

175
README.md
View File

@ -2,110 +2,153 @@
<img src="https://raw.githubusercontent.com/akamhy/waybackpy/master/assets/waybackpy_logo.svg"><br>
<h2>Python package & CLI tool that interfaces with the Wayback Machine API</h2>
<h3>A Python package & CLI tool that interfaces with the Wayback Machine API</h3>
</div>
<p align="center">
<a href="https://github.com/akamhy/waybackpy/actions?query=workflow%3ATests"><img alt="Unit Tests" src="https://github.com/akamhy/waybackpy/workflows/Tests/badge.svg"></a>
<a href="https://pypi.org/project/waybackpy/"><img alt="pypi" src="https://img.shields.io/pypi/v/waybackpy.svg"></a>
<a href="https://github.com/akamhy/waybackpy/actions?query=workflow%3ACI"><img alt="Build Status" src="https://github.com/akamhy/waybackpy/workflows/CI/badge.svg"></a>
<a href="https://www.codacy.com/manual/akamhy/waybackpy?utm_source=github.com&amp;utm_medium=referral&amp;utm_content=akamhy/waybackpy&amp;utm_campaign=Badge_Grade"><img alt="Codacy Badge" src="https://api.codacy.com/project/badge/Grade/255459cede9341e39436ec8866d3fb65"></a>
<a href="https://codecov.io/gh/akamhy/waybackpy"><img alt="codecov" src="https://codecov.io/gh/akamhy/waybackpy/branch/master/graph/badge.svg"></a>
<a href="https://github.com/akamhy/waybackpy/blob/master/CONTRIBUTING.md"><img alt="Contributions Welcome" src="https://img.shields.io/static/v1.svg?label=Contributions&message=Welcome&color=0059b3&style=flat-square"></a>
<a href="https://pepy.tech/project/waybackpy?versions=2*&versions=1*&versions=3*"><img alt="Downloads" src="https://pepy.tech/badge/waybackpy/month"></a>
<a href="https://github.com/akamhy/waybackpy/commits/master"><img alt="GitHub lastest commit" src="https://img.shields.io/github/last-commit/akamhy/waybackpy?color=blue&style=flat-square"></a>
<a href="#"><img alt="PyPI - Python Version" src="https://img.shields.io/pypi/pyversions/waybackpy?style=flat-square"></a>
<a href="https://github.com/psf/black"><img alt="Code style: black" src="https://img.shields.io/badge/code%20style-black-000000.svg"></a>
</p>
-----------------------------------------------------------------------------------------------------------------------------------------------
### Installation
## ⭐️ Introduction
Waybackpy is a [Python package](https://www.udacity.com/blog/2021/01/what-is-a-python-package.html) and a [CLI](https://www.w3schools.com/whatis/whatis_cli.asp) tool that interfaces with the [Wayback Machine](https://en.wikipedia.org/wiki/Wayback_Machine) API.
Using [pip](https://en.wikipedia.org/wiki/Pip_(package_manager)):
Wayback Machine has 3 client side [API](https://www.redhat.com/en/topics/api/what-are-application-programming-interfaces)s.
- [Save API](https://github.com/akamhy/waybackpy/wiki/Wayback-Machine-APIs#save-api)
- [Availability API](https://github.com/akamhy/waybackpy/wiki/Wayback-Machine-APIs#availability-api)
- [CDX API](https://github.com/akamhy/waybackpy/wiki/Wayback-Machine-APIs#cdx-api)
These three APIs can be accessed via the waybackpy either by importing it in a script or from the CLI.
### 🏗 Installation
Using [pip](https://en.wikipedia.org/wiki/Pip_(package_manager)), from [PyPI](https://pypi.org/) (recommended):
```bash
pip install waybackpy
```
Install directly from GitHub:
Install directly from [this git repository](https://github.com/akamhy/waybackpy) (NOT recommended):
```bash
pip install git+https://github.com/akamhy/waybackpy.git
```
### Supported Features
### 🐳 Docker Image
Docker Hub : <https://hub.docker.com/r/secsi/waybackpy>
- Archive webpage
- Retrieve all archives of a webpage/domain
- Retrieve archive close to a date or timestamp
- Retrieve all archives which have a particular prefix
- Get source code of the archive easily
- CDX API support
[Docker image](https://searchitoperations.techtarget.com/definition/Docker-image) is automatically updated on every release by [Regulary and Automatically Updated Docker Images](https://github.com/cybersecsi/RAUDI) (RAUDI).
RAUDI is a tool by SecSI (<https://secsi.io>), an Italian cybersecurity startup.
### Usage
### 🚀 Usage
#### As a Python package
##### Save API aka SavePageNow
```python
>>> import waybackpy
>>> url = "https://en.wikipedia.org/wiki/Multivariable_calculus"
>>> from waybackpy import WaybackMachineSaveAPI
>>> url = "https://github.com"
>>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
>>> wayback = waybackpy.Url(url, user_agent)
>>> archive = wayback.save()
>>> archive.archive_url
'https://web.archive.org/web/20210104173410/https://en.wikipedia.org/wiki/Multivariable_calculus'
>>> archive.timestamp
datetime.datetime(2021, 1, 4, 17, 35, 12, 691741)
>>> oldest_archive = wayback.oldest()
>>> oldest_archive.archive_url
'https://web.archive.org/web/20050422130129/http://en.wikipedia.org:80/wiki/Multivariable_calculus'
>>> archive_close_to_2010_feb = wayback.near(year=2010, month=2)
>>> archive_close_to_2010_feb.archive_url
'https://web.archive.org/web/20100215001541/http://en.wikipedia.org:80/wiki/Multivariable_calculus'
>>> wayback.newest().archive_url
'https://web.archive.org/web/20210104173410/https://en.wikipedia.org/wiki/Multivariable_calculus'
>>>
>>> save_api = WaybackMachineSaveAPI(url, user_agent)
>>> save_api.save()
https://web.archive.org/web/20220118125249/https://github.com/
>>> save_api.cached_save
False
>>> save_api.timestamp()
datetime.datetime(2022, 1, 18, 12, 52, 49)
```
> Full Python package documentation can be found at <https://github.com/akamhy/waybackpy/wiki/Python-package-docs>.
##### Availability API
```python
>>> from waybackpy import WaybackMachineAvailabilityAPI
>>>
>>> url = "https://google.com"
>>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
>>>
>>> availability_api = WaybackMachineAvailabilityAPI(url, user_agent)
>>>
>>> availability_api.oldest()
https://web.archive.org/web/19981111184551/http://google.com:80/
>>>
>>> availability_api.newest()
https://web.archive.org/web/20220118150444/https://www.google.com/
>>>
>>> availability_api.near(year=2010, month=10, day=10, hour=10)
https://web.archive.org/web/20101010101708/http://www.google.com/
```
##### CDX API aka CDXServerAPI
```python
>>> from waybackpy import WaybackMachineCDXServerAPI
>>> url = "https://pypi.org"
>>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
>>> cdx = WaybackMachineCDXServerAPI(url, user_agent, start_timestamp=2016, end_timestamp=2017)
>>> for item in cdx.snapshots():
... print(item.archive_url)
...
https://web.archive.org/web/20160110011047/http://pypi.org/
https://web.archive.org/web/20160305104847/http://pypi.org/
.
. # URLS REDACTED FOR READABILITY
.
https://web.archive.org/web/20171127171549/https://pypi.org/
https://web.archive.org/web/20171206002737/http://pypi.org:80/
```
> Documentation is at <https://github.com/akamhy/waybackpy/wiki/Python-package-docs>.
#### As a CLI tool
Saving a webpage:
```bash
$ waybackpy --save --url "https://en.wikipedia.org/wiki/Social_media" --user_agent "my-unique-user-agent"
https://web.archive.org/web/20200719062108/https://en.wikipedia.org/wiki/Social_media
$ waybackpy --oldest --url "https://en.wikipedia.org/wiki/Humanoid" --user_agent "my-unique-user-agent"
https://web.archive.org/web/20040415020811/http://en.wikipedia.org:80/wiki/Humanoid
$ waybackpy --newest --url "https://en.wikipedia.org/wiki/Remote_sensing" --user_agent "my-unique-user-agent"
https://web.archive.org/web/20201221130522/https://en.wikipedia.org/wiki/Remote_sensing
$ waybackpy --total --url "https://en.wikipedia.org/wiki/Linux_kernel" --user_agent "my-unique-user-agent"
1904
$ waybackpy --known_urls --url akamhy.github.io --user_agent "my-unique-user-agent" --file
https://akamhy.github.io
https://akamhy.github.io/assets/js/scale.fix.js
https://akamhy.github.io/favicon.ico
https://akamhy.github.io/robots.txt
https://akamhy.github.io/waybackpy/
'akamhy.github.io-urls-iftor2.txt' saved in current working directory
waybackpy --save --url "https://en.wikipedia.org/wiki/Social_media" --user_agent "my-unique-user-agent"
```
```bash
Archive URL:
https://web.archive.org/web/20220121193801/https://en.wikipedia.org/wiki/Social_media
Cached save:
False
```
> Full CLI documentation can be found at <https://github.com/akamhy/waybackpy/wiki/CLI-docs>.
## License
Retriving the oldest archive and also printing the JSON response of the availability API:
```bash
waybackpy --oldest --json --url "https://en.wikipedia.org/wiki/Humanoid" --user_agent "my-unique-user-agent"
```
```bash
Archive URL:
https://web.archive.org/web/20040415020811/http://en.wikipedia.org:80/wiki/Humanoid
JSON response:
{"url": "https://en.wikipedia.org/wiki/Humanoid", "archived_snapshots": {"closest": {"status": "200", "available": true, "url": "http://web.archive.org/web/20040415020811/http://en.wikipedia.org:80/wiki/Humanoid", "timestamp": "20040415020811"}}, "timestamp": "199401212126"}
```
Archive close to a time, minute level precision is supported:
```bash
waybackpy --url google.com --user_agent "my-unique-user-agent" --near --year 2008 --month 8 --day 8
```
```bash
Archive URL:
https://web.archive.org/web/20080808014003/http://www.google.com:80/
```
> CLI documentation is at <https://github.com/akamhy/waybackpy/wiki/CLI-docs>.
### 🛡 License
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://github.com/akamhy/waybackpy/blob/master/LICENSE)
Released under the MIT License. See
[license](https://github.com/akamhy/waybackpy/blob/master/LICENSE) for details.
Copyright (c) 2020-2022 Akash Mahanty Et al.
-----------------------------------------------------------------------------------------------------------------------------------------------
Released under the MIT License. See [license](https://github.com/akamhy/waybackpy/blob/master/LICENSE) for details.

View File

@ -1 +1,14 @@
<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 176.612 41.908" height="158.392" width="667.51" xmlns:v="https://github.com/akamhy/waybackpy"><text transform="matrix(.862888 0 0 1.158899 -.748 -98.312)" y="110.937" x="0.931" xml:space="preserve" font-weight="bold" font-size="28.149" font-family="sans-serif" letter-spacing="0" word-spacing="0" writing-mode="lr-tb" fill="#003dff"><tspan y="110.937" x="0.931"><tspan y="110.937" x="0.931" letter-spacing="3.568" writing-mode="lr-tb">waybackpy</tspan></tspan></text><path d="M.749 0h153.787v4.864H.749zm22.076 37.418h153.787v4.49H22.825z" fill="navy"/><path d="M0 37.418h22.825v4.49H0zM154.536 0h21.702v4.864h-21.702z" fill="#f0f"/></svg>
<?xml version="1.0" encoding="utf-8"?>
<svg width="711.80188pt" height="258.30469pt" viewBox="0 0 711.80188 258.30469" version="1.1" id="svg2" xmlns="http://www.w3.org/2000/svg">
<g id="surface1" transform="translate(-40.045801,-148)">
<path style="fill: rgb(171, 46, 51); fill-opacity: 1; fill-rule: nonzero; stroke: none;" d="M 224.09 309.814 L 224.09 197.997 L 204.768 197.994 L 204.768 312.635 C 204.768 312.635 205.098 312.9 204.105 313.698 C 203.113 314.497 202.408 313.849 202.408 313.849 L 200.518 313.849 L 200.518 197.991 L 181.139 197.991 L 181.139 313.849 L 179.253 313.849 C 179.253 313.849 178.544 314.497 177.551 313.698 C 176.558 312.9 176.888 312.635 176.888 312.635 L 176.888 197.994 L 157.57 197.997 L 157.57 309.814 C 157.57 309.814 156.539 316.772 162.615 321.658 C 168.691 326.546 177.551 326.049 177.551 326.049 L 204.11 326.049 C 204.11 326.049 212.965 326.546 219.041 321.658 C 225.118 316.772 224.09 309.814 224.09 309.814" id="path5"/>
<path style="fill: rgb(171, 46, 51); fill-opacity: 1; fill-rule: nonzero; stroke: none;" d="M 253.892 299.821 C 253.892 299.821 253.632 300.965 251.888 300.965 C 250.143 300.965 249.629 299.821 249.629 299.821 L 249.629 278.477 C 249.629 278.477 249.433 278.166 250.078 277.645 C 250.726 277.124 251.243 277.179 251.243 277.179 L 253.892 277.228 Z M 251.588 199.144 C 230.266 199.144 231.071 213.218 231.071 213.218 L 231.071 254.303 L 249.675 254.303 L 249.675 213.69 C 249.675 213.69 249.775 211.276 251.787 211.276 C 253.8 211.276 254 213.542 254 213.542 L 254 265.146 L 246.156 265.146 C 246.156 265.146 240.022 264.579 235.495 268.22 C 230.968 271.858 231.071 276.791 231.071 276.791 L 231.071 298.955 C 231.071 298.955 229.461 308.016 238.914 312.058 C 248.368 316.103 254.805 309.795 254.805 309.795 L 254.805 312.706 L 272.508 312.706 L 272.508 212.895 C 272.508 212.895 272.907 199.144 251.588 199.144" id="path7"/>
<path style="fill: rgb(171, 46, 51); fill-opacity: 1; fill-rule: nonzero; stroke: none;" d="M 404.682 318.261 C 404.682 318.261 404.398 319.494 402.485 319.494 C 400.568 319.494 400.001 318.261 400.001 318.261 L 400.001 295.216 C 400.001 295.216 399.786 294.879 400.496 294.315 C 401.208 293.757 401.776 293.812 401.776 293.812 L 404.682 293.868 Z M 402.152 209.568 C 378.728 209.568 379.61 224.761 379.61 224.761 L 379.61 269.117 L 400.051 269.117 L 400.051 225.273 C 400.051 225.273 400.162 222.665 402.374 222.665 C 404.582 222.665 404.805 225.109 404.805 225.109 L 404.805 280.82 L 396.187 280.82 C 396.187 280.82 389.447 280.213 384.475 284.141 C 379.499 288.072 379.61 293.396 379.61 293.396 L 379.61 317.324 C 379.61 317.324 377.843 327.104 388.232 331.469 C 398.616 335.838 405.69 329.027 405.69 329.027 L 405.69 332.169 L 425.133 332.169 L 425.133 224.413 C 425.133 224.413 425.578 209.568 402.152 209.568" id="path9"/>
<path style="fill: rgb(171, 46, 51); fill-opacity: 1; fill-rule: nonzero; stroke: none;" d="M 321.114 328.636 L 321.114 206.587 L 302.582 206.587 L 302.582 304.902 C 302.582 304.902 303.211 307.094 300.624 307.094 C 298.035 307.094 298.316 304.902 298.316 304.902 L 298.316 206.587 L 279.784 206.587 C 279.784 206.587 279.922 304.338 279.922 306.756 C 279.922 309.175 280.27 310.526 280.831 312.379 C 281.391 314.238 282.579 318.116 290.901 319.186 C 299.224 320.256 302.44 315.813 302.44 315.813 L 302.44 327.736 C 302.44 327.736 302.862 329.366 300.554 329.366 C 298.246 329.366 298.316 327.849 298.316 327.849 L 298.316 322.957 L 279.642 322.957 L 279.642 327.791 C 279.642 327.791 278.523 341.514 300.274 341.514 C 322.026 341.514 321.114 328.636 321.114 328.636" id="path11"/>
<path style="fill: rgb(171, 46, 51); fill-opacity: 1; fill-rule: nonzero; stroke: none;" d="M 352.449 209.811 L 352.449 273.495 C 352.449 277.49 347.911 277.194 347.911 277.194 L 347.911 207.592 C 347.911 207.592 346.929 207.542 349.567 207.542 C 352.817 207.542 352.449 209.811 352.449 209.811 M 352.326 310.393 C 352.326 310.393 352.143 312.366 350.425 312.366 L 348.033 312.366 L 348.033 289.478 L 349.628 289.478 C 349.628 289.478 352.326 289.428 352.326 292.092 Z M 371.341 287.505 C 371.341 284.791 370.727 282.966 368.826 280.993 C 366.925 279.02 363.367 277.441 363.367 277.441 C 363.367 277.441 365.514 276.948 368.704 274.728 C 371.893 272.509 371.525 267.921 371.525 267.921 L 371.525 212.919 C 371.525 212.919 371.801 204.509 366.925 200.587 C 362.049 196.665 352.515 196.363 352.515 196.363 L 328.711 196.363 L 328.711 324.107 L 350.609 324.107 C 360.055 324.107 364.594 322.232 368.336 318.286 C 372.077 314.34 371.341 308.321 371.341 308.321 Z M 371.341 287.505" id="path13"/>
<path style="fill: rgb(171, 46, 51); fill-opacity: 1; fill-rule: nonzero; stroke: none;" d="M 452.747 226.744 L 452.747 268.806 L 471.581 268.806 L 471.581 227.459 C 471.581 227.459 471.846 213.532 450.516 213.532 C 429.182 213.532 430.076 227.533 430.076 227.533 L 430.076 313.381 C 430.076 313.381 428.825 327.523 450.872 327.523 C 472.919 327.523 471.401 313.526 471.401 313.526 L 471.401 292.064 L 452.835 292.064 L 452.835 314.389 C 452.835 314.389 452.923 315.61 450.961 315.61 C 448.997 315.61 448.729 314.389 448.729 314.389 L 448.729 226.524 C 448.729 226.524 448.821 225.378 450.692 225.378 C 452.566 225.378 452.747 226.744 452.747 226.744" id="path15"/>
<path style="fill: rgb(171, 46, 51); fill-opacity: 1; fill-rule: nonzero; stroke: none;" d="M 520.624 281.841 C 517.672 278.98 514.317 277.904 514.317 277.904 C 514.317 277.904 517.538 277.796 520.489 274.775 C 523.442 271.753 523.173 267.924 523.173 267.924 L 523.173 208.211 L 503.185 208.211 L 503.185 276.014 C 503.185 276.014 503.185 277.361 501.172 277.361 L 498.761 277.309 L 498.761 191.655 L 478.973 191.655 L 478.973 327.905 L 498.692 327.905 L 498.692 290.039 L 501.709 290.039 C 501.709 290.039 502.112 290.039 502.648 290.523 C 503.185 291.01 503.185 291.602 503.185 291.602 L 503.185 327.905 L 523.307 327.905 L 523.307 288.636 C 523.307 288.636 523.576 284.699 520.624 281.841" id="path17"/>
<path style="fill-opacity: 1; fill-rule: nonzero; stroke: none; fill: rgb(255, 222, 87);" d="M 638.021 327.182 L 638.021 205.132 L 619.489 205.132 L 619.489 303.448 C 619.489 303.448 620.119 305.64 617.53 305.64 C 614.944 305.64 615.223 303.448 615.223 303.448 L 615.223 205.132 L 596.692 205.132 C 596.692 205.132 596.83 302.884 596.83 305.301 C 596.83 307.721 597.178 309.071 597.738 310.924 C 598.299 312.784 599.487 316.662 607.809 317.732 C 616.132 318.802 619.349 314.359 619.349 314.359 L 619.349 326.281 C 619.349 326.281 619.77 327.913 617.462 327.913 C 615.154 327.913 615.223 326.396 615.223 326.396 L 615.223 321.502 L 596.55 321.502 L 596.55 326.336 C 596.55 326.336 595.43 340.059 617.182 340.059 C 638.934 340.059 638.021 327.182 638.021 327.182" id="path-1"/>
<path d="M 592.159 233.846 C 593.222 238.576 593.75 243.873 593.745 249.735 C 593.74 255.598 593.135 261.281 591.931 266.782 C 590.726 272.285 588.901 277.144 586.453 281.361 C 584.006 285.578 580.938 288.946 577.248 291.466 C 573.559 293.985 569.226 295.246 564.25 295.246 C 561.585 295.246 559.008 294.936 556.521 294.32 C 554.033 293.703 551.813 292.854 549.859 291.774 C 547.905 290.694 546.284 289.512 544.997 288.226 C 543.71 286.94 542.934 285.578 542.668 284.138 L 542.629 328.722 L 526.369 328.722 L 526.475 207.466 L 541.003 207.466 L 542.728 216.259 C 544.507 213.38 547.197 211.065 550.797 209.317 C 554.397 207.568 558.374 206.694 562.728 206.694 C 565.66 206.694 568.637 207.157 571.657 208.083 C 574.677 209.008 577.497 210.551 580.116 212.711 C 582.735 214.871 585.11 217.698 587.239 221.196 C 589.369 224.692 591.009 228.909 592.159 233.846 Z M 558.932 280.744 C 561.597 280.744 564.019 279.972 566.197 278.429 C 568.376 276.887 570.243 274.804 571.801 272.182 C 573.358 269.559 574.582 266.423 575.474 262.772 C 576.366 259.121 576.814 255.238 576.817 251.124 C 576.821 247.113 576.424 243.307 575.628 239.708 C 574.831 236.108 573.701 232.92 572.237 230.143 C 570.774 227.366 568.999 225.155 566.912 223.51 C 564.825 221.864 562.405 221.041 559.65 221.041 C 556.985 221.041 554.54 221.813 552.318 223.356 C 550.095 224.898 548.183 226.981 546.581 229.603 C 544.98 232.226 543.755 235.311 542.908 238.86 C 542.061 242.408 541.635 246.239 541.632 250.353 C 541.628 254.466 542.002 258.349 542.754 262 C 543.506 265.651 544.637 268.865 546.145 271.642 C 547.653 274.419 549.472 276.63 551.603 278.276 C 553.734 279.922 556.177 280.744 558.932 280.744 Z" style="fill: rgb(69, 132, 182); white-space: pre;"/>
</g>
</svg>

Before

Width:  |  Height:  |  Size: 694 B

After

Width:  |  Height:  |  Size: 8.3 KiB

11
pytest.ini Normal file
View File

@ -0,0 +1,11 @@
[pytest]
addopts =
# show summary of all tests that did not pass
-ra
# enable all warnings
-Wd
# coverage and html report
--cov=waybackpy
--cov-report=html
testpaths =
tests

8
requirements-dev.txt Normal file
View File

@ -0,0 +1,8 @@
click
requests
pytest
pytest-cov
codecov
flake8
mypy
black

View File

@ -1 +1,2 @@
requests>=2.24.0
click
requests

View File

@ -1,17 +1,25 @@
import os.path
from setuptools import setup
with open(os.path.join(os.path.dirname(__file__), "README.md")) as f:
readme_path = os.path.join(os.path.dirname(__file__), "README.md")
with open(readme_path, encoding="utf-8") as f:
long_description = f.read()
about = {}
with open(os.path.join(os.path.dirname(__file__), "waybackpy", "__version__.py")) as f:
version_path = os.path.join(os.path.dirname(__file__), "waybackpy", "__version__.py")
with open(version_path, encoding="utf-8") as f:
exec(f.read(), about)
version = str(about["__version__"])
download_url = "https://github.com/akamhy/waybackpy/archive/{version}.tar.gz".format(
version=version
)
setup(
name=about["__title__"],
packages=["waybackpy"],
version=about["__version__"],
version=version,
description=about["__description__"],
long_description=long_description,
long_description_content_type="text/markdown",
@ -19,21 +27,24 @@ setup(
author=about["__author__"],
author_email=about["__author_email__"],
url=about["__url__"],
download_url="https://github.com/akamhy/waybackpy/archive/2.4.3.tar.gz",
download_url=download_url,
keywords=[
"Archive It",
"Archive Website",
"Wayback Machine",
"waybackurls",
"Internet Archive",
"Wayback Machine CLI",
"Wayback Machine Python",
"Internet Archiving",
"Availability API",
"CDX API",
"savepagenow",
],
install_requires=["requests"],
install_requires=["requests", "click"],
python_requires=">=3.4",
classifiers=[
"Development Status :: 5 - Production/Stable",
"Development Status :: 4 - Beta",
"Intended Audience :: Developers",
"Natural Language :: English",
"Topic :: Software Development :: Build Tools",
"License :: OSI Approved :: MIT License",
"Programming Language :: Python",
"Programming Language :: Python :: 3",
@ -43,6 +54,7 @@ setup(
"Programming Language :: Python :: 3.7",
"Programming Language :: Python :: 3.8",
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: Implementation :: CPython",
],
entry_points={"console_scripts": ["waybackpy = waybackpy.cli:main"]},

View File

@ -0,0 +1,100 @@
import pytest
import random
import string
from datetime import datetime, timedelta
from waybackpy.availability_api import WaybackMachineAvailabilityAPI
from waybackpy.exceptions import (
InvalidJSONInAvailabilityAPIResponse,
ArchiveNotInAvailabilityAPIResponse,
)
now = datetime.utcnow()
url = "https://example.com/"
user_agent = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36"
rndstr = lambda n: "".join(
random.choice(string.ascii_uppercase + string.digits) for _ in range(n)
)
def test_oldest():
"""
Test the oldest archive of Google.com and also checks the attributes.
"""
url = "https://example.com/"
user_agent = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36"
availability_api = WaybackMachineAvailabilityAPI(url, user_agent)
oldest = availability_api.oldest()
oldest_archive_url = oldest.archive_url
assert "2002" in oldest_archive_url
oldest_timestamp = oldest.timestamp()
assert abs(oldest_timestamp - now) > timedelta(days=7000) # More than 19 years
assert availability_api.JSON["archived_snapshots"]["closest"]["available"] is True
assert repr(oldest).find("example.com") != -1
assert "2002" in str(oldest)
def test_newest():
"""
Assuming that the recent most Google Archive was made no more earlier than
last one day which is 86400 seconds.
"""
url = "https://www.youtube.com/"
user_agent = "Mozilla/5.0 (X11; Linux x86_64; rv:96.0) Gecko/20100101 Firefox/96.0"
availability_api = WaybackMachineAvailabilityAPI(url, user_agent)
newest = availability_api.newest()
newest_timestamp = newest.timestamp()
# betting in favor that latest youtube archive was not before the last 3 days
# high tarffic sites like youtube are archived mnay times a day, so seems
# very reasonable to me.
assert abs(newest_timestamp - now) < timedelta(seconds=86400 * 3)
def test_invalid_json():
"""
When the API is malfunctioning or we don't pass a URL it may return invalid JSON data.
"""
with pytest.raises(InvalidJSONInAvailabilityAPIResponse):
availability_api = WaybackMachineAvailabilityAPI(url="", user_agent=user_agent)
archive_url = availability_api.archive_url
def test_no_archive():
"""
ArchiveNotInAvailabilityAPIResponse may be raised if Wayback Machine did not
replied with the archive despite the fact that we know the site has million
of archives. Don't know the reason for this wierd behavior.
And also if really there are no archives for the passed URL this exception
is raised.
"""
with pytest.raises(ArchiveNotInAvailabilityAPIResponse):
availability_api = WaybackMachineAvailabilityAPI(
url="https://%s.cn" % rndstr(30), user_agent=user_agent
)
archive_url = availability_api.archive_url
def test_no_api_call_str_repr():
"""
Some entitled users maybe want to see what is the string representation
if they dont make any API requests.
str() must not return None so we return ""
"""
availability_api = WaybackMachineAvailabilityAPI(
url="https://%s.gov" % rndstr(30), user_agent=user_agent
)
assert "" == str(availability_api)
def test_no_call_timestamp():
"""
If no API requests were made the bound timestamp() method returns
the datetime.max as a default value.
"""
availability_api = WaybackMachineAvailabilityAPI(
url="https://%s.in" % rndstr(30), user_agent=user_agent
)
assert datetime.max == availability_api.timestamp()

View File

@ -1,93 +0,0 @@
import pytest
from waybackpy.cdx import Cdx
from waybackpy.exceptions import WaybackError
def test_all_cdx():
url = "akamhy.github.io"
user_agent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, \
like Gecko) Chrome/45.0.2454.85 Safari/537.36"
cdx = Cdx(
url=url,
user_agent=user_agent,
start_timestamp=2017,
end_timestamp=2020,
filters=[
"statuscode:200",
"mimetype:text/html",
"timestamp:20201002182319",
"original:https://akamhy.github.io/",
],
gzip=False,
collapses=["timestamp:10", "digest"],
limit=50,
match_type="prefix",
)
snapshots = cdx.snapshots()
for snapshot in snapshots:
ans = snapshot.archive_url
assert "https://web.archive.org/web/20201002182319/https://akamhy.github.io/" == ans
url = "akahfjgjkmhy.gihthub.ip"
cdx = Cdx(
url=url,
user_agent=user_agent,
start_timestamp=None,
end_timestamp=None,
filters=[],
match_type=None,
gzip=True,
collapses=[],
limit=10,
)
snapshots = cdx.snapshots()
print(snapshots)
i = 0
for _ in snapshots:
i += 1
assert i == 0
url = "https://github.com/akamhy/waybackpy/*"
cdx = Cdx(url=url, user_agent=user_agent, limit=50)
snapshots = cdx.snapshots()
for snapshot in snapshots:
print(snapshot.archive_url)
url = "https://github.com/akamhy/waybackpy"
with pytest.raises(WaybackError):
cdx = Cdx(url=url, user_agent=user_agent, limit=50, filters=["ghddhfhj"])
snapshots = cdx.snapshots()
with pytest.raises(WaybackError):
cdx = Cdx(url=url, user_agent=user_agent, collapses=["timestamp", "ghdd:hfhj"])
snapshots = cdx.snapshots()
url = "https://github.com"
cdx = Cdx(url=url, user_agent=user_agent, limit=50)
snapshots = cdx.snapshots()
c = 0
for snapshot in snapshots:
c += 1
if c > 100:
break
url = "https://github.com/*"
cdx = Cdx(url=url, user_agent=user_agent, collapses=["timestamp"])
snapshots = cdx.snapshots()
c = 0
for snapshot in snapshots:
c += 1
if c > 30529: # deafult limit is 10k
break
url = "https://github.com/*"
cdx = Cdx(url=url, user_agent=user_agent)
c = 0
snapshots = cdx.snapshots()
for snapshot in snapshots:
c += 1
if c > 100529:
break

View File

@ -1,9 +1,10 @@
import pytest
from datetime import datetime
from waybackpy.snapshot import CdxSnapshot, datetime
from waybackpy.cdx_snapshot import CDXSnapshot
def test_CdxSnapshot():
def test_CDXSnapshot():
sample_input = "org,archive)/ 20080126045828 http://github.com text/html 200 Q4YULN754FHV2U6Q5JUT6Q2P57WEWNNY 1415"
prop_values = sample_input.split(" ")
properties = {}
@ -17,7 +18,7 @@ def test_CdxSnapshot():
properties["length"],
) = prop_values
snapshot = CdxSnapshot(properties)
snapshot = CDXSnapshot(properties)
assert properties["urlkey"] == snapshot.urlkey
assert properties["timestamp"] == snapshot.timestamp

99
tests/test_cdx_utils.py Normal file
View File

@ -0,0 +1,99 @@
import pytest
from waybackpy.exceptions import WaybackError
from waybackpy.cdx_utils import (
get_total_pages,
full_url,
get_response,
check_filters,
check_collapses,
check_match_type,
)
def test_get_total_pages():
url = "twitter.com"
user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.2 Safari/605.1.15"
assert get_total_pages(url=url, user_agent=user_agent) >= 56
def test_full_url():
params = {}
endpoint = "https://web.archive.org/cdx/search/cdx"
assert endpoint == full_url(endpoint, params)
params = {"a": "1"}
assert "https://web.archive.org/cdx/search/cdx?a=1" == full_url(endpoint, params)
assert "https://web.archive.org/cdx/search/cdx?a=1" == full_url(
endpoint + "?", params
)
params["b"] = 2
assert "https://web.archive.org/cdx/search/cdx?a=1&b=2" == full_url(
endpoint + "?", params
)
params["c"] = "foo bar"
assert "https://web.archive.org/cdx/search/cdx?a=1&b=2&c=foo%20bar" == full_url(
endpoint + "?", params
)
def test_get_response():
url = "https://github.com"
user_agent = (
"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0"
)
headers = {"User-Agent": "%s" % user_agent}
response = get_response(url, headers=headers)
assert response.status_code == 200
url = "http/wwhfhfvhvjhmom"
with pytest.raises(WaybackError):
get_response(url, headers=headers)
def test_check_filters():
filters = []
check_filters(filters)
filters = ["statuscode:200", "timestamp:20215678901234", "original:https://url.com"]
check_filters(filters)
with pytest.raises(WaybackError):
check_filters("not-list")
with pytest.raises(WaybackError):
check_filters(["invalid"])
def test_check_collapses():
collapses = []
check_collapses(collapses)
collapses = ["timestamp:10"]
check_collapses(collapses)
collapses = ["urlkey"]
check_collapses(collapses)
collapses = "urlkey" # NOT LIST
with pytest.raises(WaybackError):
check_collapses(collapses)
collapses = ["also illegal collapse"]
with pytest.raises(WaybackError):
check_collapses(collapses)
def test_check_match_type():
assert None == check_match_type(None, "url")
match_type = "exact"
url = "test_url"
assert None == check_match_type(match_type, url)
url = "has * in it"
with pytest.raises(WaybackError):
check_match_type("domain", url)
with pytest.raises(WaybackError):
check_match_type("not a valid type", "url")

View File

@ -1,359 +0,0 @@
import sys
import os
import pytest
import random
import string
import argparse
import waybackpy.cli as cli
from waybackpy.wrapper import Url # noqa: E402
from waybackpy.__version__ import __version__
def test_save():
args = argparse.Namespace(
user_agent=None,
url="https://hfjfjfjfyu6r6rfjvj.fjhgjhfjgvjm",
total=False,
version=False,
file=False,
oldest=False,
save=True,
json=False,
archive_url=False,
newest=False,
near=False,
subdomain=False,
known_urls=False,
get=None,
)
reply = cli.args_handler(args)
assert "could happen because either your waybackpy" or "cannot be archived by wayback machine as it is a redirect" in str(reply)
def test_json():
args = argparse.Namespace(
user_agent=None,
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
file=False,
oldest=False,
save=False,
json=True,
archive_url=False,
newest=False,
near=False,
subdomain=False,
known_urls=False,
get=None,
)
reply = cli.args_handler(args)
assert "archived_snapshots" in str(reply)
def test_archive_url():
args = argparse.Namespace(
user_agent=None,
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
file=False,
oldest=False,
save=False,
json=False,
archive_url=True,
newest=False,
near=False,
subdomain=False,
known_urls=False,
get=None,
)
reply = cli.args_handler(args)
assert "https://web.archive.org/web/" in str(reply)
def test_oldest():
args = argparse.Namespace(
user_agent=None,
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
file=False,
oldest=True,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
subdomain=False,
known_urls=False,
get=None,
)
reply = cli.args_handler(args)
assert "pypi.org/user/akamhy" in str(reply)
uid = "".join(
random.choice(string.ascii_lowercase + string.digits) for _ in range(6)
)
url = "https://pypi.org/yfvjvycyc667r67ed67r" + uid
args = argparse.Namespace(
user_agent=None,
url=url,
total=False,
version=False,
file=False,
oldest=True,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
subdomain=False,
known_urls=False,
get=None,
)
reply = cli.args_handler(args)
assert "Can not find archive for" in str(reply)
def test_newest():
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
file=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=True,
near=False,
subdomain=False,
known_urls=False,
get=None,
)
reply = cli.args_handler(args)
assert "pypi.org/user/akamhy" in str(reply)
uid = "".join(
random.choice(string.ascii_lowercase + string.digits) for _ in range(6)
)
url = "https://pypi.org/yfvjvycyc667r67ed67r" + uid
args = argparse.Namespace(
user_agent=None,
url=url,
total=False,
version=False,
file=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=True,
near=False,
subdomain=False,
known_urls=False,
get=None,
)
reply = cli.args_handler(args)
assert "Can not find archive for" in str(reply)
def test_total_archives():
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://pypi.org/user/akamhy/",
total=True,
version=False,
file=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
subdomain=False,
known_urls=False,
get=None,
)
reply = cli.args_handler(args)
assert isinstance(reply, int)
def test_known_urls():
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://www.keybr.com",
total=False,
version=False,
file=True,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
subdomain=False,
known_urls=True,
get=None,
)
reply = cli.args_handler(args)
assert "keybr" in str(reply)
def test_near():
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
file=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=True,
subdomain=False,
known_urls=False,
get=None,
year=2020,
month=7,
day=15,
hour=1,
minute=1,
)
reply = cli.args_handler(args)
assert "202007" in str(reply)
uid = "".join(
random.choice(string.ascii_lowercase + string.digits) for _ in range(6)
)
url = "https://pypi.org/yfvjvycyc667r67ed67r" + uid
args = argparse.Namespace(
user_agent=None,
url=url,
total=False,
version=False,
file=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=True,
subdomain=False,
known_urls=False,
get=None,
year=2020,
month=7,
day=15,
hour=1,
minute=1,
)
reply = cli.args_handler(args)
assert "Can not find archive for" in str(reply)
def test_get():
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://github.com/akamhy",
total=False,
version=False,
file=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
subdomain=False,
known_urls=False,
get="url",
)
reply = cli.args_handler(args)
assert "waybackpy" in str(reply)
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://github.com/akamhy/waybackpy",
total=False,
version=False,
file=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
subdomain=False,
known_urls=False,
get="oldest",
)
reply = cli.args_handler(args)
assert "waybackpy" in str(reply)
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://akamhy.github.io/waybackpy/",
total=False,
version=False,
file=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
subdomain=False,
known_urls=False,
get="newest",
)
reply = cli.args_handler(args)
assert "waybackpy" in str(reply)
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
file=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
subdomain=False,
known_urls=False,
get="foobar",
)
reply = cli.args_handler(args)
assert "get the source code of the" in str(reply)
def test_args_handler():
args = argparse.Namespace(version=True)
reply = cli.args_handler(args)
assert ("waybackpy version %s" % (__version__)) == reply
args = argparse.Namespace(url=None, version=False)
reply = cli.args_handler(args)
assert ("waybackpy %s" % (__version__)) in str(reply)
def test_main():
# This also tests the parse_args method in cli.py
cli.main(["temp.py", "--version"])

133
tests/test_save_api.py Normal file
View File

@ -0,0 +1,133 @@
import pytest
import time
import random
import string
from datetime import datetime
from waybackpy.save_api import WaybackMachineSaveAPI
from waybackpy.exceptions import MaximumSaveRetriesExceeded
rndstr = lambda n: "".join(
random.choice(string.ascii_uppercase + string.digits) for _ in range(n)
)
def test_save():
url = "https://github.com/akamhy/waybackpy"
user_agent = "Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/604.1"
save_api = WaybackMachineSaveAPI(url, user_agent)
save_api.save()
archive_url = save_api.archive_url
timestamp = save_api.timestamp()
headers = save_api.headers # CaseInsensitiveDict
cached_save = save_api.cached_save
assert cached_save in [True, False]
assert archive_url.find("github.com/akamhy/waybackpy") != -1
assert str(headers).find("github.com/akamhy/waybackpy") != -1
assert type(save_api.timestamp()) == type(datetime(year=2020, month=10, day=2))
def test_max_redirect_exceeded():
with pytest.raises(MaximumSaveRetriesExceeded):
url = "https://%s.gov" % rndstr
user_agent = "Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/604.1"
save_api = WaybackMachineSaveAPI(url, user_agent, max_tries=3)
save_api.save()
def test_sleep():
"""
sleeping is actually very important for SaveAPI
interface stability.
The test checks that the time taken by sleep method
is as intended.
"""
url = "https://example.com"
user_agent = "Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/604.1"
save_api = WaybackMachineSaveAPI(url, user_agent)
s_time = int(time.time())
save_api.sleep(6) # multiple of 3 sleep for 10 seconds
e_time = int(time.time())
assert (e_time - s_time) >= 10
s_time = int(time.time())
save_api.sleep(7) # sleeps for 5 seconds
e_time = int(time.time())
assert (e_time - s_time) >= 5
def test_timestamp():
url = "https://example.com"
user_agent = "Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/604.1"
save_api = WaybackMachineSaveAPI(url, user_agent)
now = datetime.utcnow()
save_api._archive_url = (
"https://web.archive.org/web/%s/" % now.strftime("%Y%m%d%H%M%S") + url
)
save_api.timestamp()
assert save_api.cached_save is False
save_api._archive_url = "https://web.archive.org/web/%s/" % "20100124063622" + url
save_api.timestamp()
assert save_api.cached_save is True
def test_archive_url_parser():
"""
Testing three regex for matches and also tests the response URL.
"""
url = "https://example.com"
user_agent = "Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/604.1"
save_api = WaybackMachineSaveAPI(url, user_agent)
save_api.headers = """
START
Content-Location: /web/20201126185327/https://www.scribbr.com/citing-sources/et-al
END
"""
assert (
save_api.archive_url_parser()
== "https://web.archive.org/web/20201126185327/https://www.scribbr.com/citing-sources/et-al"
)
save_api.headers = """
{'Server': 'nginx/1.15.8', 'Date': 'Sat, 02 Jan 2021 09:40:25 GMT', 'Content-Type': 'text/html; charset=UTF-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'X-Archive-Orig-Server': 'nginx', 'X-Archive-Orig-Date': 'Sat, 02 Jan 2021 09:40:09 GMT', 'X-Archive-Orig-Transfer-Encoding': 'chunked', 'X-Archive-Orig-Connection': 'keep-alive', 'X-Archive-Orig-Vary': 'Accept-Encoding', 'X-Archive-Orig-Last-Modified': 'Fri, 01 Jan 2021 12:19:00 GMT', 'X-Archive-Orig-Strict-Transport-Security': 'max-age=31536000, max-age=0;', 'X-Archive-Guessed-Content-Type': 'text/html', 'X-Archive-Guessed-Charset': 'utf-8', 'Memento-Datetime': 'Sat, 02 Jan 2021 09:40:09 GMT', 'Link': '<https://www.scribbr.com/citing-sources/et-al/>; rel="original", <https://web.archive.org/web/timemap/link/https://www.scribbr.com/citing-sources/et-al/>; rel="timemap"; type="application/link-format", <https://web.archive.org/web/https://www.scribbr.com/citing-sources/et-al/>; rel="timegate", <https://web.archive.org/web/20200601082911/https://www.scribbr.com/citing-sources/et-al/>; rel="first memento"; datetime="Mon, 01 Jun 2020 08:29:11 GMT", <https://web.archive.org/web/20201126185327/https://www.scribbr.com/citing-sources/et-al/>; rel="prev memento"; datetime="Thu, 26 Nov 2020 18:53:27 GMT", <https://web.archive.org/web/20210102094009/https://www.scribbr.com/citing-sources/et-al/>; rel="memento"; datetime="Sat, 02 Jan 2021 09:40:09 GMT", <https://web.archive.org/web/20210102094009/https://www.scribbr.com/citing-sources/et-al/>; rel="last memento"; datetime="Sat, 02 Jan 2021 09:40:09 GMT"', 'Content-Security-Policy': "default-src 'self' 'unsafe-eval' 'unsafe-inline' data: blob: archive.org web.archive.org analytics.archive.org pragma.archivelab.org", 'X-Archive-Src': 'spn2-20210102092956-wwwb-spn20.us.archive.org-8001.warc.gz', 'Server-Timing': 'captures_list;dur=112.646325, exclusion.robots;dur=0.172010, exclusion.robots.policy;dur=0.158205, RedisCDXSource;dur=2.205932, esindex;dur=0.014647, LoadShardBlock;dur=82.205012, PetaboxLoader3.datanode;dur=70.750239, CDXLines.iter;dur=24.306278, load_resource;dur=26.520179', 'X-App-Server': 'wwwb-app200', 'X-ts': '200', 'X-location': 'All', 'X-Cache-Key': 'httpsweb.archive.org/web/20210102094009/https://www.scribbr.com/citing-sources/et-al/IN', 'X-RL': '0', 'X-Page-Cache': 'MISS', 'X-Archive-Screenname': '0', 'Content-Encoding': 'gzip'}
"""
assert (
save_api.archive_url_parser()
== "https://web.archive.org/web/20210102094009/https://www.scribbr.com/citing-sources/et-al/"
)
save_api.headers = """
START
X-Cache-Key: https://web.archive.org/web/20171128185327/https://www.scribbr.com/citing-sources/et-al/US
END
"""
assert (
save_api.archive_url_parser()
== "https://web.archive.org/web/20171128185327/https://www.scribbr.com/citing-sources/et-al/"
)
save_api.headers = "TEST TEST TEST AND NO MATCH - TEST FOR RESPONSE URL MATCHING"
save_api.response_url = "https://web.archive.org/web/20171128185327/https://www.scribbr.com/citing-sources/et-al"
assert (
save_api.archive_url_parser()
== "https://web.archive.org/web/20171128185327/https://www.scribbr.com/citing-sources/et-al"
)
def test_archive_url():
"""
Checks the attribute archive_url's value when the save method was not
explicitly invoked by the end-user but the save method was invoked implicitly
by the archive_url method which is an attribute due to @property.
"""
url = "https://example.com"
user_agent = "Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/604.1"
save_api = WaybackMachineSaveAPI(url, user_agent)
save_api.saved_archive = (
"https://web.archive.org/web/20220124063056/https://example.com/"
)
assert save_api.archive_url == save_api.saved_archive

View File

@ -1,186 +1,13 @@
import pytest
import json
from waybackpy.utils import (
_cleaned_url,
_url_check,
_full_url,
URLError,
WaybackError,
_get_total_pages,
_archive_url_parser,
_wayback_timestamp,
_get_response,
_check_match_type,
_check_collapses,
_check_filters,
_timestamp_manager,
)
from waybackpy.utils import latest_version, DEFAULT_USER_AGENT
from waybackpy.__version__ import __version__
def test_timestamp_manager():
timestamp = True
data = {}
assert _timestamp_manager(timestamp, data)
data = """
{"archived_snapshots": {"closest": {"timestamp": "20210109155628", "available": true, "status": "200", "url": "http://web.archive.org/web/20210109155628/https://www.google.com/"}}, "url": "https://www.google.com/"}
"""
data = json.loads(data)
assert data["archived_snapshots"]["closest"]["timestamp"] == "20210109155628"
def test_check_filters():
filters = []
_check_filters(filters)
filters = ["statuscode:200", "timestamp:20215678901234", "original:https://url.com"]
_check_filters(filters)
with pytest.raises(WaybackError):
_check_filters("not-list")
def test_check_collapses():
collapses = []
_check_collapses(collapses)
collapses = ["timestamp:10"]
_check_collapses(collapses)
collapses = ["urlkey"]
_check_collapses(collapses)
collapses = "urlkey" # NOT LIST
with pytest.raises(WaybackError):
_check_collapses(collapses)
collapses = ["also illegal collapse"]
with pytest.raises(WaybackError):
_check_collapses(collapses)
def test_check_match_type():
assert _check_match_type(None, "url") is None
match_type = "exact"
url = "test_url"
assert _check_match_type(match_type, url) is None
url = "has * in it"
with pytest.raises(WaybackError):
_check_match_type("domain", url)
with pytest.raises(WaybackError):
_check_match_type("not a valid type", "url")
def test_cleaned_url():
test_url = " https://en.wikipedia.org/wiki/Network security "
answer = "https://en.wikipedia.org/wiki/Network%20security"
assert answer == _cleaned_url(test_url)
def test_url_check():
good_url = "https://akamhy.github.io"
assert _url_check(good_url) is None
bad_url = "https://github-com"
with pytest.raises(URLError):
_url_check(bad_url)
def test_full_url():
params = {}
endpoint = "https://web.archive.org/cdx/search/cdx"
assert endpoint == _full_url(endpoint, params)
params = {"a": "1"}
assert "https://web.archive.org/cdx/search/cdx?a=1" == _full_url(endpoint, params)
assert "https://web.archive.org/cdx/search/cdx?a=1" == _full_url(
endpoint + "?", params
)
params["b"] = 2
assert "https://web.archive.org/cdx/search/cdx?a=1&b=2" == _full_url(
endpoint + "?", params
)
params["c"] = "foo bar"
assert "https://web.archive.org/cdx/search/cdx?a=1&b=2&c=foo%20bar" == _full_url(
endpoint + "?", params
def test_default_user_agent():
assert (
DEFAULT_USER_AGENT
== "waybackpy %s - https://github.com/akamhy/waybackpy" % __version__
)
def test_get_total_pages():
user_agent = "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"
url = "github.com*"
assert 212890 <= _get_total_pages(url, user_agent)
url = "https://zenodo.org/record/4416138"
assert 2 >= _get_total_pages(url, user_agent)
def test_archive_url_parser():
perfect_header = """
{'Server': 'nginx/1.15.8', 'Date': 'Sat, 02 Jan 2021 09:40:25 GMT', 'Content-Type': 'text/html; charset=UTF-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'X-Archive-Orig-Server': 'nginx', 'X-Archive-Orig-Date': 'Sat, 02 Jan 2021 09:40:09 GMT', 'X-Archive-Orig-Transfer-Encoding': 'chunked', 'X-Archive-Orig-Connection': 'keep-alive', 'X-Archive-Orig-Vary': 'Accept-Encoding', 'X-Archive-Orig-Last-Modified': 'Fri, 01 Jan 2021 12:19:00 GMT', 'X-Archive-Orig-Strict-Transport-Security': 'max-age=31536000, max-age=0;', 'X-Archive-Guessed-Content-Type': 'text/html', 'X-Archive-Guessed-Charset': 'utf-8', 'Memento-Datetime': 'Sat, 02 Jan 2021 09:40:09 GMT', 'Link': '<https://www.scribbr.com/citing-sources/et-al/>; rel="original", <https://web.archive.org/web/timemap/link/https://www.scribbr.com/citing-sources/et-al/>; rel="timemap"; type="application/link-format", <https://web.archive.org/web/https://www.scribbr.com/citing-sources/et-al/>; rel="timegate", <https://web.archive.org/web/20200601082911/https://www.scribbr.com/citing-sources/et-al/>; rel="first memento"; datetime="Mon, 01 Jun 2020 08:29:11 GMT", <https://web.archive.org/web/20201126185327/https://www.scribbr.com/citing-sources/et-al/>; rel="prev memento"; datetime="Thu, 26 Nov 2020 18:53:27 GMT", <https://web.archive.org/web/20210102094009/https://www.scribbr.com/citing-sources/et-al/>; rel="memento"; datetime="Sat, 02 Jan 2021 09:40:09 GMT", <https://web.archive.org/web/20210102094009/https://www.scribbr.com/citing-sources/et-al/>; rel="last memento"; datetime="Sat, 02 Jan 2021 09:40:09 GMT"', 'Content-Security-Policy': "default-src 'self' 'unsafe-eval' 'unsafe-inline' data: blob: archive.org web.archive.org analytics.archive.org pragma.archivelab.org", 'X-Archive-Src': 'spn2-20210102092956-wwwb-spn20.us.archive.org-8001.warc.gz', 'Server-Timing': 'captures_list;dur=112.646325, exclusion.robots;dur=0.172010, exclusion.robots.policy;dur=0.158205, RedisCDXSource;dur=2.205932, esindex;dur=0.014647, LoadShardBlock;dur=82.205012, PetaboxLoader3.datanode;dur=70.750239, CDXLines.iter;dur=24.306278, load_resource;dur=26.520179', 'X-App-Server': 'wwwb-app200', 'X-ts': '200', 'X-location': 'All', 'X-Cache-Key': 'httpsweb.archive.org/web/20210102094009/https://www.scribbr.com/citing-sources/et-al/IN', 'X-RL': '0', 'X-Page-Cache': 'MISS', 'X-Archive-Screenname': '0', 'Content-Encoding': 'gzip'}
"""
archive = _archive_url_parser(
perfect_header, "https://www.scribbr.com/citing-sources/et-al/"
)
assert "web.archive.org/web/20210102094009" in archive
header = """
vhgvkjv
Content-Location: /web/20201126185327/https://www.scribbr.com/citing-sources/et-al
ghvjkbjmmcmhj
"""
archive = _archive_url_parser(
header, "https://www.scribbr.com/citing-sources/et-al/"
)
assert "20201126185327" in archive
header = """
hfjkfjfcjhmghmvjm
X-Cache-Key: https://web.archive.org/web/20171128185327/https://www.scribbr.com/citing-sources/et-al/US
yfu,u,gikgkikik
"""
archive = _archive_url_parser(
header, "https://www.scribbr.com/citing-sources/et-al/"
)
assert "20171128185327" in archive
# The below header should result in Exception
no_archive_header = """
{'Server': 'nginx/1.15.8', 'Date': 'Sat, 02 Jan 2021 09:42:45 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Cache-Control': 'no-cache', 'X-App-Server': 'wwwb-app52', 'X-ts': '523', 'X-RL': '0', 'X-Page-Cache': 'MISS', 'X-Archive-Screenname': '0'}
"""
with pytest.raises(WaybackError):
_archive_url_parser(
no_archive_header, "https://www.scribbr.com/citing-sources/et-al/"
)
def test_wayback_timestamp():
ts = _wayback_timestamp(year=2020, month=1, day=2, hour=3, minute=4)
assert "202001020304" in str(ts)
def test_get_response():
endpoint = "https://www.google.com"
user_agent = (
"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0"
)
headers = {"User-Agent": "%s" % user_agent}
response = _get_response(endpoint, params=None, headers=headers)
assert response.status_code == 200
endpoint = "http/wwhfhfvhvjhmom"
with pytest.raises(WaybackError):
_get_response(endpoint, params=None, headers=headers)
endpoint = "https://akamhy.github.io"
url, response = _get_response(
endpoint, params=None, headers=headers, return_full_url=True
)
assert endpoint == url
def test_latest_version():
assert __version__ == latest_version(package_name="waybackpy")

View File

@ -1,28 +0,0 @@
import pytest
from waybackpy.wrapper import Url
user_agent = "Mozilla/5.0 (Windows NT 6.2; rv:20.0) Gecko/20121202 Firefox/20.0"
def test_url_check():
"""No API Use"""
broken_url = "http://wwwgooglecom/"
with pytest.raises(Exception):
Url(broken_url, user_agent)
def test_near():
with pytest.raises(Exception):
NeverArchivedUrl = (
"https://ee_3n.wrihkeipef4edia.org/rwti5r_ki/Nertr6w_rork_rse7c_urity"
)
target = Url(NeverArchivedUrl, user_agent)
target.near(year=2010)
def test_json():
url = "github.com/akamhy/waybackpy"
target = Url(url, user_agent)
assert "archived_snapshots" in str(target.JSON)

View File

@ -1,50 +1,7 @@
# ┏┓┏┓┏┓━━━━━━━━━━┏━━┓━━━━━━━━━━┏┓━━┏━━━┓━━━━━
# ┃┃┃┃┃┃━━━━━━━━━━┃┏┓┃━━━━━━━━━━┃┃━━┃┏━┓┃━━━━━
# ┃┃┃┃┃┃┏━━┓━┏┓━┏┓┃┗┛┗┓┏━━┓━┏━━┓┃┃┏┓┃┗━┛┃┏┓━┏┓
# ┃┗┛┗┛┃┗━┓┃━┃┃━┃┃┃┏━┓┃┗━┓┃━┃┏━┛┃┗┛┛┃┏━━┛┃┃━┃┃
# ┗┓┏┓┏┛┃┗┛┗┓┃┗━┛┃┃┗━┛┃┃┗┛┗┓┃┗━┓┃┏┓┓┃┃━━━┃┗━┛┃
# ━┗┛┗┛━┗━━━┛┗━┓┏┛┗━━━┛┗━━━┛┗━━┛┗┛┗┛┗┛━━━┗━┓┏┛
# ━━━━━━━━━━━┏━┛┃━━━━━━━━━━━━━━━━━━━━━━━━┏━┛┃━
# ━━━━━━━━━━━┗━━┛━━━━━━━━━━━━━━━━━━━━━━━━┗━━┛━
"""
Waybackpy is a Python package & command-line program that interfaces with the Internet Archive's Wayback Machine API.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Archive webpage and retrieve archived URLs easily.
Usage:
>>> import waybackpy
>>> url = "https://en.wikipedia.org/wiki/Multivariable_calculus"
>>> user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
>>> wayback = waybackpy.Url(url, user_agent)
>>> archive = wayback.save()
>>> str(archive)
'https://web.archive.org/web/20210104173410/https://en.wikipedia.org/wiki/Multivariable_calculus'
>>> archive.timestamp
datetime.datetime(2021, 1, 4, 17, 35, 12, 691741)
>>> oldest_archive = wayback.oldest()
>>> str(oldest_archive)
'https://web.archive.org/web/20050422130129/http://en.wikipedia.org:80/wiki/Multivariable_calculus'
>>> archive_close_to_2010_feb = wayback.near(year=2010, month=2)
>>> str(archive_close_to_2010_feb)
'https://web.archive.org/web/20100215001541/http://en.wikipedia.org:80/wiki/Multivariable_calculus'
>>> str(wayback.newest())
'https://web.archive.org/web/20210104173410/https://en.wikipedia.org/wiki/Multivariable_calculus'
Full documentation @ <https://github.com/akamhy/waybackpy/wiki>.
:copyright: (c) 2020-2021 AKash Mahanty Et al.
:license: MIT
"""
from .wrapper import Url, Cdx
from .wrapper import Url
from .cdx_api import WaybackMachineCDXServerAPI
from .save_api import WaybackMachineSaveAPI
from .availability_api import WaybackMachineAvailabilityAPI
from .__version__ import (
__title__,
__description__,

View File

@ -1,11 +1,11 @@
__title__ = "waybackpy"
__description__ = (
"A Python package that interfaces with the Internet Archive's Wayback Machine API. "
"Python package that interfaces with the Internet Archive's Wayback Machine APIs. "
"Archive pages and retrieve archived pages easily."
)
__url__ = "https://akamhy.github.io/waybackpy/"
__version__ = "2.4.3"
__author__ = "akamhy"
__version__ = "3.0.1"
__author__ = "Akash Mahanty"
__author_email__ = "akamhy@yahoo.com"
__license__ = "MIT"
__copyright__ = "Copyright 2020-2021 Akash Mahanty et al."
__copyright__ = "Copyright 2020-2022 Akash Mahanty et al."

View File

@ -0,0 +1,198 @@
import time
import json
import requests
from datetime import datetime
from .utils import DEFAULT_USER_AGENT
from .exceptions import (
ArchiveNotInAvailabilityAPIResponse,
InvalidJSONInAvailabilityAPIResponse,
)
class WaybackMachineAvailabilityAPI:
"""
Class that interfaces the availability API of the Wayback Machine.
"""
def __init__(self, url, user_agent=DEFAULT_USER_AGENT, max_tries=3):
self.url = str(url).strip().replace(" ", "%20")
self.user_agent = user_agent
self.headers = {"User-Agent": self.user_agent}
self.payload = {"url": "{url}".format(url=self.url)}
self.endpoint = "https://archive.org/wayback/available"
self.max_tries = max_tries
self.tries = 0
self.last_api_call_unix_time = int(time.time())
self.api_call_time_gap = 5
self.JSON = None
def unix_timestamp_to_wayback_timestamp(self, unix_timestamp):
"""
Converts Unix time to wayback Machine timestamp.
"""
return datetime.utcfromtimestamp(int(unix_timestamp)).strftime("%Y%m%d%H%M%S")
def __repr__(self):
"""
Same as string representation, just return the archive URL as a string.
"""
return str(self)
def __str__(self):
"""
String representation of the class. If atleast one API call was successfully
made then return the archive URL as a string. Else returns None.
"""
# String must not return anything other than a string object
# So, if some asks for string repr before making the API requests
# just return ""
if not self.JSON:
return ""
return self.archive_url
def json(self):
"""
Makes the API call to the availability API can set the JSON response
to the JSON attribute of the instance and also returns the JSON attribute.
"""
time_diff = int(time.time()) - self.last_api_call_unix_time
sleep_time = self.api_call_time_gap - time_diff
if sleep_time > 0:
time.sleep(sleep_time)
self.response = requests.get(
self.endpoint, params=self.payload, headers=self.headers
)
self.last_api_call_unix_time = int(time.time())
self.tries += 1
try:
self.JSON = self.response.json()
except json.decoder.JSONDecodeError:
raise InvalidJSONInAvailabilityAPIResponse(
"Response data:\n{text}".format(text=self.response.text)
)
return self.JSON
def timestamp(self):
"""
Converts the timestamp form the JSON response to datetime object.
If JSON attribute of the instance is None it implies that the either
the the last API call failed or one was never made.
If not JSON or if JSON but no timestamp in the JSON response then returns
the maximum value for datetime object that is possible.
If you get an URL as a response form the availability API it is guaranteed
that you can get the datetime object from the timestamp.
"""
if not self.JSON or not self.JSON["archived_snapshots"]:
return datetime.max
return datetime.strptime(
self.JSON["archived_snapshots"]["closest"]["timestamp"], "%Y%m%d%H%M%S"
)
@property
def archive_url(self):
"""
Reads the the JSON response data and tries to get the timestamp and returns
the timestamp if found else returns None.
"""
data = self.JSON
# If the user didn't used oldest, newest or near but tries to access the
# archive_url attribute then, we assume they are fine with any archive
# and invoke the oldest archive function.
if not data:
self.oldest()
# If data is still not none then probably there are no
# archive for the requested URL.
if not data or not data["archived_snapshots"]:
while (self.tries < self.max_tries) and (
not data or not data["archived_snapshots"]
):
self.json() # It makes a new API call
data = self.JSON # json() updated the value of JSON attribute
# Even if after we exhausted teh max_tries, then we give up and
# raise exception.
if not data or not data["archived_snapshots"]:
raise ArchiveNotInAvailabilityAPIResponse(
"Archive not found in the availability "
+ "API response, the URL you requested may not have any "
+ "archives yet. You may retry after some time or archive the webpage now."
+ "\nResponse data:\n{response}".format(response=self.response.text)
)
else:
archive_url = data["archived_snapshots"]["closest"]["url"]
archive_url = archive_url.replace(
"http://web.archive.org/web/", "https://web.archive.org/web/", 1
)
return archive_url
def wayback_timestamp(self, **kwargs):
"""
Prepends zero before the year, month, day, hour and minute so that they
are conformable with the YYYYMMDDhhmmss wayback machine timestamp format.
"""
return "".join(
str(kwargs[key]).zfill(2)
for key in ["year", "month", "day", "hour", "minute"]
)
def oldest(self):
"""
Passing the year 1994 should return the oldest archive because
wayback machine was started in May, 1996 and there should be no archive
before the year 1994.
"""
return self.near(year=1994)
def newest(self):
"""
Passing the current UNIX time should be sufficient to get the newest
archive considering the API request-response time delay and also the
database lags on Wayback machine.
"""
return self.near(unix_timestamp=int(time.time()))
def near(
self,
year=None,
month=None,
day=None,
hour=None,
minute=None,
unix_timestamp=None,
):
"""
The main method for this Class, oldest and newest methods are dependent on this
method.
It generates the timestamp based on the input either by calling the
unix_timestamp_to_wayback_timestamp or wayback_timestamp method with
appropriate arguments for their respective parameters.
Adds the timestamp to the payload dictionary.
And finally invoking the json method to make the API call then returns the instance.
"""
if unix_timestamp:
timestamp = self.unix_timestamp_to_wayback_timestamp(unix_timestamp)
else:
now = datetime.utcnow().timetuple()
timestamp = self.wayback_timestamp(
year=year if year else now.tm_year,
month=month if month else now.tm_mon,
day=day if day else now.tm_mday,
hour=hour if hour else now.tm_hour,
minute=minute if minute else now.tm_min,
)
self.payload["timestamp"] = timestamp
self.json()
return self

View File

@ -1,229 +0,0 @@
from .snapshot import CdxSnapshot
from .exceptions import WaybackError
from .utils import (
_get_total_pages,
_get_response,
default_user_agent,
_check_filters,
_check_collapses,
_check_match_type,
_add_payload,
)
# TODO : Threading support for pagination API. It's designed for Threading.
# TODO : Add get method here if type is Vaild HTML, SVG other but not - or warc. Test it.
class Cdx:
def __init__(
self,
url,
user_agent=None,
start_timestamp=None,
end_timestamp=None,
filters=[],
match_type=None,
gzip=None,
collapses=[],
limit=None,
):
self.url = str(url).strip()
self.user_agent = str(user_agent) if user_agent else default_user_agent
self.start_timestamp = str(start_timestamp) if start_timestamp else None
self.end_timestamp = str(end_timestamp) if end_timestamp else None
self.filters = filters
_check_filters(self.filters)
self.match_type = str(match_type).strip() if match_type else None
_check_match_type(self.match_type, self.url)
self.gzip = gzip if gzip else True
self.collapses = collapses
_check_collapses(self.collapses)
self.limit = limit if limit else 5000
self.last_api_request_url = None
self.use_page = False
def cdx_api_manager(self, payload, headers, use_page=False):
"""Act as button, we can choose between the normal API and pagination API.
Parameters
----------
self : waybackpy.cdx.Cdx
The instance itself
payload : dict
Get request parameters name value pairs
headers : dict
The headers for making the GET request.
use_page : bool
If True use pagination API else use normal resume key based API.
We have two options to get the snapshots, we use this
method to make a selection between pagination API and
the normal one with Resumption Key, sequential querying
of CDX data. For very large querying (for example domain query),
it may be useful to perform queries in parallel and also estimate
the total size of the query.
read more about the pagination API at:
https://web.archive.org/web/20201228063237/https://github.com/internetarchive/wayback/blob/master/wayback-cdx-server/README.md#pagination-api
if use_page is false if will use the normal sequential query API,
else use the pagination API.
two mutually exclusive cases possible:
1) pagination API is selected
a) get the total number of pages to read, using _get_total_pages()
b) then we use a for loop to get all the pages and yield the response text
2) normal sequential query API is selected.
a) get use showResumeKey=true to ask the API to add a query resumption key
at the bottom of response
b) check if the page has more than 3 lines, if not return the text
c) if it has atleast three lines, we check the second last line for zero length.
d) if the second last line has length zero than we assume that the last line contains
the resumption key, we set the resumeKey and remove the resumeKey from text
e) if the second line has non zero length we return the text as there will no resumption key
f) if we find the resumption key we set the "more" variable status to True which is always set
to False on each iteration. If more is not True the iteration stops and function returns.
"""
endpoint = "https://web.archive.org/cdx/search/cdx"
total_pages = _get_total_pages(self.url, self.user_agent)
# If we only have two or less pages of archives then we care for accuracy
# pagination API can be lagged sometimes
if use_page == True and total_pages >= 2:
blank_pages = 0
for i in range(total_pages):
payload["page"] = str(i)
url, res = _get_response(
endpoint, params=payload, headers=headers, return_full_url=True
)
self.last_api_request_url = url
text = res.text
if len(text) == 0:
blank_pages += 1
if blank_pages >= 2:
break
yield text
else:
payload["showResumeKey"] = "true"
payload["limit"] = str(self.limit)
resumeKey = None
more = True
while more:
if resumeKey:
payload["resumeKey"] = resumeKey
url, res = _get_response(
endpoint, params=payload, headers=headers, return_full_url=True
)
self.last_api_request_url = url
text = res.text.strip()
lines = text.splitlines()
more = False
if len(lines) >= 3:
second_last_line = lines[-2]
if len(second_last_line) == 0:
resumeKey = lines[-1].strip()
text = text.replace(resumeKey, "", 1).strip()
more = True
yield text
def snapshots(self):
"""
This function yeilds snapshots encapsulated
in CdxSnapshot for increased usability.
All the get request values are set if the conditions match
And we use logic that if someone's only inputs don't have any
of [start_timestamp, end_timestamp] and don't use any collapses
then we use the pagination API as it returns archives starting
from the first archive and the recent most archive will be on
the last page.
"""
payload = {}
headers = {"User-Agent": self.user_agent}
_add_payload(self, payload)
if not self.start_timestamp or self.end_timestamp:
self.use_page = True
if self.collapses != []:
self.use_page = False
texts = self.cdx_api_manager(payload, headers, use_page=self.use_page)
for text in texts:
if text.isspace() or len(text) <= 1 or not text:
continue
snapshot_list = text.split("\n")
for snapshot in snapshot_list:
if len(snapshot) < 46: # 14 + 32 (timestamp+digest)
continue
properties = {
"urlkey": None,
"timestamp": None,
"original": None,
"mimetype": None,
"statuscode": None,
"digest": None,
"length": None,
}
prop_values = snapshot.split(" ")
prop_values_len = len(prop_values)
properties_len = len(properties)
if prop_values_len != properties_len:
raise WaybackError(
"Snapshot returned by Cdx API has {prop_values_len} properties instead of expected {properties_len} properties.\nInvolved Snapshot : {snapshot}".format(
prop_values_len=prop_values_len,
properties_len=properties_len,
snapshot=snapshot,
)
)
(
properties["urlkey"],
properties["timestamp"],
properties["original"],
properties["mimetype"],
properties["statuscode"],
properties["digest"],
properties["length"],
) = prop_values
yield CdxSnapshot(properties)

194
waybackpy/cdx_api.py Normal file
View File

@ -0,0 +1,194 @@
from .exceptions import WaybackError
from .cdx_snapshot import CDXSnapshot
from .cdx_utils import (
get_total_pages,
get_response,
check_filters,
check_collapses,
check_match_type,
full_url,
)
from .utils import DEFAULT_USER_AGENT
class WaybackMachineCDXServerAPI:
"""
Class that interfaces the CDX server API of the Wayback Machine.
"""
def __init__(
self,
url,
user_agent=DEFAULT_USER_AGENT,
start_timestamp=None, # from, can not use from as it's a keyword
end_timestamp=None, # to, not using to as can not use from
filters=[],
match_type=None,
gzip=None,
collapses=[],
limit=None,
max_tries=3,
):
self.url = str(url).strip().replace(" ", "%20")
self.user_agent = user_agent
self.start_timestamp = str(start_timestamp) if start_timestamp else None
self.end_timestamp = str(end_timestamp) if end_timestamp else None
self.filters = filters
check_filters(self.filters)
self.match_type = str(match_type).strip() if match_type else None
check_match_type(self.match_type, self.url)
self.gzip = gzip if gzip else True
self.collapses = collapses
check_collapses(self.collapses)
self.limit = limit if limit else 5000
self.max_tries = max_tries
self.last_api_request_url = None
self.use_page = False
self.endpoint = "https://web.archive.org/cdx/search/cdx"
def cdx_api_manager(self, payload, headers, use_page=False):
total_pages = get_total_pages(self.url, self.user_agent)
# If we only have two or less pages of archives then we care for more accuracy
# pagination API is lagged sometimes
if use_page is True and total_pages >= 2:
blank_pages = 0
for i in range(total_pages):
payload["page"] = str(i)
url = full_url(self.endpoint, params=payload)
res = get_response(url, headers=headers)
self.last_api_request_url = url
text = res.text
if len(text) == 0:
blank_pages += 1
if blank_pages >= 2:
break
yield text
else:
payload["showResumeKey"] = "true"
payload["limit"] = str(self.limit)
resumeKey = None
more = True
while more:
if resumeKey:
payload["resumeKey"] = resumeKey
url = full_url(self.endpoint, params=payload)
res = get_response(url, headers=headers)
self.last_api_request_url = url
text = res.text.strip()
lines = text.splitlines()
more = False
if len(lines) >= 3:
second_last_line = lines[-2]
if len(second_last_line) == 0:
resumeKey = lines[-1].strip()
text = text.replace(resumeKey, "", 1).strip()
more = True
yield text
def add_payload(self, payload):
if self.start_timestamp:
payload["from"] = self.start_timestamp
if self.end_timestamp:
payload["to"] = self.end_timestamp
if self.gzip is not True:
payload["gzip"] = "false"
if self.match_type:
payload["matchType"] = self.match_type
if self.filters and len(self.filters) > 0:
for i, f in enumerate(self.filters):
payload["filter" + str(i)] = f
if self.collapses and len(self.collapses) > 0:
for i, f in enumerate(self.collapses):
payload["collapse" + str(i)] = f
# Don't need to return anything as it's dictionary.
payload["url"] = self.url
def snapshots(self):
payload = {}
headers = {"User-Agent": self.user_agent}
self.add_payload(payload)
if not self.start_timestamp or self.end_timestamp:
self.use_page = True
if self.collapses != []:
self.use_page = False
texts = self.cdx_api_manager(payload, headers, use_page=self.use_page)
for text in texts:
if text.isspace() or len(text) <= 1 or not text:
continue
snapshot_list = text.split("\n")
for snapshot in snapshot_list:
if len(snapshot) < 46: # 14 + 32 (timestamp+digest)
continue
properties = {
"urlkey": None,
"timestamp": None,
"original": None,
"mimetype": None,
"statuscode": None,
"digest": None,
"length": None,
}
prop_values = snapshot.split(" ")
prop_values_len = len(prop_values)
properties_len = len(properties)
if prop_values_len != properties_len:
raise WaybackError(
"Snapshot returned by Cdx API has {prop_values_len} properties".format(
prop_values_len=prop_values_len
)
+ " instead of expected {properties_len} ".format(
properties_len=properties_len
)
+ "properties.\nProblematic Snapshot : {snapshot}".format(
snapshot=snapshot
)
)
(
properties["urlkey"],
properties["timestamp"],
properties["original"],
properties["mimetype"],
properties["statuscode"],
properties["digest"],
properties["length"],
) = prop_values
yield CDXSnapshot(properties)

View File

@ -1,26 +1,16 @@
from datetime import datetime
class CdxSnapshot:
class CDXSnapshot:
"""
This class encapsulates the snapshots for greater usability.
Raw Snapshot data looks like:
org,archive)/ 20080126045828 http://github.com text/html 200 Q4YULN754FHV2U6Q5JUT6Q2P57WEWNNY 1415
Class for the CDX snapshot lines returned by the CDX API,
Each valid line of the CDX API is casted to an CDXSnapshot object
by the CDX API interface.
This provides the end-user the ease of using the data as attributes
of the CDXSnapshot.
"""
def __init__(self, properties):
"""
Parameters
----------
self : waybackpy.snapshot.CdxSnapshot
The instance itself
properties : dict
Properties is a dict containg all of the 7 cdx snapshot properties.
"""
self.urlkey = properties["urlkey"]
self.timestamp = properties["timestamp"]
self.datetime_timestamp = datetime.strptime(self.timestamp, "%Y%m%d%H%M%S")
@ -34,12 +24,6 @@ class CdxSnapshot:
)
def __str__(self):
"""Returns the Cdx snapshot line.
Output format:
org,archive)/ 20080126045828 http://github.com text/html 200 Q4YULN754FHV2U6Q5JUT6Q2P57WEWNNY 1415
"""
return "{urlkey} {timestamp} {original} {mimetype} {statuscode} {digest} {length}".format(
urlkey=self.urlkey,
timestamp=self.timestamp,

128
waybackpy/cdx_utils.py Normal file
View File

@ -0,0 +1,128 @@
import re
import requests
from urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter
from .exceptions import WaybackError
from .utils import DEFAULT_USER_AGENT
def get_total_pages(url, user_agent=DEFAULT_USER_AGENT):
endpoint = "https://web.archive.org/cdx/search/cdx?"
payload = {"showNumPages": "true", "url": str(url)}
headers = {"User-Agent": user_agent}
request_url = full_url(endpoint, params=payload)
response = get_response(request_url, headers=headers)
return int(response.text.strip())
def full_url(endpoint, params):
if not params:
return endpoint
full_url = endpoint if endpoint.endswith("?") else (endpoint + "?")
for key, val in params.items():
key = "filter" if key.startswith("filter") else key
key = "collapse" if key.startswith("collapse") else key
amp = "" if full_url.endswith("?") else "&"
full_url = (
full_url
+ amp
+ "{key}={val}".format(key=key, val=requests.utils.quote(str(val)))
)
return full_url
def get_response(
url,
headers=None,
retries=5,
backoff_factor=0.5,
no_raise_on_redirects=False,
):
session = requests.Session()
retries = Retry(
total=retries,
backoff_factor=backoff_factor,
status_forcelist=[500, 502, 503, 504],
)
session.mount("https://", HTTPAdapter(max_retries=retries))
try:
response = session.get(url, headers=headers)
session.close()
return response
except Exception as e:
reason = str(e)
exc_message = "Error while retrieving {url}.\n{reason}".format(
url=url, reason=reason
)
exc = WaybackError(exc_message)
exc.__cause__ = e
raise exc
def check_filters(filters):
if not isinstance(filters, list):
raise WaybackError("filters must be a list.")
# [!]field:regex
for _filter in filters:
try:
match = re.search(
r"(\!?(?:urlkey|timestamp|original|mimetype|statuscode|digest|length)):(.*)",
_filter,
)
match.group(1)
match.group(2)
except Exception:
exc_message = (
"Filter '{_filter}' is not following the cdx filter syntax.".format(
_filter=_filter
)
)
raise WaybackError(exc_message)
def check_collapses(collapses):
if not isinstance(collapses, list):
raise WaybackError("collapses must be a list.")
if len(collapses) == 0:
return
for collapse in collapses:
try:
match = re.search(
r"(urlkey|timestamp|original|mimetype|statuscode|digest|length)(:?[0-9]{1,99})?",
collapse,
)
match.group(1)
if 2 == len(match.groups()):
match.group(2)
except Exception:
exc_message = "collapse argument '{collapse}' is not following the cdx collapse syntax.".format(
collapse=collapse
)
raise WaybackError(exc_message)
def check_match_type(match_type, url):
if not match_type:
return
if "*" in url:
raise WaybackError(
"Can not use wildcard in the URL along with the match_type arguments."
)
legal_match_type = ["exact", "prefix", "host", "domain"]
if match_type not in legal_match_type:
exc_message = "{match_type} is not an allowed match type.\nUse one from 'exact', 'prefix', 'host' or 'domain'".format(
match_type=match_type
)
raise WaybackError(exc_message)

View File

@ -1,334 +1,347 @@
import os
import click
import re
import sys
import json
import os
import json as JSON
import random
import string
import argparse
from .wrapper import Url
from .exceptions import WaybackError
from .__version__ import __version__
from .utils import DEFAULT_USER_AGENT
from .cdx_api import WaybackMachineCDXServerAPI
from .save_api import WaybackMachineSaveAPI
from .availability_api import WaybackMachineAvailabilityAPI
from .wrapper import Url
def _save(obj):
try:
return obj.save()
except Exception as err:
e = str(err)
m = re.search(r"Header:\n(.*)", e)
if m:
header = m.group(1)
if "No archive URL found in the API response" in e:
return (
"\n[waybackpy] Can not save/archive your link.\n[waybackpy] This "
"could happen because either your waybackpy ({version}) is likely out of "
"date or Wayback Machine is malfunctioning.\n[waybackpy] Visit "
"https://github.com/akamhy/waybackpy for the latest version of "
"waybackpy.\n[waybackpy] API response Header :\n{header}".format(
version=__version__, header=header
@click.command()
@click.option(
"-u", "--url", help="URL on which Wayback machine operations are to be performed."
)
@click.option(
"-ua",
"--user-agent",
"--user_agent",
default=DEFAULT_USER_AGENT,
help="User agent, default user agent is '%s' " % DEFAULT_USER_AGENT,
)
@click.option(
"-v", "--version", is_flag=True, default=False, help="Print waybackpy version."
)
@click.option(
"-n",
"--newest",
"-au",
"--archive_url",
"--archive-url",
default=False,
is_flag=True,
help="Fetch the newest archive of the specified URL",
)
@click.option(
"-o",
"--oldest",
default=False,
is_flag=True,
help="Fetch the oldest archive of the specified URL",
)
@click.option(
"-j",
"--json",
default=False,
is_flag=True,
help="Spit out the JSON data for availability_api commands.",
)
@click.option(
"-N", "--near", default=False, is_flag=True, help="Archive near specified time."
)
@click.option("-Y", "--year", type=click.IntRange(1994, 9999), help="Year in integer.")
@click.option("-M", "--month", type=click.IntRange(1, 12), help="Month in integer.")
@click.option("-D", "--day", type=click.IntRange(1, 31), help="Day in integer.")
@click.option("-H", "--hour", type=click.IntRange(0, 24), help="Hour in integer.")
@click.option("-MIN", "--minute", type=click.IntRange(0, 60), help="Minute in integer.")
@click.option(
"-s",
"--save",
default=False,
is_flag=True,
help="Save the specified URL's webpage and print the archive URL.",
)
@click.option(
"-h",
"--headers",
default=False,
is_flag=True,
help="Spit out the headers data for save_api commands.",
)
@click.option(
"-ku",
"--known-urls",
"--known_urls",
default=False,
is_flag=True,
help="List known URLs. Uses CDX API.",
)
@click.option(
"-sub",
"--subdomain",
default=False,
is_flag=True,
help="Use with '--known_urls' to include known URLs for subdomains.",
)
@click.option(
"-f",
"--file",
default=False,
is_flag=True,
help="Use with '--known_urls' to save the URLs in file at current directory.",
)
@click.option(
"-c",
"--cdx",
default=False,
is_flag=True,
help="Spit out the headers data for save_api commands.",
)
@click.option(
"-st",
"--start-timestamp",
"--start_timestamp",
)
@click.option(
"-et",
"--end-timestamp",
"--end_timestamp",
)
@click.option(
"-f",
"--filters",
multiple=True,
)
@click.option(
"-mt",
"--match-type",
"--match_type",
)
@click.option(
"-gz",
"--gzip",
)
@click.option(
"-c",
"--collapses",
multiple=True,
)
@click.option(
"-l",
"--limit",
)
@click.option(
"-cp",
"--cdx-print",
"--cdx_print",
multiple=True,
)
def main(
url,
user_agent,
version,
newest,
oldest,
json,
near,
year,
month,
day,
hour,
minute,
save,
headers,
known_urls,
subdomain,
file,
cdx,
start_timestamp,
end_timestamp,
filters,
match_type,
gzip,
collapses,
limit,
cdx_print,
):
"""
┏┓┏┓┏┓━━━━━━━━━━┏━━┓━━━━━━━━━━┏┓━━┏━━━┓━━━━━
┃┃┃┃┃┃━━━━━━━━━━┃┏┓┃━━━━━━━━━━┃┃━━┃┏━┓┃━━━━━
┃┃┃┃┃┃┏━━┓━┏┓━┏┓┃┗┛┗┓┏━━┓━┏━━┓┃┃┏┓┃┗━┛┃┏┓━┏┓
┃┗┛┗┛┃┗━┓┃━┃┃━┃┃┃┏━┓┃┗━┓┃━┃┏━┛┃┗┛┛┃┏━━┛┃┃━┃┃
┗┓┏┓┏┛┃┗┛┗┓┃┗━┛┃┃┗━┛┃┃┗┛┗┓┃┗━┓┃┏┓┓┃┃━━━┃┗━┛┃
━┗┛┗┛━┗━━━┛┗━┓┏┛┗━━━┛┗━━━┛┗━━┛┗┛┗┛┗┛━━━┗━┓┏┛
━━━━━━━━━━━┏━┛┃━━━━━━━━━━━━━━━━━━━━━━━━┏━┛┃━
━━━━━━━━━━━┗━━┛━━━━━━━━━━━━━━━━━━━━━━━━┗━━┛━
waybackpy : Python package & CLI tool that interfaces the Wayback Machine API
Released under the MIT License.
License @ https://github.com/akamhy/waybackpy/blob/master/LICENSE
Copyright (c) 2020 waybackpy contributors. Contributors list @
https://github.com/akamhy/waybackpy/graphs/contributors
https://github.com/akamhy/waybackpy
https://pypi.org/project/waybackpy
"""
if version:
click.echo("waybackpy version %s" % __version__)
return
if not url:
click.echo("No URL detected. Please pass an URL.")
return
def echo_availability_api(availability_api_instance):
click.echo("Archive URL:")
if not availability_api_instance.archive_url:
archive_url = (
"NO ARCHIVE FOUND - The requested URL is probably "
+ "not yet archived or if the URL was recently archived then it is "
+ "not yet available via the Wayback Machine's availability API "
+ "because of database lag and should be available after some time."
)
else:
archive_url = availability_api_instance.archive_url
click.echo(archive_url)
if json:
click.echo("JSON response:")
click.echo(JSON.dumps(availability_api_instance.JSON))
availability_api = WaybackMachineAvailabilityAPI(url, user_agent=user_agent)
if oldest:
availability_api.oldest()
echo_availability_api(availability_api)
return
if newest:
availability_api.newest()
echo_availability_api(availability_api)
return
if near:
near_args = {}
keys = ["year", "month", "day", "hour", "minute"]
args_arr = [year, month, day, hour, minute]
for key, arg in zip(keys, args_arr):
if arg:
near_args[key] = arg
availability_api.near(**near_args)
echo_availability_api(availability_api)
return
if save:
save_api = WaybackMachineSaveAPI(url, user_agent=user_agent)
save_api.save()
click.echo("Archive URL:")
click.echo(save_api.archive_url)
click.echo("Cached save:")
click.echo(save_api.cached_save)
if headers:
click.echo("Save API headers:")
click.echo(save_api.headers)
return
def save_urls_on_file(url_gen):
domain = None
sys_random = random.SystemRandom()
uid = "".join(
sys_random.choice(string.ascii_lowercase + string.digits) for _ in range(6)
)
url_count = 0
for url in url_gen:
url_count += 1
if not domain:
match = re.search("https?://([A-Za-z_0-9.-]+).*", url)
domain = "domain-unknown"
if match:
domain = match.group(1)
file_name = "{domain}-urls-{uid}.txt".format(domain=domain, uid=uid)
file_path = os.path.join(os.getcwd(), file_name)
if not os.path.isfile(file_path):
open(file_path, "w+").close()
with open(file_path, "a") as f:
f.write("{url}\n".format(url=url))
click.echo(url)
if url_count > 0:
click.echo(
"\n\n'{file_name}' saved in current working directory".format(
file_name=file_name
)
)
if "URL cannot be archived by wayback machine as it is a redirect" in e:
return ("URL cannot be archived by wayback machine as it is a redirect")
raise WaybackError(err)
else:
click.echo("No known URLs found. Please try a diffrent input!")
if known_urls:
wayback = Url(url, user_agent)
url_gen = wayback.known_urls(subdomain=subdomain)
def _archive_url(obj):
return obj.archive_url
if file:
return save_urls_on_file(url_gen)
else:
for url in url_gen:
click.echo(url)
if cdx:
filters = list(filters)
collapses = list(collapses)
cdx_print = list(cdx_print)
def _json(obj):
return json.dumps(obj.JSON)
def no_archive_handler(e, obj):
m = re.search(r"archive\sfor\s\'(.*?)\'\stry", str(e))
if m:
url = m.group(1)
ua = obj.user_agent
if "github.com/akamhy/waybackpy" in ua:
ua = "YOUR_USER_AGENT_HERE"
return (
"\n[Waybackpy] Can not find archive for '{url}'.\n[Waybackpy] You can"
" save the URL using the following command:\n[Waybackpy] waybackpy --"
'user_agent "{user_agent}" --url "{url}" --save'.format(
url=url, user_agent=ua
)
)
raise WaybackError(e)
def _oldest(obj):
try:
return obj.oldest()
except Exception as e:
return no_archive_handler(e, obj)
def _newest(obj):
try:
return obj.newest()
except Exception as e:
return no_archive_handler(e, obj)
def _total_archives(obj):
return obj.total_archives()
def _near(obj, args):
_near_args = {}
args_arr = [args.year, args.month, args.day, args.hour, args.minute]
keys = ["year", "month", "day", "hour", "minute"]
for key, arg in zip(keys, args_arr):
if arg:
_near_args[key] = arg
try:
return obj.near(**_near_args)
except Exception as e:
return no_archive_handler(e, obj)
def _save_urls_on_file(url_gen):
domain = None
sys_random = random.SystemRandom()
uid = "".join(
sys_random.choice(string.ascii_lowercase + string.digits) for _ in range(6)
)
url_count = 0
for url in url_gen:
url_count += 1
if not domain:
m = re.search("https?://([A-Za-z_0-9.-]+).*", url)
domain = "domain-unknown"
if m:
domain = m.group(1)
file_name = "{domain}-urls-{uid}.txt".format(domain=domain, uid=uid)
file_path = os.path.join(os.getcwd(), file_name)
if not os.path.isfile(file_path):
open(file_path, "w+").close()
with open(file_path, "a") as f:
f.write("{url}\n".format(url=url))
print(url)
if url_count > 0:
return "\n\n'{file_name}' saved in current working directory".format(
file_name=file_name
)
else:
return "No known URLs found. Please try a diffrent input!"
def _known_urls(obj, args):
"""
Known urls for a domain.
"""
subdomain = True if args.subdomain else False
url_gen = obj.known_urls(subdomain=subdomain)
if args.file:
return _save_urls_on_file(url_gen)
else:
for url in url_gen:
print(url)
return "\n"
def _get(obj, args):
if args.get.lower() == "url":
return obj.get()
if args.get.lower() == "archive_url":
return obj.get(obj.archive_url)
if args.get.lower() == "oldest":
return obj.get(obj.oldest())
if args.get.lower() == "latest" or args.get.lower() == "newest":
return obj.get(obj.newest())
if args.get.lower() == "save":
return obj.get(obj.save())
return "Use get as \"--get 'source'\", 'source' can be one of the followings: \
\n1) url - get the source code of the url specified using --url/-u.\
\n2) archive_url - get the source code of the newest archive for the supplied url, alias of newest.\
\n3) oldest - get the source code of the oldest archive for the supplied url.\
\n4) newest - get the source code of the newest archive for the supplied url.\
\n5) save - Create a new archive and get the source code of this new archive for the supplied url."
def args_handler(args):
if args.version:
return "waybackpy version {version}".format(version=__version__)
if not args.url:
return "waybackpy {version} \nSee 'waybackpy --help' for help using this tool.".format(
version=__version__
cdx_api = WaybackMachineCDXServerAPI(
url,
user_agent=user_agent,
start_timestamp=start_timestamp,
end_timestamp=end_timestamp,
filters=filters,
match_type=match_type,
gzip=gzip,
collapses=collapses,
limit=limit,
)
obj = Url(args.url)
if args.user_agent:
obj = Url(args.url, args.user_agent)
snapshots = cdx_api.snapshots()
if args.save:
output = _save(obj)
elif args.archive_url:
output = _archive_url(obj)
elif args.json:
output = _json(obj)
elif args.oldest:
output = _oldest(obj)
elif args.newest:
output = _newest(obj)
elif args.known_urls:
output = _known_urls(obj, args)
elif args.total:
output = _total_archives(obj)
elif args.near:
return _near(obj, args)
elif args.get:
output = _get(obj, args)
else:
output = (
"You only specified the URL. But you also need to specify the operation."
"\nSee 'waybackpy --help' for help using this tool."
)
return output
def add_requiredArgs(requiredArgs):
requiredArgs.add_argument(
"--url", "-u", help="URL on which Wayback machine operations would occur"
)
def add_userAgentArg(userAgentArg):
help_text = 'User agent, default user_agent is "waybackpy python package - https://github.com/akamhy/waybackpy"'
userAgentArg.add_argument("--user_agent", "-ua", help=help_text)
def add_saveArg(saveArg):
saveArg.add_argument(
"--save", "-s", action="store_true", help="Save the URL on the Wayback machine"
)
def add_auArg(auArg):
auArg.add_argument(
"--archive_url",
"-au",
action="store_true",
help="Get the latest archive URL, alias for --newest",
)
def add_jsonArg(jsonArg):
jsonArg.add_argument(
"--json",
"-j",
action="store_true",
help="JSON data of the availability API request",
)
def add_oldestArg(oldestArg):
oldestArg.add_argument(
"--oldest",
"-o",
action="store_true",
help="Oldest archive for the specified URL",
)
def add_newestArg(newestArg):
newestArg.add_argument(
"--newest",
"-n",
action="store_true",
help="Newest archive for the specified URL",
)
def add_totalArg(totalArg):
totalArg.add_argument(
"--total",
"-t",
action="store_true",
help="Total number of archives for the specified URL",
)
def add_getArg(getArg):
getArg.add_argument(
"--get",
"-g",
help="Prints the source code of the supplied url. Use '--get help' for extended usage",
)
def add_knownUrlArg(knownUrlArg):
knownUrlArg.add_argument(
"--known_urls", "-ku", action="store_true", help="URLs known for the domain."
)
help_text = "Use with '--known_urls' to include known URLs for subdomains."
knownUrlArg.add_argument("--subdomain", "-sub", action="store_true", help=help_text)
knownUrlArg.add_argument(
"--file",
"-f",
action="store_true",
help="Save the URLs in file at current directory.",
)
def add_nearArg(nearArg):
nearArg.add_argument(
"--near", "-N", action="store_true", help="Archive near specified time"
)
def add_nearArgs(nearArgs):
nearArgs.add_argument("--year", "-Y", type=int, help="Year in integer")
nearArgs.add_argument("--month", "-M", type=int, help="Month in integer")
nearArgs.add_argument("--day", "-D", type=int, help="Day in integer.")
nearArgs.add_argument("--hour", "-H", type=int, help="Hour in intege")
nearArgs.add_argument("--minute", "-MIN", type=int, help="Minute in integer")
def parse_args(argv):
parser = argparse.ArgumentParser()
add_requiredArgs(parser.add_argument_group("URL argument (required)"))
add_userAgentArg(parser.add_argument_group("User Agent"))
add_saveArg(parser.add_argument_group("Create new archive/save URL"))
add_auArg(parser.add_argument_group("Get the latest Archive"))
add_jsonArg(parser.add_argument_group("Get the JSON data"))
add_oldestArg(parser.add_argument_group("Oldest archive"))
add_newestArg(parser.add_argument_group("Newest archive"))
add_totalArg(parser.add_argument_group("Total number of archives"))
add_getArg(parser.add_argument_group("Get source code"))
add_knownUrlArg(
parser.add_argument_group(
"URLs known and archived to Waybcak Machine for the site."
)
)
add_nearArg(parser.add_argument_group("Archive close to time specified"))
add_nearArgs(parser.add_argument_group("Arguments that are used only with --near"))
parser.add_argument(
"--version", "-v", action="store_true", help="Waybackpy version"
)
return parser.parse_args(argv[1:])
def main(argv=None):
argv = sys.argv if argv is None else argv
print(args_handler(parse_args(argv)))
for snapshot in snapshots:
if len(cdx_print) == 0:
click.echo(snapshot)
else:
output_string = ""
if "urlkey" or "url-key" or "url_key" in cdx_print:
output_string = output_string + snapshot.urlkey + " "
if "timestamp" or "time-stamp" or "time_stamp" in cdx_print:
output_string = output_string + snapshot.timestamp + " "
if "original" in cdx_print:
output_string = output_string + snapshot.original + " "
if "original" in cdx_print:
output_string = output_string + snapshot.original + " "
if "mimetype" or "mime-type" or "mime_type" in cdx_print:
output_string = output_string + snapshot.mimetype + " "
if "statuscode" or "status-code" or "status_code" in cdx_print:
output_string = output_string + snapshot.statuscode + " "
if "digest" in cdx_print:
output_string = output_string + snapshot.digest + " "
if "length" in cdx_print:
output_string = output_string + snapshot.length + " "
if "archiveurl" or "archive-url" or "archive_url" in cdx_print:
output_string = output_string + snapshot.archive_url + " "
click.echo(output_string)
if __name__ == "__main__":
sys.exit(main(sys.argv))
main()

View File

@ -10,6 +10,8 @@ class WaybackError(Exception):
Raised when Waybackpy can not return what you asked for.
1) Wayback Machine API Service is unreachable/down.
2) You passed illegal arguments.
All other exceptions are inherited from this class.
"""
@ -24,3 +26,27 @@ class URLError(Exception):
"""
Raised when malformed URLs are passed as arguments.
"""
class MaximumRetriesExceeded(WaybackError):
"""
MaximumRetriesExceeded
"""
class MaximumSaveRetriesExceeded(MaximumRetriesExceeded):
"""
MaximumSaveRetriesExceeded
"""
class ArchiveNotInAvailabilityAPIResponse(WaybackError):
"""
Could not parse the archive in the JSON response of the availability API.
"""
class InvalidJSONInAvailabilityAPIResponse(WaybackError):
"""
availability api returned invalid JSON
"""

186
waybackpy/save_api.py Normal file
View File

@ -0,0 +1,186 @@
import re
import time
import requests
from datetime import datetime
from urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter
from .utils import DEFAULT_USER_AGENT
from .exceptions import MaximumSaveRetriesExceeded
class WaybackMachineSaveAPI:
"""
WaybackMachineSaveAPI class provides an interface for saving URLs on the
Wayback Machine.
"""
def __init__(self, url, user_agent=DEFAULT_USER_AGENT, max_tries=8):
self.url = str(url).strip().replace(" ", "%20")
self.request_url = "https://web.archive.org/save/" + self.url
self.user_agent = user_agent
self.request_headers = {"User-Agent": self.user_agent}
self.max_tries = max_tries
self.total_save_retries = 5
self.backoff_factor = 0.5
self.status_forcelist = [500, 502, 503, 504]
self._archive_url = None
self.instance_birth_time = datetime.utcnow()
@property
def archive_url(self):
"""
Returns the archive URL is already cached by _archive_url
else invoke the save method to save the archive which returns the
archive thus we return the methods return value.
"""
if self._archive_url:
return self._archive_url
else:
return self.save()
def get_save_request_headers(self):
"""
Creates a session and tries 'retries' number of times to
retrieve the archive.
If successful in getting the response, sets the headers, status_code
and response_url attributes.
The archive is usually in the headers but it can also be the response URL
as the Wayback Machine redirects to the archive after a successful capture
of the webpage.
Wayback Machine's save API is known
to be very unreliable thus if it fails first check opening
the response URL yourself in the browser.
"""
session = requests.Session()
retries = Retry(
total=self.total_save_retries,
backoff_factor=self.backoff_factor,
status_forcelist=self.status_forcelist,
)
session.mount("https://", HTTPAdapter(max_retries=retries))
self.response = session.get(self.request_url, headers=self.request_headers)
self.headers = (
self.response.headers
) # <class 'requests.structures.CaseInsensitiveDict'>
self.status_code = self.response.status_code
self.response_url = self.response.url
session.close()
def archive_url_parser(self):
"""
Three regexen (like oxen?) are used to search for the
archive URL in the headers and finally look in the response URL
for the archive URL.
"""
regex1 = r"Content-Location: (/web/[0-9]{14}/.*)"
match = re.search(regex1, str(self.headers))
if match:
return "https://web.archive.org" + match.group(1)
regex2 = r"rel=\"memento.*?(web\.archive\.org/web/[0-9]{14}/.*?)>"
match = re.search(regex2, str(self.headers))
if match:
return "https://" + match.group(1)
regex3 = r"X-Cache-Key:\shttps(.*)[A-Z]{2}"
match = re.search(regex3, str(self.headers))
if match:
return "https" + match.group(1)
if self.response_url:
self.response_url = self.response_url.strip()
if "web.archive.org/web" in self.response_url:
regex = r"web\.archive\.org/web/(?:[0-9]*?)/(?:.*)$"
match = re.search(regex, self.response_url)
if match:
return "https://" + match.group(0)
def sleep(self, tries):
"""
Ensure that the we wait some time before succesive retries so that we
don't waste the retries before the page is even captured by the Wayback
Machine crawlers also ensures that we are not putting too much load on
the Wayback Machine's save API.
If tries are multiple of 3 sleep 10 seconds else sleep 5 seconds.
"""
sleep_seconds = 5
if tries % 3 == 0:
sleep_seconds = 10
time.sleep(sleep_seconds)
def timestamp(self):
"""
Read the timestamp off the archive URL and convert the Wayback Machine
timestamp to datetime object.
Also check if the time on archive is URL and compare it to instance birth
time.
If time on the archive is older than the instance creation time set the cached_save
to True else set it to False. The flag can be used to check if the Wayback Machine
didn't serve a Cached URL. It is quite common for the Wayback Machine to serve
cached archive if last archive was captured before last 45 minutes.
"""
m = re.search(
r"https?://web\.archive.org/web/([0-9]{14})/http", self._archive_url
)
string_timestamp = m.group(1)
timestamp = datetime.strptime(string_timestamp, "%Y%m%d%H%M%S")
timestamp_unixtime = time.mktime(timestamp.timetuple())
instance_birth_time_unixtime = time.mktime(self.instance_birth_time.timetuple())
if timestamp_unixtime < instance_birth_time_unixtime:
self.cached_save = True
else:
self.cached_save = False
return timestamp
def save(self):
"""
Calls the SavePageNow API of the Wayback Machine with required parameters
and headers to save the URL.
Raises MaximumSaveRetriesExceeded is maximum retries are exhausted but still
we were unable to retrieve the archive from the Wayback Machine.
"""
self.saved_archive = None
tries = 0
while True:
tries += 1
if tries >= self.max_tries:
raise MaximumSaveRetriesExceeded(
"Tried %s times but failed to save and retrieve the" % str(tries)
+ " archive for %s.\nResponse URL:\n%s \nResponse Header:\n%s\n"
% (self.url, self.response_url, str(self.headers)),
)
if not self.saved_archive:
if tries > 1:
self.sleep(tries)
self.get_save_request_headers()
self.saved_archive = self.archive_url_parser()
if not self.saved_archive:
continue
else:
self._archive_url = self.saved_archive
self.timestamp()
return self.saved_archive

View File

@ -1,564 +1,12 @@
import re
import time
import requests
from datetime import datetime
from .exceptions import WaybackError, URLError, RedirectSaveError
from .__version__ import __version__
from urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter
quote = requests.utils.quote
default_user_agent = "waybackpy python package - https://github.com/akamhy/waybackpy"
DEFAULT_USER_AGENT = "waybackpy %s - https://github.com/akamhy/waybackpy" % __version__
def _latest_version(package_name, headers):
"""Returns the latest version of package_name.
Parameters
----------
package_name : str
The name of the python package
headers : dict
Headers that will be used while making get requests
Return type is str
Use API <https://pypi.org/pypi/> to get the latest version of
waybackpy, but can be used to get latest version of any package
on PyPi.
"""
def latest_version(package_name, user_agent=DEFAULT_USER_AGENT):
request_url = "https://pypi.org/pypi/" + package_name + "/json"
response = _get_response(request_url, headers=headers)
headers = {"User-Agent": user_agent}
response = requests.get(request_url, headers=headers)
data = response.json()
return data["info"]["version"]
def _unix_timestamp_to_wayback_timestamp(unix_timestamp):
"""Returns unix timestamp converted to datetime.datetime
Parameters
----------
unix_timestamp : str, int or float
Unix-timestamp that needs to be converted to datetime.datetime
Converts and returns input unix_timestamp to datetime.datetime object.
Does not matter if unix_timestamp is str, float or int.
"""
return datetime.utcfromtimestamp(int(unix_timestamp)).strftime("%Y%m%d%H%M%S")
def _add_payload(instance, payload):
"""Adds payload from instance that can be used to make get requests.
Parameters
----------
instance : waybackpy.cdx.Cdx
instance of the Cdx class
payload : dict
A dict onto which we need to add keys and values based on instance.
instance is object of Cdx class and it contains the data required to fill
the payload dictionary.
"""
if instance.start_timestamp:
payload["from"] = instance.start_timestamp
if instance.end_timestamp:
payload["to"] = instance.end_timestamp
if instance.gzip != True:
payload["gzip"] = "false"
if instance.match_type:
payload["matchType"] = instance.match_type
if instance.filters and len(instance.filters) > 0:
for i, f in enumerate(instance.filters):
payload["filter" + str(i)] = f
if instance.collapses and len(instance.collapses) > 0:
for i, f in enumerate(instance.collapses):
payload["collapse" + str(i)] = f
# Don't need to return anything as it's dictionary.
payload["url"] = instance.url
def _timestamp_manager(timestamp, data):
"""Returns the timestamp.
Parameters
----------
timestamp : datetime.datetime
datetime object
data : dict
A python dictionary, which is loaded JSON os the availability API.
Return type:
datetime.datetime
If timestamp is not None then sets the value to timestamp itself.
If timestamp is None the returns the value from the last fetched API data.
If not timestamp and can not read the archived_snapshots form data return datetime.max
"""
if timestamp:
return timestamp
if not data["archived_snapshots"]:
return datetime.max
return datetime.strptime(
data["archived_snapshots"]["closest"]["timestamp"], "%Y%m%d%H%M%S"
)
def _check_match_type(match_type, url):
"""Checks the validity of match_type parameter of the CDX GET requests.
Parameters
----------
match_type : list
list that may contain any or all from ["exact", "prefix", "host", "domain"]
See https://github.com/akamhy/waybackpy/wiki/Python-package-docs#url-match-scope
url : str
The URL used to create the waybackpy Url object.
If not vaild match_type raise Exception.
"""
if not match_type:
return
if "*" in url:
raise WaybackError("Can not use wildcard with match_type argument")
legal_match_type = ["exact", "prefix", "host", "domain"]
if match_type not in legal_match_type:
exc_message = "{match_type} is not an allowed match type.\nUse one from 'exact', 'prefix', 'host' or 'domain'".format(
match_type=match_type
)
raise WaybackError(exc_message)
def _check_collapses(collapses):
"""Checks the validity of collapse parameter of the CDX GET request.
One or more field or field:N to 'collapses=[]' where
field is one of (urlkey, timestamp, original, mimetype, statuscode,
digest and length) and N is the first N characters of field to test.
Parameters
----------
collapses : list
If not vaild collapses raise Exception.
"""
if not isinstance(collapses, list):
raise WaybackError("collapses must be a list.")
if len(collapses) == 0:
return
for collapse in collapses:
try:
match = re.search(
r"(urlkey|timestamp|original|mimetype|statuscode|digest|length)(:?[0-9]{1,99})?",
collapse,
)
field = match.group(1)
N = None
if 2 == len(match.groups()):
N = match.group(2)
if N:
if not (field + N == collapse):
raise Exception
else:
if not (field == collapse):
raise Exception
except Exception:
exc_message = "collapse argument '{collapse}' is not following the cdx collapse syntax.".format(
collapse=collapse
)
raise WaybackError(exc_message)
def _check_filters(filters):
"""Checks the validity of filter parameter of the CDX GET request.
Any number of filter params of the following form may be specified:
filters=["[!]field:regex"] may be specified..
Parameters
----------
filters : list
If not vaild filters raise Exception.
"""
if not isinstance(filters, list):
raise WaybackError("filters must be a list.")
# [!]field:regex
for _filter in filters:
try:
match = re.search(
r"(\!?(?:urlkey|timestamp|original|mimetype|statuscode|digest|length)):(.*)",
_filter,
)
key = match.group(1)
val = match.group(2)
except Exception:
exc_message = (
"Filter '{_filter}' is not following the cdx filter syntax.".format(
_filter=_filter
)
)
raise WaybackError(exc_message)
def _cleaned_url(url):
"""Sanatize the url
Remove and replace illegal whitespace characters from the URL.
"""
return str(url).strip().replace(" ", "%20")
def _url_check(url):
"""
Check for common URL problems.
What we are checking:
1) '.' in self.url, no url that ain't '.' in it.
If you known any others, please create a PR on the github repo.
"""
if "." not in url:
exc_message = "'{url}' is not a vaild URL.".format(url=url)
raise URLError(exc_message)
def _full_url(endpoint, params):
"""API endpoint + GET parameters = full_url
Parameters
----------
endpoint : str
The API endpoint
params : dict
Dictionary that has name-value pairs.
Return type is str
"""
if not params:
return endpoint
full_url = endpoint if endpoint.endswith("?") else (endpoint + "?")
for key, val in params.items():
key = "filter" if key.startswith("filter") else key
key = "collapse" if key.startswith("collapse") else key
amp = "" if full_url.endswith("?") else "&"
full_url = full_url + amp + "{key}={val}".format(key=key, val=quote(str(val)))
return full_url
def _get_total_pages(url, user_agent):
"""
If showNumPages is passed in cdx API, it returns
'number of archive pages'and each page has many archives.
This func returns number of pages of archives (type int).
"""
total_pages_url = (
"https://web.archive.org/cdx/search/cdx?url={url}&showNumPages=true".format(
url=url
)
)
headers = {"User-Agent": user_agent}
return int((_get_response(total_pages_url, headers=headers).text).strip())
def _archive_url_parser(
header, url, latest_version=__version__, instance=None, response=None
):
"""Returns the archive after parsing it from the response header.
Parameters
----------
header : str
The response header of WayBack Machine's Save API
url : str
The input url, the one used to created the Url object.
latest_version : str
The latest version of waybackpy (default is __version__)
instance : waybackpy.wrapper.Url
Instance of Url class
The wayback machine's save API doesn't
return JSON response, we are required
to read the header of the API response
and find the archive URL.
This method has some regular expressions
that are used to search for the archive url
in the response header of Save API.
Two cases are possible:
1) Either we find the archive url in
the header.
2) Or we didn't find the archive url in
API header.
If we found the archive URL we return it.
Return format:
web.archive.org/web/<TIMESTAMP>/<URL>
And if we couldn't find it, we raise
WaybackError with an error message.
"""
if "save redirected" in header and instance:
time.sleep(60) # makeup for archive time
now = datetime.utcnow().timetuple()
timestamp = _wayback_timestamp(
year=now.tm_year,
month=now.tm_mon,
day=now.tm_mday,
hour=now.tm_hour,
minute=now.tm_min,
)
return_str = "web.archive.org/web/{timestamp}/{url}".format(
timestamp=timestamp, url=url
)
url = "https://" + return_str
headers = {"User-Agent": instance.user_agent}
res = _get_response(url, headers=headers)
if res.status_code < 400:
return "web.archive.org/web/{timestamp}/{url}".format(
timestamp=timestamp, url=url
)
# Regex1
m = re.search(r"Content-Location: (/web/[0-9]{14}/.*)", str(header))
if m:
return "web.archive.org" + m.group(1)
# Regex2
m = re.search(
r"rel=\"memento.*?(web\.archive\.org/web/[0-9]{14}/.*?)>", str(header)
)
if m:
return m.group(1)
# Regex3
m = re.search(r"X-Cache-Key:\shttps(.*)[A-Z]{2}", str(header))
if m:
return m.group(1)
if response:
if response.url:
if "web.archive.org/web" in response.url:
m = re.search(
r"web\.archive\.org/web/(?:[0-9]*?)/(?:.*)$",
str(response.url).strip(),
)
if m:
return m.group(0)
if instance:
newest_archive = None
try:
newest_archive = instance.newest()
except WaybackError:
pass # We don't care as this is a save request
if newest_archive:
minutes_old = (
datetime.utcnow() - newest_archive.timestamp
).total_seconds() / 60.0
if minutes_old <= 30:
archive_url = newest_archive.archive_url
m = re.search(r"web\.archive\.org/web/[0-9]{14}/.*", archive_url)
if m:
instance.cached_save = True
return m.group(0)
if __version__ == latest_version:
exc_message = (
"No archive URL found in the API response. "
"If '{url}' can be accessed via your web browser then either "
"Wayback Machine is malfunctioning or it refused to archive your URL."
"\nHeader:\n{header}".format(url=url, header=header)
)
if "save redirected" == header.strip():
raise RedirectSaveError(
"URL cannot be archived by wayback machine as it is a redirect.\nHeader:\n{header}".format(
header=header
)
)
else:
exc_message = (
"No archive URL found in the API response. "
"If '{url}' can be accessed via your web browser then either "
"this version of waybackpy ({version}) is out of date or WayBack "
"Machine is malfunctioning. Visit 'https://github.com/akamhy/waybackpy' "
"for the latest version of waybackpy.\nHeader:\n{header}".format(
url=url, version=__version__, header=header
)
)
raise WaybackError(exc_message)
def _wayback_timestamp(**kwargs):
"""Returns a valid waybackpy timestamp.
The standard archive URL format is
https://web.archive.org/web/20191214041711/https://www.youtube.com
If we break it down in three parts:
1 ) The start (https://web.archive.org/web/)
2 ) timestamp (20191214041711)
3 ) https://www.youtube.com, the original URL
The near method of Url class in wrapper.py takes year, month, day, hour
and minute as arguments, their type is int.
This method takes those integers and converts it to
wayback machine timestamp and returns it.
zfill(2) adds 1 zero in front of single digit days, months hour etc.
Return type is string.
"""
return "".join(
str(kwargs[key]).zfill(2) for key in ["year", "month", "day", "hour", "minute"]
)
def _get_response(
endpoint,
params=None,
headers=None,
return_full_url=False,
retries=5,
backoff_factor=0.5,
no_raise_on_redirects=False,
):
"""Makes get requests.
Parameters
----------
endpoint : str
The API endpoint.
params : dict
The get request parameters. (default is None)
headers : dict
Headers for the get request. (default is None)
return_full_url : bool
Determines whether the call went full url returned along with the
response. (default is False)
retries : int
Maximum number of retries for the get request. (default is 5)
backoff_factor : float
The factor by which we determine the next retry time after wait.
https://urllib3.readthedocs.io/en/latest/reference/urllib3.util.html
(default is 0.5)
no_raise_on_redirects : bool
If maximum 30(default for requests) times redirected than instead of
exceptions return. (default is False)
To handle WaybackError:
from waybackpy.exceptions import WaybackError
try:
...
except WaybackError as e:
# handle it
"""
# From https://stackoverflow.com/a/35504626
# By https://stackoverflow.com/users/401467/datashaman
s = requests.Session()
retries = Retry(
total=retries,
backoff_factor=backoff_factor,
status_forcelist=[500, 502, 503, 504],
)
s.mount("https://", HTTPAdapter(max_retries=retries))
# The URL with parameters required for the get request
url = _full_url(endpoint, params)
try:
if not return_full_url:
return s.get(url, headers=headers)
return (url, s.get(url, headers=headers))
except Exception as e:
reason = str(e)
if no_raise_on_redirects:
if "Exceeded 30 redirects" in reason:
return
exc_message = "Error while retrieving {url}.\n{reason}".format(
url=url, reason=reason
)
exc = WaybackError(exc_message)
exc.__cause__ = e
raise exc

View File

@ -1,290 +1,59 @@
import re
from .save_api import WaybackMachineSaveAPI
from .availability_api import WaybackMachineAvailabilityAPI
from .cdx_api import WaybackMachineCDXServerAPI
from .utils import DEFAULT_USER_AGENT
from datetime import datetime, timedelta
from .exceptions import WaybackError
from .cdx import Cdx
from .utils import (
_archive_url_parser,
_wayback_timestamp,
_get_response,
default_user_agent,
_url_check,
_cleaned_url,
_timestamp_manager,
_unix_timestamp_to_wayback_timestamp,
_latest_version,
)
"""
The Url class is not recommended to be used anymore, instead use the
WaybackMachineSaveAPI, WaybackMachineAvailabilityAPI and WaybackMachineCDXServerAPI.
The reason it is still in the code is backwards compatibility with 2.x.x versions.
If were are using the Url before the update to version 3.x.x, your code should still be
working fine and there is no hurry to update the interface but is recommended that you
do not use the Url class for new code as it would be removed after 2025 also the first
3.x.x versions was released in January 2022 and three years are more than enough to update
the older interface code.
"""
class Url:
"""
Attributes
----------
url : str
The input URL, wayback machine API operations are performed
on this URL after sanatizing it.
user_agent : str
The user_agent used while making the GET requests to the
Wayback machine APIs
_archive_url : str
Caches the last fetched archive.
timestamp : datetime.datetime
timestamp of the archive URL as datetime object for
greater usability
_JSON : dict
Caches the last fetched availability API data
latest_version : str
The latest version of waybackpy on PyPi
cached_save : bool
Flag to check if WayBack machine returned a cached
archive instead of creating a new archive. WayBack
machine allows only one 1 archive for an URL in
30 minutes. If the archive returned by WayBack machine
is older than 3 minutes than this flag is set to True
Methods turned properties
----------
JSON : dict
JSON response of availability API as dictionary / loaded JSON
archive_url : str
Return the archive url, returns str
_timestamp : datetime.datetime
Sets the value of self.timestamp if still not set
Methods
-------
save()
Archives the URL on WayBack machine
get(url="", user_agent="", encoding="")
Gets the source of archive url, can also be used to get source
of any URL if passed into it.
near(year=None, month=None, day=None, hour=None, minute=None, unix_timestamp=None)
Wayback Machine can have many archives for a URL/webpage, sometimes we want
archive close to a specific time.
This method takes year, month, day, hour, minute and unix_timestamp as input.
oldest(year=1994)
The oldest archive of an URL.
newest()
The newest archive of an URL
total_archives(start_timestamp=None, end_timestamp=None)
total number of archives of an URL, the timeframe can be confined by
start_timestamp and end_timestamp
known_urls(subdomain=False, host=False, start_timestamp=None, end_timestamp=None, match_type="prefix")
Known URLs for an URL, subdomain, URL as prefix etc.
"""
def __init__(self, url, user_agent=default_user_agent):
def __init__(self, url, user_agent=DEFAULT_USER_AGENT):
self.url = url
self.user_agent = str(user_agent)
_url_check(self.url)
self._archive_url = None
self.timestamp = None
self._JSON = None
self.latest_version = None
self.cached_save = False
def __repr__(self):
return "waybackpy.Url(url={url}, user_agent={user_agent})".format(
url=self.url, user_agent=self.user_agent
self.archive_url = None
self.wayback_machine_availability_api = WaybackMachineAvailabilityAPI(
self.url, user_agent=self.user_agent
)
def __str__(self):
if not self._archive_url:
self._archive_url = self.archive_url
return "{archive_url}".format(archive_url=self._archive_url)
if not self.archive_url:
self.newest()
return self.archive_url
def __len__(self):
"""Number of days between today and the date of archive based on the timestamp
len() of waybackpy.wrapper.Url should return
the number of days between today and the
archive timestamp.
Can be applied on return values of near and its
childs (e.g. oldest) and if applied on waybackpy.Url()
whithout using any functions, it just grabs
self._timestamp and def _timestamp gets it
from def JSON.
"""
td_max = timedelta(
days=999999999, hours=23, minutes=59, seconds=59, microseconds=999999
)
if not self.timestamp:
self.timestamp = self._timestamp
self.oldest()
if self.timestamp == datetime.max:
return td_max.days
return (datetime.utcnow() - self.timestamp).days
@property
def JSON(self):
"""Returns JSON response of availability API as dictionary / loaded JSON
return type : dict
"""
# If user used the near method or any method that depends on near, we
# are certain that we have a loaded dictionary cached in self._JSON.
# Return the loaded JSON data.
if self._JSON:
return self._JSON
# If no cached data found, get data and return + cache it.
endpoint = "https://archive.org/wayback/available"
headers = {"User-Agent": self.user_agent}
payload = {"url": "{url}".format(url=_cleaned_url(self.url))}
response = _get_response(endpoint, params=payload, headers=headers)
self._JSON = response.json()
return self._JSON
@property
def archive_url(self):
"""Return the archive url.
return type : str
"""
if self._archive_url:
return self._archive_url
data = self.JSON
if not data["archived_snapshots"]:
archive_url = None
else:
archive_url = data["archived_snapshots"]["closest"]["url"]
archive_url = archive_url.replace(
"http://web.archive.org/web/", "https://web.archive.org/web/", 1
)
self._archive_url = archive_url
return archive_url
@property
def _timestamp(self):
"""Sets the value of self.timestamp if still not set.
Return type : datetime.datetime
"""
return _timestamp_manager(self.timestamp, self.JSON)
def save(self):
"""Saves/Archive the URL.
To save a webpage on WayBack machine we
need to send get request to https://web.archive.org/save/
And to get the archive URL we are required to read the
header of the API response.
_get_response() takes care of the get requests.
_archive_url_parser() parses the archive from the header.
return type : waybackpy.wrapper.Url
"""
request_url = "https://web.archive.org/save/" + _cleaned_url(self.url)
headers = {"User-Agent": self.user_agent}
response = _get_response(
request_url,
params=None,
headers=headers,
backoff_factor=2,
no_raise_on_redirects=True,
self.wayback_machine_save_api = WaybackMachineSaveAPI(
self.url, user_agent=self.user_agent
)
if not self.latest_version:
self.latest_version = _latest_version("waybackpy", headers=headers)
if response:
res_headers = response.headers
else:
res_headers = "save redirected"
self._archive_url = "https://" + _archive_url_parser(
res_headers,
self.url,
latest_version=self.latest_version,
instance=self,
response=response,
)
m = re.search(
r"https?://web.archive.org/web/([0-9]{14})/http", self._archive_url
)
str_ts = m.group(1)
ts = datetime.strptime(str_ts, "%Y%m%d%H%M%S")
now = datetime.utcnow()
total_seconds = int((now - ts).total_seconds())
if total_seconds > 60 * 3:
self.cached_save = True
self.timestamp = ts
self.archive_url = self.wayback_machine_save_api.archive_url
self.timestamp = self.wayback_machine_save_api.timestamp()
self.headers = self.wayback_machine_save_api.headers
return self
def get(self, url="", user_agent="", encoding=""):
"""GET the source of archive or any other URL.
url : str, waybackpy.wrapper.Url
The method will return the source code of
this URL instead of last fetched archive.
user_agent : str
The user_agent for GET request to API
encoding : str
If user is using any other encoding that
can't be detected by response.encoding
Return the source code of the last fetched
archive URL if no URL is passed to this method
else it returns the source code of url passed.
If encoding is not supplied, it is auto-detected
from the response itself by requests package.
"""
if not url and self._archive_url:
url = self._archive_url
elif not url and not self._archive_url:
url = _cleaned_url(self.url)
if not user_agent:
user_agent = self.user_agent
headers = {"User-Agent": str(user_agent)}
response = _get_response(str(url), params=None, headers=headers)
if not encoding:
try:
encoding = response.encoding
except AttributeError:
encoding = "UTF-8"
return response.content.decode(encoding.replace("text/html", "UTF-8", 1))
def near(
self,
year=None,
@ -294,153 +63,45 @@ class Url:
minute=None,
unix_timestamp=None,
):
"""
Parameters
----------
year : int
Archive close to year
month : int
Archive close to month
day : int
Archive close to day
hour : int
Archive close to hour
minute : int
Archive close to minute
unix_timestamp : str, float or int
Archive close to this unix_timestamp
Wayback Machine can have many archives of a webpage,
sometimes we want archive close to a specific time.
This method takes year, month, day, hour and minute as input.
The input type must be integer. Any non-supplied parameters
default to the current time.
We convert the input to a wayback machine timestamp using
_wayback_timestamp(), it returns a string.
We use the wayback machine's availability API
(https://archive.org/wayback/available)
to get the closest archive from the timestamp.
We set self._archive_url to the archive found, if any.
If archive found, we set self.timestamp to its timestamp.
We self._JSON to the response of the availability API.
And finally return self.
"""
if unix_timestamp:
timestamp = _unix_timestamp_to_wayback_timestamp(unix_timestamp)
else:
now = datetime.utcnow().timetuple()
timestamp = _wayback_timestamp(
year=year if year else now.tm_year,
month=month if month else now.tm_mon,
day=day if day else now.tm_mday,
hour=hour if hour else now.tm_hour,
minute=minute if minute else now.tm_min,
)
endpoint = "https://archive.org/wayback/available"
headers = {"User-Agent": self.user_agent}
payload = {
"url": "{url}".format(url=_cleaned_url(self.url)),
"timestamp": timestamp,
}
response = _get_response(endpoint, params=payload, headers=headers)
data = response.json()
if not data["archived_snapshots"]:
raise WaybackError(
"Can not find archive for '{url}' try later or use wayback.Url(url, user_agent).save() "
"to create a new archive.\nAPI response:\n{text}".format(
url=_cleaned_url(self.url), text=response.text
)
)
archive_url = data["archived_snapshots"]["closest"]["url"]
archive_url = archive_url.replace(
"http://web.archive.org/web/", "https://web.archive.org/web/", 1
self.wayback_machine_availability_api.near(
year=year,
month=month,
day=day,
hour=hour,
minute=minute,
unix_timestamp=unix_timestamp,
)
self._archive_url = archive_url
self.timestamp = datetime.strptime(
data["archived_snapshots"]["closest"]["timestamp"], "%Y%m%d%H%M%S"
)
self._JSON = data
self.set_availability_api_attrs()
return self
def oldest(self, year=1994):
"""
Returns the earliest/oldest Wayback Machine archive for the webpage.
Wayback machine has started archiving the internet around 1997 and
therefore we can't have any archive older than 1997, we use 1994 as the
deafult year to look for the oldest archive.
We simply pass the year in near() and return it.
"""
return self.near(year=year)
def oldest(self):
self.wayback_machine_availability_api.oldest()
self.set_availability_api_attrs()
return self
def newest(self):
"""Return the newest Wayback Machine archive available.
self.wayback_machine_availability_api.newest()
self.set_availability_api_attrs()
return self
We return the return value of self.near() as it deafults to current UTC time.
Due to Wayback Machine database lag, this may not always be the
most recent archive.
return type : waybackpy.wrapper.Url
"""
return self.near()
def set_availability_api_attrs(self):
self.archive_url = self.wayback_machine_availability_api.archive_url
self.JSON = self.wayback_machine_availability_api.JSON
self.timestamp = self.wayback_machine_availability_api.timestamp()
def total_archives(self, start_timestamp=None, end_timestamp=None):
"""Returns the total number of archives for an URL
Parameters
----------
start_timestamp : str
1 to 14 digit string of numbers, you are not required to
pass a full 14 digit timestamp.
end_timestamp : str
1 to 14 digit string of numbers, you are not required to
pass a full 14 digit timestamp.
return type : int
A webpage can have multiple archives on the wayback machine
If someone wants to count the total number of archives of a
webpage on wayback machine they can use this method.
Returns the total number of Wayback Machine archives for the URL.
"""
cdx = Cdx(
_cleaned_url(self.url),
cdx = WaybackMachineCDXServerAPI(
self.url,
user_agent=self.user_agent,
start_timestamp=start_timestamp,
end_timestamp=end_timestamp,
)
# cdx.snapshots() is generator not list.
i = 0
count = 0
for _ in cdx.snapshots():
i = i + 1
return i
count = count + 1
return count
def known_urls(
self,
@ -450,45 +111,13 @@ class Url:
end_timestamp=None,
match_type="prefix",
):
"""Yields known_urls URLs from the CDX API.
Parameters
----------
subdomain : bool
If True fetch subdomain URLs along with the host URLs.
host : bool
Only fetch host URLs.
start_timestamp : str
1 to 14 digit string of numbers, you are not required to
pass a full 14 digit timestamp.
end_timestamp : str
1 to 14 digit string of numbers, you are not required to
pass a full 14 digit timestamp.
match_type : str
One of (exact, prefix, host and domain)
return type : waybackpy.snapshot.CdxSnapshot
Yields list of URLs known to exist for given input.
Defaults to input URL as prefix.
Based on:
https://gist.github.com/mhmdiaa/adf6bff70142e5091792841d4b372050
By Mohammed Diaa (https://github.com/mhmdiaa)
"""
if subdomain:
match_type = "domain"
if host:
match_type = "host"
cdx = Cdx(
_cleaned_url(self.url),
cdx = WaybackMachineCDXServerAPI(
self.url,
user_agent=self.user_agent,
start_timestamp=start_timestamp,
end_timestamp=end_timestamp,