Compare commits

...

133 Commits
v1.6 ... 2.1.8

Author SHA1 Message Date
f1065ed1c8 v2.1.8 2020-10-03 01:18:30 +05:30
315519b21f 2.1.8 2020-10-03 01:18:08 +05:30
07c98661de add usage for known urls (#26)
* Update README.md

* Update README.md

* Update README.md

* bash example for known urls

* python examples / usage for known urls :)

* Update README.md

* Update README.md

* Update README.md

* Update README.md
2020-10-03 01:16:19 +05:30
2cd991a54e lint markdown 2020-10-02 23:34:06 +05:30
ede251afb3 update tests 2020-10-02 23:10:48 +05:30
a8ce970ca0 fixed yet another issue with tests :( 2020-10-02 23:01:59 +05:30
243af26bf6 update version format in tests 2020-10-02 22:23:58 +05:30
0f1db94884 license & packaging info 2020-10-02 22:10:30 +05:30
c304f58ea2 update tests 2020-10-02 21:35:39 +05:30
23f7222cb5 tweak 2020-10-02 21:01:32 +05:30
ce7294d990 Implemented new feature, known urls for domain. 2020-10-02 20:27:28 +05:30
c9fa114d2e grammar 2020-10-01 23:50:03 +05:30
8b6bacb28e Add files via upload 2020-09-08 09:23:59 +05:30
32d8ad7780 Update README.md (#24)
- IA and Wayback machine logo, added new waybackpy logo.
+ changed pages to webpages in lead
2020-09-08 09:12:48 +05:30
cbf2f90faa Add files via upload 2020-09-08 09:06:33 +05:30
4dde3e3134 Delete a.txt 2020-09-08 09:02:36 +05:30
1551e8f1c6 Add files via upload 2020-09-08 09:02:19 +05:30
c84f09e2d2 Create a.txt 2020-09-08 08:59:28 +05:30
57a32669b5 v2.1.7 2020-08-09 11:06:29 +05:30
fe017cbcc8 v2.1.7 2020-08-09 11:06:04 +05:30
5edb03d24b update docs 2020-08-09 11:05:04 +05:30
c5de2232ba Update test_wrapper.py 2020-08-09 10:53:00 +05:30
ca9186c301 update message, sometimes raised for poor performance by wayback machine even if the url is archived. 2020-08-09 10:43:16 +05:30
8a4b631c13 new regex to parse archive, IA changed the header again :( 2020-08-09 10:36:25 +05:30
ec9ce92f48 Update README.md (#23)
* Update README.md

* fix grammar
2020-07-26 10:30:54 +05:30
e95d35c37f re arrange the badges, moved contributions welcome to top 2020-07-26 10:24:31 +05:30
36d662b961 Update __version__.py 2020-07-24 16:24:57 +05:30
2835f8877e Update setup.py 2020-07-24 16:24:38 +05:30
18cbd2fd30 Update cli.py 2020-07-24 16:10:29 +05:30
a2812fb56f patch for cli 2020-07-24 16:09:47 +05:30
77effcf649 Update setup.py 2020-07-24 15:34:14 +05:30
7272ef45a0 Update __version__.py 2020-07-24 15:33:58 +05:30
56116551ac Coverge improvements (#22)
* Update cli.py

* improved tests

* chnages for proper testing

* Type check using isinstance

* Replace elifs with if when used after return

* twitter.com --> www.ibm.com

* fix typo

* test archive urll parser and dunders

* Update test_wrapper.py
2020-07-24 15:31:21 +05:30
4dcda94cb0 v2.1.4 2020-07-24 01:03:44 +05:30
09f59b0182 v2.1.4 2020-07-24 01:03:04 +05:30
ed24184b99 Remove duplicate get response method 2020-07-24 00:57:22 +05:30
56bef064b1 only test save on >3.7 2020-07-23 20:51:46 +05:30
44bb2cf5e4 some cli tests 2020-07-23 20:44:14 +05:30
e231228721 Update README.md (#21)
* Update README.md

* example bash oldest newest

* total archives bash example

* near bash example

* format the list

* ce

* get bash example

* pip git install example

* Update index.rst

* + argparse

* + argparse
2020-07-22 21:35:02 +05:30
b8b2d6dfa9 v2.1.3 2020-07-22 20:21:37 +05:30
3eca6294df v2.1.3 2020-07-22 20:20:44 +05:30
eb037a0284 Rename test_1.py to test_wrapper.py 2020-07-22 20:19:59 +05:30
a01821f20b Update .travis.yml 2020-07-22 17:33:59 +05:30
b21036f8df Update .travis.yml 2020-07-22 17:31:57 +05:30
b43bacb7ac fix error language 2020-07-22 17:25:15 +05:30
f7313b255a Update cli.py 2020-07-22 17:22:38 +05:30
7457e1c793 - print(repr(obj)) 2020-07-22 17:18:27 +05:30
f7493d823f Update cli.py 2020-07-22 17:16:53 +05:30
7fa7b59ce3 if version don't try to create object 2020-07-22 17:15:28 +05:30
78a608db50 Update cli.py 2020-07-22 17:12:44 +05:30
93f7dfdaf9 resolve args conflict 2020-07-22 17:09:32 +05:30
83c6f256c9 version arg 2020-07-22 17:03:56 +05:30
dee9105794 command_line support (#18)
* Update wrapper.py

* entry points cli

* Suppress the urllib2/3 Exception

* rm cli code, will create a new cli.py file

* Create cli.py

* update cli entry pts

* Update cli.py

* Update cli.py

* import print_function

* Update cli.py

* Update cli.py

* Delete pypi_uploader.sh

* resolve conflicts with the master

* update the test ; resolve the conflicts

* decrease code complexity

* cli method changed to main

* get is not for just local usage

* get method should be available from interface

* get is used in the interface

* Update cli.py
2020-07-22 16:40:13 +05:30
3bfc3b46d0 Delete SECURITY.md 2020-07-22 11:07:59 +05:30
553f150bee replace youtube with twitter.com
for some reason Wayback API is returing diffrent youtube URL now.
2020-07-22 11:07:23 +05:30
b3a7e714a5 Update wrapper.py 2020-07-22 10:57:43 +05:30
cd9841713c Update wrapper.py 2020-07-22 10:52:43 +05:30
1ea9548d46 Raise WaybackError from URLError and include URL (#19)
* Raise WaybackError from URLError and include URL

* python2 compatibility

Co-authored-by: Akash <64683866+akamhy@users.noreply.github.com>
2020-07-22 10:51:44 +05:30
be7642c837 Code style improvements (#20)
* Add sane line length to setup.cfg

* Use Black for quick readability improvements

* Clean up exceptions, docstrings, and comments

Docstrings on dunder functions are redundant and typically ignored
Limit to reasonable line length
General grammar and style corrections
Clarify docstrings and exceptions
Format docstrings per PEP 257 -- Docstring Conventions

* Move archive_url_parser out of Url.save()

It's generally poor form to define a function in a function, as it will
be re-defined each time the function is run.

archive_url_parser does not depend on anything in Url, so it makes sense
to move it out of the class.

* move wayback_timestamp out of class, mark private functions

* DRY in _wayback_timestamp

* Url._url_check should return None

There's no point in returning True if it's never checked and won't ever
be False.
Implicitly returning None or raising an exception is more idiomatic.

* Default parameters should be type-consistant with expected values

* Specify parameters to near

* Use datetime.datetime in _wayback_timestamp

* cleanup __init__.py

* Cleanup formatting in tests

* Fix names in tests

* Revert "Use datetime.datetime in _wayback_timestamp"

This reverts commit 5b30380865.

Introduced unnecessary complexity

* Move _get_response outside of Url

Because Codacy reminded me that I missed it.

* fix imports in tests
2020-07-22 10:09:14 +05:30
a418a4e464 Update SECURITY.md 2020-07-21 10:38:41 +05:30
aec035ef1e Create SECURITY.md 2020-07-21 08:41:07 +05:30
6d37993ab9 moved to manuals 2020-07-21 08:14:40 +05:30
72b80ca44e Create pypi_uploader.sh 2020-07-21 08:14:21 +05:30
c10aa9279c Create python-publish.yml 2020-07-21 08:08:55 +05:30
68d809a7d6 Update test_1.py 2020-07-20 23:45:49 +05:30
4ad09a419b Fix bash syntax 2020-07-20 23:44:23 +05:30
ddc6620f09 Only report coverage if python 3.8 or greater 2020-07-20 23:29:54 +05:30
4066a65678 Update .travis.yml 2020-07-20 23:20:43 +05:30
8e46a9ba7a Update .travis.yml 2020-07-20 23:16:33 +05:30
a5a98b9b00 Update .travis.yml 2020-07-20 23:11:56 +05:30
a721ab7d6c Update .travis.yml 2020-07-20 23:10:06 +05:30
7db27ae5e1 Create pypi_uploader.sh 2020-07-20 22:28:35 +05:30
8fd4462025 Update wrapper.py 2020-07-20 20:17:18 +05:30
c458a15820 Update .travis.yml 2020-07-20 15:38:33 +05:30
bae3412bee Update .travis.yml 2020-07-20 15:24:26 +05:30
94cb08bb37 Update setup.py 2020-07-20 10:41:00 +05:30
af888db13e 2.1.2 2020-07-20 10:40:37 +05:30
d24f2408ee Update test_1.py 2020-07-20 10:31:47 +05:30
ddd2274015 Update test_1.py 2020-07-20 10:21:15 +05:30
99abdb7c67 Update test_1.py 2020-07-20 10:16:39 +05:30
f3bb9a8540 Update wrapper.py 2020-07-20 10:11:36 +05:30
bb94e0d1c5 Update index.rst and remove dupes 2020-07-20 10:07:31 +05:30
1a78d88be2 2.1.1 2020-07-19 23:17:01 +05:30
3ec61758b3 Update __version__.py 2020-07-19 23:16:13 +05:30
83c962166d Raise 2020-07-19 23:02:04 +05:30
e87dee3bdf Waybackpy example on replit (#15)
* Waybackpy save example on replit

* Oldest example

* Newest method replit link

* Near method example

* Get example

* Total archive method example
2020-07-19 22:28:08 +05:30
b27bfff15a v2.1.0 2020-07-19 21:08:01 +05:30
970fc1cd08 Update __version__.py 2020-07-19 21:06:54 +05:30
65391bf14b update 2020-07-19 21:04:32 +05:30
8ab116f276 API chnaged again. updated
* Update wrapper.py

* Update wrapper.py

* Update wrapper.py

* Update wrapper.py

* Update wrapper.py

* api changed; fix archive url parser

* Update wrapper.py

* - Trailing whitespace

* include the header in exception
2020-07-19 20:39:07 +05:30
6f82041ec9 Update README.md (#13)
* Update README.md

* Update README.md

* replit demo for waybackpy.Url.save()

* Update README.md

* Update README.md

* replit demo for oldest()

* replit demo for newest()

* Update README.md

* replit demo for total_archives

* demo at replit for get()

* demo for near

* Update README.md

* Update README.md

* Update README.md
2020-07-19 16:39:39 +05:30
11059c960e Update setup.py 2020-07-18 19:27:04 +05:30
eee1b8eba1 Update __version__.py 2020-07-18 19:26:41 +05:30
f7de8f5575 sleeps to prevent too many requests in a timeframe 2020-07-18 19:25:19 +05:30
3fa0c32064 V2.0.1 link 2020-07-18 19:09:18 +05:30
aa1e3b8825 V2.0.1 2020-07-18 19:08:39 +05:30
58d2d585c8 No timeout for final try 2020-07-18 18:29:41 +05:30
e8efed2e2f Update test_1.py 2020-07-18 17:24:54 +05:30
49089b7321 2.0.0 link 2020-07-18 17:09:07 +05:30
55d8687566 Update test_1.py 2020-07-18 16:58:23 +05:30
0fa28527af Update index.rst 2020-07-18 16:54:07 +05:30
68259fd2d9 Update index.rst 2020-07-18 16:53:27 +05:30
e7086a89d3 Update index.rst 2020-07-18 16:52:37 +05:30
e39467227c Update index.rst 2020-07-18 16:51:47 +05:30
ba840404cf Update index.rst 2020-07-18 16:50:37 +05:30
8fbd2d9e55 Update index.rst 2020-07-18 16:49:03 +05:30
eebf6043de Update index.rst 2020-07-18 16:48:29 +05:30
3d3b09d6d8 Update README.md 2020-07-18 16:46:40 +05:30
ef15b5863c Update index.rst 2020-07-18 16:44:32 +05:30
256c0cdb6b update test - save 2020-07-18 16:39:35 +05:30
12c72a8294 fix link 2020-07-18 16:30:20 +05:30
0ad27f5ecc update readme for newer oop and some test changes (#12)
* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* docstrings

* user agent ; more variants

* description update

* Update __init__.py

* # -*- coding: utf-8 -*-

* Update test_1.py

* update docs for get()

* Update README.md
2020-07-18 16:22:09 +05:30
700b60b5f8 Update README.md 2020-07-18 08:16:59 +05:30
11032596c8 Update README.md 2020-07-18 08:15:43 +05:30
9727f92168 Update README.md 2020-07-18 08:12:33 +05:30
d2893fec13 Delete CONTRIBUTING.md 2020-07-18 08:12:00 +05:30
f1353b2129 Update CONTRIBUTING.md 2020-07-18 00:58:50 +05:30
c76a95ef90 Create CONTRIBUTING.md (#11) 2020-07-18 00:57:48 +05:30
62d88359ce Update README.md 2020-07-18 00:40:21 +05:30
9942c474c9 Update README.md 2020-07-18 00:35:12 +05:30
dfb736e794 Size 2020-07-18 00:32:00 +05:30
84d1766917 Update README.md 2020-07-18 00:20:58 +05:30
9d3cdfafb3 Update README.md 2020-07-18 00:20:17 +05:30
20a16bfa45 Version 2.0.0 on it's way for release (tommorow) 2020-07-18 00:09:28 +05:30
f2112c73f6 Python 2 support 2020-07-17 21:08:32 +05:30
9860527d96 OOP (#10)
* Update wrapper.py

* Update exceptions.py

* Update __init__.py

* test adjusted for new changes

* Update wrapper.py
2020-07-17 20:50:00 +05:30
9ac1e877c8 Update README.md 2020-07-16 20:39:12 +05:30
f881705d00 detecet python version whith sys.version_info (#9) 2020-06-26 15:48:01 +05:30
f015c3f4f3 test on the worst case possible 2020-05-08 09:56:01 +05:30
42ac399362 Most efficient method to count (yet) 2020-05-08 09:47:13 +05:30
e9d010c793 just count the status code, consumes less memory 2020-05-08 09:28:18 +05:30
58a6409528 v1.6 2020-05-07 20:14:59 +05:30
7ca2029158 Update setup.py 2020-05-07 20:14:40 +05:30
18 changed files with 1668 additions and 581 deletions

31
.github/workflows/python-publish.yml vendored Normal file
View File

@ -0,0 +1,31 @@
# This workflows will upload a Python Package using Twine when a release is created
# For more information see: https://help.github.com/en/actions/language-and-framework-guides/using-python-with-github-actions#publishing-to-package-registries
name: Upload Python Package
on:
release:
types: [created]
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: '3.x'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install setuptools wheel twine
- name: Build and publish
env:
TWINE_USERNAME: ${{ secrets.PYPI_USERNAME }}
TWINE_PASSWORD: ${{ secrets.PYPI_PASSWORD }}
run: |
python setup.py sdist bdist_wheel
twine upload dist/*

View File

@ -1,14 +1,19 @@
language: python
python:
- "2.7"
- "3.6"
- "3.8"
os: linux
dist: xenial
cache: pip
install:
- pip install pytest
before_script:
cd tests
python:
- 2.7
- 3.6
- 3.8
before_install:
- python --version
- pip install -U pip
- pip install -U pytest
- pip install codecov
- pip install pytest pytest-cov
script:
- pytest test_1.py
- cd tests
- pytest --cov=../waybackpy
after_success:
- if [[ $TRAVIS_PYTHON_VERSION == 3.8 ]]; then python -m codecov; fi

416
README.md
View File

@ -1,176 +1,386 @@
# waybackpy
[![Build Status](https://travis-ci.org/akamhy/waybackpy.svg?branch=master)](https://travis-ci.org/akamhy/waybackpy)
[![Downloads](https://img.shields.io/pypi/dm/waybackpy.svg)](https://pypistats.org/packages/waybackpy)
![contributions welcome](https://img.shields.io/static/v1.svg?label=Contributions&message=Welcome&color=0059b3&style=flat-square)
[![Build Status](https://img.shields.io/travis/akamhy/waybackpy.svg?label=Travis%20CI&logo=travis&style=flat-square)](https://travis-ci.org/akamhy/waybackpy)
[![codecov](https://codecov.io/gh/akamhy/waybackpy/branch/master/graph/badge.svg)](https://codecov.io/gh/akamhy/waybackpy)
[![Downloads](https://pepy.tech/badge/waybackpy/month)](https://pepy.tech/project/waybackpy/month)
[![Release](https://img.shields.io/github/v/release/akamhy/waybackpy.svg)](https://github.com/akamhy/waybackpy/releases)
[![Codacy Badge](https://api.codacy.com/project/badge/Grade/255459cede9341e39436ec8866d3fb65)](https://www.codacy.com/manual/akamhy/waybackpy?utm_source=github.com&amp;utm_medium=referral&amp;utm_content=akamhy/waybackpy&amp;utm_campaign=Badge_Grade)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://github.com/akamhy/waybackpy/blob/master/LICENSE)
[![Maintainability](https://api.codeclimate.com/v1/badges/942f13d8177a56c1c906/maintainability)](https://codeclimate.com/github/akamhy/waybackpy/maintainability)
[![CodeFactor](https://www.codefactor.io/repository/github/akamhy/waybackpy/badge)](https://www.codefactor.io/repository/github/akamhy/waybackpy)
[![made-with-python](https://img.shields.io/badge/Made%20with-Python-1f425f.svg)](https://www.python.org/)
![pypi](https://img.shields.io/pypi/v/waybackpy.svg)
[![pypi](https://img.shields.io/pypi/v/waybackpy.svg)](https://pypi.org/project/waybackpy/)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/waybackpy?style=flat-square)
[![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-green.svg)](https://github.com/akamhy/waybackpy/graphs/commit-activity)
![Repo size](https://img.shields.io/github/repo-size/akamhy/waybackpy.svg?label=Repo%20size&style=flat-square)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://github.com/akamhy/waybackpy/blob/master/LICENSE)
![Wayback Machine](https://raw.githubusercontent.com/akamhy/waybackpy/master/assets/waybackpy-colored%20284.png)
![Internet Archive](https://upload.wikimedia.org/wikipedia/commons/thumb/8/84/Internet_Archive_logo_and_wordmark.svg/84px-Internet_Archive_logo_and_wordmark.svg.png)
![Wayback Machine](https://upload.wikimedia.org/wikipedia/commons/thumb/0/01/Wayback_Machine_logo_2010.svg/284px-Wayback_Machine_logo_2010.svg.png)
The waybackpy is a python wrapper for [Internet Archive](https://en.wikipedia.org/wiki/Internet_Archive)'s [Wayback Machine](https://en.wikipedia.org/wiki/Wayback_Machine).
Waybackpy is a Python library that interfaces with [Internet Archive](https://en.wikipedia.org/wiki/Internet_Archive)'s [Wayback Machine](https://en.wikipedia.org/wiki/Wayback_Machine) API. Archive webpages and retrieve archived webpages easily.
Table of contents
=================
<!--ts-->
* [Installation](https://github.com/akamhy/waybackpy#installation)
* [Installation](#installation)
* [Usage](https://github.com/akamhy/waybackpy#usage)
* [Saving an url using save()](https://github.com/akamhy/waybackpy#capturing-aka-saving-an-url-using-save)
* [Receiving the oldest archive for an URL Using oldest()](https://github.com/akamhy/waybackpy#receiving-the-oldest-archive-for-an-url-using-oldest)
* [Receiving the recent most/newest archive for an URL using newest()](https://github.com/akamhy/waybackpy#receiving-the-newest-archive-for-an-url-using-newest)
* [Receiving archive close to a specified year, month, day, hour, and minute using near()](https://github.com/akamhy/waybackpy#receiving-archive-close-to-a-specified-year-month-day-hour-and-minute-using-near)
* [Get the content of webpage using get()](https://github.com/akamhy/waybackpy#get-the-content-of-webpage-using-get)
* [Count total archives for an URL using total_archives()](https://github.com/akamhy/waybackpy#count-total-archives-for-an-url-using-total_archives)
* [Usage](#usage)
* [As a Python package](#as-a-python-package)
* [Saving an url](#capturing-aka-saving-an-url-using-save)
* [Retrieving the oldest archive](#retrieving-the-oldest-archive-for-an-url-using-oldest)
* [Retrieving the recent most/newest archive](#retrieving-the-newest-archive-for-an-url-using-newest)
* [Retrieving archive close to a specified year, month, day, hour, and minute](#retrieving-archive-close-to-a-specified-year-month-day-hour-and-minute-using-near)
* [Get the content of webpage](#get-the-content-of-webpage-using-get)
* [Count total archives for an URL](#count-total-archives-for-an-url-using-total_archives)
* [List of URLs that Wayback Machine knows and has archived for a domain name](#retrieving-archive-close-to-a-specified-year-month-day-hour-and-minute-using-near)
* [As a Command-line tool](#with-the-command-line-interface)
* [Save](#save)
* [Oldest archive](#oldest-archive)
* [Newest archive](#newest-archive)
* [Total archives](#total-number-of-archives)
* [Archive near a time](#archive-near-time)
* [Get the source code](#get-the-source-code)
* [Fetch all the URLs that the Wayback Machine knows for a domain](#fetch-all-the-urls-that-the-wayback-machine-knows-for-a-domain)
* [Tests](https://github.com/akamhy/waybackpy#tests)
* [Tests](#tests)
* [Dependency](https://github.com/akamhy/waybackpy#dependency)
* [Dependency](#dependency)
* [License](https://github.com/akamhy/waybackpy#license)
* [Packaging](#packaging)
* [License](#license)
<!--te-->
## Installation
Using [pip](https://en.wikipedia.org/wiki/Pip_(package_manager)):
**pip install waybackpy**
```bash
pip install waybackpy
```
or direct from this repository using git.
```bash
pip install git+https://github.com/akamhy/waybackpy.git
```
## Usage
#### Capturing aka Saving an url Using save()
```diff
+ waybackpy.save(url, UA=user_agent)
```
> url is mandatory. UA is not, but highly recommended.
```python
import waybackpy
# Capturing a new archive on Wayback machine.
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
archived_url = waybackpy.save("https://github.com/akamhy/waybackpy", UA = "Any-User-Agent")
print(archived_url)
```
This should print something similar to the following archived URL:
<https://web.archive.org/web/20200504141153/https://github.com/akamhy/waybackpy>
#### Receiving the oldest archive for an URL Using oldest()
```diff
+ waybackpy.oldest(url, UA=user_agent)
```
> url is mandatory. UA is not, but highly recommended.
### As a Python package
#### Capturing aka Saving an url using save()
```python
import waybackpy
# retrieving the oldest archive on Wayback machine.
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
oldest_archive = waybackpy.oldest("https://www.google.com/", UA = "Any-User-Agent")
print(oldest_archive)
new_archive_url = waybackpy.Url(
url = "https://en.wikipedia.org/wiki/Multivariable_calculus",
user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
).save()
print(new_archive_url)
```
This returns the oldest available archive for <https://google.com>.
<http://web.archive.org/web/19981111184551/http://google.com:80/>
#### Receiving the newest archive for an URL using newest()
```diff
+ waybackpy.newest(url, UA=user_agent)
```bash
https://web.archive.org/web/20200504141153/https://github.com/akamhy/waybackpy
```
> url is mandatory. UA is not, but highly recommended.
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPySaveExample></sub>
#### Retrieving the oldest archive for an URL using oldest()
```python
import waybackpy
# retrieving the newest archive on Wayback machine.
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
newest_archive = waybackpy.newest("https://www.microsoft.com/en-us", UA = "Any-User-Agent")
print(newest_archive)
oldest_archive_url = waybackpy.Url(
"https://www.google.com/",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:40.0) Gecko/20100101 Firefox/40.0"
).oldest()
print(oldest_archive_url)
```
This returns the newest available archive for <https://www.microsoft.com/en-us>, something just like this:
<http://web.archive.org/web/20200429033402/https://www.microsoft.com/en-us/>
#### Receiving archive close to a specified year, month, day, hour, and minute using near()
```diff
+ waybackpy.near(url, year=2020, month=1, day=1, hour=1, minute=1, UA=user_agent)
```bash
http://web.archive.org/web/19981111184551/http://google.com:80/
```
> url is mandotory. year,month,day,hour and minute are optional arguments. UA is not mandotory, but higly recomended.
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyOldestExample></sub>
#### Retrieving the newest archive for an URL using newest()
```python
import waybackpy
# retriving the the closest archive from a specified year.
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
# supported argumnets are year,month,day,hour and minute
archive_near_year = waybackpy.near("https://www.facebook.com/", year=2010, UA ="Any-User-Agent")
print(archive_near_year)
newest_archive_url = waybackpy.Url(
"https://www.facebook.com/",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:39.0) Gecko/20100101 Firefox/39.0"
).newest()
print(newest_archive_url)
```
returns : <http://web.archive.org/web/20100504071154/http://www.facebook.com/>
```waybackpy.near("https://www.facebook.com/", year=2010, month=1, UA ="Any-User-Agent")``` returns: <http://web.archive.org/web/20101111173430/http://www.facebook.com//>
```bash
https://web.archive.org/web/20200714013225/https://www.facebook.com/
```
```waybackpy.near("https://www.oracle.com/index.html", year=2019, month=1, day=5, UA ="Any-User-Agent")``` returns: <http://web.archive.org/web/20190105054437/https://www.oracle.com/index.html>
> Please note that if you only specify the year, the current month and day are default arguments for month and day respectively. Do not expect just putting the year parameter would return the archive closer to January but the current month you are using the package. If you are using it in July 2018 and let's say you use ```waybackpy.near("https://www.facebook.com/", year=2011, UA ="Any-User-Agent")``` then you would be returned the nearest archive to July 2011 and not January 2011. You need to specify the month "1" for January.
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyNewestExample></sub>
> Do not pad (don't use zeros in the month, year, day, minute, and hour arguments). e.g. For January, set month = 1 and not month = 01.
#### Retrieving archive close to a specified year, month, day, hour, and minute using near()
```python
from waybackpy import Url
user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:38.0) Gecko/20100101 Firefox/38.0"
github_url = "https://github.com/"
github_wayback_obj = Url(github_url, user_agent)
# Do not pad (don't use zeros in the month, year, day, minute, and hour arguments). e.g. For January, set month = 1 and not month = 01.
```
```python
github_archive_near_2010 = github_wayback_obj.near(year=2010)
print(github_archive_near_2010)
```
```bash
https://web.archive.org/web/20100719134402/http://github.com/
```
```python
github_archive_near_2011_may = github_wayback_obj.near(year=2011, month=5)
print(github_archive_near_2011_may)
```
```bash
https://web.archive.org/web/20110519185447/https://github.com/
```
```python
github_archive_near_2015_january_26 = github_wayback_obj.near(
year=2015, month=1, day=26
)
print(github_archive_near_2015_january_26)
```
```bash
https://web.archive.org/web/20150127031159/https://github.com
```
```python
github_archive_near_2018_4_july_9_2_am = github_wayback_obj.near(
year=2018, month=7, day=4, hour = 9, minute = 2
)
print(github_archive_near_2018_4_july_9_2_am)
```
```bash
https://web.archive.org/web/20180704090245/https://github.com/
```
<sub>The library doesn't supports seconds yet. You are encourged to create a PR ;)</sub>
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyNearExample></sub>
#### Get the content of webpage using get()
```diff
+ waybackpy.get(url, encoding="UTF-8", UA=user_agent)
```
> url is mandatory. UA is not, but highly recommended. encoding is detected automatically, don't specify unless necessary.
```python
from waybackpy import get
# retriving the webpage from any url including the archived urls. Don't need to import other libraies :)
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
# supported argumnets are url, encoding and UA
webpage = get("https://example.com/", UA="User-Agent")
print(webpage)
import waybackpy
google_url = "https://www.google.com/"
User_Agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.85 Safari/537.36"
waybackpy_url_object = waybackpy.Url(google_url, User_Agent)
# If no argument is passed in get(), it gets the source of the Url used to create the object.
current_google_url_source = waybackpy_url_object.get()
print(current_google_url_source)
# The following chunk of code will force a new archive of google.com and get the source of the archived page.
# waybackpy_url_object.save() type is string.
google_newest_archive_source = waybackpy_url_object.get(
waybackpy_url_object.save()
)
print(google_newest_archive_source)
# waybackpy_url_object.oldest() type is str, it's oldest archive of google.com
google_oldest_archive_source = waybackpy_url_object.get(
waybackpy_url_object.oldest()
)
print(google_oldest_archive_source)
```
> This should print the source code for <https://example.com/>.
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyGetExample#main.py></sub>
#### Count total archives for an URL using total_archives()
```diff
+ waybackpy.total_archives(url, UA=user_agent)
```python
import waybackpy
URL = "https://en.wikipedia.org/wiki/Python (programming language)"
UA = "Mozilla/5.0 (iPad; CPU OS 8_1_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B435 Safari/600.1.4"
archive_count = waybackpy.Url(
url=URL,
user_agent=UA
).total_archives()
print(archive_count) # total_archives() returns an int
```
> url is mandatory. UA is not, but highly recommended.
```bash
2440
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyTotalArchivesExample></sub>
#### List of URLs that Wayback Machine knows and has archived for a domain name
1) If alive=True is set, waybackpy will check all URLs to identify the alive URLs. Don't use with popular websites like google or it would take too long.
2) To include URLs from subdomain set sundomain=True
```python
from waybackpy import total_archives
# retriving the webpage from any url including the archived urls. Don't need to import other libraies :)
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
# supported argumnets are url and UA
count = total_archives("https://en.wikipedia.org/wiki/Python (programming language)", UA="User-Agent")
print(count)
import waybackpy
URL = "akamhy.github.io"
UA = "Mozilla/5.0 (iPad; CPU OS 8_1_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B435 Safari/600.1.4"
known_urls = waybackpy.Url(url=URL, user_agent=UA).known_urls(alive=True, subdomain=False) # alive and subdomain are optional.
print(known_urls) # known_urls() returns list of URLs
```
> This should print an integer (int), which is the number of total archives on archive.org
```bash
['http://akamhy.github.io',
'https://akamhy.github.io/waybackpy/',
'https://akamhy.github.io/waybackpy/assets/css/style.css?v=a418a4e4641a1dbaad8f3bfbf293fad21a75ff11',
'https://akamhy.github.io/waybackpy/assets/css/style.css?v=f881705d00bf47b5bf0c58808efe29eecba2226c']
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyKnownURLsToWayBackMachineExample#main.py></sub>
### With the Command-line interface
#### Save
```bash
$ waybackpy --url "https://en.wikipedia.org/wiki/Social_media" --user_agent "my-unique-user-agent" --save
https://web.archive.org/web/20200719062108/https://en.wikipedia.org/wiki/Social_media
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyBashSave></sub>
#### Oldest archive
```bash
$ waybackpy --url "https://en.wikipedia.org/wiki/SpaceX" --user_agent "my-unique-user-agent" --oldest
https://web.archive.org/web/20040803000845/http://en.wikipedia.org:80/wiki/SpaceX
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyBashOldest></sub>
#### Newest archive
```bash
$ waybackpy --url "https://en.wikipedia.org/wiki/YouTube" --user_agent "my-unique-user-agent" --newest
https://web.archive.org/web/20200606044708/https://en.wikipedia.org/wiki/YouTube
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyBashNewest></sub>
#### Total number of archives
```bash
$ waybackpy --url "https://en.wikipedia.org/wiki/Linux_kernel" --user_agent "my-unique-user-agent" --total
853
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyBashTotal></sub>
#### Archive near time
```bash
$ waybackpy --url facebook.com --user_agent "my-unique-user-agent" --near --year 2012 --month 5 --day 12
https://web.archive.org/web/20120512142515/https://www.facebook.com/
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyBashNear></sub>
#### Get the source code
```bash
waybackpy --url google.com --user_agent "my-unique-user-agent" --get url # Prints the source code of the url
waybackpy --url google.com --user_agent "my-unique-user-agent" --get oldest # Prints the source code of the oldest archive
waybackpy --url google.com --user_agent "my-unique-user-agent" --get newest # Prints the source code of the newest archive
waybackpy --url google.com --user_agent "my-unique-user-agent" --get save # Save a new archive on wayback machine then print the source code of this archive.
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyBashGet></sub>
#### Fetch all the URLs that the Wayback Machine knows for a domain
1) You can add the '--alive' flag to only fetch alive links.
2) You can add the '--subdomain' flag to add subdomains.
3) '--alive' and '--subdomain' flags can be used simultaneously.
4) All links will be saved in a file, and the file will be created in the current working directory.
```bash
pip install waybackpy
# Ignore the above installation line.
waybackpy --url akamhy.github.io --user_agent "my-user-agent" --known_urls
# Prints all known URLs under akamhy.github.io
waybackpy --url akamhy.github.io --user_agent "my-user-agent" --known_urls --alive
# Prints all known URLs under akamhy.github.io which are still working and not dead links.
waybackpy --url akamhy.github.io --user_agent "my-user-agent" --known_urls --subdomain
# Prints all known URLs under akamhy.github.io inclusing subdomain
waybackpy --url akamhy.github.io --user_agent "my-user-agent" --known_urls --subdomain --alive
# Prints all known URLs under akamhy.github.io including subdomain which are not dead links and still alive.
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackpyKnownUrlsFromWaybackMachine#main.sh></sub>
## Tests
* [Here](https://github.com/akamhy/waybackpy/tree/master/tests)
[Here](https://github.com/akamhy/waybackpy/tree/master/tests)
## Dependency
* None, just python standard libraries (json, urllib and datetime). Both python 2 and 3 are supported :)
None, just python standard libraries (re, json, urllib, argparse and datetime). Both python 2 and 3 are supported :)
## Packaging
1. Increment version.
2. Build package ``python setup.py sdist bdist_wheel``.
3. Sign & upload the package ``twine upload -s dist/*``.
## License
[MIT License](https://github.com/akamhy/waybackpy/blob/master/LICENSE)
Released under the MIT License. See
[license](https://github.com/akamhy/waybackpy/blob/master/LICENSE) for details.

View File

@ -0,0 +1,268 @@
<?xml version="1.0" standalone="no"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 20010904//EN"
"http://www.w3.org/TR/2001/REC-SVG-20010904/DTD/svg10.dtd">
<svg version="1.0" xmlns="http://www.w3.org/2000/svg"
width="629.000000pt" height="103.000000pt" viewBox="0 0 629.000000 103.000000"
preserveAspectRatio="xMidYMid meet">
<g transform="translate(0.000000,103.000000) scale(0.100000,-0.100000)"
fill="#000000" stroke="none">
<path d="M0 515 l0 -515 3145 0 3145 0 0 515 0 515 -3145 0 -3145 0 0 -515z
m5413 439 c31 -6 36 -10 31 -26 -3 -10 0 -26 7 -34 6 -8 10 -17 7 -20 -3 -2
-17 11 -32 31 -15 19 -41 39 -59 44 -38 11 -10 14 46 5z m150 -11 c-7 -2 -21
-2 -30 0 -10 3 -4 5 12 5 17 0 24 -2 18 -5z m-4869 -23 c-6 -6 -21 -6 -39 -1
-30 9 -30 9 10 10 25 1 36 -2 29 -9z m452 -37 c-3 -26 -15 -65 -25 -88 -10
-22 -21 -64 -25 -94 -3 -29 -14 -72 -26 -95 -11 -23 -20 -51 -20 -61 0 -30
-39 -152 -53 -163 -6 -5 -45 -12 -85 -14 -72 -5 -102 4 -102 33 0 6 -9 31 -21
56 -11 25 -26 72 -33 103 -6 31 -17 64 -24 73 -8 9 -22 37 -32 64 l-18 48 -16
-39 c-9 -21 -16 -44 -16 -50 0 -6 -7 -24 -15 -40 -8 -16 -24 -63 -34 -106 -11
-43 -26 -93 -34 -112 -14 -34 -15 -35 -108 -46 -70 -9 -96 -9 -106 0 -21 17
-43 64 -43 92 0 14 -4 27 -9 31 -12 7 -50 120 -66 200 -8 35 -25 81 -40 103
-14 22 -27 52 -28 68 -2 28 0 29 48 31 28 1 82 5 120 9 54 4 73 3 82 -7 11
-15 53 -148 53 -170 0 -7 9 -32 21 -56 20 -41 39 -49 39 -17 0 8 -5 12 -10 9
-6 -3 -13 2 -16 12 -3 10 -10 26 -15 36 -14 26 7 21 29 -8 l20 -26 7 33 c7 35
41 149 56 185 7 19 16 23 56 23 27 0 80 2 120 6 80 6 88 1 97 -71 3 -20 9 -42
14 -48 5 -7 20 -43 32 -82 13 -38 24 -72 26 -74 2 -2 13 4 24 14 13 12 20 31
20 55 0 20 7 56 15 81 7 24 19 63 25 87 12 47 31 60 89 61 l34 1 -7 -47z
m3131 41 c17 -3 34 -12 37 -20 3 -7 1 -48 -4 -91 -4 -43 -7 -80 -4 -82 2 -2
11 2 20 10 9 7 24 18 34 24 9 5 55 40 101 77 79 64 87 68 136 68 28 0 54 -4
58 -10 3 -5 12 -7 20 -3 9 3 15 -1 15 -9 0 -13 -180 -158 -197 -158 -4 0 -14
-9 -20 -20 -11 -17 -7 -27 27 -76 22 -32 40 -63 40 -70 0 -7 6 -19 14 -26 7
-8 37 -48 65 -89 l52 -74 -28 -3 c-51 -5 -74 -12 -68 -22 9 -14 -59 -12 -73 2
-20 20 -13 30 10 14 34 -24 44 -19 17 8 -25 25 -109 140 -109 149 0 7 -60 97
-64 97 -2 0 -11 -10 -22 -22 -18 -21 -18 -21 0 -15 10 4 25 2 32 -4 18 -15 19
-35 2 -22 -7 6 -25 13 -39 17 -34 8 -39 -5 -39 -94 0 -38 -3 -75 -6 -84 -6
-16 -54 -22 -67 -9 -4 3 -40 7 -81 8 -101 2 -110 10 -104 97 3 37 10 73 16 80
6 8 10 77 10 174 0 89 2 166 6 172 6 11 162 15 213 6z m301 -1 c-25 -2 -52
-11 -58 -19 -7 -7 -17 -14 -23 -14 -5 0 -2 9 8 20 14 16 29 20 69 18 l51 -2
-47 -3z m809 -9 c33 -21 65 -89 62 -132 -1 -21 1 -47 5 -59 9 -28 -26 -111
-51 -120 -10 -3 -25 -12 -33 -19 -10 -8 -70 -15 -170 -21 l-155 -8 4 -73 c4
-93 -10 -112 -80 -112 -26 0 -60 5 -74 12 -19 8 -31 8 -51 -1 -45 -20 -55 -1
-55 98 0 47 -1 111 -3 141 -2 30 -5 107 -7 170 l-4 115 65 2 c36 2 103 7 150
11 150 15 372 13 397 -4z m338 -19 c11 -14 46 -54 78 -88 l58 -62 62 65 c34
36 75 73 89 83 28 18 113 24 122 9 3 -5 -32 -51 -77 -102 -147 -167 -134 -143
-139 -253 -3 -54 -10 -103 -16 -109 -8 -8 -8 -17 -1 -30 14 -26 11 -28 -47
-29 -119 -2 -165 3 -174 22 -6 10 -9 69 -8 131 l2 113 -57 75 c-32 41 -80 102
-107 134 -27 33 -47 62 -45 66 3 4 58 6 122 4 113 -3 119 -5 138 -29z m-4233
13 c16 -13 98 -150 98 -164 0 -4 29 -65 65 -135 36 -71 65 -135 65 -143 0 -10
-14 -17 -37 -21 -21 -4 -48 -10 -61 -16 -40 -16 -51 -10 -77 41 -29 57 -35 59
-157 38 -65 -11 -71 -14 -84 -43 -10 -25 -21 -34 -46 -38 -41 -6 -61 8 -48 33
15 28 12 38 -12 42 -18 2 -23 10 -24 36 -1 27 3 35 23 43 13 5 34 9 46 9 23 0
57 47 57 78 0 9 10 33 22 52 14 24 21 52 22 92 1 49 4 58 24 67 13 6 31 11 40
11 9 0 26 7 36 15 24 18 28 18 48 3z m1701 0 c16 -12 97 -143 97 -157 0 -3 32
-69 70 -146 39 -76 67 -142 62 -147 -4 -4 -28 -12 -52 -17 -25 -6 -57 -13 -72
-17 -25 -6 -29 -2 -50 42 -14 30 -31 50 -43 53 -11 2 -57 -2 -103 -9 -79 -12
-83 -13 -96 -45 -10 -24 -22 -34 -46 -38 -43 -9 -53 -1 -45 39 5 30 3 34 -15
34 -17 0 -20 6 -20 39 0 40 13 50 65 51 19 0 55 48 55 72 0 6 8 29 19 52 32
72 41 107 31 127 -8 14 -5 21 12 33 12 9 32 16 43 16 11 0 29 7 39 15 24 18
28 18 49 3z m-3021 -11 c-29 -9 -32 -13 -27 -39 8 -36 -11 -37 -20 -1 -8 32
15 54 54 52 24 -1 23 -2 -7 -12z m3499 4 c-12 -8 -51 -4 -51 5 0 2 15 4 33 4
22 0 28 -3 18 -9z m1081 -67 c2 -42 0 -78 -4 -81 -5 -2 -8 18 -8 45 0 27 -3
64 -6 81 -4 19 -2 31 4 31 6 0 12 -32 14 -76z m-1951 46 c12 -7 19 -21 19 -38
l-1 -27 -15 28 c-8 15 -22 27 -32 27 -9 0 -24 5 -32 10 -21 14 35 13 61 0z
m1004 -3 c73 -19 135 -61 135 -92 0 -15 -8 -29 -21 -36 -18 -9 -30 -6 -69 15
-37 20 -62 26 -109 26 -54 0 -62 -3 -78 -26 -21 -32 -33 -130 -25 -191 9 -58
41 -84 111 -91 38 -3 61 1 97 17 36 17 49 19 60 10 25 -21 15 -48 -28 -76 -38
-24 -54 -28 -148 -31 -114 -4 -170 10 -190 48 -6 11 -16 20 -23 20 -24 0 -59
95 -59 159 0 59 20 122 42 136 6 3 10 13 10 22 0 31 80 82 130 83 19 0 42 5
50 10 21 13 57 12 115 -3z m-1682 -23 c-14 -14 -28 -23 -31 -20 -8 8 29 46 44
46 7 0 2 -11 -13 -26z m159 -2 c-20 -15 -22 -23 -16 -60 4 -28 3 -42 -5 -42
-7 0 -11 19 -11 50 0 36 5 52 18 59 28 17 39 12 14 -7z m1224 -28 c-39 -40
-46 -38 -19 7 15 24 40 41 52 33 2 -2 -13 -20 -33 -40z m-1538 -33 l62 -66 63
68 c56 59 68 67 100 67 19 0 38 -3 40 -7 3 -5 -32 -53 -76 -108 -88 -108 -84
-97 -90 -255 l-2 -55 -87 -3 c-49 -1 -88 -1 -89 0 0 2 -3 50 -5 107 -3 75 -8
109 -19 121 -8 9 -15 20 -15 25 0 4 -18 29 -41 54 -83 94 -89 102 -84 111 3 6
45 9 93 9 l87 -1 63 -67z m786 59 c33 -12 48 -42 52 -107 3 -43 0 -57 -16 -73
l-20 -20 20 -28 c26 -35 35 -89 21 -125 -18 -46 -66 -60 -226 -64 -77 -3 -166
-7 -198 -10 -84 -7 -99 9 -97 102 1 38 -1 125 -4 191 l-5 122 47 5 c26 3 103
4 171 2 69 -2 134 1 145 5 29 12 80 12 110 0z m-1050 -16 c3 -8 2 -12 -4 -9
-6 3 -10 10 -10 16 0 14 7 11 14 -7z m-374 -22 c0 -9 -5 -24 -10 -32 -7 -11
-10 -5 -10 23 0 23 4 36 10 32 6 -3 10 -14 10 -23z m1701 16 c2 -21 -2 -43
-10 -51 -4 -4 -7 9 -8 28 -1 32 15 52 18 23z m2859 -28 c-11 -20 -50 -28 -50
-10 0 6 9 10 19 10 11 0 23 5 26 10 12 19 16 10 5 -10z m-4759 -47 c-8 -15
-10 -15 -11 -2 0 17 10 32 18 25 2 -3 -1 -13 -7 -23z m2599 9 c0 -9 -40 -35
-46 -29 -6 6 25 37 37 37 5 0 9 -3 9 -8z m316 -127 c-4 -19 -12 -37 -18 -41
-8 -5 -9 -1 -5 10 4 10 7 36 7 59 1 35 2 39 11 24 6 -10 8 -34 5 -52z m1942
38 c-15 -16 -30 -45 -33 -65 -4 -21 -12 -38 -17 -38 -19 0 3 74 30 103 14 15
30 27 36 27 5 0 -2 -12 -16 -27z m-3855 -16 c-6 -12 -15 -33 -20 -47 -9 -23
-10 -23 -15 -3 -3 12 3 34 14 52 23 35 37 34 21 -2z m3282 -82 c-23 -18 -81
-35 -115 -34 -17 1 -11 5 21 13 25 7 54 18 65 24 30 18 53 15 29 -3z m-2585
-130 c-7 -8 -19 -15 -27 -15 -10 0 -7 8 9 31 18 24 24 27 26 14 2 -9 -2 -22
-8 -30z m-1775 -5 c-4 -12 -9 -19 -12 -17 -3 3 -2 15 2 27 4 12 9 19 12 17 3
-3 2 -15 -2 -27z m820 -29 c-9 -8 -25 21 -25 44 0 16 3 14 15 -9 9 -16 13 -32
10 -35z m2085 47 c0 -17 -31 -48 -47 -48 -11 0 -8 8 9 29 24 32 38 38 38 19z
m-1655 -47 c-11 -10 -35 11 -35 30 0 21 0 21 19 -2 11 -13 18 -26 16 -28z
m1221 24 c13 -14 21 -25 18 -25 -11 0 -54 33 -54 41 0 15 12 10 36 -16z
m-1428 -7 c-3 -7 -18 -14 -34 -15 -20 -1 -22 0 -6 4 12 2 22 9 22 14 0 5 5 9
11 9 6 0 9 -6 7 -12z m3574 -45 c8 -10 6 -13 -11 -13 -18 0 -21 6 -20 38 0 34
1 35 10 13 5 -14 15 -31 21 -38z m-4097 14 c19 -4 19 -4 2 -12 -18 -7 -46 16
-47 39 0 6 6 3 13 -6 6 -9 21 -18 32 -21z m1700 1 c19 -5 19 -5 2 -13 -18 -7
-46 17 -46 40 0 6 5 3 12 -6 7 -9 21 -19 32 -21z m-1970 12 c-3 -5 -21 -9 -38
-9 l-32 2 35 7 c19 4 36 8 38 9 2 0 0 -3 -3 -9z m350 0 c-27 -12 -35 -12 -35
0 0 6 12 10 28 9 24 0 25 -1 7 -9z m1350 0 c-3 -5 -18 -9 -33 -9 l-27 1 30 8
c17 4 31 8 33 9 2 0 0 -3 -3 -9z m355 0 c-19 -13 -30 -13 -30 0 0 6 10 10 23
10 18 0 19 -2 7 -10z m-2324 -35 c-6 -22 -11 -25 -44 -24 -31 2 -32 3 -9 6 18
3 32 14 39 29 14 30 23 24 14 -11z m2839 16 c-14 -14 -73 -26 -60 -13 6 5 19
12 30 15 34 8 40 8 30 -2z m212 -21 l48 -8 -47 -1 c-56 -1 -78 6 -78 26 0 12
3 13 14 3 8 -6 36 -15 63 -20z m116 -1 c-6 -6 -18 -6 -28 -3 -18 7 -18 8 1 14
23 9 39 1 27 -11z m633 -14 c31 5 35 4 21 -5 -9 -6 -34 -10 -55 -8 -31 3 -37
7 -40 28 l-3 25 19 -23 c16 -20 24 -23 58 -17z m939 15 c16 -7 11 -9 -20 -9
-29 -1 -36 2 -25 9 17 11 19 11 45 0z m-5445 -24 c6 -8 21 -16 33 -18 19 -3
20 -4 5 -10 -12 -5 -27 1 -45 17 -16 13 -23 25 -17 25 6 0 17 -6 24 -14z m150
-76 c0 -11 -4 -20 -10 -20 -14 0 -13 -103 1 -117 21 -21 2 -43 -36 -43 -19 0
-35 5 -35 11 0 8 -5 7 -15 -1 -21 -17 -44 2 -28 22 22 26 20 128 -2 128 -8 0
-15 9 -15 19 0 18 8 20 70 20 63 0 70 -2 70 -19z m1189 -63 c17 -32 31 -62 31
-66 0 -14 -43 -21 -57 -9 -7 6 -29 12 -48 14 -26 2 -35 -1 -40 -16 -4 -12 -12
-17 -21 -13 -8 3 -13 12 -10 19 3 8 1 14 -4 14 -18 0 -10 22 9 27 22 6 43 46
35 67 -3 9 5 20 23 30 34 18 38 14 82 -67z m2146 -8 l34 -67 -25 -6 c-14 -4
-31 -3 -37 2 -7 5 -29 12 -49 16 -31 6 -38 4 -38 -9 0 -8 -7 -15 -15 -15 -8 0
-15 7 -15 15 0 8 -4 15 -10 15 -19 0 -10 21 14 30 16 6 27 20 31 40 4 18 16
41 27 52 26 26 40 14 83 -73z m-3205 51 c8 -10 20 -26 27 -36 10 -17 12 -14
12 19 1 36 2 37 37 37 l37 0 -8 -72 c-3 -40 -11 -76 -17 -79 -20 -13 -43 3
-62 42 -27 56 -34 56 -41 4 -7 -42 -9 -44 -34 -39 -35 9 -34 6 -35 71 -1 41 4
62 14 70 18 15 50 7 70 -17z m280 11 c-5 -11 -15 -21 -21 -23 -13 -4 -14 -101
-3 -120 5 -8 1 -9 -10 -5 -10 4 -29 7 -42 7 -22 0 -24 3 -24 55 0 52 -1 55
-26 55 -19 0 -25 5 -22 18 2 13 17 18 68 23 36 3 71 6 78 7 9 2 10 -3 2 -17z
m178 -3 c3 -15 -4 -18 -32 -18 -25 0 -36 -4 -36 -15 0 -10 11 -15 35 -15 24 0
35 -5 35 -15 0 -11 -11 -15 -41 -15 -55 0 -47 -24 9 -28 29 -2 42 -8 42 -18 0
-16 -25 -17 -108 -7 l-53 6 2 56 c3 92 1 90 77 88 55 -2 67 -5 70 -19z m230
10 c18 -18 14 -56 -7 -77 -17 -17 -18 -21 -5 -40 14 -19 13 -21 -4 -21 -10 0
-28 11 -40 25 -24 27 -52 24 -52 -5 0 -24 -9 -29 -43 -23 -26 5 -27 7 -27 73
0 45 4 70 13 73 26 11 153 7 165 -5z m557 -2 c47 -20 47 -40 0 -32 -53 10 -77
-7 -73 -52 l3 -37 48 1 c26 0 47 -3 47 -6 0 -35 -108 -42 -140 -10 -29 29 -27
94 5 125 28 28 60 31 110 11z m213 -8 c3 -15 -4 -18 -38 -18 -50 0 -51 -22 -1
-30 44 -7 44 -24 -1 -28 -54 -5 -52 -32 2 -32 29 0 40 -4 40 -15 0 -17 -28
-19 -104 -9 l-46 7 0 72 0 72 72 -1 c61 -1 73 -4 76 -18z m312 6 c0 -9 -9 -18
-21 -21 -19 -5 -20 -12 -17 -69 3 -63 3 -63 -22 -58 -49 11 -50 12 -50 64 0
43 -3 50 -20 50 -13 0 -20 7 -20 20 0 17 8 20 68 23 37 2 70 4 75 5 4 1 7 -5
7 -14z m155 6 c65 -15 94 -73 62 -125 -14 -24 -25 -28 -92 -33 -44 -3 -54 0
-78 24 -34 34 -36 82 -4 111 37 34 53 37 112 23z m505 -3 c0 -8 -9 -40 -20
-72 -11 -31 -18 -60 -16 -64 3 -4 -9 -8 -25 -9 -25 -2 -31 3 -51 45 l-22 47
-21 -46 c-17 -38 -25 -47 -51 -50 -24 -3 -30 0 -32 17 -1 12 -8 40 -17 64 -21
59 -20 61 20 61 27 0 35 -4 35 -17 0 -10 4 -24 9 -32 7 -11 13 -6 25 23 14 35
18 37 53 34 32 -2 39 -7 41 -28 6 -43 19 -43 36 -1 15 40 36 55 36 28z m136
-4 c27 -45 64 -115 64 -122 0 -13 -42 -22 -54 -12 -6 5 -28 11 -49 15 -32 6
-38 4 -45 -13 -8 -24 -26 -16 -36 16 -5 16 -2 25 13 32 11 6 25 28 32 48 17
55 53 71 75 36z m840 -4 c22 -18 16 -32 -11 -25 -59 15 -94 -18 -74 -71 8 -21
15 -24 47 -22 40 3 66 -7 57 -21 -3 -5 -12 -7 -20 -3 -8 3 -15 1 -15 -4 0 -17
-111 4 -126 24 -26 34 -13 100 25 131 18 14 96 9 117 -9z m816 -54 l37 -70
-25 -8 c-16 -6 -30 -5 -40 3 -22 19 -81 22 -88 4 -7 -19 -26 -18 -26 1 0 8 -4
15 -10 15 -20 0 -9 21 15 30 24 9 30 24 27 63 -1 10 2 16 7 13 5 -3 12 1 15
10 4 9 15 14 28 12 17 -2 33 -22 60 -73z m183 61 c47 -20 47 -40 0 -32 -46 9
-75 -7 -75 -42 0 -45 13 -56 59 -49 30 4 41 2 41 -8 0 -32 -95 -35 -134 -4
-30 24 -34 64 -11 109 22 43 60 51 120 26z m398 4 c19 0 24 -26 6 -32 -13 -4
-16 -42 -5 -84 l7 -32 -55 -1 c-57 0 -68 7 -41 29 17 14 21 90 5 90 -5 0 -10
10 -10 21 0 19 4 21 38 15 20 -3 45 -6 55 -6z m117 0 c5 0 17 -13 27 -30 9
-16 21 -30 25 -30 4 0 8 14 8 30 0 28 3 30 36 30 l36 0 -5 -71 c-2 -42 -9 -74
-17 -79 -15 -9 -50 -1 -50 12 0 5 -11 25 -24 45 l-24 35 -9 -42 c-4 -23 -11
-41 -15 -41 -5 1 -19 1 -32 1 -23 0 -23 2 -20 67 3 66 15 88 42 78 8 -3 18 -5
22 -5z m317 -3 c21 -15 4 -27 -38 -27 -50 0 -49 -23 1 -30 50 -8 51 -30 1 -30
-30 0 -41 -4 -41 -15 0 -11 12 -15 45 -15 33 0 45 -4 45 -15 0 -17 -24 -19
-108 -8 l-54 6 6 66 c3 36 5 69 6 72 0 11 124 7 137 -4z m-4374 -7 c9 0 17 -4
17 -10 0 -5 -16 -10 -35 -10 -28 0 -35 -4 -35 -19 0 -15 8 -21 35 -23 20 -2
35 -7 35 -13 0 -5 -15 -11 -35 -13 -30 -3 -35 -7 -35 -28 0 -18 -5 -24 -23
-24 -13 0 -28 -5 -33 -10 -7 -7 -11 9 -13 51 -1 35 -6 70 -11 79 -7 13 -2 16
28 18 20 2 39 5 41 8 3 3 15 3 26 0 11 -3 28 -6 38 -6z m1856 -14 c23 -21 38
-20 51 4 6 11 17 20 25 20 16 0 20 -16 6 -24 -17 -11 -50 -94 -44 -114 4 -18
0 -20 -34 -19 l-38 2 3 40 c3 33 -1 45 -22 64 -36 34 -34 53 5 47 17 -2 39
-12 48 -20z m299 -18 c-3 -24 -1 -55 3 -70 6 -24 4 -29 -14 -32 -41 -9 -155
-14 -163 -7 -5 3 -10 36 -12 73 l-2 67 67 4 c38 2 81 4 97 5 27 2 28 1 24 -40z
m512 22 c0 -11 4 -20 9 -20 4 0 20 9 34 20 25 20 57 27 57 12 0 -5 -14 -18
-30 -31 l-30 -22 26 -44 c24 -41 24 -45 7 -45 -10 0 -27 14 -37 31 -21 35 -40
34 -44 -4 -3 -22 -8 -27 -32 -27 -39 0 -43 11 -35 86 l7 64 34 0 c27 0 34 -4
34 -20z m511 12 c0 -4 1 -36 2 -72 l2 -65 -32 -3 c-28 -3 -32 0 -39 30 l-7 33
-14 -33 c-16 -40 -34 -41 -51 -2 -16 35 -35 31 -26 -6 6 -22 3 -24 -30 -24
l-36 0 -1 55 c-1 30 -2 61 -3 68 -1 7 14 13 34 15 33 3 38 -1 59 -39 l24 -42
18 24 c10 13 19 29 19 35 0 5 4 14 10 20 11 11 70 16 71 6z m509 -28 c0 -31 3
-35 23 -32 17 2 23 11 25 36 3 29 6 32 36 32 l34 0 1 -75 1 -75 -29 0 c-23 0
-30 5 -35 26 -5 19 -12 25 -29 22 -17 -2 -22 -10 -22 -30 1 -24 -2 -27 -25
-22 -45 10 -50 13 -50 33 0 11 -6 21 -12 24 -10 4 -10 7 0 18 6 7 12 25 12 39
0 34 7 40 42 40 25 0 28 -3 28 -36z"/>
<path d="M800 860 c30 -24 44 -25 36 -4 -3 9 -6 18 -6 20 0 2 -12 4 -27 4
l-28 0 25 -20z"/>
<path d="M310 850 c0 -5 5 -10 10 -10 6 0 10 5 10 10 0 6 -4 10 -10 10 -5 0
-10 -4 -10 -10z"/>
<path d="M366 851 c-8 -12 21 -34 33 -27 6 4 8 13 4 21 -6 17 -29 20 -37 6z"/>
<path d="M920 586 c0 -9 7 -16 16 -16 9 0 14 5 12 12 -6 18 -28 21 -28 4z"/>
<path d="M965 419 c-4 -6 -5 -13 -2 -16 7 -7 27 6 27 18 0 12 -17 12 -25 -2z"/>
<path d="M362 388 c3 -7 15 -14 29 -16 24 -4 24 -3 4 12 -24 19 -38 20 -33 4z"/>
<path d="M4106 883 c-14 -14 -5 -31 14 -26 11 3 20 9 20 13 0 10 -26 20 -34
13z"/>
<path d="M4590 870 c-14 -10 -22 -22 -18 -25 7 -8 57 25 58 38 0 12 -14 8 -40
-13z"/>
<path d="M4380 655 c7 -8 17 -15 22 -15 6 0 5 7 -2 15 -7 8 -17 15 -22 15 -6
0 -5 -7 2 -15z"/>
<path d="M4082 560 c-6 -11 -12 -28 -12 -37 0 -13 6 -10 20 12 11 17 20 33 20
38 0 14 -15 7 -28 -13z"/>
<path d="M4496 466 c3 -9 11 -16 16 -16 13 0 5 23 -10 28 -7 2 -10 -2 -6 -12z"/>
<path d="M4236 445 c-9 -24 5 -41 16 -20 7 11 7 20 0 27 -6 6 -12 3 -16 -7z"/>
<path d="M4540 400 c0 -5 5 -10 11 -10 5 0 7 5 4 10 -3 6 -8 10 -11 10 -2 0
-4 -4 -4 -10z"/>
<path d="M5330 891 c0 -11 26 -22 34 -14 3 3 3 10 0 14 -7 12 -34 11 -34 0z"/>
<path d="M4805 880 c-8 -13 4 -32 16 -25 12 8 12 35 0 35 -6 0 -13 -4 -16 -10z"/>
<path d="M5070 821 l-35 -6 0 -75 0 -75 40 -3 c22 -2 58 3 80 10 38 12 40 16
47 63 12 88 -16 107 -132 86z m109 -36 c3 -19 2 -19 -15 -4 -11 9 -26 19 -34
22 -8 4 -2 5 15 4 21 -1 31 -8 34 -22z"/>
<path d="M5411 694 c0 -11 3 -14 6 -6 3 7 2 16 -1 19 -3 4 -6 -2 -5 -13z"/>
<path d="M5223 674 c-10 -22 -10 -25 3 -20 9 3 18 6 20 6 2 0 4 9 4 20 0 28
-13 25 -27 -6z"/>
<path d="M5001 422 c-14 -27 -12 -35 8 -23 7 5 11 17 9 27 -4 17 -5 17 -17 -4z"/>
<path d="M5673 883 c9 -9 19 -14 23 -11 10 10 -6 28 -24 28 -15 0 -15 -1 1
-17z"/>
<path d="M5866 717 c-14 -10 -16 -16 -7 -22 15 -9 35 8 30 24 -3 8 -10 7 -23
-2z"/>
<path d="M5700 520 c0 -5 5 -10 10 -10 6 0 10 5 10 10 0 6 -4 10 -10 10 -5 0
-10 -4 -10 -10z"/>
<path d="M5700 451 c0 -23 25 -46 34 -32 4 6 -2 19 -14 31 -19 19 -20 19 -20
1z"/>
<path d="M1375 850 c-3 -5 -1 -10 4 -10 6 0 11 5 11 10 0 6 -2 10 -4 10 -3 0
-8 -4 -11 -10z"/>
<path d="M1391 687 c-5 -12 -7 -35 -6 -50 2 -15 -1 -27 -7 -27 -5 0 -6 9 -3
21 5 15 4 19 -4 15 -6 -4 -11 -18 -11 -30 0 -19 7 -25 33 -29 17 -2 42 1 55 7
l22 12 -27 52 c-29 57 -39 63 -52 29z"/>
<path d="M1240 520 c0 -5 5 -10 10 -10 6 0 10 5 10 10 0 6 -4 10 -10 10 -5 0
-10 -4 -10 -10z"/>
<path d="M1575 490 c4 -14 9 -27 11 -29 7 -7 34 9 34 20 0 7 -3 9 -7 6 -3 -4
-15 1 -26 10 -19 17 -19 17 -12 -7z"/>
<path d="M3094 688 c-4 -13 -7 -35 -6 -50 1 -16 -2 -28 -8 -28 -5 0 -6 7 -3
17 4 11 3 14 -5 9 -16 -10 -15 -49 1 -43 6 2 20 0 29 -4 10 -6 27 -5 41 2 28
13 26 30 -8 86 -24 39 -31 41 -41 11z"/>
<path d="M3270 502 c0 -19 29 -47 39 -37 6 7 1 16 -15 28 -13 10 -24 14 -24 9z"/>
<path d="M3570 812 c-13 -10 -21 -24 -19 -31 3 -7 15 0 34 19 31 33 21 41 -15
12z"/>
<path d="M3855 480 c-3 -5 -1 -10 4 -10 6 0 11 5 11 10 0 6 -2 10 -4 10 -3 0
-8 -4 -11 -10z"/>
<path d="M3585 450 c3 -5 13 -10 21 -10 8 0 12 5 9 10 -3 6 -13 10 -21 10 -8
0 -12 -4 -9 -10z"/>
<path d="M1880 820 c0 -5 7 -10 16 -10 8 0 12 5 9 10 -3 6 -10 10 -16 10 -5 0
-9 -4 -9 -10z"/>
<path d="M2042 668 c-7 -7 -12 -23 -12 -37 1 -24 2 -24 16 8 16 37 14 47 -4
29z"/>
<path d="M2015 560 c4 -6 11 -8 16 -5 14 9 11 15 -7 15 -8 0 -12 -5 -9 -10z"/>
<path d="M1915 470 c4 -6 11 -8 16 -5 14 9 11 15 -7 15 -8 0 -12 -5 -9 -10z"/>
<path d="M2320 795 c0 -14 5 -25 10 -25 6 0 10 11 10 25 0 14 -4 25 -10 25 -5
0 -10 -11 -10 -25z"/>
<path d="M2660 771 c0 -6 5 -13 10 -16 6 -3 10 1 10 9 0 9 -4 16 -10 16 -5 0
-10 -4 -10 -9z"/>
<path d="M2487 763 c-4 -3 -7 -23 -7 -43 0 -36 1 -38 40 -43 68 -9 116 20 102
61 -3 10 -7 10 -18 1 -11 -9 -14 -7 -14 10 0 18 -6 21 -48 21 -27 0 -52 -3
-55 -7z"/>
<path d="M2320 719 c0 -5 5 -7 10 -4 6 3 10 8 10 11 0 2 -4 4 -10 4 -5 0 -10
-5 -10 -11z"/>
<path d="M2480 550 l0 -40 66 1 c58 1 67 4 76 25 18 39 -4 54 -78 54 l-64 0 0
-40z m40 15 c-7 -8 -16 -15 -21 -15 -5 0 -6 7 -3 15 4 8 13 15 21 15 13 0 13
-3 3 -15z"/>
<path d="M2665 527 c-4 -10 -5 -21 -1 -24 10 -10 18 4 13 24 -4 17 -4 17 -12
0z"/>
<path d="M1586 205 c-9 -23 -8 -25 9 -25 17 0 19 9 6 28 -7 11 -10 10 -15 -3z"/>
<path d="M3727 200 c-3 -13 0 -20 9 -20 15 0 19 26 5 34 -5 3 -11 -3 -14 -14z"/>
<path d="M1194 229 c-3 -6 -2 -15 3 -20 13 -13 43 -1 43 17 0 16 -36 19 -46 3z"/>
<path d="M2470 224 c-18 -46 -12 -73 15 -80 37 -9 52 1 59 40 5 26 3 41 -8 51
-23 24 -55 18 -66 -11z"/>
<path d="M3120 196 c0 -9 7 -16 16 -16 17 0 14 22 -4 28 -7 2 -12 -3 -12 -12z"/>
<path d="M4750 201 c0 -12 5 -21 10 -21 6 0 10 6 10 14 0 8 -4 18 -10 21 -5 3
-10 -3 -10 -14z"/>
<path d="M3515 229 c-8 -12 14 -31 30 -26 6 2 10 10 10 18 0 17 -31 24 -40 8z"/>
<path d="M3521 161 c-7 -5 -9 -11 -4 -14 14 -9 54 4 47 14 -7 11 -25 11 -43 0z"/>
</g>
</svg>

After

Width:  |  Height:  |  Size: 18 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 56 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 10 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 18 KiB

576
index.rst
View File

@ -1,20 +1,373 @@
waybackpy
=========
|Build Status| |Downloads| |Release| |Codacy Badge| |License: MIT|
|Maintainability| |CodeFactor| |made-with-python| |pypi| |PyPI - Python
Version| |Maintenance|
|contributions welcome| |Build Status| |codecov| |Downloads| |Release|
|Codacy Badge| |Maintainability| |CodeFactor| |made-with-python| |pypi|
|PyPI - Python Version| |Maintenance| |Repo size| |License: MIT|
.. |Build Status| image:: https://travis-ci.org/akamhy/waybackpy.svg?branch=master
|Internet Archive| |Wayback Machine|
Waybackpy is a Python library that interfaces with the `Internet
Archive <https://en.wikipedia.org/wiki/Internet_Archive>`__'s `Wayback
Machine <https://en.wikipedia.org/wiki/Wayback_Machine>`__ API. Archive
pages and retrieve archived pages easily.
Table of contents
=================
.. raw:: html
<!--ts-->
- `Installation <#installation>`__
- `Usage <#usage>`__
- `As a Python package <#as-a-python-package>`__
- `Saving an url using
save() <#capturing-aka-saving-an-url-using-save>`__
- `Receiving the oldest archive for an URL Using
oldest() <#receiving-the-oldest-archive-for-an-url-using-oldest>`__
- `Receiving the recent most/newest archive for an URL using
newest() <#receiving-the-newest-archive-for-an-url-using-newest>`__
- `Receiving archive close to a specified year, month, day, hour,
and minute using
near() <#receiving-archive-close-to-a-specified-year-month-day-hour-and-minute-using-near>`__
- `Get the content of webpage using
get() <#get-the-content-of-webpage-using-get>`__
- `Count total archives for an URL using
total\_archives() <#count-total-archives-for-an-url-using-total_archives>`__
- `With Command-line interface <#with-the-command-line-interface>`__
- `Save <#save>`__
- `Oldest archive <#oldest-archive>`__
- `Newest archive <#newest-archive>`__
- `Total archives <#total-number-of-archives>`__
- `Archive near a time <#archive-near-time>`__
- `Get the source code <#get-the-source-code>`__
- `Tests <#tests>`__
- `Dependency <#dependency>`__
- `License <#license>`__
.. raw:: html
<!--te-->
Installation
------------
Using `pip <https://en.wikipedia.org/wiki/Pip_(package_manager)>`__:
.. code:: bash
pip install waybackpy
or direct from this repository using git.
.. code:: bash
pip install git+https://github.com/akamhy/waybackpy.git
Usage
-----
As a Python package
~~~~~~~~~~~~~~~~~~~
Capturing aka Saving an url using save()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: python
import waybackpy
new_archive_url = waybackpy.Url(
url = "https://en.wikipedia.org/wiki/Multivariable_calculus",
user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
).save()
print(new_archive_url)
.. code:: bash
https://web.archive.org/web/20200504141153/https://github.com/akamhy/waybackpy
Try this out in your browser @
https://repl.it/@akamhy/WaybackPySaveExample\
Receiving the oldest archive for an URL using oldest()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: python
import waybackpy
oldest_archive_url = waybackpy.Url(
"https://www.google.com/",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:40.0) Gecko/20100101 Firefox/40.0"
).oldest()
print(oldest_archive_url)
.. code:: bash
http://web.archive.org/web/19981111184551/http://google.com:80/
Try this out in your browser @
https://repl.it/@akamhy/WaybackPyOldestExample\
Receiving the newest archive for an URL using newest()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: python
import waybackpy
newest_archive_url = waybackpy.Url(
"https://www.facebook.com/",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:39.0) Gecko/20100101 Firefox/39.0"
).newest()
print(newest_archive_url)
.. code:: bash
https://web.archive.org/web/20200714013225/https://www.facebook.com/
Try this out in your browser @
https://repl.it/@akamhy/WaybackPyNewestExample\
Receiving archive close to a specified year, month, day, hour, and minute using near()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: python
from waybackpy import Url
user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:38.0) Gecko/20100101 Firefox/38.0"
github_url = "https://github.com/"
github_wayback_obj = Url(github_url, user_agent)
# Do not pad (don't use zeros in the month, year, day, minute, and hour arguments). e.g. For January, set month = 1 and not month = 01.
.. code:: python
github_archive_near_2010 = github_wayback_obj.near(year=2010)
print(github_archive_near_2010)
.. code:: bash
https://web.archive.org/web/20100719134402/http://github.com/
.. code:: python
github_archive_near_2011_may = github_wayback_obj.near(year=2011, month=5)
print(github_archive_near_2011_may)
.. code:: bash
https://web.archive.org/web/20110519185447/https://github.com/
.. code:: python
github_archive_near_2015_january_26 = github_wayback_obj.near(
year=2015, month=1, day=26
)
print(github_archive_near_2015_january_26)
.. code:: bash
https://web.archive.org/web/20150127031159/https://github.com
.. code:: python
github_archive_near_2018_4_july_9_2_am = github_wayback_obj.near(
year=2018, month=7, day=4, hour = 9, minute = 2
)
print(github_archive_near_2018_4_july_9_2_am)
.. code:: bash
https://web.archive.org/web/20180704090245/https://github.com/
The library doesn't supports seconds yet. You are encourged to create a
PR ;)
Try this out in your browser @
https://repl.it/@akamhy/WaybackPyNearExample\
Get the content of webpage using get()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: python
import waybackpy
google_url = "https://www.google.com/"
User_Agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.85 Safari/537.36"
waybackpy_url_object = waybackpy.Url(google_url, User_Agent)
# If no argument is passed in get(), it gets the source of the Url used to create the object.
current_google_url_source = waybackpy_url_object.get()
print(current_google_url_source)
# The following chunk of code will force a new archive of google.com and get the source of the archived page.
# waybackpy_url_object.save() type is string.
google_newest_archive_source = waybackpy_url_object.get(
waybackpy_url_object.save()
)
print(google_newest_archive_source)
# waybackpy_url_object.oldest() type is str, it's oldest archive of google.com
google_oldest_archive_source = waybackpy_url_object.get(
waybackpy_url_object.oldest()
)
print(google_oldest_archive_source)
Try this out in your browser @
https://repl.it/@akamhy/WaybackPyGetExample#main.py\
Count total archives for an URL using total\_archives()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: python
import waybackpy
URL = "https://en.wikipedia.org/wiki/Python (programming language)"
UA = "Mozilla/5.0 (iPad; CPU OS 8_1_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B435 Safari/600.1.4"
archive_count = waybackpy.Url(
url=URL,
user_agent=UA
).total_archives()
print(archive_count) # total_archives() returns an int
.. code:: bash
2440
Try this out in your browser @
https://repl.it/@akamhy/WaybackPyTotalArchivesExample\
With the Command-line interface
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Save
^^^^
.. code:: bash
$ waybackpy --url "https://en.wikipedia.org/wiki/Social_media" --user_agent "my-unique-user-agent" --save
https://web.archive.org/web/20200719062108/https://en.wikipedia.org/wiki/Social_media
Try this out in your browser @
https://repl.it/@akamhy/WaybackPyBashSave\
Oldest archive
^^^^^^^^^^^^^^
.. code:: bash
$ waybackpy --url "https://en.wikipedia.org/wiki/SpaceX" --user_agent "my-unique-user-agent" --oldest
https://web.archive.org/web/20040803000845/http://en.wikipedia.org:80/wiki/SpaceX
Try this out in your browser @
https://repl.it/@akamhy/WaybackPyBashOldest\
Newest archive
^^^^^^^^^^^^^^
.. code:: bash
$ waybackpy --url "https://en.wikipedia.org/wiki/YouTube" --user_agent "my-unique-user-agent" --newest
https://web.archive.org/web/20200606044708/https://en.wikipedia.org/wiki/YouTube
Try this out in your browser @
https://repl.it/@akamhy/WaybackPyBashNewest\
Total number of archives
^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: bash
$ waybackpy --url "https://en.wikipedia.org/wiki/Linux_kernel" --user_agent "my-unique-user-agent" --total
853
Try this out in your browser @
https://repl.it/@akamhy/WaybackPyBashTotal\
Archive near time
^^^^^^^^^^^^^^^^^
.. code:: bash
$ waybackpy --url facebook.com --user_agent "my-unique-user-agent" --near --year 2012 --month 5 --day 12
https://web.archive.org/web/20120512142515/https://www.facebook.com/
Try this out in your browser @
https://repl.it/@akamhy/WaybackPyBashNear\
Get the source code
^^^^^^^^^^^^^^^^^^^
.. code:: bash
$ waybackpy --url google.com --user_agent "my-unique-user-agent" --get url # Prints the source code of the url
$ waybackpy --url google.com --user_agent "my-unique-user-agent" --get oldest # Prints the source code of the oldest archive
$ waybackpy --url google.com --user_agent "my-unique-user-agent" --get newest # Prints the source code of the newest archive
$ waybackpy --url google.com --user_agent "my-unique-user-agent" --get save # Save a new archive on wayback machine then print the source code of this archive.
Try this out in your browser @
https://repl.it/@akamhy/WaybackPyBashGet\
Tests
-----
- `Here <https://github.com/akamhy/waybackpy/tree/master/tests>`__
Dependency
----------
- None, just python standard libraries (re, json, urllib, argparse and
datetime). Both python 2 and 3 are supported :)
License
-------
`MIT
License <https://github.com/akamhy/waybackpy/blob/master/LICENSE>`__
.. |contributions welcome| image:: https://img.shields.io/static/v1.svg?label=Contributions&message=Welcome&color=0059b3&style=flat-square
.. |Build Status| image:: https://img.shields.io/travis/akamhy/waybackpy.svg?label=Travis%20CI&logo=travis&style=flat-square
:target: https://travis-ci.org/akamhy/waybackpy
.. |Downloads| image:: https://img.shields.io/pypi/dm/waybackpy.svg
:target: https://pypistats.org/packages/waybackpy
.. |codecov| image:: https://codecov.io/gh/akamhy/waybackpy/branch/master/graph/badge.svg
:target: https://codecov.io/gh/akamhy/waybackpy
.. |Downloads| image:: https://pepy.tech/badge/waybackpy/month
:target: https://pepy.tech/project/waybackpy/month
.. |Release| image:: https://img.shields.io/github/v/release/akamhy/waybackpy.svg
:target: https://github.com/akamhy/waybackpy/releases
.. |Codacy Badge| image:: https://api.codacy.com/project/badge/Grade/255459cede9341e39436ec8866d3fb65
:target: https://www.codacy.com/manual/akamhy/waybackpy?utm_source=github.com&utm_medium=referral&utm_content=akamhy/waybackpy&utm_campaign=Badge_Grade
.. |License: MIT| image:: https://img.shields.io/badge/License-MIT-yellow.svg
:target: https://github.com/akamhy/waybackpy/blob/master/LICENSE
.. |Maintainability| image:: https://api.codeclimate.com/v1/badges/942f13d8177a56c1c906/maintainability
:target: https://codeclimate.com/github/akamhy/waybackpy/maintainability
.. |CodeFactor| image:: https://www.codefactor.io/repository/github/akamhy/waybackpy/badge
@ -22,211 +375,12 @@ Version| |Maintenance|
.. |made-with-python| image:: https://img.shields.io/badge/Made%20with-Python-1f425f.svg
:target: https://www.python.org/
.. |pypi| image:: https://img.shields.io/pypi/v/waybackpy.svg
:target: https://pypi.org/project/waybackpy/
.. |PyPI - Python Version| image:: https://img.shields.io/pypi/pyversions/waybackpy?style=flat-square
.. |Maintenance| image:: https://img.shields.io/badge/Maintained%3F-yes-green.svg
:target: https://github.com/akamhy/waybackpy/graphs/commit-activity
|Internet Archive| |Wayback Machine|
The waybackpy is a python wrapper for `Internet Archive`_\ s `Wayback
Machine`_.
.. _Internet Archive: https://en.wikipedia.org/wiki/Internet_Archive
.. _Wayback Machine: https://en.wikipedia.org/wiki/Wayback_Machine
.. |Repo size| image:: https://img.shields.io/github/repo-size/akamhy/waybackpy.svg?label=Repo%20size&style=flat-square
.. |License: MIT| image:: https://img.shields.io/badge/License-MIT-yellow.svg
:target: https://github.com/akamhy/waybackpy/blob/master/LICENSE
.. |Internet Archive| image:: https://upload.wikimedia.org/wikipedia/commons/thumb/8/84/Internet_Archive_logo_and_wordmark.svg/84px-Internet_Archive_logo_and_wordmark.svg.png
.. |Wayback Machine| image:: https://upload.wikimedia.org/wikipedia/commons/thumb/0/01/Wayback_Machine_logo_2010.svg/284px-Wayback_Machine_logo_2010.svg.png
Installation
------------
Using `pip`_:
**pip install waybackpy**
.. _pip: https://en.wikipedia.org/wiki/Pip_(package_manager)
Usage
-----
Archiving aka Saving an url Using save()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: diff
+ waybackpy.save(url, UA=user_agent)
..
url is mandatory. UA is not, but highly recommended.
.. code:: python
import waybackpy
# Capturing a new archive on Wayback machine.
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
archived_url = waybackpy.save("https://github.com/akamhy/waybackpy", UA = "Any-User-Agent")
print(archived_url)
This should print something similar to the following archived URL:
https://web.archive.org/web/20200504141153/https://github.com/akamhy/waybackpy
Receiving the oldest archive for an URL Using oldest()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: diff
+ waybackpy.oldest(url, UA=user_agent)
..
url is mandatory. UA is not, but highly recommended.
.. code:: python
import waybackpy
# retrieving the oldest archive on Wayback machine.
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
oldest_archive = waybackpy.oldest("https://www.google.com/", UA = "Any-User-Agent")
print(oldest_archive)
This returns the oldest available archive for https://google.com.
http://web.archive.org/web/19981111184551/http://google.com:80/
Receiving the newest archive for an URL using newest()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: diff
+ waybackpy.newest(url, UA=user_agent)
..
url is mandatory. UA is not, but highly recommended.
.. code:: python
import waybackpy
# retrieving the newest archive on Wayback machine.
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
newest_archive = waybackpy.newest("https://www.microsoft.com/en-us", UA = "Any-User-Agent")
print(newest_archive)
This returns the newest available archive for
https://www.microsoft.com/en-us, something just like this:
http://web.archive.org/web/20200429033402/https://www.microsoft.com/en-us/
Receiving archive close to a specified year, month, day, hour, and minute using near()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: diff
+ waybackpy.near(url, year=2020, month=1, day=1, hour=1, minute=1, UA=user_agent)
..
url is mandotory. year,month,day,hour and minute are optional
arguments. UA is not mandotory, but higly recomended.
.. code:: python
import waybackpy
# retriving the the closest archive from a specified year.
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
# supported argumnets are year,month,day,hour and minute
archive_near_year = waybackpy.near("https://www.facebook.com/", year=2010, UA ="Any-User-Agent")
print(archive_near_year)
returns :
http://web.archive.org/web/20100504071154/http://www.facebook.com/
``waybackpy.near("https://www.facebook.com/", year=2010, month=1, UA ="Any-User-Agent")``
returns:
http://web.archive.org/web/20101111173430/http://www.facebook.com//
``waybackpy.near("https://www.oracle.com/index.html", year=2019, month=1, day=5, UA ="Any-User-Agent")``
returns:
http://web.archive.org/web/20190105054437/https://www.oracle.com/index.html
> Please note that if you only specify the year, the current month and
day are default arguments for month and day respectively. Do not expect
just putting the year parameter would return the archive closer to
January but the current month you are using the package. If you are
using it in July 2018 and lets say you use
``waybackpy.near("https://www.facebook.com/", year=2011, UA ="Any-User-Agent")``
then you would be returned the nearest archive to July 2011 and not
January 2011. You need to specify the month “1” for January.
Do not pad (dont use zeros in the month, year, day, minute, and hour
arguments). e.g. For January, set month = 1 and not month = 01.
Get the content of webpage using get()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: diff
+ waybackpy.get(url, encoding="UTF-8", UA=user_agent)
..
url is mandatory. UA is not, but highly recommended. encoding is
detected automatically, dont specify unless necessary.
.. code:: python
from waybackpy import get
# retriving the webpage from any url including the archived urls. Don't need to import other libraies :)
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
# supported argumnets are url, encoding and UA
webpage = get("https://example.com/", UA="User-Agent")
print(webpage)
..
This should print the source code for https://example.com/.
Count total archives for an URL using total_archives()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: diff
+ waybackpy.total_archives(url, UA=user_agent)
..
url is mandatory. UA is not, but highly recommended.
.. code:: python
from waybackpy import total_archives
# retriving the webpage from any url including the archived urls. Don't need to import other libraies :)
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
# supported argumnets are url and UA
count = total_archives("https://en.wikipedia.org/wiki/Python (programming language)", UA="User-Agent")
print(count)
..
This should print an integer (int), which is the number of total
archives on archive.org
Tests
-----
- `Here`_
Dependency
----------
- None, just python standard libraries (json, urllib and datetime).
Both python 2 and 3 are supported :)
License
-------
`MIT License`_
.. _Here: https://github.com/akamhy/waybackpy/tree/master/tests
.. _MIT License: https://github.com/akamhy/waybackpy/blob/master/LICENSE

View File

@ -1,3 +1,7 @@
[metadata]
description-file = README.md
license_file = LICENSE
[flake8]
max-line-length = 88
extend-ignore = E203,W503

View File

@ -19,7 +19,7 @@ setup(
author = about['__author__'],
author_email = about['__author_email__'],
url = about['__url__'],
download_url = 'https://github.com/akamhy/waybackpy/archive/v1.5.tar.gz',
download_url = 'https://github.com/akamhy/waybackpy/archive/2.1.8.tar.gz',
keywords = ['wayback', 'archive', 'archive website', 'wayback machine', 'Internet Archive'],
install_requires=[],
python_requires= ">=2.7",
@ -42,6 +42,11 @@ setup(
'Programming Language :: Python :: 3.8',
'Programming Language :: Python :: Implementation :: CPython',
],
entry_points={
'console_scripts': [
'waybackpy = waybackpy.cli:main'
]
},
project_urls={
'Documentation': 'https://waybackpy.readthedocs.io',
'Source': 'https://github.com/akamhy/waybackpy',

View File

@ -1,98 +0,0 @@
import sys
sys.path.append("..")
import waybackpy
import pytest
user_agent = "Mozilla/5.0 (Windows NT 6.2; rv:20.0) Gecko/20121202 Firefox/20.0"
def test_clean_url():
test_url = " https://en.wikipedia.org/wiki/Network security "
answer = "https://en.wikipedia.org/wiki/Network_security"
test_result = waybackpy.clean_url(test_url)
assert answer == test_result
def test_url_check():
InvalidUrl = "http://wwwgooglecom/"
with pytest.raises(Exception) as e_info:
waybackpy.url_check(InvalidUrl)
def test_save():
# Test for urls that exist and can be archived.
url1="https://github.com/akamhy/waybackpy"
archived_url1 = waybackpy.save(url1, UA=user_agent)
assert url1 in archived_url1
# Test for urls that are incorrect.
with pytest.raises(Exception) as e_info:
url2 = "ha ha ha ha"
waybackpy.save(url2, UA=user_agent)
# Test for urls not allowed to archive by robot.txt.
with pytest.raises(Exception) as e_info:
url3 = "http://www.archive.is/faq.html"
waybackpy.save(url3, UA=user_agent)
# Non existent urls, test
with pytest.raises(Exception) as e_info:
url4 = "https://githfgdhshajagjstgeths537agajaajgsagudadhuss8762346887adsiugujsdgahub.us"
archived_url4 = waybackpy.save(url4, UA=user_agent)
def test_near():
url = "google.com"
archive_near_year = waybackpy.near(url, year=2010, UA=user_agent)
assert "2010" in archive_near_year
archive_near_month_year = waybackpy.near(url, year=2015, month=2, UA=user_agent)
assert ("201502" in archive_near_month_year) or ("201501" in archive_near_month_year) or ("201503" in archive_near_month_year)
archive_near_day_month_year = waybackpy.near(url, year=2006, month=11, day=15, UA=user_agent)
assert ("20061114" in archive_near_day_month_year) or ("20061115" in archive_near_day_month_year) or ("2006116" in archive_near_day_month_year)
archive_near_hour_day_month_year = waybackpy.near("www.python.org", year=2008, month=5, day=9, hour=15, UA=user_agent)
assert ("2008050915" in archive_near_hour_day_month_year) or ("2008050914" in archive_near_hour_day_month_year) or ("2008050913" in archive_near_hour_day_month_year)
with pytest.raises(Exception) as e_info:
NeverArchivedUrl = "https://ee_3n.wrihkeipef4edia.org/rwti5r_ki/Nertr6w_rork_rse7c_urity"
waybackpy.near(NeverArchivedUrl, year=2010, UA=user_agent)
def test_oldest():
url = "github.com/akamhy/waybackpy"
archive_oldest = waybackpy.oldest(url, UA=user_agent)
assert "20200504141153" in archive_oldest
def test_newest():
url = "github.com/akamhy/waybackpy"
archive_newest = waybackpy.newest(url, UA=user_agent)
assert url in archive_newest
def test_get():
oldest_google_archive = waybackpy.oldest("google.com", UA=user_agent)
oldest_google_page_text = waybackpy.get(oldest_google_archive, UA=user_agent)
assert "Welcome to Google" in oldest_google_page_text
def test_total_archives():
count1 = waybackpy.total_archives("https://en.wikipedia.org/wiki/Python (programming language)", UA=user_agent)
assert count1 > 2000
count2 = waybackpy.total_archives("https://gaha.e4i3n.m5iai3kip6ied.cima/gahh2718gs/ahkst63t7gad8", UA=user_agent)
assert count2 == 0
if __name__ == "__main__":
test_clean_url()
print(".")
test_url_check()
print(".")
test_get()
print(".")
test_near()
print(".")
test_newest()
print(".")
test_save()
print(".")
test_oldest()
print(".")
test_total_archives()
print(".")

97
tests/test_cli.py Normal file
View File

@ -0,0 +1,97 @@
# -*- coding: utf-8 -*-
import sys
import os
import pytest
import argparse
sys.path.append("..")
import waybackpy.cli as cli # noqa: E402
from waybackpy.wrapper import Url # noqa: E402
from waybackpy.__version__ import __version__
codecov_python = False
if sys.version_info > (3, 7):
codecov_python = True
# Namespace(day=None, get=None, hour=None, minute=None, month=None, near=False,
# newest=False, oldest=False, save=False, total=False, url=None, user_agent=None, version=False, year=None)
if codecov_python:
def test_save():
args = argparse.Namespace(user_agent=None, url="https://pypi.org/user/akamhy/", total=False, version=False,
oldest=False, save=True, newest=False, near=False, alive=False, subdomain=False, known_urls=False, get=None)
reply = cli.args_handler(args)
assert "pypi.org/user/akamhy" in reply
def test_oldest():
args = argparse.Namespace(user_agent=None, url="https://pypi.org/user/akamhy/", total=False, version=False,
oldest=True, save=False, newest=False, near=False, alive=False, subdomain=False, known_urls=False, get=None)
reply = cli.args_handler(args)
assert "pypi.org/user/akamhy" in reply
def test_newest():
args = argparse.Namespace(user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9", url="https://pypi.org/user/akamhy/", total=False, version=False,
oldest=False, save=False, newest=True, near=False, alive=False, subdomain=False, known_urls=False, get=None)
reply = cli.args_handler(args)
assert "pypi.org/user/akamhy" in reply
def test_total_archives():
args = argparse.Namespace(user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9", url="https://pypi.org/user/akamhy/", total=True, version=False,
oldest=False, save=False, newest=False, near=False, alive=False, subdomain=False, known_urls=False, get=None)
reply = cli.args_handler(args)
assert isinstance(reply, int)
def test_near():
args = argparse.Namespace(user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9", url="https://pypi.org/user/akamhy/", total=False, version=False,
oldest=False, save=False, newest=False, near=True, alive=False, subdomain=False, known_urls=False, get=None, year=2020, month=7, day=15, hour=1, minute=1)
reply = cli.args_handler(args)
assert "202007" in reply
def test_get():
args = argparse.Namespace(user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9", url="https://pypi.org/user/akamhy/", total=False, version=False,
oldest=False, save=False, newest=False, near=False, alive=False, subdomain=False, known_urls=False, get="url")
reply = cli.args_handler(args)
assert "waybackpy" in reply
args = argparse.Namespace(user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9", url="https://pypi.org/user/akamhy/", total=False, version=False,
oldest=False, save=False, newest=False, near=False, alive=False, subdomain=False, known_urls=False, get="oldest")
reply = cli.args_handler(args)
assert "waybackpy" in reply
args = argparse.Namespace(user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9", url="https://pypi.org/user/akamhy/", total=False, version=False,
oldest=False, save=False, newest=False, near=False, alive=False, subdomain=False, known_urls=False, get="newest")
reply = cli.args_handler(args)
assert "waybackpy" in reply
if codecov_python:
args = argparse.Namespace(user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9", url="https://pypi.org/user/akamhy/", total=False, version=False,
oldest=False, save=False, newest=False, near=False, alive=False, subdomain=False, known_urls=False, get="save")
reply = cli.args_handler(args)
assert "waybackpy" in reply
args = argparse.Namespace(user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9", url="https://pypi.org/user/akamhy/", total=False, version=False,
oldest=False, save=False, newest=False, near=False, alive=False, subdomain=False, known_urls=False, get="BullShit")
reply = cli.args_handler(args)
assert "get the source code of the" in reply
def test_args_handler():
args = argparse.Namespace(version=True)
reply = cli.args_handler(args)
assert ("waybackpy version %s" % (__version__)) == reply
args = argparse.Namespace(url=None, version=False)
reply = cli.args_handler(args)
assert ("waybackpy %s" % (__version__)) in reply
def test_main():
# This also tests the parse_args method in cli.py
cli.main(['temp.py', '--version'])

190
tests/test_wrapper.py Normal file
View File

@ -0,0 +1,190 @@
# -*- coding: utf-8 -*-
import sys
import pytest
import random
import time
sys.path.append("..")
import waybackpy.wrapper as waybackpy # noqa: E402
if sys.version_info >= (3, 0): # If the python ver >= 3
from urllib.request import Request, urlopen
from urllib.error import URLError
else: # For python2.x
from urllib2 import Request, urlopen, URLError
user_agent = "Mozilla/5.0 (Windows NT 6.2; rv:20.0) Gecko/20121202 Firefox/20.0"
def test_clean_url():
test_url = " https://en.wikipedia.org/wiki/Network security "
answer = "https://en.wikipedia.org/wiki/Network_security"
target = waybackpy.Url(test_url, user_agent)
test_result = target._clean_url()
assert answer == test_result
def test_dunders():
url = "https://en.wikipedia.org/wiki/Network_security"
user_agent = "UA"
target = waybackpy.Url(url, user_agent)
assert "waybackpy.Url(url=%s, user_agent=%s)" % (url, user_agent) == repr(target)
assert len(target) == len(url)
assert str(target) == url
def test_archive_url_parser():
request_url = "https://amazon.com"
hdr = {"User-Agent": user_agent} # nosec
req = Request(request_url, headers=hdr) # nosec
header = waybackpy._get_response(req).headers
with pytest.raises(Exception):
waybackpy._archive_url_parser(header)
def test_url_check():
broken_url = "http://wwwgooglecom/"
with pytest.raises(Exception):
waybackpy.Url(broken_url, user_agent)
def test_save():
# Test for urls that exist and can be archived.
time.sleep(10)
url_list = [
"en.wikipedia.org",
"www.wikidata.org",
"commons.wikimedia.org",
"www.wiktionary.org",
"www.w3schools.com",
"www.ibm.com",
]
x = random.randint(0, len(url_list) - 1)
url1 = url_list[x]
target = waybackpy.Url(
url1,
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36",
)
archived_url1 = target.save()
assert url1 in archived_url1
if sys.version_info > (3, 6):
# Test for urls that are incorrect.
with pytest.raises(Exception):
url2 = "ha ha ha ha"
waybackpy.Url(url2, user_agent)
time.sleep(5)
url3 = "http://www.archive.is/faq.html"
# Test for urls not allowed to archive by robot.txt. Doesn't works anymore. Find alternatives.
# with pytest.raises(Exception):
#
# target = waybackpy.Url(
# url3,
# "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:25.0) "
# "Gecko/20100101 Firefox/25.0",
# )
# target.save()
# time.sleep(5)
# Non existent urls, test
with pytest.raises(Exception):
target = waybackpy.Url(
url3,
"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) "
"AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 "
"Safari/533.20.27",
)
target.save()
else:
pass
def test_near():
time.sleep(10)
url = "google.com"
target = waybackpy.Url(
url,
"Mozilla/5.0 (Windows; U; Windows NT 6.0; de-DE) AppleWebKit/533.20.25 "
"(KHTML, like Gecko) Version/5.0.3 Safari/533.19.4",
)
archive_near_year = target.near(year=2010)
assert "2010" in archive_near_year
if sys.version_info > (3, 6):
time.sleep(5)
archive_near_month_year = target.near(year=2015, month=2)
assert (
("201502" in archive_near_month_year)
or ("201501" in archive_near_month_year)
or ("201503" in archive_near_month_year)
)
target = waybackpy.Url(
"www.python.org",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246",
)
archive_near_hour_day_month_year = target.near(
year=2008, month=5, day=9, hour=15
)
assert (
("2008050915" in archive_near_hour_day_month_year)
or ("2008050914" in archive_near_hour_day_month_year)
or ("2008050913" in archive_near_hour_day_month_year)
)
with pytest.raises(Exception):
NeverArchivedUrl = (
"https://ee_3n.wrihkeipef4edia.org/rwti5r_ki/Nertr6w_rork_rse7c_urity"
)
target = waybackpy.Url(NeverArchivedUrl, user_agent)
target.near(year=2010)
else:
pass
def test_oldest():
url = "github.com/akamhy/waybackpy"
target = waybackpy.Url(url, user_agent)
assert "20200504141153" in target.oldest()
def test_newest():
url = "github.com/akamhy/waybackpy"
target = waybackpy.Url(url, user_agent)
assert url in target.newest()
def test_get():
target = waybackpy.Url("google.com", user_agent)
assert "Welcome to Google" in target.get(target.oldest())
def test_wayback_timestamp():
ts = waybackpy._wayback_timestamp(
year=2020, month=1, day=2, hour=3, minute=4
)
assert "202001020304" in str(ts)
def test_get_response():
hdr = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) "
"Gecko/20100101 Firefox/78.0"
}
req = Request("https://www.google.com", headers=hdr) # nosec
response = waybackpy._get_response(req)
assert response.code == 200
def test_total_archives():
if sys.version_info > (3, 6):
target = waybackpy.Url(" https://google.com ", user_agent)
assert target.total_archives() > 500000
else:
pass
target = waybackpy.Url(
" https://gaha.e4i3n.m5iai3kip6ied.cima/gahh2718gs/ahkst63t7gad8 ", user_agent
)
assert target.total_archives() == 0

View File

@ -10,13 +10,15 @@
# ━━━━━━━━━━━┗━━┛━━━━━━━━━━━━━━━━━━━━━━━━┗━━┛━
"""
A python wrapper for Internet Archive's Wayback Machine API.
Waybackpy is a Python library that interfaces with the Internet Archive's Wayback Machine API.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Archive pages and retrieve archived pages easily.
Usage:
>>> import waybackpy
>>> new_archive = waybackpy.save('https://www.python.org')
>>> target_url = waybackpy.Url('https://www.python.org', 'Your-apps-cool-user-agent')
>>> new_archive = target_url.save()
>>> print(new_archive)
https://web.archive.org/web/20200502170312/https://www.python.org/
@ -25,6 +27,14 @@ Full documentation @ <https://akamhy.github.io/waybackpy/>.
:license: MIT
"""
from .wrapper import save, near, oldest, newest, get, clean_url, url_check, total_archives
from .__version__ import __title__, __description__, __url__, __version__
from .__version__ import __author__, __author_email__, __license__, __copyright__
from .wrapper import Url
from .__version__ import (
__title__,
__description__,
__url__,
__version__,
__author__,
__author_email__,
__license__,
__copyright__,
)

View File

@ -1,7 +1,9 @@
# -*- coding: utf-8 -*-
__title__ = "waybackpy"
__description__ = "A python wrapper for Internet Archive's Wayback Machine API. Archive pages and retrieve archived pages easily."
__description__ = "A Python library that interfaces with the Internet Archive's Wayback Machine API. Archive pages and retrieve archived pages easily."
__url__ = "https://akamhy.github.io/waybackpy/"
__version__ = "v1.5"
__version__ = "2.1.8"
__author__ = "akamhy"
__author_email__ = "akash3pro@gmail.com"
__license__ = "MIT"

165
waybackpy/cli.py Normal file
View File

@ -0,0 +1,165 @@
# -*- coding: utf-8 -*-
from __future__ import print_function
import sys
import os
import re
import argparse
from waybackpy.wrapper import Url
from waybackpy.__version__ import __version__
def _save(obj):
return (obj.save())
def _oldest(obj):
return (obj.oldest())
def _newest(obj):
return (obj.newest())
def _total_archives(obj):
return (obj.total_archives())
def _near(obj, args):
_near_args = {}
if args.year:
_near_args["year"] = args.year
if args.month:
_near_args["month"] = args.month
if args.day:
_near_args["day"] = args.day
if args.hour:
_near_args["hour"] = args.hour
if args.minute:
_near_args["minute"] = args.minute
return (obj.near(**_near_args))
def _known_urls(obj, args):
sd = False
al = False
if args.subdomain:
sd = True
if args.alive:
al = True
url_list = obj.known_urls(alive=al, subdomain=sd)
total_urls = len(url_list)
if total_urls > 0:
m = re.search('https?://([A-Za-z_0-9.-]+).*', url_list[0])
if m:
domain = m.group(1)
else:
domain = "waybackpy-known"
dir_path = os.path.abspath(os.getcwd())
file_name = dir_path + "/%s-%d-urls.txt" % (domain, total_urls)
text = "\n".join(url_list) + "\n"
with open(file_name, "a+") as f:
f.write(text)
text = text + "%d URLs found and saved in ./%s-%d-urls.txt" % (
total_urls, domain, total_urls
)
else:
text = "No known URLs found. Please try a diffrent domain!"
return text
def _get(obj, args):
if args.get.lower() == "url":
return (obj.get())
if args.get.lower() == "oldest":
return (obj.get(obj.oldest()))
if args.get.lower() == "latest" or args.get.lower() == "newest":
return (obj.get(obj.newest()))
if args.get.lower() == "save":
return (obj.get(obj.save()))
return ("Use get as \"--get 'source'\", 'source' can be one of the followings: \
\n1) url - get the source code of the url specified using --url/-u.\
\n2) oldest - get the source code of the oldest archive for the supplied url.\
\n3) newest - get the source code of the newest archive for the supplied url.\
\n4) save - Create a new archive and get the source code of this new archive for the supplied url.")
def args_handler(args):
if args.version:
return ("waybackpy version %s" % __version__)
if not args.url:
return ("waybackpy %s \nSee 'waybackpy --help' for help using this tool." % __version__)
if args.user_agent:
obj = Url(args.url, args.user_agent)
else:
obj = Url(args.url)
if args.save:
return _save(obj)
if args.oldest:
return _oldest(obj)
if args.newest:
return _newest(obj)
if args.known_urls:
return _known_urls(obj, args)
if args.total:
return _total_archives(obj)
if args.near:
return _near(obj, args)
if args.get:
return _get(obj, args)
return ("You only specified the URL. But you also need to specify the operation.\nSee 'waybackpy --help' for help using this tool.")
def parse_args(argv):
parser = argparse.ArgumentParser()
requiredArgs = parser.add_argument_group('URL argument (required)')
requiredArgs.add_argument("--url", "-u", help="URL on which Wayback machine operations would occur")
userAgentArg = parser.add_argument_group('User Agent')
userAgentArg.add_argument("--user_agent", "-ua", help="User agent, default user_agent is \"waybackpy python package - https://github.com/akamhy/waybackpy\"")
saveArg = parser.add_argument_group("Create new archive/save URL")
saveArg.add_argument("--save", "-s", action='store_true', help="Save the URL on the Wayback machine")
oldestArg = parser.add_argument_group("Oldest archive")
oldestArg.add_argument("--oldest", "-o", action='store_true', help="Oldest archive for the specified URL")
newestArg = parser.add_argument_group("Newest archive")
newestArg.add_argument("--newest", "-n", action='store_true', help="Newest archive for the specified URL")
totalArg = parser.add_argument_group("Total number of archives")
totalArg.add_argument("--total", "-t", action='store_true', help="Total number of archives for the specified URL")
getArg = parser.add_argument_group("Get source code")
getArg.add_argument("--get", "-g", help="Prints the source code of the supplied url. Use '--get help' for extended usage")
knownUrlArg = parser.add_argument_group("URLs known and archived to Waybcak Machine for the site.")
knownUrlArg.add_argument("--known_urls", "-ku", action='store_true', help="URLs known for the domain.")
knownUrlArg.add_argument("--subdomain", "-sub", action='store_true', help="Use with '--known_urls' to include known URLs for subdomains.")
knownUrlArg.add_argument("--alive", "-a", action='store_true', help="Only include live URLs. Will not inlclude dead links.")
nearArg = parser.add_argument_group('Archive close to time specified')
nearArg.add_argument("--near", "-N", action='store_true', help="Archive near specified time")
nearArgs = parser.add_argument_group('Arguments that are used only with --near')
nearArgs.add_argument("--year", "-Y", type=int, help="Year in integer")
nearArgs.add_argument("--month", "-M", type=int, help="Month in integer")
nearArgs.add_argument("--day", "-D", type=int, help="Day in integer.")
nearArgs.add_argument("--hour", "-H", type=int, help="Hour in intege")
nearArgs.add_argument("--minute", "-MIN", type=int, help="Minute in integer")
parser.add_argument("--version", "-v", action='store_true', help="Waybackpy version")
return parser.parse_args(argv[1:])
def main(argv=None):
if argv is None:
argv = sys.argv
args = parse_args(argv)
output = args_handler(args)
print(output)
if __name__ == "__main__":
sys.exit(main(sys.argv))

View File

@ -1,43 +1,6 @@
# -*- coding: utf-8 -*-
class TooManyArchivingRequests(Exception):
"""Error when a single url reqeusted for archiving too many times in a short timespam.
Wayback machine doesn't supports archivng any url too many times in a short period of time.
class WaybackError(Exception):
"""
class ArchivingNotAllowed(Exception):
"""Files like robots.txt are set to deny robot archiving.
Wayback machine respects these file, will not archive.
"""
class PageNotSaved(Exception):
"""
When unable to save a webpage.
"""
class ArchiveNotFound(Exception):
"""
When a page was never archived but client asks for old archive.
"""
class UrlNotFound(Exception):
"""
Raised when 404 UrlNotFound.
"""
class BadGateWay(Exception):
"""
Raised when 502 bad gateway.
"""
class WaybackUnavailable(Exception):
"""
Raised when 503 API Service Temporarily Unavailable.
"""
class InvalidUrl(Exception):
"""
Raised when url doesn't follow the standard url format.
Raised when API Service error.
"""

View File

@ -1,143 +1,224 @@
# -*- coding: utf-8 -*-
import re
import sys
import json
from datetime import datetime
from waybackpy.exceptions import TooManyArchivingRequests, ArchivingNotAllowed, PageNotSaved, ArchiveNotFound, UrlNotFound, BadGateWay, InvalidUrl, WaybackUnavailable
try:
from waybackpy.exceptions import WaybackError
from waybackpy.__version__ import __version__
if sys.version_info >= (3, 0): # If the python ver >= 3
from urllib.request import Request, urlopen
from urllib.error import HTTPError, URLError
except ImportError:
from urllib2 import Request, urlopen, HTTPError, URLError
from urllib.error import URLError
else: # For python2.x
from urllib2 import Request, urlopen, URLError
default_UA = "waybackpy python package - https://github.com/akamhy/waybackpy"
default_UA = "waybackpy python package"
def url_check(url):
if "." not in url:
raise InvalidUrl("'%s' is not a vaild url." % url)
def clean_url(url):
return str(url).strip().replace(" ","_")
def wayback_timestamp(**kwargs):
return (
str(kwargs["year"])
+
str(kwargs["month"]).zfill(2)
+
str(kwargs["day"]).zfill(2)
+
str(kwargs["hour"]).zfill(2)
+
str(kwargs["minute"]).zfill(2)
)
def handle_HTTPError(e):
if e.code == 502:
raise BadGateWay(e)
elif e.code == 503:
raise WaybackUnavailable(e)
elif e.code == 429:
raise TooManyArchivingRequests(e)
elif e.code == 404:
raise UrlNotFound(e)
def save(url, UA=default_UA):
url_check(url)
request_url = ("https://web.archive.org/save/" + clean_url(url))
hdr = { 'User-Agent' : '%s' % UA } #nosec
req = Request(request_url, headers=hdr) #nosec
def _archive_url_parser(header):
"""Parse out the archive from header."""
# Regex1
arch = re.search(
r"Content-Location: (/web/[0-9]{14}/.*)", str(header)
)
if arch:
return "web.archive.org" + arch.group(1)
# Regex2
arch = re.search(
r"rel=\"memento.*?(web\.archive\.org/web/[0-9]{14}/.*?)>", str(header)
)
if arch:
return arch.group(1)
# Regex3
arch = re.search(r"X-Cache-Key:\shttps(.*)[A-Z]{2}", str(header))
if arch:
return arch.group(1)
raise WaybackError(
"No archive URL found in the API response. "
"This version of waybackpy (%s) is likely out of date. Visit "
"https://github.com/akamhy/waybackpy for the latest version "
"of waybackpy.\nHeader:\n%s" % (__version__, str(header))
)
def _wayback_timestamp(**kwargs):
"""Return a formatted timestamp."""
return "".join(
str(kwargs[key]).zfill(2) for key in ["year", "month", "day", "hour", "minute"]
)
def _get_response(req):
"""Get response for the supplied request."""
try:
response = urlopen(req) #nosec
except HTTPError as e:
if handle_HTTPError(e) is None:
raise PageNotSaved(e)
except URLError:
response = urlopen(req) # nosec
except Exception:
try:
response = urlopen(req) #nosec
except URLError as e:
raise UrlNotFound(e)
response = urlopen(req) # nosec
except Exception as e:
exc = WaybackError("Error while retrieving %s" % req.full_url)
exc.__cause__ = e
raise exc
return response
header = response.headers
class Url:
"""waybackpy Url object"""
if "exclusion.robots.policy" in str(header):
raise ArchivingNotAllowed("Can not archive %s. Disabled by site owner." % (url))
def __init__(self, url, user_agent=default_UA):
self.url = url
self.user_agent = user_agent
self._url_check() # checks url validity on init.
return "https://web.archive.org" + header['Content-Location']
def __repr__(self):
return "waybackpy.Url(url=%s, user_agent=%s)" % (self.url, self.user_agent)
def get(url, encoding=None, UA=default_UA):
url_check(url)
hdr = { 'User-Agent' : '%s' % UA }
req = Request(clean_url(url), headers=hdr) #nosec
def __str__(self):
return "%s" % self._clean_url()
try:
resp=urlopen(req) #nosec
except URLError:
try:
resp=urlopen(req) #nosec
except URLError as e:
raise UrlNotFound(e)
def __len__(self):
return len(self._clean_url())
if encoding is None:
try:
encoding= resp.headers['content-type'].split('charset=')[-1]
except AttributeError:
encoding = "UTF-8"
def _url_check(self):
"""Check for common URL problems."""
if "." not in self.url:
raise URLError("'%s' is not a vaild URL." % self.url)
return resp.read().decode(encoding.replace("text/html", "UTF-8", 1))
def _clean_url(self):
"""Fix the URL, if possible."""
return str(self.url).strip().replace(" ", "_")
def near(url, **kwargs):
def save(self):
"""Create a new Wayback Machine archive for this URL."""
request_url = "https://web.archive.org/save/" + self._clean_url()
hdr = {"User-Agent": "%s" % self.user_agent} # nosec
req = Request(request_url, headers=hdr) # nosec
header = _get_response(req).headers
return "https://" + _archive_url_parser(header)
try:
url = kwargs["url"]
except KeyError:
url = url
def get(self, url="", user_agent="", encoding=""):
"""Return the source code of the supplied URL.
If encoding is not supplied, it is auto-detected from the response.
"""
if not url:
url = self._clean_url()
year=kwargs.get("year", datetime.utcnow().strftime('%Y'))
month=kwargs.get("month", datetime.utcnow().strftime('%m'))
day=kwargs.get("day", datetime.utcnow().strftime('%d'))
hour=kwargs.get("hour", datetime.utcnow().strftime('%H'))
minute=kwargs.get("minute", datetime.utcnow().strftime('%M'))
UA=kwargs.get("UA", default_UA)
if not user_agent:
user_agent = self.user_agent
url_check(url)
timestamp = wayback_timestamp(year=year,month=month,day=day,hour=hour,minute=minute)
request_url = "https://archive.org/wayback/available?url=%s&timestamp=%s" % (clean_url(url), str(timestamp))
hdr = { 'User-Agent' : '%s' % UA }
req = Request(request_url, headers=hdr) # nosec
hdr = {"User-Agent": "%s" % user_agent}
req = Request(url, headers=hdr) # nosec
response = _get_response(req)
if not encoding:
try:
encoding = response.headers["content-type"].split("charset=")[-1]
except AttributeError:
encoding = "UTF-8"
return response.read().decode(encoding.replace("text/html", "UTF-8", 1))
try:
response = urlopen(req) #nosec
except HTTPError as e:
handle_HTTPError(e)
def near(self, year=None, month=None, day=None, hour=None, minute=None):
""" Return the closest Wayback Machine archive to the time supplied.
Supported params are year, month, day, hour and minute.
Any non-supplied parameters default to the current time.
data = json.loads(response.read().decode("UTF-8"))
if not data["archived_snapshots"]:
raise ArchiveNotFound("'%s' is not yet archived." % url)
"""
now = datetime.utcnow().timetuple()
timestamp = _wayback_timestamp(
year=year if year else now.tm_year,
month=month if month else now.tm_mon,
day=day if day else now.tm_mday,
hour=hour if hour else now.tm_hour,
minute=minute if minute else now.tm_min,
)
archive_url = (data["archived_snapshots"]["closest"]["url"])
# wayback machine returns http sometimes, idk why? But they support https
archive_url = archive_url.replace("http://web.archive.org/web/","https://web.archive.org/web/",1)
return archive_url
request_url = "https://archive.org/wayback/available?url=%s&timestamp=%s" % (
self._clean_url(),
timestamp,
)
hdr = {"User-Agent": "%s" % self.user_agent}
req = Request(request_url, headers=hdr) # nosec
response = _get_response(req)
data = json.loads(response.read().decode("UTF-8"))
if not data["archived_snapshots"]:
raise WaybackError(
"Can not find archive for '%s' try later or use wayback.Url(url, user_agent).save() "
"to create a new archive." % self._clean_url()
)
archive_url = data["archived_snapshots"]["closest"]["url"]
# wayback machine returns http sometimes, idk why? But they support https
archive_url = archive_url.replace(
"http://web.archive.org/web/", "https://web.archive.org/web/", 1
)
return archive_url
def oldest(url, UA=default_UA, year=1994):
return near(url, year=year, UA=UA)
def oldest(self, year=1994):
"""Return the oldest Wayback Machine archive for this URL."""
return self.near(year=year)
def newest(url, UA=default_UA):
return near(url, UA=UA)
def newest(self):
"""Return the newest Wayback Machine archive available for this URL.
def total_archives(url, UA=default_UA):
url_check(url)
Due to Wayback Machine database lag, this may not always be the
most recent archive.
"""
return self.near()
hdr = { 'User-Agent' : '%s' % UA }
request_url = "https://web.archive.org/cdx/search/cdx?url=%s&output=json" % clean_url(url)
req = Request(request_url, headers=hdr) # nosec
def total_archives(self):
"""Returns the total number of Wayback Machine archives for this URL."""
hdr = {"User-Agent": "%s" % self.user_agent}
request_url = (
"https://web.archive.org/cdx/search/cdx?url=%s&output=json&fl=statuscode"
% self._clean_url()
)
req = Request(request_url, headers=hdr) # nosec
response = _get_response(req)
# Most efficient method to count number of archives (yet)
return str(response.read()).count(",")
try:
response = urlopen(req) #nosec
except HTTPError as e:
handle_HTTPError(e)
def known_urls(self, alive=False, subdomain=False):
"""Returns list of URLs known to exist for given domain name
because these URLs were crawled by WayBack Machine bots.
return (len(json.loads(response.read())))
Useful for pen-testers and others.
Idea by Mohammed Diaa (https://github.com/mhmdiaa) from:
https://gist.github.com/mhmdiaa/adf6bff70142e5091792841d4b372050
"""
url_list = []
if subdomain:
request_url = (
"https://web.archive.org/cdx/search/cdx?url=*.%s/*&output=json&fl=original&collapse=urlkey"
% self._clean_url()
)
else:
request_url = (
"http://web.archive.org/cdx/search/cdx?url=%s/*&output=json&fl=original&collapse=urlkey"
% self._clean_url()
)
hdr = {"User-Agent": "%s" % self.user_agent}
req = Request(request_url, headers=hdr) # nosec
response = _get_response(req)
data = json.loads(response.read().decode("UTF-8"))
url_list = [y[0] for y in data if y[0] != "original"]
#Remove all deadURLs from url_list if alive=True
if alive:
tmp_url_list = []
for url in url_list:
try:
urlopen(url)
except:
continue
tmp_url_list.append(url)
url_list = tmp_url_list
return url_list