Compare commits

...

144 Commits
v1.4 ... 2.1.6

Author SHA1 Message Date
36d662b961 Update __version__.py 2020-07-24 16:24:57 +05:30
2835f8877e Update setup.py 2020-07-24 16:24:38 +05:30
18cbd2fd30 Update cli.py 2020-07-24 16:10:29 +05:30
a2812fb56f patch for cli 2020-07-24 16:09:47 +05:30
77effcf649 Update setup.py 2020-07-24 15:34:14 +05:30
7272ef45a0 Update __version__.py 2020-07-24 15:33:58 +05:30
56116551ac Coverge improvements (#22)
* Update cli.py

* improved tests

* chnages for proper testing

* Type check using isinstance

* Replace elifs with if when used after return

* twitter.com --> www.ibm.com

* fix typo

* test archive urll parser and dunders

* Update test_wrapper.py
2020-07-24 15:31:21 +05:30
4dcda94cb0 v2.1.4 2020-07-24 01:03:44 +05:30
09f59b0182 v2.1.4 2020-07-24 01:03:04 +05:30
ed24184b99 Remove duplicate get response method 2020-07-24 00:57:22 +05:30
56bef064b1 only test save on >3.7 2020-07-23 20:51:46 +05:30
44bb2cf5e4 some cli tests 2020-07-23 20:44:14 +05:30
e231228721 Update README.md (#21)
* Update README.md

* example bash oldest newest

* total archives bash example

* near bash example

* format the list

* ce

* get bash example

* pip git install example

* Update index.rst

* + argparse

* + argparse
2020-07-22 21:35:02 +05:30
b8b2d6dfa9 v2.1.3 2020-07-22 20:21:37 +05:30
3eca6294df v2.1.3 2020-07-22 20:20:44 +05:30
eb037a0284 Rename test_1.py to test_wrapper.py 2020-07-22 20:19:59 +05:30
a01821f20b Update .travis.yml 2020-07-22 17:33:59 +05:30
b21036f8df Update .travis.yml 2020-07-22 17:31:57 +05:30
b43bacb7ac fix error language 2020-07-22 17:25:15 +05:30
f7313b255a Update cli.py 2020-07-22 17:22:38 +05:30
7457e1c793 - print(repr(obj)) 2020-07-22 17:18:27 +05:30
f7493d823f Update cli.py 2020-07-22 17:16:53 +05:30
7fa7b59ce3 if version don't try to create object 2020-07-22 17:15:28 +05:30
78a608db50 Update cli.py 2020-07-22 17:12:44 +05:30
93f7dfdaf9 resolve args conflict 2020-07-22 17:09:32 +05:30
83c6f256c9 version arg 2020-07-22 17:03:56 +05:30
dee9105794 command_line support (#18)
* Update wrapper.py

* entry points cli

* Suppress the urllib2/3 Exception

* rm cli code, will create a new cli.py file

* Create cli.py

* update cli entry pts

* Update cli.py

* Update cli.py

* import print_function

* Update cli.py

* Update cli.py

* Delete pypi_uploader.sh

* resolve conflicts with the master

* update the test ; resolve the conflicts

* decrease code complexity

* cli method changed to main

* get is not for just local usage

* get method should be available from interface

* get is used in the interface

* Update cli.py
2020-07-22 16:40:13 +05:30
3bfc3b46d0 Delete SECURITY.md 2020-07-22 11:07:59 +05:30
553f150bee replace youtube with twitter.com
for some reason Wayback API is returing diffrent youtube URL now.
2020-07-22 11:07:23 +05:30
b3a7e714a5 Update wrapper.py 2020-07-22 10:57:43 +05:30
cd9841713c Update wrapper.py 2020-07-22 10:52:43 +05:30
1ea9548d46 Raise WaybackError from URLError and include URL (#19)
* Raise WaybackError from URLError and include URL

* python2 compatibility

Co-authored-by: Akash <64683866+akamhy@users.noreply.github.com>
2020-07-22 10:51:44 +05:30
be7642c837 Code style improvements (#20)
* Add sane line length to setup.cfg

* Use Black for quick readability improvements

* Clean up exceptions, docstrings, and comments

Docstrings on dunder functions are redundant and typically ignored
Limit to reasonable line length
General grammar and style corrections
Clarify docstrings and exceptions
Format docstrings per PEP 257 -- Docstring Conventions

* Move archive_url_parser out of Url.save()

It's generally poor form to define a function in a function, as it will
be re-defined each time the function is run.

archive_url_parser does not depend on anything in Url, so it makes sense
to move it out of the class.

* move wayback_timestamp out of class, mark private functions

* DRY in _wayback_timestamp

* Url._url_check should return None

There's no point in returning True if it's never checked and won't ever
be False.
Implicitly returning None or raising an exception is more idiomatic.

* Default parameters should be type-consistant with expected values

* Specify parameters to near

* Use datetime.datetime in _wayback_timestamp

* cleanup __init__.py

* Cleanup formatting in tests

* Fix names in tests

* Revert "Use datetime.datetime in _wayback_timestamp"

This reverts commit 5b30380865.

Introduced unnecessary complexity

* Move _get_response outside of Url

Because Codacy reminded me that I missed it.

* fix imports in tests
2020-07-22 10:09:14 +05:30
a418a4e464 Update SECURITY.md 2020-07-21 10:38:41 +05:30
aec035ef1e Create SECURITY.md 2020-07-21 08:41:07 +05:30
6d37993ab9 moved to manuals 2020-07-21 08:14:40 +05:30
72b80ca44e Create pypi_uploader.sh 2020-07-21 08:14:21 +05:30
c10aa9279c Create python-publish.yml 2020-07-21 08:08:55 +05:30
68d809a7d6 Update test_1.py 2020-07-20 23:45:49 +05:30
4ad09a419b Fix bash syntax 2020-07-20 23:44:23 +05:30
ddc6620f09 Only report coverage if python 3.8 or greater 2020-07-20 23:29:54 +05:30
4066a65678 Update .travis.yml 2020-07-20 23:20:43 +05:30
8e46a9ba7a Update .travis.yml 2020-07-20 23:16:33 +05:30
a5a98b9b00 Update .travis.yml 2020-07-20 23:11:56 +05:30
a721ab7d6c Update .travis.yml 2020-07-20 23:10:06 +05:30
7db27ae5e1 Create pypi_uploader.sh 2020-07-20 22:28:35 +05:30
8fd4462025 Update wrapper.py 2020-07-20 20:17:18 +05:30
c458a15820 Update .travis.yml 2020-07-20 15:38:33 +05:30
bae3412bee Update .travis.yml 2020-07-20 15:24:26 +05:30
94cb08bb37 Update setup.py 2020-07-20 10:41:00 +05:30
af888db13e 2.1.2 2020-07-20 10:40:37 +05:30
d24f2408ee Update test_1.py 2020-07-20 10:31:47 +05:30
ddd2274015 Update test_1.py 2020-07-20 10:21:15 +05:30
99abdb7c67 Update test_1.py 2020-07-20 10:16:39 +05:30
f3bb9a8540 Update wrapper.py 2020-07-20 10:11:36 +05:30
bb94e0d1c5 Update index.rst and remove dupes 2020-07-20 10:07:31 +05:30
1a78d88be2 2.1.1 2020-07-19 23:17:01 +05:30
3ec61758b3 Update __version__.py 2020-07-19 23:16:13 +05:30
83c962166d Raise 2020-07-19 23:02:04 +05:30
e87dee3bdf Waybackpy example on replit (#15)
* Waybackpy save example on replit

* Oldest example

* Newest method replit link

* Near method example

* Get example

* Total archive method example
2020-07-19 22:28:08 +05:30
b27bfff15a v2.1.0 2020-07-19 21:08:01 +05:30
970fc1cd08 Update __version__.py 2020-07-19 21:06:54 +05:30
65391bf14b update 2020-07-19 21:04:32 +05:30
8ab116f276 API chnaged again. updated
* Update wrapper.py

* Update wrapper.py

* Update wrapper.py

* Update wrapper.py

* Update wrapper.py

* api changed; fix archive url parser

* Update wrapper.py

* - Trailing whitespace

* include the header in exception
2020-07-19 20:39:07 +05:30
6f82041ec9 Update README.md (#13)
* Update README.md

* Update README.md

* replit demo for waybackpy.Url.save()

* Update README.md

* Update README.md

* replit demo for oldest()

* replit demo for newest()

* Update README.md

* replit demo for total_archives

* demo at replit for get()

* demo for near

* Update README.md

* Update README.md

* Update README.md
2020-07-19 16:39:39 +05:30
11059c960e Update setup.py 2020-07-18 19:27:04 +05:30
eee1b8eba1 Update __version__.py 2020-07-18 19:26:41 +05:30
f7de8f5575 sleeps to prevent too many requests in a timeframe 2020-07-18 19:25:19 +05:30
3fa0c32064 V2.0.1 link 2020-07-18 19:09:18 +05:30
aa1e3b8825 V2.0.1 2020-07-18 19:08:39 +05:30
58d2d585c8 No timeout for final try 2020-07-18 18:29:41 +05:30
e8efed2e2f Update test_1.py 2020-07-18 17:24:54 +05:30
49089b7321 2.0.0 link 2020-07-18 17:09:07 +05:30
55d8687566 Update test_1.py 2020-07-18 16:58:23 +05:30
0fa28527af Update index.rst 2020-07-18 16:54:07 +05:30
68259fd2d9 Update index.rst 2020-07-18 16:53:27 +05:30
e7086a89d3 Update index.rst 2020-07-18 16:52:37 +05:30
e39467227c Update index.rst 2020-07-18 16:51:47 +05:30
ba840404cf Update index.rst 2020-07-18 16:50:37 +05:30
8fbd2d9e55 Update index.rst 2020-07-18 16:49:03 +05:30
eebf6043de Update index.rst 2020-07-18 16:48:29 +05:30
3d3b09d6d8 Update README.md 2020-07-18 16:46:40 +05:30
ef15b5863c Update index.rst 2020-07-18 16:44:32 +05:30
256c0cdb6b update test - save 2020-07-18 16:39:35 +05:30
12c72a8294 fix link 2020-07-18 16:30:20 +05:30
0ad27f5ecc update readme for newer oop and some test changes (#12)
* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* docstrings

* user agent ; more variants

* description update

* Update __init__.py

* # -*- coding: utf-8 -*-

* Update test_1.py

* update docs for get()

* Update README.md
2020-07-18 16:22:09 +05:30
700b60b5f8 Update README.md 2020-07-18 08:16:59 +05:30
11032596c8 Update README.md 2020-07-18 08:15:43 +05:30
9727f92168 Update README.md 2020-07-18 08:12:33 +05:30
d2893fec13 Delete CONTRIBUTING.md 2020-07-18 08:12:00 +05:30
f1353b2129 Update CONTRIBUTING.md 2020-07-18 00:58:50 +05:30
c76a95ef90 Create CONTRIBUTING.md (#11) 2020-07-18 00:57:48 +05:30
62d88359ce Update README.md 2020-07-18 00:40:21 +05:30
9942c474c9 Update README.md 2020-07-18 00:35:12 +05:30
dfb736e794 Size 2020-07-18 00:32:00 +05:30
84d1766917 Update README.md 2020-07-18 00:20:58 +05:30
9d3cdfafb3 Update README.md 2020-07-18 00:20:17 +05:30
20a16bfa45 Version 2.0.0 on it's way for release (tommorow) 2020-07-18 00:09:28 +05:30
f2112c73f6 Python 2 support 2020-07-17 21:08:32 +05:30
9860527d96 OOP (#10)
* Update wrapper.py

* Update exceptions.py

* Update __init__.py

* test adjusted for new changes

* Update wrapper.py
2020-07-17 20:50:00 +05:30
9ac1e877c8 Update README.md 2020-07-16 20:39:12 +05:30
f881705d00 detecet python version whith sys.version_info (#9) 2020-06-26 15:48:01 +05:30
f015c3f4f3 test on the worst case possible 2020-05-08 09:56:01 +05:30
42ac399362 Most efficient method to count (yet) 2020-05-08 09:47:13 +05:30
e9d010c793 just count the status code, consumes less memory 2020-05-08 09:28:18 +05:30
58a6409528 v1.6 2020-05-07 20:14:59 +05:30
7ca2029158 Update setup.py 2020-05-07 20:14:40 +05:30
80331833f2 Update setup.py 2020-05-07 20:12:32 +05:30
5e3d3a815f fix 2020-05-07 20:03:17 +05:30
6182a18cf4 fix 2020-05-07 20:02:47 +05:30
9bca750310 v1.5 2020-05-07 19:59:23 +05:30
c22749a6a3 update 2020-05-07 19:54:00 +05:30
151df94fe3 license_file = LICENSE 2020-05-07 19:38:19 +05:30
24540d0b2c update 2020-05-07 19:33:39 +05:30
bdfc72d05d Create __version__.py 2020-05-07 19:16:26 +05:30
3b104c1a28 v1.5 2020-05-07 19:03:02 +05:30
fb0d4658a7 ce 2020-05-07 19:02:12 +05:30
48833980e1 update 2020-05-07 18:58:01 +05:30
0c4f119981 Update wrapper.py 2020-05-07 17:25:34 +05:30
afded51a04 Update wrapper.py 2020-05-07 17:20:23 +05:30
b950616561 Update wrapper.py 2020-05-07 17:17:17 +05:30
444675538f fix code Complexity (#8)
* fix code Complexity

* Update wrapper.py

* codefactor badge
2020-05-07 16:51:08 +05:30
0ca6710334 Update wrapper.py 2020-05-07 16:24:33 +05:30
01a7c591ad retry 2020-05-07 15:46:39 +05:30
74d3bc154b fix issue with py2.7 2020-05-07 15:34:41 +05:30
a8e94dfb25 Update README.md 2020-05-07 15:14:55 +05:30
cc38798b32 Update README.md 2020-05-07 15:14:30 +05:30
bc3dd44f27 Update README.md 2020-05-07 15:13:58 +05:30
ba46cdafe2 Update README.md 2020-05-07 15:12:37 +05:30
538afb14e9 Update test_1.py 2020-05-07 15:06:52 +05:30
7605b614ee test for total_archives() 2020-05-07 15:00:28 +05:30
d0a4e25cf5 Update __init__.py 2020-05-07 14:53:09 +05:30
8c5c0153da + total_archives() 2020-05-07 14:52:05 +05:30
e7dac74906 Update __init__.py 2020-05-07 09:06:49 +05:30
c686708c9e more testing 2020-05-07 08:59:09 +05:30
f9ae8ada70 Update test_1.py 2020-05-07 08:39:24 +05:30
e56ece3dc9 Update README.md 2020-05-07 08:23:31 +05:30
db127a5c54 always return https 2020-05-06 20:16:25 +05:30
ed497bbd23 Update wrapper.py 2020-05-06 20:07:25 +05:30
45fe07ddb6 Update wrapper.py 2020-05-06 19:35:01 +05:30
0029d63d8a 503 API Service Temporarily Unavailable 2020-05-06 19:22:56 +05:30
beb5b625ec Set theme jekyll-theme-cayman 2020-05-06 12:20:43 +05:30
b40d734346 Update README.md 2020-05-06 09:18:02 +05:30
be0a30de85 Create index.rst 2020-05-05 20:22:46 +05:30
15 changed files with 1284 additions and 278 deletions

31
.github/workflows/python-publish.yml vendored Normal file
View File

@ -0,0 +1,31 @@
# This workflows will upload a Python Package using Twine when a release is created
# For more information see: https://help.github.com/en/actions/language-and-framework-guides/using-python-with-github-actions#publishing-to-package-registries
name: Upload Python Package
on:
release:
types: [created]
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: '3.x'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install setuptools wheel twine
- name: Build and publish
env:
TWINE_USERNAME: ${{ secrets.PYPI_USERNAME }}
TWINE_PASSWORD: ${{ secrets.PYPI_PASSWORD }}
run: |
python setup.py sdist bdist_wheel
twine upload dist/*

View File

@ -1,14 +1,19 @@
language: python
python:
- "2.7"
- "3.6"
- "3.8"
os: linux
dist: xenial
cache: pip
install:
- pip install pytest
before_script:
cd tests
python:
- 2.7
- 3.6
- 3.8
before_install:
- python --version
- pip install -U pip
- pip install -U pytest
- pip install codecov
- pip install pytest pytest-cov
script:
- pytest test_1.py
- cd tests
- pytest --cov=../waybackpy
after_success:
- if [[ $TRAVIS_PYTHON_VERSION == 3.8 ]]; then python -m codecov; fi

314
README.md
View File

@ -1,154 +1,288 @@
# waybackpy
[![Build Status](https://travis-ci.org/akamhy/waybackpy.svg?branch=master)](https://travis-ci.org/akamhy/waybackpy)
[![Build Status](https://img.shields.io/travis/akamhy/waybackpy.svg?label=Travis%20CI&logo=travis&style=flat-square)](https://travis-ci.org/akamhy/waybackpy)
[![Downloads](https://img.shields.io/pypi/dm/waybackpy.svg)](https://pypistats.org/packages/waybackpy)
[![Release](https://img.shields.io/github/v/release/akamhy/waybackpy.svg)](https://github.com/akamhy/waybackpy/releases)
[![Codacy Badge](https://api.codacy.com/project/badge/Grade/255459cede9341e39436ec8866d3fb65)](https://www.codacy.com/manual/akamhy/waybackpy?utm_source=github.com&amp;utm_medium=referral&amp;utm_content=akamhy/waybackpy&amp;utm_campaign=Badge_Grade)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://github.com/akamhy/waybackpy/blob/master/LICENSE)
[![Maintainability](https://api.codeclimate.com/v1/badges/942f13d8177a56c1c906/maintainability)](https://codeclimate.com/github/akamhy/waybackpy/maintainability)
[![CodeFactor](https://www.codefactor.io/repository/github/akamhy/waybackpy/badge)](https://www.codefactor.io/repository/github/akamhy/waybackpy)
[![made-with-python](https://img.shields.io/badge/Made%20with-Python-1f425f.svg)](https://www.python.org/)
![pypi](https://img.shields.io/pypi/v/wayback.svg)
![pypi](https://img.shields.io/pypi/v/waybackpy.svg)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/waybackpy?style=flat-square)
[![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-green.svg)](https://github.com/akamhy/waybackpy/graphs/commit-activity)
[![codecov](https://codecov.io/gh/akamhy/waybackpy/branch/master/graph/badge.svg)](https://codecov.io/gh/akamhy/waybackpy)
![](https://img.shields.io/github/repo-size/akamhy/waybackpy.svg?label=Repo%20size&style=flat-square)
![contributions welcome](https://img.shields.io/static/v1.svg?label=Contributions&message=Welcome&color=0059b3&style=flat-square)
![Internet Archive](https://upload.wikimedia.org/wikipedia/commons/thumb/8/84/Internet_Archive_logo_and_wordmark.svg/84px-Internet_Archive_logo_and_wordmark.svg.png)
![Wayback Machine](https://upload.wikimedia.org/wikipedia/commons/thumb/0/01/Wayback_Machine_logo_2010.svg/284px-Wayback_Machine_logo_2010.svg.png)
The waybackpy is a python wrapper for [Internet Archive](https://en.wikipedia.org/wiki/Internet_Archive)'s [Wayback Machine](https://en.wikipedia.org/wiki/Wayback_Machine).
Waybackpy is a Python library that interfaces with the [Internet Archive](https://en.wikipedia.org/wiki/Internet_Archive)'s [Wayback Machine](https://en.wikipedia.org/wiki/Wayback_Machine) API. Archive pages and retrieve archived pages easily.
Table of contents
=================
<!--ts-->
* [Installation](https://github.com/akamhy/waybackpy#installation)
* [Installation](#installation)
* [Usage](https://github.com/akamhy/waybackpy#usage)
* [Saving an url using save()](https://github.com/akamhy/waybackpy#capturing-aka-saving-an-url-using-save)
* [Receiving the oldest archive for an URL Using oldest()](https://github.com/akamhy/waybackpy#receiving-the-oldest-archive-for-an-url-using-oldest)
* [Receiving the recent most/newest archive for an URL using newest()](https://github.com/akamhy/waybackpy#receiving-the-newest-archive-for-an-url-using-newest)
* [Receiving archive close to a specified year, month, day, hour, and minute using near()](https://github.com/akamhy/waybackpy#receiving-archive-close-to-a-specified-year-month-day-hour-and-minute-using-near)
* [Get the content of webpage using get()](https://github.com/akamhy/waybackpy#get-the-content-of-webpage-using-get)
* [Usage](#usage)
* [As a python package](#as-a-python-package)
* [Saving an url using save()](#capturing-aka-saving-an-url-using-save)
* [Receiving the oldest archive for an URL Using oldest()](#receiving-the-oldest-archive-for-an-url-using-oldest)
* [Receiving the recent most/newest archive for an URL using newest()](#receiving-the-newest-archive-for-an-url-using-newest)
* [Receiving archive close to a specified year, month, day, hour, and minute using near()](#receiving-archive-close-to-a-specified-year-month-day-hour-and-minute-using-near)
* [Get the content of webpage using get()](#get-the-content-of-webpage-using-get)
* [Count total archives for an URL using total_archives()](#count-total-archives-for-an-url-using-total_archives)
* [With CLI](#with-the-cli)
* [Save](#save)
* [Oldest archive](#oldest-archive)
* [Newest archive](#newest-archive)
* [Total archives](#total-number-of-archives)
* [Archive near a time](#archive-near-time)
* [Get the source code](#get-the-source-code)
* [Tests](https://github.com/akamhy/waybackpy#tests)
* [Tests](#tests)
* [Dependency](https://github.com/akamhy/waybackpy#dependency)
* [Dependency](#dependency)
* [License](https://github.com/akamhy/waybackpy#license)
* [License](#license)
<!--te-->
## Installation
Using [pip](https://en.wikipedia.org/wiki/Pip_(package_manager)):
**pip install waybackpy**
```bash
pip install waybackpy
```
or direct from this repository using git.
```bash
pip install git+https://github.com/akamhy/waybackpy.git
```
## Usage
#### Capturing aka Saving an url Using save()
### As a python package
```diff
+ waybackpy.save(url, UA=user_agent)
```
> url is mandatory. UA is not, but highly recommended.
#### Capturing aka Saving an url using save()
```python
import waybackpy
# Capturing a new archive on Wayback machine.
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
archived_url = waybackpy.save("https://github.com/akamhy/waybackpy", UA = "Any-User-Agent")
print(archived_url)
new_archive_url = waybackpy.Url(
url = "https://en.wikipedia.org/wiki/Multivariable_calculus",
user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
).save()
print(new_archive_url)
```
This should print something similar to the following archived URL:
<https://web.archive.org/web/20200504141153/https://github.com/akamhy/waybackpy>
#### Receiving the oldest archive for an URL Using oldest()
```diff
+ waybackpy.oldest(url, UA=user_agent)
```bash
https://web.archive.org/web/20200504141153/https://github.com/akamhy/waybackpy
```
> url is mandatory. UA is not, but highly recommended.
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPySaveExample></sub>
#### Receiving the oldest archive for an URL using oldest()
```python
import waybackpy
# retrieving the oldest archive on Wayback machine.
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
oldest_archive = waybackpy.oldest("https://www.google.com/", UA = "Any-User-Agent")
print(oldest_archive)
```
This returns the oldest available archive for <https://google.com>.
<http://web.archive.org/web/19981111184551/http://google.com:80/>
oldest_archive_url = waybackpy.Url(
"https://www.google.com/",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:40.0) Gecko/20100101 Firefox/40.0"
).oldest()
print(oldest_archive_url)
```
```bash
http://web.archive.org/web/19981111184551/http://google.com:80/
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyOldestExample></sub>
#### Receiving the newest archive for an URL using newest()
```diff
+ waybackpy.newest(url, UA=user_agent)
```
> url is mandatory. UA is not, but highly recommended.
```python
import waybackpy
# retrieving the newest archive on Wayback machine.
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
newest_archive = waybackpy.newest("https://www.microsoft.com/en-us", UA = "Any-User-Agent")
print(newest_archive)
```
This returns the newest available archive for <https://www.microsoft.com/en-us>, something just like this:
<http://web.archive.org/web/20200429033402/https://www.microsoft.com/en-us/>
newest_archive_url = waybackpy.Url(
"https://www.facebook.com/",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:39.0) Gecko/20100101 Firefox/39.0"
).newest()
print(newest_archive_url)
```
```bash
https://web.archive.org/web/20200714013225/https://www.facebook.com/
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyNewestExample></sub>
#### Receiving archive close to a specified year, month, day, hour, and minute using near()
```diff
+ waybackpy.near(url, year=2020, month=1, day=1, hour=1, minute=1, UA=user_agent)
```
> url is mandotory. year,month,day,hour and minute are optional arguments. UA is not mandotory, but higly recomended.
```python
import waybackpy
# retriving the the closest archive from a specified year.
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
# supported argumnets are year,month,day,hour and minute
archive_near_year = waybackpy.near("https://www.facebook.com/", year=2010, UA ="Any-User-Agent")
print(archive_near_year)
from waybackpy import Url
user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:38.0) Gecko/20100101 Firefox/38.0"
github_url = "https://github.com/"
github_wayback_obj = Url(github_url, user_agent)
# Do not pad (don't use zeros in the month, year, day, minute, and hour arguments). e.g. For January, set month = 1 and not month = 01.
```
returns : <http://web.archive.org/web/20100504071154/http://www.facebook.com/>
```python
github_archive_near_2010 = github_wayback_obj.near(year=2010)
print(github_archive_near_2010)
```
```bash
https://web.archive.org/web/20100719134402/http://github.com/
```
```python
github_archive_near_2011_may = github_wayback_obj.near(year=2011, month=5)
print(github_archive_near_2011_may)
```
```bash
https://web.archive.org/web/20110519185447/https://github.com/
```
```python
github_archive_near_2015_january_26 = github_wayback_obj.near(
year=2015, month=1, day=26
)
print(github_archive_near_2015_january_26)
```
```bash
https://web.archive.org/web/20150127031159/https://github.com
```
```python
github_archive_near_2018_4_july_9_2_am = github_wayback_obj.near(
year=2018, month=7, day=4, hour = 9, minute = 2
)
print(github_archive_near_2018_4_july_9_2_am)
```
```bash
https://web.archive.org/web/20180704090245/https://github.com/
```waybackpy.near("https://www.facebook.com/", year=2010, month=1, UA ="Any-User-Agent")``` returns: <http://web.archive.org/web/20101111173430/http://www.facebook.com//>
```
<sub>The library doesn't supports seconds yet. You are encourged to create a PR ;)</sub>
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyNearExample></sub>
```waybackpy.near("https://www.oracle.com/index.html", year=2019, month=1, day=5, UA ="Any-User-Agent")``` returns: <http://web.archive.org/web/20190105054437/https://www.oracle.com/index.html>
> Please note that if you only specify the year, the current month and day are default arguments for month and day respectively. Do not expect just putting the year parameter would return the archive closer to January but the current month you are using the package. If you are using it in July 2018 and let's say you use ```waybackpy.near("https://www.facebook.com/", year=2011, UA ="Any-User-Agent")``` then you would be returned the nearest archive to July 2011 and not January 2011. You need to specify the month "1" for January.
> Do not pad (don't use zeros in the month, year, day, minute, and hour arguments). e.g. For January, set month = 1 and not month = 01.
#### Get the content of webpage using get()
```diff
+ waybackpy.get(url, encoding="UTF-8", UA=user_agent)
```
> url is mandatory. UA is not, but highly recommended. encoding is detected automatically, don't specify unless necessary.
```python
from waybackpy import get
# retriving the webpage from any url including the archived urls. Don't need to import other libraies :)
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
# supported argumnets are url, encoding and UA
webpage = get("https://example.com/", UA="User-Agent")
print(webpage)
import waybackpy
google_url = "https://www.google.com/"
User_Agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.85 Safari/537.36"
waybackpy_url_object = waybackpy.Url(google_url, User_Agent)
# If no argument is passed in get(), it gets the source of the Url used to create the object.
current_google_url_source = waybackpy_url_object.get()
print(current_google_url_source)
# The following chunk of code will force a new archive of google.com and get the source of the archived page.
# waybackpy_url_object.save() type is string.
google_newest_archive_source = waybackpy_url_object.get(
waybackpy_url_object.save()
)
print(google_newest_archive_source)
# waybackpy_url_object.oldest() type is str, it's oldest archive of google.com
google_oldest_archive_source = waybackpy_url_object.get(
waybackpy_url_object.oldest()
)
print(google_oldest_archive_source)
```
> This should print the source code for <https://example.com/>.
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyGetExample#main.py></sub>
#### Count total archives for an URL using total_archives()
```python
import waybackpy
URL = "https://en.wikipedia.org/wiki/Python (programming language)"
UA = "Mozilla/5.0 (iPad; CPU OS 8_1_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B435 Safari/600.1.4"
archive_count = waybackpy.Url(
url=URL,
user_agent=UA
).total_archives()
print(archive_count) # total_archives() returns an int
```
```bash
2440
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyTotalArchivesExample></sub>
### With the CLI
#### Save
```bash
$ waybackpy --url "https://en.wikipedia.org/wiki/Social_media" --user_agent "my-unique-user-agent" --save
https://web.archive.org/web/20200719062108/https://en.wikipedia.org/wiki/Social_media
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyBashSave></sub>
#### Oldest archive
```bash
$ waybackpy --url "https://en.wikipedia.org/wiki/SpaceX" --user_agent "my-unique-user-agent" --oldest
https://web.archive.org/web/20040803000845/http://en.wikipedia.org:80/wiki/SpaceX
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyBashOldest></sub>
#### Newest archive
```bash
$ waybackpy --url "https://en.wikipedia.org/wiki/YouTube" --user_agent "my-unique-user-agent" --newest
https://web.archive.org/web/20200606044708/https://en.wikipedia.org/wiki/YouTube
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyBashNewest></sub>
#### Total number of archives
```bash
$ waybackpy --url "https://en.wikipedia.org/wiki/Linux_kernel" --user_agent "my-unique-user-agent" --total
853
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyBashTotal></sub>
#### Archive near time
```bash
$ waybackpy --url facebook.com --user_agent "my-unique-user-agent" --near --year 2012 --month 5 --day 12
https://web.archive.org/web/20120512142515/https://www.facebook.com/
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyBashNear></sub>
#### Get the source code
```bash
$ waybackpy --url google.com --user_agent "my-unique-user-agent" --get url # Prints the source code of the url
$ waybackpy --url google.com --user_agent "my-unique-user-agent" --get oldest # Prints the source code of the oldest archive
$ waybackpy --url google.com --user_agent "my-unique-user-agent" --get newest # Prints the source code of the newest archive
$ waybackpy --url google.com --user_agent "my-unique-user-agent" --get save # Save a new archive on wayback machine then print the source code of this archive.
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyBashGet></sub>
## Tests
* [Here](https://github.com/akamhy/waybackpy/tree/master/tests)
## Dependency
* None, just python standard libraries (json, urllib and datetime). Both python 2 and 3 are supported :)
* None, just python standard libraries (re, json, urllib, argparse and datetime). Both python 2 and 3 are supported :)
## License
[MIT License](https://github.com/akamhy/waybackpy/blob/master/LICENSE)

1
_config.yml Normal file
View File

@ -0,0 +1 @@
theme: jekyll-theme-cayman

385
index.rst Normal file
View File

@ -0,0 +1,385 @@
waybackpy
=========
|Build Status| |Downloads| |Release| |Codacy Badge| |License: MIT|
|Maintainability| |CodeFactor| |made-with-python| |pypi| |PyPI - Python
Version| |Maintenance| |codecov| |image12| |contributions welcome|
|Internet Archive| |Wayback Machine|
Waybackpy is a Python library that interfaces with the `Internet
Archive <https://en.wikipedia.org/wiki/Internet_Archive>`__'s `Wayback
Machine <https://en.wikipedia.org/wiki/Wayback_Machine>`__ API. Archive
pages and retrieve archived pages easily.
Table of contents
=================
.. raw:: html
<!--ts-->
- `Installation <#installation>`__
- `Usage <#usage>`__
- `As a python package <#as-a-python-package>`__
- `Saving an url using
save() <#capturing-aka-saving-an-url-using-save>`__
- `Receiving the oldest archive for an URL Using
oldest() <#receiving-the-oldest-archive-for-an-url-using-oldest>`__
- `Receiving the recent most/newest archive for an URL using
newest() <#receiving-the-newest-archive-for-an-url-using-newest>`__
- `Receiving archive close to a specified year, month, day, hour,
and minute using
near() <#receiving-archive-close-to-a-specified-year-month-day-hour-and-minute-using-near>`__
- `Get the content of webpage using
get() <#get-the-content-of-webpage-using-get>`__
- `Count total archives for an URL using
total\_archives() <#count-total-archives-for-an-url-using-total_archives>`__
- `With CLI <#with-the-cli>`__
- `Save <#save>`__
- `Oldest archive <#oldest-archive>`__
- `Newest archive <#newest-archive>`__
- `Total archives <#total-number-of-archives>`__
- `Archive near a time <#archive-near-time>`__
- `Get the source code <#get-the-source-code>`__
- `Tests <#tests>`__
- `Dependency <#dependency>`__
- `License <#license>`__
.. raw:: html
<!--te-->
Installation
------------
Using `pip <https://en.wikipedia.org/wiki/Pip_(package_manager)>`__:
.. code:: bash
pip install waybackpy
or direct from this repository using git.
.. code:: bash
pip install git+https://github.com/akamhy/waybackpy.git
Usage
-----
As a python package
~~~~~~~~~~~~~~~~~~~
Capturing aka Saving an url using save()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: python
import waybackpy
new_archive_url = waybackpy.Url(
url = "https://en.wikipedia.org/wiki/Multivariable_calculus",
user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
).save()
print(new_archive_url)
.. code:: bash
https://web.archive.org/web/20200504141153/https://github.com/akamhy/waybackpy
Try this out in your browser @
https://repl.it/@akamhy/WaybackPySaveExample\
Receiving the oldest archive for an URL using oldest()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: python
import waybackpy
oldest_archive_url = waybackpy.Url(
"https://www.google.com/",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:40.0) Gecko/20100101 Firefox/40.0"
).oldest()
print(oldest_archive_url)
.. code:: bash
http://web.archive.org/web/19981111184551/http://google.com:80/
Try this out in your browser @
https://repl.it/@akamhy/WaybackPyOldestExample\
Receiving the newest archive for an URL using newest()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: python
import waybackpy
newest_archive_url = waybackpy.Url(
"https://www.facebook.com/",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:39.0) Gecko/20100101 Firefox/39.0"
).newest()
print(newest_archive_url)
.. code:: bash
https://web.archive.org/web/20200714013225/https://www.facebook.com/
Try this out in your browser @
https://repl.it/@akamhy/WaybackPyNewestExample\
Receiving archive close to a specified year, month, day, hour, and minute using near()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: python
from waybackpy import Url
user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:38.0) Gecko/20100101 Firefox/38.0"
github_url = "https://github.com/"
github_wayback_obj = Url(github_url, user_agent)
# Do not pad (don't use zeros in the month, year, day, minute, and hour arguments). e.g. For January, set month = 1 and not month = 01.
.. code:: python
github_archive_near_2010 = github_wayback_obj.near(year=2010)
print(github_archive_near_2010)
.. code:: bash
https://web.archive.org/web/20100719134402/http://github.com/
.. code:: python
github_archive_near_2011_may = github_wayback_obj.near(year=2011, month=5)
print(github_archive_near_2011_may)
.. code:: bash
https://web.archive.org/web/20110519185447/https://github.com/
.. code:: python
github_archive_near_2015_january_26 = github_wayback_obj.near(
year=2015, month=1, day=26
)
print(github_archive_near_2015_january_26)
.. code:: bash
https://web.archive.org/web/20150127031159/https://github.com
.. code:: python
github_archive_near_2018_4_july_9_2_am = github_wayback_obj.near(
year=2018, month=7, day=4, hour = 9, minute = 2
)
print(github_archive_near_2018_4_july_9_2_am)
.. code:: bash
https://web.archive.org/web/20180704090245/https://github.com/
The library doesn't supports seconds yet. You are encourged to create a
PR ;)
Try this out in your browser @
https://repl.it/@akamhy/WaybackPyNearExample\
Get the content of webpage using get()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: python
import waybackpy
google_url = "https://www.google.com/"
User_Agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.85 Safari/537.36"
waybackpy_url_object = waybackpy.Url(google_url, User_Agent)
# If no argument is passed in get(), it gets the source of the Url used to create the object.
current_google_url_source = waybackpy_url_object.get()
print(current_google_url_source)
# The following chunk of code will force a new archive of google.com and get the source of the archived page.
# waybackpy_url_object.save() type is string.
google_newest_archive_source = waybackpy_url_object.get(
waybackpy_url_object.save()
)
print(google_newest_archive_source)
# waybackpy_url_object.oldest() type is str, it's oldest archive of google.com
google_oldest_archive_source = waybackpy_url_object.get(
waybackpy_url_object.oldest()
)
print(google_oldest_archive_source)
Try this out in your browser @
https://repl.it/@akamhy/WaybackPyGetExample#main.py\
Count total archives for an URL using total\_archives()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: python
import waybackpy
URL = "https://en.wikipedia.org/wiki/Python (programming language)"
UA = "Mozilla/5.0 (iPad; CPU OS 8_1_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B435 Safari/600.1.4"
archive_count = waybackpy.Url(
url=URL,
user_agent=UA
).total_archives()
print(archive_count) # total_archives() returns an int
.. code:: bash
2440
Try this out in your browser @
https://repl.it/@akamhy/WaybackPyTotalArchivesExample\
With the CLI
~~~~~~~~~~~~
Save
^^^^
.. code:: bash
$ waybackpy --url "https://en.wikipedia.org/wiki/Social_media" --user_agent "my-unique-user-agent" --save
https://web.archive.org/web/20200719062108/https://en.wikipedia.org/wiki/Social_media
Try this out in your browser @
https://repl.it/@akamhy/WaybackPyBashSave\
Oldest archive
^^^^^^^^^^^^^^
.. code:: bash
$ waybackpy --url "https://en.wikipedia.org/wiki/SpaceX" --user_agent "my-unique-user-agent" --oldest
https://web.archive.org/web/20040803000845/http://en.wikipedia.org:80/wiki/SpaceX
Try this out in your browser @
https://repl.it/@akamhy/WaybackPyBashOldest\
Newest archive
^^^^^^^^^^^^^^
.. code:: bash
$ waybackpy --url "https://en.wikipedia.org/wiki/YouTube" --user_agent "my-unique-user-agent" --newest
https://web.archive.org/web/20200606044708/https://en.wikipedia.org/wiki/YouTube
Try this out in your browser @
https://repl.it/@akamhy/WaybackPyBashNewest\
Total number of archives
^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: bash
$ waybackpy --url "https://en.wikipedia.org/wiki/Linux_kernel" --user_agent "my-unique-user-agent" --total
853
Try this out in your browser @
https://repl.it/@akamhy/WaybackPyBashTotal\
Archive near time
^^^^^^^^^^^^^^^^^
.. code:: bash
$ waybackpy --url facebook.com --user_agent "my-unique-user-agent" --near --year 2012 --month 5 --day 12
https://web.archive.org/web/20120512142515/https://www.facebook.com/
Try this out in your browser @
https://repl.it/@akamhy/WaybackPyBashNear\
Get the source code
^^^^^^^^^^^^^^^^^^^
.. code:: bash
$ waybackpy --url google.com --user_agent "my-unique-user-agent" --get url # Prints the source code of the url
$ waybackpy --url google.com --user_agent "my-unique-user-agent" --get oldest # Prints the source code of the oldest archive
$ waybackpy --url google.com --user_agent "my-unique-user-agent" --get newest # Prints the source code of the newest archive
$ waybackpy --url google.com --user_agent "my-unique-user-agent" --get save # Save a new archive on wayback machine then print the source code of this archive.
Try this out in your browser @
https://repl.it/@akamhy/WaybackPyBashGet\
Tests
-----
- `Here <https://github.com/akamhy/waybackpy/tree/master/tests>`__
Dependency
----------
- None, just python standard libraries (re, json, urllib, argparse and datetime).
Both python 2 and 3 are supported :)
License
-------
`MIT
License <https://github.com/akamhy/waybackpy/blob/master/LICENSE>`__
.. |Build Status| image:: https://img.shields.io/travis/akamhy/waybackpy.svg?label=Travis%20CI&logo=travis&style=flat-square
:target: https://travis-ci.org/akamhy/waybackpy
.. |Downloads| image:: https://img.shields.io/pypi/dm/waybackpy.svg
:target: https://pypistats.org/packages/waybackpy
.. |Release| image:: https://img.shields.io/github/v/release/akamhy/waybackpy.svg
:target: https://github.com/akamhy/waybackpy/releases
.. |Codacy Badge| image:: https://api.codacy.com/project/badge/Grade/255459cede9341e39436ec8866d3fb65
:target: https://www.codacy.com/manual/akamhy/waybackpy?utm_source=github.com&utm_medium=referral&utm_content=akamhy/waybackpy&utm_campaign=Badge_Grade
.. |License: MIT| image:: https://img.shields.io/badge/License-MIT-yellow.svg
:target: https://github.com/akamhy/waybackpy/blob/master/LICENSE
.. |Maintainability| image:: https://api.codeclimate.com/v1/badges/942f13d8177a56c1c906/maintainability
:target: https://codeclimate.com/github/akamhy/waybackpy/maintainability
.. |CodeFactor| image:: https://www.codefactor.io/repository/github/akamhy/waybackpy/badge
:target: https://www.codefactor.io/repository/github/akamhy/waybackpy
.. |made-with-python| image:: https://img.shields.io/badge/Made%20with-Python-1f425f.svg
:target: https://www.python.org/
.. |pypi| image:: https://img.shields.io/pypi/v/waybackpy.svg
.. |PyPI - Python Version| image:: https://img.shields.io/pypi/pyversions/waybackpy?style=flat-square
.. |Maintenance| image:: https://img.shields.io/badge/Maintained%3F-yes-green.svg
:target: https://github.com/akamhy/waybackpy/graphs/commit-activity
.. |codecov| image:: https://codecov.io/gh/akamhy/waybackpy/branch/master/graph/badge.svg
:target: https://codecov.io/gh/akamhy/waybackpy
.. |image12| image:: https://img.shields.io/github/repo-size/akamhy/waybackpy.svg?label=Repo%20size&style=flat-square
.. |contributions welcome| image:: https://img.shields.io/static/v1.svg?label=Contributions&message=Welcome&color=0059b3&style=flat-square
.. |Internet Archive| image:: https://upload.wikimedia.org/wikipedia/commons/thumb/8/84/Internet_Archive_logo_and_wordmark.svg/84px-Internet_Archive_logo_and_wordmark.svg.png
.. |Wayback Machine| image:: https://upload.wikimedia.org/wikipedia/commons/thumb/0/01/Wayback_Machine_logo_2010.svg/284px-Wayback_Machine_logo_2010.svg.png

View File

@ -1,2 +1,7 @@
[metadata]
description-file = README.md
license_file = LICENSE
[flake8]
max-line-length = 88
extend-ignore = E203,W503

View File

@ -4,21 +4,25 @@ from setuptools import setup
with open(os.path.join(os.path.dirname(__file__), 'README.md')) as f:
long_description = f.read()
about = {}
with open(os.path.join(os.path.dirname(__file__), 'waybackpy', '__version__.py')) as f:
exec(f.read(), about)
setup(
name = 'waybackpy',
name = about['__title__'],
packages = ['waybackpy'],
version = 'v1.4',
description = "A python wrapper for Internet Archive's Wayback Machine API. Archive pages and retrieve archived pages easily.",
version = about['__version__'],
description = about['__description__'],
long_description=long_description,
long_description_content_type='text/markdown',
license='MIT',
author = 'akamhy',
author_email = 'akash3pro@gmail.com',
url = 'https://github.com/akamhy/waybackpy',
download_url = 'https://github.com/akamhy/waybackpy/archive/v1.4.tar.gz',
license= about['__license__'],
author = about['__author__'],
author_email = about['__author_email__'],
url = about['__url__'],
download_url = 'https://github.com/akamhy/waybackpy/archive/2.1.6.tar.gz',
keywords = ['wayback', 'archive', 'archive website', 'wayback machine', 'Internet Archive'],
install_requires=[],
python_requires=">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*",
python_requires= ">=2.7",
classifiers=[
'Development Status :: 5 - Production/Stable',
'Intended Audience :: Developers',
@ -28,13 +32,23 @@ setup(
'Programming Language :: Python',
'Programming Language :: Python :: 2',
'Programming Language :: Python :: 2.7',
'Programming Language :: Python :: 3',
'Programming Language :: Python :: 3',
'Programming Language :: Python :: 3.2',
'Programming Language :: Python :: 3.3',
'Programming Language :: Python :: 3.4',
'Programming Language :: Python :: 3.5',
'Programming Language :: Python :: 3.6',
'Programming Language :: Python :: 3.7',
'Programming Language :: Python :: 3.8',
'Programming Language :: Python :: Implementation :: CPython',
'Programming Language :: Python :: Implementation :: PyPy'
],
entry_points={
'console_scripts': [
'waybackpy = waybackpy.cli:main'
]
},
project_urls={
'Documentation': 'https://waybackpy.readthedocs.io',
'Source': 'https://github.com/akamhy/waybackpy',
},
)

View File

@ -1,58 +0,0 @@
import sys
sys.path.append("..")
import waybackpy
import pytest
user_agent = "Mozilla/5.0 (Windows NT 6.2; rv:20.0) Gecko/20121202 Firefox/20.0"
def test_save():
# Test for urls that exist and can be archived.
url1="https://github.com/akamhy/waybackpy"
archived_url1 = waybackpy.save(url1, UA=user_agent)
assert url1 in archived_url1
# Test for urls that are incorrect.
with pytest.raises(Exception) as e_info:
url2 = "ha ha ha ha"
archived_url2 = waybackpy.save(url2, UA=user_agent)
# Test for urls not allowed to archive by robot.txt.
with pytest.raises(Exception) as e_info:
url3 = "http://www.archive.is/faq.html"
archived_url3 = waybackpy.save(url3, UA=user_agent)
# Non existent urls, test
with pytest.raises(Exception) as e_info:
url4 = "https://githfgdhshajagjstgeths537agajaajgsagudadhuss8762346887adsiugujsdgahub.us"
archived_url4 = waybackpy.save(url4, UA=user_agent)
def test_near():
url = "google.com"
archive_near_year = waybackpy.near(url, year=2010, UA=user_agent)
assert "2010" in archive_near_year
archive_near_month_year = waybackpy.near(url, year=2015, month=2, UA=user_agent)
assert "201502" in archive_near_month_year
archive_near_day_month_year = waybackpy.near(url, year=2006, month=11, day=15, UA=user_agent)
assert "20061115" in archive_near_day_month_year
archive_near_hour_day_month_year = waybackpy.near("www.python.org", year=2008, month=5, day=9, hour=15, UA=user_agent)
assert "2008050915" in archive_near_hour_day_month_year
def test_oldest():
url = "github.com/akamhy/waybackpy"
archive_oldest = waybackpy.oldest(url, UA=user_agent)
assert "20200504141153" in archive_oldest
def test_newest():
url = "github.com/akamhy/waybackpy"
archive_newest = waybackpy.newest(url, UA=user_agent)
assert url in archive_newest
def test_get():
oldest_google_archive = waybackpy.oldest("google.com", UA=user_agent)
oldest_google_page_text = waybackpy.get(oldest_google_archive, UA=user_agent)
assert "Welcome to Google" in oldest_google_page_text

97
tests/test_cli.py Normal file
View File

@ -0,0 +1,97 @@
# -*- coding: utf-8 -*-
import sys
import os
import pytest
import argparse
sys.path.append("..")
import waybackpy.cli as cli # noqa: E402
from waybackpy.wrapper import Url # noqa: E402
from waybackpy.__version__ import __version__
codecov_python = False
if sys.version_info > (3, 7):
codecov_python = True
# Namespace(day=None, get=None, hour=None, minute=None, month=None, near=False,
# newest=False, oldest=False, save=False, total=False, url=None, user_agent=None, version=False, year=None)
if codecov_python:
def test_save():
args = argparse.Namespace(user_agent=None, url="https://pypi.org/user/akamhy/", total=False, version=False,
oldest=False, save=True, newest=False, near=False, get=None)
reply = cli.args_handler(args)
assert "pypi.org/user/akamhy" in reply
def test_oldest():
args = argparse.Namespace(user_agent=None, url="https://pypi.org/user/akamhy/", total=False, version=False,
oldest=True, save=False, newest=False, near=False, get=None)
reply = cli.args_handler(args)
assert "pypi.org/user/akamhy" in reply
def test_newest():
args = argparse.Namespace(user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9", url="https://pypi.org/user/akamhy/", total=False, version=False,
oldest=False, save=False, newest=True, near=False, get=None)
reply = cli.args_handler(args)
assert "pypi.org/user/akamhy" in reply
def test_total_archives():
args = argparse.Namespace(user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9", url="https://pypi.org/user/akamhy/", total=True, version=False,
oldest=False, save=False, newest=False, near=False, get=None)
reply = cli.args_handler(args)
assert isinstance(reply, int)
def test_near():
args = argparse.Namespace(user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9", url="https://pypi.org/user/akamhy/", total=False, version=False,
oldest=False, save=False, newest=False, near=True, get=None, year=2020, month=7, day=15, hour=1, minute=1)
reply = cli.args_handler(args)
assert "202007" in reply
def test_get():
args = argparse.Namespace(user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9", url="https://pypi.org/user/akamhy/", total=False, version=False,
oldest=False, save=False, newest=False, near=False, get="url")
reply = cli.args_handler(args)
assert "waybackpy" in reply
args = argparse.Namespace(user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9", url="https://pypi.org/user/akamhy/", total=False, version=False,
oldest=False, save=False, newest=False, near=False, get="oldest")
reply = cli.args_handler(args)
assert "waybackpy" in reply
args = argparse.Namespace(user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9", url="https://pypi.org/user/akamhy/", total=False, version=False,
oldest=False, save=False, newest=False, near=False, get="newest")
reply = cli.args_handler(args)
assert "waybackpy" in reply
if codecov_python:
args = argparse.Namespace(user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9", url="https://pypi.org/user/akamhy/", total=False, version=False,
oldest=False, save=False, newest=False, near=False, get="save")
reply = cli.args_handler(args)
assert "waybackpy" in reply
args = argparse.Namespace(user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9", url="https://pypi.org/user/akamhy/", total=False, version=False,
oldest=False, save=False, newest=False, near=False, get="BullShit")
reply = cli.args_handler(args)
assert "get the source code of the" in reply
def test_args_handler():
args = argparse.Namespace(version=True)
reply = cli.args_handler(args)
assert __version__ == reply
args = argparse.Namespace(url=None, version=False)
reply = cli.args_handler(args)
assert "Specify an URL" in reply
def test_main():
# This also tests the parse_args method in cli.py
cli.main(['temp.py', '--version'])

194
tests/test_wrapper.py Normal file
View File

@ -0,0 +1,194 @@
# -*- coding: utf-8 -*-
import sys
import pytest
import random
import time
sys.path.append("..")
import waybackpy.wrapper as waybackpy # noqa: E402
if sys.version_info >= (3, 0): # If the python ver >= 3
from urllib.request import Request, urlopen
from urllib.error import URLError
else: # For python2.x
from urllib2 import Request, urlopen, URLError
user_agent = "Mozilla/5.0 (Windows NT 6.2; rv:20.0) Gecko/20121202 Firefox/20.0"
def test_clean_url():
test_url = " https://en.wikipedia.org/wiki/Network security "
answer = "https://en.wikipedia.org/wiki/Network_security"
target = waybackpy.Url(test_url, user_agent)
test_result = target._clean_url()
assert answer == test_result
def test_dunders():
url = "https://en.wikipedia.org/wiki/Network_security"
user_agent = "UA"
target = waybackpy.Url(url, user_agent)
assert "waybackpy.Url(url=%s, user_agent=%s)" % (url, user_agent) == repr(target)
assert len(target) == len(url)
assert str(target) == url
def test_archive_url_parser():
request_url = "https://amazon.com"
hdr = {"User-Agent": user_agent} # nosec
req = Request(request_url, headers=hdr) # nosec
header = waybackpy._get_response(req).headers
with pytest.raises(Exception):
waybackpy._archive_url_parser(header)
def test_url_check():
broken_url = "http://wwwgooglecom/"
with pytest.raises(Exception):
waybackpy.Url(broken_url, user_agent)
def test_save():
# Test for urls that exist and can be archived.
time.sleep(10)
url_list = [
"en.wikipedia.org",
"www.wikidata.org",
"commons.wikimedia.org",
"www.wiktionary.org",
"www.w3schools.com",
"www.ibm.com",
]
x = random.randint(0, len(url_list) - 1)
url1 = url_list[x]
target = waybackpy.Url(
url1,
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36",
)
archived_url1 = target.save()
assert url1 in archived_url1
if sys.version_info > (3, 6):
# Test for urls that are incorrect.
with pytest.raises(Exception):
url2 = "ha ha ha ha"
waybackpy.Url(url2, user_agent)
time.sleep(5)
# Test for urls not allowed to archive by robot.txt.
with pytest.raises(Exception):
url3 = "http://www.archive.is/faq.html"
target = waybackpy.Url(
url3,
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:25.0) "
"Gecko/20100101 Firefox/25.0",
)
target.save()
time.sleep(5)
# Non existent urls, test
with pytest.raises(Exception):
url4 = (
"https://githfgdhshajagjstgeths537agajaajgsagudadhuss87623"
"46887adsiugujsdgahub.us"
)
target = waybackpy.Url(
url3,
"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) "
"AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 "
"Safari/533.20.27",
)
target.save()
else:
pass
def test_near():
time.sleep(10)
url = "google.com"
target = waybackpy.Url(
url,
"Mozilla/5.0 (Windows; U; Windows NT 6.0; de-DE) AppleWebKit/533.20.25 "
"(KHTML, like Gecko) Version/5.0.3 Safari/533.19.4",
)
archive_near_year = target.near(year=2010)
assert "2010" in archive_near_year
if sys.version_info > (3, 6):
time.sleep(5)
archive_near_month_year = target.near(year=2015, month=2)
assert (
("201502" in archive_near_month_year)
or ("201501" in archive_near_month_year)
or ("201503" in archive_near_month_year)
)
target = waybackpy.Url(
"www.python.org",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246",
)
archive_near_hour_day_month_year = target.near(
year=2008, month=5, day=9, hour=15
)
assert (
("2008050915" in archive_near_hour_day_month_year)
or ("2008050914" in archive_near_hour_day_month_year)
or ("2008050913" in archive_near_hour_day_month_year)
)
with pytest.raises(Exception):
NeverArchivedUrl = (
"https://ee_3n.wrihkeipef4edia.org/rwti5r_ki/Nertr6w_rork_rse7c_urity"
)
target = waybackpy.Url(NeverArchivedUrl, user_agent)
target.near(year=2010)
else:
pass
def test_oldest():
url = "github.com/akamhy/waybackpy"
target = waybackpy.Url(url, user_agent)
assert "20200504141153" in target.oldest()
def test_newest():
url = "github.com/akamhy/waybackpy"
target = waybackpy.Url(url, user_agent)
assert url in target.newest()
def test_get():
target = waybackpy.Url("google.com", user_agent)
assert "Welcome to Google" in target.get(target.oldest())
def test_wayback_timestamp():
ts = waybackpy._wayback_timestamp(
year=2020, month=1, day=2, hour=3, minute=4
)
assert "202001020304" in str(ts)
def test_get_response():
hdr = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) "
"Gecko/20100101 Firefox/78.0"
}
req = Request("https://www.google.com", headers=hdr) # nosec
response = waybackpy._get_response(req)
assert response.code == 200
def test_total_archives():
if sys.version_info > (3, 6):
target = waybackpy.Url(" https://google.com ", user_agent)
assert target.total_archives() > 500000
else:
pass
target = waybackpy.Url(
" https://gaha.e4i3n.m5iai3kip6ied.cima/gahh2718gs/ahkst63t7gad8 ", user_agent
)
assert target.total_archives() == 0

View File

@ -1,6 +1,40 @@
# -*- coding: utf-8 -*-
from .wrapper import save, near, oldest, newest, get
__version__ = "v1.4"
# ┏┓┏┓┏┓━━━━━━━━━━┏━━┓━━━━━━━━━━┏┓━━┏━━━┓━━━━━
# ┃┃┃┃┃┃━━━━━━━━━━┃┏┓┃━━━━━━━━━━┃┃━━┃┏━┓┃━━━━━
# ┃┃┃┃┃┃┏━━┓━┏┓━┏┓┃┗┛┗┓┏━━┓━┏━━┓┃┃┏┓┃┗━┛┃┏┓━┏┓
# ┃┗┛┗┛┃┗━┓┃━┃┃━┃┃┃┏━┓┃┗━┓┃━┃┏━┛┃┗┛┛┃┏━━┛┃┃━┃┃
# ┗┓┏┓┏┛┃┗┛┗┓┃┗━┛┃┃┗━┛┃┃┗┛┗┓┃┗━┓┃┏┓┓┃┃━━━┃┗━┛┃
# ━┗┛┗┛━┗━━━┛┗━┓┏┛┗━━━┛┗━━━┛┗━━┛┗┛┗┛┗┛━━━┗━┓┏┛
# ━━━━━━━━━━━┏━┛┃━━━━━━━━━━━━━━━━━━━━━━━━┏━┛┃━
# ━━━━━━━━━━━┗━━┛━━━━━━━━━━━━━━━━━━━━━━━━┗━━┛━
__all__ = ['wrapper', 'exceptions']
"""
Waybackpy is a Python library that interfaces with the Internet Archive's Wayback Machine API.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Archive pages and retrieve archived pages easily.
Usage:
>>> import waybackpy
>>> target_url = waybackpy.Url('https://www.python.org', 'Your-apps-cool-user-agent')
>>> new_archive = target_url.save()
>>> print(new_archive)
https://web.archive.org/web/20200502170312/https://www.python.org/
Full documentation @ <https://akamhy.github.io/waybackpy/>.
:copyright: (c) 2020 by akamhy.
:license: MIT
"""
from .wrapper import Url
from .__version__ import (
__title__,
__description__,
__url__,
__version__,
__author__,
__author_email__,
__license__,
__copyright__,
)

10
waybackpy/__version__.py Normal file
View File

@ -0,0 +1,10 @@
# -*- coding: utf-8 -*-
__title__ = "waybackpy"
__description__ = "A Python library that interfaces with the Internet Archive's Wayback Machine API. Archive pages and retrieve archived pages easily."
__url__ = "https://akamhy.github.io/waybackpy/"
__version__ = "2.1.6"
__author__ = "akamhy"
__author_email__ = "akash3pro@gmail.com"
__license__ = "MIT"
__copyright__ = "Copyright 2020 akamhy"

105
waybackpy/cli.py Normal file
View File

@ -0,0 +1,105 @@
# -*- coding: utf-8 -*-
from __future__ import print_function
import sys
import argparse
from waybackpy.wrapper import Url
from waybackpy.__version__ import __version__
def _save(obj):
return (obj.save())
def _oldest(obj):
return (obj.oldest())
def _newest(obj):
return (obj.newest())
def _total_archives(obj):
return (obj.total_archives())
def _near(obj, args):
_near_args = {}
if args.year:
_near_args["year"] = args.year
if args.month:
_near_args["month"] = args.month
if args.day:
_near_args["day"] = args.day
if args.hour:
_near_args["hour"] = args.hour
if args.minute:
_near_args["minute"] = args.minute
return (obj.near(**_near_args))
def _get(obj, args):
if args.get.lower() == "url":
return (obj.get())
if args.get.lower() == "oldest":
return (obj.get(obj.oldest()))
if args.get.lower() == "latest" or args.get.lower() == "newest":
return (obj.get(obj.newest()))
if args.get.lower() == "save":
return (obj.get(obj.save()))
return ("Use get as \"--get 'source'\", 'source' can be one of the followings: \
\n1) url - get the source code of the url specified using --url/-u.\
\n2) oldest - get the source code of the oldest archive for the supplied url.\
\n3) newest - get the source code of the newest archive for the supplied url.\
\n4) save - Create a new archive and get the source code of this new archive for the supplied url.")
def args_handler(args):
if args.version:
return (__version__)
if not args.url:
return ("Specify an URL. See --help for help using waybackpy.")
if args.user_agent:
obj = Url(args.url, args.user_agent)
else:
obj = Url(args.url)
if args.save:
return _save(obj)
if args.oldest:
return _oldest(obj)
if args.newest:
return _newest(obj)
if args.total:
return _total_archives(obj)
if args.near:
return _near(obj, args)
if args.get:
return _get(obj, args)
return ("Usage: waybackpy --url [URL] --user_agent [USER AGENT] [OPTIONS]. See --help for help using waybackpy.")
def parse_args(argv):
parser = argparse.ArgumentParser()
parser.add_argument("-u", "--url", help="URL on which Wayback machine operations would occur.")
parser.add_argument("-ua", "--user_agent", help="User agent, default user_agent is \"waybackpy python package - https://github.com/akamhy/waybackpy\".")
parser.add_argument("-s", "--save", action='store_true', help="Save the URL on the Wayback machine.")
parser.add_argument("-o", "--oldest", action='store_true', help="Oldest archive for the specified URL.")
parser.add_argument("-n", "--newest", action='store_true', help="Newest archive for the specified URL.")
parser.add_argument("-t", "--total", action='store_true', help="Total number of archives for the specified URL.")
parser.add_argument("-g", "--get", help="Prints the source code of the supplied url. Use '--get help' for extended usage.")
parser.add_argument("-v", "--version", action='store_true', help="Prints the waybackpy version.")
parser.add_argument("-N", "--near", action='store_true', help="Latest/Newest archive for the specified URL.")
parser.add_argument("-Y", "--year", type=int, help="Year in integer. For use with --near.")
parser.add_argument("-M", "--month", type=int, help="Month in integer. For use with --near.")
parser.add_argument("-D", "--day", type=int, help="Day in integer. For use with --near.")
parser.add_argument("-H", "--hour", type=int, help="Hour in integer. For use with --near.")
parser.add_argument("-MIN", "--minute", type=int, help="Minute in integer. For use with --near.")
return parser.parse_args(argv[1:])
def main(argv=None):
if argv is None:
argv = sys.argv
args = parse_args(argv)
output = args_handler(args)
print(output)
if __name__ == "__main__":
sys.exit(main(sys.argv))

View File

@ -1,38 +1,6 @@
# -*- coding: utf-8 -*-
class TooManyArchivingRequests(Exception):
"""Error when a single url reqeusted for archiving too many times in a short timespam.
Wayback machine doesn't supports archivng any url too many times in a short period of time.
class WaybackError(Exception):
"""
class ArchivingNotAllowed(Exception):
"""Files like robots.txt are set to deny robot archiving.
Wayback machine respects these file, will not archive.
"""
class PageNotSaved(Exception):
"""
When unable to save a webpage.
"""
class ArchiveNotFound(Exception):
"""
When a page was never archived but client asks for old archive.
"""
class UrlNotFound(Exception):
"""
Raised when 404 UrlNotFound.
"""
class BadGateWay(Exception):
"""
Raised when 502 bad gateway.
"""
class InvalidUrl(Exception):
"""
Raised when url doesn't follow the standard url format.
Raised when API Service error.
"""

View File

@ -1,88 +1,169 @@
# -*- coding: utf-8 -*-
import re
import sys
import json
from datetime import datetime
from waybackpy.exceptions import TooManyArchivingRequests, ArchivingNotAllowed, PageNotSaved, ArchiveNotFound, UrlNotFound, BadGateWay, InvalidUrl
try:
from waybackpy.exceptions import WaybackError
from waybackpy.__version__ import __version__
if sys.version_info >= (3, 0): # If the python ver >= 3
from urllib.request import Request, urlopen
from urllib.error import HTTPError
except ImportError:
from urllib2 import Request, urlopen, HTTPError
from urllib.error import URLError
else: # For python2.x
from urllib2 import Request, urlopen, URLError
default_UA = "waybackpy python package - https://github.com/akamhy/waybackpy"
default_UA = "waybackpy python package"
def _archive_url_parser(header):
"""Parse out the archive from header."""
# Regex1
arch = re.search(
r"rel=\"memento.*?(web\.archive\.org/web/[0-9]{14}/.*?)>", str(header)
)
if arch:
return arch.group(1)
# Regex2
arch = re.search(r"X-Cache-Key:\shttps(.*)[A-Z]{2}", str(header))
if arch:
return arch.group(1)
raise WaybackError(
"No archive URL found in the API response. "
"This version of waybackpy (%s) is likely out of date. Visit "
"https://github.com/akamhy/waybackpy for the latest version "
"of waybackpy.\nHeader:\n%s" % (__version__, str(header))
)
def clean_url(url):
return str(url).strip().replace(" ","_")
def save(url,UA=default_UA):
base_save_url = "https://web.archive.org/save/"
request_url = (base_save_url + clean_url(url))
hdr = { 'User-Agent' : '%s' % UA } #nosec
req = Request(request_url, headers=hdr) #nosec
if "." not in url:
raise InvalidUrl("'%s' is not a vaild url." % url)
def _wayback_timestamp(**kwargs):
"""Return a formatted timestamp."""
return "".join(
str(kwargs[key]).zfill(2) for key in ["year", "month", "day", "hour", "minute"]
)
def _get_response(req):
"""Get response for the supplied request."""
try:
response = urlopen(req) #nosec
except HTTPError as e:
if e.code == 502:
raise BadGateWay(e)
elif e.code == 429:
raise TooManyArchivingRequests(e)
elif e.code == 404:
raise UrlNotFound(e)
else:
raise PageNotSaved(e)
header = response.headers
if "exclusion.robots.policy" in str(header):
raise ArchivingNotAllowed("Can not archive %s. Disabled by site owner." % (url))
archive_id = header['Content-Location']
archived_url = "https://web.archive.org" + archive_id
return archived_url
def get(url,encoding=None,UA=default_UA):
hdr = { 'User-Agent' : '%s' % UA }
request_url = clean_url(url)
req = Request(request_url, headers=hdr) #nosec
resp=urlopen(req) #nosec
if encoding is None:
response = urlopen(req) # nosec
except Exception:
try:
encoding= resp.headers['content-type'].split('charset=')[-1]
except AttributeError:
encoding = "UTF-8"
return resp.read().decode(encoding)
response = urlopen(req) # nosec
except Exception as e:
exc = WaybackError("Error while retrieving %s" % req.full_url)
exc.__cause__ = e
raise exc
return response
def wayback_timestamp(year,month,day,hour,minute):
year = str(year)
month = str(month).zfill(2)
day = str(day).zfill(2)
hour = str(hour).zfill(2)
minute = str(minute).zfill(2)
return (year+month+day+hour+minute)
class Url:
"""waybackpy Url object"""
def near(
url,
year=datetime.utcnow().strftime('%Y'),
month=datetime.utcnow().strftime('%m'),
day=datetime.utcnow().strftime('%d'),
hour=datetime.utcnow().strftime('%H'),
minute=datetime.utcnow().strftime('%M'),
UA=default_UA,
):
timestamp = wayback_timestamp(year,month,day,hour,minute)
request_url = "https://archive.org/wayback/available?url=%s&timestamp=%s" % (clean_url(url), str(timestamp))
hdr = { 'User-Agent' : '%s' % UA }
req = Request(request_url, headers=hdr) # nosec
response = urlopen(req) #nosec
data = json.loads(response.read().decode("UTF-8"))
if not data["archived_snapshots"]:
raise ArchiveNotFound("'%s' is not yet archived." % url)
def __init__(self, url, user_agent=default_UA):
self.url = url
self.user_agent = user_agent
self._url_check() # checks url validity on init.
archive_url = (data["archived_snapshots"]["closest"]["url"])
return archive_url
def __repr__(self):
return "waybackpy.Url(url=%s, user_agent=%s)" % (self.url, self.user_agent)
def oldest(url,UA=default_UA,year=1994):
return near(url,year=year,UA=UA)
def __str__(self):
return "%s" % self._clean_url()
def newest(url,UA=default_UA):
return near(url,UA=UA)
def __len__(self):
return len(self._clean_url())
def _url_check(self):
"""Check for common URL problems."""
if "." not in self.url:
raise URLError("'%s' is not a vaild URL." % self.url)
def _clean_url(self):
"""Fix the URL, if possible."""
return str(self.url).strip().replace(" ", "_")
def save(self):
"""Create a new Wayback Machine archive for this URL."""
request_url = "https://web.archive.org/save/" + self._clean_url()
hdr = {"User-Agent": "%s" % self.user_agent} # nosec
req = Request(request_url, headers=hdr) # nosec
header = _get_response(req).headers
return "https://" + _archive_url_parser(header)
def get(self, url="", user_agent="", encoding=""):
"""Return the source code of the supplied URL.
If encoding is not supplied, it is auto-detected from the response.
"""
if not url:
url = self._clean_url()
if not user_agent:
user_agent = self.user_agent
hdr = {"User-Agent": "%s" % user_agent}
req = Request(url, headers=hdr) # nosec
response = _get_response(req)
if not encoding:
try:
encoding = response.headers["content-type"].split("charset=")[-1]
except AttributeError:
encoding = "UTF-8"
return response.read().decode(encoding.replace("text/html", "UTF-8", 1))
def near(self, year=None, month=None, day=None, hour=None, minute=None):
""" Return the closest Wayback Machine archive to the time supplied.
Supported params are year, month, day, hour and minute.
Any non-supplied parameters default to the current time.
"""
now = datetime.utcnow().timetuple()
timestamp = _wayback_timestamp(
year=year if year else now.tm_year,
month=month if month else now.tm_mon,
day=day if day else now.tm_mday,
hour=hour if hour else now.tm_hour,
minute=minute if minute else now.tm_min,
)
request_url = "https://archive.org/wayback/available?url=%s&timestamp=%s" % (
self._clean_url(),
timestamp,
)
hdr = {"User-Agent": "%s" % self.user_agent}
req = Request(request_url, headers=hdr) # nosec
response = _get_response(req)
data = json.loads(response.read().decode("UTF-8"))
if not data["archived_snapshots"]:
raise WaybackError(
"'%s' is not yet archived. Use wayback.Url(url, user_agent).save() "
"to create a new archive." % self._clean_url()
)
archive_url = data["archived_snapshots"]["closest"]["url"]
# wayback machine returns http sometimes, idk why? But they support https
archive_url = archive_url.replace(
"http://web.archive.org/web/", "https://web.archive.org/web/", 1
)
return archive_url
def oldest(self, year=1994):
"""Return the oldest Wayback Machine archive for this URL."""
return self.near(year=year)
def newest(self):
"""Return the newest Wayback Machine archive available for this URL.
Due to Wayback Machine database lag, this may not always be the
most recent archive.
"""
return self.near()
def total_archives(self):
"""Returns the total number of Wayback Machine archives for this URL."""
hdr = {"User-Agent": "%s" % self.user_agent}
request_url = (
"https://web.archive.org/cdx/search/cdx?url=%s&output=json&fl=statuscode"
% self._clean_url()
)
req = Request(request_url, headers=hdr) # nosec
response = _get_response(req)
# Most efficient method to count number of archives (yet)
return str(response.read()).count(",")