Compare commits
37 Commits
Author | SHA1 | Date | |
---|---|---|---|
80331833f2 | |||
5e3d3a815f | |||
6182a18cf4 | |||
9bca750310 | |||
c22749a6a3 | |||
151df94fe3 | |||
24540d0b2c | |||
bdfc72d05d | |||
3b104c1a28 | |||
fb0d4658a7 | |||
48833980e1 | |||
0c4f119981 | |||
afded51a04 | |||
b950616561 | |||
444675538f | |||
0ca6710334 | |||
01a7c591ad | |||
74d3bc154b | |||
a8e94dfb25 | |||
cc38798b32 | |||
bc3dd44f27 | |||
ba46cdafe2 | |||
538afb14e9 | |||
7605b614ee | |||
d0a4e25cf5 | |||
8c5c0153da | |||
e7dac74906 | |||
c686708c9e | |||
f9ae8ada70 | |||
e56ece3dc9 | |||
db127a5c54 | |||
ed497bbd23 | |||
45fe07ddb6 | |||
0029d63d8a | |||
beb5b625ec | |||
b40d734346 | |||
be0a30de85 |
24
README.md
24
README.md
@ -4,8 +4,11 @@
|
||||
[](https://github.com/akamhy/waybackpy/releases)
|
||||
[](https://www.codacy.com/manual/akamhy/waybackpy?utm_source=github.com&utm_medium=referral&utm_content=akamhy/waybackpy&utm_campaign=Badge_Grade)
|
||||
[](https://github.com/akamhy/waybackpy/blob/master/LICENSE)
|
||||
[](https://codeclimate.com/github/akamhy/waybackpy/maintainability)
|
||||
[](https://www.codefactor.io/repository/github/akamhy/waybackpy)
|
||||
[](https://www.python.org/)
|
||||

|
||||

|
||||

|
||||
[](https://github.com/akamhy/waybackpy/graphs/commit-activity)
|
||||
|
||||
|
||||
@ -27,6 +30,8 @@ Table of contents
|
||||
* [Receiving the recent most/newest archive for an URL using newest()](https://github.com/akamhy/waybackpy#receiving-the-newest-archive-for-an-url-using-newest)
|
||||
* [Receiving archive close to a specified year, month, day, hour, and minute using near()](https://github.com/akamhy/waybackpy#receiving-archive-close-to-a-specified-year-month-day-hour-and-minute-using-near)
|
||||
* [Get the content of webpage using get()](https://github.com/akamhy/waybackpy#get-the-content-of-webpage-using-get)
|
||||
* [Count total archives for an URL using total_archives()](https://github.com/akamhy/waybackpy#count-total-archives-for-an-url-using-total_archives)
|
||||
|
||||
|
||||
* [Tests](https://github.com/akamhy/waybackpy#tests)
|
||||
|
||||
@ -142,6 +147,23 @@ print(webpage)
|
||||
```
|
||||
> This should print the source code for <https://example.com/>.
|
||||
|
||||
#### Count total archives for an URL using total_archives()
|
||||
|
||||
```diff
|
||||
+ waybackpy.total_archives(url, UA=user_agent)
|
||||
```
|
||||
> url is mandatory. UA is not, but highly recommended.
|
||||
|
||||
```python
|
||||
from waybackpy import total_archives
|
||||
# retriving the webpage from any url including the archived urls. Don't need to import other libraies :)
|
||||
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
|
||||
# supported argumnets are url and UA
|
||||
count = total_archives("https://en.wikipedia.org/wiki/Python (programming language)", UA="User-Agent")
|
||||
print(count)
|
||||
```
|
||||
> This should print an integer (int), which is the number of total archives on archive.org
|
||||
|
||||
## Tests
|
||||
* [Here](https://github.com/akamhy/waybackpy/tree/master/tests)
|
||||
|
||||
|
1
_config.yml
Normal file
1
_config.yml
Normal file
@ -0,0 +1 @@
|
||||
theme: jekyll-theme-cayman
|
232
index.rst
Normal file
232
index.rst
Normal file
@ -0,0 +1,232 @@
|
||||
waybackpy
|
||||
=========
|
||||
|
||||
|Build Status| |Downloads| |Release| |Codacy Badge| |License: MIT|
|
||||
|Maintainability| |CodeFactor| |made-with-python| |pypi| |PyPI - Python
|
||||
Version| |Maintenance|
|
||||
|
||||
.. |Build Status| image:: https://travis-ci.org/akamhy/waybackpy.svg?branch=master
|
||||
:target: https://travis-ci.org/akamhy/waybackpy
|
||||
.. |Downloads| image:: https://img.shields.io/pypi/dm/waybackpy.svg
|
||||
:target: https://pypistats.org/packages/waybackpy
|
||||
.. |Release| image:: https://img.shields.io/github/v/release/akamhy/waybackpy.svg
|
||||
:target: https://github.com/akamhy/waybackpy/releases
|
||||
.. |Codacy Badge| image:: https://api.codacy.com/project/badge/Grade/255459cede9341e39436ec8866d3fb65
|
||||
:target: https://www.codacy.com/manual/akamhy/waybackpy?utm_source=github.com&utm_medium=referral&utm_content=akamhy/waybackpy&utm_campaign=Badge_Grade
|
||||
.. |License: MIT| image:: https://img.shields.io/badge/License-MIT-yellow.svg
|
||||
:target: https://github.com/akamhy/waybackpy/blob/master/LICENSE
|
||||
.. |Maintainability| image:: https://api.codeclimate.com/v1/badges/942f13d8177a56c1c906/maintainability
|
||||
:target: https://codeclimate.com/github/akamhy/waybackpy/maintainability
|
||||
.. |CodeFactor| image:: https://www.codefactor.io/repository/github/akamhy/waybackpy/badge
|
||||
:target: https://www.codefactor.io/repository/github/akamhy/waybackpy
|
||||
.. |made-with-python| image:: https://img.shields.io/badge/Made%20with-Python-1f425f.svg
|
||||
:target: https://www.python.org/
|
||||
.. |pypi| image:: https://img.shields.io/pypi/v/waybackpy.svg
|
||||
.. |PyPI - Python Version| image:: https://img.shields.io/pypi/pyversions/waybackpy?style=flat-square
|
||||
.. |Maintenance| image:: https://img.shields.io/badge/Maintained%3F-yes-green.svg
|
||||
:target: https://github.com/akamhy/waybackpy/graphs/commit-activity
|
||||
|
||||
|Internet Archive| |Wayback Machine|
|
||||
|
||||
The waybackpy is a python wrapper for `Internet Archive`_\ ’s `Wayback
|
||||
Machine`_.
|
||||
|
||||
.. _Internet Archive: https://en.wikipedia.org/wiki/Internet_Archive
|
||||
.. _Wayback Machine: https://en.wikipedia.org/wiki/Wayback_Machine
|
||||
|
||||
.. |Internet Archive| image:: https://upload.wikimedia.org/wikipedia/commons/thumb/8/84/Internet_Archive_logo_and_wordmark.svg/84px-Internet_Archive_logo_and_wordmark.svg.png
|
||||
.. |Wayback Machine| image:: https://upload.wikimedia.org/wikipedia/commons/thumb/0/01/Wayback_Machine_logo_2010.svg/284px-Wayback_Machine_logo_2010.svg.png
|
||||
|
||||
Installation
|
||||
------------
|
||||
|
||||
Using `pip`_:
|
||||
|
||||
**pip install waybackpy**
|
||||
|
||||
.. _pip: https://en.wikipedia.org/wiki/Pip_(package_manager)
|
||||
|
||||
Usage
|
||||
-----
|
||||
|
||||
Archiving aka Saving an url Using save()
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
.. code:: diff
|
||||
|
||||
+ waybackpy.save(url, UA=user_agent)
|
||||
|
||||
..
|
||||
|
||||
url is mandatory. UA is not, but highly recommended.
|
||||
|
||||
.. code:: python
|
||||
|
||||
import waybackpy
|
||||
# Capturing a new archive on Wayback machine.
|
||||
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
|
||||
archived_url = waybackpy.save("https://github.com/akamhy/waybackpy", UA = "Any-User-Agent")
|
||||
print(archived_url)
|
||||
|
||||
This should print something similar to the following archived URL:
|
||||
|
||||
https://web.archive.org/web/20200504141153/https://github.com/akamhy/waybackpy
|
||||
|
||||
Receiving the oldest archive for an URL Using oldest()
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
.. code:: diff
|
||||
|
||||
+ waybackpy.oldest(url, UA=user_agent)
|
||||
|
||||
..
|
||||
|
||||
url is mandatory. UA is not, but highly recommended.
|
||||
|
||||
.. code:: python
|
||||
|
||||
import waybackpy
|
||||
# retrieving the oldest archive on Wayback machine.
|
||||
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
|
||||
oldest_archive = waybackpy.oldest("https://www.google.com/", UA = "Any-User-Agent")
|
||||
print(oldest_archive)
|
||||
|
||||
This returns the oldest available archive for https://google.com.
|
||||
|
||||
http://web.archive.org/web/19981111184551/http://google.com:80/
|
||||
|
||||
Receiving the newest archive for an URL using newest()
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
.. code:: diff
|
||||
|
||||
+ waybackpy.newest(url, UA=user_agent)
|
||||
|
||||
..
|
||||
|
||||
url is mandatory. UA is not, but highly recommended.
|
||||
|
||||
.. code:: python
|
||||
|
||||
import waybackpy
|
||||
# retrieving the newest archive on Wayback machine.
|
||||
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
|
||||
newest_archive = waybackpy.newest("https://www.microsoft.com/en-us", UA = "Any-User-Agent")
|
||||
print(newest_archive)
|
||||
|
||||
This returns the newest available archive for
|
||||
https://www.microsoft.com/en-us, something just like this:
|
||||
|
||||
http://web.archive.org/web/20200429033402/https://www.microsoft.com/en-us/
|
||||
|
||||
Receiving archive close to a specified year, month, day, hour, and minute using near()
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
.. code:: diff
|
||||
|
||||
+ waybackpy.near(url, year=2020, month=1, day=1, hour=1, minute=1, UA=user_agent)
|
||||
|
||||
..
|
||||
|
||||
url is mandotory. year,month,day,hour and minute are optional
|
||||
arguments. UA is not mandotory, but higly recomended.
|
||||
|
||||
.. code:: python
|
||||
|
||||
import waybackpy
|
||||
# retriving the the closest archive from a specified year.
|
||||
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
|
||||
# supported argumnets are year,month,day,hour and minute
|
||||
archive_near_year = waybackpy.near("https://www.facebook.com/", year=2010, UA ="Any-User-Agent")
|
||||
print(archive_near_year)
|
||||
|
||||
returns :
|
||||
http://web.archive.org/web/20100504071154/http://www.facebook.com/
|
||||
|
||||
``waybackpy.near("https://www.facebook.com/", year=2010, month=1, UA ="Any-User-Agent")``
|
||||
returns:
|
||||
http://web.archive.org/web/20101111173430/http://www.facebook.com//
|
||||
|
||||
``waybackpy.near("https://www.oracle.com/index.html", year=2019, month=1, day=5, UA ="Any-User-Agent")``
|
||||
returns:
|
||||
http://web.archive.org/web/20190105054437/https://www.oracle.com/index.html
|
||||
> Please note that if you only specify the year, the current month and
|
||||
day are default arguments for month and day respectively. Do not expect
|
||||
just putting the year parameter would return the archive closer to
|
||||
January but the current month you are using the package. If you are
|
||||
using it in July 2018 and let’s say you use
|
||||
``waybackpy.near("https://www.facebook.com/", year=2011, UA ="Any-User-Agent")``
|
||||
then you would be returned the nearest archive to July 2011 and not
|
||||
January 2011. You need to specify the month “1” for January.
|
||||
|
||||
Do not pad (don’t use zeros in the month, year, day, minute, and hour
|
||||
arguments). e.g. For January, set month = 1 and not month = 01.
|
||||
|
||||
Get the content of webpage using get()
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
.. code:: diff
|
||||
|
||||
+ waybackpy.get(url, encoding="UTF-8", UA=user_agent)
|
||||
|
||||
..
|
||||
|
||||
url is mandatory. UA is not, but highly recommended. encoding is
|
||||
detected automatically, don’t specify unless necessary.
|
||||
|
||||
.. code:: python
|
||||
|
||||
from waybackpy import get
|
||||
# retriving the webpage from any url including the archived urls. Don't need to import other libraies :)
|
||||
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
|
||||
# supported argumnets are url, encoding and UA
|
||||
webpage = get("https://example.com/", UA="User-Agent")
|
||||
print(webpage)
|
||||
|
||||
..
|
||||
|
||||
This should print the source code for https://example.com/.
|
||||
|
||||
Count total archives for an URL using total_archives()
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
.. code:: diff
|
||||
|
||||
+ waybackpy.total_archives(url, UA=user_agent)
|
||||
|
||||
..
|
||||
|
||||
url is mandatory. UA is not, but highly recommended.
|
||||
|
||||
.. code:: python
|
||||
|
||||
from waybackpy import total_archives
|
||||
# retriving the webpage from any url including the archived urls. Don't need to import other libraies :)
|
||||
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
|
||||
# supported argumnets are url and UA
|
||||
count = total_archives("https://en.wikipedia.org/wiki/Python (programming language)", UA="User-Agent")
|
||||
print(count)
|
||||
|
||||
..
|
||||
|
||||
This should print an integer (int), which is the number of total
|
||||
archives on archive.org
|
||||
|
||||
Tests
|
||||
-----
|
||||
|
||||
- `Here`_
|
||||
|
||||
Dependency
|
||||
----------
|
||||
|
||||
- None, just python standard libraries (json, urllib and datetime).
|
||||
Both python 2 and 3 are supported :)
|
||||
|
||||
License
|
||||
-------
|
||||
|
||||
`MIT License`_
|
||||
|
||||
.. _Here: https://github.com/akamhy/waybackpy/tree/master/tests
|
||||
.. _MIT License: https://github.com/akamhy/waybackpy/blob/master/LICENSE
|
31
setup.py
31
setup.py
@ -4,21 +4,25 @@ from setuptools import setup
|
||||
with open(os.path.join(os.path.dirname(__file__), 'README.md')) as f:
|
||||
long_description = f.read()
|
||||
|
||||
about = {}
|
||||
with open(os.path.join(os.path.dirname(__file__), 'waybackpy', '__version__.py')) as f:
|
||||
exec(f.read(), about)
|
||||
|
||||
setup(
|
||||
name = 'waybackpy',
|
||||
name = about['__title__'],
|
||||
packages = ['waybackpy'],
|
||||
version = 'v1.4',
|
||||
description = "A python wrapper for Internet Archive's Wayback Machine API. Archive pages and retrieve archived pages easily.",
|
||||
version = about['__version__'],
|
||||
description = about['__description__'],
|
||||
long_description=long_description,
|
||||
long_description_content_type='text/markdown',
|
||||
license='MIT',
|
||||
author = 'akamhy',
|
||||
author_email = 'akash3pro@gmail.com',
|
||||
url = 'https://github.com/akamhy/waybackpy',
|
||||
download_url = 'https://github.com/akamhy/waybackpy/archive/v1.4.tar.gz',
|
||||
license= about['__license__'],
|
||||
author = about['__author__'],
|
||||
author_email = about['__author_email__'],
|
||||
url = about['__url__'],
|
||||
download_url = 'https://github.com/akamhy/waybackpy/archive/v1.5.tar.gz',
|
||||
keywords = ['wayback', 'archive', 'archive website', 'wayback machine', 'Internet Archive'],
|
||||
install_requires=[],
|
||||
python_requires=">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*",
|
||||
python_requires= ">=2.7",
|
||||
classifiers=[
|
||||
'Development Status :: 5 - Production/Stable',
|
||||
'Intended Audience :: Developers',
|
||||
@ -28,13 +32,18 @@ setup(
|
||||
'Programming Language :: Python',
|
||||
'Programming Language :: Python :: 2',
|
||||
'Programming Language :: Python :: 2.7',
|
||||
'Programming Language :: Python :: 3',
|
||||
'Programming Language :: Python :: 3',
|
||||
'Programming Language :: Python :: 3.2',
|
||||
'Programming Language :: Python :: 3.3',
|
||||
'Programming Language :: Python :: 3.4',
|
||||
'Programming Language :: Python :: 3.5',
|
||||
'Programming Language :: Python :: 3.6',
|
||||
'Programming Language :: Python :: 3.7',
|
||||
'Programming Language :: Python :: 3.8',
|
||||
'Programming Language :: Python :: Implementation :: CPython',
|
||||
'Programming Language :: Python :: Implementation :: PyPy'
|
||||
],
|
||||
project_urls={
|
||||
'Documentation': 'https://waybackpy.readthedocs.io',
|
||||
'Source': 'https://github.com/akamhy/waybackpy',
|
||||
},
|
||||
)
|
||||
|
@ -6,6 +6,16 @@ import pytest
|
||||
|
||||
user_agent = "Mozilla/5.0 (Windows NT 6.2; rv:20.0) Gecko/20121202 Firefox/20.0"
|
||||
|
||||
def test_clean_url():
|
||||
test_url = " https://en.wikipedia.org/wiki/Network security "
|
||||
answer = "https://en.wikipedia.org/wiki/Network_security"
|
||||
test_result = waybackpy.clean_url(test_url)
|
||||
assert answer == test_result
|
||||
|
||||
def test_url_check():
|
||||
InvalidUrl = "http://wwwgooglecom/"
|
||||
with pytest.raises(Exception) as e_info:
|
||||
waybackpy.url_check(InvalidUrl)
|
||||
|
||||
def test_save():
|
||||
# Test for urls that exist and can be archived.
|
||||
@ -16,31 +26,35 @@ def test_save():
|
||||
# Test for urls that are incorrect.
|
||||
with pytest.raises(Exception) as e_info:
|
||||
url2 = "ha ha ha ha"
|
||||
archived_url2 = waybackpy.save(url2, UA=user_agent)
|
||||
waybackpy.save(url2, UA=user_agent)
|
||||
|
||||
# Test for urls not allowed to archive by robot.txt.
|
||||
with pytest.raises(Exception) as e_info:
|
||||
url3 = "http://www.archive.is/faq.html"
|
||||
archived_url3 = waybackpy.save(url3, UA=user_agent)
|
||||
waybackpy.save(url3, UA=user_agent)
|
||||
|
||||
# Non existent urls, test
|
||||
with pytest.raises(Exception) as e_info:
|
||||
url4 = "https://githfgdhshajagjstgeths537agajaajgsagudadhuss8762346887adsiugujsdgahub.us"
|
||||
archived_url4 = waybackpy.save(url4, UA=user_agent)
|
||||
|
||||
|
||||
def test_near():
|
||||
url = "google.com"
|
||||
archive_near_year = waybackpy.near(url, year=2010, UA=user_agent)
|
||||
assert "2010" in archive_near_year
|
||||
|
||||
archive_near_month_year = waybackpy.near(url, year=2015, month=2, UA=user_agent)
|
||||
assert "201502" in archive_near_month_year
|
||||
|
||||
assert ("201502" in archive_near_month_year) or ("201501" in archive_near_month_year) or ("201503" in archive_near_month_year)
|
||||
|
||||
archive_near_day_month_year = waybackpy.near(url, year=2006, month=11, day=15, UA=user_agent)
|
||||
assert "20061115" in archive_near_day_month_year
|
||||
assert ("20061114" in archive_near_day_month_year) or ("20061115" in archive_near_day_month_year) or ("2006116" in archive_near_day_month_year)
|
||||
|
||||
archive_near_hour_day_month_year = waybackpy.near("www.python.org", year=2008, month=5, day=9, hour=15, UA=user_agent)
|
||||
assert "2008050915" in archive_near_hour_day_month_year
|
||||
assert ("2008050915" in archive_near_hour_day_month_year) or ("2008050914" in archive_near_hour_day_month_year) or ("2008050913" in archive_near_hour_day_month_year)
|
||||
|
||||
with pytest.raises(Exception) as e_info:
|
||||
NeverArchivedUrl = "https://ee_3n.wrihkeipef4edia.org/rwti5r_ki/Nertr6w_rork_rse7c_urity"
|
||||
waybackpy.near(NeverArchivedUrl, year=2010, UA=user_agent)
|
||||
|
||||
def test_oldest():
|
||||
url = "github.com/akamhy/waybackpy"
|
||||
@ -51,8 +65,34 @@ def test_newest():
|
||||
url = "github.com/akamhy/waybackpy"
|
||||
archive_newest = waybackpy.newest(url, UA=user_agent)
|
||||
assert url in archive_newest
|
||||
|
||||
|
||||
def test_get():
|
||||
oldest_google_archive = waybackpy.oldest("google.com", UA=user_agent)
|
||||
oldest_google_page_text = waybackpy.get(oldest_google_archive, UA=user_agent)
|
||||
assert "Welcome to Google" in oldest_google_page_text
|
||||
|
||||
def test_total_archives():
|
||||
|
||||
count1 = waybackpy.total_archives("https://en.wikipedia.org/wiki/Python (programming language)", UA=user_agent)
|
||||
assert count1 > 2000
|
||||
|
||||
count2 = waybackpy.total_archives("https://gaha.e4i3n.m5iai3kip6ied.cima/gahh2718gs/ahkst63t7gad8", UA=user_agent)
|
||||
assert count2 == 0
|
||||
|
||||
if __name__ == "__main__":
|
||||
test_clean_url()
|
||||
print(".")
|
||||
test_url_check()
|
||||
print(".")
|
||||
test_get()
|
||||
print(".")
|
||||
test_near()
|
||||
print(".")
|
||||
test_newest()
|
||||
print(".")
|
||||
test_save()
|
||||
print(".")
|
||||
test_oldest()
|
||||
print(".")
|
||||
test_total_archives()
|
||||
print(".")
|
||||
|
@ -1,6 +1,30 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
from .wrapper import save, near, oldest, newest, get
|
||||
|
||||
__version__ = "v1.4"
|
||||
# ┏┓┏┓┏┓━━━━━━━━━━┏━━┓━━━━━━━━━━┏┓━━┏━━━┓━━━━━
|
||||
# ┃┃┃┃┃┃━━━━━━━━━━┃┏┓┃━━━━━━━━━━┃┃━━┃┏━┓┃━━━━━
|
||||
# ┃┃┃┃┃┃┏━━┓━┏┓━┏┓┃┗┛┗┓┏━━┓━┏━━┓┃┃┏┓┃┗━┛┃┏┓━┏┓
|
||||
# ┃┗┛┗┛┃┗━┓┃━┃┃━┃┃┃┏━┓┃┗━┓┃━┃┏━┛┃┗┛┛┃┏━━┛┃┃━┃┃
|
||||
# ┗┓┏┓┏┛┃┗┛┗┓┃┗━┛┃┃┗━┛┃┃┗┛┗┓┃┗━┓┃┏┓┓┃┃━━━┃┗━┛┃
|
||||
# ━┗┛┗┛━┗━━━┛┗━┓┏┛┗━━━┛┗━━━┛┗━━┛┗┛┗┛┗┛━━━┗━┓┏┛
|
||||
# ━━━━━━━━━━━┏━┛┃━━━━━━━━━━━━━━━━━━━━━━━━┏━┛┃━
|
||||
# ━━━━━━━━━━━┗━━┛━━━━━━━━━━━━━━━━━━━━━━━━┗━━┛━
|
||||
|
||||
__all__ = ['wrapper', 'exceptions']
|
||||
"""
|
||||
A python wrapper for Internet Archive's Wayback Machine API.
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
Archive pages and retrieve archived pages easily.
|
||||
Usage:
|
||||
>>> import waybackpy
|
||||
>>> new_archive = waybackpy.save('https://www.python.org')
|
||||
>>> print(new_archive)
|
||||
https://web.archive.org/web/20200502170312/https://www.python.org/
|
||||
|
||||
Full documentation @ <https://akamhy.github.io/waybackpy/>.
|
||||
:copyright: (c) 2020 by akamhy.
|
||||
:license: MIT
|
||||
"""
|
||||
|
||||
from .wrapper import save, near, oldest, newest, get, clean_url, url_check, total_archives
|
||||
from .__version__ import __title__, __description__, __url__, __version__
|
||||
from .__version__ import __author__, __author_email__, __license__, __copyright__
|
||||
|
8
waybackpy/__version__.py
Normal file
8
waybackpy/__version__.py
Normal file
@ -0,0 +1,8 @@
|
||||
__title__ = "waybackpy"
|
||||
__description__ = "A python wrapper for Internet Archive's Wayback Machine API. Archive pages and retrieve archived pages easily."
|
||||
__url__ = "https://akamhy.github.io/waybackpy/"
|
||||
__version__ = "v1.5"
|
||||
__author__ = "akamhy"
|
||||
__author_email__ = "akash3pro@gmail.com"
|
||||
__license__ = "MIT"
|
||||
__copyright__ = "Copyright 2020 akamhy"
|
@ -32,6 +32,11 @@ class BadGateWay(Exception):
|
||||
Raised when 502 bad gateway.
|
||||
"""
|
||||
|
||||
class WaybackUnavailable(Exception):
|
||||
"""
|
||||
Raised when 503 API Service Temporarily Unavailable.
|
||||
"""
|
||||
|
||||
class InvalidUrl(Exception):
|
||||
"""
|
||||
Raised when url doesn't follow the standard url format.
|
||||
|
@ -1,88 +1,143 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
import json
|
||||
from datetime import datetime
|
||||
from waybackpy.exceptions import TooManyArchivingRequests, ArchivingNotAllowed, PageNotSaved, ArchiveNotFound, UrlNotFound, BadGateWay, InvalidUrl
|
||||
from waybackpy.exceptions import TooManyArchivingRequests, ArchivingNotAllowed, PageNotSaved, ArchiveNotFound, UrlNotFound, BadGateWay, InvalidUrl, WaybackUnavailable
|
||||
try:
|
||||
from urllib.request import Request, urlopen
|
||||
from urllib.error import HTTPError
|
||||
from urllib.error import HTTPError, URLError
|
||||
except ImportError:
|
||||
from urllib2 import Request, urlopen, HTTPError
|
||||
from urllib2 import Request, urlopen, HTTPError, URLError
|
||||
|
||||
|
||||
default_UA = "waybackpy python package"
|
||||
|
||||
def url_check(url):
|
||||
if "." not in url:
|
||||
raise InvalidUrl("'%s' is not a vaild url." % url)
|
||||
|
||||
def clean_url(url):
|
||||
return str(url).strip().replace(" ","_")
|
||||
|
||||
def save(url,UA=default_UA):
|
||||
base_save_url = "https://web.archive.org/save/"
|
||||
request_url = (base_save_url + clean_url(url))
|
||||
def wayback_timestamp(**kwargs):
|
||||
return (
|
||||
str(kwargs["year"])
|
||||
+
|
||||
str(kwargs["month"]).zfill(2)
|
||||
+
|
||||
str(kwargs["day"]).zfill(2)
|
||||
+
|
||||
str(kwargs["hour"]).zfill(2)
|
||||
+
|
||||
str(kwargs["minute"]).zfill(2)
|
||||
)
|
||||
|
||||
def handle_HTTPError(e):
|
||||
if e.code == 502:
|
||||
raise BadGateWay(e)
|
||||
elif e.code == 503:
|
||||
raise WaybackUnavailable(e)
|
||||
elif e.code == 429:
|
||||
raise TooManyArchivingRequests(e)
|
||||
elif e.code == 404:
|
||||
raise UrlNotFound(e)
|
||||
|
||||
def save(url, UA=default_UA):
|
||||
url_check(url)
|
||||
request_url = ("https://web.archive.org/save/" + clean_url(url))
|
||||
|
||||
hdr = { 'User-Agent' : '%s' % UA } #nosec
|
||||
req = Request(request_url, headers=hdr) #nosec
|
||||
if "." not in url:
|
||||
raise InvalidUrl("'%s' is not a vaild url." % url)
|
||||
|
||||
|
||||
try:
|
||||
response = urlopen(req) #nosec
|
||||
except HTTPError as e:
|
||||
if e.code == 502:
|
||||
raise BadGateWay(e)
|
||||
elif e.code == 429:
|
||||
raise TooManyArchivingRequests(e)
|
||||
elif e.code == 404:
|
||||
if handle_HTTPError(e) is None:
|
||||
raise PageNotSaved(e)
|
||||
except URLError:
|
||||
try:
|
||||
response = urlopen(req) #nosec
|
||||
except URLError as e:
|
||||
raise UrlNotFound(e)
|
||||
else:
|
||||
raise PageNotSaved(e)
|
||||
|
||||
header = response.headers
|
||||
|
||||
if "exclusion.robots.policy" in str(header):
|
||||
raise ArchivingNotAllowed("Can not archive %s. Disabled by site owner." % (url))
|
||||
archive_id = header['Content-Location']
|
||||
archived_url = "https://web.archive.org" + archive_id
|
||||
return archived_url
|
||||
|
||||
def get(url,encoding=None,UA=default_UA):
|
||||
return "https://web.archive.org" + header['Content-Location']
|
||||
|
||||
def get(url, encoding=None, UA=default_UA):
|
||||
url_check(url)
|
||||
hdr = { 'User-Agent' : '%s' % UA }
|
||||
request_url = clean_url(url)
|
||||
req = Request(request_url, headers=hdr) #nosec
|
||||
resp=urlopen(req) #nosec
|
||||
req = Request(clean_url(url), headers=hdr) #nosec
|
||||
|
||||
try:
|
||||
resp=urlopen(req) #nosec
|
||||
except URLError:
|
||||
try:
|
||||
resp=urlopen(req) #nosec
|
||||
except URLError as e:
|
||||
raise UrlNotFound(e)
|
||||
|
||||
if encoding is None:
|
||||
try:
|
||||
encoding= resp.headers['content-type'].split('charset=')[-1]
|
||||
except AttributeError:
|
||||
encoding = "UTF-8"
|
||||
return resp.read().decode(encoding)
|
||||
|
||||
def wayback_timestamp(year,month,day,hour,minute):
|
||||
year = str(year)
|
||||
month = str(month).zfill(2)
|
||||
day = str(day).zfill(2)
|
||||
hour = str(hour).zfill(2)
|
||||
minute = str(minute).zfill(2)
|
||||
return (year+month+day+hour+minute)
|
||||
return resp.read().decode(encoding.replace("text/html", "UTF-8", 1))
|
||||
|
||||
def near(
|
||||
url,
|
||||
year=datetime.utcnow().strftime('%Y'),
|
||||
month=datetime.utcnow().strftime('%m'),
|
||||
day=datetime.utcnow().strftime('%d'),
|
||||
hour=datetime.utcnow().strftime('%H'),
|
||||
minute=datetime.utcnow().strftime('%M'),
|
||||
UA=default_UA,
|
||||
):
|
||||
timestamp = wayback_timestamp(year,month,day,hour,minute)
|
||||
def near(url, **kwargs):
|
||||
|
||||
try:
|
||||
url = kwargs["url"]
|
||||
except KeyError:
|
||||
url = url
|
||||
|
||||
year=kwargs.get("year", datetime.utcnow().strftime('%Y'))
|
||||
month=kwargs.get("month", datetime.utcnow().strftime('%m'))
|
||||
day=kwargs.get("day", datetime.utcnow().strftime('%d'))
|
||||
hour=kwargs.get("hour", datetime.utcnow().strftime('%H'))
|
||||
minute=kwargs.get("minute", datetime.utcnow().strftime('%M'))
|
||||
UA=kwargs.get("UA", default_UA)
|
||||
|
||||
url_check(url)
|
||||
timestamp = wayback_timestamp(year=year,month=month,day=day,hour=hour,minute=minute)
|
||||
request_url = "https://archive.org/wayback/available?url=%s×tamp=%s" % (clean_url(url), str(timestamp))
|
||||
hdr = { 'User-Agent' : '%s' % UA }
|
||||
req = Request(request_url, headers=hdr) # nosec
|
||||
response = urlopen(req) #nosec
|
||||
|
||||
try:
|
||||
response = urlopen(req) #nosec
|
||||
except HTTPError as e:
|
||||
handle_HTTPError(e)
|
||||
|
||||
data = json.loads(response.read().decode("UTF-8"))
|
||||
if not data["archived_snapshots"]:
|
||||
raise ArchiveNotFound("'%s' is not yet archived." % url)
|
||||
|
||||
archive_url = (data["archived_snapshots"]["closest"]["url"])
|
||||
# wayback machine returns http sometimes, idk why? But they support https
|
||||
archive_url = archive_url.replace("http://web.archive.org/web/","https://web.archive.org/web/",1)
|
||||
return archive_url
|
||||
|
||||
def oldest(url,UA=default_UA,year=1994):
|
||||
return near(url,year=year,UA=UA)
|
||||
def oldest(url, UA=default_UA, year=1994):
|
||||
return near(url, year=year, UA=UA)
|
||||
|
||||
def newest(url,UA=default_UA):
|
||||
return near(url,UA=UA)
|
||||
def newest(url, UA=default_UA):
|
||||
return near(url, UA=UA)
|
||||
|
||||
def total_archives(url, UA=default_UA):
|
||||
url_check(url)
|
||||
|
||||
hdr = { 'User-Agent' : '%s' % UA }
|
||||
request_url = "https://web.archive.org/cdx/search/cdx?url=%s&output=json" % clean_url(url)
|
||||
req = Request(request_url, headers=hdr) # nosec
|
||||
|
||||
try:
|
||||
response = urlopen(req) #nosec
|
||||
except HTTPError as e:
|
||||
handle_HTTPError(e)
|
||||
|
||||
return (len(json.loads(response.read())))
|
||||
|
Reference in New Issue
Block a user