Compare commits

...

37 Commits
v1.4 ... v1.6

Author SHA1 Message Date
80331833f2 Update setup.py 2020-05-07 20:12:32 +05:30
5e3d3a815f fix 2020-05-07 20:03:17 +05:30
6182a18cf4 fix 2020-05-07 20:02:47 +05:30
9bca750310 v1.5 2020-05-07 19:59:23 +05:30
c22749a6a3 update 2020-05-07 19:54:00 +05:30
151df94fe3 license_file = LICENSE 2020-05-07 19:38:19 +05:30
24540d0b2c update 2020-05-07 19:33:39 +05:30
bdfc72d05d Create __version__.py 2020-05-07 19:16:26 +05:30
3b104c1a28 v1.5 2020-05-07 19:03:02 +05:30
fb0d4658a7 ce 2020-05-07 19:02:12 +05:30
48833980e1 update 2020-05-07 18:58:01 +05:30
0c4f119981 Update wrapper.py 2020-05-07 17:25:34 +05:30
afded51a04 Update wrapper.py 2020-05-07 17:20:23 +05:30
b950616561 Update wrapper.py 2020-05-07 17:17:17 +05:30
444675538f fix code Complexity (#8)
* fix code Complexity

* Update wrapper.py

* codefactor badge
2020-05-07 16:51:08 +05:30
0ca6710334 Update wrapper.py 2020-05-07 16:24:33 +05:30
01a7c591ad retry 2020-05-07 15:46:39 +05:30
74d3bc154b fix issue with py2.7 2020-05-07 15:34:41 +05:30
a8e94dfb25 Update README.md 2020-05-07 15:14:55 +05:30
cc38798b32 Update README.md 2020-05-07 15:14:30 +05:30
bc3dd44f27 Update README.md 2020-05-07 15:13:58 +05:30
ba46cdafe2 Update README.md 2020-05-07 15:12:37 +05:30
538afb14e9 Update test_1.py 2020-05-07 15:06:52 +05:30
7605b614ee test for total_archives() 2020-05-07 15:00:28 +05:30
d0a4e25cf5 Update __init__.py 2020-05-07 14:53:09 +05:30
8c5c0153da + total_archives() 2020-05-07 14:52:05 +05:30
e7dac74906 Update __init__.py 2020-05-07 09:06:49 +05:30
c686708c9e more testing 2020-05-07 08:59:09 +05:30
f9ae8ada70 Update test_1.py 2020-05-07 08:39:24 +05:30
e56ece3dc9 Update README.md 2020-05-07 08:23:31 +05:30
db127a5c54 always return https 2020-05-06 20:16:25 +05:30
ed497bbd23 Update wrapper.py 2020-05-06 20:07:25 +05:30
45fe07ddb6 Update wrapper.py 2020-05-06 19:35:01 +05:30
0029d63d8a 503 API Service Temporarily Unavailable 2020-05-06 19:22:56 +05:30
beb5b625ec Set theme jekyll-theme-cayman 2020-05-06 12:20:43 +05:30
b40d734346 Update README.md 2020-05-06 09:18:02 +05:30
be0a30de85 Create index.rst 2020-05-05 20:22:46 +05:30
10 changed files with 465 additions and 68 deletions

View File

@ -4,8 +4,11 @@
[![Release](https://img.shields.io/github/v/release/akamhy/waybackpy.svg)](https://github.com/akamhy/waybackpy/releases)
[![Codacy Badge](https://api.codacy.com/project/badge/Grade/255459cede9341e39436ec8866d3fb65)](https://www.codacy.com/manual/akamhy/waybackpy?utm_source=github.com&utm_medium=referral&utm_content=akamhy/waybackpy&utm_campaign=Badge_Grade)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://github.com/akamhy/waybackpy/blob/master/LICENSE)
[![Maintainability](https://api.codeclimate.com/v1/badges/942f13d8177a56c1c906/maintainability)](https://codeclimate.com/github/akamhy/waybackpy/maintainability)
[![CodeFactor](https://www.codefactor.io/repository/github/akamhy/waybackpy/badge)](https://www.codefactor.io/repository/github/akamhy/waybackpy)
[![made-with-python](https://img.shields.io/badge/Made%20with-Python-1f425f.svg)](https://www.python.org/)
![pypi](https://img.shields.io/pypi/v/wayback.svg)
![pypi](https://img.shields.io/pypi/v/waybackpy.svg)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/waybackpy?style=flat-square)
[![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-green.svg)](https://github.com/akamhy/waybackpy/graphs/commit-activity)
@ -27,6 +30,8 @@ Table of contents
* [Receiving the recent most/newest archive for an URL using newest()](https://github.com/akamhy/waybackpy#receiving-the-newest-archive-for-an-url-using-newest)
* [Receiving archive close to a specified year, month, day, hour, and minute using near()](https://github.com/akamhy/waybackpy#receiving-archive-close-to-a-specified-year-month-day-hour-and-minute-using-near)
* [Get the content of webpage using get()](https://github.com/akamhy/waybackpy#get-the-content-of-webpage-using-get)
* [Count total archives for an URL using total_archives()](https://github.com/akamhy/waybackpy#count-total-archives-for-an-url-using-total_archives)
* [Tests](https://github.com/akamhy/waybackpy#tests)
@ -142,6 +147,23 @@ print(webpage)
```
> This should print the source code for <https://example.com/>.
#### Count total archives for an URL using total_archives()
```diff
+ waybackpy.total_archives(url, UA=user_agent)
```
> url is mandatory. UA is not, but highly recommended.
```python
from waybackpy import total_archives
# retriving the webpage from any url including the archived urls. Don't need to import other libraies :)
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
# supported argumnets are url and UA
count = total_archives("https://en.wikipedia.org/wiki/Python (programming language)", UA="User-Agent")
print(count)
```
> This should print an integer (int), which is the number of total archives on archive.org
## Tests
* [Here](https://github.com/akamhy/waybackpy/tree/master/tests)

1
_config.yml Normal file
View File

@ -0,0 +1 @@
theme: jekyll-theme-cayman

232
index.rst Normal file
View File

@ -0,0 +1,232 @@
waybackpy
=========
|Build Status| |Downloads| |Release| |Codacy Badge| |License: MIT|
|Maintainability| |CodeFactor| |made-with-python| |pypi| |PyPI - Python
Version| |Maintenance|
.. |Build Status| image:: https://travis-ci.org/akamhy/waybackpy.svg?branch=master
:target: https://travis-ci.org/akamhy/waybackpy
.. |Downloads| image:: https://img.shields.io/pypi/dm/waybackpy.svg
:target: https://pypistats.org/packages/waybackpy
.. |Release| image:: https://img.shields.io/github/v/release/akamhy/waybackpy.svg
:target: https://github.com/akamhy/waybackpy/releases
.. |Codacy Badge| image:: https://api.codacy.com/project/badge/Grade/255459cede9341e39436ec8866d3fb65
:target: https://www.codacy.com/manual/akamhy/waybackpy?utm_source=github.com&utm_medium=referral&utm_content=akamhy/waybackpy&utm_campaign=Badge_Grade
.. |License: MIT| image:: https://img.shields.io/badge/License-MIT-yellow.svg
:target: https://github.com/akamhy/waybackpy/blob/master/LICENSE
.. |Maintainability| image:: https://api.codeclimate.com/v1/badges/942f13d8177a56c1c906/maintainability
:target: https://codeclimate.com/github/akamhy/waybackpy/maintainability
.. |CodeFactor| image:: https://www.codefactor.io/repository/github/akamhy/waybackpy/badge
:target: https://www.codefactor.io/repository/github/akamhy/waybackpy
.. |made-with-python| image:: https://img.shields.io/badge/Made%20with-Python-1f425f.svg
:target: https://www.python.org/
.. |pypi| image:: https://img.shields.io/pypi/v/waybackpy.svg
.. |PyPI - Python Version| image:: https://img.shields.io/pypi/pyversions/waybackpy?style=flat-square
.. |Maintenance| image:: https://img.shields.io/badge/Maintained%3F-yes-green.svg
:target: https://github.com/akamhy/waybackpy/graphs/commit-activity
|Internet Archive| |Wayback Machine|
The waybackpy is a python wrapper for `Internet Archive`_\ s `Wayback
Machine`_.
.. _Internet Archive: https://en.wikipedia.org/wiki/Internet_Archive
.. _Wayback Machine: https://en.wikipedia.org/wiki/Wayback_Machine
.. |Internet Archive| image:: https://upload.wikimedia.org/wikipedia/commons/thumb/8/84/Internet_Archive_logo_and_wordmark.svg/84px-Internet_Archive_logo_and_wordmark.svg.png
.. |Wayback Machine| image:: https://upload.wikimedia.org/wikipedia/commons/thumb/0/01/Wayback_Machine_logo_2010.svg/284px-Wayback_Machine_logo_2010.svg.png
Installation
------------
Using `pip`_:
**pip install waybackpy**
.. _pip: https://en.wikipedia.org/wiki/Pip_(package_manager)
Usage
-----
Archiving aka Saving an url Using save()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: diff
+ waybackpy.save(url, UA=user_agent)
..
url is mandatory. UA is not, but highly recommended.
.. code:: python
import waybackpy
# Capturing a new archive on Wayback machine.
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
archived_url = waybackpy.save("https://github.com/akamhy/waybackpy", UA = "Any-User-Agent")
print(archived_url)
This should print something similar to the following archived URL:
https://web.archive.org/web/20200504141153/https://github.com/akamhy/waybackpy
Receiving the oldest archive for an URL Using oldest()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: diff
+ waybackpy.oldest(url, UA=user_agent)
..
url is mandatory. UA is not, but highly recommended.
.. code:: python
import waybackpy
# retrieving the oldest archive on Wayback machine.
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
oldest_archive = waybackpy.oldest("https://www.google.com/", UA = "Any-User-Agent")
print(oldest_archive)
This returns the oldest available archive for https://google.com.
http://web.archive.org/web/19981111184551/http://google.com:80/
Receiving the newest archive for an URL using newest()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: diff
+ waybackpy.newest(url, UA=user_agent)
..
url is mandatory. UA is not, but highly recommended.
.. code:: python
import waybackpy
# retrieving the newest archive on Wayback machine.
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
newest_archive = waybackpy.newest("https://www.microsoft.com/en-us", UA = "Any-User-Agent")
print(newest_archive)
This returns the newest available archive for
https://www.microsoft.com/en-us, something just like this:
http://web.archive.org/web/20200429033402/https://www.microsoft.com/en-us/
Receiving archive close to a specified year, month, day, hour, and minute using near()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: diff
+ waybackpy.near(url, year=2020, month=1, day=1, hour=1, minute=1, UA=user_agent)
..
url is mandotory. year,month,day,hour and minute are optional
arguments. UA is not mandotory, but higly recomended.
.. code:: python
import waybackpy
# retriving the the closest archive from a specified year.
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
# supported argumnets are year,month,day,hour and minute
archive_near_year = waybackpy.near("https://www.facebook.com/", year=2010, UA ="Any-User-Agent")
print(archive_near_year)
returns :
http://web.archive.org/web/20100504071154/http://www.facebook.com/
``waybackpy.near("https://www.facebook.com/", year=2010, month=1, UA ="Any-User-Agent")``
returns:
http://web.archive.org/web/20101111173430/http://www.facebook.com//
``waybackpy.near("https://www.oracle.com/index.html", year=2019, month=1, day=5, UA ="Any-User-Agent")``
returns:
http://web.archive.org/web/20190105054437/https://www.oracle.com/index.html
> Please note that if you only specify the year, the current month and
day are default arguments for month and day respectively. Do not expect
just putting the year parameter would return the archive closer to
January but the current month you are using the package. If you are
using it in July 2018 and lets say you use
``waybackpy.near("https://www.facebook.com/", year=2011, UA ="Any-User-Agent")``
then you would be returned the nearest archive to July 2011 and not
January 2011. You need to specify the month “1” for January.
Do not pad (dont use zeros in the month, year, day, minute, and hour
arguments). e.g. For January, set month = 1 and not month = 01.
Get the content of webpage using get()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: diff
+ waybackpy.get(url, encoding="UTF-8", UA=user_agent)
..
url is mandatory. UA is not, but highly recommended. encoding is
detected automatically, dont specify unless necessary.
.. code:: python
from waybackpy import get
# retriving the webpage from any url including the archived urls. Don't need to import other libraies :)
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
# supported argumnets are url, encoding and UA
webpage = get("https://example.com/", UA="User-Agent")
print(webpage)
..
This should print the source code for https://example.com/.
Count total archives for an URL using total_archives()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: diff
+ waybackpy.total_archives(url, UA=user_agent)
..
url is mandatory. UA is not, but highly recommended.
.. code:: python
from waybackpy import total_archives
# retriving the webpage from any url including the archived urls. Don't need to import other libraies :)
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
# supported argumnets are url and UA
count = total_archives("https://en.wikipedia.org/wiki/Python (programming language)", UA="User-Agent")
print(count)
..
This should print an integer (int), which is the number of total
archives on archive.org
Tests
-----
- `Here`_
Dependency
----------
- None, just python standard libraries (json, urllib and datetime).
Both python 2 and 3 are supported :)
License
-------
`MIT License`_
.. _Here: https://github.com/akamhy/waybackpy/tree/master/tests
.. _MIT License: https://github.com/akamhy/waybackpy/blob/master/LICENSE

View File

@ -1,2 +1,3 @@
[metadata]
description-file = README.md
license_file = LICENSE

View File

@ -4,21 +4,25 @@ from setuptools import setup
with open(os.path.join(os.path.dirname(__file__), 'README.md')) as f:
long_description = f.read()
about = {}
with open(os.path.join(os.path.dirname(__file__), 'waybackpy', '__version__.py')) as f:
exec(f.read(), about)
setup(
name = 'waybackpy',
name = about['__title__'],
packages = ['waybackpy'],
version = 'v1.4',
description = "A python wrapper for Internet Archive's Wayback Machine API. Archive pages and retrieve archived pages easily.",
version = about['__version__'],
description = about['__description__'],
long_description=long_description,
long_description_content_type='text/markdown',
license='MIT',
author = 'akamhy',
author_email = 'akash3pro@gmail.com',
url = 'https://github.com/akamhy/waybackpy',
download_url = 'https://github.com/akamhy/waybackpy/archive/v1.4.tar.gz',
license= about['__license__'],
author = about['__author__'],
author_email = about['__author_email__'],
url = about['__url__'],
download_url = 'https://github.com/akamhy/waybackpy/archive/v1.5.tar.gz',
keywords = ['wayback', 'archive', 'archive website', 'wayback machine', 'Internet Archive'],
install_requires=[],
python_requires=">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*",
python_requires= ">=2.7",
classifiers=[
'Development Status :: 5 - Production/Stable',
'Intended Audience :: Developers',
@ -28,13 +32,18 @@ setup(
'Programming Language :: Python',
'Programming Language :: Python :: 2',
'Programming Language :: Python :: 2.7',
'Programming Language :: Python :: 3',
'Programming Language :: Python :: 3',
'Programming Language :: Python :: 3.2',
'Programming Language :: Python :: 3.3',
'Programming Language :: Python :: 3.4',
'Programming Language :: Python :: 3.5',
'Programming Language :: Python :: 3.6',
'Programming Language :: Python :: 3.7',
'Programming Language :: Python :: 3.8',
'Programming Language :: Python :: Implementation :: CPython',
'Programming Language :: Python :: Implementation :: PyPy'
],
project_urls={
'Documentation': 'https://waybackpy.readthedocs.io',
'Source': 'https://github.com/akamhy/waybackpy',
},
)

View File

@ -6,6 +6,16 @@ import pytest
user_agent = "Mozilla/5.0 (Windows NT 6.2; rv:20.0) Gecko/20121202 Firefox/20.0"
def test_clean_url():
test_url = " https://en.wikipedia.org/wiki/Network security "
answer = "https://en.wikipedia.org/wiki/Network_security"
test_result = waybackpy.clean_url(test_url)
assert answer == test_result
def test_url_check():
InvalidUrl = "http://wwwgooglecom/"
with pytest.raises(Exception) as e_info:
waybackpy.url_check(InvalidUrl)
def test_save():
# Test for urls that exist and can be archived.
@ -16,31 +26,35 @@ def test_save():
# Test for urls that are incorrect.
with pytest.raises(Exception) as e_info:
url2 = "ha ha ha ha"
archived_url2 = waybackpy.save(url2, UA=user_agent)
waybackpy.save(url2, UA=user_agent)
# Test for urls not allowed to archive by robot.txt.
with pytest.raises(Exception) as e_info:
url3 = "http://www.archive.is/faq.html"
archived_url3 = waybackpy.save(url3, UA=user_agent)
waybackpy.save(url3, UA=user_agent)
# Non existent urls, test
with pytest.raises(Exception) as e_info:
url4 = "https://githfgdhshajagjstgeths537agajaajgsagudadhuss8762346887adsiugujsdgahub.us"
archived_url4 = waybackpy.save(url4, UA=user_agent)
def test_near():
url = "google.com"
archive_near_year = waybackpy.near(url, year=2010, UA=user_agent)
assert "2010" in archive_near_year
archive_near_month_year = waybackpy.near(url, year=2015, month=2, UA=user_agent)
assert "201502" in archive_near_month_year
assert ("201502" in archive_near_month_year) or ("201501" in archive_near_month_year) or ("201503" in archive_near_month_year)
archive_near_day_month_year = waybackpy.near(url, year=2006, month=11, day=15, UA=user_agent)
assert "20061115" in archive_near_day_month_year
assert ("20061114" in archive_near_day_month_year) or ("20061115" in archive_near_day_month_year) or ("2006116" in archive_near_day_month_year)
archive_near_hour_day_month_year = waybackpy.near("www.python.org", year=2008, month=5, day=9, hour=15, UA=user_agent)
assert "2008050915" in archive_near_hour_day_month_year
assert ("2008050915" in archive_near_hour_day_month_year) or ("2008050914" in archive_near_hour_day_month_year) or ("2008050913" in archive_near_hour_day_month_year)
with pytest.raises(Exception) as e_info:
NeverArchivedUrl = "https://ee_3n.wrihkeipef4edia.org/rwti5r_ki/Nertr6w_rork_rse7c_urity"
waybackpy.near(NeverArchivedUrl, year=2010, UA=user_agent)
def test_oldest():
url = "github.com/akamhy/waybackpy"
@ -51,8 +65,34 @@ def test_newest():
url = "github.com/akamhy/waybackpy"
archive_newest = waybackpy.newest(url, UA=user_agent)
assert url in archive_newest
def test_get():
oldest_google_archive = waybackpy.oldest("google.com", UA=user_agent)
oldest_google_page_text = waybackpy.get(oldest_google_archive, UA=user_agent)
assert "Welcome to Google" in oldest_google_page_text
def test_total_archives():
count1 = waybackpy.total_archives("https://en.wikipedia.org/wiki/Python (programming language)", UA=user_agent)
assert count1 > 2000
count2 = waybackpy.total_archives("https://gaha.e4i3n.m5iai3kip6ied.cima/gahh2718gs/ahkst63t7gad8", UA=user_agent)
assert count2 == 0
if __name__ == "__main__":
test_clean_url()
print(".")
test_url_check()
print(".")
test_get()
print(".")
test_near()
print(".")
test_newest()
print(".")
test_save()
print(".")
test_oldest()
print(".")
test_total_archives()
print(".")

View File

@ -1,6 +1,30 @@
# -*- coding: utf-8 -*-
from .wrapper import save, near, oldest, newest, get
__version__ = "v1.4"
# ┏┓┏┓┏┓━━━━━━━━━━┏━━┓━━━━━━━━━━┏┓━━┏━━━┓━━━━━
# ┃┃┃┃┃┃━━━━━━━━━━┃┏┓┃━━━━━━━━━━┃┃━━┃┏━┓┃━━━━━
# ┃┃┃┃┃┃┏━━┓━┏┓━┏┓┃┗┛┗┓┏━━┓━┏━━┓┃┃┏┓┃┗━┛┃┏┓━┏┓
# ┃┗┛┗┛┃┗━┓┃━┃┃━┃┃┃┏━┓┃┗━┓┃━┃┏━┛┃┗┛┛┃┏━━┛┃┃━┃┃
# ┗┓┏┓┏┛┃┗┛┗┓┃┗━┛┃┃┗━┛┃┃┗┛┗┓┃┗━┓┃┏┓┓┃┃━━━┃┗━┛┃
# ━┗┛┗┛━┗━━━┛┗━┓┏┛┗━━━┛┗━━━┛┗━━┛┗┛┗┛┗┛━━━┗━┓┏┛
# ━━━━━━━━━━━┏━┛┃━━━━━━━━━━━━━━━━━━━━━━━━┏━┛┃━
# ━━━━━━━━━━━┗━━┛━━━━━━━━━━━━━━━━━━━━━━━━┗━━┛━
__all__ = ['wrapper', 'exceptions']
"""
A python wrapper for Internet Archive's Wayback Machine API.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Archive pages and retrieve archived pages easily.
Usage:
>>> import waybackpy
>>> new_archive = waybackpy.save('https://www.python.org')
>>> print(new_archive)
https://web.archive.org/web/20200502170312/https://www.python.org/
Full documentation @ <https://akamhy.github.io/waybackpy/>.
:copyright: (c) 2020 by akamhy.
:license: MIT
"""
from .wrapper import save, near, oldest, newest, get, clean_url, url_check, total_archives
from .__version__ import __title__, __description__, __url__, __version__
from .__version__ import __author__, __author_email__, __license__, __copyright__

8
waybackpy/__version__.py Normal file
View File

@ -0,0 +1,8 @@
__title__ = "waybackpy"
__description__ = "A python wrapper for Internet Archive's Wayback Machine API. Archive pages and retrieve archived pages easily."
__url__ = "https://akamhy.github.io/waybackpy/"
__version__ = "v1.5"
__author__ = "akamhy"
__author_email__ = "akash3pro@gmail.com"
__license__ = "MIT"
__copyright__ = "Copyright 2020 akamhy"

View File

@ -32,6 +32,11 @@ class BadGateWay(Exception):
Raised when 502 bad gateway.
"""
class WaybackUnavailable(Exception):
"""
Raised when 503 API Service Temporarily Unavailable.
"""
class InvalidUrl(Exception):
"""
Raised when url doesn't follow the standard url format.

View File

@ -1,88 +1,143 @@
# -*- coding: utf-8 -*-
import json
from datetime import datetime
from waybackpy.exceptions import TooManyArchivingRequests, ArchivingNotAllowed, PageNotSaved, ArchiveNotFound, UrlNotFound, BadGateWay, InvalidUrl
from waybackpy.exceptions import TooManyArchivingRequests, ArchivingNotAllowed, PageNotSaved, ArchiveNotFound, UrlNotFound, BadGateWay, InvalidUrl, WaybackUnavailable
try:
from urllib.request import Request, urlopen
from urllib.error import HTTPError
from urllib.error import HTTPError, URLError
except ImportError:
from urllib2 import Request, urlopen, HTTPError
from urllib2 import Request, urlopen, HTTPError, URLError
default_UA = "waybackpy python package"
def url_check(url):
if "." not in url:
raise InvalidUrl("'%s' is not a vaild url." % url)
def clean_url(url):
return str(url).strip().replace(" ","_")
def save(url,UA=default_UA):
base_save_url = "https://web.archive.org/save/"
request_url = (base_save_url + clean_url(url))
def wayback_timestamp(**kwargs):
return (
str(kwargs["year"])
+
str(kwargs["month"]).zfill(2)
+
str(kwargs["day"]).zfill(2)
+
str(kwargs["hour"]).zfill(2)
+
str(kwargs["minute"]).zfill(2)
)
def handle_HTTPError(e):
if e.code == 502:
raise BadGateWay(e)
elif e.code == 503:
raise WaybackUnavailable(e)
elif e.code == 429:
raise TooManyArchivingRequests(e)
elif e.code == 404:
raise UrlNotFound(e)
def save(url, UA=default_UA):
url_check(url)
request_url = ("https://web.archive.org/save/" + clean_url(url))
hdr = { 'User-Agent' : '%s' % UA } #nosec
req = Request(request_url, headers=hdr) #nosec
if "." not in url:
raise InvalidUrl("'%s' is not a vaild url." % url)
try:
response = urlopen(req) #nosec
except HTTPError as e:
if e.code == 502:
raise BadGateWay(e)
elif e.code == 429:
raise TooManyArchivingRequests(e)
elif e.code == 404:
if handle_HTTPError(e) is None:
raise PageNotSaved(e)
except URLError:
try:
response = urlopen(req) #nosec
except URLError as e:
raise UrlNotFound(e)
else:
raise PageNotSaved(e)
header = response.headers
if "exclusion.robots.policy" in str(header):
raise ArchivingNotAllowed("Can not archive %s. Disabled by site owner." % (url))
archive_id = header['Content-Location']
archived_url = "https://web.archive.org" + archive_id
return archived_url
def get(url,encoding=None,UA=default_UA):
return "https://web.archive.org" + header['Content-Location']
def get(url, encoding=None, UA=default_UA):
url_check(url)
hdr = { 'User-Agent' : '%s' % UA }
request_url = clean_url(url)
req = Request(request_url, headers=hdr) #nosec
resp=urlopen(req) #nosec
req = Request(clean_url(url), headers=hdr) #nosec
try:
resp=urlopen(req) #nosec
except URLError:
try:
resp=urlopen(req) #nosec
except URLError as e:
raise UrlNotFound(e)
if encoding is None:
try:
encoding= resp.headers['content-type'].split('charset=')[-1]
except AttributeError:
encoding = "UTF-8"
return resp.read().decode(encoding)
def wayback_timestamp(year,month,day,hour,minute):
year = str(year)
month = str(month).zfill(2)
day = str(day).zfill(2)
hour = str(hour).zfill(2)
minute = str(minute).zfill(2)
return (year+month+day+hour+minute)
return resp.read().decode(encoding.replace("text/html", "UTF-8", 1))
def near(
url,
year=datetime.utcnow().strftime('%Y'),
month=datetime.utcnow().strftime('%m'),
day=datetime.utcnow().strftime('%d'),
hour=datetime.utcnow().strftime('%H'),
minute=datetime.utcnow().strftime('%M'),
UA=default_UA,
):
timestamp = wayback_timestamp(year,month,day,hour,minute)
def near(url, **kwargs):
try:
url = kwargs["url"]
except KeyError:
url = url
year=kwargs.get("year", datetime.utcnow().strftime('%Y'))
month=kwargs.get("month", datetime.utcnow().strftime('%m'))
day=kwargs.get("day", datetime.utcnow().strftime('%d'))
hour=kwargs.get("hour", datetime.utcnow().strftime('%H'))
minute=kwargs.get("minute", datetime.utcnow().strftime('%M'))
UA=kwargs.get("UA", default_UA)
url_check(url)
timestamp = wayback_timestamp(year=year,month=month,day=day,hour=hour,minute=minute)
request_url = "https://archive.org/wayback/available?url=%s&timestamp=%s" % (clean_url(url), str(timestamp))
hdr = { 'User-Agent' : '%s' % UA }
req = Request(request_url, headers=hdr) # nosec
response = urlopen(req) #nosec
try:
response = urlopen(req) #nosec
except HTTPError as e:
handle_HTTPError(e)
data = json.loads(response.read().decode("UTF-8"))
if not data["archived_snapshots"]:
raise ArchiveNotFound("'%s' is not yet archived." % url)
archive_url = (data["archived_snapshots"]["closest"]["url"])
# wayback machine returns http sometimes, idk why? But they support https
archive_url = archive_url.replace("http://web.archive.org/web/","https://web.archive.org/web/",1)
return archive_url
def oldest(url,UA=default_UA,year=1994):
return near(url,year=year,UA=UA)
def oldest(url, UA=default_UA, year=1994):
return near(url, year=year, UA=UA)
def newest(url,UA=default_UA):
return near(url,UA=UA)
def newest(url, UA=default_UA):
return near(url, UA=UA)
def total_archives(url, UA=default_UA):
url_check(url)
hdr = { 'User-Agent' : '%s' % UA }
request_url = "https://web.archive.org/cdx/search/cdx?url=%s&output=json" % clean_url(url)
req = Request(request_url, headers=hdr) # nosec
try:
response = urlopen(req) #nosec
except HTTPError as e:
handle_HTTPError(e)
return (len(json.loads(response.read())))