Compare commits

...

41 Commits
v1.1 ... v1.4

Author SHA1 Message Date
3a65a60bd6 Update README.md 2020-05-05 19:08:26 +05:30
7b626f5ea5 Update README.md 2020-05-05 17:54:38 +05:30
73371d6c68 Update README.md 2020-05-05 17:49:23 +05:30
8904ba4d67 Update README.md 2020-05-05 17:47:55 +05:30
b4a7f7ea6f Update README.md 2020-05-05 17:47:00 +05:30
a2ead04021 Update README.md 2020-05-05 17:44:12 +05:30
3513feb075 Update __init__.py 2020-05-05 17:37:38 +05:30
d34b98373f Update setup.py 2020-05-05 17:37:16 +05:30
38f3b81742 Update .travis.yml 2020-05-05 17:27:58 +05:30
660a826aed Update .travis.yml 2020-05-05 17:21:36 +05:30
a52d035c0e Update .travis.yml 2020-05-05 17:19:24 +05:30
6737ce0e26 Create .travis.yml 2020-05-05 17:14:57 +05:30
98cc918c8f Update test_1.py 2020-05-05 17:10:33 +05:30
b103bfc6e4 Create test_1.py 2020-05-05 16:29:55 +05:30
edd05838b8 v1.3 2020-05-05 11:29:22 +05:30
031212e161 v1.3 2020-05-05 11:28:58 +05:30
d3bd5b05b5 Update setup.py 2020-05-05 10:50:09 +05:30
d6598a67b9 Update setup.py 2020-05-05 10:40:23 +05:30
e5a6057249 Update setup.py 2020-05-05 10:39:10 +05:30
2a1b3bc6ee Update setup.py 2020-05-05 10:36:05 +05:30
b4ca98eca2 Update setup.py 2020-05-05 10:32:06 +05:30
36b01754ec Update setup.py 2020-05-05 10:23:38 +05:30
3d8bf4eec6 Update setup.py 2020-05-05 10:22:54 +05:30
e7761b3709 Update README.md 2020-05-05 10:21:08 +05:30
df851dce0c Update setup.py 2020-05-05 10:16:15 +05:30
f5acbcfc95 Update exceptions.py 2020-05-05 10:07:27 +05:30
44156e5e7e Update exceptions.py 2020-05-05 10:05:47 +05:30
a6cb955669 Update wrapper.py 2020-05-05 10:04:40 +05:30
8acb14a243 Update wrapper.py 2020-05-05 10:00:29 +05:30
7d434c3f0f Update wrapper.py 2020-05-05 09:57:39 +05:30
057c61d677 Update wrapper.py 2020-05-05 09:48:39 +05:30
6705c04f38 Update wrapper.py 2020-05-05 09:43:13 +05:30
e631c0aadb Update README.md 2020-05-05 09:37:53 +05:30
423782ea75 Update README.md 2020-05-05 09:36:11 +05:30
7944f0878d Add .whitesource configuration file (#6)
Co-authored-by: whitesource-bolt-for-github[bot] <42819689+whitesource-bolt-for-github[bot]@users.noreply.github.com>
2020-05-05 09:33:50 +05:30
850b055527 Update README.md 2020-05-05 09:31:43 +05:30
32bc765113 Update README.md (#5)
* Update README.md

* Update README.md

* Update README.md
2020-05-05 09:27:02 +05:30
09b4ba2649 Version 1.2 with bug fixes and support for webpage retrieval (#4) 2020-05-05 09:03:16 +05:30
929790feca Update README.md (#1)
Add usage/ documentaion
2020-05-04 21:06:00 +05:30
09a521ae43 Create setup.cfg 2020-05-04 16:23:00 +05:30
a503be5a86 Create setup.py 2020-05-04 16:21:24 +05:30
9 changed files with 310 additions and 16 deletions

14
.travis.yml Normal file
View File

@ -0,0 +1,14 @@
language: python
python:
- "2.7"
- "3.6"
- "3.8"
os: linux
dist: xenial
cache: pip
install:
- pip install pytest
before_script:
cd tests
script:
- pytest test_1.py

8
.whitesource Normal file
View File

@ -0,0 +1,8 @@
{
"checkRunSettings": {
"vulnerableCheckRunConclusionLevel": "failure"
},
"issueSettings": {
"minSeverityLevel": "LOW"
}
}

156
README.md
View File

@ -1,2 +1,154 @@
# pywayback
A python wrapper for Internet Archive's Wayback Machine
# waybackpy
[![Build Status](https://travis-ci.org/akamhy/waybackpy.svg?branch=master)](https://travis-ci.org/akamhy/waybackpy)
[![Downloads](https://img.shields.io/pypi/dm/waybackpy.svg)](https://pypistats.org/packages/waybackpy)
[![Release](https://img.shields.io/github/v/release/akamhy/waybackpy.svg)](https://github.com/akamhy/waybackpy/releases)
[![Codacy Badge](https://api.codacy.com/project/badge/Grade/255459cede9341e39436ec8866d3fb65)](https://www.codacy.com/manual/akamhy/waybackpy?utm_source=github.com&amp;utm_medium=referral&amp;utm_content=akamhy/waybackpy&amp;utm_campaign=Badge_Grade)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://github.com/akamhy/waybackpy/blob/master/LICENSE)
[![made-with-python](https://img.shields.io/badge/Made%20with-Python-1f425f.svg)](https://www.python.org/)
![pypi](https://img.shields.io/pypi/v/wayback.svg)
[![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-green.svg)](https://github.com/akamhy/waybackpy/graphs/commit-activity)
![Internet Archive](https://upload.wikimedia.org/wikipedia/commons/thumb/8/84/Internet_Archive_logo_and_wordmark.svg/84px-Internet_Archive_logo_and_wordmark.svg.png)
![Wayback Machine](https://upload.wikimedia.org/wikipedia/commons/thumb/0/01/Wayback_Machine_logo_2010.svg/284px-Wayback_Machine_logo_2010.svg.png)
The waybackpy is a python wrapper for [Internet Archive](https://en.wikipedia.org/wiki/Internet_Archive)'s [Wayback Machine](https://en.wikipedia.org/wiki/Wayback_Machine).
Table of contents
=================
<!--ts-->
* [Installation](https://github.com/akamhy/waybackpy#installation)
* [Usage](https://github.com/akamhy/waybackpy#usage)
* [Saving an url using save()](https://github.com/akamhy/waybackpy#capturing-aka-saving-an-url-using-save)
* [Receiving the oldest archive for an URL Using oldest()](https://github.com/akamhy/waybackpy#receiving-the-oldest-archive-for-an-url-using-oldest)
* [Receiving the recent most/newest archive for an URL using newest()](https://github.com/akamhy/waybackpy#receiving-the-newest-archive-for-an-url-using-newest)
* [Receiving archive close to a specified year, month, day, hour, and minute using near()](https://github.com/akamhy/waybackpy#receiving-archive-close-to-a-specified-year-month-day-hour-and-minute-using-near)
* [Get the content of webpage using get()](https://github.com/akamhy/waybackpy#get-the-content-of-webpage-using-get)
* [Tests](https://github.com/akamhy/waybackpy#tests)
* [Dependency](https://github.com/akamhy/waybackpy#dependency)
* [License](https://github.com/akamhy/waybackpy#license)
<!--te-->
## Installation
Using [pip](https://en.wikipedia.org/wiki/Pip_(package_manager)):
**pip install waybackpy**
## Usage
#### Capturing aka Saving an url Using save()
```diff
+ waybackpy.save(url, UA=user_agent)
```
> url is mandatory. UA is not, but highly recommended.
```python
import waybackpy
# Capturing a new archive on Wayback machine.
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
archived_url = waybackpy.save("https://github.com/akamhy/waybackpy", UA = "Any-User-Agent")
print(archived_url)
```
This should print something similar to the following archived URL:
<https://web.archive.org/web/20200504141153/https://github.com/akamhy/waybackpy>
#### Receiving the oldest archive for an URL Using oldest()
```diff
+ waybackpy.oldest(url, UA=user_agent)
```
> url is mandatory. UA is not, but highly recommended.
```python
import waybackpy
# retrieving the oldest archive on Wayback machine.
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
oldest_archive = waybackpy.oldest("https://www.google.com/", UA = "Any-User-Agent")
print(oldest_archive)
```
This returns the oldest available archive for <https://google.com>.
<http://web.archive.org/web/19981111184551/http://google.com:80/>
#### Receiving the newest archive for an URL using newest()
```diff
+ waybackpy.newest(url, UA=user_agent)
```
> url is mandatory. UA is not, but highly recommended.
```python
import waybackpy
# retrieving the newest archive on Wayback machine.
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
newest_archive = waybackpy.newest("https://www.microsoft.com/en-us", UA = "Any-User-Agent")
print(newest_archive)
```
This returns the newest available archive for <https://www.microsoft.com/en-us>, something just like this:
<http://web.archive.org/web/20200429033402/https://www.microsoft.com/en-us/>
#### Receiving archive close to a specified year, month, day, hour, and minute using near()
```diff
+ waybackpy.near(url, year=2020, month=1, day=1, hour=1, minute=1, UA=user_agent)
```
> url is mandotory. year,month,day,hour and minute are optional arguments. UA is not mandotory, but higly recomended.
```python
import waybackpy
# retriving the the closest archive from a specified year.
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
# supported argumnets are year,month,day,hour and minute
archive_near_year = waybackpy.near("https://www.facebook.com/", year=2010, UA ="Any-User-Agent")
print(archive_near_year)
```
returns : <http://web.archive.org/web/20100504071154/http://www.facebook.com/>
```waybackpy.near("https://www.facebook.com/", year=2010, month=1, UA ="Any-User-Agent")``` returns: <http://web.archive.org/web/20101111173430/http://www.facebook.com//>
```waybackpy.near("https://www.oracle.com/index.html", year=2019, month=1, day=5, UA ="Any-User-Agent")``` returns: <http://web.archive.org/web/20190105054437/https://www.oracle.com/index.html>
> Please note that if you only specify the year, the current month and day are default arguments for month and day respectively. Do not expect just putting the year parameter would return the archive closer to January but the current month you are using the package. If you are using it in July 2018 and let's say you use ```waybackpy.near("https://www.facebook.com/", year=2011, UA ="Any-User-Agent")``` then you would be returned the nearest archive to July 2011 and not January 2011. You need to specify the month "1" for January.
> Do not pad (don't use zeros in the month, year, day, minute, and hour arguments). e.g. For January, set month = 1 and not month = 01.
#### Get the content of webpage using get()
```diff
+ waybackpy.get(url, encoding="UTF-8", UA=user_agent)
```
> url is mandatory. UA is not, but highly recommended. encoding is detected automatically, don't specify unless necessary.
```python
from waybackpy import get
# retriving the webpage from any url including the archived urls. Don't need to import other libraies :)
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
# supported argumnets are url, encoding and UA
webpage = get("https://example.com/", UA="User-Agent")
print(webpage)
```
> This should print the source code for <https://example.com/>.
## Tests
* [Here](https://github.com/akamhy/waybackpy/tree/master/tests)
## Dependency
* None, just python standard libraries (json, urllib and datetime). Both python 2 and 3 are supported :)
## License
[MIT License](https://github.com/akamhy/waybackpy/blob/master/LICENSE)

2
setup.cfg Normal file
View File

@ -0,0 +1,2 @@
[metadata]
description-file = README.md

40
setup.py Normal file
View File

@ -0,0 +1,40 @@
import os.path
from setuptools import setup
with open(os.path.join(os.path.dirname(__file__), 'README.md')) as f:
long_description = f.read()
setup(
name = 'waybackpy',
packages = ['waybackpy'],
version = 'v1.4',
description = "A python wrapper for Internet Archive's Wayback Machine API. Archive pages and retrieve archived pages easily.",
long_description=long_description,
long_description_content_type='text/markdown',
license='MIT',
author = 'akamhy',
author_email = 'akash3pro@gmail.com',
url = 'https://github.com/akamhy/waybackpy',
download_url = 'https://github.com/akamhy/waybackpy/archive/v1.4.tar.gz',
keywords = ['wayback', 'archive', 'archive website', 'wayback machine', 'Internet Archive'],
install_requires=[],
python_requires=">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*",
classifiers=[
'Development Status :: 5 - Production/Stable',
'Intended Audience :: Developers',
'Natural Language :: English',
'Topic :: Software Development :: Build Tools',
'License :: OSI Approved :: MIT License',
'Programming Language :: Python',
'Programming Language :: Python :: 2',
'Programming Language :: Python :: 2.7',
'Programming Language :: Python :: 3',
'Programming Language :: Python :: 3.4',
'Programming Language :: Python :: 3.5',
'Programming Language :: Python :: 3.6',
'Programming Language :: Python :: 3.7',
'Programming Language :: Python :: 3.8',
'Programming Language :: Python :: Implementation :: CPython',
'Programming Language :: Python :: Implementation :: PyPy'
],
)

58
tests/test_1.py Normal file
View File

@ -0,0 +1,58 @@
import sys
sys.path.append("..")
import waybackpy
import pytest
user_agent = "Mozilla/5.0 (Windows NT 6.2; rv:20.0) Gecko/20121202 Firefox/20.0"
def test_save():
# Test for urls that exist and can be archived.
url1="https://github.com/akamhy/waybackpy"
archived_url1 = waybackpy.save(url1, UA=user_agent)
assert url1 in archived_url1
# Test for urls that are incorrect.
with pytest.raises(Exception) as e_info:
url2 = "ha ha ha ha"
archived_url2 = waybackpy.save(url2, UA=user_agent)
# Test for urls not allowed to archive by robot.txt.
with pytest.raises(Exception) as e_info:
url3 = "http://www.archive.is/faq.html"
archived_url3 = waybackpy.save(url3, UA=user_agent)
# Non existent urls, test
with pytest.raises(Exception) as e_info:
url4 = "https://githfgdhshajagjstgeths537agajaajgsagudadhuss8762346887adsiugujsdgahub.us"
archived_url4 = waybackpy.save(url4, UA=user_agent)
def test_near():
url = "google.com"
archive_near_year = waybackpy.near(url, year=2010, UA=user_agent)
assert "2010" in archive_near_year
archive_near_month_year = waybackpy.near(url, year=2015, month=2, UA=user_agent)
assert "201502" in archive_near_month_year
archive_near_day_month_year = waybackpy.near(url, year=2006, month=11, day=15, UA=user_agent)
assert "20061115" in archive_near_day_month_year
archive_near_hour_day_month_year = waybackpy.near("www.python.org", year=2008, month=5, day=9, hour=15, UA=user_agent)
assert "2008050915" in archive_near_hour_day_month_year
def test_oldest():
url = "github.com/akamhy/waybackpy"
archive_oldest = waybackpy.oldest(url, UA=user_agent)
assert "20200504141153" in archive_oldest
def test_newest():
url = "github.com/akamhy/waybackpy"
archive_newest = waybackpy.newest(url, UA=user_agent)
assert url in archive_newest
def test_get():
oldest_google_archive = waybackpy.oldest("google.com", UA=user_agent)
oldest_google_page_text = waybackpy.get(oldest_google_archive, UA=user_agent)
assert "Welcome to Google" in oldest_google_page_text

View File

@ -1,6 +1,6 @@
# -*- coding: utf-8 -*-
from .wrapper import save, near, oldest, newest
from .wrapper import save, near, oldest, newest, get
__version__ = "1.1"
__version__ = "v1.4"
__all__ = ['wrapper', 'exceptions']

View File

@ -1,14 +1,14 @@
# -*- coding: utf-8 -*-
class TooManyArchivingRequests(Exception):
"""
Error when a single url reqeusted for archiving too many times in a short timespam.
"""Error when a single url reqeusted for archiving too many times in a short timespam.
Wayback machine doesn't supports archivng any url too many times in a short period of time.
"""
class ArchivingNotAllowed(Exception):
"""
Files like robots.txt are set to deny robot archiving.
"""Files like robots.txt are set to deny robot archiving.
Wayback machine respects these file, will not archive.
"""

View File

@ -1,6 +1,7 @@
# -*- coding: utf-8 -*-
import json
from datetime import datetime
from waybackpy.exceptions import *
from waybackpy.exceptions import TooManyArchivingRequests, ArchivingNotAllowed, PageNotSaved, ArchiveNotFound, UrlNotFound, BadGateWay, InvalidUrl
try:
from urllib.request import Request, urlopen
from urllib.error import HTTPError
@ -16,8 +17,8 @@ def clean_url(url):
def save(url,UA=default_UA):
base_save_url = "https://web.archive.org/save/"
request_url = (base_save_url + clean_url(url))
hdr = { 'User-Agent' : '%s' % UA }
req = Request(request_url, headers=hdr)
hdr = { 'User-Agent' : '%s' % UA } #nosec
req = Request(request_url, headers=hdr) #nosec
if "." not in url:
raise InvalidUrl("'%s' is not a vaild url." % url)
try:
@ -39,6 +40,26 @@ def save(url,UA=default_UA):
archived_url = "https://web.archive.org" + archive_id
return archived_url
def get(url,encoding=None,UA=default_UA):
hdr = { 'User-Agent' : '%s' % UA }
request_url = clean_url(url)
req = Request(request_url, headers=hdr) #nosec
resp=urlopen(req) #nosec
if encoding is None:
try:
encoding= resp.headers['content-type'].split('charset=')[-1]
except AttributeError:
encoding = "UTF-8"
return resp.read().decode(encoding)
def wayback_timestamp(year,month,day,hour,minute):
year = str(year)
month = str(month).zfill(2)
day = str(day).zfill(2)
hour = str(hour).zfill(2)
minute = str(minute).zfill(2)
return (year+month+day+hour+minute)
def near(
url,
year=datetime.utcnow().strftime('%Y'),
@ -48,16 +69,15 @@ def near(
minute=datetime.utcnow().strftime('%M'),
UA=default_UA,
):
timestamp = str(year)+str(month)+str(day)+str(hour)+str(minute)
timestamp = wayback_timestamp(year,month,day,hour,minute)
request_url = "https://archive.org/wayback/available?url=%s&timestamp=%s" % (clean_url(url), str(timestamp))
hdr = { 'User-Agent' : '%s' % UA }
req = Request(request_url, headers=hdr)
req = Request(request_url, headers=hdr) # nosec
response = urlopen(req) #nosec
import json
data = json.loads(response.read().decode('utf8'))
data = json.loads(response.read().decode("UTF-8"))
if not data["archived_snapshots"]:
raise ArchiveNotFound("'%s' is not yet archived." % url)
archive_url = (data["archived_snapshots"]["closest"]["url"])
return archive_url