Compare commits
41 Commits
Author | SHA1 | Date | |
---|---|---|---|
3a65a60bd6 | |||
7b626f5ea5 | |||
73371d6c68 | |||
8904ba4d67 | |||
b4a7f7ea6f | |||
a2ead04021 | |||
3513feb075 | |||
d34b98373f | |||
38f3b81742 | |||
660a826aed | |||
a52d035c0e | |||
6737ce0e26 | |||
98cc918c8f | |||
b103bfc6e4 | |||
edd05838b8 | |||
031212e161 | |||
d3bd5b05b5 | |||
d6598a67b9 | |||
e5a6057249 | |||
2a1b3bc6ee | |||
b4ca98eca2 | |||
36b01754ec | |||
3d8bf4eec6 | |||
e7761b3709 | |||
df851dce0c | |||
f5acbcfc95 | |||
44156e5e7e | |||
a6cb955669 | |||
8acb14a243 | |||
7d434c3f0f | |||
057c61d677 | |||
6705c04f38 | |||
e631c0aadb | |||
423782ea75 | |||
7944f0878d | |||
850b055527 | |||
32bc765113 | |||
09b4ba2649 | |||
929790feca | |||
09a521ae43 | |||
a503be5a86 |
14
.travis.yml
Normal file
14
.travis.yml
Normal file
@ -0,0 +1,14 @@
|
||||
language: python
|
||||
python:
|
||||
- "2.7"
|
||||
- "3.6"
|
||||
- "3.8"
|
||||
os: linux
|
||||
dist: xenial
|
||||
cache: pip
|
||||
install:
|
||||
- pip install pytest
|
||||
before_script:
|
||||
cd tests
|
||||
script:
|
||||
- pytest test_1.py
|
8
.whitesource
Normal file
8
.whitesource
Normal file
@ -0,0 +1,8 @@
|
||||
{
|
||||
"checkRunSettings": {
|
||||
"vulnerableCheckRunConclusionLevel": "failure"
|
||||
},
|
||||
"issueSettings": {
|
||||
"minSeverityLevel": "LOW"
|
||||
}
|
||||
}
|
156
README.md
156
README.md
@ -1,2 +1,154 @@
|
||||
# pywayback
|
||||
A python wrapper for Internet Archive's Wayback Machine
|
||||
# waybackpy
|
||||
[](https://travis-ci.org/akamhy/waybackpy)
|
||||
[](https://pypistats.org/packages/waybackpy)
|
||||
[](https://github.com/akamhy/waybackpy/releases)
|
||||
[](https://www.codacy.com/manual/akamhy/waybackpy?utm_source=github.com&utm_medium=referral&utm_content=akamhy/waybackpy&utm_campaign=Badge_Grade)
|
||||
[](https://github.com/akamhy/waybackpy/blob/master/LICENSE)
|
||||
[](https://www.python.org/)
|
||||

|
||||
[](https://github.com/akamhy/waybackpy/graphs/commit-activity)
|
||||
|
||||
|
||||
|
||||

|
||||

|
||||
|
||||
The waybackpy is a python wrapper for [Internet Archive](https://en.wikipedia.org/wiki/Internet_Archive)'s [Wayback Machine](https://en.wikipedia.org/wiki/Wayback_Machine).
|
||||
|
||||
Table of contents
|
||||
=================
|
||||
<!--ts-->
|
||||
|
||||
* [Installation](https://github.com/akamhy/waybackpy#installation)
|
||||
|
||||
* [Usage](https://github.com/akamhy/waybackpy#usage)
|
||||
* [Saving an url using save()](https://github.com/akamhy/waybackpy#capturing-aka-saving-an-url-using-save)
|
||||
* [Receiving the oldest archive for an URL Using oldest()](https://github.com/akamhy/waybackpy#receiving-the-oldest-archive-for-an-url-using-oldest)
|
||||
* [Receiving the recent most/newest archive for an URL using newest()](https://github.com/akamhy/waybackpy#receiving-the-newest-archive-for-an-url-using-newest)
|
||||
* [Receiving archive close to a specified year, month, day, hour, and minute using near()](https://github.com/akamhy/waybackpy#receiving-archive-close-to-a-specified-year-month-day-hour-and-minute-using-near)
|
||||
* [Get the content of webpage using get()](https://github.com/akamhy/waybackpy#get-the-content-of-webpage-using-get)
|
||||
|
||||
* [Tests](https://github.com/akamhy/waybackpy#tests)
|
||||
|
||||
* [Dependency](https://github.com/akamhy/waybackpy#dependency)
|
||||
|
||||
* [License](https://github.com/akamhy/waybackpy#license)
|
||||
|
||||
<!--te-->
|
||||
|
||||
## Installation
|
||||
Using [pip](https://en.wikipedia.org/wiki/Pip_(package_manager)):
|
||||
|
||||
**pip install waybackpy**
|
||||
|
||||
|
||||
|
||||
## Usage
|
||||
|
||||
#### Capturing aka Saving an url Using save()
|
||||
|
||||
```diff
|
||||
+ waybackpy.save(url, UA=user_agent)
|
||||
```
|
||||
> url is mandatory. UA is not, but highly recommended.
|
||||
```python
|
||||
import waybackpy
|
||||
# Capturing a new archive on Wayback machine.
|
||||
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
|
||||
archived_url = waybackpy.save("https://github.com/akamhy/waybackpy", UA = "Any-User-Agent")
|
||||
print(archived_url)
|
||||
```
|
||||
This should print something similar to the following archived URL:
|
||||
|
||||
<https://web.archive.org/web/20200504141153/https://github.com/akamhy/waybackpy>
|
||||
|
||||
#### Receiving the oldest archive for an URL Using oldest()
|
||||
|
||||
```diff
|
||||
+ waybackpy.oldest(url, UA=user_agent)
|
||||
```
|
||||
> url is mandatory. UA is not, but highly recommended.
|
||||
|
||||
|
||||
```python
|
||||
import waybackpy
|
||||
# retrieving the oldest archive on Wayback machine.
|
||||
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
|
||||
oldest_archive = waybackpy.oldest("https://www.google.com/", UA = "Any-User-Agent")
|
||||
print(oldest_archive)
|
||||
```
|
||||
This returns the oldest available archive for <https://google.com>.
|
||||
|
||||
<http://web.archive.org/web/19981111184551/http://google.com:80/>
|
||||
|
||||
#### Receiving the newest archive for an URL using newest()
|
||||
|
||||
```diff
|
||||
+ waybackpy.newest(url, UA=user_agent)
|
||||
```
|
||||
> url is mandatory. UA is not, but highly recommended.
|
||||
|
||||
|
||||
```python
|
||||
import waybackpy
|
||||
# retrieving the newest archive on Wayback machine.
|
||||
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
|
||||
newest_archive = waybackpy.newest("https://www.microsoft.com/en-us", UA = "Any-User-Agent")
|
||||
print(newest_archive)
|
||||
```
|
||||
This returns the newest available archive for <https://www.microsoft.com/en-us>, something just like this:
|
||||
|
||||
<http://web.archive.org/web/20200429033402/https://www.microsoft.com/en-us/>
|
||||
|
||||
#### Receiving archive close to a specified year, month, day, hour, and minute using near()
|
||||
|
||||
```diff
|
||||
+ waybackpy.near(url, year=2020, month=1, day=1, hour=1, minute=1, UA=user_agent)
|
||||
```
|
||||
> url is mandotory. year,month,day,hour and minute are optional arguments. UA is not mandotory, but higly recomended.
|
||||
|
||||
|
||||
```python
|
||||
import waybackpy
|
||||
# retriving the the closest archive from a specified year.
|
||||
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
|
||||
# supported argumnets are year,month,day,hour and minute
|
||||
archive_near_year = waybackpy.near("https://www.facebook.com/", year=2010, UA ="Any-User-Agent")
|
||||
print(archive_near_year)
|
||||
```
|
||||
returns : <http://web.archive.org/web/20100504071154/http://www.facebook.com/>
|
||||
|
||||
```waybackpy.near("https://www.facebook.com/", year=2010, month=1, UA ="Any-User-Agent")``` returns: <http://web.archive.org/web/20101111173430/http://www.facebook.com//>
|
||||
|
||||
```waybackpy.near("https://www.oracle.com/index.html", year=2019, month=1, day=5, UA ="Any-User-Agent")``` returns: <http://web.archive.org/web/20190105054437/https://www.oracle.com/index.html>
|
||||
> Please note that if you only specify the year, the current month and day are default arguments for month and day respectively. Do not expect just putting the year parameter would return the archive closer to January but the current month you are using the package. If you are using it in July 2018 and let's say you use ```waybackpy.near("https://www.facebook.com/", year=2011, UA ="Any-User-Agent")``` then you would be returned the nearest archive to July 2011 and not January 2011. You need to specify the month "1" for January.
|
||||
|
||||
> Do not pad (don't use zeros in the month, year, day, minute, and hour arguments). e.g. For January, set month = 1 and not month = 01.
|
||||
|
||||
#### Get the content of webpage using get()
|
||||
|
||||
```diff
|
||||
+ waybackpy.get(url, encoding="UTF-8", UA=user_agent)
|
||||
```
|
||||
> url is mandatory. UA is not, but highly recommended. encoding is detected automatically, don't specify unless necessary.
|
||||
|
||||
```python
|
||||
from waybackpy import get
|
||||
# retriving the webpage from any url including the archived urls. Don't need to import other libraies :)
|
||||
# Default user-agent (UA) is "waybackpy python package", if not specified in the call.
|
||||
# supported argumnets are url, encoding and UA
|
||||
webpage = get("https://example.com/", UA="User-Agent")
|
||||
print(webpage)
|
||||
```
|
||||
> This should print the source code for <https://example.com/>.
|
||||
|
||||
## Tests
|
||||
* [Here](https://github.com/akamhy/waybackpy/tree/master/tests)
|
||||
|
||||
## Dependency
|
||||
* None, just python standard libraries (json, urllib and datetime). Both python 2 and 3 are supported :)
|
||||
|
||||
|
||||
## License
|
||||
|
||||
[MIT License](https://github.com/akamhy/waybackpy/blob/master/LICENSE)
|
||||
|
40
setup.py
Normal file
40
setup.py
Normal file
@ -0,0 +1,40 @@
|
||||
import os.path
|
||||
from setuptools import setup
|
||||
|
||||
with open(os.path.join(os.path.dirname(__file__), 'README.md')) as f:
|
||||
long_description = f.read()
|
||||
|
||||
setup(
|
||||
name = 'waybackpy',
|
||||
packages = ['waybackpy'],
|
||||
version = 'v1.4',
|
||||
description = "A python wrapper for Internet Archive's Wayback Machine API. Archive pages and retrieve archived pages easily.",
|
||||
long_description=long_description,
|
||||
long_description_content_type='text/markdown',
|
||||
license='MIT',
|
||||
author = 'akamhy',
|
||||
author_email = 'akash3pro@gmail.com',
|
||||
url = 'https://github.com/akamhy/waybackpy',
|
||||
download_url = 'https://github.com/akamhy/waybackpy/archive/v1.4.tar.gz',
|
||||
keywords = ['wayback', 'archive', 'archive website', 'wayback machine', 'Internet Archive'],
|
||||
install_requires=[],
|
||||
python_requires=">=2.7, !=3.0.*, !=3.1.*, !=3.2.*, !=3.3.*",
|
||||
classifiers=[
|
||||
'Development Status :: 5 - Production/Stable',
|
||||
'Intended Audience :: Developers',
|
||||
'Natural Language :: English',
|
||||
'Topic :: Software Development :: Build Tools',
|
||||
'License :: OSI Approved :: MIT License',
|
||||
'Programming Language :: Python',
|
||||
'Programming Language :: Python :: 2',
|
||||
'Programming Language :: Python :: 2.7',
|
||||
'Programming Language :: Python :: 3',
|
||||
'Programming Language :: Python :: 3.4',
|
||||
'Programming Language :: Python :: 3.5',
|
||||
'Programming Language :: Python :: 3.6',
|
||||
'Programming Language :: Python :: 3.7',
|
||||
'Programming Language :: Python :: 3.8',
|
||||
'Programming Language :: Python :: Implementation :: CPython',
|
||||
'Programming Language :: Python :: Implementation :: PyPy'
|
||||
],
|
||||
)
|
58
tests/test_1.py
Normal file
58
tests/test_1.py
Normal file
@ -0,0 +1,58 @@
|
||||
import sys
|
||||
sys.path.append("..")
|
||||
import waybackpy
|
||||
import pytest
|
||||
|
||||
|
||||
user_agent = "Mozilla/5.0 (Windows NT 6.2; rv:20.0) Gecko/20121202 Firefox/20.0"
|
||||
|
||||
|
||||
def test_save():
|
||||
# Test for urls that exist and can be archived.
|
||||
url1="https://github.com/akamhy/waybackpy"
|
||||
archived_url1 = waybackpy.save(url1, UA=user_agent)
|
||||
assert url1 in archived_url1
|
||||
|
||||
# Test for urls that are incorrect.
|
||||
with pytest.raises(Exception) as e_info:
|
||||
url2 = "ha ha ha ha"
|
||||
archived_url2 = waybackpy.save(url2, UA=user_agent)
|
||||
|
||||
# Test for urls not allowed to archive by robot.txt.
|
||||
with pytest.raises(Exception) as e_info:
|
||||
url3 = "http://www.archive.is/faq.html"
|
||||
archived_url3 = waybackpy.save(url3, UA=user_agent)
|
||||
|
||||
# Non existent urls, test
|
||||
with pytest.raises(Exception) as e_info:
|
||||
url4 = "https://githfgdhshajagjstgeths537agajaajgsagudadhuss8762346887adsiugujsdgahub.us"
|
||||
archived_url4 = waybackpy.save(url4, UA=user_agent)
|
||||
|
||||
def test_near():
|
||||
url = "google.com"
|
||||
archive_near_year = waybackpy.near(url, year=2010, UA=user_agent)
|
||||
assert "2010" in archive_near_year
|
||||
|
||||
archive_near_month_year = waybackpy.near(url, year=2015, month=2, UA=user_agent)
|
||||
assert "201502" in archive_near_month_year
|
||||
|
||||
archive_near_day_month_year = waybackpy.near(url, year=2006, month=11, day=15, UA=user_agent)
|
||||
assert "20061115" in archive_near_day_month_year
|
||||
|
||||
archive_near_hour_day_month_year = waybackpy.near("www.python.org", year=2008, month=5, day=9, hour=15, UA=user_agent)
|
||||
assert "2008050915" in archive_near_hour_day_month_year
|
||||
|
||||
def test_oldest():
|
||||
url = "github.com/akamhy/waybackpy"
|
||||
archive_oldest = waybackpy.oldest(url, UA=user_agent)
|
||||
assert "20200504141153" in archive_oldest
|
||||
|
||||
def test_newest():
|
||||
url = "github.com/akamhy/waybackpy"
|
||||
archive_newest = waybackpy.newest(url, UA=user_agent)
|
||||
assert url in archive_newest
|
||||
|
||||
def test_get():
|
||||
oldest_google_archive = waybackpy.oldest("google.com", UA=user_agent)
|
||||
oldest_google_page_text = waybackpy.get(oldest_google_archive, UA=user_agent)
|
||||
assert "Welcome to Google" in oldest_google_page_text
|
@ -1,6 +1,6 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
from .wrapper import save, near, oldest, newest
|
||||
from .wrapper import save, near, oldest, newest, get
|
||||
|
||||
__version__ = "1.1"
|
||||
__version__ = "v1.4"
|
||||
|
||||
__all__ = ['wrapper', 'exceptions']
|
||||
|
@ -1,14 +1,14 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
|
||||
class TooManyArchivingRequests(Exception):
|
||||
"""
|
||||
Error when a single url reqeusted for archiving too many times in a short timespam.
|
||||
|
||||
"""Error when a single url reqeusted for archiving too many times in a short timespam.
|
||||
Wayback machine doesn't supports archivng any url too many times in a short period of time.
|
||||
"""
|
||||
|
||||
class ArchivingNotAllowed(Exception):
|
||||
"""
|
||||
Files like robots.txt are set to deny robot archiving.
|
||||
|
||||
"""Files like robots.txt are set to deny robot archiving.
|
||||
Wayback machine respects these file, will not archive.
|
||||
"""
|
||||
|
||||
|
@ -1,6 +1,7 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
import json
|
||||
from datetime import datetime
|
||||
from waybackpy.exceptions import *
|
||||
from waybackpy.exceptions import TooManyArchivingRequests, ArchivingNotAllowed, PageNotSaved, ArchiveNotFound, UrlNotFound, BadGateWay, InvalidUrl
|
||||
try:
|
||||
from urllib.request import Request, urlopen
|
||||
from urllib.error import HTTPError
|
||||
@ -16,8 +17,8 @@ def clean_url(url):
|
||||
def save(url,UA=default_UA):
|
||||
base_save_url = "https://web.archive.org/save/"
|
||||
request_url = (base_save_url + clean_url(url))
|
||||
hdr = { 'User-Agent' : '%s' % UA }
|
||||
req = Request(request_url, headers=hdr)
|
||||
hdr = { 'User-Agent' : '%s' % UA } #nosec
|
||||
req = Request(request_url, headers=hdr) #nosec
|
||||
if "." not in url:
|
||||
raise InvalidUrl("'%s' is not a vaild url." % url)
|
||||
try:
|
||||
@ -39,6 +40,26 @@ def save(url,UA=default_UA):
|
||||
archived_url = "https://web.archive.org" + archive_id
|
||||
return archived_url
|
||||
|
||||
def get(url,encoding=None,UA=default_UA):
|
||||
hdr = { 'User-Agent' : '%s' % UA }
|
||||
request_url = clean_url(url)
|
||||
req = Request(request_url, headers=hdr) #nosec
|
||||
resp=urlopen(req) #nosec
|
||||
if encoding is None:
|
||||
try:
|
||||
encoding= resp.headers['content-type'].split('charset=')[-1]
|
||||
except AttributeError:
|
||||
encoding = "UTF-8"
|
||||
return resp.read().decode(encoding)
|
||||
|
||||
def wayback_timestamp(year,month,day,hour,minute):
|
||||
year = str(year)
|
||||
month = str(month).zfill(2)
|
||||
day = str(day).zfill(2)
|
||||
hour = str(hour).zfill(2)
|
||||
minute = str(minute).zfill(2)
|
||||
return (year+month+day+hour+minute)
|
||||
|
||||
def near(
|
||||
url,
|
||||
year=datetime.utcnow().strftime('%Y'),
|
||||
@ -48,16 +69,15 @@ def near(
|
||||
minute=datetime.utcnow().strftime('%M'),
|
||||
UA=default_UA,
|
||||
):
|
||||
timestamp = str(year)+str(month)+str(day)+str(hour)+str(minute)
|
||||
timestamp = wayback_timestamp(year,month,day,hour,minute)
|
||||
request_url = "https://archive.org/wayback/available?url=%s×tamp=%s" % (clean_url(url), str(timestamp))
|
||||
hdr = { 'User-Agent' : '%s' % UA }
|
||||
req = Request(request_url, headers=hdr)
|
||||
req = Request(request_url, headers=hdr) # nosec
|
||||
response = urlopen(req) #nosec
|
||||
import json
|
||||
data = json.loads(response.read().decode('utf8'))
|
||||
data = json.loads(response.read().decode("UTF-8"))
|
||||
if not data["archived_snapshots"]:
|
||||
raise ArchiveNotFound("'%s' is not yet archived." % url)
|
||||
|
||||
|
||||
archive_url = (data["archived_snapshots"]["closest"]["url"])
|
||||
return archive_url
|
||||
|
||||
|
Reference in New Issue
Block a user