* Waybackpy save example on replit

* Oldest example

* Newest method replit link

* Near method example

* Get example

* Total archive method example

2020-07-19 22:28:08 +05:30

8.0 KiB

Raw Blame History

waybackpy

Waybackpy is a Python library that interfaces with the Internet Archive's Wayback Machine API. Archive pages and retrieve archived pages easily.

Installation
Usage
Tests
Dependency
License

Installation

Using pip:

pip install waybackpy

Usage

Capturing aka Saving an url using save()

import waybackpy

new_archive_url = waybackpy.Url(

    url = "https://en.wikipedia.org/wiki/Multivariable_calculus",
    user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
    
).save()

print(new_archive_url)

https://web.archive.org/web/20200504141153/https://github.com/akamhy/waybackpy

_{Try this out in your browser @ https://repl.it/@akamhy/WaybackPySaveExample}

Receiving the oldest archive for an URL using oldest()

import waybackpy

oldest_archive_url = waybackpy.Url(

    "https://www.google.com/",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:40.0) Gecko/20100101 Firefox/40.0"
    
).oldest()

print(oldest_archive_url)

http://web.archive.org/web/19981111184551/http://google.com:80/

_{Try this out in your browser @ https://repl.it/@akamhy/WaybackPyOldestExample}

Receiving the newest archive for an URL using newest()

import waybackpy

newest_archive_url = waybackpy.Url(

    "https://www.facebook.com/",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:39.0) Gecko/20100101 Firefox/39.0"
    
).newest()

print(newest_archive_url)

https://web.archive.org/web/20200714013225/https://www.facebook.com/

_{Try this out in your browser @ https://repl.it/@akamhy/WaybackPyNewestExample}

Receiving archive close to a specified year, month, day, hour, and minute using near()

from waybackpy import Url

user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:38.0) Gecko/20100101 Firefox/38.0"
github_url = "https://github.com/"


github_wayback_obj = Url(github_url, user_agent)

# Do not pad (don't use zeros in the month, year, day, minute, and hour arguments). e.g. For January, set month = 1 and not month = 01.

github_archive_near_2010 = github_wayback_obj.near(year=2010)
print(github_archive_near_2010)

https://web.archive.org/web/20100719134402/http://github.com/

github_archive_near_2011_may = github_wayback_obj.near(year=2011, month=5)
print(github_archive_near_2011_may)

https://web.archive.org/web/20110519185447/https://github.com/

github_archive_near_2015_january_26 = github_wayback_obj.near(
    year=2015, month=1, day=26
)
print(github_archive_near_2015_january_26)

https://web.archive.org/web/20150127031159/https://github.com

github_archive_near_2018_4_july_9_2_am = github_wayback_obj.near(
    year=2018, month=7, day=4, hour = 9, minute = 2
)
print(github_archive_near_2018_4_july_9_2_am)

https://web.archive.org/web/20180704090245/https://github.com/

_{The library doesn't supports seconds yet. You are encourged to create a PR ;)}

_{Try this out in your browser @ https://repl.it/@akamhy/WaybackPyNearExample}

Get the content of webpage using get()

import waybackpy

google_url = "https://www.google.com/"

User_Agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.85 Safari/537.36"

waybackpy_url_object = waybackpy.Url(google_url, User_Agent)


# If no argument is passed in get(), it gets the source of the Url used to create the object.
current_google_url_source = waybackpy_url_object.get()
print(current_google_url_source)


# The following chunk of code will force a new archive of google.com and get the source of the archived page.
# waybackpy_url_object.save() type is string.
google_newest_archive_source = waybackpy_url_object.get(
    waybackpy_url_object.save()
)
print(google_newest_archive_source)


# waybackpy_url_object.oldest() type is str, it's oldest archive of google.com
google_oldest_archive_source = waybackpy_url_object.get(
    waybackpy_url_object.oldest()
)
print(google_oldest_archive_source)

_{Try this out in your browser @ https://repl.it/@akamhy/WaybackPyGetExample#main.py}

Count total archives for an URL using total_archives()

import waybackpy

URL = "https://en.wikipedia.org/wiki/Python (programming language)"

UA = "Mozilla/5.0 (iPad; CPU OS 8_1_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B435 Safari/600.1.4"

archive_count = waybackpy.Url(
    url=URL,
    user_agent=UA
).total_archives()

print(archive_count) # total_archives() returns an int

_{Try this out in your browser @ https://repl.it/@akamhy/WaybackPyTotalArchivesExample}

Tests

Here

Dependency

None, just python standard libraries (re, json, urllib and datetime). Both python 2 and 3 are supported :)

License

MIT License

8.0 KiB Raw Blame History