Compare commits

...

31 Commits
2.2.0 ... 2.3.0

Author SHA1 Message Date
d3e68d0e70 code formated with black (#47) 2020-12-14 01:18:04 +05:30
fde28d57aa Update CONTRIBUTING.md 2020-12-14 00:16:29 +05:30
6092e504c8 Update CONTRIBUTING.md 2020-12-14 00:15:51 +05:30
93ef60ecd2 v2.3.0 (#46)
* v2.3.0

* v2.3.0

* decrease line length
2020-12-14 00:14:54 +05:30
461b3f74c9 UPDATE header image url 2020-12-13 23:09:59 +05:30
3c53b411b0 Improve the appearance of readme (#45)
* replaced text header wth image

* svg

* Update README.md

* Update README.md

* Update README.md

* level 2

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Create CONTRIBUTING.md

* Update README.md

* Add files via upload

* Update README.md

* Delete waybackpy-colored 284.png

* Delete waybackpy colored.png

* Update README.md

* Update index.rst

* Update index.rst

* Update index.rst

* Update setup.py

* Delete index.rst

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README.md
2020-12-13 23:08:16 +05:30
8125526061 create pyup.io config file (#44) 2020-12-13 22:31:49 +05:30
2dc81569a8 Create .pep8speaks.yml 2020-12-13 17:58:09 +05:30
fd163f3d36 Update wrapper.py 2020-12-13 17:12:32 +05:30
a0a918cf0d . 2020-12-13 17:10:28 +05:30
4943cf6873 remove print stmnt, update ci 2020-12-13 16:37:35 +05:30
bc3efc7d63 now using requests lib as it handles errors nicely (#42)
* now using requests lib as it handles errors nicely

* remove unused import (urllib)

* FIX : replaced full_url with endpoint (not using urlib)

* LINT :  Found in waybackpy\wrapper.py:88  Unnecessary else after return
2020-12-13 15:44:37 +05:30
f89368f16d LINT : Found in waybackpy\wrapper.py:88 Unnecessary else after return 2020-12-13 15:39:23 +05:30
c919a6a605 FIX : replaced full_url with endpoint (not using urlib) 2020-12-13 15:22:56 +05:30
0280fca189 remove unused import (urllib) 2020-12-13 15:13:51 +05:30
60ee8b95a8 now using requests lib as it handles errors nicely 2020-12-13 15:05:57 +05:30
ca51c14332 deleted .travis.yml, link with flake (#41)
close #38
2020-11-26 13:06:50 +05:30
525cf17c6f Update ci.yml 2020-11-26 12:14:15 +05:30
406e03c52f Update ci.yml 2020-11-26 12:04:45 +05:30
672b33e83a Update ci.yml 2020-11-26 10:10:10 +05:30
b19b840628 Update ci.yml 2020-11-26 10:01:55 +05:30
a6df4f899c Update ci.yml 2020-11-26 09:26:11 +05:30
7686e9c20d Update README.md (#40) 2020-11-26 09:18:26 +05:30
3c5932bc39 now using gh actions (#39) 2020-11-26 09:09:53 +05:30
f9a986f489 Create ci.yml 2020-11-26 08:55:23 +05:30
0d7458ee90 per https://docs.travis-ci.com/user/languages/python/, Python builds are not available on the macOS 2020-11-26 08:08:59 +05:30
ac8b9d6a50 use osx, huge backlog on .org travis for linux builds 2020-11-26 08:03:27 +05:30
58cd9c28e7 Threading enabled checking for URLs 2020-11-26 06:15:42 +05:30
5088305a58 removed python2 compatibility code 2020-11-21 17:00:11 +05:30
9f847a5e55 change pepy.tech download count link, they removed the month page 2020-11-11 10:44:14 +05:30
6c04c2f3d3 + https://github.com/akamhy/waybackpy/graphs/contributors 2020-11-04 08:09:30 +05:30
21 changed files with 801 additions and 910 deletions

42
.github/workflows/ci.yml vendored Normal file
View File

@ -0,0 +1,42 @@
# This workflow will install Python dependencies, run tests and lint with a variety of Python versions
# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions
name: CI
on:
push:
branches: [ master ]
pull_request:
branches: [ master ]
jobs:
build:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ['3.8']
steps:
- uses: actions/checkout@v2
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v2
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install flake8 pytest codecov pytest-cov
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
- name: Lint with flake8
run: |
# stop the build if there are Python syntax errors or undefined names
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
- name: Test with pytest
run: |
pytest --cov=waybackpy tests/
- name: Upload coverage to Codecov
run: |
bash <(curl -s https://codecov.io/bash) -t ${{ secrets.CODECOV_TOKEN }}

4
.pep8speaks.yml Normal file
View File

@ -0,0 +1,4 @@
# File : .pep8speaks.yml
scanner:
diff_only: True # If True, errors caused by only the patch are shown

5
.pyup.yml Normal file
View File

@ -0,0 +1,5 @@
# autogenerated pyup.io config file
# see https://pyup.io/docs/configuration/ for all available options
schedule: ''
update: false

View File

@ -1,18 +0,0 @@
language: python
os: linux
dist: xenial
cache: pip
python:
- 3.6
- 3.8
before_install:
- python --version
- pip install -U pip
- pip install -U pytest
- pip install codecov
- pip install pytest pytest-cov
script:
- cd tests
- pytest --cov=../waybackpy
after_success:
- if [[ $TRAVIS_PYTHON_VERSION == 3.8 ]]; then python -m codecov; fi

58
CONTRIBUTING.md Normal file
View File

@ -0,0 +1,58 @@
# Contributing to waybackpy
We love your input! We want to make contributing to this project as easy and transparent as possible, whether it's:
- Reporting a bug
- Discussing the current state of the code
- Submitting a fix
- Proposing new features
- Becoming a maintainer
## We Develop with Github
We use github to host code, to track issues and feature requests, as well as accept pull requests.
## We Use [Github Flow](https://guides.github.com/introduction/flow/index.html), So All Code Changes Happen Through Pull Requests
Pull requests are the best way to propose changes to the codebase (we use [Github Flow](https://guides.github.com/introduction/flow/index.html)). We actively welcome your pull requests:
1. Fork the repo and create your branch from `master`.
2. If you've added code that should be tested, add tests.
3. If you've changed APIs, update the documentation.
4. Ensure the test suite passes.
5. Make sure your code lints.
6. Issue that pull request!
## Any contributions you make will be under the MIT Software License
In short, when you submit code changes, your submissions are understood to be under the same [MIT License](https://github.com/akamhy/waybackpy/blob/master/LICENSE) that covers the project. Feel free to contact the maintainers if that's a concern.
## Report bugs using Github's [issues](https://github.com/akamhy/waybackpy/issues)
We use GitHub issues to track public bugs. Report a bug by [opening a new issue](https://github.com/akamhy/waybackpy/issues/new); it's that easy!
## Write bug reports with detail, background, and sample code
**Great Bug Reports** tend to have:
- A quick summary and/or background
- Steps to reproduce
- Be specific!
- Give sample code if you can.
- What you expected would happen
- What actually happens
- Notes (possibly including why you think this might be happening, or stuff you tried that didn't work)
People *love* thorough bug reports. I'm not even kidding.
## Use a Consistent Coding Style
* You can try running `flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics` for style unification.
## License
By contributing, you agree that your contributions will be licensed under its [MIT License](https://github.com/akamhy/waybackpy/blob/master/LICENSE).
## References
This document is forked from [this gist](https://gist.github.com/briandk/3d2e8b3ec8daf5a27a62) by [briandk](https://github.com/briandk) which was itself adapted from the open-source contribution guidelines for [Facebook's Draft](https://github.com/facebook/draft-js/blob/a9316a723f9e918afde44dea68b5f9f39b7d9b00/CONTRIBUTING.md)

View File

@ -1,6 +1,6 @@
MIT License
Copyright (c) 2020 Akash Mahanty (https://github.com/akamhy)
Copyright (c) 2020 waybackpy contributors ( https://github.com/akamhy/waybackpy/graphs/contributors )
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal

View File

@ -1,23 +1,25 @@
# waybackpy
<div align="center">
<img src="https://raw.githubusercontent.com/akamhy/waybackpy/master/assets/waybackpy_logo.svg"><br>
</div>
![contributions welcome](https://img.shields.io/static/v1.svg?label=Contributions&message=Welcome&color=0059b3&style=flat-square)
[![Build Status](https://img.shields.io/travis/akamhy/waybackpy.svg?label=Travis%20CI&logo=travis&style=flat-square)](https://travis-ci.org/akamhy/waybackpy)
[![codecov](https://codecov.io/gh/akamhy/waybackpy/branch/master/graph/badge.svg)](https://codecov.io/gh/akamhy/waybackpy)
[![Downloads](https://pepy.tech/badge/waybackpy/month)](https://pepy.tech/project/waybackpy/month)
[![Release](https://img.shields.io/github/v/release/akamhy/waybackpy.svg)](https://github.com/akamhy/waybackpy/releases)
[![Codacy Badge](https://api.codacy.com/project/badge/Grade/255459cede9341e39436ec8866d3fb65)](https://www.codacy.com/manual/akamhy/waybackpy?utm_source=github.com&amp;utm_medium=referral&amp;utm_content=akamhy/waybackpy&amp;utm_campaign=Badge_Grade)
[![Maintainability](https://api.codeclimate.com/v1/badges/942f13d8177a56c1c906/maintainability)](https://codeclimate.com/github/akamhy/waybackpy/maintainability)
[![CodeFactor](https://www.codefactor.io/repository/github/akamhy/waybackpy/badge)](https://www.codefactor.io/repository/github/akamhy/waybackpy)
[![made-with-python](https://img.shields.io/badge/Made%20with-Python-1f425f.svg)](https://www.python.org/)
-----------------
## Python package & CLI tool that interfaces with the Wayback Machine API.
[![pypi](https://img.shields.io/pypi/v/waybackpy.svg)](https://pypi.org/project/waybackpy/)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/waybackpy?style=flat-square)
[![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-green.svg)](https://github.com/akamhy/waybackpy/graphs/commit-activity)
![Repo size](https://img.shields.io/github/repo-size/akamhy/waybackpy.svg?label=Repo%20size&style=flat-square)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://github.com/akamhy/waybackpy/blob/master/LICENSE)
[![Build Status](https://github.com/akamhy/waybackpy/workflows/CI/badge.svg)](https://github.com/akamhy/waybackpy/actions)
[![codecov](https://codecov.io/gh/akamhy/waybackpy/branch/master/graph/badge.svg)](https://codecov.io/gh/akamhy/waybackpy)
[![contributions welcome](https://img.shields.io/static/v1.svg?label=Contributions&message=Welcome&color=0059b3&style=flat-square)](https://github.com/akamhy/waybackpy/blob/master/CONTRIBUTING.md)
[![Codacy Badge](https://api.codacy.com/project/badge/Grade/255459cede9341e39436ec8866d3fb65)](https://www.codacy.com/manual/akamhy/waybackpy?utm_source=github.com&amp;utm_medium=referral&amp;utm_content=akamhy/waybackpy&amp;utm_campaign=Badge_Grade)
[![Downloads](https://pepy.tech/badge/waybackpy/month)](https://pepy.tech/project/waybackpy)
[![Release](https://img.shields.io/github/v/release/akamhy/waybackpy.svg)](https://github.com/akamhy/waybackpy/releases)
[![Maintainability](https://api.codeclimate.com/v1/badges/942f13d8177a56c1c906/maintainability)](https://codeclimate.com/github/akamhy/waybackpy/maintainability)
[![made-with-python](https://img.shields.io/badge/Made%20with-Python-1f425f.svg)](https://www.python.org/)
[![Maintenance](https://img.shields.io/badge/Maintained%3F-yes-green.svg)](https://github.com/akamhy/waybackpy/graphs/commit-activity)
[![GitHub last commit](https://img.shields.io/github/last-commit/akamhy/waybackpy?color=blue&style=flat-square)](https://github.com/akamhy/waybackpy/commits/master)
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/waybackpy?style=flat-square)
![Wayback Machine](https://raw.githubusercontent.com/akamhy/waybackpy/master/assets/waybackpy-colored%20284.png)
Waybackpy is a Python package that interfaces with [Internet Archive](https://en.wikipedia.org/wiki/Internet_Archive)'s [Wayback Machine](https://en.wikipedia.org/wiki/Wayback_Machine) API. Archive webpages and retrieve archived webpages easily.
Table of contents
=================
@ -30,8 +32,8 @@ Table of contents
* [Saving a webpage](#capturing-aka-saving-an-url-using-save)
* [Retrieving archive](#retrieving-the-archive-for-an-url-using-archive_url)
* [Retrieving the oldest archive](#retrieving-the-oldest-archive-for-an-url-using-oldest)
* [Retrieving the recent most/newest archive](#retrieving-the-newest-archive-for-an-url-using-newest)
* [Retrieving the JSON response of availability API](#retrieving-the-json-reponse-for-the-avaliblity-api-request)
* [Retrieving the latest/newest archive](#retrieving-the-newest-archive-for-an-url-using-newest)
* [Retrieving the JSON response of availability API](#retrieving-the-json-response-for-the-availability-api-request)
* [Retrieving archive close to a specified year, month, day, hour, and minute](#retrieving-archive-close-to-a-specified-year-month-day-hour-and-minute-using-near)
* [Get the content of webpage](#get-the-content-of-webpage-using-get)
* [Count total archives for an URL](#count-total-archives-for-an-url-using-total_archives)
@ -50,8 +52,6 @@ Table of contents
* [Tests](#tests)
* [Dependency](#dependency)
* [Packaging](#packaging)
* [License](#license)
@ -76,7 +76,7 @@ pip install git+https://github.com/akamhy/waybackpy.git
### As a Python package
#### Capturing aka Saving an url using save()
#### Capturing aka Saving an URL using save()
```python
import waybackpy
@ -152,7 +152,7 @@ https://web.archive.org/web/20201016150543/https://www.facebook.com/
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyNewestExample></sub>
#### Retrieving the JSON reponse for the avaliblity API request
#### Retrieving the JSON response for the availability API request
```python
import waybackpy
@ -220,7 +220,7 @@ print(github_archive_near_2018_4_july_9_2_am)
https://web.archive.org/web/20180704090245/https://github.com/
```
<sub>The package doesn't support second argument yet. You are encourged to create a PR ;)</sub>
<sub>The package doesn't support the seconds' argument yet. You are encouraged to create a PR ;)</sub>
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyNearExample></sub>
@ -374,10 +374,10 @@ https://web.archive.org/web/20120512142515/https://www.facebook.com/
#### Get the source code
```bash
waybackpy --url google.com --user_agent "my-unique-user-agent" --get url # Prints the source code of the url
waybackpy --url google.com --user_agent "my-unique-user-agent" --get url # Prints the source code of the URL
waybackpy --url google.com --user_agent "my-unique-user-agent" --get oldest # Prints the source code of the oldest archive
waybackpy --url google.com --user_agent "my-unique-user-agent" --get newest # Prints the source code of the newest archive
waybackpy --url google.com --user_agent "my-unique-user-agent" --get save # Save a new archive on wayback machine then print the source code of this archive.
waybackpy --url google.com --user_agent "my-unique-user-agent" --get save # Save a new archive on Wayback machine then print the source code of this archive.
```
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyBashGet></sub>
@ -403,7 +403,7 @@ waybackpy --url akamhy.github.io --user_agent "my-user-agent" --known_urls --ali
waybackpy --url akamhy.github.io --user_agent "my-user-agent" --known_urls --subdomain
# Prints all known URLs under akamhy.github.io inclusing subdomain
# Prints all known URLs under akamhy.github.io including subdomain
waybackpy --url akamhy.github.io --user_agent "my-user-agent" --known_urls --subdomain --alive
@ -415,22 +415,28 @@ waybackpy --url akamhy.github.io --user_agent "my-user-agent" --known_urls --sub
## Tests
[Here](https://github.com/akamhy/waybackpy/tree/master/tests)
To run tests locally:
1) Install or update the testing/coverage tools
```bash
pip install -U pytest
pip install codecov
pip install pytest pytest-cov
cd tests
pytest --cov=../waybackpy
python -m codecov #For reporting coverage on Codecov
pip install codecov pytest pytest-cov -U
```
## Dependency
2) Inside the repository run the following commands
```bash
pytest --cov=waybackpy tests/
```
3) To report coverage run
```bash
bash <(curl -s https://codecov.io/bash) -t SECRET_CODECOV_TOKEN
```
You can find the tests [here](https://github.com/akamhy/waybackpy/tree/master/tests).
None, just pre-installed [python standard libraries](https://docs.python.org/3/library/).
## Packaging

View File

@ -1 +1 @@
theme: jekyll-theme-cayman
theme: jekyll-theme-cayman

Binary file not shown.

Before

Width:  |  Height:  |  Size: 56 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 18 KiB

85
assets/waybackpy_logo.svg Normal file
View File

@ -0,0 +1,85 @@
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<svg
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:cc="http://creativecommons.org/ns#"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:svg="http://www.w3.org/2000/svg"
xmlns="http://www.w3.org/2000/svg"
id="svg8"
version="1.1"
viewBox="0 0 176.61171 41.907883"
height="41.907883mm"
width="176.61171mm">
<defs
id="defs2" />
<metadata
id="metadata5">
<rdf:RDF>
<cc:Work
rdf:about="">
<dc:format>image/svg+xml</dc:format>
<dc:type
rdf:resource="http://purl.org/dc/dcmitype/StillImage" />
<dc:title></dc:title>
</cc:Work>
</rdf:RDF>
</metadata>
<g
transform="translate(-0.74835286,-98.31182)"
id="layer1">
<flowRoot
transform="scale(0.26458333)"
style="font-style:normal;font-weight:normal;font-size:40px;line-height:1.25;font-family:sans-serif;letter-spacing:0px;word-spacing:0px;fill:#000000;fill-opacity:1;stroke:none"
id="flowRoot4598"
xml:space="preserve"><flowRegion
id="flowRegion4600"><rect
y="415.4129"
x="-38.183765"
height="48.08326"
width="257.38687"
id="rect4602" /></flowRegion><flowPara
id="flowPara4604"></flowPara></flowRoot> <text
transform="scale(0.86288797,1.158899)"
id="text4777"
y="110.93711"
x="0.93061"
style="font-style:normal;font-variant:normal;font-weight:bold;font-stretch:normal;font-size:28.14887619px;line-height:4.25;font-family:sans-serif;-inkscape-font-specification:'sans-serif, Bold';font-variant-ligatures:normal;font-variant-caps:normal;font-variant-numeric:normal;font-feature-settings:normal;text-align:start;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:start;fill:#003dff;fill-opacity:1;stroke:none;stroke-width:7.51955223;stroke-miterlimit:4;stroke-dasharray:none"
xml:space="preserve"><tspan
style="stroke-width:7.51955223"
id="tspan4775"
y="110.93711"
x="0.93061"><tspan
id="tspan4773"
style="font-style:normal;font-variant:normal;font-weight:bold;font-stretch:normal;font-size:28.14887619px;font-family:sans-serif;-inkscape-font-specification:'sans-serif, Bold';font-variant-ligatures:normal;font-variant-caps:normal;font-variant-numeric:normal;font-feature-settings:normal;text-align:start;letter-spacing:3.56786728px;writing-mode:lr-tb;text-anchor:start;fill:#003dff;fill-opacity:1;stroke-width:7.51955223;stroke-miterlimit:4;stroke-dasharray:none"
y="110.93711"
x="0.93061">waybackpy</tspan></tspan></text>
<rect
y="98.311821"
x="1.4967092"
height="4.8643045"
width="153.78688"
id="rect4644"
style="opacity:1;fill:#000080;fill-opacity:1;stroke:#00ff00;stroke-width:0;stroke-miterlimit:4;stroke-dasharray:none" />
<rect
style="opacity:1;fill:#000080;fill-opacity:1;stroke:#00ff00;stroke-width:0;stroke-miterlimit:4;stroke-dasharray:none"
id="rect4648"
width="153.78688"
height="4.490128"
x="23.573174"
y="135.72957" />
<rect
y="135.72957"
x="0.74835336"
height="4.4901319"
width="22.82482"
id="rect4650"
style="opacity:1;fill:#ff00ff;fill-opacity:1;stroke:#00ff00;stroke-width:0;stroke-miterlimit:4;stroke-dasharray:none" />
<rect
style="opacity:1;fill:#ff00ff;fill-opacity:1;stroke:#00ff00;stroke-width:0;stroke-miterlimit:4;stroke-dasharray:none"
id="rect4652"
width="21.702286"
height="4.8643003"
x="155.2836"
y="98.311821" />
</g>
</svg>

After

Width:  |  Height:  |  Size: 3.6 KiB

531
index.rst
View File

@ -1,531 +0,0 @@
waybackpy
=========
|contributions welcome| |Build Status| |codecov| |Downloads| |Release|
|Codacy Badge| |Maintainability| |CodeFactor| |made-with-python| |pypi|
|PyPI - Python Version| |Maintenance| |Repo size| |License: MIT|
.. figure:: https://raw.githubusercontent.com/akamhy/waybackpy/master/assets/waybackpy-colored%20284.png
:alt: Wayback Machine
Wayback Machine
Waybackpy is a Python package that interfaces with `Internet
Archive <https://en.wikipedia.org/wiki/Internet_Archive>`__'s `Wayback
Machine <https://en.wikipedia.org/wiki/Wayback_Machine>`__ API. Archive
webpages and retrieve archived webpages easily.
Table of contents
=================
.. raw:: html
<!--ts-->
- `Installation <#installation>`__
- `Usage <#usage>`__
- `As a Python package <#as-a-python-package>`__
- `Saving a webpage <#capturing-aka-saving-an-url-using-save>`__
- `Retrieving
archive <#retrieving-the-archive-for-an-url-using-archive_url>`__
- `Retrieving the oldest
archive <#retrieving-the-oldest-archive-for-an-url-using-oldest>`__
- `Retrieving the recent most/newest
archive <#retrieving-the-newest-archive-for-an-url-using-newest>`__
- `Retrieving the JSON response of availability
API <#retrieving-the-json-reponse-for-the-avaliblity-api-request>`__
- `Retrieving archive close to a specified year, month, day, hour,
and
minute <#retrieving-archive-close-to-a-specified-year-month-day-hour-and-minute-using-near>`__
- `Get the content of
webpage <#get-the-content-of-webpage-using-get>`__
- `Count total archives for an
URL <#count-total-archives-for-an-url-using-total_archives>`__
- `List of URLs that Wayback Machine knows and has archived for a
domain
name <#list-of-urls-that-wayback-machine-knows-and-has-archived-for-a-domain-name>`__
- `With the Command-line
interface <#with-the-command-line-interface>`__
- `Saving webpage <#save>`__
- `Archive URL <#get-archive-url>`__
- `Oldest archive URL <#oldest-archive>`__
- `Newest archive URL <#newest-archive>`__
- `JSON response of API <#get-json-data-of-avaialblity-api>`__
- `Total archives <#total-number-of-archives>`__
- `Archive near specified time <#archive-near-time>`__
- `Get the source code <#get-the-source-code>`__
- `Fetch all the URLs that the Wayback Machine knows for a
domain <#fetch-all-the-urls-that-the-wayback-machine-knows-for-a-domain>`__
- `Tests <#tests>`__
- `Dependency <#dependency>`__
- `Packaging <#packaging>`__
- `License <#license>`__
.. raw:: html
<!--te-->
Installation
------------
Using `pip <https://en.wikipedia.org/wiki/Pip_(package_manager)>`__:
.. code:: bash
pip install waybackpy
or direct from this repository using git.
.. code:: bash
pip install git+https://github.com/akamhy/waybackpy.git
Usage
-----
As a Python package
~~~~~~~~~~~~~~~~~~~
Capturing aka Saving an url using save()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: python
import waybackpy
url = "https://en.wikipedia.org/wiki/Multivariable_calculus"
user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
waybackpy_url_obj = waybackpy.Url(url, user_agent)
archive = waybackpy_url_obj.save()
print(archive)
.. code:: bash
https://web.archive.org/web/20201016171808/https://en.wikipedia.org/wiki/Multivariable_calculus
Try this out in your browser @
https://repl.it/@akamhy/WaybackPySaveExample\
Retrieving the archive for an URL using archive\_url
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: python
import waybackpy
url = "https://www.google.com/"
user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:40.0) Gecko/20100101 Firefox/40.0"
waybackpy_url_obj = waybackpy.Url(url, user_agent)
archive_url = waybackpy_url_obj.archive_url
print(archive_url)
.. code:: bash
https://web.archive.org/web/20201016153320/https://www.google.com/
Try this out in your browser @
https://repl.it/@akamhy/WaybackPyArchiveUrl\
Retrieving the oldest archive for an URL using oldest()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: python
import waybackpy
url = "https://www.google.com/"
user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:40.0) Gecko/20100101 Firefox/40.0"
waybackpy_url_obj = waybackpy.Url(url, user_agent)
oldest_archive_url = waybackpy_url_obj.oldest()
print(oldest_archive_url)
.. code:: bash
http://web.archive.org/web/19981111184551/http://google.com:80/
Try this out in your browser @
https://repl.it/@akamhy/WaybackPyOldestExample\
Retrieving the newest archive for an URL using newest()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: python
import waybackpy
url = "https://www.facebook.com/"
user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:39.0) Gecko/20100101 Firefox/39.0"
waybackpy_url_obj = waybackpy.Url(url, user_agent)
newest_archive_url = waybackpy_url_obj.newest()
print(newest_archive_url)
.. code:: bash
https://web.archive.org/web/20201016150543/https://www.facebook.com/
Try this out in your browser @
https://repl.it/@akamhy/WaybackPyNewestExample\
Retrieving the JSON reponse for the avaliblity API request
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: python
import waybackpy
url = "https://www.facebook.com/"
user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:39.0) Gecko/20100101 Firefox/39.0"
waybackpy_url_obj = waybackpy.Url(url, user_agent)
json_dict = waybackpy_url_obj.JSON
print(json_dict)
.. code:: javascript
{'url': 'https://www.facebook.com/', 'archived_snapshots': {'closest': {'available': True, 'url': 'http://web.archive.org/web/20201016150543/https://www.facebook.com/', 'timestamp': '20201016150543', 'status': '200'}}}
Try this out in your browser @ https://repl.it/@akamhy/WaybackPyJSON\
Retrieving archive close to a specified year, month, day, hour, and minute using near()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: python
from waybackpy import Url
user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:38.0) Gecko/20100101 Firefox/38.0"
url = "https://github.com/"
waybackpy_url_obj = Url(url, user_agent)
# Do not pad (don't use zeros in the month, year, day, minute, and hour arguments). e.g. For January, set month = 1 and not month = 01.
.. code:: python
github_archive_near_2010 = waybackpy_url_obj.near(year=2010)
print(github_archive_near_2010)
.. code:: bash
https://web.archive.org/web/20101018053604/http://github.com:80/
.. code:: python
github_archive_near_2011_may = waybackpy_url_obj.near(year=2011, month=5)
print(github_archive_near_2011_may)
.. code:: bash
https://web.archive.org/web/20110518233639/https://github.com/
.. code:: python
github_archive_near_2015_january_26 = waybackpy_url_obj.near(year=2015, month=1, day=26)
print(github_archive_near_2015_january_26)
.. code:: bash
https://web.archive.org/web/20150125102636/https://github.com/
.. code:: python
github_archive_near_2018_4_july_9_2_am = waybackpy_url_obj.near(year=2018, month=7, day=4, hour=9, minute=2)
print(github_archive_near_2018_4_july_9_2_am)
.. code:: bash
https://web.archive.org/web/20180704090245/https://github.com/
The package doesn't support second argument yet. You are encourged to
create a PR ;)
Try this out in your browser @
https://repl.it/@akamhy/WaybackPyNearExample\
Get the content of webpage using get()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: python
import waybackpy
google_url = "https://www.google.com/"
User_Agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.85 Safari/537.36"
waybackpy_url_object = waybackpy.Url(google_url, User_Agent)
# If no argument is passed in get(), it gets the source of the Url used to create the object.
current_google_url_source = waybackpy_url_object.get()
print(current_google_url_source)
# The following chunk of code will force a new archive of google.com and get the source of the archived page.
# waybackpy_url_object.save() type is string.
google_newest_archive_source = waybackpy_url_object.get(waybackpy_url_object.save())
print(google_newest_archive_source)
# waybackpy_url_object.oldest() type is str, it's oldest archive of google.com
google_oldest_archive_source = waybackpy_url_object.get(waybackpy_url_object.oldest())
print(google_oldest_archive_source)
Try this out in your browser @
https://repl.it/@akamhy/WaybackPyGetExample#main.py\
Count total archives for an URL using total\_archives()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: python
import waybackpy
URL = "https://en.wikipedia.org/wiki/Python (programming language)"
UA = "Mozilla/5.0 (iPad; CPU OS 8_1_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B435 Safari/600.1.4"
waybackpy_url_object = waybackpy.Url(url=URL, user_agent=UA)
archive_count = waybackpy_url_object.total_archives()
print(archive_count) # total_archives() returns an int
.. code:: bash
2516
Try this out in your browser @
https://repl.it/@akamhy/WaybackPyTotalArchivesExample\
List of URLs that Wayback Machine knows and has archived for a domain name
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1) If alive=True is set, waybackpy will check all URLs to identify the
alive URLs. Don't use with popular websites like google or it would
take too long.
2) To include URLs from subdomain set sundomain=True
.. code:: python
import waybackpy
URL = "akamhy.github.io"
UA = "Mozilla/5.0 (iPad; CPU OS 8_1_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B435 Safari/600.1.4"
waybackpy_url_object = waybackpy.Url(url=URL, user_agent=UA)
known_urls = waybackpy_url_object.known_urls(alive=True, subdomain=False) # alive and subdomain are optional.
print(known_urls) # known_urls() returns list of URLs
.. code:: bash
['http://akamhy.github.io',
'https://akamhy.github.io/waybackpy/',
'https://akamhy.github.io/waybackpy/assets/css/style.css?v=a418a4e4641a1dbaad8f3bfbf293fad21a75ff11',
'https://akamhy.github.io/waybackpy/assets/css/style.css?v=f881705d00bf47b5bf0c58808efe29eecba2226c']
Try this out in your browser @
https://repl.it/@akamhy/WaybackPyKnownURLsToWayBackMachineExample#main.py\
With the Command-line interface
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Save
^^^^
.. code:: bash
$ waybackpy --url "https://en.wikipedia.org/wiki/Social_media" --user_agent "my-unique-user-agent" --save
https://web.archive.org/web/20200719062108/https://en.wikipedia.org/wiki/Social_media
Try this out in your browser @
https://repl.it/@akamhy/WaybackPyBashSave\
Get archive URL
^^^^^^^^^^^^^^^
.. code:: bash
$ waybackpy --url "https://en.wikipedia.org/wiki/SpaceX" --user_agent "my-unique-user-agent" --archive_url
https://web.archive.org/web/20201007132458/https://en.wikipedia.org/wiki/SpaceX
Try this out in your browser @
https://repl.it/@akamhy/WaybackPyBashArchiveUrl\
Oldest archive
^^^^^^^^^^^^^^
.. code:: bash
$ waybackpy --url "https://en.wikipedia.org/wiki/SpaceX" --user_agent "my-unique-user-agent" --oldest
https://web.archive.org/web/20040803000845/http://en.wikipedia.org:80/wiki/SpaceX
Try this out in your browser @
https://repl.it/@akamhy/WaybackPyBashOldest\
Newest archive
^^^^^^^^^^^^^^
.. code:: bash
$ waybackpy --url "https://en.wikipedia.org/wiki/YouTube" --user_agent "my-unique-user-agent" --newest
https://web.archive.org/web/20200606044708/https://en.wikipedia.org/wiki/YouTube
Try this out in your browser @
https://repl.it/@akamhy/WaybackPyBashNewest\
Get JSON data of avaialblity API
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: bash
waybackpy --url "https://en.wikipedia.org/wiki/SpaceX" --user_agent "my-unique-user-agent" --json
.. code:: javascript
{'archived_snapshots': {'closest': {'timestamp': '20201007132458', 'status': '200', 'available': True, 'url': 'http://web.archive.org/web/20201007132458/https://en.wikipedia.org/wiki/SpaceX'}}, 'url': 'https://en.wikipedia.org/wiki/SpaceX'}
Try this out in your browser @
https://repl.it/@akamhy/WaybackPyBashJSON\
Total number of archives
^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: bash
$ waybackpy --url "https://en.wikipedia.org/wiki/Linux_kernel" --user_agent "my-unique-user-agent" --total
853
Try this out in your browser @
https://repl.it/@akamhy/WaybackPyBashTotal\
Archive near time
^^^^^^^^^^^^^^^^^
.. code:: bash
$ waybackpy --url facebook.com --user_agent "my-unique-user-agent" --near --year 2012 --month 5 --day 12
https://web.archive.org/web/20120512142515/https://www.facebook.com/
Try this out in your browser @
https://repl.it/@akamhy/WaybackPyBashNear\
Get the source code
^^^^^^^^^^^^^^^^^^^
.. code:: bash
waybackpy --url google.com --user_agent "my-unique-user-agent" --get url # Prints the source code of the url
waybackpy --url google.com --user_agent "my-unique-user-agent" --get oldest # Prints the source code of the oldest archive
waybackpy --url google.com --user_agent "my-unique-user-agent" --get newest # Prints the source code of the newest archive
waybackpy --url google.com --user_agent "my-unique-user-agent" --get save # Save a new archive on wayback machine then print the source code of this archive.
Try this out in your browser @
https://repl.it/@akamhy/WaybackPyBashGet\
Fetch all the URLs that the Wayback Machine knows for a domain
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1) You can add the '--alive' flag to only fetch alive links.
2) You can add the '--subdomain' flag to add subdomains.
3) '--alive' and '--subdomain' flags can be used simultaneously.
4) All links will be saved in a file, and the file will be created in
the current working directory.
.. code:: bash
pip install waybackpy
# Ignore the above installation line.
waybackpy --url akamhy.github.io --user_agent "my-user-agent" --known_urls
# Prints all known URLs under akamhy.github.io
waybackpy --url akamhy.github.io --user_agent "my-user-agent" --known_urls --alive
# Prints all known URLs under akamhy.github.io which are still working and not dead links.
waybackpy --url akamhy.github.io --user_agent "my-user-agent" --known_urls --subdomain
# Prints all known URLs under akamhy.github.io inclusing subdomain
waybackpy --url akamhy.github.io --user_agent "my-user-agent" --known_urls --subdomain --alive
# Prints all known URLs under akamhy.github.io including subdomain which are not dead links and still alive.
Try this out in your browser @
https://repl.it/@akamhy/WaybackpyKnownUrlsFromWaybackMachine#main.sh\
Tests
-----
`Here <https://github.com/akamhy/waybackpy/tree/master/tests>`__
To run tests locally:
.. code:: bash
pip install -U pytest
pip install codecov
pip install pytest pytest-cov
cd tests
pytest --cov=../waybackpy
python -m codecov #For reporting coverage on Codecov
Dependency
----------
None, just pre-installed `python standard
libraries <https://docs.python.org/3/library/>`__.
Packaging
---------
1. Increment version.
2. Build package ``python setup.py sdist bdist_wheel``.
3. Sign & upload the package ``twine upload -s dist/*``.
License
-------
Released under the MIT License. See
`license <https://github.com/akamhy/waybackpy/blob/master/LICENSE>`__
for details.
.. |contributions welcome| image:: https://img.shields.io/static/v1.svg?label=Contributions&message=Welcome&color=0059b3&style=flat-square
.. |Build Status| image:: https://img.shields.io/travis/akamhy/waybackpy.svg?label=Travis%20CI&logo=travis&style=flat-square
:target: https://travis-ci.org/akamhy/waybackpy
.. |codecov| image:: https://codecov.io/gh/akamhy/waybackpy/branch/master/graph/badge.svg
:target: https://codecov.io/gh/akamhy/waybackpy
.. |Downloads| image:: https://pepy.tech/badge/waybackpy/month
:target: https://pepy.tech/project/waybackpy/month
.. |Release| image:: https://img.shields.io/github/v/release/akamhy/waybackpy.svg
:target: https://github.com/akamhy/waybackpy/releases
.. |Codacy Badge| image:: https://api.codacy.com/project/badge/Grade/255459cede9341e39436ec8866d3fb65
:target: https://www.codacy.com/manual/akamhy/waybackpy?utm_source=github.com&utm_medium=referral&utm_content=akamhy/waybackpy&utm_campaign=Badge_Grade
.. |Maintainability| image:: https://api.codeclimate.com/v1/badges/942f13d8177a56c1c906/maintainability
:target: https://codeclimate.com/github/akamhy/waybackpy/maintainability
.. |CodeFactor| image:: https://www.codefactor.io/repository/github/akamhy/waybackpy/badge
:target: https://www.codefactor.io/repository/github/akamhy/waybackpy
.. |made-with-python| image:: https://img.shields.io/badge/Made%20with-Python-1f425f.svg
:target: https://www.python.org/
.. |pypi| image:: https://img.shields.io/pypi/v/waybackpy.svg
:target: https://pypi.org/project/waybackpy/
.. |PyPI - Python Version| image:: https://img.shields.io/pypi/pyversions/waybackpy?style=flat-square
.. |Maintenance| image:: https://img.shields.io/badge/Maintained%3F-yes-green.svg
:target: https://github.com/akamhy/waybackpy/graphs/commit-activity
.. |Repo size| image:: https://img.shields.io/github/repo-size/akamhy/waybackpy.svg?label=Repo%20size&style=flat-square
.. |License: MIT| image:: https://img.shields.io/badge/License-MIT-yellow.svg
:target: https://github.com/akamhy/waybackpy/blob/master/LICENSE

1
requirements.txt Normal file
View File

@ -0,0 +1 @@
requests>=2.24.0

View File

@ -1,52 +1,54 @@
import os.path
from setuptools import setup
with open(os.path.join(os.path.dirname(__file__), 'README.md')) as f:
with open(os.path.join(os.path.dirname(__file__), "README.md")) as f:
long_description = f.read()
about = {}
with open(os.path.join(os.path.dirname(__file__), 'waybackpy', '__version__.py')) as f:
with open(os.path.join(os.path.dirname(__file__), "waybackpy", "__version__.py")) as f:
exec(f.read(), about)
setup(
name = about['__title__'],
packages = ['waybackpy'],
version = about['__version__'],
description = about['__description__'],
name=about["__title__"],
packages=["waybackpy"],
version=about["__version__"],
description=about["__description__"],
long_description=long_description,
long_description_content_type='text/markdown',
license= about['__license__'],
author = about['__author__'],
author_email = about['__author_email__'],
url = about['__url__'],
download_url = 'https://github.com/akamhy/waybackpy/archive/2.2.0.tar.gz',
keywords = ['waybackpy', 'archive', 'archive website', 'wayback machine', 'Internet Archive'],
install_requires=[],
python_requires= ">=3.2",
long_description_content_type="text/markdown",
license=about["__license__"],
author=about["__author__"],
author_email=about["__author_email__"],
url=about["__url__"],
download_url="https://github.com/akamhy/waybackpy/archive/2.3.0.tar.gz",
keywords=[
"Archive It",
"Archive Website",
"Wayback Machine",
"waybackurls",
"Internet Archive",
],
install_requires=["requests"],
python_requires=">=3.4",
classifiers=[
'Development Status :: 5 - Production/Stable',
'Intended Audience :: Developers',
'Natural Language :: English',
'Topic :: Software Development :: Build Tools',
'License :: OSI Approved :: MIT License',
'Programming Language :: Python',
'Programming Language :: Python :: 3',
'Programming Language :: Python :: 3.2',
'Programming Language :: Python :: 3.3',
'Programming Language :: Python :: 3.4',
'Programming Language :: Python :: 3.5',
'Programming Language :: Python :: 3.6',
'Programming Language :: Python :: 3.7',
'Programming Language :: Python :: 3.8',
'Programming Language :: Python :: Implementation :: CPython',
],
entry_points={
'console_scripts': [
'waybackpy = waybackpy.cli:main'
]
},
"Development Status :: 5 - Production/Stable",
"Intended Audience :: Developers",
"Natural Language :: English",
"Topic :: Software Development :: Build Tools",
"License :: OSI Approved :: MIT License",
"Programming Language :: Python",
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.4",
"Programming Language :: Python :: 3.5",
"Programming Language :: Python :: 3.6",
"Programming Language :: Python :: 3.7",
"Programming Language :: Python :: 3.8",
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: Implementation :: CPython",
],
entry_points={"console_scripts": ["waybackpy = waybackpy.cli:main"]},
project_urls={
'Documentation': 'https://waybackpy.readthedocs.io',
'Source': 'https://github.com/akamhy/waybackpy',
"Documentation": "https://akamhy.github.io/waybackpy/",
"Source": "https://github.com/akamhy/waybackpy",
"Tracker": "https://github.com/akamhy/waybackpy/issues",
},
)

0
tests/__init__.py Normal file
View File

View File

@ -6,101 +6,292 @@ import argparse
sys.path.append("..")
import waybackpy.cli as cli # noqa: E402
from waybackpy.wrapper import Url # noqa: E402
from waybackpy.wrapper import Url # noqa: E402
from waybackpy.__version__ import __version__
codecov_python = False
if sys.version_info > (3, 7):
codecov_python = True
# Namespace(day=None, get=None, hour=None, minute=None, month=None, near=False,
# newest=False, oldest=False, save=False, total=False, url=None, user_agent=None, version=False, year=None)
if codecov_python:
def test_save():
args = argparse.Namespace(user_agent=None, url="https://pypi.org/user/akamhy/", total=False, version=False,
oldest=False, save=True, json=False, archive_url=False, newest=False, near=False, alive=False, subdomain=False, known_urls=False, get=None)
reply = cli.args_handler(args)
assert "pypi.org/user/akamhy" in str(reply)
def test_save():
args = argparse.Namespace(
user_agent=None,
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=False,
save=True,
json=False,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get=None,
)
reply = cli.args_handler(args)
assert "pypi.org/user/akamhy" in str(reply)
def test_json():
args = argparse.Namespace(user_agent=None, url="https://pypi.org/user/akamhy/", total=False, version=False,
oldest=False, save=False, json=True, archive_url=False, newest=False, near=False, alive=False, subdomain=False, known_urls=False, get=None)
args = argparse.Namespace(
user_agent=None,
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=False,
save=False,
json=True,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get=None,
)
reply = cli.args_handler(args)
assert "archived_snapshots" in str(reply)
def test_archive_url():
args = argparse.Namespace(user_agent=None, url="https://pypi.org/user/akamhy/", total=False, version=False,
oldest=False, save=False, json=False, archive_url=True, newest=False, near=False, alive=False, subdomain=False, known_urls=False, get=None)
args = argparse.Namespace(
user_agent=None,
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=True,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get=None,
)
reply = cli.args_handler(args)
assert "https://web.archive.org/web/" in str(reply)
def test_oldest():
args = argparse.Namespace(user_agent=None, url="https://pypi.org/user/akamhy/", total=False, version=False,
oldest=True, save=False, json=False, archive_url=False, newest=False, near=False, alive=False, subdomain=False, known_urls=False, get=None)
args = argparse.Namespace(
user_agent=None,
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=True,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get=None,
)
reply = cli.args_handler(args)
assert "pypi.org/user/akamhy" in str(reply)
def test_newest():
args = argparse.Namespace(user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9", url="https://pypi.org/user/akamhy/", total=False, version=False,
oldest=False, save=False, json=False, archive_url=False, newest=True, near=False, alive=False, subdomain=False, known_urls=False, get=None)
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=True,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get=None,
)
reply = cli.args_handler(args)
assert "pypi.org/user/akamhy" in str(reply)
def test_total_archives():
args = argparse.Namespace(user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9", url="https://pypi.org/user/akamhy/", total=True, version=False,
oldest=False, save=False, json=False, archive_url=False, newest=False, near=False, alive=False, subdomain=False, known_urls=False, get=None)
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://pypi.org/user/akamhy/",
total=True,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get=None,
)
reply = cli.args_handler(args)
assert isinstance(reply, int)
def test_known_urls():
args = argparse.Namespace(user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9", url="https://akamhy.github.io", total=False, version=False,
oldest=False, save=False, json=False, archive_url=False, newest=False, near=False, alive=True, subdomain=True, known_urls=True, get=None)
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://akamhy.github.io",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
alive=True,
subdomain=True,
known_urls=True,
get=None,
)
reply = cli.args_handler(args)
assert "github" in str(reply)
def test_near():
args = argparse.Namespace(user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9", url="https://pypi.org/user/akamhy/", total=False, version=False,
oldest=False, save=False, json=False, archive_url=False, newest=False, near=True, alive=False, subdomain=False, known_urls=False, get=None, year=2020, month=7, day=15, hour=1, minute=1)
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=True,
alive=False,
subdomain=False,
known_urls=False,
get=None,
year=2020,
month=7,
day=15,
hour=1,
minute=1,
)
reply = cli.args_handler(args)
assert "202007" in str(reply)
def test_get():
args = argparse.Namespace(user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9", url="https://pypi.org/user/akamhy/", total=False, version=False,
oldest=False, save=False, json=False, archive_url=False, newest=False, near=False, alive=False, subdomain=False, known_urls=False, get="url")
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get="url",
)
reply = cli.args_handler(args)
assert "waybackpy" in str(reply)
args = argparse.Namespace(user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9", url="https://pypi.org/user/akamhy/", total=False, version=False,
oldest=False, save=False, json=False, archive_url=False, newest=False, near=False, alive=False, subdomain=False, known_urls=False, get="oldest")
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get="oldest",
)
reply = cli.args_handler(args)
assert "waybackpy" in str(reply)
args = argparse.Namespace(user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9", url="https://pypi.org/user/akamhy/", total=False, version=False,
oldest=False, save=False, json=False, archive_url=False, newest=False, near=False, alive=False, subdomain=False, known_urls=False, get="newest")
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get="newest",
)
reply = cli.args_handler(args)
assert "waybackpy" in str(reply)
if codecov_python:
args = argparse.Namespace(user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9", url="https://pypi.org/user/akamhy/", total=False, version=False,
oldest=False, save=False, json=False, archive_url=False, newest=False, near=False, alive=False, subdomain=False, known_urls=False, get="save")
reply = cli.args_handler(args)
assert "waybackpy" in str(reply)
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get="save",
)
reply = cli.args_handler(args)
assert "waybackpy" in str(reply)
args = argparse.Namespace(user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9", url="https://pypi.org/user/akamhy/", total=False, version=False,
oldest=False, save=False, json=False, archive_url=False, newest=False, near=False, alive=False, subdomain=False, known_urls=False, get="BullShit")
args = argparse.Namespace(
user_agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 \
(KHTML, like Gecko) Version/8.0.8 Safari/600.8.9",
url="https://pypi.org/user/akamhy/",
total=False,
version=False,
oldest=False,
save=False,
json=False,
archive_url=False,
newest=False,
near=False,
alive=False,
subdomain=False,
known_urls=False,
get="BullShit",
)
reply = cli.args_handler(args)
assert "get the source code of the" in str(reply)
def test_args_handler():
args = argparse.Namespace(version=True)
reply = cli.args_handler(args)
@ -110,6 +301,7 @@ def test_args_handler():
reply = cli.args_handler(args)
assert ("waybackpy %s" % (__version__)) in str(reply)
def test_main():
# This also tests the parse_args method in cli.py
cli.main(['temp.py', '--version'])
cli.main(["temp.py", "--version"])

View File

@ -2,16 +2,12 @@
import sys
import pytest
import random
import requests
sys.path.append("..")
import waybackpy.wrapper as waybackpy # noqa: E402
if sys.version_info >= (3, 0): # If the python ver >= 3
from urllib.request import Request, urlopen
from urllib.error import URLError
else: # For python2.x
from urllib2 import Request, urlopen, URLError
user_agent = "Mozilla/5.0 (Windows NT 6.2; rv:20.0) Gecko/20121202 Firefox/20.0"
@ -23,6 +19,7 @@ def test_clean_url():
test_result = target._clean_url()
assert answer == test_result
def test_dunders():
url = "https://en.wikipedia.org/wiki/Network_security"
user_agent = "UA"
@ -30,14 +27,17 @@ def test_dunders():
assert "waybackpy.Url(url=%s, user_agent=%s)" % (url, user_agent) == repr(target)
assert "en.wikipedia.org" in str(target)
def test_archive_url_parser():
request_url = "https://amazon.com"
hdr = {"User-Agent": user_agent} # nosec
req = Request(request_url, headers=hdr) # nosec
header = waybackpy._get_response(req).headers
endpoint = "https://amazon.com"
user_agent = "Mozilla/5.0 (Windows NT 6.2; rv:20.0) Gecko/20121202 Firefox/20.0"
headers = {"User-Agent": "%s" % user_agent}
response = waybackpy._get_response(endpoint, params=None, headers=headers)
header = response.headers
with pytest.raises(Exception):
waybackpy._archive_url_parser(header)
def test_url_check():
broken_url = "http://wwwgooglecom/"
with pytest.raises(Exception):
@ -65,34 +65,20 @@ def test_save():
archived_url1 = str(target.save())
assert url1 in archived_url1
if sys.version_info > (3, 6):
# Test for urls that are incorrect.
with pytest.raises(Exception):
url2 = "ha ha ha ha"
waybackpy.Url(url2, user_agent)
url3 = "http://www.archive.is/faq.html"
# Test for urls that are incorrect.
with pytest.raises(Exception):
url2 = "ha ha ha ha"
waybackpy.Url(url2, user_agent)
url3 = "http://www.archive.is/faq.html"
# Test for urls not allowed to archive by robot.txt. Doesn't works anymore. Find alternatives.
# with pytest.raises(Exception):
#
# target = waybackpy.Url(
# url3,
# "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:25.0) "
# "Gecko/20100101 Firefox/25.0",
# )
# target.save()
# Non existent urls, test
with pytest.raises(Exception):
target = waybackpy.Url(
url3,
"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) "
"AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 "
"Safari/533.20.27",
)
target.save()
else:
pass
with pytest.raises(Exception):
target = waybackpy.Url(
url3,
"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) "
"AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 "
"Safari/533.20.27",
)
target.save()
def test_near():
@ -105,36 +91,33 @@ def test_near():
archive_near_year = target.near(year=2010)
assert "2010" in str(archive_near_year)
if sys.version_info > (3, 6):
archive_near_month_year = str(target.near(year=2015, month=2))
assert (
("201502" in archive_near_month_year)
or ("201501" in archive_near_month_year)
or ("201503" in archive_near_month_year)
)
archive_near_month_year = str(target.near(year=2015, month=2))
assert (
("201502" in archive_near_month_year)
or ("201501" in archive_near_month_year)
or ("201503" in archive_near_month_year)
)
target = waybackpy.Url(
"www.python.org",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246",
)
archive_near_hour_day_month_year = str(target.near(
year=2008, month=5, day=9, hour=15
))
assert (
("2008050915" in archive_near_hour_day_month_year)
or ("2008050914" in archive_near_hour_day_month_year)
or ("2008050913" in archive_near_hour_day_month_year)
)
target = waybackpy.Url(
"www.python.org",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246",
)
archive_near_hour_day_month_year = str(
target.near(year=2008, month=5, day=9, hour=15)
)
assert (
("2008050915" in archive_near_hour_day_month_year)
or ("2008050914" in archive_near_hour_day_month_year)
or ("2008050913" in archive_near_hour_day_month_year)
)
with pytest.raises(Exception):
NeverArchivedUrl = (
"https://ee_3n.wrihkeipef4edia.org/rwti5r_ki/Nertr6w_rork_rse7c_urity"
)
target = waybackpy.Url(NeverArchivedUrl, user_agent)
target.near(year=2010)
else:
pass
with pytest.raises(Exception):
NeverArchivedUrl = (
"https://ee_3n.wrihkeipef4edia.org/rwti5r_ki/Nertr6w_rork_rse7c_urity"
)
target = waybackpy.Url(NeverArchivedUrl, user_agent)
target.near(year=2010)
def test_oldest():
@ -142,16 +125,19 @@ def test_oldest():
target = waybackpy.Url(url, user_agent)
assert "20200504141153" in str(target.oldest())
def test_json():
url = "github.com/akamhy/waybackpy"
target = waybackpy.Url(url, user_agent)
assert "archived_snapshots" in str(target.JSON)
def test_archive_url():
url = "github.com/akamhy/waybackpy"
target = waybackpy.Url(url, user_agent)
assert "github.com/akamhy" in str(target.archive_url)
def test_newest():
url = "github.com/akamhy/waybackpy"
target = waybackpy.Url(url, user_agent)
@ -163,35 +149,32 @@ def test_get():
assert "Welcome to Google" in target.get(target.oldest())
def test_wayback_timestamp():
ts = waybackpy._wayback_timestamp(
year=2020, month=1, day=2, hour=3, minute=4
)
ts = waybackpy._wayback_timestamp(year=2020, month=1, day=2, hour=3, minute=4)
assert "202001020304" in str(ts)
def test_get_response():
hdr = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) "
"Gecko/20100101 Firefox/78.0"
}
req = Request("https://www.google.com", headers=hdr) # nosec
response = waybackpy._get_response(req)
assert response.code == 200
endpoint = "https://www.google.com"
user_agent = (
"Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0"
)
headers = {"User-Agent": "%s" % user_agent}
response = waybackpy._get_response(endpoint, params=None, headers=headers)
assert response.status_code == 200
def test_total_archives():
if sys.version_info > (3, 6):
target = waybackpy.Url(" https://google.com ", user_agent)
assert target.total_archives() > 500000
else:
pass
target = waybackpy.Url(" https://google.com ", user_agent)
assert target.total_archives() > 500000
target = waybackpy.Url(
" https://gaha.e4i3n.m5iai3kip6ied.cima/gahh2718gs/ahkst63t7gad8 ", user_agent
)
assert target.total_archives() == 0
def test_known_urls():
target = waybackpy.Url("akamhy.github.io", user_agent)

View File

@ -1,9 +1,12 @@
# -*- coding: utf-8 -*-
__title__ = "waybackpy"
__description__ = "A Python package that interfaces with the Internet Archive's Wayback Machine API. Archive pages and retrieve archived pages easily."
__description__ = (
"A Python package that interfaces with the Internet Archive's Wayback Machine API. "
"Archive pages and retrieve archived pages easily."
)
__url__ = "https://akamhy.github.io/waybackpy/"
__version__ = "2.2.0"
__version__ = "2.3.0"
__author__ = "akamhy"
__author_email__ = "akash3pro@gmail.com"
__license__ = "MIT"

View File

@ -1,29 +1,37 @@
# -*- coding: utf-8 -*-
from __future__ import print_function
import sys
import os
import re
import argparse
import string
import random
from waybackpy.wrapper import Url
from waybackpy.__version__ import __version__
def _save(obj):
return (obj.save())
return obj.save()
def _archive_url(obj):
return (obj.archive_url)
return obj.archive_url
def _json(obj):
return (obj.JSON)
return obj.JSON
def _oldest(obj):
return (obj.oldest())
return obj.oldest()
def _newest(obj):
return (obj.newest())
return obj.newest()
def _total_archives(obj):
return (obj.total_archives())
return obj.total_archives()
def _near(obj, args):
_near_args = {}
@ -37,7 +45,27 @@ def _near(obj, args):
_near_args["hour"] = args.hour
if args.minute:
_near_args["minute"] = args.minute
return (obj.near(**_near_args))
return obj.near(**_near_args)
def _save_urls_on_file(input_list, live_url_count):
m = re.search("https?://([A-Za-z_0-9.-]+).*", input_list[0])
if m:
domain = m.group(1)
else:
domain = "domain-unknown"
uid = "".join(
random.choice(string.ascii_lowercase + string.digits) for _ in range(6)
)
file_name = "%s-%d-urls-%s.txt" % (domain, live_url_count, uid)
file_content = "\n".join(input_list)
file_path = os.path.join(os.getcwd(), file_name)
with open(file_path, "w+") as f:
f.write(file_content)
return "%s\n\n'%s' saved in current working directory" % (file_content, file_name)
def _known_urls(obj, args):
"""Abbreviations:
@ -54,55 +82,46 @@ def _known_urls(obj, args):
total_urls = len(url_list)
if total_urls > 0:
m = re.search('https?://([A-Za-z_0-9.-]+).*', url_list[0])
if m:
domain = m.group(1)
else:
domain = "domain-unknown"
dir_path = os.path.abspath(os.getcwd())
file_name = dir_path + "/%s-%d-urls.txt" % (domain, total_urls)
text = "\n".join(url_list) + "\n"
with open(file_name, "a+") as f:
f.write(text)
text = text + "%d URLs found and saved in ./%s-%d-urls.txt" % (
total_urls, domain, total_urls
)
text = _save_urls_on_file(url_list, total_urls)
else:
text = "No known URLs found. Please try a diffrent domain!"
return text
def _get(obj, args):
if args.get.lower() == "url":
return (obj.get())
return obj.get()
if args.get.lower() == "archive_url":
return (obj.get(obj.archive_url))
return obj.get(obj.archive_url)
if args.get.lower() == "oldest":
return (obj.get(obj.oldest()))
return obj.get(obj.oldest())
if args.get.lower() == "latest" or args.get.lower() == "newest":
return (obj.get(obj.newest()))
return obj.get(obj.newest())
if args.get.lower() == "save":
return (obj.get(obj.save()))
return obj.get(obj.save())
return ("Use get as \"--get 'source'\", 'source' can be one of the followings: \
return "Use get as \"--get 'source'\", 'source' can be one of the followings: \
\n1) url - get the source code of the url specified using --url/-u.\
\n2) archive_url - get the source code of the newest archive for the supplied url, alias of newest.\
\n3) oldest - get the source code of the oldest archive for the supplied url.\
\n4) newest - get the source code of the newest archive for the supplied url.\
\n5) save - Create a new archive and get the source code of this new archive for the supplied url.")
\n5) save - Create a new archive and get the source code of this new archive for the supplied url."
def args_handler(args):
if args.version:
return ("waybackpy version %s" % __version__)
return "waybackpy version %s" % __version__
if not args.url:
return ("waybackpy %s \nSee 'waybackpy --help' for help using this tool." % __version__)
return (
"waybackpy %s \nSee 'waybackpy --help' for help using this tool."
% __version__
)
if args.user_agent:
obj = Url(args.url, args.user_agent)
@ -127,58 +146,107 @@ def args_handler(args):
return _near(obj, args)
if args.get:
return _get(obj, args)
return ("You only specified the URL. But you also need to specify the operation.\nSee 'waybackpy --help' for help using this tool.")
message = (
"You only specified the URL. But you also need to specify the operation."
"\nSee 'waybackpy --help' for help using this tool."
)
return message
def parse_args(argv):
parser = argparse.ArgumentParser()
requiredArgs = parser.add_argument_group('URL argument (required)')
requiredArgs.add_argument("--url", "-u", help="URL on which Wayback machine operations would occur")
requiredArgs = parser.add_argument_group("URL argument (required)")
requiredArgs.add_argument(
"--url", "-u", help="URL on which Wayback machine operations would occur"
)
userAgentArg = parser.add_argument_group('User Agent')
userAgentArg.add_argument("--user_agent", "-ua", help="User agent, default user_agent is \"waybackpy python package - https://github.com/akamhy/waybackpy\"")
userAgentArg = parser.add_argument_group("User Agent")
help_text = 'User agent, default user_agent is "waybackpy python package - https://github.com/akamhy/waybackpy"'
userAgentArg.add_argument("--user_agent", "-ua", help=help_text)
saveArg = parser.add_argument_group("Create new archive/save URL")
saveArg.add_argument("--save", "-s", action='store_true', help="Save the URL on the Wayback machine")
saveArg.add_argument(
"--save", "-s", action="store_true", help="Save the URL on the Wayback machine"
)
auArg = parser.add_argument_group("Get the latest Archive")
auArg.add_argument("--archive_url", "-au", action='store_true', help="Get the latest archive URL, alias for --newest")
auArg.add_argument(
"--archive_url",
"-au",
action="store_true",
help="Get the latest archive URL, alias for --newest",
)
jsonArg = parser.add_argument_group("Get the JSON data")
jsonArg.add_argument("--json", "-j", action='store_true', help="JSON data of the availability API request")
jsonArg.add_argument(
"--json",
"-j",
action="store_true",
help="JSON data of the availability API request",
)
oldestArg = parser.add_argument_group("Oldest archive")
oldestArg.add_argument("--oldest", "-o", action='store_true', help="Oldest archive for the specified URL")
oldestArg.add_argument(
"--oldest",
"-o",
action="store_true",
help="Oldest archive for the specified URL",
)
newestArg = parser.add_argument_group("Newest archive")
newestArg.add_argument("--newest", "-n", action='store_true', help="Newest archive for the specified URL")
newestArg.add_argument(
"--newest",
"-n",
action="store_true",
help="Newest archive for the specified URL",
)
totalArg = parser.add_argument_group("Total number of archives")
totalArg.add_argument("--total", "-t", action='store_true', help="Total number of archives for the specified URL")
totalArg.add_argument(
"--total",
"-t",
action="store_true",
help="Total number of archives for the specified URL",
)
getArg = parser.add_argument_group("Get source code")
getArg.add_argument("--get", "-g", help="Prints the source code of the supplied url. Use '--get help' for extended usage")
getArg.add_argument(
"--get",
"-g",
help="Prints the source code of the supplied url. Use '--get help' for extended usage",
)
knownUrlArg = parser.add_argument_group("URLs known and archived to Waybcak Machine for the site.")
knownUrlArg.add_argument("--known_urls", "-ku", action='store_true', help="URLs known for the domain.")
knownUrlArg.add_argument("--subdomain", "-sub", action='store_true', help="Use with '--known_urls' to include known URLs for subdomains.")
knownUrlArg.add_argument("--alive", "-a", action='store_true', help="Only include live URLs. Will not inlclude dead links.")
knownUrlArg = parser.add_argument_group(
"URLs known and archived to Waybcak Machine for the site."
)
knownUrlArg.add_argument(
"--known_urls", "-ku", action="store_true", help="URLs known for the domain."
)
help_text = "Use with '--known_urls' to include known URLs for subdomains."
knownUrlArg.add_argument("--subdomain", "-sub", action="store_true", help=help_text)
help_text = "Only include live URLs. Will not inlclude dead links."
knownUrlArg.add_argument("--alive", "-a", action="store_true", help=help_text)
nearArg = parser.add_argument_group("Archive close to time specified")
nearArg.add_argument(
"--near", "-N", action="store_true", help="Archive near specified time"
)
nearArg = parser.add_argument_group('Archive close to time specified')
nearArg.add_argument("--near", "-N", action='store_true', help="Archive near specified time")
nearArgs = parser.add_argument_group('Arguments that are used only with --near')
nearArgs = parser.add_argument_group("Arguments that are used only with --near")
nearArgs.add_argument("--year", "-Y", type=int, help="Year in integer")
nearArgs.add_argument("--month", "-M", type=int, help="Month in integer")
nearArgs.add_argument("--day", "-D", type=int, help="Day in integer.")
nearArgs.add_argument("--hour", "-H", type=int, help="Hour in intege")
nearArgs.add_argument("--minute", "-MIN", type=int, help="Minute in integer")
parser.add_argument("--version", "-v", action='store_true', help="Waybackpy version")
parser.add_argument(
"--version", "-v", action="store_true", help="Waybackpy version"
)
return parser.parse_args(argv[1:])
def main(argv=None):
if argv is None:
argv = sys.argv
@ -186,5 +254,6 @@ def main(argv=None):
output = args_handler(args)
print(output)
if __name__ == "__main__":
sys.exit(main(sys.argv))

View File

@ -1,6 +1,13 @@
# -*- coding: utf-8 -*-
class WaybackError(Exception):
"""
Raised when Wayback Machine API Service is unreachable/down.
"""
class URLError(Exception):
"""
Raised when malformed URLs are passed as arguments.
"""

View File

@ -1,17 +1,12 @@
# -*- coding: utf-8 -*-
import re
import sys
import json
from datetime import datetime, timedelta
from waybackpy.exceptions import WaybackError
from waybackpy.exceptions import WaybackError, URLError
from waybackpy.__version__ import __version__
import requests
import concurrent.futures
if sys.version_info >= (3, 0): # If the python ver >= 3
from urllib.request import Request, urlopen
from urllib.error import URLError
else: # For python2.x
from urllib2 import Request, urlopen, URLError
default_UA = "waybackpy python package - https://github.com/akamhy/waybackpy"
@ -19,9 +14,7 @@ default_UA = "waybackpy python package - https://github.com/akamhy/waybackpy"
def _archive_url_parser(header):
"""Parse out the archive from header."""
# Regex1
arch = re.search(
r"Content-Location: (/web/[0-9]{14}/.*)", str(header)
)
arch = re.search(r"Content-Location: (/web/[0-9]{14}/.*)", str(header))
if arch:
return "web.archive.org" + arch.group(1)
# Regex2
@ -49,19 +42,21 @@ def _wayback_timestamp(**kwargs):
)
def _get_response(req):
def _get_response(endpoint, params=None, headers=None):
"""Get response for the supplied request."""
try:
response = urlopen(req) # nosec
response = requests.get(endpoint, params=params, headers=headers)
except Exception:
try:
response = urlopen(req) # nosec
response = requests.get(endpoint, params=params, headers=headers) # nosec
except Exception as e:
exc = WaybackError("Error while retrieving %s" % req.full_url)
exc = WaybackError("Error while retrieving %s" % endpoint)
exc.__cause__ = e
raise exc
return response
class Url:
"""waybackpy Url object"""
@ -69,9 +64,10 @@ class Url:
self.url = url
self.user_agent = user_agent
self._url_check() # checks url validity on init.
self.JSON = self._JSON() # JSON of most recent archive
self.archive_url = self._archive_url() # URL of archive
self.timestamp = self._archive_timestamp() # timestamp for last archive
self.JSON = self._JSON() # JSON of most recent archive
self.archive_url = self._archive_url() # URL of archive
self.timestamp = self._archive_timestamp() # timestamp for last archive
self._alive_url_list = []
def __repr__(self):
return "waybackpy.Url(url=%s, user_agent=%s)" % (self.url, self.user_agent)
@ -80,16 +76,14 @@ class Url:
return "%s" % self.archive_url
def __len__(self):
td_max = timedelta(days=999999999,
hours=23,
minutes=59,
seconds=59,
microseconds=999999)
td_max = timedelta(
days=999999999, hours=23, minutes=59, seconds=59, microseconds=999999
)
if self.timestamp == datetime.max:
return td_max.days
else:
diff = datetime.utcnow() - self.timestamp
return diff.days
diff = datetime.utcnow() - self.timestamp
return diff.days
def _url_check(self):
"""Check for common URL problems."""
@ -97,17 +91,11 @@ class Url:
raise URLError("'%s' is not a vaild URL." % self.url)
def _JSON(self):
request_url = "https://archive.org/wayback/available?url=%s" % (
self._clean_url(),
)
hdr = {"User-Agent": "%s" % self.user_agent}
req = Request(request_url, headers=hdr) # nosec
response = _get_response(req)
data_string = response.read().decode("UTF-8")
data = json.loads(data_string)
return data
endpoint = "https://archive.org/wayback/available"
headers = {"User-Agent": "%s" % self.user_agent}
payload = {"url": "%s" % self._clean_url()}
response = _get_response(endpoint, params=payload, headers=headers)
return response.json()
def _archive_url(self):
"""Get URL of archive."""
@ -118,9 +106,7 @@ class Url:
else:
archive_url = data["archived_snapshots"]["closest"]["url"]
archive_url = archive_url.replace(
"http://web.archive.org/web/",
"https://web.archive.org/web/",
1
"http://web.archive.org/web/", "https://web.archive.org/web/", 1
)
return archive_url
@ -133,10 +119,9 @@ class Url:
time = datetime.max
else:
time = datetime.strptime(data["archived_snapshots"]
["closest"]
["timestamp"],
'%Y%m%d%H%M%S')
time = datetime.strptime(
data["archived_snapshots"]["closest"]["timestamp"], "%Y%m%d%H%M%S"
)
return time
@ -147,10 +132,9 @@ class Url:
def save(self):
"""Create a new Wayback Machine archive for this URL."""
request_url = "https://web.archive.org/save/" + self._clean_url()
hdr = {"User-Agent": "%s" % self.user_agent} # nosec
req = Request(request_url, headers=hdr) # nosec
header = _get_response(req).headers
self.archive_url = "https://" + _archive_url_parser(header)
headers = {"User-Agent": "%s" % self.user_agent}
response = _get_response(request_url, params=None, headers=headers)
self.archive_url = "https://" + _archive_url_parser(response.headers)
self.timestamp = datetime.utcnow()
return self
@ -165,20 +149,21 @@ class Url:
if not user_agent:
user_agent = self.user_agent
hdr = {"User-Agent": "%s" % user_agent}
req = Request(url, headers=hdr) # nosec
response = _get_response(req)
headers = {"User-Agent": "%s" % self.user_agent}
response = _get_response(url, params=None, headers=headers)
if not encoding:
try:
encoding = response.headers["content-type"].split("charset=")[-1]
encoding = response.encoding
except AttributeError:
encoding = "UTF-8"
return response.read().decode(encoding.replace("text/html", "UTF-8", 1))
return response.content.decode(encoding.replace("text/html", "UTF-8", 1))
def near(self, year=None, month=None, day=None, hour=None, minute=None):
""" Return the closest Wayback Machine archive to the time supplied.
Supported params are year, month, day, hour and minute.
Any non-supplied parameters default to the current time.
"""Return the closest Wayback Machine archive to the time supplied.
Supported params are year, month, day, hour and minute.
Any non-supplied parameters default to the current time.
"""
now = datetime.utcnow().timetuple()
@ -190,14 +175,11 @@ class Url:
minute=minute if minute else now.tm_min,
)
request_url = "https://archive.org/wayback/available?url=%s&timestamp=%s" % (
self._clean_url(),
timestamp,
)
hdr = {"User-Agent": "%s" % self.user_agent}
req = Request(request_url, headers=hdr) # nosec
response = _get_response(req)
data = json.loads(response.read().decode("UTF-8"))
endpoint = "https://archive.org/wayback/available"
headers = {"User-Agent": "%s" % self.user_agent}
payload = {"url": "%s" % self._clean_url(), "timestamp": timestamp}
response = _get_response(endpoint, params=payload, headers=headers)
data = response.json()
if not data["archived_snapshots"]:
raise WaybackError(
"Can not find archive for '%s' try later or use wayback.Url(url, user_agent).save() "
@ -209,14 +191,12 @@ class Url:
)
self.archive_url = archive_url
self.timestamp = datetime.strptime(data["archived_snapshots"]
["closest"]
["timestamp"],
'%Y%m%d%H%M%S')
self.timestamp = datetime.strptime(
data["archived_snapshots"]["closest"]["timestamp"], "%Y%m%d%H%M%S"
)
return self
def oldest(self, year=1994):
"""Return the oldest Wayback Machine archive for this URL."""
return self.near(year=year)
@ -231,22 +211,36 @@ class Url:
def total_archives(self):
"""Returns the total number of Wayback Machine archives for this URL."""
hdr = {"User-Agent": "%s" % self.user_agent}
request_url = (
"https://web.archive.org/cdx/search/cdx?url=%s&output=json&fl=statuscode"
% self._clean_url()
)
req = Request(request_url, headers=hdr) # nosec
response = _get_response(req)
endpoint = "https://web.archive.org/cdx/search/cdx"
headers = {
"User-Agent": "%s" % self.user_agent,
"output": "json",
"fl": "statuscode",
}
payload = {"url": "%s" % self._clean_url()}
response = _get_response(endpoint, params=payload, headers=headers)
# Most efficient method to count number of archives (yet)
return str(response.read()).count(",")
return response.text.count(",")
def pick_live_urls(self, url):
try:
response_code = requests.get(url).status_code
except Exception:
return # we don't care if urls are not opening
# 200s are OK and 300s are usually redirects, if you don't want redirects replace 400 with 300
if response_code >= 400:
return
self._alive_url_list.append(url)
def known_urls(self, alive=False, subdomain=False):
"""Returns list of URLs known to exist for given domain name
because these URLs were crawled by WayBack Machine bots.
Useful for pen-testers and others.
Idea by Mohammed Diaa (https://github.com/mhmdiaa) from:
https://gist.github.com/mhmdiaa/adf6bff70142e5091792841d4b372050
"""
@ -255,35 +249,24 @@ class Url:
if subdomain:
request_url = (
"https://web.archive.org/cdx/search/cdx?url=*.%s/*&output=json&fl=original&collapse=urlkey"
% self._clean_url()
"https://web.archive.org/cdx/search/cdx?url=*.%s/*&output=json&fl=original&collapse=urlkey"
% self._clean_url()
)
else:
request_url = (
"http://web.archive.org/cdx/search/cdx?url=%s/*&output=json&fl=original&collapse=urlkey"
% self._clean_url()
"http://web.archive.org/cdx/search/cdx?url=%s/*&output=json&fl=original&collapse=urlkey"
% self._clean_url()
)
hdr = {"User-Agent": "%s" % self.user_agent}
req = Request(request_url, headers=hdr) # nosec
response = _get_response(req)
data = json.loads(response.read().decode("UTF-8"))
headers = {"User-Agent": "%s" % self.user_agent}
response = _get_response(request_url, params=None, headers=headers)
data = response.json()
url_list = [y[0] for y in data if y[0] != "original"]
#Remove all deadURLs from url_list if alive=True
# Remove all deadURLs from url_list if alive=True
if alive:
tmp_url_list = []
for url in url_list:
try:
urlopen(url) # nosec
except:
continue
tmp_url_list.append(url)
url_list = tmp_url_list
with concurrent.futures.ThreadPoolExecutor() as executor:
executor.map(self.pick_live_urls, url_list)
url_list = self._alive_url_list
return url_list