Compare commits
49 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
4dcda94cb0 | ||
|
|
09f59b0182 | ||
|
|
ed24184b99 | ||
|
|
56bef064b1 | ||
|
|
44bb2cf5e4 | ||
|
|
e231228721 | ||
|
|
b8b2d6dfa9 | ||
|
|
3eca6294df | ||
|
|
eb037a0284 | ||
|
|
a01821f20b | ||
|
|
b21036f8df | ||
|
|
b43bacb7ac | ||
|
|
f7313b255a | ||
|
|
7457e1c793 | ||
|
|
f7493d823f | ||
|
|
7fa7b59ce3 | ||
|
|
78a608db50 | ||
|
|
93f7dfdaf9 | ||
|
|
83c6f256c9 | ||
|
|
dee9105794 | ||
|
|
3bfc3b46d0 | ||
|
|
553f150bee | ||
|
|
b3a7e714a5 | ||
|
|
cd9841713c | ||
|
|
1ea9548d46 | ||
|
|
be7642c837 | ||
|
|
a418a4e464 | ||
|
|
aec035ef1e | ||
|
|
6d37993ab9 | ||
|
|
72b80ca44e | ||
|
|
c10aa9279c | ||
|
|
68d809a7d6 | ||
|
|
4ad09a419b | ||
|
|
ddc6620f09 | ||
|
|
4066a65678 | ||
|
|
8e46a9ba7a | ||
|
|
a5a98b9b00 | ||
|
|
a721ab7d6c | ||
|
|
7db27ae5e1 | ||
|
|
8fd4462025 | ||
|
|
c458a15820 | ||
|
|
bae3412bee | ||
|
|
94cb08bb37 | ||
|
|
af888db13e | ||
|
|
d24f2408ee | ||
|
|
ddd2274015 | ||
|
|
99abdb7c67 | ||
|
|
f3bb9a8540 | ||
|
|
bb94e0d1c5 |
31
.github/workflows/python-publish.yml
vendored
Normal file
31
.github/workflows/python-publish.yml
vendored
Normal file
@@ -0,0 +1,31 @@
|
|||||||
|
# This workflows will upload a Python Package using Twine when a release is created
|
||||||
|
# For more information see: https://help.github.com/en/actions/language-and-framework-guides/using-python-with-github-actions#publishing-to-package-registries
|
||||||
|
|
||||||
|
name: Upload Python Package
|
||||||
|
|
||||||
|
on:
|
||||||
|
release:
|
||||||
|
types: [created]
|
||||||
|
|
||||||
|
jobs:
|
||||||
|
deploy:
|
||||||
|
|
||||||
|
runs-on: ubuntu-latest
|
||||||
|
|
||||||
|
steps:
|
||||||
|
- uses: actions/checkout@v2
|
||||||
|
- name: Set up Python
|
||||||
|
uses: actions/setup-python@v2
|
||||||
|
with:
|
||||||
|
python-version: '3.x'
|
||||||
|
- name: Install dependencies
|
||||||
|
run: |
|
||||||
|
python -m pip install --upgrade pip
|
||||||
|
pip install setuptools wheel twine
|
||||||
|
- name: Build and publish
|
||||||
|
env:
|
||||||
|
TWINE_USERNAME: ${{ secrets.PYPI_USERNAME }}
|
||||||
|
TWINE_PASSWORD: ${{ secrets.PYPI_PASSWORD }}
|
||||||
|
run: |
|
||||||
|
python setup.py sdist bdist_wheel
|
||||||
|
twine upload dist/*
|
||||||
23
.travis.yml
23
.travis.yml
@@ -1,14 +1,19 @@
|
|||||||
language: python
|
language: python
|
||||||
python:
|
|
||||||
- "2.7"
|
|
||||||
- "3.6"
|
|
||||||
- "3.8"
|
|
||||||
os: linux
|
os: linux
|
||||||
dist: xenial
|
dist: xenial
|
||||||
cache: pip
|
cache: pip
|
||||||
install:
|
python:
|
||||||
- pip install pytest
|
- 2.7
|
||||||
before_script:
|
- 3.6
|
||||||
cd tests
|
- 3.8
|
||||||
|
before_install:
|
||||||
|
- python --version
|
||||||
|
- pip install -U pip
|
||||||
|
- pip install -U pytest
|
||||||
|
- pip install codecov
|
||||||
|
- pip install pytest pytest-cov
|
||||||
script:
|
script:
|
||||||
- pytest test_1.py
|
- cd tests
|
||||||
|
- pytest --cov=../waybackpy
|
||||||
|
after_success:
|
||||||
|
- if [[ $TRAVIS_PYTHON_VERSION == 3.8 ]]; then python -m codecov; fi
|
||||||
|
|||||||
64
README.md
64
README.md
@@ -28,13 +28,20 @@ Table of contents
|
|||||||
* [Installation](#installation)
|
* [Installation](#installation)
|
||||||
|
|
||||||
* [Usage](#usage)
|
* [Usage](#usage)
|
||||||
|
* [As a python package](#as-a-python-package)
|
||||||
* [Saving an url using save()](#capturing-aka-saving-an-url-using-save)
|
* [Saving an url using save()](#capturing-aka-saving-an-url-using-save)
|
||||||
* [Receiving the oldest archive for an URL Using oldest()](#receiving-the-oldest-archive-for-an-url-using-oldest)
|
* [Receiving the oldest archive for an URL Using oldest()](#receiving-the-oldest-archive-for-an-url-using-oldest)
|
||||||
* [Receiving the recent most/newest archive for an URL using newest()](#receiving-the-newest-archive-for-an-url-using-newest)
|
* [Receiving the recent most/newest archive for an URL using newest()](#receiving-the-newest-archive-for-an-url-using-newest)
|
||||||
* [Receiving archive close to a specified year, month, day, hour, and minute using near()](#receiving-archive-close-to-a-specified-year-month-day-hour-and-minute-using-near)
|
* [Receiving archive close to a specified year, month, day, hour, and minute using near()](#receiving-archive-close-to-a-specified-year-month-day-hour-and-minute-using-near)
|
||||||
* [Get the content of webpage using get()](#get-the-content-of-webpage-using-get)
|
* [Get the content of webpage using get()](#get-the-content-of-webpage-using-get)
|
||||||
* [Count total archives for an URL using total_archives()](#count-total-archives-for-an-url-using-total_archives)
|
* [Count total archives for an URL using total_archives()](#count-total-archives-for-an-url-using-total_archives)
|
||||||
|
* [With CLI](#with-the-cli)
|
||||||
|
* [Save](#save)
|
||||||
|
* [Oldest archive](#oldest-archive)
|
||||||
|
* [Newest archive](#newest-archive)
|
||||||
|
* [Total archives](#total-number-of-archives)
|
||||||
|
* [Archive near a time](#archive-near-time)
|
||||||
|
* [Get the source code](#get-the-source-code)
|
||||||
|
|
||||||
* [Tests](#tests)
|
* [Tests](#tests)
|
||||||
|
|
||||||
@@ -49,10 +56,15 @@ Using [pip](https://en.wikipedia.org/wiki/Pip_(package_manager)):
|
|||||||
```bash
|
```bash
|
||||||
pip install waybackpy
|
pip install waybackpy
|
||||||
```
|
```
|
||||||
|
or direct from this repository using git.
|
||||||
|
```bash
|
||||||
|
pip install git+https://github.com/akamhy/waybackpy.git
|
||||||
|
```
|
||||||
|
|
||||||
## Usage
|
## Usage
|
||||||
|
|
||||||
|
### As a python package
|
||||||
|
|
||||||
#### Capturing aka Saving an url using save()
|
#### Capturing aka Saving an url using save()
|
||||||
```python
|
```python
|
||||||
import waybackpy
|
import waybackpy
|
||||||
@@ -218,12 +230,58 @@ print(archive_count) # total_archives() returns an int
|
|||||||
```
|
```
|
||||||
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyTotalArchivesExample></sub>
|
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyTotalArchivesExample></sub>
|
||||||
|
|
||||||
|
### With the CLI
|
||||||
|
|
||||||
|
#### Save
|
||||||
|
```bash
|
||||||
|
$ waybackpy --url "https://en.wikipedia.org/wiki/Social_media" --user_agent "my-unique-user-agent" --save
|
||||||
|
https://web.archive.org/web/20200719062108/https://en.wikipedia.org/wiki/Social_media
|
||||||
|
```
|
||||||
|
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyBashSave></sub>
|
||||||
|
|
||||||
|
#### Oldest archive
|
||||||
|
```bash
|
||||||
|
$ waybackpy --url "https://en.wikipedia.org/wiki/SpaceX" --user_agent "my-unique-user-agent" --oldest
|
||||||
|
https://web.archive.org/web/20040803000845/http://en.wikipedia.org:80/wiki/SpaceX
|
||||||
|
```
|
||||||
|
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyBashOldest></sub>
|
||||||
|
|
||||||
|
#### Newest archive
|
||||||
|
```bash
|
||||||
|
$ waybackpy --url "https://en.wikipedia.org/wiki/YouTube" --user_agent "my-unique-user-agent" --newest
|
||||||
|
https://web.archive.org/web/20200606044708/https://en.wikipedia.org/wiki/YouTube
|
||||||
|
```
|
||||||
|
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyBashNewest></sub>
|
||||||
|
|
||||||
|
#### Total number of archives
|
||||||
|
```bash
|
||||||
|
$ waybackpy --url "https://en.wikipedia.org/wiki/Linux_kernel" --user_agent "my-unique-user-agent" --total
|
||||||
|
853
|
||||||
|
```
|
||||||
|
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyBashTotal></sub>
|
||||||
|
|
||||||
|
#### Archive near time
|
||||||
|
```bash
|
||||||
|
$ waybackpy --url facebook.com --user_agent "my-unique-user-agent" --near --year 2012 --month 5 --day 12
|
||||||
|
https://web.archive.org/web/20120512142515/https://www.facebook.com/
|
||||||
|
```
|
||||||
|
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyBashNear></sub>
|
||||||
|
|
||||||
|
#### Get the source code
|
||||||
|
```bash
|
||||||
|
$ waybackpy --url google.com --user_agent "my-unique-user-agent" --get url # Prints the source code of the url
|
||||||
|
$ waybackpy --url google.com --user_agent "my-unique-user-agent" --get oldest # Prints the source code of the oldest archive
|
||||||
|
$ waybackpy --url google.com --user_agent "my-unique-user-agent" --get newest # Prints the source code of the newest archive
|
||||||
|
$ waybackpy --url google.com --user_agent "my-unique-user-agent" --get save # Save a new archive on wayback machine then print the source code of this archive.
|
||||||
|
```
|
||||||
|
<sub>Try this out in your browser @ <https://repl.it/@akamhy/WaybackPyBashGet></sub>
|
||||||
|
|
||||||
## Tests
|
## Tests
|
||||||
* [Here](https://github.com/akamhy/waybackpy/tree/master/tests)
|
* [Here](https://github.com/akamhy/waybackpy/tree/master/tests)
|
||||||
|
|
||||||
|
|
||||||
## Dependency
|
## Dependency
|
||||||
* None, just python standard libraries (re, json, urllib and datetime). Both python 2 and 3 are supported :)
|
* None, just python standard libraries (re, json, urllib, argparse and datetime). Both python 2 and 3 are supported :)
|
||||||
|
|
||||||
|
|
||||||
## License
|
## License
|
||||||
|
|||||||
109
index.rst
109
index.rst
@@ -22,20 +22,31 @@ Table of contents
|
|||||||
- `Installation <#installation>`__
|
- `Installation <#installation>`__
|
||||||
|
|
||||||
- `Usage <#usage>`__
|
- `Usage <#usage>`__
|
||||||
|
- `As a python package <#as-a-python-package>`__
|
||||||
|
|
||||||
- `Saving an url using
|
- `Saving an url using
|
||||||
save() <#capturing-aka-saving-an-url-using-save>`__
|
save() <#capturing-aka-saving-an-url-using-save>`__
|
||||||
- `Receiving the oldest archive for an URL Using
|
- `Receiving the oldest archive for an URL Using
|
||||||
oldest() <#receiving-the-oldest-archive-for-an-url-using-oldest>`__
|
oldest() <#receiving-the-oldest-archive-for-an-url-using-oldest>`__
|
||||||
- `Receiving the recent most/newest archive for an URL using
|
- `Receiving the recent most/newest archive for an URL using
|
||||||
newest() <#receiving-the-newest-archive-for-an-url-using-newest>`__
|
newest() <#receiving-the-newest-archive-for-an-url-using-newest>`__
|
||||||
- `Receiving archive close to a specified year, month, day, hour, and
|
- `Receiving archive close to a specified year, month, day, hour,
|
||||||
minute using
|
and minute using
|
||||||
near() <#receiving-archive-close-to-a-specified-year-month-day-hour-and-minute-using-near>`__
|
near() <#receiving-archive-close-to-a-specified-year-month-day-hour-and-minute-using-near>`__
|
||||||
- `Get the content of webpage using
|
- `Get the content of webpage using
|
||||||
get() <#get-the-content-of-webpage-using-get>`__
|
get() <#get-the-content-of-webpage-using-get>`__
|
||||||
- `Count total archives for an URL using
|
- `Count total archives for an URL using
|
||||||
total\_archives() <#count-total-archives-for-an-url-using-total_archives>`__
|
total\_archives() <#count-total-archives-for-an-url-using-total_archives>`__
|
||||||
|
|
||||||
|
- `With CLI <#with-the-cli>`__
|
||||||
|
|
||||||
|
- `Save <#save>`__
|
||||||
|
- `Oldest archive <#oldest-archive>`__
|
||||||
|
- `Newest archive <#newest-archive>`__
|
||||||
|
- `Total archives <#total-number-of-archives>`__
|
||||||
|
- `Archive near a time <#archive-near-time>`__
|
||||||
|
- `Get the source code <#get-the-source-code>`__
|
||||||
|
|
||||||
- `Tests <#tests>`__
|
- `Tests <#tests>`__
|
||||||
|
|
||||||
- `Dependency <#dependency>`__
|
- `Dependency <#dependency>`__
|
||||||
@@ -55,9 +66,18 @@ Using `pip <https://en.wikipedia.org/wiki/Pip_(package_manager)>`__:
|
|||||||
|
|
||||||
pip install waybackpy
|
pip install waybackpy
|
||||||
|
|
||||||
|
or direct from this repository using git.
|
||||||
|
|
||||||
|
.. code:: bash
|
||||||
|
|
||||||
|
pip install git+https://github.com/akamhy/waybackpy.git
|
||||||
|
|
||||||
Usage
|
Usage
|
||||||
-----
|
-----
|
||||||
|
|
||||||
|
As a python package
|
||||||
|
~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
Capturing aka Saving an url using save()
|
Capturing aka Saving an url using save()
|
||||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
@@ -79,7 +99,7 @@ Capturing aka Saving an url using save()
|
|||||||
https://web.archive.org/web/20200504141153/https://github.com/akamhy/waybackpy
|
https://web.archive.org/web/20200504141153/https://github.com/akamhy/waybackpy
|
||||||
|
|
||||||
Try this out in your browser @
|
Try this out in your browser @
|
||||||
https://repl.it/repls/CompassionateRemoteOrigin#main.py\
|
https://repl.it/@akamhy/WaybackPySaveExample\
|
||||||
|
|
||||||
Receiving the oldest archive for an URL using oldest()
|
Receiving the oldest archive for an URL using oldest()
|
||||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
@@ -102,7 +122,7 @@ Receiving the oldest archive for an URL using oldest()
|
|||||||
http://web.archive.org/web/19981111184551/http://google.com:80/
|
http://web.archive.org/web/19981111184551/http://google.com:80/
|
||||||
|
|
||||||
Try this out in your browser @
|
Try this out in your browser @
|
||||||
https://repl.it/repls/MixedSuperDimensions#main.py\
|
https://repl.it/@akamhy/WaybackPyOldestExample\
|
||||||
|
|
||||||
Receiving the newest archive for an URL using newest()
|
Receiving the newest archive for an URL using newest()
|
||||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
@@ -125,7 +145,7 @@ Receiving the newest archive for an URL using newest()
|
|||||||
https://web.archive.org/web/20200714013225/https://www.facebook.com/
|
https://web.archive.org/web/20200714013225/https://www.facebook.com/
|
||||||
|
|
||||||
Try this out in your browser @
|
Try this out in your browser @
|
||||||
https://repl.it/repls/OblongMiniInteger#main.py\
|
https://repl.it/@akamhy/WaybackPyNewestExample\
|
||||||
|
|
||||||
Receiving archive close to a specified year, month, day, hour, and minute using near()
|
Receiving archive close to a specified year, month, day, hour, and minute using near()
|
||||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
@@ -186,7 +206,7 @@ The library doesn't supports seconds yet. You are encourged to create a
|
|||||||
PR ;)
|
PR ;)
|
||||||
|
|
||||||
Try this out in your browser @
|
Try this out in your browser @
|
||||||
https://repl.it/repls/SparseDeadlySearchservice#main.py\
|
https://repl.it/@akamhy/WaybackPyNearExample\
|
||||||
|
|
||||||
Get the content of webpage using get()
|
Get the content of webpage using get()
|
||||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
@@ -222,7 +242,7 @@ Get the content of webpage using get()
|
|||||||
print(google_oldest_archive_source)
|
print(google_oldest_archive_source)
|
||||||
|
|
||||||
Try this out in your browser @
|
Try this out in your browser @
|
||||||
https://repl.it/repls/PinkHoneydewNonagon#main.py\
|
https://repl.it/@akamhy/WaybackPyGetExample#main.py\
|
||||||
|
|
||||||
Count total archives for an URL using total\_archives()
|
Count total archives for an URL using total\_archives()
|
||||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
@@ -247,7 +267,78 @@ Count total archives for an URL using total\_archives()
|
|||||||
2440
|
2440
|
||||||
|
|
||||||
Try this out in your browser @
|
Try this out in your browser @
|
||||||
https://repl.it/repls/DigitalUnconsciousNumbers#main.py\
|
https://repl.it/@akamhy/WaybackPyTotalArchivesExample\
|
||||||
|
|
||||||
|
With the CLI
|
||||||
|
~~~~~~~~~~~~
|
||||||
|
|
||||||
|
Save
|
||||||
|
^^^^
|
||||||
|
|
||||||
|
.. code:: bash
|
||||||
|
|
||||||
|
$ waybackpy --url "https://en.wikipedia.org/wiki/Social_media" --user_agent "my-unique-user-agent" --save
|
||||||
|
https://web.archive.org/web/20200719062108/https://en.wikipedia.org/wiki/Social_media
|
||||||
|
|
||||||
|
Try this out in your browser @
|
||||||
|
https://repl.it/@akamhy/WaybackPyBashSave\
|
||||||
|
|
||||||
|
Oldest archive
|
||||||
|
^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
.. code:: bash
|
||||||
|
|
||||||
|
$ waybackpy --url "https://en.wikipedia.org/wiki/SpaceX" --user_agent "my-unique-user-agent" --oldest
|
||||||
|
https://web.archive.org/web/20040803000845/http://en.wikipedia.org:80/wiki/SpaceX
|
||||||
|
|
||||||
|
Try this out in your browser @
|
||||||
|
https://repl.it/@akamhy/WaybackPyBashOldest\
|
||||||
|
|
||||||
|
Newest archive
|
||||||
|
^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
.. code:: bash
|
||||||
|
|
||||||
|
$ waybackpy --url "https://en.wikipedia.org/wiki/YouTube" --user_agent "my-unique-user-agent" --newest
|
||||||
|
https://web.archive.org/web/20200606044708/https://en.wikipedia.org/wiki/YouTube
|
||||||
|
|
||||||
|
Try this out in your browser @
|
||||||
|
https://repl.it/@akamhy/WaybackPyBashNewest\
|
||||||
|
|
||||||
|
Total number of archives
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
.. code:: bash
|
||||||
|
|
||||||
|
$ waybackpy --url "https://en.wikipedia.org/wiki/Linux_kernel" --user_agent "my-unique-user-agent" --total
|
||||||
|
853
|
||||||
|
|
||||||
|
Try this out in your browser @
|
||||||
|
https://repl.it/@akamhy/WaybackPyBashTotal\
|
||||||
|
|
||||||
|
Archive near time
|
||||||
|
^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
.. code:: bash
|
||||||
|
|
||||||
|
$ waybackpy --url facebook.com --user_agent "my-unique-user-agent" --near --year 2012 --month 5 --day 12
|
||||||
|
https://web.archive.org/web/20120512142515/https://www.facebook.com/
|
||||||
|
|
||||||
|
Try this out in your browser @
|
||||||
|
https://repl.it/@akamhy/WaybackPyBashNear\
|
||||||
|
|
||||||
|
Get the source code
|
||||||
|
^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
.. code:: bash
|
||||||
|
|
||||||
|
$ waybackpy --url google.com --user_agent "my-unique-user-agent" --get url # Prints the source code of the url
|
||||||
|
$ waybackpy --url google.com --user_agent "my-unique-user-agent" --get oldest # Prints the source code of the oldest archive
|
||||||
|
$ waybackpy --url google.com --user_agent "my-unique-user-agent" --get newest # Prints the source code of the newest archive
|
||||||
|
$ waybackpy --url google.com --user_agent "my-unique-user-agent" --get save # Save a new archive on wayback machine then print the source code of this archive.
|
||||||
|
|
||||||
|
Try this out in your browser @
|
||||||
|
https://repl.it/@akamhy/WaybackPyBashGet\
|
||||||
|
|
||||||
Tests
|
Tests
|
||||||
-----
|
-----
|
||||||
@@ -257,7 +348,7 @@ Tests
|
|||||||
Dependency
|
Dependency
|
||||||
----------
|
----------
|
||||||
|
|
||||||
- None, just python standard libraries (re, json, urllib and datetime).
|
- None, just python standard libraries (re, json, urllib, argparse and datetime).
|
||||||
Both python 2 and 3 are supported :)
|
Both python 2 and 3 are supported :)
|
||||||
|
|
||||||
License
|
License
|
||||||
|
|||||||
@@ -1,3 +1,7 @@
|
|||||||
[metadata]
|
[metadata]
|
||||||
description-file = README.md
|
description-file = README.md
|
||||||
license_file = LICENSE
|
license_file = LICENSE
|
||||||
|
|
||||||
|
[flake8]
|
||||||
|
max-line-length = 88
|
||||||
|
extend-ignore = E203,W503
|
||||||
|
|||||||
7
setup.py
7
setup.py
@@ -19,7 +19,7 @@ setup(
|
|||||||
author = about['__author__'],
|
author = about['__author__'],
|
||||||
author_email = about['__author_email__'],
|
author_email = about['__author_email__'],
|
||||||
url = about['__url__'],
|
url = about['__url__'],
|
||||||
download_url = 'https://github.com/akamhy/waybackpy/archive/2.1.1.tar.gz',
|
download_url = 'https://github.com/akamhy/waybackpy/archive/2.1.4.tar.gz',
|
||||||
keywords = ['wayback', 'archive', 'archive website', 'wayback machine', 'Internet Archive'],
|
keywords = ['wayback', 'archive', 'archive website', 'wayback machine', 'Internet Archive'],
|
||||||
install_requires=[],
|
install_requires=[],
|
||||||
python_requires= ">=2.7",
|
python_requires= ">=2.7",
|
||||||
@@ -42,6 +42,11 @@ setup(
|
|||||||
'Programming Language :: Python :: 3.8',
|
'Programming Language :: Python :: 3.8',
|
||||||
'Programming Language :: Python :: Implementation :: CPython',
|
'Programming Language :: Python :: Implementation :: CPython',
|
||||||
],
|
],
|
||||||
|
entry_points={
|
||||||
|
'console_scripts': [
|
||||||
|
'waybackpy = waybackpy.cli:main'
|
||||||
|
]
|
||||||
|
},
|
||||||
project_urls={
|
project_urls={
|
||||||
'Documentation': 'https://waybackpy.readthedocs.io',
|
'Documentation': 'https://waybackpy.readthedocs.io',
|
||||||
'Source': 'https://github.com/akamhy/waybackpy',
|
'Source': 'https://github.com/akamhy/waybackpy',
|
||||||
|
|||||||
134
tests/test_1.py
134
tests/test_1.py
@@ -1,134 +0,0 @@
|
|||||||
# -*- coding: utf-8 -*-
|
|
||||||
import sys
|
|
||||||
sys.path.append("..")
|
|
||||||
import waybackpy
|
|
||||||
import pytest
|
|
||||||
import random
|
|
||||||
import time
|
|
||||||
|
|
||||||
user_agent = "Mozilla/5.0 (Windows NT 6.2; rv:20.0) Gecko/20121202 Firefox/20.0"
|
|
||||||
|
|
||||||
def test_clean_url():
|
|
||||||
time.sleep(10)
|
|
||||||
test_url = " https://en.wikipedia.org/wiki/Network security "
|
|
||||||
answer = "https://en.wikipedia.org/wiki/Network_security"
|
|
||||||
target = waybackpy.Url(test_url, user_agent)
|
|
||||||
test_result = target.clean_url()
|
|
||||||
assert answer == test_result
|
|
||||||
|
|
||||||
def test_url_check():
|
|
||||||
time.sleep(10)
|
|
||||||
broken_url = "http://wwwgooglecom/"
|
|
||||||
with pytest.raises(Exception) as e_info:
|
|
||||||
waybackpy.Url(broken_url, user_agent)
|
|
||||||
|
|
||||||
def test_save():
|
|
||||||
# Test for urls that exist and can be archived.
|
|
||||||
time.sleep(10)
|
|
||||||
|
|
||||||
url_list = [
|
|
||||||
"en.wikipedia.org",
|
|
||||||
"www.wikidata.org",
|
|
||||||
"commons.wikimedia.org",
|
|
||||||
"www.wiktionary.org",
|
|
||||||
"www.w3schools.com",
|
|
||||||
"www.youtube.com"
|
|
||||||
]
|
|
||||||
x = random.randint(0, len(url_list)-1)
|
|
||||||
url1 = url_list[x]
|
|
||||||
target = waybackpy.Url(url1, "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36")
|
|
||||||
archived_url1 = target.save()
|
|
||||||
assert url1 in archived_url1
|
|
||||||
|
|
||||||
if sys.version_info > (3, 6):
|
|
||||||
|
|
||||||
# Test for urls that are incorrect.
|
|
||||||
with pytest.raises(Exception) as e_info:
|
|
||||||
url2 = "ha ha ha ha"
|
|
||||||
waybackpy.Url(url2, user_agent)
|
|
||||||
time.sleep(5)
|
|
||||||
# Test for urls not allowed to archive by robot.txt.
|
|
||||||
with pytest.raises(Exception) as e_info:
|
|
||||||
url3 = "http://www.archive.is/faq.html"
|
|
||||||
target = waybackpy.Url(url3, "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:25.0) Gecko/20100101 Firefox/25.0")
|
|
||||||
target.save()
|
|
||||||
|
|
||||||
time.sleep(5)
|
|
||||||
# Non existent urls, test
|
|
||||||
with pytest.raises(Exception) as e_info:
|
|
||||||
url4 = "https://githfgdhshajagjstgeths537agajaajgsagudadhuss8762346887adsiugujsdgahub.us"
|
|
||||||
target = waybackpy.Url(url3, "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 Safari/533.20.27")
|
|
||||||
target.save()
|
|
||||||
|
|
||||||
else:
|
|
||||||
pass
|
|
||||||
|
|
||||||
def test_near():
|
|
||||||
time.sleep(10)
|
|
||||||
url = "google.com"
|
|
||||||
target = waybackpy.Url(url, "Mozilla/5.0 (Windows; U; Windows NT 6.0; de-DE) AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.3 Safari/533.19.4")
|
|
||||||
archive_near_year = target.near(year=2010)
|
|
||||||
assert "2010" in archive_near_year
|
|
||||||
|
|
||||||
if sys.version_info > (3, 6):
|
|
||||||
time.sleep(5)
|
|
||||||
archive_near_month_year = target.near( year=2015, month=2)
|
|
||||||
assert ("201502" in archive_near_month_year) or ("201501" in archive_near_month_year) or ("201503" in archive_near_month_year)
|
|
||||||
|
|
||||||
target = waybackpy.Url("www.python.org", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246")
|
|
||||||
archive_near_hour_day_month_year = target.near(year=2008, month=5, day=9, hour=15)
|
|
||||||
assert ("2008050915" in archive_near_hour_day_month_year) or ("2008050914" in archive_near_hour_day_month_year) or ("2008050913" in archive_near_hour_day_month_year)
|
|
||||||
|
|
||||||
with pytest.raises(Exception) as e_info:
|
|
||||||
NeverArchivedUrl = "https://ee_3n.wrihkeipef4edia.org/rwti5r_ki/Nertr6w_rork_rse7c_urity"
|
|
||||||
target = waybackpy.Url(NeverArchivedUrl, user_agent)
|
|
||||||
target.near(year=2010)
|
|
||||||
else:
|
|
||||||
pass
|
|
||||||
|
|
||||||
def test_oldest():
|
|
||||||
time.sleep(10)
|
|
||||||
url = "github.com/akamhy/waybackpy"
|
|
||||||
target = waybackpy.Url(url, user_agent)
|
|
||||||
assert "20200504141153" in target.oldest()
|
|
||||||
|
|
||||||
def test_newest():
|
|
||||||
time.sleep(10)
|
|
||||||
url = "github.com/akamhy/waybackpy"
|
|
||||||
target = waybackpy.Url(url, user_agent)
|
|
||||||
assert url in target.newest()
|
|
||||||
|
|
||||||
def test_get():
|
|
||||||
time.sleep(10)
|
|
||||||
target = waybackpy.Url("google.com", user_agent)
|
|
||||||
assert "Welcome to Google" in target.get(target.oldest())
|
|
||||||
|
|
||||||
def test_total_archives():
|
|
||||||
time.sleep(10)
|
|
||||||
if sys.version_info > (3, 6):
|
|
||||||
target = waybackpy.Url(" https://google.com ", user_agent)
|
|
||||||
assert target.total_archives() > 500000
|
|
||||||
else:
|
|
||||||
pass
|
|
||||||
time.sleep(5)
|
|
||||||
target = waybackpy.Url(" https://gaha.e4i3n.m5iai3kip6ied.cima/gahh2718gs/ahkst63t7gad8 ", user_agent)
|
|
||||||
assert target.total_archives() == 0
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
test_clean_url()
|
|
||||||
print(".") #1
|
|
||||||
test_url_check()
|
|
||||||
print(".") #1
|
|
||||||
test_get()
|
|
||||||
print(".") #3
|
|
||||||
test_near()
|
|
||||||
print(".") #4
|
|
||||||
test_newest()
|
|
||||||
print(".") #5
|
|
||||||
test_save()
|
|
||||||
print(".") #6
|
|
||||||
test_oldest()
|
|
||||||
print(".") #7
|
|
||||||
test_total_archives()
|
|
||||||
print(".") #8
|
|
||||||
print("OK")
|
|
||||||
43
tests/test_cli.py
Normal file
43
tests/test_cli.py
Normal file
@@ -0,0 +1,43 @@
|
|||||||
|
# -*- coding: utf-8 -*-
|
||||||
|
import sys
|
||||||
|
import os
|
||||||
|
import pytest
|
||||||
|
import argparse
|
||||||
|
|
||||||
|
sys.path.append("..")
|
||||||
|
import waybackpy.cli as cli # noqa: E402
|
||||||
|
from waybackpy.wrapper import Url # noqa: E402
|
||||||
|
|
||||||
|
if sys.version_info > (3, 7):
|
||||||
|
def test_save():
|
||||||
|
obj = Url("https://pypi.org/user/akamhy/", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 (KHTML, like Gecko) Version/8.0.8 Safari/600.8.9")
|
||||||
|
cli._save(obj)
|
||||||
|
else:
|
||||||
|
pass
|
||||||
|
|
||||||
|
def test_get():
|
||||||
|
args = argparse.Namespace(get='oldest')
|
||||||
|
obj = Url("https://pypi.org/user/akamhy/", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 (KHTML, like Gecko) Version/8.0.8 Safari/600.8.9")
|
||||||
|
cli._get(obj, args)
|
||||||
|
args = argparse.Namespace(get='newest')
|
||||||
|
cli._get(obj, args)
|
||||||
|
args = argparse.Namespace(get='url')
|
||||||
|
cli._get(obj, args)
|
||||||
|
if sys.version_info > (3, 7):
|
||||||
|
args = argparse.Namespace(get='save')
|
||||||
|
cli._get(obj, args)
|
||||||
|
else:
|
||||||
|
pass
|
||||||
|
|
||||||
|
def test_oldest():
|
||||||
|
obj = Url("https://pypi.org/user/akamhy/", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 (KHTML, like Gecko) Version/8.0.8 Safari/600.8.9")
|
||||||
|
cli._oldest(obj)
|
||||||
|
|
||||||
|
def test_newest():
|
||||||
|
obj = Url("https://pypi.org/user/akamhy/", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 (KHTML, like Gecko) Version/8.0.8 Safari/600.8.9")
|
||||||
|
cli._newest(obj)
|
||||||
|
|
||||||
|
def test_near():
|
||||||
|
args = argparse.Namespace(year=2020, month=6, day=1, hour=1, minute=1)
|
||||||
|
obj = Url("https://pypi.org/user/akamhy/", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 (KHTML, like Gecko) Version/8.0.8 Safari/600.8.9")
|
||||||
|
cli._near(obj, args)
|
||||||
179
tests/test_wrapper.py
Normal file
179
tests/test_wrapper.py
Normal file
@@ -0,0 +1,179 @@
|
|||||||
|
# -*- coding: utf-8 -*-
|
||||||
|
import sys
|
||||||
|
import pytest
|
||||||
|
import random
|
||||||
|
import time
|
||||||
|
|
||||||
|
sys.path.append("..")
|
||||||
|
import waybackpy.wrapper as waybackpy # noqa: E402
|
||||||
|
|
||||||
|
if sys.version_info >= (3, 0): # If the python ver >= 3
|
||||||
|
from urllib.request import Request, urlopen
|
||||||
|
from urllib.error import URLError
|
||||||
|
else: # For python2.x
|
||||||
|
from urllib2 import Request, urlopen, URLError
|
||||||
|
|
||||||
|
user_agent = "Mozilla/5.0 (Windows NT 6.2; rv:20.0) Gecko/20121202 Firefox/20.0"
|
||||||
|
|
||||||
|
|
||||||
|
def test_clean_url():
|
||||||
|
test_url = " https://en.wikipedia.org/wiki/Network security "
|
||||||
|
answer = "https://en.wikipedia.org/wiki/Network_security"
|
||||||
|
target = waybackpy.Url(test_url, user_agent)
|
||||||
|
test_result = target._clean_url()
|
||||||
|
assert answer == test_result
|
||||||
|
|
||||||
|
|
||||||
|
def test_url_check():
|
||||||
|
broken_url = "http://wwwgooglecom/"
|
||||||
|
with pytest.raises(Exception):
|
||||||
|
waybackpy.Url(broken_url, user_agent)
|
||||||
|
|
||||||
|
|
||||||
|
def test_save():
|
||||||
|
# Test for urls that exist and can be archived.
|
||||||
|
time.sleep(10)
|
||||||
|
|
||||||
|
url_list = [
|
||||||
|
"en.wikipedia.org",
|
||||||
|
"www.wikidata.org",
|
||||||
|
"commons.wikimedia.org",
|
||||||
|
"www.wiktionary.org",
|
||||||
|
"www.w3schools.com",
|
||||||
|
"twitter.com",
|
||||||
|
]
|
||||||
|
x = random.randint(0, len(url_list) - 1)
|
||||||
|
url1 = url_list[x]
|
||||||
|
target = waybackpy.Url(
|
||||||
|
url1,
|
||||||
|
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 "
|
||||||
|
"(KHTML, like Gecko) Chrome/36.0.1944.0 Safari/537.36",
|
||||||
|
)
|
||||||
|
archived_url1 = target.save()
|
||||||
|
assert url1 in archived_url1
|
||||||
|
|
||||||
|
if sys.version_info > (3, 6):
|
||||||
|
|
||||||
|
# Test for urls that are incorrect.
|
||||||
|
with pytest.raises(Exception):
|
||||||
|
url2 = "ha ha ha ha"
|
||||||
|
waybackpy.Url(url2, user_agent)
|
||||||
|
time.sleep(5)
|
||||||
|
# Test for urls not allowed to archive by robot.txt.
|
||||||
|
with pytest.raises(Exception):
|
||||||
|
url3 = "http://www.archive.is/faq.html"
|
||||||
|
target = waybackpy.Url(
|
||||||
|
url3,
|
||||||
|
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:25.0) "
|
||||||
|
"Gecko/20100101 Firefox/25.0",
|
||||||
|
)
|
||||||
|
target.save()
|
||||||
|
|
||||||
|
time.sleep(5)
|
||||||
|
# Non existent urls, test
|
||||||
|
with pytest.raises(Exception):
|
||||||
|
url4 = (
|
||||||
|
"https://githfgdhshajagjstgeths537agajaajgsagudadhuss87623"
|
||||||
|
"46887adsiugujsdgahub.us"
|
||||||
|
)
|
||||||
|
target = waybackpy.Url(
|
||||||
|
url3,
|
||||||
|
"Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) "
|
||||||
|
"AppleWebKit/533.20.25 (KHTML, like Gecko) Version/5.0.4 "
|
||||||
|
"Safari/533.20.27",
|
||||||
|
)
|
||||||
|
target.save()
|
||||||
|
|
||||||
|
else:
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
|
def test_near():
|
||||||
|
time.sleep(10)
|
||||||
|
url = "google.com"
|
||||||
|
target = waybackpy.Url(
|
||||||
|
url,
|
||||||
|
"Mozilla/5.0 (Windows; U; Windows NT 6.0; de-DE) AppleWebKit/533.20.25 "
|
||||||
|
"(KHTML, like Gecko) Version/5.0.3 Safari/533.19.4",
|
||||||
|
)
|
||||||
|
archive_near_year = target.near(year=2010)
|
||||||
|
assert "2010" in archive_near_year
|
||||||
|
|
||||||
|
if sys.version_info > (3, 6):
|
||||||
|
time.sleep(5)
|
||||||
|
archive_near_month_year = target.near(year=2015, month=2)
|
||||||
|
assert (
|
||||||
|
("201502" in archive_near_month_year)
|
||||||
|
or ("201501" in archive_near_month_year)
|
||||||
|
or ("201503" in archive_near_month_year)
|
||||||
|
)
|
||||||
|
|
||||||
|
target = waybackpy.Url(
|
||||||
|
"www.python.org",
|
||||||
|
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
|
||||||
|
"(KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.246",
|
||||||
|
)
|
||||||
|
archive_near_hour_day_month_year = target.near(
|
||||||
|
year=2008, month=5, day=9, hour=15
|
||||||
|
)
|
||||||
|
assert (
|
||||||
|
("2008050915" in archive_near_hour_day_month_year)
|
||||||
|
or ("2008050914" in archive_near_hour_day_month_year)
|
||||||
|
or ("2008050913" in archive_near_hour_day_month_year)
|
||||||
|
)
|
||||||
|
|
||||||
|
with pytest.raises(Exception):
|
||||||
|
NeverArchivedUrl = (
|
||||||
|
"https://ee_3n.wrihkeipef4edia.org/rwti5r_ki/Nertr6w_rork_rse7c_urity"
|
||||||
|
)
|
||||||
|
target = waybackpy.Url(NeverArchivedUrl, user_agent)
|
||||||
|
target.near(year=2010)
|
||||||
|
else:
|
||||||
|
pass
|
||||||
|
|
||||||
|
|
||||||
|
def test_oldest():
|
||||||
|
url = "github.com/akamhy/waybackpy"
|
||||||
|
target = waybackpy.Url(url, user_agent)
|
||||||
|
assert "20200504141153" in target.oldest()
|
||||||
|
|
||||||
|
|
||||||
|
def test_newest():
|
||||||
|
url = "github.com/akamhy/waybackpy"
|
||||||
|
target = waybackpy.Url(url, user_agent)
|
||||||
|
assert url in target.newest()
|
||||||
|
|
||||||
|
|
||||||
|
def test_get():
|
||||||
|
target = waybackpy.Url("google.com", user_agent)
|
||||||
|
assert "Welcome to Google" in target.get(target.oldest())
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
def test_wayback_timestamp():
|
||||||
|
ts = waybackpy._wayback_timestamp(
|
||||||
|
year=2020, month=1, day=2, hour=3, minute=4
|
||||||
|
)
|
||||||
|
assert "202001020304" in str(ts)
|
||||||
|
|
||||||
|
|
||||||
|
def test_get_response():
|
||||||
|
hdr = {
|
||||||
|
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) "
|
||||||
|
"Gecko/20100101 Firefox/78.0"
|
||||||
|
}
|
||||||
|
req = Request("https://www.google.com", headers=hdr) # nosec
|
||||||
|
response = waybackpy._get_response(req)
|
||||||
|
assert response.code == 200
|
||||||
|
|
||||||
|
|
||||||
|
def test_total_archives():
|
||||||
|
if sys.version_info > (3, 6):
|
||||||
|
target = waybackpy.Url(" https://google.com ", user_agent)
|
||||||
|
assert target.total_archives() > 500000
|
||||||
|
else:
|
||||||
|
pass
|
||||||
|
target = waybackpy.Url(
|
||||||
|
" https://gaha.e4i3n.m5iai3kip6ied.cima/gahh2718gs/ahkst63t7gad8 ", user_agent
|
||||||
|
)
|
||||||
|
assert target.total_archives() == 0
|
||||||
@@ -28,5 +28,13 @@ Full documentation @ <https://akamhy.github.io/waybackpy/>.
|
|||||||
"""
|
"""
|
||||||
|
|
||||||
from .wrapper import Url
|
from .wrapper import Url
|
||||||
from .__version__ import __title__, __description__, __url__, __version__
|
from .__version__ import (
|
||||||
from .__version__ import __author__, __author_email__, __license__, __copyright__
|
__title__,
|
||||||
|
__description__,
|
||||||
|
__url__,
|
||||||
|
__version__,
|
||||||
|
__author__,
|
||||||
|
__author_email__,
|
||||||
|
__license__,
|
||||||
|
__copyright__,
|
||||||
|
)
|
||||||
|
|||||||
@@ -3,7 +3,7 @@
|
|||||||
__title__ = "waybackpy"
|
__title__ = "waybackpy"
|
||||||
__description__ = "A Python library that interfaces with the Internet Archive's Wayback Machine API. Archive pages and retrieve archived pages easily."
|
__description__ = "A Python library that interfaces with the Internet Archive's Wayback Machine API. Archive pages and retrieve archived pages easily."
|
||||||
__url__ = "https://akamhy.github.io/waybackpy/"
|
__url__ = "https://akamhy.github.io/waybackpy/"
|
||||||
__version__ = "2.1.1"
|
__version__ = "2.1.4"
|
||||||
__author__ = "akamhy"
|
__author__ = "akamhy"
|
||||||
__author_email__ = "akash3pro@gmail.com"
|
__author_email__ = "akash3pro@gmail.com"
|
||||||
__license__ = "MIT"
|
__license__ = "MIT"
|
||||||
|
|||||||
104
waybackpy/cli.py
Normal file
104
waybackpy/cli.py
Normal file
@@ -0,0 +1,104 @@
|
|||||||
|
# -*- coding: utf-8 -*-
|
||||||
|
from __future__ import print_function
|
||||||
|
import argparse
|
||||||
|
from waybackpy.wrapper import Url
|
||||||
|
from waybackpy.__version__ import __version__
|
||||||
|
|
||||||
|
def _save(obj):
|
||||||
|
print(obj.save())
|
||||||
|
|
||||||
|
def _oldest(obj):
|
||||||
|
print(obj.oldest())
|
||||||
|
|
||||||
|
def _newest(obj):
|
||||||
|
print(obj.newest())
|
||||||
|
|
||||||
|
def _total_archives(obj):
|
||||||
|
print(obj.total_archives())
|
||||||
|
|
||||||
|
def _near(obj, args):
|
||||||
|
_near_args = {}
|
||||||
|
if args.year:
|
||||||
|
_near_args["year"] = args.year
|
||||||
|
if args.month:
|
||||||
|
_near_args["month"] = args.month
|
||||||
|
if args.day:
|
||||||
|
_near_args["day"] = args.day
|
||||||
|
if args.hour:
|
||||||
|
_near_args["hour"] = args.hour
|
||||||
|
if args.minute:
|
||||||
|
_near_args["minute"] = args.minute
|
||||||
|
print(obj.near(**_near_args))
|
||||||
|
|
||||||
|
def _get(obj, args):
|
||||||
|
if args.get.lower() == "url":
|
||||||
|
print(obj.get())
|
||||||
|
|
||||||
|
elif args.get.lower() == "oldest":
|
||||||
|
print(obj.get(obj.oldest()))
|
||||||
|
|
||||||
|
elif args.get.lower() == "latest" or args.get.lower() == "newest":
|
||||||
|
print(obj.get(obj.newest()))
|
||||||
|
|
||||||
|
elif args.get.lower() == "save":
|
||||||
|
print(obj.get(obj.save()))
|
||||||
|
|
||||||
|
else:
|
||||||
|
print("Use get as \"--get 'source'\", 'source' can be one of the followings: \
|
||||||
|
\n1) url - get the source code of the url specified using --url/-u.\
|
||||||
|
\n2) oldest - get the source code of the oldest archive for the supplied url.\
|
||||||
|
\n3) newest - get the source code of the newest archive for the supplied url.\
|
||||||
|
\n4) save - Create a new archive and get the source code of this new archive for the supplied url.")
|
||||||
|
|
||||||
|
def main():
|
||||||
|
parser = argparse.ArgumentParser()
|
||||||
|
parser.add_argument("-u", "--url", help="URL on which Wayback machine operations would occur.")
|
||||||
|
parser.add_argument("-ua", "--user_agent", help="User agent, default user_agent is \"waybackpy python package - https://github.com/akamhy/waybackpy\".")
|
||||||
|
parser.add_argument("-s", "--save", action='store_true', help="Save the URL on the Wayback machine.")
|
||||||
|
parser.add_argument("-o", "--oldest", action='store_true', help="Oldest archive for the specified URL.")
|
||||||
|
parser.add_argument("-n", "--newest", action='store_true', help="Newest archive for the specified URL.")
|
||||||
|
parser.add_argument("-t", "--total", action='store_true', help="Total number of archives for the specified URL.")
|
||||||
|
parser.add_argument("-g", "--get", help="Prints the source code of the supplied url. Use '--get help' for extended usage.")
|
||||||
|
parser.add_argument("-v", "--version", action='store_true', help="Prints the waybackpy version.")
|
||||||
|
|
||||||
|
parser.add_argument("-N", "--near", action='store_true', help="Latest/Newest archive for the specified URL.")
|
||||||
|
parser.add_argument("-Y", "--year", type=int, help="Year in integer. For use with --near.")
|
||||||
|
parser.add_argument("-M", "--month", type=int, help="Month in integer. For use with --near.")
|
||||||
|
parser.add_argument("-D", "--day", type=int, help="Day in integer. For use with --near.")
|
||||||
|
parser.add_argument("-H", "--hour", type=int, help="Hour in integer. For use with --near.")
|
||||||
|
parser.add_argument("-MIN", "--minute", type=int, help="Minute in integer. For use with --near.")
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
if args.version:
|
||||||
|
print(__version__)
|
||||||
|
return
|
||||||
|
|
||||||
|
if not args.url:
|
||||||
|
print("Specify an URL. See --help")
|
||||||
|
return
|
||||||
|
|
||||||
|
# create the object with or without the user_agent
|
||||||
|
if args.user_agent:
|
||||||
|
obj = Url(args.url, args.user_agent)
|
||||||
|
else:
|
||||||
|
obj = Url(args.url)
|
||||||
|
|
||||||
|
if args.save:
|
||||||
|
_save(obj)
|
||||||
|
elif args.oldest:
|
||||||
|
_oldest(obj)
|
||||||
|
elif args.newest:
|
||||||
|
_newest(obj)
|
||||||
|
elif args.total:
|
||||||
|
_total_archives(obj)
|
||||||
|
elif args.near:
|
||||||
|
_near(obj, args)
|
||||||
|
elif args.get:
|
||||||
|
_get(obj, args)
|
||||||
|
else:
|
||||||
|
print("Usage: waybackpy --url [URL] --user_agent [USER AGENT] [OPTIONS]. See --help")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
@@ -5,6 +5,7 @@ import sys
|
|||||||
import json
|
import json
|
||||||
from datetime import datetime
|
from datetime import datetime
|
||||||
from waybackpy.exceptions import WaybackError
|
from waybackpy.exceptions import WaybackError
|
||||||
|
from waybackpy.__version__ import __version__
|
||||||
|
|
||||||
if sys.version_info >= (3, 0): # If the python ver >= 3
|
if sys.version_info >= (3, 0): # If the python ver >= 3
|
||||||
from urllib.request import Request, urlopen
|
from urllib.request import Request, urlopen
|
||||||
@@ -14,153 +15,156 @@ else: # For python2.x
|
|||||||
|
|
||||||
default_UA = "waybackpy python package - https://github.com/akamhy/waybackpy"
|
default_UA = "waybackpy python package - https://github.com/akamhy/waybackpy"
|
||||||
|
|
||||||
class Url():
|
|
||||||
"""waybackpy Url object"""
|
|
||||||
|
|
||||||
|
def _archive_url_parser(header):
|
||||||
def __init__(self, url, user_agent=default_UA):
|
"""Parse out the archive from header."""
|
||||||
self.url = url
|
# Regex1
|
||||||
self.user_agent = user_agent
|
arch = re.search(
|
||||||
self.url_check() # checks url validity on init.
|
r"rel=\"memento.*?(web\.archive\.org/web/[0-9]{14}/.*?)>", str(header)
|
||||||
|
|
||||||
def __repr__(self):
|
|
||||||
"""Representation of the object."""
|
|
||||||
return "waybackpy.Url(url=%s, user_agent=%s)" % (self.url, self.user_agent)
|
|
||||||
|
|
||||||
def __str__(self):
|
|
||||||
"""String representation of the object."""
|
|
||||||
return "%s" % self.clean_url()
|
|
||||||
|
|
||||||
def __len__(self):
|
|
||||||
"""Length of the URL."""
|
|
||||||
return len(self.clean_url())
|
|
||||||
|
|
||||||
def url_check(self):
|
|
||||||
"""Check for common URL problems."""
|
|
||||||
if "." not in self.url:
|
|
||||||
raise URLError("'%s' is not a vaild url." % self.url)
|
|
||||||
return True
|
|
||||||
|
|
||||||
def clean_url(self):
|
|
||||||
"""Fix the URL, if possible."""
|
|
||||||
return str(self.url).strip().replace(" ","_")
|
|
||||||
|
|
||||||
def wayback_timestamp(self, **kwargs):
|
|
||||||
"""Return the formatted the timestamp."""
|
|
||||||
return (
|
|
||||||
str(kwargs["year"])
|
|
||||||
+
|
|
||||||
str(kwargs["month"]).zfill(2)
|
|
||||||
+
|
|
||||||
str(kwargs["day"]).zfill(2)
|
|
||||||
+
|
|
||||||
str(kwargs["hour"]).zfill(2)
|
|
||||||
+
|
|
||||||
str(kwargs["minute"]).zfill(2)
|
|
||||||
)
|
)
|
||||||
|
if arch:
|
||||||
def save(self):
|
return arch.group(1)
|
||||||
"""Create a new archives for an URL on the Wayback Machine."""
|
# Regex2
|
||||||
request_url = ("https://web.archive.org/save/" + self.clean_url())
|
|
||||||
hdr = { 'User-Agent' : '%s' % self.user_agent } #nosec
|
|
||||||
req = Request(request_url, headers=hdr) #nosec
|
|
||||||
try:
|
|
||||||
response = urlopen(req, timeout=30) #nosec
|
|
||||||
except Exception:
|
|
||||||
try:
|
|
||||||
response = urlopen(req) #nosec
|
|
||||||
except Exception as e:
|
|
||||||
raise WaybackError(e)
|
|
||||||
header = response.headers
|
|
||||||
|
|
||||||
def archive_url_parser(header):
|
|
||||||
arch = re.search(r"X-Cache-Key:\shttps(.*)[A-Z]{2}", str(header))
|
arch = re.search(r"X-Cache-Key:\shttps(.*)[A-Z]{2}", str(header))
|
||||||
if arch:
|
if arch:
|
||||||
return arch.group(1)
|
return arch.group(1)
|
||||||
raise WaybackError(
|
raise WaybackError(
|
||||||
"No archive url found in the API response. Visit https://github.com/akamhy/waybackpy for latest version of waybackpy.\nHeader:\n%s" % str(header)
|
"No archive URL found in the API response. "
|
||||||
|
"This version of waybackpy (%s) is likely out of date. Visit "
|
||||||
|
"https://github.com/akamhy/waybackpy for the latest version "
|
||||||
|
"of waybackpy.\nHeader:\n%s" % (__version__, str(header))
|
||||||
)
|
)
|
||||||
|
|
||||||
return "https://" + archive_url_parser(header)
|
|
||||||
|
|
||||||
def get(self, url=None, user_agent=None, encoding=None):
|
def _wayback_timestamp(**kwargs):
|
||||||
"""Returns the source code of the supplied URL. Auto detects the encoding if not supplied."""
|
"""Return a formatted timestamp."""
|
||||||
|
return "".join(
|
||||||
|
str(kwargs[key]).zfill(2) for key in ["year", "month", "day", "hour", "minute"]
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _get_response(req):
|
||||||
|
"""Get response for the supplied request."""
|
||||||
|
try:
|
||||||
|
response = urlopen(req) # nosec
|
||||||
|
except Exception:
|
||||||
|
try:
|
||||||
|
response = urlopen(req) # nosec
|
||||||
|
except Exception as e:
|
||||||
|
exc = WaybackError("Error while retrieving %s" % req.full_ur
|
||||||
|
)
|
||||||
|
exc.__cause__ = e
|
||||||
|
raise exc
|
||||||
|
return response
|
||||||
|
|
||||||
|
class Url:
|
||||||
|
"""waybackpy Url object"""
|
||||||
|
|
||||||
|
def __init__(self, url, user_agent=default_UA):
|
||||||
|
self.url = url
|
||||||
|
self.user_agent = user_agent
|
||||||
|
self._url_check() # checks url validity on init.
|
||||||
|
|
||||||
|
def __repr__(self):
|
||||||
|
return "waybackpy.Url(url=%s, user_agent=%s)" % (self.url, self.user_agent)
|
||||||
|
|
||||||
|
def __str__(self):
|
||||||
|
return "%s" % self._clean_url()
|
||||||
|
|
||||||
|
def __len__(self):
|
||||||
|
return len(self._clean_url())
|
||||||
|
|
||||||
|
def _url_check(self):
|
||||||
|
"""Check for common URL problems."""
|
||||||
|
if "." not in self.url:
|
||||||
|
raise URLError("'%s' is not a vaild URL." % self.url)
|
||||||
|
|
||||||
|
def _clean_url(self):
|
||||||
|
"""Fix the URL, if possible."""
|
||||||
|
return str(self.url).strip().replace(" ", "_")
|
||||||
|
|
||||||
|
def save(self):
|
||||||
|
"""Create a new Wayback Machine archive for this URL."""
|
||||||
|
request_url = "https://web.archive.org/save/" + self._clean_url()
|
||||||
|
hdr = {"User-Agent": "%s" % self.user_agent} # nosec
|
||||||
|
req = Request(request_url, headers=hdr) # nosec
|
||||||
|
header = _get_response(req).headers
|
||||||
|
return "https://" + _archive_url_parser(header)
|
||||||
|
|
||||||
|
def get(self, url="", user_agent="", encoding=""):
|
||||||
|
"""Return the source code of the supplied URL.
|
||||||
|
If encoding is not supplied, it is auto-detected from the response.
|
||||||
|
"""
|
||||||
if not url:
|
if not url:
|
||||||
url = self.clean_url()
|
url = self._clean_url()
|
||||||
if not user_agent:
|
if not user_agent:
|
||||||
user_agent = self.user_agent
|
user_agent = self.user_agent
|
||||||
|
|
||||||
hdr = { 'User-Agent' : '%s' % user_agent }
|
hdr = {"User-Agent": "%s" % user_agent}
|
||||||
req = Request(url, headers=hdr) # nosec
|
req = Request(url, headers=hdr) # nosec
|
||||||
|
response = _get_response(req)
|
||||||
try:
|
|
||||||
resp=urlopen(req) #nosec
|
|
||||||
except Exception:
|
|
||||||
try:
|
|
||||||
resp=urlopen(req) #nosec
|
|
||||||
except Exception as e:
|
|
||||||
raise WaybackError(e)
|
|
||||||
|
|
||||||
if not encoding:
|
if not encoding:
|
||||||
try:
|
try:
|
||||||
encoding= resp.headers['content-type'].split('charset=')[-1]
|
encoding = response.headers["content-type"].split("charset=")[-1]
|
||||||
except AttributeError:
|
except AttributeError:
|
||||||
encoding = "UTF-8"
|
encoding = "UTF-8"
|
||||||
|
return response.read().decode(encoding.replace("text/html", "UTF-8", 1))
|
||||||
|
|
||||||
return resp.read().decode(encoding.replace("text/html", "UTF-8", 1))
|
def near(self, year=None, month=None, day=None, hour=None, minute=None):
|
||||||
|
""" Return the closest Wayback Machine archive to the time supplied.
|
||||||
def near(self, **kwargs):
|
|
||||||
""" Returns the archived from Wayback Machine for an URL closest to the time supplied.
|
|
||||||
Supported params are year, month, day, hour and minute.
|
Supported params are year, month, day, hour and minute.
|
||||||
The non supplied parameters are default to the runtime time.
|
Any non-supplied parameters default to the current time.
|
||||||
|
|
||||||
"""
|
"""
|
||||||
year=kwargs.get("year", datetime.utcnow().strftime('%Y'))
|
now = datetime.utcnow().timetuple()
|
||||||
month=kwargs.get("month", datetime.utcnow().strftime('%m'))
|
timestamp = _wayback_timestamp(
|
||||||
day=kwargs.get("day", datetime.utcnow().strftime('%d'))
|
year=year if year else now.tm_year,
|
||||||
hour=kwargs.get("hour", datetime.utcnow().strftime('%H'))
|
month=month if month else now.tm_mon,
|
||||||
minute=kwargs.get("minute", datetime.utcnow().strftime('%M'))
|
day=day if day else now.tm_mday,
|
||||||
timestamp = self.wayback_timestamp(year=year,month=month,day=day,hour=hour,minute=minute)
|
hour=hour if hour else now.tm_hour,
|
||||||
request_url = "https://archive.org/wayback/available?url=%s×tamp=%s" % (self.clean_url(), str(timestamp))
|
minute=minute if minute else now.tm_min,
|
||||||
hdr = { 'User-Agent' : '%s' % self.user_agent }
|
)
|
||||||
|
|
||||||
|
request_url = "https://archive.org/wayback/available?url=%s×tamp=%s" % (
|
||||||
|
self._clean_url(),
|
||||||
|
timestamp,
|
||||||
|
)
|
||||||
|
hdr = {"User-Agent": "%s" % self.user_agent}
|
||||||
req = Request(request_url, headers=hdr) # nosec
|
req = Request(request_url, headers=hdr) # nosec
|
||||||
|
response = _get_response(req)
|
||||||
try:
|
|
||||||
response = urlopen(req) #nosec
|
|
||||||
except Exception:
|
|
||||||
try:
|
|
||||||
response = urlopen(req) #nosec
|
|
||||||
except Exception as e:
|
|
||||||
raise WaybackError(e)
|
|
||||||
|
|
||||||
data = json.loads(response.read().decode("UTF-8"))
|
data = json.loads(response.read().decode("UTF-8"))
|
||||||
if not data["archived_snapshots"]:
|
if not data["archived_snapshots"]:
|
||||||
raise WaybackError("'%s' is not yet archived." % url)
|
raise WaybackError(
|
||||||
archive_url = (data["archived_snapshots"]["closest"]["url"])
|
"'%s' is not yet archived. Use wayback.Url(url, user_agent).save() "
|
||||||
|
"to create a new archive." % self._clean_url()
|
||||||
|
)
|
||||||
|
archive_url = data["archived_snapshots"]["closest"]["url"]
|
||||||
# wayback machine returns http sometimes, idk why? But they support https
|
# wayback machine returns http sometimes, idk why? But they support https
|
||||||
archive_url = archive_url.replace("http://web.archive.org/web/","https://web.archive.org/web/",1)
|
archive_url = archive_url.replace(
|
||||||
|
"http://web.archive.org/web/", "https://web.archive.org/web/", 1
|
||||||
|
)
|
||||||
return archive_url
|
return archive_url
|
||||||
|
|
||||||
def oldest(self, year=1994):
|
def oldest(self, year=1994):
|
||||||
"""Returns the oldest archive from Wayback Machine for an URL."""
|
"""Return the oldest Wayback Machine archive for this URL."""
|
||||||
return self.near(year=year)
|
return self.near(year=year)
|
||||||
|
|
||||||
def newest(self):
|
def newest(self):
|
||||||
"""Returns the newest archive on Wayback Machine for an URL, sometimes you may not get the newest archive because wayback machine DB lag."""
|
"""Return the newest Wayback Machine archive available for this URL.
|
||||||
|
|
||||||
|
Due to Wayback Machine database lag, this may not always be the
|
||||||
|
most recent archive.
|
||||||
|
"""
|
||||||
return self.near()
|
return self.near()
|
||||||
|
|
||||||
def total_archives(self):
|
def total_archives(self):
|
||||||
"""Returns the total number of archives on Wayback Machine for an URL."""
|
"""Returns the total number of Wayback Machine archives for this URL."""
|
||||||
hdr = { 'User-Agent' : '%s' % self.user_agent }
|
hdr = {"User-Agent": "%s" % self.user_agent}
|
||||||
request_url = "https://web.archive.org/cdx/search/cdx?url=%s&output=json&fl=statuscode" % self.clean_url()
|
request_url = (
|
||||||
|
"https://web.archive.org/cdx/search/cdx?url=%s&output=json&fl=statuscode"
|
||||||
|
% self._clean_url()
|
||||||
|
)
|
||||||
req = Request(request_url, headers=hdr) # nosec
|
req = Request(request_url, headers=hdr) # nosec
|
||||||
|
response = _get_response(req)
|
||||||
try:
|
# Most efficient method to count number of archives (yet)
|
||||||
response = urlopen(req) #nosec
|
return str(response.read()).count(",")
|
||||||
except Exception:
|
|
||||||
try:
|
|
||||||
response = urlopen(req) #nosec
|
|
||||||
except Exception as e:
|
|
||||||
raise WaybackError(e)
|
|
||||||
|
|
||||||
return str(response.read()).count(",") # Most efficient method to count number of archives (yet)
|
|
||||||
|
|||||||
Reference in New Issue
Block a user