update
This commit is contained in:
parent
f1065ed1c8
commit
ea023e98da
151
index.rst
151
index.rst
@ -5,12 +5,14 @@ waybackpy
|
|||||||
|Codacy Badge| |Maintainability| |CodeFactor| |made-with-python| |pypi|
|
|Codacy Badge| |Maintainability| |CodeFactor| |made-with-python| |pypi|
|
||||||
|PyPI - Python Version| |Maintenance| |Repo size| |License: MIT|
|
|PyPI - Python Version| |Maintenance| |Repo size| |License: MIT|
|
||||||
|
|
||||||
|Internet Archive| |Wayback Machine|
|
.. figure:: https://raw.githubusercontent.com/akamhy/waybackpy/master/assets/waybackpy-colored%20284.png
|
||||||
|
:alt: Wayback Machine
|
||||||
|
|
||||||
Waybackpy is a Python library that interfaces with the `Internet
|
Wayback Machine
|
||||||
|
Waybackpy is a Python library that interfaces with `Internet
|
||||||
Archive <https://en.wikipedia.org/wiki/Internet_Archive>`__'s `Wayback
|
Archive <https://en.wikipedia.org/wiki/Internet_Archive>`__'s `Wayback
|
||||||
Machine <https://en.wikipedia.org/wiki/Wayback_Machine>`__ API. Archive
|
Machine <https://en.wikipedia.org/wiki/Wayback_Machine>`__ API. Archive
|
||||||
pages and retrieve archived pages easily.
|
webpages and retrieve archived webpages easily.
|
||||||
|
|
||||||
Table of contents
|
Table of contents
|
||||||
=================
|
=================
|
||||||
@ -24,21 +26,23 @@ Table of contents
|
|||||||
- `Usage <#usage>`__
|
- `Usage <#usage>`__
|
||||||
- `As a Python package <#as-a-python-package>`__
|
- `As a Python package <#as-a-python-package>`__
|
||||||
|
|
||||||
- `Saving an url using
|
- `Saving an url <#capturing-aka-saving-an-url-using-save>`__
|
||||||
save() <#capturing-aka-saving-an-url-using-save>`__
|
- `Retrieving the oldest
|
||||||
- `Receiving the oldest archive for an URL Using
|
archive <#retrieving-the-oldest-archive-for-an-url-using-oldest>`__
|
||||||
oldest() <#receiving-the-oldest-archive-for-an-url-using-oldest>`__
|
- `Retrieving the recent most/newest
|
||||||
- `Receiving the recent most/newest archive for an URL using
|
archive <#retrieving-the-newest-archive-for-an-url-using-newest>`__
|
||||||
newest() <#receiving-the-newest-archive-for-an-url-using-newest>`__
|
- `Retrieving archive close to a specified year, month, day, hour,
|
||||||
- `Receiving archive close to a specified year, month, day, hour,
|
and
|
||||||
and minute using
|
minute <#retrieving-archive-close-to-a-specified-year-month-day-hour-and-minute-using-near>`__
|
||||||
near() <#receiving-archive-close-to-a-specified-year-month-day-hour-and-minute-using-near>`__
|
- `Get the content of
|
||||||
- `Get the content of webpage using
|
webpage <#get-the-content-of-webpage-using-get>`__
|
||||||
get() <#get-the-content-of-webpage-using-get>`__
|
- `Count total archives for an
|
||||||
- `Count total archives for an URL using
|
URL <#count-total-archives-for-an-url-using-total_archives>`__
|
||||||
total\_archives() <#count-total-archives-for-an-url-using-total_archives>`__
|
- `List of URLs that Wayback Machine knows and has archived for a
|
||||||
|
domain
|
||||||
|
name <#retrieving-archive-close-to-a-specified-year-month-day-hour-and-minute-using-near>`__
|
||||||
|
|
||||||
- `With Command-line interface <#with-the-command-line-interface>`__
|
- `As a Command-line tool <#with-the-command-line-interface>`__
|
||||||
|
|
||||||
- `Save <#save>`__
|
- `Save <#save>`__
|
||||||
- `Oldest archive <#oldest-archive>`__
|
- `Oldest archive <#oldest-archive>`__
|
||||||
@ -46,11 +50,15 @@ Table of contents
|
|||||||
- `Total archives <#total-number-of-archives>`__
|
- `Total archives <#total-number-of-archives>`__
|
||||||
- `Archive near a time <#archive-near-time>`__
|
- `Archive near a time <#archive-near-time>`__
|
||||||
- `Get the source code <#get-the-source-code>`__
|
- `Get the source code <#get-the-source-code>`__
|
||||||
|
- `Fetch all the URLs that the Wayback Machine knows for a
|
||||||
|
domain <#fetch-all-the-urls-that-the-wayback-machine-knows-for-a-domain>`__
|
||||||
|
|
||||||
- `Tests <#tests>`__
|
- `Tests <#tests>`__
|
||||||
|
|
||||||
- `Dependency <#dependency>`__
|
- `Dependency <#dependency>`__
|
||||||
|
|
||||||
|
- `Packaging <#packaging>`__
|
||||||
|
|
||||||
- `License <#license>`__
|
- `License <#license>`__
|
||||||
|
|
||||||
.. raw:: html
|
.. raw:: html
|
||||||
@ -89,7 +97,7 @@ Capturing aka Saving an url using save()
|
|||||||
|
|
||||||
url = "https://en.wikipedia.org/wiki/Multivariable_calculus",
|
url = "https://en.wikipedia.org/wiki/Multivariable_calculus",
|
||||||
user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
|
user_agent = "Mozilla/5.0 (Windows NT 5.1; rv:40.0) Gecko/20100101 Firefox/40.0"
|
||||||
|
|
||||||
).save()
|
).save()
|
||||||
|
|
||||||
print(new_archive_url)
|
print(new_archive_url)
|
||||||
@ -101,8 +109,8 @@ Capturing aka Saving an url using save()
|
|||||||
Try this out in your browser @
|
Try this out in your browser @
|
||||||
https://repl.it/@akamhy/WaybackPySaveExample\
|
https://repl.it/@akamhy/WaybackPySaveExample\
|
||||||
|
|
||||||
Receiving the oldest archive for an URL using oldest()
|
Retrieving the oldest archive for an URL using oldest()
|
||||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
.. code:: python
|
.. code:: python
|
||||||
|
|
||||||
@ -112,7 +120,6 @@ Receiving the oldest archive for an URL using oldest()
|
|||||||
|
|
||||||
"https://www.google.com/",
|
"https://www.google.com/",
|
||||||
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:40.0) Gecko/20100101 Firefox/40.0"
|
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:40.0) Gecko/20100101 Firefox/40.0"
|
||||||
|
|
||||||
).oldest()
|
).oldest()
|
||||||
|
|
||||||
print(oldest_archive_url)
|
print(oldest_archive_url)
|
||||||
@ -124,8 +131,8 @@ Receiving the oldest archive for an URL using oldest()
|
|||||||
Try this out in your browser @
|
Try this out in your browser @
|
||||||
https://repl.it/@akamhy/WaybackPyOldestExample\
|
https://repl.it/@akamhy/WaybackPyOldestExample\
|
||||||
|
|
||||||
Receiving the newest archive for an URL using newest()
|
Retrieving the newest archive for an URL using newest()
|
||||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
.. code:: python
|
.. code:: python
|
||||||
|
|
||||||
@ -135,7 +142,7 @@ Receiving the newest archive for an URL using newest()
|
|||||||
|
|
||||||
"https://www.facebook.com/",
|
"https://www.facebook.com/",
|
||||||
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:39.0) Gecko/20100101 Firefox/39.0"
|
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10; rv:39.0) Gecko/20100101 Firefox/39.0"
|
||||||
|
|
||||||
).newest()
|
).newest()
|
||||||
|
|
||||||
print(newest_archive_url)
|
print(newest_archive_url)
|
||||||
@ -147,8 +154,8 @@ Receiving the newest archive for an URL using newest()
|
|||||||
Try this out in your browser @
|
Try this out in your browser @
|
||||||
https://repl.it/@akamhy/WaybackPyNewestExample\
|
https://repl.it/@akamhy/WaybackPyNewestExample\
|
||||||
|
|
||||||
Receiving archive close to a specified year, month, day, hour, and minute using near()
|
Retrieving archive close to a specified year, month, day, hour, and minute using near()
|
||||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
.. code:: python
|
.. code:: python
|
||||||
|
|
||||||
@ -269,6 +276,35 @@ Count total archives for an URL using total\_archives()
|
|||||||
Try this out in your browser @
|
Try this out in your browser @
|
||||||
https://repl.it/@akamhy/WaybackPyTotalArchivesExample\
|
https://repl.it/@akamhy/WaybackPyTotalArchivesExample\
|
||||||
|
|
||||||
|
List of URLs that Wayback Machine knows and has archived for a domain name
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
1) If alive=True is set, waybackpy will check all URLs to identify the
|
||||||
|
alive URLs. Don't use with popular websites like google or it would
|
||||||
|
take too long.
|
||||||
|
2) To include URLs from subdomain set sundomain=True
|
||||||
|
|
||||||
|
.. code:: python
|
||||||
|
|
||||||
|
import waybackpy
|
||||||
|
|
||||||
|
URL = "akamhy.github.io"
|
||||||
|
UA = "Mozilla/5.0 (iPad; CPU OS 8_1_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B435 Safari/600.1.4"
|
||||||
|
|
||||||
|
known_urls = waybackpy.Url(url=URL, user_agent=UA).known_urls(alive=True, subdomain=False) # alive and subdomain are optional.
|
||||||
|
|
||||||
|
print(known_urls) # known_urls() returns list of URLs
|
||||||
|
|
||||||
|
.. code:: bash
|
||||||
|
|
||||||
|
['http://akamhy.github.io',
|
||||||
|
'https://akamhy.github.io/waybackpy/',
|
||||||
|
'https://akamhy.github.io/waybackpy/assets/css/style.css?v=a418a4e4641a1dbaad8f3bfbf293fad21a75ff11',
|
||||||
|
'https://akamhy.github.io/waybackpy/assets/css/style.css?v=f881705d00bf47b5bf0c58808efe29eecba2226c']
|
||||||
|
|
||||||
|
Try this out in your browser @
|
||||||
|
https://repl.it/@akamhy/WaybackPyKnownURLsToWayBackMachineExample#main.py\
|
||||||
|
|
||||||
With the Command-line interface
|
With the Command-line interface
|
||||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||||
|
|
||||||
@ -332,30 +368,73 @@ Get the source code
|
|||||||
|
|
||||||
.. code:: bash
|
.. code:: bash
|
||||||
|
|
||||||
$ waybackpy --url google.com --user_agent "my-unique-user-agent" --get url # Prints the source code of the url
|
waybackpy --url google.com --user_agent "my-unique-user-agent" --get url # Prints the source code of the url
|
||||||
$ waybackpy --url google.com --user_agent "my-unique-user-agent" --get oldest # Prints the source code of the oldest archive
|
waybackpy --url google.com --user_agent "my-unique-user-agent" --get oldest # Prints the source code of the oldest archive
|
||||||
$ waybackpy --url google.com --user_agent "my-unique-user-agent" --get newest # Prints the source code of the newest archive
|
waybackpy --url google.com --user_agent "my-unique-user-agent" --get newest # Prints the source code of the newest archive
|
||||||
$ waybackpy --url google.com --user_agent "my-unique-user-agent" --get save # Save a new archive on wayback machine then print the source code of this archive.
|
waybackpy --url google.com --user_agent "my-unique-user-agent" --get save # Save a new archive on wayback machine then print the source code of this archive.
|
||||||
|
|
||||||
Try this out in your browser @
|
Try this out in your browser @
|
||||||
https://repl.it/@akamhy/WaybackPyBashGet\
|
https://repl.it/@akamhy/WaybackPyBashGet\
|
||||||
|
|
||||||
|
Fetch all the URLs that the Wayback Machine knows for a domain
|
||||||
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||||
|
|
||||||
|
1) You can add the '--alive' flag to only fetch alive links.
|
||||||
|
2) You can add the '--subdomain' flag to add subdomains.
|
||||||
|
3) '--alive' and '--subdomain' flags can be used simultaneously.
|
||||||
|
4) All links will be saved in a file, and the file will be created in
|
||||||
|
the current working directory.
|
||||||
|
|
||||||
|
.. code:: bash
|
||||||
|
|
||||||
|
pip install waybackpy
|
||||||
|
|
||||||
|
# Ignore the above installation line.
|
||||||
|
|
||||||
|
waybackpy --url akamhy.github.io --user_agent "my-user-agent" --known_urls
|
||||||
|
# Prints all known URLs under akamhy.github.io
|
||||||
|
|
||||||
|
|
||||||
|
waybackpy --url akamhy.github.io --user_agent "my-user-agent" --known_urls --alive
|
||||||
|
# Prints all known URLs under akamhy.github.io which are still working and not dead links.
|
||||||
|
|
||||||
|
|
||||||
|
waybackpy --url akamhy.github.io --user_agent "my-user-agent" --known_urls --subdomain
|
||||||
|
# Prints all known URLs under akamhy.github.io inclusing subdomain
|
||||||
|
|
||||||
|
|
||||||
|
waybackpy --url akamhy.github.io --user_agent "my-user-agent" --known_urls --subdomain --alive
|
||||||
|
# Prints all known URLs under akamhy.github.io including subdomain which are not dead links and still alive.
|
||||||
|
|
||||||
|
Try this out in your browser @
|
||||||
|
https://repl.it/@akamhy/WaybackpyKnownUrlsFromWaybackMachine#main.sh\
|
||||||
|
|
||||||
Tests
|
Tests
|
||||||
-----
|
-----
|
||||||
|
|
||||||
- `Here <https://github.com/akamhy/waybackpy/tree/master/tests>`__
|
`Here <https://github.com/akamhy/waybackpy/tree/master/tests>`__
|
||||||
|
|
||||||
Dependency
|
Dependency
|
||||||
----------
|
----------
|
||||||
|
|
||||||
- None, just python standard libraries (re, json, urllib, argparse and
|
None, just python standard libraries (re, json, urllib, argparse and
|
||||||
datetime). Both python 2 and 3 are supported :)
|
datetime). Both python 2 and 3 are supported :)
|
||||||
|
|
||||||
|
Packaging
|
||||||
|
---------
|
||||||
|
|
||||||
|
1. Increment version.
|
||||||
|
|
||||||
|
2. Build package ``python setup.py sdist bdist_wheel``.
|
||||||
|
|
||||||
|
3. Sign & upload the package ``twine upload -s dist/*``.
|
||||||
|
|
||||||
License
|
License
|
||||||
-------
|
-------
|
||||||
|
|
||||||
`MIT
|
Released under the MIT License. See
|
||||||
License <https://github.com/akamhy/waybackpy/blob/master/LICENSE>`__
|
`license <https://github.com/akamhy/waybackpy/blob/master/LICENSE>`__
|
||||||
|
for details.
|
||||||
|
|
||||||
.. |contributions welcome| image:: https://img.shields.io/static/v1.svg?label=Contributions&message=Welcome&color=0059b3&style=flat-square
|
.. |contributions welcome| image:: https://img.shields.io/static/v1.svg?label=Contributions&message=Welcome&color=0059b3&style=flat-square
|
||||||
.. |Build Status| image:: https://img.shields.io/travis/akamhy/waybackpy.svg?label=Travis%20CI&logo=travis&style=flat-square
|
.. |Build Status| image:: https://img.shields.io/travis/akamhy/waybackpy.svg?label=Travis%20CI&logo=travis&style=flat-square
|
||||||
@ -382,5 +461,3 @@ License <https://github.com/akamhy/waybackpy/blob/master/LICENSE>`__
|
|||||||
.. |Repo size| image:: https://img.shields.io/github/repo-size/akamhy/waybackpy.svg?label=Repo%20size&style=flat-square
|
.. |Repo size| image:: https://img.shields.io/github/repo-size/akamhy/waybackpy.svg?label=Repo%20size&style=flat-square
|
||||||
.. |License: MIT| image:: https://img.shields.io/badge/License-MIT-yellow.svg
|
.. |License: MIT| image:: https://img.shields.io/badge/License-MIT-yellow.svg
|
||||||
:target: https://github.com/akamhy/waybackpy/blob/master/LICENSE
|
:target: https://github.com/akamhy/waybackpy/blob/master/LICENSE
|
||||||
.. |Internet Archive| image:: https://upload.wikimedia.org/wikipedia/commons/thumb/8/84/Internet_Archive_logo_and_wordmark.svg/84px-Internet_Archive_logo_and_wordmark.svg.png
|
|
||||||
.. |Wayback Machine| image:: https://upload.wikimedia.org/wikipedia/commons/thumb/0/01/Wayback_Machine_logo_2010.svg/284px-Wayback_Machine_logo_2010.svg.png
|
|
||||||
|
Loading…
Reference in New Issue
Block a user