Compare commits
12 Commits
Author | SHA1 | Date | |
---|---|---|---|
3b3e78d901 | |||
0202efd39d | |||
25c0adacb0 | |||
5bd16a42e7 | |||
57f4be53d5 | |||
64a4ce88af | |||
5407681c34 | |||
cfd977135d | |||
7a5e0bfdaf | |||
48dcda8020 | |||
3ed2170a32 | |||
d6ef55020c |
25
CITATION.cff
Normal file
25
CITATION.cff
Normal file
@ -0,0 +1,25 @@
|
||||
cff-version: 1.2.0
|
||||
message: "If you use this software, please cite it as below."
|
||||
title: waybackpy
|
||||
abstract: "Python package that interfaces with the Internet Archive's Wayback Machine APIs. Archive pages and retrieve archived pages easily."
|
||||
version: '3.0.6'
|
||||
doi: 10.5281/ZENODO.3977276
|
||||
date-released: 2022-03-15
|
||||
type: software
|
||||
authors:
|
||||
- given-names: Akash
|
||||
family-names: Mahanty
|
||||
email: akamhy@yahoo.com
|
||||
orcid: https://orcid.org/0000-0003-2482-8227
|
||||
keywords:
|
||||
- Archive Website
|
||||
- Wayback Machine
|
||||
- Internet Archive
|
||||
- Wayback Machine CLI
|
||||
- Wayback Machine Python
|
||||
- Internet Archiving
|
||||
- Availability API
|
||||
- CDX API
|
||||
- savepagenow
|
||||
license: MIT
|
||||
repository-code: "https://github.com/akamhy/waybackpy"
|
54
CONTRIBUTING.md
Normal file
54
CONTRIBUTING.md
Normal file
@ -0,0 +1,54 @@
|
||||
# Welcome to waybackpy contributing guide
|
||||
|
||||
|
||||
## Getting started
|
||||
|
||||
Read our [Code of Conduct](./CODE_OF_CONDUCT.md).
|
||||
|
||||
## Creating an issue
|
||||
|
||||
It's a good idea to open an issue and discuss suspected bugs and new feature ideas with the maintainers. Somebody might be working on your bug/idea and it would be best to discuss it to avoid wasting your time. It is a recommendation. You may avoid creating an issue and directly open pull requests.
|
||||
|
||||
## Fork this repository
|
||||
|
||||
Fork this repository. See '[Fork a repo](https://docs.github.com/en/get-started/quickstart/fork-a-repo)' for help forking this repository on GitHub.
|
||||
|
||||
## Make changes to the forked copy
|
||||
|
||||
Make the required changes to your forked copy of waybackpy, please don't forget to add or update comments and docstrings.
|
||||
|
||||
## Add tests for your changes
|
||||
|
||||
You have made the required changes to the codebase, now go ahead and add tests for newly written methods/functions and update the tests of code that you changed.
|
||||
|
||||
## Testing and Linting
|
||||
|
||||
You must run the tests and linter on your changes before opening a pull request.
|
||||
|
||||
### pytest
|
||||
|
||||
Runs all test from tests directory. pytest is a mature full-featured Python testing tool.
|
||||
```bash
|
||||
pytest
|
||||
```
|
||||
|
||||
### mypy
|
||||
|
||||
Mypy is a static type checker for Python. Type checkers help ensure that you're using variables and functions in your code correctly.
|
||||
```bash
|
||||
mypy -p waybackpy -p tests
|
||||
```
|
||||
|
||||
### black
|
||||
|
||||
After testing with pytest and type checking with mypy run black on the code base. The codestyle used by the project is 'black'.
|
||||
|
||||
```bash
|
||||
black .
|
||||
```
|
||||
|
||||
## Create a pull request
|
||||
|
||||
Read [Creating a pull request](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request).
|
||||
|
||||
Try to make sure that all automated tests are passing, and if some of them do not pass then don't worry. Tests are meant to catch bugs and a failed test is better than introducing bugs to the master branch.
|
@ -1,16 +0,0 @@
|
||||
# CONTRIBUTORS
|
||||
|
||||
## AUTHORS
|
||||
|
||||
- akamhy (<https://github.com/akamhy>)
|
||||
- eggplants (<https://github.com/eggplants>)
|
||||
- danvalen1 (<https://github.com/danvalen1>)
|
||||
- AntiCompositeNumber (<https://github.com/AntiCompositeNumber>)
|
||||
- rafaelrdealmeida (<https://github.com/rafaelrdealmeida>)
|
||||
- jonasjancarik (<https://github.com/jonasjancarik>)
|
||||
- jfinkhaeuser (<https://github.com/jfinkhaeuser>)
|
||||
|
||||
## ACKNOWLEDGEMENTS
|
||||
|
||||
- mhmdiaa (<https://github.com/mhmdiaa>) for <https://gist.github.com/mhmdiaa/adf6bff70142e5091792841d4b372050>. known_urls is based on this gist.
|
||||
- dequeued0 (<https://github.com/dequeued0>) for reporting bugs and useful feature requests.
|
43
README.md
43
README.md
@ -3,7 +3,7 @@
|
||||
|
||||
<img src="https://raw.githubusercontent.com/akamhy/waybackpy/master/assets/waybackpy_logo.svg"><br>
|
||||
|
||||
<h3>A Python package & CLI tool that interfaces with the Wayback Machine API</h3>
|
||||
<h3>Python package & CLI tool that interfaces the Wayback Machine APIs</h3>
|
||||
|
||||
</div>
|
||||
|
||||
@ -22,22 +22,22 @@
|
||||
|
||||
# <img src="https://github.githubassets.com/images/icons/emoji/unicode/2b50.png" width="30"></img> Introduction
|
||||
|
||||
Waybackpy is a [Python package](https://www.udacity.com/blog/2021/01/what-is-a-python-package.html) and a [CLI](https://www.w3schools.com/whatis/whatis_cli.asp) tool that interfaces with the [Wayback Machine](https://en.wikipedia.org/wiki/Wayback_Machine) API.
|
||||
Waybackpy is a Python package and a CLI tool that interfaces with the Wayback Machine APIs.
|
||||
|
||||
Wayback Machine has 3 client side [API](https://www.redhat.com/en/topics/api/what-are-application-programming-interfaces)s.
|
||||
Internet Archive's Wayback Machine has 3 useful public APIs.
|
||||
|
||||
- [Save API](https://github.com/akamhy/waybackpy/wiki/Wayback-Machine-APIs#save-api)
|
||||
- [Availability API](https://github.com/akamhy/waybackpy/wiki/Wayback-Machine-APIs#availability-api)
|
||||
- [CDX API](https://github.com/akamhy/waybackpy/wiki/Wayback-Machine-APIs#cdx-api)
|
||||
- SavePageNow or Save API
|
||||
- CDX Server API
|
||||
- Availability API
|
||||
|
||||
These three APIs can be accessed via the waybackpy either by importing it in a script or from the CLI.
|
||||
These three APIs can be accessed via the waybackpy either by importing it from a python file/module or from the command-line interface.
|
||||
|
||||
## <img src="https://github.githubassets.com/images/icons/emoji/unicode/1f3d7.png" width="20"></img> Installation
|
||||
|
||||
**Using [pip](https://en.wikipedia.org/wiki/Pip_(package_manager)), from [PyPI](https://pypi.org/) (recommended)**:
|
||||
|
||||
```bash
|
||||
pip install waybackpy
|
||||
pip install waybackpy -U
|
||||
```
|
||||
|
||||
**Using [conda](https://en.wikipedia.org/wiki/Conda_(package_manager)), from [conda-forge](https://anaconda.org/conda-forge/waybackpy) (recommended)**:
|
||||
@ -58,11 +58,11 @@ pip install git+https://github.com/akamhy/waybackpy.git
|
||||
|
||||
## <img src="https://github.githubassets.com/images/icons/emoji/unicode/1f433.png" width="20"></img> Docker Image
|
||||
|
||||
Docker Hub : <https://hub.docker.com/r/secsi/waybackpy>
|
||||
Docker Hub: [hub.docker.com/r/secsi/waybackpy](https://hub.docker.com/r/secsi/waybackpy)
|
||||
|
||||
[Docker image](https://searchitoperations.techtarget.com/definition/Docker-image) is automatically updated on every release by [Regulary and Automatically Updated Docker Images](https://github.com/cybersecsi/RAUDI) (RAUDI).
|
||||
Docker image is automatically updated on every release by [Regulary and Automatically Updated Docker Images](https://github.com/cybersecsi/RAUDI) (RAUDI).
|
||||
|
||||
RAUDI is a tool by SecSI (<https://secsi.io>), an Italian cybersecurity startup.
|
||||
RAUDI is a tool by [SecSI](https://secsi.io), an Italian cybersecurity startup.
|
||||
|
||||
## <img src="https://github.githubassets.com/images/icons/emoji/unicode/1f680.png" width="20"></img> Usage
|
||||
|
||||
@ -165,7 +165,8 @@ https://web.archive.org/web/20171206002737/http://pypi.org:80/
|
||||
|
||||
#### Availability API
|
||||
|
||||
It is recommended to not use the availability API due to performance issues. All the methods of availability API interface class, `WaybackMachineAvailabilityAPI`, are also implemented in the CDX server API interface class, `WaybackMachineCDXServerAPI`.
|
||||
It is recommended to not use the availability API due to performance issues. All the methods of availability API interface class, `WaybackMachineAvailabilityAPI`, are also implemented in the CDX server API interface class, `WaybackMachineCDXServerAPI`. Also note
|
||||
that the `newest()` method of `WaybackMachineAvailabilityAPI` can be more recent than `WaybackMachineCDXServerAPI`'s same method.
|
||||
|
||||
```python
|
||||
>>> from waybackpy import WaybackMachineAvailabilityAPI
|
||||
@ -201,10 +202,20 @@ Demo video on [asciinema.org](https://asciinema.org/a/469890), you can copy the
|
||||
|
||||
> CLI documentation is at <https://github.com/akamhy/waybackpy/wiki/CLI-docs>.
|
||||
|
||||
## <img src="https://github.githubassets.com/images/icons/emoji/unicode/1f6e1.png" width="20"></img> License
|
||||
|
||||
[](https://github.com/akamhy/waybackpy/blob/master/LICENSE)
|
||||
## CONTRIBUTORS
|
||||
|
||||
Copyright (c) 2020-2022 Akash Mahanty Et al.
|
||||
### AUTHORS
|
||||
|
||||
Released under the MIT License. See [license](https://github.com/akamhy/waybackpy/blob/master/LICENSE) for details.
|
||||
- akamhy (<https://github.com/akamhy>)
|
||||
- eggplants (<https://github.com/eggplants>)
|
||||
- danvalen1 (<https://github.com/danvalen1>)
|
||||
- AntiCompositeNumber (<https://github.com/AntiCompositeNumber>)
|
||||
- rafaelrdealmeida (<https://github.com/rafaelrdealmeida>)
|
||||
- jonasjancarik (<https://github.com/jonasjancarik>)
|
||||
- jfinkhaeuser (<https://github.com/jfinkhaeuser>)
|
||||
|
||||
### ACKNOWLEDGEMENTS
|
||||
|
||||
- mhmdiaa (<https://github.com/mhmdiaa>) `--known-urls` is based on [this](https://gist.github.com/mhmdiaa/adf6bff70142e5091792841d4b372050) gist.
|
||||
- dequeued0 (<https://github.com/dequeued0>) for reporting bugs and useful feature requests.
|
||||
|
@ -32,20 +32,26 @@ classifiers =
|
||||
License :: OSI Approved :: MIT License
|
||||
Programming Language :: Python
|
||||
Programming Language :: Python :: 3
|
||||
Programming Language :: Python :: 3.6
|
||||
Programming Language :: Python :: 3.7
|
||||
Programming Language :: Python :: 3.8
|
||||
Programming Language :: Python :: 3.9
|
||||
Programming Language :: Python :: 3.10
|
||||
Programming Language :: Python :: 3.11
|
||||
Programming Language :: Python :: Implementation :: CPython
|
||||
|
||||
[options]
|
||||
packages = find:
|
||||
python_requires = >= 3.7
|
||||
include-package-data = True
|
||||
python_requires = >= 3.6
|
||||
install_requires =
|
||||
click
|
||||
requests
|
||||
urllib3
|
||||
|
||||
[options.package_data]
|
||||
waybackpy = py.typed
|
||||
|
||||
[options.extras_require]
|
||||
dev =
|
||||
black
|
||||
|
@ -176,3 +176,39 @@ def test_near() -> None:
|
||||
filters=["statuscode:200"],
|
||||
)
|
||||
cdx.near(unix_timestamp=1286705410)
|
||||
|
||||
|
||||
def test_before() -> None:
|
||||
user_agent = (
|
||||
"Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) "
|
||||
"AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/604.1"
|
||||
)
|
||||
|
||||
cdx = WaybackMachineCDXServerAPI(
|
||||
url="http://www.google.com/",
|
||||
user_agent=user_agent,
|
||||
filters=["statuscode:200"],
|
||||
)
|
||||
before = cdx.before(wayback_machine_timestamp=20160731235949)
|
||||
assert "20160731233347" in before.timestamp
|
||||
assert "google" in before.urlkey
|
||||
assert before.original.find("google.com") != -1
|
||||
assert before.archive_url.find("google.com") != -1
|
||||
|
||||
|
||||
def test_after() -> None:
|
||||
user_agent = (
|
||||
"Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) "
|
||||
"AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/604.1"
|
||||
)
|
||||
|
||||
cdx = WaybackMachineCDXServerAPI(
|
||||
url="http://www.google.com/",
|
||||
user_agent=user_agent,
|
||||
filters=["statuscode:200"],
|
||||
)
|
||||
after = cdx.after(wayback_machine_timestamp=20160731235949)
|
||||
assert "20160801000917" in after.timestamp, after.timestamp
|
||||
assert "google" in after.urlkey
|
||||
assert after.original.find("google.com") != -1
|
||||
assert after.archive_url.find("google.com") != -1
|
||||
|
@ -1,6 +1,6 @@
|
||||
"""Module initializer and provider of static information."""
|
||||
|
||||
__version__ = "3.0.4"
|
||||
__version__ = "3.0.6"
|
||||
|
||||
from .availability_api import WaybackMachineAvailabilityAPI
|
||||
from .cdx_api import WaybackMachineCDXServerAPI
|
||||
|
@ -191,6 +191,88 @@ class WaybackMachineCDXServerAPI:
|
||||
|
||||
payload["url"] = self.url
|
||||
|
||||
def before(
|
||||
self,
|
||||
year: Optional[int] = None,
|
||||
month: Optional[int] = None,
|
||||
day: Optional[int] = None,
|
||||
hour: Optional[int] = None,
|
||||
minute: Optional[int] = None,
|
||||
unix_timestamp: Optional[int] = None,
|
||||
wayback_machine_timestamp: Optional[Union[int, str]] = None,
|
||||
) -> CDXSnapshot:
|
||||
"""
|
||||
Gets the nearest archive before the given datetime.
|
||||
"""
|
||||
if unix_timestamp:
|
||||
timestamp = unix_timestamp_to_wayback_timestamp(unix_timestamp)
|
||||
elif wayback_machine_timestamp:
|
||||
timestamp = str(wayback_machine_timestamp)
|
||||
else:
|
||||
now = datetime.utcnow().timetuple()
|
||||
timestamp = wayback_timestamp(
|
||||
year=now.tm_year if year is None else year,
|
||||
month=now.tm_mon if month is None else month,
|
||||
day=now.tm_mday if day is None else day,
|
||||
hour=now.tm_hour if hour is None else hour,
|
||||
minute=now.tm_min if minute is None else minute,
|
||||
)
|
||||
self.closest = timestamp
|
||||
self.sort = "closest"
|
||||
self.limit = 25000
|
||||
for snapshot in self.snapshots():
|
||||
if snapshot.timestamp < timestamp:
|
||||
return snapshot
|
||||
|
||||
# If a snapshot isn't returned, then none were found.
|
||||
raise NoCDXRecordFound(
|
||||
"No records were found before the given date for the query."
|
||||
+ "Either there are no archives before the given date,"
|
||||
+ " the URL may not have any archived, or the URL may have been"
|
||||
+ " recently archived and is still not available on the CDX server."
|
||||
)
|
||||
|
||||
def after(
|
||||
self,
|
||||
year: Optional[int] = None,
|
||||
month: Optional[int] = None,
|
||||
day: Optional[int] = None,
|
||||
hour: Optional[int] = None,
|
||||
minute: Optional[int] = None,
|
||||
unix_timestamp: Optional[int] = None,
|
||||
wayback_machine_timestamp: Optional[Union[int, str]] = None,
|
||||
) -> CDXSnapshot:
|
||||
"""
|
||||
Gets the nearest archive after the given datetime.
|
||||
"""
|
||||
if unix_timestamp:
|
||||
timestamp = unix_timestamp_to_wayback_timestamp(unix_timestamp)
|
||||
elif wayback_machine_timestamp:
|
||||
timestamp = str(wayback_machine_timestamp)
|
||||
else:
|
||||
now = datetime.utcnow().timetuple()
|
||||
timestamp = wayback_timestamp(
|
||||
year=now.tm_year if year is None else year,
|
||||
month=now.tm_mon if month is None else month,
|
||||
day=now.tm_mday if day is None else day,
|
||||
hour=now.tm_hour if hour is None else hour,
|
||||
minute=now.tm_min if minute is None else minute,
|
||||
)
|
||||
self.closest = timestamp
|
||||
self.sort = "closest"
|
||||
self.limit = 25000
|
||||
for snapshot in self.snapshots():
|
||||
if snapshot.timestamp > timestamp:
|
||||
return snapshot
|
||||
|
||||
# If a snapshot isn't returned, then none were found.
|
||||
raise NoCDXRecordFound(
|
||||
"No records were found after the given date for the query."
|
||||
+ "Either there are no archives after the given date,"
|
||||
+ " the URL may not have any archives, or the URL may have been"
|
||||
+ " recently archived and is still not available on the CDX server."
|
||||
)
|
||||
|
||||
def near(
|
||||
self,
|
||||
year: Optional[int] = None,
|
||||
|
@ -471,4 +471,4 @@ def main( # pylint: disable=no-value-for-parameter
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main() # pylint: disable=no-value-for-parameter
|
||||
main() # type: ignore # pylint: disable=no-value-for-parameter
|
||||
|
0
waybackpy/py.typed
Normal file
0
waybackpy/py.typed
Normal file
Reference in New Issue
Block a user