Compare commits

...

12 Commits

Author SHA1 Message Date
3b3e78d901 Before and After methods (#175)
* Added before and after functions

* add tests

* formatting
2022-11-17 07:58:46 +05:30
0202efd39d Add Python 3.11 to setup.cfg classifiers list (#179) 2022-11-17 07:56:19 +05:30
25c0adacb0 create CONTRIBUTING.md 2022-03-29 11:30:43 +05:30
5bd16a42e7 lint 2022-03-29 10:42:57 +05:30
57f4be53d5 ignore line 474 beacause 'error: <nothing> not callable'. 2022-03-29 10:24:55 +05:30
64a4ce88af Minor copyediting and also deleted CONTRIBUTORS.md moved content to README.md 2022-03-29 03:39:50 +05:30
5407681c34 v3.0.6 (#170)
* remove the license section from readme

This does not mean that I'm waving the copyrights rather just formatting the README

* remove useless external links form the README lead

and also added a line about the recentness of the newest method between the availability and CDX server API.

* incr version to 3.0.6 and change date to todays da

-te that is 15th of March, 2022.

* update secsi and DI section

* v3.0.5 --> v3.0.6
2022-03-15 20:33:51 +05:30
cfd977135d Update CITATION.cff (#169) 2022-03-04 11:48:49 +05:30
7a5e0bfdaf fix: cff format (#168) 2022-03-04 03:10:27 +05:30
48dcda8020 add: typed marker (PEP561) (#167) 2022-03-03 19:05:43 +05:30
3ed2170a32 add: CITATION.cff (#166) 2022-03-03 19:05:23 +05:30
d6ef55020c undo drop python3.6, see #162 (#163) 2022-02-18 21:54:33 +05:30
10 changed files with 234 additions and 36 deletions

25
CITATION.cff Normal file
View File

@ -0,0 +1,25 @@
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
title: waybackpy
abstract: "Python package that interfaces with the Internet Archive's Wayback Machine APIs. Archive pages and retrieve archived pages easily."
version: '3.0.6'
doi: 10.5281/ZENODO.3977276
date-released: 2022-03-15
type: software
authors:
- given-names: Akash
family-names: Mahanty
email: akamhy@yahoo.com
orcid: https://orcid.org/0000-0003-2482-8227
keywords:
- Archive Website
- Wayback Machine
- Internet Archive
- Wayback Machine CLI
- Wayback Machine Python
- Internet Archiving
- Availability API
- CDX API
- savepagenow
license: MIT
repository-code: "https://github.com/akamhy/waybackpy"

54
CONTRIBUTING.md Normal file
View File

@ -0,0 +1,54 @@
# Welcome to waybackpy contributing guide
## Getting started
Read our [Code of Conduct](./CODE_OF_CONDUCT.md).
## Creating an issue
It's a good idea to open an issue and discuss suspected bugs and new feature ideas with the maintainers. Somebody might be working on your bug/idea and it would be best to discuss it to avoid wasting your time. It is a recommendation. You may avoid creating an issue and directly open pull requests.
## Fork this repository
Fork this repository. See '[Fork a repo](https://docs.github.com/en/get-started/quickstart/fork-a-repo)' for help forking this repository on GitHub.
## Make changes to the forked copy
Make the required changes to your forked copy of waybackpy, please don't forget to add or update comments and docstrings.
## Add tests for your changes
You have made the required changes to the codebase, now go ahead and add tests for newly written methods/functions and update the tests of code that you changed.
## Testing and Linting
You must run the tests and linter on your changes before opening a pull request.
### pytest
Runs all test from tests directory. pytest is a mature full-featured Python testing tool.
```bash
pytest
```
### mypy
Mypy is a static type checker for Python. Type checkers help ensure that you're using variables and functions in your code correctly.
```bash
mypy -p waybackpy -p tests
```
### black
After testing with pytest and type checking with mypy run black on the code base. The codestyle used by the project is 'black'.
```bash
black .
```
## Create a pull request
Read [Creating a pull request](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request).
Try to make sure that all automated tests are passing, and if some of them do not pass then don't worry. Tests are meant to catch bugs and a failed test is better than introducing bugs to the master branch.

View File

@ -1,16 +0,0 @@
# CONTRIBUTORS
## AUTHORS
- akamhy (<https://github.com/akamhy>)
- eggplants (<https://github.com/eggplants>)
- danvalen1 (<https://github.com/danvalen1>)
- AntiCompositeNumber (<https://github.com/AntiCompositeNumber>)
- rafaelrdealmeida (<https://github.com/rafaelrdealmeida>)
- jonasjancarik (<https://github.com/jonasjancarik>)
- jfinkhaeuser (<https://github.com/jfinkhaeuser>)
## ACKNOWLEDGEMENTS
- mhmdiaa (<https://github.com/mhmdiaa>) for <https://gist.github.com/mhmdiaa/adf6bff70142e5091792841d4b372050>. known_urls is based on this gist.
- dequeued0 (<https://github.com/dequeued0>) for reporting bugs and useful feature requests.

View File

@ -3,7 +3,7 @@
<img src="https://raw.githubusercontent.com/akamhy/waybackpy/master/assets/waybackpy_logo.svg"><br>
<h3>A Python package & CLI tool that interfaces with the Wayback Machine API</h3>
<h3>Python package & CLI tool that interfaces the Wayback Machine APIs</h3>
</div>
@ -22,22 +22,22 @@
# <img src="https://github.githubassets.com/images/icons/emoji/unicode/2b50.png" width="30"></img> Introduction
Waybackpy is a [Python package](https://www.udacity.com/blog/2021/01/what-is-a-python-package.html) and a [CLI](https://www.w3schools.com/whatis/whatis_cli.asp) tool that interfaces with the [Wayback Machine](https://en.wikipedia.org/wiki/Wayback_Machine) API.
Waybackpy is a Python package and a CLI tool that interfaces with the Wayback Machine APIs.
Wayback Machine has 3 client side [API](https://www.redhat.com/en/topics/api/what-are-application-programming-interfaces)s.
Internet Archive's Wayback Machine has 3 useful public APIs.
- [Save API](https://github.com/akamhy/waybackpy/wiki/Wayback-Machine-APIs#save-api)
- [Availability API](https://github.com/akamhy/waybackpy/wiki/Wayback-Machine-APIs#availability-api)
- [CDX API](https://github.com/akamhy/waybackpy/wiki/Wayback-Machine-APIs#cdx-api)
- SavePageNow or Save API
- CDX Server API
- Availability API
These three APIs can be accessed via the waybackpy either by importing it in a script or from the CLI.
These three APIs can be accessed via the waybackpy either by importing it from a python file/module or from the command-line interface.
## <img src="https://github.githubassets.com/images/icons/emoji/unicode/1f3d7.png" width="20"></img> Installation
**Using [pip](https://en.wikipedia.org/wiki/Pip_(package_manager)), from [PyPI](https://pypi.org/) (recommended)**:
```bash
pip install waybackpy
pip install waybackpy -U
```
**Using [conda](https://en.wikipedia.org/wiki/Conda_(package_manager)), from [conda-forge](https://anaconda.org/conda-forge/waybackpy) (recommended)**:
@ -58,11 +58,11 @@ pip install git+https://github.com/akamhy/waybackpy.git
## <img src="https://github.githubassets.com/images/icons/emoji/unicode/1f433.png" width="20"></img> Docker Image
Docker Hub : <https://hub.docker.com/r/secsi/waybackpy>
Docker Hub: [hub.docker.com/r/secsi/waybackpy](https://hub.docker.com/r/secsi/waybackpy)
[Docker image](https://searchitoperations.techtarget.com/definition/Docker-image) is automatically updated on every release by [Regulary and Automatically Updated Docker Images](https://github.com/cybersecsi/RAUDI) (RAUDI).
Docker image is automatically updated on every release by [Regulary and Automatically Updated Docker Images](https://github.com/cybersecsi/RAUDI) (RAUDI).
RAUDI is a tool by SecSI (<https://secsi.io>), an Italian cybersecurity startup.
RAUDI is a tool by [SecSI](https://secsi.io), an Italian cybersecurity startup.
## <img src="https://github.githubassets.com/images/icons/emoji/unicode/1f680.png" width="20"></img> Usage
@ -143,7 +143,7 @@ com,google)/ 20101010101435 http://google.com/ text/html 301 Y6PVK4XWOI3BXQEXM5W
com,google)/ 20101010101435 http://google.com/ text/html 301 Y6PVK4XWOI3BXQEXM5WLLWU5JKUVNSFZ 391
>>> near.archive_url
'https://web.archive.org/web/20101010101435/http://google.com/'
>>>
>>>
```
##### snapshots
```python
@ -165,7 +165,8 @@ https://web.archive.org/web/20171206002737/http://pypi.org:80/
#### Availability API
It is recommended to not use the availability API due to performance issues. All the methods of availability API interface class, `WaybackMachineAvailabilityAPI`, are also implemented in the CDX server API interface class, `WaybackMachineCDXServerAPI`.
It is recommended to not use the availability API due to performance issues. All the methods of availability API interface class, `WaybackMachineAvailabilityAPI`, are also implemented in the CDX server API interface class, `WaybackMachineCDXServerAPI`. Also note
that the `newest()` method of `WaybackMachineAvailabilityAPI` can be more recent than `WaybackMachineCDXServerAPI`'s same method.
```python
>>> from waybackpy import WaybackMachineAvailabilityAPI
@ -201,10 +202,20 @@ Demo video on [asciinema.org](https://asciinema.org/a/469890), you can copy the
> CLI documentation is at <https://github.com/akamhy/waybackpy/wiki/CLI-docs>.
## <img src="https://github.githubassets.com/images/icons/emoji/unicode/1f6e1.png" width="20"></img> License
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://github.com/akamhy/waybackpy/blob/master/LICENSE)
## CONTRIBUTORS
Copyright (c) 2020-2022 Akash Mahanty Et al.
### AUTHORS
Released under the MIT License. See [license](https://github.com/akamhy/waybackpy/blob/master/LICENSE) for details.
- akamhy (<https://github.com/akamhy>)
- eggplants (<https://github.com/eggplants>)
- danvalen1 (<https://github.com/danvalen1>)
- AntiCompositeNumber (<https://github.com/AntiCompositeNumber>)
- rafaelrdealmeida (<https://github.com/rafaelrdealmeida>)
- jonasjancarik (<https://github.com/jonasjancarik>)
- jfinkhaeuser (<https://github.com/jfinkhaeuser>)
### ACKNOWLEDGEMENTS
- mhmdiaa (<https://github.com/mhmdiaa>) `--known-urls` is based on [this](https://gist.github.com/mhmdiaa/adf6bff70142e5091792841d4b372050) gist.
- dequeued0 (<https://github.com/dequeued0>) for reporting bugs and useful feature requests.

View File

@ -32,20 +32,26 @@ classifiers =
License :: OSI Approved :: MIT License
Programming Language :: Python
Programming Language :: Python :: 3
Programming Language :: Python :: 3.6
Programming Language :: Python :: 3.7
Programming Language :: Python :: 3.8
Programming Language :: Python :: 3.9
Programming Language :: Python :: 3.10
Programming Language :: Python :: 3.11
Programming Language :: Python :: Implementation :: CPython
[options]
packages = find:
python_requires = >= 3.7
include-package-data = True
python_requires = >= 3.6
install_requires =
click
requests
urllib3
[options.package_data]
waybackpy = py.typed
[options.extras_require]
dev =
black

View File

@ -176,3 +176,39 @@ def test_near() -> None:
filters=["statuscode:200"],
)
cdx.near(unix_timestamp=1286705410)
def test_before() -> None:
user_agent = (
"Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) "
"AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/604.1"
)
cdx = WaybackMachineCDXServerAPI(
url="http://www.google.com/",
user_agent=user_agent,
filters=["statuscode:200"],
)
before = cdx.before(wayback_machine_timestamp=20160731235949)
assert "20160731233347" in before.timestamp
assert "google" in before.urlkey
assert before.original.find("google.com") != -1
assert before.archive_url.find("google.com") != -1
def test_after() -> None:
user_agent = (
"Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) "
"AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/604.1"
)
cdx = WaybackMachineCDXServerAPI(
url="http://www.google.com/",
user_agent=user_agent,
filters=["statuscode:200"],
)
after = cdx.after(wayback_machine_timestamp=20160731235949)
assert "20160801000917" in after.timestamp, after.timestamp
assert "google" in after.urlkey
assert after.original.find("google.com") != -1
assert after.archive_url.find("google.com") != -1

View File

@ -1,6 +1,6 @@
"""Module initializer and provider of static information."""
__version__ = "3.0.4"
__version__ = "3.0.6"
from .availability_api import WaybackMachineAvailabilityAPI
from .cdx_api import WaybackMachineCDXServerAPI

View File

@ -191,6 +191,88 @@ class WaybackMachineCDXServerAPI:
payload["url"] = self.url
def before(
self,
year: Optional[int] = None,
month: Optional[int] = None,
day: Optional[int] = None,
hour: Optional[int] = None,
minute: Optional[int] = None,
unix_timestamp: Optional[int] = None,
wayback_machine_timestamp: Optional[Union[int, str]] = None,
) -> CDXSnapshot:
"""
Gets the nearest archive before the given datetime.
"""
if unix_timestamp:
timestamp = unix_timestamp_to_wayback_timestamp(unix_timestamp)
elif wayback_machine_timestamp:
timestamp = str(wayback_machine_timestamp)
else:
now = datetime.utcnow().timetuple()
timestamp = wayback_timestamp(
year=now.tm_year if year is None else year,
month=now.tm_mon if month is None else month,
day=now.tm_mday if day is None else day,
hour=now.tm_hour if hour is None else hour,
minute=now.tm_min if minute is None else minute,
)
self.closest = timestamp
self.sort = "closest"
self.limit = 25000
for snapshot in self.snapshots():
if snapshot.timestamp < timestamp:
return snapshot
# If a snapshot isn't returned, then none were found.
raise NoCDXRecordFound(
"No records were found before the given date for the query."
+ "Either there are no archives before the given date,"
+ " the URL may not have any archived, or the URL may have been"
+ " recently archived and is still not available on the CDX server."
)
def after(
self,
year: Optional[int] = None,
month: Optional[int] = None,
day: Optional[int] = None,
hour: Optional[int] = None,
minute: Optional[int] = None,
unix_timestamp: Optional[int] = None,
wayback_machine_timestamp: Optional[Union[int, str]] = None,
) -> CDXSnapshot:
"""
Gets the nearest archive after the given datetime.
"""
if unix_timestamp:
timestamp = unix_timestamp_to_wayback_timestamp(unix_timestamp)
elif wayback_machine_timestamp:
timestamp = str(wayback_machine_timestamp)
else:
now = datetime.utcnow().timetuple()
timestamp = wayback_timestamp(
year=now.tm_year if year is None else year,
month=now.tm_mon if month is None else month,
day=now.tm_mday if day is None else day,
hour=now.tm_hour if hour is None else hour,
minute=now.tm_min if minute is None else minute,
)
self.closest = timestamp
self.sort = "closest"
self.limit = 25000
for snapshot in self.snapshots():
if snapshot.timestamp > timestamp:
return snapshot
# If a snapshot isn't returned, then none were found.
raise NoCDXRecordFound(
"No records were found after the given date for the query."
+ "Either there are no archives after the given date,"
+ " the URL may not have any archives, or the URL may have been"
+ " recently archived and is still not available on the CDX server."
)
def near(
self,
year: Optional[int] = None,

View File

@ -471,4 +471,4 @@ def main( # pylint: disable=no-value-for-parameter
if __name__ == "__main__":
main() # pylint: disable=no-value-for-parameter
main() # type: ignore # pylint: disable=no-value-for-parameter

0
waybackpy/py.typed Normal file
View File