Compare commits
6 Commits
Author | SHA1 | Date | |
---|---|---|---|
3b3e78d901 | |||
0202efd39d | |||
25c0adacb0 | |||
5bd16a42e7 | |||
57f4be53d5 | |||
64a4ce88af |
54
CONTRIBUTING.md
Normal file
54
CONTRIBUTING.md
Normal file
@ -0,0 +1,54 @@
|
||||
# Welcome to waybackpy contributing guide
|
||||
|
||||
|
||||
## Getting started
|
||||
|
||||
Read our [Code of Conduct](./CODE_OF_CONDUCT.md).
|
||||
|
||||
## Creating an issue
|
||||
|
||||
It's a good idea to open an issue and discuss suspected bugs and new feature ideas with the maintainers. Somebody might be working on your bug/idea and it would be best to discuss it to avoid wasting your time. It is a recommendation. You may avoid creating an issue and directly open pull requests.
|
||||
|
||||
## Fork this repository
|
||||
|
||||
Fork this repository. See '[Fork a repo](https://docs.github.com/en/get-started/quickstart/fork-a-repo)' for help forking this repository on GitHub.
|
||||
|
||||
## Make changes to the forked copy
|
||||
|
||||
Make the required changes to your forked copy of waybackpy, please don't forget to add or update comments and docstrings.
|
||||
|
||||
## Add tests for your changes
|
||||
|
||||
You have made the required changes to the codebase, now go ahead and add tests for newly written methods/functions and update the tests of code that you changed.
|
||||
|
||||
## Testing and Linting
|
||||
|
||||
You must run the tests and linter on your changes before opening a pull request.
|
||||
|
||||
### pytest
|
||||
|
||||
Runs all test from tests directory. pytest is a mature full-featured Python testing tool.
|
||||
```bash
|
||||
pytest
|
||||
```
|
||||
|
||||
### mypy
|
||||
|
||||
Mypy is a static type checker for Python. Type checkers help ensure that you're using variables and functions in your code correctly.
|
||||
```bash
|
||||
mypy -p waybackpy -p tests
|
||||
```
|
||||
|
||||
### black
|
||||
|
||||
After testing with pytest and type checking with mypy run black on the code base. The codestyle used by the project is 'black'.
|
||||
|
||||
```bash
|
||||
black .
|
||||
```
|
||||
|
||||
## Create a pull request
|
||||
|
||||
Read [Creating a pull request](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request).
|
||||
|
||||
Try to make sure that all automated tests are passing, and if some of them do not pass then don't worry. Tests are meant to catch bugs and a failed test is better than introducing bugs to the master branch.
|
@ -1,16 +0,0 @@
|
||||
# CONTRIBUTORS
|
||||
|
||||
## AUTHORS
|
||||
|
||||
- akamhy (<https://github.com/akamhy>)
|
||||
- eggplants (<https://github.com/eggplants>)
|
||||
- danvalen1 (<https://github.com/danvalen1>)
|
||||
- AntiCompositeNumber (<https://github.com/AntiCompositeNumber>)
|
||||
- rafaelrdealmeida (<https://github.com/rafaelrdealmeida>)
|
||||
- jonasjancarik (<https://github.com/jonasjancarik>)
|
||||
- jfinkhaeuser (<https://github.com/jfinkhaeuser>)
|
||||
|
||||
## ACKNOWLEDGEMENTS
|
||||
|
||||
- mhmdiaa (<https://github.com/mhmdiaa>) for <https://gist.github.com/mhmdiaa/adf6bff70142e5091792841d4b372050>. known_urls is based on this gist.
|
||||
- dequeued0 (<https://github.com/dequeued0>) for reporting bugs and useful feature requests.
|
25
README.md
25
README.md
@ -3,7 +3,7 @@
|
||||
|
||||
<img src="https://raw.githubusercontent.com/akamhy/waybackpy/master/assets/waybackpy_logo.svg"><br>
|
||||
|
||||
<h3>A Python package & CLI tool that interfaces with the Wayback Machine API</h3>
|
||||
<h3>Python package & CLI tool that interfaces the Wayback Machine APIs</h3>
|
||||
|
||||
</div>
|
||||
|
||||
@ -24,7 +24,7 @@
|
||||
|
||||
Waybackpy is a Python package and a CLI tool that interfaces with the Wayback Machine APIs.
|
||||
|
||||
Wayback Machine has 3 client side APIs.
|
||||
Internet Archive's Wayback Machine has 3 useful public APIs.
|
||||
|
||||
- SavePageNow or Save API
|
||||
- CDX Server API
|
||||
@ -37,7 +37,7 @@ These three APIs can be accessed via the waybackpy either by importing it from a
|
||||
**Using [pip](https://en.wikipedia.org/wiki/Pip_(package_manager)), from [PyPI](https://pypi.org/) (recommended)**:
|
||||
|
||||
```bash
|
||||
pip install waybackpy
|
||||
pip install waybackpy -U
|
||||
```
|
||||
|
||||
**Using [conda](https://en.wikipedia.org/wiki/Conda_(package_manager)), from [conda-forge](https://anaconda.org/conda-forge/waybackpy) (recommended)**:
|
||||
@ -143,7 +143,7 @@ com,google)/ 20101010101435 http://google.com/ text/html 301 Y6PVK4XWOI3BXQEXM5W
|
||||
com,google)/ 20101010101435 http://google.com/ text/html 301 Y6PVK4XWOI3BXQEXM5WLLWU5JKUVNSFZ 391
|
||||
>>> near.archive_url
|
||||
'https://web.archive.org/web/20101010101435/http://google.com/'
|
||||
>>>
|
||||
>>>
|
||||
```
|
||||
##### snapshots
|
||||
```python
|
||||
@ -165,7 +165,7 @@ https://web.archive.org/web/20171206002737/http://pypi.org:80/
|
||||
|
||||
#### Availability API
|
||||
|
||||
It is recommended to not use the availability API due to performance issues. All the methods of availability API interface class, `WaybackMachineAvailabilityAPI`, are also implemented in the CDX server API interface class, `WaybackMachineCDXServerAPI`. Also note
|
||||
It is recommended to not use the availability API due to performance issues. All the methods of availability API interface class, `WaybackMachineAvailabilityAPI`, are also implemented in the CDX server API interface class, `WaybackMachineCDXServerAPI`. Also note
|
||||
that the `newest()` method of `WaybackMachineAvailabilityAPI` can be more recent than `WaybackMachineCDXServerAPI`'s same method.
|
||||
|
||||
```python
|
||||
@ -203,4 +203,19 @@ Demo video on [asciinema.org](https://asciinema.org/a/469890), you can copy the
|
||||
> CLI documentation is at <https://github.com/akamhy/waybackpy/wiki/CLI-docs>.
|
||||
|
||||
|
||||
## CONTRIBUTORS
|
||||
|
||||
### AUTHORS
|
||||
|
||||
- akamhy (<https://github.com/akamhy>)
|
||||
- eggplants (<https://github.com/eggplants>)
|
||||
- danvalen1 (<https://github.com/danvalen1>)
|
||||
- AntiCompositeNumber (<https://github.com/AntiCompositeNumber>)
|
||||
- rafaelrdealmeida (<https://github.com/rafaelrdealmeida>)
|
||||
- jonasjancarik (<https://github.com/jonasjancarik>)
|
||||
- jfinkhaeuser (<https://github.com/jfinkhaeuser>)
|
||||
|
||||
### ACKNOWLEDGEMENTS
|
||||
|
||||
- mhmdiaa (<https://github.com/mhmdiaa>) `--known-urls` is based on [this](https://gist.github.com/mhmdiaa/adf6bff70142e5091792841d4b372050) gist.
|
||||
- dequeued0 (<https://github.com/dequeued0>) for reporting bugs and useful feature requests.
|
||||
|
@ -37,6 +37,7 @@ classifiers =
|
||||
Programming Language :: Python :: 3.8
|
||||
Programming Language :: Python :: 3.9
|
||||
Programming Language :: Python :: 3.10
|
||||
Programming Language :: Python :: 3.11
|
||||
Programming Language :: Python :: Implementation :: CPython
|
||||
|
||||
[options]
|
||||
|
@ -176,3 +176,39 @@ def test_near() -> None:
|
||||
filters=["statuscode:200"],
|
||||
)
|
||||
cdx.near(unix_timestamp=1286705410)
|
||||
|
||||
|
||||
def test_before() -> None:
|
||||
user_agent = (
|
||||
"Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) "
|
||||
"AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/604.1"
|
||||
)
|
||||
|
||||
cdx = WaybackMachineCDXServerAPI(
|
||||
url="http://www.google.com/",
|
||||
user_agent=user_agent,
|
||||
filters=["statuscode:200"],
|
||||
)
|
||||
before = cdx.before(wayback_machine_timestamp=20160731235949)
|
||||
assert "20160731233347" in before.timestamp
|
||||
assert "google" in before.urlkey
|
||||
assert before.original.find("google.com") != -1
|
||||
assert before.archive_url.find("google.com") != -1
|
||||
|
||||
|
||||
def test_after() -> None:
|
||||
user_agent = (
|
||||
"Mozilla/5.0 (MacBook Air; M1 Mac OS X 11_4) "
|
||||
"AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.1.1 Safari/604.1"
|
||||
)
|
||||
|
||||
cdx = WaybackMachineCDXServerAPI(
|
||||
url="http://www.google.com/",
|
||||
user_agent=user_agent,
|
||||
filters=["statuscode:200"],
|
||||
)
|
||||
after = cdx.after(wayback_machine_timestamp=20160731235949)
|
||||
assert "20160801000917" in after.timestamp, after.timestamp
|
||||
assert "google" in after.urlkey
|
||||
assert after.original.find("google.com") != -1
|
||||
assert after.archive_url.find("google.com") != -1
|
||||
|
@ -191,6 +191,88 @@ class WaybackMachineCDXServerAPI:
|
||||
|
||||
payload["url"] = self.url
|
||||
|
||||
def before(
|
||||
self,
|
||||
year: Optional[int] = None,
|
||||
month: Optional[int] = None,
|
||||
day: Optional[int] = None,
|
||||
hour: Optional[int] = None,
|
||||
minute: Optional[int] = None,
|
||||
unix_timestamp: Optional[int] = None,
|
||||
wayback_machine_timestamp: Optional[Union[int, str]] = None,
|
||||
) -> CDXSnapshot:
|
||||
"""
|
||||
Gets the nearest archive before the given datetime.
|
||||
"""
|
||||
if unix_timestamp:
|
||||
timestamp = unix_timestamp_to_wayback_timestamp(unix_timestamp)
|
||||
elif wayback_machine_timestamp:
|
||||
timestamp = str(wayback_machine_timestamp)
|
||||
else:
|
||||
now = datetime.utcnow().timetuple()
|
||||
timestamp = wayback_timestamp(
|
||||
year=now.tm_year if year is None else year,
|
||||
month=now.tm_mon if month is None else month,
|
||||
day=now.tm_mday if day is None else day,
|
||||
hour=now.tm_hour if hour is None else hour,
|
||||
minute=now.tm_min if minute is None else minute,
|
||||
)
|
||||
self.closest = timestamp
|
||||
self.sort = "closest"
|
||||
self.limit = 25000
|
||||
for snapshot in self.snapshots():
|
||||
if snapshot.timestamp < timestamp:
|
||||
return snapshot
|
||||
|
||||
# If a snapshot isn't returned, then none were found.
|
||||
raise NoCDXRecordFound(
|
||||
"No records were found before the given date for the query."
|
||||
+ "Either there are no archives before the given date,"
|
||||
+ " the URL may not have any archived, or the URL may have been"
|
||||
+ " recently archived and is still not available on the CDX server."
|
||||
)
|
||||
|
||||
def after(
|
||||
self,
|
||||
year: Optional[int] = None,
|
||||
month: Optional[int] = None,
|
||||
day: Optional[int] = None,
|
||||
hour: Optional[int] = None,
|
||||
minute: Optional[int] = None,
|
||||
unix_timestamp: Optional[int] = None,
|
||||
wayback_machine_timestamp: Optional[Union[int, str]] = None,
|
||||
) -> CDXSnapshot:
|
||||
"""
|
||||
Gets the nearest archive after the given datetime.
|
||||
"""
|
||||
if unix_timestamp:
|
||||
timestamp = unix_timestamp_to_wayback_timestamp(unix_timestamp)
|
||||
elif wayback_machine_timestamp:
|
||||
timestamp = str(wayback_machine_timestamp)
|
||||
else:
|
||||
now = datetime.utcnow().timetuple()
|
||||
timestamp = wayback_timestamp(
|
||||
year=now.tm_year if year is None else year,
|
||||
month=now.tm_mon if month is None else month,
|
||||
day=now.tm_mday if day is None else day,
|
||||
hour=now.tm_hour if hour is None else hour,
|
||||
minute=now.tm_min if minute is None else minute,
|
||||
)
|
||||
self.closest = timestamp
|
||||
self.sort = "closest"
|
||||
self.limit = 25000
|
||||
for snapshot in self.snapshots():
|
||||
if snapshot.timestamp > timestamp:
|
||||
return snapshot
|
||||
|
||||
# If a snapshot isn't returned, then none were found.
|
||||
raise NoCDXRecordFound(
|
||||
"No records were found after the given date for the query."
|
||||
+ "Either there are no archives after the given date,"
|
||||
+ " the URL may not have any archives, or the URL may have been"
|
||||
+ " recently archived and is still not available on the CDX server."
|
||||
)
|
||||
|
||||
def near(
|
||||
self,
|
||||
year: Optional[int] = None,
|
||||
|
@ -471,4 +471,4 @@ def main( # pylint: disable=no-value-for-parameter
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main() # pylint: disable=no-value-for-parameter
|
||||
main() # type: ignore # pylint: disable=no-value-for-parameter
|
||||
|
Reference in New Issue
Block a user