From 07c98661de5691e7a9eccefd8a4c53b93e0bc86f Mon Sep 17 00:00:00 2001 From: Akash Mahanty Date: Sat, 3 Oct 2020 01:16:19 +0530 Subject: [PATCH] add usage for known urls (#26) * Update README.md * Update README.md * Update README.md * bash example for known urls * python examples / usage for known urls :) * Update README.md * Update README.md * Update README.md * Update README.md --- README.md | 74 +++++++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 67 insertions(+), 7 deletions(-) diff --git a/README.md b/README.md index f40e42a..b8b228f 100644 --- a/README.md +++ b/README.md @@ -27,24 +27,29 @@ Table of contents * [Usage](#usage) * [As a Python package](#as-a-python-package) - * [Saving an url using save()](#capturing-aka-saving-an-url-using-save) - * [Retrieving the oldest archive for an URL Using oldest()](#receiving-the-oldest-archive-for-an-url-using-oldest) - * [Retrieving the recent most/newest archive for an URL using newest()](#receiving-the-newest-archive-for-an-url-using-newest) - * [Retrieving archive close to a specified year, month, day, hour, and minute using near()](#receiving-archive-close-to-a-specified-year-month-day-hour-and-minute-using-near) - * [Get the content of webpage using get()](#get-the-content-of-webpage-using-get) - * [Count total archives for an URL using total_archives()](#count-total-archives-for-an-url-using-total_archives) - * [With Command-line interface](#with-the-command-line-interface) + * [Saving an url](#capturing-aka-saving-an-url-using-save) + * [Retrieving the oldest archive](#retrieving-the-oldest-archive-for-an-url-using-oldest) + * [Retrieving the recent most/newest archive](#retrieving-the-newest-archive-for-an-url-using-newest) + * [Retrieving archive close to a specified year, month, day, hour, and minute](#retrieving-archive-close-to-a-specified-year-month-day-hour-and-minute-using-near) + * [Get the content of webpage](#get-the-content-of-webpage-using-get) + * [Count total archives for an URL](#count-total-archives-for-an-url-using-total_archives) + * [List of URLs that Wayback Machine knows and has archived for a domain name](#retrieving-archive-close-to-a-specified-year-month-day-hour-and-minute-using-near) + + * [As a Command-line tool](#with-the-command-line-interface) * [Save](#save) * [Oldest archive](#oldest-archive) * [Newest archive](#newest-archive) * [Total archives](#total-number-of-archives) * [Archive near a time](#archive-near-time) * [Get the source code](#get-the-source-code) + * [Fetch all the URLs that the Wayback Machine knows for a domain](#fetch-all-the-urls-that-the-wayback-machine-knows-for-a-domain) * [Tests](#tests) * [Dependency](#dependency) +* [Packaging](#packaging) + * [License](#license) @@ -245,6 +250,31 @@ print(archive_count) # total_archives() returns an int Try this out in your browser @ +#### List of URLs that Wayback Machine knows and has archived for a domain name + +1) If alive=True is set, waybackpy will check all URLs to identify the alive URLs. Don't use with popular websites like google or it would take too long. +2) To include URLs from subdomain set sundomain=True + +```python +import waybackpy + +URL = "akamhy.github.io" +UA = "Mozilla/5.0 (iPad; CPU OS 8_1_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B435 Safari/600.1.4" + +known_urls = waybackpy.Url(url=URL, user_agent=UA).known_urls(alive=True, subdomain=False) # alive and subdomain are optional. + +print(known_urls) # known_urls() returns list of URLs +``` + +```bash +['http://akamhy.github.io', +'https://akamhy.github.io/waybackpy/', +'https://akamhy.github.io/waybackpy/assets/css/style.css?v=a418a4e4641a1dbaad8f3bfbf293fad21a75ff11', +'https://akamhy.github.io/waybackpy/assets/css/style.css?v=f881705d00bf47b5bf0c58808efe29eecba2226c'] +``` + +Try this out in your browser @ + ### With the Command-line interface #### Save @@ -304,6 +334,36 @@ waybackpy --url google.com --user_agent "my-unique-user-agent" --get save # Save Try this out in your browser @ +#### Fetch all the URLs that the Wayback Machine knows for a domain +1) You can add the '--alive' flag to only fetch alive links. +2) You can add the '--subdomain' flag to add subdomains. +3) '--alive' and '--subdomain' flags can be used simultaneously. +4) All links will be saved in a file, and the file will be created in the current working directory. + +```bash +pip install waybackpy + +# Ignore the above installation line. + +waybackpy --url akamhy.github.io --user_agent "my-user-agent" --known_urls +# Prints all known URLs under akamhy.github.io + + +waybackpy --url akamhy.github.io --user_agent "my-user-agent" --known_urls --alive +# Prints all known URLs under akamhy.github.io which are still working and not dead links. + + +waybackpy --url akamhy.github.io --user_agent "my-user-agent" --known_urls --subdomain +# Prints all known URLs under akamhy.github.io inclusing subdomain + + +waybackpy --url akamhy.github.io --user_agent "my-user-agent" --known_urls --subdomain --alive +# Prints all known URLs under akamhy.github.io including subdomain which are not dead links and still alive. + +``` + +Try this out in your browser @ + ## Tests [Here](https://github.com/akamhy/waybackpy/tree/master/tests)