misc/web-scraper

Go to file

Simon Weald 5f7d66912f add test files

2018-09-19 08:39:05 +01:00

rename all instances of base_url to rooturl, add more documentation

2018-09-18 18:24:15 +01:00

add test files

2018-09-19 08:39:05 +01:00

rename all instances of base_url to rooturl, add more documentation

2018-09-18 18:24:15 +01:00

.gitignore

ignore generated file

2018-09-06 17:08:56 +01:00

async_crawler.py

rename all instances of base_url to rooturl, add more documentation

2018-09-18 18:24:15 +01:00

crawler.py

return from WebPage to indicate whether a link was actually crawlable and only actually crawl it if it was

2018-09-12 08:03:08 +01:00

notes.md

add talking points

2018-09-18 18:23:12 +01:00

README.md

update docs

2018-09-19 08:38:49 +01:00

requirements.txt

update requirements

2018-09-16 15:44:30 +01:00

test_helpers.py

correct tests with new arg names

2018-09-19 08:37:55 +01:00

README.md

Concurrent web scraper

Requirements

This crawler requires at least Python 3.5 in order to utilise the async/await keywords from asyncio.

Install required modules:

pip install -r requirements.txt

Run:

python async_crawler.py -u https://urltocrawl.com [-c 100]

Flags:

-u/--url https://url.com
- The base URL is required.
-c/--concurrency 100
- Specifying concurrency value is optional (defaults to 100).

Results

The resulting sitemap will be output to the root of this directory as sitemap.html