misc/web-scraper

Go to file

Simon Weald 75d3756bbc fix errors discovered by pycyodestyle

2018-09-16 16:04:07 +01:00

report runtime of script in generated sitemap

2018-09-06 17:20:59 +01:00

more comments and progress output

2018-09-16 15:26:49 +01:00

.gitignore

ignore generated file

2018-09-06 17:08:56 +01:00

async_crawler.py

fix errors discovered by pycyodestyle

2018-09-16 16:04:07 +01:00

crawler.py

return from WebPage to indicate whether a link was actually crawlable and only actually crawl it if it was

2018-09-12 08:03:08 +01:00

notes.md

add async info

2018-09-10 21:29:46 +01:00

README.md

add flags to README

2018-09-16 15:58:17 +01:00

requirements.txt

update requirements

2018-09-16 15:44:30 +01:00

test_helpers.py

add some basic tests

2018-09-11 13:42:15 +01:00

README.md

Concurrent web scraper

Requirements

This crawler requires at least Python 3.5 in order to utilise the async/await keywords from asyncio.

Install required modules:

pip install -r requirements.txt

Run:

python crawler.py -u https://urltocrawl.com [-c 100]

Flags:

-u/--url https://url.com
- The base URL is required.
-c/--concurrency 100
- Specifying concurrency value is optional (defaults to 100).

Results

The resulting sitemap will be output to the root of this directory as sitemap.html