web-scraper/notes.md at f2c294ebdb135ec5e653fcb6dee5ce1e8e093afb

Files

Simon Weald f2c294ebdb added new ideas to implement

2018-09-04 15:40:11 +01:00

Thoughts

~~strip hashes and everything following (as they're in-page anchors)~~
strip args
~~use pop() on the set instead of .remove()~~
- ~~return false once the set is empty~~
~~WebPage.parse_urls() needs to compare startswith to base url~~
ignore any links which aren't to pages
better url checking to get bare domain
~~remove base url from initial urls with and without trailing slash~~
investigate using tldextract to match urls
implement parsing of robots.txt