web-scraper/notes.md at 164239b3437ee23c0f4b3b9bd6d9b6714b05a74d

Files

Simon Weald 164239b343 more thoughts

2018-09-06 17:31:12 +01:00

Thoughts

~~strip hashes and everything following (as they're in-page anchors)~~
strip args
~~use pop() on the set instead of .remove()~~
- ~~return false once the set is empty~~
~~WebPage.parse_urls() needs to compare startswith to base url~~
~~ignore any links which aren't to pages~~
better url checking to get bare domain
remove trailing slash from any discovered url
investigate lxml parser
~~remove base url from initial urls with and without trailing slash~~
investigate using tldextract to match urls
~~implement parsing of robots.txt~~