web-scraper/notes.md at 9e754a55846bf22a682c456de4116ca3f8b3be3d

Files

Simon Weald 17fa9f93f9 tick off gzip encoding

2018-09-09 10:52:37 +01:00

Thoughts

~~strip hashes and everything following (as they're in-page anchors)~~
~~strip args~~
~~use pop() on the set instead of .remove()~~
- ~~return false once the set is empty~~
~~WebPage.parse_urls() needs to compare startswith to base url~~
~~ignore any links which aren't to pages~~
~~better url checking to get bare domain~~ #wontfix
~~remove trailing slash from any discovered url~~
~~investigate lxml parser~~
~~remove base url from initial urls with and without trailing slash~~
~~investigate using tldextract to match urls~~ #wontfix
~~implement parsing of robots.txt~~
~~investigate gzip encoding~~
implement some kind of progress display
async
better exception handling