web-scraper/notes.md at b5d644a2233cc487289298011ef529e6dc35875e

Files

Simon Weald ab0ab0a010 add more thoughts

2018-09-07 11:50:53 +01:00

Thoughts

~~strip hashes and everything following (as they're in-page anchors)~~
strip args
~~use pop() on the set instead of .remove()~~
- ~~return false once the set is empty~~
~~WebPage.parse_urls() needs to compare startswith to base url~~
~~ignore any links which aren't to pages~~
better url checking to get bare domain
remove trailing slash from any discovered url
investigate lxml parser
~~remove base url from initial urls with and without trailing slash~~
investigate using tldextract to match urls
~~implement parsing of robots.txt~~
investigate gzip encoding

text/html; charset=utf-8
application/xhtml+xml
'WebPage' object has no attribute 'source'
'WebPage' object has no attribute 'discovered_hrefs'