1008 B
1008 B
Thoughts
strip hashes and everything following (as they're in-page anchors)- strip args
usepop()on the set instead of.remove()return false once the set is empty
WebPage.parse_urls()needs to compare startswith to base urlignore any links which aren't to pages- better url checking to get bare domain
- remove trailing slash from any discovered url
- investigate lxml parser
remove base url from initial urls with and without trailing slash- investigate using tldextract to match urls
implement parsing of robots.txt- investigate gzip encoding
text/html; charset=utf-8
application/xhtml+xml
'WebPage' object has no attribute 'source'
'WebPage' object has no attribute 'discovered_hrefs'