628 B
628 B
Thoughts
strip hashes and everything following (as they're in-page anchors)- strip args
usepop()on the set instead of.remove()return false once the set is empty
WebPage.parse_urls()needs to compare startswith to base url- ignore any links which aren't to pages
- better url checking to get bare domain
remove base url from initial urls with and without trailing slash- investigate using tldextract to match urls
- implement parsing of robots.txt