428 B
428 B
Thoughts
strip hashes and everything following (as they're in-page anchors)- strip args
usepop()on the set instead of.remove()return false once the set is empty
WebPage.parse_urls()needs to compare startswith to base url- ignore any links which aren't to pages
- better url checking to get bare domain
- remove base url from initial urls with and without trailing slash