934 B
934 B
Thoughts
strip hashes and everything following (as they're in-page anchors)- strip args
usepop()on the set instead of.remove()return false once the set is empty
WebPage.parse_urls()needs to compare startswith to base urlignore any links which aren't to pagesbetter url checking to get bare domain#wontfixremove trailing slash from any discovered urlinvestigate lxml parserremove base url from initial urls with and without trailing slashinvestigate using tldextract to match urls#wontfiximplement parsing of robots.txt- investigate gzip encoding
- implement some kind of progress display