1.2 KiB
1.2 KiB
Thoughts
strip hashes and everything following (as they're in-page anchors)strip argsusepop()on the set instead of.remove()return false once the set is empty
WebPage.parse_urls()needs to compare startswith to base urlignore any links which aren't to pagesbetter url checking to get bare domain#wontfixremove trailing slash from any discovered urlinvestigate lxml parserremove base url from initial urls with and without trailing slashinvestigate using tldextract to match urls#wontfiximplement parsing of robots.txtinvestigate gzip encodingimplement some kind of progress display- async
- better exception handling
- randomise output filename
talking points
- token bucket algo to enforce n requests per second
- read up on bucket algo types
- re-structuring AsyncCrawler to be more testable
- use exponential backoff algo?