web-scraper/notes.md at 5f7d66912f57be3dcb99caeb362c604a60113a4a

Files

Simon Weald 32d7f1e54b add talking points

2018-09-18 18:23:12 +01:00

Thoughts

~~strip hashes and everything following (as they're in-page anchors)~~
~~strip args~~
~~use pop() on the set instead of .remove()~~
- ~~return false once the set is empty~~
~~WebPage.parse_urls() needs to compare startswith to base url~~
~~ignore any links which aren't to pages~~
~~better url checking to get bare domain~~ #wontfix
~~remove trailing slash from any discovered url~~
~~investigate lxml parser~~
~~remove base url from initial urls with and without trailing slash~~
~~investigate using tldextract to match urls~~ #wontfix
~~implement parsing of robots.txt~~
~~investigate gzip encoding~~
~~implement some kind of progress display~~
async
better exception handling
randomise output filename

token bucket algo to enforce n requests per second
- read up on bucket algo types
re-structuring AsyncCrawler to be more testable
use exponential backoff algo?