Files
web-scraper/notes.md
2018-09-09 10:52:37 +01:00

982 B

Thoughts

  • strip hashes and everything following (as they're in-page anchors)
  • strip args
  • use pop() on the set instead of .remove()
    • return false once the set is empty
  • WebPage.parse_urls() needs to compare startswith to base url
  • ignore any links which aren't to pages
  • better url checking to get bare domain #wontfix
  • remove trailing slash from any discovered url
  • investigate lxml parser
  • remove base url from initial urls with and without trailing slash
  • investigate using tldextract to match urls #wontfix
  • implement parsing of robots.txt
  • investigate gzip encoding
  • implement some kind of progress display
  • async
  • better exception handling