Files
web-scraper/notes.md
2018-09-06 17:31:12 +01:00

714 B

Thoughts

  • strip hashes and everything following (as they're in-page anchors)
  • strip args
  • use pop() on the set instead of .remove()
    • return false once the set is empty
  • WebPage.parse_urls() needs to compare startswith to base url
  • ignore any links which aren't to pages
  • better url checking to get bare domain
  • remove trailing slash from any discovered url
  • investigate lxml parser
  • remove base url from initial urls with and without trailing slash
  • investigate using tldextract to match urls
  • implement parsing of robots.txt