Commit Graph

20 Commits

Author SHA1 Message Date
b5d644a223 various minor improvements to exception handling 2018-09-09 10:16:03 +01:00
6508156aa4 use lxml as the parser and only find links on a page if we've got the source 2018-09-09 10:06:25 +01:00
738ab8e441 adjust robots handling to deal with 404s and enforce a user agent which allows us to initially obtain the user agent 2018-09-09 09:57:16 +01:00
fdd84a8786 manually retrieve robots.txt to ensure we can set the user-agent 2018-09-07 12:40:12 +01:00
6d9103c154 improved content-type detection 2018-09-06 17:08:12 +01:00
e57a86c60a only attempt to read html 2018-09-06 16:30:11 +01:00
a3ec9451e3 implement parsing of robots.txt 2018-09-05 18:56:20 +01:00
05e907ecec too many changes to make a sensible commit message 2018-09-04 09:21:26 +01:00
abc628106d added a docstring to the WebPage object 2018-08-31 19:18:00 +01:00
c436016e0c remove unecessary function 2018-08-31 19:16:08 +01:00
03554fde80 add docstrings 2018-08-31 19:15:35 +01:00
759f965e95 use more explicit names, use urljoin to combine urls 2018-08-31 19:12:58 +01:00
1b18aa83eb corrected some small errors and added runner function 2018-08-31 19:01:35 +01:00
915def3a5d rework url sanitiser to use urllib modules, move WebPage object to helpers 2018-08-31 18:26:25 +01:00
453331d69d simplified url qualifier 2018-08-29 22:27:26 +01:00
2b812da26a simplify UrlPoolManager to use a set instead of a dict 2018-08-29 21:49:15 +01:00
452de87f35 change name of pool management object to be more clear 2018-08-28 22:28:49 +01:00
73cb883151 add a list manager object 2018-08-28 22:28:16 +01:00
25f8c4c686 remove testing url with requests and assume that the user is correct 2018-08-28 17:22:52 +01:00
79b10798a3 initial commit of utils 2018-08-27 19:37:41 +01:00