|
|
1af26f50f2
|
added a docstring
|
2018-09-11 13:42:02 +01:00 |
|
|
|
9e125dfae0
|
added comments and docstrings
|
2018-09-09 22:49:55 +01:00 |
|
|
|
9e754a5584
|
improve handling of gzip/deflated data detection
|
2018-09-09 11:21:46 +01:00 |
|
|
|
1b005570ee
|
implement gzip compression requests and handling
|
2018-09-09 10:53:09 +01:00 |
|
|
|
b5d644a223
|
various minor improvements to exception handling
|
2018-09-09 10:16:03 +01:00 |
|
|
|
6508156aa4
|
use lxml as the parser and only find links on a page if we've got the source
|
2018-09-09 10:06:25 +01:00 |
|
|
|
738ab8e441
|
adjust robots handling to deal with 404s and enforce a user agent which allows us to initially obtain the user agent
|
2018-09-09 09:57:16 +01:00 |
|
|
|
fdd84a8786
|
manually retrieve robots.txt to ensure we can set the user-agent
|
2018-09-07 12:40:12 +01:00 |
|
|
|
6d9103c154
|
improved content-type detection
|
2018-09-06 17:08:12 +01:00 |
|
|
|
e57a86c60a
|
only attempt to read html
|
2018-09-06 16:30:11 +01:00 |
|
|
|
a3ec9451e3
|
implement parsing of robots.txt
|
2018-09-05 18:56:20 +01:00 |
|
|
|
05e907ecec
|
too many changes to make a sensible commit message
|
2018-09-04 09:21:26 +01:00 |
|
|
|
abc628106d
|
added a docstring to the WebPage object
|
2018-08-31 19:18:00 +01:00 |
|
|
|
c436016e0c
|
remove unecessary function
|
2018-08-31 19:16:08 +01:00 |
|
|
|
03554fde80
|
add docstrings
|
2018-08-31 19:15:35 +01:00 |
|
|
|
759f965e95
|
use more explicit names, use urljoin to combine urls
|
2018-08-31 19:12:58 +01:00 |
|
|
|
1b18aa83eb
|
corrected some small errors and added runner function
|
2018-08-31 19:01:35 +01:00 |
|
|
|
915def3a5d
|
rework url sanitiser to use urllib modules, move WebPage object to helpers
|
2018-08-31 18:26:25 +01:00 |
|
|
|
453331d69d
|
simplified url qualifier
|
2018-08-29 22:27:26 +01:00 |
|
|
|
2b812da26a
|
simplify UrlPoolManager to use a set instead of a dict
|
2018-08-29 21:49:15 +01:00 |
|
|
|
482d23dd4f
|
blank __init__.py
|
2018-08-28 22:29:11 +01:00 |
|
|
|
452de87f35
|
change name of pool management object to be more clear
|
2018-08-28 22:28:49 +01:00 |
|
|
|
73cb883151
|
add a list manager object
|
2018-08-28 22:28:16 +01:00 |
|
|
|
25f8c4c686
|
remove testing url with requests and assume that the user is correct
|
2018-08-28 17:22:52 +01:00 |
|
|
|
79b10798a3
|
initial commit of utils
|
2018-08-27 19:37:41 +01:00 |
|