Commit Graph

32 Commits

Author SHA1 Message Date
1b9b207a28 attempt to remove base url with trailing slash (if discovered) 2018-09-04 13:57:52 +01:00
6abe7d68e0 updated notes 2018-09-04 12:51:59 +01:00
7d919039b6 removed unecessary modules 2018-09-04 10:14:27 +01:00
0726bcccb0 removed original file 2018-09-04 09:21:55 +01:00
05e907ecec too many changes to make a sensible commit message 2018-09-04 09:21:26 +01:00
abc628106d added a docstring to the WebPage object 2018-08-31 19:18:00 +01:00
c436016e0c remove unecessary function 2018-08-31 19:16:08 +01:00
03554fde80 add docstrings 2018-08-31 19:15:35 +01:00
759f965e95 use more explicit names, use urljoin to combine urls 2018-08-31 19:12:58 +01:00
0517e5bc56 crawler now initialises and populates crawled pool with urls it finds 2018-08-31 19:02:21 +01:00
1b18aa83eb corrected some small errors and added runner function 2018-08-31 19:01:35 +01:00
5e0d9fd568 initial commit of crawler skeleton 2018-08-31 18:26:49 +01:00
915def3a5d rework url sanitiser to use urllib modules, move WebPage object to helpers 2018-08-31 18:26:25 +01:00
453331d69d simplified url qualifier 2018-08-29 22:27:26 +01:00
2b812da26a simplify UrlPoolManager to use a set instead of a dict 2018-08-29 21:49:15 +01:00
fb096b4468 add scratchpad for notes 2018-08-28 22:34:05 +01:00
5d94991167 start making the scraper an object 2018-08-28 22:29:36 +01:00
482d23dd4f blank __init__.py 2018-08-28 22:29:11 +01:00
452de87f35 change name of pool management object to be more clear 2018-08-28 22:28:49 +01:00
73cb883151 add a list manager object 2018-08-28 22:28:16 +01:00
5c933fc5c9 initial commit of single-page scraper 2018-08-28 18:29:34 +01:00
25f8c4c686 remove testing url with requests and assume that the user is correct 2018-08-28 17:22:52 +01:00
0d0438670c adjusted title 2018-08-28 09:12:48 +01:00
8a1fd39dc4 added pycache dirs 2018-08-27 19:38:13 +01:00
79b10798a3 initial commit of utils 2018-08-27 19:37:41 +01:00
fb6b976391 initial commit of utils tests 2018-08-27 19:36:43 +01:00
a04de7f4de changed venv name 2018-08-27 14:28:20 +01:00
665ec1d7a7 add readme 2018-08-23 16:05:24 +01:00
65fc332925 ignore venv and vscode dirs 2018-08-23 16:03:46 +01:00
c6ce63838f bare script file 2018-08-23 16:02:09 +01:00
c383fb7ee9 initial requirements file 2018-08-23 15:59:18 +01:00
01a16a998c initial gitignore 2018-08-23 15:47:30 +01:00