|
|
ab0ab0a010
|
add more thoughts
|
2018-09-07 11:50:53 +01:00 |
|
|
|
6a1259aa7d
|
update plans to add gzip encoding
|
2018-09-06 17:33:10 +01:00 |
|
|
|
164239b343
|
more thoughts
|
2018-09-06 17:31:12 +01:00 |
|
|
|
ce1f2745c9
|
update thoughts
|
2018-09-06 17:30:28 +01:00 |
|
|
|
e70bdc9ca1
|
update requirements.txt
|
2018-09-06 17:25:30 +01:00 |
|
|
|
d1c1e17f4f
|
report runtime of script in generated sitemap
|
2018-09-06 17:20:59 +01:00 |
|
|
|
816a727d79
|
ignore generated file
|
2018-09-06 17:08:56 +01:00 |
|
|
|
84ab27a75e
|
render results as HTML
|
2018-09-06 17:08:26 +01:00 |
|
|
|
6d9103c154
|
improved content-type detection
|
2018-09-06 17:08:12 +01:00 |
|
|
|
e57a86c60a
|
only attempt to read html
|
2018-09-06 16:30:11 +01:00 |
|
|
|
a3ec9451e3
|
implement parsing of robots.txt
|
2018-09-05 18:56:20 +01:00 |
|
|
|
f2c294ebdb
|
added new ideas to implement
|
2018-09-04 15:40:11 +01:00 |
|
|
|
1b9b207a28
|
attempt to remove base url with trailing slash (if discovered)
|
2018-09-04 13:57:52 +01:00 |
|
|
|
6abe7d68e0
|
updated notes
|
2018-09-04 12:51:59 +01:00 |
|
|
|
7d919039b6
|
removed unecessary modules
|
2018-09-04 10:14:27 +01:00 |
|
|
|
0726bcccb0
|
removed original file
|
2018-09-04 09:21:55 +01:00 |
|
|
|
05e907ecec
|
too many changes to make a sensible commit message
|
2018-09-04 09:21:26 +01:00 |
|
|
|
abc628106d
|
added a docstring to the WebPage object
|
2018-08-31 19:18:00 +01:00 |
|
|
|
c436016e0c
|
remove unecessary function
|
2018-08-31 19:16:08 +01:00 |
|
|
|
03554fde80
|
add docstrings
|
2018-08-31 19:15:35 +01:00 |
|
|
|
759f965e95
|
use more explicit names, use urljoin to combine urls
|
2018-08-31 19:12:58 +01:00 |
|
|
|
0517e5bc56
|
crawler now initialises and populates crawled pool with urls it finds
|
2018-08-31 19:02:21 +01:00 |
|
|
|
1b18aa83eb
|
corrected some small errors and added runner function
|
2018-08-31 19:01:35 +01:00 |
|
|
|
5e0d9fd568
|
initial commit of crawler skeleton
|
2018-08-31 18:26:49 +01:00 |
|
|
|
915def3a5d
|
rework url sanitiser to use urllib modules, move WebPage object to helpers
|
2018-08-31 18:26:25 +01:00 |
|
|
|
453331d69d
|
simplified url qualifier
|
2018-08-29 22:27:26 +01:00 |
|
|
|
2b812da26a
|
simplify UrlPoolManager to use a set instead of a dict
|
2018-08-29 21:49:15 +01:00 |
|
|
|
fb096b4468
|
add scratchpad for notes
|
2018-08-28 22:34:05 +01:00 |
|
|
|
5d94991167
|
start making the scraper an object
|
2018-08-28 22:29:36 +01:00 |
|
|
|
482d23dd4f
|
blank __init__.py
|
2018-08-28 22:29:11 +01:00 |
|
|
|
452de87f35
|
change name of pool management object to be more clear
|
2018-08-28 22:28:49 +01:00 |
|
|
|
73cb883151
|
add a list manager object
|
2018-08-28 22:28:16 +01:00 |
|
|
|
5c933fc5c9
|
initial commit of single-page scraper
|
2018-08-28 18:29:34 +01:00 |
|
|
|
25f8c4c686
|
remove testing url with requests and assume that the user is correct
|
2018-08-28 17:22:52 +01:00 |
|
|
|
0d0438670c
|
adjusted title
|
2018-08-28 09:12:48 +01:00 |
|
|
|
8a1fd39dc4
|
added pycache dirs
|
2018-08-27 19:38:13 +01:00 |
|
|
|
79b10798a3
|
initial commit of utils
|
2018-08-27 19:37:41 +01:00 |
|
|
|
fb6b976391
|
initial commit of utils tests
|
2018-08-27 19:36:43 +01:00 |
|
|
|
a04de7f4de
|
changed venv name
|
2018-08-27 14:28:20 +01:00 |
|
|
|
665ec1d7a7
|
add readme
|
2018-08-23 16:05:24 +01:00 |
|
|
|
65fc332925
|
ignore venv and vscode dirs
|
2018-08-23 16:03:46 +01:00 |
|
|
|
c6ce63838f
|
bare script file
|
2018-08-23 16:02:09 +01:00 |
|
|
|
c383fb7ee9
|
initial requirements file
|
2018-08-23 15:59:18 +01:00 |
|
|
|
01a16a998c
|
initial gitignore
|
2018-08-23 15:47:30 +01:00 |
|