web-scraper

Author	SHA1	Message	Date
simon	75d3756bbc	fix errors discovered by pycyodestyle	2018-09-16 16:04:07 +01:00
simon	5262c23281	add flags to README	2018-09-16 15:58:17 +01:00
simon	524f6a45cd	improve documentation	2018-09-16 15:53:47 +01:00
simon	a926090bed	update requirements	2018-09-16 15:44:30 +01:00
simon	91cd988f52	more comments and progress output	2018-09-16 15:26:49 +01:00
simon	f1855f5add	re-order imports because I'm fussy	2018-09-16 09:06:30 +01:00
simon	336517e84a	more documentation and add back some required imports	2018-09-16 09:00:43 +01:00
simon	7bc9fe0679	improved documentation and remove unneeded set	2018-09-16 08:56:44 +01:00
simon	6548f55416	improve documentation	2018-09-15 21:48:50 +01:00
simon	0244435fea	remove unecessary imports	2018-09-15 21:38:51 +01:00
simon	d6964672b6	commit of working async crawler	2018-09-15 21:30:02 +01:00
simon	3808f72f73	correct semaphore usage	2018-09-14 16:06:17 +01:00
simon	7ebe4855b8	remove unecessary classes2	2018-09-14 16:02:20 +01:00
simon	db986b0eba	async crawler in a mostly-working state	2018-09-14 16:01:12 +01:00
simon	36e1f7693f	initial foray into asynchronous crawling	2018-09-12 22:54:12 +01:00
simon	8698c21fda	return from WebPage to indicate whether a link was actually crawlable and only actually crawl it if it was	2018-09-12 08:03:08 +01:00
simon	273cf56a3b	add some basic tests	2018-09-11 13:42:15 +01:00
simon	1af26f50f2	added a docstring	2018-09-11 13:42:02 +01:00
simon	c40c5cea50	add async info	2018-09-10 21:29:46 +01:00
simon	a6224f9b6a	updated readme	2018-09-10 20:56:12 +01:00
simon	b64711973f	add new thoughts	2018-09-10 11:58:58 +01:00
simon	9e125dfae0	added comments and docstrings	2018-09-09 22:49:55 +01:00
simon	f16f82fdfb	improved completion message	2018-09-09 22:40:42 +01:00
simon	a523154848	display count of crawled/uncrawled URLs whilst running	2018-09-09 22:35:55 +01:00
simon	9e754a5584	improve handling of gzip/deflated data detection	2018-09-09 11:21:46 +01:00
simon	1b005570ee	implement gzip compression requests and handling	2018-09-09 10:53:09 +01:00
simon	17fa9f93f9	tick off gzip encoding	2018-09-09 10:52:37 +01:00
simon	1e51e10db2	update with changes	2018-09-09 10:22:18 +01:00
simon	225fd8b3ea	update with changes	2018-09-09 10:22:03 +01:00
simon	d686ae0bc4	update with changes	2018-09-09 10:21:45 +01:00
simon	69f5788745	update notes	2018-09-09 10:16:22 +01:00
simon	b5d644a223	various minor improvements to exception handling	2018-09-09 10:16:03 +01:00
simon	6508156aa4	use lxml as the parser and only find links on a page if we've got the source	2018-09-09 10:06:25 +01:00
simon	738ab8e441	adjust robots handling to deal with 404s and enforce a user agent which allows us to initially obtain the user agent	2018-09-09 09:57:16 +01:00
simon	fdd84a8786	manually retrieve robots.txt to ensure we can set the user-agent	2018-09-07 12:40:12 +01:00
simon	ab0ab0a010	add more thoughts	2018-09-07 11:50:53 +01:00
simon	6a1259aa7d	update plans to add gzip encoding	2018-09-06 17:33:10 +01:00
simon	164239b343	more thoughts	2018-09-06 17:31:12 +01:00
simon	ce1f2745c9	update thoughts	2018-09-06 17:30:28 +01:00
simon	e70bdc9ca1	update requirements.txt	2018-09-06 17:25:30 +01:00
simon	d1c1e17f4f	report runtime of script in generated sitemap	2018-09-06 17:20:59 +01:00
simon	816a727d79	ignore generated file	2018-09-06 17:08:56 +01:00
simon	84ab27a75e	render results as HTML	2018-09-06 17:08:26 +01:00
simon	6d9103c154	improved content-type detection	2018-09-06 17:08:12 +01:00
simon	e57a86c60a	only attempt to read html	2018-09-06 16:30:11 +01:00
simon	a3ec9451e3	implement parsing of robots.txt	2018-09-05 18:56:20 +01:00
simon	f2c294ebdb	added new ideas to implement	2018-09-04 15:40:11 +01:00
simon	1b9b207a28	attempt to remove base url with trailing slash (if discovered)	2018-09-04 13:57:52 +01:00
simon	6abe7d68e0	updated notes	2018-09-04 12:51:59 +01:00
simon	7d919039b6	removed unecessary modules	2018-09-04 10:14:27 +01:00

1 2

79 Commits