1.6 KiB
1.6 KiB
Thoughts
strip hashes and everything following (as they're in-page anchors)strip argsusepop()on the set instead of.remove()return false once the set is empty
WebPage.parse_urls()needs to compare startswith to base urlignore any links which aren't to pagesbetter url checking to get bare domain#wontfixremove trailing slash from any discovered urlinvestigate lxml parserremove base url from initial urls with and without trailing slashinvestigate using tldextract to match urls#wontfiximplement parsing of robots.txtinvestigate gzip encodingimplement some kind of progress display- async
- better exception handling
- randomise output filename
Async bits
in __main__:
loop = asyncio.get_event_loop()
try:
loop.run_until_complete(main())
finally:
loop.close()
- initialises loop and runs it to completion
- needs to handle errors (try/except/finally)
async def run(args=None):
tasks = []
for url in pool:
tasks.append(url)
# for i in range(10):
# tasks.append(asyncio.ensure_future(myCoroutine(i)))
# gather completed tasks
await asyncio.gather(*tasks)
Getting the contents of the page needs to be async too
async def get_source():
blah
blah
await urlopen(url)