控制爬行的长度
爬行的长度(以可解析的页面数表示)可以通过 CLOSESPIDER_PAGECOUNT 设置进行控制。
如何做
我们将使用 06/07_limit_length.py 中的脚本。 脚本和抓取工具与 NASA 站点地图爬虫相同,但添加了以下配置以将解析的页面数量限制为 5:
if __name__ == "__main__":
process = CrawlerProcess({
'LOG_LEVEL': 'INFO',
'CLOSESPIDER_PAGECOUNT': 5
})
process.crawl(Spider)
process.start()
运行时,将生成以下输出(散布在日志输出中):
<200
https://www.nasa.gov/exploration/systems/sls/multimedia/sls-hardware-beingmoved-on-kamag-transporter.html>
<200 https://www.nasa.gov/exploration/systems/sls/M17-057.html>
<200
https://www.nasa.gov/press-release/nasa-awards-contract-for-center-protecti
ve-services-for-glenn-research-center/>
<200 https://www.nasa.gov/centers/marshall/news/news/icymi1708025/>
<200
https://www.nasa.gov/content/oracles-completed-suit-case-flight-series-to-a
scension-island/>
<200
https://www.nasa.gov/feature/goddard/2017/asteroid-sample-return-mission-su
ccessfully-adjusts-course/>
<200 https://www.nasa.gov/image-feature/jpl/pia21754/juling-crater/>