使用站点地图进行抓取

站点地图是一种协议，允许网站管理员向搜索引擎通知网站上可供抓取的 URL。网站管理员想要使用它，因为他们实际上希望搜索引擎抓取他们的信息。网站管理员希望您可以找到该内容，至少可以通过搜索引擎找到。但您也可以利用这些信息来发挥自己的优势。

站点地图列出了站点上的 URL，并允许网站管理员指定有关每个 URL 的附加信息：

上次更新时间
内容多久更改一次
URL 相对于其他 URL 有多重要

站点地图对于以下网站非常有用：

网站的某些区域无法通过可浏览界面访问；也就是说，您无法访问这些页面
使用 Ajax、Silverlight 或 Flash 内容，但搜索引擎通常不处理
该网站非常大，网络爬虫有可能忽略一些新的或最近更新的内容
当网站有大量孤立或没有很好链接在一起的页面时
当网站的外部链接很少时

站点地图文件具有以下结构：

<?xml version="1.0" encoding="utf-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
    <url>
        <loc>http://example.com/</loc>
        <lastmod>2006-11-18</lastmod>
        <changefreq>daily</changefreq>
        <priority>0.8</priority>
    </url>
</urlset>

站点中的每个 URL 都将用 <url></url> 标记表示，所有这些标记都包含在外部 <urlset></urlset> 标记中。总会有一个 <loc></loc> 标记指定 URL。其他三个标签是可选的。

站点地图文件可能非常大，因此它们通常被分成多个文件，然后由单个站点地图索引文件引用。该文件具有以下格式：

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    <sitemap>
        <loc>http://www.example.com/sitemap1.xml.gz</loc>
        <lastmod>2014-10-01T18:23:17+00:00</lastmod>
    </sitemap>
</sitemapindex>

在大多数情况下，sitemap.xml 文件位于域的根目录下。例如，对于 nasa.gov，它是 https://www.nasa.gov/sitemap.xml。但请注意，这不是标准，不同的站点可能在不同的位置有一张或多张地图。

特定网站的站点地图也可能位于该网站的 robots.txt 文件中。例如，microsoft.com 的 robots.txt 文件以以下内容结尾：

Sitemap: https://www.microsoft.com/en-us/explore/msft_sitemap_index.xml
Sitemap: https://www.microsoft.com/learning/sitemap.xml
Sitemap: https://www.microsoft.com/en-us/licensing/sitemap.xml
Sitemap: https://www.microsoft.com/en-us/legal/sitemap.xml
Sitemap: https://www.microsoft.com/filedata/sitemaps/RW5xN8
Sitemap: https://www.microsoft.com/store/collections.xml
Sitemap: https://www.microsoft.com/store/productdetailpages.index.xml

因此，要获取 microsoft.com 的站点地图，我们首先需要读取 robots.txt 文件并提取该信息。

现在让我们看看解析站点地图。

准备工作

您需要的所有内容都在 05/02_sitemap.py 脚本中，以及同一文件夹中的 sitemap.py 文件。 sitemap.py 文件实现了我们将在主脚本中使用的基本站点地图解析器。出于本示例的目的，我们将获取 nasa.gov 的站点地图数据。

如何做

首先执行 05/02_sitemap.py 文件。确保关联的 sitemap.py 文件位于同一目录或您的路径中。运行时，几秒钟后您将得到类似于以下内容的输出：

Found 35511 urls
{'lastmod': '2017-10-11T18:23Z', 'loc':
'http://www.nasa.gov/centers/marshall/history/this-week-in-nasa-history-apo
llo-7-launches-oct-11-1968.html', 'tag': 'url'}
{'lastmod': '2017-10-11T18:22Z', 'loc':
'http://www.nasa.gov/feature/researchers-develop-new-tool-to-evaluate-iceph
obic-materials', 'tag': 'url'}
{'lastmod': '2017-10-11T17:38Z', 'loc':
'http://www.nasa.gov/centers/ames/entry-systems-vehicle-development/roster.
html', 'tag': 'url'}
{'lastmod': '2017-10-11T17:38Z', 'loc':
'http://www.nasa.gov/centers/ames/entry-systems-vehicle-development/about.h
tml', 'tag': 'url'}
{'lastmod': '2017-10-11T17:22Z', 'loc':
'http://www.nasa.gov/centers/ames/earthscience/programs/MMS/instruments',
'tag': 'url'}
{'lastmod': '2017-10-11T18:15Z', 'loc':
'http://www.nasa.gov/centers/ames/earthscience/programs/MMS/onepager',
'tag': 'url'}
{'lastmod': '2017-10-11T17:10Z', 'loc':
'http://www.nasa.gov/centers/ames/earthscience/programs/MMS', 'tag': 'url'}
{'lastmod': '2017-10-11T17:53Z', 'loc':
'http://www.nasa.gov/feature/goddard/2017/nasa-s-james-webb-space-telescope
-and-the-big-bang-a-short-qa-with-nobel-laureate-dr-john', 'tag': 'url'}
{'lastmod': '2017-10-11T17:38Z', 'loc':
'http://www.nasa.gov/centers/ames/entry-systems-vehicle-development/index.h
tml', 'tag': 'url'}
{'lastmod': '2017-10-11T15:21Z', 'loc':
'http://www.nasa.gov/feature/mark-s-geyer-acting-deputy-associate-administr
ator-for-technical-human-explorations-and-operations', 'tag': 'url'}

该程序在所有 nasa.gov 网站地图中找到了 35,511 个 URL！代码只打印了前 10 个，因为这将是相当大的输出量。利用这些信息来初始化对所有这些 URL 的抓取肯定需要很长时间！

但这也正是网站地图的魅力所在。这些结果中有许多（如果不是全部的话）都有一个 lastmod 标签，它会告诉您相关 URL 末尾的内容上次修改的时间。如果您要对 nasa.gov 进行有礼貌的抓取，就需要将这些 URL 及其时间戳保存在数据库中，然后在抓取该 URL 之前检查其内容是否已更改，如果没有更改就不要抓取。

现在，让我们来看看实际效果如何。

工作原理

示例操作如下：

脚本首先调用 get_sitemap()：

map = sitemap.get_sitemap("https://www.nasa.gov/sitemap.xml")

这是给定 sitemap.xml 文件（或任何其他文件 - 非 gzipped）的 URL。该实现只是获取 URL 处的内容并返回它：

def get_sitemap(url):
    get_url = requests.get(url)

    if get_url.status_code == 200:
        return get_url.text
    else:
        print ('Unable to fetch sitemap: %s.' % url)

大部分工作是通过将该内容传递给 parse_sitemap() 来完成的。对于 nasa.gov，此站点地图包含以下内容，即站点地图索引文件：

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl"
href="//www.nasa.gov/sitemap.xsl"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap><loc>http://www.nasa.gov/sitemap-1.xml</loc><lastmod>2017-
10-11T19:30Z</lastmod></sitemap>
<sitemap><loc>http://www.nasa.gov/sitemap-2.xml</loc><lastmod>2017-
10-11T19:30Z</lastmod></sitemap>
<sitemap><loc>http://www.nasa.gov/sitemap-3.xml</loc><lastmod>2017-
10-11T19:30Z</lastmod></sitemap>
<sitemap><loc>http://www.nasa.gov/sitemap-4.xml</loc><lastmod>2017-
10-11T19:30Z</lastmod></sitemap>
</sitemapindex>

process_sitemap() 首先调用 process_sitemap()：

def parse_sitemap(s):
    sitemap = process_sitemap(s)

该函数首先调用 process_sitemap()，它返回带有 loc、lastmod、changeFreq 和优先级键值对的 Python 字典对象列表：

def process_sitemap(s):
    soup = BeautifulSoup(s, "lxml")
    result = []

    for loc in soup.findAll('loc'):
        item = {}
        item['loc'] = loc.text
        item['tag'] = loc.parent.name
        if loc.parent.lastmod is not None:
            item['lastmod'] = loc.parent.lastmod.text
        if loc.parent.changeFreq is not None:
            item['changeFreq'] = loc.parent.changeFreq.text
        if loc.parent.priority is not None:
            item['priority'] = loc.parent.priority.text
        result.append(item)

    return result

这是通过使用 BeautifulSoup 和 lxml 解析站点地图来执行的。 loc 属性始终被设置，并且如果存在关联的 XML 标记，则设置 lastmod、changeFreq 和priority。 .tag 属性本身只是说明此内容是从 <sitemap> 标记还是 <url> 标记检索的（<loc> 标记可以位于其中之一）。 parse_sitemap() 然后继续一一处理这些结果：
```
while sitemap:
    candidate = sitemap.pop()

    if is_sub_sitemap(candidate):
        sub_sitemap = get_sitemap(candidate['loc'])
        for i in process_sitemap(sub_sitemap):
            sitemap.append(i)
    else:
        result.append(candidate)
```

检查每一项。如果它来自站点地图索引文件（URL 以 .xml 结尾，.tag 是站点地图），那么我们需要读取该 .xml 文件并解析其内容，其结果将放入我们要处理的项目列表中。在此示例中，识别了四个站点地图文件，并且读取、处理、解析每个文件，并将它们的 URL 添加到结果中。为了演示其中的一些内容，以下是 sitemap-1.xml 的前几行：

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl"
href="//www.nasa.gov/sitemap.xsl"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url><loc>http://www.nasa.gov/</loc><changefreq>daily</changefreq><
priority>1.0</priority></url>
<url><loc>http://www.nasa.gov/connect/apps.html</loc><lastmod>2017-
08-14T22:15Z</lastmod><changefreq>yearly</changefreq></url>
<url><loc>http://www.nasa.gov/socialmedia</loc><lastmod>2017-09-29T
21:47Z</lastmod><changefreq>monthly</changefreq></url>
<url><loc>http://www.nasa.gov/multimedia/imagegallery/iotd.html</lo
c><lastmod>2017-08-21T22:00Z</lastmod><changefreq>yearly</changefre
q></url>
<url><loc>http://www.nasa.gov/archive/archive/about/career/index.ht
ml</loc><lastmod>2017-08-04T02:31Z</lastmod><changefreq>yearly</cha
ngefreq></url>

总体而言，这一站点地图有 11,006 行，因此大约有 11,000 个 URL！据报道，所有三个站点地图总共有 35,511 个 URL。

还有更多

站点地图文件也可以被压缩，并以 .gz 扩展名结尾。这是因为它可能包含许多 URL，压缩会节省大量空间。虽然我们使用的代码不处理 gzip 站点地图文件，但很容易使用 gzip 库中的函数添加此文件。

Scrapy 还提供了使用站点地图开始爬行的工具。其中之一是 Spider 类的特化，SitemapSpider。此类可以智能地为您解析站点地图，然后开始跟踪 URL。为了演示，脚本 05/03_sitemap_scrapy.py 将在 nasa.gov 顶级站点地图索引处开始爬网：

import scrapy
from scrapy.crawler import CrawlerProcess

class Spider(scrapy.spiders.SitemapSpider):
    name = 'spider'
    sitemap_urls = ['https://www.nasa.gov/sitemap.xml']

    def parse(self, response):
        print("Parsing: ", response)

if __name__ == "__main__":
    process = CrawlerProcess({
        'DOWNLOAD_DELAY': 0,
        'LOG_LEVEL': 'DEBUG'
    })
    process.crawl(Spider)
    process.start()

运行此程序时，将会有大量输出，因为蜘蛛将开始爬行所有 30000 多个 URL。在输出的早期，您将看到如下输出：

2017-10-11 20:34:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET
https://www.nasa.gov/sitemap.xml> (referer: None)
2017-10-11 20:34:27 [scrapy.downloadermiddlewares.redirect] DEBUG:
Redirecting (301) to <GET https://www.nasa.gov/sitemap-4.xml> from <GET
http://www.nasa.gov/sitemap-4.xml>
2017-10-11 20:34:27 [scrapy.downloadermiddlewares.redirect] DEBUG:
Redirecting (301) to <GET https://www.nasa.gov/sitemap-2.xml> from <GET
http://www.nasa.gov/sitemap-2.xml>
2017-10-11 20:34:27 [scrapy.downloadermiddlewares.redirect] DEBUG:
Redirecting (301) to <GET https://www.nasa.gov/sitemap-3.xml> from <GET
http://www.nasa.gov/sitemap-3.xml>
2017-10-11 20:34:27 [scrapy.downloadermiddlewares.redirect] DEBUG:
Redirecting (301) to <GET https://www.nasa.gov/sitemap-1.xml> from <GET
http://www.nasa.gov/sitemap-1.xml>
2017-10-11 20:34:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET
https://www.nasa.gov/sitemap-4.xml> (referer: None)

Scrapy 已找到所有站点地图并读取其内容。不久之后，您将开始看到一些正在解析某些页面的重定向和通知：

2017-10-11 20:34:30 [scrapy.downloadermiddlewares.redirect] DEBUG:
Redirecting (302) to <GET
https://www.nasa.gov/image-feature/jpl/pia21629/neptune-from-saturn/> from
<GET https://www.nasa.gov/image-feature/jpl/pia21629/neptune-from-saturn>
2017-10-11 20:34:30 [scrapy.downloadermiddlewares.redirect] DEBUG:
Redirecting (302) to <GET
https://www.nasa.gov/centers/ames/earthscience/members/nasaearthexchange/Ra
makrishna_Nemani/> from <GET
https://www.nasa.gov/centers/ames/earthscience/members/nasaearthexchange/Ramakrishna_Nemani>
Parsing: <200
https://www.nasa.gov/exploration/systems/sls/multimedia/sls-hardware-beingmoved-on-kamag-transporter.html>
Parsing: <200 https://www.nasa.gov/exploration/systems/sls/M17-057.html>