使用 XPath 和 lxml 查询 DOM

XPath 是一种用于从 XML 文档中选择节点的查询语言，是任何执行网页抓取的人都必须学习的查询语言。与其他基于模型的工具相比，XPath 为其用户提供了许多好处：

可以轻松地浏览 DOM 树
比 CSS 选择器和正则表达式等其他选择器更复杂、更强大
它具有大量（200 多个）内置函数，并且可以通过自定义函数进行扩展
得到解析库和抓取平台的广泛支持

XPath 包含七个数据模型（我们之前已经见过其中一些）：

根节点（顶级父节点）
元素节点 (<a>..</a>)
属性节点 (href="example.html")
文本节点（“这是文本”）
注释节点（<!-- 注释 -→）
命名空间节点
处理指令节点

XPath 表达式可以返回不同的数据类型：

字符串
布尔值
数字
节点集（可能是最常见的情况）

(XPath) axis 定义相对于当前节点的节点集。 XPath 中总共定义了 13 个轴，以便能够轻松地从当前上下文节点或根节点搜索不同的节点部分。

lxml 是 libxml2 XML 解析库之上的 Python 包装器，该库是用 C 语言编写的。C 语言的实现使其比 Beautiful Soup 更快，但也更难在某些计算机上安装。最新的安装说明位于：http://lxml.de/installation.html。

lxml 支持 XPath，这使得管理复杂的 XML 和 HTML 文档变得相当容易。我们将研究同时使用 lxml 和 XPath 的几种技术，以及如何使用 lxml 和 XPath 来导航 DOM 和访问数据。

准备工作

这些片段的代码位于 02/03_lxml_and_xpath.py 中，以防您想节省一些输入。我们将从 lxml 导入 html 以及请求开始，然后加载页面。

In [1]: from lxml import html
...: import requests
...: page_html = requests.get("http://localhost:8080/planets.html").text

此时，lxml 应该作为其他安装的依赖项进行安装。如果出现错误，请使用 pip install lxml 安装它。

怎么做

我们要做的第一件事是将 HTML 加载到 lxml “etree” 中。这是 lxml 的 DOM 表示。

in [2]: tree = html.fromstring(page_html)

树变量现在是 DOM 的 lxml 表示，它对 HTML 内容进行建模。现在让我们研究一下如何使用它和 XPath 从文档中选择各种元素。

第一个 XPath 示例将查找 <table> 元素下面的所有 <tr> 元素。

In [3]: [tr for tr in tree.xpath("/html/body/div/table/tr")]
Out[3]:
[<Element tr at 0x10cfd1408>,
<Element tr at 0x10cfd12c8>,
<Element tr at 0x10cfd1728>,
<Element tr at 0x10cfd16d8>,
<Element tr at 0x10cfd1458>,
<Element tr at 0x10cfd1868>,
<Element tr at 0x10cfd1318>,
<Element tr at 0x10cfd14a8>,
<Element tr at 0x10cfd10e8>,
<Element tr at 0x10cfd1778>,
<Element tr at 0x10cfd1638>]

此 XPath 按标记名称从文档的根向下导航到 <tr> 元素。此示例看起来与 Beautiful Soup 中的属性表示法类似，但最终它的表现力明显更强。请注意结果中的一个差异。返回所有 <tr> 元素，而不仅仅是第一个。事实上，该路径每个级别的标签都会返回多个项目（如果可用）。如果 <body> 下方有多个 <div> 元素，则将对所有这些 <div> 执行对 table/tr 的搜索。

实际结果是一个 lxml 元素对象。以下代码使用 etree.tostring() 获取与元素关联的 HTML（尽管它们已应用编码）：

In [4]: from lxml import etree
...: [etree.tostring(tr)[:50] for tr in
tree.xpath("/html/body/div/table/tr")]
Out[4]:
[b'<tr id="planetHeader">
\n <th>&#',
b'<tr id="planet1" class="planet" name="Mercury">&#1',
b'<tr id="planet2" class="planet" name="Venus">',
b'<tr id="planet3" class="planet" name="Earth">',
b'<tr id="planet4" class="planet" name="Mars">\n',
b'<tr id="planet5" class="planet" name="Jupiter">&#1',
b'<tr id="planet6" class="planet" name="Saturn">&#13',
b'<tr id="planet7" class="planet" name="Uranus">&#13',
b'<tr id="planet8" class="planet" name="Neptune">&#1',
b'<tr id="planet9" class="planet" name="Pluto">',
b'<tr id="footerRow">
\n <td>
']

现在让我们看看如何使用 XPath 仅选择属于行星的 <tr> 元素。

In [5]: [etree.tostring(tr)[:50] for tr in
tree.xpath("/html/body/div/table/tr[@class='planet']")]
Out[5]:
[b'<tr id="planet1" class="planet" name="Mercury">&#1',
b'<tr id="planet2" class="planet" name="Venus">',
b'<tr id="planet3" class="planet" name="Earth">',
b'<tr id="planet4" class="planet" name="Mars">\n',
b'<tr id="planet5" class="planet" name="Jupiter">&#1',
b'<tr id="planet6" class="planet" name="Saturn">&#13',
b'<tr id="planet7" class="planet" name="Uranus">&#13',
b'<tr id="planet8" class="planet" name="Neptune">&#1',
b'<tr id="planet9" class="planet" name="Pluto">
']

使用标签旁边的 [] 表明我们想要根据当前元素的某些标准进行选择。 @ 表示我们要检查标签的属性，在此转换中我们要选择属性等于“planet”的标签。

从具有 11 个 <tr> 行的查询中还需要指出另一点。如前所述，XPath 在每个级别找到的所有节点上运行导航。本文档中有两个表，它们都是不同 <div> 的子元素，它们都是子元素或 <body> 元素。 id="planetHeader" 的行来自我们所需的目标表，另一行 id="footerRow" 来自第二个表。

之前我们通过选择带有 class="row" 的 <tr> 解决了这个问题，但还有其他方法值得简单提及。首先，我们还可以使用 [] 来指定 XPath 的每个部分的特定元素，就像它们是数组一样。采取以下措施：

In [6]: [etree.tostring(tr)[:50] for tr in
tree.xpath("/html/body/div[1]/table/tr")]
Out[6]:
[b'<tr id="planetHeader">
\n <th>&#',
b'<tr id="planet1" class="planet" name="Mercury">&#1',
b'<tr id="planet2" class="planet" name="Venus">',
b'<tr id="planet3" class="planet" name="Earth">',
b'<tr id="planet4" class="planet" name="Mars">\n',
b'<tr id="planet5" class="planet" name="Jupiter">&#1',
b'<tr id="planet6" class="planet" name="Saturn">&#13',
b'<tr id="planet7" class="planet" name="Uranus">&#13',
b'<tr id="planet8" class="planet" name="Neptune">&#1',
b'<tr id="planet9" class="planet" name="Pluto">
']

XPath 中的数组从 1 开始，而不是从 0 开始（这是常见的错误来源）。这选择了第一个 <div>。对 [2] 的更改将选择第二个 <div>，因此仅选择第二个 <table>。

In [7]: [etree.tostring(tr)[:50] for tr in
tree.xpath("/html/body/div[2]/table/tr")]
Out[7]: [b'<tr id="footerRow">
\n <td>
']

本文档中的第一个 <div> 也有一个 id 属性：

<div id="planets">

这可用于选择此 <div>：

In [8]: [etree.tostring(tr)[:50] for tr in
tree.xpath("/html/body/div[@id='planets']/table/tr")]
Out[8]:
[b'<tr id="planetHeader">
\n <th>&#',
b'<tr id="planet1" class="planet" name="Mercury">&#1',
b'<tr id="planet2" class="planet" name="Venus">',
b'<tr id="planet3" class="planet" name="Earth">',
b'<tr id="planet4" class="planet" name="Mars">\n',
b'<tr id="planet5" class="planet" name="Jupiter">&#1',
b'<tr id="planet6" class="planet" name="Saturn">&#13',
b'<tr id="planet7" class="planet" name="Uranus">&#13',
b'<tr id="planet8" class="planet" name="Neptune">&#1',
b'<tr id="planet9" class="planet" name="Pluto">
']

之前我们根据类属性的值选择了行星行。我们还可以排除行：

In [9]: [etree.tostring(tr)[:50] for tr in
tree.xpath("/html/body/div[@id='planets']/table/tr[@id!='planetHeader']")]
Out[9]:
[b'<tr id="planet1" class="planet" name="Mercury">&#1',
b'<tr id="planet2" class="planet" name="Venus">',
b'<tr id="planet3" class="planet" name="Earth">',
b'<tr id="planet4" class="planet" name="Mars">\n',
b'<tr id="planet5" class="planet" name="Jupiter">&#1',
b'<tr id="planet6" class="planet" name="Saturn">&#13',
b'<tr id="planet7" class="planet" name="Uranus">&#13',
b'<tr id="planet8" class="planet" name="Neptune">&#1',
b'<tr id="planet9" class="planet" name="Pluto">
']

假设行星行没有属性（也没有标题行），那么我们可以按位置执行此操作，跳过第一行：

In [10]: [etree.tostring(tr)[:50] for tr in
tree.xpath("/html/body/div[@id='planets']/table/tr[position() > 1]")]
Out[10]:
[b'<tr id="planet1" class="planet" name="Mercury">&#1',
b'<tr id="planet2" class="planet" name="Venus">',
b'<tr id="planet3" class="planet" name="Earth">',
b'<tr id="planet4" class="planet" name="Mars">\n',
b'<tr id="planet5" class="planet" name="Jupiter">&#1',
b'<tr id="planet6" class="planet" name="Saturn">&#13',
b'<tr id="planet7" class="planet" name="Uranus">&#13',
b'<tr id="planet8" class="planet" name="Neptune">&#1',
b'<tr id="planet9" class="planet" name="Pluto">
']

使用 parent::* 可以导航到节点的父节点：

In [11]: [etree.tostring(tr)[:50] for tr in
tree.xpath("/html/body/div/table/tr/parent::*")]
Out[11]:
[b'<table id="planetsTable" border="1">\n ',
b'<table id="footerTable">\n <tr id="']

这将返回两个父数据表，记住，这个 XPath 返回两个数据表中的记录，因此可以找到所有这些记录的就能找到所有这些行的父代。* 是一个通配符，代表任何名称的父标签。名称的父标签。在本例中，两个父标签都是表格，但一般情况下，结果可以是任意数量的 HTML 元素类型。HTML 元素类型。下面的代码具有相同的结果，但如果两个父代是不同的 HTML 标记，则只会返回 <table> 元素。

In [12]: [etree.tostring(tr)[:50] for tr in
tree.xpath("/html/body/div/table/tr/parent::table")]
Out[12]:
[b'<table id="planetsTable" border="1">\n ',
b'<table id="footerTable">\n <tr id="']

也可以通过位置或属性指定特定父级。下文选择了 id="footerTable" 的父节点：

In [13]: [etree.tostring(tr)[:50] for tr in
tree.xpath("/html/body/div/table/tr/parent::table[@id='footerTable']")]
Out[13]: [b'<table id="footerTable"> \n <tr id="']

父节点的快捷方式是 ..（. 也代表当前节点）：

In [14]: [etree.tostring(tr)[:50] for tr in
tree.xpath("/html/body/div/table/tr/..")]
Out[14]:
[b'<table id="planetsTable" border="1">\n ',
b'<table id="footerTable">\n <tr id="']

最后一个例子是计算地球的质量：

In [15]: mass =
tree.xpath("/html/body/div[1]/table/tr[@name='Earth']/td[3]/text()[1]")[0].strip()
...: mass Out[15]: '5.97'

该 XPath 的尾部 /td[3]/text()[1] 选择了行中的第三个 <td> 元素，然后选择了该元素的文本（该文本是元素中的所有文本数组）。行中的第三个 <td> 元素，然后选择该元素的文本（这是一个包含该元素中所有文本的数组），最后选择其中的第一个元素，即质量。其中第一个是质量。

工作原理

XPath 是 XSLT（可扩展样式表语言转换）标准的一个元素，提供在 XML 文档中选择节点的功能。HTML 是 XML 的一种变体，因此 XPath 可以在 HTML 文档上运行（不过，HTML 的格式可能不正确，在这种情况下会扰乱 XPath 的解析）。

XPath 本身是为 XML 节点、属性和属性的结构建模而设计的。该语法提供了在 XML 中查找与表达式匹配的项的方法。这包括对 XML 文档中的任何节点、属性、值或文本进行匹配或逻辑比较。XML文档中的任何节点、属性、值或文本进行匹配或逻辑比较。

XPath 表达式可以在文档中组合成非常复杂的路径。路径。还可以根据相对位置来导航文档。位置导航文档，这大大有助于根据相对位置而不是 DOM 中的绝对位置查找数据。位置而不是 DOM 中的绝对位置查找数据。

要知道如何解析 HTML 和执行网络搜索，了解 XPath 至关重要。正如我们将要看到的，XPath 是许多高级库（如 lxml 等）的基础，并为它们提供了实现方法。高级库，如 lxml。

还有更多

XPath 实际上是一种处理 XML 和 HTML 文档的神奇工具。它的它的功能相当丰富，而我们在演示 HTML 文档中常见的几个示例时，仅仅触及了其功能的表面。我们只演示了几个在 HTML 文档中抓取数据的常见示例。

要了解更多信息，请访问以下链接：