处理 CSV 和 JSON 数据
从 HTML 页面中提取数据是使用上一章中的技术完成的,主要是通过各种工具使用 XPath,也使用 Beautiful Soup。 虽然我们将主要关注 HTML,但 HTML 是 XML(可扩展标记语言)的变体。 XML 是在 Web 上表达数据最流行的一种方式,但其他方式也变得流行起来,甚至超过了 XML 的受欢迎程度。
您将看到的两种常见格式是 JSON(JavaScript 对象表示法)和 CSV(逗号分隔值)。 CSV 很容易创建,并且是许多电子表格应用程序的通用形式,因此许多网站都提供这种格式的数据,或者您需要将抓取的数据转换为该格式以进行进一步存储或协作。 JSON 确实已成为首选格式,因为它易于使用 JavaScript(和 Python)等编程语言,并且许多数据库现在都支持它作为本机数据格式。
在本节中,我们将研究如何将抓取的数据转换为 CSV 和 JSON,以及将数据写入文件以及从远程服务器读取这些数据文件。 我们将检查的工具是 Python CSV 和 JSON 库。 我们还将研究如何使用 pandas 来实现这些技术。
这些示例还隐含了 XML 数据到 CSV 和 JSON 的转换,因此我们不会为这些示例提供专门的部分。 |
准备工作
我们将使用行星数据页面并将该数据转换为 CSV 和 JSON 文件。 首先,我们将页面中的行星数据加载到 Python 字典对象列表中。 以下代码(位于 (03/get_planet_data.py) 中提供了执行此任务的函数,该函数将在整个章节中重复使用:
from urllib.request import urlopen
from bs4 import BeautifulSoup
def get_planet_data():
html = urlopen("http://localhost:8080/planets.html")
bsobj = BeautifulSoup(html, "lxml")
planets = []
planet_rows = bsobj.html.body.div.table.findAll("tr", {"class": "planet"})
for i in planet_rows:
tds = i.findAll("td")
planet_data = dict()
planet_data['Name'] = tds[1].text.strip()
planet_data['Mass'] = tds[2].text.strip()
planet_data['Radius'] = tds[3].text.strip()
planet_data['Description'] = tds[4].text.strip()
planet_data['MoreInfo'] = tds[5].findAll("a")[0]["href"].strip()
planets.append(planet_data)
return planets
if __name__ == "__main__":
print(get_planet_data())
运行脚本会给出以下输出(简短截断):
03 $python get_planet_data.py
[{'Name': 'Mercury', 'Mass': '0.330', 'Radius': '4879', 'Description':
'Named Mercurius by the Romans because it appears to move so swiftly.',
'MoreInfo': 'https://en.wikipedia.org/wiki/Mercury_(planet)'}, {'Name':
'Venus', 'Mass': '4.87', 'Radius': '12104', 'Description': 'Roman name for
the goddess of love. This planet was considered to be the brightest and
most beautiful planet or star in the\r\n heavens. Other civilizations have
named it for their god or goddess of love/war.', 'MoreInfo':
'https://en.wikipedia.org/wiki/Venus'}, {'Name': 'Earth', 'Mass': '5.97',
'Radius': '12756', 'Description': "The name Earth comes from the IndoEuropean base 'er,'which produced the Germanic noun 'ertho,' and ultimately
German 'erde,'\r\n Dutch 'aarde,' Scandinavian 'jord,' and English 'earth.'
Related forms include Greek 'eraze,' meaning\r\n 'on the ground,' and Welsh
'erw,' meaning 'a piece of land.'", 'MoreInfo':
'https://en.wikipedia.org/wiki/Earth'}, {'Name': 'Mars', 'Mass': '0.642',
'Radius': '6792', 'Description': 'Named by the Romans for their god of war
because of its red, bloodlike color. Other civilizations also named this
planet\r\n from this attribute; for example, the Egyptians named it "Her
Desher," meaning "the red one."', 'MoreInfo':
...
可能需要安装 csv、json 和 pandas。 您可以使用以下三个命令来完成此操作:
pip install csv
pip install json
pip install pandas
怎么做
我们首先将行星数据转换为 CSV 文件。
-
这将使用 csv 执行。 以下代码将行星数据写入 CSV 文件(代码位于 03/create_csv.py):
import csv from get_planet_data import get_planet_data planets = get_planet_data() with open('../../www/planets.csv', 'w+') as csvFile: writer = csv.writer(csvFile) writer.writerow(('Name', 'Mass', 'Radius', 'Description', 'MoreInfo')) for planet in planets: writer.writerow([planet['Name'], planet['Mass'], planet['Radius'], planet['Description'], planet['MoreInfo']])
-
输出文件被放入我们项目的 www 文件夹中。 检查它我们看到以下内容::
Name,Mass,Radius,Description,MoreInfo Mercury,0.330,4879,Named Mercurius by the Romans because it appears to move so swiftly.,https://en.wikipedia.org/wiki/Mercury_(planet) Venus,4.87,12104,Roman name for the goddess of love. This planet was considered to be the brightest and most beautiful planet or star in the heavens. Other civilizations have named it for their god or goddess of love/war.,https://en.wikipedia.org/wiki/Venus Earth,5.97,12756,"The name Earth comes from the Indo-European base 'er,'which produced the Germanic noun 'ertho,' and ultimately German 'erde,' Dutch 'aarde,' Scandinavian 'jord,' and English 'earth.' Related forms include Greek 'eraze,' meaning 'on the ground,' and Welsh 'erw,' meaning 'a piece of land.'",https://en.wikipedia.org/wiki/Earth Mars,0.642,6792,"Named by the Romans for their god of war because of its red, bloodlike color. Other civilizations also named this planet from this attribute; for example, the Egyptians named it ""Her Desher,"" meaning ""the red one.""",https://en.wikipedia.org/wiki/Mars Jupiter,1898,142984,The largest and most massive of the planets was named Zeus by the Greeks and Jupiter by the Romans; he was the most important deity in both pantheons.,https://en.wikipedia.org/wiki/Jupiter Saturn,568,120536,"Roman name for the Greek Cronos, father ofZeus/Jupiter. Other civilizations have given different names to Saturn, which is the farthest planet from Earth that can be observed by the naked human eye. Most of its satellites were named for Titans who, according to Greek mythology, were brothers and sisters of Saturn.",https://en.wikipedia.org/wiki/Saturn Uranus,86.8,51118,"Several astronomers, including Flamsteed and Le Monnier, had observed Uranus earlier but had recorded it as a fixed star. Herschel tried unsuccessfully to name his discovery ""Georgian Sidus"" after George III; the planet was named by Johann Bode in 1781 after the ancient Greek deity of the sky Uranus, the father of Kronos (Saturn) and grandfather of Zeus (Jupiter).",https://en.wikipedia.org/wiki/Uranus Neptune,102,49528,"Neptune was ""predicted"" by John Couch Adams and Urbain Le Verrier who, independently, were able to account for the irregularities in the motion of Uranus by correctly predicting the orbital elements of a trans- Uranian body. Using the predicted parameters of Le Verrier (Adams never published his predictions), Johann Galle observed the planet in 1846. Galle wanted to name the planet for Le Verrier, but that was not acceptable to the international astronomical community. Instead, this planet is named for the Roman god of the sea.",https://en.wikipedia.org/wiki/Neptune Pluto,0.0146,2370,"Pluto was discovered at Lowell Observatory in Flagstaff, AZ during a systematic search for a trans-Neptune planet predicted by Percival Lowell and William H. Pickering. Named after the Roman god of the underworld who was able to render himself invisible.",https://en.wikipedia.org/wiki/Pluto
我们将此文件写入 www 目录,以便我们可以使用 Web 服务器下载它。
-
此数据现在可以在支持 CSV 内容的应用程序中使用,例如 Excel:
-
还可以使用 csv 库从 Web 服务器读取 CSV 数据,方法是首先使用 requests 检索内容。 以下代码位于 03/read_csv_from_web.py):
import requests import csv planets_request = requests.get("http://localhost:8080/planets.csv") csv_data = planets_request.text.split("\n") reader = csv.reader(csv_data, delimiter=',', quotechar='"') # for line in lines: print(line) for row in reader: print(row)
以下是部分输出
['Name', 'Mass', 'Radius', 'Description', 'MoreInfo'] ['Mercury', '0.330', '4879', 'Named Mercurius by the Romans because it appears to move so swiftly.', 'https://en.wikipedia.org/wiki/Mercury_(planet)'] ['Venus', '4.87', '12104', 'Roman name for the goddess of love. This planet was considered to be the brightest and most beautiful planet or star in the heavens. Other civilizations have named it for their god or goddess of love/war.', 'https://en.wikipedia.org/wiki/Venus'] ['Earth', '5.97', '12756', "The name Earth comes from the IndoEuropean base 'er,'which produced the Germanic noun 'ertho,' and ultimately German 'erde,' Dutch 'aarde,' Scandinavian 'jord,' and English 'earth.' Related forms include Greek 'eraze,' meaning 'on the ground,' and Welsh 'erw,' meaning 'a piece of land.'", 'https://en.wikipedia.org/wiki/Earth']
需要指出的一件事是,CSV 编写器留下了一个尾随空白,就像如果不处理的话会添加一个空列表项。 这是通过对行进行切片来处理的:以下语句返回除最后一行之外的所有行:
lines = [line for line in reader][:-1]
-
使用 pandas 也可以轻松完成此操作。 下面从抓取的数据构造一个 DataFrame。 代码在 03/create_df_planets.py 中:
import pandas as pd planets_df = pd.read_csv("http://localhost:8080/planets_pandas.csv",index_col='Name') print(planets_df)
运行此命令会产生以下输出:
Description Mass Radius Name Mercury Named Mercurius by the Romans because it appea... 0.330 4879 Venus Roman name for the goddess of love. This plane... 4.87 12104 Earth The name Earth comes from the Indo-European ba... 5.97 12756 Mars Named by the Romans for their god of war becau... 0.642 6792 Jupiter The largest and most massive of the planets wa... 1898 142984 Saturn Roman name for the Greek Cronos, father of Zeu... 568 120536 Uranus Several astronomers, including Flamsteed and L... 86.8 51118 Neptune Neptune was "predicted" by John Couch Adams an... 102 49528 Pluto Pluto was discovered at Lowell Observatory in ... 0.0146 2370
-
只需简单调用 .to_csv() 即可将 DataFrame 保存到 CSV 文件(代码位于 03/save_csv pandas.py 中):
import pandas as pd from get_planet_data import get_planet_data # construct a data from from the list planets = get_planet_data() planets_df = pd.DataFrame(planets).set_index('Name') planets_df.to_csv("../../data/planets_pandas.csv")
-
可以使用 pd.read_csv() 轻松地从 URL 读取 CSV 文件 - 无需其他库。 您可以使用 03/read_csv_via_pandas.py 中的代码):
import pandas as pd planets_df = pd.read_csv("http://localhost:8080/planets_pandas.csv", index_col='Name') print(planets_df)
-
将数据转换为 JSON 也非常容易。 使用 Python 操作 JSON 可以通过 Python json 库来完成。 该库可用于将 Python 对象与 JSON 相互转换。 以下将行星列表转换为 JSON 并将其打印到控制台:将行星数据打印为 JSON(03/convert_to_json.py 中的代码):
import json from get_planet_data import get_planet_data planets=get_planet_data() json.dumps(planets)
执行此脚本会产生以下输出(省略了一些输出):
[ { "Name": "Mercury", "Mass": "0.330", "Radius": "4879", "Description": "Named Mercurius by the Romans because it appears to move so swiftly.", "MoreInfo": "https://en.wikipedia.org/wiki/Mercury_(planet)" }, { "Name": "Venus", "Mass": "4.87", "Radius": "12104", "Description": "Roman name for the goddess of love. This planet was considered to be the brightest and most beautiful planet or star in the heavens. Other civilizations have named it for their god or goddess of love/war.", "MoreInfo": "https://en.wikipedia.org/wiki/Venus" },
-
这也可以用来轻松地将 JSON 保存到文件 (03/save_as_json.py):
import json from get_planet_data import get_planet_data planets=get_planet_data() with open('../../data/planets.json', 'w+') as jsonFile: json.dump(planets, jsonFile, indent=4)
-
使用 !head -n 13 ../../www/planets.json 检查输出显示:
[ { "Name": "Mercury", "Mass": "0.330", "Radius": "4879", "Description": "Named Mercurius by the Romans because it appears to move so swiftly.", "MoreInfo": "https://en.wikipedia.org/wiki/Mercury_(planet)" }, { "Name": "Venus", "Mass": "4.87", "Radius": "12104", "Description": "Roman name for the goddess of love. This planet was considered to be the brightest and most beautiful planet or star in the heavens. Other civilizations have named it for their god or goddess of love/war.",
-
JSON 可以通过请求从 Web 服务器读取并转换为 Python 对象 (03/read_http_json_requests.py):
import requests import json planets_request = requests.get("http://localhost:8080/data/planets.json") print(json.loads(planets_request.text))
-
pandas 还提供 JSON 功能来保存到 CSV (03/save_json_pandas.py):
import pandas as pd from get_planet_data import get_planet_data planets = get_planet_data() planets_df = pd.DataFrame(planets).set_index('Name') planets_df.reset_index().to_json("../../data/planets_pandas.json", orient='records')
不幸的是,目前还没有一种方法可以漂亮地打印 .to_json() 输出的 JSON。 另请注意 orient='records' 的使用和 rest_index() 的使用。 这对于将相同的 JSON 结构复制到使用 JSON 库示例编写的 JSON 是必要的。
-
JSON 可以使用 .read_json() 读取到 DataFrame 中,也可以从 HTTP 和文件中读取 (03/read_json_http_pandas.py):
import pandas as pd planets_df = pd.read_json("http://localhost:8080/data/planets_pandas.json").set_index('Name') print(planets_df)
运行原理
csv 和 json 库是 Python 的标准部分,提供了一种读取和写入两种格式数据的简单方法。
pandas 并不是某些 Python 发行版的标准配置,您可能需要安装它。 CSV 和 JSON 的 pandas 函数在操作上也更高水平,提供许多强大的数据操作,并且还支持从远程服务器访问数据。
还有更多
csv、json 或 pandas 库的选择由您决定,但我倾向于喜欢 pandas,我们将在整本书中检查它的用途,尽管我们不会太深入地了解它的用法。
要深入了解 pandas,请访问 pandas.pydata.org,或阅读我的另一本书 From Packt, Learning pandas, 2ed。
有关 csv 库的更多信息,请参阅 https://docs.python.org/3/library/csv.html
有关 json 库的更多信息,请参阅 https://docs.python.org/3/library/json.html