<?xml version="1.0" encoding="utf-8"?>
<items>
<item>
<name>John Cleese</name>
</item>
</items>
对于上面这个XML文件,用Python解析获取name的文字就比较简单:
import xml.etree.ElementTree as ET
xml_content = """<?xml version="1.0" encoding="utf-8"?>
<items>
<item>
<name>John Cleese</name>
</item>
</items>
"""
root = ET.fromstringlist(xml_content)
print(root.find(".//item/name").text)
但是如果加了命名空间:
<?xml version="1.0" encoding="utf-8"?>
<items xmlns="http://www.w3.org/1999/xhtm">
<item>
<name>John Cleese</name>
</item>
</items>
可以写成这个样子:
import xml.etree.ElementTree as ET
xml_content = """<?xml version="1.0" encoding="utf-8"?>
<items xmlns="http://www.w3.org/1999/xhtm">
<item>
<name>John Cleese</name>
</item>
</items>
"""
root = ET.fromstringlist(xml_content)
print(
root.find(
".//{http://www.w3.org/1999/xhtm}item/{http://www.w3.org/1999/xhtm}name"
).text
)
但是这样子的话,XPATH看起来就有点冗长了,可以改成这个样子:
import xml.etree.ElementTree as ET
xml_content = """<?xml version="1.0" encoding="utf-8"?>
<items xmlns="http://www.w3.org/1999/xhtm">
<item>
<name>John Cleese</name>
</item>
</items>
"""
ns = {"a": "http://www.w3.org/1999/xhtm"}
root = ET.fromstringlist(xml_content)
print(root.find(".//a:item/a:name", ns).text)
最后,对于只有一个默认命名空间的情况下,有个简单粗暴的方案就是在处理之前,把XML中命名空间的内容删掉:
import re
import xml.etree.ElementTree as ET
def parse_XML(src: str) -> ET.Element:
src = re.sub('xmlns=".+?"', "", src, 1)
root = ET.fromstring(src)
return root