Python解析XML命名空间的处理 – 长江边的程序员

<?xml version="1.0" encoding="utf-8"?>
<items>
    <item>
        <name>John Cleese</name>
    </item>
</items>

对于上面这个XML文件，用Python解析获取name的文字就比较简单：

import xml.etree.ElementTree as ET

xml_content = """<?xml version="1.0" encoding="utf-8"?>
<items>
    <item>
        <name>John Cleese</name>
    </item>
</items>
"""

root = ET.fromstringlist(xml_content)
print(root.find(".//item/name").text)

但是如果加了命名空间：

<?xml version="1.0" encoding="utf-8"?>
<items xmlns="http://www.w3.org/1999/xhtm">
    <item>
        <name>John Cleese</name>
    </item>
</items>

可以写成这个样子：

import xml.etree.ElementTree as ET

xml_content = """<?xml version="1.0" encoding="utf-8"?>
<items xmlns="http://www.w3.org/1999/xhtm">
    <item>
        <name>John Cleese</name>
    </item>
</items>
"""

root = ET.fromstringlist(xml_content)
print(
    root.find(
        ".//{http://www.w3.org/1999/xhtm}item/{http://www.w3.org/1999/xhtm}name"
    ).text
)

但是这样子的话，XPATH看起来就有点冗长了，可以改成这个样子：

import xml.etree.ElementTree as ET

xml_content = """<?xml version="1.0" encoding="utf-8"?>
<items xmlns="http://www.w3.org/1999/xhtm">
    <item>
        <name>John Cleese</name>
    </item>
</items>
"""

ns = {"a": "http://www.w3.org/1999/xhtm"}
root = ET.fromstringlist(xml_content)
print(root.find(".//a:item/a:name", ns).text)

最后，对于只有一个默认命名空间的情况下，有个简单粗暴的方案就是在处理之前，把XML中命名空间的内容删掉：

import re
import xml.etree.ElementTree as ET

def parse_XML(src: str) -> ET.Element:
    src = re.sub('xmlns=".+?"', "", src, 1)
    root = ET.fromstring(src)
    return root