Python解析XML命名空间的处理

<?xml version="1.0" encoding="utf-8"?>
<items>
    <item>
        <name>John Cleese</name>
    </item>
</items>

对于上面这个XML文件,用Python解析获取name的文字就比较简单:

import xml.etree.ElementTree as ET

xml_content = """<?xml version="1.0" encoding="utf-8"?>
<items>
    <item>
        <name>John Cleese</name>
    </item>
</items>
"""

root = ET.fromstringlist(xml_content)
print(root.find(".//item/name").text)

但是如果加了命名空间:

<?xml version="1.0" encoding="utf-8"?>
<items xmlns="http://www.w3.org/1999/xhtm">
    <item>
        <name>John Cleese</name>
    </item>
</items>

可以写成这个样子:

import xml.etree.ElementTree as ET

xml_content = """<?xml version="1.0" encoding="utf-8"?>
<items xmlns="http://www.w3.org/1999/xhtm">
    <item>
        <name>John Cleese</name>
    </item>
</items>
"""

root = ET.fromstringlist(xml_content)
print(
    root.find(
        ".//{http://www.w3.org/1999/xhtm}item/{http://www.w3.org/1999/xhtm}name"
    ).text
)

但是这样子的话,XPATH看起来就有点冗长了,可以改成这个样子:

import xml.etree.ElementTree as ET

xml_content = """<?xml version="1.0" encoding="utf-8"?>
<items xmlns="http://www.w3.org/1999/xhtm">
    <item>
        <name>John Cleese</name>
    </item>
</items>
"""

ns = {"a": "http://www.w3.org/1999/xhtm"}
root = ET.fromstringlist(xml_content)
print(root.find(".//a:item/a:name", ns).text)

最后,对于只有一个默认命名空间的情况下,有个简单粗暴的方案就是在处理之前,把XML中命名空间的内容删掉:

import re
import xml.etree.ElementTree as ET

def parse_XML(src: str) -> ET.Element:
    src = re.sub('xmlns=".+?"', "", src, 1)
    root = ET.fromstring(src)
    return root
Back to Top