正则表达式/解析 XML 文件

发布于 2024-11-14 22:45:25 字数 661 浏览 2 评论 0原文

我有一个 XML 文件，其中包含自定义标签包含的一堆数据。这对于我的一个项目很有用，但对于另一个项目我不需要那么多信息。所以我想修剪 XML 文件，并删除某些标签的所有实例以及标签之间的任何内容。

<GOBJ>
    <cost>4</cost>
    <duration>n/a</duration>
    <item>Stone Block</item>
    <type>Construction - Material</type>
    <misc>Use these blocks to build things. These blocks don't degrade.</misc>
</GOBJ>

我只想保留 [item]blah[item] 和 [type]blah[type]，其余的应该删除/删除。

稍后，我需要检查 [type] 的文本，如果它与某些单词匹配，则替换其内容。例如，如果单词 metal 位于 [type] 标记内的任何位置，则仅用单词 metal 替换该标记的内容。

我知道这是一个很大的要求；我很感激任何帮助。

原文

I have an XML file with a bunch of data contained by custom tags. This is all useful for one project I have, but for another project I don't need so much info. So I'd like to trim the XML file, and get rid of all instances of certain tags and whatever is between the tags.

<GOBJ>
    <cost>4</cost>
    <duration>n/a</duration>
    <item>Stone Block</item>
    <type>Construction - Material</type>
    <misc>Use these blocks to build things. These blocks don't degrade.</misc>
</GOBJ>

I only want to keep [item]blah[item] and [type]blah[type], the rest should be deleted/removed.

Later on, I will need to check the text of [type] and replace its contents if it matches certain words. For example, if the word metal is anywhere within the [type] tag, then replace the contents of that tag with just the word metal.

I know this is a big request; I appreciate any help.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

赤濁 2024-11-21 22:45:25

另一种方法是仅使用简单的 XML → XML（XSLT 1.0 和 XPath 1.0）转换，如下所示。它可以轻松适应您的要求并重用于其他文档。

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">

    <xsl:output method="xml" omit-xml-declaration="yes" indent="yes"/>

    <xsl:template match="root">
        <root>
            <xsl:apply-templates select="GOBJ"/>
        </root>
    </xsl:template>

    <xsl:template match="GOBJ">
        <GOBJ>
            <xsl:copy-of select="item"/>
            <type>
                <xsl:choose>
                    <xsl:when test="contains(type, 'metal')">
                        <xsl:text>metal</xsl:text>
                    </xsl:when>
                    <!-- other xsl:when conditions here -->
                    <xsl:otherwise>
                        <xsl:value-of select="type"/>
                    </xsl:otherwise>
                </xsl:choose>
            </type>
        </GOBJ>
    </xsl:template>
</xsl:stylesheet>

我知道这不是基于正则表达式的解决方案，但恕我直言，最好使用本机面向 XML 的工具包。

Another way is to just use simple XML → XML (XSLT 1.0 with XPath 1.0) transformation like below. It's easy to adapt for your requirements and reuse for other documents.

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">

    <xsl:output method="xml" omit-xml-declaration="yes" indent="yes"/>

    <xsl:template match="root">
        <root>
            <xsl:apply-templates select="GOBJ"/>
        </root>
    </xsl:template>

    <xsl:template match="GOBJ">
        <GOBJ>
            <xsl:copy-of select="item"/>
            <type>
                <xsl:choose>
                    <xsl:when test="contains(type, 'metal')">
                        <xsl:text>metal</xsl:text>
                    </xsl:when>
                    <!-- other xsl:when conditions here -->
                    <xsl:otherwise>
                        <xsl:value-of select="type"/>
                    </xsl:otherwise>
                </xsl:choose>
            </type>
        </GOBJ>
    </xsl:template>
</xsl:stylesheet>

I know it's not regex based solution, but IMHO it's better to use native XML-oriented toolkit.

回复收藏 0 原文

无名指的心愿 2024-11-21 22:45:25

假设文件的布局与您的示例完全相同，乘以所需的尽可能多的记录，并且您希望尽可能保留原始布局，全局替换

(<GOBJ>[^<]+?).+?(<item>.+?<\/type>\n).+?(<\/GOBJ>)

为

$1$2$3

，并且正则表达式设置为在“单行”模式下运行，将执行您所要求的操作，元素为大写，其他元素为小写，每条记录的每个元素只有一个实例，并且元素始终出现在每个记录中紧邻元素之前。

在 JavaScript 中，这将是：

var result = src.replace(
    /(<GOBJ>[^<]+?).+?(<item>.+?<\/type>\n).+?(<\/GOBJ>)/g, 
    '$1$2$3'
);

请注意，严格的条件缓解了与使用正则表达式解析 XML 相关的任何问题。如果无法满足条件，那么使用特定于 XML 的工具（如 XSLT）会得到更好的服务。

Assuming that the file is laid out exactly as your example, multiplied by as many records as required, and that you wish to preserve the original layout as much as possible, replacing

(<GOBJ>[^<]+?).+?(<item>.+?<\/type>\n).+?(<\/GOBJ>)

with

$1$2$3

globally and the regex is set to operate in 'singleline' mode, will do what you require iff, element <GOBJ> is uppercase, other elements are in lowercase, there is ever only one instance of each element per record, and element <item> always appears immediately before element <type> in each record.

In JavaScript, this would be:

var result = src.replace(
    /(<GOBJ>[^<]+?).+?(<item>.+?<\/type>\n).+?(<\/GOBJ>)/g, 
    '$1$2$3'
);

Note that the strict conditions alleviate any issues related to parsing XML using a regular expression. If the conditions cannot be met, you would be far better served using an XML-specific tool, like XSLT.

回复收藏 0 原文

一萌ing 2024-11-21 22:45:25

这是一个 grep 解决方案：grep -E '(|)' myfile.xml

回复收藏 0 原文

迷爱 2024-11-21 22:45:25

我开发了另一种方法来解决这个问题；我构建了一个 jquery 脚本来分割 xml 代码（我事先用不同的符号替换了所有左/右箭头），如果我不包含另一个特定符号，则输出数组条目。

var name = $('div').text().trim().split(/\[name\](.*?)\[\/name\]/g);
var type = $('div').text().trim().split(/\[type\](.*?)\[\/type\]/g);
for (i = 0; name.length > i; i++) {
        if ((type[i].match(/\[/g))) {
            type[i] = "";
        }
        if (!(name[i].match(/\[/g))) {
            if (type[i].match(/construction/g)) {type[i] = "T_C";}
            if (type[i].match(/material/g)) {type[i] = "T_M";}
            if (type[i].match(/metalwork/g)) {type[i] = "T_W";}
            if (type[i].match(/water/g)) {type[i] = "T_W";}
            if (type[i].match(/oil/g)) {type[i] = "T_O";}
            if (type[i].match(/precious/g)) {type[i] = "T_P";}
            if (type[i].match(/magic/g)) {type[i] = "T_M";}
            $('.Collect').append('<p>a href="../Img/XXX/' + name[i] + '.jpg" class="' + type[i] + '">' + name[i] + '/a></p>');
        } else {
            name[i] = "";
        }

    }

输出的格式是这样的，这样我就可以将页面复制粘贴到 txt/html 文件中，并得到几乎我想要的结果。我必须找出某种方法来用适当的目录名称替换 XXX...

我只需要执行此操作一两次，因此纯粹的自动化并不是必需的。

I developed another way to tackle the problem; I built a jquery script that split up the xml code (i replaced all the left/right arrows with a different symbol before hand), and output the array entry if i didn't contain another certain symbol.

var name = $('div').text().trim().split(/\[name\](.*?)\[\/name\]/g);
var type = $('div').text().trim().split(/\[type\](.*?)\[\/type\]/g);
for (i = 0; name.length > i; i++) {
        if ((type[i].match(/\[/g))) {
            type[i] = "";
        }
        if (!(name[i].match(/\[/g))) {
            if (type[i].match(/construction/g)) {type[i] = "T_C";}
            if (type[i].match(/material/g)) {type[i] = "T_M";}
            if (type[i].match(/metalwork/g)) {type[i] = "T_W";}
            if (type[i].match(/water/g)) {type[i] = "T_W";}
            if (type[i].match(/oil/g)) {type[i] = "T_O";}
            if (type[i].match(/precious/g)) {type[i] = "T_P";}
            if (type[i].match(/magic/g)) {type[i] = "T_M";}
            $('.Collect').append('<p>a href="../Img/XXX/' + name[i] + '.jpg" class="' + type[i] + '">' + name[i] + '/a></p>');
        } else {
            name[i] = "";
        }

    }

The output is formatted that way so that i can just copy paste the page into a txt/html file, and have it pretty much as i wanted it. I'll have to figure out some way to replace XXX with the appropriate Directory name...

I only needed to do this once or twice, so pure automation wasn't imperative.

回复收藏 0 原文