命名空间阻止在 C# 中解析 XML 文件

发布于 2024-09-11 07:39:19 字数 1737 浏览 4 评论 0原文

我有一个大小为 2.8GB 的 XML 文件（波兰语维基百科转储）。我必须在该文件中搜索特定标题并获取其页面内容。为了简单起见，我使用 LINQ to XML：

var text = from el in StreamXmlDocument(filePath)
           where el.Element("title").Value.Contains(titleToSearch)
           select (string)el.Element("revision").Element("text");

所以

private IEnumerable<XElement> StreamXmlDocument(string uri)
{
    //code made accoring to informations at MSDN website available at URL:
    //http://msdn.microsoft.com/en-us/library/system.xml.linq.xnode.readfrom.aspx
    using (XmlReader reader = XmlReader.Create(uri))
    {

        reader.MoveToContent();

        while (reader.Read())
        {
            switch (reader.NodeType)
            {
                case XmlNodeType.Element:
                    if (reader.Name == "page")
                    {
                        XElement el = XElement.ReadFrom(reader) as XElement;
                        el.DescendantsAndSelf().Attributes().Where(n => n.IsNamespaceDeclaration).Remove();
                        if (el != null)
                        {
                            yield return el;
                        }
                    }
                    break;
            }
        }
    }

问题是该文件在第一个元素中包含 xmlns 属性：

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.4/" (...) >

当我运行上面的代码时，我收到错误 no reference to object at this line：

where el.Element("title").Value.Contains(titleToSearch)

当我手动删除该 xmlns 属性时，所有内容工作正常。我在互联网上的某个地方发现：

el.DescendantsAndSelf().Attributes().Where(n => n.IsNamespaceDeclaration).Remove();

应该从元素中删除所有 xmlns 属性。但事实并非如此。

原文

I have this XML file of size 2.8GB (Polish Wikipedia dump). I have to search this file for certain title and get page content for it. I use LINQ to XML for simplicity:

var text = from el in StreamXmlDocument(filePath)
           where el.Element("title").Value.Contains(titleToSearch)
           select (string)el.Element("revision").Element("text");

and

private IEnumerable<XElement> StreamXmlDocument(string uri)
{
    //code made accoring to informations at MSDN website available at URL:
    //http://msdn.microsoft.com/en-us/library/system.xml.linq.xnode.readfrom.aspx
    using (XmlReader reader = XmlReader.Create(uri))
    {

        reader.MoveToContent();

        while (reader.Read())
        {
            switch (reader.NodeType)
            {
                case XmlNodeType.Element:
                    if (reader.Name == "page")
                    {
                        XElement el = XElement.ReadFrom(reader) as XElement;
                        el.DescendantsAndSelf().Attributes().Where(n => n.IsNamespaceDeclaration).Remove();
                        if (el != null)
                        {
                            yield return el;
                        }
                    }
                    break;
            }
        }
    }

So the problem is that this file contains a xmlns attribute in first element:

<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.4/" (...) >

and when I run the code above I get error no reference to object at this line:

where el.Element("title").Value.Contains(titleToSearch)

When I manually delete that xmlns attribute everything works fine. I found somewhere in the Internet that this:

el.DescendantsAndSelf().Attributes().Where(n => n.IsNamespaceDeclaration).Remove();

should delete all xmlns attributes from elements. But it doesn't.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

烟花肆意 2024-09-18 07:39:19

好吧，欢迎来到 SO ;-)

在 XML 中，命名空间声明是神圣的。删除它很可能会使 XML 无法使用，因此我建议不要这样做（对于 2.8GB 文件来说这是一项艰巨的任务！）。每当您处理 XML 时，每个名称都应被视为唯一，如 {namespace}elementname 中那样（即两者）。 Linq to XML 接受命名空间和你应该使用它们：

XNamespace wiki = "http://www.mediawiki.org/xml/export-0.4/";

var text = from el in StreamXmlDocument(filePath)
           where el.Element(wiki + "title").Value.Contains(titleToSearch)
           select (string)el.Element(wiki + "revision").Element(wiki + "text");

（可能会被忽略，你已经这样做了）：
关于 XML 的注释：我相信 Linq2XML 会将整个内容加载到内存中，就像 DOM 一样，这将需要大约 4.5 倍的文件大小。这可能有问题。阅读此 MSDN 博客关于将 Linq 流式传输到 XML。

Well, welcome at SO then ;-)

In XML, a namespace declaration is saint. Removing it may well make the XML unusable, so I'd advice against it (and it's a huge task on a 2.8GB file!). Each name should be considered unique as in {namespace}elementname (i.e, both) whenever you deal with XML. Linq to XML accepts namespaces and you should use them:

XNamespace wiki = "http://www.mediawiki.org/xml/export-0.4/";

var text = from el in StreamXmlDocument(filePath)
           where el.Element(wiki + "title").Value.Contains(titleToSearch)
           select (string)el.Element(wiki + "revision").Element(wiki + "text");

(may be ignored, you do this already):
A note on the XML: Linq2XML will load the whole thing in memory, I believe, just like DOM, which will require about 4.5 times the size of the file. This may be problematic. Read this MSDN blog about streaming Linq to XML.

回复收藏 0 原文

我恋#小黄人 2024-09-18 07:39:19

我相信你想要：

XNamespace ns = "http://www.mediawiki.org/xml/export-0.4/";

var text = from el in StreamXmlDocument(filePath)
           where el.Element(ns+"title").Value.Contains(titleToSearch)
           select (string)el.Element(ns+"revision").Element(ns+"text");

I believe you want:

XNamespace ns = "http://www.mediawiki.org/xml/export-0.4/";

var text = from el in StreamXmlDocument(filePath)
           where el.Element(ns+"title").Value.Contains(titleToSearch)
           select (string)el.Element(ns+"revision").Element(ns+"text");

回复收藏 0 原文

~没有更多了~