命名空间阻止在 C# 中解析 XML 文件
我有一个大小为 2.8GB 的 XML 文件(波兰语维基百科转储)。我必须在该文件中搜索特定标题并获取其页面内容。为了简单起见,我使用 LINQ to XML:
var text = from el in StreamXmlDocument(filePath)
where el.Element("title").Value.Contains(titleToSearch)
select (string)el.Element("revision").Element("text");
所以
private IEnumerable<XElement> StreamXmlDocument(string uri)
{
//code made accoring to informations at MSDN website available at URL:
//http://msdn.microsoft.com/en-us/library/system.xml.linq.xnode.readfrom.aspx
using (XmlReader reader = XmlReader.Create(uri))
{
reader.MoveToContent();
while (reader.Read())
{
switch (reader.NodeType)
{
case XmlNodeType.Element:
if (reader.Name == "page")
{
XElement el = XElement.ReadFrom(reader) as XElement;
el.DescendantsAndSelf().Attributes().Where(n => n.IsNamespaceDeclaration).Remove();
if (el != null)
{
yield return el;
}
}
break;
}
}
}
问题是该文件在第一个元素中包含 xmlns 属性:
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.4/" (...) >
当我运行上面的代码时,我收到错误 no reference to object at this line:
where el.Element("title").Value.Contains(titleToSearch)
当我手动删除该 xmlns 属性时,所有内容工作正常。我在互联网上的某个地方发现:
el.DescendantsAndSelf().Attributes().Where(n => n.IsNamespaceDeclaration).Remove();
应该从元素中删除所有 xmlns 属性。但事实并非如此。
I have this XML file of size 2.8GB (Polish Wikipedia dump). I have to search this file for certain title and get page content for it. I use LINQ to XML for simplicity:
var text = from el in StreamXmlDocument(filePath)
where el.Element("title").Value.Contains(titleToSearch)
select (string)el.Element("revision").Element("text");
and
private IEnumerable<XElement> StreamXmlDocument(string uri)
{
//code made accoring to informations at MSDN website available at URL:
//http://msdn.microsoft.com/en-us/library/system.xml.linq.xnode.readfrom.aspx
using (XmlReader reader = XmlReader.Create(uri))
{
reader.MoveToContent();
while (reader.Read())
{
switch (reader.NodeType)
{
case XmlNodeType.Element:
if (reader.Name == "page")
{
XElement el = XElement.ReadFrom(reader) as XElement;
el.DescendantsAndSelf().Attributes().Where(n => n.IsNamespaceDeclaration).Remove();
if (el != null)
{
yield return el;
}
}
break;
}
}
}
So the problem is that this file contains a xmlns attribute in first element:
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.4/" (...) >
and when I run the code above I get error no reference to object at this line:
where el.Element("title").Value.Contains(titleToSearch)
When I manually delete that xmlns attribute everything works fine. I found somewhere in the Internet that this:
el.DescendantsAndSelf().Attributes().Where(n => n.IsNamespaceDeclaration).Remove();
should delete all xmlns attributes from elements. But it doesn't.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
好吧,欢迎来到 SO ;-)
在 XML 中,命名空间声明是神圣的。删除它很可能会使 XML 无法使用,因此我建议不要这样做(对于 2.8GB 文件来说这是一项艰巨的任务!)。每当您处理 XML 时,每个名称都应被视为唯一,如
{namespace}elementname
中那样(即两者)。 Linq to XML 接受命名空间 和你应该使用它们:(可能会被忽略,你已经这样做了):
关于 XML 的注释:我相信 Linq2XML 会将整个内容加载到内存中,就像 DOM 一样,这将需要大约 4.5 倍的文件大小。这可能有问题。阅读此 MSDN 博客关于将 Linq 流式传输到 XML。
Well, welcome at SO then ;-)
In XML, a namespace declaration is saint. Removing it may well make the XML unusable, so I'd advice against it (and it's a huge task on a 2.8GB file!). Each name should be considered unique as in
{namespace}elementname
(i.e, both) whenever you deal with XML. Linq to XML accepts namespaces and you should use them:(may be ignored, you do this already):
A note on the XML: Linq2XML will load the whole thing in memory, I believe, just like DOM, which will require about 4.5 times the size of the file. This may be problematic. Read this MSDN blog about streaming Linq to XML.
我相信你想要:
I believe you want: