OOXML SDK 非法字符替换
我在使用 MS 的 Open XML SDK 2.0 创建 XLSX 文档时遇到问题。
我的问题是,我需要在 Excel 工作表中显示一些非法字符,但如果我只是将它们添加到工作表中,文档将无法加载。
我正在使用这个函数
private static string ProcessString(string str)
{
return System.Security.SecurityElement.Escape(str);
}
,它将给我汤姆的球而不是汤姆的球。 (嗯,我还没弄清楚如何获得后者,因为生成的 Excel 无法打开。)
有人知道如何在 Excel 工作表中使用 OOXML 来显示非法 XML 字符吗?
编辑:
在我用来创建文本单元格的函数中,
private static Cell CreateTextCell(string header, UInt32 index, string text)
{
var c = new Cell { DataType = CellValues.String, CellReference = header + index };
var cellValue = new CellValue(text);
c.Append(cellValue);
return c;
}
我知道它与非法字符有关,因为当我没有在文本中包含特定字段时,它会起作用,然后当我包含它时,Excel会给我一个解析器错误和一个空白文档。
我处理的文本也恰好包含 HTML 标签。
附注哈哈,我刚刚注意到使用的 markdown 解析了我的 HTML 转义,使我的示例看起来很荒谬。
编辑 2:
一些输入示例:
癌症的复杂性:我们是否在考虑错误的水平来制定有效的干预措施?
BRCA1 突变阴性女性患乳腺癌风险的前瞻性研究或BRCA2凯瑟琳·坎宁安家族性乳腺癌研究基金会 (kConFab) 中的突变阳性家庭。
针对种系BRCA2突变与侵袭性前列腺癌和不良后果相关。
html 格式基本上是这样显示在网页上的。我应该去掉基本的格式标签。但更重要的是,我希望加载 Excel 文件,而转义值是实现这一目标的可靠方法。
I am having an issue with creating an XLSX document with the Open XML SDK 2.0 from MS.
My issue is that I need to display some of these illegal characters in the excel sheet, but if I just add them to the sheet, the document will not load.
I am using this function
private static string ProcessString(string str)
{
return System.Security.SecurityElement.Escape(str);
}
Which will give me Tom's ball instead of Tom's ball. (Well I haven't figured out how to get the latter as the excel generated won't open.)
Anybody know how to make the illegal XML characters show using OOXML in an Excel sheet?
EDIT:
In function I am using to create a text cell is:
private static Cell CreateTextCell(string header, UInt32 index, string text)
{
var c = new Cell { DataType = CellValues.String, CellReference = header + index };
var cellValue = new CellValue(text);
c.Append(cellValue);
return c;
}
I know it has to do with illegal characters because when I didn't include a particular field in my text it worked, then when I included it, Excel would give me a parser error and a blank document.
The text that I deal with also happens to have HTML tags in it as well.
ps. lol, I just noticed that the markdown used parsed my HTML escape making my example look ridiculous.
edit 2:
Some example of input:
Cancer's Complexity: Are we Looking at the Wrong Levels to Develop Effective Interventions?
Prospective study of breast cancer risk in mutation-negative women from <i>BRCA1</i> or <i>BRCA2</i> mutation-positive families in the Kathleen Cuningham Foundation Consortium for Research into Familial Breast Cancer (kConFab).
Germline <em>BRCA2</em> mutations correlate with aggressive prostate cancer and adverse outcome.
The html formatting is basically so it displays on the web page. I should just strip off the basic formatting tags. But more importantly, I want the excel file to load and escaping the values is a sure way of doing just that.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
考虑一下 Excel 文件中包含的所有内容最终都以 XML 形式保存。因此,如果您希望将任何无效的 XML 字符保留在文件中,则需要将其排除。这实际上并不是一个 OOXML 问题,而是一个 XML 问题。
因此,如果您输入类似的文本,
则必须将尖括号转换为
(添加空格以说明括号)
请参阅 将字符串转义为 XML,了解执行此操作的几种方法。
另外,要了解 Microsoft Office 如何执行此操作,请将有问题的文本添加到 Excel 文档中并保存。然后使用OpenXml SDK工具(SDK自带)反射该文件,看看是怎么做的。
Think about that everything contained in e.g. an Excel-file is ultimately persisted as XML. So if you have any invalid XML-chars that you'd like to persist in the file, you need to excape them. It is not really an OOXML issue - it is an XML issue.
So if you have input text like
You'd have to convert the angle brackets to
(space added to illustrate the brackets)
See String escape into XML for a few ways to do this.
Also, to see how Microsoft Office does this, add the trouble-some text to an Excel-document and save it. Then use the OpenXml SDK tool (comes with the SDK) to reflect the file and see how it is done.
您确定这就是导致问题的原因吗?您可以将“正常”字符串添加到单元格中并打开它吗?
AFAIK 撇号字符不是非法的 XML 字符。
如果您查看 OOXML 规范第 22.9.2.19 ST_Xstring(转义字符串)(单元格中字符串的数据类型),您将看到以下解释:
*22.9.2.19 ST_Xstring(转义字符串)
支持转义无效 XML 字符的字符串。
对于 XML 1.0 规范定义的无法在 XML 中表示的所有字符,将使用 Unicode 数字字符表示转义字符格式 xHHHH 对这些字符进行转义,其中 H 表示字符值中的十六进制字符。 [示例:XML 1.0 文档中不允许使用 Unicode 字符 8,因此必须将其转义为 x0008。结束示例]*
Are you sure this is what is causing the problem? Can you add "normal" strings to the cells and open it?
AFAIK the apostrophe character is not an illegal XML character.
If you look in the OOXML specification in section 22.9.2.19 ST_Xstring (Escaped String) (the data type for strings in cells) you will see the following explanation:
*22.9.2.19 ST_Xstring (Escaped String)
String of characters with support for escaped invalid-XML characters.
For all characters which cannot be represented in XML as defined by the XML 1.0 specification, the characters are escaped using the Unicode numerical character representation escape character format xHHHH, where H represents a hexadecimal character in the character's value. [Example: The Unicode character 8 is not permitted in an XML 1.0 document, so it must be escaped as x0008. end example]*
还有一件事要小心。 XML 并没有对我们习惯在 HTML 中使用的所有字符命名实体提供内置支持。 XML 中有少量始终可以理解的内容(即 < > &)。由于 OOXML 中没有 DTD,因此您无法定义更多命名实体。相反,您必须使用字符实体将其他任何内容引入到字符串中(例如, 表示空格, 表示 等),或者直接将 Unicode 字符输入到字符串中。
您可以使用 "对于 "和'对于撇号,如果您需要将其中之一放入由相同类型的引号符号包围的属性值中。
有一些 Unicode 代码点在 XML 数据流中是完全禁止的。为了将这些代码插入到字符串的值中,OOXML 有其自己的转义机制,并且仅针对这些代码。我认为这个问题不涉及该条文。
Another thing to be careful of. XML does not have built-in support for all of the character named entities that we are accustomed to using in HTML. There are a small number of always-understood ones in XML (i.e., < > &). Since there are no DTDs in OOXML, you can't define more named entities. Instead you have to introduce anything else into a string using a character entity (e.g., for a space, for etc.) or else directly enter the Unicode character into the string.
You can use " for "e; and ' for apostrophe if you need to get one of those into an attribute value that is surrounded by the same kind of quote symbol.
There are some Unicode code points that are flat-out forbidden in XML data streams. To insert those codes into the value of a string, OOXML has its own escape mechanism for those codes and those codes only. I don't think that provision is involved in this question.