Excel 中的富文本格式(带有格式化标签)为无格式文本
我有大约。 Excel 中包含 RTF(包括格式标签)的 12000 个单元格。我需要解析它们才能获取未格式化的文本。
这是带有文本的单元格之一的示例:
{\rtf1\ansi\deflang1060\ftnbj\uc1
{\fonttbl{\f0 \froman \fcharset0 Times New Roman;}{\f1 \fswiss \fcharset238
Arial;}}
{\colortbl ;\red255\green255\blue255 ;\red0\green0\blue0 ;}
{\stylesheet{\fs24\cf2\cb1 Normal;}{\cs1\cf2\cb1 Default Paragraph Font;}}
\paperw11908\paperh16833\margl1800\margr1800\margt1440\margb1440\headery720\footery720
\deftab720\formshade\aendnotes\aftnnrlc\pgbrdrhead\pgbrdrfoot
\sectd\pgwsxn11908\pghsxn16833\marglsxn1800\margrsxn1800\margtsxn1440\margbsxn1440
\headery720\footery720\sbkpage\pgncont\pgndec
\plain\plain\f1\fs24\pard TPR 0160 000\par IPR 0160 000\par OB-R-02-28\par}
我真正需要的是:
TPR 0160 000
IPR 0160 000
OB-R-02-28
简单循环单元格并删除不必要的格式的问题是,并非这 12000 个单元格中的所有内容都像这样简单。所以我需要手动检查许多不同的版本并编写几个变体;最后仍然有很多手工工作要做。
但是,如果我将一个单元格的内容复制到空文本文档并将其另存为 RTF,然后用 MS Word 打开它,它会立即解析文本,我就得到了我想要的内容。不幸的是,对于 12000 个电池来说这样做非常不方便。
所以我正在考虑 VBA 宏,将单元格内容移动到 Word,强制解析,然后将结果复制回原始单元格。不幸的是我不太确定该怎么做。
有人有什么想法吗?或者有不同的方法?我将非常感谢您提供解决方案或推动正确的方向。
天啊!
I have approx. 12000 cells in excel containing RTF (including formatting tags). I need to parse them to get to the unformatted text.
This is the example of one of the cells with text:
{\rtf1\ansi\deflang1060\ftnbj\uc1
{\fonttbl{\f0 \froman \fcharset0 Times New Roman;}{\f1 \fswiss \fcharset238
Arial;}}
{\colortbl ;\red255\green255\blue255 ;\red0\green0\blue0 ;}
{\stylesheet{\fs24\cf2\cb1 Normal;}{\cs1\cf2\cb1 Default Paragraph Font;}}
\paperw11908\paperh16833\margl1800\margr1800\margt1440\margb1440\headery720\footery720
\deftab720\formshade\aendnotes\aftnnrlc\pgbrdrhead\pgbrdrfoot
\sectd\pgwsxn11908\pghsxn16833\marglsxn1800\margrsxn1800\margtsxn1440\margbsxn1440
\headery720\footery720\sbkpage\pgncont\pgndec
\plain\plain\f1\fs24\pard TPR 0160 000\par IPR 0160 000\par OB-R-02-28\par}
And all I really need is this:
TPR 0160 000
IPR 0160 000
OB-R-02-28
The problem with simple looping over the cells and removing unnecessary formatting is, that not everything in those 12000 cells is as straightforward as this is. So I would need to manually inspect many different versions and write several variations; and still at the end there would be a lot of manual work to do.
But if I copy the contents of one cell to empty text document and save it as RTF, then open it with MS Word, it instantly parses the text and I get exactly what I want. Unfortunately it's extremely inconvenient to do so for a 12000 cells.
So I was thinking about VBA macro, to move cell contents to Word, force parsing and then copy the result back to the originating cell. Unfortunately I'm not really sure how to do it.
Does anybody has any idea? Or a different approach? I will be really grateful for a solution or a push in the right direction.
TNX!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
如果您确实想沿着使用 Word 解析文本的路线,此功能应该可以帮助您。正如评论所暗示的,您将需要 MS Word 对象库的引用。
您可以使用与此类似的方法为 12,000 个单元中的每一个调用它:
ParseRTF 函数运行大约需要一秒钟(至少在我的机器上),因此对于 12,000 个单元,这将在大约三个半小时内完成。
在周末思考这个问题后,我确信有一个更好(更快)的解决方案。
我记得剪贴板的 RTF 功能,并意识到可以创建一个类,将 RTF 数据复制到剪贴板,粘贴到 Word 文档,并输出生成的纯文本。该解决方案的好处是不必为每个 rtf 字符串打开和关闭单词 doc 对象;它可以在循环之前打开并在循环之后关闭。
下面是实现此目的的代码。它是一个名为 clsRTFParser 的类模块。
您可以使用与此类似的方法为 12,000 个单元中的每一个单元调用它:
我已在我的计算机上使用示例 RTF 字符串对此进行了模拟。对于 12,000 个细胞,需要两分半钟,这是一个更合理的时间范围!
If you did want to go down the route of using Word to parse the text, this function should help you out. As the comments suggest, you'll need a reference to the MS Word Object Library.
You could call it for each of your 12,000 cells using something similar to this:
The ParseRTF function takes about a second to run (on my machine at least), so for 12,000 cells this will work out at about three and a half hours.
Having thought about this problem over the weekend, I was sure there was a better (quicker) solution for this.
I remembered the RTF capabilities of the clipboard, and realised that a class could be created that would copy RTF data to the clipboard, paste to a word doc, and output the resulting plain text. The benefit of this solution is that the word doc object would not have to be opened and closed for each rtf string; it could be opened before the loop and closed after.
Below is the code to achieve this. It is a Class module named clsRTFParser.
You could call it for each of your 12,000 cells using something similar to this:
I have simulated this using example RTF strings on my machine. For 12,000 cells it took two and a half minutes, a much more reasonable time frame!
您可以尝试用正则表达式解析每个单元格,只留下您需要的内容。
每个 RTF 控制代码都以“\”开头并以空格结尾,中间没有任何额外的空格。 “{}”用于分组。如果您的文本不包含任何内容,您可以删除它们(与“;”相同)。所以现在你保留你的文本和一些不必要的单词,如“Arial”、“Normal”等。你也可以构建字典来删除它们。经过一些调整后,您将仅保留所需的文本。
请参阅 http://www.regular-expressions.info/ 了解更多信息和出色的工具编写RegExp(RegexBuddy - 不幸的是它不是免费的,但它物有所值。AFAIR 还有试用版)。
更新:当然,我不鼓励您为每个单元格手动执行此操作。只需迭代活动范围:
参考这个帖子:
SO:关于迭代单元格在VBA中 就
个人而言,我会尝试这个想法:
And how to use RegExp's in VBA (Excel)?
参考:
Excel 中的正则表达式函数
和
VBA 中的正则表达式
基本上,您必须通过 COM 使用 VBScript.RegExp 对象。
You can try to parse every cell with regular expression and leave only the content you need.
Every RTF control code start with "\" and ends with space, without any additional space between. "{}" are use for grouping. If your text won't contain any, you can just remove them (the same for ";"). So now you stay with your text and some unnecessary words as "Arial", "Normal" etc. You can build the dictionary to remove them also. After some tweaking, you will stay with only the text you need.
Look at http://www.regular-expressions.info/ for more information and great tool to write RegExp's (RegexBuddy - unfortunately it isn't free, but it's worth the money. AFAIR there is also trial).
UPDATE: Of course, I don't encourage you to do it manually for every cell. Just iterate through active range:
Refer this thread:
SO: About iterating through cells in VBA
Personally, I'll give a try to this idea:
And how to use RegExp's in VBA (Excel)?
Refer:
Regex functions in Excel
and
Regex in VBA
Basically you've to use VBScript.RegExp object through COM.
这里的一些解决方案需要引用 MS Word 对象库。根据我收到的牌,我找到了一个不依赖它的解决方案。它剥离 RTF 标签以及其他诸如字体表和样式表之类的内容,所有这些都在 VBA 中。这可能对你有帮助。我在您的数据上运行了它,除了空白之外,我得到了与您期望的相同的输出。
这是代码。
首先,检查字符串是否是字母数字。给它一个只有一个字符长的字符串。该函数用于计算各处的界限。
接下来是删除整个组。我用它来删除字体表和其他垃圾。
好的,这个函数会删除所有标签。
我们可以通过明显的方式删除大括号:
将上述函数复制粘贴到模块中后,您可以创建一个函数,使用它们来删除您不需要或不需要的任何内容。以下内容在我的情况下完美运行。
我希望这有帮助。我不会在文字处理器或其他任何东西中使用它,但如果这就是你正在做的事情,它可能会用于抓取数据。
Some of the solutions here require a reference to the MS Word Object Library. Playing with the cards I am dealt, I found a solution that does not rely on it. It strips RTF tags, and other fluff like font tables and stylesheets, all in VBA. It might be helpful to you. I ran it across your data, and other than the whitespace, I get the same output as what you expected.
Here is the code.
First, something to check if a string is alphanumeric or not. Give it a string that's one character long. This function is used to work out delimitation here and there.
Next up is to remove an entire group. I use this to remove font tables and other rubbish.
Okay, and this function removes any tags.
We can remove curly braces in the obvious way:
Once you have the functions above copy-pasted into your module, you can create a function that uses them to strip away any stuff you don't need or want. The following works perfectly in my case.
I hope this helps. I wouldn't use it in a word processor or anything, but it might do for scraping data if that's what you're doing.
您的帖子听起来好像每个 RTF 文档都存储在单个 Excel 单元格中。如果是这样,那么
使用.Net Framework RichTextBox控件的解决方案
将通过两行代码将每个单元格中的RTF转换为纯文本(经过一些系统配置以获得正确的 .tlb 文件以允许引用 .Net Framework)。将单元格值放入 rtfsample 中并
Your post made it sound as if each RTF document was stored in a single Excell cell. If so, then
Solution using .Net Framework RichTextBox control
will convert the RTF in each cell to plain text in 2 lines of code (after a little system configuration to get the right .tlb file to allow reference to the .Net Framework). Put the cell value in rtfsample and