有没有办法逐行阅读word文档
我正在尝试提取Word文档中的所有单词。我能够一次性完成这一切,如下所示...
Word.Application word = new Word.Application();
doc = word.Documents.Open(@"C:\SampleText.doc");
doc.Activate();
foreach (Word.Range docRange in doc.Words) // loads all words in document
{
IEnumerable<string> sortedSubstrings = Enumerable.Range(0, docRange.Text.Trim().Length)
.Select(i => docRange.Text.Substring(i))
.OrderBy(s => s.Length < 3 ? s : s.Remove(2, Math.Min(s.Length - 2, 2)));
wordPosition =
(int)
docRange.get_Information(
Microsoft.Office.Interop.Word.WdInformation.wdFirstCharacterColumnNumber);
foreach (var substring in sortedSubstrings)
{
index = docRange.Text.IndexOf(substring) + wordPosition;
charLocation[index] = substring;
}
}
但是我更愿意一次加载一行文档...是否可以这样做?
我可以按段落加载它,但是我无法迭代段落以提取所有单词。
foreach (Word.Paragraph para in doc.Paragraphs)
{
foreach (Word.Range docRange in para) // Error: type Word.para is not enumeranle**
{
IEnumerable<string> sortedSubstrings = Enumerable.Range(0, docRange.Text.Trim().Length)
.Select(i => docRange.Text.Substring(i))
.OrderBy(s => s.Length < 3 ? s : s.Remove(2, Math.Min(s.Length - 2, 2)));
wordPosition =
(int)
docRange.get_Information(
Microsoft.Office.Interop.Word.WdInformation.wdFirstCharacterColumnNumber);
foreach (var substring in sortedSubstrings)
{
index = docRange.Text.IndexOf(substring) + wordPosition;
charLocation[index] = substring;
}
}
}
I am trying to extract all the words in a Word document. I am able to do it all in one go as follows...
Word.Application word = new Word.Application();
doc = word.Documents.Open(@"C:\SampleText.doc");
doc.Activate();
foreach (Word.Range docRange in doc.Words) // loads all words in document
{
IEnumerable<string> sortedSubstrings = Enumerable.Range(0, docRange.Text.Trim().Length)
.Select(i => docRange.Text.Substring(i))
.OrderBy(s => s.Length < 3 ? s : s.Remove(2, Math.Min(s.Length - 2, 2)));
wordPosition =
(int)
docRange.get_Information(
Microsoft.Office.Interop.Word.WdInformation.wdFirstCharacterColumnNumber);
foreach (var substring in sortedSubstrings)
{
index = docRange.Text.IndexOf(substring) + wordPosition;
charLocation[index] = substring;
}
}
However I would have preferred to load the document one line at a time... is it possible to do so?
I can load it by paragraph however I am unable to iterate through the paragraphs to extract all words.
foreach (Word.Paragraph para in doc.Paragraphs)
{
foreach (Word.Range docRange in para) // Error: type Word.para is not enumeranle**
{
IEnumerable<string> sortedSubstrings = Enumerable.Range(0, docRange.Text.Trim().Length)
.Select(i => docRange.Text.Substring(i))
.OrderBy(s => s.Length < 3 ? s : s.Remove(2, Math.Min(s.Length - 2, 2)));
wordPosition =
(int)
docRange.get_Information(
Microsoft.Office.Interop.Word.WdInformation.wdFirstCharacterColumnNumber);
foreach (var substring in sortedSubstrings)
{
index = docRange.Text.IndexOf(substring) + wordPosition;
charLocation[index] = substring;
}
}
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
这有助于您逐行获取字符串。
此处是代码背后的天才。请点击链接获取有关其工作原理的更多说明。
This helps in you getting string line by line.
Here's the genius behind the code. Follow the link for some more explanation on how it works.
我建议按照此页面上的代码此处
其关键在于您使用 Word.ApplicationClass (Microsoft.Interop.Word) 对象来读取它,尽管他从哪里获取“Doc”对象超出了我的范围。我假设您使用 ApplicationClass 创建它。
编辑:通过调用以下命令来检索文档:
遗憾的是,我链接的页面上的代码格式并不那么容易。
EDIT2:从那里您可以循环遍历文档段落,但是据我所知,无法循环遍历行。我建议使用一些模式匹配来查找换行符。
要从段落中提取文本,请使用 Word.Paragraph.Range.Text,这将返回段落内的所有文本。然后您必须搜索换行符。我会使用string.IndexOf()。
或者,如果您想按行一次提取一个句子,您可以简单地迭代 范围.句子
I would suggest following the code on this page here
The crux of it is that you read it with a Word.ApplicationClass (Microsoft.Interop.Word) object, although where he's getting the "Doc" object is beyond me. I would assume you create it with the ApplicationClass.
EDIT: Document is retrieved by calling this:
Sadly the formatting of the code on the page I linked wasn't all to easy.
EDIT2: From there you can loop through doc paragraphs, however as far as I can see there is no way of looping through lines. I would suggest using some pattern matching to find linebreaks.
In order to extract the text from a paragraph, use Word.Paragraph.Range.Text, this will return all the text inside a paragraph. Then you must search for linebreak characters. I'd use string.IndexOf().
Alternatively, if by lines you want to extract one sentence at a time, you can simply iterate through Range.Sentences
增加每行的计数
Increment the count for each line
我推荐使用 DocX 库。它是轻量级的,不需要在机器上安装Word。这是用于逐行获取文本的代码:
I recommend using DocX library. It is lightweight and doesn't require Word to be installed on the machine. Here is the code that use to get text line by line :