有没有办法逐行阅读word文档

发布于 2024-11-27 13:03:32 字数 1680 浏览 0 评论 0原文

我正在尝试提取Word文档中的所有单词。我能够一次性完成这一切，如下所示...

Word.Application word = new Word.Application();
doc = word.Documents.Open(@"C:\SampleText.doc");
doc.Activate();

foreach (Word.Range docRange in doc.Words) // loads all words in document
{
    IEnumerable<string> sortedSubstrings = Enumerable.Range(0, docRange.Text.Trim().Length)
        .Select(i => docRange.Text.Substring(i))
        .OrderBy(s => s.Length < 3 ? s : s.Remove(2, Math.Min(s.Length - 2, 2)));

    wordPosition =
        (int)
        docRange.get_Information(
            Microsoft.Office.Interop.Word.WdInformation.wdFirstCharacterColumnNumber);

    foreach (var substring in sortedSubstrings)
    {
        index = docRange.Text.IndexOf(substring) + wordPosition;
        charLocation[index] = substring;
    }
}

但是我更愿意一次加载一行文档...是否可以这样做？

我可以按段落加载它，但是我无法迭代段落以提取所有单词。

foreach (Word.Paragraph para in doc.Paragraphs)
{
    foreach (Word.Range docRange in para) // Error: type Word.para is not enumeranle**
    {
        IEnumerable<string> sortedSubstrings = Enumerable.Range(0, docRange.Text.Trim().Length)
            .Select(i => docRange.Text.Substring(i))
            .OrderBy(s => s.Length < 3 ? s : s.Remove(2, Math.Min(s.Length - 2, 2)));

        wordPosition =
            (int)
            docRange.get_Information(
                Microsoft.Office.Interop.Word.WdInformation.wdFirstCharacterColumnNumber);

        foreach (var substring in sortedSubstrings)
        {
            index = docRange.Text.IndexOf(substring) + wordPosition;
            charLocation[index] = substring;
        }

    }
}

原文

I am trying to extract all the words in a Word document. I am able to do it all in one go as follows...

Word.Application word = new Word.Application();
doc = word.Documents.Open(@"C:\SampleText.doc");
doc.Activate();

foreach (Word.Range docRange in doc.Words) // loads all words in document
{
    IEnumerable<string> sortedSubstrings = Enumerable.Range(0, docRange.Text.Trim().Length)
        .Select(i => docRange.Text.Substring(i))
        .OrderBy(s => s.Length < 3 ? s : s.Remove(2, Math.Min(s.Length - 2, 2)));

    wordPosition =
        (int)
        docRange.get_Information(
            Microsoft.Office.Interop.Word.WdInformation.wdFirstCharacterColumnNumber);

    foreach (var substring in sortedSubstrings)
    {
        index = docRange.Text.IndexOf(substring) + wordPosition;
        charLocation[index] = substring;
    }
}

However I would have preferred to load the document one line at a time... is it possible to do so?

I can load it by paragraph however I am unable to iterate through the paragraphs to extract all words.

foreach (Word.Paragraph para in doc.Paragraphs)
{
    foreach (Word.Range docRange in para) // Error: type Word.para is not enumeranle**
    {
        IEnumerable<string> sortedSubstrings = Enumerable.Range(0, docRange.Text.Trim().Length)
            .Select(i => docRange.Text.Substring(i))
            .OrderBy(s => s.Length < 3 ? s : s.Remove(2, Math.Min(s.Length - 2, 2)));

        wordPosition =
            (int)
            docRange.get_Information(
                Microsoft.Office.Interop.Word.WdInformation.wdFirstCharacterColumnNumber);

        foreach (var substring in sortedSubstrings)
        {
            index = docRange.Text.IndexOf(substring) + wordPosition;
            charLocation[index] = substring;
        }

    }
}

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

远昼 2024-12-04 13:03:32

这有助于您逐行获取字符串。

    object file = Path.GetDirectoryName(Application.ExecutablePath) + @"\Answer.doc";

    Word.Application wordObject = new Word.ApplicationClass();
    wordObject.Visible = false;

    object nullobject = Missing.Value;
    Word.Document docs = wordObject.Documents.Open
        (ref file, ref nullobject, ref nullobject, ref nullobject,
        ref nullobject, ref nullobject, ref nullobject, ref nullobject,
        ref nullobject, ref nullobject, ref nullobject, ref nullobject,
        ref nullobject, ref nullobject, ref nullobject, ref nullobject);

    String strLine;
    bool bolEOF = false;

    docs.Characters[1].Select();

    int index = 0;
    do
    {
        object unit = Word.WdUnits.wdLine;
        object count = 1;
        wordObject.Selection.MoveEnd(ref unit, ref count);

        strLine = wordObject.Selection.Text;
        richTextBox1.Text += ++index + " - " + strLine + "\r\n"; //for our understanding

        object direction = Word.WdCollapseDirection.wdCollapseEnd;
        wordObject.Selection.Collapse(ref direction);

        if (wordObject.Selection.Bookmarks.Exists(@"\EndOfDoc"))
            bolEOF = true;
    } while (!bolEOF);

    docs.Close(ref nullobject, ref nullobject, ref nullobject);
    wordObject.Quit(ref nullobject, ref nullobject, ref nullobject);
    docs = null;
    wordObject = null;

此处是代码背后的天才。请点击链接获取有关其工作原理的更多说明。

This helps in you getting string line by line.

    object file = Path.GetDirectoryName(Application.ExecutablePath) + @"\Answer.doc";

    Word.Application wordObject = new Word.ApplicationClass();
    wordObject.Visible = false;

    object nullobject = Missing.Value;
    Word.Document docs = wordObject.Documents.Open
        (ref file, ref nullobject, ref nullobject, ref nullobject,
        ref nullobject, ref nullobject, ref nullobject, ref nullobject,
        ref nullobject, ref nullobject, ref nullobject, ref nullobject,
        ref nullobject, ref nullobject, ref nullobject, ref nullobject);

    String strLine;
    bool bolEOF = false;

    docs.Characters[1].Select();

    int index = 0;
    do
    {
        object unit = Word.WdUnits.wdLine;
        object count = 1;
        wordObject.Selection.MoveEnd(ref unit, ref count);

        strLine = wordObject.Selection.Text;
        richTextBox1.Text += ++index + " - " + strLine + "\r\n"; //for our understanding

        object direction = Word.WdCollapseDirection.wdCollapseEnd;
        wordObject.Selection.Collapse(ref direction);

        if (wordObject.Selection.Bookmarks.Exists(@"\EndOfDoc"))
            bolEOF = true;
    } while (!bolEOF);

    docs.Close(ref nullobject, ref nullobject, ref nullobject);
    wordObject.Quit(ref nullobject, ref nullobject, ref nullobject);
    docs = null;
    wordObject = null;

Here's the genius behind the code. Follow the link for some more explanation on how it works.

回复收藏 0 原文

暖风昔人 2024-12-04 13:03:32

我建议按照此页面上的代码此处

其关键在于您使用 Word.ApplicationClass (Microsoft.Interop.Word) 对象来读取它，尽管他从哪里获取“Doc”对象超出了我的范围。我假设您使用 ApplicationClass 创建它。

编辑：通过调用以下命令来检索文档：

Word.Document doc = wordApp.Documents.Open(ref file, ref nullobj, ref nullobj,
                                      ref nullobj, ref nullobj, ref nullobj,
                                      ref nullobj, ref nullobj, ref nullobj,
                                      ref nullobj, ref nullobj, ref nullobj);

遗憾的是，我链接的页面上的代码格式并不那么容易。

EDIT2：从那里您可以循环遍历文档段落，但是据我所知，无法循环遍历行。我建议使用一些模式匹配来查找换行符。

要从段落中提取文本，请使用 Word.Paragraph.Range .Text，这将返回段落内的所有文本。然后您必须搜索换行符。我会使用string.IndexOf()。

或者，如果您想按行一次提取一个句子，您可以简单地迭代范围.句子

I would suggest following the code on this page here

The crux of it is that you read it with a Word.ApplicationClass (Microsoft.Interop.Word) object, although where he's getting the "Doc" object is beyond me. I would assume you create it with the ApplicationClass.

EDIT: Document is retrieved by calling this:

Word.Document doc = wordApp.Documents.Open(ref file, ref nullobj, ref nullobj,
                                      ref nullobj, ref nullobj, ref nullobj,
                                      ref nullobj, ref nullobj, ref nullobj,
                                      ref nullobj, ref nullobj, ref nullobj);

Sadly the formatting of the code on the page I linked wasn't all to easy.

EDIT2: From there you can loop through doc paragraphs, however as far as I can see there is no way of looping through lines. I would suggest using some pattern matching to find linebreaks.

In order to extract the text from a paragraph, use Word.Paragraph.Range .Text, this will return all the text inside a paragraph. Then you must search for linebreak characters. I'd use string.IndexOf().

Alternatively, if by lines you want to extract one sentence at a time, you can simply iterate through Range.Sentences

回复收藏 0 原文

风吹短裙飘 2024-12-04 13:03:32

        Microsoft.Office.Interop.Word.Application word = new Microsoft.Office.Interop.Word.Application();
        object miss = System.Reflection.Missing.Value;
        object path = @"D:\viewstate.docx";
        object readOnly = true;
        Microsoft.Office.Interop.Word.Document docs = word.Documents.Open(ref path, ref miss, ref readOnly, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss);
        string totaltext = "";

        object unit = Microsoft.Office.Interop.Word.WdUnits.wdLine;
        object count = 1;
        word.Selection.MoveEnd(ref unit, ref count);
        totaltext = word.Selection.Text;

        TextBox1.Text = totaltext;
        docs.Close(ref miss, ref miss, ref miss);
        word.Quit(ref miss, ref miss, ref miss);
        docs = null;
        word = null;

增加每行的计数

        Microsoft.Office.Interop.Word.Application word = new Microsoft.Office.Interop.Word.Application();
        object miss = System.Reflection.Missing.Value;
        object path = @"D:\viewstate.docx";
        object readOnly = true;
        Microsoft.Office.Interop.Word.Document docs = word.Documents.Open(ref path, ref miss, ref readOnly, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss, ref miss);
        string totaltext = "";

        object unit = Microsoft.Office.Interop.Word.WdUnits.wdLine;
        object count = 1;
        word.Selection.MoveEnd(ref unit, ref count);
        totaltext = word.Selection.Text;

        TextBox1.Text = totaltext;
        docs.Close(ref miss, ref miss, ref miss);
        word.Quit(ref miss, ref miss, ref miss);
        docs = null;
        word = null;

Increment the count for each line

回复收藏 0 原文

迷途知返 2024-12-04 13:03:32

我推荐使用 DocX 库。它是轻量级的，不需要在机器上安装Word。这是用于逐行获取文本的代码：

using(DocX doc = DocX.Load("sample.docx"))
{
     for (int i = 0; i < doc.Paragraphs.Count; i++ )
     {
          foreach (var item in doc.Paragraphs[i].Text.Split(new string[]{"\n"}
                    , StringSplitOptions.RemoveEmptyEntries))
          {
                Console.WriteLine(item);
          }
     }
}

I recommend using DocX library. It is lightweight and doesn't require Word to be installed on the machine. Here is the code that use to get text line by line :

using(DocX doc = DocX.Load("sample.docx"))
{
     for (int i = 0; i < doc.Paragraphs.Count; i++ )
     {
          foreach (var item in doc.Paragraphs[i].Text.Split(new string[]{"\n"}
                    , StringSplitOptions.RemoveEmptyEntries))
          {
                Console.WriteLine(item);
          }
     }
}

回复收藏 0 原文

~没有更多了~