如何从word文档中提取项目符号信息?

发布于 2024-08-23 01:35:22 字数 450 浏览 2 评论 0原文

我想提取word文档中存在的项目符号信息。 我想要这样的东西: 假设下面的文本位于Word文档中:

启动汽车的步骤:

  • 开门
  • 坐在里面
  • 关上门
  • 插入钥匙
  • 等。

然后我想要我的文本文件如下所示:

启动汽车的步骤:

打开门

<子弹>坐在里面

<子弹>关上门

<子弹>插入键

<子弹>等等

我正在使用 C# 语言来执行此操作。

我可以从Word文档中提取段落并直接将它们写入文本文件中,并带有一些格式信息,例如文本是粗体还是斜体等,但不知道如何提取此项目符号信息。

谁能告诉我该怎么做吗?

提前致谢

I want to extract information of bullets present in word document.
I want something like this :
Suppose the text below, is in word document :

Steps to Start car :

  • Open door
  • Sit inside
  • Close the door
  • Insert key
  • etc.

Then I want my text file like below :

Steps to Start car :

<BULET> Open door </BULET>

<BULET> Sit inside </BULET>

<BULET> Close the door </BULET>

<BULET> Insert key </BULET>

<BULET> etc.</BULET>

I am using C# language to do this.

I can extract paragraphs from word document and directly write them in text file with some formatting information like whether text is bold or is in italics, etc. but dont know how to extract this bullet information.

Can anyone please tell me how to do this?

Thanks in advance

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

空心空情空意 2024-08-30 01:35:22

你可以通过阅读每个句子来做到这一点。 doc.Sentences 是 Range 对象的数组。所以你可以从 Paragraph 中获取相同的 Range 对象。

        foreach (Paragraph para in oDoc.Paragraphs)
        {
            string paraNumber = para.Range.ListFormat.ListLevelNumber.ToString();
            string bulletStr = para.Range.ListFormat.ListString;
            MessageBox.Show(paraNumber + "\t" + bulletStr + "\t" + para.Range.Text);
        }

在 paraNumber 中,您可以获得段落级别,在 buttetStr 中,您可以获得字符串形式的项目符号。

You can do it by reading each sentence. doc.Sentences is an array of Range object. So you can get same Range object from Paragraph.

        foreach (Paragraph para in oDoc.Paragraphs)
        {
            string paraNumber = para.Range.ListFormat.ListLevelNumber.ToString();
            string bulletStr = para.Range.ListFormat.ListString;
            MessageBox.Show(paraNumber + "\t" + bulletStr + "\t" + para.Range.Text);
        }

Into paraNumber you can get paragraph level and into buttetStr you can get bullet as string.

花伊自在美 2024-08-30 01:35:22

我正在使用 这个 OpenXMLPower 工具,由 Eric White 开发。它是免费的,可通过 NUGet 包获取。您可以从 Visual Studio 包管理器安装它。
输入图片这里的描述

他提供了一个现成的代码片段。这个工具节省了我很多时间。以下是我定制的代码片段用于满足我的要求的方式。
事实上,您可以在项目中使用这些方法。

 private static WordprocessingDocument _wordDocument;
 private StringBuilder textItemSB = new StringBuilder();
 private List<string> textItemList = new List<string>();


/// Open word document using office SDK and reads all contents from body of document
/// </summary>
/// <param name="filepath">path of file to be processed</param>
/// <returns>List of paragraphs with their text contents</returns>
private void GetDocumentBodyContents()
{
    string modifiedString = string.Empty;
    List<string> allList = new List<string>();
    List<string> allListText = new List<string>();

    try
    {
_wordDocument = WordprocessingDocument.Open(wordFileStream, false);
        //RevisionAccepter.AcceptRevisions(_wordDocument);
        XElement root = _wordDocument.MainDocumentPart.GetXDocument().Root;
        XElement body = root.LogicalChildrenContent().First();
        OutputBlockLevelContent(_wordDocument, body);
    }
    catch (Exception ex)
    {
        logger.Error("ERROR in GetDocumentBodyContents:" + ex.Message.ToString());
    }
}


// This is recursive method. At each iteration it tries to fetch listitem and Text item. Once you have these items in hand 
// You can manipulate and create your own collection.
private void OutputBlockLevelContent(WordprocessingDocument wordDoc, XElement blockLevelContentContainer)
{
    try
    {
        string listItem = string.Empty, itemText = string.Empty, numberText = string.Empty;
        foreach (XElement blockLevelContentElement in
            blockLevelContentContainer.LogicalChildrenContent())
        {
            if (blockLevelContentElement.Name == W.p)
            {
                listItem = ListItemRetriever.RetrieveListItem(wordDoc, blockLevelContentElement);
                itemText = blockLevelContentElement
                    .LogicalChildrenContent(W.r)
                    .LogicalChildrenContent(W.t)
                    .Select(t => (string)t)
                    .StringConcatenate();
                if (itemText.Trim().Length > 0)
                {
                    if (null == listItem)
                    {
                        // Add html break tag 
                        textItemSB.Append( itemText + "<br/>");
                    }
                    else
                    {
                        //if listItem == "" bullet character, replace it with equivalent html encoded character                                   
                        textItemSB.Append("          " + (listItem == "" ? "•" : listItem) + "     " + itemText + "<br/>");
                    }
                }
                else if (null != listItem)
                {
                    //If bullet character is found, replace it with equivalent html encoded character  
                    textItemSB.Append(listItem == "" ? "          •" : listItem);
                }
                else
                    textItemSB.Append("<blank>");
                continue;
            }
            // If element is not a paragraph, it must be a table.

            foreach (var row in blockLevelContentElement.LogicalChildrenContent())
            {                        
                foreach (var cell in row.LogicalChildrenContent())
                {                            
                    // Cells are a block-level content container, so can call this method recursively.
                    OutputBlockLevelContent(wordDoc, cell);
                }
            }
        }
        if (textItemSB.Length > 0)
        {
            textItemList.Add(textItemSB.ToString());
            textItemSB.Clear();
        }
    }
    catch (Exception ex)
    {
        .....
    }
}

I am using this OpenXMLPower tool by Eric White. Its free and available at NUGet package. you can install it from Visual studio package manager.
enter image description here

He has provided a ready to use code snippet. This tool has saved me many hours. Below is the way I have customized code snippet to use for my requirement.
Infact you can use these methods as it in your project.

 private static WordprocessingDocument _wordDocument;
 private StringBuilder textItemSB = new StringBuilder();
 private List<string> textItemList = new List<string>();


/// Open word document using office SDK and reads all contents from body of document
/// </summary>
/// <param name="filepath">path of file to be processed</param>
/// <returns>List of paragraphs with their text contents</returns>
private void GetDocumentBodyContents()
{
    string modifiedString = string.Empty;
    List<string> allList = new List<string>();
    List<string> allListText = new List<string>();

    try
    {
_wordDocument = WordprocessingDocument.Open(wordFileStream, false);
        //RevisionAccepter.AcceptRevisions(_wordDocument);
        XElement root = _wordDocument.MainDocumentPart.GetXDocument().Root;
        XElement body = root.LogicalChildrenContent().First();
        OutputBlockLevelContent(_wordDocument, body);
    }
    catch (Exception ex)
    {
        logger.Error("ERROR in GetDocumentBodyContents:" + ex.Message.ToString());
    }
}


// This is recursive method. At each iteration it tries to fetch listitem and Text item. Once you have these items in hand 
// You can manipulate and create your own collection.
private void OutputBlockLevelContent(WordprocessingDocument wordDoc, XElement blockLevelContentContainer)
{
    try
    {
        string listItem = string.Empty, itemText = string.Empty, numberText = string.Empty;
        foreach (XElement blockLevelContentElement in
            blockLevelContentContainer.LogicalChildrenContent())
        {
            if (blockLevelContentElement.Name == W.p)
            {
                listItem = ListItemRetriever.RetrieveListItem(wordDoc, blockLevelContentElement);
                itemText = blockLevelContentElement
                    .LogicalChildrenContent(W.r)
                    .LogicalChildrenContent(W.t)
                    .Select(t => (string)t)
                    .StringConcatenate();
                if (itemText.Trim().Length > 0)
                {
                    if (null == listItem)
                    {
                        // Add html break tag 
                        textItemSB.Append( itemText + "<br/>");
                    }
                    else
                    {
                        //if listItem == "" bullet character, replace it with equivalent html encoded character                                   
                        textItemSB.Append("          " + (listItem == "" ? "•" : listItem) + "     " + itemText + "<br/>");
                    }
                }
                else if (null != listItem)
                {
                    //If bullet character is found, replace it with equivalent html encoded character  
                    textItemSB.Append(listItem == "" ? "          •" : listItem);
                }
                else
                    textItemSB.Append("<blank>");
                continue;
            }
            // If element is not a paragraph, it must be a table.

            foreach (var row in blockLevelContentElement.LogicalChildrenContent())
            {                        
                foreach (var cell in row.LogicalChildrenContent())
                {                            
                    // Cells are a block-level content container, so can call this method recursively.
                    OutputBlockLevelContent(wordDoc, cell);
                }
            }
        }
        if (textItemSB.Length > 0)
        {
            textItemList.Add(textItemSB.ToString());
            textItemSB.Clear();
        }
    }
    catch (Exception ex)
    {
        .....
    }
}
浪荡不羁 2024-08-30 01:35:22

我得到了答案......

首先我在段落的基础上转换文档。但是,如果我们逐句处理 doc 文件,则可以确定该句子是否包含项目符号或任何类型的形状,或者该句子是否是表格的一部分。因此,一旦我们获得了这些信息,我们就可以适当地转换该句子。如果有人需要源代码,我可以分享。

I got the answer.....

First I was converting doc on paragraph basis. But instead of that if we process doc file sentence by sentence basis, it is possible to determine whether that sentence contains bullet or any kind of shape or if that sentence is part of table. So once we get this information, then we can convert that sentence appropriately. If someone needs source code, I can share it.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文