当前位置：文江博客话题详情

如何从word文档中提取项目符号信息？

发布于 2024-08-23 01:35:22 字数 450 浏览 9 评论 0原文

我想提取word文档中存在的项目符号信息。我想要这样的东西：假设下面的文本位于Word文档中：

启动汽车的步骤：

开门
坐在里面
关上门
插入钥匙
等。

然后我想要我的文本文件如下所示：

启动汽车的步骤：

打开门

<子弹>坐在里面

<子弹>关上门

<子弹>插入键

<子弹>等等

我正在使用 C# 语言来执行此操作。

我可以从Word文档中提取段落并直接将它们写入文本文件中，并带有一些格式信息，例如文本是粗体还是斜体等，但不知道如何提取此项目符号信息。

谁能告诉我该怎么做吗？

提前致谢

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

空心空情空意 2024-08-30 01:35:22

你可以通过阅读每个句子来做到这一点。 doc.Sentences 是 Range 对象的数组。所以你可以从 Paragraph 中获取相同的 Range 对象。

        foreach (Paragraph para in oDoc.Paragraphs)
        {
            string paraNumber = para.Range.ListFormat.ListLevelNumber.ToString();
            string bulletStr = para.Range.ListFormat.ListString;
            MessageBox.Show(paraNumber + "\t" + bulletStr + "\t" + para.Range.Text);
        }

在 paraNumber 中，您可以获得段落级别，在 buttetStr 中，您可以获得字符串形式的项目符号。

You can do it by reading each sentence. doc.Sentences is an array of Range object. So you can get same Range object from Paragraph.

        foreach (Paragraph para in oDoc.Paragraphs)
        {
            string paraNumber = para.Range.ListFormat.ListLevelNumber.ToString();
            string bulletStr = para.Range.ListFormat.ListString;
            MessageBox.Show(paraNumber + "\t" + bulletStr + "\t" + para.Range.Text);
        }

Into paraNumber you can get paragraph level and into buttetStr you can get bullet as string.

回复收藏 0 原文

花伊自在美 2024-08-30 01:35:22

我正在使用这个 OpenXMLPower 工具，由 Eric White 开发。它是免费的，可通过 NUGet 包获取。您可以从 Visual Studio 包管理器安装它。

他提供了一个现成的代码片段。这个工具节省了我很多时间。以下是我定制的代码片段用于满足我的要求的方式。
事实上，您可以在项目中使用这些方法。

 private static WordprocessingDocument _wordDocument;
 private StringBuilder textItemSB = new StringBuilder();
 private List<string> textItemList = new List<string>();


/// Open word document using office SDK and reads all contents from body of document
/// </summary>
/// <param name="filepath">path of file to be processed</param>
/// <returns>List of paragraphs with their text contents</returns>
private void GetDocumentBodyContents()
{
    string modifiedString = string.Empty;
    List<string> allList = new List<string>();
    List<string> allListText = new List<string>();

    try
    {
_wordDocument = WordprocessingDocument.Open(wordFileStream, false);
        //RevisionAccepter.AcceptRevisions(_wordDocument);
        XElement root = _wordDocument.MainDocumentPart.GetXDocument().Root;
        XElement body = root.LogicalChildrenContent().First();
        OutputBlockLevelContent(_wordDocument, body);
    }
    catch (Exception ex)
    {
        logger.Error("ERROR in GetDocumentBodyContents:" + ex.Message.ToString());
    }
}


// This is recursive method. At each iteration it tries to fetch listitem and Text item. Once you have these items in hand 
// You can manipulate and create your own collection.
private void OutputBlockLevelContent(WordprocessingDocument wordDoc, XElement blockLevelContentContainer)
{
    try
    {
        string listItem = string.Empty, itemText = string.Empty, numberText = string.Empty;
        foreach (XElement blockLevelContentElement in
            blockLevelContentContainer.LogicalChildrenContent())
        {
            if (blockLevelContentElement.Name == W.p)
            {
                listItem = ListItemRetriever.RetrieveListItem(wordDoc, blockLevelContentElement);
                itemText = blockLevelContentElement
                    .LogicalChildrenContent(W.r)
                    .LogicalChildrenContent(W.t)
                    .Select(t => (string)t)
                    .StringConcatenate();
                if (itemText.Trim().Length > 0)
                {
                    if (null == listItem)
                    {
                        // Add html break tag 
                        textItemSB.Append( itemText + "<br/>");
                    }
                    else
                    {
                        //if listItem == "" bullet character, replace it with equivalent html encoded character                                   
                        textItemSB.Append("          " + (listItem == "" ? "•" : listItem) + "     " + itemText + "<br/>");
                    }
                }
                else if (null != listItem)
                {
                    //If bullet character is found, replace it with equivalent html encoded character  
                    textItemSB.Append(listItem == "" ? "          •" : listItem);
                }
                else
                    textItemSB.Append("<blank>");
                continue;
            }
            // If element is not a paragraph, it must be a table.

            foreach (var row in blockLevelContentElement.LogicalChildrenContent())
            {                        
                foreach (var cell in row.LogicalChildrenContent())
                {                            
                    // Cells are a block-level content container, so can call this method recursively.
                    OutputBlockLevelContent(wordDoc, cell);
                }
            }
        }
        if (textItemSB.Length > 0)
        {
            textItemList.Add(textItemSB.ToString());
            textItemSB.Clear();
        }
    }
    catch (Exception ex)
    {
        .....
    }
}

I am using this OpenXMLPower tool by Eric White. Its free and available at NUGet package. you can install it from Visual studio package manager.

He has provided a ready to use code snippet. This tool has saved me many hours. Below is the way I have customized code snippet to use for my requirement.
Infact you can use these methods as it in your project.

 private static WordprocessingDocument _wordDocument;
 private StringBuilder textItemSB = new StringBuilder();
 private List<string> textItemList = new List<string>();


/// Open word document using office SDK and reads all contents from body of document
/// </summary>
/// <param name="filepath">path of file to be processed</param>
/// <returns>List of paragraphs with their text contents</returns>
private void GetDocumentBodyContents()
{
    string modifiedString = string.Empty;
    List<string> allList = new List<string>();
    List<string> allListText = new List<string>();

    try
    {
_wordDocument = WordprocessingDocument.Open(wordFileStream, false);
        //RevisionAccepter.AcceptRevisions(_wordDocument);
        XElement root = _wordDocument.MainDocumentPart.GetXDocument().Root;
        XElement body = root.LogicalChildrenContent().First();
        OutputBlockLevelContent(_wordDocument, body);
    }
    catch (Exception ex)
    {
        logger.Error("ERROR in GetDocumentBodyContents:" + ex.Message.ToString());
    }
}


// This is recursive method. At each iteration it tries to fetch listitem and Text item. Once you have these items in hand 
// You can manipulate and create your own collection.
private void OutputBlockLevelContent(WordprocessingDocument wordDoc, XElement blockLevelContentContainer)
{
    try
    {
        string listItem = string.Empty, itemText = string.Empty, numberText = string.Empty;
        foreach (XElement blockLevelContentElement in
            blockLevelContentContainer.LogicalChildrenContent())
        {
            if (blockLevelContentElement.Name == W.p)
            {
                listItem = ListItemRetriever.RetrieveListItem(wordDoc, blockLevelContentElement);
                itemText = blockLevelContentElement
                    .LogicalChildrenContent(W.r)
                    .LogicalChildrenContent(W.t)
                    .Select(t => (string)t)
                    .StringConcatenate();
                if (itemText.Trim().Length > 0)
                {
                    if (null == listItem)
                    {
                        // Add html break tag 
                        textItemSB.Append( itemText + "<br/>");
                    }
                    else
                    {
                        //if listItem == "" bullet character, replace it with equivalent html encoded character                                   
                        textItemSB.Append("          " + (listItem == "" ? "•" : listItem) + "     " + itemText + "<br/>");
                    }
                }
                else if (null != listItem)
                {
                    //If bullet character is found, replace it with equivalent html encoded character  
                    textItemSB.Append(listItem == "" ? "          •" : listItem);
                }
                else
                    textItemSB.Append("<blank>");
                continue;
            }
            // If element is not a paragraph, it must be a table.

            foreach (var row in blockLevelContentElement.LogicalChildrenContent())
            {                        
                foreach (var cell in row.LogicalChildrenContent())
                {                            
                    // Cells are a block-level content container, so can call this method recursively.
                    OutputBlockLevelContent(wordDoc, cell);
                }
            }
        }
        if (textItemSB.Length > 0)
        {
            textItemList.Add(textItemSB.ToString());
            textItemSB.Clear();
        }
    }
    catch (Exception ex)
    {
        .....
    }
}

回复收藏 0 原文