在必须使用正则表达式在 html 文档中查找某些内容后重新使用 lxml 的强大功能的最佳方法

发布于 2024-08-24 20:32:45 字数 1667 浏览 5 评论 0原文

我正在尝试从大量 html 文档（数十万个）中提取一些文本。这些文档实际上是表格，但它们是由大量不同的组织准备的，因此他们创建文档的方式存在很大差异。例如，文档分为章节。我可能想从每个文档中提取第 5 章的内容，以便我可以分析该章的内容。最初我认为这很容易，但事实证明作者可能在整个文档中使用一组非嵌套表格来保存内容，以便可以使用表格内的 td 标签显示第 n 章。或者他们可能使用其他元素，例如 p 标签、H 标签、div 标签或任何其他块级元素。

在反复尝试使用 lxml 来帮助我识别每一章的开头和结尾之后，我确定使用正则表达式会更清晰，因为在每种情况下，无论封闭的 html 元素是什么，章节标签始终位于 It的形式

>Chapter #

稍微复杂一些，因为可能存在一些以不同方式表示的空白或不间断空格（或或只是空格）。尽管如此，编写一个正则表达式来标识每个部分的开头还是很简单的。（一个部分的开头是上一节的结尾。）

但是现在我想使用 lxml 来获取文本。我的想法是，我真的别无选择，只能沿着字符串查找包含我用来查找相关部分的文本的元素的关闭标签。

这是一个例子，其中保存章节名称的元素是一个 div

<div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0pt" align="left"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: Times New Roman">Chapter 1.&#160;&#160;&#160;Our Beginnings.</font></div>

所以我想象我将从找到第一章匹配的位置开始并设置正则表达式来查找下一章

</div|</td|</p|</h1 . . .

所以在这一点上我已经确定了包含我的章节标题的元素的类型

我可以使用相同的逻辑来查找该元素中的所有文本，该元素设置了正则表达式来帮助我标记从

>Chapter 1.&#160;&#160;&#160;Our Beginnings.<

所以我已经确定了我的第 1 章开始的位置

我可以对第 2 章执行相同的操作（这是第 1 章结束的地方）

现在我想象我要从我标识为元素的开头处开始剪切文档，该元素指示第 1 章开始的位置并在第 1 章之前结束我确定为指示第 2 章开始位置的元素的开头。然后，我识别的字符串将被输入到 lxml 以利用其功能来获取内容。

我会遇到所有这些麻烦，因为我已经读了一遍又一遍 - 从来没有使用正则表达式从 html 文档中提取内容，而且我还没有找到一种与 lxml 一样准确的方法来识别起始和结束位置我要提取的文本。例如，我永远无法确定第一章的副标题是“我们的开始”，也可能是“我们的红色金丝雀”。让我说，我花了整整两天的时间尝试使用 lxml 来确信我拥有开始和结束元素，并且我只能在 <60% 的时间内准确，但非常短的正则表达式给了我超过 95% 的成功率。

我倾向于让事情变得比必要的更复杂，所以我想知道是否有人见过或解决了类似的问题，以及他们是否有他们想要提供的方法（不是细节）。

原文

I am trying to rip some text out of a large number of html documents (numbers in the hundreds of thousands). The documents are really forms but they are prepared by a very large group of different organizations so there is significant variation in how they create the document. For example, the documents are divided into chapters. I might want to extract the contents of Chapter 5 from every document so I can analyze the content of the chapter. Initially I thought this would be easy but it turns out that the authors might use a set of non-nested tables throughout the document to hold the content so that Chapter n could be displayed using td tags inside a table. Or they might use other elements such as p tags H tags, div tags or any other block level element.

After trying repeatedly to use lxml to help me identify the beginning and end of each chapter I have determined that it is a lot cleaner to use a regular expression because in every case, no matter what the enclosing html element is the chapter label is always in the form of

>Chapter #

It is a little more complicated in that there might be some white space or non-breaking space represented in different ways ( or or just spaces). Nonetheless it was trivial to write a regular expression to identify the beginning of each section. (The beginning of one section is the end of the previous section.)

But now I want to use lxml to get the text out. My thought is that I have really no choice but to walk along my string to find the close tag for the element that encloses the text I am using to find the relevant section.

That is here is one example where the element holding the Chapter name is a div

<div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0pt" align="left"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: Times New Roman">Chapter 1.   Our Beginnings.</font></div>

So I am imagining that I would begin at the location where I found the match for chapter 1 and set up a regular expressions to find the next

</div|</td|</p|</h1 . . .

So at this point I have identified the type of element holding my chapter heading

I can use the same logic to find all of the text that is within that element that is set up a regular expression to help me mark from

>Chapter 1.   Our Beginnings.<

So I have identified where my Chapter 1 begins

I can do the same for chapter 2 (which is where Chapter 1 ends)

Now I am imagining that I am going to snip the document beginning at the opening of the element that I identified as the element the indicates where chapter 1 begins and ending just before the opening of the element that I identified as the element that indicates where Chapter 2 begins. The string that I have identified will then be fed to lxml to use its power to get the content.

I am going to all of this trouble because I have read over and over - never use a regular expression to extract content from html documents and I have not hit on a way to be as accurate with lxml to identify the starting and ending locations for the text I want to extract. For example, I can never be certain that the subtitle of Chapter 1 is Our Beginnings it could be Our Red Canary. Let me say that I spent two solid days trying with lxml to be confident that I had the beginning and ending elements and I could only be accurate <60% of the time but a very short regular expression has given me better than 95% success.

I have a tendency to make things more complicated than necessary so I am wondering if anyone has seen or solved a similar problems and if they had an approach (not the details mind you) that they would like to offer.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

七月上 2024-08-31 20:32:45

有时，在处理写得不好或不一致的 HTML 时，没有直接的途径来获取内容。

您可能希望考虑使用 lynx 或基于文本的浏览器之一将页面内容转储到文件中，或将其通过管道传输到代码中，然后对其进行处理。或者，您可以使用 lxml 加载和解析页面，然后使用 text_content() 提取文本并通过正则表达式查找章节。

就像他们说的那样，GIGO - 垃圾进来，垃圾出去，而我们作为开发人员的工作就是将垃圾变成黄金。这样做可能会变得非常混乱。

回复收藏 0 原文

注定孤独终老 2024-08-31 20:32:45

听起来您可能做的最简单的事情是迭代 tree.getroot().iterdescendants() 寻找与您所需的正则表达式匹配的 node.text 节点。从那时起，您可以将节点传递给一个函数，该函数使用一些临时启发式方法来确定文本的位置。（也许如果 root 上的 iterdescendants 太慢，您可以使用正则表达式方法并深入 etree 来尝试找到 f(text_position) -> node 函数。）

例如，如果您发现target 是一个 //tr/td，您可以将其传递给某个表文本查找子例程，该子例程查看 node.parent() 中的下一个 td 以查看它是否具有有意义的文本（大约一章的长度，包含某些单词，等等）。同样，您可以使用一些启发式方法来查找其他标记（如 div 和 p）中的数据。如果您发现自己处于未知标签（如 font）中，您可以尝试向上冒泡有限数量的级别来找到您知道如何处理的内容 - 您必须小心不要向上冒泡太多，或者我想您可能会意外地从另一章中检索文本。

问题的关键似乎是您正在挖掘的数据不是以编程方式呈现的——在这些情况下，在某种程度上，人机交互通常是必要的。

回复收藏 0 原文

故事还在继续 2024-08-31 20:32:45

正如我担心的那样，没有系统的方法可以使用 lxml 来识别和提取我需要的内容。哦，好吧，我很感谢大家的插话。注意，这不是 lxml 的错，而是 html 编码不一致的错。例如。因为章节是文档的合理划分，所以一章中的所有内容都应该包含在某种类型的元素中。最灵活的可能是 div 标签，后续的 div 是下一章。这将使章节成为树的分支。不幸的是，大约 20% 的文档可能结构良好，而其他文档则不然。

我可以测试应该保存我的内容（div，p）的每种类型的元素，并获取其所有子元素及其所有兄弟元素，直到到达该类型的下一个元素，该元素包含提醒我我们处于本节的结尾（下一节的开始）。但是，当我 95% 或更多的时间都擅长使用正则表达式时，这似乎是一项太多的工作。

感谢您一如既往地从他们那里学到的所有答案和评论。

回复收藏 0 原文

~没有更多了~