在必须使用正则表达式在 html 文档中查找某些内容后重新使用 lxml 的强大功能的最佳方法

发布于 2024-08-24 20:32:45 字数 1667 浏览 5 评论 0原文

我正在尝试从大量 html 文档(数十万个)中提取一些文本。这些文档实际上是表格,但它们是由大量不同的组织准备的,因此他们创建文档的方式存在很大差异。例如,文档分为章节。我可能想从每个文档中提取第 5 章的内容,以便我可以分析该章的内容。最初我认为这很容易,但事实证明作者可能在整个文档中使用一组非嵌套表格来保存内容,以便可以使用表格内的 td 标签显示第 n 章。或者他们可能使用其他元素,例如 p 标签、H 标签、div 标签或任何其他块级元素。

在反复尝试使用 lxml 来帮助我识别每一章的开头和结尾之后,我确定使用正则表达式会更清晰,因为在每种情况下,无论封闭的 html 元素是什么,章节标签始终位于 It的形式

>Chapter #

稍微复杂一些,因为可能存在一些以不同方式表示的空白或不间断空格(   或   或只是空格)。尽管如此,编写一个正则表达式来标识每个部分的开头还是很简单的。 (一个部分的开头是上一节的结尾。)

但是现在我想使用 lxml 来获取文本。我的想法是,我真的别无选择,只能沿着字符串查找包含我用来查找相关部分的文本的元素的关闭标签。

这是一个例子,其中保存章节名称的元素是一个 div

<div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0pt" align="left"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: Times New Roman">Chapter 1.&#160;&#160;&#160;Our Beginnings.</font></div>

所以我想象我将从找到第一章匹配的位置开始并设置正则表达式来查找下一章

</div|</td|</p|</h1 . . .

所以在这一点上我已经确定了包含我的章节标题的元素的类型

我可以使用相同的逻辑来查找该元素中的所有文本,该元素设置了正则表达式来帮助我标记从

>Chapter 1.&#160;&#160;&#160;Our Beginnings.<

所以我已经确定了我的第 1 章开始的位置

我可以对第 2 章执行相同的操作(这是第 1 章结束的地方)

现在我想象我要从我标识为元素的开头处开始剪切文档,该元素指示第 1 章开始的位置并在第 1 章之前结束我确定为指示第 2 章开始位置的元素的开头。然后,我识别的字符串将被输入到 lxml 以利用其功能来获取内容。

我会遇到所有这些麻烦,因为我已经读了一遍又一遍 - 从来没有使用正则表达式从 html 文档中提取内容,而且我还没有找到一种与 lxml 一样准确的方法来识别起始和结束位置我要提取的文本。例如,我永远无法确定第一章的副标题是“我们的开始”,也可能是“我们的红色金丝雀”。让我说,我花了整整两天的时间尝试使用 lxml 来确信我拥有开始和结束元素,并且我只能在 <60% 的时间内准确,但非常短的正则表达式给了我超过 95% 的成功率。

我倾向于让事情变得比必要的更复杂,所以我想知道是否有人见过或解决了类似的问题,以及他们是否有他们想要提供的方法(不是细节)。

I am trying to rip some text out of a large number of html documents (numbers in the hundreds of thousands). The documents are really forms but they are prepared by a very large group of different organizations so there is significant variation in how they create the document. For example, the documents are divided into chapters. I might want to extract the contents of Chapter 5 from every document so I can analyze the content of the chapter. Initially I thought this would be easy but it turns out that the authors might use a set of non-nested tables throughout the document to hold the content so that Chapter n could be displayed using td tags inside a table. Or they might use other elements such as p tags H tags, div tags or any other block level element.

After trying repeatedly to use lxml to help me identify the beginning and end of each chapter I have determined that it is a lot cleaner to use a regular expression because in every case, no matter what the enclosing html element is the chapter label is always in the form of

>Chapter #

It is a little more complicated in that there might be some white space or non-breaking space represented in different ways (  or   or just spaces). Nonetheless it was trivial to write a regular expression to identify the beginning of each section. (The beginning of one section is the end of the previous section.)

But now I want to use lxml to get the text out. My thought is that I have really no choice but to walk along my string to find the close tag for the element that encloses the text I am using to find the relevant section.

That is here is one example where the element holding the Chapter name is a div

<div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0pt" align="left"><font style="DISPLAY: inline; FONT-WEIGHT: bold; FONT-SIZE: 10pt; FONT-FAMILY: Times New Roman">Chapter 1.   Our Beginnings.</font></div>

So I am imagining that I would begin at the location where I found the match for chapter 1 and set up a regular expressions to find the next

</div|</td|</p|</h1 . . .

So at this point I have identified the type of element holding my chapter heading

I can use the same logic to find all of the text that is within that element that is set up a regular expression to help me mark from

>Chapter 1.   Our Beginnings.<

So I have identified where my Chapter 1 begins

I can do the same for chapter 2 (which is where Chapter 1 ends)

Now I am imagining that I am going to snip the document beginning at the opening of the element that I identified as the element the indicates where chapter 1 begins and ending just before the opening of the element that I identified as the element that indicates where Chapter 2 begins. The string that I have identified will then be fed to lxml to use its power to get the content.

I am going to all of this trouble because I have read over and over - never use a regular expression to extract content from html documents and I have not hit on a way to be as accurate with lxml to identify the starting and ending locations for the text I want to extract. For example, I can never be certain that the subtitle of Chapter 1 is Our Beginnings it could be Our Red Canary. Let me say that I spent two solid days trying with lxml to be confident that I had the beginning and ending elements and I could only be accurate <60% of the time but a very short regular expression has given me better than 95% success.

I have a tendency to make things more complicated than necessary so I am wondering if anyone has seen or solved a similar problems and if they had an approach (not the details mind you) that they would like to offer.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

七月上 2024-08-31 20:32:45

有时,在处理写得不好或不一致的 HTML 时,没有直接的途径来获取内容。

您可能希望考虑使用 lynx 或基于文本的浏览器之一将页面内容转储到文件中,或将其通过管道传输到代码中,然后对其进行处理。或者,您可以使用 lxml 加载和解析页面,然后使用 text_content() 提取文本并通过正则表达式查找章节。

就像他们说的那样,GIGO - 垃圾进来,垃圾出去,而我们作为开发人员的工作就是将垃圾变成黄金。这样做可能会变得非常混乱。

Sometimes there is not a straight path to getting the content when dealing with poorly or inconsistently written HTML.

You might want to look at using lynx or one of the text-based browsers to dump the page content, either into a file, or to pipe it into your code, and then process it. Or, you can use lxml to load and parse the page, then extract the text using text_content() and go after the chapters via regex.

Like they say, GIGO - garbage in, garbage out, and it's our job as developers to spin that garbage into gold. Doing so can get pretty messy.

注定孤独终老 2024-08-31 20:32:45

听起来您可能做的最简单的事情是迭代 tree.getroot().iterdescendants() 寻找与您所需的正则表达式匹配的 node.text 节点。从那时起,您可以将节点传递给一个函数,该函数使用一些临时启发式方法来确定文本的位置。 (也许如果 root 上的 iterdescendants 太慢,您可以使用正则表达式方法并深入 etree 来尝试找到 f(text_position) -> node 函数。)

例如,如果您发现target 是一个 //tr/td,您可以将其传递给某个表文本查找子例程,该子例程查看 node.parent() 中的下一个 td 以查看它是否具有有意义的文本(大约一章的长度,包含某些单词,等等)。同样,您可以使用一些启发式方法来查找其他标记(如 divp)中的数据。如果您发现自己处于未知标签(如 font)中,您可以尝试向上冒泡有限数量的级别来找到您知道如何处理的内容 - 您必须小心不要向上冒泡太多,或者我想您可能会意外地从另一章中检索文本。

问题的关键似乎是您正在挖掘的数据不是以编程方式呈现的——在这些情况下,在某种程度上,人机交互通常是必要的。

The simplest thing it sounds like you could possibly do is iterate over tree.getroot().iterdescendants() looking for a node with node.text that matches your desired regular expression. From that point, you can pass the node to a function that uses some ad-hoc heuristics to determine where the text is. (Maybe if iterdescendants on root is too slow you can use your regex approach and dive into etree to try and find a f(text_position) -> node function.)

For example, if you find that the target was a //tr/td, you can pass it to some table-text-finding subroutine that looked into the next td in node.parent() to see if it has text that makes sense (approximately chapter-length, containing certain words, whatever). Likewise, you can make up some heuristics for finding the data in other tags like div and p. If you find yourself in an unknown tag like font you can try bubbling up a limited number of levels to find something you know how to handle -- you have to be cautious not to bubble up too far, or I imagine you might accidentally retrieve text from another chapter.

The crux of the problem seems to be that you're mining data that's not presented programmatically in a programmatic way -- in these cases, human interaction is usually necessary to some degree.

故事还在继续 2024-08-31 20:32:45

正如我担心的那样,没有系统的方法可以使用 lxml 来识别和提取我需要的内容。哦,好吧,我很感谢大家的插话。注意,这不是 lxml 的错,而是 html 编码不一致的错。例如。因为章节是文档的合理划分,所以一章中的所有内容都应该包含在某种类型的元素中。最灵活的可能是 div 标签,后续的 div 是下一章。这将使章节成为树的分支。不幸的是,大约 20% 的文档可能结构良好,而其他文档则不然。

我可以测试应该保存我的内容(div,p)的每种类型的元素,并获取其所有子元素及其所有兄弟元素,直到到达该类型的下一个元素,该元素包含提醒我我们处于本节的结尾(下一节的开始)。但是,当我 95% 或更多的时间都擅长使用正则表达式时,这似乎是一项太多的工作。

感谢您一如既往地从他们那里学到的所有答案和评论。

As I feared there is no systematic way to use lxml to identify and extract what I need. O h well I appreciate everyone chiming in. Note-this is not the fault of lxml, it is the fault of the inconsistent html coding. For instance. Because a chapter is a reasonable division of a document all the content in one chapter should be wrapped in some type of element. Probably the most flexible would be a div tag with the subsequent div being the next chapter. This would make a chapter a branch of the tree. Unfortunately while approximately 20% of the documents might be that well structured the others are not.

I could test for each type of element that should hold my content (div, p) and grab all of its children and all of its siblings until I get to the next element of that type that has information that alerts me that we are at the end of the section (beginning of the next section). But this seems like too much work when I am good 95% of the time or more with a regular expression.

Thanks for all of the answers and comments as always I learnded from them.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文