对 lxml 中的表进行分类
我正在处理大量 html 文档。我的任务之一是从文档中提取所有文本。我已经走了很远,但现在我很困惑,因为使用表格作为容器/格式化结构来获取本质上不是数字的信息
我的目标是忽略 - 留下 - 如果它是一个表格,则不提取“表格”数字字段
我正准备通过采用表格来实现基于暴力规则的方法,如果超过一定比例的 td.text_content() 可以被分类为数字,我将决定该表格是
我想知道 的数值表格如果其他人可以提出更好的方法
I am working with a large set of html documents. One of my tasks is to extract all text from the documents. I have gotten pretty far but now I am stumped because of the use of tables as containers / formatting structures for information that is not numeric in nature
My goal is to ignore - leave behind - not extract the 'table' if it is a table of numeric fields
I am getting ready to implement a brute rule based approach by taking a table and if more than some percentage of the td.text_content() can be classfied as digits I will decide that the table is a table of numeric values
I am wondering if someone else can suggest a better approach
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我建议使用上下文数据解析,例如由 IBM 的 Watson 实现,但我想否则无法对表进行分类。你能发布数字和非数字表的 html 示例吗?当它们发布时我会更新我的答案。
I would suggest contextual data parsing, such as implemented by IBM's Watson, but I suppose otherwise it is impossible to classify the table. Could you post html examples of a numeric and a non-numeric table? I'll update my answer when they are posted.