识别分层数据中的模式或结构
我的问题是“检测分层或树模式的最佳技术是什么?”
我想识别 HTML 页面中的部分,例如:用户登录菜单,或导航菜单,或内容正文,页脚等。
我正在尝试使用我实现的语法识别(我不喜欢像 Lex、yacc 这样的经典语言)这项工作,因为他们不关心 HTML 数据意义)与 php 并使用 DOM 解析器进行 HTML 遍历(DOMDocument)。
我遇到了麻烦,因为在 html 中直观地表示数据的方式存在差异。例如,可以使用
,但数百种可能性中只有一种。它还取决于 CSS 事件(onclick、onmousehover)。而且从假菜单中识别真菜单也存在问题。
我正在考虑神经训练,但在所有示例中我发现它们适用于线性数据,而不是分层数据。我尝试训练一些网络,但很明显它们丢失了 DOM 树元素之间的关系信息。或者也许我不知道如何让它变得更好。
我的模式识别语法的结果很差,因为它不接受html中可能的“意外”,也不平滑识别,它太严格(而不是模糊)。
有什么想法吗?
my question is ¿What would be the best technology to detect hierarchical or tree patterns?
I want to recognise parts in a HTML page, for example: user login menu, or navigation menu, or content body, footer, etc.
I'm trying with a grammar recognition implemented by me (I dont like classical like Lex, yacc for this work, because they don't care the HTML data sense) with php and using DOM parser for HTML walking (DOMDocument).
I'm having trouble because the variability in the manner to represent visually the data in html. For example, a menu can be implemented with <ul><li><a href=#>Link1</a><li>Link2....</ul>
, but there is only one possibility of hundreds. It depends also in css events (onclick, onmousehover). And there are problems in recognising a real menu from a fake menu.
I was thinking in neural training but in all examples I found they are for linear data, not hierarchical data. I tried train some networks but it's obvious that they lose the relationship information between the DOM tree elements. Or maybe I don't know to make it better.
My pattern recognition grammar has a poor result because it doesn't accept possible "accidents" in html nor smooths the recognition, it's too strict (not fuzzy).
¿Any idea?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
一种可能的方法是拥有许多(10-20)个不同的正则表达式或其他检测方法的数组,并查看其中有多少符合条件,根据每个正确的频率对它们进行加权,并将其与一个值进行比较。或者,如果您喜欢的话,您可以采取总正确率并使用神经网络进行选择。
One possible way would be to have an array of many (10-20) different regexps or other detection methods, and see how many of them it qualifies for, weight them according to how often each is correct, and compare it to a value. or you could take the total correct anduse a nueral network to choose, if you like them.