确定 URL 是否位于给定 URL、页面 DOM、父 URL 和其他页面 URL 的网页的页眉/页脚中
给定一个 URL、第一个 URL 所在网页的 URL、网页的 DOM 以及网页上其余 URL 的列表,我如何可靠确定该 URL 是否在页面的页眉/页脚或者两者都不在?
我正在使用 C#/.NET。
我知道没有一个解决方案是完美的,因为网页没有语义表达,也因为一些网站/页面专门混淆了它们的页面,但我想构建一些适用于 75% 的网页的逻辑。
另外,是否还有其他信息有助于确定 URL 在页面中的位置?
Given a URL, the URL of the webpage that first URL is on, the DOM of the webpage, and a list of the rest of the URLs on the webpage how can I reliably determine if the URL is in the header/footer of the page or if it's in neither?
I'm using C#/.NET.
I know that no solution is perfect since webpages are not semantically expressed and also because some websites/pages specifically obfuscate their pages, but I would like to build some logic that would work for say 75% of webpages.
Also, are there other pieces of information that would be helpful to determine the location of the URL in the page?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我认为这里的创造性任务是定义“页眉”和“页脚”,例如“距离顶部小于 x 单位的内容”,或“页面上的最后 200 个字符”。完成此操作后,您可以根据这些规则解析页面。
I think the creative task here is to define "header" and "footer", as in "content less than x units away from the top", or "the last 200 characters on the page". Once you have accomplished this, you can parse the page based on those rules.