如何从原始 HTML 文件中提取数据?
有没有一种方法可以从没有 ID
和 classes
的非语义编写的原始 html 中提取所需的数据?我的意思是,假设有一个已保存的网页(个人资料)的 html 文件,并且我想提取诸如“爱好”之类的数据。使用 PHP 可以做到这一点吗?
Is there a way to extract desired data from a raw html which has been written unsemantically with no IDs
and classes
? I mean, suppose there is a saved html file of a webpage (profile) and I want to extract the data like (say) 'hobbies'. Is it possible to do this using PHP?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
使用正则表达式!我开玩笑,我开玩笑。如果您知道同一页面的状态,并且保证格式保持足够相似,那么您可以尝试编写手动解析器。或者,有很多库可以解析 html。我对 PHP 不太熟悉,无法推荐一个,但我相信谷歌搜索可以让你走得更远。我之前很幸运地使用过 John Resig 的 纯 javascript HTML 解析器。
归根结底,如果您需要来自未按语义构造的 html 页面的语义信息,那么您可能注定会以编程方式失败,而您最好的选择可能是 机械土耳其人。
Use regex! I kid, I kid. If you know the state of the same page, and the format is guaranteed to remain similar enough, then you can try writing a manual parser. Alternatively, there are a lot of libraries out there that will parse html for. I'm not familiar enough with PHP to recommend one, but I'm sure some Googleing could take you a long way. I've had luck with John Resig's pure javascript HTML parser before.
At the end of the day, if you need semantic information from an html page that isn't constructed semantically, you're probably doomed programmatically and your best bet may be a mechanical turk.
听起来您正在寻找 PHP DOM 解析器,例如 这个。如果 HTML 确实缺乏语义结构,那么提取所需的数据可能会有点困难,但 DOM 解析器是起点。
Sounds like you're looking for a PHP DOM Parser, such as this one. It'll probably be a bit tricky to pull out the data you need if the HTML is truly devoid of semantic structure, but a DOM parser is the place to start.
是的,该技术称为网络抓取。如果 DOM 是有效的 html,则可以使用它。如果页面是动态生成的,则生成器将使用某种结构,根据我的经验,您始终可以隔离感兴趣的元素。
如果 DOM 不适合你,你可以使用正则表达式(这就是我在编写网络蜘蛛时经常做的事情)。正则表达式比针对 DOM 层次结构编写抓取逻辑更有效、更快速。因此,您需要打开一些个人资料页面并分析静态结构。然后只需编写一个正则表达式来隔离感兴趣的字段。
Yes the technique is called web scraping. You could use the DOM if its valid html. If the page is dynamically generated the generator would have used some structure, and from my experience you can always isolate elements of interest.
If DOM does not work for you, you can just use regular expressions (thats what I always used to do when writing web-spiders). Regular expressions are more effective and quicker that writing scraping logic against a DOM heirarchy. So you need to open a few of the profile pages and analyze the static structure. Then just write a regular expression to isolate the fields of interest.
使用 PHP 有两种方法。首先是使用 tidy 扩展 清理文档,使其成为有效的 XHTML,从而成为格式良好的 XML,因此可以使用 XML 工具进行解析。
第二种是使用 PHP 版本的 html5lib 解析器,它试图实现 HTML5 研究进入当前的浏览器解析例程。如果它显示在浏览器中,html5lib 可以解析它。
无论使用哪种方法,您最终都会得到一个 DOM 对象,您可以使用 xpath 表达式进行查询。由于您的理论文档缺乏语义结构,因此您需要从“第三个 p 内的第五个跨度”的心态来看待文档部分。
更多信息此处(自链接警告)。
There's two approaches to take with PHP. The first is to clean your document up using the tidy extension so it's valid XHTML, and therefore well-formed XML, and therefore can be parsed using XML tools.
The second is to use the PHP release of html5lib parser, which attempts to implement the HTML5 research into current browser parsing routines. If it displays in a browser, html5lib can parse it.
Using either approach you'll end up with a DOM object you can query using xpath expressions. Since your theoretical documents lack semantic structure, you'll want toook at the document parts from a "the 5th span inside the 3rd p" mindset.
More information here (self-link warning).