抓取页面
抓取经销商库存页面的可怕混乱的最佳实践是什么(使用 js 记录。编写 ,然后使用纯文本 html 关闭它)?没有 div/tds/任何东西都标有任何 id 或类等。
我应该直接 preg_match(?_all) 东西还是有一些 xpath 魔法我可以做? 没有 api、没有 feed、没有 xml,什么都干净。
编辑:
- 我基本上想到的 atm 类似于 http://pastebin.com/raw.php? i=EuMfRVD5 - 这是我最好的选择还是还有其他方法?
What would be best practice in scraping a horrible mess of a distributor's inventory page (using js to document.write a <td>, then using plaintext html to close it)? No divs/tds/anything is labelled with any id or classes, etc.
Should I just straight up preg_match(?_all) the thing or is there some xpath magic I can do?
There is no api, no feeds, no xml, nothing clean at all.
edit:
-
What i'm basically thinking of atm is something like http://pastebin.com/raw.php?i=EuMfRVD5 - is that my best bet or is there any other way?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
你的例子还不够。但是,由于您似乎无论如何都不需要突出显示元信息,因此可以通过以下操作来撤销 JS 混淆:
也许这已经足够好,可以在之后通过其中一个 DOM 库进行管道传输。
Your example is not enough of an example. But since you seemingly don't need the highlighting meta info anyway, the JS-obfuscation could be undone with a bit of:
Maybe that's already good enough to pipe it through one of the DOM libraries afterwards.
一般来说,您应该始终使用 http://www.php.net/DOM 来解析页面。正则表达式很糟糕,通常完全不可能用于解析 html,因为这不是它的构建目的。
然而...如果页面使用大量 javascript 来输出内容,那么无论如何你都是 SoL。要获得完整的图片,您真正能做的最好的事情就是抓取它并通过浏览器运行它并解析呈现的内容。可以将其自动化,尽管设置起来有点麻烦。
但是...考虑到 js 输出大量的问题...也许正则表达式确实是最好的路线。但我想首先也是最重要的取决于实际内容是什么以及您试图从页面获取什么。
In general you should always use http://www.php.net/DOM to parse a page. Regex is horrible and usually downright impossible to use for parsing html, because that's not what it was built for.
However...if the page uses a lot of javascript to output stuff, you are kind of SoL regardless. The best you can really do to get a complete picture is to grab it and run it through a browser and parse what is rendered. It is possible to automate it, though it's kind of a pita to setup.
But...given the issue w/ js outputting a lot of it...maybe regex really would be best route. But I guess first and foremost it kind of depends on what the actual content is and what it is you are trying to get from the page.