如何从网页解析Gmail聊天记录?
从显示 Gmail 聊天日志的网页解析 Gmail 聊天日志的最佳方法是什么?据我所知,这仍然是访问服务器托管的 Gmail 聊天日志的唯一方法(通过桌面 Gmail 或移动 Gmail)。
当查看发生对话的生成源时,标记看起来像嵌套的 div 和 span(页面上其他位置的 div 具有随机的两个字符 id 和没有模式的类)。以下是左侧有时间戳的行的摘录:
<div>
<span style="display:block;float:left;color:#888">
2:56 PM
</span>
<span style="display:block;padding-left:6em">
<span>
<span style="font-weight:bold">me</span>: i'm trying to think of a good way to parse gmail chat logs
</span>
</span>
</div>
但并非每一行都有时间戳,因此那些没有时间戳的行似乎在其位置放置了不间断的空格:
<div>
<span style="display:block;float:left;color:#888">
</span>
<span style="display:block;padding-left:6em">
<span>
and reformat that into something like an xml format
</span>
</span>
</div>
我应该使用 XPath吗?有没有更有效的办法?
编辑:
仅作为数据,这就是它的样子:
12:43 AM John: Something something something.
Something something something.
me: Something something something?
12:44 AM Also, something something something.
12:47 AM Something something something.
12:48 AM Something something something
with something something something.
12:49 AM John: Something.
What would be the best way to parse Gmail chat logs from the webpage where it's displayed? As far as I know, this is still the only way to access server-hosted Gmail chat logs (through either desktop Gmail or mobile Gmail).
When looking at the generated source where the conversation takes place, the markup looks like nested divs and spans (and the divs elsewhere on the page have randomized two-character ids and classes with no pattern). Here's an excerpt from a line that has a timestamp to the left:
<div>
<span style="display:block;float:left;color:#888">
2:56 PM
</span>
<span style="display:block;padding-left:6em">
<span>
<span style="font-weight:bold">me</span>: i'm trying to think of a good way to parse gmail chat logs
</span>
</span>
</div>
But not every line has a timestamp, so those without one seem to place nonbreaking spaces in its place:
<div>
<span style="display:block;float:left;color:#888">
</span>
<span style="display:block;padding-left:6em">
<span>
and reformat that into something like an xml format
</span>
</span>
</div>
Should I use XPath? Is there something more efficient?
Edit:
As data only, this is what it looks like:
12:43 AM John: Something something something.
Something something something.
me: Something something something?
12:44 AM Also, something something something.
12:47 AM Something something something.
12:48 AM Something something something
with something something something.
12:49 AM John: Something.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我会将 Ruby 与 Nokogiri 库结合使用,它比 XPath/XSLT 提供更多的灵活性:
返回:
I would use Ruby with the Nokogiri library, it gives you much more flexibility than just XPath/XSLT:
Returns: