如何从网页解析Gmail聊天记录？

发布于 2024-09-07 12:30:41 字数 1302 浏览 4 评论 0原文

从显示 Gmail 聊天日志的网页解析 Gmail 聊天日志的最佳方法是什么？据我所知，这仍然是访问服务器托管的 Gmail 聊天日志的唯一方法（通过桌面 Gmail 或移动 Gmail）。

当查看发生对话的生成源时，标记看起来像嵌套的 div 和 span（页面上其他位置的 div 具有随机的两个字符 id 和没有模式的类）。以下是左侧有时间戳的行的摘录：

<div>
<span style="display:block;float:left;color:#888">
2:56 PM&nbsp;
</span>

<span style="display:block;padding-left:6em">
<span>

<span style="font-weight:bold">me</span>: i'm trying to think of a good way to parse gmail chat logs

</span>
</span>
</div>

但并非每一行都有时间戳，因此那些没有时间戳的行似乎在其位置放置了不间断的空格：

<div>
<span style="display:block;float:left;color:#888">
&nbsp;&nbsp;
</span>

<span style="display:block;padding-left:6em">

<span>
and reformat that into something like an xml format
</span>

</span>
</div>

我应该使用 XPath吗？有没有更有效的办法？

编辑：

仅作为数据，这就是它的样子：

12:43 AM John: Something something something.
         Something something something.
         me: Something something something?
12:44 AM Also, something something something.
12:47 AM Something something something.
12:48 AM Something something something
         with something something something.
12:49 AM John: Something.

原文

What would be the best way to parse Gmail chat logs from the webpage where it's displayed? As far as I know, this is still the only way to access server-hosted Gmail chat logs (through either desktop Gmail or mobile Gmail).

When looking at the generated source where the conversation takes place, the markup looks like nested divs and spans (and the divs elsewhere on the page have randomized two-character ids and classes with no pattern). Here's an excerpt from a line that has a timestamp to the left:

<div>
<span style="display:block;float:left;color:#888">
2:56 PM 
</span>

<span style="display:block;padding-left:6em">
<span>

<span style="font-weight:bold">me</span>: i'm trying to think of a good way to parse gmail chat logs

</span>
</span>
</div>

But not every line has a timestamp, so those without one seem to place nonbreaking spaces in its place:

<div>
<span style="display:block;float:left;color:#888">
  
</span>

<span style="display:block;padding-left:6em">

<span>
and reformat that into something like an xml format
</span>

</span>
</div>

Should I use XPath? Is there something more efficient?

Edit:

As data only, this is what it looks like:

12:43 AM John: Something something something.
         Something something something.
         me: Something something something?
12:44 AM Also, something something something.
12:47 AM Something something something.
12:48 AM Something something something
         with something something something.
12:49 AM John: Something.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

柠檬色的秋千 2024-09-14 12:30:41

我应该使用 XPath 吗？有什么东西吗
效率更高？

我会将 Ruby 与 Nokogiri 库结合使用，它比 XPath/XSLT 提供更多的灵活性：

#!/usr/bin/ruby
require 'rubygems'
require 'nokogiri'

src = <<EOS
<div>
    <span style="display:block;float:left;color:#888">
        2:56 PM 
    </span>
    <span style="display:block;padding-left:6em">
        <span>
            <span style="font-weight:bold">me</span>: i'm trying to think of a good way to parse gmail chat logs
        </span>
    </span>
    <span style="display:block;float:left;color:#888">
          
    </span>
    <span style="display:block;padding-left:6em">
        <span>
            and reformat that into something like an xml format
        </span>
    </span>
</div>
EOS

chatlog = []
last_timestamp = nil
doc = Nokogiri::HTML(src)

doc.xpath('//div/span').each do |span|
    style = span.attributes['style'].value

    if style.include?('color:')
        last_timestamp = span.content.strip
    elsif style.include?('padding-left:')
        chatlog << {:timestamp => last_timestamp, :message => span.content.strip}
    end
end

builder = Nokogiri::XML::Builder.new do |doc|
    doc.chatlog {
        chatlog.each do |line|
            doc.line {
                doc.time    line[:timestamp]
                doc.message line[:message]
            }
        end
    }
end

<?xml version="1.0" encoding="UTF-8"?>
<chatlog>
  <line>
    <time>2:56 PM </time>
    <message>me: i'm trying to think of a good way to parse gmail chat logs</message>
  </line>
  <line>
    <time>  </time>
    <message>and reformat that into something like an xml format</message>
  </line>
</chatlog>

Should I use XPath? Is there something
more efficient?

I would use Ruby with the Nokogiri library, it gives you much more flexibility than just XPath/XSLT:

#!/usr/bin/ruby
require 'rubygems'
require 'nokogiri'

src = <<EOS
<div>
    <span style="display:block;float:left;color:#888">
        2:56 PM 
    </span>
    <span style="display:block;padding-left:6em">
        <span>
            <span style="font-weight:bold">me</span>: i'm trying to think of a good way to parse gmail chat logs
        </span>
    </span>
    <span style="display:block;float:left;color:#888">
          
    </span>
    <span style="display:block;padding-left:6em">
        <span>
            and reformat that into something like an xml format
        </span>
    </span>
</div>
EOS

chatlog = []
last_timestamp = nil
doc = Nokogiri::HTML(src)

doc.xpath('//div/span').each do |span|
    style = span.attributes['style'].value

    if style.include?('color:')
        last_timestamp = span.content.strip
    elsif style.include?('padding-left:')
        chatlog << {:timestamp => last_timestamp, :message => span.content.strip}
    end
end

builder = Nokogiri::XML::Builder.new do |doc|
    doc.chatlog {
        chatlog.each do |line|
            doc.line {
                doc.time    line[:timestamp]
                doc.message line[:message]
            }
        end
    }
end

Returns:

<?xml version="1.0" encoding="UTF-8"?>
<chatlog>
  <line>
    <time>2:56 PM </time>
    <message>me: i'm trying to think of a good way to parse gmail chat logs</message>
  </line>
  <line>
    <time>  </time>
    <message>and reformat that into something like an xml format</message>
  </line>
</chatlog>

回复收藏 0 原文

~没有更多了~