Nokogiri 为什么要截断这个元素?

发布于 2024-10-31 07:22:05 字数 6525 浏览 0 评论 0原文

我正在使用 Nokogiri 和 Ruby 1.9.2 解析 XML 文件。在我阅读描述(如下)之前,一切似乎都工作正常。文本正在被截断。输入文本是:

<Value>The Copthorne Aberdeen enjoys a location proximate to several bars, restaurants and other diversions. This Aberdeen hotel is located on the city’s West End, roughly a mile from the many opportunities to engage in sightseeing or simply shopping the day away. The Aberdeen International Airport is approximately 10 miles from the Copthorne Hotel in Aberdeen.

There are 89 rooms in total at the Copthorne Aberdeen Hotel. Each of the is provided with direct-dial telephone service, trouser presses, coffee and tea makers and a private bath with a bathrobe and toiletries courtesy of the hotel. The rooms are light in color.

The Hotel Copthorne Aberdeen offers its guests a restaurant where they can enjoy their meals in a somewhat formal setting. For something more laid-back, guests may have a drink and a light meal in the hotel bar. This hotel does offer business services and there are rooms for meetings located onsite. The hotel also provides a secure parking facility for those who arrive by private car.</Value>

但我得到的是:

g. For something more laid-back, guests may have a drink and a light meal in the hotel bar. This hotel does offer business services and there are rooms for meetings located onsite. The hotel also provides a secure parking facility for those who arrive by private car.

注意它从 g. 开始,遗漏了一半以上。

这是完整的 XML 文件:

<?xml version="1.0" encoding="utf-8"?>
<Hotel>
  <HotelID>1040900</HotelID>
  <HotelFileName>Copthorne_Hotel_Aberdeen</HotelFileName>
  <HotelName>Copthorne Hotel Aberdeen</HotelName>
  <CityID>10</CityID>
  <CityFileName>Aberdeen</CityFileName>
  <CityName>Aberdeen</CityName>
  <CountryCode>GB</CountryCode>
  <CountryFileName>United_Kingdom</CountryFileName>
  <CountryName>United Kingdom</CountryName>
  <StarRating>4</StarRating>
  <Latitude>57.146068572998</Latitude>
  <Longitude>-2.111680030823</Longitude>
  <Popularity>1</Popularity>
  <Address>122 Huntly Street</Address>
  <CurrencyCode>GBP</CurrencyCode>
  <LowRate>36.8354</LowRate>
  <Facilities>1|2|3|5|6|8|10|11|15|17|18|19|20|22|27|29|30|34|36|39|40|41|43|45|47|49|51|53|55|56|60|62|140|154|209</Facilities>
  <NumberOfReviews>239</NumberOfReviews>
  <OverallRating>3.95</OverallRating>
  <CleanlinessRating>3.98</CleanlinessRating>
  <ServiceRating>3.98</ServiceRating>
  <FacilitiesRating>3.83</FacilitiesRating>
  <LocationRating>4.06</LocationRating>
  <DiningRating>3.93</DiningRating>
  <RoomsRating>3.68</RoomsRating>
  <PropertyType>0</PropertyType>
  <ChainID>92</ChainID>
  <Checkin>14</Checkin>
  <Checkout>12</Checkout>
  <Images>
    <Image>19305754</Image>
    <Image>19305755</Image>
    <Image>19305756</Image>
    <Image>19305757</Image>
    <Image>19305758</Image>
    <Image>19305759</Image>
    <Image>19305760</Image>
    <Image>19305761</Image>
    <Image>19305762</Image>
    <Image>19305763</Image>
    <Image>19305764</Image>
    <Image>19305765</Image>
    <Image>19305766</Image>
    <Image>19305767</Image>
    <Image>37102984</Image>
  </Images>
  <Descriptions>
    <Description>
      <Name>General Description</Name>
      <Value>The Copthorne Aberdeen enjoys a location proximate to several bars, restaurants and other diversions. This Aberdeen hotel is located on the city’s West End, roughly a mile from the many opportunities to engage in sightseeing or simply shopping the day away. The Aberdeen International Airport is approximately 10 miles from the Copthorne Hotel in Aberdeen.

There are 89 rooms in total at the Copthorne Aberdeen Hotel. Each of the is provided with direct-dial telephone service, trouser presses, coffee and tea makers and a private bath with a bathrobe and toiletries courtesy of the hotel. The rooms are light in color.

The Hotel Copthorne Aberdeen offers its guests a restaurant where they can enjoy their meals in a somewhat formal setting. For something more laid-back, guests may have a drink and a light meal in the hotel bar. This hotel does offer business services and there are rooms for meetings located onsite. The hotel also provides a secure parking facility for those who arrive by private car.</Value>
    </Description>
    <Description>
      <Name>LocationDescription</Name>
      <Value>Aberdeen's premier four star hotel located in the city centre just off Union Street and the main business and entertainment areas. Within 10 minutes journey of Aberdeen Railway Station and only 10-20 minutes journey from International Airport.</Value>
    </Description>
  </Descriptions>
</Hotel>

这是我的 Ruby 程序:

require 'rubygems'
require 'nokogiri'
require 'ap'
include Nokogiri

class Hotel < Nokogiri::XML::SAX::Document

    def initialize
        @h = {}
        @h["Images"] = Array.new([])
        @h["Descriptions"] = Array.new([])
        @desc = {}
    end

    def end_document
      ap @h
        puts "Finished..."
    end

    def start_element(element, attributes = [])
        @element = element

    @desc = {} if element == "Description"
    end

    def end_element(element, attributes = [])     
      @h["Images"] << @characters if element == "Image"
    @desc["Name"] = @characters if element == "Name"
    if element == "Value"
      @desc["Value"] = @characters
      @h["Descriptions"] << @desc
    end

    @h[element] = @characters unless %w(Images Image Descriptions Description Hotel Name Value).include? element
    end

    def characters(string)
        @characters = string
    end  
end

# Create a new parser
parser = Nokogiri::XML::SAX::Parser.new(Hotel.new)

# Feed the parser some XML
parser.parse(File.open("/Users/cbmeeks/Projects/shared/data/text/HotelDatabase_EN/00/1040900.xml", 'rb'))

谢谢

I am parsing an XML file using Nokogiri and Ruby 1.9.2. Everything seems to be working fine until I read the Descriptions (below). The text is being truncated. The input text is:

<Value>The Copthorne Aberdeen enjoys a location proximate to several bars, restaurants and other diversions. This Aberdeen hotel is located on the city’s West End, roughly a mile from the many opportunities to engage in sightseeing or simply shopping the day away. The Aberdeen International Airport is approximately 10 miles from the Copthorne Hotel in Aberdeen.

There are 89 rooms in total at the Copthorne Aberdeen Hotel. Each of the is provided with direct-dial telephone service, trouser presses, coffee and tea makers and a private bath with a bathrobe and toiletries courtesy of the hotel. The rooms are light in color.

The Hotel Copthorne Aberdeen offers its guests a restaurant where they can enjoy their meals in a somewhat formal setting. For something more laid-back, guests may have a drink and a light meal in the hotel bar. This hotel does offer business services and there are rooms for meetings located onsite. The hotel also provides a secure parking facility for those who arrive by private car.</Value>

But instead I am getting:

g. For something more laid-back, guests may have a drink and a light meal in the hotel bar. This hotel does offer business services and there are rooms for meetings located onsite. The hotel also provides a secure parking facility for those who arrive by private car.

Notice it starts at g. which is leaving off more than half.

Here is the complete XML file:

<?xml version="1.0" encoding="utf-8"?>
<Hotel>
  <HotelID>1040900</HotelID>
  <HotelFileName>Copthorne_Hotel_Aberdeen</HotelFileName>
  <HotelName>Copthorne Hotel Aberdeen</HotelName>
  <CityID>10</CityID>
  <CityFileName>Aberdeen</CityFileName>
  <CityName>Aberdeen</CityName>
  <CountryCode>GB</CountryCode>
  <CountryFileName>United_Kingdom</CountryFileName>
  <CountryName>United Kingdom</CountryName>
  <StarRating>4</StarRating>
  <Latitude>57.146068572998</Latitude>
  <Longitude>-2.111680030823</Longitude>
  <Popularity>1</Popularity>
  <Address>122 Huntly Street</Address>
  <CurrencyCode>GBP</CurrencyCode>
  <LowRate>36.8354</LowRate>
  <Facilities>1|2|3|5|6|8|10|11|15|17|18|19|20|22|27|29|30|34|36|39|40|41|43|45|47|49|51|53|55|56|60|62|140|154|209</Facilities>
  <NumberOfReviews>239</NumberOfReviews>
  <OverallRating>3.95</OverallRating>
  <CleanlinessRating>3.98</CleanlinessRating>
  <ServiceRating>3.98</ServiceRating>
  <FacilitiesRating>3.83</FacilitiesRating>
  <LocationRating>4.06</LocationRating>
  <DiningRating>3.93</DiningRating>
  <RoomsRating>3.68</RoomsRating>
  <PropertyType>0</PropertyType>
  <ChainID>92</ChainID>
  <Checkin>14</Checkin>
  <Checkout>12</Checkout>
  <Images>
    <Image>19305754</Image>
    <Image>19305755</Image>
    <Image>19305756</Image>
    <Image>19305757</Image>
    <Image>19305758</Image>
    <Image>19305759</Image>
    <Image>19305760</Image>
    <Image>19305761</Image>
    <Image>19305762</Image>
    <Image>19305763</Image>
    <Image>19305764</Image>
    <Image>19305765</Image>
    <Image>19305766</Image>
    <Image>19305767</Image>
    <Image>37102984</Image>
  </Images>
  <Descriptions>
    <Description>
      <Name>General Description</Name>
      <Value>The Copthorne Aberdeen enjoys a location proximate to several bars, restaurants and other diversions. This Aberdeen hotel is located on the city’s West End, roughly a mile from the many opportunities to engage in sightseeing or simply shopping the day away. The Aberdeen International Airport is approximately 10 miles from the Copthorne Hotel in Aberdeen.

There are 89 rooms in total at the Copthorne Aberdeen Hotel. Each of the is provided with direct-dial telephone service, trouser presses, coffee and tea makers and a private bath with a bathrobe and toiletries courtesy of the hotel. The rooms are light in color.

The Hotel Copthorne Aberdeen offers its guests a restaurant where they can enjoy their meals in a somewhat formal setting. For something more laid-back, guests may have a drink and a light meal in the hotel bar. This hotel does offer business services and there are rooms for meetings located onsite. The hotel also provides a secure parking facility for those who arrive by private car.</Value>
    </Description>
    <Description>
      <Name>LocationDescription</Name>
      <Value>Aberdeen's premier four star hotel located in the city centre just off Union Street and the main business and entertainment areas. Within 10 minutes journey of Aberdeen Railway Station and only 10-20 minutes journey from International Airport.</Value>
    </Description>
  </Descriptions>
</Hotel>

And here is my Ruby program:

require 'rubygems'
require 'nokogiri'
require 'ap'
include Nokogiri

class Hotel < Nokogiri::XML::SAX::Document

    def initialize
        @h = {}
        @h["Images"] = Array.new([])
        @h["Descriptions"] = Array.new([])
        @desc = {}
    end

    def end_document
      ap @h
        puts "Finished..."
    end

    def start_element(element, attributes = [])
        @element = element

    @desc = {} if element == "Description"
    end

    def end_element(element, attributes = [])     
      @h["Images"] << @characters if element == "Image"
    @desc["Name"] = @characters if element == "Name"
    if element == "Value"
      @desc["Value"] = @characters
      @h["Descriptions"] << @desc
    end

    @h[element] = @characters unless %w(Images Image Descriptions Description Hotel Name Value).include? element
    end

    def characters(string)
        @characters = string
    end  
end

# Create a new parser
parser = Nokogiri::XML::SAX::Parser.new(Hotel.new)

# Feed the parser some XML
parser.parse(File.open("/Users/cbmeeks/Projects/shared/data/text/HotelDatabase_EN/00/1040900.xml", 'rb'))

Thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

上课铃就是安魂曲 2024-11-07 07:22:05

我去掉了 XML,因为它有很多不必要的节点来解决这个问题。以下是我如何处理文本的示例:

#!/usr/bin/env ruby
# encoding: UTF-8

xml =<<EOT
<?xml version="1.0" encoding="utf-8"?>
<Hotel>
  <Descriptions>
    <Description>
      <Name>General Description</Name>
      <Value>The Copthorne Aberdeen enjoys a location proximate to several bars, restaurants and other diversions. This Aberdeen hotel is located on the city’s West End, roughly a mile from the many opportunities to engage in sightseeing or simply shopping the day away. The Aberdeen International Airport is approximately 10 miles from the Copthorne Hotel in Aberdeen.

There are 89 rooms in total at the Copthorne Aberdeen Hotel. Each of the is provided with direct-dial telephone service, trouser presses, coffee and tea makers and a private bath with a bathrobe and toiletries courtesy of the hotel. The rooms are light in color.

The Hotel Copthorne Aberdeen offers its guests a restaurant where they can enjoy their meals in a somewhat formal setting. For something more laid-back, guests may have a drink and a light meal in the hotel bar. This hotel does offer business services and there are rooms for meetings located onsite. The hotel also provides a secure parking facility for those who arrive by private car.</Value>
    </Description>
    <Description>
      <Name>LocationDescription</Name>
      <Value>Aberdeen's premier four star hotel located in the city centre just off Union Street and the main business and entertainment areas. Within 10 minutes journey of Aberdeen Railway Station and only 10-20 minutes journey from International Airport.</Value>
    </Description>
  </Descriptions>
</Hotel>
EOT

require 'nokogiri'

doc = Nokogiri::XML(xml)
puts doc.search('Value').map{ |n| n.text }

带有输出示例:

国敦阿伯丁酒店地理位置优越,靠近多家酒吧、餐厅和其他娱乐场所。这家阿伯丁酒店位于该市的西区,距离众多观光或购物场所大约 1 英里。阿伯丁国敦酒店距阿伯丁国际机场约 10 英里。

国敦阿伯丁酒店共有 89 间客房。每间客房均配有直拨电话服务、熨裤机、咖啡机和茶具以及酒店提供的带浴袍和洗浴用品的私人浴室。房间颜色浅。

阿伯丁国敦酒店为客人提供一间餐厅,客人可以在稍微正式的环境中享用餐点。若想享受更悠闲的时光,客人可以在酒店的酒吧享用饮品和便餐。这家酒店提供商务服务,并设有内部会议室。酒店还为驾车前来的客人提供安全的停车设施。
阿伯丁首屈一指的四星级酒店位于市中心,紧邻联合街以及主要商业和娱乐区。距离香港仔火车站不到 10 分钟路程,距离国际机场仅 10-20 分钟路程。

这特意只出现在 Value 节点之后。修改示例以获取图像节点也很简单。

现在,有几个问题:为什么使用 SAX 模式?传入的 XML 是否大于主机 RAM 的合理容纳范围?如果没有,请使用 DOM,因为它更容易使用。

当我第一次运行它时,Ruby 告诉我无效的多字节字符 (US-ASCII),这意味着 XML 中有一些它不喜欢的内容。我通过添加 #coding 行修复了这个问题。我正在使用 Ruby 1.9.2,这使得处理此类事情变得更容易。

我使用 CSS 访问器进行搜索。 Nokogiri 允许 XPath 和 CSS,因此您可以随心所欲地满足 XML 解析的需求。

I stripped down the XML because it had a lot of unnecessary nodes for the problem. Here's a sample of how I go after text:

#!/usr/bin/env ruby
# encoding: UTF-8

xml =<<EOT
<?xml version="1.0" encoding="utf-8"?>
<Hotel>
  <Descriptions>
    <Description>
      <Name>General Description</Name>
      <Value>The Copthorne Aberdeen enjoys a location proximate to several bars, restaurants and other diversions. This Aberdeen hotel is located on the city’s West End, roughly a mile from the many opportunities to engage in sightseeing or simply shopping the day away. The Aberdeen International Airport is approximately 10 miles from the Copthorne Hotel in Aberdeen.

There are 89 rooms in total at the Copthorne Aberdeen Hotel. Each of the is provided with direct-dial telephone service, trouser presses, coffee and tea makers and a private bath with a bathrobe and toiletries courtesy of the hotel. The rooms are light in color.

The Hotel Copthorne Aberdeen offers its guests a restaurant where they can enjoy their meals in a somewhat formal setting. For something more laid-back, guests may have a drink and a light meal in the hotel bar. This hotel does offer business services and there are rooms for meetings located onsite. The hotel also provides a secure parking facility for those who arrive by private car.</Value>
    </Description>
    <Description>
      <Name>LocationDescription</Name>
      <Value>Aberdeen's premier four star hotel located in the city centre just off Union Street and the main business and entertainment areas. Within 10 minutes journey of Aberdeen Railway Station and only 10-20 minutes journey from International Airport.</Value>
    </Description>
  </Descriptions>
</Hotel>
EOT

require 'nokogiri'

doc = Nokogiri::XML(xml)
puts doc.search('Value').map{ |n| n.text }

With a sample of the output:

The Copthorne Aberdeen enjoys a location proximate to several bars, restaurants and other diversions. This Aberdeen hotel is located on the city’s West End, roughly a mile from the many opportunities to engage in sightseeing or simply shopping the day away. The Aberdeen International Airport is approximately 10 miles from the Copthorne Hotel in Aberdeen.

There are 89 rooms in total at the Copthorne Aberdeen Hotel. Each of the is provided with direct-dial telephone service, trouser presses, coffee and tea makers and a private bath with a bathrobe and toiletries courtesy of the hotel. The rooms are light in color.

The Hotel Copthorne Aberdeen offers its guests a restaurant where they can enjoy their meals in a somewhat formal setting. For something more laid-back, guests may have a drink and a light meal in the hotel bar. This hotel does offer business services and there are rooms for meetings located onsite. The hotel also provides a secure parking facility for those who arrive by private car.
Aberdeen's premier four star hotel located in the city centre just off Union Street and the main business and entertainment areas. Within 10 minutes journey of Aberdeen Railway Station and only 10-20 minutes journey from International Airport.

This purposely only goes after the Value nodes. It'd be simple to modify the sample to grab the image nodes too.

Now, a couple questions: Why use SAX mode? Is the incoming XML bigger than can reasonably fit into the RAM of your host? If not, use DOM as it's much easier to use.

When I ran it the first time, Ruby told me invalid multibyte char (US-ASCII), meaning there's something in the XML it didn't like. I fixed that by adding the # encoding line. I'm using Ruby 1.9.2, which makes it easier to deal with such things.

I'm using CSS accessors for the search. Nokogiri allows XPath and CSS, so you're free to indulge your XML-parsing heart's desire however you want.

忘东忘西忘不掉你 2024-11-07 07:22:05

我遇到了类似的问题,这里是实际的解释:

def characters(string)
    @characters = string
end

实际上应该是这样的:

def start_element(element, attributes = [])     
  #...(other stuff)...

  # Reset/initialize @characters
  @characters = ""
end

def characters(string)
    @characters += string
end

基本原理是标签的内容实际上可能被分割成多个文本节点,如下所述: http://nokogiri.org/Nokogiri/XML/SAX/Document.html

给定一个连续的字符串,此方法可能会被多次调用。

仅捕获文本正文的最后一段,因为每次遇到文本节点(即调用 characters 方法)时,它都会替换 @characters 的内容,而不是附加到它。

I ran into a similar problem, and here is the actual explanation:

def characters(string)
    @characters = string
end

Should actually be something like this:

def start_element(element, attributes = [])     
  #...(other stuff)...

  # Reset/initialize @characters
  @characters = ""
end

def characters(string)
    @characters += string
end

The rationale is that the contents of the tag may in fact be split into multiple text nodes, as described here: http://nokogiri.org/Nokogiri/XML/SAX/Document.html

This method might be called multiple times given one contiguous string of characters.

Only the last segment of the text body was being captured because each time it encountered a text node (i.e. the characters method is called) it replaced the contents of @characters instead of appending to it.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文