Nokogiri 为什么要截断这个元素?
我正在使用 Nokogiri 和 Ruby 1.9.2 解析 XML 文件。在我阅读描述
(如下)之前,一切似乎都工作正常。文本正在被截断。输入文本是:
<Value>The Copthorne Aberdeen enjoys a location proximate to several bars, restaurants and other diversions. This Aberdeen hotel is located on the city’s West End, roughly a mile from the many opportunities to engage in sightseeing or simply shopping the day away. The Aberdeen International Airport is approximately 10 miles from the Copthorne Hotel in Aberdeen.
There are 89 rooms in total at the Copthorne Aberdeen Hotel. Each of the is provided with direct-dial telephone service, trouser presses, coffee and tea makers and a private bath with a bathrobe and toiletries courtesy of the hotel. The rooms are light in color.
The Hotel Copthorne Aberdeen offers its guests a restaurant where they can enjoy their meals in a somewhat formal setting. For something more laid-back, guests may have a drink and a light meal in the hotel bar. This hotel does offer business services and there are rooms for meetings located onsite. The hotel also provides a secure parking facility for those who arrive by private car.</Value>
但我得到的是:
g. For something more laid-back, guests may have a drink and a light meal in the hotel bar. This hotel does offer business services and there are rooms for meetings located onsite. The hotel also provides a secure parking facility for those who arrive by private car.
注意它从 g.
开始,遗漏了一半以上。
这是完整的 XML 文件:
<?xml version="1.0" encoding="utf-8"?>
<Hotel>
<HotelID>1040900</HotelID>
<HotelFileName>Copthorne_Hotel_Aberdeen</HotelFileName>
<HotelName>Copthorne Hotel Aberdeen</HotelName>
<CityID>10</CityID>
<CityFileName>Aberdeen</CityFileName>
<CityName>Aberdeen</CityName>
<CountryCode>GB</CountryCode>
<CountryFileName>United_Kingdom</CountryFileName>
<CountryName>United Kingdom</CountryName>
<StarRating>4</StarRating>
<Latitude>57.146068572998</Latitude>
<Longitude>-2.111680030823</Longitude>
<Popularity>1</Popularity>
<Address>122 Huntly Street</Address>
<CurrencyCode>GBP</CurrencyCode>
<LowRate>36.8354</LowRate>
<Facilities>1|2|3|5|6|8|10|11|15|17|18|19|20|22|27|29|30|34|36|39|40|41|43|45|47|49|51|53|55|56|60|62|140|154|209</Facilities>
<NumberOfReviews>239</NumberOfReviews>
<OverallRating>3.95</OverallRating>
<CleanlinessRating>3.98</CleanlinessRating>
<ServiceRating>3.98</ServiceRating>
<FacilitiesRating>3.83</FacilitiesRating>
<LocationRating>4.06</LocationRating>
<DiningRating>3.93</DiningRating>
<RoomsRating>3.68</RoomsRating>
<PropertyType>0</PropertyType>
<ChainID>92</ChainID>
<Checkin>14</Checkin>
<Checkout>12</Checkout>
<Images>
<Image>19305754</Image>
<Image>19305755</Image>
<Image>19305756</Image>
<Image>19305757</Image>
<Image>19305758</Image>
<Image>19305759</Image>
<Image>19305760</Image>
<Image>19305761</Image>
<Image>19305762</Image>
<Image>19305763</Image>
<Image>19305764</Image>
<Image>19305765</Image>
<Image>19305766</Image>
<Image>19305767</Image>
<Image>37102984</Image>
</Images>
<Descriptions>
<Description>
<Name>General Description</Name>
<Value>The Copthorne Aberdeen enjoys a location proximate to several bars, restaurants and other diversions. This Aberdeen hotel is located on the city’s West End, roughly a mile from the many opportunities to engage in sightseeing or simply shopping the day away. The Aberdeen International Airport is approximately 10 miles from the Copthorne Hotel in Aberdeen.
There are 89 rooms in total at the Copthorne Aberdeen Hotel. Each of the is provided with direct-dial telephone service, trouser presses, coffee and tea makers and a private bath with a bathrobe and toiletries courtesy of the hotel. The rooms are light in color.
The Hotel Copthorne Aberdeen offers its guests a restaurant where they can enjoy their meals in a somewhat formal setting. For something more laid-back, guests may have a drink and a light meal in the hotel bar. This hotel does offer business services and there are rooms for meetings located onsite. The hotel also provides a secure parking facility for those who arrive by private car.</Value>
</Description>
<Description>
<Name>LocationDescription</Name>
<Value>Aberdeen's premier four star hotel located in the city centre just off Union Street and the main business and entertainment areas. Within 10 minutes journey of Aberdeen Railway Station and only 10-20 minutes journey from International Airport.</Value>
</Description>
</Descriptions>
</Hotel>
这是我的 Ruby 程序:
require 'rubygems'
require 'nokogiri'
require 'ap'
include Nokogiri
class Hotel < Nokogiri::XML::SAX::Document
def initialize
@h = {}
@h["Images"] = Array.new([])
@h["Descriptions"] = Array.new([])
@desc = {}
end
def end_document
ap @h
puts "Finished..."
end
def start_element(element, attributes = [])
@element = element
@desc = {} if element == "Description"
end
def end_element(element, attributes = [])
@h["Images"] << @characters if element == "Image"
@desc["Name"] = @characters if element == "Name"
if element == "Value"
@desc["Value"] = @characters
@h["Descriptions"] << @desc
end
@h[element] = @characters unless %w(Images Image Descriptions Description Hotel Name Value).include? element
end
def characters(string)
@characters = string
end
end
# Create a new parser
parser = Nokogiri::XML::SAX::Parser.new(Hotel.new)
# Feed the parser some XML
parser.parse(File.open("/Users/cbmeeks/Projects/shared/data/text/HotelDatabase_EN/00/1040900.xml", 'rb'))
谢谢
I am parsing an XML file using Nokogiri and Ruby 1.9.2. Everything seems to be working fine until I read the Descriptions
(below). The text is being truncated. The input text is:
<Value>The Copthorne Aberdeen enjoys a location proximate to several bars, restaurants and other diversions. This Aberdeen hotel is located on the city’s West End, roughly a mile from the many opportunities to engage in sightseeing or simply shopping the day away. The Aberdeen International Airport is approximately 10 miles from the Copthorne Hotel in Aberdeen.
There are 89 rooms in total at the Copthorne Aberdeen Hotel. Each of the is provided with direct-dial telephone service, trouser presses, coffee and tea makers and a private bath with a bathrobe and toiletries courtesy of the hotel. The rooms are light in color.
The Hotel Copthorne Aberdeen offers its guests a restaurant where they can enjoy their meals in a somewhat formal setting. For something more laid-back, guests may have a drink and a light meal in the hotel bar. This hotel does offer business services and there are rooms for meetings located onsite. The hotel also provides a secure parking facility for those who arrive by private car.</Value>
But instead I am getting:
g. For something more laid-back, guests may have a drink and a light meal in the hotel bar. This hotel does offer business services and there are rooms for meetings located onsite. The hotel also provides a secure parking facility for those who arrive by private car.
Notice it starts at g.
which is leaving off more than half.
Here is the complete XML file:
<?xml version="1.0" encoding="utf-8"?>
<Hotel>
<HotelID>1040900</HotelID>
<HotelFileName>Copthorne_Hotel_Aberdeen</HotelFileName>
<HotelName>Copthorne Hotel Aberdeen</HotelName>
<CityID>10</CityID>
<CityFileName>Aberdeen</CityFileName>
<CityName>Aberdeen</CityName>
<CountryCode>GB</CountryCode>
<CountryFileName>United_Kingdom</CountryFileName>
<CountryName>United Kingdom</CountryName>
<StarRating>4</StarRating>
<Latitude>57.146068572998</Latitude>
<Longitude>-2.111680030823</Longitude>
<Popularity>1</Popularity>
<Address>122 Huntly Street</Address>
<CurrencyCode>GBP</CurrencyCode>
<LowRate>36.8354</LowRate>
<Facilities>1|2|3|5|6|8|10|11|15|17|18|19|20|22|27|29|30|34|36|39|40|41|43|45|47|49|51|53|55|56|60|62|140|154|209</Facilities>
<NumberOfReviews>239</NumberOfReviews>
<OverallRating>3.95</OverallRating>
<CleanlinessRating>3.98</CleanlinessRating>
<ServiceRating>3.98</ServiceRating>
<FacilitiesRating>3.83</FacilitiesRating>
<LocationRating>4.06</LocationRating>
<DiningRating>3.93</DiningRating>
<RoomsRating>3.68</RoomsRating>
<PropertyType>0</PropertyType>
<ChainID>92</ChainID>
<Checkin>14</Checkin>
<Checkout>12</Checkout>
<Images>
<Image>19305754</Image>
<Image>19305755</Image>
<Image>19305756</Image>
<Image>19305757</Image>
<Image>19305758</Image>
<Image>19305759</Image>
<Image>19305760</Image>
<Image>19305761</Image>
<Image>19305762</Image>
<Image>19305763</Image>
<Image>19305764</Image>
<Image>19305765</Image>
<Image>19305766</Image>
<Image>19305767</Image>
<Image>37102984</Image>
</Images>
<Descriptions>
<Description>
<Name>General Description</Name>
<Value>The Copthorne Aberdeen enjoys a location proximate to several bars, restaurants and other diversions. This Aberdeen hotel is located on the city’s West End, roughly a mile from the many opportunities to engage in sightseeing or simply shopping the day away. The Aberdeen International Airport is approximately 10 miles from the Copthorne Hotel in Aberdeen.
There are 89 rooms in total at the Copthorne Aberdeen Hotel. Each of the is provided with direct-dial telephone service, trouser presses, coffee and tea makers and a private bath with a bathrobe and toiletries courtesy of the hotel. The rooms are light in color.
The Hotel Copthorne Aberdeen offers its guests a restaurant where they can enjoy their meals in a somewhat formal setting. For something more laid-back, guests may have a drink and a light meal in the hotel bar. This hotel does offer business services and there are rooms for meetings located onsite. The hotel also provides a secure parking facility for those who arrive by private car.</Value>
</Description>
<Description>
<Name>LocationDescription</Name>
<Value>Aberdeen's premier four star hotel located in the city centre just off Union Street and the main business and entertainment areas. Within 10 minutes journey of Aberdeen Railway Station and only 10-20 minutes journey from International Airport.</Value>
</Description>
</Descriptions>
</Hotel>
And here is my Ruby program:
require 'rubygems'
require 'nokogiri'
require 'ap'
include Nokogiri
class Hotel < Nokogiri::XML::SAX::Document
def initialize
@h = {}
@h["Images"] = Array.new([])
@h["Descriptions"] = Array.new([])
@desc = {}
end
def end_document
ap @h
puts "Finished..."
end
def start_element(element, attributes = [])
@element = element
@desc = {} if element == "Description"
end
def end_element(element, attributes = [])
@h["Images"] << @characters if element == "Image"
@desc["Name"] = @characters if element == "Name"
if element == "Value"
@desc["Value"] = @characters
@h["Descriptions"] << @desc
end
@h[element] = @characters unless %w(Images Image Descriptions Description Hotel Name Value).include? element
end
def characters(string)
@characters = string
end
end
# Create a new parser
parser = Nokogiri::XML::SAX::Parser.new(Hotel.new)
# Feed the parser some XML
parser.parse(File.open("/Users/cbmeeks/Projects/shared/data/text/HotelDatabase_EN/00/1040900.xml", 'rb'))
Thanks
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我去掉了 XML,因为它有很多不必要的节点来解决这个问题。以下是我如何处理文本的示例:
带有输出示例:
这特意只出现在
Value
节点之后。修改示例以获取图像节点也很简单。现在,有几个问题:为什么使用 SAX 模式?传入的 XML 是否大于主机 RAM 的合理容纳范围?如果没有,请使用 DOM,因为它更容易使用。
当我第一次运行它时,Ruby 告诉我
无效的多字节字符 (US-ASCII)
,这意味着 XML 中有一些它不喜欢的内容。我通过添加#coding
行修复了这个问题。我正在使用 Ruby 1.9.2,这使得处理此类事情变得更容易。我使用 CSS 访问器进行搜索。 Nokogiri 允许 XPath 和 CSS,因此您可以随心所欲地满足 XML 解析的需求。
I stripped down the XML because it had a lot of unnecessary nodes for the problem. Here's a sample of how I go after text:
With a sample of the output:
This purposely only goes after the
Value
nodes. It'd be simple to modify the sample to grab the image nodes too.Now, a couple questions: Why use SAX mode? Is the incoming XML bigger than can reasonably fit into the RAM of your host? If not, use DOM as it's much easier to use.
When I ran it the first time, Ruby told me
invalid multibyte char (US-ASCII)
, meaning there's something in the XML it didn't like. I fixed that by adding the# encoding
line. I'm using Ruby 1.9.2, which makes it easier to deal with such things.I'm using CSS accessors for the search. Nokogiri allows XPath and CSS, so you're free to indulge your XML-parsing heart's desire however you want.
我遇到了类似的问题,这里是实际的解释:
实际上应该是这样的:
基本原理是标签的内容实际上可能被分割成多个文本节点,如下所述: http://nokogiri.org/Nokogiri/XML/SAX/Document.html
仅捕获文本正文的最后一段,因为每次遇到文本节点(即调用
characters
方法)时,它都会替换@characters
的内容,而不是附加到它。I ran into a similar problem, and here is the actual explanation:
Should actually be something like this:
The rationale is that the contents of the tag may in fact be split into multiple text nodes, as described here: http://nokogiri.org/Nokogiri/XML/SAX/Document.html
Only the last segment of the text body was being captured because each time it encountered a text node (i.e. the
characters
method is called) it replaced the contents of@characters
instead of appending to it.