Ruby / Mechanize 在发现重音 u 字母时中止
我想使用 Ruby/Mechanize 读取这种形式的 php 脚本:
<form name="editevent" method="post" action="/index.php" enctype="multipart/form-data">
<input type="text" name="veranstaltung">
<select name='ortid'>
<option value='2'>Kaminwerk</option>
<option value='3'>Pitú</option>
<option value='4'>Apollo-Center</option>
</select>
<input type="text" name="neutermin" id="neutid" />
<textarea name="beschreibung" cols="40" rows="7"></textarea><br />
<input type="submit" name="button" value="Absenden">
</form>
在 Ruby 中,我得到了:
form = page.forms.first
form.fields.each { |f| puts f.name }
但是 Ruby 只能找到名称为“veranstaltung”和“ortid”的表单元素
我发现问题在于“u”带有重音符号的字母“Pitú”。证明:当我打印html代码的inner_html时,表单的一部分看起来像这样:
<form name="editevent" method="post" action="/index.php" enctype="multipart/form-data">
<input type="text" name="veranstaltung">
<select name='ortid'>
<option value='2'>Kaminwerk</option>
<option value='3'>Pit</form>
表单的另一部分消失了!尽管有“ú”,我怎样才能完全使用该形式? 如果有人能提供帮助,我会很高兴。
I want to read this form of a php script using Ruby/Mechanize:
<form name="editevent" method="post" action="/index.php" enctype="multipart/form-data">
<input type="text" name="veranstaltung">
<select name='ortid'>
<option value='2'>Kaminwerk</option>
<option value='3'>Pitú</option>
<option value='4'>Apollo-Center</option>
</select>
<input type="text" name="neutermin" id="neutid" />
<textarea name="beschreibung" cols="40" rows="7"></textarea><br />
<input type="submit" name="button" value="Absenden">
</form>
In Ruby I have got:
form = page.forms.first
form.fields.each { |f| puts f.name }
However Ruby can find only the form elements with the name "veranstaltung" and "ortid"
I found out that the problem ist the "u" letter with the accent on it in the word "Pitú". Proofs: when I print the inner_html of the html code the part of the form looks like this:
<form name="editevent" method="post" action="/index.php" enctype="multipart/form-data">
<input type="text" name="veranstaltung">
<select name='ortid'>
<option value='2'>Kaminwerk</option>
<option value='3'>Pit</form>
The other part of the form has vanished! How can I use that form completely despite of the "ú"?
I would be very glad if anyone could help.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
什么版本的红宝石?它闻起来像 1.8.7,它不支持 Unicode。如果可以,请升级到 1.9.2。
解析内容时指定语言的代码集也很重要。通常,该信息位于 DOCTYPE 语句中,但如果不是,则必须向语言提示所期望的内容。
因为这些字符嵌入在 PHP 中,所以它们可能是 UTF-8,也可能是 WIN-1252 或 ISO-8951 的变体,这意味着它们是单字节字符。 Mechanize 使用 Nokogiri 进行解析,它会想知道该语言是什么,以便为您提供值的最佳解码。 Nokogiri 会将错误放入
errors
属性中当它无法按照自己的喜好解析某些内容时,因此您可能需要检查那里。所以,如果我是你,我会查看发送内容时的 DOCTYPE 是什么,并检查 HTTP 标头,看看是否有某些内容会定义代码集。
这是我在互联网上多次遇到的问题,因为 HTML 写得非常糟糕,而且经常不符合规范。
What version of Ruby? It smells like 1.8.7, which is not Unicode savvy. If you can, upgrade to 1.9.2.
It's also important to specify the code-set of the language when parsing the content. Often times that information is in the DOCTYPE statement, but if it isn't you have to give the language a hint of what to expect.
Because those characters are embedded in PHP, they could be UTF-8, or maybe a variant of WIN-1252 or ISO-8951 which implies they'd be a single byte character. Mechanize uses Nokogiri to parse, and it will want to know what the language is to give you the best decoding of the values. Nokogiri will put errors in the
errors
attribute when it can't parse something to its liking, so you might want to check there.SO, if I were you, I'd look to see what the DOCTYPE is when the content is sent, and also check the HTTP headers, and see if something will define the codeset.
This is a problem I've encountered many times on the internet because HTML is so poorly written and so often fails to follow the specs.