尝试编写一些代码来确定 html 页面中是否已选中某个框
我正在处理由超过 5000 个不同实体准备的大量文档。我想做的事情之一是确定某个框是否已被选中。编制者需要通过选中五个不同框之一来指示一些信息。
问题是准备者自己决定如何在 html 中呈现复选框。他们的一些陈述很有趣。他们主要依赖 wingdings 作为字体指令。以下是我到目前为止发现的一些复选框类型
'serif">S</font>'
'wingdings">x</font>'
'ü'
'ý'
'þ'
<font style="font-family: Wingdings; font-variant: normal">þ</font>
我上面粘贴的代码片段将在使用 IE 变体打开文档时显示一个复选框,当使用 IE 打开文档时它将呈现其他内容火狐、Safari 或 Chrome。
这是另一个例子
<div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0pt" align="center"><font style="DISPLAY: inline; FONT-SIZE: 10pt; FONT-FAMILY: times new roman">THE DATA THAT HAS THE CHECKED BOX <font style="DISPLAY: inline; FONT-FAMILY: wingdings 2, serif">R</font></font></div>
所以我想以最简单的形式我的问题是
Python中是否有东西“知道”
<font style="DISPLAY: inline; FONT-FAMILY: wingdings 2, serif">R</font>
这是一个复选框?然后进一步扩展 - 是否有某种东西可以“知道”在 html 代码中呈现复选框的几乎所有方式?
我想指出的是,当我检查该字体元素的文本时,我得到一个 unicode R
我希望这更清楚。
I am working with a large collection of documents that are prepared by more than 5K different entities. One of the things I am trying to do is to determine whether or not a box has been checked. The preparer needs to indicate some information by checking one of five different boxes.
The problem is that the preparer decided on their own how to present a check box in the html. Some of their representations are interesting. They mostly rely on wingdings as the font directive. Here are a few of the types of checked boxes I have found so far
'serif">S</font>'
'wingdings">x</font>'
'ü'
'ý'
'þ'
<font style="font-family: Wingdings; font-variant: normal">þ</font>
The piece of code that I pasted above will display a checked box when the document is opened with a variant of IE, it will render something else when the document is opened with Firefox, Safari or Chrome.
Here is another example
<div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0pt" align="center"><font style="DISPLAY: inline; FONT-SIZE: 10pt; FONT-FAMILY: times new roman">THE DATA THAT HAS THE CHECKED BOX <font style="DISPLAY: inline; FONT-FAMILY: wingdings 2, serif">R</font></font></div>
So I guess in its simplest form my question is
Is there something in python that 'knows' that
<font style="DISPLAY: inline; FONT-FAMILY: wingdings 2, serif">R</font>
this is a checked box? And then extending that further - is there something that 'knows' this for just about every way a checked box can be presented in html code?
I want to note that when I check the text of that font element I get a unicode R
I hope this is clearer.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
在我看来,它看起来像这样。
“S”的 ascii 值为 83。如果您在 wingdings 上查找 83,则会得到“droplet”。 “水滴”的 Unicode 等效项是
The way I see it, it appears like this.
The ascii value of 'S' is 83. If you look up 83 on wingdings, you get "droplet". The Unicode equivalent of "droplet" is ????.
The ascii value of 'x' is 120. Looking 120 up on wingdings, you get "clear". Unicode ⌧.
252 is wingding "checkbld", unicode ✓.
253 is wingding "boxxmarkbld", unicode ☒
254 is wingding "boxcheckbld", unicode ☑.
'R' is displayed under font-family wingdings2, ascii 82, and unicode equivalent ☑
Note: This is just a guess on which is which. Don't take my word for it.
I assumed it would be so since it seems to make sense. My source is Here (wingdings) and Here (wingdings2)
Solution to comment: [√] (left bracket, amp, pound, 8730, semicolon, right bracket).
√ is interpreted as U+221A, with the semicolon being an "end statement" type character. According to fileformat.info, U+221A is the square root symbol, and is in python u'\u221a'. This should solve your problem.
All answers I give are a matter of pure speculation and guesswork, although character codes and equivalents are verified through the links and python2.7.1's chr() and ord().