我是否正确地将字符串传递给 python 库？

发布于 2024-08-19 19:36:43 字数 1334 浏览 5 评论 0原文

我正在使用一个名为 Guess Language 的 python 库： http://pypi.python.org /pypi/guess-language/0.1

“justwords”是一个带有 unicode 文本的字符串。我把它贴在包里，但它总是返回英文，即使网页是日文的。有谁知道为什么？我编码不正确吗？

§ç©ºéå
¶ä»æ¡å°±æ²æéç¨®å¾                                é¤ï¼æä»¥ä¾éè£¡ç¶ç
éäºï¼åæ¤ç°å¢æ°£æ°¹³åèµ·ä¾åªè½ç®âå¾å¥½âéå¸¸å¥½âåå                 ¶æ¯è¦é»é¤ï¼é¨ä¾¿é»çé»ãé£²æãä¸ææ²»çåä¸å                                     ä¾¿å®ï¼æ¯æ´è¥ç   äºï¼æ³æ³éè£¡ä»¥å°é»ãæ¯è§ä¾èªªä¹è©²æpremiumï¼åªæ±é¤é»å¥½åå°±å¥½äºã&lt;br /&gt;&lt;br /&gt;é¦åç¾ï¼æä»¥å°±é»åå®æ´ç         æ£è¦åä¸ä¸å
ä¸ç                           å¥é¤å§ï¼å



justwords = justwords.encode('utf-8')
true_lang =  str(guess_language.guessLanguage(justwords))
print true_lang

编辑：谢谢大家的帮助。这是问题的更新。

我试图“猜测”此内容的语言：http://feeds.feedburner.com/nchild

基本上，在Python中，我得到了htmlSource。然后，我使用 BeautifulSoup 剥离标签。然后，我将其传递给图书馆以获取该语言。如果我不执行encode('utf-8')，则会出现 ASCII 错误。所以，这是必须的。

soup = BeautifulStoneSoup(htmlSource)
justwords = ''.join(soup.findAll(text=True))
justwords = justwords.encode('utf-8')
true_lang =  str(guess_language.guessLanguage(justwords))

原文

I'm using a python library called Guess Language: http://pypi.python.org/pypi/guess-language/0.1

"justwords" is a string with unicode text. I stick it in the package, but it always returns English, even though the web page is in Japanese. Does anyone know why? Am I not encoding correctly?

§ç©ºéå
¶ä»æ¡å°±æ²æéç¨®å¾                                é¤ï¼æä»¥ä¾éè£¡ç¶ç
éäºï¼åæ¤ç°å¢æ°£æ°¹³åèµ·ä¾åªè½ç®âå¾å¥½âéå¸¸å¥½âåå                 ¶æ¯è¦é»é¤ï¼é¨ä¾¿é»çé»ãé£²æãä¸ææ²»çåä¸å                                     ä¾¿å®ï¼æ¯æ´è¥ç   äºï¼æ³æ³éè£¡ä»¥å°é»ãæ¯è§ä¾èªªä¹è©²æpremiumï¼åªæ±é¤é»å¥½åå°±å¥½äºã<br /><br />é¦åç¾ï¼æä»¥å°±é»åå®æ´ç         æ£è¦åä¸ä¸å
ä¸ç                           å¥é¤å§ï¼å



justwords = justwords.encode('utf-8')
true_lang =  str(guess_language.guessLanguage(justwords))
print true_lang

Edit: THanks guys for your help. This is an update of the problem.

I am trying to "guess" the language of this: http://feeds.feedburner.com/nchild

Basically, in Python, I get the htmlSource. Then, I strip the tags using BeautifulSoup. Then, I pass it to the library to get the language. If I do not do encode('utf-8'), then ASCII-errors will come up. So , this is a must.

soup = BeautifulStoneSoup(htmlSource)
justwords = ''.join(soup.findAll(text=True))
justwords = justwords.encode('utf-8')
true_lang =  str(guess_language.guessLanguage(justwords))

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

篱下浅笙歌 2024-08-26 19:36:43

查看主页，它显示“”“检测 60 多种语言；希腊语 (el)、韩语 (ko)、日语 (ja)、中文 (zh) 以及 trigrams 目录中列出的所有语言。”“”它

没有不要对这 4 种语言使用三元组；它依赖于输入文本中存在哪些脚本块。查看源代码：

if "Katakana" in scripts or "Hiragana" in scripts or "Katakana Phonetic Extensions" in scripts:
    return "ja"

if "CJK Unified Ideographs" in scripts or "Bopomofo" in scripts \
        or "Bopomofo Extended" in scripts or "KangXi Radicals" in scripts:
    return "zh"

对于像片假名或平假名这样的脚本名称出现在 scripts 中，此类字符必须占输入文本的 40% 或更多（在标准化后删除非字母字符等）。某些日语文本可能需要低于 40% 的阈值。但是，如果这是您的文本的问题，我希望它有超过 40% 的汉字（CJK 统一表意文字），因此应该返回“zh”（中文）。

经过一些实验后更新，包括插入一条打印语句来显示检测到的脚本块的百分比：

来自朝日报纸网站的一条大概典型的新闻条目：

 49.3 Hiragana
  8.7 Katakana
 42.0 CJK Unified Ideographs
result ja

一条大概非典型的同上：（

 35.9 Hiragana
 49.2 CJK Unified Ideographs
 13.3 Katakana
  1.6 Halfwidth and Fullwidth Forms
result zh

看起来可能是将测试基于总（平假名 + 片假名）内容是一个好主意）

通过机器推送原始首页（XML、HTML、所有内容）的结果：

  2.4 Hiragana
  6.1 CJK Unified Ideographs
  0.1 Halfwidth and Fullwidth Forms
  3.7 Katakana
 87.7 Basic Latin
result ca

基本拉丁语的高百分比当然是由于标记。我还没有调查是什么让它选择“ca”（加泰罗尼亚语）而不是使用基本拉丁语的任何其他语言（包括英语）。但是，您打印的冗长的文章没有显示任何包含标记的迹象。

更新结束

更新 2

这是一个示例（2 个标题和接下来的 4 段来自此链接），其中大约 83% 的字符是东亚字符，其余字符是基本拉丁字符，但结果是 en（英语）。

 29.6 Hiragana
 18.5 Katakana
 34.9 CJK Unified Ideographs
 16.9 Basic Latin
result en

基本拉丁字符是由于文本中使用组织等英文名称引起的。日本的规则失败了，因为片假名和平假名都没有达到 40%（它们合计得分为 48.1%）。中文规则失败，因为中日韩统一表意文字得分低于 40%。因此83.1%的东亚文字被忽略，结果由16.9%的少数人决定。这些“腐烂的行政区”规则需要进行一些改革。一般来说，它可以表达为：

如果（仅语言 X 使用的脚本块总数）>= X 特定阈值，则选择语言 X。

如上所述，平假名 + 片假名 >= 40% 可能会执行以下操作日本人的伎俩。韩语可能也需要类似的规则。

你的冗长的文章实际上确实包含了一些标记字符（我没有向右滚动足够远来看到它），但肯定不足以将所有东亚分数压低到 40% 以下。因此，我们仍在等待查看您的实际输入是什么以及您如何从哪里获得它。

更新结束2

为了帮助诊断您的问题，请不要打印冗长的内容；这样

print repr(justwords)

，任何有兴趣实际进行调试的人都可以做一些事情。如果您提供网页的 URL，并显示用于获取 unicode justwords 的 Python 代码，将会有所帮助。请编辑您的答案以显示这 3 条信息。

更新 3 感谢您提供 URL。目视检查表明，该语言绝大多数是中文。是什么给你留下了日本人的印象？

非常感谢您提供一些代码。为了避免您的通讯员必须为您完成工作，并避免由于猜测而产生误解，您应该始终提供（无需询问）一个独立的脚本来重现您的问题。请注意，如果您没有执行 .encode('utf8') 操作，您会说您收到“ASCII 错误”（没有确切的错误消息！没有回溯！）——我的代码（见下文）没有此问题。

不，感谢您没有提供 print repr(justwords) 的结果（即使在被询问之后）。检查已创建的中间数据是一种非常基本且非常有效的调试技术。这是你在提问之前应该做的事情。有了这些知识，你就可以提出更好的问题。

使用此代码：

# coding: ascii
import sys
sys.path.append(r"C:\junk\wotlang\guess-language\guess_language")
import guess_language
URL = "http://feeds.feedburner.com/nchild"
from BeautifulSoup import BeautifulStoneSoup
from pprint import pprint as pp
import urllib2
htmlSource = urllib2.urlopen(URL).read()
soup = BeautifulStoneSoup(htmlSource)
fall = soup.findAll(text=True)
# pp(fall)
justwords = ''.join(fall)
# justwords = justwords.encode('utf-8')
result = guess_language.guessLanguage(justwords)
print "result", result

我得到了这些结果：

 29.0 CJK Unified Ideographs
  0.0 Extended Latin
  0.1 Katakana
 70.9 Basic Latin
result en

请注意，URL 内容不是静态的；大约一个小时后，我得到：

 27.9 CJK Unified Ideographs
  0.0 Extended Latin
  0.1 Katakana
 72.0 Basic Latin

统计数据是通过摆弄 guess_language.py 的第 361 行获得的，它显示为：

for key, value in run_types.items():
    pct = (value*100.0) / totalCount # line changed so that pct is a float
    print "%5.1f %s" % (pct, key) # line inserted
    if pct >=40:
        relevant_runs.append(key)

The统计数据是带有大量 HTML/XML/Javascript 内容的中文症状（请参阅前面的内容）例子）;通过查看取消注释 pp(fall) 获得的漂亮打印的输出可以证实这一点——很多东西，例如：

<img style="float:left; margin:0 10px 0px 10px;cursor:pointer; cursor:hand
;" width="60px" src="http://2.bp.blogspot.com/_LBJ4udkQZag/Rm6sTn1b7NI/AAAAAAAAA
FA/bYkSJZ3i2bg/s400/hepinge169.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_507518
3283203730642" alt="\u548c\u5e73\u6771\u8def\u4e00\u6bb5169\u865f" title="\u548c
\u5e73\u6771\u8def\u4e00\u6bb5169\u865f"/>\u4eca\u5929\u4e2d\u5348\u8d70\u523
0\u516c\u53f8\u5c0d\u9762\u76847-11\u8cb7\u98f2\u6599\uff0c\u7a81\u7136\u770b\u5
230\u9019\u500b7-11\u602a\u7269\uff01\u770b\u8d77\u4f86\u6bd4\u6a19\u6e96\u62db\
u724c\u6709\u4f5c\u7528\u7684\u53ea\u6709\u4e2d\u9593\u7684\u6307\u793a\u71c8\u8
00c\u5df2\uff0c\u53ef\u537b\u6709\u8d85\u7d1a\u5927\u7684footprint\uff01<br /
><br /><a href="http://4.bp.blogspot.com/_LBJ4udkQZag/Rm6wHH1b7QI/AA

您需要对标记做一些事情。步骤：在 XML 浏览器中查看原始“htmlSource”。 XML 是否不合规？如何避免出现未翻译的 < 等？哪些元素的文本内容仅由于 URL 或类似内容而为“英语”？美丽[石头]汤有问题吗？您应该使用 Beautiful[Stone]Soup 的其他功能吗？您应该使用 lxml 吗？

我建议先进行一些研究，然后再提出一个新的问题。

更新 3 结束

Looking at the main page, it says """Detects over 60 languages; Greek (el), Korean (ko), Japanese (ja), Chinese (zh) and all the languages listed in the trigrams directory. """

It doesn't use trigrams for those 4 languages; it relies on what script blocks are present in the input text. Looking at the source code:

if "Katakana" in scripts or "Hiragana" in scripts or "Katakana Phonetic Extensions" in scripts:
    return "ja"

if "CJK Unified Ideographs" in scripts or "Bopomofo" in scripts \
        or "Bopomofo Extended" in scripts or "KangXi Radicals" in scripts:
    return "zh"

For a script name like Katakana or Hiragana to appear in scripts, such characters must comprise 40% or more of the input text (after normalisation which removes non-alphabetic characters etc). It may be possible that some Japanese text needs a threshold of less than 40%. HOWEVER if that was the problem with your text, I would expect it to have more than 40% kanji (CJK Unified Ideographs) and thus should return "zh" (Chinese).

Update after some experimentation, including inserting a print statement to show what script blocks were detected with what percentages:

A presumably typical news item from the Asahi newspaper website:

 49.3 Hiragana
  8.7 Katakana
 42.0 CJK Unified Ideographs
result ja

A presumably atypical ditto:

 35.9 Hiragana
 49.2 CJK Unified Ideographs
 13.3 Katakana
  1.6 Halfwidth and Fullwidth Forms
result zh

(Looks like it might be a good idea to base the test on the total (Hiragana + Katakana) content)

Result of shoving the raw front page (XML, HTML, everything) through the machinery:

  2.4 Hiragana
  6.1 CJK Unified Ideographs
  0.1 Halfwidth and Fullwidth Forms
  3.7 Katakana
 87.7 Basic Latin
result ca

The high percentage of Basic Latin is of course due to the markup. I haven't investigated what made it choose "ca" (Catalan) over any other language which uses Basic Latin, including English. However the gobbledegook that you printed doesn't show any sign of including markup.

End of update

Update 2

Here's an example (2 headlines and next 4 paragraphs from this link) where about 83% of the characters are East Asian and the rest are Basic Latin but the result is en (English).

 29.6 Hiragana
 18.5 Katakana
 34.9 CJK Unified Ideographs
 16.9 Basic Latin
result en

The Basic Latin Characters are caused by the use of the English names of organisations etc in the text. The Japanese rule fails because neither Katakana nor Hiragana score 40% (together they score 48.1%). The Chinese rule fails because CJK Unified Ideographs scores less than 40%. So the 83.1% East Asian characters are ignored, and the result is decided by the 16.9% minority. These "rotten borough" rules need some reform. In generality, it could be expressed like:

If (total of script blocks used by only language X) >= X-specific threshold, then select language X.

As suggested above, Hiragana + Katakana >= 40% will probably do the trick for Japanese. A similar rule may well be needed for Korean.

Your gobbledegook did actually contain a few characters of markup (I didn't scroll far enough to the right to see it) but certainly not enough to depress all the East Asian scores below 40%. So we're still waiting to see what your actual input is and how you got it from where.

End of update2

To aid with diagnosis of your problem, please don't print gobbledegook; use

print repr(justwords)

That way anyone who is interested in actually doing debugging has got something to work on. It would help if you gave the URL of the webpage, and showed the Python code that you used to get your unicode justwords. Please edit your answer to show those 3 pieces of information.

Update 3 Thanks for the URL. Visual inspection indicates that the language is overwhelmingly Chinese. What gave you the impression that it is Japanese?

Semithanks for supplying some of your code. To avoid your correspondents having to do your work for you, and to avoid misunderstandings due to guessing, you should always supply (without being asked) a self-contained script that will reproduce your problem. Note that you say you got "ASCII errors" (no exact error message! no traceback!) if you didn't do .encode('utf8') -- my code (see below) doesn't have this problem.

No thanks for not supplying the result of print repr(justwords) (even after being asked). Inspecting what intermediate data has been created is a very elementary and very effective debugging technique. This is something you should always do before asking a question. Armed with this knowledge you can ask a better question.

Using this code:

# coding: ascii
import sys
sys.path.append(r"C:\junk\wotlang\guess-language\guess_language")
import guess_language
URL = "http://feeds.feedburner.com/nchild"
from BeautifulSoup import BeautifulStoneSoup
from pprint import pprint as pp
import urllib2
htmlSource = urllib2.urlopen(URL).read()
soup = BeautifulStoneSoup(htmlSource)
fall = soup.findAll(text=True)
# pp(fall)
justwords = ''.join(fall)
# justwords = justwords.encode('utf-8')
result = guess_language.guessLanguage(justwords)
print "result", result

I got these results:

 29.0 CJK Unified Ideographs
  0.0 Extended Latin
  0.1 Katakana
 70.9 Basic Latin
result en

Note that the URL content is not static; about an hour later I got:

 27.9 CJK Unified Ideographs
  0.0 Extended Latin
  0.1 Katakana
 72.0 Basic Latin

The statistics were obtained by fiddling around line 361 of guess_language.py so that it reads:

for key, value in run_types.items():
    pct = (value*100.0) / totalCount # line changed so that pct is a float
    print "%5.1f %s" % (pct, key) # line inserted
    if pct >=40:
        relevant_runs.append(key)

The statistics are symptomatic of Chinese with lots of HTML/XML/Javascript stuff (see previous example); this is confirmed by looking at the output of the pretty-print obtained by un-commenting pp(fall) -- lots of stuff like:

<img style="float:left; margin:0 10px 0px 10px;cursor:pointer; cursor:hand
;" width="60px" src="http://2.bp.blogspot.com/_LBJ4udkQZag/Rm6sTn1b7NI/AAAAAAAAA
FA/bYkSJZ3i2bg/s400/hepinge169.gif" border="0" alt=""id="BLOGGER_PHOTO_ID_507518
3283203730642" alt="\u548c\u5e73\u6771\u8def\u4e00\u6bb5169\u865f" title="\u548c
\u5e73\u6771\u8def\u4e00\u6bb5169\u865f"/>\u4eca\u5929\u4e2d\u5348\u8d70\u523
0\u516c\u53f8\u5c0d\u9762\u76847-11\u8cb7\u98f2\u6599\uff0c\u7a81\u7136\u770b\u5
230\u9019\u500b7-11\u602a\u7269\uff01\u770b\u8d77\u4f86\u6bd4\u6a19\u6e96\u62db\
u724c\u6709\u4f5c\u7528\u7684\u53ea\u6709\u4e2d\u9593\u7684\u6307\u793a\u71c8\u8
00c\u5df2\uff0c\u53ef\u537b\u6709\u8d85\u7d1a\u5927\u7684footprint\uff01<br /
><br /><a href="http://4.bp.blogspot.com/_LBJ4udkQZag/Rm6wHH1b7QI/AA

You need to do something about the markup. Steps: Look at your raw "htmlSource" in an XML browser. Is the XML non-compliant? How can you avoid having untranslated < etc? What elements have text content that is "English" only by virtue of it being a URL or similar? Is there a problem in Beautiful[Stone]Soup? Should you be using some other functionality of Beautiful[Stone]Soup? Should you use lxml instead?

I'd suggest some research followed by a new SO question.

end of update 3

回复收藏 0 原文