Bing/Google/Flickr API:如何找到与 150,000 个日语句子中的每个句子对应的图像?
我正在做词性和词性日语句子的形态分析项目。每个句子都有自己的网页。为了使这个页面更加直观,我想展示一张与这句话有某种关系的图片。例如,对于句子“私は学生です”(“我是学生”),相关图片将是学校、日语课本、学生等的图片。我有:每个词的词性标记单词。我现在的方法是:在每个句子中使用 2-3 个名词,并使用 Bing Images API 从搜索结果中检索第一张图像。注意:到目前为止所有的句子处理都是用 Java 完成的。
不过有几个问题: 1)对于日语中的名词搜索,Google Images API、Bing Images API、Flickr API 等哪个更好(更丰富的语料库和强大的搜索)? 2)如何从句子中选择最重要的名词在图像搜索引擎中进行查询,而不需要进行复杂的主题建模等? 谢谢!
I'm doing part-of-speech & morphological analysis project for Japanese sentences. Each sentence will have its own webpage. To make this page more visual, I want to show one picture which is somehow related to the sentence. For example, For the sentence "私は学生です" ("I'm a student"), the relevant pictures would be pictures of school, Japanese textbook, students, etc. What I have: part-of-speech tagging for every word. My approach now: use 2-3 nouns from every sentence and retrieve the first image from search results using Bing Images API. Note: all the sentence processing up to this point was done in Java.
Have a couple of questions though:
1) what is better (richer corpus & powerful search), Google Images API, Bing Images API, Flickr API, etc. for searching nouns in Japanese?
2) how do you select the most important noun from the sentence to do the query in Image Search Engine without doing complicated topic modeling, etc.?
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
日语 WordNet 具有指向 OpenClipart 图片。这可能是另一个相关来源。他们在名为“Enhancing the Japanese WordNet”的论文中对此进行了描述。
Japanese WordNet has links to OpenClipart pictures. That could be another relevant source. They describe it in their paper called "Enhancing the Japanese WordNet".
我以为你会首先在“は”、“が”和“を”之前选择任何名词,并给予这些优先级——可能按照这个顺序。
但这假设您的词性标记足以正确识别 は=subject (因为我猜您知道 は 并不总是主语标记)。
我用这种技术查看了一堆示例句子,发现它很好正如预期的那样。除非没有使用这些,这是很少见的。
而像这样的句子,在没有 を 或 は 的情况下,你必须考虑寻找 で 和它前面的名词。因为如果你注意到这里,“人”这个词实际上并没有告诉你任何关于所说内容的信息。如果没有正确解析上下文,您甚至不知道名词是人还是人。
但是基本上,你不能实现这样的优先/后备类型系统吗?
顺便说一句,我希望你的句子都使用汉字,否则当你看到はし(在链接到的句子之一中)时,你将不知道是否要显示桥或筷子 - 并且显示错误的可能会不好。
I thought you would start by choosing any noun before は、が and を and giving these priority - probably in that order.
But that assumes that your part-of-speech tagging is good enough to get は=subject identified properly (as I guess you know that は is not always the subject marker).
I looked at a bunch of sample sentences here with this technique in mind and found it as good as could be expected. Except where none of those are used, which is rarish.
And sentences like this one, where you'd have to consider maybe looking for で and a noun before it in the case where there is no を or は. Because if you notice here, the word 人 (people) really doesn't tell you anything about what's being said. Without parsing context properly, you don't even know if the noun is person or people.
But basically, couldn't you implement a priority/fallback type system like this?
BTW I hope your sentences all use kanji, or when you see はし (in one of the sentences linked to) you won't know whether to show a bridge or chopsticks - and showing the wrong one will probably not be good.