您使用哪些工具来分析文本?

发布于 2024-07-21 06:41:56 字数 702 浏览 6 评论 0原文

我需要一些灵感。 对于一个业余爱好项目,我正在研究内容分析。 我基本上是在尝试分析输入以将其与主题图相匹配。

例如:

  • “伊拉克之路”> 历史,中东
  • “Halloumni”> 食品,中东
  • “宝马”> 德国,汽车
  • “奥巴马”> 美国
  • “黑斑羚”> 美国,汽车
  • “柏林墙”> 历史,德国
  • “腊肠”> 食品,德国
  • “芝士汉堡”> 食品,美国
  • ……

我读了很多有关分类学的文章,最后,无论我读到什么,都得出结论:所有人的标签都不同,因此该系统注定会失败。

我考虑过标记化输入和停用词列表,但它们当然需要大量的工作来提出和构建。 在单词和主题之间建立相关链接似乎很费力,而且永无止境,因为无论您处理什么语言,它都非常丰富,而且大多数语言也严重依赖上下文。 更不用说维护它了。

我想我需要想出一些聪明的东西,并用我希望它能够猜测的主题来训练它。 有点像 Eliza 机器人

不管怎样,我不相信有什么东西可以开箱即用,但是有人有任何线索或技术示例可以用来分析输入以提取含义吗?

I'm in need of some inspiration. For a hobby project I am playing with content analysis. I am basically trying to analyze input to match it to a topic map.

For example:

  • "The way on Iraq" > History, Middle East
  • "Halloumni" > Food, Middle East
  • "BMW" > Germany, Cars
  • "Obama" > USA
  • "Impala" > USA, Cars
  • "The Berlin Wall" > History, Germany
  • "Bratwurst" > Food, Germany
  • "Cheeseburger" > Food, USA
  • ...

I've been reading a lot about taxonomy and in the end, whatever I read concludes that all people tag differently and therefor the system is bound to fail.

I thought about tokenized input and stop word lists, but they are of course a lot of work to come up with and build. Building the relevant links between words and topics seems exhausting and also never ending cause whatever language you deal with, it's very rich and most languages also heavily rely on context. Let alone maintaining it.

I guess I need to come up with something smart and train it with topics I want it to be able to guess. Kind of like an Eliza bot.

Anyway, I don't believe there is something that does that out of the box, but does anyone have any leads or examples for technology to use in order to analyze input in order to extract meaning?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

听你说爱我 2024-07-28 06:41:56

嗨。 我首先会使用 OpenCalais 来查找文本或输入中的实体。 这太棒了,我自己也用过很多次(来自路透社的人)。

之后,您可以进一步分析文本,在实体和单词之间创建关联。 我可能会在类似 WordNet 中查找它们并尝试对它们进行典型化,甚至是 auto -生成一些与您尝试映射的域相匹配的本体。

至于如何将它们整合在一起,您可以做很多事情; 上面的或者两遍或三遍模型试图弄清楚单词是什么和意味着什么。 或者,如果您控制输入,请创建一种更易于解析的格式,或者沿着模糊路径走下去NLP(这很有趣)。

或者您可以使用 Jena 来解析任意 RDF 片段,尽管我不喜欢 RDF 前提我自己(我是主题映射者)。 我写过一些东西,可以在维基百科中查找单词、短语或名称,并根据维基百科页面中找到的语义来评估它们的命中率(如果需要,我可以告诉你更多详细信息,但工作起来不是更有趣吗?)你自己想出比我更好的东西吗? 链接数量、SeeAlso 数量、文本量、讨论页面有多大等等。

这些年来我写了很多东西(甚至是用 PHP 和 Perl 编写的;请参阅 Robert Barta 在 CPAN 上的主题地图,特别是一些强大的 TM 模块),从引擎到解析器再到中间的一些奇怪的东西。 关联数组将单词和短语分开,创建累积直方图以对其组件进行排序等等。 这都是有趣的东西,但至于收缩包装工具,我不太确定。 每个人的目标和需求似乎都不同。 这取决于您想要变得多么复杂和成熟。

无论如何,希望这能有所帮助。 干杯! :)

Hiya. I'd first look to OpenCalais for finding entities within texts or input. It's great, and I've used it plenty myself (from the Reuters guys).

After that you can analyze the text further, creating associations between entities and words. I'd probably look them up in something like WordNet and try to typify them, or even auto-generate some ontology that matches the domain you're trying to map.

As to how to pull it all together, there's many things you can do; the above, or two- or three-pass models of trying to figure out what words are and mean. Or, if you control the input, make up a format that is easier to parse, or go down the murky path of NLP (which is a lot of fun).

Or you could look to something like Jena for parsing arbitrary RDF snippets, although I don't like the RDF premise myself (I'm a Topic Mapper). I've written stuff that looks up words or phrases or names in WikiPedia, and rate their hitrate based on the semantics found in the WikiPedia pages (I could tell you the details more if requested, but isn't it more fun to work it out yourself and come up with something better than mine? :), ie. number of links, number of SeeAlso, amount of text, how big the discussion page, etc.

I've written tons of stuff over the years (even in PHP and Perl; look to Robert Barta's Topic Maps stuff on CPAN, especially the TM modules for some kick-ass stuff), from engines to parsers to something weird in the middle. Associative arrays which breaks words and phrases apart, creating cumulative histograms to sort their components out and so forth. It's all fun stuff, but as to shrink-wrapped tools, I'm not so sure. Everyones goals and needs seems to be different. It depends on how complex and sophisticated you want to become.

Anyway, hope this helps a little. Cheers! :)

深居我梦 2024-07-28 06:41:56

SemanticHacker 完全可以满足您的需求,开箱即用,并且具有友好的 API。 它对于短短语来说有些不准确,但对于长文本来说却是完美的。

  • “伊拉克之路”> 社会/问题/战争与冲突/具体冲突
  • “Halloumni”> 不适用
  • “宝马”> 休闲/摩托车/品牌和型号
  • “奥巴马”> 社会/政治/保守主义
  • “黑斑羚”> 休闲/汽车/品牌和型号/雪佛兰
  • “柏林墙”> 地区/欧洲/德国/国家
  • “腊肠”> 家/烹饪/肉类
  • “芝士汉堡”> 家/烹饪/食谱收藏; 地区/北美/美国/马里兰州/地区

SemanticHacker does exactly what you want, out-of-the-box, and has a friendly API. It's somewhat inaccurate on short phrases, but just perfect for long texts.

  • “The way on Iraq” > Society/Issues/Warfare and Conflict/Specific Conflicts
  • “Halloumni” > N/A
  • “BMW” > Recreation/Motorcycles/Makes and Models
  • “Obama” > Society/Politics/Conservatism
  • “Impala” > Recreation/Autos/Makes and Models/Chevrolet
  • “The Berlin Wall” > Regional/Europe/Germany/States
  • “Bratwurst” > Home/Cooking/Meat
  • “Cheeseburger” > Home/Cooking/Recipe Collections; Regional/North America/United States/Maryland/Localities
洋洋洒洒 2024-07-28 06:41:56

听起来您正在寻找贝叶斯网络实现。 您可以使用 Solr 之类的东西。

另请查看 CI-Bayes。 Joseph Ottinger 在服务器端写了关于它的一篇文章。今年早些时候净。

Sounds like you're looking for a Bayesian Network implementation. You may get by using something like Solr.

Also check out CI-Bayes. Joseph Ottinger wrote an article about it on theserverside.net earlier this year.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文