PHP 中的自然语言处理

发布于 2024-12-10 12:55:25 字数 114 浏览 1 评论 0原文

比如说,给定一个自由文本形式的食谱(成分列表、步骤等),我如何才能以这样的方式解析它,以便提取成分(例如数量、测量单位、成分名称等)。使用PHP?

假设自由文本是某种程度格式化的。

Given, say, a recipe (list of ingredients, steps, etc.) in free text form, how could I parse that in such a way I can pull out the ingredients (e.g. quantity, unit of measurements, ingredient name, etc.) usin PHP?

Assume that the free text is somewhat formatted.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

浅沫记忆 2024-12-17 12:55:25

为了“正确”地做到这一点,您需要定义某种语法,然后可能使用 LALR 解析器或一些工具,例如 yaccbison 或 Lex 来构建解析器。假设您不想这样做,那么它的 strpos() ftw!

To do it 'properly', you need to define some sort of grammar, and then maybe use a LALR parser or some tools such as yacc, bison or Lex to build a parser. Assuming you dont want to do that, its strpos() ftw!

淡水深流 2024-12-17 12:55:25

java中有用于名称实体提取的openNlp,它可以获取您正在查找的内容,请参阅: http:// opennlp.sourceforge.net/models-1.5/

然后你可以使用 php-java 连接器将结果获取到 php 中。

There is openNlp in java for name entity extraction which can fetch you what you are looking see this : http://opennlp.sourceforge.net/models-1.5/

Then you can use php-java connector to get results into php.

等风也等你 2024-12-17 12:55:25

Java 也有非常类似的问题。简而言之,您需要字典(例如成分)和类似于正则表达式的语言而不是术语(注释)。您可以在 Java 中执行此操作,并通过 Web 服务从 PHP 调用它,或者您可以尝试在 PHP 中重新实现它(请注意,在第二种情况下,速度可能会显着降低)。

There's very similar question for Java. In short, you need dictionaries (of, say, ingredients) and regex-like language over terms (annotations). You can do it in Java and invoke it from PHP via web service or you can try to re-implement it in PHP (note, that in second case you may have significant slowdown).

心的憧憬 2024-12-17 12:55:25

如果您想快速完成此操作,并收集最少量的资源,您可能可以想出一些好的启发式方法和一些正则表达式。

既然您说该列表“有点格式化”,我将假设每行有一个成分指令。

我首先提出一个测量名称列表,这是一个相对封闭的类(正如我们在语言学中所说的那样),例如 $measurements=['cup', 'tablespoon', 'teaspoon', '捏'、'冲'、'品尝'、...]。您甚至可能会想出一本字典,将多个项目映射到一个标准化值(因此 $measurements={cup:['cup', 'c'],tablespoon:['tablespoon', 'tbsp', 'tablesp ', ...], ...} 或诸如此类。)

然后在每一行上,您可以找到测量单位(如果它在您的字典中)。接下来,查找数字(可以将其格式化为小数 - 例如 1.5 - 或复杂分数 - 例如 2 1/2 或 2-1/2),并假设这是您需要的单位数。如果没有数字,那么您可以假设单位是一(就像“品尝”之类的情况一样)。

最后,您可以假设剩下的任何东西都是实际成分。

我想这个启发式方法可以覆盖 75-80% 的情况。你仍然会遇到很多极端情况,比如食谱要求“2个橙子”,或者更糟糕! --“2个橙子汁”。在这些情况下,您要么想要将它们添加为例外(在某种离线管理期间),要么让自己对它们没有得到适当的处理“OK”。

If you want to do this quickly, and with gathering the smallest amount of resource-gathering, you can probably come up with some good heuristics and some regular expressions.

Since you say that the list is "somewhat formatted," I'll work on the assumption that there is one ingredient directive per line.

I'd start by coming up with a list of measurement names, which are a relatively-closed class (as we call it in linguistics), like $measurements=['cup', 'tablespoon', 'teaspoon', 'pinch', 'dash', 'to taste', ...]. You might even come up with a dictionary that maps several items to one normalised value (so $measurements={cup:['cup', 'c'], tablespoon:['tablespoon', 'tbsp', 'tablesp', ...], ...} or whatnot.)

Then on each line, you can find the unit of measurement if it is in your dictionary. Next, look for numbers (which may be formatted as decimals -- e.g. 1.5 -- or as complex fractions -- e.g. 2 1/2 or 2-1/2), and assume that is the count of the units you need. If there are no numbers, then you can just assume that the unit is one (as maybe the case with "to taste" and the like).

Finally, you can assume anything that is remaining is the actual ingredient.

I imagine this heuristic would cover 75-80% of your cases. You're still going to have a lot of corner cases, like when the recipe calls for "2 oranges", or -- worse! -- "Juice of 2 oranges". In these cases, you would either want to add them (during some sort of off-line curation) as exceptions, or let yourself be "OK" with them not being treated properly.

萌化 2024-12-17 12:55:25

如果没有大量的语言建模,我认为唯一的方法就是拥有大量的成分列表并在食谱中搜索它们。数量应该是紧接在成分之前的词。

Without a ton of language modeling, I think the only way would be to have a huge list of ingredients and search for them in the recipe. The quantity should be the word immediately prior to the ingredient.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文