Justadistraction:标记化没有空格的英语。村上羊人
我想知道如果删除空格,您将如何对英语(或其他西方语言)的字符串进行标记?
这个问题的灵感来源于村上小说《跳舞跳舞'中的羊人
角色小说中,羊人被翻译成这样说:
“就像我们说的,我们会尽力而为。尝试将你重新连接到你想要的东西,”羊人说。 “但是我们不能独自完成这件事。你也必须工作。”
因此,保留了一些标点符号,但不是全部。足以供人阅读,但有些随意。
为此构建解析器的策略是什么?常见的字母组合、音节计数、条件语法、前瞻/后瞻正则表达式等?
具体来说,就Python而言,你将如何构建一个(宽容的)翻译流程?不要求完整的答案,只是更多地询问你的思维过程将如何解决问题。
我以一种无聊的方式问这个问题,但我认为这个问题可能会得到一些有趣的(nlp/加密/频率/社交)答案。 谢谢!
I wondered how you would go about tokenizing strings in English (or other western languages) if whitespaces were removed?
The inspiration for the question is the Sheep Man character in the Murakami novel 'Dance Dance Dance'
In the novel, the Sheep Man is translated as saying things like:
"likewesaid, we'lldowhatwecan. Trytoreconnectyou, towhatyouwant," said the Sheep Man. "Butwecan'tdoit-alone. Yougottaworktoo."
So, some punctuation is kept, but not all. Enough for a human to read, but somewhat arbitrary.
What would be your strategy for building a parser for this? Common combinations of letters, syllable counts, conditional grammars, look-ahead/behind regexps etc.?
Specifically, python-wise, how would you structure a (forgiving) translation flow? Not asking for a completed answer, just more how your thought process would go about breaking the problem down.
I ask this in a frivolous manner, but I think it's a question that might get some interesting (nlp/crypto/frequency/social) answers.
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
大约八个月前,我实际上为工作做了类似的事情。我只是在哈希表中使用了英语单词词典(查找时间为 O(1))。我会逐个字母匹配整个单词。它运作良好,但存在许多含糊之处。 (asshit 可以是 ass hit 或 asshit)。要解决这些歧义,需要更复杂的语法分析。
I actually did something like this for work about eight months ago. I just used a dictionary of English words in a hashtable (for O(1) lookup times). I'd go letter by letter matching whole words. It works well, but there are numerous ambiguities. (asshit can be ass hit or as shit). To resolve those ambiguities would require much more sophisticated grammar analysis.
首先,我认为你需要一本英语单词词典——你可以尝试一些仅依赖于统计分析的方法,但我认为词典更有可能获得良好的结果。
一旦你有了单词,你就有两种可能的方法:
你可以将单词分类为语法类别,并使用正式语法来解析句子 - 显然,有时你会得到没有匹配或多个匹配 - 我不熟悉可以让你在不匹配的情况下放松语法规则的技术,但我确信一定有一些。
另一方面,您可以使用一些大型英语文本语料库并计算某些单词彼此相邻的相对概率 - 获取单词对和三元组的列表。由于该数据结构相当大,您可以使用单词类别(语法和/或基于含义)来简化它。然后你只需构建一个自动机并选择单词之间最可能的转换。
我确信还有更多可能的方法。您甚至可以将我提到的两者结合起来,构建某种语法,并为其规则赋予权重。这是一个丰富的实验领域。
First of all, I think you need a dictionary of English words -- you could try some methods that rely solely on some statistical analysis, but I think a dictionary has better chances of good results.
Once you have the words, you have two possible approaches:
You could categorize the words into grammar categories and use a formal grammar to parse the sentences -- obviously, you would sometimes get no match or multiple matches -- I'm not familiar with techniques that would allow you to loosen the grammar rules in case of no match, but I'm sure there must be some.
On the other hand, you could just take some large corpus of English text and compute relative probabilities of certain words being next to each other -- getting a list of pair and triples of words. Since that data structure would be rather big, you could use word categories (grammatical and/or based on meaning) to simplify it. Then you just build an automaton and choose the most probable transitions between the words.
I am sure there are many more possible approaches. You can even combine the two I mentioned, building some kind of grammar with weight attached to its rules. It's a rich field for experimenting.
我不知道这对您是否有很大帮助,但您也许可以使用这种拼写以某种方式校正器。
I don't know if this is of much help to you, but you might be able to make use of this spelling corrector in some way.
这只是我写的一些快速代码,我认为它可以很好地从像你给出的代码片段中提取单词......它没有经过充分考虑,但我认为如果你不能的话,沿着这些思路的东西会起作用找到预打包类型的解决方案
还有一些问题需要解决,例如如果它永远不会返回匹配项,这显然不起作用,因为如果它只是不断添加更多字符,它就永远不会匹配,但是因为您的演示字符串有一些空格,您也可以让它识别这些空格并自动从每个空格开始。
您还需要考虑标点符号,编写条件语句,例如
This is just some quick code I wrote out that I think would work fairly well to extract words from a snippet like the one you gave... Its not fully thought out, but I think something along these lines would work if you can't find a pre-packaged type of solution
There are some more issues to be worked out, such as if it never returns a match, this would obviously not work as it would never match if it just kept adding in more characters, however since your demo string had some spaces you could have it recognize these too and automatically start over at each of these.
Also you need to account for punctuation, write conditionals like