有没有办法清理文本文件(去掉相似的单词)? (不使用嵌套for循环)
我正在尝试考虑清理文件中文本的最佳方法。所以我想做的是,给定一个输入文件,匹配相似的单词并替换它们。因此,如果文件中有 apple 和 ApPle,则 ApPle 将被 apple 替换。
有没有办法在不使用两个 for 循环的情况下做到这一点,如下所示:
for $word in @file
for $word2 in @file
if $word matches $word2
replace $word2 with $word
end
end
end
我总是犹豫是否使用嵌套 for 循环,所以我只是想知道是否有更优雅的解决方案。另外,如果您想知道为什么它是伪代码,那是因为我还没有决定用什么来编程它。 (对于那些不知道 @file 是单词列表而 $word 是非空白字符串的人)。
I'm trying to think of the optimal way to clean up text in a file. So what I want to do is, given an input file, match words that are similar and replace them. So if apple and ApPle are in the file, ApPle would be replaced by apple.
Is there any way to do this without using two for loops like so:
for $word in @file
for $word2 in @file
if $word matches $word2
replace $word2 with $word
end
end
end
I'm always hesitant to use nested for loops so I'm just wondering if there's a more elegant solution. Also, if you're wondering why it's pseudocode, it's because I haven't decided what to program this in yet. (For those who don't know @file is a list of words and $word is a non-whitespace string of characters).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
也许这会起作用:
为相似的单词定义一个唯一的表示(“哈希函数”)。 (如果只是大小写不同,那很容易。如果发音相似,那就更困难了。)
一次性读取文件,维护一个“哈希表”,仅当单词尚未出现在哈希表中时才打印该单词.
。
如果你的哈希函数不是单射的,事情会变得稍微复杂一些。
Perhaps this will work:
Define a unique representation (a "hash function") for similar words. (If it's only difference in case, that's easy. If it's similar pronounciation, that's more difficult.)
Read the file in one pass, maintain a "hash table" and print the word only if it's not yet in the hash table.
.
If your hash function is not injective, things get slightly more complicated.
这实际上取决于“相似”对您意味着什么,以及何时应该替换单词。代码应该确定这一点吗?您是否想将所有大写字母转换为小写字母,或者代码应该使用不同的标准来执行此操作?
在 PHP 中,您可以使用以下函数(的组合):
http://www.php.net/manual/en/function.str -ireplace.php(不区分大小写的替换)
http://www.php.net/manual/en/function.strtolower.php (将字符串转换为小写)
http://www.php.net/manual/en/function.strtoupper.php (将字符串转换为大写)
http://php.net/manual/en/function.similar-text.php (看看字符串 A 与字符串 B 有多相似)
如果您可以发布有关您的预期用例的更多详细信息,您可能会得到更好的答案:)
It really depends on what "similar" means to you, and when words should be replaced. Should the code determine this? Do you want to turn everything that's in uppercase into lowercase, or should the code use different criteria to do this?
In PHP, you could conceivably use (a combination of) these functions:
http://www.php.net/manual/en/function.str-ireplace.php (case-insensitive replace)
http://www.php.net/manual/en/function.strtolower.php (convert a string to lowercase)
http://www.php.net/manual/en/function.strtoupper.php (convert a string to uppercase)
http://php.net/manual/en/function.similar-text.php (see how similar string A is to string B)
If you can post more details about your intended use case, you'll probably get better answers :)