检测并去除噪音文本

发布于 2024-09-01 13:32:46 字数 169 浏览 6 评论 0原文

给定一个包含大量数据的数据库表,删除噪音文本的最佳实践是什么,例如:

  • fghfghfghfg
  • qsdqsdqsd
  • rtyrtyrty

,噪音存储在“名称”字段中。

我正在研究具有 Java 标准结构的数据。

giving a database table with huge data in it, what is the best practice to remove noise text such as :

  • fghfghfghfg
  • qsdqsdqsd
  • rtyrtyrty

that noise is stored into the "name" field.

I'm working on data with Java standard structures.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

末が日狂欢 2024-09-08 13:32:46

删除这样的东西并不像看起来那么容易。

对于我们人类来说,很容易看出“djkhfkjh”没有任何意义。但计算机如何检测这种噪音呢?它怎么知道“Eyjafjallajökull”只是有人砸键盘,或者是过去几年里最热闹的山?

如果没有很多误报,你就无法可靠地做到这一点,所以毕竟,它再次手动过滤误报和真报。

Removing stuff like that isn't as easy as it might seem.

For us humans, it's easy to see that "djkhfkjh" doesn't make any sense. But how would a computer detect this kind of noise? How would it know that "Eyjafjallajökull" is just someone smashing his keyboard, or the most overbuzzed mountain in the last couple of years?

You can't do this reliably without many false positives, so after all, it's filtering the false-positives and true-positives by hand again.

淡写薰衣草的香 2024-09-08 13:32:46

那么,您可以使用 NLP 方法构建一个分类器,并根据噪声和非噪声的示例对其进行训练。您可以采用的一种情况是 Apache Tika 的语言检测器。如果语言检测器说“打败了我”,那可能就足够了。

Well, you can build a classifier using NLP methods, and train it on examples of noise and not-noise. One case of that you can take is the language detector from Apache Tika. If the language detector says 'beats me' that might be good enough.

貪欢 2024-09-08 13:32:46

获取一本字典,其中包含尽可能多的名称,并过滤数据以显示字典中没有的名称。然后你必须一一删除它们,以确保你没有删除有效数据。
按名称对列表进行排序可以帮助您一次删除更多行。

Get a dictionary with as many names you can find and filter your data to display the ones that are not in the dictionary. Then you have to delete them one by one to make sure you do not delete valid data.
Sorting the list by name can help you delete more rows at a time.

贱人配狗天长地久 2024-09-08 13:32:46

如果文本的其余部分是英语,您可以使用单词列表。如果文本中超过给定百分比(例如 50%)的单词不在单词列表中,则可能是噪音。

您可能需要设置一个阈值,例如 5 个单词,以防止删除“LOL”等帖子。

在大多数 Linux 安装上,您可以从拼写检查器 aspell 中提取单词列表,如下所示:

aspell --lang en dump master

If the rest of the text is English, you could use a word list. If more than a given percentage (say, 50%) of the words in the text are not in the word list, it is probably noise.

You may want to set a threshold of, say, 5 words, to prevent deleting posts like 'LOL'.

On most Linux installations, you can extract a word list from the spell checker aspell like this:

aspell --lang en dump master
爱*していゐ 2024-09-08 13:32:46

您需要首先更有效地定义“噪音文本”。定义问题是这里的困难部分。你不能编写这样的代码:“去掉类似于 _____ 的字符串。”看起来您所识别的模式是“连续三个字符的一致集合,并且该集合至少重复一次,但可能无法干净地终止(它可能终止于集合中间的字符)”。

现在编写一个与该模式匹配的正则表达式,并测试它。

但我敢打赌您正在寻找其他模式......

You're going to need to start by defining "noise text" more effectively. Defining the problem is the hard part here. You can't write code that will say "get rid of strings that are sort of like _____." It looks like the pattern you've identified is "a consistent set of three characters in a row, and the set repeats at least once, but may not terminate cleanly (it could terminate on a character from the middle of the set)."

Now write a regular expression that matches that pattern, and test it.

But I bet there are other patterns that you're looking for...

暗恋未遂 2024-09-08 13:32:46

检查每个单词,看看有多少冗余。如果有超过三个连续重复的字母组,则它是噪声的良好候选者。另外,查找通常不属于一起的字母组以及在键盘上也是连续的连续字母组。如果整个单词都是由键盘上相邻的字母组成的,那么它也会在噪音列表中占据一席之地。

Inspect each word and see how much redundancy is there. If there are more than three consecutive repeated groups of letters, it is a good candidate for noise. Also, look for groups of letters that don't usually belong together and for groups of consecutive letters that are also consecutive on the keyboard. If a whole word is made of such letters that are keyboard neighbors, it also claims a spot on the noise list.

寂寞花火° 2024-09-08 13:32:46

训练 NLP 分类器可能是最好的方法。然而,更简单的方法可能是简单地检查每个单词是否存在于所有已知“有效”单词的列表中。大多数 Unix 系统都有一个名为 /usr/share/dict/words 的文件,您可以将其用于此目的。此外,Ubuntu 通过 /usr/share/dict/american-english、/usr/share/dict/american-huge 和 /usr/share/dict/american-insane 对此进行了扩展,每个列表都比上一个更全面。这些列表还包括许多常见的拼写错误,因此您不会过滤掉技术上不是单词但可以清楚地识别为单词的文本。

如果您确实雄心勃勃,则可以结合这些方法,并使用这些单词列表来训练贝叶斯或最大熵分类器。

Training a NLP classifier would probably be the best way to go. However, a simpler method might be to simply check that each word exists in a list of all known "valid" words. Most Unix systems have a file called /usr/share/dict/words that you can use for this purpose. Additionally, Ubuntu expands on this with /usr/share/dict/american-english, /usr/share/dict/american-huge, and /usr/share/dict/american-insane, each list more comprehensive then the last. These lists also include a lot of common misspellings, so you won't filter out text that's not technically a word, but clearly recognizable as a word.

If you're really ambitious, you can combine these approaches, and use these words lists to train a Bayesian or Maximum Entropy classifier.

帅冕 2024-09-08 13:32:46

这里有很多好的答案。哪种方法适合您很大程度上取决于您的问题的具体情况 - 例如,输入是否应该是英语单词、用户名、人们的姓氏等。

一种方法:编写一个程序来分析您的问题考虑“有效”输入。跟踪每个可能的三字母序列在合法文本中出现的频率。然后,当您有要检查的输入时,查看输入的每个三个字母序列并查找其预期频率。像“xzt”这样的东西可能有接近零的频率。如果这样的子序列太多,请将其标记为垃圾。

问题:

  1. 您可能会将错误的拼写视为垃圾,例如,如果有人忘记在单词中的“q”后面加上“u”。
  2. 你不会捕捉到像“thethethe”这样的输入。

There are a lot of good answers here. Which one(s) will work for you depends a lot on the specifics of your problem -- for example, is the input supposed to be English words, usernames, people's last names, etc.

One approach: write a program to analyze what you consider "valid" input. Keep track of how frequently every possible three-letter sequence appears in legitimate text. Then when you have input to check, look at each three-letter sequence of the input and look up its expected frequency. Something like "xzt" probably has a frequency near zero. If you have too many subsequences like that, mark it as garbage.

Problems with this:

  1. You might treat bad spelling as garbage, for example if someone forgets to put a 'u' after a 'q' in a word.
  2. You won't catch input like "thethethethe".
箹锭⒈辈孓 2024-09-08 13:32:46

示例#1 和#2 可以被尝试找出文本发音的解析器删除。无论何种语言,它们都是无法言说的,因此不是文字。

Examples #1 and #2 can be removed by a parser that tries to figure out how to pronounce the text. Regardless of language they're unspeakable and thus not words.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文