(客观)C 中的垃圾邮件检测
我目前正在编写一个 iPhone 应用程序,它从用户那里获取一些数据并将其上传到服务器。上传的数据将显示给同一程序的其他用户(还有更多内容,但为了保持简单的想法......)。上传的数据基本上只是三个字符串:名称(最多 50 个字符)、标题(最多 50 个字符)和一些文本(几乎无限字符)。我需要的基本上是一个可以检测数据输入的有效性的函数、服务或算法。它必须能够检测一系列重复字符、某些“非法”单词、异常空格等。所以我的问题是;是否有用于此类数据验证的 C 或 Objective-C 库(内置或开源),否则我将如何进行此类检查?
以下是好数据和坏数据的两个示例:
好:
姓名:“约翰·亚伦·史密斯” 标题:“为什么我还在这里?” 短信:“有人可以帮助我吗?我感到孤独!”
坏的:
名称:“去你的,kldsanfklds” 标题:“仅 99 美元。立即购买。仅 99 美元” 文本:“ndsaklgnvds lakævndsaklæfhadsæhdsjka fhdskjafhdskj lafhsdkhf。€#&/ #&()(/&%& ># €%€#% €#& hidosæahviædshvidshfiodsa。adsifjDSILFJIDSH \n\n\n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n"
我知道在很多情况下采取预防措施是很困难的,但这个算法/库只需要过滤掉最糟糕的垃圾邮件。我还将在最终数据库提交之前查看数据,但当然垃圾邮件越少,我就越容易得到它。
你的, 本。
编辑:我最“流利”的语言是 Objective-C,但我在 C 方面也做得很好,并且我了解 PHP 和 JAVA。其他语言的库/示例对我来说可能很难理解,也很难“翻译”成有效的 iPhone 语言。
编辑-编辑:我并不是在寻找过于复杂的东西。这只是我进行粗剪的简单方法。
I'm currently writing an iPhone application which gets some data from the user and uploads it to a server. The uploaded data will be displayed to other users of the same program (there's more to it than that, but to keep the idea simple...). The data which is uploaded is basically just three strings: a name(max. 50 char.), a title(max. 50 char.) and some text(virtually unlimited char.). What I need is basically a function, service or algorithm which can detect how valid the data input is. It would have to be able to detect series of repetitive characters, certain 'illegal' words, abnormal whitespaces, etc. So my questions is; is there a C or Objective-C library (build-in or open source) for this sort of data validation, or else, how would I go about doing this kind of check?
Here are two examples of good and bad data:
GOOD:
Name: "John Aaron Smith" Title: "Why am I still here?" Text: "Can anybody please help me? I'm feeling lonely!"
BAD:
Name: "f**k you, kldsanfklds" Title: "Only $99. Buy Now. Only $99" Text: "ndsaklgnvds lakævndsaklæfhadsæhdsjka fhdskjafhdskj lafhsdkhf. €#&/ #&()(/&%& ># €%€#% €#& hidosæahviædshvidshfiodsa. adsifjDSILFJIDSH \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n"
I know taking precautions for so many cases will be difficult, but this algorithm/library would just have to filter the worst spam. I will also be looking through the data before the final database submission, but of course the less spam, the easier I'll have it.
Yours,
BEN.
EDIT: My most 'fluent' language is objective-C, but I'm also doing pretty well with C, and I have knowledge of PHP and JAVA. Libraries/examples in other languages might be difficult for me to understand, and 'translate' into a valid iPhone language.
EDIT-EDIT: I'm not looking for something overly sophisticated. Just a simple way for me to do the rough cut.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
这是非常困难需要解决的问题。我不会尝试创建自己的垃圾邮件检测,我会使用已经存在且具有良好声誉的解决方案,例如 垃圾邮件刺客。
This is a very difficult problem to solve. I would not attempt to create my own spam detection, I would use a solution which already exists and has a good reputation, such as SpamAssassin.
你见过Mollom吗?它有一个一堆开发者库(php、ruby、perl等),它们与Mollom服务器通信以确定条目的垃圾性。将其中之一转换为 Objective-C 并不困难。
Have you seen Mollom? It has a bunch of developer libraries (php, ruby, perl, etc) that communicate with the Mollom servers to determine the spaminess of an entry. It wouldn't be hard to translate one of those to Objective-C.
我做了一些与你想要的类似的东西,但它是用 php 编写的。我处理的所有文本都是用验证码输入的,所以我阻止的是类似于你的坏例子的无用评论垃圾邮件。这是我到目前为止所得到的,它已经阻止了 80% 的垃圾。它可能会阻止具有不良拼写习惯的人发送一些有效文本,但与手动编辑文本相比,我更喜欢这样做。
您可以通过阻止带有可疑字符的文本来添加此内容,例如 %^[]
另外,您可以编译一个永远不应该相邻使用的字符列表,例如 fd、gf、kp、yt、vnd
此时您需要通过添加到算法来实现自动化。这意味着算法需要理解一些语法,并且整个过程的强度将开始倍增。此时此刻,其他任何事情都超出了我的理解范围。
I've made something similar to what you want but it's in php. All the text I deal with is entered with a captcha so what I'm blocking is useless comment spam similar to your bad example. Here's what I've got so far which has been blocking a good 80% of the junk. It may block some valid text from people with bad spelling habits but I prefer that over manually editing text.
You could add to this by blocking text with suspicious characters e.g. %^[]
additionally you could compile a list of characters that should never be used next to each other e.g. fd, gf, kp, yt, vnd
At this point you need to automate by adding to the algorithm. This would mean that the algorithm needs to understand some grammar and the overall process will begin to multiply in intensity. Anything else is beyond my comprehension at this point.