搜索实体名称数据库(大学、城市、人物、国家...)

发布于 2024-08-06 12:48:40 字数 194 浏览 12 评论 0原文

对于我和另一个人正在进行的企业应用研究项目,我们希望从页面中删除某些内容,以保持发布的消息的通用性(意味着不具有攻击性并且本质上是匿名的)。现在,我们想要获取用户发布到留言板的消息,并删除任何类型的名称、大学或机构的名称以及脏话(如果以后可能,我们希望删除企业名称)。

是否有一些我们可以连接到的数据库,我们可以运行清理消息来检查数据库中的值以便识别这些值?

For an enterprise application research project me and another person are working on, we are looking to remove certain content from the page to keep the posted messages universal(meaning not offensive and essentially anonymous). Right now we want to take a message that a user has posted to a message board, and remove any type of name, name of a college or institution,and profanity(and if later possible we would like to remove business names).

Is there some database that we can connect to that we can run scrub our messages with to check against values in the database in order to recognize these?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

初懵 2024-08-13 12:48:40

这个问题似乎暗示在消息处理过程中会查询一个在线数据库。操作问题(此类服务的可靠性、响应时间滞后等)以及完整性问题(需要查询多个数据库,因为没有一个数据库能够满足项目 100% 的词汇需求)使得这种在线/实时方法不切实际。不过,有许多数据库可供下载,并且可以让您建立自己的本地“热词”数据库。

WordNet,如果您可能会使用所有“实例”单词作为单词当您匿名/清理它们时,通常需要从消息中删除它们。 (也许您还想将“非实例”单词保留在单独的表格/单词列表中“更有可能是好的”)。仅此列表就可以很好地支持您的应用程序的“0.9”版本。

然而,您最终会想要扩展这个“坏词”的词汇数据库,例如包含所有大学的首字母缩略词(CMU、UCSD、DU、MIT、UNC 等)、运动队名称(凯尔特人队) 、 棕熊队、 棕熊队、 红袜队...),并根据您的消息领域,添加公众人物的其他名字(Wordnet 有几个,例如乔治·布什或罗伯特·德尼罗,但它缺乏不太出名的人或成名的人最近:例如巴拉克·奥巴马(Barack Obama))

为了补充 Wordnet,我想到了两种不同类型的来源:

  • 传统的在线数据库
  • 本体论和大众分类法。

前者的例子是 USPS 的“按邮政编码排列的城市/州”。后者的例子是学者、组织或各种个人编制的各种“清单”。不可能提供这些源类型的详尽列表,但以下内容应该有所帮助:

在更简单的情况下,人们可以仅下载列表等,或者也可以“剪切和粘贴”。本体将被您需要解析的附加属性“阻碍”(将来您可能实际上需要这些属性并以更传统的方式使用本体,目前,只需要获取词汇实体即可) )。

这个词汇数据库编译任务可能看起来令人畏惧。但是 80-20 规则指出,20% 的“热门词”将占消息中引用量的 80%,因此,只需付出相对较小的努力,您就应该能够生成覆盖 90% 以上用例的系统。

展望未来:超越“热词”数据库
有多种方法可以使用自然语言处理 (NLP) 中的各种技术和概念来完成此任务。随着您的项目变得越来越复杂,您可能想要了解其中一些概念并可能实现它们。例如,我想到了一个简单的 POS 标记器,因为它可能有助于[部分]区分令牌“SCREW”的各种用法,因为您的应用程序会丢弃攻击性单词。 (“董事会想要拧紧学生”与“董事会应该每码至少用 4 个螺丝固定”。

甚至在需要这些正式的 NLP 之前技术,您可以使用一些基于模式的规则来处理与项目目标消息类型相关的域相关的常见情况,例如,您可以考虑以下内容:

  • (单词)州立大学
  • 参议员(Word_Starting_with_Capital letter)
  • 混合字母和数字的单词(这些单词通常用于拼错名称并规避您的项目希望实现的过滤器类型)

另一个可能有用的工具,特别是一开始将是一个收集有关消息语料库的统计信息的系统:词频、最常见的单词、最常见的二元组(两个连续的单词)等。

The question seems to imply an online database which would be queried during the processing of messages. Operational issues (reliability of such services, lag in response time etc.) as well as completeness issue (need to query multiple databases because no single one will cover 100% of the project's lexical needs) render this online/real-time approach impractical. There are however many databases available for download and which would allow you to build your own local database of "hot words".

A good place to start could be WordNet, were you'd likely use all of the "instance" words as words that should typically need to be removed from messages, as you anonymize/cleanse them. (Maybe you'll also want to keep the "non instance" words in a separate table/list of words "more likely to be ok"). This list alone could likely support honorably well a "0.9" version of your application.

You'll eventually want to extend this lexical database of "bad words" however, for example to include all universities acronyms (CMU, UCSD, DU, MIT, UNC and such), Sports Teams names (Celtics, Bruins, Bruins, Red Sox...) and depending on the domain of your messages, additional names of public figures (Wordnet has several, such George Bush or Robert De Niro, but it lacks less famous people or people that came of fame more recently: eg Barack Obama)

To complement Wordnet, two distinct types of sources come to mind:

  • traditional online databases
  • ontologies and folksonomies

Examples of the former are say "Cities/State by ZIP code" at the USPS. Examples of the latter are various "lists" compiled by scholars, organizations or various individuals. It is impossible to provide an exhaustive list of either of these source types, but the following should help:

  • DAML.ORG Catalog of ontologies
  • US Regions and States example of an ontology DAML format
  • Open Directory project Open Source directory (attention, gets quickly messy)
  • SourceWatch.org example of a "list of lists : folks in journalism/politics"
  • Seach Engine keywords: "List Of Lists", or also use three or four of the words you'd expect to find in the list you seek.

In simpler cases, one can merely download lists and such, or also, "cut-and-paste". The ontologies will be "encumbered" with additional attributes that you'll need to parse out (in the future you may actually desire these attributes and use the ontologies in a more traditional fashion, for now, grabbing the lexical entities is all that is needed).

This lexical database compilation task may seem daunting. But the 80-20 rule, states that 20% of the "hot words" will account for 80% of the citations in the messages, and therefore with a relatively small effort, you should be able to produce a system that covers 90%+ of your use cases.

Looking ahead: Beyond the "hot words" database
There are many ways of approaching this task, using various techniques and concepts from Natural Language Processing (NLP). As your project gains in sophistication, you may want to learn about some of these concepts and possibly implement them. For example a simple POS tagger comes to mind, as it may help [in part] discriminating between say various usage of the token "SCREW" as your application discards offensive words. ("The board of directors wants to screw the students" vs. "The board should be fastened with a minimum of 4 screws per yard".

Before even needing these formal NLP techniques, you may use a few pattern-based rules to handle common cases associated with the domain(s) relative to the type of messages the project targets. For example, you may consider the following:

  • (word) State University
  • Senator (Word_Starting_with_Capital letter)
  • Words that mix letters and numbers (these are often used to misspell names and circumvent the type of filters your projects wishes to implement)

Another tool that may be useful, in particular in the beginning will be a system that collects statistical info about the message corpus: word frequency, most common words, most common bigrams (two consecutive words) etc.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文