维基百科如何避免重复条目?

发布于 2024-08-10 03:01:06 字数 277 浏览 14 评论 0原文

维基百科 这样大的网站如何排序重复的条目?

我需要知道从用户创建重复条目的那一刻起的确切过程等等。如果您不知道但知道方法,请发送。

----更新----

假设有 wikipedia.com/horse 并且后来有人创建了 wikipedia.com/the_horse 这是一个重复的条目!它应该被删除或可能被重定向到原始页面。

How can websites as big as Wikipedia sort duplicated entries out?

I need to know the exact procedure from the moment that user creates the duplicate entry and so on. If you don't know it but you know a method please send it.

----update----

Suppose there is wikipedia.com/horse and somebody afterward creates wikipedia.com/the_horse this is a duplicate entry! It should be deleted or may be redirected to the original page.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

梦屿孤独相伴 2024-08-17 03:01:06

这是一个手动过程

基本上,维基百科和 stackoverflow 等网站依赖其用户/编辑在意外创建时不进行重复或合并/删除它们。有多种功能可以使这一过程变得更容易、更可靠:

  • 建立良好的命名约定(“马”不是一个被广泛接受的名称,人们自然会选择“马”),以便编辑者自然会为相同的人赋予相同的名称主题。
  • 让编辑更容易找到类似的文章。
  • 轻松将文章标记为重复项或将其删除。
  • 制定合理的限制,以便破坏者无法滥用这些功能从您的网站中删除正版内容。

话虽如此,你仍然会在维基百科上发现很多重复的信息——但编辑们正在尽快清理这些信息。

一切都与社区有关(更新)

随着时间的推移,社区站点(例如维基百科或 stackoverflow)会逐渐开发其程序。看看 Wikipedia:about Stackoverflow:常见问题解答meta.stackoverflow。您可以花几周的时间阅读有关社区如何共同构建网站以及他们如何处理出现的问题的所有小(但重要)细节。其中大部分是关于贡献者的规则——但是当您制定规则时,他们的许多详细信息将被放入您网站的代码中。

作为一般规则,我强烈建议创建一个具有简单系统和小型贡献者社区的网站,这些贡献者同意一个共同目标并对阅读内容感兴趣您网站的用户喜欢做出贡献,愿意妥协并手动纠正问题。在这个阶段,拥有社区的“身份”和相互帮助比拥有许多访问者或贡献者更为重要。您将必须花费大量时间和精力来处理出现的问题,并将责任委托给您的成员。一旦网站有了基础和共同商定的方向,您就可以慢慢发展您的社区。如果您做得正确,您将获得足够的支持者来与新成员分担额外的工作。如果您不够关心,垃圾邮件发送者或巨魔就会接管您的网站。

请注意,维基百科多年来缓慢增长到目前的规模。秘诀不是“变大”,而是“持续健康成长”。

话虽如此,stackoverflow 的增长速度似乎比维基百科还要快。您可能需要考虑此处做出的不同权衡决策:stackoverflow 在允许一个用户更改另一用户的贡献方面受到更多限制。不良信息通常会被简单地推到页面底部(排名较低)。因此,它不会产生像维基百科那样的文章。但更容易排除问题。

It's a manual process

Basically, sites such as wikipedia and also stackoverflow rely on their users/editors not to make duplicates or to merge/remove them when they have been created by accident. There are various features that make this process easier and more reliable:

  • Establish good naming conventions ("the horse" is not a well-accepted name, one would naturally choose "horse") so that editors will naturally give the same name to the same subject.
  • Make it easy for editors to find similar articles.
  • Make it easy to flag articles as duplicates or delete them.
  • Make sensible restrictions so that vandals can't mis-use these features to remove genuine content from your site.

Having said this, you still find a lot of duplicate information on wikipedia --- but the editors are cleaning this up as quickly as it is being added.

It's all about community (update)

Community sites (like wikipedia or stackoverflow) over time develop their procedures over time. Take a look at Wikipedia:about Stackoverflow:FAQ or meta.stackoverflow. You can spend weeks reading about all the little (but important) details of how a community together builds a site together and how they deal with the problems that arise. Much of this is about rules for your contributors --- but as you develop your rules, many of their details will be put into the code of your site.

As a general rule, I would strongly suggest to start a site with a simple system and a small community of contributors that agree on a common goal and are interested in reading the content of your site, like to contribute, are willing to compromise and to correct problems manually. At this stage it is much more important to have an "identity" of your community and mutual help than to have many visitors or contributors. You will have to spend much time and care to deal with problems as they arise and delegate responsibility to your members. Once the site has a basis and a commonly agreed direction, you can slowly grow your community. If you do it right, you will gain enough supporters to share the additional work amongst the new members. If you don't care enough, spammers or trolls will take over your site.

Note that Wikipedia grew slowly over many years to its current size. The secret is not "get big" but "keep growing healthily".

Having said that, stackoverflow seems to have grown at a faster rate than wikipedia. You may want to consider the different trade off decisions that were made here: stackoverflow is much more restricted in allowing one user to change the contribution of another user. Bad information is often simply pushed down to the bottom of a page (low ranking). Hence, it will not produce articles like wikipedia. But it's easier to keep problems out.

喵星人汪星人 2024-08-17 03:01:06

我可以在 Yaakov 的列表中添加一个:
* 维基百科确保合并信息后,“The Horse”指向“Horse”,这样同样错误的标题就不会被使用第二次。

I can add one to Yaakov's list:
* Wikipedia makes sure that after merging the information, "The Horse" points to "Horse", so that the same wrong title can not be used a second time.

如梦初醒的夏天 2024-08-17 03:01:06

EBAGHAKI,回答上面评论中的最后一个问题:

如果您尝试使用这些功能设计自己的系统,关键是:

  • 使命名空间本身可由识别重复项的社区进行编辑。

在 MediaWiki 的例子中,这是通过特殊的“#REDIRECT”命令完成的——第一行仅使用“#REDIRECT [[新文章标题]]”创建的文章被视为 URL 重定向。

MediaWiki 中使用的编辑系统的其余部分非常简单——每个页面本质上都被视为一个文本块,没有结构,并且具有任何读者都可以添加新修订的单流修订历史记录。这一切都不是自动的。

当您尝试创建主页时,系统会向您显示一条长消息,鼓励您以各种方式搜索页面标题,以查看现有页面是否已存在 - 许多网站都有类似的过程。 Digg 是一个典型的例子,它通过积极的、自动的搜索来试图说服你不要发布重复的内容——你必须点击列出潜在重复内容的屏幕,并确认你的内容是不同的,然后才能被允许发布。

EBAGHAKI, responding to your last question in the comments above:

If you're trying to design your own system with these features, the key one is:

  • Make the namespace itself editable by the community that is identifying duplicates.

In MediaWiki's case, this is done with the special "#REDIRECT" command -- an article created with only "#REDIRECT [[new article title]]" on its first line is treated as a URL redirect.

The rest of the editorial system used in MediaWiki is depressingly simple -- every page is essentially treated as a block of text, with no structure, and with a single-stream revision history that any reader can add a new revision to. Nothing automatic about any of this.

When you try to create a main page, you are shown a long message encouraging you to search for the page title in various ways to see whether an existing page is already there -- many sites have similar processes. Digg is a typical example of one with an aggressive, automated search to try to convince you not to post duplicates -- you have to click through a screen listing potential duplicates and affirm that yours is different, before you are allowed to post.

习ぎ惯性依靠 2024-08-17 03:01:06

我假设他们有一个程序可以删除无关的单词(例如“the”)以创建规范标题,并且如果它与现有页面匹配则不允许输入。

I assume they have a procedure that removes extraneous words such as 'the' to create a canonical title, and if it matches an existing page not allow the entry.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文