用于解析文本以查找可能的维基百科链接的工具

发布于 2024-07-15 02:25:03 字数 588 浏览 6 评论 0原文

是否存在可以解析文本并输出该文本的工具,该文本超链接到维基百科条目以查找感兴趣的单词?

例如,我想要一个可以变成这样的工具:

最流行的搜索算法 排序列表是二分查找。

进入:

最流行的搜索算法 排序列表二分搜索

如果维基百科有一个 API 可以做到这一点,那就太好了,因为它们最有能力确定什么是“感兴趣的词”。

在我的示例中,我只是链接了直接链接到条目的所有组合,除了 The 和 most 之外。

Does a tool exist that can parse text and output that text, hyper-linked to Wikipedia entries for words of interest?

For example, I'd like a tool that could turn something like:

The most popular search algorithm on a
sorted list is the binary search.

Into:

The most popular search algorithm on a
sorted list is the binary search.

It would be wonderful if Wikipedia had an API which would do this since they would be best equipped to determine what "words of interests" are.

In my example I simply linked all combinations which linked directly to an entry except for The and most.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

不顾 2024-07-22 02:25:03

有一个工具可以完全满足您的要求。
http://wikify.appointment.at/
它并不完美,但它有效。

There is a tool that does exactly what you're asking for.
http: //wikify.appointment.at/
It's not perfect, but it works.

牵你的手,一向走下去 2024-07-22 02:25:03

这里有两个单独的问题需要解决:

  1. 决定应该链接哪些单词
  2. 确定是否有合适的条目将这些单词链接到

现在,(2) 更简单,但也有些问题。 维基百科似乎有API,可以让您有效地收集数据,并且它们还允许“屏幕抓取”。 但消除歧义存在一个问题——有时您可能找不到您想要的条目。 例如,python 链接到消歧页面,因为它可以是一种编程语言、一条蛇还有其他一些事情。

(1) 不过,难度要大得多。 您可以采用“简单方法”并尝试查找所有重要名词(甚至名词/形容词对)的链接。 这里的“不平凡”意味着省略“恶魔、单词、计算机”等词。
但这会导致链接过多,不方便阅读。 这实际上取决于你来决定文本中的哪些内容是有趣的,这在很大程度上取决于文本本身。 在一篇面向专业程序员的文章中,你真的想每次都链接到“搜索算法”吗? 但对于初学者来说,也许你会这样做。

总而言之,我强烈怀疑是否有一个通用工具可以满足您的需要。 但您肯定拥有所有选项,并且无需太多努力就可以编写特定需求的代码。

You have two separate problems to solve here:

  1. Deciding which words should be linked
  2. Determining if there's a suitable entry to link these words to

Now, (2) is simpler, though it's also somewhat problematic. Wikipedia seems to have an API that allows you to gather data efficiently, and they also allow "screen scraping". But there's a problem with disambiguation - sometimes you might hit not the entry you wanted. For example, python links to a disambiguation page, as it can be a programming language, a snake and a couple of other things.

(1) Is much harder, though. You can take the "simple approach" and attempt to find links for all non-trivial nouns (or even noun/adjective pairs). Non-trivial here means omitting words like "fiend, word, computer" etc.
But This would result in a plethora of links, which isn't convenient to read. It's really up to you to decide what's interesting in the text, and this depends a lot on the text itself. In an article for professional programmers, do you really want to link to "search algorithm" every time? But for beginners, perhaps you do.

To conclude, I strongly doubt there's a single general-purpose tool that will do the trick for you. But you surely have all the options at your hand, and something need-specific can be coded without too much effort.

命比纸薄 2024-07-22 02:25:03

微软研究院的 Silviu Cucerzan 解决了这个问题。 好吧,不是插入链接的问题,而是确定某段文本中提到的实体的一般问题。 对你来说幸运的是,他使用维基百科文章作为他的实体集。 他的论文“基于维基百科数据的大规模命名实体消歧”可在他的 网站。 直接链接:pdf

Silviu Cucerzan of Microsoft Research tackled this problem. Well, not the problem of inserting the links, but the general issue of determining what entities are being mentioned in a some piece of text. Fortunately for you, he used Wikipedia articles as his set of entities. His paper, "Large-Scale Named Entity Disambiguation Based on Wikipedia Data", is available on his website. Direct link: pdf.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文