定义单词的上下文 - Python

发布于 2024-08-25 20:14:40 字数 435 浏览 9 评论 0 原文

我认为这是一个有趣的问题,至少对我来说是这样。


我有一个单词列表,比方说:

照片、免费、搜索、图像、css3、css、教程、网页设计、教程、谷歌、中国、审查制度、政治、互联网

,我有一个上下文列表

  • 编程
  • 世界新闻
  • 技术
  • 网页设计

我需要如果可能的话,尝试将单词与适当的上下文相匹配。

也许以某种方式发现单词关系。

alt text


有什么想法吗?

非常感谢您的帮助!

I think this is an interesting question, at least for me.


I have a list of words, let's say:

photo, free, search, image, css3, css, tutorials, webdesign, tutorial, google, china, censorship, politics, internet

and I have a list of contexts:

  • Programming
  • World news
  • Technology
  • Web Design

I need to try and match words with the appropriate context/contexts if possible.

Maybe discovering word relationships in some way.

alt text


Any ideas?

Help would be much appreciated!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

姐不稀罕 2024-09-01 20:14:40

这听起来更像是一个分类/本体问题,而不是 NLP。尝试使用 WordNet 作为标准本体。

我在你所说的问题中没有看到任何真正的 NLP,但如果你确实需要一些语义分析或解析器,请尝试 NLTK< /a>.

This sounds like it's more of a categorization/ontology problem than NLP. Try WordNet for a standard ontology.

I don't see any real NLP in your stated problem, but if you do need some semantic analysis or a parser try NLTK.

一腔孤↑勇 2024-09-01 20:14:40

这些话从何而来?它们来自真实的文本吗?如果是,那么这就是一个典型的数据挖掘问题。您需要做的是将您的文档集放入矩阵中,其中行代表单词来自哪个文档,列代表文档中的单词。

例如,如果您有两个这样的文档:

D1:需要查找含义。
D2:需要将苹果与橙子分开,

您的矩阵将如下所示:

      Need to find meaning Apples Oranges Separate From
D1:   1     1   1     1      0      0       0       0
D2:   1     1   0     0      1      1       1       1

这称为文档矩阵术语

收集完这些统计数据后,您可以使用类似 K-Means 将相似的文档分组在一起。由于您已经知道自己有多少个概念,因此您的任务应该会更容易一些。 K-Means 是非常慢的算法,因此您可以尝试使用 SVD 等技术来优化它

Where do these words come from? Do they come from real texts. If they are then it is a classic data mining problem. What you need to do is to your set of documents into the matrix where rows represent which document the word came from and the columns represent the words in the documents.

For example if you have two documents like this:

D1: Need to find meaning.
D2: Need to separate Apples from oranges

you matrix will look like this:

      Need to find meaning Apples Oranges Separate From
D1:   1     1   1     1      0      0       0       0
D2:   1     1   0     0      1      1       1       1

This is called term by document matrix

Having collected this statistics you can use algorithms like K-Means to group similar documents together. Since you already know how many concepts you have your tasks should be soomewhat easier. K-Means is very slow algorithm, so you can try to optimize it using techniques such as SVD

独夜无伴 2024-09-01 20:14:40

我几天前才发现这个: ConceptNet

这是一个常识性本体论,所以它可能不像具体如你所愿,但它有一个 python API,你可以下载他们的整个数据库(目前解压后大约 1GB)。请记住他们的许可限制

如果您阅读了开发它的团队发表的论文,您可能会得到一些关于如何将你的词语与概念/上下文联系起来的想法。

I just found this a couple days ago: ConceptNet

It's a commonsense ontology, so it might not be as specific as you would like, but it has a python API and you can download their entire database (currently around 1GB decompressed). Just keep in mind their licensing restrictions.

If you read the papers that were published by the team that developed it, you may get some ideas on how to relate your words to concepts/contexts.

甜妞爱困 2024-09-01 20:14:40

您问题的答案显然取决于您尝试将术语映射到的目标分类法。一旦你决定了这一点,你需要弄清楚这些概念应该有多细粒度。正如其他回复中所建议的那样,WordNet 将为您提供同义词集,即或多或少同义的术语集,但您必须通过某种其他机制将其映射到“网页设计”或“世界新闻”等概念,因为这些未在 WordNet 中编码。如果您的目标是非常广泛的语义分类,您可以使用 WordNet 的高级概念节点来区分,例如(向上层次结构)人类与动物、动画与植物、物质与固体、具体与抽象事物等。

另一种可能对您非常有用的分类法是维基百科类别系统。这不仅仅是我刚刚想到的一个自发想法,而且已经有 大量工作从维基百科类别中导出真实本体。看看 Java Wikipedia Library - 这个想法是找到相关术语的维基百科文章(例如“css3”),提取本文所属的类别,并根据某些标准(即“编程”、“技术”和“网络开发”)选择最佳类别。根据您想要执行的操作,最后一步(选择几个给定类别中最好的)可能会或可能不困难。

请参阅此处了解您可以使用的其他本​​体/知识库的列表使用。

The answer to your question obviously depends on the target taxonomy you are trying to map your terms into. Once you have decided on this you need to figure out how fine-grained the concepts should be. WordNet, as it has been suggested in other responses, will give you synsets, i.e. sets of terms which are more or less synonymous but which you will have to map to concepts like 'Web Design' or 'World News' by some other mechanism since these are not encoded in WordNet. If you're aiming at a very broad semantic categorization, you could use WordNet's higher-level concept nodes which differentiate, e.g. (going up the hierarchy) human from animal, animates from plants, substances from solids, concrete from abstract things, etc.

Another kind-of-taxonomy which may be quite useful to you is the Wikipedia category system. This is not just a spontaneous idea I just came up with, but there has been a lot of work on deriving real ontologies from Wikipedia categories. Take a look at the Java Wikipedia Library - the idea would be to find a wikipedia article for the term in question (e.g. 'css3'), extract the categories this article belongs to, and pick the best ones with respect to some criterion (i.e. 'programming', 'technology', and 'web-development'). Depending on what you're trying to do this last step (choosing the best of several given categories) may or may not be difficult.

See here for a list of other ontologies / knowledge bases you could use.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文