评估“价值”属性
我正在尝试使用 OpenAmplify API 用于评估 URI 的内容。重点是要找出与文章真正相关的主题。不幸的是,我得到的主题分析是:
- 巨大且
- 多样
这两种质量对于我正在尝试做的事情来说都不是非常有用,因为信噪比严重偏向噪声。我正在分析网页内容,因此涉及一定量(也许大量)不相关的内容(广告等)。我明白了。
尽管如此,返回的许多主题要么是无用的(完全没有意义,甚至不是文字),要么是不相关的(例如,它从哪里来?)或者太细粒度而无法提供任何意义或见解。我可能可以使用为每个域、子域、主题等返回的值,嗯,过滤掉大部分噪音,但我真的不知道它意味着什么。
当然,我明白值是“文本中单词的突出程度”的衡量标准,但数字本身看起来完全是任意的,这让我阻止我说“忽略任何带有小于 50" 的值并使其具有任何实际意义。
是否有任何范围标准可以帮助我了解如何使用主题的价值分数作为过滤阈值?或者,我是否应该使用另一个字段来进行这种过滤?
感谢您的帮助。
I'm attempting to use the OpenAmplify API to evaluate the content of a URI. The point is to draw out the topics that are truly relevant to the article. Unfortunately, the topical analysis I'm getting back is:
- Huge, and
- Varied
Neither quality is terribly useful for what I'm trying to do because the signal to noise ratio is being heavily skewed towards noise. I'm analyzing web content, so there is a certain amount (perhaps a large amount) of irrelevant content (ads, etc.) involved. I get that.
Nonetheless, many of the topics being returned are either useless (utterly non-sensical, not even words), irrelevant (as in, where did that come from?) or too granular to provide any meaning or insight. I can probably filter out most of this noise using the value, um, value that is returned for each domain, subdomain, topic, et al, but I don't really know what it means.
Certainly I understand that the value it's a measure of "the prominence of the word in the text," but the number itself appears entirely arbitrary in a way that I prevents me saying something like "ignore any terms with a value less than 50" and have it carry any real meaning.
Are there any range criteria that I can use to help me understand how to use a topic's value score as a filtering threshold? Alternatively, is there another field that I should be using for this sort of filtration?
Thanks for your help.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
从其他渠道,我了解到
value
属性无法按照我希望的方式进行评估。对于不同的信号来说,它意味着不同的事情,并且没有一种定义方式对这种需求有意义。From other channels, I've learned that the
value
attribute can't be evaluated the way I was hoping. It means different things for different signals and none are defined in such a way that are meaningful for this kind of requirement.