如何使用句子转换器使用元数据进行文档检索?

发布于 2025-01-17 07:16:58 字数 1419 浏览 4 评论 0原文

我正在尝试使用 Sentence Transformers 和 Haystack 进行文档检索,重点是在文档文本之外的其他元数据上搜索文档。

我正在使用学术出版物标题的数据集,并且附加了一个假的出版年份(我想将其用作搜索词)。通过阅读周围的内容,我合并了各列,只是在标题和出版年份之间添加了分隔符,并包含了列标题,因为我认为这可能会添加上下文。示例输入如下所示:

title Sparsity-certifying Graph Decompositions [SEP] 发布于 1980 年

我在这里有一个文档存储和检索方法,基于 this

document_store_faiss = FAISSDocumentStore(faiss_index_factory_str="Flat",
                                          return_embedding=True,
                                          similarity='cosine')

retriever_faiss = EmbeddingRetriever(document_store_faiss,
                                     embedding_model='all-mpnet-base-v2',
                                     model_format='sentence_transformers')

document_store_faiss.write_documents(df.rename(columns={'combined':'content'}).to_dict(orient='records'))
document_store_faiss.update_embeddings(retriever=retriever_faiss)

def get_results(query, retriever, n_docs = 25):
  return [(item.content) for item in retriever.retrieve(q, top_k = n_docs)]

q = 'published year 1999'
print('Results: ')
res = get_results(q, retriever_faiss) 
for r in res:
  print(r) 

我检查是否有任何输入实际上具有与搜索词匹配的出版年份,但是当我查看在我的搜索结果中,我得到了看似随机出版年份的条目。我希望至少结果都是相同的发布年份,因为我希望执行更复杂的查询,例如“1980 年之前发布的年份”。

如果有人可以告诉我我做错了什么,或者我是否误解了这个过程/预期结果,我将不胜感激。

I'm trying to use Sentence Transformers and Haystack for document retrieval, focusing on searching documents on other metadata beside document text.

I'm using a dataset of academic publication titles, and I've appended a fake publication year (which I want to use as a search term). From reading around I've combined the columns and just added a separator between the title and publication year, and included the column titles since I thought maybe this could add context. An example input looks like:

title Sparsity-certifying Graph Decompositions [SEP] published year 1980

I have a document store and method of retrieving here, based on this:

document_store_faiss = FAISSDocumentStore(faiss_index_factory_str="Flat",
                                          return_embedding=True,
                                          similarity='cosine')

retriever_faiss = EmbeddingRetriever(document_store_faiss,
                                     embedding_model='all-mpnet-base-v2',
                                     model_format='sentence_transformers')

document_store_faiss.write_documents(df.rename(columns={'combined':'content'}).to_dict(orient='records'))
document_store_faiss.update_embeddings(retriever=retriever_faiss)

def get_results(query, retriever, n_docs = 25):
  return [(item.content) for item in retriever.retrieve(q, top_k = n_docs)]

q = 'published year 1999'
print('Results: ')
res = get_results(q, retriever_faiss) 
for r in res:
  print(r) 

I do a check to see if any inputs actually have a publication year matching the search term, but when I look at my search results I'm getting entries with seemingly random published years. I was hoping that at least the results would all be the same published year, since I hoped to do more complicated queries like "published year before 1980".

If anyone could either tell me what I'm doing wrong, or whether I have misunderstood this process / expected results it would be much appreciated.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

﹏雨一样淡蓝的深情 2025-01-24 07:16:58

听起来您需要元数据过滤,而不是将年份放在查询本身中。 FaissDocumentStore 不支持过滤,我建议切换到 Haystack 几天前在 v1.3 版本中引入的 PineconeDocumentStore。它支持当前文档存储集中最强大的过滤功能。

您需要确保安装了最新版本的 Haystack,并且它还需要额外的 pinecone-client 库:

pip install -U farm-haystack pinecone-client

有一个 此处指南可能会有所帮助,内容如下:

document_store = PineconeDocumentStore(
    api_key="<API_KEY>", # from https://app.pinecone.io
    environment="us-west1-gcp"
)
retriever = EmbeddingRetriever(
    document_store,
    embedding_model='all-mpnet-base-v2',
    model_format='sentence_transformers'
)

在编写文档之前,您需要转换数据以将文本包含在 content 中 (正如你所拥有的已完成上述操作,但无需预先附加年份),然后将年份作为字段包含在 meta 字典中。因此,您将创建一个字典列表,如下所示:

dicts = [
    {'content': 'your text here', 'meta': {'year': 1999}},
    {'content': 'another record text', 'meta': {'year': 1971}},
    ...
]

我不知道您的 df 的确切格式,但假设它类似于:

textyear“your texthere
1999
“another recordhere”1971

我们可以编写以下内容来重新格式化它:

df = df.rename(columns={'text': 'content'})  # you did this already

# create a new 'meta' column that contains {'year': <year>} data
df['meta'] = df['year'].apply(lambda x: {'year': x})

# we don't need the year column anymore so we drop it
df = df.drop(['year'], axis=1)

# now convert into the list of dictionaries format as you did before
dicts = df.to_dict(orient='records')

此数据替换了您编写的 df 字典,因此我们将继续这样:

document_store.write_documents(dicts)
document_store.update_embeddings(retriever=retriever)

现在您可以使用过滤器进行查询,例如搜索发布年份为1999 年我们使用条件 "$eq"(等于):

docs = retriever.retrieve(
    "some query here",
    top_k=25,
    filters={
        {"year": {"$eq": 1999}}
    }
)

对于 1980 年之前发布的内容,我们可以使用 "$lt"(小于):

docs = retriever.retrieve(
    "some query here",
    top_k=25,
    filters={
        {"year": {"$lt": 1980}}
    }
)

It sounds like you need metadata filtering rather than placing the year within the query itself. The FaissDocumentStore doesn't support filtering, I'd recommend switching to the PineconeDocumentStore which Haystack introduced in the v1.3 release a few days ago. It supports the strongest filter functionality in the current set of document stores.

You will need to make sure you have the latest version of Haystack installed, and it needs an additional pinecone-client library too:

pip install -U farm-haystack pinecone-client

There's a guide here that may help, it will go something like:

document_store = PineconeDocumentStore(
    api_key="<API_KEY>", # from https://app.pinecone.io
    environment="us-west1-gcp"
)
retriever = EmbeddingRetriever(
    document_store,
    embedding_model='all-mpnet-base-v2',
    model_format='sentence_transformers'
)

Before you write the documents you need to convert the data to include your text in content (as you have done above, but no need to pre-append the year), and then include the year as a field in a meta dictionary. So you would create a list of dictionaries that look like:

dicts = [
    {'content': 'your text here', 'meta': {'year': 1999}},
    {'content': 'another record text', 'meta': {'year': 1971}},
    ...
]

I don't know the exact format of your df but assuming it is something like:

textyear
"your text here"1999
"another record here"1971

We could write the following to reformat it:

df = df.rename(columns={'text': 'content'})  # you did this already

# create a new 'meta' column that contains {'year': <year>} data
df['meta'] = df['year'].apply(lambda x: {'year': x})

# we don't need the year column anymore so we drop it
df = df.drop(['year'], axis=1)

# now convert into the list of dictionaries format as you did before
dicts = df.to_dict(orient='records')

This data replaces the df dictionaries you write, so we would continue as so:

document_store.write_documents(dicts)
document_store.update_embeddings(retriever=retriever)

Now you can query with filters, for example to search for docs with the publish year of 1999 we use the condition "$eq" (equals):

docs = retriever.retrieve(
    "some query here",
    top_k=25,
    filters={
        {"year": {"$eq": 1999}}
    }
)

For published before 1980 we can use "$lt" (less than):

docs = retriever.retrieve(
    "some query here",
    top_k=25,
    filters={
        {"year": {"$lt": 1980}}
    }
)
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文