如何在自定义网站中执行搜索并阅读结果？

发布于 2025-01-22 19:21:01 字数 1251 浏览 0 评论 0原文

我正在开发一个功能，以在线下载蛋白质.pdb文件，这是我正在创建的代码的一部分，该代码是通过我们的Aibind Machine Learning模型生成的对接蛋白质和配体。对于这些蛋白质中的大约60％，我能够使用基因库将其HGNC ID转换为PDB ID，然后我通过网站Uniprot和RCSB查询下载PDB文件。但是，对于其他40％，仅存在用于蛋白质的计算生成的Alphafold PDB模型，而我一直使用的基因库并不识别这些蛋白质是具有有效的PDB ID的。值得庆幸的是，在Alphafold网站上有一个搜索功能，通过使用HGNC ID进行搜索，我会收到条目列表（顶部是我想要的蛋白质的99％），如下所示；

一旦我拥有Uniprot ID（在本示例中显示为Q7K0E6），然后我可以导航到Alphafold输入页面并访问文件服务器以下载该蛋白质的PDB文件，我我已经能够成功地针对我一直使用的数据库中具有注册Uniprot ID的蛋白质执行。

我一直在使用以下代码将搜索网页用作为搜索条目输入的HGNC符号刮擦，将所有HTML页面数据放入文本文件中。

  import urllib
  import urllib.request
  import requests

  url = 'https://alphafold.ebi.ac.uk/search/text/'
  fname = 'alphaname.txt'
  HGNC = 'vr1'
  url = url + 'vr1'

  get = urllib.request.urlopen(url)
  html = get.read()
  r = requests.get(url)
  with open(fname, "wb") as f:
       f.write(html)

当我在文件本身（手册以及通过Python）中执行搜索时，我看不到任何被查询的条目中的数据作为搜索结果。

我如何使用Python从网站搜索功能中执行的搜索中检索数据？

原文

I am developing a function to download protein .pdb files online as part of a body of code I am creating to dock protein and ligands generated by our AIBind machine learning model. For around 60% of these proteins I am able to use gene libraries to convert their HGNC IDs to pdb IDs, which I then query through the website uniprot and RCSB to download pdb files. However, for the other 40% there only exist computationally generated alphafold PDB models for the proteins, and the gene libraries I have been using do not recognize these proteins as having valid PDB IDs. Thankfully, there is a search function on the alphafold website, where by searching with the HGNC ID, I recieve a list of entries (where the top one is 99% the protein I am looking for), as shown below;

Once I have the uniprot ID (which is shown in this example as Q7K0E6), I can then navigate to the alphafold entry page and access the file server to download the PDB file for that protein, which I have already been able to successfully perform for proteins that have a registered uniprot ID in the databanks that I have been utilizing.

I've been using the following code to scrape the search webpage with the HGNC symbol inputted as a search entry, putting all of the HTML page data into a text file.

  import urllib
  import urllib.request
  import requests

  url = 'https://alphafold.ebi.ac.uk/search/text/'
  fname = 'alphaname.txt'
  HGNC = 'vr1'
  url = url + 'vr1'

  get = urllib.request.urlopen(url)
  html = get.read()
  r = requests.get(url)
  with open(fname, "wb") as f:
       f.write(html)

When I perform a search in the file itself (manual as well as through python), I don't see any data from any of the entries queried as search results.

How would I use python to retrieve data from searches performed within the search function of a website?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

别再吹冷风 2025-01-29 19:21:01

数据通过JavaScript从外部URL加载。您可以使用请求模块对其进行仿真，例如：

import json
import requests


api_url = "https://alphafold.ebi.ac.uk/api/search"

params = {
    "q": "(text:*vr1 OR text:vr1*)",
    "type": "main",
    "start": "0",
    "rows": "20",
}

data = requests.get(api_url, params=params).json()

print(json.dumps(data, indent=4))

打印：

{
    "numFound": 112,
    "start": 0,
    "numFoundExact": true,
    "docs": [
        {
            "entryId": "AF-O35433-F1",
            "gene": "Trpv1",
            "geneT": [
                "Trpv1",
                "Vr1",
                "Vr1l"
            ],
            "geneSynonyms": [
                "Vr1",
                "Vr1l"
            ],
            "sequenceChecksum": "DAFC80B12BDF71BF",
            "sequenceVersionDate": "1998-01-01",
            "uniprotAccession": "O35433",
            "uniprotAccessionT": "O35433",
            "uniprotId": "TRPV1_RAT",
            "uniprotDescription": "Transient receptor potential cation channel subfamily V member 1",
            "protein": [
                "Transient receptor potential cation channel subfamily V member 1",
                "Capsaicin receptor",
                "Osm-9-like TRP channel 1",
                "Vanilloid receptor 1",
                "Vanilloid receptor type 1-like",
                "OTRPC1"
            ],
            "taxId": 10116,
            "organismScientificName": "Rattus norvegicus",
            "organism": [
                "Rattus norvegicus",
                "Rat"
            ],
            "globalMetricValue": 71.55,
            "uniprotStart": 1,
            "uniprotEnd": 838,
            "uniprotSequence": "MEQRASLDSEESESPPQENSCLDPPDRDPNCKPPPVKPHIFTTRSRTRLFGKGDSEEASPLDCPYEEGGLASCPIITVSSVLTIQRPGDGPASVRPSSQDSVSAGEKPPRLYDRRSIFDAVAQSNCQELESLLPFLQRSKKRLTDSEFKDPETGKTCLLKAMLNLHNGQNDTIALLLDVARKTDSLKQFVNASYTDSYYKGQTALHIAIERRNMTLVTLLVENGADVQAAANGDFFKKTKGRPGFYFGELPLSLAACTNQLAIVKFLLQNSWQPADISARDSVGNTVLHALVEVADNTVDNTKFVTSMYNEILILGAKLHPTLKLEEITNRKGLTPLALAASSGKIGVLAYILQREIHEPECRHLSRKFTEWAYGPVHSSLYDLSCIDTCEKNSVLEVIAYSSSETPNRHDMLLVEPLNRLLQDKWDRFVKRIFYFNFFVYCLYMIIFTAAAYYRPVEGLPPYKLKNTVGDYFRVTGEILSVSGGVYFFFRGIQYFLQRRPSLKSLFVDSYSEILFFVQSLFMLVSVVLYFSQRKEYVASMVFSLAMGWTNMLYYTRGFQQMGIYAVMIEKMILRDLCRFMFVYLVFLFGFSTAVVTLIEDGKNNSLPMESTPHKCRGSACKPGNSYNSLYSTCLELFKFTIGMGDLEFTENYDFKAVFIILLLAYVILTYILLLNMLIALMGETVNKIAQESKNIWKLQRAITILDTEKSFLKCMRKAFRSGKLLQVGFTPDGKDDYRWCFRVDEVNWTTWNTNVGIINEDPGNCEGVKRTLSFSLRSGRVSGRNWKNFALVPLLRDASTRDRHATQQEEVQLKHYTGSLKPEDAEVFKDSMVPGEK",
            "modelCreatedDate": "2021-07-01",
            "organismCommonNames": [
                "Rat"
            ],
            "proteinFullNames": [
                "Capsaicin receptor",
                "Osm-9-like TRP channel 1",
                "Vanilloid receptor 1",
                "Vanilloid receptor type 1-like"
            ],
            "proteinShortNames": [
                "OTRPC1"
            ],
            "latestVersion": 2,
            "allVersions": [
                1,
                2
            ],
            "_version_": 1723016518349881344
        },
        {
            "entryId": "AF-Q7K0E6-F1",
            "gene": "AspRS",
            "geneT": [

...

The data is loaded from external URL via JavaScript. You can use requests module to simulate it, for example:

import json
import requests


api_url = "https://alphafold.ebi.ac.uk/api/search"

params = {
    "q": "(text:*vr1 OR text:vr1*)",
    "type": "main",
    "start": "0",
    "rows": "20",
}

data = requests.get(api_url, params=params).json()

print(json.dumps(data, indent=4))

Prints:

{
    "numFound": 112,
    "start": 0,
    "numFoundExact": true,
    "docs": [
        {
            "entryId": "AF-O35433-F1",
            "gene": "Trpv1",
            "geneT": [
                "Trpv1",
                "Vr1",
                "Vr1l"
            ],
            "geneSynonyms": [
                "Vr1",
                "Vr1l"
            ],
            "sequenceChecksum": "DAFC80B12BDF71BF",
            "sequenceVersionDate": "1998-01-01",
            "uniprotAccession": "O35433",
            "uniprotAccessionT": "O35433",
            "uniprotId": "TRPV1_RAT",
            "uniprotDescription": "Transient receptor potential cation channel subfamily V member 1",
            "protein": [
                "Transient receptor potential cation channel subfamily V member 1",
                "Capsaicin receptor",
                "Osm-9-like TRP channel 1",
                "Vanilloid receptor 1",
                "Vanilloid receptor type 1-like",
                "OTRPC1"
            ],
            "taxId": 10116,
            "organismScientificName": "Rattus norvegicus",
            "organism": [
                "Rattus norvegicus",
                "Rat"
            ],
            "globalMetricValue": 71.55,
            "uniprotStart": 1,
            "uniprotEnd": 838,
            "uniprotSequence": "MEQRASLDSEESESPPQENSCLDPPDRDPNCKPPPVKPHIFTTRSRTRLFGKGDSEEASPLDCPYEEGGLASCPIITVSSVLTIQRPGDGPASVRPSSQDSVSAGEKPPRLYDRRSIFDAVAQSNCQELESLLPFLQRSKKRLTDSEFKDPETGKTCLLKAMLNLHNGQNDTIALLLDVARKTDSLKQFVNASYTDSYYKGQTALHIAIERRNMTLVTLLVENGADVQAAANGDFFKKTKGRPGFYFGELPLSLAACTNQLAIVKFLLQNSWQPADISARDSVGNTVLHALVEVADNTVDNTKFVTSMYNEILILGAKLHPTLKLEEITNRKGLTPLALAASSGKIGVLAYILQREIHEPECRHLSRKFTEWAYGPVHSSLYDLSCIDTCEKNSVLEVIAYSSSETPNRHDMLLVEPLNRLLQDKWDRFVKRIFYFNFFVYCLYMIIFTAAAYYRPVEGLPPYKLKNTVGDYFRVTGEILSVSGGVYFFFRGIQYFLQRRPSLKSLFVDSYSEILFFVQSLFMLVSVVLYFSQRKEYVASMVFSLAMGWTNMLYYTRGFQQMGIYAVMIEKMILRDLCRFMFVYLVFLFGFSTAVVTLIEDGKNNSLPMESTPHKCRGSACKPGNSYNSLYSTCLELFKFTIGMGDLEFTENYDFKAVFIILLLAYVILTYILLLNMLIALMGETVNKIAQESKNIWKLQRAITILDTEKSFLKCMRKAFRSGKLLQVGFTPDGKDDYRWCFRVDEVNWTTWNTNVGIINEDPGNCEGVKRTLSFSLRSGRVSGRNWKNFALVPLLRDASTRDRHATQQEEVQLKHYTGSLKPEDAEVFKDSMVPGEK",
            "modelCreatedDate": "2021-07-01",
            "organismCommonNames": [
                "Rat"
            ],
            "proteinFullNames": [
                "Capsaicin receptor",
                "Osm-9-like TRP channel 1",
                "Vanilloid receptor 1",
                "Vanilloid receptor type 1-like"
            ],
            "proteinShortNames": [
                "OTRPC1"
            ],
            "latestVersion": 2,
            "allVersions": [
                1,
                2
            ],
            "_version_": 1723016518349881344
        },
        {
            "entryId": "AF-Q7K0E6-F1",
            "gene": "AspRS",
            "geneT": [

...

回复收藏 0 原文

~没有更多了~