如何在自定义网站中执行搜索并阅读结果?

发布于 2025-01-22 19:21:01 字数 1251 浏览 0 评论 0原文

我正在开发一个功能,以在线下载蛋白质.pdb文件,这是我正在创建的代码的一部分,该代码是通过我们的Aibind Machine Learning模型生成的对接蛋白质和配体。对于这些蛋白质中的大约60%,我能够使用基因库将其HGNC ID转换为PDB ID,然后我通过网站Uniprot和RCSB查询下载PDB文件。但是,对于其他40%,仅存在用于蛋白质的计算生成的Alphafold PDB模型,而我一直使用的基因库并不识别这些蛋白质是具有有效的PDB ID的。值得庆幸的是,在Alphafold网站上有一个搜索功能,通过使用HGNC ID进行搜索,我会收到条目列表(顶部是我想要的蛋白质的99%),如下所示;

一旦我拥有Uniprot ID(在本示例中显示为Q7K0E6),然后我可以导航到Alphafold输入页面并访问文件服务器以下载该蛋白质的PDB文件,我我已经能够成功地针对我一直使用的数据库中具有注册Uniprot ID的蛋白质执行。

我一直在使用以下代码将搜索网页用作为搜索条目输入的HGNC符号刮擦,将所有HTML页面数据放入文本文件中。

  import urllib
  import urllib.request
  import requests

  url = 'https://alphafold.ebi.ac.uk/search/text/'
  fname = 'alphaname.txt'
  HGNC = 'vr1'
  url = url + 'vr1'

  get = urllib.request.urlopen(url)
  html = get.read()
  r = requests.get(url)
  with open(fname, "wb") as f:
       f.write(html) 

当我在文件本身(手册以及通过Python)中执行搜索时,我看不到任何被查询的条目中的数据作为搜索结果。

我如何使用Python从网站搜索功能中执行的搜索中检索数据?

I am developing a function to download protein .pdb files online as part of a body of code I am creating to dock protein and ligands generated by our AIBind machine learning model. For around 60% of these proteins I am able to use gene libraries to convert their HGNC IDs to pdb IDs, which I then query through the website uniprot and RCSB to download pdb files. However, for the other 40% there only exist computationally generated alphafold PDB models for the proteins, and the gene libraries I have been using do not recognize these proteins as having valid PDB IDs. Thankfully, there is a search function on the alphafold website, where by searching with the HGNC ID, I recieve a list of entries (where the top one is 99% the protein I am looking for), as shown below;

enter image description here

Once I have the uniprot ID (which is shown in this example as Q7K0E6), I can then navigate to the alphafold entry page and access the file server to download the PDB file for that protein, which I have already been able to successfully perform for proteins that have a registered uniprot ID in the databanks that I have been utilizing.

I've been using the following code to scrape the search webpage with the HGNC symbol inputted as a search entry, putting all of the HTML page data into a text file.

  import urllib
  import urllib.request
  import requests

  url = 'https://alphafold.ebi.ac.uk/search/text/'
  fname = 'alphaname.txt'
  HGNC = 'vr1'
  url = url + 'vr1'

  get = urllib.request.urlopen(url)
  html = get.read()
  r = requests.get(url)
  with open(fname, "wb") as f:
       f.write(html) 

When I perform a search in the file itself (manual as well as through python), I don't see any data from any of the entries queried as search results.

enter image description here

How would I use python to retrieve data from searches performed within the search function of a website?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

别再吹冷风 2025-01-29 19:21:01

数据通过JavaScript从外部URL加载。您可以使用请求模块对其进行仿真,例如:

import json
import requests


api_url = "https://alphafold.ebi.ac.uk/api/search"

params = {
    "q": "(text:*vr1 OR text:vr1*)",
    "type": "main",
    "start": "0",
    "rows": "20",
}

data = requests.get(api_url, params=params).json()

print(json.dumps(data, indent=4))

打印:

{
    "numFound": 112,
    "start": 0,
    "numFoundExact": true,
    "docs": [
        {
            "entryId": "AF-O35433-F1",
            "gene": "Trpv1",
            "geneT": [
                "Trpv1",
                "Vr1",
                "Vr1l"
            ],
            "geneSynonyms": [
                "Vr1",
                "Vr1l"
            ],
            "sequenceChecksum": "DAFC80B12BDF71BF",
            "sequenceVersionDate": "1998-01-01",
            "uniprotAccession": "O35433",
            "uniprotAccessionT": "O35433",
            "uniprotId": "TRPV1_RAT",
            "uniprotDescription": "Transient receptor potential cation channel subfamily V member 1",
            "protein": [
                "Transient receptor potential cation channel subfamily V member 1",
                "Capsaicin receptor",
                "Osm-9-like TRP channel 1",
                "Vanilloid receptor 1",
                "Vanilloid receptor type 1-like",
                "OTRPC1"
            ],
            "taxId": 10116,
            "organismScientificName": "Rattus norvegicus",
            "organism": [
                "Rattus norvegicus",
                "Rat"
            ],
            "globalMetricValue": 71.55,
            "uniprotStart": 1,
            "uniprotEnd": 838,
            "uniprotSequence": "MEQRASLDSEESESPPQENSCLDPPDRDPNCKPPPVKPHIFTTRSRTRLFGKGDSEEASPLDCPYEEGGLASCPIITVSSVLTIQRPGDGPASVRPSSQDSVSAGEKPPRLYDRRSIFDAVAQSNCQELESLLPFLQRSKKRLTDSEFKDPETGKTCLLKAMLNLHNGQNDTIALLLDVARKTDSLKQFVNASYTDSYYKGQTALHIAIERRNMTLVTLLVENGADVQAAANGDFFKKTKGRPGFYFGELPLSLAACTNQLAIVKFLLQNSWQPADISARDSVGNTVLHALVEVADNTVDNTKFVTSMYNEILILGAKLHPTLKLEEITNRKGLTPLALAASSGKIGVLAYILQREIHEPECRHLSRKFTEWAYGPVHSSLYDLSCIDTCEKNSVLEVIAYSSSETPNRHDMLLVEPLNRLLQDKWDRFVKRIFYFNFFVYCLYMIIFTAAAYYRPVEGLPPYKLKNTVGDYFRVTGEILSVSGGVYFFFRGIQYFLQRRPSLKSLFVDSYSEILFFVQSLFMLVSVVLYFSQRKEYVASMVFSLAMGWTNMLYYTRGFQQMGIYAVMIEKMILRDLCRFMFVYLVFLFGFSTAVVTLIEDGKNNSLPMESTPHKCRGSACKPGNSYNSLYSTCLELFKFTIGMGDLEFTENYDFKAVFIILLLAYVILTYILLLNMLIALMGETVNKIAQESKNIWKLQRAITILDTEKSFLKCMRKAFRSGKLLQVGFTPDGKDDYRWCFRVDEVNWTTWNTNVGIINEDPGNCEGVKRTLSFSLRSGRVSGRNWKNFALVPLLRDASTRDRHATQQEEVQLKHYTGSLKPEDAEVFKDSMVPGEK",
            "modelCreatedDate": "2021-07-01",
            "organismCommonNames": [
                "Rat"
            ],
            "proteinFullNames": [
                "Capsaicin receptor",
                "Osm-9-like TRP channel 1",
                "Vanilloid receptor 1",
                "Vanilloid receptor type 1-like"
            ],
            "proteinShortNames": [
                "OTRPC1"
            ],
            "latestVersion": 2,
            "allVersions": [
                1,
                2
            ],
            "_version_": 1723016518349881344
        },
        {
            "entryId": "AF-Q7K0E6-F1",
            "gene": "AspRS",
            "geneT": [

...

The data is loaded from external URL via JavaScript. You can use requests module to simulate it, for example:

import json
import requests


api_url = "https://alphafold.ebi.ac.uk/api/search"

params = {
    "q": "(text:*vr1 OR text:vr1*)",
    "type": "main",
    "start": "0",
    "rows": "20",
}

data = requests.get(api_url, params=params).json()

print(json.dumps(data, indent=4))

Prints:

{
    "numFound": 112,
    "start": 0,
    "numFoundExact": true,
    "docs": [
        {
            "entryId": "AF-O35433-F1",
            "gene": "Trpv1",
            "geneT": [
                "Trpv1",
                "Vr1",
                "Vr1l"
            ],
            "geneSynonyms": [
                "Vr1",
                "Vr1l"
            ],
            "sequenceChecksum": "DAFC80B12BDF71BF",
            "sequenceVersionDate": "1998-01-01",
            "uniprotAccession": "O35433",
            "uniprotAccessionT": "O35433",
            "uniprotId": "TRPV1_RAT",
            "uniprotDescription": "Transient receptor potential cation channel subfamily V member 1",
            "protein": [
                "Transient receptor potential cation channel subfamily V member 1",
                "Capsaicin receptor",
                "Osm-9-like TRP channel 1",
                "Vanilloid receptor 1",
                "Vanilloid receptor type 1-like",
                "OTRPC1"
            ],
            "taxId": 10116,
            "organismScientificName": "Rattus norvegicus",
            "organism": [
                "Rattus norvegicus",
                "Rat"
            ],
            "globalMetricValue": 71.55,
            "uniprotStart": 1,
            "uniprotEnd": 838,
            "uniprotSequence": "MEQRASLDSEESESPPQENSCLDPPDRDPNCKPPPVKPHIFTTRSRTRLFGKGDSEEASPLDCPYEEGGLASCPIITVSSVLTIQRPGDGPASVRPSSQDSVSAGEKPPRLYDRRSIFDAVAQSNCQELESLLPFLQRSKKRLTDSEFKDPETGKTCLLKAMLNLHNGQNDTIALLLDVARKTDSLKQFVNASYTDSYYKGQTALHIAIERRNMTLVTLLVENGADVQAAANGDFFKKTKGRPGFYFGELPLSLAACTNQLAIVKFLLQNSWQPADISARDSVGNTVLHALVEVADNTVDNTKFVTSMYNEILILGAKLHPTLKLEEITNRKGLTPLALAASSGKIGVLAYILQREIHEPECRHLSRKFTEWAYGPVHSSLYDLSCIDTCEKNSVLEVIAYSSSETPNRHDMLLVEPLNRLLQDKWDRFVKRIFYFNFFVYCLYMIIFTAAAYYRPVEGLPPYKLKNTVGDYFRVTGEILSVSGGVYFFFRGIQYFLQRRPSLKSLFVDSYSEILFFVQSLFMLVSVVLYFSQRKEYVASMVFSLAMGWTNMLYYTRGFQQMGIYAVMIEKMILRDLCRFMFVYLVFLFGFSTAVVTLIEDGKNNSLPMESTPHKCRGSACKPGNSYNSLYSTCLELFKFTIGMGDLEFTENYDFKAVFIILLLAYVILTYILLLNMLIALMGETVNKIAQESKNIWKLQRAITILDTEKSFLKCMRKAFRSGKLLQVGFTPDGKDDYRWCFRVDEVNWTTWNTNVGIINEDPGNCEGVKRTLSFSLRSGRVSGRNWKNFALVPLLRDASTRDRHATQQEEVQLKHYTGSLKPEDAEVFKDSMVPGEK",
            "modelCreatedDate": "2021-07-01",
            "organismCommonNames": [
                "Rat"
            ],
            "proteinFullNames": [
                "Capsaicin receptor",
                "Osm-9-like TRP channel 1",
                "Vanilloid receptor 1",
                "Vanilloid receptor type 1-like"
            ],
            "proteinShortNames": [
                "OTRPC1"
            ],
            "latestVersion": 2,
            "allVersions": [
                1,
                2
            ],
            "_version_": 1723016518349881344
        },
        {
            "entryId": "AF-Q7K0E6-F1",
            "gene": "AspRS",
            "geneT": [

...
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文