Passiflora 2017-11-13 07:02 采纳率: 0%
浏览 846
已结题

genbank网站搜索结果页面是如何翻页的

比如这样的一个网页:http://www.ncbi.nlm.nih.gov/nuccore/?term=trocholejeunea,可以看到里面有79items,4个pages,然后我通过它上面的Previous/Next切换page的时候它的链接就变成了http://www.ncbi.nlm.nih.gov/nuccore/,看源代码那里page部分没看到有href,也没看到它执行Javascript,然后抓包也没看到有什么参数传上去。在网址后面加上&page=2也不行,请教一下这个它到底是怎么实现的?该如何爬取?非常感谢!

附切换页面的Previous/Next的HTML代码:

 <a name="EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Entrez_Pager.Page" title="Next page of results" class="active page_link next" id="EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Entrez_Pager.Page" accesskey="k" href="#" sid="3" page="2">Next &gt;</a>
  • 写回答

1条回答

  • 808a73097dda5232 2018-09-07 17:14
    关注

    大概有两种方式吧,1抓请求,2看代码逻辑。我用1抓请求的来说,以Chrome为例:

    • 打开开发者工具,输入网址,点击第二页,查看network面板,点击主网址(第一条),在Response里得出那些item是html源码返回的,不是网页再去ajax之类的请求来的

    图片说明

    • 右键copy as curl(为例),copy出来便于后续处理:
    curl 'https://www.ncbi.nlm.nih.gov/nuccore' -H 'authority: www.ncbi.nlm.nih.gov' -H 'cache-control: max-age=0' -H 'origin: https://www.ncbi.nlm.nih.gov' -H 'upgrade-insecure-requests: 1' -H 'content-type: application/x-www-form-urlencoded' -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36' -H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8' -H 'referer: https://www.ncbi.nlm.nih.gov/nuccore/?term=trocholejeunea' -H 'accept-encoding: gzip, deflate, br' -H 'accept-language: zh-CN,zh;q=0.9' -H 'cookie: ncbi_sid=39715999B9263251_0079SID; WebEnv=1ZCmaGvRzY3zpI_IAgfXI0JqPi895YkU0mGNTLkrEK3TGUQc-jpFe6a-PFL72MblpoSCI13v40tnx_Fz87Eb-5BcmOwB4DJf6kGAM%4039715999B9263251_0079SID; _ga=GA1.2.2002142930.1536320297; _gid=GA1.2.770582215.1536320297; _gat=1; starnext=MYGwlsDWB2CmAeAXAXAJgLy2ogTrAXgGQDM60ArsMAPZ6EAs6AJtcAM7kC2hAbM6x24B2dAAtEnEIQAc6VAAZCATmawAZgENyIRIQCM89LlajqIWACtY5OBv170e/aSFL9jNrA05go/Xw0QKT0RTRBPfVkwiIV0YiUhPQBWJVSAISVUHlQAYWkDeUKiovoAMULcgDpOAH1UVEJUByxcAgBSYgBBCipaWA7OlnYuAYA5AHlRgFFGjAB3BcroYAAjMCWQTiWwUUqAc2oAN0bGFrx8AZ6aOizMbHPLymv+rqHBRpEz9q6rvoG3kZdCbTRqyL4XIQ5X54NqQgGcWE5YEzVAqPT0eiKYiGJSFEgOdGYkgYaTEegkUjRWAkRi4cjU4hJRwYrF8DGoJIkESocnEWQ8XG8lRJPR6BqY9DSaS4hgOVEy+gYaAaRBgQ7UjHoUAQGAIXT0JnivhUhgiXHSBiyDkW+gqDTAVXqtq5AAOGj2sBq4GgkGdUL1hCShnEiBdbAGztKkYWcyWq3W0E2212B0Okeh/VQUeIpUQsBwCOIABFjL4zJZrLZncRAw4pvcCABlACebDznFQlQACvXWvhKqMnn1Ko3YABHenLT0AJVgHB0bC7GjgIEqvfONSXHpw3fd1KSGFGeud8g6MySpEJiiSjAN5O9kBqys4sHQ66brfbnZ7Df7g96eAjuOk7ADOc7aIgi7LrAq7vvgm57juW7UsuoFtrQACSTDoJwGhgMs1DYC0AA0cGlLQ3CoXOiC0DkIAaGwbCjBoL7oG6ez4SqYCEcRqqIOYNTLkwNTsfmxE0ER2DiWYxGeOq0A1DQIAANw2CA1AaEwcBIGgdx9iQZBDnQjDwrw/DDMIYgSFIVqKCoTDqFoOj6IYpamOYVg2F49iOPoGBqRpTDOOgrjuOgnjeL4/joIEwShIEER6FECXUrE8SJCk6SZNkeQFMUxRlBUOTVHUDRNHpDw/EZLyDAIgKdMiszoDGcZrBsWzQDs+xHCcFXfN01WNHw4KPABNWmagny/qNzz/HVhYNZMKJgtNkIZoi8KIo1qLMkS2LoLiWIEiyxKSmSFLoCaZJGDg9IkEyV4kGyiqcsQ3K8vySh6DwJDCmKnIStK8g1vQcqpIoipkFx6oMIwAWadp+pMuSF67dejBNA0STI4qgZ8CKP1JCI15Wik14qIoPCGJTBIckIQA' --data 'term=trocholejeunea+&EntrezSystem2.PEntrez.Nuccore.Sequence_PageController.PreviousPageName=results&EntrezSystem2.PEntrez.Nuccore.Sequence_Facets.FacetsUrlFrag=filters%3D&EntrezSystem2.PEntrez.Nuccore.Sequence_Facets.FacetSubmitted=false&EntrezSystem2.PEntrez.Nuccore.Sequence_Facets.BMFacets=&EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Sequence_DisplayBar.sPresentation=docsum&EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Sequence_DisplayBar.sPageSize=20&EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Sequence_DisplayBar.sSort=none&EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Sequence_DisplayBar.FFormat=docsum&EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Sequence_DisplayBar.FSort=&EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Sequence_DisplayBar.CSFormat=fasta_cds_na&EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Sequence_DisplayBar.GFFormat=gene_fasta&EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Sequence_DisplayBar.Db=nuccore&EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Sequence_DisplayBar.QueryKey=1&EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Sequence_DisplayBar.CurrFilter=all&EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Sequence_DisplayBar.ResultCount=79&EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Sequence_DisplayBar.ViewerParams=&EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Sequence_DisplayBar.FileFormat=docsum&EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Sequence_DisplayBar.LastPresentation=docsum&EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Sequence_DisplayBar.Presentation=docsum&EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Sequence_DisplayBar.PageSize=20&EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Sequence_DisplayBar.LastPageSize=20&EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Sequence_DisplayBar.Sort=&EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Sequence_DisplayBar.LastSort=&EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Sequence_DisplayBar.FileSort=&EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Sequence_DisplayBar.Format=&EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Sequence_DisplayBar.LastFormat=&EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Sequence_DisplayBar.PrevPageSize=20&EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Sequence_DisplayBar.PrevPresentation=docsum&EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Sequence_DisplayBar.PrevSort=&CollectionStartIndex=1&EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Sequence_ResultsController.ResultCount=79&EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Sequence_ResultsController.RunLastQuery=&EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Sequence_ResultsController.AccnsFromResult=&EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Entrez_Pager.cPage=1&EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Entrez_Pager.CurrPage=2&EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Entrez_Pager.cPage=1&EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Sequence_DisplayBar.sPresentation2=docsum&EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Sequence_DisplayBar.sPageSize2=20&EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Sequence_DisplayBar.sSort2=none&EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Sequence_DisplayBar.TopSendTo=genefeat&EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Sequence_DisplayBar.FFormat2=docsum&EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Sequence_DisplayBar.FSort2=&EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Sequence_DisplayBar.CSFormat2=fasta_cds_na&EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Sequence_DisplayBar.GFFormat2=gene_fasta&EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Sequence_MultiItemSupl.Taxport.TxView=list&EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Sequence_MultiItemSupl.Taxport.TxListSize=5&EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Sequence_MultiItemSupl.RelatedDataLinks.rdDatabase=rddbto&EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Sequence_MultiItemSupl.RelatedDataLinks.DbName=nuccore&EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Discovery_SearchDetails.SearchDetailsTerm=%22Trocholejeunea%22%5BOrganism%5D+OR+trocholejeunea%5BAll+Fields%5D&EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.HistoryDisplay.Cmd=PageChanged&EntrezSystem2.PEntrez.DbConnector.Db=nuccore&EntrezSystem2.PEntrez.DbConnector.LastDb=nuccore&EntrezSystem2.PEntrez.DbConnector.Term=trocholejeunea&EntrezSystem2.PEntrez.DbConnector.LastTabCmd=&EntrezSystem2.PEntrez.DbConnector.LastQueryKey=1&EntrezSystem2.PEntrez.DbConnector.IdsFromResult=&EntrezSystem2.PEntrez.DbConnector.LastIdsFromResult=&EntrezSystem2.PEntrez.DbConnector.LinkName=&EntrezSystem2.PEntrez.DbConnector.LinkReadableName=&EntrezSystem2.PEntrez.DbConnector.LinkSrcDb=&EntrezSystem2.PEntrez.DbConnector.Cmd=PageChanged&EntrezSystem2.PEntrez.DbConnector.TabCmd=&EntrezSystem2.PEntrez.DbConnector.QueryKey=&p%24a=EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Entrez_Pager.Page&p%24l=EntrezSystem2&p%24st=nuccore' --compressed
    
    • 可以看到有很多请求头与参数,然后就是分析与精简,看哪些是必要参数以及猜测参数意思
    • 比如先把请求参数弄成个数组,然后一个个减少来请求,如果响应结果不是想要的,就说明当前减少的那个参数是必须的
    String[] params = { "term=trocholejeunea",
            "EntrezSystem2.PEntrez.DbConnector.Cmd=PageChanged",
            "EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Entrez_Pager.CurrPage=2","省略"};
    for (int i = 1; i <= params.length; i++) {
        Builder bodyBuilder = new FormBody.Builder();
        System.out.println(i);
        int j = 0;
        for (; j < params.length - i; j++) {
            String[] split = StringUtils.splitPreserveAllTokens(params[j], '=');
            bodyBuilder.addEncoded(split[0], split[1]);
        }
        Request request = new Request.Builder().url("https://www.ncbi.nlm.nih.gov/nuccore")
                .header("cookie",
                        "省略")
                /* 省略其它header */
                .post(bodyBuilder.build()).build();
        Response response = new OkHttpClient().newCall(request).execute();
        if (!StringUtils.contains(response.body().string(), "AY462401")) {
        // 第二页有关键字AY462401,如果响应没有,那params[j]就是必填参数
            System.out.println(params[j]);
            break;
        }
    }
    
    • 其它请求内容如header,cookie等参照上一条处理
    • 再点第三页,比较与第二页请求内容的变化猜测诸如请求参数等内容的含义
    请求参数 含义
    term=trocholejeunea 你的查询内容?
    EntrezSystem2.PEntrez.DbConnector.Cmd=PageChanged 必填
    EntrezSystem2.PEntrez.Nuccore.Sequence_ResultsPanel.Entrez_Pager.CurrPage=2 页面,1=第1页
    EntrezSystem2.PEntrez.DbConnector.LastQueryKey=1 必填

    ps: 没研究为什么直接打开 http://www.ncbi.nlm.nih.gov/nuccore/?term=trocholejeunea 不需要DbConnector.Cmd这些参数

    • 结果:用get(query string)或者post(form data)都可以拿到结果,用程序获取会提示requires JavaScript但没关系,items是有的
      • 都需要加上.header("Cookie", "ncbi_sid=39715999B9263251_0079SID"),不知道这东西是啥,每次还不一样
      • HtmlUnit支持js,但你这里没必要
    • 最后:你自己玩吧,可以提高问题的C币么?我答的很累!!!
    • 坑:你给的地址是http,点第二页跳的是https
    评论

报告相同问题?

悬赏问题

  • ¥15 outlook无法配置成功
  • ¥15 Pwm双极模式H桥驱动控制电机
  • ¥30 这是哪个作者做的宝宝起名网站
  • ¥60 版本过低apk如何修改可以兼容新的安卓系统
  • ¥25 由IPR导致的DRIVER_POWER_STATE_FAILURE蓝屏
  • ¥50 有数据,怎么建立模型求影响全要素生产率的因素
  • ¥50 有数据,怎么用matlab求全要素生产率
  • ¥15 TI的insta-spin例程
  • ¥15 完成下列问题完成下列问题
  • ¥15 C#算法问题, 不知道怎么处理这个数据的转换