dongyan8896 2018-07-11 12:56
浏览 127
已采纳

如何使用带有Scrapy的admin-ajax.php从网站上抓取数据

I am trying to scrape the reviews about unibet casino on that website : https://casinoplacard.com/unibet-casino-reviews-and-bonuses/

As I did for other sources of reviews I used Scrapy on Python to scrape the reviews with the code below :

class slotRunner_spyder(scrapy.Spider):
count=0

name = "slotRunner_spyder"
start_urls = [

       'https://casinoplacard.com/unibet-casino-reviews-and-bonuses/'
]
def parse(self, response):

    parsed_uri = urlparse(response.url)
    domain = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)

    for review in response.css('div.rwp-users-reviews > div.rwp-u-review') :
        self.count+=1
        yield {
            'name': review.css('td a::text').extract_first(),
            'date': review.css('td small::text').extract_first(),
            'review': review.css('div.rwp-u-review__content > div.rwp-u-review__comment').extract(),
            'url' : response.url
        }
    print(self.count)

But for that website it does not work. To understand better I have introduced the counter (self.count) and discover that it do only 1 iteration which is not normal...

Then I have spent some tiem studying the DevTools of that website and I have discover that when the page is loaded, a XHR POST request method is done automatically with the URL : https://casinoplacard.com/wp-admin/admin-ajax.php

And by looking into that request I have found the 182 reviews data in :

Preview >> Data >> Reviews

So could you guys please help me understand how it works to catch those data ?

Thank you very much !

  • 写回答

1条回答 默认 最新

  • 普通网友 2018-07-12 12:56
    关注

    I finally found how to do so, I am sure this is not the best way but at least I did what I wanted to do.

    So as I said in my question in the preview tab there were all the data I needed. So what I had to do was getting those data. To do so I understood that when the URL is loaded that XHR POST request were made automatically so I just tried to force python to request that URL.

    import requests
    s = requests.Session()
    # We get the URL into that session
    s.get(url)
    #Here is the imitation of the POST request 
    self.r = s.post(ajax_URL,data=param,headers=headers)`
    

    The parameters you just get them from the headers tab of the DevTool, then the form data is your parameters. For the header you get it also in the header tab, you search for User-Agent and just paste all that in the headers. The ajax URL is the one I wrote in my question.

    Hope that will help someone.

    本回答被题主选为最佳回答 , 对您是否有帮助呢?
    评论

报告相同问题?

悬赏问题

  • ¥15 深度学习根据CNN网络模型,搭建BP模型并训练MNIST数据集
  • ¥15 lammps拉伸应力应变曲线分析
  • ¥15 C++ 头文件/宏冲突问题解决
  • ¥15 用comsol模拟大气湍流通过底部加热(温度不同)的腔体
  • ¥50 安卓adb backup备份子用户应用数据失败
  • ¥20 有人能用聚类分析帮我分析一下文本内容嘛
  • ¥15 请问Lammps做复合材料拉伸模拟,应力应变曲线问题
  • ¥30 python代码,帮调试,帮帮忙吧
  • ¥15 #MATLAB仿真#车辆换道路径规划
  • ¥15 java 操作 elasticsearch 8.1 实现 索引的重建