python爬虫爬取时提示org.xml.sax.SAXParseException: Content is not allowed in prolog.

最近在研究爬虫,请求方法是POST,请求内容类型是application/x-www-form-urlencoded,说明以表单的方式提交。

观察响应正文,可以发现,响应内容类型是xml,其中想要提取的数据就在new节点内:
图片说明

首先构造headers:
图片说明

请求参数在请求正文中:
图片说明

可以发现请求参数也是放在xml中,将__xml参数解码后可以发现内容如下:
图片说明

提交的参数放在p标签里,每次提交请求变化的也只有那些参数,并没有发现加密的迹象
因此构造params:
图片说明

代码如下:

import requests


target = ".../dorado/smartweb2.RPC.d?__rpc=true"    # 公司内网地址,外网无法访问
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko",
    "Referer": ".../pages/policynewbiz/inputapplication/pmGDVehicleQuery.jsp?VEHICLELICENSE=&VIN=LEFYECG257HN34234&LICENSETYPE=&Kind=AUTOCOMPRENHENSIVEINSURANCE2014PRODUCT&",
    "Content-Type": "application/x-www-form-urlencoded",
    "Cookie": "jsessionidp09=X2tfdlLTBZH7xzKwnhSgh2W2N5374T0HnHWYQkl2MRShjBxfpKpW!1484787398; F5cookie=1410712842.6521.0000"
}

params = {
    "__type": "loadData",
    "__viewInstanceId": "org.view.policynewbiz.inputapplication.pmGDVehicleQuery~org.view.common.viewmodel.CpicViewModel",
    "__xml": '%3Crpc%20id%3D%22datasetResult%22%20type%3D%22wrapper%22%20objectClazz%3D%22%22%20pi%3D%221%22%20ps%3D%22100%22%20pc%3D%221%22%20prc%3D%220%22%20fs%3D%22vin%2ClicensePlateNo%2ClicensePlateType%2CengineNo%2CpmVehicleType%2CpmUserNature%2CineffectualDate%2CrejectDate%2CfirstRegisterDate%2ClastCheckDate%2CtransferDate%2CwholeWeight%2CratedPassengerCapacity%2Ctonnage%2Cdisplacement%2CmadeFactory%2Cmodel%2CbrandCN%2CbrandEN%2Chaulage%2Ccolor%2CfuelType%2CvehicleStatus%2CmotorTypeCode%22%3E%3Cps%3E%3Cp%20name%3D%22flag%22%3E1%3C/p%3E%3Cp%20name%3D%22carMark%22/%3E%3Cp%20name%3D%22RackNo%22%3E2FMDK3J95DBC93811%3C/p%3E%3C/ps%3E%3C/rpc%3E%0D%0A',
    "__rpc": "true",
}

res = requests.post(url=target, headers=headers, data=params)
html = res.content.decode("utf-8")
print(html)

执行结果报错:

D:\Users\CPIC\AppData\Local\Programs\Python\Python37\python.exe E:/Workspace/Python/SchoolInfo/test56.py
<?xml version="1.0"?>
<result succeed="false" >
<errorMessage>org.xml.sax.SAXParseException: Content is not allowed in prolog.</errorMessage>
<stackTrace><![CDATA[com.bstek.dorado.utils.xml.dom4j.Dom4jXmlBuilder.buildDocument(Dom4jXmlBuilder.java:59)
com.bstek.dorado.view.rpc.AbstractRPCHandler.init(AbstractRPCHandler.java:58)
com.bstek.dorado.view.rpc.LoadDataRPCHandler.init(LoadDataRPCHandler.java:41)
com.bstek.dorado.core.FilterHandle.doFilter(FilterHandle.java:131)
com.bstek.dorado.core.DoradoFilter.doFilter(DoradoFilter.java:70)
weblogic.servlet.internal.FilterChainImpl.doFilter(FilterChainImpl.java:43)
com.cpic.p09.auto.common.filter.CompatibleFilter.doFilter(CompatibleFilter.java:34)
weblogic.servlet.internal.FilterChainImpl.doFilter(FilterChainImpl.java:43)
com.cpic.p09.auto.common.filter.ClientCacheFilter.doFilter(ClientCacheFilter.java:71)
weblogic.servlet.internal.WebAppServletContext$ServletInvocationAction.run(WebAppServletContext.java:3242)
weblogic.security.acl.internal.AuthenticatedSubject.doAs(AuthenticatedSubject.java:321)
weblogic.servlet.internal.WebAppServletContext.execute(WebAppServletContext.java:1916)
weblogic.servlet.internal.ServletRequestImpl.run(ServletRequestImpl.java:1366)
weblogic.work.ExecuteThread.run(ExecuteThread.java:181)
]]></stackTrace>
<viewProperties></viewProperties>
</result>

Process finished with exit code 0

有哪位大神遇到过这种情况,小弟要抓狂了

1个回答

result succeed="false" # 结果返回失败
org.xml.sax.SAXParseException: Content is not allowed in prolog.# 你的内容没有允许被显示
我换个网址试了一下你的代码结构是可以正常运行的,所以我觉得问题应该还是出在参数方面吧。

qq_38324018
qq_38324018 我也一直觉得是参数问题,但是参数是请求正文里照搬过来的,按理说应该没问题啊
3 个月之前 回复
Csdn user default icon
上传中...
上传图片
插入图片
抄袭、复制答案,以达到刷声望分或其他目的的行为,在CSDN问答是严格禁止的,一经发现立刻封号。是时候展现真正的技术了!