weixin_39810441 2020-11-29 14:18
浏览 0

Cookies being sent to wrong site

We've now see a few cases where some cookies are being sent over-and-over in requests to lots of different sites. For an example WARC from our domain crawl, we see


WARC/1.0^M
WARC-Type: request^M
WARC-Target-URI: http://gamstop.co.uk/^M
WARC-Date: 2019-06-26T21:57:15Z^M
WARC-Concurrent-To: <c8c3061e-74aa-4836-8e95-11270677cac7>^M
WARC-Record-ID: <d51c9542-a658-4901-9445-e5a8e06d4cf3>^M
Content-Type: application/http; msgtype=request^M
Content-Length: 1750^M
^M
GET / HTTP/1.0^M
Connection: Close^M
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8^M
Host: gamstop.co.uk^M
User-Agent: bl.uk_ldfc_bot/3.4.0-20190418 (+http://www.bl.uk/aboutus/legaldeposit/websites/websites/faqswebmaster/index.html)^M
Cookie: Country=GB; EPRAT=1966168438-1561106086442; ESTN=2; PHPSESSID=p4bsl9dkml874kbnrs28isqrl2; PM_unread_Onlineshop="6669,9502,6679,5575,6055,6936,9117,5574,6670,6673,6627,70170,6814,6595,6464,6003,5582,5338,6938,9488,6811,9360,6597,9366,5586,9823,9454,9377,9453,7368,6573,6121,9800,9847,5558,9801,6267,6592,9057,5583,9843,6463,9542,9811,9814,70038,5584,9812,6672,9420,9362,9363,6596,9424,5560,6590,9816,9844,9464,6389,9275,5559,6598,9806,9805,9842,70219,6939,9815,6410,9472,6215,9807,9808,6594,6591,9830,6476,9421,9423,70013,9404,9361,5131,6584,6593,6599,6426,6354,6279,6578,9822,9833,5398,9438,70107,9435,9491,6586,6581,9827,9852,5364,9832,9846,9277,9276,70175,9819,6158,70041,70124,9448,70043,9279,5653,9442,6588,5399,9834,70121,70120,9375,6589,70106,70025,9818,6495,6214,6576,9425,70040,9456,9474,70026,70044,9813,9447,6574,6922,9851,9379,5657,6577,9356,9804,6575,70042,9383,9378,9492,6506,7700,9321,9426,6587,70123,9826,7892,9825,9439,9440,9355,9352,70045,9353,9350,70029,70032,70030,9446,9380,9381,6585,6583,9449,9441,9437,9359,70031,70027,9357,9365,9809,9418,70119,9810,70037,9457,9817,9824,70122,9376,70028,9358,9487,9322,70118,5626,9462,9455,9278,9417,9465,9463,9468,9419,9354,9415,9820,9490,9390,9406,9428,9416,9351,70039,5650,9466,9850,9055,70012,9467,9401,9803,9427,9831,9373,9405,9845,70172,9436,9489,9853,9059,9403,9471,9384,9543,70173,6270,9469,9340,9345,9364,5130,9821,9343,9342,9374,9473,9341,5656,9422,70171,9382,9144,9389,70174,6427,9344,9402,9802,9470"; PSACountry=GB^M
^M
^M
^M
</d51c9542-a658-4901-9445-e5a8e06d4cf3></c8c3061e-74aa-4836-8e95-11270677cac7>

This appears to be in large numbers of requests, and because it's large, sometimes causes problems (we got 403 blocked by http://gamstop.co.uk/ in this example.)

We've visited the site before in this crawl, but these cookies didn't turn up there.

该提问来源于开源项目:internetarchive/heritrix3

  • 写回答

4条回答 默认 最新

  • weixin_39810441 2020-11-29 14:18
    关注

    From the console, these can be inspected with

    
    cookies = appCtx.getBean("cookieStorage").hostSubset("co.uk");
    rawOut.println(cookies);
    

    And cleared with

    
    cookies = appCtx.getBean("cookieStorage").hostSubset("co.uk");
    cookies.clear();
    rawOut.println(cookies);
    

    So I can do that manually for now.

    评论

报告相同问题?