weixin_39844963
weixin_39844963
2020-12-02 18:50

VCL Return + Restart causes assertions in CNT_Embark on AZ(req->vcl0);

Expected Behavior

Avoid panicking and working correctly

Current Behavior


Panic at: Thu, 27 Jun 2019 16:15:21 GMT
Assert error in CNT_Embark(), cache/cache_req_fsm.c line 1023:
  Condition((req->vcl0) == 0) not true.
version = varnish-6.2.0 revision b14a3d38dbe918ad50d3838b11aa596f42179b54, vrt api = 9.0
ident = Linux,4.9.0-9-amd64,x86_64,-junix,-smalloc,-sdefault,-hcritbit,epoll
now = 615633.120472 (mono), 1561652121.833952 (real)
Backtrace:
  0x560e65033ec9: /usr/sbin/varnishd(+0x4eec9) [0x560e65033ec9]
  0x560e6509e9c0: /usr/sbin/varnishd(VAS_Fail+0x40) [0x560e6509e9c0]
  0x560e65037e16: /usr/sbin/varnishd(CNT_Embark+0xc6) [0x560e65037e16]
  0x560e65060dd3: /usr/sbin/varnishd(+0x7bdd3) [0x560e65060dd3]
  0x560e650563d5: /usr/sbin/varnishd(+0x713d5) [0x560e650563d5]
  0x560e65056910: /usr/sbin/varnishd(+0x71910) [0x560e65056910]
  0x7f3ce6755fa3: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7fa3) [0x7f3ce6755fa3]
  0x7f3ce66864cf: /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f3ce66864cf]
thread = (cache-worker)
pthread.attr = {
  guard = 4096,
  stack_bottom = 0x7f3c98abc000,
  stack_top = 0x7f3c98afc000,
  stack_size = 262144,
}
thr.req = 0x7f3c8c186020 {
  vxid = 688132, transport = HTTP/1 {
    state = HTTP1::Proc
  }
  step = R_STP_LOOKUP,
  req_body = R_BODY_NONE,
  restarts = 0, esi_level = 0,
  sp = 0x7f3ccad85020 {
    fd = 78, vxid = 688129,
    t_open = 1561652118.305774,
    t_idle = 1561652121.454280,
    ws = 0x7f3ccad85060 {
      id = \"ses\",
      {s, f, r, e} = {0x7f3ccad85098, +96, (nil), +261984},
    },
    transport = HTTP/1 {
      state = HTTP1::Proc
    }
    client = 192.168.35.49 36214 :80,
  },
  worker = 0x7f3c98afad80 {
    ws = 0x7f3c98afae28 {
      id = \"wrk\",
      {s, f, r, e} = {0x7f3c98afa380, +0, (nil), +2040},
    },
    VCL::method = HASH,
    VCL::return = lookup,
    VCL::methods = {},
  },
  ws = 0x7f3c8c186170 {
    id = \"req\",
    {s, f, r, e} = {0x7f3c8c1880b0, +664, +515912, +515912},
  },
  http_conn = 0x7f3c8c188050 {
    fd = 78 (),
    doclose = NULL,
    ws = 0x7f3c8c186170 {
      [Already dumped, see above]
    },
    {rxbuf_b, rxbuf_e} = {0x7f3c8c1880b0, 0x7f3c8c188296},
    {pipeline_b, pipeline_e} = {(nil), (nil)},
    content_length = -1,
    body_status = none,
    first_byte_timeout = 0.000000,
    between_bytes_timeout = 0.000000,
  },
  http[req] = 0x7f3c8c186218 {
    ws = 0x7f3c8c186170 {
      [Already dumped, see above]
    },
    hdrs {
      \"GET\",
      \"/library/de_persgroep/fonts/2533/museo-sans-300.ttf\",
      \"HTTP/1.1\",
      \"Accept: */*\",
      \"Referer: ***** Edited *****\",
      \"Accept-Language: nl-BE\",
      \"User-Agent: Mozilla/5.0 (***** Edited *****\",
      \"Origin: https://actie.demorgen.be\",
      \"Host: files.qualifio.com\",
      \"Accept-Encoding: gzip\",
      \"X-Varnish: 14439596\",
      \"X-Forwarded-For: ***** Edited *****\",
      \"X-Original-Url: /library/de_persgroep/fonts/2533/museo-sans-300.ttf\",
    },
  },
  vdc = 0x7f3c8c188028 {
    nxt = (nil),
    retval = 0,
  },
  vcl = {
    name = \"vcl_files_20190627_161453\",
    busy = 81,
    discard = 0,
    state = warm,
    temp = warm,
    conf = {
      syntax = \"41\",
      srcname = {
        \"/etc/varnish/conf-enabled/default_files.vcl\",
        \"Builtin\",
        \"/etc/varnish/conf/common_purge.vcl\",
        \"/etc/varnish/conf/common_security_main.vcl\",
        \"/etc/varnish/conf/common_security_cleanup.vcl\",
        \"/etc/varnish/conf/common_security_clean_debug.vcl\",
        \"/etc/varnish/conf/common_security_filter_cfide.vcl\",
        \"/etc/varnish/conf/common_security_server_name.vcl\",
        \"/etc/varnish/conf/common_security_xss.vcl\",
      },
    },
  },
  vcl0 = {
    name = \"router_20190627_161453\",
    busy = 43,
    discard = 0,
    state = warm,
    temp = warm,
    conf = {
      syntax = \"41\",
      srcname = {
        \"/etc/varnish/build/router.vcl\",
        \"Builtin\",
        \"/etc/varnish/conf/common_security_sni_match.vcl\",
        \"/etc/varnish/conf/common_guru_meditation.vcl\",
      },
    },
  },
  flags = {
  },
  privs = 0x7f3c8c186200 {
  },
  topreq = {
    req = 0x7f3c8c186020 {
      [Already dumped, see above]
    },
  },
},
thr.busyobj = (nil) {
},
vmods = {
  directors = {Varnish 6.2.0 b14a3d38dbe918ad50d3838b11aa596f42179b54, 0.0},
  dynamic = {Varnish 6.2.0 b14a3d38dbe918ad50d3838b11aa596f42179b54, 0.0},
  std = {Varnish 6.2.0 b14a3d38dbe918ad50d3838b11aa596f42179b54, 0.0},
  proxy = {Varnish 6.2.0 b14a3d38dbe918ad50d3838b11aa596f42179b54, 0.0},
},

Possible Solution

I suspect the cnt_restart should implement the same logic as esi to move back to the original vcl0 and remove it from here.

Steps to Reproduce (for bugs)

This is initial bug, i'll provide more detail as i read through the code, but my intution is that a pair of vcl like this one should cause the crash :

first_vcl


sub vcl_recv {
  return ("second_vcl");
}

second_vcl


# RoutingKey: req.url == "/varnish-crash"
# VclLabel: vcl_varnish_crash
# Priority: 1
vcl 4.1;

# Dummy, MUST NOT BE USED
backend default { .host = "127.0.0.1"; }

include "conf/builtin/synth.vcl";

sub vcl_recv {
  if (req.restarts==0) {
    return (restart);
  }
  return (synth(200, "OK - restarted " + req.restarts));
}

sub vcl_synth {
    call builtin_vcl_synth;
    return (deliver);
}

Your Environment

  • Version used: 6.2.0 (cloudpackage)
  • Operating System and version: Debian buster
  • Source of binary packages used (if any)

This ticket will be updated as i can confirm my initial investigation. thanks a lot for your support

该提问来源于开源项目:varnishcache/varnish-cache

  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享
  • 邀请回答

17条回答

  • weixin_39844963 weixin_39844963 5月前

    First note, altough the current vcl either use a direct backend or a dns one using dynamic vmod (ovh cloudstorage) fallback.

    the only complex part is the restart to change backend when the original response status is over 399

    点赞 评论 复制链接分享
  • weixin_39844963 weixin_39844963 5月前

    the minimal vcl provided here do not cause the crash, so, it's not as trivial as hoped, we only saw crash on font files, we're working on reproducing the issue as consistently as possible

    点赞 评论 复制链接分享
  • weixin_39715834 weixin_39715834 5月前

    Once we have a reproducer, it would be nice to check whether #3011 accidentally fixes this by stashing vcl0 in the reqtop.

    点赞 评论 复制链接分享
  • weixin_39844963 weixin_39844963 5月前

    our first objective is to allow going live with our deployment, so as esi are not yet used on the new stack, we'll try to put live 6.1, this will allow to confirm the vcl0 is indeed the main culprit and not a side effect.

    here is the vcl called. and part of the router : the common_purge only implement purge control with acl, so should not be an issue security_filesystem blocks a set of request or badly percent encoded string, so should not causes issue either.

    What i find strange is the panic seems to be in the lookup phase.

    The vcl

    the router is generated based on the routingKey comment and will only forward matching requests, so not really complex, a minimal router.vcl :

    
    vcl 4.1;
    
    import std;
    
    # Dummy, MUST NOT BE USED
    backend default { .host = "127.0.0.1"; }
    
    include "conf/common_security_sni_match.vcl";
    
    sub vcl_recv {
        # Pre-Router Logic
        set req.http.X-Forwarded-Proto = "http";
        if (std.port(server.ip) == 443) {
            set req.http.X-Forwarded-Proto = "https";
        }
    
        ### clean out requests sent via curls -X mode and LWP
        if (req.url ~ "^http://") {
            set req.url = regsub(req.url, "http://[^/]*", "");
        }
    
        if (req.http.host == "Domain" || ((req.url ~ "^/+(REGEXP)($|/)" && req.url !~"\.ext") || req.url ~"\.static$")) {
          # From : ../conf-enabled/default_files.vcl # Priority : 600
          return (vcl(vcl_files));
        }
        # Return a 404, unknown Host
        return (synth(404, "Unknown Routing Rule"));
    }
    
    
    # RoutingKey: req.http.host == "Domain" || ((req.url ~ "^/+(REGEXP)($|/)" && req.url !~"\.ext") || req.url ~"\.static$")
    # VclLabel: vcl_files
    # Priority: 600
    
    vcl 4.1;
    
    import directors;
    import dynamic;
    import std;
    
    include "conf/common_purge.vcl";
    include "conf/common_security_main.vcl";
    include "conf/common_security_system_files.vcl";
    
    # probes
    probe static_probe {
        .url = "/static/ping.txt";
        .interval = 5s;
        .timeout = 3s;
        .window = 5;
        .threshold = 3;
    }
    
    # backends
    backend lighttpd {
        .host = "127.0.0.1";
        .port = "81";
        .max_connections = 500;
        .connect_timeout = 500ms;
        .first_byte_timeout = 25s;
        .between_bytes_timeout = 15s;
        .probe = static_probe;
    }
    
    sub vcl_init {
        new ovh_storage = dynamic.director(
            port = "80",
            ttl = 5m
        );
    
        new director_files = directors.fallback();
        director_files.add_backend(lighttpd);
    }
    
    sub vcl_recv {
        # Fallback to OVH Storage
        if (req.restarts == 1) {
          set req.url = regsub(req.url, "^/static/", "/customers/qualifio/static/");
          set req.url = regsub(req.url, "^/fonts/", "/customers/qualifio/fonts/");
          set req.url = "/v1/AUTH_XXXXXXXXXXXXXXXXXXXXXXXXX" + req.url;
          set req.backend_hint = ovh_storage.backend("storage.ZONE.cloud.ovh.net");
          return (hash);
        }
    
        # Protect access to lighttpd server status
        if (req.url ~ "^/XXXXXXXXXXXXXXXXXX") {
            return (synth(404));
        }
    
        set req.backend_hint=director_files.backend();
    
        # Happens before we check if we have this in cache already.
        #
        # Typically you clean up the request here, removing cookies you don't need,
        # rewriting the request, etc.
        set req.url = std.tolower(req.url);
        unset req.http.If-Modified-Since;
        unset req.http.If-None-Match;
    }
    
    sub vcl_backend_response {
        #unset beresp.http.Last-Modified;
        if (beresp.http.content-type ~ "text" || beresp.http.content-type ~ "application/java") {
            set beresp.do_gzip = true;
        }
        # Happens after we have read the response headers from the backend.
        #
        # Here you clean the response headers, removing silly Set-Cookie headers
        # and other mistakes your backend does.
        set beresp.http.Access-Control-Allow-Origin = "*";
        set beresp.grace=15s;
        if (bereq.url ~ "\.(js|css|jpg|jpeg|png|gif)(\?.*)?") {
            if (beresp.ttl < 3600s) {
                set beresp.ttl = 10800s;
                set beresp.uncacheable = false;
                /* Remove Expires from backend, it's not long enough */
                unset beresp.http.expires;
    
                /* Set the clients TTL on this object */
                set beresp.http.Cache-Control = "public, max-age=10800";
            }
            set beresp.grace=20m;
        }
        if (beresp.ttl < 60s) {
            set beresp.grace=60s;
            set beresp.ttl = 300s;
            set beresp.uncacheable = false;
            /* Remove Expires from backend, it's not long enough */
            unset beresp.http.expires;
    
            /* Set the clients TTL on this object */
            set beresp.http.Cache-Control = "public, max-age=300";
        }
        set beresp.keep=4h;
    }
    
    sub vcl_synth {
        # Restart on 911 to fallback to ovh cloud storage
        if (resp.status == 911) {
          return (restart);
        }
    }
    
    # Happens when we have all the pieces we need, and are about to send the
    # response to the client.
    #
    # You can do accounting or modifying the final object here.
    sub vcl_deliver {
        if (req.restarts == 0 && resp.status > 399) {
            # Require a restart and fallback to ovh cloud
            # We save the current error_code
            set req.http.X-Original-ErrCode = resp.status;
            set req.http.X-Original-Reason = resp.reason;
            # We restart the query on the cloud storage
            return (synth(911, "storage.zone.cloud.ovh.net"));
        }
        if (req.restarts == 1 && resp.status == 401) {
            # The request come from ovh cloud and is a 401
            # if we have a saved error code, return it
            if (req.http.X-Original-ErrCode) {
                return (synth(req.http.X-Original-ErrCode, req.http.X-Original-Reason));
            }
            return (synth(404, "Not Found"));
        }
        unset resp.http.Last-Modified;
        unset resp.http.Etag;
    }
    
    
    点赞 评论 复制链接分享
  • weixin_39715834 weixin_39715834 5月前

    What is the output of this command?

    
    addr2line -e /usr/sbin/varnishd 0x560e65060dd3 0x560e650563d5 0x560e65056910
    
    点赞 评论 复制链接分享
  • weixin_39844963 weixin_39844963 5月前

    unfortunately :

    
    addr2line -e /usr/sbin/varnishd 0x560e65060dd3 0x560e650563d5 0x560e65056910
    ??:0
    ??:0
    ??:0
    

    And i can't find the debug version of the package cloud .deb

    
    Desired=Unknown/Install/Remove/Purge/Hold
    | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
    |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
    ||/ Name           Version         Architecture Description
    +++-==============-===============-============-==================================================
    ii  varnish        6.2.0-1~stretch amd64        state of the art, high-performance web accelerator
    
    点赞 评论 复制链接分享
  • weixin_39844963 weixin_39844963 5月前

    Full stack : addr2line -e /usr/sbin/varnishd +0x4eec9 +0xB99C0 +0x52E16 +0x7bdd3 +0x713d5 +0x71910

    
    ./bin/varnishd/cache/cache_panic.c:624
    ./lib/libvarnish/vas.c:53
    ./bin/varnishd/cache/cache_req_fsm.c:1023
    ./bin/varnishd/http1/cache_http1_fsm.c:415
    ./bin/varnishd/cache/cache_wrk.c:410
    ./bin/varnishd/cache/cache_wrk.c:443
    
    点赞 评论 复制链接分享
  • weixin_39715834 weixin_39715834 5月前
    
    varnishtest "return(vcl) then reembark"
    
    barrier b1 cond 2
    
    server s1 {
            rxreq
            barrier b1 sync
            txresp
    } -start
    
    varnish v1 -vcl+backend ""
    varnish v1 -cliok "vcl.label lbl vcl1"
    varnish v1 -vcl {
            backend be { .host = "${bad_backend}"; }
    
            sub vcl_recv {
                    return (vcl(lbl));
            }
    } -start
    
    client c1 {
            txreq
            rxresp
            expect resp.status == 200
    } -start
    
    client c2 {
            barrier b1 sync
            txreq
            rxresp
            expect resp.status == 200
    } -start
    
    client c1 -wait
    client c2 -wait
    
    点赞 评论 复制链接分享
  • weixin_39715834 weixin_39715834 5月前

    Confirmed panic on 6.2 and master, fixed by #3011.

    点赞 评论 复制链接分享
  • weixin_39844963 weixin_39844963 5月前

    Ok, we'll test in on our system as soon as possible, (we have multiple issues that required us to rush the move to prod of the varnish), i'll keep you posted to confirm it solves our issue.

    点赞 评论 复制链接分享
  • weixin_39715834 weixin_39715834 5月前

    I'm fairly confident considering how similar the panic I consistently get with the test case I pasted. You can give it a try yourself before and after patching by running varnishtest file.vtc where file.vtc contains my test script above.

    点赞 评论 复制链接分享
  • weixin_39715834 weixin_39715834 5月前

    By the way it would be nice to have some feedback about #3011 running in production :)

    点赞 评论 复制链接分享
  • weixin_39844963 weixin_39844963 5月前

    We'll probably start deploying 6.1 and shadow the traffic to the test node with the patch, i'll let you know if it's good and we'll check the strategy based on the vcls deployed on the new varnish cluster.

    点赞 评论 复制链接分享
  • weixin_39844963 weixin_39844963 5月前

    Based on a partial load of traffic (that was able to crash the server), we've got a stable varnish now, so, it's probably good, i'll reopen the bug if it's crashing with more volume

    点赞 评论 复制链接分享
  • weixin_39715834 weixin_39715834 5月前

    Reopening because we have an actionable test case to reproduce the crash. And we are trying to make progress in #3011, a candidate pull request for solving this.

    点赞 评论 复制链接分享
  • weixin_39844963 weixin_39844963 5月前

    Ok, so, in production for the last 4 hours, we're getting ~100 qps per node, and no issue (we were crashing the node with ~ 7qps before)

    点赞 评论 复制链接分享
  • weixin_39844963 weixin_39844963 5月前

    we've been basically running for the last month with increasing traffic on the servers.

    all is good no panic and > 200qps as of now and 25 days of varnish uptime

    so obviously, it works quite well

    点赞 评论 复制链接分享

相关推荐