Summary
After upgrading docker/ecs/amazon linux ami our machines will quickly (day or two) see a continuous spike in IO (mostly read) from the docker daemon which leaves docker commands unresponsive after draining all EBS GP2 burst credits - no known changes in the docker containers themselves should account for the IO, and iotop and docker stats do not show the IO coming from the processes inside the container but rather the docker daemon.
"modifying" the ebs volume to be larger (and thus resetting the burst credit limit) does give a temporary solution, but the problem persists. We're seeing this on a machines of all sizes running ~10-30 containers using a 200Gb EBS GP2 volume, none of the code inside the containers is performing any IO (beyond usual api server python startup reads to python code etc)
Description
ECS logs show issues like this:
/var/log/ecs/ecs-agent.log.2018-01-13-20:17719:2018-01-13T20:52:34Z [DEBUG] Collecting stats for container 444653aaf4c0a82529f707d1c7b2ff70843d95ec2672921c8adc146a06f02c6a
/var/log/ecs/ecs-agent.log.2018-01-13-20:17766:2018-01-13T20:52:44Z [INFO] Error retrieving stats for container 444653aaf4c0a82529f707d1c7b2ff70843d95ec2672921c8adc146a06f02c6a: inactivity time exceeded timeout
docker logs show a lot of issues like this:
time="2018-01-12T01:56:19.479923466Z" level=error msg="Failed to compute size of container rootfs 248ee44d1b48da09fcdf0d7508adb3284b947904900aaf3d98854ad16b25b90a: mount does not exist"
time="2018-01-12T01:57:12.701824864Z" level=error msg="Handler for GET /v1.30/containers/json returned error: write unix /var/run/docker.sock->@: write: broken pipe"
2018-01-12 01:57:12.701890 I | http: multiple response.WriteHeader calls
iotop output shows the root dockerd process performing heavy and sustained read IO (4MB/s)
iostats shows the device receiving the IO is the docker storage volume, and specifically "/dev/dm-1" and "/dev/dm-2" neither of which shows up in the "df -h" output
using the "fatrace" utility I see continuous reads to a subset of the files within certain container mount points e.g.:
dockerd(3485): O /var/lib/docker/devicemapper/mnt/a5507a59211c19b14f6a66924d2f3612c829678cd4898632b6e803388ea612ea-init/rootfs/usr/share/man/cs/man8
dockerd(3485): O /var/lib/docker/devicemapper/mnt/a5507a59211c19b14f6a66924d2f3612c829678cd4898632b6e803388ea612ea-init/rootfs/usr/share/man/cs/man8
dockerd(3485): O /var/lib/docker/devicemapper/mnt/a5507a59211c19b14f6a66924d2f3612c829678cd4898632b6e803388ea612ea-init/rootfs/usr/share/man/cs/man8
dockerd(3485): RC /var/lib/docker/devicemapper/mnt/a5507a59211c19b14f6a66924d2f3612c829678cd4898632b6e803388ea612ea/rootfs/usr/share/man/cs/man5
dockerd(3485): O /var/lib/docker/devicemapper/mnt/a5507a59211c19b14f6a66924d2f3612c829678cd4898632b6e803388ea612ea-init/rootfs/usr/share/man/cs/man8
dockerd(3485): O /var/lib/docker/devicemapper/mnt/a5507a59211c19b14f6a66924d2f3612c829678cd4898632b6e803388ea612ea-init/rootfs/usr/share/man/cs/man8
dockerd(3485): RC /var/lib/docker/devicemapper/mnt/a5507a59211c19b14f6a66924d2f3612c829678cd4898632b6e803388ea612ea/rootfs/usr/share/man/cs/man5
dockerd(3485): O /var/lib/docker/devicemapper/mnt/a5507a59211c19b14f6a66924d2f3612c829678cd4898632b6e803388ea612ea-init/rootfs/usr/share/man/cs/man8
But the process id performing the read is the docker daemon itself? is this a weird artifact of the way docker works? Are reads from the actual process being redirected through the daemon in certain cases, or is this related to the message above suggesting that ECS is trying to compute the size of the container rootfs (by stating all files in it?)
Happy to provide more details as I'm not sure what to check next or how to work around this other than trying to revert to an old version of docker/ecs (and no, I don't know precisely which version introduced this)
Environment Details
docker info
Containers: 83
Running: 27
Paused: 0
Stopped: 56
Images: 17
Server Version: 17.06.2-ce
Storage Driver: devicemapper
Pool Name: docker-docker--pool
Pool Blocksize: 524.3kB
Base Device Size: 10.74GB
Backing Filesystem: ext4
Data file:
Metadata file:
Data Space Used: 15.76GB
Data Space Total: 101.8GB
Data Space Available: 86.02GB
Metadata Space Used: 10.89MB
Metadata Space Total: 109.1MB
Metadata Space Available: 98.16MB
Thin Pool Minimum Free Space: 10.18GB
Udev Sync Supported: true
Deferred Removal Enabled: true
Deferred Deletion Enabled: true
Deferred Deleted Device Count: 1
Library Version: 1.02.135-RHEL7 (2016-11-16)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 6e23458c129b551d5c9871e5174f6b1b7f6d1170
runc version: 810190ceaa507aa2727d7ae6f4790c76ec150bd2
init version: 949e6fa
Security Options:
seccomp
Profile: default
Kernel Version: 4.9.62-21.56.amzn1.x86_64
Operating System: Amazon Linux AMI 2017.09
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 14.69GiB
Name: ip-172-31-5-100
ID: FWKC:KL7D:YYOW:HHOJ:HTQX:CCTS:BGPN:EXGT:7OJV:QD7X:B6L7:SZCB
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Experimental: false
Insecure Registries:
repos.fbn.internal
repos.fbn.org
127.0.0.0/8
Live Restore Enabled: false
curl http://localhost:51678/v1/metadata
{"Cluster":"prod","ContainerInstanceArn":"arn:aws:ecs:us-west-2:699897246896:container-instance/05f2b658-da93-4eee-8924-475cb50e231d","Version":"Amazon ECS Agent - v1.15.2 (ea58fdd)"}该提问来源于开源项目:aws/amazon-ecs-agent