weixin_39644611
weixin_39644611
2021-01-02 08:38

InspectIt Ocelot Agent has a huge JVM impact.

I compare InspectIt Ocelot other APM solutions and InspectIT Ocelot Agent has the most impact on the Heap. While testing an embedded Tomcat Application for 2 hours, Elastic APM record around 170mb of the heap used and 270mb committed memory. Pinpoint APM recorded similar results. Ocelot recorded 450mb of the heap used and 620mb of committed memory. I use the Version 1.0 with Zipkin and Prometheus.

该提问来源于开源项目:inspectIT/inspectit-ocelot

  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享
  • 邀请回答

24条回答

  • weixin_39644611 weixin_39644611 4月前

    Does version 1.2.1 really solve the memory leak problem already? I did several test runs with the same setup yesterday and I see that the problem is still not solved.

    jvm-metrics Inkedjvm-metrics-heap_LI

    点赞 评论 复制链接分享
  • weixin_39631017 weixin_39631017 4月前

    It should have solved it, if the issue was related to the finalizer we removed. With the Span-finalizer included we were able to reproduce your issue, while it on our side disappeared after removing it.

    Could you maybe generate a JVM Heap Dump in the phase were the old gen starts ramping up and provide it to us? This should help us locate the issue, as we so far are not able to reproduce it anymore.

    点赞 评论 复制链接分享
  • weixin_39644611 weixin_39644611 4月前

    I created the Heap Dump with jcmd. https://mega.nz/file/Td5nUYJY#Kai0RLw7nA3Oy38_tH9zaIb8dsvDK_OScz1CcdKXAnc

    点赞 评论 复制链接分享
  • weixin_39644611 weixin_39644611 4月前

    Okay, thats strange now. I did another test run, but i did not reboot the system. I restarted the application only and now, i dont have the memory leak after 1 hour. I'll run the test longer and see what happens.

    点赞 评论 复制链接分享
  • weixin_39631017 weixin_39631017 4月前

    I took a closer look at your most recent screenshot: The actual memory leak also is fixed there: In your pre 1.2.1 screenshots, after the ramp up the old gen stayed at a higher level (200mb). This was caused by the finalizer, as explained by -nt .

    In your most recent screenshot, the old gen increases until a Major-GC occurs. Then however, the old gen is back to the normal Level (<100mb).

    The ramp up of the old gen and the resulting increase in committed memory seems to be normal behaviour. Basically with only 2 CPU cores, the application is under heavy load. The G1 GC does it's best to keep up the collection throughput with it's default settings, but apparently can't keep up.

    When the old gen starts growing, the tipping point for the G1 GC is reached: it decides to increase the committed memory in order to improve it's throughput. More committed memory means less frequent garbage collection, which results in a better throughput for the GC.

    There is a certain allocation rate the application has to reach for this tipping over to occur. In our experiments, this also happened with other APM agent we tried out (see the last post of -nt ).

    In order to find out how much inspectIT Ocelot increases the allocation rate I built a dashboard for visualizing the allocation rate. You can use this dashboard to analyze the memory allocation rate of the application. I played around with various Ocelot settings (disabling all rules regarding tracing and metrics) and the allocation rate did barely change. If this finding is correct, it means that Ocelot should have a rather low memory allocation rate overhead.

    I would suggest you to use this dashboard and selectively enable / disable the features of inspectIT Ocelot while monitoring the allocation rate. For example this configuration file will disable all instrumentation performed by inspectIT normally. These instrumentations capture HTTP metrics as well as the metrics required for the service graph. In addition you can remove all files provided with the docker-demo from your configuration server, as these enable tracing.

    点赞 评论 复制链接分享
  • weixin_39631017 weixin_39631017 4月前

    I also took a look at your heap-dump: JCMD triggers a major GC before doing the dump. Your dump therefore only has a total size of about 50mb, indicating that there is no memory leak. There are also no contents taking up a large amount of space belonging to inspectIT Ocelot or OpenCensus anymore.

    点赞 评论 复制链接分享
  • weixin_39644611 weixin_39644611 4月前

    I started a testrun after rebooting the system and now i dont have any memory leak. Maybe it was just a read-out error. I will start another test over night and if there are no problems, you can close the thread.

    点赞 评论 复制链接分享
  • weixin_39631017 weixin_39631017 4月前

    Hi , anything new here? Or can I close this issue?

    点赞 评论 复制链接分享
  • weixin_39715652 weixin_39715652 4月前

    Hey

    First of all, please excuse us for letting you wait so long. As my colleague already told you we have been able to reproduce the error, which took us quite some time. The key issue for us to reproduce the error was the amount of cores on which the whole setup ran.

    A small look at the heap dump we took during the runtime of the application made us suspect the Finalizer class for causing this memory problem: Top Consumers

    As you see, the finalizer class took 232,6mb of 291,1mb (about 80%) of the heap space. So we dug deeper into that. We found out that the finalizer that eats up so much space references to an object from the OpenCensus library which we use in our agent.

    analysis_marked

    Upon a closer look at the OpenCensus source code, we found that OpenCensus uses an overwritten finalize method in the RecordEventSpanImpl class you see in the screenshot above. This method uses a synchronize block, which causes this method to need a good amout of computing time to finish. This results in the garbage collector of the JVM to need more time than it actually gets to collect the object during the garbage collection and therefore, the garbage collection cannot finish.

    This is why the used heap remains on a higher baseline after the peak in your screenshot. This is apparently is only a problem on systems with fewer cores, thus we needed to reduce the number of cores the project runs on to 2.

    The good news is, that we have been able to build a test version of our agent, in which we have fixed that very problem. So the bug will be fixed in the next release of our agent!

    Concerning the issue with the fast rising usage of heap: We have taken a look at other agents and found that their heap consumption is similar to ours. Thus with the new release of our agent, you should not have any problems in this regard. So thank you again for your help and all the information you have provided to us! If you have any more questions, please feel free to ask!

    Greetings Marius

    点赞 评论 复制链接分享
  • weixin_39644611 weixin_39644611 4月前

    Inkedjvm-metrics-heap-details_LI

    点赞 评论 复制链接分享
  • weixin_39715652 weixin_39715652 4月前

    Hey ,

    Thank you for your effort and the details you have provided to us! I have tried to recreate the bug with openJDK 11, 11.0.1 and 11.0.2, the jar you provided us as well as your JMeter test and our default agent configuration. Unfortunatly i was not able to recreate the bug with each test for the mentioned java versions running for more than 3 hours.

    However, we are still interested in why this behaviour occurs. Would you like to share some more details about your setup? Maybe which exact openjdk version you used and if you only used this openjdk version for compiling or also for running the jars? Does this behaviour occur every single test run or just from time to time? Each hint would be hepful!

    Cheers and thanks for your efforts, Marius

    点赞 评论 复制链接分享
  • weixin_39644611 weixin_39644611 4月前

    openjdk version "11.0.5" 2019-10-15 LTS OpenJDK Runtime Environment 18.9 (build 11.0.5+10-LTS) OpenJDK 64-Bit Server VM 18.9 (build 11.0.5+10-LTS, mixed mode, sharing)

    VMWare Virtualisation CentOS Linux release 7.7.1908 (Core) 2 vCPU Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz 4 GB Ram

    点赞 评论 复制链接分享
  • weixin_39644611 weixin_39644611 4月前

    I'm not shure on which system i compiled the jar. It could be on the same machine or a Windows 10 with AdoptOpenJDK 11. openjdk version "11.0.4" 2019-07-16 OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.4+11) OpenJDK 64-Bit Server VM AdoptOpenJDK (build 11.0.4+11, mixed mode)

    点赞 评论 复制链接分享
  • weixin_39644611 weixin_39644611 4月前

    This behavior happens every time after about 1 hour.

    点赞 评论 复制链接分享
  • weixin_39644611 weixin_39644611 4月前

    I tested "Petclinic" again with "openjdk version 11.0.1 2018-10-16 OpenJDK Runtime Environment 18.9 (build 11.0.1+13) OpenJDK 64-Bit Server VM 18.9 (build 11.0.1+13, mixed mode)" And again, the heap rises after one hour.

    点赞 评论 复制链接分享
  • weixin_39965161 weixin_39965161 4月前

    Hi

    we've reproduce your problem and currently analysing the data. We have a suspicion on what might be causing the problem, but would like to verify this one more time.

    点赞 评论 复制链接分享
  • weixin_39644611 weixin_39644611 4月前

    I use the config server to deliver the configuration of the demo to the agent, but i think thats not the problem because yesterday I started a 12 hour test without any configuration and i have the same result.

    jvm-metrics

    I build Petclinic from Source as JAR and started it with

    java -javaagent:"../../inspectit-ocelot-agent-1.0.jar" -Dinspectit.config.http.url=http://:8090/api/v1/agent/configuration -Dinspectit.exporters.tracing.zipkin.url=http://:9411/api/v2/spans -Dinspectit.exporters.tracing.zipkin.service-name=PetClinic -jar spring-petclinic-2.2.0.BUILD-SNAPSHOT.jar

    Here is the testfile and the aggregation report: petclinic_jmeter.zip

    点赞 评论 复制链接分享
  • weixin_39644611 weixin_39644611 4月前

    And here is my petclinic .JAR https://mega.nz/#!TUBXVY6T!jJgozIXKp2AowHd1igLlPGEA0dXmInP2uy0HtR08KpU

    点赞 评论 复制链接分享
  • weixin_39944638 weixin_39944638 4月前

    Thanks a lot for sharing the details!

    We will try to reproduce and analyze this behaviour. One last question: Are you using just the default configuration for the agent, or do you have additional custom configuration files in the config server? If yes, could also provide the custom configuration?

    Thanks!

    点赞 评论 复制链接分享
  • weixin_39644611 weixin_39644611 4月前

    I have no custom configuration.

    点赞 评论 复制链接分享
  • weixin_39644611 weixin_39644611 4月前

    I experimented a bit with Grafana and found out that the agent also captures the heap in more detail. The OldGen of the heap is increasing significantly. Apparently, objects (possibly from the agent) are not sorted out by the garbage collector.

    点赞 评论 复制链接分享
  • weixin_39644611 weixin_39644611 4月前

    My test application is the SpringBoot Petclinic with an embedded Tomcat and an embedded database with OpenJDK11 on a CentOS7 system. I create load with JMeter on a seperate system. Prometheus an Zipkin are on another system. For JMeter I use a predefined test plan from SpringBoot and the predefined Ocelot Config Files from the Ocelot Demo. It starts 100 threads within 10 seconds. Before I start any tests, i reboot the system. First I tested with the agent version 0.5. The heap remains relatively constant at under 200 MB for about 1 hour. After one hour the memory increases very strongly and with peaks up to 700 MB. jvm-metrics-heap

    With the agent version 1.0 this behaviour can be reconstructed. Here, too, the memory remains at 200 MB at the beginning, but increases strongly after one hour with peaks up to 700 MB. jvm-metrics

    With Elastic APM or Pinpoint I have never reached these values. Of course I do not use the APM agents at the same time. 4

    点赞 评论 复制链接分享
  • weixin_39965161 weixin_39965161 4月前

    Hi
    thanks for your feedback.

    Just to make sure I got this right:

    • You use the ocelot configuration files which are included in the demo, right? Are you using a configuration server for delivering the configuration to the agent?
    • As demo application you use SpringBoot's Petclinic. Are you using the (distributed) inspectIT Ocelot demo version of it, or the "normal", monolithic one provided by Spring?
    • Can you also share your JMeter test (or a link to it) so we can analyse this behavior?
    • How do you add the agent to your demo system?

    All the best, Marius

    点赞 评论 复制链接分享
  • weixin_39944638 weixin_39944638 4月前

    Hi , Thanks for your insights! We would be very interested in more details. Would you be willing to share mor edetails on your measurements, so we can better understand what causes the huge memory overhead?

    Thanks!

    点赞 评论 复制链接分享

相关推荐