Newrelic High Response Times

Newrelic High Response Times

NewRelic High Response Times

Today I am tasked with troubleshooting why a set of newly provisioned physical RHEL 7.3 (Maipo) servers are showing bad performance in Newrelic.

I just setup these new physical servers and I installed our apps into them.  The apps themselves are Play Framework Scala based web applications. Upon starting the application, and doing rigorous internal QA on it, I got sign-off that all was OK, and received the green light for PROD.

So deployment days comes and I am ready to throw these suckers into the F5 PROD pools along with other VM’s that I was replacing.

I open up NewRelic, and I switch the health-checks to be pulled into the PROD pools. The servers show no major errors in Splunk, and everything is looking good.

I take look at NewRelic, and I now notice our response times to external services has gone way above normal. I am talking like 800% slower than normal. Not cool…

Now what….

We use New Relic Java Agent app instrumentation. It was showing that every external service, whether it was external like the salesforce.com api, or internal IP addresses inside our firewall, were taking very long to respond. Something in the order of 500 seconds for some requests.

It would seem these new servers were very very sick compared to the other servers.

Time to start my library of performance tests.

1.  Checking the Firewalls

The first thing I checked were the firewalls. I thought to myself “This all passed QA. If it was a firewall issue, it wouldn’t have passed.” But as anyone else would do, I had to throw out any preconcieved notions of what was happening.

firewalld

Ok, firewalls are down.

selinux

Nope, it’s disabled. Not the issue.

ulimits

The next thing I check is the server’s ulimits.

Everything looks pretty standard.

curl

For curl, I use a curl-format file like follows:

The curl-format is nifty for outputting extended curl information when regarding response times.

However, everything I checked seemed okay. Time for the big guns.

qperf

qperf is a tool that test network connections between two servers. You run the qperf client on the slow server, and you run the qperf service on the target server. So I installed the qperf library and I started up the qperf service on another machine in my network.

Server: you simply run qperf with no switches.

Client:

So I run the tests, in this case the tcp_bw (tcp bandwidth) and tcp_lat (tcp latentcy) tests.

So, the BW comes back at about 1 GB/s, which is what I would expect over the Gigabit interface.

The latency comes back at 118us, which is also what I would expect.

So, all in all, it doesn’t seem to have any issues from the server side!

Now what…

Newrelic

So, I have pretty much come to the conclusion that NewRelic is lying to me. I finally stop the servers and start looking into NewRelic. That is when is dawns on me.

The New servers have a different version of the NewRelic Java Agent running!

The Old Servers, and the New servers, all report to the same APM pool (app_name). The New servers were version 3.41.0 (The latest version as of this writing). The OLD servers were running version 3.33.0 of the NewRelic Java Agent. Could this REALLY be the issue? Have the way the NewRelic Java Agent reports metrics changed?

So I quickly laugh to myself as I start replacing the new servers with the 3.33.0 version, so all the agents match. It couldn’t really be this could it?

I restart the servers, place them back into the production pools, and I wait for NewRelic to catch up.

Before, and after the NewRelic Java Agent Change

Wow. What a difference.

 

So yeah. In the end, the servers were completely fine. The NewRelic Java Agents were mixing up the reporting because they were running different versions.

The End.