Understand the impact of websockets on the Azure Application Gateway

Hi,

websockets are admittedly not the most commonly used technology although they are very useful in every near “real-time” scenario. The thing is this may have a dramatic impact on the behavior of the Azure Application Gateway, mostly regarding the monitoring aspects.

While the gateway works perfectly with websockets, the associated diagnostics may seem wrong at first, especially when sharing a single gateway across multiple backends, not using websockets. You might indeed end up with charts looking like this:

blog

were you see your latency increasing a lot with frequent peeks..So, if you setup an alert on this latency, you might end up with false positives. When digging further, you realize that this abnormal latency is in fact due to websockets.

The gateway comes with 3 log categories:

logs

Lets forget the firewall and focus on the Access & Performance logs. The above chart is based on the performance logs with latency measured every minute:

AzureDiagnostics
| where ResourceProvider == “MICROSOFT.NETWORK” and Category == “ApplicationGatewayPerformanceLog”
| summarize avg(latency_d) by bin(TimeGenerated, 1m)
| render timechart

If you do not render it as a timechart, you’ll get the raw figures with huge latency. When digging further and switching to the Access logs, the following query is an eye opener:

AzureDiagnostics
| where ResourceProvider == “MICROSOFT.NETWORK” and Category == “ApplicationGatewayAccessLog” and httpStatus_d == 101
| timeTaken_d

because it will show huge time taken and this is a normal behavior.  Indeed, when visiting a web site that makes use of websockets, by monitoring the network traffic, you can easily see this:

ws

a 101 request that remains in pending state because the socket remains opened until the end of the user session.  Therefore, the recorded latency might be huge (500 000 ms for instance) and will have an impact on the overall latency, making latency_d not a reliable metric anymore.

Long story short: for the time being, if you happen to use websockets, I’d recommend using a dedicated gateway where you know latency_d is not the metric you will count on or consider using an alternative solution. Indeed, there is currently no way to discard websockets-related traffic from the Performance logs, therefore, the overall performance metric is impacted although there is no issue in reality. On the other end, the timetaken_d metric in the Access log may become a functional metric as it reflects the amount of time users stayed on your website.

Advertisements

About Stephane Eyskens

Office 365, Azure PaaS and SharePoint platform expert
This entry was posted in Azure. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s