Hi,
websockets are admittedly not the most commonly used technology although they are very useful in every near “real-time” scenario. The thing is this may have a dramatic impact on the behavior of the Azure Application Gateway, mostly regarding the monitoring aspects.
While the gateway works perfectly with websockets, the associated diagnostics may seem wrong at first, especially when sharing a single gateway across multiple backends, not using websockets. You might indeed end up with charts looking like this:
were you see your latency increasing a lot with frequent peeks..So, if you setup an alert on this latency, you might end up with false positives. When digging further, you realize that this abnormal latency is in fact due to websockets.
The gateway comes with 3 log categories:
Lets forget the firewall and focus on the Access & Performance logs. The above chart is based on the performance logs with latency measured every minute:
AzureDiagnostics
| where ResourceProvider == “MICROSOFT.NETWORK” and Category == “ApplicationGatewayPerformanceLog”
| summarize avg(latency_d) by bin(TimeGenerated, 1m)
| render timechart
If you do not render it as a timechart, you’ll get the raw figures with huge latency. When digging further and switching to the Access logs, the following query is an eye opener:
AzureDiagnostics
| where ResourceProvider == “MICROSOFT.NETWORK” and Category == “ApplicationGatewayAccessLog” and httpStatus_d == 101
| timeTaken_d
because it will show huge time taken and this is a normal behavior. Indeed, when visiting a web site that makes use of websockets, by monitoring the network traffic, you can easily see this:
a 101 request that remains in pending state because the socket remains opened until the end of the user session. Therefore, the recorded latency might be huge (500 000 ms for instance) and will have an impact on the overall latency, making latency_d not a reliable metric anymore.
Long story short: for the time being, if you happen to use websockets, I’d recommend using a dedicated gateway where you know latency_d is not the metric you will count on or consider using an alternative solution. Indeed, there is currently no way to discard websockets-related traffic from the Performance logs, therefore, the overall performance metric is impacted although there is no issue in reality. On the other end, the timetaken_d metric in the Access log may become a functional metric as it reflects the amount of time users stayed on your website.