Troubleshooting Message Queue (MQ) Timeout Issues Across Azure and GCP Datacenters

A real-world production incident involving Message Queue (MQ) timeout issues across Azure and GCP datacenters that partially impacted customer login services, backend APIs, and distributed microservices communication.

Sowmya Narayan

5/9/20263 min read

worm's-eye view photography of concrete building

Introduction

Distributed systems running across multiple cloud datacenters can occasionally face transient network failures that are difficult to predict and troubleshoot.

We recently encountered a production incident involving intermittent MQ timeout issues across both Azure and GCP datacenters, which partially impacted customer login services and backend operations on our digital platform.

This issue affected:

customer login functionality
mobile applications
profile updates
transaction lookup services
CSR operations

In this blog, I’ll walk through:

how the issue was detected
monitoring and observability used during investigation
the actual root cause
impact analysis
lessons learned from the incident

Alert Detection

The issue was initially identified through monitoring alerts triggered by elevated 5XX failure rates across microservices.

We observed:

increased HTTP 5XX errors
API timeout failures
elevated MQ timeout errors
customer login failures

Alert notifications were triggered automatically via monitoring systems.

Monitoring Dashboard Observations

Our Grafana MQ dashboards immediately showed abnormal channel behavior across both Azure and GCP datacenters.

The dashboards revealed:

MQ timeout spikes
channel instability
intermittent connectivity disruptions
sudden drops in message flow

Customer Impact

Between approximately:

2:53 AM – 3:06 AM EST

customers experienced intermittent failures across multiple services.

Observed Impact

Customer login to secure site and mobile application was partially impacted
Approximately 557 unique customers faced login failures
Users already logged in experienced intermittent issues while:
- updating profiles
- viewing statements
- checking transactions
- accessing notifications

CSR agents also observed:

OOPS errors
customer lookup failures
intermittent backend processing issues

Application Error Investigation

During investigation, we analyzed logs in Elastic Stack and observed a large increase in authentication and timeout-related failures.

The logs confirmed:

timeout-related API failures
MQ communication delays
intermittent authentication processing failures

Failure Rate Analysis

Detailed request tracing showed a sudden spike in API failure rates during the impacted time window.

The failure rate temporarily crossed:

50% error rate
high request timeout thresholds

This directly correlated with the MQ connectivity disruption.

Root Cause Investigation

Further investigation with the Network Team identified the actual root cause.

The network tunnel between datacenters experienced a brief flap and temporary outage.

Network logs showed:

Oct 27 02:54:04:
EIGRP Neighbor Tunnel500 is down: holding time expired

Oct 27 02:56:11:
EIGRP Neighbor Tunnel500 is up: new adjacency

This temporary tunnel instability caused:

MQ communication interruptions
timeout failures
transient connectivity issues between Azure/GCP MQ and TSYS MQ

Why the Issue Impacted Multiple Services

The impacted microservices relied heavily on MQ communication for:

authentication workflows
transaction processing
profile management
backend service integration

When MQ connectivity became unstable:

requests started timing out
APIs returned 5XX errors
downstream services failed intermittently

Since the issue affected both Azure and GCP datacenters simultaneously, the impact spread across multiple dependent services.

Auto Recovery

One important observation was that the issue self-recovered after network connectivity stabilized.

Once the tunnel adjacency was restored:

MQ channels resumed normal operation
API error rates dropped
login services recovered automatically

No manual restart or failover activity was required.

Operational Learnings

This incident highlighted several important operational lessons.

1. Network Instability Can Cascade Quickly

A short-lived tunnel flap created widespread application-level failures across multiple services.

2. Strong Observability is Critical

Dashboards, logs, alerts, and tracing significantly reduced troubleshooting time.

3. Cross-Cloud Dependencies Need Careful Monitoring

Applications running across:

Azure
GCP
external MQ systems

must include robust network observability and redundancy planning.

4. MQ Health Monitoring is Essential

Real-time monitoring of:

MQ channels
queue depth
timeout rates
connectivity status

helps identify issues before customer impact becomes severe.

Preventive Actions

Following the incident:

Network Team performed RCA analysis
additional monitoring validations were added
MQ timeout alerting thresholds were reviewed
tunnel stability monitoring was improved

Final Thoughts

Transient network failures in distributed cloud environments can create large-scale application disruptions even when infrastructure itself remains healthy.

In this incident:

a temporary network tunnel flap
caused MQ timeout failures
which impacted customer login and backend services across Azure and GCP datacenters

Strong observability, rapid alerting, and coordinated troubleshooting between platform, application, and network teams were critical in quickly identifying and resolving the issue.