Troubleshooting Message Queue (MQ) Timeout Issues Across Azure and GCP Datacenters

A real-world production incident involving Message Queue (MQ) timeout issues across Azure and GCP datacenters that partially impacted customer login services, backend APIs, and distributed microservices communication.

Sowmya Narayan

5/9/20263 min read

worm's-eye view photography of concrete building
worm's-eye view photography of concrete building

Introduction

Distributed systems running across multiple cloud datacenters can occasionally face transient network failures that are difficult to predict and troubleshoot.

We recently encountered a production incident involving intermittent MQ timeout issues across both Azure and GCP datacenters, which partially impacted customer login services and backend operations on our digital platform.

This issue affected:

  • customer login functionality

  • mobile applications

  • profile updates

  • transaction lookup services

  • CSR operations

In this blog, I’ll walk through:

  • how the issue was detected

  • monitoring and observability used during investigation

  • the actual root cause

  • impact analysis

  • lessons learned from the incident

Alert Detection

The issue was initially identified through monitoring alerts triggered by elevated 5XX failure rates across microservices.

We observed:

  • increased HTTP 5XX errors

  • API timeout failures

  • elevated MQ timeout errors

  • customer login failures

Alert notifications were triggered automatically via monitoring systems.

Monitoring Dashboard Observations

Our Grafana MQ dashboards immediately showed abnormal channel behavior across both Azure and GCP datacenters.

The dashboards revealed:

  • MQ timeout spikes

  • channel instability

  • intermittent connectivity disruptions

  • sudden drops in message flow

Customer Impact

Between approximately:

2:53 AM – 3:06 AM EST

customers experienced intermittent failures across multiple services.

Observed Impact

  • Customer login to secure site and mobile application was partially impacted

  • Approximately 557 unique customers faced login failures

  • Users already logged in experienced intermittent issues while:

    • updating profiles

    • viewing statements

    • checking transactions

    • accessing notifications

CSR agents also observed:

  • OOPS errors

  • customer lookup failures

  • intermittent backend processing issues

Application Error Investigation

During investigation, we analyzed logs in Elastic Stack and observed a large increase in authentication and timeout-related failures.

The logs confirmed:

  • timeout-related API failures

  • MQ communication delays

  • intermittent authentication processing failures

Failure Rate Analysis

Detailed request tracing showed a sudden spike in API failure rates during the impacted time window.

The failure rate temporarily crossed:

  • 50% error rate

  • high request timeout thresholds

This directly correlated with the MQ connectivity disruption.

Root Cause Investigation

Further investigation with the Network Team identified the actual root cause.

The network tunnel between datacenters experienced a brief flap and temporary outage.

Network logs showed:

Oct 27 02:54:04:
EIGRP Neighbor Tunnel500 is down: holding time expired

Oct 27 02:56:11:
EIGRP Neighbor Tunnel500 is up: new adjacency

This temporary tunnel instability caused:

  • MQ communication interruptions

  • timeout failures

  • transient connectivity issues between Azure/GCP MQ and TSYS MQ

Why the Issue Impacted Multiple Services

The impacted microservices relied heavily on MQ communication for:

  • authentication workflows

  • transaction processing

  • profile management

  • backend service integration

When MQ connectivity became unstable:

  • requests started timing out

  • APIs returned 5XX errors

  • downstream services failed intermittently

Since the issue affected both Azure and GCP datacenters simultaneously, the impact spread across multiple dependent services.

Auto Recovery

One important observation was that the issue self-recovered after network connectivity stabilized.

Once the tunnel adjacency was restored:

  • MQ channels resumed normal operation

  • API error rates dropped

  • login services recovered automatically

No manual restart or failover activity was required.

Operational Learnings

This incident highlighted several important operational lessons.

1. Network Instability Can Cascade Quickly

A short-lived tunnel flap created widespread application-level failures across multiple services.

2. Strong Observability is Critical

Dashboards, logs, alerts, and tracing significantly reduced troubleshooting time.

3. Cross-Cloud Dependencies Need Careful Monitoring

Applications running across:

  • Azure

  • GCP

  • external MQ systems

must include robust network observability and redundancy planning.

4. MQ Health Monitoring is Essential

Real-time monitoring of:

  • MQ channels

  • queue depth

  • timeout rates

  • connectivity status

helps identify issues before customer impact becomes severe.

Preventive Actions

Following the incident:

  • Network Team performed RCA analysis

  • additional monitoring validations were added

  • MQ timeout alerting thresholds were reviewed

  • tunnel stability monitoring was improved

Final Thoughts

Transient network failures in distributed cloud environments can create large-scale application disruptions even when infrastructure itself remains healthy.

In this incident:

  • a temporary network tunnel flap

  • caused MQ timeout failures

  • which impacted customer login and backend services across Azure and GCP datacenters

Strong observability, rapid alerting, and coordinated troubleshooting between platform, application, and network teams were critical in quickly identifying and resolving the issue.