A real-world production incident involving JDBC connectivity failures and Hikari connection pool timeouts impacting authentication, notifications, and payment-related microservices across multi-cloud Kubernetes environments.

A real-world production incident involving JDBC connectivity failures and Hikari connection pool timeouts impacting authentication, notifications, and payment-related microservices across multi-cloud Kubernetes environments.

Sowmya Narayan

5/9/20263 min read

Introduction

In large-scale cloud-native environments, even temporary database connectivity disruptions can quickly impact multiple customer-facing services.

We recently encountered a production incident where several microservices across both GCP and Azure environments experienced JDBC connectivity failures, leading to:

  • Hikari connection pool timeouts

  • authentication failures

  • notification delivery issues

  • intermittent login disruptions

In this blog, I’ll walk through:

  • how the issue was detected

  • impacted services

  • troubleshooting observations

  • investigation findings

  • lessons learned from handling database connectivity issues in distributed systems

The Incident

Multiple microservices across both GCP and Azure environments began experiencing JDBC connectivity failures, resulting in Hikari connection pool timeouts and intermittent service degradation.

Application logs started showing:

  • JDBC connection errors

  • Hikari connection pool timeout exceptions

  • intermittent database connectivity failures

Alert Detection

The issue was identified through:

  • Grafana Hikari pool alerts

  • JDBC error logs

  • application monitoring dashboards

  • Slack incident notifications

  • PagerDuty alerts

The monitoring dashboards revealed:

  • connection pool exhaustion

  • database timeout spikes

  • intermittent service instability

Impacted Services

Several important microservices were affected during the incident.

Authentication Services

Authentication systems experienced intermittent failures, impacting customer login functionality.

Notification Services

Notification systems responsible for:

  • emails

  • SMS

  • alerts

experienced intermittent delivery failures.

Customer Relationship Services

Services managing customer-to-account relationships and card mapping became unstable due to database connectivity interruptions.

Payment and Transfer Services

Payment-related services responsible for transaction routing and real-time updates also experienced temporary degradation.

Customer Impact

During the outage:

  • some customers were unable to log in to mobile and web applications

  • intermittent authentication failures were observed

  • notification delivery delays occurred

The outage caused approximately:

  • ~30% login failures for a short duration

  • ~1% overall user impact during the incident window

Architecture Overview

The platform architecture relied on multiple microservices communicating with PostgreSQL databases through JDBC connection pools.

When database connectivity became unstable:

  • Hikari pools exhausted available connections

  • requests started timing out

  • downstream services became unstable

Investigation Findings

During troubleshooting, the operations team observed:

  • JDBC timeout exceptions

  • Hikari connection pool exhaustion

  • intermittent PostgreSQL connectivity failures

  • service instability across both cloud environments

The affected microservices gradually stabilized after restarting the impacted application pods.

Hikari Connection Pool Timeouts

Application logs showed multiple Hikari connection pool timeout errors.

This indicated that:

  • database connections were either unavailable or delayed

  • connection pools could not obtain healthy connections

  • application threads started waiting for database access

Resolution

The operations team restarted all impacted microservices.

After the restart:

  • JDBC errors stopped

  • Hikari pools recovered

  • service stability returned

  • customer login functionality normalized

No additional customer impact was observed afterward.

Root Cause Investigation

The investigation identified that several microservices temporarily lost database connectivity during the incident.

A support case was raised with the cloud provider to investigate potential managed PostgreSQL infrastructure issues.

However, because the support case was opened after the telemetry retention period had expired, the cloud provider was unable to perform a detailed root cause analysis.

The final RCA remained inconclusive.

Why This Incident Was Challenging

This incident was difficult because:

  • the issue auto-recovered

  • logs were no longer available from the cloud provider

  • no permanent infrastructure failure was visible afterward

  • the issue affected multiple services across multiple cloud environments simultaneously

This is a common challenge in distributed cloud systems where transient infrastructure or networking issues disappear before full diagnostics can be captured.

Key Learnings

This incident highlighted several important operational lessons.

1. JDBC Connectivity Issues Can Cascade Quickly

Temporary database connectivity interruptions can rapidly impact:

  • authentication

  • notifications

  • payments

  • customer-facing APIs

2. Hikari Pool Monitoring is Extremely Valuable

Connection pool monitoring provided early visibility into database degradation before complete service failure occurred.

3. Fast Incident Escalation Matters

Cloud telemetry retention windows are limited.

Delays in opening support cases can make root cause analysis difficult or impossible.

4. Multi-Cloud Dependencies Increase Complexity

When applications span multiple cloud providers:

  • troubleshooting becomes harder

  • observability becomes critical

  • correlation across environments is essential

5. Restarting Services May Temporarily Restore Stability

Restarting affected pods helped refresh:

  • stale JDBC connections

  • exhausted pools

  • unhealthy application states

However, restarts should not replace proper RCA investigations.

Preventive Improvements

Following the incident, the team reviewed:

  • JDBC retry handling

  • Hikari timeout configurations

  • connection pool tuning

  • monitoring improvements

  • faster vendor escalation processes

Additional observability improvements were also planned for database connectivity monitoring.

Final Thoughts

Database connectivity failures in distributed cloud-native environments can create widespread application instability even when infrastructure appears healthy.

In this incident:

  • temporary JDBC connectivity disruptions

  • caused Hikari pool exhaustion

  • impacting authentication, notifications, and customer-facing services across multiple cloud environments

Strong observability, rapid incident response, and proactive database monitoring remain essential for operating reliable large-scale microservices platforms.