Incident: March 17

Data request processing was interrupted due to a deployment-related issue


📘

What Happened?

On Monday, March 16th, 2026, following a system release, customers experienced failures with data request processing. An issue in how the system handled memory management caused all pending data requests to be marked as failed when a single request exceeded memory limits. A secondary issue was then discovered where requests were being marked as queued in our database but were not actually being delivered to our processing workers. Both issues were identified and resolved by the morning of Tuesday, March 17th.

We sincerely apologize for the disruption this caused to your workflows and reporting schedules.

Impact Summary

What was experienced:

  • Data requests across all connected platforms failed to process during the affected window
  • Requests that appeared as "queued" in the system were not actually being processed
  • Processing delays accumulated overnight before the issue was discovered the following morning

Timeline

Monday, March 16 (Evening)

A system update was deployed at end of day. Shortly after deployment, the data request processing system began experiencing complete processing failures. When any single request exceeded available memory, all other pending requests in that processing cycle were incorrectly marked as failed.

Monday, March 16 (Late Evening)

A fix was applied to prevent the memory handler from registering multiple times per worker cycle. However, a secondary issue was introduced—the fix caused serialization problems that prevented requests from being properly delivered to processing workers via Redis.

Tuesday, March 17 (Morning)

The secondary issue was discovered when monitoring showed requests were marked as "queued" in the database but were not being picked up by workers. The serialization issue was identified and resolved, restoring normal processing flow. All backlogs cleared and processing returned to normal.

Root Cause

This incident had two contributing factors:

The initial issue was caused by the system's out-of-memory error handler registering itself multiple times—once per worker processing cycle rather than once per worker lifetime. When the 50th request in a cycle exceeded memory limits, all 49 previously successful requests in that cycle were also marked as failed, causing a cascading failure across all data processing.

The fix for the first issue introduced a secondary problem: the static properties used to ensure the handler only registered once caused serialization failures when the system attempted to queue requests to Redis. The database recorded these requests as "queued," but Redis never received them, so processing workers had no work to pick up. This issue was not caught in testing because the internal test environment uses in-memory synchronous processing rather than Redis-based queuing.

Our Response

Our team took immediate action to minimize customer impact:

Immediate Actions (March 16-17)

  • Memory handler fix: Added static properties to ensure the out-of-memory handler registers only once per worker lifetime, preventing cascading failures
  • Serialization fix: Resolved the Redis serialization issue to restore proper message queuing between the database and processing workers
  • Backlog processing: All queued requests were reprocessed and completed once fixes were applied

Monitoring Improvements (Completed March 18)

  • Database-side queue monitoring: Implemented new monitoring that tracks when the database records requests as queued but processing is not actually occurring
  • End-to-end queue validation: Previously only Redis queue staleness was monitored; now both sides of the queuing pipeline are tracked

What We're Doing to Prevent This

We've identified several improvements to prevent similar incidents and improve our response capabilities:

  • Memory handler isolation: Out-of-memory handler now registers once per worker lifetime, preventing cascading request failures
  • Redis serialization fix: Resolved serialization compatibility to ensure reliable message delivery to processing workers
  • Database queue monitoring: New monitoring detects when requests are marked as queued but are not actually being processed
  • 🔄 Worker migration: Migrating API traffic from application servers to dedicated workers to enable isolated, targeted deployments that reduce blast radius of any single release Status: In progress (target: April 2026)

Our Commitment to You

We understand how critical timely and accurate data processing is to your business. This incident highlighted the need for additional monitoring and alerting, along with adjusting deployment timelines that better allow for rapid detection and response. We're committed to continuing to improve our infrastructure and release processes to provide you with reliable, uninterrupted service.

We're grateful for your patience and continued trust in NinjaCat.

Questions or Concerns?

If you have any questions about this incident or continue to experience any issues with your data connections, please don't hesitate to reach out to our support team. We're here to help.

Contact Support: