Case File: 0x02

Killing the
Zombie Process.

A mission-critical system plagued by "silent crashes" and manual reboots. How I re-architected a fragile legacy codebase into a self-healing fortress.

The Crisis

The "Silent Crash" Nightmare

When I took ownership, the system had a fatal flaw. The Control UI (Windows) was tightly coupled to the Real-Time Engine (Linux) via a fragile SSH Tunnel wrapped in unsafe C++ code.

If the Windows app crashed (due to memory corruption in the C++ wrapper), it didn't just close. It severed the link, leaving the Engine running as a "Zombie" — unreachable and locking the hardware.

Legacy Architecture
Windows UI
Linux Engine
SSH Tunnel Severed

Result: Engineers had to physically walk to the server room to `kill -9` processes manually.

The Reconstruction

The "Immortal" Daemon

To fix this, I completely removed the SSH dependency. I re-architected the backend as a System V Daemon (Background Service).

By decoupling the lifecycle, the Engine became independent. It doesn't care if the UI exists or not. It runs as a root service, managed by the OS, capable of auto-starting and clean shutdowns.

# /etc/systemd/system/hil_engine.service
# The blueprint for immortality

[Unit]
Description=HiRain Real-Time Engine Daemon
After=network.target

[Service]
Type=simple
ExecStart=/usr/bin/rt_engine_server --daemon
Restart=always
User=root

[Install]
WantedBy=multi-user.target
The Safety Net

The Dead Man's Switch

But what if the UI crashes? The simulation must stop safely. I implemented a Thrift RPC Heartbeat. The UI pings the Daemon every second.

If the Daemon misses 3 heartbeats, it assumes the UI is dead and triggers an Auto-Cleanup Protocol: resetting hardware, clearing shared memory, and waiting for a new connection.

STATUS: ALIVE
PING: 15ms
Refactored Architecture
Windows UI
Thrift RPC
Daemon Service
The Legacy

Order from Chaos

The impact was immediate. The manual reboots stopped. The "Zombie" processes vanished. The system became robust enough to run 24/7 automated regression tests without human supervision.

0SDK-Related CrashesRecorded from deployment until end of tenure.