A mission-critical system plagued by "silent crashes" and manual reboots. How I re-architected a fragile legacy codebase into a self-healing fortress.
When I took ownership, the system had a fatal flaw. The Control UI (Windows) was tightly coupled to the Real-Time Engine (Linux) via a fragile SSH Tunnel wrapped in unsafe C++ code.
If the Windows app crashed (due to memory corruption in the C++ wrapper), it didn't just close. It severed the link, leaving the Engine running as a "Zombie" — unreachable and locking the hardware.
Result: Engineers had to physically walk to the server room to `kill -9` processes manually.
To fix this, I completely removed the SSH dependency. I re-architected the backend as a System V Daemon (Background Service).
By decoupling the lifecycle, the Engine became independent. It doesn't care if the UI exists or not. It runs as a root service, managed by the OS, capable of auto-starting and clean shutdowns.
# /etc/systemd/system/hil_engine.service
# The blueprint for immortality
[Unit]
Description=HiRain Real-Time Engine Daemon
After=network.target
[Service]
Type=simple
ExecStart=/usr/bin/rt_engine_server --daemon
Restart=always
User=root
[Install]
WantedBy=multi-user.targetBut what if the UI crashes? The simulation must stop safely. I implemented a Thrift RPC Heartbeat. The UI pings the Daemon every second.
If the Daemon misses 3 heartbeats, it assumes the UI is dead and triggers an Auto-Cleanup Protocol: resetting hardware, clearing shared memory, and waiting for a new connection.
The impact was immediate. The manual reboots stopped. The "Zombie" processes vanished. The system became robust enough to run 24/7 automated regression tests without human supervision.