It so happens that at times the SSH connection was getting lost (probably due to network issues) with a message “Write failed: broken pipe”, and this resulted in a crash in cluster based codes. Thanks to this Archlinux forum post, i understood the workaround.
Broken pipe results upon communication loss between the client and the server. So, if we can somehow send control signals over the connection within a ‘n’ second of the last data transmission, this problem should be solved. This is exactly what the command ServerAliveInterval does. Smaller the time duration specified, earlier the control signal will be sent, and the connection doesnt drop.
To achieve this, just append at the end the following line to /etc/ssh/ssh_config file at the client side :
sudo nano /etc/ssh/ssh_config
ServerAliveInterval 5 (append at the end of the file)
Now, just restart the ssh daemon and its done!
sudo systemctl restart sshd.service (Fedora / OS having systemd)
sudo service ssh restart (Linux Mint / OS having Upstart)
This will ensure the control signal inquiring Server Alive status is sent 5sec after every time the connection loss happens, and the cluster based codes can run smoothly!