i turned out that the problem below was caused by a firewall that blocked tcp session if it had been idle for a few minutes. the problem went away when i reduced tcp_connection_lifetime from 3610 to 120 sec.
i don't know if it possible to configure tcp_connection_lifetime on per connection basis. for example, tcp connection to UA could have tcp_connection_lifetime=3610, since tcp session is kept active by UA sending crlfs, whereas tcp connection to another proxy could have a shorter tcp_connection_lifetime.
-- juha
-------------------------------------------------------------
i did some more debugging and wireshark shows that the 3.0 sr does not even try to send anything to the 3.1 sr over the tcp connection although netstat now tells at both hosts that the connection is established. instead sr 3.0 replies immediately after receiving invite from ua:
SIP/2.0 477 Unfortunately error on sending to next hop occurred (477/TM)
there is no related messages in syslog. perhaps tcp stack on 3.0 host has not got acks for earlier packets and just waits there.
The most likely candidates are:
- blacklisted destination (due to some previous error). You could check it with sercmd dst_blacklist.view or dst_blacklist.debug.
- some local firewall rules on the OUTPUT chain running out of memory (but it's strange that you don't get any log messages)