Hello,
We are experiencing an issue when having some blocking code in the dialog:end event route. If the code takes too long to execute then CDR information is not logged.
In our case we have a blocking socket trying to establish a connection and waiting 500ms before failing. When it timeouts, the accounting module will never log any CDR logs for that call. It appears that the callbacks for DLGCB_TERMINATED_CONFIRMED event are never called. That is because the function dlg_terminated_confirmed is never called (this function is registered as a callback to the TMCB_RESPONSE_OUT event from tm module).
After further investigation, the problem seems to be related to a race condition between two processes when dealing with ending a dialog:
From what we saw the dialog:end event route is treated when a BYE request is received. But if that takes too long, and that BYE request will be retransmitted (we are using SIPp to test this, and SIPp will automatically retransmit the message), it will then be treated by a different process.
When this happens, the first process has already modified the dialog state, so the second process will not execute code specific to dialog ending from the dialog module dlg_onroute function. It will not execute the event route, nor the DLGCB_TERMINATED event callbacks, and it will neither register the dlg_terminated_confirmed function to the TMCB_RESPONSE_OUT event. This means that the second process will actually finish processing the BYE faster then the first process. By the time the first process finishes with processing the dialog:end event route, the transaction corresponding to the BYE request is already created, and even though dlg_terminated_confirmed function is eventually registered for the TMCB_RESPONSE_OUT event, it is too late because this event was already treated, and all callbacks registered for it were called. This means that DLGCB_TERMINATED_CONFIRMED event will never be treated, and the registered callbacks for this event (including the accounting module functions) will not get called.
We managed to work around this behavior by decreasing the timeout of the socket so that the BYE can be processed before a retransmit occurs, but maybe there is a different solution to fix this issue. Do you have any suggestions on how to solve this problem?
Thank you, Claudiu Boriga.