Hello,
as I added support for other transports (tcp, tls, ...) to DMQ inter-nodes communications and trying to configure it for a few testing scenarios, I ended up in some design issues/limitations. Two of them seem to prevent usual operations, therefore I want to get more opinions and see what would be the best way to go further.
1) there is no decoupling between server_address and server socket. The server socket is kept internally, but it is built from server_address parameter. The actual problem appears when trying to use FQDN as notification_address, because it ends up to have duplicate peer addresses for the same node.
For example:
- server1 with ip 1.2.3.4 and domain server1.sip.com
- server2 with ip 5.6.7.8 and domain server2.sip.com
On server1:
modparam("dmq", "server_address", "sip:1.2.3.4:5060") modparam("dmq", "notification_address", "sip:server2.sip.com:5060")
On server2:
modparam("dmq", "server_address", "sip:5.6.7.8:5060") modparam("dmq", "notification_address", "sip:server1.sip.com:5060")
Then each node end ups with 4 peer nodes.
On server1:
- sip:1.2.3.4:5060 (local=1) - sip:server1.sip.com:5060 (local=0) - sip:5.6.7.8:5060 (local=0) - sip:server2.sip.com:5060 (local=0)
On server2:
- sip:1.2.3.4:5060 (local=0) - sip:server1.sip.com:5060 (local=0) - sip:5.6.7.8:5060 (local=1) - sip:server2.sip.com:5060 (local=0)
Practically each server considers the local FQDN being a remote peer.
There are KDMQ requests sent to itself, but a real problematic issue is that presence replication (as I tested, could be for the other modules as well) is broken, because instead of an update it happens a replace. The case was a PUBLISH with body having state open, then in 30sec there is an PUBLISH to refresh using same ETag and empty body (as per spec), but instead of just updating the expires value, it also updates the body to empty string.
If notification_address is using IP address instead of FQDN of the other server, everything works as expected, with body being kept on refresh and only expires being updated.
The looping/spiralling is overwriting the purpose of the KDMQ replication action.
Use of FQDN is needed for TLS transport in order to be able to validate the domain against the attributes in certificate. Currently it does not work to have server_address with FQDN (maybe it would work with advertise address to listen, but that will force to use for SIP headers, which is not wanted).
To solve it I would introduce server_socket modparam, which if it is not set, then it is computed from server_address like for now. This keeps backward compatibility.
2) The second issue is a bit related. As there was a need to remove the FQDN and change the notification_address to be with the IP of the other node, I restarted each node, but the FQDNs stayed there, because after the restart of the server 1, it got the list with FQDNs from the server 2. Then restarting the server 2, ended up by sync'ing with server 1 and receiving again all the addresses. Practically the solution was to shut down all the nodes, which is something one is likely not wanting to do, because the entire platform is down with no active node, and even if it is for short time, for cases when data is only in memory (e.g., htable items, or in-memory only presence) everything is lost.
In other words, there is no way to remove a peer address that still points to an active node but it is no longer wanted, because it persists in the other running nodes. Could be also the case of changing the domain to be used for notification address.
This issue leads to the 1), because one node then appears many times (old FQDN and new FQDN), leading to loops/spirals with unwanted/broken side effects.
I haven't thought much of it, but one solution could be an RPC command to be able to remove unwanted addresses from list of peers. It still can be a race of sync'ing immediately after removing via rpc, before being able to remove from the other node, but then one can check with rpc dmq.list_nodes and re-run the rpc command if it is the case.
Given that Kamailio is in testing phase for 5.5, I want to see if anyone thinks of other solutions to fix this issues without introducing a new parameter (again, with backward compatibility) and a new rpc command (which does not have effect in breaking existing behaviour).
Cheers, Daniel