<!-- Kamailio Project uses GitHub Issues only for bugs in the code or feature requests.
If you have questions about using Kamailio or related to its configuration file, ask on sr-users mailing list:
* http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-users
If you have questions about developing extensions to Kamailio or its existing C code, ask on sr-dev mailing list
* http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev
Please try to fill this template as much as possible for any issue. It helps the developers to troubleshoot the issue.
If you submit a feature request (or enhancement), you can delete the text of the template and only add the description of what you would like to be added.
If there is no content to be filled in a section, the entire section can be removed.
You can delete the comments from the template sections when filling.
You can delete next line and everything above before submitting (it is a comment). -->
### Description Kamailio frequently terminates tls connections to sip clients, if db_mongodb database driver is used and one mongodb cluster member becomes unavailable.
If one mongodb secondary is shutdown (or the primary and a promotion takes place) kamailio will terminate the tls connection to a sip-client (sends a FIN/ACK) after a short period of time (differs 20s to 180s). If a sip client reconnects, registration will succeed (so mongodb access is fine) but kamailio will terminate the tls connection shortly after. This situation happens as well if only the connection between kamailio and a secondary is blocked (the cluster members can still communicate with each other).
I assume that the mongo-c's server discovery might be the culprit. Mongoc iteratively tries to connect to all servers in the mongoc-uri to update their status (`mongodb://mongodb-cluster2:27017,mongodb-cluster1:27017,mongodb-cluster0:27017/kamailio?authMechanism=MONGODB-X509&replicaSet=rs1&ssl=true&sslcertificateauthorityfile=mongo.ca&sslclientcertificatekeyfile=mongo.certkey`). Nonethless, mongoc's server discovery should be transparent to the kamailio client as long as a primary exists (and this is the case). No reason to terminate tls client connections.
<!-- Explain what you did, what you expected to happen, and what actually happened. -->
### Troubleshooting
#### Reproduction
<!-- If the issue can be reproduced, describe how it can be done. --> 1. Setup a mongodb cluster with 3 members. 2. Setup kamailio with tls using db-mongodb via tls. Note: The mongodb-uri must contain all three members, e.g. `mongodb://mongodb-cluster2:27017,mongodb-cluster1:27017,mongodb-cluster0:27017/kamailio?ssl=true`) 3. Connect/register a sip client via tls. 4. Stop one member. 5. Clients will receive a ready error on its tls connection.
#### Log Messages
<!-- Check the syslog file and if there are relevant log messages printed by Kamailio, add them next, or attach to issue, or provide a link to download them (e.g., to a pastebin site). -->
The db_mongodb kamailio logs don't reveal any errors. Operations are normal. The only thing which is shown in the logs is TLS read:error. As seen below.
``` Jul 23 14:22:09 bda3e8fce481 /usr/local/sbin/kamailio[318]: ERROR: tls [tls_util.h:42]: tls_err_ret(): TLS read:error:140E0197:SSL routines:SSL_shutdown:shutdown while in init Jul 23 14:22:09 bda3e8fce481 /usr/local/sbin/kamailio[318]: ERROR: <core> [core/tcp_read.c:1485]: tcp_read_req(): ERROR: tcp_read_req: error reading - c: 0x7fa9b6343210 r: 0x7fa9b6343290 ``` A wireshark trace reveals that kamailio closes the tcp connection (FIN/ACK). The client receives a "stream truncated".
### Additional Information
* **Kamailio Version** - output of `kamailio -v`
``` version: kamailio 5.1.4 (x86_64/linux) flags: STATS: Off, USE_TCP, USE_TLS, USE_SCTP, TLS_HOOKS, DISABLE_NAGLE, USE_MCAST, DNS_IP_HACK, SHM_MEM, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, TLSF_MALLOC, DBG_SR_MEMORY, USE_FUTEX, FAST_LOCK-ADAPTIVE_WAIT, USE_DNS_CACHE, USE_DNS_FAILOVER, USE_NAPTR, USE_DST_BLACKLIST, HAVE_RESOLV_RES ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535, DEFAULT PKG_SIZE 8MB poll method support: poll, epoll_lt, epoll_et, sigio_rt, select. id: unknown compiled on 13:39:15 Jul 23 2018 with gcc 6.3.0
```
* **mongoc-version**
``` mongoc-1.11.0 (built with cmake ../ -DENABLE_AUTOMATIC_INIT_AND_CLEANUP=OFF) ```
* **Operating System**:
<!-- Details about the operating system, the type: Linux (e.g.,: Debian 8.4, Ubuntu 16.04, CentOS 7.1, ...), MacOS, xBSD, Solaris, ...; Kernel details (output of `uname -a`) -->
``` Debian 9.5 docker image running on CentOS 7 (kernel 3.10.0-514.6.1.el7.x86_64) ```
What are the values for global parameters that start with `tcp_` in your kamailio.cfg?
Set `debug=3` inside kamailio.cfg, reproduce this scenario and attach all debug messges printed by kamailio (should be more that the two ERROR messages). It will help to see what was done.
What are the values for global parameters that start with tcp_ in your kamailio.cfg?
``` tcp_connection_lifetime=3605 tcp_rd_buf_size=16384 tcp_conn_wq_max=65536 tcp_accept_no_cl=yes ```
This log should show 5 disconnects. Access to one mongodb-cluster-member (secondary) was blocked right after the start and then unblocked for some more minutes.
[kamailio_bug.log](https://github.com/kamailio/kamailio/files/2224117/kamailio_bug.log)
At a quick look, the logs show that the tls context is reset by mongo reconnect. I need to dig more into it.
Fine. Tell me if I should provide more logs. Or maybe you could point me on where to look. It looks as if there is a side effect between the mongoc's reconnect and the kamailio tls-context, right?
I tried to dig a bit in the code, but couldn't find an obvious reason, not having a way to test and troubleshoot. Can you give access to such system (I need only to kamailio server) or provide a Docker-based configuration to build a test bed on my computer?
Everything to reproduce the bug should be in here.
[demo.tar.gz](https://github.com/kamailio/kamailio/files/2382921/demo.tar.gz)
Any update on this? Anything missing I should provide?
This issue is stale because it has been open 6 weeks with no activity. Remove stale label or comment or this will be closed in 2 weeks.
Closed #1599 as not planned.