in order to avoid repeatedly trying dead cluster connection, how about the following:
- define by cluster module param the default order in which connections with same priority are tried
- if first connection in the list fails, move that connection to the end of the list
advantage would be that there is no need to periodically (by counting or timer) re-try dead connection.
default order could be taken from the order in which connections having the same priority are listed in cluster param value.
comments?
-- juha
Hello,
On 4/4/12 7:15 PM, Juha Heinanen wrote:
in order to avoid repeatedly trying dead cluster connection, how about the following:
define by cluster module param the default order in which connections with same priority are tried
if first connection in the list fails, move that connection to the end of the list
advantage would be that there is no need to periodically (by counting or timer) re-try dead connection.
default order could be taken from the order in which connections having the same priority are listed in cluster param value.
comments?
reordering does not help with round robin operations and parallel write.
The list of connections is cloned by each process in private memory (as it needs to be one db connection per process). A connecton can be part of many clusters.
For each connection there is a pointer to shared memory where to store the state (active/inactive), with the goal that when an operation fails in one process, all the others will consider the connection inactive until is reactivated. While the shared pointer is there, the logic itself is not implemented yet.
Cheers, Daniel
Daniel-Constantin Mierla writes:
reordering does not help with round robin operations and parallel write.
i know, but it would solve the problem in the common case of a master-master replicated mysql cluster, where it does not matter, which of the two mysql servers is used.
The list of connections is cloned by each process in private memory (as it needs to be one db connection per process). A connecton can be part of many clusters.
in that case each process could reorder the list on its own. when connection on top of the list does not respond, it would be moved to the end of the list. that means one miss per process, which would be ok to me.
For each connection there is a pointer to shared memory where to store the state (active/inactive), with the goal that when an operation fails in one process, all the others will consider the connection inactive until is reactivated. While the shared pointer is there, the logic itself is not implemented yet.
that solution would be ok too, but would require a timer process to do the checking and reactivation.
-- juha
daniel,
the patch below upgrades working connection if read operation on another connection with same priority has failed, i.e., next time the working connection will be tried before the failing one.
is the patch ok with you? if so, the same could be done on write operations.
-- juha
*** /usr/src/orig/sip-router/modules_k/db_cluster/dbcl_api.c 2012-03-28 18:26:21.000000000 +0300 --- modules_k/db_cluster/dbcl_api.c 2012-04-07 17:51:28.000000000 +0300 *************** *** 46,51 **** --- 46,52 ---- int k;\ db1_con_t *dbh=NULL;\ dbcl_cls_t *cls=NULL;\ + dbcl_con_t *tmp;\ cls = (dbcl_cls_t*)_h->tail;\ ret = 0;\ for(i=DBCL_PRIO_SIZE-1; i>0; i--)\ *************** *** 58,68 **** if(cls->rlist[i].clist[j] != NULL && cls->rlist[i].clist[j]->flags!=0\ && cls->rlist[i].clist[j]->dbh != NULL)\ {\ ! LM_DBG("serial operation - cluster [%.*s] (%d/%d)\n",\ cls->name.len, cls->name.s, i, j);\ dbh = cls->rlist[i].clist[j]->dbh;\ ret = cls->rlist[i].clist[j]->dbf.command;\ if (ret==0) {\ cls->usedcon = cls->rlist[i].clist[j];\ return 0;\ }\ --- 59,76 ---- if(cls->rlist[i].clist[j] != NULL && cls->rlist[i].clist[j]->flags!=0\ && cls->rlist[i].clist[j]->dbh != NULL)\ {\ ! LM_DBG("serial operation - cluster [%.*s] (%d/%d)\n", \ cls->name.len, cls->name.s, i, j);\ dbh = cls->rlist[i].clist[j]->dbh;\ ret = cls->rlist[i].clist[j]->dbf.command;\ if (ret==0) {\ + if (j > 0) {\ + LM_INFO("upgrading connection - cluster [%.*s] (%d/%d)\n", \ + cls->name.len, cls->name.s, i, j);\ + tmp = cls->rlist[i].clist[j];\ + cls->rlist[i].clist[j] = cls->rlist[i].clist[j-1];\ + cls->rlist[i].clist[j-1] = tmp;\ + }\ cls->usedcon = cls->rlist[i].clist[j];\ return 0;\ }\
Hello,
I am just thinking of making the connection inactive, via a flag in shared memory, so the other processes will ignore it as well. Then, if the connection is active more than N seconds, try to use it again, and if fails, keep it inactive.
This will work for every kind of selecting policy, read or write operation. I hope to push the patch very soon. Your patch could be ok for serial selection, but then we would have to add different ones for round robin and parallel operations.
Cheers, Daniel
On 4/7/12 4:59 PM, Juha Heinanen wrote:
daniel,
the patch below upgrades working connection if read operation on another connection with same priority has failed, i.e., next time the working connection will be tried before the failing one.
is the patch ok with you? if so, the same could be done on write operations.
-- juha
*** /usr/src/orig/sip-router/modules_k/db_cluster/dbcl_api.c 2012-03-28 18:26:21.000000000 +0300 --- modules_k/db_cluster/dbcl_api.c 2012-04-07 17:51:28.000000000 +0300
*** 46,51 **** --- 46,52 ---- int k;\ db1_con_t *dbh=NULL;\ dbcl_cls_t *cls=NULL;\
- dbcl_con_t *tmp;\ cls = (dbcl_cls_t*)_h->tail;\ ret = 0;\ for(i=DBCL_PRIO_SIZE-1; i>0; i--)\
*** 58,68 **** if(cls->rlist[i].clist[j] != NULL&& cls->rlist[i].clist[j]->flags!=0\ && cls->rlist[i].clist[j]->dbh != NULL)\ {\ ! LM_DBG("serial operation - cluster [%.*s] (%d/%d)\n",\ cls->name.len, cls->name.s, i, j);\ dbh = cls->rlist[i].clist[j]->dbh;\ ret = cls->rlist[i].clist[j]->dbf.command;\ if (ret==0) {\ cls->usedcon = cls->rlist[i].clist[j];\ return 0;\ }\ --- 59,76 ---- if(cls->rlist[i].clist[j] != NULL&& cls->rlist[i].clist[j]->flags!=0\ && cls->rlist[i].clist[j]->dbh != NULL)\ {\ ! LM_DBG("serial operation - cluster [%.*s] (%d/%d)\n", \ cls->name.len, cls->name.s, i, j);\ dbh = cls->rlist[i].clist[j]->dbh;\ ret = cls->rlist[i].clist[j]->dbf.command;\ if (ret==0) {\
if (j> 0) {\
LM_INFO("upgrading connection - cluster [%.*s] (%d/%d)\n", \
cls->name.len, cls->name.s, i, j);\
tmp = cls->rlist[i].clist[j];\
cls->rlist[i].clist[j] = cls->rlist[i].clist[j-1];\
cls->rlist[i].clist[j-1] = tmp;\
}\ cls->usedcon = cls->rlist[i].clist[j];\ return 0;\ }\
Daniel-Constantin Mierla writes:
I am just thinking of making the connection inactive, via a flag in shared memory, so the other processes will ignore it as well. Then, if the connection is active more than N seconds, try to use it again, and if fails, keep it inactive.
as i mentioned earlier, this solution would be fine with me. i just went on and implemented a simple solution for serial connections in case you don't have time to work on a more general fix.
-- juha
when using db_cluster, msilo module m_dump() function generates these errors: Apr 7 18:18:02 sip /usr/sbin/sip-proxy[30210]: ERROR: db_mysql [km_dbase.c:122]: driver error on query: Unknown column 'src_addr' in 'field list' Apr 7 18:18:02 sip /usr/sbin/sip-proxy[30210]: ERROR: <core> [db_query.c:127]: error while submitting query Apr 7 18:18:02 sip /usr/sbin/sip-proxy[30210]: ERROR: db_mysql [km_dbase.c:122]: driver error on query: Unknown column 'src_addr' in 'field list' Apr 7 18:18:02 sip /usr/sbin/sip-proxy[30210]: ERROR: <core> [db_query.c:127]: error while submitting query Apr 7 18:18:02 sip /usr/sbin/sip-proxy[30210]: ERROR: db_cluster [dbcl_api.c:274]: invalid mode #000 (0) Apr 7 18:18:02 sip /usr/sbin/sip-proxy[30210]: ERROR: msilo [msilo.c:1066]: failed to query database
this is with unmodified db_cluster module when all db connections are working ok.
is there some bug in msilo db usage or in db_cluster module? where does invalid mode #000 come from?
-- juha
Hello,
On 4/7/12 5:20 PM, Juha Heinanen wrote:
when using db_cluster, msilo module m_dump() function generates these errors: Apr 7 18:18:02 sip /usr/sbin/sip-proxy[30210]: ERROR: db_mysql [km_dbase.c:122]: driver error on query: Unknown column 'src_addr' in 'field list' Apr 7 18:18:02 sip /usr/sbin/sip-proxy[30210]: ERROR:<core> [db_query.c:127]: error while submitting query Apr 7 18:18:02 sip /usr/sbin/sip-proxy[30210]: ERROR: db_mysql [km_dbase.c:122]: driver error on query: Unknown column 'src_addr' in 'field list'
the error is thrown from mysql module. db_cluster does not have any direct relation with the real db connector structure, it just calls appropriate functions from other db modules.
Apr 7 18:18:02 sip /usr/sbin/sip-proxy[30210]: ERROR:<core> [db_query.c:127]: error while submitting query Apr 7 18:18:02 sip /usr/sbin/sip-proxy[30210]: ERROR: db_cluster [dbcl_api.c:274]: invalid mode #000 (0) Apr 7 18:18:02 sip /usr/sbin/sip-proxy[30210]: ERROR: msilo [msilo.c:1066]: failed to query database
this is with unmodified db_cluster module when all db connections are working ok.
Have you tried with mysql directly. I expect the msilo tables structure changed, but schema was not properly updated (or version not increased).
is there some bug in msilo db usage or in db_cluster module? where does invalid mode #000 come from?
This is like no selecting algorithm is provided. I will check, since it should not happen at runtime, supposed to be detected at startup, but it might be a side effect of the next db layer failure.
Cheers, Daniel
Daniel-Constantin Mierla writes:
Apr 7 18:18:02 sip /usr/sbin/sip-proxy[30210]: ERROR:<core> [db_query.c:127]: error while submitting query Apr 7 18:18:02 sip /usr/sbin/sip-proxy[30210]: ERROR: db_cluster [dbcl_api.c:274]: invalid mode #000 (0) Apr 7 18:18:02 sip /usr/sbin/sip-proxy[30210]: ERROR: msilo [msilo.c:1066]: failed to query database
this is with unmodified db_cluster module when all db connections are working ok.
daniel,
i turned on mysql logging and got this:
90 Query select id,src_addr,dst_addr,body,ctype,inc_time,extra_hdrs from location where username='jh' AND domain='test.fi' AND snd_time=0 order by id
that is, msilo query looks otherwise ok, but table is wrong! it should be silo, not location. i have NOT set db_table msilo module variable.
after the above select, there is update on location table. could table name be somehow overridden by previous or next query in case of db_cluster?
with same msilo table, queries work ok when db_cluster is not in use.
-- juha
Hello,
msilo was missing few db set table before doing the queries -- it was fine without db_cluster, but with it there are two layers and the table name has to be refreshed in order to propagate completely.
I did a commit to msilo for now, I will check other modules as well.
Let me know if works now.
Cheers, Daniel
On 4/9/12 1:36 PM, Juha Heinanen wrote:
Daniel-Constantin Mierla writes:
Apr 7 18:18:02 sip /usr/sbin/sip-proxy[30210]: ERROR:<core> [db_query.c:127]: error while submitting query Apr 7 18:18:02 sip /usr/sbin/sip-proxy[30210]: ERROR: db_cluster [dbcl_api.c:274]: invalid mode #000 (0) Apr 7 18:18:02 sip /usr/sbin/sip-proxy[30210]: ERROR: msilo [msilo.c:1066]: failed to query database
this is with unmodified db_cluster module when all db connections are working ok.
daniel,
i turned on mysql logging and got this:
90 Query select id,src_addr,dst_addr,body,ctype,inc_time,extra_hdrs from location where username='jh' AND domain='test.fi' AND snd_time=0 order by id
that is, msilo query looks otherwise ok, but table is wrong! it should be silo, not location. i have NOT set db_table msilo module variable.
after the above select, there is update on location table. could table name be somehow overridden by previous or next query in case of db_cluster?
with same msilo table, queries work ok when db_cluster is not in use.
-- juha
daniel,
i tried and now msilo works with db_cluster. i'll test the new db_cluster timer stuff later today. thanks for implementing it.
-- juha