Hello,
str_str function searches for a substring into a bigger string.
str_str function in lib/cds/sstr.c file has a comment: /* FIXME: reimplement using better algorithm */
Attached patch reimplements that function using Boyer-Moore-Horspool-Raita algorithm, which is the one most text editors use to search for substrings.
Boyer-Moore-Horspool-Raita algorithm is a simplification from Boyer-Moore one, with less preprocessing overhead.
If n is the string length, and m is substring length, naive brute force algorithm has a cost of O(n*m).
Raita algoritm has same cost in worst case but Ω(n/m) a better average cost. As a drawback it adds a preprocessing cost of O(m + charset_length) where charset_length is 256 for us, because we use char.
Another way to approach this substring search issue would be separate it into two functions, a preprocessing one, and a searching one. Then in module initializacion where the function is used, calculate the needed preprocess and when the function is called do the search only.
This Raita algorithm could also be useful for str_search function in ut.c. Its usefulness depends on n, and m lengths, the bigger they are the better the algoritm works.
http://en.wikipedia.org/wiki/String_searching_algorithm http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore%E2%80%93Horspool_algorithm http://www-igm.univ-mlv.fr/~lecroq/string/node22.html
Regards, Vicente.
Am Dienstag, 12. Juni 2012, 13:05:18 schrieb Vicente Hernando:
str_str function in lib/cds/sstr.c file has a comment: /* FIXME: reimplement using better algorithm */
Attached patch reimplements that function using Boyer-Moore-Horspool-Raita algorithm, which is the one most text editors use to search for substrings.
Boyer-Moore-Horspool-Raita algorithm is a simplification from Boyer-Moore one, with less preprocessing overhead.
If n is the string length, and m is substring length, naive brute force algorithm has a cost of O(n*m).
Raita algoritm has same cost in worst case but Ω(n/m) a better average cost. As a drawback it adds a preprocessing cost of O(m + charset_length) where charset_length is 256 for us, because we use char.
Another way to approach this substring search issue would be separate it into two functions, a preprocessing one, and a searching one. Then in module initializacion where the function is used, calculate the needed preprocess and when the function is called do the search only.
This Raita algorithm could also be useful for str_search function in ut.c. Its usefulness depends on n, and m lengths, the bigger they are the better the algoritm works.
http://en.wikipedia.org/wiki/String_searching_algorithm http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore%E2%80%93Horspool_algorithm http://www-igm.univ-mlv.fr/~lecroq/string/node22.html
Hallo Vincente,
thanks for your patch. One question, would it not make sense to use the strstr function of <string.h>, with a small wrapper, which is probably quite optimised instead of implementing it our self?
Best regards,
Henning
Hello Henning,
On 06/13/2012 03:10 PM, Henning Westerholt wrote:
Am Dienstag, 12. Juni 2012, 13:05:18 schrieb Vicente Hernando:
str_str function in lib/cds/sstr.c file has a comment: /* FIXME: reimplement using better algorithm */
Attached patch reimplements that function using Boyer-Moore-Horspool-Raita algorithm, which is the one most text editors use to search for substrings.
Boyer-Moore-Horspool-Raita algorithm is a simplification from Boyer-Moore one, with less preprocessing overhead.
If n is the string length, and m is substring length, naive brute force algorithm has a cost of O(n*m).
Raita algoritm has same cost in worst case but Ω(n/m) a better average cost. As a drawback it adds a preprocessing cost of O(m + charset_length) where charset_length is 256 for us, because we use char.
Another way to approach this substring search issue would be separate it into two functions, a preprocessing one, and a searching one. Then in module initializacion where the function is used, calculate the needed preprocess and when the function is called do the search only.
This Raita algorithm could also be useful for str_search function in ut.c. Its usefulness depends on n, and m lengths, the bigger they are the better the algoritm works.
http://en.wikipedia.org/wiki/String_searching_algorithm http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore%E2%80%93Horspool_algorithm http://www-igm.univ-mlv.fr/~lecroq/string/node22.html
Hallo Vincente,
thanks for your patch. One question, would it not make sense to use the strstr function of<string.h>, with a small wrapper, which is probably quite optimised instead of implementing it our self?
actually, I do not know. Kamailio provides some functions that are already available in some libraries, like isdigit, etc.
Is there any kamailio policy about what functions to use?
str_str function is used in presence_b2b SER module.
We would manage further optimization by separating substring preprocessing and searching in the string, because we know we are always going to search for "accepted", "terminated" and "pending".
Best regards, Vicente
Best regards,
Henning
Looking into strstr man page: char *strstr(const char *haystack, const char *needle);
the patch advantage would be it already knows string lengths and has not to calculate them. Otherwise I expect strstr use a similar to Raita algoritm.
Regards, Vicente.
On 06/13/2012 03:34 PM, Vicente Hernando wrote:
Hello Henning,
On 06/13/2012 03:10 PM, Henning Westerholt wrote:
Am Dienstag, 12. Juni 2012, 13:05:18 schrieb Vicente Hernando:
str_str function in lib/cds/sstr.c file has a comment: /* FIXME: reimplement using better algorithm */
Attached patch reimplements that function using Boyer-Moore-Horspool-Raita algorithm, which is the one most text editors use to search for substrings.
Boyer-Moore-Horspool-Raita algorithm is a simplification from Boyer-Moore one, with less preprocessing overhead.
If n is the string length, and m is substring length, naive brute force algorithm has a cost of O(n*m).
Raita algoritm has same cost in worst case but Ω(n/m) a better average cost. As a drawback it adds a preprocessing cost of O(m + charset_length) where charset_length is 256 for us, because we use char.
Another way to approach this substring search issue would be separate it into two functions, a preprocessing one, and a searching one. Then in module initializacion where the function is used, calculate the needed preprocess and when the function is called do the search only.
This Raita algorithm could also be useful for str_search function in ut.c. Its usefulness depends on n, and m lengths, the bigger they are the better the algoritm works.
http://en.wikipedia.org/wiki/String_searching_algorithm http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore%E2%80%93Horspool_algorithm
Hallo Vincente,
thanks for your patch. One question, would it not make sense to use the strstr function of<string.h>, with a small wrapper, which is probably quite optimised instead of implementing it our self?
actually, I do not know. Kamailio provides some functions that are already available in some libraries, like isdigit, etc.
Is there any kamailio policy about what functions to use?
str_str function is used in presence_b2b SER module.
We would manage further optimization by separating substring preprocessing and searching in the string, because we know we are always going to search for "accepted", "terminated" and "pending".
Best regards, Vicente
Best regards,
Henning
sr-dev mailing list sr-dev@lists.sip-router.org http://lists.sip-router.org/cgi-bin/mailman/listinfo/sr-dev
Am Mittwoch, 13. Juni 2012, 15:47:59 schrieb Vicente Hernando:
Looking into strstr man page: char *strstr(const char *haystack, const char *needle);
the patch advantage would be it already knows string lengths and has not to calculate them. Otherwise I expect strstr use a similar to Raita algoritm.
Hello Vincente,
in the past I replaced some internally re-implemented string functions with the ones from the glibc. So in my opinion we could do the same with this function here, maybe add a small wrapper around it to not change the signature.
Regards,
Henning
Hello Henning,
On 06/13/2012 04:53 PM, Henning Westerholt wrote:
Am Mittwoch, 13. Juni 2012, 15:47:59 schrieb Vicente Hernando:
Looking into strstr man page: char *strstr(const char *haystack, const char *needle);
the patch advantage would be it already knows string lengths and has not to calculate them. Otherwise I expect strstr use a similar to Raita algoritm.
Hello Vincente,
in the past I replaced some internally re-implemented string functions with the ones from the glibc. So in my opinion we could do the same with this function here, maybe add a small wrapper around it to not change the signature.
I see using glibc functions as a good general approach.
Only doubt is when changing str strings into zero terminated strings, quickest way is adding '\0' at the end.
Is it always guaranteed to have enough space at the end of str.s to store '\0' value?
Best regards, Vicente.
Regards,
Henning
Am Mittwoch, 13. Juni 2012, 19:17:02 schrieb Vicente Hernando:
Am Mittwoch, 13. Juni 2012, 15:47:59 schrieb Vicente Hernando:
Looking into strstr man page: char *strstr(const char *haystack, const char *needle);
the patch advantage would be it already knows string lengths and has not to calculate them. Otherwise I expect strstr use a similar to Raita algoritm.
Hello Vincente,
in the past I replaced some internally re-implemented string functions with the ones from the glibc. So in my opinion we could do the same with this function here, maybe add a small wrapper around it to not change the signature.
I see using glibc functions as a good general approach.
Only doubt is when changing str strings into zero terminated strings, quickest way is adding '\0' at the end.
Is it always guaranteed to have enough space at the end of str.s to store '\0' value?
Hi Vincente,
good point, this is not guaranted and probably in many cases not the case. So the general library function is probably here not usable.
Best regards,
Henning