catalina oancea wrote:
That seems like an interesting idea, Alex.
I still have problems understanding the size parameter. For example what exactly does size=4 mean?
So, imagine a two-dimensional array. An array of size 4 would mean a 2x2 array. Or, if you prefer, a one-dimensional array of 4 elements. But "tables" are easier to visualise as a grid.
Every time a value is inserted,
location = hash_function(key, table_size); hash_table[location] = value;
happens. The hash_function() is structured in such a way that it will produce a deterministic result; the same key will always result in the same hash. It also is structured to never return a hash value that is greater than the size of the table, as that would result in an array overflow.
The size is the size of the set of all possible values hash_function() can produce. How the hash function works exactly depends on the data and there are many strategies. Good hash functions achieve as much of an "optimal" distribution as possible; with size=4, the aim is to have about 25% of the stored values go in hash_table[0], 25% in hash_table[1], 25% in [2], and 25% in [3].
How exactly to achieve this best is a question I leave to the mathematicians; there are many hash algorithms one can use. Of course, whether the result is optimal or not also depends on the relationship of the hash algorithm to the variation of the data. A very basic and common one is used for introductory teaching is to sum up the ASCII values of the key string and divide them by the size of the table and use the remainder as the hash value, e.g.
char key[] = "catalina"; int sum = 0;
for(int i = 0; i < sizeof(key); i ++) sum += (int) i;
return (sum % HASH_TABLE_SIZE);
Of course, this is a rather naive hash function and more sophisticated ones are available.
Anyway, the point is that if you have a table with a size of 4, the most optimal, best-case scenario is that you will have the first 4 insertions map to a unique value among those 4.
Anything inserted AFTER that will *necessarily* result in a collision - AKA there is another element in that bucket, and it is necessary to chain the elements and/or use some sort of linear approach. This increases the computational complexity of the search and removes the speed advantage which is the point of using a hash table as a data structure in the first place:
struct value_data *hash_lookup(table_t **tbl, char *key) { int idx = hash_lookup(tbl->size, key); struct bucket_node *scan = NULL;
if(tbl->buckets[idx] == NULL) return NULL;
for(scan = tbl->buckets; scan; scan = scan->next) { if(!strcmp(scan->key, key)) return scan->value; }
return NULL; }
With a larger table, you get the advantage of having more mathematically possible unique buckets for the data to go into, thus reducing or eliminating the need for a linear search. If you try to stuff 200 values into 4 buckets, you're going to get a best-case performance of about O(N/4), or O(50). With a table size > 200, you'll get a best-case performance of O(1), which is the point of a hash table.
All associative arrays and "maps" in various high-level languages, e.g. Perl
my %hash = ( 'key' => 'value' );
and PHP:
my $arr = array( 'key' => 'value' );
and Java:
HashMap<String, String> m = new HashMap<String, String>(); m.put("Catalina", "Oancea");
... work on this principle, but dynamically resize themselves in the background in accordance with the runtime population of the structure.
The difference is that in the Kamailio 'htable' module, you must specify a fixed size at initialisation.
-- Alex