C library and database design - Location

8 Dec 2017


      Hello,
I have had a little bit of a head start and implemented a few things that I
wanted to present here:
I built a C library that is supposed to implement the core functionality or
reading and writing the database as well as performing the lookups.
https://git.ipfire.org/?p=people/ms/libloc.git;a=summary
So far there is an implementation of a string pool which will later hold all
strings like the names of countries and ASes, etc. The pool will probably at the
end of the database and keeps all strings separated by a NULL byte. Therefore it
is easy to just jump to the right place and read the string from there until you
find NULL. That makes a string lookup perform in O(1) no matter how large the
database is.
When you add a string to the pool it is checking for duplicates so that adding
the same string twice will make the pool only store it once and return the same
address both times. This operation is O(n) but that should be fine since the
database is not going to be very very large (i.e. gigabytes) and we write it
once and read if very often which makes this a good optimisation.
We could potentially compress the string pool later if we needed to, but so far
I didn't see the point there and it is kind of nice to being able to open the
database in a hex editor and see what is going on inside.
So what does the database look like to far?
[root@rice-oxley libloc]# hexdump -C test.db
00000000  4c 4f 43 44 42 58 58 00  00 00 00 00 00 00 00 0c  |LOCDBXX.........|
00000010  00 00 00 18 00 00 01 55  54 65 73 74 20 56 65 6e  |.......UTest Ven|
00000020  64 6f 72 00 4c 6f 72 65  6d 20 69 70 73 75 6d 20  |dor.Lorem ipsum |
00000030  64 6f 6c 6f 72 20 73 69  74 20 61 6d 65 74 2c 20  |dolor sit amet, |
00000040  63 6f 6e 73 65 63 74 65  74 75 72 20 61 64 69 70  |consectetur adip|
00000050  69 73 63 69 6e 67 20 65  6c 69 74 2e 20 50 72 6f  |iscing elit. Pro|
00000060  69 6e 20 75 6c 74 72 69  63 65 73 20 70 75 6c 76  |in ultrices pulv|
00000070  69 6e 61 72 20 64 6f 6c  6f 72 2c 20 65 74 20 73  |inar dolor, et s|
00000080  6f 6c 6c 69 63 69 74 75  64 69 6e 20 65 72 6f 73  |ollicitudin eros|
00000090  20 75 6c 74 72 69 63 69  65 73 20 76 69 74 61 65  | ultricies vitae|
000000a0  2e 20 4e 61 6d 20 69 6e  20 76 6f 6c 75 74 70 61  |. Nam in volutpa|
000000b0  74 20 6c 69 62 65 72 6f  2e 20 4e 75 6c 6c 61 20  |t libero. Nulla |
000000c0  66 61 63 69 6c 69 73 69  2e 20 50 65 6c 6c 65 6e  |facilisi. Pellen|
000000d0  74 65 73 71 75 65 20 74  65 6d 70 6f 72 20 66 65  |tesque tempor fe|
000000e0  6c 69 73 20 65 6e 69 6d  2e 20 49 6e 74 65 67 65  |lis enim. Intege|
000000f0  72 20 63 6f 6e 67 75 65  20 6e 69 73 69 20 69 6e  |r congue nisi in|
00000100  20 6d 61 78 69 6d 75 73  20 70 72 65 74 69 75 6d  | maximus pretium|
00000110  2e 20 50 65 6c 6c 65 6e  74 65 73 71 75 65 20 65  |. Pellentesque e|
00000120  74 20 74 75 72 70 69 73  20 65 6c 65 6d 65 6e 74  |t turpis element|
00000130  75 6d 2c 20 6c 75 63 74  75 73 20 6d 69 20 61 74  |um, luctus mi at|
00000140  2c 20 69 6e 74 65 72 64  75 6d 20 65 72 61 74 2e  |, interdum erat.|
00000150  20 4d 61 65 63 65 6e 61  73 20 75 74 20 76 65 6e  | Maecenas ut ven|
00000160  65 6e 61 74 69 73 20 6e  75 6e 63 2e 00           |enatis nunc..|
The database starts with a magic value which I have set to LOCDBXX at the moment
and we need to find a better one probably. After that there is a version field
that we change if the format changes. So we can have any changes of the database
format later if we need to.
Then, there is a pointer to a string of the "vendor" of the database. That
allows us to set where the database is from and there is also a "description"
pointer where we can just write some text and put useful information about the
database. The pointers just point to a string in the string pool.
Do we need more like those? I could imaging license and URL where this database
came from? But we could put this into the description, too. Thoughts?
Then there is an offset where the string pool starts in the file (i.e. at which
byte) and the length of the string pool.
There is no trailer yet to make sure that the database hasn't been altered. I
guess we should do that with some sort of a signature inside the database. But I
haven't really thought about that too much. I want to involve everyone into the
process of designing this and also peer-review as much as possible so that we
really have a smart and flexible design that won't create any future problems.
The library will get some bindings for other programming languages later. That
allows us to use it easily in Python, Perl and what else people use. I have good
experience with writing the Python bindings. We will need perl for using this
database in IPFire 2. Anything else should be contributed by third parties.
The library itself is under LGPLv2.1 or later now which allows closed source
applications to use it as well.
It has some unit tests to load the database, etc.
So far I am only depending on the C standard library and it should stay as that
as long as we can because that makes porting the database to other OSes easier.
If we want to use compression we might need other libraries and of course the
language bindings need that their respective libraries.
So this is where I am now. I haven't really made a plan for the next steps. So
please send me comments and suggestions and I will try to draft out something.
Best,
-Michael