Hello,
I have had a little bit of a head start and implemented a few things that I wanted to present here:
I built a C library that is supposed to implement the core functionality or reading and writing the database as well as performing the lookups.
https://git.ipfire.org/?p=people/ms/libloc.git;a=summary
So far there is an implementation of a string pool which will later hold all strings like the names of countries and ASes, etc. The pool will probably at the end of the database and keeps all strings separated by a NULL byte. Therefore it is easy to just jump to the right place and read the string from there until you find NULL. That makes a string lookup perform in O(1) no matter how large the database is.
When you add a string to the pool it is checking for duplicates so that adding the same string twice will make the pool only store it once and return the same address both times. This operation is O(n) but that should be fine since the database is not going to be very very large (i.e. gigabytes) and we write it once and read if very often which makes this a good optimisation.
We could potentially compress the string pool later if we needed to, but so far I didn't see the point there and it is kind of nice to being able to open the database in a hex editor and see what is going on inside.
So what does the database look like to far?
[root@rice-oxley libloc]# hexdump -C test.db 00000000 4c 4f 43 44 42 58 58 00 00 00 00 00 00 00 00 0c |LOCDBXX.........| 00000010 00 00 00 18 00 00 01 55 54 65 73 74 20 56 65 6e |.......UTest Ven| 00000020 64 6f 72 00 4c 6f 72 65 6d 20 69 70 73 75 6d 20 |dor.Lorem ipsum | 00000030 64 6f 6c 6f 72 20 73 69 74 20 61 6d 65 74 2c 20 |dolor sit amet, | 00000040 63 6f 6e 73 65 63 74 65 74 75 72 20 61 64 69 70 |consectetur adip| 00000050 69 73 63 69 6e 67 20 65 6c 69 74 2e 20 50 72 6f |iscing elit. Pro| 00000060 69 6e 20 75 6c 74 72 69 63 65 73 20 70 75 6c 76 |in ultrices pulv| 00000070 69 6e 61 72 20 64 6f 6c 6f 72 2c 20 65 74 20 73 |inar dolor, et s| 00000080 6f 6c 6c 69 63 69 74 75 64 69 6e 20 65 72 6f 73 |ollicitudin eros| 00000090 20 75 6c 74 72 69 63 69 65 73 20 76 69 74 61 65 | ultricies vitae| 000000a0 2e 20 4e 61 6d 20 69 6e 20 76 6f 6c 75 74 70 61 |. Nam in volutpa| 000000b0 74 20 6c 69 62 65 72 6f 2e 20 4e 75 6c 6c 61 20 |t libero. Nulla | 000000c0 66 61 63 69 6c 69 73 69 2e 20 50 65 6c 6c 65 6e |facilisi. Pellen| 000000d0 74 65 73 71 75 65 20 74 65 6d 70 6f 72 20 66 65 |tesque tempor fe| 000000e0 6c 69 73 20 65 6e 69 6d 2e 20 49 6e 74 65 67 65 |lis enim. Intege| 000000f0 72 20 63 6f 6e 67 75 65 20 6e 69 73 69 20 69 6e |r congue nisi in| 00000100 20 6d 61 78 69 6d 75 73 20 70 72 65 74 69 75 6d | maximus pretium| 00000110 2e 20 50 65 6c 6c 65 6e 74 65 73 71 75 65 20 65 |. Pellentesque e| 00000120 74 20 74 75 72 70 69 73 20 65 6c 65 6d 65 6e 74 |t turpis element| 00000130 75 6d 2c 20 6c 75 63 74 75 73 20 6d 69 20 61 74 |um, luctus mi at| 00000140 2c 20 69 6e 74 65 72 64 75 6d 20 65 72 61 74 2e |, interdum erat.| 00000150 20 4d 61 65 63 65 6e 61 73 20 75 74 20 76 65 6e | Maecenas ut ven| 00000160 65 6e 61 74 69 73 20 6e 75 6e 63 2e 00 |enatis nunc..|
The database starts with a magic value which I have set to LOCDBXX at the moment and we need to find a better one probably. After that there is a version field that we change if the format changes. So we can have any changes of the database format later if we need to.
Then, there is a pointer to a string of the "vendor" of the database. That allows us to set where the database is from and there is also a "description" pointer where we can just write some text and put useful information about the database. The pointers just point to a string in the string pool.
Do we need more like those? I could imaging license and URL where this database came from? But we could put this into the description, too. Thoughts?
Then there is an offset where the string pool starts in the file (i.e. at which byte) and the length of the string pool.
There is no trailer yet to make sure that the database hasn't been altered. I guess we should do that with some sort of a signature inside the database. But I haven't really thought about that too much. I want to involve everyone into the process of designing this and also peer-review as much as possible so that we really have a smart and flexible design that won't create any future problems.
The library will get some bindings for other programming languages later. That allows us to use it easily in Python, Perl and what else people use. I have good experience with writing the Python bindings. We will need perl for using this database in IPFire 2. Anything else should be contributed by third parties.
The library itself is under LGPLv2.1 or later now which allows closed source applications to use it as well.
It has some unit tests to load the database, etc.
So far I am only depending on the C standard library and it should stay as that as long as we can because that makes porting the database to other OSes easier. If we want to use compression we might need other libraries and of course the language bindings need that their respective libraries.
So this is where I am now. I haven't really made a plan for the next steps. So please send me comments and suggestions and I will try to draft out something.
Best, -Michael