Hello,
I have had a little bit of a head start and implemented a few things that I wanted to present here:
I built a C library that is supposed to implement the core functionality or reading and writing the database as well as performing the lookups.
https://git.ipfire.org/?p=people/ms/libloc.git;a=summary
So far there is an implementation of a string pool which will later hold all strings like the names of countries and ASes, etc. The pool will probably at the end of the database and keeps all strings separated by a NULL byte. Therefore it is easy to just jump to the right place and read the string from there until you find NULL. That makes a string lookup perform in O(1) no matter how large the database is.
When you add a string to the pool it is checking for duplicates so that adding the same string twice will make the pool only store it once and return the same address both times. This operation is O(n) but that should be fine since the database is not going to be very very large (i.e. gigabytes) and we write it once and read if very often which makes this a good optimisation.
We could potentially compress the string pool later if we needed to, but so far I didn't see the point there and it is kind of nice to being able to open the database in a hex editor and see what is going on inside.
So what does the database look like to far?
[root@rice-oxley libloc]# hexdump -C test.db 00000000 4c 4f 43 44 42 58 58 00 00 00 00 00 00 00 00 0c |LOCDBXX.........| 00000010 00 00 00 18 00 00 01 55 54 65 73 74 20 56 65 6e |.......UTest Ven| 00000020 64 6f 72 00 4c 6f 72 65 6d 20 69 70 73 75 6d 20 |dor.Lorem ipsum | 00000030 64 6f 6c 6f 72 20 73 69 74 20 61 6d 65 74 2c 20 |dolor sit amet, | 00000040 63 6f 6e 73 65 63 74 65 74 75 72 20 61 64 69 70 |consectetur adip| 00000050 69 73 63 69 6e 67 20 65 6c 69 74 2e 20 50 72 6f |iscing elit. Pro| 00000060 69 6e 20 75 6c 74 72 69 63 65 73 20 70 75 6c 76 |in ultrices pulv| 00000070 69 6e 61 72 20 64 6f 6c 6f 72 2c 20 65 74 20 73 |inar dolor, et s| 00000080 6f 6c 6c 69 63 69 74 75 64 69 6e 20 65 72 6f 73 |ollicitudin eros| 00000090 20 75 6c 74 72 69 63 69 65 73 20 76 69 74 61 65 | ultricies vitae| 000000a0 2e 20 4e 61 6d 20 69 6e 20 76 6f 6c 75 74 70 61 |. Nam in volutpa| 000000b0 74 20 6c 69 62 65 72 6f 2e 20 4e 75 6c 6c 61 20 |t libero. Nulla | 000000c0 66 61 63 69 6c 69 73 69 2e 20 50 65 6c 6c 65 6e |facilisi. Pellen| 000000d0 74 65 73 71 75 65 20 74 65 6d 70 6f 72 20 66 65 |tesque tempor fe| 000000e0 6c 69 73 20 65 6e 69 6d 2e 20 49 6e 74 65 67 65 |lis enim. Intege| 000000f0 72 20 63 6f 6e 67 75 65 20 6e 69 73 69 20 69 6e |r congue nisi in| 00000100 20 6d 61 78 69 6d 75 73 20 70 72 65 74 69 75 6d | maximus pretium| 00000110 2e 20 50 65 6c 6c 65 6e 74 65 73 71 75 65 20 65 |. Pellentesque e| 00000120 74 20 74 75 72 70 69 73 20 65 6c 65 6d 65 6e 74 |t turpis element| 00000130 75 6d 2c 20 6c 75 63 74 75 73 20 6d 69 20 61 74 |um, luctus mi at| 00000140 2c 20 69 6e 74 65 72 64 75 6d 20 65 72 61 74 2e |, interdum erat.| 00000150 20 4d 61 65 63 65 6e 61 73 20 75 74 20 76 65 6e | Maecenas ut ven| 00000160 65 6e 61 74 69 73 20 6e 75 6e 63 2e 00 |enatis nunc..|
The database starts with a magic value which I have set to LOCDBXX at the moment and we need to find a better one probably. After that there is a version field that we change if the format changes. So we can have any changes of the database format later if we need to.
Then, there is a pointer to a string of the "vendor" of the database. That allows us to set where the database is from and there is also a "description" pointer where we can just write some text and put useful information about the database. The pointers just point to a string in the string pool.
Do we need more like those? I could imaging license and URL where this database came from? But we could put this into the description, too. Thoughts?
Then there is an offset where the string pool starts in the file (i.e. at which byte) and the length of the string pool.
There is no trailer yet to make sure that the database hasn't been altered. I guess we should do that with some sort of a signature inside the database. But I haven't really thought about that too much. I want to involve everyone into the process of designing this and also peer-review as much as possible so that we really have a smart and flexible design that won't create any future problems.
The library will get some bindings for other programming languages later. That allows us to use it easily in Python, Perl and what else people use. I have good experience with writing the Python bindings. We will need perl for using this database in IPFire 2. Anything else should be contributed by third parties.
The library itself is under LGPLv2.1 or later now which allows closed source applications to use it as well.
It has some unit tests to load the database, etc.
So far I am only depending on the C standard library and it should stay as that as long as we can because that makes porting the database to other OSes easier. If we want to use compression we might need other libraries and of course the language bindings need that their respective libraries.
So this is where I am now. I haven't really made a plan for the next steps. So please send me comments and suggestions and I will try to draft out something.
Best, -Michael
Hello Michael,
Hello,
I have had a little bit of a head start and implemented a few things that I wanted to present here:
I built a C library that is supposed to implement the core functionality or reading and writing the database as well as performing the lookups.
Thanks for beginning.
So far there is an implementation of a string pool which will later hold all strings like the names of countries and ASes, etc. The pool will probably at the end of the database and keeps all strings separated by a NULL byte. Therefore it is easy to just jump to the right place and read the string from there until you find NULL. That makes a string lookup perform in O(1) no matter how large the database is.
Sounds good.
When you add a string to the pool it is checking for duplicates so that adding the same string twice will make the pool only store it once and return the same address both times. This operation is O(n) but that should be fine since the database is not going to be very very large (i.e. gigabytes) and we write it once and read if very often which makes this a good optimisation.
We could potentially compress the string pool later if we needed to, but so far I didn't see the point there and it is kind of nice to being able to open the database in a hex editor and see what is going on inside.
True.
In my opinion, it might be a good idea to make the AS description searchable, too. Most users will search for a description rather than the ASN since memorising numbers is rather difficult.
(Correct me if I am wrong; I know little about these database types.)
So what does the database look like to far?
[root@rice-oxley libloc]# hexdump -C test.db 00000000 4c 4f 43 44 42 58 58 00 00 00 00 00 00 00 00 0c |LOCDBXX.........| 00000010 00 00 00 18 00 00 01 55 54 65 73 74 20 56 65 6e |.......UTest Ven| 00000020 64 6f 72 00 4c 6f 72 65 6d 20 69 70 73 75 6d 20 |dor.Lorem ipsum | 00000030 64 6f 6c 6f 72 20 73 69 74 20 61 6d 65 74 2c 20 |dolor sit amet, | 00000040 63 6f 6e 73 65 63 74 65 74 75 72 20 61 64 69 70 |consectetur adip| 00000050 69 73 63 69 6e 67 20 65 6c 69 74 2e 20 50 72 6f |iscing elit. Pro| 00000060 69 6e 20 75 6c 74 72 69 63 65 73 20 70 75 6c 76 |in ultrices pulv| 00000070 69 6e 61 72 20 64 6f 6c 6f 72 2c 20 65 74 20 73 |inar dolor, et s| 00000080 6f 6c 6c 69 63 69 74 75 64 69 6e 20 65 72 6f 73 |ollicitudin eros| 00000090 20 75 6c 74 72 69 63 69 65 73 20 76 69 74 61 65 | ultricies vitae| 000000a0 2e 20 4e 61 6d 20 69 6e 20 76 6f 6c 75 74 70 61 |. Nam in volutpa| 000000b0 74 20 6c 69 62 65 72 6f 2e 20 4e 75 6c 6c 61 20 |t libero. Nulla | 000000c0 66 61 63 69 6c 69 73 69 2e 20 50 65 6c 6c 65 6e |facilisi. Pellen| 000000d0 74 65 73 71 75 65 20 74 65 6d 70 6f 72 20 66 65 |tesque tempor fe| 000000e0 6c 69 73 20 65 6e 69 6d 2e 20 49 6e 74 65 67 65 |lis enim. Intege| 000000f0 72 20 63 6f 6e 67 75 65 20 6e 69 73 69 20 69 6e |r congue nisi in| 00000100 20 6d 61 78 69 6d 75 73 20 70 72 65 74 69 75 6d | maximus pretium| 00000110 2e 20 50 65 6c 6c 65 6e 74 65 73 71 75 65 20 65 |. Pellentesque e| 00000120 74 20 74 75 72 70 69 73 20 65 6c 65 6d 65 6e 74 |t turpis element| 00000130 75 6d 2c 20 6c 75 63 74 75 73 20 6d 69 20 61 74 |um, luctus mi at| 00000140 2c 20 69 6e 74 65 72 64 75 6d 20 65 72 61 74 2e |, interdum erat.| 00000150 20 4d 61 65 63 65 6e 61 73 20 75 74 20 76 65 6e | Maecenas ut ven| 00000160 65 6e 61 74 69 73 20 6e 75 6e 63 2e 00 |enatis nunc..|
The database starts with a magic value which I have set to LOCDBXX at the moment and we need to find a better one probably. After that there is a version field that we change if the format changes. So we can have any changes of the database format later if we need to.
I consider DDMMYYYY to be fine since we probably won't update this several times a day.
Then, there is a pointer to a string of the "vendor" of the database. That allows us to set where the database is from and there is also a "description" pointer where we can just write some text and put useful information about the database. The pointers just point to a string in the string pool.
Do we need more like those? I could imaging license and URL where this database came from? But we could put this into the description, too. Thoughts?
Including licence and URL is a good idea in my eyes.
Then there is an offset where the string pool starts in the file (i.e. at which byte) and the length of the string pool.
There is no trailer yet to make sure that the database hasn't been altered. I guess we should do that with some sort of a signature inside the database. But I haven't really thought about that too much. I want to involve everyone into the process of designing this and also peer-review as much as possible so that we really have a smart and flexible design that won't create any future problems.
As far as I am concerned, a SHA2 hash is not sufficient here. We need something like a GPG signature. Can we include that into the database or is it too large?
The library will get some bindings for other programming languages later. That allows us to use it easily in Python, Perl and what else people use. I have good experience with writing the Python bindings. We will need perl for using this database in IPFire 2. Anything else should be contributed by third parties.
The library itself is under LGPLv2.1 or later now which allows closed source applications to use it as well.
It has some unit tests to load the database, etc.
So far I am only depending on the C standard library and it should stay as that as long as we can because that makes porting the database to other OSes easier. If we want to use compression we might need other libraries and of course the language bindings need that their respective libraries.
So this is where I am now. I haven't really made a plan for the next steps. So please send me comments and suggestions and I will try to draft out something.
In my eyes, the next step would be gathering the AS/GeoIP data out of the RIPE, ARIN, ... upstream databases. I think it might be good to split that up: (a) The AS information can be more or less easily collected from the databases. Our job is to write something that does this (I don't think we need some extra special effort here, as we do with GeoIP). (b) Building a custom GeoIP database consists of three steps: - scrape the WHOIS entries of all AS and network ranges - use manually set country code for A[1-3] networks - scrape the Tor consensus and add relay IPs to the A1 section.
At the moment, I do some research about A1 (Anonymous Proxies) networks - it is quite hard to find out which servers are used by VPN providers, since only a few of them are using custom AS or netranges.
If anyone has ideas about how to find these servers (no custom rDNS/netrange), please let me know.
Scraping the Tor consensus is easy since it is public. I suggest to add relays with static IPs and certain flags (running, valid, fast, stable) only. Besides from detecting whether an IP address is static or not, this should be an easy task.
Currently, I have not started collecting satellite and anycast networks, but I will do that as soon I am through with A1.
Best regards, Peter Müller
Best, -Michael
Hi,
On Fri, 2017-12-08 at 21:51 +0100, Peter Müller wrote:
Hello Michael,
Hello,
I have had a little bit of a head start and implemented a few things that I wanted to present here:
I built a C library that is supposed to implement the core functionality or reading and writing the database as well as performing the lookups.
Thanks for beginning.
So far there is an implementation of a string pool which will later hold all strings like the names of countries and ASes, etc. The pool will probably at the end of the database and keeps all strings separated by a NULL byte. Therefore it is easy to just jump to the right place and read the string from there until you find NULL. That makes a string lookup perform in O(1) no matter how large the database is.
Sounds good.
When you add a string to the pool it is checking for duplicates so that adding the same string twice will make the pool only store it once and return the same address both times. This operation is O(n) but that should be fine since the database is not going to be very very large (i.e. gigabytes) and we write it once and read if very often which makes this a good optimisation.
We could potentially compress the string pool later if we needed to, but so far I didn't see the point there and it is kind of nice to being able to open the database in a hex editor and see what is going on inside.
True.
In my opinion, it might be a good idea to make the AS description searchable, too. Most users will search for a description rather than the ASN since memorising numbers is rather difficult.
(Correct me if I am wrong; I know little about these database types.)
Certainly we will need some search functionality. The question that I need an answer to is only what we need to search for and how fast that has to be.
For example performing a string search on a matching AS will take some time. If that isn't good enough, we will have to add an index to the database which will of course make it larger. So it is all a trade-off and we have to find out what is most important in which cases and if we can afford to add more space on disk.
So what does the database look like to far?
[root@rice-oxley libloc]# hexdump -C test.db 00000000 4c 4f 43 44 42 58 58 00 00 00 00 00 00 00 00 0c |LOCDBXX.........| 00000010 00 00 00 18 00 00 01 55 54 65 73 74 20 56 65 6e |.......UTest Ven| 00000020 64 6f 72 00 4c 6f 72 65 6d 20 69 70 73 75 6d 20 |dor.Lorem ipsum | 00000030 64 6f 6c 6f 72 20 73 69 74 20 61 6d 65 74 2c 20 |dolor sit amet, | 00000040 63 6f 6e 73 65 63 74 65 74 75 72 20 61 64 69 70 |consectetur adip| 00000050 69 73 63 69 6e 67 20 65 6c 69 74 2e 20 50 72 6f |iscing elit. Pro| 00000060 69 6e 20 75 6c 74 72 69 63 65 73 20 70 75 6c 76 |in ultrices pulv| 00000070 69 6e 61 72 20 64 6f 6c 6f 72 2c 20 65 74 20 73 |inar dolor, et s| 00000080 6f 6c 6c 69 63 69 74 75 64 69 6e 20 65 72 6f 73 |ollicitudin eros| 00000090 20 75 6c 74 72 69 63 69 65 73 20 76 69 74 61 65 | ultricies vitae| 000000a0 2e 20 4e 61 6d 20 69 6e 20 76 6f 6c 75 74 70 61 |. Nam in volutpa| 000000b0 74 20 6c 69 62 65 72 6f 2e 20 4e 75 6c 6c 61 20 |t libero. Nulla | 000000c0 66 61 63 69 6c 69 73 69 2e 20 50 65 6c 6c 65 6e |facilisi. Pellen| 000000d0 74 65 73 71 75 65 20 74 65 6d 70 6f 72 20 66 65 |tesque tempor fe| 000000e0 6c 69 73 20 65 6e 69 6d 2e 20 49 6e 74 65 67 65 |lis enim. Intege| 000000f0 72 20 63 6f 6e 67 75 65 20 6e 69 73 69 20 69 6e |r congue nisi in| 00000100 20 6d 61 78 69 6d 75 73 20 70 72 65 74 69 75 6d | maximus pretium| 00000110 2e 20 50 65 6c 6c 65 6e 74 65 73 71 75 65 20 65 |. Pellentesque e| 00000120 74 20 74 75 72 70 69 73 20 65 6c 65 6d 65 6e 74 |t turpis element| 00000130 75 6d 2c 20 6c 75 63 74 75 73 20 6d 69 20 61 74 |um, luctus mi at| 00000140 2c 20 69 6e 74 65 72 64 75 6d 20 65 72 61 74 2e |, interdum erat.| 00000150 20 4d 61 65 63 65 6e 61 73 20 75 74 20 76 65 6e | Maecenas ut ven| 00000160 65 6e 61 74 69 73 20 6e 75 6e 63 2e 00 |enatis nunc..|
The database starts with a magic value which I have set to LOCDBXX at the moment and we need to find a better one probably. After that there is a version field that we change if the format changes. So we can have any changes of the database format later if we need to.
I consider DDMMYYYY to be fine since we probably won't update this several times a day.
No this is just a value that will always be the same and identifies the file format. A PDF file always starts with PDF%1.4 or so and there are many others, too. We will also need some sort of file extension or just use .db.
Then, there is a pointer to a string of the "vendor" of the database. That allows us to set where the database is from and there is also a "description" pointer where we can just write some text and put useful information about the database. The pointers just point to a string in the string pool.
Do we need more like those? I could imaging license and URL where this database came from? But we could put this into the description, too. Thoughts?
Including licence and URL is a good idea in my eyes.
As an extra field?
Then there is an offset where the string pool starts in the file (i.e. at which byte) and the length of the string pool.
There is no trailer yet to make sure that the database hasn't been altered. I guess we should do that with some sort of a signature inside the database. But I haven't really thought about that too much. I want to involve everyone into the process of designing this and also peer-review as much as possible so that we really have a smart and flexible design that won't create any future problems.
As far as I am concerned, a SHA2 hash is not sufficient here. We need something like a GPG signature. Can we include that into the database or is it too large?
Yes, certainly we need more than a checksum. We need to prove that we have created this database and the best way to do that is PGP. The signature itself is not too large and it is certainly worth having it.
But I don't want to depend on an external library (yet) to do this. So maybe we should make this optional that this library can be used in embedded devices where code size is a problem. Or we use a RSA signature or something where we can use OpenSSL or libgcrypt which might already be there. But I haven't investigated what implementation errors we can do here.
The library will get some bindings for other programming languages later. That allows us to use it easily in Python, Perl and what else people use. I have good experience with writing the Python bindings. We will need perl for using this database in IPFire 2. Anything else should be contributed by third parties.
The library itself is under LGPLv2.1 or later now which allows closed source applications to use it as well.
It has some unit tests to load the database, etc.
So far I am only depending on the C standard library and it should stay as that as long as we can because that makes porting the database to other OSes easier. If we want to use compression we might need other libraries and of course the language bindings need that their respective libraries.
So this is where I am now. I haven't really made a plan for the next steps. So please send me comments and suggestions and I will try to draft out something.
In my eyes, the next step would be gathering the AS/GeoIP data out of the RIPE, ARIN, ... upstream databases. I think it might be good to split that up: (a) The AS information can be more or less easily collected from the databases. Our job is to write something that does this (I don't think we need some extra special effort here, as we do with GeoIP). (b) Building a custom GeoIP database consists of three steps:
- scrape the WHOIS entries of all AS and network ranges
- use manually set country code for A[1-3] networks
- scrape the Tor consensus and add relay IPs to the A1 section.
At the moment, I do some research about A1 (Anonymous Proxies) networks - it is quite hard to find out which servers are used by VPN providers, since only a few of them are using custom AS or netranges.
That might indeed be their trick :)
If anyone has ideas about how to find these servers (no custom rDNS/netrange), please let me know.
Scraping the Tor consensus is easy since it is public. I suggest to add relays with static IPs and certain flags (running, valid, fast, stable) only. Besides from detecting whether an IP address is static or not, this should be an easy task.
Currently, I have not started collecting satellite and anycast networks, but I will do that as soon I am through with A1.
Best regards, Peter Müller
Best, -Michael
-Michael