Hello development folks,
broken VoIP calls involving VoIP telephone equipment behind an IPFire machine have been an ongoing nuisance for me for years by now. While I cannot pinpoint their first occurrence anymore, I recall them to happen ever since we moved to Linux 4.14.x - since VoIP is the only technology requiring advanced connection tracking I have in use, there might be more related bugs.
While VoIP calls to my ISP using SIP over UDP and RTP with opportunistic SRTP support enabled worked in most (but not all) cases, using the same equipment to make a phone call via an IPsec VPN between two IPFire machines failed with a chance 30 to 50 percent per call. The failure mode has been always the same: At least one participant could not hear the other after picking up the phone. Sometimes, both callers could not hear each other.
Initially, I blamed the netfilter ALGs we ship, as they were error-prone and tampered with traffic they should not have tampered with (Arne mentioned the SIP ALG interfered with IPsec traffic as well - for whatever reason it does). Since ALGs do not work on encrypted traffic, switching to SIP over TLS and mandatory SRTP should do the trick, I assumed.
It did not. After running Core Update 155 (where we disabled all ALGs), I recently experienced a broken call again, with SIP over TLS and SRTP in place.
Since I am able to rule out a faulty configuration of the VoIP equipment with a high level of confidence, this leaves me with the suggestion that there is a more fundamental flaw in the Linux 4.14.x connection tracking, causing establishment of RTP streams to fail sometimes.
Worse, this is not reproducible at all - at least all attempts of mine to provoke this failure did not accomplish anything. (For the sake of completeness, I should mention that all needed firewall rules are present and no dropped packets were logged. IPS is not triggering, either, at least there are no corresponding log messages in /var/log/suricata/fast.log .) Since involved IPFire machines handle between 1k and 5k connections at any time, increasing the size of the connection tracking table by running
sysctl net.netfilter.nf_conntrack_max=655360;
seemed useful to me. It did, however, not improve the reliability of VoIP call establishment.
All in all, this situation is quite unsatisfying. _Something_ in IPFire sometimes messes up with RTP streams, without doing so reproducible, logging anything or being otherwise reasonably debuggable. After Core Update 155, we can strike ALGs of the list of potential failure sources.
I have no idea where - and even how - to look further.
Hopefully Linux 5.x will our connection tracking reliability. I am pretty much out of ideas for Linux 4.14.x, though.
Thanks, and best regards, Peter Müller