Wednesday, April 19, 2017

Troubleshooting Netfilter

I'm developing a Linux "Diverter" to handle packets for FakeNet-NG, and I've run into some mind-bending issues. Here is a fun one.

I needed to make FakeNet-NG respond when clients use it as their gateway to talk to arbitrary IP addresses. This is done easily enough: 

iptables -t nat -I PREROUTING -j REDIRECT

At the same time, I needed to make it possible for clients asking for arbitrary ports (where no service was bound), to be redirected to a dummy service. And I needed to write pcaps, produce logging, and allow other on-the-fly decisions to be made. This I did using python-netfilterqueue and dpkt to mangle port numbers on the way in, fix them on the way out, and recalculate checksums as necessary.

These solutions each worked great. But as I learned while demonstrating this functionality, they just didn't work at the same time:

root@ubuntu:/home/mykill# echo fdsa | nc -v 45678
nc: connect to port 45678 (tcp) failed: Connection timed out

I compared pcaps from successful and unsuccessful conversations between the client system and an arbitrary IP address (say, In successful cases (where my packet mangling code was inactive), the FakeNet system responded with whatever IP the client asked to talk to, and the two systems successfully finished the TCP three-way handshake necessary to establish a connection and exchange information. But when my packet mangling code was active, the FakeNet system responded with a SYN/ACK erroneously bearing its own IP address, and the client responded with an RST.

RST is TCP-ese for "Sit down, I wasn't even talking to you."

This behavior led me to the suspicion that my packet mangling activity was preventing the system from recognizing and fixing up response packets so that their IP addresses would match the IP address of the incoming packet (say,

To investigate this, I started by looking at net/netfilter/xt_REDIRECT.c with the goal of learning whether the kernel was using things like the TCP port numbers I was mangling to try to keep track of what packets to fix up. I found that in the case of IPv4, redirect_tg4() calls nf_nat_redirect_ipv4() in nf_nat_redirect.c which unconditionally accesses conntrack information in the skb (short for socket buffer, i.e. the packet), finally calling nf_nat_setup_info() in nf_nat_core.c. The latter function manipulates the destination IP address and calculates a "tuple" and "inverse tuple" that will be used to identify corresponding packets by their endpoint (and other protocol characteristics) and fix up any fields that were mangled by the NAT logic.

I was surprised conntrack was involved because I hadn't needed to use the -m conntrack argument to implement redirection. To confirm what I was seeing, I used lsmod to peek at the dependencies among Netfilter modules. Sure enough, I found that xt_REDIRECT.ko (which implements the REDIRECT target in my iptables rule) relies on nf_nat.ko, which itself relies on nf_conntrack.ko.

I still didn't have the full picture, but it seemed more and more like I was on to something. Perhaps the system was calculating a "tuple" based on the TCP destination port of the incoming packet, my code was modifying the TCP destination port, and then the system was getting a whack at the response packet before I had a chance to fix up its TCP source port to something that would result in a match.

I wanted to figure out when the REDIRECT logic was executing versus when my own logic was executing so I could confirm that hypothesis. While I puzzled over this, I happened upon some relevant documentation that led me to believe I might be correct about the use of TCP ports (rather than, say, socket ownership) to track connections:

Yeah dummy. It's the port.

This documentation also answered my question of when the NAT tuple calculations occur:
Connection tracking hooks into high-priority NF_IP_LOCAL_OUT and NF_IP_PRE_ROUTING hooks, in order to see packets before they enter the system.
These chains were consistent with the registration structures in xt_REDIRECT.c, which further indicated that the hooks were specific to the nat table (naturally):

static struct xt_target redirect_tg_reg[] __read_mostly = {
        .name       = "REDIRECT",
        .family     = NFPROTO_IPV4,
        .revision   = 0,
        .table      = "nat",
        .target     = redirect_tg4,
        .checkentry = redirect_tg4_check,
        .targetsize = sizeof(struct nf_nat_ipv4_multi_range_compat),
        .hooks      = (1 << NF_INET_PRE_ROUTING) |
                      (1 << NF_INET_LOCAL_OUT),
        .me         = THIS_MODULE,

At this point, I really wanted a way to beat Netfilter's OUTPUT/nat hook to the punch. I needed to fix up the source port of the response packet and see if I could induce Netfilter to calculate correct inverse-tuples and fix up the source IPs in my response packets again. But the documentation says Netfilter implements its connection tracking using high-priority hooks in the NF_IP_LOCAL_OUT and NF_IP_PRE_ROUTING chains. That sounds a lot like Netfilter gets first dibs. I sat in uffish thought until I remembered this very detailed diagram (click to enlarge):

You are here. No, wait.

If this graphic was correct, I was about to get in before Netfilter -- by changing my fix-up hook to run in the OUTPUT/raw chain and table. I gave it a shot, and...

root@ubuntu:/home/mykill# echo asdf | nc -v 9999
Connection to 9999 port [tcp/*] succeeded!


It was a hard-fought battle, but I actually was able to confirm my suspicions thanks to the very readable code in the Netfilter portion of the kernel and some very helpful documentation. It's fun to be working on Linux again!

No comments:

Post a Comment