Bailey's on the Rocks: April 2017

Sunday, April 23, 2017

So, de beency bouncy burger, eh?

A colleague found this decoder called CyberChef, which I wish to bookmark and share with you. You may find it useful, too. I hear it can be downloaded as a standalone web page if you wish to audit it and then use it privately for opsec reasons, offline analysis, etc.

https://gchq.github.io/CyberChef/

Børk, børk, børk!

Truthful ProgrammEEng

In this article I'll share how to approach complex logic problems by borrowing from the Electrical Engineering (EE) discipline. I'll discuss how I've used techniques from digital logic circuit design to write efficient and accurate code for hypervisor startup and for packet redirection.

This approach can reduce unnecessary nested/multiple conditional statements down to a one-liner. It applies to software engineering scenarios when you can express your problem in terms of a set of strictly Boolean (true/false) conditions.

If you find yourself contemplating a tangled web of conditional (if/else) code or you find it difficult to understand what conditions contribute to the decision you want your software to make, take a step back and consider applying this technique from digital logic circuit design. If nothing else, the first three or five steps will get you on the right track to write the correct conditional logic (in if/else style) to solve your problem. If you can carry this process to its conclusion, then you can arrive at the most optimized one-liner to solve your logic problem.

It consists of the following steps:

Figure out what Boolean (true/false) conditions are necessary to the decision at hand
Define a truth table that specifies all the possible conditions
Mark which conditions you want to yield in a TRUE result
Turn the TRUE cases (EEs call these minterms) into logical expressions
Logical-OR those expressions together to render a decision (EEs call this the canonical sum-of-products, or CSOP, logic function)
Simplify if possible

Here's a hypervisor programming example followed by a network filtering example to demonstrate the process.

Starting a VT-x Hypervisor

I needed to write startup code for a tiny hypervisor project using Intel Virtualization Technology Extensions (VT-x). I knew nothing about VT-x, but I did know that Intel documented the corresponding VMX instruction set and all its requirements in their processor manual, so I started there. Most of the publicly available code I could get my hands on seemed to specifically check one-off control flags to check if it was okay to enter VMX root operation, ignoring the exact specifications that Intel provided in more modern documentation. I decided to implement my own check based on the data sheet.

The Intel manual describes some requirements for control registers CR0 and CR4 to allow the VMXON instruction to work. Here is some not-very-succinct text from the Intel manual (in case this makes your eyes bleed, I'll provide a more succinct summary next):

Clear as mud.

In summary, there is a FIXED0 and a FIXED1 model-specific register (MSR) for CR0, and another pair of these for CR4. The most important part of the text above is the last sentence starting with "Thus, each bit...". This amounts to the following possible values of FIXED0:FIXED1 along with their corresponding meaning:

F0	F1	CR must be
0	0	fixed to 0
1	1	fixed to 1
0	1	don't care

Notice that one case is missing from Intel's description: $F0:F1 = 10$. When we create a truth table including $CR$ as an input, this will amount to two missing / don't-care cases (one for $CR=0$ and another for $CR=1$).

So, expanding this out to take into account the two possible values of any bit in the control register we are checking, we have the following truth table for the possible values of $F0:F1$ and $CR$:

F0	F1	CR	Invalid
0	0	0	FALSE
0	0	1	TRUE (control register bit should be 0 but is 1)
1	1	0	TRUE (control register bit should be 1 but is 0)
1	1	1	FALSE
0	1	0	FALSE
0	1	1	FALSE

Since there are two cases where a given control register bit can be in an invalid state, that means there are two logical cases, or minterms, either of which should result in a verdict of true (i.e. it is true that the control register is in an invalid state). Here's the same table from above, but formatted a little bit more like a truth table, with minterms called out.

F0	F1	CR	Invalid	Minterms
0	0	0	0
0	0	1	1	$\overline{F0}\cdot\overline{F1}\cdot CR$
1	1	0	1	$F0 \cdot F1 \cdot \overline{CR}$
1	1	1	0
0	1	0	0
0	1	1	0

This amounts to the following sum of products logic function:

$$Invalid(F0,F1,CR)=\overline{F0} \cdot \overline{F1} \cdot CR+F0 \cdot F1 \cdot \overline{CR}$$

Boolean algebra tells us that what is true for one bit in a bit vector is true for all bits. So, in C code, each negated variable (e.g. $\overline{F0}$) amounts to inverting all the bits in the integer. So, the C code for this would look like:

int is_invalid(int fixed0, int fixed1, int cr)
{
    return ((~fixed0 & ~fixed1 & cr) | (fixed0 & fixed1 & ~cr));
}

Boolean algebraic simplification can be applied to optimize this function to result in fewer logical operators, and fewer instruction opcodes emitted by the compiler. Optimization is left as an exercise for the student (hint: XOR is involved) ;-)

Redirecting Network Traffic

More recently, I've been writing logic for the Linux Diverter in FakeNet-NG. This project is the successor to the original FakeNet tool distributed by Siko and Tank to let people simulate a network in a single virtual machine. The NG version was originally developed to make it possible to use FakeNet on newer versions of Windows, but it would be nice for it to work on Linux, too.

So, I am writing a Linux "Diverter", which is a component that manages how packets are redirected and manipulated to simulate the network. This is a really fun project in which I'm using python-netfilterqueue to catch and redirect packets so we can observe traffic sent to any port in the system, even ports where no service is currently bound.

The specification for the Linux Diverter includes redirecting traffic sent to unbound ports into a single listener (similar to the INetSim dummy listener). We want the Linux version of FakeNet-NG to be used in two modes:

SingleHost: the malware and the network simulation tool are running on the same machine, like the legacy version of FakeNet.

MultiHost: the malware and the network simulation tool are running on different machines, like INetSim.

The conditions for whether a packet must be redirected boil down to a logic function of four inputs:

(A) Is the source address local?

(B) Is the destination address local?

(D) Is the destination port bound by a FakeNet-NG listener?

The criteria for changing the destination port of a packet are as follows:

When using FakeNet-NG in MultiHost mode (like INetSim), if a foreign host is trying to talk to us or any other host, ensure that unbound ports get redirected to a listener.

When using FakeNet-NG in SingleHost mode (like legacy FakeNet), ensure outbound destination packets are redirected except when the packet is a response from a FakeNet-NG listener (in other words, the packet originated from a bound port).

The truth table for a decision function that redirects traffic destined for local, unbound ports to a single dummy listener in both SingleHost and MultiHost scenarios is as follows, with local IPs and bound ports in bold:

Sixteen cases

Here, we translated all the different human-comprehensible conditions like "it's coming from a foreign host, don't care what port, and it's arriving at the local host to an unbound port" into a set of Boolean conditions that can be either true (1) or false (0). It's convenient to use letters like A, B, C, and D to represent the inputs (see the first row of the table above). The minterms of this function are the cases when it returns true, namely:

$\overline{A} \cdot \overline{B} \cdot \overline{C} \cdot \overline{D}$
$\overline{A} \cdot \overline{B} \cdot C \cdot \overline{D}$
$\overline{A} \cdot B \cdot \overline{C} \cdot \overline{D}$
$\overline{A} \cdot B \cdot C \cdot \overline{D}$
$A \cdot \overline{B} \cdot \overline{C} \cdot \overline{D}$
$A \cdot B \cdot \overline{C} \cdot \overline{D}$

We can see that a few terms can be simplified, such as (1) and (2). Since A, B, and D are false and C can be either true or false, these two really amount to A' B' D'. We can do this to arrive at a fairly simple function of these three disjunctive terms:

$\overline{A} \cdot \overline{B} \cdot \overline{D}$

$\overline{A} \cdot B \cdot \overline{D}$

$A \cdot \overline{C} \cdot \overline{D}$

From there, you can just get rid of B and collapse the first two cases into one:

$\overline{A} \cdot \overline{D}$

$A \cdot \overline{C} \cdot \overline{D}$

And from there, you can take one step further by looking more closely at the truth table to find the optimal logic. But there are six minterms here, so it might be time to bust out a more powerful tool: the ol' Karnaugh map. A K-map allows you to visually locate all the adjacent groups (pairs, quads, etc.) of logical "yes" outputs and arrive at the simplest possible logic function to yield the desired output. If you're not familiar, you should definitely check this out.

Here's the K-map for our logic function:

Short and sweet: two disjunctive terms in three variables

The K-map here has only two groups of adjacent minterms: those that occur when $A$ and $D$ are zero, and those that occur when $C$ and $D$ are zero. This results in two disjunctive terms, each requiring only two input variables apiece (three in total):

$\overline{A} \cdot \overline{D}$

$\overline{C} \cdot \overline{D}$

So, the minimal sum of products logic function would be:

$$R(A, B, C, D) = \overline{A}\overline{D}+\overline{C}\overline{D}$$
Or, in Python:

        return ((not src_local and not dport_bound) or
                (not sport_bound and not dport_bound))

Done and done.

The next time you find yourself rolling around in freaky nested conditionals and arriving at the wrong logic, try approaching it the way an electrical engineer would: express the problem in terms of a set of Boolean conditions and then figure out how you want each case to go. If you want the simplest result, you can use Boolean algebra or learn to use a K-map to handle the rest. But even if you stop at the truth table, at least this process will help you sort out logic that you might otherwise be confused about and ensure that you know all the test cases you'll need to include to properly test your code.

MathJax

I fixed MathJax for mobile on my blog. I also made it retrieve the scripts via cloudflare over https rather than in the plain from mathjax.org, in case any of my readers live in repressive regimes that punish people for reading beautifully rendered equations.

This might be useful to other blogger.commers who like LaTeX/MathJax and want to see it render on the mobile version of their blog:
http://stackoverflow.com/questions/42592013/blog-with-mathjax-seen-on-a-cellphone

Okay. Carry on.

Saturday, April 22, 2017

Chasing Heisenbugs: Troubleshooting Frustratingly Unresponsive Systems

I did some work in January 2016 on automated performance profiling and diagnosis. As @arvanaghi pointed out, this can be useful for investigating observables resulting from potentially malicious activity. So, I'm figuring out where I left off by turning my documentation into a blog post. What I wrote is pretty stuffy, but since I am sharing it in blog format, I will take some artistic license here and there. Without further ado, I present to you:

I'm just going to paste the whole thing here and draw emoticons on it

Scope

Several challenges make performance profiling and diagnosis of deployed applications a difficult task:

Difficulty reproducing intermittent issues
Investigation inhibited by user interface latency (UI) due to resource consumption
Unavailability of appropriate diagnostic tools on user machines
Inability of laypeople to execute technically detailed diagnostic tasks
Unavailability of live artifacts in cases of dead memory dump analysis

My studies have included two following types of disruptive resource utilization:

User interface latency due to prolonged high CPU utilization
User interface latency due to prolonged high I/O utilization

I'll just be talking about CPU here.

Where applicable, this article will primarily discuss automated means for solving the above problems. Tools can be configured to trigger only when issues appear that are specific to the software and issues that the diagnostic software is meant to address. Where feasible, I will share partial software specifications, rapidly prototyped proofs of concept (PoCs), empirical results, and discussion of considerations relevant to production environments.

Prolonged High CPU Utilization

Automated diagnostics in general can be divided into two classes: (1) those that are meant to identify conditions of interest (e.g. high CPU utilization); and (2) those that are meant to collect diagnostic information relevant to that condition. Each is treated subsequently.

Identifying Conditions of Interest

For purposes of this discussion, two classes of CPU utilization will be used: single-CPU percent utilization and system-wide percent CPU utilization. Single-CPU percent utilization is defined to be the percent time spent in both user and kernel mode (Equation 1); system-wide CPU utilization is defined to be the same figure, divided by the number of CPUs in the system (Equation 2). For example, if a process uses 100% of a single logical CPU in a four-CPU system, its system-wide CPU utilization is 25%. System-wide CPU utilization is the figure that is displayed by applications such as taskmgr.exe and tasklist.exe.

$$u_1=\frac{\Delta t_{user} + \Delta t_{kernel}}{\Delta t}$$

(Eq. 1)

$$u_2=\frac{u_1}{n_{cpus}}$$

(Eq. 2)

High CPU utilization can now be defined from the perspective of user experience. Single-threaded applications will only be capable of consuming <100% of the CPU time on a single CPU (e.g. on a two-CPU system, <50% of system CPU resources). Multi-threaded applications have a much higher potential impact on whole-system CPU utilization because they can create enough CPU-intensive threads to run all logical CPUs at 100%. For purposes of this article, 85% CPU system-wide CPU utilization or greater will constitute high CPU utilization.

As for prolonged high CPU utilization, that is a subjective matter. From a user experience perspective, this can vary depending upon the user. For purposes of this article, high CPU utilization lasting 5 seconds or greater will be considered to be prolonged high CPU utilization. In practice, engineers might also need to consider how to classify and measure spikes in CPU utilization that occur frequently but for a shorter time than might constitute "prolonged" high CPU utilization; however, these considerations are left out of the scope of this article.

I've implemented a Windows PoC (trigger.cpp) to assess the percent CPU utilization (both single-CPU and system-wide) for a given process. I don't know of any APIs for process-wide or thread-specific CPU utilization, but Windows does expose the GetProcessTimes() API which can be used to determine how much time a process or thread has spent executing in user and kernel space over its life. I've used this to measure the change in kernel and user execution times versus the progression of real time as measured using the combination of the QueryPerformanceCounter() and QueryPerformanceFrequency() functions. Figure 1 shows the PoC in operation providing processor utilization updates that closely track the output of Windows Task Manager. The legacy SysInternals' CPUSTRES.EXE tool was used to exercise the PoC.

Fig. 1: CPU utilization tool

There's one more thing to think about. If a diagnostic utility is executed indefinitely, it would be nice to make it consolidate successive distinct CPU utilization events into a single diagnostic event.

For example, the CPU utilization graph in Figure 1 below depicts a single high-CPU event lasting from $t=50s$ through $t=170s$. Although there are two dips in utilization around $t=110s$ and $t=150s$, this would likely be considered a single high-CPU event from both an engineering and a user experience perspective. Therefore, rather than terminating and restarting monitoring to record two events, a more coherent view might be obtained by recording a single event.

Fig. 2: Single high CPU utilization event with two short dips

A dip in utilization might also represent a transition from one phase of high CPU utilization to another, in which the target process performs markedly different activities than prior to the dip. This information can be preserved within a single diagnostic event for later identification provided that sufficiently robust data are collected.

One way to prevent continual collection of instantaneous events and to coalesce temporally connected events together is to define a finite state machine that implements hysteresis. Thresholds can be defined and adhered to in order to satisfy both the definition of "prolonged" high CPU utilization and the requirement that diagnostic information is not collected multiple times for a single "event". Such a state machine could facilitate a delay before diagnostics are terminated and launched again, which can in turn prevent the processing, storage, and/or transmission of excessive diagnostic reports representing a single event. Figure 3 depicts a finite state machine (FSM) for determining when to initiate and terminate diagnostic information collection.

Fig. 3: It's an FSM. Barely.

The state machine in Figure 3 above would be evaluated every time diagnostic information is sampled, and would operate as follows:

The machine begins in the Normal state.
Normal state is promoted to High CPU at any iteration when CPU utilization exceeds the threshold (85% for purposes of this article).
When the state is High CPU, it can advance to Diagnosis or regress to Normal, as follows:

If CPU utilization returns below a threshold before the threshold number of seconds have elapsed, then this does not qualify as a "prolonged" high CPU utilization event, and state is demoted to Normal;
If CPU utilization remains above the threshold utilization value for the threshold number of seconds, then diagnostics are initiated and state is promoted to Diagnosis.

The Diagnosis state is used in combination with the Normal-Wait state to avoid continually repeating the diagnostic process over short periods. When the state is Diagnosis, it can either advance to Normal-Wait, or remain at Diagnosis. The condition for advancing to Normal-Wait is realized when CPU utilization for the target process falls below the threshold value.
When the state is Normal-Wait, the next transition can be either a regression to Diagnosis, no change, or an advance back to the Normal state:

If the CPU utilization of the target process returns to high utilization before the time threshold expires, the threshold is reset and the state regresses to Diagnosis. In this case, diagnostic information collection continues.
If CPU utilization remains low but the threshold duration has not elapsed, the state machine remain in Normal-Wait.
If the CPU utilization of the target process remains below the threshold value for the threshold duration, then diagnostics are terminated, the state transitions back to Normal, and the machine can return to a state in which it may again consider escalating to further diagnostics of subsequent events.

The accuracy of this machine in identifying the correct conditions to initiate and terminate diagnostic information collection could be improved by incorporating fuzzy evaluation of application conditions, such as by using a moving average of CPU utilization or by omitting statistical outliers from evaluation. Other definitions, thresholds, and behaviors described above may be refined further depending upon the specific scenario. Such refinements are beyond the scope of this brief study.

Collecting Diagnostic Information

When prolonged high CPU utilization occurs, the high-level question on the user's mind is: WHAT THE HECK IS MY COMPUTER DOINGGGGGG????!!

And to answer this question, we can investigate where the application is spending its time. Which, incidentally, is available to us through exposed OS APIs.

To address where the application is spending its time, two pieces of information are relevant:

What threads are consuming the greatest degree of CPU resources, and
What instructions are being executed in each thread?

This information may allow engineers to identify which threads and subroutines or basic blocks are consuming the most significant CPU resources. In order to obtain this information, an automated diagnostic application must first enumerate and identify running threads. Because threads may be created and destroyed at any time, the diagnostic application must continually obtain the list of threads and then collect CPU utilization and instruction pointer values per thread. The result may be that threads appear and disappear throughout the timeline of the diagnostic data.

Ideally, output would be to a binary or XML file that could be loaded into a user interface for coherent display and browsing of data. In this study and the associated PoC, information will be collected over a fixed number of iterations (i.e. 100 samples) and displayed on the console before terminating.

For purposes of better understanding the origin of each thread, it can be useful to obtain module information and determine whether the thread entry point falls within the base and end address of any module. If it does, then slightly more informational name information, such as modname.dll+0xNNNN, can be displayed. Note that I said slightly more informational. Sometimes this just points to a C runtime thread start stub. But it's still worth having.

In the PoC, data is displayed by sorting the application's threads by percent utilization and displaying the most significant offenders last. Figure 4 shows threads from a legacy SysInternals CPU load simulation tool, CPUSTRES.EXE, sorted in order of CPU utilization.

Fig. 4: Threads sorted in ascending order of CPU utilization

Although this answers the high-level question of what the program is doing (i.e., executing the thread whose start routine is located at CPUSTRES.EXE+0x1D7B in two separate threads), it does not indicate specifically what part of each routine is executing at a given time.

To answer more specific questions about performance, two techniques are available:

Execution tracing
Instruction pointer sampling

Execution tracing can be implemented to observe detailed instruction execution by using either single-stepping via Windows debug APIs or by making use of processor-specific tracing facilities. Instruction pointer sampling on the other hand can be implemented quickly, albeit at a cost to diagnostic detail. Even so, this method offers improved performance since single-stepping is ssssllllloooowwwwwwwwwwww.

This PoC (inspect.cpp) implements instruction pointer sampling by suspending each thread with the SuspendThread() function and obtaining the control portion of the associated thread context via the GetThreadContext() function. Figure 5 depicts the PoC enumerating instruction pointers for several threads within taskmgr.exe. Notably, thread 3996 is executing a different instruction in sample 7 than it is in sample 8, whereas most threads in the output remain at the same instruction pointer across various samples, perhaps waiting on blocking Windows API calls.

Fig. 5: Instruction pointer for thread 3996

This information can provide limited additional context as to what the application is doing. More useful information might include a stack trace (provided frame pointers are not omitted for anti-reverse engineering or optimization reasons).

Conclusion

It's nice to be able to identify and collect information about events like this. But CPU utilization is only one part of the problem, and there are other ways of detecting UI latency than measuring per-process statistics. Also, what of inherent instrumentation available for collecting diagnostic information? What of kernrate, which was mentioned in Windows Internals book and is covered here. It looks as if this can be used instead of custom diagnostics, as long as there are sufficient CPU resources to launch it (either via starting a new process or by invoking the APIs that it uses to initiate logging). Would kernrate.exe (or the underlying APIs) suffer too much resource starvation to be useful in the automatically detected conditions I outlined above? In addition to this, what ETW information might give us a better glimpse into what is happening when a system becomes frustratingly unresponsive?

These are the questions I want to dig into to arrive at a fully automated system for pinpointing the reasons slowness and arriving at human-readable explanations for what is happening.

Wednesday, April 19, 2017

Troubleshooting Netfilter

I'm developing a Linux "Diverter" to handle packets for FakeNet-NG, and I've run into some mind-bending issues. Here is a fun one.

I needed to make FakeNet-NG respond when clients use it as their gateway to talk to arbitrary IP addresses. This is done easily enough:

iptables -t nat -I PREROUTING -j REDIRECT

At the same time, I needed to make it possible for clients asking for arbitrary ports (where no service was bound), to be redirected to a dummy service. And I needed to write pcaps, produce logging, and allow other on-the-fly decisions to be made. This I did using python-netfilterqueue and dpkt to mangle port numbers on the way in, fix them on the way out, and recalculate checksums as necessary.

These solutions each worked great. But as I learned while demonstrating this functionality, they just didn't work at the same time:

root@ubuntu:/home/mykill# echo fdsa | nc -v 5.5.5.5 45678
nc: connect to 5.5.5.5 port 45678 (tcp) failed: Connection timed out

I compared pcaps from successful and unsuccessful conversations between the client system and an arbitrary IP address (say, 5.5.5.5). In successful cases (where my packet mangling code was inactive), the FakeNet system responded with whatever IP the client asked to talk to, and the two systems successfully finished the TCP three-way handshake necessary to establish a connection and exchange information. But when my packet mangling code was active, the FakeNet system responded with a SYN/ACK erroneously bearing its own IP address, and the client responded with an RST.

RST is TCP-ese for "Sit down, I wasn't even talking to you."

This behavior led me to the suspicion that my packet mangling activity was preventing the system from recognizing and fixing up response packets so that their IP addresses would match the IP address of the incoming packet (say, 5.5.5.5).

To investigate this, I started by looking at net/netfilter/xt_REDIRECT.c with the goal of learning whether the kernel was using things like the TCP port numbers I was mangling to try to keep track of what packets to fix up. I found that in the case of IPv4, redirect_tg4() calls nf_nat_redirect_ipv4() in nf_nat_redirect.c which unconditionally accesses conntrack information in the skb (short for socket buffer, i.e. the packet), finally calling nf_nat_setup_info() in nf_nat_core.c. The latter function manipulates the destination IP address and calculates a "tuple" and "inverse tuple" that will be used to identify corresponding packets by their endpoint (and other protocol characteristics) and fix up any fields that were mangled by the NAT logic.

I was surprised conntrack was involved because I hadn't needed to use the -m conntrack argument to implement redirection. To confirm what I was seeing, I used lsmod to peek at the dependencies among Netfilter modules. Sure enough, I found that xt_REDIRECT.ko (which implements the REDIRECT target in my iptables rule) relies on nf_nat.ko, which itself relies on nf_conntrack.ko.

I still didn't have the full picture, but it seemed more and more like I was on to something. Perhaps the system was calculating a "tuple" based on the TCP destination port of the incoming packet, my code was modifying the TCP destination port, and then the system was getting a whack at the response packet before I had a chance to fix up its TCP source port to something that would result in a match.

I wanted to figure out when the REDIRECT logic was executing versus when my own logic was executing so I could confirm that hypothesis. While I puzzled over this, I happened upon some relevant documentation that led me to believe I might be correct about the use of TCP ports (rather than, say, socket ownership) to track connections:

Yeah dummy. It's the port.

This documentation also answered my question of when the NAT tuple calculations occur:

Connection tracking hooks into high-priority NF_IP_LOCAL_OUT and NF_IP_PRE_ROUTING hooks, in order to see packets before they enter the system.

These chains were consistent with the registration structures in xt_REDIRECT.c, which further indicated that the hooks were specific to the nat table (naturally):

static struct xt_target redirect_tg_reg[] __read_mostly = {
 .
 .
 .
    {
        .name       = "REDIRECT",
        .family     = NFPROTO_IPV4,
        .revision   = 0,
        .table      = "nat",
        .target     = redirect_tg4,
        .checkentry = redirect_tg4_check,
        .targetsize = sizeof(struct nf_nat_ipv4_multi_range_compat),
        .hooks      = (1 << NF_INET_PRE_ROUTING) |
                      (1 << NF_INET_LOCAL_OUT),
        .me         = THIS_MODULE,
    },
};

At this point, I really wanted a way to beat Netfilter's OUTPUT/nat hook to the punch. I needed to fix up the source port of the response packet and see if I could induce Netfilter to calculate correct inverse-tuples and fix up the source IPs in my response packets again. But the documentation says Netfilter implements its connection tracking using high-priority hooks in the NF_IP_LOCAL_OUT and NF_IP_PRE_ROUTING chains. That sounds a lot like Netfilter gets first dibs. I sat in uffish thought until I remembered this very detailed diagram (click to enlarge):

You are here. No, wait.

If this graphic was correct, I was about to get in before Netfilter -- by changing my fix-up hook to run in the OUTPUT/raw chain and table. I gave it a shot, and...

root@ubuntu:/home/mykill# echo asdf | nc -v 5.5.5.5 9999
Connection to 5.5.5.5 9999 port [tcp/*] succeeded!
asdf

VICTORY!

It was a hard-fought battle, but I actually was able to confirm my suspicions thanks to the very readable code in the Netfilter portion of the kernel and some very helpful documentation. It's fun to be working on Linux again!