Bailey's on the Rocks

Snooping Again

2019-07-19T01:23:00.003-07:00

Back At It

As a red teamer in 2016, I found it painful to dredge up command, output, and IP address details about red team operations that had already concluded.

In 2016, I wrote about researching and developing a Microsoft Detours DLL to hook cmd.exe and its subordinate processes and log console I/O with a few system details. Cool project, but mostly experimental with several deficiencies.

But I'm writing to share that I've improved the tool, learned a few things, and am releasing the updated tool on GitHub for your red teaming pleasure.

Deficiencies, You Say?

When I first wrote this, Detours Express was only licensed for non-commercial use on 32-bit platforms. It was possible to use my DLL with the 32-bit cmd.exe, but if a 64-bit executable was invoked, it couldn't execute with the hooking DLL attached. My DLL also created a new log per process, which was inconvenient to review. Lastly, some CLI-driven tools including PowerShell use different Windows APIs to interact with the console, so their output never made it into the logs.

Two events motivated me to revisit and rectify these.

First, as of April 23, 2018, Galen Hunt and the Detours Team indicated that Detours 4.0.1 was freely available on GitHub, supporting "x86, x64 and other Windows-compatible processors (IA64 and ARM). It includes support for either 32-bit or 64-bit processes." This is great news! It allows malware analysts, red teamers, researchers, and developers, to instrument and extend many Windows userspace applications almost arbitrarily without any licensing encumbrances. I'm using my old project to extol the joys of using Detours on both x86 and x64 for free.

Second, in a 2019 discussion, I overheard some red teamers wishing for a way to log time/date stamped console output, and those same red teamers also indicated it would be helpful to have IP address information available for the same reasons as I stated above. Cmder is a powerful console with logging, but I don't believe it has time/date stamps or IP address logging. Furthermore, I think its output contains ANSI color escape sequences that hinder those logs from being readily reviewed and presented.

Bringing it Up to Date

In the aforementioned blog post, I wrote in detail about using the Microsoft Detours' traceapi sample to observe API usage, form hypotheses, and arrive at discoveries about how one might hook and modify behavior. I did the same thing to learn that PowerShell uses ReadConsoleInput and WriteConsoleOutput to do its work instead of simple ReadConsole/WriteConsole.

I furthermore read the documentation for DetourCreateProcessWithDlls and used SysInternals' Process Monitor to figure out how to let Detours use a rundll32 helper process to load the correct architecture of my DLL into new subprocesses.

And finally, I arrived at a scheme that uses environment variables to establish a single log file path for a given command interpreter and its subprocesses to write to. Consequently, one may look in a single log file to review the command line session.

Oh, and in response to some friendly suggestions, I now crudely prevent the IP address information from being displayed for each and every command entered, provided it has not changed within a given CLI session.

Here's some log output so you can see roughly what it looks like.

Nested logging of cmd.exe/powershell.exe

Deficiencies that Will Remain

Alas, various CLI applications mix and match line endings, resulting in stray ^M characters. Perhaps smart I/O transformation on the part of my logging apparatus could eliminate this, but I find it simple and convenient to post-process the log files. For example, Vim has the following edit mode command:

:%s/^M//g

What if Cmder Just Adds These Features?

I'm certainly not competing with the fine folks who write Cmder. If they take the trouble to to add and support the relevant state logic and configuration settings to achieve these same ends, that would be pretty cool and useful. My project remains a fun study into how to research and instrument programs, and a nice example of the usefulness of mixed-architecture Detours usage.

The Link, Please?

Oh yeah, so here is the link to my project:

https://github.com/strictlymike/cmdlog

Enjoy!

Vimifier!

2018-05-18T14:54:00.005-07:00

If you've seen my blog, you know I love Vim. And if you've worked with me, you know I hate when I have to use editors like Microsoft Word because I have to take my fingers off the home row so much to select text, frequently change its format, and generally get around and edit what I'm working on. I worked on a 72-page report last week where I lamented this extensively. Over the week, I started to think: what stops me from turning the keystrokes I want to type into the keystrokes I am typing?

For instance, when I want to move the cursor between words, I use Ctrl+Left and Ctrl+Right. And when I want to select several words, I hold down the shift key. I started to wonder how hard it might be to create a keyboard hook that would translate Vim-style shortcuts (like b, w, v, ...) into their Windows hotkey equivalents (Ctrl+Left, Ctrl+Right, Shift, ...).

Well, I tried it out, and yeah it's a bit of work, but it's worth it! I don't have screenshots or GIFs because it's hard to make it evident what keystrokes I'm using, but if you're a tinkerer then take a look and try it out. I think you'll be amused!

Get the source code under my GitHub profile at https://github.com/strictlymike/vimifier.

If you need a bare bones compiler to get this built, the quickest shortcut I know of is to grab the Microsoft Visual C++ Compiler for Python 2.7.

Getting Into RE

2017-12-21T03:32:00.003-08:00

This has been done before, but I find myself producing this information repeatedly, so what the hell, here's a blog article about it: how to get started in reverse engineering (RE). You'll need a VM (VirtualBox is free and works for me).

I'll first promote the resources that I used because that's what worked for me. Then I'll talk about how to get practice via a certain CTF, and share some resources that I believe have been useful to others.

PMA

That stands for Practical Malware Analysis. This book is already showing its age, but I still think it is the best all-in-one resource to learn reverse engineering fundamentals.

The approach I took that helped me really absorb the material was:

Read the book through and just absorb it;
Go back through for the labs, reviewing each chapter as necessary;
When a lab takes more than a certain time (start with 30 minutes), use the back of the book for the answer;
If you don't see the connection between what you've seen so far and the answer in the back of the book, read the extended answer to see how they got that; and,
Always read the extended answer to glean any techniques that might be more efficient than the road you took.

It takes discipline to remember that you have limited time and you need to move through the labs if you are to learn and improve. If you're banging your head into the wall, you're that much closer to giving up, which is only okay if you have determined that this is simply not interesting to you anymore.

If you don't know assembly language well and you think it is hindering your ability to move through PMA, then suspend that process and take some time with...

x86 Assembly Language

I will suggest two main roads for learning x86 (and x64) assembly language, and a couple of other references to support them. The first and most accessible main resource is the one that a lot of my colleagues have said helped them: http://opensecuritytraining.info/

A lot of reverse engineers, both aspiring and established, say that Xeno's courses are where they learned assembly language, and it went really well for them, so with it being available online for free, I have to put it out there.

As for myself, my main resource for learning x86 assembly language was Richard Blum's book, Professional Assembly Language Programming. The book teaches with GNU tools, so it uses the AT&T syntax which is largely unpopular with the RE crowd, but on the upside, the GNU tools are superlatively easy to acquire and use on most Linux distributions.

Aside from those, the first chapter (x86/x64) of Practical Reverse Engineering reinforced and clarified some essential concepts for me.

Finally, for the definitive RTFM experience, Volume 2 of Intel's processor manuals contains the instruction set reference, which you can use to look up weird instructions you come across. If you're using IDA Pro, there is also an auto-comment mode in IDA Pro that may help remind you if you are just getting started.

Debugging

Tarik Soulami's book Inside Windows Debugging is an outstanding read about not just WinDbg but Windows internals. I can't recommend it emphatically enough.

FLARE-On Challenges

If you're done with PMA and ready for some practice, the FLARE-On Challenge binaries archived at http://flare-on.com/ pose a unique training opportunity for two reasons: first, because they deliberately mimic real malware; and second, because they are all accompanied by solution write-ups on fireeye.com:

2014: part 1 and part 2
2015: https://www.fireeye.com/blog/threat-research/2015/09/flare-on_challenges.html
2016: https://www.fireeye.com/blog/threat-research/2016/11/2016_flare-on_challe.html
2017: https://www.fireeye.com/blog/threat-research/2017/10/2017-flare-on-challenge-solutions.html

I wrote challenge 7 and the associated solution write-up, you should check it out! :-)

When used for training, I suggest approaching these incrementally: attack level 1 of each, level 2 of each, level 3 of each, in turn. I also suggest treating them like the PMA labs: if you exceed a certain duration analyzing them, peek at the solution write-up and see if that gives you a shove in the right direction.

Things I Did That You Don't Have To

The first book I read about RE-related things was actually not PMA; it was The Shellcoder's Handbook, 1st Edition (2nd Edition is here). This was not a gentle introduction. But looking back, it taught me a lot of things that I refer back to very frequently. So maybe it was more formative for me than I even remember.

I also continue to get a lot out of reading Kyle Loudon's book, Mastering Algorithms in C. The book talks about all those things I consider to be magic, especially crypto and compression.

Other Resources

@malwareunicorn has made available her materials for learning about RE, and it looks like an interesting way to get started: https://securedorg.github.io/RE101/
2017.12.27: A vulnerability researcher told me that his "Aha! book" was Reversing: Secrets of Reverse Engineering. I took a look at this while visiting a bookstore and it looks not only informative but interesting.
...

See the ellipsis? I'm going to tack things onto this article as I learn about them. If you know of one, hollar. It's easy for me to remember what helped ME, but I could use a reminder of what others have found helpful.

x86 Assembly Language Brush-Up

2017-11-03T06:17:00.005-07:00

A buddy of mine is reviewing x86 assembly, so I thought I would write a brief primer on common x86 assembly language instructions. If you aren't yet familiar with the x86 register file, check out Remember the Registers first for a super-quick overview.

On a side-note, I laughed when I looked back at that other article, because it starts the same way: "oh, hey, a friend is about to learn x86 assembly, so I thought I would write this quick article!" So I guess the lesson here is: Parents, talk to your kids about assembly language... or else their friends will! }:-)

With no further ado, I'll get this thing moving with...

mov

Much of reverse engineering entails following the flow of data backward and forward as it moves through registers and memory. The mov instruction is the most commonly used instruction and the instruction you'll most often have to read to know where data is going.

lea

lea stands for load effective address. The lea instruction is supposed to give you a pointer to something rather than dereferencing the pointer and giving you the actual data. In reality, though, it just computes the sum or other expression in the square brackets and moves it to the specified location. Take, for example, the following instruction:

lea eax, [ebp-218h]

The eax register in this case will receive ebp minus 0x218, which is the address of some local variable. Compare this with:

mov eax, [ebp-218h]

Which actually dereferences ebp-0x218 to retrieve the contents of that local variable in the function stack frame and puts that value into eax.

Since the lea instruction in all reality just computes the value of the expression in the brackets, it can also be used to evaluate complex expressions involving multiplication and addition. If you see some values that can't possibly be addresses getting used with the lea instruction, you might be right. The program may be merely computing a value rather than working with memory addresses.

push

Data goes on the stack, usually for a function call.

Some compilers will also emit code to push an immediate operand (a constant value, e.g. 0) and then pop it to a register, like this:

push 4 ; Put the number 4 on the stack
pop eax ; The number 4 winds up in eax

call

The processor pushes the address of the next instruction and transfers control to a procedure of the programmer's choosing. This is kind of equivalent to lines 3-5 below:

1   push arg2               ; Push function arguments as normal
2   push arg1
3   push offset L_nextinstr ; Save the address of the next instruction on the stack
4   jmp procedure           ; Transfer control
5 L_nextinstr:
6   test eax, eax           ; Resume normal stuff like checking return value
7

jmp

This is another way to transfer control, usually within a procedure, but sometimes to a procedure.

retn N

When you see a return instruction followed by a number, the function is cleaning up its own stack, which means it is stdcall (the standard calling convention for Microsoft Win32 APIs).

add

I mention this instruction now because it is used in the other calling convention, cdecl, to efficiently forget about function parameters pushed on the stack:

add esp, 8

Obviously its usual use is plain arithmetic, but when it is used with the stack pointer as above, you know the preceding function call was to a cdecl function.

cmp

Compare two operands: Subtract the second operand from the first operand and set EFLAGS as if this were an arithmetic subtraction instruction.

test

Logical comparison. From Intel's manual: "Computes the bit-wise logical AND of first operand... and the second operand... and sets [EFLAGS accordingly]."

If you're unsure about what an instruction does, RTFM: http://www.intel.com/products/processor/manuals/

Intel's manuals are the definitive guide to how Intel's processors parse and execute instructions. They are organized as follows:

Volume 1: Basic Architecture

Volume 2: Instruction Set Reference

Volume 3: System Programming Guide

If you wonder about a particular instruction, you'll find it in volume 2 (Instruction Set Reference). If you want to learn about the x86 execution environment, volume 1 (Basic Architecture) is your friend. And if you're writing a bootloader, an operating system, or a hypervisor, volume 3 (System Programming Guide) is for you.

Misc

If you're interested in tabulating the most common instructions using IDAPython, here is a snippet.

from collections import defaultdict

def _for_each_instr(callback, outputs=None, parms=None):
    """Do <callback> for each instruction.

    Call callback() providing fva, chunk start va, instr addr, and outputs/
    parameters.
    """
    for fva in Functions():
        for (va_start, va_end) in Chunks(fva):
            for head in Heads(va_start, va_end):
                callback(fva, va_start, head, outputs, parms)

def enum_mnemonics():
    mnems = defaultdict(int)
    def enum_mnemonics_callback(fva, chunkva, head, unused1, unused2):
        mnems[GetMnem(head)] += 1

    _for_each_instr(enum_mnemonics_callback)

    mnems_sorted = sorted(mnems.iteritems(), key=lambda(k,v):v, reverse=True)

    return mnems_sorted

Free Shortcuts in IDA Pro

2017-10-04T13:51:00.003-07:00

I'm tired of reassigning everything to the same hotkey in IDA Pro because I don't know which hotkeys are free. Here are the IDA Pro keyboard shortcuts I know of that aren't in use in IDA Pro so far. Only one- and two-key shortcuts are included. I have omitted shortcuts used by BinDiff and flare-ida, because I don't want to collide with those.

` (backtick)
, (comma)
<>{}[] (left and right angle bracket, brace, and square bracket)
Most (but not all) of the top row: !@#$%^&*()+=
I
J
Alt+F4,F5, and F8
Alt+E, F, N, O, U, V, W, Z
Ctrl+0, 4, 5, 7, 8, 9
Ctrl+H
Ctrl+Y
Shift+A-C
Shift+F-O
Shift+Q
Shift+S-Z

I got these by visiting Options -> Shortcuts... in a recent version of IDA Pro. If you notice any others, or notable collisions, please comment or message and I will update.

Two Great Tastes: IDA + WinDbg

2017-09-28T10:53:00.002-07:00

Setting up remote debugging with IDA+WinDbg can be difficult when it doesn't work right off the bat, because the errors don't jog the right thought process for you to fix the setup and get it working. This causes some people to walk away from the whole thing, which is unfortunate. This setup is SOOOOO useful that it's worth slogging through the pain to get it working. The value of having graph view and IDA's annotation capabilities on-hand while debugging cannot be overstated.

Here, I'll emphasize one thing that could stand to be better emphasized in Hex-Rays' own documentation: you have to be using the same version of WinDbg on each side. And I'll indicate some ways to isolate end-to-end (E2E) issues. Note that the system with IDA Pro on it is referred to here as the analysis system (it's where you do your analysis of the code), and the system where you run malware is referred to as the target system.

Pointers

Resolve any end-to-end (E2E) issues first (firewalls, networking, etc.)
Lock IDA Pro into using the same version of WinDbg as is on your target system
Use WinDbg itself to verify that there are no E2E issues

Algorithm

This is exactly how to set up a remote debugging setup with IDA Pro and WinDbg. Here are the steps:

Both systems: Ensure your analysis and target machines can access each other over the network

If they are VMs, you may need to adjust them to ensure they are both host-only
You might need to mess with firewall settings
If you are using FakeNet-NG, you might need to add an exception for dbgsrv.exe

Target system: Locate (install, if necessary) WinDbg on your target system.
Target -> Analysis system: If you haven't installed the same version of WinDbg to both systems, then simply copy the entire x86 directory where you located WinDbg on the target system, onto your analysis system. It doesn't matter where you place this.
Analysis system: Edit ida.cfg to set DBGTOOLS to point to the x86 directory

Use double backslashes, e.g. DBGTOOLS = "C:\\Program Files (x86)\\Windows Kits\\10\\Debuggers\\x86\\";

Target system: Start the WinDbg debug server

"C:\path\to\dbgsrv.exe" -t tcp:port=9999

Analysis system: Test by trying to connect remotely with WinDbg itself - if this doesn't work, then you've got end-to-end issues to resolve before IDA will work
Analysis system: configure your IDB to use WinDbg:

Debugger -> Switch debugger... (select Windbg debugger and click OK)
Debugger -> Options...

Application: path\on\your\target\system\to\binary.exe
Input file: path\on\your\target\system\to\binary.exe
Directory: path\on\your\target\system\to
Parameters: command-line arguments you want passed to the malware (if any)
Connection string: tcp:server=TARGETSYSTEMNAME,port=9999
Click OK

Analysis system: Click on an instruction and hit F4 to "run to" that instruction, or set a breakpoint and strike F9
Disregard warnings as applicable ;-)

Troubleshooting

You may want to audit your user and system PATH environment variables to ensure that they don't include the x86 directory of a conflicting version of WinDbg, or the x64 directory for that matter.

If you get "Could not initialize WinDbg engine 0x7f / The specified procedure could not be found... You might try adding the path to that x86 directory to your system path and closing/reopening IDA. I also find that certain Python scripts seem to cause IDA Pro to emit this error, so you might also try closing/reopening IDA, initiating your debug session, and only THEN loading any ancillary IDAPython scripts you were using.

Miscellany

As of 2011, Hex-Rays indicated that this would not work with the x64 tools.

Done to Death

2017-08-19T16:01:00.002-07:00

I saw this tweet from @redteamwrangler today:

This question is a pretty easy fallback for interviewers and it might be getting a little old for us. But it set me to thinking about how tediously I could answer this without using the Internet or any reference materials on my machine. Some of these details might be a little off or just downright wrong, but I'll display my ignorance. It was a fun exercise since it isn't in the context of any actual interview :-)

What happens when you type www.google.com into a browser and press return? Supposing you're using Internet Explorer for Windows on an wired (Ethernet) connection:

8259a or emulated 8259a keyboard controller emits some scancodes
Keyboard interrupt is sent to the CPU
Interrupt service routine acknowledges interrupt, potentially moves one or more scancodes into a buffer
Or a delayed procedure call (DPC) does
ntoskrnl/win32k determine which thread corresponds to the foreground window and deliver a series of window messages of type WM_KEYDOWN/WM_KEYUP ending with one having virtual key code VK_ENTER
Window procedure for the browser URL bar (which is a window object) is called with a window message of type WM_KEYUP with virtual key code VK_ENTER
Window procedure has a switch statement / jmp table in it that accounts for this particular window message (WM_KEYUP) and maybe a sub-case for VK_ENTER
Probably takes the accumulated buffer so far (L"www.google.com") and passes it to a function
Probably uses a library like WinInet to do the real stuff

Probably calls InternetOpen() to get a handle of type HINTERNET
Probably calls InternetOpenUrlW() or HttpSomethingSomething() to get another HINTERNET handle

Probably reads the registry or uses a cached value for the HTTP User-Agent field that it provides here
Probably uses WinSock2 for TCP
Probably calls ws2_32!WSAStartup() if it hasn't been called yet
Checks proxy settings for the user and optionally establishes a connection with the corresponding hostname and implements HTTP proxy requests instead of direct HTTP requests
Parses the URL for the hostname, protocol scheme, any explicit port specification, URI, query parameters, etc.

Probably uses urlmon!InternetCrackUrlW() for this (or is that in wininet? I think it's in urlmon)
If no protocol scheme is specified, uses http:// which has a default port of 80
If https:// is specified, a default port of 443

Issues a DNS A (IPv4 address) request and/or AAAA (IPv6 address) request for the name

Probably calls ws2_32!gethostbyname() to do this

Thread consults DNS resolver cache service (if running) for name, probably via IPC

DNS resolver cache either returns the name or...
Probably uses dnsapi.dll which exports some function that...

Checks DNS configuration (probably the registry) to get primary, secondary, tertiary, etc. DNS servers
Calls ws2_32!inet_aton() to convert human-readable configuration to IPv4 or IPv6 addresses
Creates a socket object via ws2_32!socket()
Creates an in_addr object to communicate with the DNS server
Uses ws2_32!sendto() to use AF_INET/IPPROTO_UDP connectionlessly querying the server

Network layer (hmm, getting hand wavy) consults routing table to determine what interface packet should go through and whether it must visit a gateway
Network card device driver creates and fills out an object that the kernel uses to describe network datagrams (packets)
Network card device driver initiates I/O request with NIC via PCI registers or other hardware interface to provide datagram to be transmitted

NIC takes the medium and transmits Ethernet frames bearing the octets that were given to it
If another host transmits at the same time, the two hosts use the binary exponential backoff algorithm to wait until the medium is clear
A router is likely the gateway; it accepts the packets and creates new packets to send across one or more other networks until the DNS server receives them
DNS server UDP stack handles incoming packet, provides it to UDP-based DNS service e.g. bind which is bound to port 53
Bind parses the packets, potentially forwards the request if the desired names are not in its zone file, and returns the response

The DNS resolver cache service receives and parses DNS reply/replies and returns the answer to the DNS client

If the DNS resolver cache service wasn't running, ws2_32!gethostbyname() probably does most of this itself

Since you only typed "www.google.com", it's plaintext HTTP on the default port, so 80
Establishes a TCP connection with the resulting host number and port

ws2_32!socket() to get a socket object
ws2_32!connect() with AF_INET and IPPROTO_TCP to connect

tcpip.sys is probably involved here
Again the network card and Ethernet medium stuff
TCP three-way handshake, window negotiation, etc.

The client sends a SYN TCP segment
The server returns a SYN,ACK TCP segment
The client returns an ACK TCP segment

TCP data transmission

The client sends a SYN,PSH TCP segment pushing data
Something like "GET / HTTP/1.1\nHost: www.google.com\nUser-Agent: ..."

Google's web server does some thinking and returns a response
The client receives a 3xx redirect response and gets directed to go to https://www.google.com/
Uses Microsoft schannel (secure channel) library to negotiate ciphers, parse the server's security certificate, and transmit data over TLS
Starts with ClientHello message, ServerHello, etc.
Obtains HTTP response, something like "HTTP 200 OK\n..." with some HTML in the HTTP body
HTML links to images, maybe JavaScript, etc., resulting in Cross-Origin (CORS) processing and follow-on requests
Invokes the JScript scripting engine for JScript and rendering engine to display the content
Uses graphics primitives and likely renders into a buffer that it furnishes to win32k.sys through GDI calls
GDI manages framebuffer of all windows including the foreground window
Monitor dispays the framebuffer
Photons fly into your eyes
Optic nerve and brain adjust for upside-down image arriving at retina
Person realizes then went to Google and says, "crap, I meant to go to Bing." Orrrrrr maybe not, haha.

If I had more time, I'd draw this out a little more, but I had to quit eventually. And the more I do this, the more I bump into all the things I don't know. A couple things I'd like to know more about:

What does "network layer" mean on Windows? I could use ETW with syscall stackwalking enabled to follow the ws2_32!connect() call into the kernel, or Windows Internals might just tell me.
What kernel object represents a packet in the Windows kernel? A packet buffer? I forgot :-(
How does the networking stack give a packet to the NIC to transmit it? My device driver fu is ageing.

So, de beency bouncy burger, eh?

2017-04-23T17:53:00.000-07:00

A colleague found this decoder called CyberChef, which I wish to bookmark and share with you. You may find it useful, too. I hear it can be downloaded as a standalone web page if you wish to audit it and then use it privately for opsec reasons, offline analysis, etc.

https://gchq.github.io/CyberChef/

Børk, børk, børk!

Truthful ProgrammEEng

2017-04-23T15:08:00.000-07:00

In this article I'll share how to approach complex logic problems by borrowing from the Electrical Engineering (EE) discipline. I'll discuss how I've used techniques from digital logic circuit design to write efficient and accurate code for hypervisor startup and for packet redirection.

This approach can reduce unnecessary nested/multiple conditional statements down to a one-liner. It applies to software engineering scenarios when you can express your problem in terms of a set of strictly Boolean (true/false) conditions.

If you find yourself contemplating a tangled web of conditional (if/else) code or you find it difficult to understand what conditions contribute to the decision you want your software to make, take a step back and consider applying this technique from digital logic circuit design. If nothing else, the first three or five steps will get you on the right track to write the correct conditional logic (in if/else style) to solve your problem. If you can carry this process to its conclusion, then you can arrive at the most optimized one-liner to solve your logic problem.

It consists of the following steps:

Figure out what Boolean (true/false) conditions are necessary to the decision at hand
Define a truth table that specifies all the possible conditions
Mark which conditions you want to yield in a TRUE result
Turn the TRUE cases (EEs call these minterms) into logical expressions
Logical-OR those expressions together to render a decision (EEs call this the canonical sum-of-products, or CSOP, logic function)
Simplify if possible

Here's a hypervisor programming example followed by a network filtering example to demonstrate the process.

Starting a VT-x Hypervisor

I needed to write startup code for a tiny hypervisor project using Intel Virtualization Technology Extensions (VT-x). I knew nothing about VT-x, but I did know that Intel documented the corresponding VMX instruction set and all its requirements in their processor manual, so I started there. Most of the publicly available code I could get my hands on seemed to specifically check one-off control flags to check if it was okay to enter VMX root operation, ignoring the exact specifications that Intel provided in more modern documentation. I decided to implement my own check based on the data sheet.

The Intel manual describes some requirements for control registers CR0 and CR4 to allow the VMXON instruction to work. Here is some not-very-succinct text from the Intel manual (in case this makes your eyes bleed, I'll provide a more succinct summary next):

Clear as mud.

In summary, there is a FIXED0 and a FIXED1 model-specific register (MSR) for CR0, and another pair of these for CR4. The most important part of the text above is the last sentence starting with "Thus, each bit...". This amounts to the following possible values of FIXED0:FIXED1 along with their corresponding meaning:

F0	F1	CR must be
0	0	fixed to 0
1	1	fixed to 1
0	1	don't care

Notice that one case is missing from Intel's description: $F0:F1 = 10$. When we create a truth table including $CR$ as an input, this will amount to two missing / don't-care cases (one for $CR=0$ and another for $CR=1$).

So, expanding this out to take into account the two possible values of any bit in the control register we are checking, we have the following truth table for the possible values of $F0:F1$ and $CR$:

F0	F1	CR	Invalid
0	0	0	FALSE
0	0	1	TRUE (control register bit should be 0 but is 1)
1	1	0	TRUE (control register bit should be 1 but is 0)
1	1	1	FALSE
0	1	0	FALSE
0	1	1	FALSE

Since there are two cases where a given control register bit can be in an invalid state, that means there are two logical cases, or minterms, either of which should result in a verdict of true (i.e. it is true that the control register is in an invalid state). Here's the same table from above, but formatted a little bit more like a truth table, with minterms called out.

F0	F1	CR	Invalid	Minterms
0	0	0	0
0	0	1	1	$\overline{F0}\cdot\overline{F1}\cdot CR$
1	1	0	1	$F0 \cdot F1 \cdot \overline{CR}$
1	1	1	0
0	1	0	0
0	1	1	0

This amounts to the following sum of products logic function:

$$Invalid(F0,F1,CR)=\overline{F0} \cdot \overline{F1} \cdot CR+F0 \cdot F1 \cdot \overline{CR}$$

Boolean algebra tells us that what is true for one bit in a bit vector is true for all bits. So, in C code, each negated variable (e.g. $\overline{F0}$) amounts to inverting all the bits in the integer. So, the C code for this would look like:

int is_invalid(int fixed0, int fixed1, int cr)
{
    return ((~fixed0 & ~fixed1 & cr) | (fixed0 & fixed1 & ~cr));
}

Boolean algebraic simplification can be applied to optimize this function to result in fewer logical operators, and fewer instruction opcodes emitted by the compiler. Optimization is left as an exercise for the student (hint: XOR is involved) ;-)

Redirecting Network Traffic

More recently, I've been writing logic for the Linux Diverter in FakeNet-NG. This project is the successor to the original FakeNet tool distributed by Siko and Tank to let people simulate a network in a single virtual machine. The NG version was originally developed to make it possible to use FakeNet on newer versions of Windows, but it would be nice for it to work on Linux, too.

So, I am writing a Linux "Diverter", which is a component that manages how packets are redirected and manipulated to simulate the network. This is a really fun project in which I'm using python-netfilterqueue to catch and redirect packets so we can observe traffic sent to any port in the system, even ports where no service is currently bound.

The specification for the Linux Diverter includes redirecting traffic sent to unbound ports into a single listener (similar to the INetSim dummy listener). We want the Linux version of FakeNet-NG to be used in two modes:

SingleHost: the malware and the network simulation tool are running on the same machine, like the legacy version of FakeNet.

MultiHost: the malware and the network simulation tool are running on different machines, like INetSim.

The conditions for whether a packet must be redirected boil down to a logic function of four inputs:

(A) Is the source address local?

(B) Is the destination address local?

(D) Is the destination port bound by a FakeNet-NG listener?

The criteria for changing the destination port of a packet are as follows:

When using FakeNet-NG in MultiHost mode (like INetSim), if a foreign host is trying to talk to us or any other host, ensure that unbound ports get redirected to a listener.

When using FakeNet-NG in SingleHost mode (like legacy FakeNet), ensure outbound destination packets are redirected except when the packet is a response from a FakeNet-NG listener (in other words, the packet originated from a bound port).

The truth table for a decision function that redirects traffic destined for local, unbound ports to a single dummy listener in both SingleHost and MultiHost scenarios is as follows, with local IPs and bound ports in bold:

Sixteen cases

Here, we translated all the different human-comprehensible conditions like "it's coming from a foreign host, don't care what port, and it's arriving at the local host to an unbound port" into a set of Boolean conditions that can be either true (1) or false (0). It's convenient to use letters like A, B, C, and D to represent the inputs (see the first row of the table above). The minterms of this function are the cases when it returns true, namely:

$\overline{A} \cdot \overline{B} \cdot \overline{C} \cdot \overline{D}$
$\overline{A} \cdot \overline{B} \cdot C \cdot \overline{D}$
$\overline{A} \cdot B \cdot \overline{C} \cdot \overline{D}$
$\overline{A} \cdot B \cdot C \cdot \overline{D}$
$A \cdot \overline{B} \cdot \overline{C} \cdot \overline{D}$
$A \cdot B \cdot \overline{C} \cdot \overline{D}$

We can see that a few terms can be simplified, such as (1) and (2). Since A, B, and D are false and C can be either true or false, these two really amount to A' B' D'. We can do this to arrive at a fairly simple function of these three disjunctive terms:

$\overline{A} \cdot \overline{B} \cdot \overline{D}$

$\overline{A} \cdot B \cdot \overline{D}$

$A \cdot \overline{C} \cdot \overline{D}$

From there, you can just get rid of B and collapse the first two cases into one:

$\overline{A} \cdot \overline{D}$

$A \cdot \overline{C} \cdot \overline{D}$

And from there, you can take one step further by looking more closely at the truth table to find the optimal logic. But there are six minterms here, so it might be time to bust out a more powerful tool: the ol' Karnaugh map. A K-map allows you to visually locate all the adjacent groups (pairs, quads, etc.) of logical "yes" outputs and arrive at the simplest possible logic function to yield the desired output. If you're not familiar, you should definitely check this out.

Here's the K-map for our logic function:

Short and sweet: two disjunctive terms in three variables

The K-map here has only two groups of adjacent minterms: those that occur when $A$ and $D$ are zero, and those that occur when $C$ and $D$ are zero. This results in two disjunctive terms, each requiring only two input variables apiece (three in total):

$\overline{A} \cdot \overline{D}$

$\overline{C} \cdot \overline{D}$

So, the minimal sum of products logic function would be:

$$R(A, B, C, D) = \overline{A}\overline{D}+\overline{C}\overline{D}$$
Or, in Python:

        return ((not src_local and not dport_bound) or
                (not sport_bound and not dport_bound))

Done and done.

The next time you find yourself rolling around in freaky nested conditionals and arriving at the wrong logic, try approaching it the way an electrical engineer would: express the problem in terms of a set of Boolean conditions and then figure out how you want each case to go. If you want the simplest result, you can use Boolean algebra or learn to use a K-map to handle the rest. But even if you stop at the truth table, at least this process will help you sort out logic that you might otherwise be confused about and ensure that you know all the test cases you'll need to include to properly test your code.

MathJax

2017-04-23T07:47:00.001-07:00

I fixed MathJax for mobile on my blog. I also made it retrieve the scripts via cloudflare over https rather than in the plain from mathjax.org, in case any of my readers live in repressive regimes that punish people for reading beautifully rendered equations.

This might be useful to other blogger.commers who like LaTeX/MathJax and want to see it render on the mobile version of their blog:
http://stackoverflow.com/questions/42592013/blog-with-mathjax-seen-on-a-cellphone

Okay. Carry on.

Chasing Heisenbugs: Troubleshooting Frustratingly Unresponsive Systems

2017-04-22T08:09:00.000-07:00

I did some work in January 2016 on automated performance profiling and diagnosis. As @arvanaghi pointed out, this can be useful for investigating observables resulting from potentially malicious activity. So, I'm figuring out where I left off by turning my documentation into a blog post. What I wrote is pretty stuffy, but since I am sharing it in blog format, I will take some artistic license here and there. Without further ado, I present to you:

I'm just going to paste the whole thing here and draw emoticons on it

Scope

Several challenges make performance profiling and diagnosis of deployed applications a difficult task:

Difficulty reproducing intermittent issues
Investigation inhibited by user interface latency (UI) due to resource consumption
Unavailability of appropriate diagnostic tools on user machines
Inability of laypeople to execute technically detailed diagnostic tasks
Unavailability of live artifacts in cases of dead memory dump analysis

My studies have included two following types of disruptive resource utilization:

User interface latency due to prolonged high CPU utilization
User interface latency due to prolonged high I/O utilization

I'll just be talking about CPU here.

Where applicable, this article will primarily discuss automated means for solving the above problems. Tools can be configured to trigger only when issues appear that are specific to the software and issues that the diagnostic software is meant to address. Where feasible, I will share partial software specifications, rapidly prototyped proofs of concept (PoCs), empirical results, and discussion of considerations relevant to production environments.

Prolonged High CPU Utilization

Automated diagnostics in general can be divided into two classes: (1) those that are meant to identify conditions of interest (e.g. high CPU utilization); and (2) those that are meant to collect diagnostic information relevant to that condition. Each is treated subsequently.

Identifying Conditions of Interest

For purposes of this discussion, two classes of CPU utilization will be used: single-CPU percent utilization and system-wide percent CPU utilization. Single-CPU percent utilization is defined to be the percent time spent in both user and kernel mode (Equation 1); system-wide CPU utilization is defined to be the same figure, divided by the number of CPUs in the system (Equation 2). For example, if a process uses 100% of a single logical CPU in a four-CPU system, its system-wide CPU utilization is 25%. System-wide CPU utilization is the figure that is displayed by applications such as taskmgr.exe and tasklist.exe.

$$u_1=\frac{\Delta t_{user} + \Delta t_{kernel}}{\Delta t}$$

(Eq. 1)

$$u_2=\frac{u_1}{n_{cpus}}$$

(Eq. 2)

High CPU utilization can now be defined from the perspective of user experience. Single-threaded applications will only be capable of consuming <100% of the CPU time on a single CPU (e.g. on a two-CPU system, <50% of system CPU resources). Multi-threaded applications have a much higher potential impact on whole-system CPU utilization because they can create enough CPU-intensive threads to run all logical CPUs at 100%. For purposes of this article, 85% CPU system-wide CPU utilization or greater will constitute high CPU utilization.

As for prolonged high CPU utilization, that is a subjective matter. From a user experience perspective, this can vary depending upon the user. For purposes of this article, high CPU utilization lasting 5 seconds or greater will be considered to be prolonged high CPU utilization. In practice, engineers might also need to consider how to classify and measure spikes in CPU utilization that occur frequently but for a shorter time than might constitute "prolonged" high CPU utilization; however, these considerations are left out of the scope of this article.

I've implemented a Windows PoC (trigger.cpp) to assess the percent CPU utilization (both single-CPU and system-wide) for a given process. I don't know of any APIs for process-wide or thread-specific CPU utilization, but Windows does expose the GetProcessTimes() API which can be used to determine how much time a process or thread has spent executing in user and kernel space over its life. I've used this to measure the change in kernel and user execution times versus the progression of real time as measured using the combination of the QueryPerformanceCounter() and QueryPerformanceFrequency() functions. Figure 1 shows the PoC in operation providing processor utilization updates that closely track the output of Windows Task Manager. The legacy SysInternals' CPUSTRES.EXE tool was used to exercise the PoC.

Fig. 1: CPU utilization tool

There's one more thing to think about. If a diagnostic utility is executed indefinitely, it would be nice to make it consolidate successive distinct CPU utilization events into a single diagnostic event.

For example, the CPU utilization graph in Figure 1 below depicts a single high-CPU event lasting from $t=50s$ through $t=170s$. Although there are two dips in utilization around $t=110s$ and $t=150s$, this would likely be considered a single high-CPU event from both an engineering and a user experience perspective. Therefore, rather than terminating and restarting monitoring to record two events, a more coherent view might be obtained by recording a single event.

Fig. 2: Single high CPU utilization event with two short dips

A dip in utilization might also represent a transition from one phase of high CPU utilization to another, in which the target process performs markedly different activities than prior to the dip. This information can be preserved within a single diagnostic event for later identification provided that sufficiently robust data are collected.

One way to prevent continual collection of instantaneous events and to coalesce temporally connected events together is to define a finite state machine that implements hysteresis. Thresholds can be defined and adhered to in order to satisfy both the definition of "prolonged" high CPU utilization and the requirement that diagnostic information is not collected multiple times for a single "event". Such a state machine could facilitate a delay before diagnostics are terminated and launched again, which can in turn prevent the processing, storage, and/or transmission of excessive diagnostic reports representing a single event. Figure 3 depicts a finite state machine (FSM) for determining when to initiate and terminate diagnostic information collection.

Fig. 3: It's an FSM. Barely.

The state machine in Figure 3 above would be evaluated every time diagnostic information is sampled, and would operate as follows:

The machine begins in the Normal state.
Normal state is promoted to High CPU at any iteration when CPU utilization exceeds the threshold (85% for purposes of this article).
When the state is High CPU, it can advance to Diagnosis or regress to Normal, as follows:

If CPU utilization returns below a threshold before the threshold number of seconds have elapsed, then this does not qualify as a "prolonged" high CPU utilization event, and state is demoted to Normal;
If CPU utilization remains above the threshold utilization value for the threshold number of seconds, then diagnostics are initiated and state is promoted to Diagnosis.

The Diagnosis state is used in combination with the Normal-Wait state to avoid continually repeating the diagnostic process over short periods. When the state is Diagnosis, it can either advance to Normal-Wait, or remain at Diagnosis. The condition for advancing to Normal-Wait is realized when CPU utilization for the target process falls below the threshold value.
When the state is Normal-Wait, the next transition can be either a regression to Diagnosis, no change, or an advance back to the Normal state:

If the CPU utilization of the target process returns to high utilization before the time threshold expires, the threshold is reset and the state regresses to Diagnosis. In this case, diagnostic information collection continues.
If CPU utilization remains low but the threshold duration has not elapsed, the state machine remain in Normal-Wait.
If the CPU utilization of the target process remains below the threshold value for the threshold duration, then diagnostics are terminated, the state transitions back to Normal, and the machine can return to a state in which it may again consider escalating to further diagnostics of subsequent events.

The accuracy of this machine in identifying the correct conditions to initiate and terminate diagnostic information collection could be improved by incorporating fuzzy evaluation of application conditions, such as by using a moving average of CPU utilization or by omitting statistical outliers from evaluation. Other definitions, thresholds, and behaviors described above may be refined further depending upon the specific scenario. Such refinements are beyond the scope of this brief study.

Collecting Diagnostic Information

When prolonged high CPU utilization occurs, the high-level question on the user's mind is: WHAT THE HECK IS MY COMPUTER DOINGGGGGG????!!

And to answer this question, we can investigate where the application is spending its time. Which, incidentally, is available to us through exposed OS APIs.

To address where the application is spending its time, two pieces of information are relevant:

What threads are consuming the greatest degree of CPU resources, and
What instructions are being executed in each thread?

This information may allow engineers to identify which threads and subroutines or basic blocks are consuming the most significant CPU resources. In order to obtain this information, an automated diagnostic application must first enumerate and identify running threads. Because threads may be created and destroyed at any time, the diagnostic application must continually obtain the list of threads and then collect CPU utilization and instruction pointer values per thread. The result may be that threads appear and disappear throughout the timeline of the diagnostic data.

Ideally, output would be to a binary or XML file that could be loaded into a user interface for coherent display and browsing of data. In this study and the associated PoC, information will be collected over a fixed number of iterations (i.e. 100 samples) and displayed on the console before terminating.

For purposes of better understanding the origin of each thread, it can be useful to obtain module information and determine whether the thread entry point falls within the base and end address of any module. If it does, then slightly more informational name information, such as modname.dll+0xNNNN, can be displayed. Note that I said slightly more informational. Sometimes this just points to a C runtime thread start stub. But it's still worth having.

In the PoC, data is displayed by sorting the application's threads by percent utilization and displaying the most significant offenders last. Figure 4 shows threads from a legacy SysInternals CPU load simulation tool, CPUSTRES.EXE, sorted in order of CPU utilization.

Fig. 4: Threads sorted in ascending order of CPU utilization

Although this answers the high-level question of what the program is doing (i.e., executing the thread whose start routine is located at CPUSTRES.EXE+0x1D7B in two separate threads), it does not indicate specifically what part of each routine is executing at a given time.

To answer more specific questions about performance, two techniques are available:

Execution tracing
Instruction pointer sampling

Execution tracing can be implemented to observe detailed instruction execution by using either single-stepping via Windows debug APIs or by making use of processor-specific tracing facilities. Instruction pointer sampling on the other hand can be implemented quickly, albeit at a cost to diagnostic detail. Even so, this method offers improved performance since single-stepping is ssssllllloooowwwwwwwwwwww.

This PoC (inspect.cpp) implements instruction pointer sampling by suspending each thread with the SuspendThread() function and obtaining the control portion of the associated thread context via the GetThreadContext() function. Figure 5 depicts the PoC enumerating instruction pointers for several threads within taskmgr.exe. Notably, thread 3996 is executing a different instruction in sample 7 than it is in sample 8, whereas most threads in the output remain at the same instruction pointer across various samples, perhaps waiting on blocking Windows API calls.

Fig. 5: Instruction pointer for thread 3996

This information can provide limited additional context as to what the application is doing. More useful information might include a stack trace (provided frame pointers are not omitted for anti-reverse engineering or optimization reasons).

Conclusion

It's nice to be able to identify and collect information about events like this. But CPU utilization is only one part of the problem, and there are other ways of detecting UI latency than measuring per-process statistics. Also, what of inherent instrumentation available for collecting diagnostic information? What of kernrate, which was mentioned in Windows Internals book and is covered here. It looks as if this can be used instead of custom diagnostics, as long as there are sufficient CPU resources to launch it (either via starting a new process or by invoking the APIs that it uses to initiate logging). Would kernrate.exe (or the underlying APIs) suffer too much resource starvation to be useful in the automatically detected conditions I outlined above? In addition to this, what ETW information might give us a better glimpse into what is happening when a system becomes frustratingly unresponsive?

These are the questions I want to dig into to arrive at a fully automated system for pinpointing the reasons slowness and arriving at human-readable explanations for what is happening.

Troubleshooting Netfilter

2017-04-19T13:51:00.002-07:00

I'm developing a Linux "Diverter" to handle packets for FakeNet-NG, and I've run into some mind-bending issues. Here is a fun one.

I needed to make FakeNet-NG respond when clients use it as their gateway to talk to arbitrary IP addresses. This is done easily enough:

iptables -t nat -I PREROUTING -j REDIRECT

At the same time, I needed to make it possible for clients asking for arbitrary ports (where no service was bound), to be redirected to a dummy service. And I needed to write pcaps, produce logging, and allow other on-the-fly decisions to be made. This I did using python-netfilterqueue and dpkt to mangle port numbers on the way in, fix them on the way out, and recalculate checksums as necessary.

These solutions each worked great. But as I learned while demonstrating this functionality, they just didn't work at the same time:

root@ubuntu:/home/mykill# echo fdsa | nc -v 5.5.5.5 45678
nc: connect to 5.5.5.5 port 45678 (tcp) failed: Connection timed out

I compared pcaps from successful and unsuccessful conversations between the client system and an arbitrary IP address (say, 5.5.5.5). In successful cases (where my packet mangling code was inactive), the FakeNet system responded with whatever IP the client asked to talk to, and the two systems successfully finished the TCP three-way handshake necessary to establish a connection and exchange information. But when my packet mangling code was active, the FakeNet system responded with a SYN/ACK erroneously bearing its own IP address, and the client responded with an RST.

RST is TCP-ese for "Sit down, I wasn't even talking to you."

This behavior led me to the suspicion that my packet mangling activity was preventing the system from recognizing and fixing up response packets so that their IP addresses would match the IP address of the incoming packet (say, 5.5.5.5).

To investigate this, I started by looking at net/netfilter/xt_REDIRECT.c with the goal of learning whether the kernel was using things like the TCP port numbers I was mangling to try to keep track of what packets to fix up. I found that in the case of IPv4, redirect_tg4() calls nf_nat_redirect_ipv4() in nf_nat_redirect.c which unconditionally accesses conntrack information in the skb (short for socket buffer, i.e. the packet), finally calling nf_nat_setup_info() in nf_nat_core.c. The latter function manipulates the destination IP address and calculates a "tuple" and "inverse tuple" that will be used to identify corresponding packets by their endpoint (and other protocol characteristics) and fix up any fields that were mangled by the NAT logic.

I was surprised conntrack was involved because I hadn't needed to use the -m conntrack argument to implement redirection. To confirm what I was seeing, I used lsmod to peek at the dependencies among Netfilter modules. Sure enough, I found that xt_REDIRECT.ko (which implements the REDIRECT target in my iptables rule) relies on nf_nat.ko, which itself relies on nf_conntrack.ko.

I still didn't have the full picture, but it seemed more and more like I was on to something. Perhaps the system was calculating a "tuple" based on the TCP destination port of the incoming packet, my code was modifying the TCP destination port, and then the system was getting a whack at the response packet before I had a chance to fix up its TCP source port to something that would result in a match.

I wanted to figure out when the REDIRECT logic was executing versus when my own logic was executing so I could confirm that hypothesis. While I puzzled over this, I happened upon some relevant documentation that led me to believe I might be correct about the use of TCP ports (rather than, say, socket ownership) to track connections:

Yeah dummy. It's the port.

This documentation also answered my question of when the NAT tuple calculations occur:

Connection tracking hooks into high-priority NF_IP_LOCAL_OUT and NF_IP_PRE_ROUTING hooks, in order to see packets before they enter the system.

These chains were consistent with the registration structures in xt_REDIRECT.c, which further indicated that the hooks were specific to the nat table (naturally):

static struct xt_target redirect_tg_reg[] __read_mostly = {
 .
 .
 .
    {
        .name       = "REDIRECT",
        .family     = NFPROTO_IPV4,
        .revision   = 0,
        .table      = "nat",
        .target     = redirect_tg4,
        .checkentry = redirect_tg4_check,
        .targetsize = sizeof(struct nf_nat_ipv4_multi_range_compat),
        .hooks      = (1 << NF_INET_PRE_ROUTING) |
                      (1 << NF_INET_LOCAL_OUT),
        .me         = THIS_MODULE,
    },
};

At this point, I really wanted a way to beat Netfilter's OUTPUT/nat hook to the punch. I needed to fix up the source port of the response packet and see if I could induce Netfilter to calculate correct inverse-tuples and fix up the source IPs in my response packets again. But the documentation says Netfilter implements its connection tracking using high-priority hooks in the NF_IP_LOCAL_OUT and NF_IP_PRE_ROUTING chains. That sounds a lot like Netfilter gets first dibs. I sat in uffish thought until I remembered this very detailed diagram (click to enlarge):

You are here. No, wait.

If this graphic was correct, I was about to get in before Netfilter -- by changing my fix-up hook to run in the OUTPUT/raw chain and table. I gave it a shot, and...

root@ubuntu:/home/mykill# echo asdf | nc -v 5.5.5.5 9999
Connection to 5.5.5.5 9999 port [tcp/*] succeeded!
asdf

VICTORY!

It was a hard-fought battle, but I actually was able to confirm my suspicions thanks to the very readable code in the Netfilter portion of the kernel and some very helpful documentation. It's fun to be working on Linux again!

Remember the Registers (or, the ABCs of x86)

2017-02-03T21:19:00.000-08:00

I have a friend who is about to learn assembly language on x86. This means understanding how processor registers are used. So, I'm sharing some mnemonics for remembering the registers and their roles on x86. Maybe they will help you, too.

A register, by the way, is just a memory location that is directly (and quickly!) accessible by the processor, usually located in the processor circuitry as opposed to being accessible at a memory address (although this depends on the hardware).

Oh. Okay, so not one of these. Got it.

I'll completely omit floating point (and MMX/SSE) stuff, but I'll briefly mention amd64 for the sake of example.

A, B, C, and D

There are four general purpose registers whose names are basically A, B, C, and D. On old 16-bit Intel processors, they were named AX, BX, CX, and DX. From each of these, you could also access the low eight bits (e.g. AL, as in "A low") and the high eight bits (e.g. AH, as in "A high"), and you still can.

When 32-bit came along, they prefixed them all with an E (meaning "extended"), so now they are EAX, EBX, ECX, and EDX.

And when AMD devised their derivative amd64 architecture, they prefixed them all with an R, presumably meaning "really-friggin'-extended". So, on amd64, they are named RAX, RBX, RCX, and RDX.

The "A" register has a few special roles, but most importantly it is used to hold the return value (i.e. the result) when functions return to their callers.

The "C" register sometimes has a special role, too, that I'll describe next.

So much for A, B, C, and D.

String

Several string instructions (e.g. cmps and movs, stos, lods) have implied operands:

ESI = Source
EDI = Destination

ECX = Counter
EAX = Value to write (sometimes applicable, sometimes not)

Stack

The stack starts at a high address in memory and as things are added to it, it grows down to lower addresses. The processor keeps track with a stack pointer, and then functions can further use a "base" pointer to point to the boundary between the caller's stack (and two other pieces of data), and their own local variables.

ESP = Stack Pointer
EBP = Base Pointer

Instruction Pointer

So special that it's all alone under its own heading:

EIP = Instruction Pointer

This register points to the instruction that is about to be executed.

Extra Credit: Segment Registers

Remember in Windows 95 when blue screens used to report "CS:EIP = xxxxxxx"? Neither do I, let's pretend I didn't admit that. Anyway, that CS is a segment selector indicating the location of the code segment. Modern operating systems use a flat model, so they're mostly set the same.

CS = Code Segment
DS = Data Segment
ES = "Extra" Segment (meh)
SS = Stack Segment
FS & GS = Even more extra segments, using the letters that follow C, D, and E, namely F and G Segments

These registers are 16 bits long and contain integers called segment descriptors; the CPU reads the Global or the Local Descriptor Table (the GDT or the LDT) to find the base address and size of each memory segment indicated by those descriptors. Operating systems commonly use a "flat" model instead of a segmented model, so these may all contain the same value. Since segmentation is largely a non-issue now, the extra segment registers are sometimes used to point to interesting structures. For example: FS => Thread Environment Block (TEB) in userspace Windows 32-bit applications.

Summary

In review:

EAX, EBX, ECX, EDX = A, B, C, D; Note that the 'A' register holds function return values
ESI, EDI = Source, Destination (for string operations) - ECX may be the counter and EAX may used, too.
ESP, EBP = Stack Pointer, Base Pointer
EIP = Instruction Pointer
CS, DS, SS, ES, FS, GS = Code, Data, Stack, and Extra segments, followed by F and G Segments

For the authoritative reference on all of this, see the Intel processor manuals.

Happy hackin' :)

Word.

2016-11-11T22:46:00.002-08:00

I hate the mouse, so it makes sense that I would especially hate copying and pasting from the command line. I usually just need one word of screen output to go about my business and do what I need to do, but this requires activating the control box (Alt+Space), activating the Edit menu, activating the Mark command and then dragging the cursor over the text of interest. Woe be to me if I screw up the text selection, because then I get to start all over again. And then there's when Windows doesn't respond consistently to the Alt+Space keystrokes I'm sending in order to activate the control box (yes, this happens; no, it's not human error).

I hate this so much that I wrote a pair of programs (line and word) to resolve it. Here are three examples of using them: getting all IP addresses, copying one of several paths to the clipboard, and isolating a filename in a paragraph of output.

If you invoke line or word without an argument, it will print the output with line numbers. An argument will isolate the lines or words you specify. Word accepts negative indices like Python. Line does not, because I didn't want to hog memory reading all of stdin into a buffer, though it wouldn't be hard to add, and it's not unreasonable since this is mostly for use with small buffers. Neither program accepts ranges, because I didn't have lots of time to spend.

You can get the code here: https://github.com/strictlymike/pipe-slice

Teachers

2016-11-05T12:24:00.000-07:00

We've moved into our new home, and I was putting away all my computing books. This time, I decided to put them in a deliberate order, starting with the books that I read as a kid and moving on through the system layers, my education, my work, and my dabbling. Each time I see these books, they retell the story of how my curiosity grew into a professional pursuit, and they remind me how happy I am that I was able to connect with such an interesting profession. They make me think of my teachers...

During the summers, I spent a lot of time at my grandmother's house. One summer, my dad brought an old Compaq portable that the army had retired, and he set it up in the play room in her basement. I recall that the monitor supported b&w, amber, and green. Naturally, I preferred green. I used this to run newps.exe (The New Print Shop - you can play it on archive.org).

Newps allowed me to create small monochrome images (probably 64x64 pixels) and store a few dozen of them to a "folder". I learned that if I scrolled through the images by holding the down arrow, they would display sequentially, which allowed me to build short animations. I used this to create my masterpiece: a hand-dithered animation of my cousin, winking. Wow.

My dad also sometimes took me to the office with him and let me play on the computer. I remember playing some sort of DOS game and accidentally winding up in the GW-BASIC interpreter. I must have broken the game, because I saw a clubs suite, little faces, greek letters, and a mess of other inscrutable data pour out onto the screen. My dad told me these were called "characters", and that programmers probably used some of those characters to write the program. This idea blew my mind - the concept that something visual, graphical, animated - was written by someone. After that, every time I came in contact with a computer, I was preoccupied with working out what my dad must have meant when he said that software could be "written".

My dad later brought home an NEC laptop with Windows 3.11, and I stared closely at the dithering that was used to produce a realistically colored eel using 256 colors. Thinking of that thick laptop with the tiny screen reminds me that my dad was the first person to tell me to RTFM. When I wanted to understand how to play Rodents' Revenge, draw with MS Paint, or use a DOS command, he told me to read the help file.

After a while, CDs became more and more common, and the games I wanted to play were not available on floppy disks. I begged my dad for a CD-ROM drive until, one day, he walked in and threw one on my bed. When I asked him to install it, he told me to do it myself, and when I asked him how, he said "read the instructions." As a 13-year-old, I felt helpless and sorry for myself for about a split second. That was the day I learned about config.sys, emm386.exe, autoexec.bat, and mscdex.exe.

And then, RTFM I did. I somehow wound up with a copy of Dan Gookin's DOS for Dummies, which introduced me to EDIT.COM, boot disks, AUTOEXEC.BAT, and CONFIG.SYS. That summer, I became addicted to reading HELP.EXE while listening to techno all night. I unnecessarily used RAM drives to "accelerate" my file access. I used the VSAFE.COM TSR to protect myself from viruses (uh, right). I hid files from my family by placing them inside of compressed volume files (*.CVF) and applying file attributes to hide them. I had nothing worth hiding, but I just wanted to be able to hide stuff. Out of ignorance or unavailability of BIOS features, I wrote a batch file that used CHOICE.COM to password protect my computer and prevent my dad from taking up space with Internet Explorer 3.0 when I was a diehard Netscape Navigator user. And of course, I used filemgr.exe to delete Net Nanny. (Holy crap, are they still in business?!)

Then when I was in high school, my dad gave me a computer that went with me to my mom's house and had Windows 3.11 on it. My friend asked if it had a modem and I said yeah, but no AOL. He decided to come over, and the minute he saw it, he connected it to the phone line, popped up Terminal, and dialed into the DEC VAX at the University of Wisconsin - Milwaukee (UWM). From there, he ran telnet and connected to jgsdos.brooktrout.com, port 5000, and it was then that I learned about Zolstead's MUD (Multi-User Dungeon). The fact that a buddy could just walk up to a computer he had never seen before and connect over a phone line to a computer and then to another computer from there, once again blew my mind.

After that, it mushroomed. I found QBASIC.EXE and learned to write software. Awful software! But software, nonetheless. I started reading about C and my dad got me books about that. My friend introduced me to his buddy who, at the age of 14 or 15, seemed to already know everything about computers. While evading my girlfriend's parents, I hid at the public library and checked out an 800-page beginner's book on C++. I read the entire book before it was due back at the library and took copious notes. I was so intent on learning C++ that I wanted to buy a compiler, but Visual Studio was way too expensive, so I bought Boreland C++ 3.1, for $30.

High tech.

After that, I started looking around for where I could get more of this. I audited classes at UWM and attended two semesters of C++ courses at MSOE. When I realized that this was what college could be like, I just barely pulled out of my academic nose dive and persuaded the MSOE admissions staff that I was worth making an exception for.

I fondly remember many dozens of teachers and moments on the journey. My grandmother, who did me the incredible favor of teaching me to type. It was one of those boring basics that made everything else easier. John Carmack, who innovated video games and made me wonder about the connection between mathematics and 3D graphics. My cousin, who patiently explained, multiple times, the difference between RAM, CPU, hard disks, and floppy disks; let me use his AOL account; introduced me to GIFs; and repeatedly reminded me how to get my computer to run Gorillas.

Then there were the teachers of my professional career. Professors, University staff, and in later years, other students. My boss at my first job. The friend who saw me reading about shellcode and Linux drivers and decided to pull me into a community of hackers. The people I worked with there. And as a consultant. And as a reverse engineer.

Now, I see almost everyone I meet or know or even read about as a teacher. It's what we all are, because we're doing what we love and we want to connect other people with something we feel is very powerful. These books remind me of all the teachers I ever had. The MS DOS 5.0 user manual, Patterson & Hennesy's Computer Organization & Design, The Shellcoder's Handbook, Mastering Algorithms with C, Rootkits... My dad, my friends, my colleagues, and my co-workers.

To this day, I am trying to figure out what the sequence of events has to be to elicit and maintain a learner's interest in some discipline of study. The learners I am interested are young, old, and in-between. Here are some of my thoughts on how to engage them:

Get the learner past the "boring basics" that make everything else available, possible, and achievable.
Create opportunities for "magic moments" by exposing the learner to as many facets of our world as are available to you; when you see a "magic moment" take place, figure out how to encourage and help create more of them - gently though, because the moment you find yourself leading the horse to water, you will be disappointed to find that it is no longer thirsty.
Never let yourself get in the way of a learning opportunity. When you are asked to facilitate an obsession, provide the materials and let the learner do the work.
When we are willing to invest precious resources (time, education, or equipment), we show that we take the learner's studies seriously, and the learner has an opportunity to take themselves and their studies more seriously as well, which can lead to more focus and more courage.
Always be open to explaining how things work.

Maybe if I'm lucky, I will think of more things to add to this list.

The case for reinventing the wheel

2016-10-23T11:19:00.000-07:00

I really like to use IDA Pro as my debugger, and shellcode is no exception. Initially I couldn't see why anyone would ever write their own loader for analyzing shellcode. Siko et al released shellcode_launcher.exe along with the Practical Malware Analysis labs, so why rewrite that code? shellcode_launcher.exe does the work of ReadFile / VirtualAlloc / VirtualProtect, et cetera, so I just make that my database and pull in the VirtualAlloc'd memory using IDA Pro's memory snapshot facilities. Then, I go to town.

Well, I changed my tune when I discovered that VirtualAlloc was not receptive to my suggestions for where to allocate memory. (WinDbg: bp <callsite>; g; ed esp <lpAddress>; p). Without a consistent shellcode base address, none of my annotations from the IDA memory snapshot I took were lining up with the actual shellcode in subsequent debug sessions.

Edit January 14th, 2018: At this point, we have a choose-your-own-adventure on our hands:

If you use remote debugging, and/or you like to see IDA Pro annotations superimposed over your debugger session, and your shellcode itself allocates additional memory and executes code there, then you might be better off reading my fireeye.com blog article titled Debugging Complex Malware that Executes Code on the Heap.
If you don't use remote debugging, then you might be satisfied capturing snapshots of your debugging VM at critical points in the debug session so you can iteratively debug and understand the shellcode.
Finally, if your shellcode does not execute additional code on the heap and you just want to give it a uniform memory map in which to iteratively debug it, then read on...

For simple cases, you can reinvent the wheel and write your own shellcode loader to force shellcode to live at the same virtual address each time you debug it. But no need to start from scratch; here's the path of least resistance...

Assuming you have the shellcode as a raw binary, use xxd's feature of outputting shellcode as a C include file:

xxd -i myshellcode > myshellcode.c

That gives you a hexdump in C form:

unsigned char myshellcode[] = {
    0x55, 0x8b, 0xec, 0x01, 0x23, 0x45, 0x67, 0x89, 0xab, 0xcd, 0xef, 0x01,
    0x01, 0x23, 0x45, 0x67, 0x89, 0xab, 0xcd, 0xef, 0x01, 0x23, 0x45, 0x67,
    ...
    0x01, 0x23, 0x45, 0x67, 0x89, 0xab, 0xcd, 0xef, 0x01, 0x23, 0x45, 0x67,
};
unsigned int myshellcode_len = 4242;

So now, write your loader:

#include "myshellcode.c"

typedef void (*fptr)(void);

int
main(void)
{
    fptr sc = (fptr) myshellcode;
    __asm int 3 ; Safety - so I don't execute this on my analysis box (or worse!)
    sc();
}

Then do this (in an SDK prompt):

cl.exe loader.c

If you don't have Visual Studio, just get Microsoft's compiler for Python 2.7.

After compiling and linking, you'll get this:

I was worried about execute permissions when calling into my shellcode, but happily, it Just Works, perhaps because I ran cl.exe directly without using Visual Studio to specify its usual flags. The program loads in IDA as a PE-COFF, it can be debugged using IDA's debugging plugins, and the shellcode is always at the same address (in my case, unk_40A000). Therefore, you can annotate the shellcode without using the IDA Pro memory snapshot facilities to save it from a debugger session, and (this is the important part) without worrying that VirtualAlloc will return a different address during the next debug session, rendering your annotations less useful to you. The same applies to breakpoints: they will actually work, from session to session. That makes life easier.

This one weird trick for decoding DLL malware strings

2016-10-20T06:16:00.001-07:00

TL;DR: argtracker and ctypes. It's the ctypes part that surprised me. Read on to see why.

This procedure can make light work of decoding strings in a DLL that has a horrifying string decoder or contains a metric ton of strings. The first stage leans on code that's already out there, with a bit of duct tape to get to the second stage; the second stage is to load your malware and call into it. There's just one stick-in-the-mud limitation: it has to be a file you can load into your address space using LoadLibrary, such as a DLL. Otherwise, you have to use a different kind of tool (I'll discuss this later).

First of all, gather all the strings you want to decode. Jay Smith wrote a very cool tool for this that uses Vivisect to emulate code and locate arguments. It's called argtracker. Don't duplicate it like I was starting to do with idaapi. Please, for the love of all that is lazy, just download it and get it installed.

The IDA Python script below is basically the code from the FireEye blog with a second function added to print all the encoded strings out so you can feed them to the second stage of this procedure. If your strings aren't printable prior to decoding, then you'll need to change this up a bit.

import vivisect
import flare.argtracker as c_argtracker
import flare.jayutils as c_jayutils

# Obtain the address where each argument is referenced by the decoder along
# with the offset that was referenced
def get_first_push_arg(decoder):
    ret = []
    vw = c_jayutils.loadWorkspace(c_jayutils.getInputFilePath())
    tracker = c_argtracker.ArgTracker(vw)
    xrefs = idautils.CodeRefsTo(decoder, 1)
    for xref in xrefs:
        argslist = tracker.getPushArgs(xref, 1)
        for argdict in argslist:
            va_at, offset = argdict[1]
            ret.append(argdict[1])
    return ret

# Now go get each string
def print_va_off_and_contents(pushed_args):
    print('refva, off, argcontents')
    for (va_at, offset) in pushed_args:
        print(hex(va_at) + ', ' + hex(offset) + ', ' + GetString(offset, -1, 0))
        # https://www.hex-rays.com/products/ida/support/idadoc/283.shtml
        # 0 <= ASCSTR_C
        # 3 <= ASCSTR_UNICODE

Provide your decoder's virtual address to get_first_push_arg, and then supply the returned list to print_va_off_and_contents to get something you can massage into shape for the second stage. Yes, I know, I'm using print instead of Python's logging module. The title of this blog was actually going to have the word "lazy" in it. Maybe it still should. Anyway...

Second and final step: load the malware and call its decoder. The interesting thing I learned is that Python ctypes can call non-exported functions. What a happy surprise! First, you have to define a function prototype, then you obtain a callable by hooking that prototype to an address in your binary where the function lives. There are prototypes for stdcall (WINFUNCTYPE) and cdecl (CFUNCTYPE). We're using stdcall. Here's a convenient snippet along with the string decoding goodness.

from ctypes import *

# Modify all this
offset = 0x4321                             # Decoder offset in your mal DLL
strings = [                                 # Populate from stage 1 (above)
    [0x10001234, "ABCdef"],
    [0x10005678, "ZYX990"],
    ...
]
dll = cdll.my_malware_dll                   # Modify to load your DLL
prototype = WINFUNCTYPE(c_char_p, c_char_p) # Stdcall, accepts & returns char*

# Leave this alone
string_decoder_addr = dll._handle + offset
decode = prototype(string_decoder_addr);

for (va, s) in strings:
    print(hex(va) + ' ' + s + ' -> ' + decode(s))

Simple, dimple. Paste the strings from IDA Pro into this script, ctypes loads and calls into the malware, and Bob's your uncle. For extra credit, you can update this script to emit another script that will create the appropriate comments or bookmarks in IDA Pro. This ctypes procedure works great for DLLs. Unfortunately, next time, it'll probably be an EXE and not a DLL. For those cases, you'll have to adapt this to a different tool, such as flare-dbg, to control malware execution and feed it the strings you want to decode. I'll talk more about tools and techniques for this another time.

Script Kitties Early Trick or Treat, Part 2

2016-10-13T21:40:00.001-07:00

I promised a treat. Well, as scripts go, this will probably be like the time you went trick-or-treating as a kid and the old couple gave you three pennies and then you walked down the street and realized the pennies seemed to smell bad, but hey, it's money you didn't have before, so what the hay. It's not quite that bad, it's just I wrote it in 2006 and I didn't do much to bring it into modern times. But, here we go...

In 2006, PowerShell was just about to be released and around the same time I was thinking, darn it, wouldn't it be easier to experiment with VBScript if they had given me a command line? So I made one.

As it turns out, some malware is written in VBScript, so this came in handy a while back for me to decode a few lazily "encoded" strings that were assembled using the VBScript Chr() function and string concatenation. It let me figure out what COM objects were being created and move on with my life, so maybe it'll be useful to you.

I also added the ability to switch to JScript, because people also write malware in JScript, so hell, why not.

Here's a little demo:

You're wishing I just gave you the pennies now, aren't you.

Yeah, that's it. If you look at the code, you'll find out why it stinks just like those pennies. But it serves its purpose. So, enjoy!

Here's the code: https://github.com/strictlymike/eval-hta

Script Kitties Early Trick or Treat, Part 1

2016-09-14T06:57:00.002-07:00

Some of my old sysadmin tricks became useful again when I analyzed some malware targeting Windows Scripting Host (WSH). In this article I'll share a trick, and in the next, I'll share a treat.

When logic gets hairy, both developers and malware analysts open a debugger to get more information. But what can be done when the target platform is WSH? As it happens, there are debuggers for this, too, and they can be had by installing either Microsoft Office or Microsoft Visual Studio in your dynamic analysis VM. To invoke the debugger, use the /X switch of either cscript.exe or wscript.exe, e.g.:

wscript.exe /X rat3ie.vbs

Here's the Visual Studio debugger, halting on line 1 of a craptacular VBScript RAT:

This gives the ability to view local variables in the Locals tab (at bottom), set breakpoints, and step through code.

That's all for this little nugget. Next time, I'll post a tool I wrote in 2006 that came in handy for conveniently and interactively evaluating VBScript and JScript to de-obfuscate strings and experiment with malware functionality.

"Advanced" OllyDbg Scripting

2016-09-06T14:18:00.003-07:00

Alternative possibilities:

I'm daft;
OllyDbg's "Warn when breakpoint is outside the code section" option can't (always?) be truly disabled in odbg110; or,
This is not the droid (i.e. option) that I'm looking for.

In any case:

Set sh = CreateObject("WScript.Shell")

While True
    Call sh.SendKeys("%Y")
    Call WScript.Sleep(100)
Wend

And goodbye to this dialog when attempting to find the OEP by tracing into:

Next episode, we answer the question: did OllyDump ever finish? ;-)

Edit 10/14/2016: It never finished, so I ended up doing it manually by catching the unpacker in a memcpy and dumping its payload from poi(esp+4). You live, you learn.

Process Monitoring for the Curious and Paranoid

2016-08-29T15:10:00.001-07:00

It's been months since I had time for any of this, but I've been thinking for a long time about what I would discover if I were to monitor process creation with some sort of balloon notification. Between coming to bed late one night, some scraps of time here and there to document it, and a day home to polish it off while my sick daughter naps, here's a useful tool. I want to emphasize, it's hacky, but for a busy father's casual/opportunistic research, it's enough to play jazz.

Objective

I would like to expediently answer a few questions, including:

Is a new process the reason why my mouse pointer changed to the wait icon?
Was a new process responsible for my computer slowing down?
How often do new processes start, anyway?
What are some commonly executed processes that I haven't noticed yet?
Does this process run any sub-processes?
Is there any process associated with that pop-up, or is it an already-running process?

Poring over my event logs is the wrong answer because eventvwr is slow to pop up and navigate, so when I am experiencing slowness, it doesn't allow me to get up-to-the-moment answers. Also, it can be tedious and time consuming to go back and find the right event, and my boss doesn't pay me to stare at event logs. And then how do I know that this event occurred at the same time as the phenomenon I'm observing?

What I want is a way to casually take note of interesting process creation events throughout the day without really spending time on it.

Alternative Solutions

I've had a few options rolling around in my head for a while:

Instance creation event query on Win32_Process creation - Around 2005 I experimented with this and found that it cannot catch short-lived processes because they are created and destroyed between polling intervals which must last, at minimum, one second.
Win32_ProcessTrace - I started out with this, but alas, they do not contain full image name information, so I needed to query the OS for further information, and again, short-lived processes result in information loss.
Monitoring event logs - Event ID 4688 provides image names, but advanced configuration is required to obtain full command lines. Alternatively, SysInternals' Sysmon logs this information by default. WMI or other methods could be used to notify on event creation.
The Windows kernel exports PsSetCreateProcessNotifyRoutineEx, which provides access to a convenient PS_CREATE_INFO structure containing the full image name and command line. Alas, this requires either purchasing and protecting an expensive driver signing certificate, or leaving a kernel code execution vulnerability unpatched so as to inject a driver as described in the whitepaper I published in February.

Implementation

Unfortunately, mucking with drivers is not lazy enough for me. Since short-lived processes are important (they are commonly used as part of post-exploitation / recon), a Win32_Process instance creation query won't work. For ease of use, I've created a first draft solution by Frankensteining two C# StackOverflow answers together to use systray balloon notifications with WMI's Win32_ProcessTrace. I put this on the Internet so I could compile it and use it to see what was going on with my work computer.

Here's the gist of it

It's lazy, but for casual/opportunistic research, it's enough to play jazz. It doesn't capture command-line arguments and doesn't always capture the full image name, because it just uses Win32_ProcessTrace and then the .NET System.Diagnostics classes to get process information after the fact.

Alas, it bothers me not to have full image names or command-line arguments. The best source of information I know of in userspace is event logs, but I had trouble getting the info I needed on advanced logging configuration for my Windows 8.1 box, I just installed Sysmon. Now what?

Another gist

As it turns out, it is necessary to modify the registry and restart the Windows Management Instrumentation service (and its dependent services) to make this work. I added a Microsoft-Windows-Sysmon/Operational key to HKLM\SYSTEM\CurrentControlSet\Services\EventLog and restarted the winmgmt service, and it all came together.

This gist has a detailed console view along with the systray notification to prevent me from having to necessarily open eventvwr to see more details. Here's how it looks:

Observations

Here are a few startling events and associated discoveries sure to send a chill down your spine, all from tracking down process activity during my journey:

netsh.exe just ran. Is this some post-exploitation alteration of my firewall rules? No, a certain VPN client executes netsh.exe to get its work done. This was not the only software I caught doing that.
Windows Remote Assistance COM Server (raserver.exe) executed and terminated immediately. What interfaces does this provide? Could this be post-exploitation enabling of remote assistance for future access? No. It's a scheduled task that triggers upon group policy updates so remote assistance knows to update its configuration.
reg.exe just got run by cmd.exe. Holy schnikes, now I'm truly pwned. Is the parent process a backdoor executing persistence or other post-exploitation commands? Nope. It's just some endpoint management software that IT confirms they deploy and manage.
Added 9/7/16: Heart rate increases as I read C:\Windows\system32\rundll32.exe C:\Windows\system32\inetcpl.cpl,ClearMyTracksByProcess Flags:525568 WinX:0 WinY:0 IEFrame:0000000000000000. Then I remember I just closed an IE in-private window. Take a look at the parent process command line and see: "C:\Program Files\Internet Explorer\IEXPLORE.EXE" -private. WHEW!

Value

So, as you can see, sometimes situational awareness is not all it's cracked up to be! As most DFIR people and hunters are aware, there is plenty of noise just sitting there waiting to alarm you.

Even so, I think this tool could be useful for noticing anomalous process observables such as:

Post-exploitation commands a la RTFM, e.g. whoami, net.exe, netsh.exe, and so on
svchost.exe executing from within your user profile
Ransomware deleting shadow copies using vssadmin.exe

This tool can increase your awareness of what applications are responsible for certain behaviors, such as the Get Windows 10 user prompts that everybody loved so much. It can also raise your awareness of cases where security policies do not appear to be doing their job, such as application whitelisting. If you're a paranoid or curious power user, this may all be useful to you. In case it is, here are those gists again:

If I polish this up into something nicer, I'll try to update this article with the link.

TIL: Accessing memory in another process under Linux

2016-03-30T02:52:00.005-07:00

Today it was hit home for me that I am now a "Windows guy", because I couldn't remember the name for the select or epoll syscalls, only muttering "WaitForMultipleObjects?" and scratching my head. This was hit further home because I couldn't think of anything other than ptrace for accessing another process's data. Granted, my friend says I've always been a Windows guy and I should get over it. But I really only started learning about how computers work when I began working with Linux, so this bothered me. Hence, I took a little walk down syscalls.h in 3.7.1 to see what would jog my memory or what new things I would find. Indeed, I did find something interesting and relevant.

include/linux/syscalls.h:

856 asmlinkage long sys_process_vm_readv(pid_t pid,
857                      const struct iovec __user *lvec,
858                      unsigned long liovcnt,
859                      const struct iovec __user *rvec,
860                      unsigned long riovcnt,
861                      unsigned long flags);
862 asmlinkage long sys_process_vm_writev(pid_t pid,
863                       const struct iovec __user *lvec,
864                       unsigned long liovcnt,
865                       const struct iovec __user *rvec,
866                       unsigned long riovcnt,
867                       unsigned long flags);

And here is a bookmark to the relevant file in LXR.

It's been a long time since I hacked on Linux, but I wonder what other interesting things have been added since I went over to the dark side (or came back to it, depending upon how you look at it).

Beasting it

2016-03-24T07:50:00.000-07:00

Wherein, I share brute force tools based on treating strings like numbers.

Working in offensive security has opened my mind to the fact that hacks don't have to be beautiful. So in working a couple CTFs recently, brute force has readily come to mind for me (as you can see from other articles on my blog). This recently happened again, when I had an opportunity to run a brute force over the network (yes, very slow, but it was a small character set, so why not) in tandem with working out the real solution, as well as in BCTF 2016 where we were asked to calculate a string whose SHA-256 hash begins with 20 cleared bits.

Sure, what the hay

I wrote a tool in Python, and another in C++, to treat strings as numbers of radix equal to the number of characters in the set of valid characters for the problem. By incrementing each "digit" of the string, and rolling over to the next when necessary, an incrementable string class iterates through all possible values for strings of that length and character set. Both tools use this to brute force a solution using strings of increasing length until either the solution is found or the sequence terminates.

It seems that these sorts of things arise semi-frequently in CTFs, so I generalized these into a single-source-file "framework", polished them up a little bit, and am sharing them below.

As an example, here is the amount of C++ code I would have needed to write using <openssl/sha.h> and linking with -lcrypto to brute force the hash in the BetaFour challenge using this framework. It includes an evaluator callback (try_a_value) that determines whether the current brute force buffer value satisfies the problem, and two supporting functions to hash the value and to determine whether the hash begins with 20 bits of zeroes (it assumes little-endian).

bool
try_a_value(unsigned char *val)
{
    unsigned char md[SHA256_DIGEST_LENGTH];

    PDEBUG("Trying %s\n", val);
    hash(val, md);
    return first20bits0(md);
}

bool
first20bits0(unsigned char *md) { return !(*((uint32_t *)md) & 0x00f0ffff); }

/* Calculate SHA-256 digest of string */
void
hash(unsigned char *startingwith, unsigned char *md) {
    SHA256_CTX ctx;
    unsigned char *data = startingwith;
    int len = strlen((const char *)startingwith);

    SHA256_Init(&ctx);
    SHA256_Update(&ctx, data, len);
    SHA256_Final(md, &ctx);
}

Only 24 lines including whitespace and comments. This will make it easier to for me to work on such challenges in the future, so in the spirit of openness and nerdy hackery, I thought I would share it.

Brutiful C++ and Python brute force tools for Windows and Linux:

https://github.com/strictlymike/brutiful

CPUID, SMSW, and Other Delights

2016-03-18T00:52:00.000-07:00

I wrote a quick and dirty utility to collect info a la redpill, nopill (props to Danny Quist but I can't find that whitepaper anymore!), etc. Nothing really novel about it, but I thought others may find it useful for researchy scenarios. I used it to investigate a hypervisor running on an Intel microprocessor, so each output line includes an indication of whether VMX appears to be supported. My intent was to train a Bayes learner to identify systems that are lying about whether they support VMX (thus likely detecting a hypervisor), similar to a previous project of mine, except in the course of this project, that became so very unnecessary.

Here is a snippet of its output:

So hex. Much flashy.

This tool works by creating and affinitizing a thread to each logical CPU in the system, executing a few compiler intrinsics and assembly functions, and outputting the desired information for each CPU. Like most research code I post, this tool is only as complete as I needed it to be for my own purposes. Therefore, it does not support 32-bit platforms, does not collect the value of the SLDT instruction from each processor, and is not meant for AMD microprocessors. If you can tolerate all that, then the source code is here:

https://github.com/strictlymike/cpuinfo

For more information about CPUID and using microprocessor instructions, I thought the book Professional Assembly Language Programming was very helpful, and of course, the Intel 64 and IA-32 microprocessor manuals are the authoritative reference on all things Intel x86 and x64.

Extreme Rubber Ducking

2016-03-09T22:25:00.000-08:00

"And now for something... completely different."

In 2013, I started to think about how I could barely remember calculus. This made me a sad panda, so I started reviewing undergraduate math and then whipped out my electrical engineering textbooks (yes, I kept those). That gave way to a comprehensive review of my undergraduate that is still in progress.

In pounding through old textbooks without any teachers or tutors, I've learned that preparing to ask for help is actually a great way to solve problems independently. It forces me to walk through my case like a lawyer and rigorously present my assertions so any contradictions are laid bare. I find it most effective when I write it down and commit to posting it on a forum or asking a friend if I don't manage to figure it out by myself. This is like a slightly more stringent form of Rubber Duck Debugging.

I like to use several ducks at once. It's much more powerful that way.

Here, I share two examples of this in a context that is unusual to my blog: circuit analysis. In the first scenario, I was wrong, and in the second, it was the book's fault. Thanks, book! Great job!

Currently Going Nowhere

I first got stuck on a problem that entailed analyzing multiple circuit nodes as a single "supernode". The book did not work an example with three nodes and a dependent source, so I worked the problem repeatedly, getting the same wrong answer each time. After looking at the math over half a dozen times, I concluded that I was misunderstanding how to apply the new concept in this special case (three nodes, dependent source). I tried reading several articles about supernodes and it seemed like I was doing this correctly. I finally gave in and prepared to phone a friend.

For depicting this circuit, I found a pretty cool tool provided by DigiKey called SchemeIt. I threw together this schematic:

In the words of my EE professor: Very simple.

Then I began mounting my case. I started with describing the currents that enter and exit the supernode (nodes 1, 2, and 3). My discussion didn't get very far:

Apply Kirchoff's Current Law to the supernode:

First, I define $i_1$ to be the same as $i$, $i_2$ to be the current from node 2 down to the reference node (through the 4-ohm resistor), $i_3$ to be the current from node 3 down to the reference node (through the 3-ohm resistor), and $i_4$ to be the current flowing from node 1 through the 6-ohm resistor to node 3. Hence,

$i_4 = i_1 + i_2 + i_3$

Wait.

Wait, wait, wait.

This is a flawed equation. It doesn't take into account the fact that $i_4$ both leaves AND enters the supernode.

So, as far as the supernode was concerned, the current $i_4$ was going... Nowhere. It was both exiting and entering the node, so from the perspective of the supernode, $i_4$ cancelled itself out. I realized this because I took the time to carefully make my case and then the contradiction stood right out: "$i_4$ [is] the current flowing from node 1 through the 6-ohm resistor to node 3" (both part of the same supernode). Onward!

This Practice Problem is Not Operational

I later ran into an issue applying the circuit equivalent model of an operational amplifier (an op-amp). I ran through the same process. First, I drew my rendition of the original schematic and my equivalent model. I transformed the circuit using the "non-ideal" model wherein the op-amp's inverting and non-inverting terminals are connected through an input resistance, and the op-amp's output terminal is modeled as a voltage-controlled voltage source in series with a small output resistance.

Equivalent. See? Very simple.

The exercise was to find (a) the closed-loop gain $v_o/v_s$, and (b) the output current $i_o$ when $v_s = 1 V$. I went to work explaining myself, walking through the application of Ohm's law to define current-voltage relationships, Kirchoff's Voltage Law, mesh analysis, etc., until I arrived at a system of equations. I punched it all into GNU Octave and got those same old familiar answers:

Matlab, eat yer heart out!

From the 3x1 matrix above, I concluded $i_3 = -6.5 \times 10^{-4} A$, or -650 uA. Substituting some more equations, I got:

$i_o = - (-650 uA) = +650 uA$

However, the book's answer was that $i_o = -362 mA$.

For the life of me, I couldn't find my error by talking it through. Before submitting the question to a friend or a forum, I thought I would work a few more examples and then revisit this one. I turned the page and worked the next example, which turned out to be a re-working of the same problem. The end of that example reads:

"This, again, is close to the value of 0.649 mA [aka 649uA] obtained in Practice Prob. 5.1 with the nonideal model."

Wait... That's... Not what the book said on the previous page! But that is what I got, every time I worked that problem! I'd been working this problem over and over, and it was a MISPRINT. ARRRRRRGH!!

Lessons Learned

I learned two things from these exercises:

You can become more self-reliant in solving problems if you discipline yourself to write up the details like you're truly about to defend your thought process to someone else; and,
You're not always wrong ;-)

The glory of this process is that often I can use it to ferret out my own stupid mistakes without ever having to share them with anyone (unless I decide to write a ducking blog article about it).

The experience I got from this also ties in closely with my professional observation that having to write about one's work forces the author to explain why the work is correct, which tends to yield ideas about how that work could have been done better. Which is to say, whether you're working out issues or you already think you're all the way there, reporting on your work will invariably improve the outcome.

You can celebrate by taking your ducky for a bubble bath!

In addition to the moral of the story, I also wanted to point out the following:

GNU Octave is super useful
SchemeIt ain't too terrible, either

So there.