Archive for March, 2009

A Quine in PDF

Tuesday, March 17th, 2009

A quine is a self-reproducing program. One whose output is the source of the program itself. The wikipedia link above goes into more depth.

I’m full of mucus and my head is floating, so instead of working on stuff that matters I spent a few hours working out a quine in PDF. Since I wasted so much time making it, I’ve decided to waste some more trying to explain it. I think to understand this, you’ll need to either already be familiar with the PDF format or you will need to grab the specification and follow along with me.

I’ll go object by object and try to explain how it works at each object.

  • Object 1 – Catalog dictionary
  • Object 2 – Pages dictionary
  • Object 3 – Page dictionary
    • Contains references to XObjects defined in objects 9 and 10
    • Font is defined in object 4
    • Content stream for this page is defined in object 5
    • The page is extra long (/MediaBox [0 0 612 1600]) to fit the whole quine in a single page
  • Object 4 – Font dictionary
  • Object 5 – Content stream for page defined by object 3
    • Draws the entire page — this is “main”
    • Overall flow:
      • \X4 Do – Call XObject to draw the raw file up to the start of object 9.
      • \X3 Do – Call XObject to draw the commands that draw the raw file up to the start of object 9.
      • Draw the bits in between the prologue of object 9’s content stream and the start of object 10.
      • \X3 Do – Call XObject to draw the commands that draw the raw file up to the start of object 9.
      • Draw the prologue of the file. This includes the end of object 10 and the xref table.
  • Object 6 – A NOP XObject stream
  • Object 7 – An XObject stream that draws the first half of the drawing commands to draw a line
    • To avoid putting the left parenthesis (0×28) in a normal string where it would have to be escaped, a hex string (<28>) is used.
  • Object 8 – An XObject stream that draws the second half of the drawing commands to draw a line
    • Again, to avoid putting the right parenthesis (0×29) in a normal string where it would have to be escaped, a hex string (<29>) is used.
  • Object 9 and 10 – XObject streams that print the first part of the PDF file and the drawing commands that print the first part of the PDF file
    • The streams for these two objects are intentionally identical. Their dictionaries define different object for the X1 and X2 objects used in the streams. To print the raw pdf lines, a NOP XObject (object 6) is used for both. To print the drawing commands used to print the raw pdf, the prefix and suffix are defined by objects 7 and 8.

That still seems horribly unclear. Hrm. The main idea is to use the Form XObject object as a macro to easily draw the pdf “escaped” or “unescaped” (to draw the commands to draw a string or just the string). The extra wrench is that the XObjects do not inherit any values from the calling stream. This means we can’t call the same XObject twice to draw the escaped and unescaped versions. Instead, we create two versions with identical streams but different Resource dictionaries (defining escaped or unescaped).

Well, I think that’s the best my medicated ass is going to do. Please leave comments with the bits that are unclear.

EDIT: For quines in many languages, check out The Quine Page.

Patch for libdasm-1.5

Monday, March 16th, 2009

While working on DynaTrex, I found a small but problematic bug in libdasm-1.5 when parsing some floating point instructions. One of the floating point opcode tables was missing 4 null entries in the middle. This resulted in some incorrect parsing for those instructions following the omission (about 32 opcode encodings). I generated a patch and sent it off to the maintainer, but in case this library isn’t maintained any longer I’m posting the patch here. For verification, try disassembling FRNDINT (0xd9 0xfc).

Fuzzing Adobe Reader 9

Thursday, March 12th, 2009

As I mentioned in a previous post, the PDF specification seems bloated. Additionally, the Adobe Reader makes a really good effort to display something even when the PDF document is ill-formed. These observations led me to implement a fuzzing framework with a PDF file format fuzzer as a guinea pig application.

Looking around, I only found one fuzzing framework in Haskell, FileH. It seems very fast, but it is targeted at “dumb” binary fuzzing. Flip some bits here and remove some stuff there… increment these 4 bytes as if they were an integer. It also assumes you have a test harness executing the program in question and returning a non-zero exit code when an interesting execution occurs. My original plan was to create a generative file fuzzer creating new PDFs using the QuickCheck module and to integrate the debugger/execution monitor with the fuzzing framework. With these goals in mind, I set out to write FuzzyWuzzy, the Haskell file fuzzing framework. (Note: I have restrained myself from getting into the gory details, but the link to bitbucket gives you the source should it interest you. Also, keep in mind this is the 5th Haskell project I’ve developed [and the largest]… I’m learning.)

My first step was writing the launcher and monitor bits that interface with the operating system. The first generation would be Windows specific with half an eye towards eventually supporting a ptrace interface. The Win32 modules provided with the GHC installation were almost all I needed. I ended up writing an foreign interface to CreateProcess to support an extra flag (DEBUG_PROCESS) not exposed by the unified System.Process module. I also implemented the foreign interface to TerminateProcess, but that was trivial. With those two extra functions available, I was able to use the functions from System.Win32.DebugApi to create the launcher and monitor to detect crashing programs. I have not yet investigated ways to detect large memory usage or CPU load, but those are on the list for the next version. Currently, the monitor will end an execution and flag it as interesting if the application would have crashed had it not been attached to a debugger.

With the OS stuff out of the way, I turned my attention to file generation. I created an abstract representation of the PDF format and implemented a serialization function to turn a PDF type into a file. After some more serious thought about generating almost valid PDFs from QuickCheck generators, I decided to take another direction.

Instead of generating a PDF from nothing, I wrote a PDF parser to turn a PDF into the abstract representation. The next step was to write mutations on the PDF abstract tree — operations like enlarging Name or String objects, adding long chains of escape sequences to Name or String objects, and deleting entries in Dictionary objects. I also wrote some mutations on the raw character stream going back to disk. These were similar to the mutations done by FileH. At this point, the fuzzer was a complete program. I let it run for a bit and watched Acrobat throw lots of nice message boxes complaining about ill-formed PDFs.

In the course of writing the higher level PDF type mutations, I realized the hierarchical PDF structure made it difficult to pick, say, a random String object (String objects are usually referenced as values in a Dictionary object and would rarely if ever be found as top level [indirect] objects themselves). It would be easier to filter if the PDF was a flat list of objects with each node able to reference the id of another object if needed. After adding this as a transformation from the hierarchical representation, I came up with another bunch of mutations that were much easier to formulate with this representation. With this modification I started finding some crash bugs! My little fuzzer actually works.

Now what? Finding the offending mutation wasn’t difficult and now I have a minimal case to play with. Of course, I’ve been coming up with new ideas for mutations each day.

Ideas for the future:

  • Implement a system for distributed fuzzing – Break up the fuzzing process to be able to easily distribute the work. In other words, have a few computers doing the generations of new files to test and a pool of tester computers to do the runs.
  • Decompose the PDF format further to fuzz the stream contents – Stream objects are usually compressed with the DEFLATE algorithm. This makes for boring fuzzing. Uncompress the Stream objects and decompose them further (Embedded Fonts have known formats, Graphics commands are not difficult to parse, movies, pictures, and music are all stored as Streams as well).
  • Notification system – Email notification of newly found crashes with unique stack/EIP backtraces. Who wouldn’t want to know *immediately*?

After doing all this fuzzing work, it’s become apparent why many people have moved towards developing hybrid fuzzers that use dynamic information to control the future inputs. That is probably where I’ll be heading next. Simple fuzzers are hard to measure (as everyone has already said many many times).

UPDATE: I’ve implemented the e-mail notification. I’ve begun the stream mutation code. I’ve also run into some weirdness with the Haskell <-> CreateProcess interface — I’ve gotten a few rare segfaults. Since I haven’t been successful getting a Haskell monitor written, I wrote a quick one in C, but I haven’t ported the PDF mutation stuff to use it. I’m thinking about writing some Python to manage a simple distributed fuzzing system. Of course, I have been spending all my time lately on other stuff completely, including DynaTrex, an open source binary rewriting tool for Windows. It is still very young.