This blog is also available on Gemini (What's Gemini?)

How UAV::Pilot got Real Time Video, or: So, Would You Like to Write a Perl Media Player?


Real-time graphics isn't something people normally do in Perl, and certainly not video decoding. Video decoding is too computation-intensive to be done in pure Perl, but that doesn't stop us from interfacing to existing libraries, like ffmpeg[1].

The Parrot AR.Drone v2.0 has an h.264 video stream, which you get by connecting to TCP port 5555. Older versions of the AR.Drone had its own encoding mechanism, which their SDK docs refer to as "P.264", and which is a slight variation on h.264. I don't intend to implement the older version. It's for silly people.

Basics of h.264

Most compressed video works by taking an initial "key frame" (or I-frame), which is the complete data of the image. This is followed by several "predicted frames" (or P-frame), which hold only the differences compared to the previous frame. If you think about a movie with a simple dialog scene between two characters, you might see a character on camera not moving very much except for their mouth. This can be compressed very efficiently with a single big I-frame and lots of little P-frames. Then the camera switches to the other character, at which point a good encoder will choose to put in a new I-frame. You could technically keep going with P-frames, but there are probably too many changes to keep track of to be worth it.

Since correctly decoding a P-frame depends on getting all the frames back to the last I-frame right, it's a good idea for encoders to throw in a new I-frame on a regular basis for error correction. If you've ever seen a video stream get mangled for a while and then suddenly correct itself, it's probably because it hit a new I-frame.

(One exception to all this is Motion JPEG, which, as the name implies, is just a series of JPEG images. These tend to have a higher bitrate than h.264, but are also cheaper to decode and avoid having errors affect subsequent frames.)

If you've done any kind of graphics programming, or even just HTML/CSS colors, then you know about the RGB color space. Each of the Red, Green, and Blue channels gets 8 bits. Throw in an Alpha (transparency) channel, and things fit nice into a 32 bit word.

Videos are different. They use the "YCbCr" color space, at term which is sometimes used interchangeably with "YUV". The "Y" is luma, while "Cb" and "Cr" is blue and red, respectively. There are bunch of encoding variations[2], but the most important one for our purposes is YUV 4:2:2.

The reason this is that YUV can do a clever trick where it sends the Y channel on every pixel on a row, but only sends the U and V channels on *every other pixel*. So where RGB has 24 bits per pixel (or 32 for RGBA), YUV averages to only 16.

The h.264 format internally stores things in YUV 4:2:2, which corresponds to SDL::Overlay[3]'s flag of `SDL_YV12_OVERLAY`.

Getting Data From the AR.Drone

As I said before, the AR.Drone sends the video stream over TCP port 5555. Before getting the h.264 frame, a "PaVE" header is sent. The most important information in that header is the packet size. Some resolution data is nice, too. This is all processed in UAV::Pilot::Driver::ARDrone::Video[4].

The Video object can take a list of objects that do the role UAV::Pilot::Video::H264Handler[5]. This role requires a single method to be implemented, `process_h264_frame()`, which is passed the frame and some width/height data.

The first object to do that role was UAV::Pilot::Video::FileDump[6], which (duh) dumps the frames to a file. The result could be played on VLC, or encoded into an AVI with mencoder. This is as far as things got for UAV::Pilot version 0.4.

(In theory, you should have been able to play the stream in real time on Unixy operating systems by piping the output to a video player that can take a stream on STDIN, but it never seemed to work right for me.)

Real Time Display

The major part of version 0.5 was to get the real time display working. This meant brushing up my rusty C skills and interfacing to ffmpeg and SDL. Now, SDL does have Perl bindings, but they aren't totally suitable for video display (more on that later). There are also two major bindings to ffmpeg on CPAN: Video::FFmpeg[7] and FFmpeg[8]. Neither was suitable for this project, because they both rely on having a local file that you're processing, rather than having frames in memory.

Fortunately, the ffmpeg library has an excellent decoding example[9]. Most of the xs code for UAV::Pilot::Video::H264Decoder[10] was copy/pasted from there.

Most of that code involves initializing ffmpeg's various C structs. Some of the most important lines are `codec = avcodec_find_decoder( CODEC_ID_H264 );`, which gets us an h.264 decoder, and `c->pix_fmt = PIX_FMT_YUV420P;`, which tells ffmpeg that we want to get data back in the YUV 4:2:2 format. Since h.264 stores in this format internally, this will keep things fast.

In `process_h264_frame()`, we call `avcodec_decode_video2()` to decode the h.264 frame and get us the raw YUV array. At this point, the YUV data is in C arrays, which are nothing more than a block of memory.

High-level languages like Perl don't work on blocks of memory, at least not in ways that the programmer is usually supposed to care about. They hold variables in a more sophisticated structure, which in Perl's case is called an 'SV' for scalars (or 'AV' for array, or 'HV' for hashes). For details, see Rob Hoelz's series on Perl internals[11], or read perlguts[12] for all the gory details.

If we wanted to process that frame data in Perl, we would have iterate through the three arrays (one for each YUV channel). As we go, we would put the content in an SV, then push that SV onto an AV. Those AVs can then be passed back from C and into Perl code. The function `get_last_frame_pixels_arrayref()` handles this conversion, if you really want to do that. Protip: you really don't want to do that.

Why? Remember that YUV sends Y for every pixel in a row, and U and V for every other pixel, for an average of 2 bytes per pixel, and therefore 2 SVs per pixel (again, on average). If we assume a resolution of 1280×720 (720p), then there are 921,600 pixels, or 1,843,200 SVs to create and push. You would need to do this 25-30 times per second to keep up with a real time video stream, on top of the video decoding and whatever else the CPU needs to be doing while controlling a flying robot.

This would obviously be too taxing on the CPU and memory bandwidth. My humble laptop (which has a AMD Athlon II P320 dual-core CPU) runs up to about 75% CPU usage in UAV::Pilot while decoding a 360p video stream. That laptop is starting to show its age, but it's clear that the above scheme would not work even on newer and beefier machines.

Fortunately, there's a little trick that's hinted at in perlguts[13]. The SV struct is broken down into more specific types, like SViv. The trick is that the IV type is guaranteed to be big enough to store a pointer, which means we can store a pointer to the frame data in an SV and then pass it around in Perl code. This means that instead of 1.8 million SVs, we make just one for holding a pointer to the frame struct.

This trick is pretty common in xs modules. If you've ever run `Data::Dumper` on a `XML::LibXML` node, you may have noticed that it just shows a number. That number is actually a memory address that points to the `libxml2` struct for that particular DOM node. The SDL bindings also do this.

The tradeoff is that the data can never be actually processed by Perl, just passed around between one piece of C code to another. The method `get_last_frame_c_obj()` will give you those pointers for passing around to whatever C code you want.

This is why SDL::Overlay isn't *exactly* what we need. To pass the data into the Perl versions of the overlay `pixels()` and `pitches()` methods, we would have to do that whole conversion process. Then, since the SDL bindings are a thin wrapper around C code, it would undo the conversion all over again.

Instead, UAV::Pilot::SDL::Video[14] uses the Perl bindings to initialize everything in Perl code. Since SDL is doing that same little C pointer trick, we can grab the SDL struct for the overlay the same way. When it comes time to draw the frame to the screen, the module's xs code[15] gets the SDL_Overlay C struct and feeds in the frame data we already have. Actual copying of the data is done by the ffmpeg function `sws_scale()`, because that's solution I found, and I freely admit to cargo-culting it.

At this point, it all worked, I jumped for joy, and put the final touches on UAV::Pilot version 0.5.

Where to go From Here

I would like to be able to draw right on the video display, such as to display nav data like the one in this video:[16][17]

Preliminary work is done in UAV::Pilot::SDL::VideoOverlay[18] (a role for objects to draw things on top of the video) and UAV::Pilot::SDL::VideoOverlay::Reticle[19] (which implements that role and draws a reticule).

The problem I hit is that you can't just draw on the YUV overlay using standard SDL drawing commands for lines or such. They come up black and tend to flicker. Part of the reason appears to go back to YUV only storing the UV channels on every other pixel, which screws up 1-pixel wide lines 50% of the time. The other reason is that hardware accelerated YUV overlays are rather complicated[20]. Notice that linked discussion thread goes back to 2006, and things don't appear to have gotten better until *maybe* just recently with the release of SDL2.

The video frame could be converted to RGB in software, but that would probably be too expensive in real time. The options appear to be to either work it out with SDL2, or rewrite things in OpenGL ES[21]. OpenGL would add a lot more boilerplate code, but could have side benefits for speed on top of just plain working correctly.

Once you can draw on the screen, you could do some other cool things like doing object detection and displaying boxes around those objects. Image::ObjectDetect[22] is a Perl wrapper around the opencv[23] object detection library, though you'll run into the same problem of copying SVs shown above. Best to use the opencv library directly.