Photo of HDZero VRX connected to an LCD and displaying an image back from a nearby quad.

But first, a quick recap…

It’s hard to believe it has been 9 years since I last posted to this blog, but I feel writing this technical review of the HDZero system is exactly the motivation I need to give blogging another go. Firstly, to fill in the massive gap since my last post in Aug 2014 about porting the SimonK firmware to an Atmega-88 ESC, … basically my interest in First Person Video (FPV) drones took off and I’ve become engrossed in the hobby ever since. I even occasionally post freestyle videos on my KandredFPV YouTube channel. On top of that that we had another 2 kids (taking us to 3) and my free time quickly disappeared.

The digital evolution of FPV

I’ve always had an interest in RC but only got into FPV after struggling to pickup LOS (line of sight) flying when I built my first quadcopter. After being terribly disappointed at how difficult it was to fly the darn things in LOS/Acro, I would later stumble across some FPV video on YouTube of pilots flying with ease and having a great time. I was convinced it was the solution to all my problems, and rightly so because I fell in love with the hobby and quickly got to work at becoming a less sh*t pilot.

Until recently, all of my experiences of FPV had been using analog video which is just another way of saying a Composite Video encoded signal (CVBS) is transmitted from the quad to the pilot via a video transmitter & receiver using VSB RF modulation (a form of narrowband-AM, yes AM!) in the 5.8Ghz band. The CVBS signal can be either PAL (576i@25 fps) or NTSC (480i@30 fps), which are both considered standard definition video but with all the visual artifacts and defects which normally accompany a legacy analog broadcast TV system.

Enter the HDZero system – formerly known as SharkByte

About a year ago I finally decided digital had matured enough to give it a try and I bought the Race VTX and HDZero SharkByte VRX for use with my old Dominator HD2 goggles, and I was blown away by how much of an improvement this system was over analog.

How it works

Digital RF communications

At the heart of any digital FPV system is a digital RF transmitter and invariably the modulation used is some form of the OFMD modulation scheme because of its unique ability overcome the inter-symbol interference (ISI) and channel fading caused by multipath propagation. To be clear, ISI and multipath fading affects all RF communication systems. The effects of multipath fading are normally mitigating using pre-trained equalisers to reverse the channel distortions, and in digital RF systems, low symbol rates and guard intervals help with ISI. However, multipath fading is more problematic in mobile RF applications (i.e. where at least one endpoint is moving relative to the other) such as FPV; since the environment is constantly changing and the equalisers need to be constantly retrained. OFDM allows for this by integrating the pilot signals used for equaliser training into the modulation scheme as subcarriers, so the fading correction can be adaptive.

It’s worth noting that MIMO-OFDM is another variation of this technology (found in 4G+ and 802.11n+ devices) which actually leverages multipath propagation to transmit multiple distinct channels simultaneous via different paths (a.k.a spatial multiplexing) resulting in an increased channel capacity. As far as I’m aware, neither HDZero, DJI or WalkSnail uses this technology, since it’s normally used with linearly polarised diversity antennas and matching receiver and transmitter antenna counts.

The Divimath SoC

Parsing through the documentation on the Divimath (the creators of HDZero) website, I found the datasheet brief for their DM5680 chip which confirms their use of OFDM in the system.

With the exception of the system’s lack of any kind of video compression – which Divimath states there system does not use – it actually bares my similarities to Digital Video Broadcating – Terrestrial (DVB-T) system which replaced analog broadcast TV here in New Zealand (and most of the world) since back in 2008. And like DVB-T, HDZero appears to be a broadcast video system, since no point-to-point (i.e. binding) association is required between the VTX and VRX. You simply select a channel and you immediately start receiving the FPV broadcast, which conveniently allows for any number of spectators to tune in.

Divimath has documented the HDZero system as having a bandwidth of 27Mhz, and we know the RF channel bitrate to be 100Mbps because Carl, the founder of Divimath, mentioned it as part of a technical walkthrough of the system in Chris Roser’s YouTube interview. We don’t know the OFDM symbol rate (i.e. symbol duration and size of guard interval). We also don’t know what size of QAM is used (i.e. 16 or 64) with the OFDM data subcarriers or the number of subcarriers used; DVB-T uses 1700+ carriers, but my understanding is using such large IFFT bin size places a higher demand on the DSP, so it’s probably likely down in the 10s if not 100s of sub-carriers similar to WiFi (802.11x).

There is no evidence of the bitrate (and therefore video encoding) being adaptive based on SNR, although Carl suggested as much in that interview. This functionality would require SNR (which is collected at the receiver) to be sent as telemetry in back-channel to the VTX, and while the DM5680 chipset does appear to include a full transceiver, Mads Tech’s DJI vs HDZERO RF Deep Dive YouTube video saw no such telemetry being sent in a spectrum analysis of HDZero. In my opinion, HDZero would need to be a point-to-point system like DJI in order for this work since it would not be possible to adaptively vary bitrate for multiple VRX spectators in a broadcast system. So spectator mode would need to be separate multipoint RF links with their own independent bitrates, which we know it is not.

A screenshot of HDZero’s spectrum from Mad Tech’s YouTube video

VRX Hardware

Livyu FPV has done an excellent hardware teardown of the Shark Byte VRX which I highly recommend if you haven’t seen it already. I’ve captured most of those details in the diagram below, and filled in a gap he was uncertain of, i.e. the H.264 DVR encoder. At the heart of the VRX is a XILINX Spartan-6 FPGA which if I had to guess is running some kind of ARM soft-core for executing the firmware in addition to what appears to be Divimath’s own image fusion algorithm. This is because, while each omni/patch antenna pairs (on the left and right) are setup as diversity inputs to the AD9361 chips, they each feed independent DM5680 OFDM baseband video decoders, and thus generate two separate feeds of YUV420 image data into the Spartan-6. So Divimath must be fusing the frames of these images together somehow, either at a low-level by selecting uncorrupted 8×8 blocks or, at a higher level by selecting the image frame with the least corruption, then outputting this to the HDMI driver and optionally the H.264 DVR encoder.

This is good to know because there might be some advantage to using different orientations for the left and right omni antennas to promote image diversity, rather than mounting them both upright.

The Video Codec

What sets HDZero and DVB-T apart is Divimath’s proprietary low latency video encoding algorithm which is used in place of the higher latency traditional MPEG codec. Well to be precise, Divimath’s website says “Video Coding: No video enceder” which I believe we should interpret as “no traditional video encoder”, since it’s quite easy to see that there is no way the system can transmit 720p60 video in a 100Mbps channel.

Assuming 4:2:0 chroma subsampling, a common image compression format used in digital cameras whereby each pixel is encoded as 12 bits (instead of 32), for 720p60 video we can calculate the uncompressed bitrate as:

=> width x height x bits per pixel x frames per sec
=> 1280 x 720 x 12 x 60 = ~664Mbps

This is over 6 times the RF channel capacity, so some form of compression must be used.

Video Compression Theory

There are two aspects of video compression: spatial – where redundancies within each image frame of a video are independently removed (e.g. image compression) and, temporal – where redundancies across adjacent frames within a video are codependently removed (e.g. motion compensation). Codecs which do both forms of compression are ubiquitous, e.g. H.264 which is used in most modern FPV goggle DVRs. Video codec which do only spatial compression are less common, but one popular example used in cheap video surveillance and older FPV DVRs is MJPEG due to it being quite inexpensive to implement, however it produces a lower quality video with a higher bitrate (and consequently larger file size).

Another important detail worth considering about these two aspects of video compression when applied to FPV is the impact of irrecoverable transmission bit errors on the decoding of the video. The impact of an irrecoverable error in a spatially compressed frame really depends on the compression algorithms used, e.g. JPEGs in a MJPEG video use a combination of lossy quantised DCT compression blocks followed by lossless REL and entropy coding (e.g. Huffman coding) and bit errors within this data are likely to lead to the corruption of the entire frame. Bit errors in temporally compressed frames are even more detrimental and would likely lead to the corruption of an entire Group of Pictures (GOP) – more on this later – and several frames would have to be dropped.

Also when dealing with real-time video streaming applications like FPV it is worth noting the latency introduced by these two aspects of compression. Depending on the types of algorithms used in spatial compression the latency can vary anywhere from the time taken to encode an 8×8 DCT block of the image to (assuming entropy encoding is used as in JPEG) the time taken to DCT encode all blocks then entropy encode the entire image frame. If temporal compression is used as well, we then need to add in the time it takes find redundancies across two or more adjacent image frames and remove/re-encode them, so at least 2 frames of latency.

A hypothesis on HDZero’s Codec

Now this is purely speculation on my part, since as far as I know Divimath hasn’t published any details on how their codec works, but based on what we know about video compression and the observed behaviour of the HDZero system we can safely make two assumptions:

  1. HDZero does not use temporal compression in its video codec.
  2. The spatial compression in HDZero’s codec is not compressing the entire frame.

Contrary to popular opinion, the main reason for these assertions is not the low latency of the system, since “Zero latency” H.264 video codecs have been documented. Rather, it’s mainly because if either of these strategies were used, irrecoverable transmission bit errors would result in the loss/corruption of an entire image frame, which we know is inconsistent with the way HDZero handles interference. In fact, upon closer inspection HDZero video has what appears to be 8×8 pixel DCT block-level corruption:

A cropped and magnified shot of some HDZero breakup on one of my test flights, showing the corrupted 8×8 DCT block patterns.

Anyone can also visually confirm that these corrupted blocks appear to be badly decoded DCT coefficients which are normally 8×8 pixels in size:

I’ve actually counted the number of these blocks in a still frame of 720p DVR recording using the size of the corrupted ones as a guide and found there to be 128×96 blocks in a frame. For what I’m about to say next, we need to keep in mind that the 720p resolution of H.264 DVR recording happens after the video is decoded and bears little in resemblance to the original transmission format. So, if each block is indeed 8×8 pixels in size, then the original source image format appears to be a meager 1024×768 pixels which would suggest the original 720p image from the camera is downscaled and changed to a 4:3 aspect ratio before encoding, which results in 20% loss of horizontal resolution.

Working backwards from these assumptions, we can do some calculations to determine the data budget available for each of these 8×8 pixel DCT block, based on a 100Mbps channel bitrate:

=> bitrate per frame / total DCT blocks per a 1024x768 frame
=> (100e6 / 60) / ((1024x 768) / (8 x 8)) = ~135 bits (17 bytes)

For YUV420 image data, luminance (Y) is DCT encoded in full resolution while the two chroma components (UV) are DCT encoded at 1/4 resolution, so I suspect they would probably use 4×4 DCT blocks for U & V to align with the Y block boundaries. Therefore they’d need a way to compress an 8×8 (Y) and two 4×4 (U/V) DCT blocks (i.e. 96 bytes) into 17 bytes. This task does not seem so insurmountable after the DCT quantisation is applied, i.e. a process of scaling down the DCT coefficients so that high frequency detail components are reduced to zero, since our eyes care more about low frequency details than high ones. I suspect Divimath are only sending the top 10% or so of low-frequency coefficients for Y DCT block, and the top 20% for U/V DCT blocks. For Y, this would be the DC and 5 lowest frequency coefficients, and for U & V this would be the DC and 2 lowest frequency coefficients, for a total of 12 coefficients, and since these values are heavily quantised they can probably be encoding in smaller word lengths than 8-bit bytes, offering further compression.

An 8×8 DCT block before (left) and after (right) quantisation, with the DC and top 5 low frequency coefficients scanned out as -26, -3, 0, -3, -3, -6 for transmission.

Carl also mentioned that Forward Error Correction (FEC) wasn’t applied to all transmitted data but rather conditionally based on data importance, so I’d expect that the Y-DCT blocks would be sent with a 1/2 or 3/4 coding rate (50% to 25% overhead) while the UV-DCT blocks could be sent without any error correction coding.

I don’t deny that these numbers are all contrived, and no consideration was made for other data streams such as frame synchronisation and OSD telemetry from the flight controller to be multiplexed across the channel. But as far as formulating a hypothesis is concerned, I feel the approach is sound and would allow for irrecoverable transmission bit errors in a single UV DCT (or Y DCT) block to only ever result in the corruption of that block.

Resolution is not the same image quality

And this brings me to what I believe is the most important point I want to make concerning HDZero, which has been discussed very little in almost all the YouTube reviews of digital FPV systems I’ve seen thus far. And that is the focus on system resolution, without much mention of image quality i.e. how much detail is preserved in the image reproduction. Consider this chroma test pattern captured by HDZero:

A still frame of HDZero video recording a chroma test pattern.

The alternating red and blue bars should have straight edges, but you can clearly see the excessively quantised lower detail encoding of the U & V components creating a jagged square edge, which is quite a contrast to the straight gray scale (Y) edges elsewhere in the still. For reference, here is the original test pattern:

It’s worth noting that I captured this still frame from a DVR recording, which as discussed before is itself an H.264 re-encoding of the original decoded HDZero transmission video format, so even more detail is lost here, but notice how crisp and sharp the OSD font and icons are, since they appear to be overlaid in the VRX and only get encoded once to H.264.

I guess the point I’m trying to make is saying your system supports 720p doesn’t mean much if you’re downscaling and throwing away about 90% of the fine details. To be honest, I would much rather have 480p at even 50% quality. To justify this, consider the following video recordings of an EIA 1956 Resolution Chart which was traditionally used for testing analog TV systems, and notice how the 480p GoPro video at the top has far higher image quality despite having half the resolution of the following HDZero 720p DVR recording. If you freeze HDZero footage, you can even see the compression artifacts from the excessively quantised DCT block patterns showing. Lastly for fun, I’ve added a 480i analog DVR from a Runcam Nano3 800TVL camera at the bottom, and you’ll notice some of the numbers are far more readable for analog than HDZero.

I would love to see YouTube reviewer doing more of this kind of testing of digital (and even analogue) FPV systems going forward, maybe with a larger chart further away from the camera so the focus of the lens doesn’t affect the results as I felt they might have in these close up tests.

System Latency

Now we come to the matter of latency and this is where HDZero excels when compared to other digital FPV systems. But as with our previous discussion on resolution vs image quality, we need to be clear about what’s actually being measuring in the numerous systems latency tests which are available online. To break this down, let’s look at all the stages in the video feed where latency can be incurred, starting from the input:

Camera latency

As I see it, there are actually two sources of latency with digital video cameras, which influences what some reviewers have described as time to first-bright-pixel (TFBP) vs time to first-bright-frame (TFBF) in FPV latency tests:

TFBP determiner: The TFBB is largely influenced by the time taken from when the image sensor is read to when a horizontal scan line of pixels is outputted across the MIPI CSI-2 interface. If our 8×8 DCT theory holds true, then the DM5680 needs at least 8 scan lines of image data from the camera before it can start encoding the first row of DCT blocks. Now while Carl suggests that HDZero uses global shutter cameras, and this may have actually been the case for the original Byte Frost system which apparently had a Caddx Camera, all the newest cameras starting with the Shark Byte Runcam Nano HD up to the latest HDZero Micro/Nano V2 cams are specified as having rolling shutters and you can even see the tell tail artifacts such as “spatial aliasing” of the props and the “jello effect” in some DVR footage. So in the absence of a global shutter we can safely assume each frame of the image does not get buffered in the camera but rather scan lines are transferred immediately after they are captured and processed by the CSI-2 protocol stack. HDZero has actually documented their fixed low latency as being ~3ms (Δt₁ in their diagram) for TFBP when using their new HDZero goggles, which is quite impressive and means MIPI CSI-2 latency is probably somewhere around ~2ms, if we assume it matches that of analog cameras.

Now it’s worth noting that TFBB tests are normally conducted using light from an LED which when switched on will be captured by all pixels on the sensor simultaneous so the measure time is not influenced by the position of the scanline at that time, and a bright pixel is guaranteed to be captured somewhere in the frame.

TFBF determiner: The second source of latency is the camera frame rate, and is influenced by the time it takes for the camera to emit a full frame, i.e. to scan and emit all the bright/illuminated pixels in the frame during an LED latency test. For a 60fps camera this is 1/60 or 16.7ms, since this is the interval between frames, but also we need to include the TFBP latency which gives us ~19.7ms (i.e. Δt₂ in the diagram above).

Video encoding/decoding and RF channel latency

Again, using the HDZero goggles fixed low latency documentation as a guide, if the glass-to-glass TFBP latency is ~3ms and the camera latency is ~2ms then video encoding, decoding and the RF channel must introduce ~1ms of latency.

VRX HDMI and Goggle OLED driver latency

Thanks to Chris Rosser’s latency tests of the HDZero VRX we know that the system has a 12ms TFBP when paired with the Skyzone Sky04x goggles, so subtracting the ~3ms known glass-to-glass latency for the HDZero Goggles we see that somewhere between the HDMI transmitter and Sky04x OLED driver we’re getting an additional ~9ms of latency for 720p@60 video. I’ve read the IT66121FN datasheet and have not found any latency numbers or mentions of buffering which can add delay, but given than Sky04x OLED panels have been shown to have ~2ms TFBP for analog video, I suspect the extra latency comes from either the HDMI transmitter in the VRX or the HDMI receiver in the analog goggles. This would explain why HD FPV systems with both VRX and goggle options have their lowest latency when integrated directly into goggles. Surely Divimath was aware of this HDMI limitation when they embarked on making their own Goggles to completely eliminating the HDMI dependency.

Conclusion

Well this review ended up being much lengthier than I was planning but I’ll end with my final thoughts on the system. If you’re an FPV pilot who cares more about low latency than high definition image quality (e.g. racers) then I think HDZero is a worth while improvement over analog, especially with their new 90fps camera offering better latency than the best analog setup, but stay away from the VRX and it’s ~9ms of extra latency and get the HDZero Goggles instead. Also, HDZero seems to be quite popular amongst micro pilots due to the low weight of the whoop style VTXs.

While I don’t deny that HDZero image quality is far more consistent in colour reproduction than analog I feel the excessive compression goes too far and it ends up being only marginally better than analog due to the significant loss of image detail. Then there is matter of the digital breakup which I find is way more abrupt and distracting than analog breakup. For me as a freestyler who occasionally does a bit of proxy, I don’t think the latency benefits are as compelling as the higher image quality offered by other HD FPV systems. So while I’m still keeping my HDZero setup on a few quads, I’ve decided to give the new WalkSnail Avatar HD VRX a try to see if it delivers on image quality without compromising too much latency. I hope to write about that experience in my next blog post. Bye for now.

Leave a Reply