Implementing tile encoding in rav1e25 Apr 2019
What is this?
Tile encoding consists in splitting video frames into tiles that can be encoded and decoded independently in parallel (to use several CPUs), at the cost of a small loss in compression efficiency.
This speeds up encoding and increases decoding frame rate.
To prepare for tiling, some refactoring was necessary.
To illustrate it, here is a mini-plane containing 6×3 pixels. Padding is added for alignment (and other details), so its physical size is 8×4 pixels:
In memory, it is stored in a single array:
The number of array items separating one pixel to the one below is called the stride. Here, the stride is 8.
The encoder often needs to process rectangular regions. For that purpose, many functions received a slice of the plane array and the stride value:
This works fine, but the plane slice spans multiple rows.
Let’s split our planes into 4 tiles (2 columns × 2 rows):
In memory, the resulting plane regions are not contiguous:
In Rust, it is not sufficient not to read/write the same memory from several threads, it must be impossible to write (safe) code that could do it. More precisely, a mutable reference may not alias any other reference to the same memory.
As a consequence, passing a mutable slice (
&mut [u16]) spanning multiple
rows is incompatible with tiling. Instead, we need some structure, implemented
with unsafe code, providing a view of the authorized region of the underlying
As a first step, I replaced every piece of code which used a raw slice and the
stride value by the existing
structures (which first required to make planes generic after
After these changes, our function could be rewritten as follow:
So now, all the code using a raw slice and a stride value has been replaced. But
if we look at the definition of
PlaneMutSlice, we see that it still borrows
the whole plane:
So the refactoring, in itself, does not solves the problem.
What is needed now is a structure that exposes a bounded region of the plane.
For illustration purpose, let’s consider a minimal example, solving a similar problem: split a matrix into columns.
In memory, the matrix is stored in a single array:
To do so, let’s define a
ColumnMut type, and split the raw array into columns:
PhantomData is necessary to bind the lifetime (in practice,
when we store a raw pointer, we often need a
The iterator returned by
columns() yields a different column every time, so
the borrowing rules are respected.
Now, we can read from and write to a matrix via temporary column views:
Even if the columns are interlaced in memory, from a
ColumnMut instance, it is
not possible to access data belonging to another column.
rows fields must be kept private, otherwise they could
be changed from safe code in such a way that breaks boundaries and violates
A plane is split in a similar way, except that it provides plane regions instead of colums.
In practice, more structures related to the encoding state are split into tiles, provided both in const and mut versions, so there is a whole hierarchy of tiling structures:
+- FrameState → TileState | +- Frame → Tile | | +- Plane → PlaneRegion | + RestorationState → TileRestorationState | | +- RestorationPlane → TileRestorationPlane | | +- FrameRestorationUnits → TileRestorationUnits | + FrameMotionVectors → TileMotionVectors +- FrameBlocks → TileBlocks
The split is done by a separate component (see
tiler.rs), which yields a
tile context containing an instance of the hierarchy of tiling views for each
A priori, there are mainly two possibilities to express offsets during tile encoding:
- relative to the tile;
- relative to the frame (i.e. absolute).
The usage of tiling views strongly favors the first choice. For example, it would be confusing if a bounded region could not be indexed from 0:
Worse, this would not be possible at all for the second dimension:
Therefore, offsets used in tiling views are relative to the tile (contrary to libaom and AV1 specification).
Encoding a frame first involves frame-wise accesses (initialization), then tile-wise accesses (to encode tiles in parallel), then frame-wise accesses using the results of tile-encoding (deblocking, CDEF, loop restoration, …).
All the frame-level structures have been replaced by tiling views where necessary.
The tiling views exist only temporarily, during the calls to
encode_tile(). While they are alive, it is not possible to
access frame-level structures (the borrow checker statically prevents it).
Then the tiling structures vanish, and frame-level processing can continue.
This schema gives an overview:
\ +----------------+ | | | | | | | Frame-wise accesses | | > | | | - FrameState<T> | | | - Frame<T> +----------------+ | - Plane<T> / - ... || tiling views \/ \ +---+ +---+ +---+ +---+ | | | | | | | | | | Tile encoding (possibly in parallel) +---+ +---+ +---+ +---+ | | +---+ +---+ +---+ +---+ | Tile-wise accesses | | | | | | | | > +---+ +---+ +---+ +---+ | - TileStateMut<'_, T> | - TileMut<'_, T> +---+ +---+ +---+ +---+ | - PlaneRegionMut<'_, T> | | | | | | | | | +---+ +---+ +---+ +---+ | / || vanishing of tiling views \/ \ +----------------+ | | | | | | | Frame-wise accesses | | > | | | (deblocking, CDEF, ...) | | | +----------------+ | /
To enable tile encoding, parameters have been added to pass the (log2) number of
--tile-rows-log2. For example, to request 2x2
rav1e video.y4m -o video.ivf --tile-cols-log2 1 --tile-rows-log2 1
Currently, we need to pass the log2 of the number of tiles (like in libaom,
even if the
aomenc options are called
avoid any confusion. Maybe we could find a better option which is both correct,
non-confusing and user-friendly later.
Now that we can encode tiles, we must write them according to the AV1 bitstream specification, so that decoders can read the resulting file correctly.
Before tile encoding (i.e. with a single tile), rav1e produced a correct bitstream. Several changes were necessary to write multiple tiles.
In addition, when there are several tiles, it signals two more values, described below.
For entropy coding, the encoder maintain and update a CDF (Cumulative Distribution Function), representing the probabilities of symbols.
After a frame is encoded, the current CDF state is saved to be possibly used as a starting state for future frames.
But with tile encoding, each tile finishes with its own CDF state, so which one
should we associate to the reference frame? The answer is: any of them. But we
must signal the one we choose, in
context_update_tile_id; the decoder needs it
to decode the bitstream.
In practice, we keep the CDF from the biggest tile.
Size of tiles size
The size of an encoded tile, in bytes, is variable (of course). Therefore, we will need to signal the size of each tile.
To gain a few bytes, the number of bytes used to store the size itself is also
variable, and signaled by 2 bits in the frame header
Concretely, we must choose the smallest size that is sufficient to encode all the tile sizes for the frame.
tile_start_and_end_present_flag(we always disable it);
The tile size (minus 1) is written in little endian, and use the number of bytes we signaled in the frame header.
That’s all. This is sufficient to produce a correct bitstream with multiple tiles.
I tested on my laptop (8 CPUs) several encodings to compare encoding performance (this is not a good benchmark, but it gives an idea, you are encouraged to run your own tests). Here are the results:
tiles time speedup 1 7mn02,336s 1.00× 2 3mn53,578s 1.81× 4 2mn12,995s 3.05× 8* 1mn57,533s 3.59×
Speedups are quite good for 2 and 4 tiles.
Why not 2×, 4× and 8× speedup? Mainly because of Amdahl’s law.
Tile encoding parallelizes only a part of the whole process: there are still single-threaded processings at frame-level.
Suppose that a proportion p (between 0 and 1) of a given task can be
parallelized. Then its theoretical speedup is
1 / ((p/n) + (1-p)), where n
is the number of threads.
tiles speedup speedup speedup (p=0.9) (p=0.95) (p=0.98) 2 1.82× 1.90× 1.96× 4 3.07× 3.48× 3.77× 8 4.71× 5.93× 7.02×
Maybe counterintuitively, to increase the speedup brought by parallelization, non-parallelized code must be optimized (the more threads are used, the more the non-parallelized code represents a significant part).
The (not-so-reliable) benchmark results for 2 and 4 tiles suggest that tile encoding represents ~90% of the whole encoding process.
Not everything worked the first time.
The most common source of errors while implementing tile encoding was related to offsets.
When there was only one tile, all offsets were relative to the frame. With several tiles, some offsets are relative to the current tile, but some others are still relative to the whole frame. For example, during motion estimation, a motion vector can point outside tile boundaries in the reference frame, so we must take care to convert offsets accordingly.
The most obvious errors were catched by plane regions (which prevent access outside boundaries), but some others were more subtle.
Such errors could produce interesting images. For example, here is a screenshot of my first tiled video:
But the final boss bug was way more sneaky: it corrupted the bitstream (so the decoder was unable to decode), but not always, and never the first frame. When an inter-frame could be decoded, it was sometimes visually corrupted, but only for some videos and for some encoding parameters.
After more than one week of investigations, I finally found it.
Implementing this feature was an awesome journey. I learned a lot, both about Rust and video encoding (I didn’t even know what a tile was before I started).
Big thanks to the Mozilla/Xiph/Daala team, who has been very welcoming and helpful, and who does amazing work!