~rom1v/blog

Scrcpy 2.0, with audio

2023-03-12T02:30:00+01:00

I am thrilled to announce the release of scrcpy 2.0. Now, you can mirror (and record) your Android 11+ devices in real-time with audio forwarded to your computer!

This new version also includes an option to select the video and audio codecs. The device screen can now be encoded in H.265, or even AV1 if your device supports AV1 encoding (though this is unlikely).

The application is free and open source. Follow the instructions to install it and run it on your computer.

If you like scrcpy, you can support my open source work.

Audio usage

Audio forwarding is supported for devices with Android 11 or higher, and it is enabled by default:

For Android 12 or newer, it works out-of-the-box.
For Android 11, you’ll need to ensure that the device screen is unlocked when starting scrcpy. A fake popup will briefly appear to make the system think that the shell app is in the foreground. Without this, audio capture will fail.
For Android 10 or earlier, audio cannot be captured and is automatically disabled.

You can disable audio with:

scrcpy --no-audio

If audio is enabled, it is also recorded:

scrcpy --record=file.mkv

Unlike video, audio requires some buffering even in real-time. The buffer size needs to be small enough to maintain acceptable latency, but large enough to minimize buffer underrun, which causes audio glitches. The default buffer size is set to 50ms, but it can be adjusted:

scrcpy --audio-buffer=30

To improve playback smoothness, you may deliberately increase the latency:

scrcpy --audio-buffer=200

This is useful, for example, to project your personal videos on a bigger screen:

scrcpy --video-codec=h265 --display-buffer=200 --audio-buffer=200

You can also select the audio codec and bit rate (default is Opus at 128Kbps). As a side note, I’m particularly impressed by the Opus codec at very low bit rate:

scrcpy --audio-codec=opus --audio-bit-rate=16k
scrcpy --audio-codec=aac --audio-bit-rate=16k

See the audio documentation page for more details.

History

The first version of scrcpy was released 5 years ago. Since then, audio forwarding has been one of the most requested features (see issue #14).

I made a first experimentation and developed USBaudio as a solution, but it worked poorly and the feature it relied on was deprecated in Android 8.

With the introduction of a new API to capture audio from an Android app in Android 10, I made a prototype called sndcpy. However, there were several issues. Firstly, it required to be invoked from an Android app (the scrcpy server is not an Android app, but a Java executable run with shell permissions). Most importantly, this API lets apps decide whether they can be captured or not, meaning many apps simply could not be captured, causing confusion for users.

By the end of January, @yume-chan (a scrcpy user), provided a proof-of-concept to capture the device audio with shell permissions and also proposed a working workaround for Android 11.

Since then, I have been working on a proper integration into scrcpy (my evenings and weekends have been quite busy 🙂). I added encoding, recording, buffering and playback with clock drift compensation to prevent audio delay from drifting.

Below are more technical details.

Audio capture

On the device, audio is captured by an AudioRecord with REMOTE_SUBMIX as the audio source.

The API is straightforward to use, but not very low-latency friendly. It is possible to read a number of requested bytes in one of two modes:

READ_BLOCKING: the read will block until all the requested data is read (it should be called READ_ALL_BLOCKING).
READ_NON_BLOCKING: the read will return immediately after reading as much audio data as possible without blocking.

However, the most useful mode, which is a blocking read that may return less data than requested (like the read() system call), is missing.

Since the amount of data available is unknown beforehand, in READ_BLOCKING mode, scrcpy might wait for too long. Conversely, in READ_NON_BLOCKING mode, scrcpy would read in a live-loop, burning CPU while the function returns 0 most of the time.

I decided to use READ_BLOCKING with a size of 5ms (960 bytes).

Anyway, in practice, on the devices I tested on, audio blocks are produced only every 20ms, introducing a latency of 20ms. This is not a limiting factor though, since default OPUS and AAC encoders implementations on Android produce frame sizes of 960 samples (20ms) and 1024 samples (21.33ms) respectively (and they are not configurable).

In these conditions, scrcpy reads successively 4 blocks of 5 ms every 20ms. Although the number of requested bytes could be increased to 20ms (3840 bytes), in theory some devices might capture audio faster.

With the missing blocking mode (READ_BLOCKING_THE_REAL_ONE), it would be possible to request a read with a larger buffer (e.g. 500ms) in one call, and the AudioRecord would return as much data as possible whenever it is available.

Audio encoding

The captured audio samples are then encoded by MediaCodec, which offers both synchronous and asynchronous APIs.

For our purpose, we need to execute two actions in parallel:

sending input audio buffers (read by our AudioRecord);
receiving output audio buffers (the encoded packets).

Therefore, the asynchronous API is more suitable than the synchronous one.

Here is how it is documented:

MediaCodec codec = MediaCodec.createByCodecName(name);
codec.setCallback(new MediaCodec.Callback() {
    @Override
    void onInputBufferAvailable(MediaCodec mc, int inputBufferId) {
        ByteBuffer inputBuffer = codec.getInputBuffer(inputBufferId);
        // fill inputBuffer with valid data
        …
        codec.queueInputBuffer(inputBufferId, …);
    }

    @Override
    void onOutputBufferAvailable(MediaCodec mc, int outputBufferId, …) {
        ByteBuffer outputBuffer = codec.getOutputBuffer(outputBufferId);
        // outputBuffer is ready to be processed or rendered.
        …
        codec.releaseOutputBuffer(outputBufferId, …);
    }

    …
}

However, there is a catch: the callbacks (onInputBufferAvailable() and onOutputBufferAvailable()) are called from the same thread and cannot run in parallel.

Filling an input buffer requires a blocking call to read from the AudioRecord, while processing the output buffers involves a blocking call to send the data to the client over a socket.

If we were to process the buffers directly from the callbacks, the processing of an output buffer would be delayed until the blocking call to AudioRecord.read() completes (which may be up to 20ms as described in the previous section). This would result in additional latency.

To address this issue, the callback only submits tasks to input and output queues, which are processed by dedicated threads:

// simplified
codec.setCallback(new MediaCodec.Callback() {
    @Override
    void onInputBufferAvailable(MediaCodec mc, int inputBufferId) {
        inputTasks.put(new InputTask(index));
    }

    @Override
    void onOutputBufferAvailable(MediaCodec mc, int outputBufferId,
                                 MediaCodec.BufferInfo bufferInfo) {
        outputTasks.put(new OutputTask(index, bufferInfo);
    }

    …
}

Client architecture

Here is an overview of the client architecture for the video and audio streams:

                                                 V4L2 sink
                                               /
                                       decoder
                                     /         \
        VIDEO -------------> demuxer             display
                                     \
                                       recorder
                                     /
        AUDIO -------------> demuxer
                                     \
                                       decoder --- audio player

The video and audio are captured and encoded on the device, and the resulting packets are sent via separate sockets over an adb tunnel using a custom protocol. This protocol transmits the raw encoded packets with packet headers that provide early information about packet boundaries (useful to reduce video latency) and PTS (used for recording).

Video and audio streams are then demuxed into packets by a demuxer.

If recording is enabled, the recorder asynchronously muxes the elementary streams into MP4 or MKV. Thus, the packets are encoded on the device side, but muxed on the client side (it’s the division of labour!).

If a display or V4L2 is enabled, then the video packets must be decoded by a decoder into video frames to be displayed or sent to V4L2.

If audio playback is enabled (currently when a display is enabled), the audio packets are decoded into audio frames (blocks of samples) and played by the audio player.

Audio player

This is the last component I implemented (I wrote recording before playback), because it is the trickiest, especially to compensate for the following:

clock offset: the audio output might not start precisely when necessary to play the samples at the right time;
clock drift: the device clock and the client clock may not advance at precisely the same rate;
buffer underrun: when the player has no samples available when requested by the audio output.

While scrcpy displays the latest received video frame without buffering, this isn’t possible for audio. Playing the latest received audio sample would be meaningless.

As input, the player regularly receives AVFrames of decoded audio samples. As output, a callback regularly requests audio samples to be played. In between, an audio buffer stores produced samples that have yet to be consumed.

The player aims to feed the audio output with as little latency as possible while avoiding buffer underrun. To achieve this, it attempts to maintain the average buffering (the number of samples present in the buffer) around a target value. If this target buffering is too low, then buffer underrun will occur frequently. If it is too high, then latency becomes unacceptable. This target value is configured using the scrcpy option --audio-buffer.

The playback relies only on buffer filling, the PTS are not used at all by the audio player (just as they are not used for video mirroring, unless video buffering is enabled). PTS are only used for recording.

The player cannot adjust the sample input rate (it receives samples produced in real-time) or the sample output rate (it must provide samples as requested by the audio output callback). Therefore, it may only apply compensation by resampling (converting m input samples to n output samples).

The compensation itself is applied by swresample (FFmpeg). It is configured using swr_set_compensation(). An important work for the player is to estimate the compensation value regularly and apply it.

The estimated buffering level is the result of averaging the “natural” buffering (samples are produced and consumed by blocks, so it must be smoothed), and making instant adjustments resulting of its own actions (explicit compensation and silence insertion on underflow), which are not smoothed.

Buffer underflow events can occur when packets arrive too late. In that case, the player inserts silence. Once the packets finally arrive (late), one strategy could be to drop the samples that were replaced by silence, in order to keep a minimal latency. However, dropping samples in case of buffer underflow is inadvisable, as it would temporarily increase the underflow even more and cause very noticeable audio glitches.

Therefore, the player doesn’t drop any sample on underflow. The compensation mechanism will absorb the delay introduced by the inserted silence.

Conclusion

I’m delighted that scrcpy now supports audio forwarding after much effort.

While I expect that the audio player will require some fine-tuning in the future to better handle edge cases, it currently performs quite well.

I would like to extend a huge thank you to @yume-chan for his initial proof-of-concept, which made this feature possible.

Happy mirroring!

Audio forwarding on Android 10

2020-06-09T21:50:00+02:00

Audio forwarding is one of the most requested features in scrcpy (see issue #14).

Last year, I published a small experiment (USBaudio) to forward audio over USB, using the AOA2 protocol. Unfortunately, this Android feature was unreliable, and has been deprecated in Android 8.

Here is a new tool I developed to play the device audio output on the computer, using the Playback Capture API introduced in Android 10: sndcpy.

The name follows the same logic:

strcpy: string copy
scrcpy: screen copy
sndcpy: sound copy

This is a quick proof-of-concept, composed of:

an Android application, which captures and streams the device audio over a socket:
a shell script, which starts the app and runs VLC to play the audio stream.

The long-term goal is to implement this feature properly in scrcpy.

How to use sndcpy

You could either download the release or build the app.

VLC must be installed on the computer.

Plug an Android 10 device with USB debugging enabled, and execute:

./sndcpy

If several devices are connected (listed by adb devices):

./sndcpy <serial>  # replace <serial> by the device serial

(omit ./ on Windows)

It will install the app on the device, and request permission to start audio capture:

Once you clicked on START NOW, press Enter in the console to start playing on the computer. Press Ctrl+c in the terminal to stop (except on Windows, just disconnect the device or stop capture from the device notifications).

The sound continues to be played on the device. The volume can be adjusted independently on the device and on the computer.

Apps restrictions

sndcpy may only forward audio from apps which do not prevent audio capture. The rules are detailed in §capture policy:

By default, apps that target versions up to and including to Android 9.0 do not permit playback capture. To enable it, include android:allowAudioPlaybackCapture="true" in the app’s manifest.xml file.

By default, apps that target Android 10 (API level 29) or higher allow their audio to be captured. To disable playback capture, include android:allowAudioPlaybackCapture="false" in the app’s manifest.xml file.

So some apps might need to be updated to support audio capture.

Integration in scrcpy

Ideally, I would like scrcpy to support audio forwarding directly. However, this will require quite a lot of work.

In particular, scrcpy does not use an Android app (required for capturing audio), it currently only runs a Java main as shell (required to inject events and capture the screen without asking).

And it will require to implement audio playback (done by VLC in this PoC), but also audio recording (for scrcpy --record file.mkv), encoding and decoding to transmit a compressed stream, handle audio-video synchronization…

Since I develop scrcpy on my free time, this feature will probably not be integrated very soon. Therefore, I prefer to release a working proof-of-concept which does the job.

Introducing USBaudio

2019-06-20T09:40:00+02:00

Forwarding audio from Android devices

In order to support audio forwarding in scrcpy, I first implemented an experimentation on a separate branch (see issue #14). But it was too hacky and fragile to be merged (and it does not work on all platforms).

So I decided to write a separate tool: USBaudio.

It works on Linux with PulseAudio.

How to use USBaudio

First, you need to build it (follow the instructions).

Plug an Android device. If USB debugging is enabled, just execute:

usbaudio

If USB debugging is disabled (or if multiple devices are connected), you need to specify a device, either by their serial or vendor id and product_id (as printed by lsusb):

usbaudio -s 0123456789abcdef
usbaudio -d 18d1:4ee2

The audio should be played on the computer.

If it’s stuttering, try increasing the live caching value (at the cost of a higher latency):

# default is 50ms
usbaudio --live-caching=100

Note that it can also be directly captured by OBS:

How does it work?

USBaudio executes 3 steps successively:

It enables audio accessory on the device (by sending AOA requests via libusb), so that the audio is forwarded over USB. If it works, PulseAudio (or ALSA) on the computer should detect a new audio input source.
It retrieves the PulseAudio input source id associated to the Android device (via libpulse).
It execs VLC to play audio from this input source.

Note that enabling audio accessory changes the USB device product id, so it will close any adb connection (and scrcpy). Therefore, you should enable audio forwarding before running scrcpy.

Manually

To only enable audio accessory without playing:

usbaudio -n
usbaudio --no-play

The audio input sources can be listed by:

pactl list short sources

For example:

$ pactl list short sources
...
5   alsa_input.usb-LGE_Nexus_5_05f5e60a0ae518e5-01.analog-stereo     module-alsa-card.c  s16le 2ch 44100Hz   RUNNING

Use the number (here 5) to play it with VLC:

vlc -Idummy --live-caching=50 pulse://5

Alternatively, you can use ALSA directly:

cat /proc/asound/cards

For example:

$ cat /proc/asound/cards
...
 1 [N5             ]: USB-Audio - Nexus 5
                      LGE Nexus 5 at usb-0000:00:14.0-4, high speed

Use the device number (here 1) as follow:

vlc -Idummy --live-caching=50 alsa://hw:1

If it works manually but not automatically (without -n), then please open an issue.

Limitations

It does not work on all devices, it seems that audio accessory is not always well supported. But it’s better than nothing.

Android Q added a new playback capture API. Hopefully, scrcpy could use it to forward audio in the future (but only for Android Q devices).

A new core playlist for VLC 4

2019-05-21T09:25:00+02:00

The core playlist in VLC was started a long time ago. Since then, it has grown to handle too many different things, to the point it became a kind of god object.

In practice, the playlist was also controlling playback (start, stop, change volume…), configuring audio and video outputs, storing media detected by discovery…

For VLC 4, we wanted a new playlist API, containing a simple list of items (instead of a tree), acting as a media provider for a player, without unrelated responsibilities.

I wrote it several months ago (at Videolabs). Now that the old one has been removed, it’s time to give some technical details.

Objectives
Data structure
Interaction with UI
- Desynchronization
- Synchronization
  - Core to UI
  - UI to core
Random playback
- Randomizer
Sorting
Interaction with the player
Media source
Conclusion

Objectives

One major design goal is to expose what UI frameworks need. Several user interfaces, like Qt, Mac OS and Android¹, will use this API to display and interact with the main VLC playlist.

The playlist must be performant for common use cases and usable from multiple threads.

Indeed, in VLC, user interfaces are implemented as modules loaded dynamically. In general, there is exactly one user interface, but there may be none or (in theory) several. Thus, the playlist may not be bound to the event loop of some specific user interface. Moreover, the playlist may be modified from a player thread; for example, playing a zip archive will replace the item by its content automatically.

As a consequence, the playlist will use a mutex; to avoid ToCToU issues, it will also expose public functions to lock and unlock it. But as we will see later, there will be other consequences.

Data structure

User interfaces need random access to the playlist items, so a vector is the most natural structure to store the items. A vector is provided by the standard library of many languages (vector in C++, Vec in Rust, ArrayList in Java…). But here, we’re in C, so there is nothing.

In the playlist, we only need a vector of pointers, so I first proposed improvements to an existing type, vlc_array_t, which only supports void * as items. But it was considered useless (1, 2) because it is too limited and not type-safe.

Therefore, I wrote vlc_vector. It is implemented using macros so that it’s generic over its item type. For example, we can use a vector of ints as follow:

// declare and initialize a vector of int
struct VLC_VECTOR(int) vec = VLC_VECTOR_INITIALIZER;

// append 0, 10, 20, 30 and 40
for (int i = 0; i < 5; ++i) {
    if (!vlc_vector_push(&vec, 10 * i)) {
        // allocation failure...
    }
}

// remove item at index 2
vlc_vector_remove(2);

// the vector now contains [0, 10, 30, 40]

int first = vec.data[0]; // 0
int last = vec.data[vec.size - 1]; // 40

// free resources
vlc_vector_destroy(&vec);

Internally, the playlist uses a vector of playlist items:

typedef struct VLC_VECTOR(struct vlc_playlist_item *) playlist_item_vector_t;

struct vlc_playlist {
    playlist_item_vector_t items;
    // ...
};

Interaction with UI

UI frameworks typically use list models to bind items to a list view component. A list model must provide:

the total number of items,
the item at a given index.

In addition, the model must notify its view when items are inserted, removed, moved or updated, and when the model is reset (the whole content should be invalidated).

For example, Qt list views use QAbstractItemModel/QAbstractListModel and the Android recycler view uses RecyclerView.Adapter.

The playlist API exposes the functions and callbacks providing these features.

Desynchronization

However, the core playlist may not be used as a direct data source for a list model. In other words, the functions of a list model must not delegate the calls to the core playlist.

To understand why, let’s consider a typical sequence of calls executed by a view on its model, from the UI thread:

model.count();
model.get(0);
model.get(1);
model.get(2);
model.get(3);
model.get(4);

If we implemented count() and get(index) by delegating to the playlist, we would have to lock each call individually:

// in some artificial UI framework in C++

int MyModel::count() {
    // don't do this
    vlc_playlist_Lock(playlist);
    int count = vlc_playlist_Count();
    vlc_playlist_Unlock(playlist);
    return count;
}

vlc_playlist_item_t *MyModel::get(int index) {
    // don't do this
    vlc_playlist_Lock(playlist);
    vlc_playlist_item_t *item = vlc_playlist_Get(playlist, index);
    vlc_playlist_Unlock(playlist);
    return item;
}

Note that locking and unlocking from the UI thread for every playlist item is not a good idea for responsiveness, but this is a minor issue here.

The real problem is that locking is not sufficient to guarantee correctness: the list view expects its model to return consistent values. Our implementation can break this assumption, because the playlist content could change asynchronously between calls. Here is an example:

// the playlist initially contains 5 items: [A, B, C, D, E]
model.count(); // 5
model.get(0);  // A
model.get(1);  // B
                    // the first playlist item is removed from another thread:
                    //     vlc_playlist_RemoveOne(playlist, 0);
                    // the playlist now contains [B, C, D, E]
model.get(2);  // D
model.get(3);  // E
model.get(4);  // out-of-range, undefined behavior (probably segfault)

The view could not process any notification of the item removal before the end of the current execution in its event loop… that is, at least after model.get(4). To avoid this problem, the data provided by view models must always live in the UI thread.

This implies that the UI has to manage a copy of the playlist content. The UI playlist should be considered as a remote out-of-sync view of the core playlist.

Note that the copy must not be limited to the list of pointers to playlist items: the content which is displayed and susceptible to change asynchronously (media metadata, like title or duration) must also be copied. The UI needs a deep copy; otherwise, the content could change (and be exposed) before the list view was notified… which, again, would break assumptions about the model.

Synchronization

The core playlist and the UI playlist are out-of-sync. So we need to “synchronize” them:

the changes on the core playlist must be reflected to the UI views,
the user, via the UI, may request changes to the core playlist (to add, move or remove items for example).

Core to UI

The core playlist is the source of truth.

Every change to the UI playlist must occur in the UI thread, yet the core playlist notification handlers are executed from any thread. Therefore, playlist callback handlers must retrieve appropriate data from the playlist, then post an event to the UI event loop², which will be handled from the UI thread. From there, the core playlist will be out-of-sync, so it would be incorrect to access it.

The order of events forwarded to the UI thread must be preserved. That way, the indices notified by the core playlist are necessarily valid within the context of the list model in the UI thread. The core playlist events can be understood as a sequence of “patches” that the UI playlist must apply to its own copy.

This only works if only the core playlist callbacks modify the list model content.

UI to core

Since the list model can only be modified by the core playlist callbacks, it is incorrect to modify it on user actions. As a consequence, the changes must be requested to the core playlist, which will, in turn, notify the actual changes.

The synchronization is more tricky in that direction.

To understand why, suppose the user selects items 10 to 20, then drag & drop to move them to index 42. Once the user releases the mouse button to “drop” the items, we need to lock the core playlist to apply the changes.

The problem is that, before we successfully acquired the lock, another client may have modified the playlist: it may have cleared it, or shuffled it, or removed items 5 to 15… As a consequence, we cannot apply the “move” request as is, because it was created from a previous playlist state.

To solve the issue, we need to adapt the request to make it fit the current playlist state. In other words, resolve conflicts: find the items if they had been moved, ignore the items not found for removal…

For that purpose, in addition to functions modifying the content directly, the playlist exposes functions to request “desynchronized” changes, which automatically resolve conflicts and generate an appropriate sequence of events to notify the clients of the actual changes.

Let’s take an example. Initially, our playlist contains 10 items:

[A, B, C, D, E, F, G, H, I, J]

The user selects [C, D, E, F, G] and press the Del key to remove the items. To apply the change, we need to lock the core playlist.

But at that time, another thread was holding the lock to apply some other changes. It removed F and I, and shuffled the playlist:

[E, B, D, J, C, G, H, A]

Once the other thread unlocks the playlist, our lock finally succeeds. Then, we call request_remove([C, D, E, F, G]) (this is pseudo-code, the real function is vlc_playlist_RequestRemove).

Internally, it triggers several calls:

// [E, B, D, J, C, G, H, A]
remove(index = 4, count = 2)   // remove [C, G]
// [E, B, D, J, H, A]
remove(index = 2, count = 1)   // remove [D]
// [E, B, J, H, A]
remove(index = 0, count = 1)   // remove [E]
// [B, J, H, A]

Thus, every client (including the UI from which the user requested to remove the items), will receive a sequence of 3 events on_items_removed, corresponding to each removed slice.

The slices are removed in descending order for both optimization (it minimizes the number of shifts) and simplicity (the index of a removal does not depend on previous removals).

In practice, it is very likely that the request will apply exactly to the current state of the playlist. To avoid unnecessary linear searches to find the items, these functions accept an additional index_hint parameter, giving the index of the items when the request was created. It should (hopefully) almost always be the same as the index in the current playlist state.

Random playback

Contrary to shuffle, random playback does not move the items within the playlist; instead, it does not play them sequentially.

To select the next item to play, we could just pick one at random.

But this is not ideal: some items will be selected several times (possibly in a row) while some others will not be selected at all. And if loop is disabled, when should we stop? After all n items have been selected at least once or after n playbacks?

Instead, we would like some desirable properties that work both with loop enabled and disabled:

an item must never be selected twice (within a cycle, if loop is enabled),
we should be able to navigate back to the previously selected items,
we must able to force the selection of a specific item (typically when the user double-clicks on an item in the playlist),
insertions and removals must be taken into account at any time,

In addition, if loop is enabled:

the random order must be recomputed for very cycle (we don’t always want the same random order),
it should be possible to navigate back to previous items from the previous cycle,
an item must never be selected twice in a row (in particular, it may not be the last item of one cycle and the first item of the next cycle).

Randomizer

I wrote a randomizer to select items “randomly” within all these constraints.

To get an idea of the results, here is a sequence produced for a playlist containing 5 items (A, B, C, D and E), with loop enabled (so that it continues indefinitely):

E D A B C E B C A D C B E D A C E A D B A D C E B A B D E C B C A E D E D B C A
E C B D A C A E B D C D E A B E D B A C D C B A E D A B C E B D C A E D C A B E
B A E C D C E D A B C E B A D E C B D A D B A C E C E B A D B C E D A E A C B D
A D E B C D C A E B E A D C B C D B A E C E A B D C D E A B D A E C B C A D B E
A B E C D A C B E D E D A B C D E C A B C A E B D E B D C A C A E D B D B E C A

Here is how it works.

The randomizer stores a single vector containing all the items of the playlist. This vector is not shuffled at once. Instead, steps of the Fisher-Yates algorithm are executed one-by-one on demand. This has several advantages:

on insertions and removals, there is no need to reshuffle or shift the whole array;
if loop is enabled, the history of the last cycle can be kept in place.

It also maintains 3 indexes:

head indicates the end of the items already determinated for the current cycle (if loop is disabled, there is only one cycle),
next points to the item after the current one³,
history points to the first item of ordered history from the last cycle.

0              next  head          history       size
|---------------|-----|.............|-------------|
 <------------------->               <----------->
  determinated range                 history range

Let’s reuse the example I wrote in the documentation.

Here is the initial state with our 5 items:

                                         history
                next                     |
                head                     |
                |                        |
                A    B    C    D    E

The playlist calls Next() to retrieve the next random item. The randomizer picks one item (say, D), and swaps it with the current head (A). Next() returns D.

                                         history
                     next                |
                     head                |
                     |                   |
                D    B    C    A    E
              <--->
           determinated range

The playlist calls Next() one more time. The randomizer selects one item outside the determinated range (say, E). Next() returns E.

                                         history
                          next           |
                          head           |
                          |              |
                D    E    C    A    B
              <-------->
           determinated range

The playlist calls Next() one more time. The randomizer selects C (already in place). Next() returns C.

                                         history
                               next      |
                               head      |
                               |         |
                D    E    C    A    B
              <------------->
            determinated range

The playlist then calls Prev(). Since the “current” item is C, the previous one is E, so Prev() returns E, and next moves back.

                                         history
                          next           |
                          |    head      |
                          |    |         |
                D    E    C    A    B
              <------------->
            determinated range

The playlist calls Next(), which returns C, as expected.

                                         history
                               next      |
                               head      |
                               |         |
                D    E    C    A    B
              <------------->
            determinated range

The playlist calls Next(), the randomizer selects B, and returns it.

                                         history
                                    next |
                                    head |
                                    |    |
                D    E    C    B    A
              <------------------>
               determinated range

The playlist calls Next(), the randomizer selects the last item (it has no choice). next and head now point one item past the end (their value is the vector size).

                                         history
                                         next
                                         head
                                         |
                D    E    C    B    A
              <----------------------->
                 determinated range

At this point, if loop is disabled, it is not possible to call Next() anymore (HasNext() returns false). So let’s enable it by calling SetLoop(), then let’s call Next() again.

This will start a new loop cycle. Firstly, next and head are reset, and the whole vector belongs to the last cycle history.

                 history
                 next
                 head
                 |
                 D    E    C    B    A
              <------------------------>
                    history range

Secondly, to avoid selecting A twice in a row (as the last item of the previous cycle and the first item of the new one), the randomizer will immediately determine another item in the vector (say C) to be the first of the new cycle. The items that belong to the history are kept in order. head and history move forward.

                     history
                next |
                |    head
                |    |
                C    D    E    B    A
              <---><------------------>
      determinated     history range
             range

Finally, it will actually select and return the first item (C).

                     history
                     next
                     head
                     |
                C    D    E    B    A
              <---><------------------>
      determinated     history range
             range

Then, the user adds an item to the playlist (F). This item is added in front of history.

                          history
                     next |
                     head |
                     |    |
                C    F    D    E    B    A
              <--->     <------------------>
      determinated          history range
             range

The playlist calls Next(), the randomizer randomly selects E. E “disappears” from the history of the last cycle. This is a general property: each item may not appear more than once in the “history” (both from the last and the new cycle). The history order is preserved.

                               history
                          next |
                          head |
                          |    |
                C    E    F    D    B    A
              <-------->     <-------------->
             determinated     history range
                range

The playlist then calls Prev() 3 times, that yields C, then A, then B. next is decremented (modulo size) on each call.

                               history
                               |    next
                          head |    |
                          |    |    |
                C    E    F    D    B    A
              <-------->     <-------------->
             determinated     history range
                range

Hopefully, the resulting randomness will match what people expect in practice.

Sorting

The playlist can be sorted by an ordered list of criteria (a key and a order).

The implementation is complicated by the fact that items metadata can change asynchronously (for example if the player is parsing it), making the comparison function inconsistent.

To avoid the problem, a first pass builds a list of metadata for all items, then this list is sorted, and finally the resulting order is applied back to the playlist.

As a benefit, the items are locked only once to retrieved their metadata.

Interaction with the player

For VLC 4, Thomas wrote a new player API.

A player can be used without a playlist: we can set its current media and the player can request, when necessary, the next media to play from a media provider.

A playlist, on the other hand, needs a player, and registers itself as its media provider. They are tightly coupled:

the playlist controls the player on user actions;
the player events may update the playlist state.

To keep them synchronized:

on user actions, we need to lock the playlist, then the player;
on player events, we need to lock the player, then the playlist.

This poses a lock-order inversion problem: for example, if thread A locks the playlist then waits for the player lock, while thread B locks the player then waits for the playlist lock, then thread A and B are deadlocked.

To avoid the problem, the player and the playlist share the same lock. Concretely, vlc_playlist_Lock() delegates to vlc_player_Lock(). In practice, the lock should be held only for short periods of time.

Media source

A separate API (media source and media tree) was necessary to expose what is called services discovery (used to detect media from various sources like Samba or MTP), which were previously managed by the old playlist.

Thus, we could kill the old playlist.

Conclusion

The new playlist and player API should help to implement UI properly (spoiler: a new modern UI is being developed), to avoid racy bugs and to implement new features (spoiler: gapless).

Discuss on reddit and Hacker News.

Actually, the Android app will maybe continue to implement its own playlist in Java/Kotlin, to avoid additional layers (Java/JNI and LibVLC). ↩
Even in the case where a core playlist callback is executed from the UI thread, the event must be posted to the event queue, to avoid breaking the order. Concretely, in Qt, this means connecting signals to slots using Qt::QueuedConnection instead of the default Qt::AutoConnection. ↩
We use next instead of current so that all indexes are unsigned, while current could be -1. ↩

Implementing tile encoding in rav1e

2019-04-25T10:15:00+02:00

During the last few months at Videolabs, I added support for tile encoding in rav1e (a Rust AV1 Encoder).

What is this?

AV1 is an open and royalty-free video coding format, concurrent with HEVC (H.265).

Rav1e is an encoder written in Rust, developed by Mozilla/Xiph. As such, it takes an input video and encodes it to produce a valid AV1 bitstream.

Tile encoding

Tile encoding consists in splitting video frames into tiles that can be encoded and decoded independently in parallel (to use several CPUs), at the cost of a small loss in compression efficiency.

This speeds up encoding and increases decoding frame rate.

8 tiles (4 colums × 2 rows)

Preliminary work

To prepare for tiling, some refactoring was necessary.

A frame contains 3 planes (one for each YUV component, possibly subsampled). Each plane is stored in a contiguous array, rows after rows.

To illustrate it, here is a mini-plane containing 6×3 pixels. Padding is added for alignment (and other details), so its physical size is 8×4 pixels:

In memory, it is stored in a single array:

The number of array items separating one pixel to the one below is called the stride. Here, the stride is 8.

The encoder often needs to process rectangular regions. For that purpose, many functions received a slice of the plane array and the stride value:

pub fn write_forty_two(slice: &mut [u16], stride: usize) {
  for y in 0..2 {
    for x in 0..4 {
      slice[y * stride + x] = 42;
    }
  }
}

This works fine, but the plane slice spans multiple rows.

Let’s split our planes into 4 tiles (2 columns × 2 rows):

In memory, the resulting plane regions are not contiguous:

In Rust, it is not sufficient not to read/write the same memory from several threads, it must be impossible to write (safe) code that could do it. More precisely, a mutable reference may not alias any other reference to the same memory.

As a consequence, passing a mutable slice (&mut [u16]) spanning multiple rows is incompatible with tiling. Instead, we need some structure, implemented with unsafe code, providing a view of the authorized region of the underlying plane.

As a first step, I replaced every piece of code which used a raw slice and the stride value by the existing PlaneSlice and PlaneMutSlice structures (which first required to make planes generic after improving the Pixel trait).

After these changes, our function could be rewritten as follow:

pub fn write_forty_two<T: Pixel>(slice: &mut PlaneMutSlice<'_, T>) {
  for y in 0..2 {
    for x in 0..4 {
      slice[y][x] = 42;
    }
  }
}

Tiling structures

So now, all the code using a raw slice and a stride value has been replaced. But if we look at the definition of PlaneMutSlice, we see that it still borrows the whole plane:

pub struct PlaneMutSlice<'a, T: Pixel> {
  pub plane: &'a mut Plane<T>,
  pub x: isize,
  pub y: isize
}

So the refactoring, in itself, does not solves the problem.

What is needed now is a structure that exposes a bounded region of the plane.

Minimal example

For illustration purpose, let’s consider a minimal example, solving a similar problem: split a matrix into columns.

In memory, the matrix is stored in a single array:

To do so, let’s define a ColumnMut type, and split the raw array into columns:

use std::marker::PhantomData;
use std::ops::{Index, IndexMut};

pub struct ColumnMut<'a> {
    data: *mut u8,
    cols: usize,
    rows: usize,
    phantom: PhantomData<&'a mut u8>,
}

impl Index<usize> for ColumnMut<'_> {
    type Output = u8;
    fn index(&self, index: usize) -> &Self::Output {
        assert!(index < self.rows);
        unsafe { &*self.data.add(index * self.cols) }
    }
}

impl IndexMut<usize> for ColumnMut<'_> {
    fn index_mut(&mut self, index: usize) -> &mut Self::Output {
        assert!(index < self.rows);
        unsafe { &mut *self.data.add(index * self.cols) }
    }
}

pub fn columns(
    slice: &mut [u8],
    cols: usize,
) -> impl Iterator<Item = ColumnMut<'_>> {
    assert!(slice.len() % cols == 0);
    let rows = slice.len() / cols;
    (0..cols).map(move |col| ColumnMut {
        data: &mut slice[col],
        cols,
        rows,
        phantom: PhantomData,
    })
}

The PhantomData is necessary to bind the lifetime (in practice, when we store a raw pointer, we often need a PhantomData).

We implemented Index and IndexMut traits to provide operator []:

// via Index trait
let value = column[y];

// via IndexMut trait
column[y] = value;

The iterator returned by columns() yields a different column every time, so the borrowing rules are respected.

Now, we can read from and write to a matrix via temporary column views:

fn main() {
    let mut data = [1, 5, 3, 2,
                    4, 2, 1, 7,
                    0, 0, 0, 0];

    // for each column, write the sum
    columns(&mut data, 4).for_each(|mut col| col[2] = col[0] + col[1]);

    assert_eq!(data, [1, 5, 3, 2,
                      4, 2, 1, 7,
                      5, 7, 4, 9]);
}

Even if the columns are interlaced in memory, from a ColumnMut instance, it is not possible to access data belonging to another column.

Note that cols and rows fields must be kept private, otherwise they could be changed from safe code in such a way that breaks boundaries and violates borrowing rules.

In rav1e

A plane is split in a similar way, except that it provides plane regions instead of colums.

The split is recursive. For example, a Frame contains 3 Planes, so a Tile contains 3 PlaneRegions, using the same underlying memory.

In practice, more structures related to the encoding state are split into tiles, provided both in const and mut versions, so there is a whole hierarchy of tiling structures:

 +- FrameState → TileState
 |  +- Frame → Tile
 |  |  +- Plane → PlaneRegion
 |  +  RestorationState → TileRestorationState
 |  |  +- RestorationPlane → TileRestorationPlane
 |  |     +- FrameRestorationUnits → TileRestorationUnits
 |  +  FrameMotionVectors → TileMotionVectors
 +- FrameBlocks → TileBlocks

The split is done by a separate component (see tiler.rs), which yields a tile context containing an instance of the hierarchy of tiling views for each tile.

Relative offsets

A priori, there are mainly two possibilities to express offsets during tile encoding:

relative to the tile;
relative to the frame (i.e. absolute).

The usage of tiling views strongly favors the first choice. For example, it would be confusing if a bounded region could not be indexed from 0:

// region starting at (64, 64)
let row = &region[0]; // panic, out-of-bounds
let row = &region[64]; // ok :-/

Worse, this would not be possible at all for the second dimension:

// region starting at (64, 64)
let first_row = &region[64];
let first_column = row[64]; // wrong, a raw slice necessarily starts at 0

Therefore, offsets used in tiling views are relative to the tile (contrary to libaom and AV1 specification).

Tile encoding

Encoding a frame first involves frame-wise accesses (initialization), then tile-wise accesses (to encode tiles in parallel), then frame-wise accesses using the results of tile-encoding (deblocking, CDEF, loop restoration, …).

All the frame-level structures have been replaced by tiling views where necessary.

The tiling views exist only temporarily, during the calls to encode_tile(). While they are alive, it is not possible to access frame-level structures (the borrow checker statically prevents it).

Then the tiling structures vanish, and frame-level processing can continue.

This schema gives an overview:

                                \
      +----------------+         |
      |                |         |
      |                |         |  Frame-wise accesses
      |                |          >
      |                |         |   - FrameState<T>
      |                |         |   - Frame<T>
      +----------------+         |   - Plane<T>
                                /    - ...

              ||   tiling views
              \/
                                \
  +---+  +---+  +---+  +---+     |
  |   |  |   |  |   |  |   |     |  Tile encoding (possibly in parallel)
  +---+  +---+  +---+  +---+     |
                                 |
  +---+  +---+  +---+  +---+     |  Tile-wise accesses
  |   |  |   |  |   |  |   |      >
  +---+  +---+  +---+  +---+     |   - TileStateMut<'_, T>
                                 |   - TileMut<'_, T>
  +---+  +---+  +---+  +---+     |   - PlaneRegionMut<'_, T>
  |   |  |   |  |   |  |   |     |
  +---+  +---+  +---+  +---+     |
                                /

              ||   vanishing of tiling views
              \/
                                \
      +----------------+         |
      |                |         |
      |                |         |  Frame-wise accesses
      |                |          >
      |                |         |  (deblocking, CDEF, ...)
      |                |         |
      +----------------+         |
                                /

Command-line

To enable tile encoding, parameters have been added to pass the (log2) number of tiles --tile-cols-log2 and --tile-rows-log2. For example, to request 2x2 tiles:

rav1e video.y4m -o video.ivf --tile-cols-log2 1 --tile-rows-log2 1

Currently, we need to pass the log2 of the number of tiles (like in libaom, even if the aomenc options are called --tile-columns and --tile-rows), to avoid any confusion. Maybe we could find a better option which is both correct, non-confusing and user-friendly later.

Bitstream

Now that we can encode tiles, we must write them according to the AV1 bitstream specification, so that decoders can read the resulting file correctly.

Before tile encoding (i.e. with a single tile), rav1e produced a correct bitstream. Several changes were necessary to write multiple tiles.

Tile info

According to Tile info syntax, the frame header signals the number of columns and rows of tiles (it always signaled a single tile before).

In addition, when there are several tiles, it signals two more values, described below.

CDF update

For entropy coding, the encoder maintain and update a CDF (Cumulative Distribution Function), representing the probabilities of symbols.

After a frame is encoded, the current CDF state is saved to be possibly used as a starting state for future frames.

But with tile encoding, each tile finishes with its own CDF state, so which one should we associate to the reference frame? The answer is: any of them. But we must signal the one we choose, in context_update_tile_id; the decoder needs it to decode the bitstream.

In practice, we keep the CDF from the biggest tile.

Size of tiles size

The size of an encoded tile, in bytes, is variable (of course). Therefore, we will need to signal the size of each tile.

To gain a few bytes, the number of bytes used to store the size itself is also variable, and signaled by 2 bits in the frame header (tile_size_bytes_minus_1).

Concretely, we must choose the smallest size that is sufficient to encode all the tile sizes for the frame.

Tile group

According to General tile group OBU syntax, we need to signal two values when there are more than 1 tile:

tile_start_and_end_present_flag (we always disable it);
tile_size_minus_1.

The tile size (minus 1) is written in little endian, and use the number of bytes we signaled in the frame header.

That’s all. This is sufficient to produce a correct bitstream with multiple tiles.

Parallelization

Thanks to Rayon, parallelizing tile encoding is as easy as replacing iter_mut() by par_iter_mut().

I tested on my laptop (8 CPUs) several encodings to compare encoding performance (this is not a good benchmark, but it gives an idea, you are encouraged to run your own tests). Here are the results:

 tiles     time      speedup
   1    7mn02,336s    1.00×
   2    3mn53,578s    1.81×
   4    2mn12,995s    3.05×
   8*   1mn57,533s    3.59×

Speedups are quite good for 2 and 4 tiles.

*The reason why the speedup is lower than expected for 8 tiles is that my CPU has actually only 4 physical cores. See this reddit comment and this other one.

Limits

Why not 2×, 4× and 8× speedup? Mainly because of Amdahl’s law.

Tile encoding parallelizes only a part of the whole process: there are still single-threaded processings at frame-level.

Suppose that a proportion p (between 0 and 1) of a given task can be parallelized. Then its theoretical speedup is 1 / ((p/n) + (1-p)), where n is the number of threads.

 tiles   speedup   speedup   speedup
         (p=0.9)   (p=0.95)  (p=0.98)
   2      1.82×     1.90×     1.96×
   4      3.07×     3.48×     3.77×
   8      4.71×     5.93×     7.02×

Maybe counterintuitively, to increase the speedup brought by parallelization, non-parallelized code must be optimized (the more threads are used, the more the non-parallelized code represents a significant part).

The (not-so-reliable) benchmark results for 2 and 4 tiles suggest that tile encoding represents ~90% of the whole encoding process.

Fixing bugs

Not everything worked the first time.

The most common source of errors while implementing tile encoding was related to offsets.

When there was only one tile, all offsets were relative to the frame. With several tiles, some offsets are relative to the current tile, but some others are still relative to the whole frame. For example, during motion estimation, a motion vector can point outside tile boundaries in the reference frame, so we must take care to convert offsets accordingly.

The most obvious errors were catched by plane regions (which prevent access outside boundaries), but some others were more subtle.

Such errors could produce interesting images. For example, here is a screenshot of my first tiled video:

One of these offsets confusions had been quickly catched by barrbrain in intra-prediction. I then fixed a similar problem in inter-prediction.

But the final boss bug was way more sneaky: it corrupted the bitstream (so the decoder was unable to decode), but not always, and never the first frame. When an inter-frame could be decoded, it was sometimes visually corrupted, but only for some videos and for some encoding parameters.

After more than one week of investigations, I finally found it. \o/

Conclusion

Implementing this feature was an awesome journey. I learned a lot, both about Rust and video encoding (I didn’t even know what a tile was before I started).

Big thanks to the Mozilla/Xiph/Daala team, who has been very welcoming and helpful, and who does amazing work!

Discuss on r/programming, r/rust, r/AV1 and Hacker News.

Introducing scrcpy

2018-03-08T12:00:00+01:00

I developed an application to display and control Android devices connected on USB. It does not require any root access. It works on GNU/Linux, Windows and Mac OS.

It focuses on:

lightness (native, displays only the device screen)
performance (30~60fps)
quality (1920×1080 or above)
low latency (~~70~100ms~~ 35~70ms)
low startup time (~1 second to display the first image)
non-intrusiveness (nothing is left installed on the device)

Like my previous project, gnirehtet, Genymobile accepted to open source it: scrcpy.

You can build, install and run it.

How does scrcpy work?

The application executes a server on the device. The client and the server communicate via a socket over an adb tunnel.

The server streams an H.264 video of the device screen. The client decodes the video frames and displays them.

The client captures input (keyboard and mouse) events, sends them to the server, which injects them to the device.

The documentation gives more details.

Here, I will detail several technical aspects of the application likely to interest developers.

Minimize latency

No buffering

It takes time to encode, transmit and decode the video stream. To minimize latency, we must avoid any additional delay.

For example, let’s stream the screen with screenrecord and play it with VLC:

adb exec-out screenrecord --output-format=h264 - | vlc - --demux h264

Initially, it works, but quickly the latency increases and frames are broken. The reason is that VLC associates a PTS to frames, and buffers the stream to play frames at some target time.

As a consequence, it sometimes prints such errors on stderr:

ES_OUT_SET_(GROUP_)PCR  is called too late (pts_delay increased to 300 ms)

Just before I started the project, Philippe, a colleague who played with WebRTC, advised me to “manually” decode (using FFmpeg) and render frames, to avoid any additional latency. This saved me from wasting time, it was the right solution.

Decoding the video stream to retrieve individual frames with FFmpeg is rather straightforward.

Skip frames

If, for any reason, the rendering is delayed, decoded frames are dropped so that scrcpy always displays the last decoded frame.

Note that this behavior may be changed with a configuration flag:

mesonconf x -Dskip_frames=false

Run a Java main on Android

Capturing the device screen requires some privileges, which are granted to shell.

It is possible to execute Java code as shell on Android, by invoking app_process from adb shell.

Hello, world!

Here is a simple Java application:

public class HelloWorld {
    public static void main(String... args) {
        System.out.println("Hello, world!");
    }
}

Let’s compile and dex it:

javac -source 1.7 -target 1.7 HelloWorld.java
"$ANDROID_HOME"/build-tools/27.0.2/dx \
    --dex --output classes.dex HelloWorld.class

Then, we push classes.dex to an Android device:

adb push classes.dex /data/local/tmp/

And execute it:

$ adb shell CLASSPATH=/data/local/tmp/classes.dex app_process / HelloWorld
Hello, world!

Access the Android framework

The application can access the Android framework at runtime.

For example, let’s use android.os.SystemClock:

import android.os.SystemClock;

public class HelloWorld {
    public static void main(String... args) {
        System.out.print("Hello,");
        SystemClock.sleep(1000);
        System.out.println(" world!");
    }
}

We link our class against android.jar:

javac -source 1.7 -target 1.7 \
    -cp "$ANDROID_HOME"/platforms/android-27/android.jar
    HelloWorld.java

Then run it as before.

Note that scrcpy also needs to access hidden methods from the framework. In that case, linking against android.jar is not sufficient, so it uses reflection.

Like an APK

The execution also works if classes.dex is embedded in a zip/jar:

jar cvf hello.jar classes.dex
adb push hello.jar /data/local/tmp/
adb shell CLASSPATH=/data/local/tmp/hello.jar app_process / HelloWorld

You know an example of a zip containing classes.dex? An APK!

Therefore, it works for any installed APK containing a class with a main method:

$ adb install myapp.apk
…
$ adb shell pm path my.app.package
package:/data/app/my.app.package-1/base.apk
$ adb shell CLASSPATH=/data/app/my.app.package-1/base.apk \
    app_process / HelloWorld

In scrcpy

To simplify the build system, I decided to build the server as an APK using gradle, even if it’s not a real Android application: gradle provides tasks for running tests, checking style, etc.

Invoked that way, the server is authorized to capture the device screen.

Improve startup time

Quick installation

Nothing is required to be installed on the device by the user: at startup, the client is responsible for executing the server on the device.

We saw that we can execute the main method of the server from an APK either:

installed, or
pushed to /data/local/tmp.

Which one to choose?

$ time adb install server.apk
…
real    0m0,963s
…

$ time adb push server.apk /data/local/tmp/
…
real    0m0,022s
…

So I decided to push.

Note that /data/local/tmp is readable and writable by shell, but not world-writable, so a malicious application may not replace the server just before the client executes it.

Parallelization

If you executed the Hello, world! in the previous section, you may have noticed that running app_process takes some time: Hello, World! is not printed before some delay (between 0.5 and 1 second).

In the client, initializing SDL also takes some time.

Therefore, these initialization steps have been parallelized.

Clean up the device

After usage, we want to remove the server (/data/local/tmp/scrcpy-server.jar) from the device.

We could remove it on exit, but then, it would be left on device disconnection.

Instead, once the server is opened by app_process, scrcpy unlinks (rm) it. Thus, the file is present only for less than 1 second (it is removed even before the screen is displayed).

The file itself (not its name) is actually removed when the last associated open file descriptor is closed (at the latest, when app_process dies).

Handle text input

Handling input received from a keyboard is more complicated than I thought.

Events

There are 2 kinds of “keyboard” events:

key events,
text input events.

Key events provide both the scancode (the physical location of a key on the keyboard) and the keycode (which depends on the keyboard layout). Only keycodes are used by scrcpy (it doesn’t need the location of physical keys).

However, key events are not sufficient to handle text input:

Sometimes it can take multiple key presses to produce a character. Sometimes a single key press can produce multiple characters.

Even simple characters may not be handled easily with key events, since they depend on the layout. For example, on a French keyboard, typing . (dot) generates Shift+;.

Therefore, scrcpy forwards key events to the device only for a limited set of keys. The remaining are handled by text input events.

Inject text

On the Android side, we may not inject text directly (injecting a KeyEvent created by the relevant constructor does not work). Instead, we can retrieve a list of KeyEvents to generate for a char[], using getEvents(char[]).

For example:

char[] chars = {'?'};
KeyEvent[] events = charMap.getEvents(chars);

Here, events is initialized with an array of 4 events:

press KEYCODE_SHIFT_LEFT
press KEYCODE_SLASH
release KEYCODE_SLASH
release KEYCODE_SHIFT_LEFT

Injecting those events correctly generates the char '?'.

Handle accented characters

Unfortunately, the previous method only works for ASCII characters:

char[] chars = {'é'};
KeyEvent[] events = charMap.getEvents(chars);
// events is null!!!

I first thought there was no way to inject such events from there, until I discussed with Philippe (yes, the same as earlier), who knew the solution: it works when we decompose the characters using combining diacritical dead key characters.

Concretely, instead of injecting "é", we inject "\u0301e":

char[] chars = {'\u0301', 'e'};
KeyEvent[] events = charMap.getEvents(chars);
// now, there are events

Therefore, to support accented characters, scrcpy attempts to decompose the characters using KeyComposition.

Set a window icon

The application window may have an icon, used in the title bar (for some desktop environments) and/or in the desktop taskbar.

The window icon must be set from an SDL_Surface by SDL_SetWindowIcon. Creating the surface with the icon content is up to the developer. For exemple, we could decide to load the icon from a PNG file, or directly from its raw pixels in memory.

Instead, another colleague, Aurélien, suggested I use the XPM image format, which is also a valid C source code: icon.xpm.

Note that the image is not the content of the variable icon_xpm declared in icon.xpm: it’s the whole file! Thus, icon.xpm may be both directly opened in Gimp and included in C source code:

#include "icon.xpm"

As a benefit, we directly “recognize” the icon from the source code, and we can patch it easily: in debug mode, the icon color is changed.

Conclusion

Developing this project was an awesome and motivating experience. I’ve learned a lot (I never used SDL or libav/FFmpeg before).

The resulting application works better than I initially expected, and I’m happy to have been able to open source it.

Discuss on reddit and Hacker News.

Gnirehtet rewritten in Rust

2017-09-21T17:00:00+02:00

Several months ago, I introduced Gnirehtet, a reverse tethering tool for Android I wrote in Java.

Since then, I rewrote it in Rust.

And it’s also open source! Download it, plug an Android device, and execute:

./gnirehtet run

(adb must be installed)

Why Rust?
Learning Rust
Difficulties
Safety pitfalls
Stats
Conclusion

Why Rust?

At Genymobile, we wanted Gnirehtet not to require the Java Runtime Environment, so the main requirement was to compile the application to a native executable binary.

Therefore, I first considered rewriting it in C or C++. But at that time (early May), I was interested in learning Rust, after vaguely hearing what it provided, namely:

However, I had never written a line of Rust code nor heard about Rust ownership, borrowing or lifetimes.

But I am convinced that the best way to learn a programming language is to work full-time on a real project in that language.

I was motivated, so after checking that it could fit our requirements (basically, I wrote a sample using the async I/O library mio, and executed it on both Linux and Windows), I decided to rewrite Gnirehtet in Rust.

Learning Rust

During the rewriting, I devoured successively the Rust book, Rust by example and the Rustonomicon. I learned a lot, and Rust is an awesome language. I now miss many of its features when I work on a C++ project, including:

advanced type inference,
enums,
patterns,
trait bounds,
Option<T> (like std::optional<T> in C++17, but benefiting from enums and patterns),
hygienic macros,
the absence of header files,
the (so simple) build system, and of course
guaranteed memory safety.

About learning, Paul Graham wrote:

Reading and experience train your model of the world. And even if you forget the experience or what you read, its effect on your model of the world persists. Your mind is like a compiled program you’ve lost the source of. It works, but you don’t know why.

Some of Rust concepts (like lifetimes or move semantics by default) provided a significantly different new training set which definitely affected my model of the world (of programming).

I am not going to present all these features (just click on the links to the documentation if you are interested). Instead, I will try to explain where and why Rust resisted to the design I wanted to implement, and how to rethink the problems within Rust constraints.

The following part requires some basic knowledge of Rust. You may want to skip directly to the stats.

Difficulties

The design of the Java application was pretty effective, so I wanted to reproduce the global architecture in the Rust version (with adaptations to make it more Rust idiomatic if necessary).

But I struggled on the details, especially to make the borrow checker happy. The rules are simple:

First, any borrow must last for a scope no greater than that of the owner. Second, you may have one or the other of these two kinds of borrows, but not both at the same time:

one or more references (&T) to a resource,

exactly one mutable reference (&mut T).

However, it took me some time to realize how they conflict with some patterns or principles.

Here are my feedbacks. I selected 4 subjects which are general enough to be independent of this particular project:

the conflicts with encapsulation;
the observer pattern;
how to share mutable data;
a quick note about annoying compiler limitations.

Encapsulation

The borrowing rules constrain encapsulation. This was the first consequence I realized.

Here is a canonical sample:

pub struct Data {
    header: [u8; 4],
    payload: [u8; 20],
}

impl Data {
    pub fn new() -> Self {
        Self {
            header: [0; 4],
            payload: [0; 20],
        }
    }

    pub fn header(&mut self) -> &mut [u8] {
        &mut self.header
    }

    pub fn payload(&mut self) -> &mut [u8] {
        &mut self.payload
    }
}

fn main() {
    let mut data = Data::new();
    let header = data.header();
    let payload = data.payload();
}

We just create a new instance of Data, then bind mutable references to the header and payload arrays to local variables, through accessors.

However, this does not compile:

$ rustc sample.rs
error[E0499]: cannot borrow `data` as mutable more than once at a time
  --> sample.rs:21:19
   |
25 |     let header = data.header();
   |                  ---- first mutable borrow occurs here
26 |     let payload = data.payload();
   |                   ^^^^ second mutable borrow occurs here
27 | }
   | - first borrow ends here

The compiler may not assume that header() and payload() return references to disjoint data in the Data struct. Therefore, each one borrows the whole data structure. Since the borrowing rules forbid to get two mutables references to the same resource, it rejects the second call.

Sometimes, we face temporary limitations because the compiler is not smart enough (yet). This is not the case here: the implementation of header() might actually return a reference to payload, or write to the payload array, violating the borrowing rules. And the validity of a method call may not depend on the method implementation.

To fix the problem, the compiler must be able to know that the local variables header and payload reference disjoint data, for example by accessing the fields directly:

    let header = &mut data.header;
    let payload = &mut data.payload;

or by exposing a method providing both references simultaneously:

struct Data {
    fn header_and_payload(&mut self) -> (&mut [u8], &mut [u8]) {
        (&mut self.header, &mut self.payload)
    }
}

fn main() {
    let mut data = Data::new();
    let (header, payload) = data.header_and_payload();
}

Similarly, inside a struct implementation, the borrowing rules also prevent factoring code into a private method easily. Consider this (artificial) example:

pub struct Data {
    buf: [u8; 20],
    prefix_length: usize,
    sum: u32,
    port: u16,
}

impl Data {
    pub fn update_sum(&mut self) {
        let content = &self.buf[self.prefix_length..];
        self.sum = content.iter().cloned().map(u32::from).sum();
    }

    pub fn update_port(&mut self) {
        let content = &self.buf[self.prefix_length..];
        self.port = (content[2] as u16) << 8 | content[3] as u16;
    }
}

Here, the buf field is an array storing some prefix and content contiguously.

We want to factorize the way we retrieve the content slice, so that the update_*() methods are not bothered with the details. Let’s try:

 impl Data {
     pub fn update_sum(&mut self) {
-        let content = &self.buf[self.prefix_length..];
+        let content = self.content();
         self.sum = content.iter().cloned().map(u32::from).sum();
     }

     pub fn update_port(&mut self) {
-        let content = &self.buf[self.prefix_length..];
+        let content = self.content();
         self.port = (content[2] as u16) << 8 | content[3] as u16;
     }
+
+    fn content(&mut self) -> &[u8] {
+        &self.buf[self.prefix_length..]
+    }
 }

Unfortunately, this does not compile:

error[E0506]: cannot assign to `self.sum` because it is borrowed
  --> facto2.rs:11:9
   |
10 |         let content = self.content();
   |                       ---- borrow of `self.sum` occurs here
11 |         self.sum = content.iter().cloned().map(u32::from).sum();
   |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ assignment to borrowed `self.sum` occurs here

error[E0506]: cannot assign to `self.port` because it is borrowed
  --> facto2.rs:16:9
   |
15 |         let content = self.content();
   |                       ---- borrow of `self.port` occurs here
16 |         self.port = (content[2] as u16) << 8 & content[3] as u16;
   |

As in the previous exemple, retrieving the reference through a method borrows the whole struct (here, self).

To workaround the problem, we can explain to the compiler that the fields are disjoint:

impl Data {
    pub fn update_sum(&mut self) {
        let content = Self::content(&self.buf, self.prefix_length);
        self.sum = content.iter().cloned().map(u32::from).sum();
    }

    pub fn update_port(&mut self) {
        let content = Self::content(&self.buf, self.prefix_length);
        self.port = (content[2] as u16) << 8 | content[3] as u16;
    }

    fn content(buf: &[u8], prefix_length: usize) -> &[u8] {
        &buf[prefix_length..]
    }
}

This compiles, but totally defeats the purpose of factorization: the caller has to provide the necessary fields.

As an alternative, we can use a macro to inline the code:

macro_rules! content {
    ($self:ident) => {
        &$self.buf[$self.prefix_length..]
    }
}

impl Data {
    pub fn update_sum(&mut self) {
        let content = content!(self);
        self.sum = content.iter().cloned().map(u32::from).sum();
    }

    pub fn update_port(&mut self) {
        let content = content!(self);
        self.port = (content[2] as u16) << 8 | content[3] as u16;
    }
}

But this seems far from ideal.

I think we must just live with it: encapsulation sometimes conflicts with the borrowing rules. After all, this is not so surprising: enforcing the borrowing rules requires to follow every concrete access to resources, while encapsulation aims to abstract them away.

Observer

The observer pattern is useful for registering event listeners on an object.

In some cases, this pattern may not be straightforward to implement in Rust.

For simplicity, let’s consider that the events are u32 values. Here is a possible implementation:

pub trait EventListener {
    fn on_event(&self, event: u32);
}

pub struct Notifier {
    listeners: Vec<Box<EventListener>>,
}

impl Notifier {
    pub fn new() -> Self {
        Self { listeners: Vec::new() }
    }

    pub fn register<T: EventListener + 'static>(&mut self, listener: T) {
        self.listeners.push(Box::new(listener));
    }

    pub fn notify(&self, event: u32) {
        for listener in &self.listeners {
            listener.on_event(event);
        }
    }
}

For convenience, make closures implement our EventListener trait:

impl<F: Fn(u32)> EventListener for F {
    fn on_event(&self, event: u32) {
        self(event);
    }
}

Thus, its usage is simple:

    let mut notifier = Notifier::new();
    notifier.register(|event| println!("received [{}]", event));
    println!("notifying...");
    notifier.notify(42);

This prints:

notifying...
received [42]

So far, so good.

However, things get a bit more complicated if we want to mutate a state when an event is received. For example, let’s implement a struct storing all the events we received:

pub struct Storage {
    events: Vec<u32>,
}

impl Storage {
    pub fn new() -> Self {
        Self { events: Vec::new() }
    }

    pub fn store(&mut self, value: u32) {
        self.events.push(value);
    }

    pub fn events(&self) -> &Vec<u32> {
        &self.events
    }
}

To be able to fill this Storage on each event received, we somehow have to pass it along with the event listener, which will be stored in the Notifier. Therefore, we need a single instance of Storage to be shared between the caller code and the Notifier.

Holding two mutable references to the same object obviously violates the borrowing rules, so we need a reference-counting pointer.

However, such a pointer is read-only, so we also need a RefCell for interior mutability.

Thus, we will use an instance of Rc<RefCell<Storage>>. It may seem too verbose, but using Rc<RefCell<T>> (or Arc<Mutex<T>> for thread-safety) is very common in Rust. And there is worse.

Here is the resulting client code:

    use std::cell::RefCell;
    use std::rc::Rc;

    let mut notifier = Notifier::new();

    // first Rc to the Storage
    let rc = Rc::new(RefCell::new(Storage::new()));

    // second Rc to the Storage
    let rc2 = rc.clone();

    // register the listener saving all the received events to the Storage
    notifier.register(move |event| rc2.borrow_mut().store(event));

    notifier.notify(3);
    notifier.notify(141);
    notifier.notify(59);
    assert_eq!(&vec![3, 141, 59], rc.borrow().events());

That way, the Storage is correctly mutated from the event listener.

All is not solved, though. In this example, we had access to the Rc<RefCell<Storage>> instance. What if we only have access to the Storage, e.g. if we want Storage to register itself from one of its methods, without requiring the caller to provide the Rc<RefCell<Storage>> instance?

impl Storage {
    pub fn register_to(&self, notifier: &mut Notifier) {
        notifier.register(move |event| {
            /* how to retrieve a &mut Storage from here? */
        });
    }
}

We need to retrieve the Rc<RefCell<Storage>> from the Storage in some way.

To do so, the idea consists in making the Storage aware of its reference-counting pointer. Of course, this only makes sense if Storage is constructed inside a Rc<RefCell<Storage>>.

This is exactly what enable_shared_from_this provides in C++, so we can draw inspiration from how it works: just store a Weak<RefCell<…>>, downgraded from the Rc<RefCell<…>>, into the structure itself. That way, we can use it to get a &mut Storage reference back in the event listener:

use std::rc::{Rc, Weak};
use std::cell::RefCell;

pub struct Storage {
    self_weak: Weak<RefCell<Storage>>,
    events: Vec<u32>,
}

impl Storage {
    pub fn new() -> Rc<RefCell<Self>> {
        let rc = Rc::new(RefCell::new(Self {
            self_weak: Weak::new(), // initialize empty
            events: Vec::new(),
        }));
        // set self_weak once we get the Rc instance
        rc.borrow_mut().self_weak = Rc::downgrade(&rc);
        rc
    }

    pub fn register_to(&self, notifier: &mut Notifier) {
        let rc = self.self_weak.upgrade().unwrap();
        notifier.register(move |event| rc.borrow_mut().store(event))
    }
}

Here is how to use it:

    let mut notifier = Notifier::new();
    let rc = Storage::new();
    rc.borrow().register_to(&mut notifier);
    notifier.notify(3);
    notifier.notify(141);
    notifier.notify(59);
    assert_eq!(&vec![3, 141, 59], rc.borrow().events());

So it is possible to implement the observer pattern in Rust, but this is a bit more challenging than in Java ;-)

When possible, it might be preferable to avoid it.

Mutable references cannot be aliased.

How to share mutable data, then?

We saw that we can use Rc<RefCell<…>> (or Arc<Mutex<…>>), that enforces the borrowing rules at runtime. However, this is not always desirable:

it forces a new allocation on the heap,
each access has a runtime cost,
it always borrows the whole resource.

Alternatively, we could use raw pointers manually inside unsafe code, but then this would be unsafe.

And there is another way, which consists in exposing temporary borrowing views of an object. Let me explain.

In Gnirehtet, a packet contains a reference to the raw data (stored in some buffer elsewhere) along with the IP and TCP/UDP header fields values (parsed from the raw data). We could have used a flat structure to store everything:

pub struct Packet<'a> {
    raw: &'a mut [u8],
    ipv4_source: u32,
    ipv4_destination: u32,
    ipv4_protocol: u8,
    // + other ipv4 fields
    transport_source: u16,
    transport_destination: u16,
    // + other transport fields
}

The Packet would provide setters for all the header fields (updating both the packet fields and the raw data). For example:

impl<'a> Packet<'a> {
    pub fn set_transport_source(&mut self, transport_source: u16) {
        self.transport_source = transport_source;
        let transport = &mut self.raw[20..];
        BigEndian::write_u16(&mut transport[0..2], port);
    }
}

But this would be poor design (especially since TCP and UDP header fields are different).

Instead, we would like to extract IP and transport headers to separate structs, managing their own part of the raw data:

// violates the borrowing rules

pub struct Packet<'a> {
    raw: &'a mut [u8], // the whole packet (including headers)
    ipv4_header: Ipv4Header<'a>,
    transport_header: TransportHeader<'a>,
}

pub struct Ipv4Header<'a> {
    raw: &'a mut [u8], // slice related to ipv4 headers
    source: u32,
    destination: u32,
    protocol: u8,
    // + other ipv4 fields
}

pub struct TransportHeader<'a> {
    raw: &'a mut [u8], // slice related to transport headers
    source: u16,
    destination: u16,
    // + other transport fields
}

You immediately spotted the problem: there are several references to the same resource, the raw byte array, at the same time.

Note that splitting the array is not a possibility here, since the raw slices overlap: we need to write the whole packet at once to the network, so the raw array in Packet must include the headers.

We need a solution compatible with the borrowing rules.

Here is the one I came up with:

store the header data separately, without the raw slices,
create view structs for IP and transport headers, with lifetime bounds,
expose Packet methods returning view instances.

And here is a simplification of the actual implementation:

pub struct Packet<'a> {
    raw: &'a mut [u8],
    ipv4_header: Ipv4HeaderData,
    transport_header: TransportHeaderData,
}

pub struct Ipv4HeaderData {
    source: u32,
    destination: u32,
    protocol: u8,
    // + other ipv4 fields
}

pub struct TransportHeaderData {
    source: u16,
    destination: u16,
    // + other transport fields
}

pub struct Ipv4Header<'a> {
    raw: &'a mut [u8],
    data: &'a mut Ipv4HeaderData,
}

pub struct TransportHeader<'a> {
    raw: &'a mut [u8],
    data: &'a mut TransportHeaderData,
}

impl<'a> Packet<'a> {
    pub fn ipv4_header(&mut self) -> Ipv4Header {
        Ipv4Header {
            raw: &mut self.raw[..20],
            data: &mut self.ipv4_header,
        }
    }

    pub fn transport_header(&mut self) -> TransportHeader {
        TransportHeader {
            raw: &mut self.raw[20..40],
            data: &mut self.transport_header,
        }
    }
}

The setters are implemented on the views, where they hold a mutable reference to the raw array:

impl<'a> TransportHeader<'a> {
    pub fn set_source(&mut self, source: u16) {
        self.data.source = source;
        BigEndian::write_u16(&mut raw[0..2], source);
    }

    pub fn set_destination(&mut self, destination: u16) {
        self.data.destination = destination;
        BigEndian::write_u16(&mut raw[2..4], destination);
    }
}

That way, the borrowing rules are respected, and the API is elegant:

    let mut packet = …;
    // "transport_header" borrows "packet" during its scope
    let mut transport_header = packet.transport_header();
    transport_header.set_source(1234);
    transport_header.set_destination(1234);

Compiler limitations

Rust is a young language, and the compiler has some annoying pitfalls.

The worst, in my opinion, is related to non-lexical lifetimes, which leads to unexpected errors:

struct Container {
    vec: Vec<i32>,
}

impl Container {
    fn find(&mut self, v: i32) -> Option<&mut i32> {
        None // we don't care the implementation
    }

    fn get(&mut self, v: i32) -> &mut i32 {
        if let Some(x) = self.find(v) {
            return x;
        }
        self.vec.push(v);
        self.vec.last_mut().unwrap()
    }
}

error[E0499]: cannot borrow `self.vec` as mutable more than once at a time
  --> sample.rs:14:9
   |
11 |         if let Some(x) = self.find(v) {
   |                          ---- first mutable borrow occurs here
...
14 |         self.vec.push(v);
   |         ^^^^^^^^ second mutable borrow occurs here
15 |         self.vec.last_mut().unwrap()
16 |     }
   |     - first borrow ends here

Hopefully, it should be fixed soon.

The Impl Trait feature, allowing to return unboxed abstract types from functions, should also improve the experience (there is also an expanded proposal).

The compiler generally produces very helpful error messages. But when it does not, they can be very confusing.

Safety pitfalls

The first chapter of the Rustonomicon says:

Safe Rust is For Reals Totally Safe.

[…]

Safe Rust is the true Rust programming language. If all you do is write Safe Rust, you will never have to worry about type-safety or memory-safety. You will never endure a null or dangling pointer, or any of that Undefined Behavior nonsense.

That’s the goal. And that’s almost true.

Leakpocalypse

In the past, it was possible to write safe-Rust code accessing freed memory.

This “leakpocalypse” led to a clarification of the safety guarantees: not running a destructor is now considered safe. In other words, memory-safety may not rely on RAII anymore (in fact, it never could, but it has been noticed only belatedly).

As a consequence, std::mem::forget is now safe, and JoinGuard has been deprecated and removed from the standard library (it has been moved to a separate crate).

Other tools relying on RAII (like Vec::drain()) must take special care to prevent memory corruption.

Whew, memory-safety is (now) safe.

Undefined infinity

In C and C++, infinite loops without side-effects are undefined behavior. This makes it possible to write programs that unexpectedly disprove Fermat’s Last Theorem.

In practice, the Rust compiler relies on LLVM, which (currently) applies its optimizations assuming that infinite loops without side-effects are undefined behavior. As a consequence, such undefined behaviors also occur in Rust.

Here is a minimal sample to trigger it:

fn infinite() {
    loop {}
}

fn main() {
    infinite();
}

Running without optimizations, it behaves as “expected”:

$ rustc ub.rs && ./ub
^C                    (infinite loop, interrupt it)

Enabling optimizations makes the program panic:

$ rustc -O ub.rs && ./ub
thread 'main' panicked at 'assertion failed: c.borrow().is_none()', /checkout/src/libstd/sys_common/thread_info.rs:51
note: Run with `RUST_BACKTRACE=1` for a backtrace.

Alternatively, we can produce unexpected results without crashing:

fn infinite(mut value: u32) {
    // infinite loop unless value initially equals 0
    while value != 0 {
        if value != 1 {
            value -= 1;
        }
    }
}

fn main() {
    infinite(42);
    println!("end");
}

$ rustc ub.rs && ./ub
^C                    (infinite loop, interrupt it)

But with optimizations:

$ rustc -O ub.rs && ./ub
end

This is a corner case, that will probably be solved in the future. In practice, Rust safety guarantees are pretty strong (at a cost of being constraining).

Segfault

This section has been added after the publication.

There are other sources of undefined behaviors (look at the issues tagged I-unsound).

For instance, casting a float value that cannot fit into the target type is undefined behavior, which can be propagated to trigger a segfault:

#[inline(never)]
pub fn f(ary: &[u8; 5]) -> &[u8] {
    let idx = 1e100f64 as usize;
    &ary[idx..]
}

fn main() {
    println!("{}", f(&[1; 5])[0xdeadbeef]);
}

rustc -O ub.rs && ./ub
Segmentation fault

Stats

That’s all for my feedbacks about the language itself.

As an appendix, let’s compare the current Java and Rust versions of the relay server.

Number of lines

$ cloc relay-{java,rust}/src
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Rust                            29            687            655           4506
Java                            37            726            701           2931
-------------------------------------------------------------------------------

(tests included)

The Rust project is significantly bigger, for several reasons:

there are many borrowing views classes;
the Rust version contains its own selector class, wrapping the lower-level Poll, while the Java version uses the standard Selector;
the error handling for command-line parsing is more verbose.

The Java version has more files because the unit tests are separate, while in Rust they are in the same file as the classes they test.

Just for information, here are the results for the Android client:

$ cloc app/src
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Java                            15            198            321            875
XML                              6              7              2             76
-------------------------------------------------------------------------------
SUM:                            21            205            323            951
-------------------------------------------------------------------------------

Binary size

--------------------------------------------
Java     gnirehtet.jar                   61K
--------------------------------------------
Rust     gnirehtet                      3.0M
         after "strip -g gnirehtet"     747K
         after "strip gnirehtet"        588K
--------------------------------------------

The Java binary itself is far smaller. The comparison is not fair though, since it requires the Java Runtime Environment:

$ du -sh /usr/lib/jvm/java-1.8.0-openjdk-amd64/
156M	/usr/lib/jvm/java-1.8.0-openjdk-amd64/

Memory usage

With a single TCP connection opened, here is the memory consumption for the Java relay server:

$ sudo pmap -x $RELAY_JAVA_PID
                  Kbytes     RSS   Dirty
total kB         4364052   86148   69316

(output filtered)

And for the Rust relay server:

$ sudo pmap -x $RELAY_RUST_PID
                  Kbytes     RSS   Dirty
total kB           19272    2736     640

Look at the RSS value, which indicates the actual memory used.

As expected, the Java version consumes more memory (86Mb) than the Rust one (less than 3Mb). Moreover, its value is unstable due to the allocation of tiny objects and their garbage collection, which also generates more dirty pages. On the contrary, the Rust value is very stable: once the connection is created, there are no memory allocations at all.

CPU usage

To compare CPU usage, here is my scenario: a 500Mb file is hosted by Apache on my laptop, I start the relay server through perf stat, then I download the file from Firefox on Android. As soon as the file is downloaded, I stop the relay server (Ctrl+C).

Here are the results for the Java version:

$ perf stat -B java -jar gnirehtet.jar relay
 Performance counter stats for 'java -jar gnirehtet.jar relay':

      11805,458302      task-clock:u (msec)       #    0,088 CPUs utilized
                 0      context-switches:u        #    0,000 K/sec
                 0      cpu-migrations:u          #    0,000 K/sec
            28 618      page-faults:u             #    0,002 M/sec
    17 908 360 446      cycles:u                  #    1,517 GHz
    13 944 172 792      stalled-cycles-frontend:u #   77,86% frontend cycles idle
    18 437 279 663      instructions:u            #    1,03  insn per cycle
                                                  #    0,76  stalled cycles per insn
     3 088 215 431      branches:u                #  261,592 M/sec
        70 647 760      branch-misses:u           #    2,29% of all branches

     133,975117164 seconds time elapsed

And for the Rust version:

$ perf stat -B ./gnirehtet relay
 Performance counter stats for 'target/release/gnirehtet relay':

       2707,479968      task-clock:u (msec)       #    0,020 CPUs utilized
                 0      context-switches:u        #    0,000 K/sec
                 0      cpu-migrations:u          #    0,000 K/sec
             1 001      page-faults:u             #    0,370 K/sec
     1 011 527 340      cycles:u                  #    0,374 GHz
     2 033 810 378      stalled-cycles-frontend:u #  201,06% frontend cycles idle
       981 103 003      instructions:u            #    0,97  insn per cycle
                                                  #    2,07  stalled cycles per insn
        98 929 222      branches:u                #   36,539 M/sec
         3 220 527      branch-misses:u           #    3,26% of all branches

     133,766035253 seconds time elapsed

I am not an expert in analyzing the results, but as far as I understand from the task-clock:u value, the Rust version consumes 4× less CPU-time.

Conclusion

Rewriting Gnirehtet in Rust was an amazing experience, where I learnt a great language and new programming concepts. And now, we get a native application showing better performances.

Happy reverse tethering!

Discuss on reddit and Hacker News.

Gnirehtet réécrit en Rust

2017-09-21T17:00:00+02:00

Il y a quelques mois, j’ai présenté Gnirehtet, un outil de reverse tethering pour Android que j’ai écrit en Java.

Depuis, je l’ai réécrit en Rust.

Et il est également open source ! Téléchargez-le, branchez un téléphone ou une tablette Android, et exécutez :

./gnirehtet run

(adb doit être installé)

Pourquoi Rust?
Apprendre Rust
Difficultés
Sûreté et pièges
Stats
Conclusion

Pourquoi Rust?

À Genymobile, nous voulions que Gnirehtet ne nécessite pas d’environnement d’exécution Java (JRE), donc le besoin principal était de compiler l’application vers un binaire exécutable natif.

Par conséquent, j’ai d’abord pensé la réécrire en C ou C++. Mais à ce moment-là (début mai), apprendre Rust m’intéressait, après avoir vaguement entendu parler de ses fonctionnalités:

Cependant, je n’avais jamais écrit une ligne de Rust ni entendu parler de son système de possession, d’emprunt ou de durées de vie.

Mais je suis convaincu que le meilleur moyen d’apprendre un langage de programmation est de travailler à plein temps sur un projet dans ce langage.

J’étais motivé, donc après avoir vérifié que ça pouvait convenir (en gros, j’ai écrit un exemple utilisant la bibliothèque d’I/O asynchrone mio, et je l’ai exécuté à la fois sur Linux et Windows), j’ai décidé de réécrire Gnirehtet en Rust.

Apprendre Rust

Pendant la réécriture, j’ai dévoré successivement le Rust book, Rust by example et le Rustonomicon. J’ai beaucoup appris, et j’aime énormément ce langage. Beaucoup de ses fonctionnalités me manquent maintenant quand je travaille sur un projet C++ :

inférence de type avancée,
enums,
patterns,
trait bounds,
Option<T> (comme std::optional<T> en C++17, mais tirant bénéfice des enums et des patterns),
macros hygiéniques,
l’absence de fichiers d’en-têtes,
le (si simple) système de build, et bien sûr
la garantie de sûreté mémoire.

À propos de l’apprentissage, Paul Graham a écrit:

Reading and experience train your model of the world. And even if you forget the experience or what you read, its effect on your model of the world persists. Your mind is like a compiled program you’ve lost the source of. It works, but you don’t know why.

Pour les non-anglophones, ma propre traduction :

La lecture et l’expérience entraînent votre modèle du monde. Et même si vous oubliez l’expérience ou ce que vous avez lu, son effet sur votre modèle du monde persiste. Votre esprit est comme un programme compilé dont vous auriez perdu le code source. Ça fonctionne, mais vous ne savez pas pourquoi.

Certains des concepts de Rust (comme les durées de vie ou la sémantique de mouvement par défaut) m’ont fourni un jeu de données significativement différent, qui a sans aucun doute affecté mon modèle du monde (de la programmation).

Je ne vais pas présenter toutes ces fonctionnaliés (cliquez sur les liens de la documentation si ça vous intéresse). À la place, je vais essayer d’expliquer où et pourquoi Rust a resisté au design que je voulais implémenter, et comment repenser les problèmes dans le périmètre des contraintes de Rust.

La partie suivant nécessite une certaine connaissance de Rust. Vous pourriez vouloir la passer pour aller directement aux stats.

Difficultés

Je trouvais la conception de l’application Java plutôt réussie, donc je voulais reproduire l’architecture globale dans la version Rust (avec d’éventuelles adaptations pour la rustifier).

Mais j’ai lutté sur les détails, en particulier pour satisfaire le borrow checker. Les règles sont simples:

First, any borrow must last for a scope no greater than that of the owner. Second, you may have one or the other of these two kinds of borrows, but not both at the same time:

one or more references (&T) to a resource,

exactly one mutable reference (&mut T).

En français :

Premièrement, aucun emprunt ne doit avoir une portée plus grande que celle de son propriétaire. Deuxièmement, vous pouvez avoir l’un ou l’autre de ces types d’emprunts, mais pas les deux à la fois:

une ou plusieurs références (&T) vers une ressource,

exactement une référence mutable (&mut T).

Cependant, il m’a fallu un peu de temps pour réaliser comment elles entrent en conflit avec certains principes ou modèles de conception.

Voici donc mes retours. J’ai sélectionné 4 sujets qui sont suffisamment généraux pour être indépendants de ce projet particulier :

les conflits avec l’encapsulation ;
le design pattern observateur ;
comment partager des données mutables ;
un retour rapide sur les limitations ennuyeuses du compilateur.

Encapsulation

Les règles d’emprunt contraignent l’encapsulation. C’est la première conséquence que j’ai réalisée.

Voici un exemple canonique :

pub struct Data {
    header: [u8; 4],
    payload: [u8; 20],
}

impl Data {
    pub fn new() -> Self {
        Self {
            header: [0; 4],
            payload: [0; 20],
        }
    }

    pub fn header(&mut self) -> &mut [u8] {
        &mut self.header
    }

    pub fn payload(&mut self) -> &mut [u8] {
        &mut self.payload
    }
}

fn main() {
    let mut data = Data::new();
    let header = data.header();
    let payload = data.payload();
}

Nous créons juste une nouvelle instance de Data, puis associons à des variables locales des références mutables vers les tableaux header et payload, en passant par des accesseurs.

Cependant, cela ne compile pas :

$ rustc sample.rs
error[E0499]: cannot borrow `data` as mutable more than once at a time
  --> sample.rs:21:19
   |
25 |     let header = data.header();
   |                  ---- first mutable borrow occurs here
26 |     let payload = data.payload();
   |                   ^^^^ second mutable borrow occurs here
27 | }
   | - first borrow ends here

Le compilateur ne peut pas faire l’hypothèse que header() et payload() retournent des références vers des données disjointes dans la structure Data. Par conséquent, chacun emprunte la structure data entièrement. Vu que les règles d’emprunt interdisent d’obtenir deux références mutables vers la même ressource, il rejette le second appel.

Parfois, nous faisons face à des limitations temporaires parce que le compilateur n’est pas (encore) assez malin. Ce n’est pas le cas ici : l’implémentation de header() pourrait très bien retourner une référence vers payload, ou écrire dans le tableau payload, enfreignant ainsi les règles d’emprunt. Et la validité d’un appel d’une méthode ne peut pas dépendre de l’implementation de la méthode.

Pour corriger le problème, le compilateur doit être capable de savoir que les variables locales header et payload référencent des données disjointes, par exemple en accédant aux champs directement :

    let header = &mut data.header;
    let payload = &mut data.payload;

ou en exposant une méthode fournissant les deux références simultanément :

struct Data {
    fn header_and_payload(&mut self) -> (&mut [u8], &mut [u8]) {
        (&mut self.header, &mut self.payload)
    }
}

fn main() {
    let mut data = Data::new();
    let (header, payload) = data.header_and_payload();
}

De même, dans l’implémentation d’une structure, les règles d’emprunt empêchent de factoriser du code dans une méthode privée facilement. Prenons cet exemple (artificiel) :

pub struct Data {
    buf: [u8; 20],
    prefix_length: usize,
    sum: u32,
    port: u16,
}

impl Data {
    pub fn update_sum(&mut self) {
        let content = &self.buf[self.prefix_length..];
        self.sum = content.iter().cloned().map(u32::from).sum();
    }

    pub fn update_port(&mut self) {
        let content = &self.buf[self.prefix_length..];
        self.port = (content[2] as u16) << 8 | content[3] as u16;
    }
}

Ici, le champ buf est un tableau stockant un préfixe et un contenu de manière contiguë.

Nous voulons factoriser la manière dont nous récupérons la slice content, pour que les méthodes update_*() n’aient pas à se préoccuper des détails. Essayons :

 impl Data {
     pub fn update_sum(&mut self) {
-        let content = &self.buf[self.prefix_length..];
+        let content = self.content();
         self.sum = content.iter().cloned().map(u32::from).sum();
     }

     pub fn update_port(&mut self) {
-        let content = &self.buf[self.prefix_length..];
+        let content = self.content();
         self.port = (content[2] as u16) << 8 | content[3] as u16;
     }
+
+    fn content(&mut self) -> &[u8] {
+        &self.buf[self.prefix_length..]
+    }
 }

Malheureusement, cela ne compile pas :

error[E0506]: cannot assign to `self.sum` because it is borrowed
  --> facto2.rs:11:9
   |
10 |         let content = self.content();
   |                       ---- borrow of `self.sum` occurs here
11 |         self.sum = content.iter().cloned().map(u32::from).sum();
   |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ assignment to borrowed `self.sum` occurs here

error[E0506]: cannot assign to `self.port` because it is borrowed
  --> facto2.rs:16:9
   |
15 |         let content = self.content();
   |                       ---- borrow of `self.port` occurs here
16 |         self.port = (content[2] as u16) << 8 & content[3] as u16;
   |

Comme dans l’exemple précédent, récupérer une référence à travers une méthode emprunte la structure complète (ici, self).

Pour contourner le problème, nous pouvons expliquer au compilateur que les champs sont disjoints :

impl Data {
    pub fn update_sum(&mut self) {
        let content = Self::content(&self.buf, self.prefix_length);
        self.sum = content.iter().cloned().map(u32::from).sum();
    }

    pub fn update_port(&mut self) {
        let content = Self::content(&self.buf, self.prefix_length);
        self.port = (content[2] as u16) << 8 | content[3] as u16;
    }

    fn content(buf: &[u8], prefix_length: usize) -> &[u8] {
        &buf[prefix_length..]
    }
}

Ça compile, mais cela va totalement à l’encontre de la factorisation : l’appelant doit fournir les champs nécessaires.

Comme alternative, nous pouvons utiliser une macro pour inliner le code :

macro_rules! content {
    ($self:ident) => {
        &$self.buf[$self.prefix_length..]
    }
}

impl Data {
    pub fn update_sum(&mut self) {
        let content = content!(self);
        self.sum = content.iter().cloned().map(u32::from).sum();
    }

    pub fn update_port(&mut self) {
        let content = content!(self);
        self.port = (content[2] as u16) << 8 | content[3] as u16;
    }
}

Mais c’est loin d’être idéal.

Je pense que nous devons juste l’accepter : l’encapsulation entre parfois en conflit avec les règles d’emprunt. Après tout, ce n’est pas si surprenant : imposer les règles d’emprunt nécessite de suivre chaque accès concret aux ressources, alors que l’encapsulation vise à les abstraire.

Observateur

Le design pattern observateur est utile pour enregistrer des événements sur un objet.

Dans certains cas, ce pattern pose des difficultés d’implémentation en Rust.

Pour faire simple, considérons que les événements sont des valeurs u32. Voici une implémentation possible :

pub trait EventListener {
    fn on_event(&self, event: u32);
}

pub struct Notifier {
    listeners: Vec<Box<EventListener>>,
}

impl Notifier {
    pub fn new() -> Self {
        Self { listeners: Vec::new() }
    }

    pub fn register<T: EventListener + 'static>(&mut self, listener: T) {
        self.listeners.push(Box::new(listener));
    }

    pub fn notify(&self, event: u32) {
        for listener in &self.listeners {
            listener.on_event(event);
        }
    }
}

Par commodité, implémentons notre trait EventListener pour les closures :

impl<F: Fn(u32)> EventListener for F {
    fn on_event(&self, event: u32) {
        self(event);
    }
}

Ainsi, son utilisation est simple :

    let mut notifier = Notifier::new();
    notifier.register(|event| println!("received [{}]", event));
    println!("notifying...");
    notifier.notify(42);

Cela affiche :

notifying...
received [42]

Jusqu’ici, tout va bien.

Cependant, les choses se compliquent si nous voulons modifier un état sur la réception d’un événement. Par exemple, implémentons une structure pour stocker tous les événements reçus :

pub struct Storage {
    events: Vec<u32>,
}

impl Storage {
    pub fn new() -> Self {
        Self { events: Vec::new() }
    }

    pub fn store(&mut self, value: u32) {
        self.events.push(value);
    }

    pub fn events(&self) -> &Vec<u32> {
        &self.events
    }
}

Pour pouvoir remplir ce Storage sur chaque événement reçu, nous devons d’une manière ou d’une autre le passer avec l’event listener, qui sera stocké dans le Notifier. Par conséquent, nous avons besoin qu’une instance de Storage soit partagée entre le code appelant et le Notifier.

Avoir deux références mutables vers le même objet enfreint évidemment les règles d’emprunt, donc nous avons besoin d’un pointeur à compteur de références.

Cependant, un tel pointeur est en lecture seul, donc nous avons également besoin d’un RefCell pour la mutabilité intérieure.

Ainsi, nous allons utiliser une instance de Rc<RefCell<Storage>>. Cela peut sembler trop verbeux, mais utiliser Rc<RefCell<T>> (ou Arc<Mutex<T>> pour la thread-safety) est très courant en Rust. Et il y a pire.

Voici ce que donne le code client :

    use std::cell::RefCell;
    use std::rc::Rc;

    let mut notifier = Notifier::new();

    // first Rc to the Storage
    let rc = Rc::new(RefCell::new(Storage::new()));

    // second Rc to the Storage
    let rc2 = rc.clone();

    // register the listener saving all the received events to the Storage
    notifier.register(move |event| rc2.borrow_mut().store(event));

    notifier.notify(3);
    notifier.notify(141);
    notifier.notify(59);
    assert_eq!(&vec![3, 141, 59], rc.borrow().events());

De cette manière, le Storage est correctement modifié à partir de l’event listener.

Tout n’est pas résolu pour autant. Dans cet exemple, c’était facile, nous avions accès à l’instance Rc<RefCell<Storage>>. Comment faire si nous avons seulement accès au Storage, par exemple si nous voulons que le Storage s’enregistre lui-même à partir de l’une de ses méthodes, sans que l’appelant n’ait à fournir l’instance Rc<RefCell<Storage>> ?

impl Storage {
    pub fn register_to(&self, notifier: &mut Notifier) {
        notifier.register(move |event| {
            /* how to retrieve a &mut Storage from here? */
        });
    }
}

Nous devons trouver un moyen de récupérer le Rc<RefCell<Storage>> à partir du Storage.

Pour cela, l’idée consiste à rendre Storage conscient de son pointeur à compteur de références. Bien sûr, cela n’a du sens que si Storage est construit dans un Rc<RefCell<Storage>>.

C’est exactement ce que enable_shared_from_this fournit en C++, donc nous pouvons nous inspirer de son fonctionnement : juste stocker un Weak<RefCell<…>>, downgradé à partir du Rc<RefCell<…>>, dans la structure elle-même. De cette manière, nous pouvons l’utiliser pour récupérer une référence &mut Storage à partir de l’event listener :

use std::rc::{Rc, Weak};
use std::cell::RefCell;

pub struct Storage {
    self_weak: Weak<RefCell<Storage>>,
    events: Vec<u32>,
}

impl Storage {
    pub fn new() -> Rc<RefCell<Self>> {
        let rc = Rc::new(RefCell::new(Self {
            self_weak: Weak::new(), // initialize empty
            events: Vec::new(),
        }));
        // set self_weak once we get the Rc instance
        rc.borrow_mut().self_weak = Rc::downgrade(&rc);
        rc
    }

    pub fn register_to(&self, notifier: &mut Notifier) {
        let rc = self.self_weak.upgrade().unwrap();
        notifier.register(move |event| rc.borrow_mut().store(event))
    }
}

Voici comment l’utiliser :

    let mut notifier = Notifier::new();
    let rc = Storage::new();
    rc.borrow().register_to(&mut notifier);
    notifier.notify(3);
    notifier.notify(141);
    notifier.notify(59);
    assert_eq!(&vec![3, 141, 59], rc.borrow().events());

Il est donc possible d’implémenter le design pattern observateur en Rust, mais c’est un peu plus difficile qu’en Java ;-)

Lorsque c’est possible, il est probablement préférable de l’éviter.

Partage de données mutables

Mutable references cannot be aliased.

En français :

Les références mutables ne peuvent pas être aliasées.

Comment partager des données mutables, alors ?

Nous avons vu que nous pouvions utiliser Rc<RefCell<…>> (ou Arc<Mutex<…>>), qui impose les règles d’emprunt à l’exécution. Cependant, ce n’est pas toujours désirable :

cela force une nouvelle allocation sur le tas,
chaque accès a un coût à l’exécution,
l’emprunt concerne toujours la ressource entière.

Au lieu de cela, nous pourrions utiliser des pointeurs bruts manuellement dans du code non-sûr, mais alors ce serait non-sûr.

Et il y a une autre solution, qui consiste à exposer des vues temporaires d’emprunt d’un objet. Laissez-moi expliquer.

Dans Gnirehtet, un paquet contient une référence vers les données brutes (stockées dans un buffer quelque part) ainsi que les valeur des champs des en-têtes IP et TCP/UDP (parsées à partir du tableau d’octets brut). Nous aurions pu utiliser une structure à plat pour tout stocker :

pub struct Packet<'a> {
    raw: &'a mut [u8],
    ipv4_source: u32,
    ipv4_destination: u32,
    ipv4_protocol: u8,
    // + other ipv4 fields
    transport_source: u16,
    transport_destination: u16,
    // + other transport fields
}

Le Packet aurait fourni des setters pour tous les champs d’en-têtes (modifiant à la fois les champs du paquet et le tableau d’octets). Par exemple :

impl<'a> Packet<'a> {
    pub fn set_transport_source(&mut self, transport_source: u16) {
        self.transport_source = transport_source;
        let transport = &mut self.raw[20..];
        BigEndian::write_u16(&mut transport[0..2], port);
    }
}

Mais cette conception ne serait pas terrible (surtout que les champs d’en-têtes TCP et UDP sont différents).

À la place, nous voudrions extraire les en-têtes d’IP et de transport vers des structures séparées, gérant leur propre partie du tableau d’octets :

// violates the borrowing rules

pub struct Packet<'a> {
    raw: &'a mut [u8], // the whole packet (including headers)
    ipv4_header: Ipv4Header<'a>,
    transport_header: TransportHeader<'a>,
}

pub struct Ipv4Header<'a> {
    raw: &'a mut [u8], // slice related to ipv4 headers
    source: u32,
    destination: u32,
    protocol: u8,
    // + other ipv4 fields
}

pub struct TransportHeader<'a> {
    raw: &'a mut [u8], // slice related to transport headers
    source: u16,
    destination: u16,
    // + other transport fields
}

Vous avez immédiatement repéré le problème : il y a plusieurs références vers la même ressource, le tableau d’octets raw, en même temps.

Remarquez que diviser le tableau n’est pas une possibilité ici, vu que les slices de raw se chevauchent : nous avons besoin d’écrire le paquet complet en une seule fois vers la couche réseau, donc le tableau raw dans Packet doit inclure les headers.

Nous avons besoin d’une solution compatible avec les règles d’emprunt.

Voici celle à laquelle je suis parvenu :

stocker les données des en-têtes séparément, sans les slices de raw,
créer des structures de vues pour les en-têtes d’IP et de transport, liées à une durée de vie,
exposer des méthodes de Packet retournant des instances de vues.

Et voici une simplification de l’implémentation réelle :

pub struct Packet<'a> {
    raw: &'a mut [u8],
    ipv4_header: Ipv4HeaderData,
    transport_header: TransportHeaderData,
}

pub struct Ipv4HeaderData {
    source: u32,
    destination: u32,
    protocol: u8,
    // + other ipv4 fields
}

pub struct TransportHeaderData {
    source: u16,
    destination: u16,
    // + other transport fields
}

pub struct Ipv4Header<'a> {
    raw: &'a mut [u8],
    data: &'a mut Ipv4HeaderData,
}

pub struct TransportHeader<'a> {
    raw: &'a mut [u8],
    data: &'a mut TransportHeaderData,
}

impl<'a> Packet<'a> {
    pub fn ipv4_header(&mut self) -> Ipv4Header {
        Ipv4Header {
            raw: &mut self.raw[..20],
            data: &mut self.ipv4_header,
        }
    }

    pub fn transport_header(&mut self) -> TransportHeader {
        TransportHeader {
            raw: &mut self.raw[20..40],
            data: &mut self.transport_header,
        }
    }
}

Les setters sont implémentés sur les vues, où ils détiennent une référence mutable vers le tableau brut :

impl<'a> TransportHeader<'a> {
    pub fn set_source(&mut self, source: u16) {
        self.data.source = source;
        BigEndian::write_u16(&mut raw[0..2], source);
    }

    pub fn set_destination(&mut self, destination: u16) {
        self.data.destination = destination;
        BigEndian::write_u16(&mut raw[2..4], destination);
    }
}

De cette manière, les règles d’emprunt sont respectées, et l’API est élégante :

    let mut packet = …;
    // "transport_header" borrows "packet" during its scope
    let mut transport_header = packet.transport_header();
    transport_header.set_source(1234);
    transport_header.set_destination(1234);

Limitations du compilateur

Rust est un langage jeune, et le compilateur a quelques problèmes ennuyeux.

Le pire, d’après moi, est lié aux durées de vie non-lexicales, qui provoque des erreurs inattendues :

struct Container {
    vec: Vec<i32>,
}

impl Container {
    fn find(&mut self, v: i32) -> Option<&mut i32> {
        None // we don't care the implementation
    }

    fn get(&mut self, v: i32) -> &mut i32 {
        if let Some(x) = self.find(v) {
            return x;
        }
        self.vec.push(v);
        self.vec.last_mut().unwrap()
    }
}

error[E0499]: cannot borrow `self.vec` as mutable more than once at a time
  --> sample.rs:14:9
   |
11 |         if let Some(x) = self.find(v) {
   |                          ---- first mutable borrow occurs here
...
14 |         self.vec.push(v);
   |         ^^^^^^^^ second mutable borrow occurs here
15 |         self.vec.last_mut().unwrap()
16 |     }
   |     - first borrow ends here

Heureusement, cela devrait être corrigé prochainement.

La fonctionnalité d’Impl Trait, permettant aux fonctions de retourner des types abstraits non-boxés, devrait aussi améliorer l’expérience (il y a aussi une proposition étendue).

Le compilateur produit généralement des messages d’erreur très utiles. Mais quand ce n’est pas le cas, ils peuvent être très déroutants.

Sûreté et pièges

Le premier chapitre du Rustonomicon dit :

Safe Rust is For Reals Totally Safe.

[…]

Safe Rust is the true Rust programming language. If all you do is write Safe Rust, you will never have to worry about type-safety or memory-safety. You will never endure a null or dangling pointer, or any of that Undefined Behavior nonsense.

En français :

La partie Sûre de Rust est Réellement Totallement Sûre.

[…]

Le Rust Sûr est le vrai langage de programmation Rust. Si vous n’écrivez que du Rust Sûr, vous n’aurez jamais à vous inquiétez de la sûreté des types ou de la mémoire. Vous n’aurez jamais à supporter un pointeur null ou dangling, ou l’un de ces comportements indéfinis insensés.

C’est le but. Et c’est presque vrai.

Leakpocalypse

Dans le passé, il a été possible d’écrire du code Rust sûr accédant à de la mémoire libérée.

Cette “leakpocalypse” a conduit à la clarification des guaranties de sûreté : ne pas exécuter un destructeur est maintenant considéré sûr. En d’autres termes, la sûreté mémoire ne peut plus reposer sur RAII (en fait, elle n’a jamais pu, mais cela n’a été remarqué que tardivement).

En conséquence, std::mem::forget est maintenant sûr, et JoinGuard a été déprécié et supprimé de la bibliothèque standard (il a été déplacé vers un crate séparé).

Les autres outils s’appuyant sur RAII (comme Vec::drain()) doivent prendre des précautions particulières pour garantir que la mémoire ne sera pas corrompue.

Ouf, la sûreté mémoire est (maintenant) sauvée.

Infinité indéfinie

En C et C++, les boucles infinies sans effets de bords sont un cas d’undefined behavior. À cause de cela, il est possible d’écrire des programmes qui, de façon inattendue, réfutent le dernier théorème de Fermat.

En pratique, le compilateur Rust s’appuie sur LLVM, qui (actuellement) applique ses optimisations en faisant l’hypothèse que les boucles infinies sans effets de bords ont un comportement indéfini. En conséquence, de tels undefined behaviors se produisent également en Rust.

Voici un exemple minimal pour l’observer :

fn infinite() {
    loop {}
}

fn main() {
    infinite();
}

Quand on l’exécute sans optimisations, il se comporte comme “attendu” :

$ rustc ub.rs && ./ub
^C                    (infinite loop, interrupt it)

Mais activer les optimisations fait paniquer le programme :

$ rustc -O ub.rs && ./ub
thread 'main' panicked at 'assertion failed: c.borrow().is_none()', /checkout/src/libstd/sys_common/thread_info.rs:51
note: Run with `RUST_BACKTRACE=1` for a backtrace.

Nous pouvons aussi produire des résultats inattendus sans plantage :

fn infinite(mut value: u32) {
    // infinite loop unless value initially equals 0
    while value != 0 {
        if value != 1 {
            value -= 1;
        }
    }
}

fn main() {
    infinite(42);
    println!("end");
}

$ rustc ub.rs && ./ub
^C                    (infinite loop, interrupt it)

Mais avec optimisations :

$ rustc -O ub.rs && ./ub
end

C’est un cas particulier, qui sera probablement corrigé dans le futur. En pratique, les garanties de sûreté de Rust sont très fortes (au prix d’être contraignantes).

Erreur de segmentation

Cette section a été ajoutée après la publication.

Il y a d’autres sources d’undefined behaviors (voir les issues taggées I-unsound).

Par exemple, caster une valeur flottante ne pouvant pas être représentée dans le type cible est un undefined behavior, qui peut être propagé pour provoquer une erreur de segmentation :

#[inline(never)]
pub fn f(ary: &[u8; 5]) -> &[u8] {
    let idx = 1e100f64 as usize;
    &ary[idx..]
}

fn main() {
    println!("{}", f(&[1; 5])[0xdeadbeef]);
}

rustc -O ub.rs && ./ub
Erreur de segmentation

Stats

C’est tout pour mes retours sur le langage lui-même.

En supplément, comparons les versions Java et Rust du serveur relais.

Nombre de lignes

$ cloc relay-{java,rust}/src
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Rust                            29            687            655           4506
Java                            37            726            701           2931
-------------------------------------------------------------------------------

(tests included)

Le projet Rust est significativement plus gros, pour plusieurs raisons :

il y a beaucoup de classes de vues d’emprunt ;
la version Rust définit sa propre classe de selecteur d’I/O asynchrone, encapsulant Poll de plus bas niveau, alors que la version Java utilise le Selector standard ;
la gestion d’erreur pour l’analyse de la ligne de commande est plus verbeuse.

La version Java contient plus de fichiers car les tests unitaires sont séparés, alors qu’en Rust ils se trouvent dans le même fichier que les classes qu’ils testent.

Juste pour information, voici les résultats pour le client Android :

$ cloc app/src
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
Java                            15            198            321            875
XML                              6              7              2             76
-------------------------------------------------------------------------------
SUM:                            21            205            323            951
-------------------------------------------------------------------------------

Taille des binaires

--------------------------------------------
Java     gnirehtet.jar                   61K
--------------------------------------------
Rust     gnirehtet                      3.0M
         after "strip -g gnirehtet"     747K
         after "strip gnirehtet"        588K
--------------------------------------------

Le binaire Java lui-même est bien plus petit. La comparaison n’est pas juste cependant, vu qu’il nécessite l’environnement d’exécution Java :

$ du -sh /usr/lib/jvm/java-1.8.0-openjdk-amd64/
156M	/usr/lib/jvm/java-1.8.0-openjdk-amd64/

Utilisation mémoire

Avec une seule connection TCP ouvert, voici la consommation mémoire pour le serveur relais en Java :

$ sudo pmap -x $RELAY_JAVA_PID
                  Kbytes     RSS   Dirty
total kB         4364052   86148   69316

(résultat filtré)

Et pour le serveur relais en Rust :

$ sudo pmap -x $RELAY_RUST_PID
                  Kbytes     RSS   Dirty
total kB           19272    2736     640

Regardez la valeur RSS, qui indique la mémoire réellement utilisée.

Comment on pouvait s’y attendre, la version Java consomme plus de mémoire (86Mo) que la version Rust (moins de 3Mo). De plus, sa valeur est instable à cause de l’allocation de petits objets et leur garbage collection, qui génère aussi davantage de dirty pages. La valeur pour Rust, quant à elle, est très stable : une fois la connection créée, il n’y a plus d’allocations mémoire du tout.

Utilisation CPU

Pour comparer l’utilisation CPU, voici mon scénario : un fichier de 500Mo est hébergé par Apache sur mon ordinateur, je démarre le serveur relais avec perf stat, puis je télécharge le fichier à partir de Firefox sur Android. Dès que le fichier est téléchargé, je stoppe le serveur relais (Ctrl+C).

Voici les résultats pour la version Java :

$ perf stat -B java -jar gnirehtet.jar relay
 Performance counter stats for 'java -jar gnirehtet.jar relay':

      11805,458302      task-clock:u (msec)       #    0,088 CPUs utilized
                 0      context-switches:u        #    0,000 K/sec
                 0      cpu-migrations:u          #    0,000 K/sec
            28 618      page-faults:u             #    0,002 M/sec
    17 908 360 446      cycles:u                  #    1,517 GHz
    13 944 172 792      stalled-cycles-frontend:u #   77,86% frontend cycles idle
    18 437 279 663      instructions:u            #    1,03  insn per cycle
                                                  #    0,76  stalled cycles per insn
     3 088 215 431      branches:u                #  261,592 M/sec
        70 647 760      branch-misses:u           #    2,29% of all branches

     133,975117164 seconds time elapsed

Et pour la version Rust :

$ perf stat -B ./gnirehtet relay
 Performance counter stats for 'target/release/gnirehtet relay':

       2707,479968      task-clock:u (msec)       #    0,020 CPUs utilized
                 0      context-switches:u        #    0,000 K/sec
                 0      cpu-migrations:u          #    0,000 K/sec
             1 001      page-faults:u             #    0,370 K/sec
     1 011 527 340      cycles:u                  #    0,374 GHz
     2 033 810 378      stalled-cycles-frontend:u #  201,06% frontend cycles idle
       981 103 003      instructions:u            #    0,97  insn per cycle
                                                  #    2,07  stalled cycles per insn
        98 929 222      branches:u                #   36,539 M/sec
         3 220 527      branch-misses:u           #    3,26% of all branches

     133,766035253 seconds time elapsed

Je ne suis pas un expert pour analyser les résultats, mais de ce que je comprends de la valeur task-clock:u, la version Rust consomme 4× moins de temps CPU.

Conclusion

Réécrire Gnirehtet en Rust a été une formidable expérience, où j’ai appris un super langage et de nouveaux concepts de programmation. Et maintenant, nous avons une application native avec de meilleures performances.

Bon reverse tethering !

Discussions sur reddit et Hacker News.

Fusionner deux dépôts git

2017-07-12T20:30:00+02:00

Ce billet explique comment fusionner un dépôt git (avec son historique) dans un sous-répertoire d’un autre dépôt git.

Cas d’usage

Mon projet principal se trouve dans un dépôt main. J’ai démarré dans un autre dépôt un projet other, que je souhaite finalement intégrer dans un sous-répertoire sub/ du projet principal, en conservant son historique. Après cette fusion, je ne garderai que le dépôt principal.

Fusion

Les deux projets se trouvent dans le répertoire courant :

$ ls
main  other

Dans le dépôt main, copier la branche master de other dans une nouvelle branche tmp :

cd main
git fetch ../other master:tmp

Le dépôt main contient alors les historiques disjoints des deux projets.

Nous allons maintenant réécrire l’historique complet de la branche tmp pour déplacer tout le contenu dans un sous-répertoire sub/, grâce une commande donnée en exemple de man git filter-branch :

git checkout tmp
git filter-branch --index-filter \
    'git ls-files -s | sed "s-\t\"*-&sub/-" |
        GIT_INDEX_FILE=$GIT_INDEX_FILE.new \
            git update-index --index-info &&
     mv "$GIT_INDEX_FILE.new" "$GIT_INDEX_FILE"'

À ce stade, nous avons toujours deux historiques indépendants, mais le contenu lié à la branche tmp se trouve dans le sous-répertoire sub/.

A---B---C---D master

  X---Y---Z tmp

La dernière étape consiste à relier les deux historiques, soit grâce à un rebase, soit grâce à un merge.

Un rebase réécrit l’historique du sous-projet sur la branche master :

git rebase master

# A---B---C---D---X'--Y'--Z' master

Un merge relie juste les deux historiques grâce à un commit de merge :

git merge tmp --allow-unrelated-histories

# A---B---C---D---M  master
#                /
#       X---Y---Z tmp

Concrètement

J’ai débuté la réécriture du serveur relais de gnirehtet en Rust dans un dépôt séparé. Maintenant qu’il commence à fonctionner, je l’ai fusionné dans un sous-répertoire du dépôt principal tout en conservant l’historique :

git fetch ../rustrelay master:tmp
git checkout tmp
git filter-branch --index-filter \
    'git ls-files -s | sed "s-\t\"*-&rustrelay/-" |
        GIT_INDEX_FILE=$GIT_INDEX_FILE.new \
            git update-index --index-info &&
     mv "$GIT_INDEX_FILE.new" "$GIT_INDEX_FILE"'
git rebase master

Introducing gnirehtet

2017-03-30T12:00:00+02:00

I spent the last few weeks at Genymobile developing a tool providing reverse tethering for Android, so that devices may use the internet connection of the computer on which they are connected via USB, without requiring any root access (neither on the device nor on the computer). It works on GNU/Linux, Windows and Mac OS.

We decided to open source it under the name gnirehtet.

Yeah, that’s a weird name, until you realize that this is the output of this bash command:

rev <<< tethering

How to use Gnirehtet

Basically, just download the latest release, extract it, and execute the following command on the computer:

./gnirehtet rt

Once activated, a “key” logo appears in your device status bar:

Check the README file of the project for more details.

How does gnirehtet work?

Gnirehtet is composed of two parts:

an Android application (the client);
a Java desktop application (the relay server).

Since then, I rewrote it in Rust.

The client registers itself as a VPN, in order to intercept the whole device network traffic, as byte[] of raw IPv4 packets, which it transmits to the relay server over a TCP connection (established over adb).

The relay server parses the packets headers, open connections from the computer to the requested destinations, and relays the content in both directions following the UDP and TCP protocols. It creates and sends response packets back to the Android client, which writes them to the VPN interface.

In a sense, the relay server behaves like a NAT, in that it opens connections on behalf of private peers. However, it differs from standard NATs in the way it communicates with the clients (the private peers), by using a very specific (though simple) protocol over a TCP connection.

For more details, you can read the developers page.

Here are the solutions I have considered

Once the application is able to intercept the whole device network traffic, several alternative designs are possible.

TL;DR: I first considered creating a “TUN device” on the computer, but it did not suit our needs. Then I wanted to benefit from existing SOCKS servers, but some constraints prevented us to relay UDP traffic. So I implemented gnirehtet.

TUN device

During my investigations on how to implement reverse tethering, I first found projects creating a TUN device on the computer (vpn-reverse-tether and SimpleRT).

This design works very well, and has several advantages:

it operates at network level, so there is no need for translation between level 3 and level 5 of the OSI model;
all IP packets are tunneled, regardless of their transport protocol (so they are all supported, while gnirehtet “only” supports TCP and UDP).

However:

it requires root access on the computer;
it does not work on platforms other than Linux.

You could still consider using these “TUN device” applications, they may better suit your needs.

SOCKS

In order to avoid to develop a specific relay server, my first idea was to make the client talk the SOCKS protocol (according to RFC 1928). That way, it would be possible to use any existing SOCKS server, for instance the one provided by ssh -D.

You probably already used it to bypass annoying enterprise firewalls. For this purpose, just start the tunnel:

ssh my_serveur -ND1080

Then configure your browser to use the SOCKS proxy localhost:1080. Also take care to enable remote DNS resolution if you want to resolve domain names from my_server (in Firefox, enable network.proxy.socks_remote_dns in about:config).

Unfortunately, the OpenSSH implementation does not support UDP, although the SOCKS5 protocol itself does. And we do need UDP, at least for DNS requests (and also NTP).

If you read carefully the two last paragraphs, you might want to ask yourself:

How may Firefox resolve domain names remotely through the OpenSSH SOCKS proxy if it does not even support UDP?

The answer lies in the section 4 of the RFC: the requested destination address may be an IPv4, an IPv6 or a domain name. However, using this feature implies that the client (e.g. Firefox) is aware of the proxy (since it must explicitly pass the domain name instead of resolving it locally), while our reverse tethering must be transparent.

But all is not lost. OK, OpenSSH does not support UDP, but this is just a specific implementation, we could consider another one. Unfortunately, SOCKS5 requires to relay UDP over UDP, but the devices and the computer communicate over adb (thanks to adb reverse), which does not support UDP port forwarding either.

Maybe we could at least relay DNS requests by forcing them to use TCP, like tsocks does:

tsocks will normally not be able to send DNS queries through a SOCKS server since SOCKS V4 works on TCP and DNS normally uses UDP. Version 1.5 and up do however provide a method to force DNS lookups to use TCP, which then makes them proxyable.

But then, SOCKS was no longer attractive to me for implementing reverse tethering.

Gnirehtet

Therefore, I developed both the client and the relay server manually.

This blog post and several open source projects (SimpleRT, vpn-reverse-tether, LocalVPN et ToyVpn) helped me a lot to understand how to implement this solution.

Conclusion

Gnirehtet allows Android devices to use the internet connection from a computer easily, without any root access. It helps when you can’t access the network using a WiFi access point.

I hope it will be useful to some of you.

This post was initially published on medium.

Discuss on reddit and Hacker News.

~rom1v/blog

Scrcpy 2.0, with audio

Audio usage

History

Audio capture

Audio encoding

Client architecture

Audio player

Conclusion

Audio forwarding on Android 10

How to use sndcpy

Apps restrictions

Integration in scrcpy

Introducing USBaudio

Forwarding audio from Android devices

How to use USBaudio

How does it work?

Manually

Limitations

A new core playlist for VLC 4

Objectives

Data structure

Interaction with UI

Desynchronization

Synchronization

Core to UI

UI to core

Random playback

Randomizer

Sorting

Interaction with the player

Media source

Conclusion

Implementing tile encoding in rav1e

What is this?

Tile encoding

Preliminary work

Tiling structures

Minimal example

In rav1e

Relative offsets

Tile encoding

Command-line

Bitstream

Tile info

CDF update

Size of tiles size

Tile group

Parallelization

Limits

Fixing bugs

Conclusion

Introducing scrcpy

How does scrcpy work?

Minimize latency

No buffering

Skip frames

Run a Java main on Android

Hello, world!

Access the Android framework

Like an APK

In scrcpy

Improve startup time

Quick installation

Parallelization

Clean up the device

Handle text input

Events

Inject text

Handle accented characters

Set a window icon

Conclusion

Gnirehtet rewritten in Rust

Why Rust?

Learning Rust

Difficulties

Encapsulation

Observer

Mutable data sharing

Compiler limitations