Shuah Khan's Linux Kernel Blogs

A blogging framework for hackers.

Regression Testing Media Resource Sharing on Au0828

Media Device Allocator API to allows multiple drivers share a media device. This API solves a very common use-case for media devices where one physical device (an USB stick) provides both audio and video. When such media device exposes a standard USB Audio class, a proprietary Video class, two or more independent drivers will share a single physical USB bridge. In such cases, it is necessary to coordinate access to the shared resource.

Using this API, drivers can allocate a media device with the shared struct device as the key. Once the media device is allocated by a driver, other drivers can get a reference to it. The media device is released when all the references are released.

au0828 and ALSA (snd_usb_audio) drivers are changed to use Media Device Allocator API to allocate/get reference to media device with the parent usb struct device as the key. Using this common media device, DVB, V4L2, and snd_usb_audio drivers can share media resources on an AU0828 media device.

The resource sharing varies depending on which component is in active use. When DVB is streaming, it holds the tuner in exclusive access mode. When s-video and composite are in use, the decoder will be held in exclusive access mode. Video and audio can share the tuner and decoder. au0828 acts as the master driver that decides the media topology and enforces the rules for sharing among various drivers.

  • Tuner to decoder link is in sharable state when Video or audio enable it. Audio and Video share tuner (source) and decoder (sink).
  • DVB holds tuner in exclusive mode.
  • S-Video and Composite hold the resource in exclusive mode.
  • VBI can’t share the link with Video/Audio. This is a au0828 driver bug. VBI should be able to share with video/audio. Some drivers expect that video starts streaming before vbi starts streaming. In some cases this is a hardware limitation, in other cases, it is a driver bug. Shuah will look into au0828 driver VBI related oddities. It will require updating enable/disable after the VBI bug is fixed to allow VBI sharing video/audio. Currently, starting VBI application changes input and active video/VBI captures stop.

Media sharing is tested using audio, video, and DVB applications. The following applications are used to test the exclusion and sharing rules.

  • Kaffeine
  • vlc digital mode
  • vlc analog mode
  • v4l2-ctl -f 700
  • v4l2-ctl –stream-mmap –stream-count=100 -d /dev/video0
  • qv4l2 -V /dev/vbi0
  • qv4l2 -d /dev/video0
  • xawtv -c /dev/video0
  • tvtime -c 3 -d /dev/video0
  • arecord -M -D plughw:2,0 -c2  -f S16_LE -t wav foo.wav

The following is a summary of test cases and the results for testing done on the patch series on Linux 5.0-rc7. Patch series is featured in an LWN Article and in on Linux Media kernel patches - Patchwork

DVB exclusive access test

When a DVB application is streaming while holding the tuner resource, audio/video/vbi applications should find the resource busy, DVB stream should not be disrupted, and the resource should be released when DVB application exits.

Application Expected Result Result
Start kaffeine and start digital TV. Streaming with tuner held. Streaming active with tuner held.
Start arecord -M -D plughw:2,0 -c2  -f S16_LE -t wav foo.wav Should exit with resource busy (EBUSY) error. Exits with EBUSY. No disruption to the DVB stream. Verified that DVB holds the tuner. Passed
Start vlc, try TV digital capture and stop vlc. Capture should fail with resource busy error. Only one DVB instance can open the device in write mode. Capture fails with resource busy. No disruption to the DVB stream. Verified that DVB holds the tuner. Passed
Start vlc, try TV analog capture and stop vlc. Capture should fail with resource busy error. Capture fails with resource busy. No disruption to the DVB stream. Verified that DVB holds the tuner. Passed
Start qv4l2 -d /dev/video0 and try capture Capture should fail with resource busy error Capture fails with resource busy. No disruption to the DVB stream. Verified that DVB holds the tuner. Passed
Start qv4l2 -V /dev/vbi0 and try capture Capture should fail with resource busy error Capture fails with resource busy. No disruption to the DVB stream. Verified that DVB holds the tuner. Passed
Start v4l2-ctl -f 700 Should exit with resource busy Exits with EBUSY. No disruption to the DVB stream. Verified that DVB holds the tuner. Passed
Start v4l2-ctl –stream-mmap –stream-count=100 -d /dev/video0 Should exit with resource busy Exits with EBUSY. No disruption to the DVB stream. Verified that DVB holds the tuner. Passed
Start xawtv -c /dev/video0 Should detect resource busy Detects resource busy. No disruption to the DVB stream. Verified that DVB holds the tuner. Passed
Start tvtime -c 3 -d /dev/video0 Should detect resource busy Detects resource busy. No disruption to the DVB stream. Verified that DVB holds the tuner. Passed
Stop kaffeine Should release tuner. Tuner is released. Passed

VBI exclusive access test

When a VBI capture is in progress is while holding the input resource, DVB/audio/video applications should find the resource busy, VBI capture should not be disrupted, and the resource should be released when VBI application exits. When another VBI application tries to do capture, it should be able to share without disruption. However, au0828’s s_input changes input and active capture stops. This is separate problem to look into.

Application Expected Result Result
Start qv4l2 -V /dev/vbi0 and start capture Capture starts with resource held. Capture starts with resource held.
Start kaffeine and start digital TV. Should find resource busy. Fails with resource busy. No disruption to capture. Verified that VBI holds the resource. Passed
Start arecord -M -D plughw:2,0 -c2  -f S16_LE -t wav foo.wav Should exit with resource busy (EBUSY) error. Exits with EBUSY. No disruption to capture. Verified that VBI holds the resource. Passed
Start vlc, try TV digital capture and stop vlc. Capture should fail with resource busy error. Capture fails with resource busy. No disruption to capture. Verified that VBI holds the resource. Passed
Start vlc, try TV analog capture and stop vlc. Capture should fail with resource busy error. Capture fails with resource busy. No disruption to capture. Verified that VBI holds the resource. Passed
Start qv4l2 -d /dev/video0 and try capture Capture should fail with resource busy error Capture fails with resource busy. No disruption to capture. Verified that VBI holds the resource. Passed
Start qv4l2 -V /dev/vbi0 and start capture Should share resource and capture? Shares the resource, capture doesn’t start and active capture stops when s_input is changed. This is a au0828 driver s_input problem to look into separately. Unrelated problem.
Start v4l2-ctl -f 700 Should exit with resource busy Exits with EBUSY. No disruption to capture. Verified that VBI holds the resource. Passed
Start v4l2-ctl –stream-mmap –stream-count=100 -d /dev/video0 Should exit with resource busy Exits with EBUSY. No disruption to capture. Verified that VBI holds the resource. Passed
Start xawtv -c /dev/video0 Should detect resource busy Detects resource busy. No disruption to capture. Verified that VBI holds the resource. Passed
Start tvtime -c 3 -d /dev/video0 Should detect resource busy Detects resource busy. No disruption to capture. Verified that VBI holds the resource. Passed
Stop qv4l2 -V /dev/vbi0 Should release resource Resource is released. Passed

Video active tests

When a video application is holding the tuner, audio and other video applications can share the tuner. Video driver allows more than one application to open the device and sharing logic allows more than video application to share the tuner. DVB and VBI applications should find the resource busy while video is actively using it.

Video tests with qv4l2 -d /dev/video0 active

Application Expected Result Result
Start qv4l2 -d /dev/video0 and start capture Capture starts with resource held. Capture starts with resource held.
Start kaffeine and start digital TV. Should find resource busy. Fails with resource busy. No disruption to capture. Verified that Video holds the resource. Passed
Start arecord -M -D plughw:2,0 -c2  -f S16_LE -t wav foo.wav and stop Should be allowed to share the resource and when it exits, resource should be in locked state. Shares tuner. Video stream continues. When arecord stooped, video continues to hold the resource. Passed
Start vlc, try TV digital capture and stop vlc. Capture should fail with resource busy error. Capture fails with resource busy. No disruption to capture. Verified that Video holds the resource. Passed
Start vlc, try TV analog capture and stop vlc. Expect sharing to work. It is possible an earlier video driver check might cause it bail out before looking for the resource. Fails with vidioc_s_fmt_vid_cap queue busy. This check is done before,  it looks to hold the resource.
Start qv4l2 -d /dev/video0 and try capture Expect sharing to work. It is possible an earlier video driver check might cause it bail out before looking for the resource Fails with vidioc_s_fmt_vid_cap queue busy. This check is done before,  it looks to hold the resource.
Start qv4l2 -V /dev/vbi0 and start capture Should exit with resource busy Exits with VIDIOC_STREAMON: Device or resource busy. No disruption to capture. Verified that Video holds the resource. Passed
Start v4l2-ctl -f 700 Expect sharing to work. It is possible an earlier video driver check might cause it bail out before looking for the resource. Shares tuner link and updates frequency and exits. Video holds the resource. I expected the video driver to not update frequency. This behavior is unrelated to the patch series.
Start v4l2-ctl –stream-mmap –stream-count=100 -d /dev/video0 Expect sharing to work. It is possible an earlier video driver check might cause it bail out before looking for the resource. Exits with VIDIOC_REQBUFS: failed: Device or resource busy. This check is done before,  it looks to hold the resource. No disruption to the Video capture. Verified that Video holds the resource.
Start xawtv -c /dev/video0 Expect sharing to work. It is possible an earlier video driver check might cause it bail out before looking for the resource Finds the audio device and then fails with vidioc_s_fmt_vid_cap queue busy. No disruption to the Video capture. Verified that Video holds the resource. Passed
Start tvtime -c 3 -d /dev/video0 Expect sharing to work. It is possible an earlier video driver check might cause it bail out before looking for the resource Detects resource busy. No disruption to capture. Verified that Video holds the resource. Passed
Stop qv4l2 -d /dev/video0 Should release resource Resource is released. Passed

Video tests with v4l2-ctl –stream-mmap –stream-count=10000 -d /dev/video0 active

Application Expected Result Result
Start v4l2-ctl –stream-mmap –stream-count=10000 -d /dev/video0 Capture starts with resource held. Capture starts with resource held.
Start kaffeine and start digital TV. Should find resource busy. Fails with resource busy. No disruption to capture. Verified that Video holds the resource. Passed
Start arecord -M -D plughw:2,0 -c2  -f S16_LE -t wav foo.wav and stop Should be allowed to share the resource and when it exits, resource should be in locked state. Shares tuner. Video stream continues. When arecord stooped, video continues to hold the resource. Passed
Start vlc, try TV digital capture and stop vlc. Capture should fail with resource busy error. Capture fails with resource busy. No disruption to capture. Verified that Video holds the resource. Passed
Start vlc, try TV analog capture and stop vlc. Expect sharing to work. It is possible an earlier video driver check might cause it bail out before looking for the resource. Fails with vidioc_s_fmt_vid_cap queue busy. This check is done before,  it looks to hold the resource.
Start qv4l2 -d /dev/video0 and try capture Expect sharing to work. It is possible an earlier video driver check might cause it bail out before looking for the resource Fails with vidioc_s_fmt_vid_cap queue busy. This check is done before,  it looks to hold the resource.
Start qv4l2 -V /dev/vbi0 and start capture Should exit with resource busy Exits with VIDIOC_STREAMON: Device or resource busy. No disruption to capture. Verified that Video holds the resource. Passed
Start v4l2-ctl -f 700 Expect sharing to work. It is possible an earlier video driver check might cause it bail out before looking for the resource. Shares tuner link and updates frequency and exits. Video holds the resource. I expected the video driver to not update frequency. This behavior is unrelated to the patch series.
Start v4l2-ctl –stream-mmap –stream-count=100 -d /dev/video0 Expect sharing to work. It is possible an earlier video driver check might cause it bail out before looking for the resource. Exits with VIDIOC_REQBUFS: failed: Device or resource busy. This check is done before,  it looks to hold the resource. No disruption to the Video capture. Verified that Video holds the resource.
Start xawtv -c /dev/video0 Expect sharing to work. It is possible an earlier video driver check might cause it bail out before looking for the resource Finds the audio device and then fails with vidioc_s_fmt_vid_cap queue busy. No disruption to the Video capture. Verified that Video holds the resource. Passed
Start tvtime -c 3 -d /dev/video0 Expect sharing to work. It is possible an earlier video driver check might cause it bail out before looking for the resource Detects resource busy. No disruption to capture. Verified that Video holds the resource. Passed
Stop v4l2-ctl –stream-mmap –stream-count=10000 -d /dev/video0 Should release resource Resource is released. Passed

Audio active tests with arecord -M -D plughw:2,0 -c2  -f S16_LE -t wav foo.wav

When an audio application is holding the tuner, video applications can share the tuner. Video driver allows more than one application to open the device and sharing logic allows more than video application to share the tuner. Audi driver on the other hand doesn’t. DVB and VBI applications should find the resource busy while video is actively using it.

Application Expected Result Result
Start arecord -M -D plughw:2,0 -c2  -f S16_LE -t wav foo.wav Capture starts with resource held. Capture starts with resource held.
Start kaffeine and start digital TV. Should find resource busy. Fails with resource busy. No disruption to capture. Verified that Audio holds the resource. Passed
Start arecord -M -D plughw:2,0 -c2  -f S16_LE -t wav foo.wav Should exit with device/resource busy. Exits with device/resource busy. Note that this check happens before resource hold attempt. Verified audio holds the resource. Passed
Start vlc, try TV digital capture and stop vlc. Capture should fail with resource busy error. Capture fails with resource busy. No disruption to capture. Verified that audio holds the resource. Passed
Start vlc, try TV analog capture and stop vlc. Expect video and audio link sharing to work. Audio device open should fail at open stage. No errors from Video and audio device open fails. Verified audio holds the resource. Passed
Start qv4l2 -d /dev/video0 and try capture. Stop. Expect sharing video/audio sharing to work. Capture works. Audio holds the release when video app exits. Passed
Start qv4l2 -V /dev/vbi0 and start capture Should exit with resource busy Exits with VIDIOC_STREAMON: Device or resource busy. No disruption to capture. Verified that audio holds the resource. Passed
Start v4l2-ctl -f 700 Expect sharing audio/video sharing to work. Shares tuner link and updates frequency and exits. audio holds the resource. Passed
Start v4l2-ctl –stream-mmap –stream-count=100 -d /dev/video0 Expect sharing audio/video sharing to work. Capture works. Audio holds the release when video app exits. Passed
Start xawtv -c /dev/video0 Expect audio/video sharing to work. Audio device open should fail at open stage. No errors from Video and audio/video sharing worked. Audio device open fails. Verified audio holds the resource. Passed
Start tvtime -c 3 -d /dev/video0 Expect sharing to work. Audio device open should fail at open stage. No errors from Video and audio/video sharing worked. Audio device open fails. Verified audio holds the resource. Passed
Stop arecord Should release resource Resource is released. Passed

Sharing audio start - video start - audio stop - video stop testing

This is a test to make sure audio/video sharing works correctly and as audio and video applications get started and stopped, the resource continues to be held as long as one application is active and gets released correctly when the last application exits.

Application Expected Result Result
Start arecord -M -D plughw:2,0 -c2  -f S16_LE -t wav foo.wav Capture starts with resource held. Capture starts with resource held.
Start v4l2-ctl -f 700 Expect sharing audio/video sharing to work. Shares tuner link and updates frequency and exits. audio holds the resource. Passed
Start v4l2-ctl –stream-mmap –stream-count=1000 -d /dev/video0 Expect sharing audio/video sharing to work. Capture works. Audio/video sharing works. Passed
Stop arecord Expect resource to stay locked with video using it. Video capture continues. Resource is held. Passed
Start arecord -M -D plughw:2,0 -c2  -f S16_LE -t wav foo.wav Expect sharing audio/video sharing to work. Capture works. Audio/video sharing works. Passed
Stop v4l2-ctl –stream-mmap –stream-count=1000 -d /dev/video0 Expect resource to stay locked with audio using it. Audio capture continues. Resource is held. Passed
Stop arecord Should release resource Resource is released. Passed

Multiple video sessions testing

This is a test to make sure video sharing works correctly and as video applications get started and stopped, the resource continues to be held as long as one application is active and gets released correctly when the last application exits.

Application Expected Result Result
Start v4l2-ctl –stream-mmap –stream-count=1000 -d /dev/video0 Expect sharing video sharing to work. Capture works.
Start v4l2-ctl -f 700 while streaming is active Expect sharing video sharing to work. Shares tuner link and updates frequency and exits. Video holds the resource when it exits. Passed
Stop v4l2-ctl –stream-mmap –stream-count=1000 -d /dev/video0 Expect sharing to work. It is possible an earlier video driver check might cause it bail out before looking for the resource Exits with vidioc_s_fmt_vid_cap queue busy. Video capture continues. Resource is held. Passed
Stop v4l2-ctl –stream-mmap –stream-count=1000 -d /dev/video0 exits Should release resource Resource is released. Passed

Multiple vbi sessions testing

This is a test to make sure vbi sharing works correctly and as vbi applications get started and stopped, the resource continues to be held as long as one application is active and gets released correctly when the last application exits.

Application Expected Result Result
Start qv4l2 -V /dev/vbi0 and start capture Capture starts with resource held Capture starts and resource is held
Start qv4l2 -V /dev/vbi0 Expect sharing to work Shares the resource, capture doesn’t start and active capture stops when s_input is changed. This is a au0828 driver s_input problem to look into separately. Unrelated problem.
Stop qv4l2 -V /dev/vbi0 Resource should be held. Resource is held. Passed
Stop qv4l2 -V /dev/vbi0 Should release resource Resource is released. Passed

One Small Step to Harden USB Over IP on Linux

This post was originally published on Samsung Open Source Blog site on December 14, 2017

The USB over IP kernel driver allows a server system to export its USB devices to a client system over an IP network via USB over IP protocol. Exportable USB devices include physical devices and software entities that are created on the server using the USB gadget sub-system. This article will cover a major bug related to USB over IP in the Linux kernel that was recently uncovered; it created some significant security issues but was resolved with help from the kernel community.

The Basics of the USB Over IP Protocol

There are two USB over IP server kernel modules:

  • usbip-host (stub driver): A stub USB device driver that can be bound to physical USB devices to export them over the network.
  • usbip-vudc: A virtual USB Device Controller that exports a USB device created with the USB Gadget Subsystem.

There is one USB over IP client kernel module:

  • usbip-vhci: A virtual USB Host Controller that imports a USB device from a USB over IP capable server. Finally, there is one USB over IP utility:

  • (usbip-utils): A set of user-space tools used to handle connection and management, this is used on both the client and server side.

The USB/IP protocol consists of a set of packets that get exchanged between the client and server that query the exportable devices, request an attachment to one, access the device, and finally detach once finished. The server responds to these requests from the client, and this exchange is a series of TCP/IP packets with a TCP/IP payload that carries the USB/IP packets over the network. I’m not going to discuss the protocol in detail in this blog, please refer to usbip_protocol.txt to learn more.

Identifying a Security Problem With USB Over IP in Linux

When a client accesses an imported USB device, usbip-vhci sends a USBIP_CMD_SUBMIT to the usbip-host, which submits a USB Request Block (URB). An endpoint number, transfer_buffer_length, and number of ISO packets are among the valid URB fields that will be in a USBIP_CMD_SUBMIT packet. There are some potential vulnerabilities with this, specifically a malicious packet could be sent from a client with an invalid endpoint, or a very large transfer_buffer_length with a large number of ISO packets. A bad endpoint value could force the usbip-host to access memory out of bounds unless usbip-host validates the endpoint to be within valid range of 0-15. A large transfer_buffer_length could result in the kernel allocating large amounts of memory. Either of these malicious requests could adversely impact the server system operation.

Jakub Jirasek from Secunia Research at Flexera reported these security vulnerabilities in the driver to the malicious input. In addition, he reported an instance of a socket pointer address (kernel memory address) leaked in a world-readable sysfs USB/IP client side file and in debug output. Unfortunately, the USB over IP driver had these security vulnerabilities since the beginning. The good news is that they have been found and fixed now, I sent a series of 4 fixes to address all of the issues reported by Jakub Jirasek. In addition, I am continuing to look for other potential problems.

All of these problems are a result of a lack of validation checks on the input and an incorrect handling of error conditions; my fixes add the missing checks and take proper action. These fixes will propagate into the stable releases within the next few weeks. One exception is the issue with kernel address leaks which is an intentional design decision to provide a convenient way to find IP address from socket addresses that opened a security hole.

Where are these fixes?

The fixes are going in to the 4.15 and stable releases. The fixes can be found in the following two git branches:

Secunia Research has created the following CVEs for the fixes:

  • CVE-2017-16911 usbip: prevent vhci_hcd driver from leaking a socket pointer address
  • CVE-2017-16912 usbip: fix stub_rx: get_pipe() to validate endpoint number
  • CVE-2017-16913 usbip: fix stub_rx: harden CMD_SUBMIT path to handle malicious input
  • CVE-2017-16914 usbip: fix stub_send_ret_submit() vulnerability to null transfer_buffer

Do it Right the First Time

This is a great example of how open source software can benefit from having many eyes looking over it to identify problems so the software can be improved. However, after solving these issues, my takeaway is to be proactive in detecting security problems, but better yet, avoid introducing them entirely. Failing to validate input is an obvious coding error in any case, that can potentially allow users to send malicious packets with severe consequences. In addition, be mindful of exposing sensitive information such as the kernel pointer addresses. Fortunately, we were able to work together to solve this problem which should make USB over IP more secure in the Linux kernel.

Media Resource Sharing Through the Media Controller API

This post was originally published on Samsung Open Source Blog site on December 2, 2015

Media devices have hardware resources that are shared across several functions. However, media drivers have no knowledge of which resources are shared. At the root of this problem is a lack of a common locking mechanism that can provide access control.

I have been working on solving the sharing problem for a year or so. At the Media Summit in San Jose back in March of this year, we reviewed my Media Token API patch series. It solved the problem with minimal changes to drivers, however, it introduced a new framework, in addition to the existing ones. Every new framework adds maintenance cost. This led to a discussion that identified the existing Media Controller API as a better alternative in the interest of avoiding adding a new framework. Since then, I’ve worked on implementing the solution using the Media Controller API, and I’ve ported it over to Media Controller Next Gen API.

In this article, I will cover how ALSA and au0828 drivers share media resources using the Media Controller Next Gen API on Win-TV HVR 950Q Hybrid USB TV stick. In addition, I will share how media-ctl/mc_nextgen_test tool can be used to generate media graphs for a media device. Let’s first start looking at the differences between Multi-function Device (MFD) and Media devices.

How do Media Devices Differ from a Multi-function Device (MFD)?

Media devices can be very small, yet complex as they consist of a group of independent devices that share an attach point. Each device implements a unique function, and sharing isn’t limited to the attach point. Functions can be shared, and media drivers have to coordinate this sharing. In some cases, non-media drivers are leveraged as in the case of Win-TV HVR 950Q, snd-usb-audio for audio is leveraged to drive the audio chip on the device.

MFD presents itself to the kernel as a single device. The Linux Kernel MFD framework helps identify functions as discrete platform devices. Sharing is limited to the attach point with sub-devices attached to a shared bus. Each function is independent with no shared resources other than the attach point. MFD drivers don’t need to coordinate sharing resources.

Challenges with Sharing Media Resources

Drivers are unaware of shared resources on a media device and they don’t have common data structures to ensure exclusive access to shared resource/function. The fine grain locks that exist today don’t ensure exclusive access and are local to function drivers; there are no global locks or data structures that span all drivers. For example, starting a digital TV application disrupts video streaming. This is because the digital driver can access/change modes while a video (Analog) TV application is using the tuner resource.

Exclusive Access Use-Cases:

  1. Starting Digital application should not disrupt active Video and Audio streams.
  2. Starting Video application should not disrupt active Digital and Audio streams.
  3. Starting Audio application should not disrupt Video and Digital streams.
  4. Querying current configuration (read only access) should work even when the tuner resource is busy.

What is the Media Controller API?

The Media Controller API is a relational media graph framework. Media functions and resources can be represented as nodes on a graph. The media device sits at the top as the root node, and nodes can be connected using media links to represent data paths. These links can be marked active to indicate them as active data paths between two or more nodes; this allows data pipelines to be started.

For example, a tuner node can be linked to a decoder with the tuner operating as the source and the decoder operating as the sink. Then, an additional video sink node can be linked to decoder to display output. A data pipeline can be started between the video node and the tuner once the link between the tuner and decoder is marked as active. As long as this pipeline is active, no other node will be allowed to activate their link to the tuner, thereby reserving the tuner resource for exclusive use.

Using The Media Controller API to Share Resources on a Win-TV HVR 950Q

To illustrate this, take a look at the Win-TV HVR 950Q device/driver view:

Win-TV HVR 950Q device and driver view

As you can see in this device/driver view, USB Device (struct usb_device) and USB Device Parent (struct device) are the only two common data structures across all the drivers for this device. If we were to create a media device at the USB Device Parent, then the media device will be common across all drivers. The Managed Media Controller API allows the media device to be created as a USB Device Parent device resource. This Media Device is created by either au0828 or snd-usb-audio as their probe routines run. The first probe routine that runs wins. This Managed Media Device serves as the root node for the media graph.

Managed Media Controller API:

  • media_device_get_devres()
  • media_device_find_devres()

Win-TV HVR 950Q drivers with managed media device

Drivers use probe routines to create media graph nodes for their functions and resources. Much like when drivers create new media nodes, other drivers might need to take some action. The Media Controller Entity Notify API are new interfaces that enable drivers to register callbacks to take action as entities get added by other drivers.

Media Controller Entity Notify API:

  • media_device_register_entity_notify()
  • media_device_unregister_entity_notify()

Drivers add add entities for their functions and the bridge driver (au0828) creates links between entities via the entity_notify API. Please note, the bridge driver takes ownership for creating links as it is the master driver that knows how resources are shared and the relationships between them.

HVR 950Q Media Graph

Win-TV HVR 950Q drivers with managed media device

This has given us a media graph that is common and shared by all drivers, but the next step is to provide a means to acquire and release shared resources. A new Media Controller Enable source handler grants access to shared resources or returns -EBUSY if not possible. Its counterpart, the Disable source marks shared resource as free. Bridge drivers (e.g au0828) implement and register enable/disable source handlers to manage access to shared resources on the device. Other drivers call the enable source handler from the media device to gain access to the resources and the disable source handler to release it.

snd-usb-audio and au0828 drivers have been changed to take advantage of the new Media Controller API. Now we look at a few media graphs when various applications own the tuner resource.

Audio Application Controls the Tuner

When an audio application is running, active pipeline (solid line) is established that start at the XC5000 tuner node and pass through the decoder and audio mixer to provide access to the audio capture end point.

Audio Application Controls the Tuner

Audio Application Releases the Tuner

The pipeline from the XC5000 tuner node to the decoder is set as inactive. Please note: the links between the decoder, audio mixer, and audio capture node are always enabled.

Audio Application Releases the Tuner

Video Application is Running

When a video application is running, active pipeline (solid line) is established that start at the XC5000 tuner node and pass through the decoder to provide access to the video node. Please note, the video application opens audio capture device while it is using the tuner. As a result, both video and audio are busy during video streaming and there is no difference between the active pipeline shown in the video and audio graphs.

Video application is running

Video Application Releases the Tuner

Please note: when a video application is running, both audio and video are held by the video driver.

Video application release the tuner

Digital Application is Running

The pipeline between the XC5000 tuner and DTV demodulator is set to active when a digital application is running.

Digital application is running

Digital Application Releases the Tuner

The active pipeline between XC5000 tuner and DTV Demodulator is set to inactive when the digital application releases the tuner.

Digital application releases the tuner

Video Application is Started while an Audio Application is Running

The video application will find the audio device busy and exit gracefully; the active pipeline between the tuner and the Audio capture device continues with no interruptions.

Video Application is Started while an Audio Application is Running

Digital Application is Started while an Audio Application is Running

The digital application will find the tuner is busy and exit gracefully with no interruption to the active audio pipeline.

Digital Application is Started while an Audio Application is Running

Audio Application is Started while a Digital Application is Running

The audio application will find the tuner busy and exit, leaving the active pipeline between the tuner and DVB demux active.

Audio Application is Started while a Digital Application is Running

Video Application is Started while a Digital Application is Running

The video application will find the tuner busy and exit leaving the active pipeline between the tuner and DVB Demux active.

Video Application is Started while a Digital Application is Running

Digital Application is Started while a Video Application is Running

The digital application will find the tuner is busy and exit gracefully with no interruption to the active video pipeline.

Digital Application is Started while a Video Application is Running

MC Next Gen Test Tool

Mauro Carvalho Chehab has developed a new mc_nextgen_test tool to generate the media graphs illustrated above.

Usage: mc_nextgen_test [OPTION…] A testing tool for the MC next geneneration API -d, –device=DEVICE media controller device (default: /dev/media0 -D, –dot show in Graphviz format -e, –entities show entities -i, –interfaces show pads -I, –intf-links show interface links -l, –data-links show data links -?, –help Give this help list –usage Give a short usage message -V, –version Print program version

Demystifying Media Devices

Media devices can be small, yet very complex. I hope this post helps understand the complexity of these devices and how the Media Controller API helps solve the resource sharing problem across media and non-media drivers that control the device.

Embedded Data Structures and Lifetime Management in the Linux Kernel

This post was originally published on Samsung Open Source Blog site on September 27, 2016

Embedded data structures are a common occurrence in Linux Kernel code. Use-after-free errors can easily creep in when they include multiple ref-counted objects with different lifetimes if the data structure is released prematurely. This article will explore some of the problems commonly encountered with lifetime management of embedded data structures when writing Kernel code, and it will cover some essential steps you can follow to prevent these issues from creeping into your own code.

What Makes Embedded Structure Lifetime so Complicated?

Let’s look at a few examples of embedded structures. Structure A embeds structure B, structure B embeds structure C, and structure C embeds structures D and E. There is no problem as long as all these structures have identical lifespans and the structure memory can be released all at once. If structure D has a different lifespan overall or in some scenarios, then when structure A goes away, structure D goes along with it; any subsequent accesses to structure D will result in use-after-free errors.

Now for a real example from the Linux Kernel code:

struct uvc_device {
        struct media_device mdev;
        ........
}

struct media_device {
        struct media_devnode devnode;
        .......
}

struct media_devnode {
        struct device dev;
        struct cdev cdev;
        .......
}

struct device {
       struct kobject kobj;
       ........
}

uvc device

Memory backing the struct uvc_device should not be released until the last user of struct device releases it. If uvc_device is released while struct device is still in use, any subsequent access to struct device will result in use-after-free errors.

Let’s look at some lifespan scenarios. The driver is unbound when no application is using /dev/media, and it then handles cleanup and releases the uvc_device resource and its embedded resources. There’s no problem in this case with the lifespans being identical. The same is applicable to when a device is physically removed or a driver module is removed via the rmmod command.

What happens when the driver is unbound or the device is removed when an application is actively using /dev/media and running syscalls and ioctls?

media_devnode->device is in use until the application closes the device file media_devnode→cdev is in use until the application exits The driver handles cleanup and releases the uvc_device resource and its embedded resources. Then, the application closes /dev/media and exits after the driver is gone; it will run into use-after-free errors on the freed resources: media_devnode->device and media_devnode->cdev.

uvc device free scenarios

Please note, removing the driver module is not allowed because rmmod simply fails when the device (/dev/media) is in use.

How to Prevent Use-After-Free Errors in Your Code

WHAT IS THE ROOT CAUSE?

  • Resources with different lifespan requirements embedded in the parent resource. Resources that are in use get released prematurely.

WHAT’S THE FIX?

  • Dynamically allocate the resource(s) with different lifespans
  • Ensure the resource that embeds cdev doesn’t get released before cdev_del()
  • Dynamically allocate media_devnode to delink media_devnode lifetime from media_device. This will ensure that the media_devnode will not be freed when driver does its cleanup and frees uvc_device along with media_device
  • Set devnode struct device kobj as the cdev parent kobject linking media_devnode->device lifespan to cdev lifespan.
media_devnode->cdev.kobj.parent = &media_devnode->dev.kobj;
  • cdev_add() gets a reference to dev.kobj
  • cdev_del() releases the reference to dev.kobj ensuring that the devnode is not released as long as the application has the device file open.

Dynamically allocated (de-embed) media_devnode

cdev gets a reference to dev.kobj

It’s crucial to proactively test for corner cases and write a test to reproduce these problems. Please find the test procedure to reproduce and fix this problem in Linux 4.8 release under tools/testing/selftests/media_tests

It is always easier to avoid this design problem during development rather than finding and fixing it later. Pay attention to resources lifespan requirements and don’t embed resources with different lifespan requirements. Finally, make sure to follow up with proactive tests of resource lifetimes.

How Can IOMMU Event Tracing Help You?

This post was originally published on Samsung Open Source Blog site on May 20, 2015

This is an article in a series on IOMMU Event Tracing:

Input/Output Memory Management Unit (IOMMU) event tracing can be extremely beneficial when debugging IOMMU hardware, BIOS, and firmware problems. In addition, IOMMU event tracing can be used to debug Linux kernel IOMMU driver, device assignment, and performance problems related to device assignment in virtualized and non-virtualized environments.

If you aren’t familiar with IOMMU Event Tracing, the first article in this series covered the fundamental concepts behind it, in addition to how tracing can be used to track information about devices as they are moved between guest and host environments. This article will focus on how to use IOMMU event tracing effectively, and will provide a few examples of IOMMU event tracing in action.

How to Enable IOMMU Event Tracing at Boot-Time

IOMMU trace events can be enabled using the Kernel boot option trace_event. The following enables all IOMMU trace events at boot-time:

trace_event=iommu

The following enables map and unmap events at boot-time:

trace_event=map,unmap

How to Enable IOMMU Event Tracing at Run-Time

A single IOMMU event or multiple events can be enabled at run-time. The following enables a single event:

cd /sys/kernel/debug/trace/events echo 1 > iommu/event_name_file

The following will enable all events:

for i in $(find /sys/kernel/debug/tracing/events/iommu/ -name enable); do echo 1 > $i; done 

Where are The Traces?

Traces can be found in /sys/kernel/debug/tracing/trace file. The following shows the trace format. For more details on tracing, please refer to Documentation/trace. This directory contains the tracing documentation.


# tracer: nop
#
# entries-in-buffer/entries-written: 18/18   #P:8
#
#                              _-----=> irqs-off
#                             / _----=> need-resched
#                            | / _---=> hardirq/softirq
#                            || / _--=> preempt-depth
#                            ||| /     delay
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |

Traces provide insight into the state of the CPU on which the task is running on. The individual fields such as the irq-off, need-resched, and preempt-depth delay help debug problems. For example, long runs of a task with need-resched set might indicate problems in the code paths that could result in bad response times for other tasks running on the system. These problems could be solved by fixing the relevant code paths.

What do IOMMU Group Event Traces Look Like?

The following group event traces are from a test system with Intel VT-d support. These events show the device and group mapping.

# tracer: nop
#
# entries-in-buffer/entries-written: 18/18   #P:8
#
#                     _-----=> irqs-off
#                      / _----=> need-resched
#                     | / _---=> hardirq/softirq
#                     || / _--=> preempt-depth
#                     ||| /     delay
#    TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#       | |       |   ||||       |         |
swapper/0-1     [000] ....     1.899609: add_device_to_group: IOMMU: groupID=0 device=0000:00:00.0
swapper/0-1     [000] ....     1.899619: add_device_to_group: IOMMU: groupID=1 device=0000:00:01.0
swapper/0-1     [000] ....     1.899624: add_device_to_group: IOMMU: groupID=2 device=0000:00:02.0
swapper/0-1     [000] ....     1.899629: add_device_to_group: IOMMU: groupID=3 device=0000:00:03.0
swapper/0-1     [000] ....     1.899634: add_device_to_group: IOMMU: groupID=4 device=0000:00:14.0
swapper/0-1     [000] ....     1.899642: add_device_to_group: IOMMU: groupID=5 device=0000:00:16.0
swapper/0-1     [000] ....     1.899647: add_device_to_group: IOMMU: groupID=6 device=0000:00:1a.0
swapper/0-1     [000] ....     1.899651: add_device_to_group: IOMMU: groupID=7 device=0000:00:1b.0
swapper/0-1     [000] ....     1.899656: add_device_to_group: IOMMU: groupID=8 device=0000:00:1c.0
swapper/0-1     [000] ....     1.899661: add_device_to_group: IOMMU: groupID=9 device=0000:00:1c.2
swapper/0-1     [000] ....     1.899668: add_device_to_group: IOMMU: groupID=10 device=0000:00:1c.3
swapper/0-1     [000] ....     1.899674: add_device_to_group: IOMMU: groupID=11 device=0000:00:1d.0
swapper/0-1     [000] ....     1.899682: add_device_to_group: IOMMU: groupID=12 device=0000:00:1f.0
swapper/0-1     [000] ....     1.899687: add_device_to_group: IOMMU: groupID=12 device=0000:00:1f.2
swapper/0-1     [000] ....     1.899692: add_device_to_group: IOMMU: groupID=12 device=0000:00:1f.3
swapper/0-1     [000] ....     1.899696: add_device_to_group: IOMMU: groupID=13 device=0000:02:00.0
swapper/0-1     [000] ....     1.899701: add_device_to_group: IOMMU: groupID=14 device=0000:03:00.0
swapper/0-1     [000] ....     1.899704: add_device_to_group: IOMMU: groupID=10 device=0000:04:00.0

What Does lspci Show?

The following lspci output from the same system gives you information on each of the devices IOMMU found and classified into groups.


00:00.0 Host bridge: Intel Corporation 4th Gen Core Processor DRAM Controller (rev 06)

00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor PCI Express x16 Controller (rev 06)

00:02.0 VGA compatible controller: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor Integrated Graphics Controller (rev 06)

00:03.0 Audio device: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor HD Audio Controller (rev 06

00:14.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family USB xHCI (rev 05)

00:16.0 Communication controller: Intel Corporation 8 Series/C220 Series Chipset Family MEI Controller #1 (rev 04)

00:1a.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family USB EHCI #2 (rev 05)

00:1b.0 Audio device: Intel Corporation 8 Series/C220 Series Chipset High Definition Audio Controller (rev 05)

00:1c.0 PCI bridge: Intel Corporation 8 Series/C220 Series Chipset Family PCI Express Root Port #1 (rev d5)

00:1c.2 PCI bridge: Intel Corporation 8 Series/C220 Series Chipset Family PCI Express Root Port #3 (rev d5)

00:1c.3 PCI bridge: Intel Corporation 82801 PCI Bridge (rev d5)

00:1d.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family USB EHCI #1 (rev 05)

00:1f.0 ISA bridge: Intel Corporation H87 Express LPC Controller (rev 05)

00:1f.2 SATA controller: Intel Corporation 8 Series/C220 Series Chipset Family 6-port SATA Controller 1 [AHCI mode] (rev 05)

00:1f.3 SMBus: Intel Corporation 8 Series/C220 Series Chipset Family SMBus Controller (rev 05)

02:00.0 Network controller: Intel Corporation Wireless 7260 (rev 73)

03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 0c)

04:00.0 PCI bridge: ASMedia Technology Inc. ASM1083/1085 PCIe to PCI Bridge (rev 04)

The following images depict this lspci output:

PCI device topology 1 PCI device topology 1 PCI device topology 1

IOMMU Groups and Device Topology

While comparing IOMMU traces with lspci device topology and IOMMU topology, you will notice that some groups contain more than one device. For example, ISA bridge, SATA controller, and SMBus are placed in group 12, and a PCI bridge and a PCIe root port are placed in group 10. You will also notice that the Network Controller (02:00.0) is in a separate Group 13 even though it is also under the PCI bridge (00:1c.0) and PCI Root Port #1 device hierarchy which are in Group 8.

This is a good example of a device isolation under a PCI bridge, and the PCI Root Port hierarchy. The PCI bridge and root port are placed in the same group, whereas the network controller is in its own group and can be assigned to a VM. In the case of the PCI bridge (00:1c.3), PCI root port #3, and PCI bridge (04:00.0) hierarchy, all of them are placed in group 10. Devices that have dependencies on each other are usually placed in a group together so they can be isolated as a group. The following shows the IOMMU topology derived from the trace events generated as devices get added to individual groups during boot-time.

IOMMU Topology

What do IOMMU Device Event Traces Look Like?

# tracer: nop
#
# entries-in-buffer/entries-written: 5689868/5689868   #P:8
#
#                        _-----=> irqs-off
#                         / _----=> need-resched
#                        | / _---=> hardirq/softirq
#                        || / _--=> preempt-depth
#                        ||| /     delay
#       TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#          | |       |   ||||       |         |
    qemu-kvm-28546 [003] ....  1804.692631: attach_device_to_domain: IOMMU: device=0000:00:1c.0
    qemu-kvm-28546 [003] ....  1804.692635: attach_device_to_domain: IOMMU: device=0000:00:1c.4
    qemu-kvm-28546 [003] ....  1804.692643: attach_device_to_domain: IOMMU: device=0000:05:00.0
    qemu-kvm-28546 [003] ....  1804.692666: detach_device_from_domain: IOMMU: device=0000:00:1c.0
    qemu-kvm-28546 [003] ....  1804.692671: detach_device_from_domain: IOMMU: device=0000:00:1c.4
    qemu-kvm-28546 [003] ....  1804.692676: detach_device_from_domain: IOMMU: device=0000:05:00.0

What do IOMMU map/unmap Event Traces Look Like?

# tracer: nop
#
# entries-in-buffer/entries-written: 54/54   #P:8
#
#                    _-----=> irqs-off
#                     / _----=> need-resched
#                    | / _---=> hardirq/softirq
#                    || / _--=> preempt-depth
#                    ||| /     delay
#   TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#      | |       |   ||||       |         |
qemu-kvm-28546 [002] ....  1804.480679: map: IOMMU: iova=0x00000000000a0000 paddr=0x00000000446a0000 size=4096 
qemu-kvm-28547 [006] ....  1809.032767: unmap: IOMMU: iova=0x00000000000c1000  size=4096 unmapped_size=4096

Great, We Have Traces! Using Traces to Solve Problems

Using traces we can get insight into IOMMU device topology to see which devices belong to which groups, and run-time device assignment changes as devices move from host to guests and back to host. In turn, this makes it much easier to debug IOMMU problems, Device assignment problems, Detect and solve performance problems, and BIOS and firmware problems related to IOMMU hardware and firmware implementation.

VFIO Based Device Assignment Use-Case

Alex Williamson, a VFIO maintainer, enabled IOMMU traces for vfio-based device assignment and found the following VFIO problems:

  • A large number of unmap calls were being made on VT-d system without IOMMU super-page support. This is because VFIO unmap path is not optimized on a VT-d system without IOMMU super-page support. As a result, each single page is unmapped individually since the current unmap path optimization relies on IOMMU super-page support.
  • Unnecessary single page mappings for invalid and reserved memory regions, like mappings of MMIO BARs.
  • Too many instances of very long tasks being run with needs-resched set.

RESULT 1: VFIO PATCH SERIES TO FIX PROBLEMS!

The problems above were fixed, resulting in a reduction in the number of unmap calls to 2% of the original on Intel VT-d without IOMMU super-page support. Before the fix, traces showed 472,574 maps, and 5,217,244 unmaps. Unmaps were more than 10 times greater than the number of maps! After the fix, traces showed 9509 maps, and 9509 unmaps, an extremely significant reduction. Additionally, more sporadic long task runs were added with needs-resched set

RESULT 2: IMPROVEMENTS TO THE IOMMU TRACING FEATURE

Alex also found a few bugs and improvements that could be made to the IOMMU tracing API. I would like to acknowledge Alex for using IOMMU tracing for VFIO based device assignments and for his feedback on improving the IOMMU Event Tracing API. The following fixes and improvements to IOMMU tracing API went into Linux 4.0

  • trace_iommu_map() should report original iova and size the map routine is called with as opposed to the iova and size at the end of mapping.
  • trace_iommu_unmap() should report original iova, size, and unmapped size.
  • Size field is handled as int and could overflow

Does Run-Time Tracing Add Overhead?

If you are wondering, what kind of overhead IOMMU tracing code introduces. Tracepoint code can be triggered to be included at run-time only when the tracepoint is enabled. In other words, tracepoint code is inactive unless it is enabled. When it is enabled, code is modified to include the tracepoint code. It doesn’t add any conditional logic overhead to determine whether or not to generate a trace message. The tracepoints use jump-labels which is a code modification of a branch.

When it is disabled, the code path looks like:


[ code ]
nop
back:
[ code ]
return;
tracepoint:
[ tracepoint code ]
jmp back;

When it is enabled, the code path looks like: (notice how the tracepoint code appears in the code path below):


[ code ]
jmp tracepoint
back:
[ code ]
return;
tracepoint:
[ tracepoint code ]
jmp back;

Future Enhancements & Closing Thoughts

At the moment, IO Page Faults are the only type of autonomous errors that are traced. In the future, additional errors and faults could be traced, but this would largely depend on the support for such reports funneling into the IOMMU drivers from the IOMMU hardware and firmware layers. One area that has been considered is the addition of error reporting and tracing to ARM IOMMU drivers if the underlying IOMMU hardware and firmware supports autonomous error reporting to the kernel.

I hope this article will help Linux kernel developers and users learn about a feature that can aid during development, maintenance, and support activities on systems with IOMMU hardware support. Reading this article, Linux kernel developers and users how to enable IOMMY event tracing to get reports on IOMMU boot-time and run-time events, and errors. Please see the following references to learn more about the use of IOMMUs in virtualized Linux environments and Linux VFIO PCI device assignment feature.

REFERENCES

Utilizing IOMMUs for Virtualization in Linux and Xen, Multiple Authors VFIO PCI Device assignment breaks free of KVM, Alex Williamson, RedHat

What Is IOMMU Event Tracing?

This post was originally published on Samsung Open Source Blog site on April 29, 2015

The IOMMU event tracing feature enables reporting IOMMU events in the Linux Kernel as they happen during boot-time and run-time. IOMMU event tracing provides insight into IOMMU device topology in the Linux Kernel. This information helps understand which IOMMU group a device belongs to, as well as run-time device assignment changes as devices are moved from hosts to guests and back by the Kernel. The Linux Kernel moves devices from host to guest when users requests such a change.

In addition, IOMMU event tracing helps debug BIOS and firmware problems related to IOMMU hardware and firmware implementation, IOMMU drivers, and device assignment. For example, tracing occurs when a device is detached from the host and assigned to a virtual machine, or the device gets moved from the host domain to the VM domain and allows debugging to occur for each of these processes. The primary purpose of IOMMU event tracing is to help detect and solve performance issues.

Enabling IOMMU event tracing will provide useful information about devices that are using IOMMU as well as as changes that occur in device assignments. In this article, I’ll discuss the IOMMU event tracing feature and the various classes of IOMMU events. In part two of this series, I’ll discuss how to enable and use it to trace events during boot-time and run-time, and how to use the IOMMU tracing feature to get insight into what’s happening in virtualized environments as devices get assigned from hosts to virtual machines and vice versa. This feature helps debug IOMMU problems during development, maintenance, and support.

What is an IOMMU?

IOMMU is short for I/O Memory Management Unit. IOMMUs are hardware that translate device (I/O) addresses to the physical (machine) address space. IOMMU can be viewed as an MMU for devices. MMU maps virtual addresses into physical addresses. Similarly, IOMMU maps device addresses into physical addresses. The following picture shows a comparative depiction of IOMMU vs. MMU.

A Comparison of IOMMU and MMU Address Mapping

In addition to basic mapping, the IOMMU provides device isolation via access permissions. Mapping requests are allowed or disallowed based on whether or not the device has proper permissions to access a certain memory region. Another key feature IOOMU brings to the table is I/O Virtualization which provides DMA remapping hardware that adds support for the isolation of device accesses to memory, as well as translation functionality. In other words, devices present I/O addresses to the IOMMU which translates them into machine addresses, thereby bridging the gap between device addressing capability and the system memory range.

IOMMU enables device access controls

What Does an IOMMU Do?

IOMMU hardware provides several key features that enhance I/O performance on a system.

  • On systems that support IOMMU, one single contiguous virtual memory region can be mapped to multiple non-contiguous physical memory regions. IOMMU can make a non-contiguous memory region appear contiguous to a device (scatter/gather).
  • Scatter/gather optimizes streaming DMA performance for the I/O device.
  • Memory isolation and protection allows device access to memory regions that are mapped for it. As a result, faulty and/or malicious devices can’t corrupt system memory.
  • Memory isolation allows safe device assignment to virtual machines without compromising host and other guest operating systems. Similar to the faulty and/or malicious device case, devices are given access to memory regions which are mapped specifically for them. As a result, devices assigned to virtual machines will not have access to the host or another virtual machine’s memory regions.
  • IOMMU helps address discrepancies between I/O device and system memory addressing capabilities. For example, IOMMU enables 32-bit DMA capable non-DAC devices access to memory regions above 4GB.
  • IOMMU supports hardware interrupt remapping. This feature expands limited hardware interrupts to extendable software interrupts, thereby increasing the number of interrupts that can be supported. Primary uses of interrupt remapping are interrupt isolation, and the ability to translate between interrupt domains. e.g: ioapic vs. x2apic on x86.

As we all know, there is no free lunch! IOMMU introduces latency due to translation overhead in the dynamic DMA mapping path. However, most servers support I/O Translation Table (IOTLB) hardware to reduce the translation overhead.

IOMMU Groups and Device Isolation

Devices are isolated in IOMMU groups. Each group contains a single device or a group of devices, but single device isolation is not always possible for a variety of reasons. Devices behind a bridge can communicate without reaching IOMMU via peer-to-peer communication channels. Unless the I/O hardware/firmware provides a way to disable peer-to-peer communication, IOMMU can’t ensure single device isolation and will be forced to place all the devices behind a bridge in a single IOMMU group for isolation.

Multi-function cards don’t always support the PCI access control services required to describe isolation between functions. In such cases, all functions on a multi-function card are placed in a single IOMMU group. Device(s) in a group can’t be separated for assignment and all devices in that group must be assigned together, even when a virtual machine only needs one of them. For example, IOMMU might be forced to group all 4-ports on a multi-port card because device isolation at port granularity isn’t possible on all hardware.

Network hardware with device isolation at port level is capable of separating ports for specific IOMMU groups.

Without port isolation, the network hardware must assign all ports to a single group.

IOMMU Domains and Protection

IOMMU domains are intended to provide protection against a virtual machine corrupting another virtual machine’s memory. Devices get moved from one domain to another as they get moved between VM’s or from a host to a VM. Any device in a domain is given access to the memory regions mapped for the specific domain it belongs to. When a device is assigned to a VM, it is first detached from the host and removed from the host domain, moved to VM domain, and attached to the VM as shown below:

Step 1: A guest OS needs access to hardware that’s currently mapped to the host. Step 2: The hardware must first be detached from the host and transfered from the host domain to the VM domain. Step 3: Once the hardware has been moved to the Vm domain, it can be attached to the guest OS.

A Brief Overview of IOMMU Boot and Run-Time Operations

The IOMMU driver creates IOMMU groups and domains during initialization. Devices are placed in IOMMU groups based on their device isolation capabilities. iommu_group_add_device() is called when device is added to a group and iommu_group_remove_device() is called when a device is from a group.

All devices are attached to the host and when a user requests a device to be assigned to a VM, the device gets detached from the host and then attached to the VM. iommu_attach_device() is called to attach a device and iommu_detach_device() is called to detach it. The iommu_map() and iommu_unmap() interfaces are for creating and deleting mappings for the device address space and system address space.

A series of device additions occur during boot. During run-time, after a device is attached, a series of device maps, and unmaps occur until the device is detached.

IOMMU event tracing provides insight to what is occurring during all of these processes.

The ability to have visibility into device additions, deletions, attaches, detaches, maps, and unmaps is valuable in debugging IOMMU problems. As you can see below, this is exactly what IOMMU events are designed to do. In fact, the idea for this tracing work was a result of debugging several IOMMU problems without having a good insight into what’s happening. Let’s take a look at the trace events.

IOMMU Trace Event Classes

IOMMU events are classified into group, device, map and unmap, and error classes to trace activity in each of these areas. Group class events are generated whenever a device gets added and removed from a group. Device class events are intended for tracing device attach and detach activity. Map and unmap events trace map/unmap activity. Finally, In addition to these normal path events, error class events are for tracing autonomous IOMMU faults that might occur during boot-time and/or run-time.

IOMMU Trace Event Classes

IOMMU Group Class Events

IOMMU group class events are triggered during boot. Traces are generated when devices get added to or removed from an IOMMU group. These traces provide insight into IOMMU device topology and how the devices are grouped for isolation.

  • Add device to a group – Ttriggered when IOMMU driver adds a device to a group. Format: IOMMU: groupID=%d device=%s
  • Remove device from a group – Triggered when IOMMU driver adds a device to a group. Format: IOMMU: groupID=%d device=%s

IOMMU Device Class Events

Events in this group are triggered during run-time, whenever devices are attached to and detached from domains. For example, when a device is detached from host and attached to a guest. This information provides insight into device assignment changes during run-time.

  • Attach (add) device to a domain – Triggered when a device gets attached (added) to a domain. Format: IOMMU: device=%s
  • Detach (remove) device from a domain – Triggered when a device gets detached (removed) from a domain. Format: IOMMU: device=%s

IOMMU Map and Unmap Events

Events in this group are triggered during run-time whenever device drivers make IOMMU map and unmap requests. This information provides insight into map and unmap requests and helps debug performance and other problems.

  • IOMMU map event – Triggered when IOMMU driver services a map request. Format: IOMMU: iova=0x%016llx paddr=0x%016llx size=%zu
  • IOMMU unmap event – Triggered when IOMMU driver services an unmap request. Format: IOMMU: iova=0x%016llx size=%zu unmapped_size=%zu

IOMMU Error Class Events

Events in this group are triggered during run-time when an IOMMU fault occurs. This information provides insight into IOMMU faults and useful in logging the fault and take measures to restart the faulting device. The information in the flags field is especially useful in debugging BIOS and firmware problems related to IOMMU hardware and firmware implementation, as well as, problems resulting from incompatibilities between the OS, BIOS, and firmware in spec compliance.

  • IO Page Fault (AMD-Vi): Triggered when an IO Page fault occurs on a AMD-Vi system. Format: IOMMU:%s %s iova=0x%016llx flags=0x%04x

Error class events are implemented in common IOMMU driver code, Intel, and ARM.

How Can IOMMU Event Tracing Help You? This article is part one of a two part series on IOMMU event tracing. This introduction will help set the knowledge foundation for the second article which will cover how to use this feature to benefit you the most. Stay tuned to this blog to learn more about IOMMU event tracing!

REFERENCES

How to Install Ubuntu and Run a Mainline Kernel on an ODROID-XU4

This post was originally published on Samsung Open Source Blog site on June 9, 2016

I recently installed Ubuntu 15.10 on Odroid-XU4 and set out to run the upstream kernel on it. After several trials and errors and being forced to reference various forums, I now have Odroid running the Linux 4.6 Kernel. In this article, I will share how to quickly get from unboxing to running the latest kernel with a short detour to upgrade to the Ubuntu 16.04 release.

Without further ado, let’s get started. First of all, download the Ubuntu 15.10 image. You can find the release notes and self installing image here:

Prepare the microSD Card

Once you’ve downloaded the image from the 2nd link above, follow the following steps to create a bootable microSD card with the image; I used a 32 GB Samsung microSD card. Insert the microSD card in its SD card adapter case in the SD card slot on your host PC or laptop. Please note that SD card will likely be auto-mounted. Check for the device files and unmount all partitions as needed. If this a brand new SD card you don’t have to worry about this step.

Uncompress the image:

$ tar -unxz ubuntu-15.10-mate-odroid-xu3-20160114.img.xz

Prepare microSD – Insert SD Card and look for auto-mounted partitions:

$ df -h
...
/dev/mmcblk0p2 30G 8.3G 21G 29% /media/shuah/rootfs
/dev/mmcblk0p1 128M 20M 108M 16% /media/shuah/boot

Check device files

$ ls /dev/mmcblk*
/dev/mmcblk0
/dev/mmcblk0p1
/dev/mmcblk0p2

First unmount the auto-mounted partitions:

$ umount /dev/mmcblk0p1
$ umount /dev/mmcblk0p2

Erase microSD card (the following writes ~8192 4M blocks to erase the entire disk):

$ dd if=/dev/zero of=/dev/mmcblk0 count=8192 bs=4M

Copy self installing image to microSD Card:

$ dd if=ubuntu-15.10-mate-odroid-xu3-20160114.img of=/dev/mmcblk0

Get Your ODROID-XU4 Up and Running

Now you are ready to use the microSD to boot the self install image. Insert the microSD card in the microSD slot in the Odroid. Make sure Boot Select Switch is in the correct position to enable microSD as the boot device.

Hopefully you have a the USB-UART Module Kit for the Odroid. If you don’t please get one! It is a must for kernel development. Please refer to the ODroid wiki for instructions on how to connect to Odroid serial console port. An alternative option is to simply connect the device’s HDMI to a monitor and hook up a USB keyboard and mouse.

It is time to give your Odroid boot a try. Power up the Odroid and you should see the ALIVE/Blue Starts LED start out solid while in u-boot and quickly go into flashing mode when kernel starts running. At this point, sit back and relax, the Ubuntu self-install image will boot and complete the instalation. If all goes well, the ALIVE/Blue Status LED continues to flash and you will see boot messages on the console and Odroid login appear on the monitor. During the install, an boot partition will be made with a rootfs being placed on the remaining space. There is ample room in the boot partition for a few kernel images. You will notice that the boot partition is auto-mounted at /media/boot.

The Ubuntu 15.10 image is complete with all the development tools except for the liveboot package. We will talk about this later. Login is odroid and password is odroid. Start the Software Updater and pull in any updates. You will have the opportunity to upgrade to 16.04. If you do upgrade, please keep in mind that you will have to bring in new video driver to get GUI working on 16.04. Please refer to GUI after upgrade to Ubuntu 16.04 on XU4. The following is a distilled list of commands that worked for me.

$ sudo apt-add-repository ppa:canonical-x/x-staging
$ sudo apt-get update
$ sudo apt-get purge xserver-xorg-video-armsoc*
$ sudo apt-get install xserver-xorg-video-armsoc-exynos

Once the above step is done, power down completely (don’t choose restart). Please note that when the HDMI is connected, the HDMI return power supplies enough power for Odroid’s PWR Status LED to stay on. Please disconnect HDMI to ensure the Ordroid is completely powered down before powering it back up. Please don’t insert and/or remove microSD while the PWR Status LED is on. Once the PWR Status is off, it safe to power on the Odroid. At this point with the GUI should be working. Please note: If you have problems with fonts not rendering properly, click “System” in the top left corner of the screen and go to Preferences >> Appearance; then, select the Fonts tab and choose “Subpixel smoothing” in the Rendering section.

Build the Mainline Kernel

Now we are ready to build, install, and boot mainline kernels. The latest linux-next is a good repo to use. I will describe what I needed to do build, install and run kernel based on this repo. I would recommend doing a native build on Odroid as it is fast enough and worked well for me. I built kernels with and without CONFIG_EXYNOS_IOMMU enabled. Contiguous Memory Allocation (CMA) is the default mode for memory allocation. When IOMMU is enabled, non-contiguous memory is allocated using Exynos IOMMU hardware. Both flavors worked for me.

Clone the linux-next git repo (do this on the XU4 if you want to do a native kernel build on the Odroid):

$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git linux_odroid

Move into the new linux_odroid directory, generate the config, prepare modules, and build:

Please note: If you like to enable IOMMU, apply the following patch or edit arch/arm/configs/exynos_defconfig to add the CONFIG_EXYNOS_IOMMU=y , and then run make exynos_defconfig

diff --git a/arch/arm/configs/exynos_defconfig b/arch/arm/configs/exynos_defconfig
index daf9762..76127dc 100644
--- a/arch/arm/configs/exynos_defconfig
+++ b/arch/arm/configs/exynos_defconfig
@@ -218,6 +218,7 @@ CONFIG_CROS_EC_CHARDEV=y
 CONFIG_COMMON_CLK_MAX77686=y
 CONFIG_COMMON_CLK_MAX77802=y
 CONFIG_COMMON_CLK_S2MPS11=y
+CONFIG_EXYNOS_IOMMU=y
 CONFIG_EXTCON=y
 CONFIG_EXTCON_MAX14577=y
 CONFIG_EXTCON_MAX77693=y
$ cd linux_odroid
$ make exynos_defconfig
$ make prepare modules_prepare
$ make -j6 bzImage modules dtbs

Copy .dtb to /media/boot – don’t overwrite the old one, if one exists:

sudo cp arch/arm/boot/dts/exynos5422-odroidxu4.dtb /media/boot/exynos5422-odroidxu4_next.dtb

Copy zImage to /media/boot – don’t overwrite the old one:

$ sudo cp arch/arm/boot/zImage /media/boot/zImage_next

Install modules – will be installed under /lib/modules:

$ sudo make modules_install

Install live-boot:

Without live-boot, initramfs will not have live support and kernel won’t boot. You have only once when the first kernel is built.

$ sudo apt-get install live-boot

Installing newly built kernel

Now you are ready to install the new kernel. First, backup everything in /media/boot to a back-up directory, just in case you run into problems. All of the remaining steps need to be run with super user privileges. Copy the current .config to /boot, and then run update-initramfs.

Note: you should run cat include/config/kernel.release separately and copy/paste the results where it is used in the next commands.

Update Initramfs:

$ cp .config /boot/config-`cat include/config/kernel.release`
$ update-initramfs -c -k `cat include/config/kernel.release`
$ mkimage -A arm -O linux -T ramdisk -C none -a 0 -e 0 -n uInitrd -d /boot/initrd.img-`cat include/config/kernel.release` /boot/uInitrd-`cat include/config/kernel.release`

Copy new uInitrd to /media/boot:

$ cp /boot/uInitrd-`cat include/config/kernel.release` /media/boot/

Now comes the important step to change the boot.ini to boot the new kernel. Edit /media/boot/boot.ini and comment out the following line:

#setenv bootcmd "fatload mmc 0:1 0x40008000 zImage; fatload mmc 0:1 0x42000000 uInitrd; fatload mmc 0:1 0x44000000 exynos5422-odroidxu3.dtb; bootz 0x40008000 0x42000000 0x44000000"

Copy and paste the commented out bootcmd line after the lline you just commened out. Now edit the new bootcmdline to point to new zImage, uInitrd, and dtb files. Don’t forget to uncomment the newly added bootcmd line after making changes.

setenv bootcmd "fatload mmc 0:1 0x40008000 zImage_next; fatload mmc 0:1 0x42000000 uInitrd-4.7.0-rc2-next-20160608; fatload mmc 0:1 0x44000000 exynos5422-odroidxu4_next.dtb; bootz 0x40008000 0x42000000 0x44000000"

We are now at the final step. Run sync and then power the device down and power back up. Please remember to disconnect HDMI so it doesn’t continue to supply power.

$ sync; poweroff

You should have the newly built zImage running when Odroid boots up. uname -a should show that the Linux odroid 4.7.0-rc2-next-20160608 is running.

Please watch out for the following in the dmesg:

FAT-fs (mmcblk1p1): Volume was not properly unmounted. Some data may be corrupt. Please run fsck.

Adding new kernel files (zImage, uInitrd, and exynos5422-odroidxu4.dtb), and changing boot.ini work just fine on the boot partition /dev/mmcblk1p1. If you were to remove files, please make sure to run fsck on this device. Run the following as root, if you don’t, the system won’t boot and it will hang, unable to load the uInitrd.

umount /dev/mmcblk1p1
fsck /dev/mmcblk1p1
mount /dev/mmcblk1p1 /media/boot

There are a few issues I’m still working to solve. Chromium included in the Ubuntu 16.04 release doesn’t work and needs to be downgraded to rev 48.0. After the downgrade, Chromium starts, but dies right away. This is a work in progress at the moment. Otherwise, configuring and getting mainline kernel booting on Odroid-xu4 has been fun, and I hope you find this guide to be valuable.

Odroid XU4 – Light Display Manager Doesn’t Start on Linux Kernel 4.9-rc1

This post was originally published on Samsung Open Source Blog site on October 21, 2016

The Light Display Manager doesn’t start on Odroid XU4 on the recent mainline kernels with exynos_defconfig. I first noticed this problem during the Linux 4.8 rc testing and this problem persists in 4.9-rc1. I want to share the root-cause, and a work-around in this post.

I’m running kernel 4.9.0-rc1 with exynos_defconfig on Ubuntu 16.04 with HDMI. Light Display Manager (lightdm) fails with the following errors.

Starting Light Display Manager... 
[  OK  ] Started Light Display Manager.
[   15.538629] [drm:exynos_drm_framebuffer_init] *ERROR* Non-contiguous GEM mem.
[   15.546149] [drm:exynos_drm_framebuffer_init] *ERROR* Non-contiguous GEM mem.
[  OK  ] Stopped Light Display Manager.

This block repeats a few times until systemd gives up on starting lightdm. The system is operational with functioning serial console and networking, however the display doesn’t work.

What Causes this problem?

The following sequence of events is what leads to this problem

  1. The user space calls exynos_drm_gem_create_ioctl() with the EXYNOS_BO_NONCONTIG request to allocate GEM buffers.
  2. exynos_drm_gem_create() creates non-contiguous GEM buffers as requested.
  3. exynos_user_fb_create() comes along later and validates the GEM buffers to associate them to frame-buffer. The validation in check_fb_gem_memory_type() detects non-contiguous buffers without IOMMU. Non-contiguous frame buffers can only be supported when IOMMU is enabled, exynos_drm_framebuffer_init() fails.
  4. At this point, there is no recovery and lightdm fails After digging into the user space angle on the problem, it turns out that xf86-video-armsoc/src/drmmode_exynos/drmmode_exynos.c assumes contiguous allocations are not supported in some Exynos DRM versions. This change was introduced in the following commit
if (create_gem->buf_type == ARMSOC_BO_SCANOUT)
-               create_exynos.flags = EXYNOS_BO_CONTIG;
-       else
-               create_exynos.flags = EXYNOS_BO_NONCONTIG;
+
+       /* Contiguous allocations are not supported in some exynos drm versions.
+        * When they are supported all allocations are effectively contiguous
+        * anyway, so for simplicity we always request non contiguous buffers.
+        */
+       create_exynos.flags = EXYNOS_BO_NONCONTIG;

There might have been logic in exynos_drm that forced Contiguous GEM buffers. At least, that is what this comment suggests. This assumption doesn’t appear to be a good one and I’m not sure if this change was made to fix a bug. After IOMMU support was added, this assumption is no longer true. Hence, the recent mainline kernels have a mismatch with the installed xserver-xorg-video-armsoc 1.4.0-0ubuntu2 armhf X.Org X server on my Odroid XU4.

Enabling CONFIG_EXYNOS_IOMMU solves the problem in my case. However, enabling CONFIG_EXYNOS_IOMMU in exynos_defconfig might break non-IOMMU Exynos platforms.

What’s the solution

There are a couple of possible solutions for this problem that worked for me.

  • Fix xf86-video-armsoc to ask for EXYNOS_BO_CONTIG for ARMSOC_BO_SCANOUT and EXYNOS_BO_NONCONTIG in all other cases. With this change, display manager now starts. However, it turns out xf86-video-armsoc is obsoleted in favor of xf86-video-modesetting. The last update to xf86-video-armsoc git was 3 years ago and appears to be inactive.
diff --git a/src/drmmode_exynos/drmmode_exynos.c b/src/drmmode_exynos/drmmode_exynos.c
index 91723df..45b2edd 100644
--- a/src/drmmode_exynos/drmmode_exynos.c
+++ b/src/drmmode_exynos/drmmode_exynos.c
@@ -126,11 +126,10 @@ static int create_custom_gem(int fd, struct armsoc_create_gem *create_gem)
        assert((create_gem->buf_type == ARMSOC_BO_SCANOUT) ||
                        (create_gem->buf_type == ARMSOC_BO_NON_SCANOUT));

-       /* Contiguous allocations are not supported in some exynos drm versions.
-        * When they are supported all allocations are effectively contiguous
-        * anyway, so for simplicity we always request non contiguous buffers.
-        */
-       create_exynos.flags = EXYNOS_BO_NONCONTIG;
+       if (create_gem->buf_type == ARMSOC_BO_SCANOUT)
+               create_exynos.flags = EXYNOS_BO_CONTIG;
+       else
+               create_exynos.flags = EXYNOS_BO_NONCONTIG;

        ret = drmIoctl(fd, DRM_IOCTL_EXYNOS_GEM_CREATE, &create_exynos);
        if (ret)
  • I settled on using xf86-video-modesetting instead of xf86-video-armsoc as a solution. I removed xf86-video-armsoc from my system, brought in the latest xf86-video-modesetting, compiled, and installed it on my system. xf86-video-modesetting uses dumb_create interface instead of DRM_IOCTL_EXYNOS_GEM_CREATE, hence doesn’t suffer from the CONTIG vs NONCONTIG problem. Exposing CONTIG and NONCONTIG to userspace appears to be causing problems when exynos drm driver determines it can’t support non-contig GEM buffers during frame-buffer initialization, after the userspace allocates them. I am working on finding solution for this problem.

The Linux Kernel Has Bugs! Really!

This post was originally published on Samsung Open Source Blog site on March 11, 2016

A Guide to Finding and Fixing Linux Kernel Bugs

“Our new Constitution is now established, and has an appearance that promises permanency; but in this world nothing can be said to be certain, except death and taxes.” said Benjamin Franklin, in a letter to Jean-Baptiste Leroy, 1789. If he were to be around today, he might have added software bugs to the list of unavoidable things.

Software and bugs go together, and the Linux Kernel is no exception. Some bugs are easy to find, but some are harder to reproduce and could require several attempts to piece together the right set of conditions to trigger them. While bugs that result in a system crash or hang are easier spot, it is often more challenging to gather the information necessary to debug and fix them. In many cases, Kernel logs can provide insight into these bugs, but when a system crashes or freezes Kernel logs might not get the chance to be written out to disk.

Race and timing bugs are elusive and can also be hard to debug because traditional methods like debug logging can change the timing just enough to avoid triggering the bug.

Incorrect or unexpected feature behavior problems are often easier to debug and trace back to the offending module or sub-system. There are some pesky bugs like the 4.4-rc1 VPN bugs, for which there are too many suspects, making it hard to isolate the offending module. In this bug, a VPN will connect successfully, but subsequent web/ssh accesses fail on that connection. I spent several hours chasing the obvious suspects including routers, switches, and network connections, before I remembered that the new element in my environment was the bleeding edge Kernel. The Kernel dmesg gave me a clue to search and find the patch that fixed the problem.

This article is for anyone who is interested in modifying the Linux Kernel, and it will cover a handful of strategies for addressing each of these types of bugs. Let’s get started!

Debugging Resources – What are they?

There is a lot of information that is available on a system during run-time that can aid in debugging.

Kernel Logging – dmesg, kern.log, and syslog provide useful information and aid in debugging. dmesg log levels can be customized to enable additional debug messages.

Check System State – These are useful methods to determine the current system state and the state of various devices (lspci, lsusb etc.). For example, “cat /proc/meminfo” gives information memory usage and how much RAM is in use on the system.

The following picture shows a handy list of Linux performance tools that can be used to observe the system at run-time and diagnose performance and other problems.

Linux Observability Tools

Kernel Errors for Everyone!

Oops Messages and System Hangs

Oops messages are printed to the console during system crashes. The Kernel might not get a chance to save these to the disk depending on how serious the problem is. In these cases it is helpful to redirect console messages to the serial port on another system. This method is useful for debugging early boot problems and panics.

System hangs are hard to debug because the system becomes unresponsive, resulting in the inability to run any tools that provide an insight into the system state. There is a Magic SysRq key or a key sequence to which the Kernel will respond to in any state, unless it is completely locked up. On my laptop, SysRq key and Print Screen are shared. Please refer to sysrq.txt under the Documentation directory in the Linux Kernel source code for more information on how to use this feature and to learn which Kernel config options need to be enabled to turn this on. Please note that it is a good idea for Kernel developers to enable the SysRq config options in the Kernel just in case the system runs into a hang.

There are a few debug tools that can aid in debugging problems with system hangs and oops messages, including GDB, which is useful in investigating Kernel addresses in Oops messages and figuring out which Kernel module and line of code is the culprit.

The tools and method we discussed so far are passive and non-intrusive. In some cases, it is necessary to use intrusive and pervasive approaches to gather diagnostic information to help debug problems that are hard to re-create. In such cases, KDB and Dynamic Probes can be used. Please refer to the presentation on Dynamic Event Tracing in Linux Kernel by Masami Hiramatsu for information on the Dynamic Probes feature.

Memory Bugs

kmemcheck and kmemleak are tools that can be used to detect potential memory-related bugs at runtime. These tools are in the Kernel source tree.

Kmemcheck:

  • Detects and warns about uninitialized memory
  • CONFIG_KMEMCHECK should be enabled
  • Documentation/kmemcheck.txt

Kmemleak:

  • Can be used to debug features and detect possible Kernel memory leaks in a similar way to a tracing garbage collector
  • CONFIG_DEBUG_KMEMLEAK should be enabled
  • Documentation/kmemleak.txt

Kernel Address Sanitizer (KASan) is another tool that can be use to find invalid memory accesses by the Kernel. This tool uses two things: a GCC feature to instrument the Kernel memory accesses, and a shadow memory to determine when the Kernel performs an invalid access to a memory address. Please refer to kasan.txt under the Documentation directory in the Linux Kernel source code for more information on how to enable and use KASan.

Debug Interfaces

The Linux Kernel also supports several debug interfaces including debug Kernel configuration options, debug APIs, dynamic debug, and tracepoints to name a few.

There are two different ways Kernel Debug Interfaces can be triggered:

  • Per-callsite: a specific pr_debug() can be enabled using the line number in the source file. e.g: enabling pr_debug() in kernel/power/suspend.c at line 340:
$ echo ‘file suspend.c line 340 +p’ >
/sys/kernel/debug/dynamic_debug/control
  • Per-module: passing in dyndbg=”plmft” modprobe or changing or creating modname.conf file in /etc/modprobe.d/ – The later persists across reboots. However for drivers that get loaded from initramfs, change grub to pass in module.dyndbg=”+plmft”

Per-Module Debugging

Several Kernel modules have debug configuration options that can be enabled at compile time with no dynamic control to enable/disable. Debug messages go to dmesg, and these include lock debugging (spinlocks, mutexes, etc…), debug lockups and hangs, read-copy update debugging, and memory debugging.

For example, the DMA-Debug API is designed for debugging driver DMA API usage errors. When CONFIG_HAVE_DMA_API_DEBUG and CONFIG_DMA_API_DEBUG options are enabled, DMA APIs are instrumented to call into a special DMA debug interface. This debug interface runs checks to diagnose errors in DMA API usage; it does this by keeping track of per-device DMA mapping information. This is used to detect unmap attempts on addresses that aren’t mapped in addition to missing mapping error checks in driver code after a DMA map attempt.

Kernel debug options require recompiling the Kernel and add overhead that could change timing. As a result, these aren’t a good choice for debugging problems that result from race conditions and timing-related problems.

Per-Callsite Dynamic Debugging

Dynamic debug on the other hand allows dynamic enabling/disabling of pr_debug(), dev_dbg(), print_hex_dump_debug(), print_hex_dump_bytes() per call site. CONFIG_DYNAMIC_DEBUG controls feature enable/disable. Dynamic debug options can be specified using /sys/kernel/debug/dynamic_debug/control (virtual file). Choosing dynamic_debug.verbose=1 kernel boot option will increase the verbosity. Please refer to Documentation/dynamic-debug-howto.txt for more information.

Even though dynamic debug allows selectively enabling messages, it creates extra overhead for each message, even when nothing is printed. In many cases, additional checks are run to determine if any messages need printing.

Tracepoints

The final feature I’ll cover here is tracepoints, which can be used for debugging, event reporting, and performance accounting. They can be enabled to trigger at run-time, however they differ from dynamic debugging in that they are inactive unless they are specifically enabled. When enabled, code is modified to execute the tracepoint code, but it doesn’t add any overhead, unlike dynamic. The tracepoints use jump labels which are a code modification of a branch. When the tracepoint is disabled, it simply returns, but when it is enabled, jmp label takes the next instruction execution to the tracepoint code.

In addition, there are several debug modules like the test_firmware, and test_bpf to test specific functionality. The Kernel includes several developer regression tests in its Kselftest sub-system. These tests can be run to find any regressions in a newly released Kernel.

Make Your Debugging Count

As a closing thought, I would caution you to watch out for a few things when using debug options:

  • They may add performance cost
  • They might use non-stack allocations which result in extra overhead for allocating memory, increasing the memory footprint
  • They might execute non-optimal code paths.
  • They aren’t good for debugging races/timing bugs and could generate too many debug messages.
  • This guide should be helpful for anyone looking to debug issues with the Linux Kernel. If you have any questions, feel free to post them in the comments section. Happy Debugging!

An Introduction to Testing the Linux Kernel With Kselftest

This post was originally published on Samsung Open Source Blog site on November 29, 2016

The Linux kernel contains a set of developer unit and regression tests (Kselftests) under tools/testing/selftests; these tests exercise individual code paths in the kernel. In this blog post, I’ll explain how to build and run these tests, run Kselftest on a system it’s built on, and how to install and run tests on a target test system.

Even though kselftest’s main purpose is for developer regression test, testers and users can also use it to ensure there are no regressions in a new kernel. Kselftest is run everyday on several Linux kernel trees on the 0-Day and kernelci.org Linux kernel integration test rings.

How to Build Kselftest

The tests are intended to be run after building, installing, and booting a kernel.

$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest.git
$ cd linux-kselftest
$ make all
$ sudo make modules_install install

Boot the new kernel, then execute the following

$ cd linux-kselftest
To build the tests:
$ make -C tools/testing/selftests

To run the tests:
$ make -C tools/testing/selftests run_tests

To build and run the tests with a single command, use:
$ make kselftest

Please note, some tests require root privileges.

You can run a subset of selftests using “TARGETS” make command variable to specify single test or a list of tests to run.

To run only tests targeted for a single subsystem:
$ make -C tools/testing/selftests TARGETS=ptrace run_tests

You can specify multiple tests to build and run:
$ make TARGETS="size timers" kselftest

See the top-level tools/testing/selftests/Makefile for the list of all possible targets.

Install Kselftest

You can use kselftest_install.sh tool installs selftests in default location which is tools/testing/selftests/kselftest or an user specified location.

To install selftests in default location:
$ cd tools/testing/selftests
$ ./kselftest_install.sh

To install selftests in an user specified location:
$ cd tools/testing/selftests
$ ./kselftest_install.sh install_dir

Generate the Kselftest Install Package

cd tools/testing/selftests
./gen_kselftest_tar.sh

The generated Kselftest tarball can be copied to target test system for running tests.

Run Installed Kselftests

Kselftest install as well as the Kselftest tarball provide a script named run_kselftest.sh to run the tests. You can simply do the following to run the installed Kselftests. Please note some tests will require root privileges.

cd kselftest
./run_kselftest.sh

Interpret Kselftest Results

When Kselftest suite is run each test prints out Pass or Fail and the reason for failure. Example results for a few tests in the suite:

Running tests in membarrier
========================================
membarrier MEMBARRIER_CMD_QUERY syscall available.
membarrier: MEMBARRIER_CMD_SHARED success.
membarrier: tests done!
selftests: membarrier_test [PASS]

Running tests in memfd
========================================
memfd: CREATE
memfd: BASIC
memfd: SEAL-WRITE
memfd: SEAL-SHRINK
memfd: SEAL-GROW
memfd: SEAL-RESIZE
memfd: SHARE-DUP
memfd: SHARE-MMAP
memfd: SHARE-OPEN
memfd: SHARE-FORK
memfd: SHARE-DUP (shared file-table)
memfd: SHARE-MMAP (shared file-table)
memfd: SHARE-OPEN (shared file-table)
memfd: SHARE-FORK (shared file-table)
memfd: DONE
selftests: memfd_test [PASS]

Additional Resources

To learn more about using Kselftest, check out the following resources