9-11 September 2019
Europe/Lisbon timezone

Accepted Microconferences

LPC2019 microconferences

 

 

 


BPF microconference

After having run a standalone BPF microconference for the first time in last year's [0] [1] [2] Linux Plumbers conference, we've been overwhelmed with throughout positive feedback. We received more submissions than we could have accommodated for the one-day slot, and the room at the conference venue was fully packed despite the fact that the networking track had about half of their submissions with BPF related topics as well.

 

We would like to continue on this success by organizing a BPF micro conference also for 2019. The microconference is aiming to catch BPF related kernel topics mainly in BPF core area as well as having focused discussions in specific subsystems (tracing, security,
networking) with short 1-2 slides in order to get BPF developers together in a face to face working meetup for tackling and hashing out unresolved issues and discussing new ideas.

 

Expected audience

Folks knowledgeable with BPF that work in core areas or in subsystems making use of BPF.

 

Expected topics

  • libbpf, loader unification
  • Standardized BPF ELF format
  • Multi-object semantics and linker-style logic for BPF loaders
  • Improving verifier scalability to 1 million instructions
  • Sleep-able bpf programs
  • State on BPF loop support
  • Proper String support in BPF
  • Indirect calls in BPF
  • BPF timers
  • BPF type format (BTF)
  • Unprivileged BPF
  • BTF of vmlinux
  • BTF annotated raw_tracepoints
  • BPF (k)litmus support
  • bpftool
  • LLVM BPF backend
  • JITs and BPF offloading
  • More to be added based on CfP for this microconference

 

If you are interested in participating in this microconference and have topics to propose, please use the CfP process. More topics will be added based on CfP for this microconference.

 

MC leads

 

[0] https://linuxplumbersconf.org/event/2/sessions/16/#20181115 
[1] https://lwn.net/Articles/773198/ 
[2] https://lwn.net/Articles/773605/ 

 

 


RISC-V microconference

The Linux Plumbers 2019 RISC-V MC will continue the trend established in  2018 [2] to address different relevant problems in RISC-V Linux land.

The overall progress in RISC-V software ecosystem since last year has been really impressive. To continue the similar growth, RISC-V track at Plumbers will focus on finding solutions and discussing ideas that require kernel changes. This will also result in a significant increase in active developer participation in code review/patch submissions which will definitely lead to a better and more stable kernel for RISC-V.

 

Expected topics

  • RISC-V Platform Specification Progress, including some extensions such as power management - Palmer Dabbelt
  • Fixing the Linux boot process in RISC-V (RISC-V now has better support for open source boot loaders like U-Boot and coreboot compared to last year. As a result of this developers can use the same boot loaders to boot Linux on RISC-V as they do in other architectures, but there's more work to be done) - Atish Patra
  • RISC-V hypervisor emulation [5] - Alistair Francis
  • RISC-V hypervisor implementation - Anup Patel
  • NOMMU Linux for RISC-V - Damien Le Moal
  • More to be added based on CfP for this microconference

 

If you are interested in participating in this microconference and have topics to propose, please use the CfP process. More topics will be added based on CfP for this microconference.

 

MC leads

  • Atish Patra <atish.patra@wdc.com> or Palmer Dabbelt <palmer@dabbelt.com>

 

 

[2] https://etherpad.openstack.org/p/RISC-V
 

 


Tracing microconference

The Linux Plumbers 2019 is pleased to welcome the Tracing microconference again this year. Tracing is once again picking up in activity. New and exciting topics are emerging. 

There is a broad list of ways to perform Tracing in Linux. From the original mainline Linux tracer, Ftrace, to profiling tools like perf, more complex customized tracing like BPF and out of tree tracers like LTTng, systemtap and Dtrace. Come and join us and not only learn but help direct the future progress of tracing inside the Linux kernel and beyond!

 

Expected topics

  • bpf tracing – Anything to do with BPF and tracing combined
  • libtrace – Making libraries from our tools
  • Packaging – Packaging these libraries
  • babeltrace – Anything that we need to do to get all tracers talking to each other
  • Those pesky tracepoints – How to get what we want from places where trace events are taboo
  • Changing tracepoints – Without breaking userspace
  • Function tracing – Modification of current implementation
  • Rewriting of the Function Graph tracer – Can kretprobes and function graph tracer merge as one
  • Histogram and synthetic tracepoints – Making a better interface that is more intuitive to use
  • More to be added based on CfP for this microconference

 

If you are interested in participating in this microconference and have topics to propose, please use the CfP process. More topics will be added based on CfP for this microconference.

 

MC lead

Tracing Microconference Summary


Tracing MC Etherpad: https://etherpad.net/p/LPC2019_Tracing

Omar Sandova started out describing Facebook's drgn utility for Programmable debugging.  It is a utility that reads /proc/kcore and the kernel debugger objects to map and debug the kernel live. Would like to connect to BPF to allow for "breakpoints" but that would cause issues in of itself. It has no macro support (too much info in the dwarf debug - gigs of it). Discussed issue with vmcores where they do not contain per-cpu or kmalloc data. Need to walk page tables to get at this information.  It tries to do what crash does offline, but with a running kernel. Not much progress was made to improve the current code.
Slides

Masami Hiramatsu discussed kernel boot time tracing. The current kernel command line has a limitation due to the size that can be passed to it. Masami wants to increase the amount of data that can be transferred to the kernel. A file could possibly be passed in via the boot loader. Masami first did this with device tree but was told that its not for config options (even though people said that ship has already sailed, and that Masami should perhaps argue that again). Masami proposed a Supplemental Kernel Commandline (SKC) that can be added, and he demonstrated the proposed format of the file. It was discussed about various ways to get this file added. Perhaps we can append it to the compressed initram disk, and tell the kernel its offset. That way the kernel can find it very early in boot and before the initrd is decompressed.
Slides

Song Liu and David Carrillo Cisneros discussed sharing PMU counters across compatible perf events. The issue is that there are more perf events than HW counters.  Perhaps be able to better share compatible counters (compatible perf events can share a single counter). Suggested to detect compatible perf events at perf_event_open.  Was suggested to implement this in arch code, but Peter Zijlstra stated that it would be better if it were in the perf core code as it appears all archs would require the same logic. Perhaps use their own cgroup (nesting), but Peter said that it would cause performance issues traversing the cgroup hierarchy. The problem space appears to be identified by those involved
and will continue on the mailing list.
Slides

Tzvetomir Stoyanov discussed creating a easier user space interface to handle the synthetic events and histogram code. Tzvetomir demonstrated a very complex use of the histogram logic in the kernel and showed that it was a very difficult interface to use. Which explains why it is not used much. It was demonstrated that the interface was used to merge two events based on fields, and if we treat the events as "tables" the event fields as "columns" and each instance of the event as a "row" we could use a database logic to join these evens (namely SQL). Daniel Black (a database maintainer) helped out in the logic. It was discussed if SQL was the right way to go, as many people dislike that syntax, but Brendan Gregg mentioned that sys admins are very familiar with it, and may be happier to have it. We discussed using a BPF like language, but one goal was to keep the language and what can be done 1:1 compatible. BPF is a superset, and if you are doing that, might as well just use BPF. Why not just use BPF then? The answer is that this is for the embedded "busybox" folks, that do not have the luxury of BPF tooling. It was decided to discuss this more later in a BoF (which was done, and a full solution came out of it).
Slides

Jérémie Galarneau discussed Unifying tracing ecosystems with Babletrace. This also flowed in with Steven Rostedt's "Unified Tracing Platform" (UTP). Babletrace strives to be able to make any tracing format (perf, LTTng, ftrace, etc) be read by any tool. The UTP strives to make any tool be able to use any of the tracing infrastructures.  Perf already supports CTF, trace-cmd has it on the todo list. libbabletrace works on Windows and Mac OS. Babletrace 2.0 is ready just finishing up on documentation and will be released in a few weeks. Steven asked for it to be announced on the linux-trace-user/devel mailing lists.
Slides

Alastair Robertson gave a talk on bpftrace. Mathieu suggested looking into TAP for a test output format.  It still has issues with raw addresses, as /proc/kallsyms only shows global variables.  Plan to get more BPF Type Format (BTF) support. There's work to convert DWARF to BTF.  There's work to make systemtap scripts be converted into BPF format to
still work with BPF underneath.
Slides

Brendan Gregg gave a talk on BPF Tracing Tools: New Observability for Performance Analysis He talked about his book project BPF Performance Tools where there's a lot of gaps that he's trying to fill in. Tools from the book can be found in https://github.com/brendangregg/bpf-perf-tools-book Still would like to get trace events into the VFS layer (but Al Viro is against it, due to possible ABI issues).
Slides

 

 


Distribution kernels microconference

The upstream kernel community is where active kernel development happens but the majority of kernels deployed do not come directly from upstream but distributions. "Distribution" here can refer to a traditional Linux distribution such as Debian or Gentoo but also Android or a custom cloud distribution. The goal of this Microconference is to discuss common problems that arise when trying to maintain a kernel.

 

Expected topics

  • Backporting kernel patches and how to make it easier
  • Consuming the stable kernel trees
  • Automated testing for distributions
  • Managing ABIs
  • Distribution packaging/infrastructure
  • Cross distribution bug reporting and tracking
  • Common distribution kconfig
  • Distribution default settings
  • Which patch sets are distributions carrying?
  • More to be added based on CfP for this microconference


"Distribution kernel" is used in a very broad manner. If you maintain a kernel tree for use by others, we welcome you to come and share your experiences.


If you are interested in participating in this microconference and have topics to propose, please use the CfP process. More topics will be added based on CfP for this microconference.

 

MC lead

 

 


Containers and Checkpoint/Restore MC

The Containers and Checkpoint/Restore MC at Linux Plumbers is the opportunity for runtime maintainers, kernel developers and others involved with containers on Linux to talk about what they are up to and agree on the next major changes to kernel and userspace.

Last year's edition covered a range of subjects and a lot of progress has been made on all of them. There is a working prototype for an id shifting filesystem some distributions already choose to include, proper support for running Android in containers via binderfs, seccomp-based syscall interception and improved container migration through the userfaultfd patchsets.

Last year's success has prompted us to reprise the microconference this year. Topics we would like to cover include:

  • Android containers
  • Agree on an upstreamable approach to shiftfs
  • Securing containres by rethinking parts of ptrace access permissions, restricting or removing the ability to re-open file descriptors through procfs with higher permissions than they were originally created with, and in general how to make procfs more secure or restricted.
  • Adoption and transition of cgroup v2 in container workloads
  • Upstreaming the time namespace patchset
  • Adding a new clone syscall
  • Adoption and improvement of the new mount and pidfd APIs
  • Improving the state of userfaultfd and its adoption in container runtimes
  • Speeding up container live migration
  • Address space separation for containers
  • More to be added based on CfP for this microconference

 

If you are interested in participating in this microconference and have topics to propose, please use the CfP process. More topics will be added based on CfP for this microconference.

 

MC leads

 


You, Me and IoT MC

The Internet of Things (IoT) has been growing at an incredible pace as of late.

Some IoT application frameworks expose a model-based view of endpoints, such as

  •     on-off switches
  •     dimmable switches
  •     temperature controls
  •     door and window sensors
  •     metering
  •     cameras

Other IoT application frameworks provide direct device access, by creating real and virtual device pairs that communicate over the network. In those cases, writing to the virtual /dev node on a client affects the real /dev node on the server. Examples are

  •     GPIO (/dev/gpiochipN)
  •     I2C (/dev/i2cN)
  •     SPI (/dev/spiN)
  •     UART (/dev/ttySN)

Interoperability (e.g. ZigBee to Thread) has been a large focus of many vendors due to the surge in popularity of voice-recognition in smart devices and the markets that they are driving. Corporate heavyweights are in full force in those economies. OpenHAB, on the other hand, has become relatively mature as a technology and vendor agnostic open-source front-end for interacting with multiple different IoT frameworks.

The Linux Foundation has made excellent progress bringing together the business community around the Zephyr RTOS, although there are also plenty of other open-source RTOS solutions available. The linux-wpan developers have brought 6LowPan to the community, which works over 802.15.4 and Bluetooth, and that has paved the way for Thread, LoRa, and others. However, some closed or quasi-closed standards must rely on bridging techniques mainly due to license incompatibility. For that reason, it is helpful for the kernel community to preemptively start working on application layer frameworks and bridges, both community-driven and business-driven.

For completely open-source implementations, experimental results have shown results with Greybus, with a significant amount of code already in staging. The immediate benefits to the community in that case are clear. There are a variety of key subjects below the application layer that come into play for Greybus and other frameworks that are actively under development, such as

  1. Device Management
  2. are devices abstracted through an API or is a virtual /dev node provided?
  3. unique ID / management of possibly many virtual /dev nodes and connection info
  4. Network Management
  5. standards are nice (e.g. 802.15.4) and help to streamline in-tree support
  6. non-standard tech best to keep out of tree?
  7. userspace utilities beyond command-line (e.g. NetworkManager, NetLink extensions)
  8.  Network Authentication
  9.  re-use machinery for e.g. 802.11 / 802.15.4 ?
  10.  generic approach for other MAC layers ?
  11.  Encryption
  12.  in userspace via e.g. SSL, /dev/crypto
  13.  Firmware Updates
  14.  generally different protocol for each IoT framework / application layer
  15.  Linux solutions should re-use components e.g. SWUpdate

If you are interested in participating in this microconference and have topics to propose, please use the CfP process. More topics will be added based on CfP for this microconference.

This Microconference will be a meeting ground for industry and hobbyist contributors alike and promises to shed some light on the what is yet to come. There might even be a sneak peak at some new OSHW IoT developer kits.

The hope is that some of the more experienced maintainers in linux-wpan, LoRa and OpenHAB can provide feedback and suggestions for those who are actively developing open-source IoT frameworks, protocols, and hardware.

MC leads

Christopher Friedt <chris@friedt.co>, Jason Kridner <jkridner@beagleboard.org>, and Drew Fustini <drew@beagleboard.org>

You, Me and IoT Summary

You, Me, & IoT MC Etherpad: https://etherpad.net/p/LPC2019_IoT

Alexandre Baillon presented the state of Greybus for IoT since he last demonstrated the subject at ELCE 2016. First, Project Ara (the modular phone) was discussed followed by the network topology (SVC, Module, AP connected over UniPro) and then how Greybus for IoT took the same concepts and extended them over any network. Module structure was described - a Device includes at least one Interface, which contains one or more Bundles, which expose one or more Cports. Cports behave much like sockets in the sense that several of them can send and receive data concurrently, and also operate on that data concurrently as well. Alexandre mentioned that Greybus has been out of staging since 4.9, and that he was currently working to have his gb-netlink module merged upstream. More details about the application layer on top of Greybus were discussed and its advantages - i.e. the MCU software only needs to understand how to interact with the bus (i2c, gpio, spi), but the actual driver controlling the bus would live remotely on a Linux host. The device describes itself to Linux via a Manifest. Limitations were listed as well, such as how one RPC message only focuses on one particular procedure, performance can vary by network technology. Current issues include how to enable remote wake-up, security and authentication are missing, it is not currently possible to pair e.g. a remote i2c device and a remote interrupt gpio (possibly extending manifest format to include key-value details / device tree). The possibility of running gb-netlink mostly in the kernel was also discussed (after authentication and encryption was set up) as a potential improvement.
Slides


Dr. Malini Bhandaru discussed the infrastructure challenges facing companies working with IoT. She pointed out that there are currently several firmware update solutions offering much of the same services (OSTree, Balena.io, SWUpdate, Swupd, Mender.io). There are also issues about what OS is running on the device. There is a distinct need for an update and configuration framework that can be hooked into. There was a question about whether this belonged in the kernel or user space in order to take privilege and consistency into account - policy must live in userspace. Further discussion highlighted that TPM devices would typically hold keys in a secure location. The IETF SUIT (Software Updates for Internet of Things) working group was mentioned. Malini suggested a command interface to implement the API, but such a command line / hook could violate the policies that distros already have.
Slides

Anrdreas Färber discussed LoRa (Long Range), going into detail about it’s physical layer (FSK, CSS - Chirp Spread Spectrum) in U-LPWA and Sub-GHz bands. He contrasted LPWAN (Low Power Wide Area Network) with LowPAN (Low-Power Personal Area Network). The tech allows for a long battery life (up to 10 years) with long transmission distances (up to 48km). Publicly documented APIs are used for modules (communicating over e.g. SPI / UART / USB). The LoRa effort within the Linux kernel is to ensure hardware works with generic enterprise distributions. Some form of socket interface is possible using a different protocol family (e.g. PF_PACKET / AF_LORA). It’s possible that an approach similar to the Netlink interface for 802.15.4 / 802.11 could be used for LoRa as well. Outcomes of Netdev conference were to model a LoRaWAN soft-MAC similar to that of 802.15.4. Planned RFCv2 submission to staging tree.
Slides

Peter Robinson gave an overview of Linux IoT from the perspective of an enterprise distribution. U-Boot progress has been great (UEFI). However, there is a large difference between Enterprise / Industrial IoT and e.g. the R-Pi. Went on to point out that BlueZ has not had a release in 15 months. BBB wireless firmware missing. Security fixes not backported to pegged kernels. R-Pi wireless firmware not working outside of Raspbian. Intel has regressed on wireless. Then, there was the issue about GPIO - everything still uses the sysfs interface, but it is deprecated. Nobody is using libgpiod. There are no node.js bindings. Adafruit has switched to the new API, but found some pain points with pull-up support and submitted a PR. The new GPIO API requires some slight paradigm shifts, and it was suggested that perhaps that should be something that libgpiod developers engage in (e.g. migration strategies). Some projects are using /dev/mem for userspace drivers, and that is very bad. There needs to be a consistent API across platforms. There is a lot of work to do to maintain a state of consistent usability and functional firmware - how do we distribute the work?
Slides


Stefan Schmidt discussed 6LowPan and the progress that has been made implementing various RFC’s (e.g. header compression). Now there are several network and application layers using 6LowPan or another network layer atop 802.15.4. He discussed some hardware topologies - e.g. many hardware vendors prefer to have a network coprocessor that runs in a very low-power mode, while the Linux applications processor is put to sleep. The Linux approach is soft-mac (i.e. no firmware required). Notable open-hardware solution is ATUSB, which is again being made available from Osmocon. Link-layer security has been developed and tested by Fraunhofer. Some MAC-layer / userspace work still to be done (e.g. beacon, scan, network management). Zigbee / 802.15.4 has lost some momentum, but is gaining traction in Industrial IoT. Is Wireless HART open? Possibly used by unencrypted smart metering in US / Canada.
Slides


Jason Kridner showed off some work that has been done at BeagleBoard.org involving Greybus for IoT. Unlike something like USB, which is a discoverable bus, many IoT busses are non-discoverable (I2C, SPI, GPIO). Greybus could be the silver bullet (“grey” bullet) to solve that problem. Furthermore, the difficulty of writing intelligent drivers and interacting larger-scale networks can be moved from the device to a Linux machine like a BeagleBone Black,  PocketBeagle, or Android device. Work on Greybus has been done to focus on IPv6 (e.g. 6LowPan) and to add strong authentication and encryption, effectively making the physical mediums arbitrary. The user experience will be that a sensor node is connected wirelessly, automatically detected via Avahi, and then advertises itself to the Linux host. A GSoC student has been working on Mikrobus Click support under Greybus, and has been fairly successful with GBSIM running on the BeagleBone Black and PocketBeagle. The idea is to get away from writing microcontroller firmware and to write drivers once and get them into the Linux kernel where they can be properly maintained and updated. The CC1352R LaunchXL development kit was shown to the audience since that was the platform that most of the work was done on, and a new open source hardware prototype board briefly made an appearance and was shown to be running the Zephyr Real-Time Operating System. Zephyr has gained so much momentum over the last few years and is playing a central role for where companies are focusing their development efforts. The network layer and HAL is fantastic, and they already have support for 802.15.4, BLE, and 6LowPan. There are still some improvements to be made with Greybus: network latency / variance - possibly reintroduce the time-sync code from the original Greybus implementation; extend the Greybus manifest to embed device-tree data as key-value properties.
Slides


Live Patching MC

The main purpose of the Linux Plumbers 2019 Live Patching microconference is to involve all stakeholders in open discussion about remaining issues that need to be solved in order to make live patching of the Linux kernel and the Linux userspace live patching feature complete.

The intention is to mainly focus on the features that have been proposed (some even with a preliminary implementation), but not yet finished, with the ultimate goal of sorting out the remaining issues.

This proposal follows up on the history of past LPC live patching microconferences that have been very useful and pushed the development forward a lot.

Currently proposed discussion/presentation topic proposals (we've not gone through "internal selection process yet") with tentatively confirmed attendance:

  •  5 min Intro - What happened in kernel live patching over the last year
  •  API for state changes made by callbacks [1][2]
  •  source-based livepatch creation tooling [3][4]
  •  klp-convert [5][6]
  •  livepatch developers guide
  •  userspace live patching

If you are interested in participating in this microconference and have topics to propose, please use the CfP process. More topics will be added based on CfP for this microconference.

MC leads

Jiri Kosina <jkosina@suse.cz> and Josh Poimboeuf <jpoimboe@redhat.com>

 

Live Patching Summary

Live patching miniconference covered 9 topics overall.

What happened in kernel live patching over the last year

Led by Miroslav Benes.  It was quite a natural followup to where we ended at the LPC 2018 miniconf, summarizing which of the points that have been agreed on back then have already been fully implemented, where obstacles have been encountered etc.

The most prominent feature that has been merged during past year was "atomic replace", which allows for easier stacking of patches. This is especially useful for distros, as it naturally aligns with the way patches are being distributed by them. Another big step forward since LPC 2018 miniconf was addition of livepatching selftests, which already tremendously helped in various cases, as it e.g. helped to track down quite a few issues during development of reliable stacktraces on s390. Proposal has been made that all major KLP features in the future should be accompanied by accompanying selftest, which the audience agreed on.

One of the last year's discussion topics / pain points were GCC optimizations which are not compatible with livepatching. GCC upstream now has -flive-patching option, which disables all those interfering optimizations.

Rethinking late module patching

Led by Miroslav Benes again.

The problem statement is: in case when there is a patch loaded for module that is yet to be loaded, it has to be patched before it starts executing. The current solution relies on hooks in the module loader, and module is patched when its being linked.  It gets a bit nasty with the arch-specifics of the module loader handling all the relocations, patching of alternatives, etc. One of the issues is that all the paravirt / jump label patching has to be done after relocations are resolved, this is getting a bit fragile and not well maintainable.

Miroslav sketched out the possible solutions:

  • livepatch would immediately load all the modules for which it has patch via dependency; half-loading modules (not promoting to final LIVE state)
  • splitting the currently one big monolithic livepatch to a per-object structure; might cause issues with consistency model
  • "blue sky" idea from Joe Lawrence: livepatch loaded modules, binary-patch .ko on disk, blacklist vulnerable version

Miroslav proposed to actually stick to the current solution, and improve
selftests coverage for all the considered-fragile arch-specific module linking code hooks. The discussion then mostly focused, based on proposals from several attendees (most prominently Steven Rostedt and Amit Shah), on expanding on the "blue sky" idea.

The final proposal converged to having a separate .ko for livepatches that's installed on the disk along with the module.  This addresses the module signature issue (as signature does not actually change), as well as module removal case (the case where a module was previously loaded while a livepatch is applied, and then later unloaded and reloaded).  The slight downside is that this will require changes to the module loader to also look for livepatches when loading a module.  When unloading the module, the livepatch module will also need to be unloaded.  Steven approved of this approach over his previous suggestion.

Source-based livepatch creation tooling

Led by Nicolai Stange.

The primary objective of the session was basing on the source-based creation of livepatches, while avoiding the tedious (and error-prone task) of copying a lot of kernel code around (from the source tree to the livepatch). Nicolai spent par of last year writing a klp-ccp (KLP Copy and Paste) utility, which automates a big chunk of the process.
Nicolai then presented the still open issues with the tool and with the process around it, most promonent ones being:

  • obtaining original GCC commandline that was used to build the original kernel
  • externalizability of static functions; we need to know whether GCC emitted static function into the patched object

Miroslav proposed to extend existing IPA dumping capabiity of GCC to emit also the information about dead code elimination; DWARF information is guaranteed not to be reliable when it comes to IPA optimizations.

Objtool on power -- update

Led by Kamalesh Babulal.

Kamalesh reported that as a followup to last year's miniconference, the objtool support for powerpc actually came to life. It hasn't yet been posted upstream, but is currently available on github.

Kamalesh further reported, that decoder has basic functionality (stack
operations + validation, branches, unreachable code, switch table (through gcc plugin), conditional branches, prologue sequences). It turns out that stack validation on powerpc is easier than on x86, as the ABI is much more strict there; which leaves the validation phase to mostly focus on hand-written assembly.

The next steps are basing on arm64 objtool code which already abstracted out the arch-specific bits, and further optimizations can be stacked on top of that (switch table detection, more testing, different gcc versions).

Do we need a Livepatch Developers Guide?

Led by Joe Lawrence.

Joe postulated, that Current in-kernel documentation provides very good documentation for individual features the infrastructure provides to the livepatch author, but Joe further suggested to also include something along the lines of what they currently have for kpatch, which takes a more general look from the point of view of livepatch developer.

Proposals that have been brought up for discussion:

  • FAQ
  • collecting already existing CVE fixes and ammend them with a lot of commentary
  • creating a livepatch blog on people.kernel.org

Mark Brown asked for documenting what architectures need to implement in order to support livepatching.

Amit Shah asked if the 'kpatch' and 'kpatch-build' script/program be renamed to 'livepatch'-friendly names so that kernel sources can also reference them for the user docs part of it.

Both Mark's and Amit's remarks have been considered very valid and useful, and agreement was reached that they will be taken care of.

API for state changes made by callbacks

Led by Petr Mladek.

Petr described his proposal for API for changing, updating and disabling
patches (by callbacks). Example where this was needed: L1TF fix, which needed to change PTE semantics (particular bits). This can't be done before all the code understands this new PTE format/semantics. Therefore pre-patch and post-patch callbacks had to do the actual modifications to all the existing PTEs. What is also currently missing is tracking compatibilities / dependencies between individual livepatches.
Petr's proposal (v2) is already on ML. struct klp_state is being introduced which tracks the actual states of the patch. klp_is_patch_compatible() checks the compatibility of the current states
to the states that the new livepatch is going to bring. No principal issues / objections have been raised, and it's appreciated by the patch author(s), so v3 will be submitted and pre-merge bikeshedding will start.

klp-convert and livepatch relocations"

Led by Joe Lawrence.

Joe started the session with problem statement: accessing non exported / static symbols from inside the patch module. One possible workardound is manually via kallsyms. Second workaround is klp-convert, which actually creates proper relocations inside the livepatch module from the symbol database during the final .ko link. Currently module loader looks for special livepatch relocations and resolves those during runtime; kernel support for these relocations have so far been added for x86 only. Special livepatch relocations are supported and processed also on other architectures. Special quirks/sections are not yet supported. Plus klp-convert would still be needed even with late module patching update.
vmlinux or modules could have ambiguous static symbols.

It turns out that the features / bugs below have to be resolved before we
can claim the klp-convert support for relocation complete:

  • handle all the corner cases (jump labels, static keys, ...) properly and have a good regression tests in place
  • one day we might (or might not) add support for out-of-tree modules which need klp-convert
  • BFD bug 24456 (multiple relocations to the same .text section)

Making livepatching infrastructure better

Led by Kamalesh Babulal.


The primary goal of the discussion as presented by Kamalesh was simple: how to improve our testing coverage.  Currently we have sample modules + kselftests. We seem to be currently missing specific unit cases and tests for corner cases. What Kamalesh would also like to see would be more stress testing oriented tests for the infrastructure. We should make sure that projects like kernelCI are running with CONFIG_LIVEPATCH=y.

Another thing Kamalesh currently sees as missing are failure test cases too. It should be checked with sosreport and supportconfig guys whether those diagnostic tools do provide necessary coverage of (at least) livepatching sysfs state. This is especially a task for distro people to figure out.

Nicolai proposed as one of the testcases identity patching, as that should reveal issues directly in the infrastructure.

Open sourcing live patching services

Led by Alice Ferrazzi.

This session followed up on previous suggestion of having public repository for livepatches against LTS kernel. Alice reported on improviement of elivepatch since last year as having moved everything to docker.

Alice proposed to more share livepatch sources; SUSE does publish those, but it's important to mention that livepatches are very closely tied to particular kernel version.

 


Open Printing MC

The Open Printing (OP) organisation works on the development of new printing architectures, technologies, printing infrastructure, and interface standards for Linux and Unix-style operating systems. OP collaborates with the IEEE-ISTO Printer Working Group (PWG) on IPP projects.

We maintain cups-filters which allows CUPS to be used on any Unix-based (non-macOS) system. Open Printing also maintains the Foomatic database which is a database-driven system for integrating free software printer drivers with CUPS under Unix. It supports every free software printer driver known to us and every printer known to work with these drivers.

Today it is very hard to think about printing in UNIX based OSs without the involvement of Open Printing. Open Printing has been successful in implementing driverless printing following the IPP standards proposed by the PWG as well.

Proposed Topics:

  • Working with SANE to make IPP scanning a reality. We need to make scanning work without device drivers similar to driverless printing.
  • Common Print Dialog Backends.
  • Printer/Scanner Applications - The new format for printer and scanner drivers. A simple daemon emulating a driverless IPP printer and/or scanner.
  • The Future of Printer Setup Tools - IPP Driverless Printing and IPP System Service. Controlling tools like cups-browsed (or perhaps also the print dialog backends?) to make the user's print dialogs only showing the relevant ones or to create printer clusters.
  • 3D Printing without the use of any slicer. A filter that can convert a stl code to a gcode.

If you are interested in participating in this microconference and have topics to propose, please use the CfP process. More topics will be added based on CfP for this microconference.

MC leads

Till Kamppeter <till.kamppeter@gmail.com> or Aveek Basu <basu.aveek@gmail.com>

 

Open Printing Summary

Presentation Link: https://linuxplumbersconf.org/event/4/contributions/366/attachments/368/602/OpenPrinting_LinuxPlumbersConference2019_Lisbon.pdf

The OpenPrinting MC kicked off with Aveek and Till speaking on what is the current status of print in today's date.

Aveek explained driverless printing and how it has changed the way the world prints today. He explained in detail on how a driverless print works in Linux in today's world.

After talking about printing, he explained the problems with regards to scanning saying that scanning is not as smooth as printing in Linux. There is a lot of work that needs to be done on the scanning side to make driverless scanning in Linux a reality.

Till spoke about the current activities of OpenPrinting as an organisation. 

Common Print Dialog Backends.

Rithvik talked about the Common Print Dialog Backends project. The Common Print Dialog project was created with the intention to provide a unified printing experience across Linux distributions. The proposed dialog would be independent of the print technology and the UI toolkit. 

Later this was turned into the Common Print Dialog Backends project which abstracts away different print technologies from the print dialog itself with a D-Bus interface. Every print technology (CUPS, GCP etc.) has its own backend package. The user has to install the package corresponding to the print technology to access his printer and the dialog will use D-Bus messages to talk to the print technology thus decoupling the frontend development from the backend. Also if the user changes/upgrades his print technology in the future, all he has to do is to install the backend corresponding to his new print technology. The biggest challenge for the Common Print Dialog project is the adoption from major toolkit projects like Qt and GTK. There is no official API defined yet and there was a recommendation from the audience that an API be officially defined so that it will help the integration of the backends into UI toolkit projects like Qt, GTK etc. and other applications looking to add printing support. 

IPP Printing & Scanning.

Aveek explained in detail how a driverless printer works using the IPP protocols. The mechanism he explained is something like once a host is connected in a network having driverless printers, a mDNS query is broadcasted to which the printers respond saying if they support driverless. Based on that response, the list of printers is shown in the print dialog. Once the user selects a particular printer, the printer is queried for it’s job attributes. As a response to this query, the printer sends the list of the attributes that it supports. Depending on that the supported features for that printer are listed in the UI and it is then up to the user to select the features that he wants to use.

This mechanism is available now for the case of printing. However the same is not the case for scanning. For scanning we still have to go for the age-old method of using a driver. Contributions are required in this space to pull up scanning and make it be on par with printing. The IPP standards have already been defined by PWG. It is high time that the manufacturers should start manufacturing hardware giving full support for IPP driverless scanning and the scanning community should also do the relevant changes from the software side.

Printer/Scanner Applications.

Till gave an introduction into the future of drivers for printers, scanners, and multi-function devices, the Printer/Scanner Applications. On the printing side they replace the 80s-years hack of using PPD (PostScript Printer Description) files to describe printer capabilities, where most printers are not PostScript and the standard PDL (Page Description Language) is PDF nowadays.

They emulate a driverless IPP printer, answering requests for capability info from the client, taking jobs, filtering them to the printer's format, and sending them off to the printer. To clients they behave exactly like a driverless IPP printer, so CUPS will only have to do with driverless IPP printers and the PPD support, deprecated a decade ago can finally get dropped. The Printer Applicsations also allow easy sandboxed packaging (Snap, flatpak, … Packages are OS-distribution-independent) as there are no files to be placed into th system's file system any more, only IP communication.

As the PWG has also created a standard for driverless IPP scanning, especially with printer/scanner multi-function devices in mind, we can do the same with scanning, replacing SANE in the future, but especially also for complete multi-function-device drivers and to also sandbox-package scanner drivers.

Printer/Scanner Applications can also provide configuration interfaces via PWG's configuration interface standard IPP System Service and/or web interface. This way one could for example implement a Printer Application for printers which cannot be auto-discovered (for example require entering the printer's IP or selecting the printer model) or one can initiate head cleaning and other maintenance tasks.

Printer/Scanner applications do not only need to be replacements for classic printer and scanner drivers but also can accommodate special tasks like IPP-over-USB (the already existing ippusbxd) or cups-browsed could be turned into a Printer Application (for clustering, legacy CUPS servers, …).

The Future of Printer Setup Tools - IPP.

In this section Till presented the situation for Printer Setup Tools. The current ones usually show a list of existing print queues, allow to add a queue for a detected or manually selected printer, and assign a driver to it. On the existing print queues default option settings can be selected. Configuration of network printer hardware is done via the printer's web interface in a browser, not in the Printer Setup Tool.

In the future tasks will change: Thanks to IPP driverless printing print queues set up automatically, both on network and USB printers (IPP-over-USB, ippusbxd), so the classic add-printer task is less and less needed. Configuration of Printer hardware is done via IPP System Service in the GUI of the Printer Setup Tool (replaces web admin interfaces).

If printer needs a driver, a driver snap needs to get installed to make the printer appear as (locally emulated) driverless IPP printer. This could also be done by hardware association mechanisms in the Snap Store.

Another new task could be to configure printer clustering with cups-browsed, but if cups-browsed gets turned into a proper Printer Application with IPP System Service it is perhaps not actually needed.

Needed GUI interfaces in modern Printer Management Tool would then be a queue overview with access to: Default options, jobs, Hardware config interface, hardware configuration via IPP System Service, and driver Snap search for non-driverless printers/scanners (as long as Snap Store apps do not have hardware association by itself).

What we really need here are contributors for the new GUI components.

What Can We Change In 3D Print.

The main concept that has been discussed was to develop a filter that can convert a 3D design into GCode. If this can be made possible then there might be a chance to do away with the slicer. There was a good discussion on how and where to have the functionalities provided by a slicer. Currently a slicer has lot of functionalities, so there were questions like how will all those be fit inside a filter.

There were talks about having a common PDL (Print Description Language) or ODL (Object Description Language) for 3D printers.

PWG has already defined the 3D printing standards.

 


Toolchains MC

The goal of the Toolchains Microconference is to focus on specific topics related to the GNU Toolchain and Clang/LLVM that have a direct impact in the development of the Linux kernel.

The intention is to have a very practical MC, where toolchain and kernel hackers can engage and, together:

  •     Identify problems, needs and challenges.
  •     Propose, discuss and agree on solutions for these specific problems.
  •     Coordinate on how to implement the solutions, in terms of interfaces, patches submissions, etc in both kernel and toolchain component.

Consequently, we will discourage vague and general "presentations" in favor of concreteness and to-the-point discussions, encouraging the participation of everyone present.

Examples of topics to cover:

  •     Header harmonization between kernel and glibc.
  •     Wrapping syscalls in glibc.
  •     eBPF support in toolchains.
  •     Potential impact/benefit/detriment of recently developed GCC optimizations on the kernel.
  •     Kernel hot-patching and GCC.
  •     Online debugging information: CTF and BTF

If you are interested in participating in this microconference and have topics to propose, please use the CfP process. More topics will be added based on CfP for this microconference.

MC leads

Jose E. Marchesi <jose.marchesi@oracle.com> and Elena Zannoni <ezannoni@gmail.com>
 

 


Testing and Fuzzing MC

The Linux Plumbers 2019 Testing and Fuzzing track focuses on advancing the current state of testing of the Linux Kernel.

Potential topics:

  • Defragmentation of testing infrastructure: how can we combine testing infrastructure to avoid duplication.
  • Better sanitizers: Tag-based KASAN, making KTSAN usable, etc.
  • Better hardware testing, hardware sanitizers.
  • Are fuzzers "solved"?
  • Improving real-time testing.
  • Using Clang for better testing coverage.
  • Unit test framework. Content will most likely depend on the state of the patch series closer to the event.
  • Future improvement for KernelCI. Bringing in functional tests? Improving the underlying infrastructure?
  • Making KMSAN/KTSAN more usable.
  • KASAN work in progress
  • Syzkaller (+ fuzzing hardware interfaces)
  • Stable tree (functional) testing
  • KernelCI (autobisect + new testing suites + functional testing)
  • Kernel selftests
  • Smatch

Our objective is to gather leading developers of the kernel and it’s related testing infrastructure and utilities in an attempt to advance the state of the various utilities in use (and possibly unify some of them), and the overall testing infrastructure of the kernel. We are hopeful that we could build on the experience of the participants of this MC to create solid plans for the upcoming year.

If you are interested in participating in this microconference and have topics to propose, please use the CfP process. More topics will be added based on CfP for this microconference.

MC leads

Sasha Levin <levinsasha928@gmail.com> and Dhaval Giani <dhaval.giani@oracle.com>
 

 


Real-Time MC

Since 2004 a project has improved the Real-time and low-latency features for Linux. This project has become know as PREEMPT_RT, formally the real-time patch. Over the past decade, many parts of the PREEMPT RT became part of the official Linux code base. Examples of what came from PREEMPT_RT include: Real-time mutexes, high-resolution timers, lockdep, ftrace, RT scheduling, SCHED_DEADLINE, RCU_PREEMPT, generic interrupts, priority inheritance futexes, threaded interrupt handlers and more. The number of patches that need integration has been reduced from previous years, and the pieces left are now mature enough to make their way into mainline Linux. This year could possibly be the year PREEMPT_RT is merged (tm)!

In the final lap of this race, the last patches are on the way to be merged, but there are still some pieces missing. When the merge occurs, PREEMPT_RT will start to follow a new pace: the Linus one. So, it is possible to raise the following discussions:

  1. The status of the merge, and how can we resolve the last issues that block the merge;
  2. How can we improve the testing of the -rt, to follow the problems raised as Linus's tree advances;
  3. What's next?

Proposed topics:

  • Real-time Containers
  • Proxy execution discussion
  • Merge - what is missing and who can help?
  • Rework of softirq - what is need for the -rt merge
  • An in-kernel view of Latency
  • Ongoing work on RCU that impacts per-cpu threads
  • How BPF can influence the PREEMPT_RT kernel latency
  • Core-schedule and the RT schedulers
  • Stable maintainers tools discussion & improvements.
  • Improvements on full CPU isolation
  • What tools can we add into tools/ that other kernel developers can use to test and learn about PREEMPT_RT?
  • What tests can we add to tools/testing/selftests?
  • New tools for timing regression test, e.g. locking, overheads...
  • What kernel boot self-tests can be added?
  • Discuss various types of failures that can happen with PREEMPT_RT that normally would not happen in the vanilla kernel, e.g, with lockdep, preemption model.

The continuation of the discussion of topics from last year's microconference, including the development done during this (almost) year, are also welcome!

If you are interested in participating in this microconference and have topics to propose, please use the CfP process. More topics will be added based on CfP for this microconference.

MC lead

Daniel Bristot de Oliveira <bristot@redhat.com>


Databases MC

Databases utilize and depend on a variety of kernel interfaces and are critically dependent on their specification, conformance to specification, and performance. Failure in any of these results in data loss, loss in revenue, or degraded experience or if discovered early, software debt. Specific interfaces can also remove small or large parts of user space code creating greater efficiencies.

This microconference will get a group of database developers together to talk about how their databases work, along with kernel developers currently developing a particular database-focused technology to talk about its interfaces and intended use.

Database developers are expected to cover:

  • The architecture of their database;
  • The kernel interfaces utilized, particularly those critical to performance and integrity
  • What is a general performance profile of their database with respect to kernel interfaces;
  • What kernel difficulties they have experienced;
  • What kernel interfaces are particularly useful;
  • What kernel interfaces would have been nice to use, but were discounted for a particular reason;
  • Particular pieces of their codebase that have convoluted implementations due to missing syscalls; and
  • The direction of database development and what interfaces to newer hardware, like NVDIMM, atomic write storage, would be desirable.

The aim for kernel developers attending is to:

  • Gain a relationship with database developers;
  • Understand where in development kernel code they will need additional input by database developers;
  • Gain an understanding on how to run database performance tests (or at least who to ask);
  • Gain appreciation for previous work that has been useful; and
  • Gain an understanding of what would be useful aspects to improve.

The aim for database developers attending is to:

  • Gain an understanding of who is implementing the functionality they need;
  • Gain an understanding of kernel development;
  • Learn about kernel features that exist, and how they can be incorporated into their implementation; and
  • Learn how to run a test on a new kernel feature.

If you are interested in participating in this microconference and have topics to propose, please use the CfP process. More topics will be added based on CfP for this microconference.

MC lead

Daniel  Black <daniel@linux.ibm.com>

Database Microconference Summary

io_uring was the initial topic both gauging its maturity and examination if there where existing performance problems. A known problem of buffered writes triggering stalls was raised, and is being already worked on by separating writes into multiple queues/locks. Existing tests showed comparable performance on read both of O_DIRECT and buffered. MySQL showing twice as bad performance currently however this is hoped that this is the known issue.

Write barriers are needed for writing the reversing a partial transaction such that it is written before the tablespace changes such that, in the case of power failure, the partial transaction can be reversed to preserve the Atomicity principle. The crux of the problem is that a write needs to be durable on disk (like fsynced) before another write. SCSI standards contain an option that has never been implemented however for the large part, no hardware level support exists. While the existing userspace implementation uses fsync, its a considerable overhead and it ensure that all file pending writes are synced, when only on aspect is needed. The way forward seems to be use/extend the chained write approach in io_uring.

O_ATOMIC, the promise of write all or nothing (and the existing block remaining intact), was presented as a requirement. We examined cases of existing hardware support by Shannon, FusionIO and Google storage all have different access mechanism and isn't discoverable and gets rather dependent on the filesystem implementation. The current userspace workaround is to double write the same data to ensure a write tear doesn't occur. There may be a path forward by using a NVDIMM aspect as a staging area attached to the side of a filesystem. XFS has a copy on write mechanism that is work in progress and its currently recommend to wait for this on a database workload (assumed to be: [ioctl_ficlonerange|http://man7.org/linux/man-pages/man2/ioctl_ficlonerange.2.html]).

XFS / Ext4 behaviours that exhibited greater throughput with more writes. Some theories for this where proposed but without more data it was hard to say. There was a number of previous bugs where an increase in hardware performance resulted in bigger queues and decreased performance. A full bisection between kernel versions to identify a commit was suggested. There was some correctness aspects fixed between the 4.4 version and the current version, but these may need to be reexamined. Quite possible two effects where in play. Off cpu analysis, in particular using an eBPF based mechanism of sampling proposed by Josef discussion, would result in better identification of what threads are waiting and where. A later discussion covered differences between unix and tcp loopback implementations and the performance of sockets where there was 0 backlog (gained 15% performance) also needs a similar level of probing/measurement to be actionable.

SQLite and the IO Errors discussions covered a large gap in POSIX specification of what can happen in errors. An example is an experimentally found chain of 6 syscalls seemed to be required to reliably rename a file. A document describing what is needed to perform some basic tasks and that result in uniform behaviour across filesystems would alleviate much frustration, guesswork and disappointment. Attendees where invited to ask a set of specific questions on linux-fsdevel@vger.kernel.org where they could be answered and pushed into documentation enhancements.

The reimplementation of MySQL redo log reduced the number of mutexes however left gaps in synchronization mechanism between worker threads. The use of spin vs mutex locks to synchronise between stages was lacking in some APIs. Waiman Long in the talk Efficient Userspace Optimistic Spinning Locks resented some options in a presentation later this day (however lacked saturation test cases).

Syscall overhead and CPU cache issues weren't covered in time however some of this was answered in Tuesdays Tracing BoF and other LPC session covered these.

The LWN article https://lwn.net/Articles/799807/ covers SQLite and Postgresql IO Errors topics in more detail.

All of the topics presented cover the needs of database implementers present in the discussion. Many thanks to our userspace friends:

    Sergei Golubchik (MariaDB)
    Dimitri Kravtchuk (MySQL)
    Pawel Olchawa (MySQL)
    Richard Hipp (SQLite)
    Andres Freund (Postgresql)
    Tomas Vondra (Postgresql)
    Josef Ahmad (MongoDB)


RDMA MC

Following the success of the past 3 years at LPC, we would like to see a 4th RDMA (Remote Direct Memory Access networking) microconference this year. The meetings in the last conferences have seen significant improvements to the RDMA subsystem merged over the years: new user API, container support, testability/syzkaller, system bootup, Soft iWarp, etc.

In Vancouver, the RDMA track hosted some core kernel discussions on get_user_pages that is starting to see its solution merged. We expect that again RDMA will be the natural microconf to hold these quasi-mm discussions at LPC.

This year there remain difficult open issues that need resolution:

  • RDMA and PCI peer to peer for GPU and NVMe applications, including HMM and DMABUF topics
  • RDMA and DAX (carry over from LSF/MM)
  • Final pieces to complete the container work
  • Contiguous system memory allocations for userspace (unresolved from 2017)
  • Shared protection domains and memory registrations
  • NVMe offload
  • Integration of HMM and ODP

And several new developing areas of interest:

  • Multi-vendor virtualized 'virtio' RDMA
  • Non-standard driver features and their impact on the design of the subsystem
  • Encrypted RDMA traffic
  • Rework and simplification of the driver API

Previous years:
2018, 2017: 2nd RDMA mini-summit summary, and 2016: 1st RDMA mini-summit summary
 

If you are interested in participating in this microconference and have topics to propose, please use the CfP process. More topics will be added based on CfP for this microconference.

MC leads

Leon Romanovsky <leon@leon.nu>, Jason Gunthorpe <jgg@mellanox.com>


Scheduler MC

The Linux Plumbers 2019 Scheduler Microconference is meant to cover any scheduler topics other than real-time scheduling.

Potential topics:

  • Load Balancer Rework - prototype
  • Idle Balance optimizations
  • Flattening the group scheduling hierarchy
  • Core scheduling
  • Proxy Execution for CFS
  • Improving scheduling latency with SCHED_IDLE task
  • Scheduler tunables - Mobile vs Server
  • Remove the tick (NO_HZ and NO_HZ_FULL)
  • Linux Integrated System Analysis (LISA) for scheduler verification

We plan to continue the discussions that started at OSPM in May 2019 and get a wider audience outside of the core scheduler developers at LPC.

If you are interested in participating in this microconference and have topics to propose, please use the CfP process. More topics will be added based on CfP for this microconference.

MC leads

Juri Lelli <juri.lelli@redhat.com>, Vincent Guittot <vincent.guittot@linaro.org>, Daniel Bristot de Oliveira <bristot@redhat.com>, Subhra Mazumdar <subhra.mazumdar@oracle.com>, and Dhaval Giani <dhaval.giani@oracle.com>

Scheduler Summary

The micro conference started with the core scheduling topic which have seen several different proposals on the mailing list. The sessions was led by the different people involved on the subject.

Core scheduling

Aubrey Li, Jan Schönherr, Hugo Reis and Vineeth Remanan Pillai jointly lead the first topic session of the Scheduler MC, the main focus of which was to discuss different approaches and possible uses of Core Scheduling [https://lwn.net/Articles/799454/]. Apart from the primary purpose of making SMT secure against hardware vulnerabilities, a database use case for Oracle and another use case for deep-learning workloads have been presented as well. For the database use case it has been suggested that Core Scheduling could help reducing idleness by finding and packing on shared siblings processes working on the same database instance, for the deep-learning one instead a similar behaviour would help making processes issuing AVX-512 instructions to be colocated in the same core. Finally an additional use case has been mentioned: coscheduling - isolating some processes while at the same time making others to run on the same (set of) core(s).

Discussion then moved on and focused on how to ensure fairness across group of processes, that might monopolize cores and starve other processes by force idling of siblings. Inter-core migrations are, admittedly, still to be looked at.

The general agreement, before closing the sessions, seemed to be that core scheduling is indeed an interesting mechanism that might eventually reach mainline (mainly because enough people need it), but there is still work to be done to make it sound and general enough to suit everybody’s needs.

Proxy execution

Juri Lelli started out with a very quick introduction to Proxy Execution, an enhanced priority inheritance mechanism thought to replace the current rt_mutex PI implementation and possibly merge rt_mutex and mutex code. Proxy Execution is especially needed by SCHED_DEADLINE scheduling policy, as what it is implemented today (deadline inheritance with boosted tasks running outside runtime enforcement) is firstly not correct (as DEADLINE tasks should inherit bandwidth) and risky for normal users (that could implement DoS by forging malicious inheritance chains).

The only question that Lelli intended to address during the topic slot (due to time constraints) was “What to do with potential donors and mutex holder that run on different CPUs?”. The current implementation moves donors (chain) to the CPU on which the holder is running, but this was deemed wrong when the question was discussed at OSPM19 [https://lwn.net/Articles/793502/]. However, both the audience and Lelli (after thinking more about it) agreed that migrating donors might be the only viable solution (from an implementation point of view) and it should be also theoretically sound, at least when holder and donor run on the same exclusive cpuset (root domain), which is probably the only sane configuration for tasks sharing resources.

Making SCHED_DEADLINE safe for kernel kthreads

Paul McKenney introduced his topic by telling the audience that Dmitry Vyukov's testing work identified some (ab)uses of sched_setattr() that can result in SCHED_DEADLINE tasks starving RCU's kthreads for extended time periods (~146 days!), since DEADLINE tasks period and runtime have currently no bound and RCU’s kthreads are SCHED_NORMAL by defult. He then proposed several approaches to possibly fix the problem.

  1. sched_setattr() could recognize parameter settings that put kthreads at risk and refuse to honor those settings.
  2. In theory, RCU could detect this situation and take the "dueling banjos" approach of increasing its priority as needed to get the CPU time that its kthreads need to operate correctly.
  3. Stress testing could be limited to non-risky regimes, such that kthreads get CPU time every 5-40 seconds, depending on configuration and experience.
  4. SCHED_DEADLINE bandwidth throttling could treat tasks in other scheduling classes as an aggregate group having a reasonable aggregate deadline and CPU budget (i.e. hierarchical SCHED_DEADLINE servers).

Discussion led to approaches 1 and 4 (both implemented and posted on LKML by Peter Zijlstra) to be selected as viable solutions. Approach 1 is actually already close to be merged upstream. Approach 4 might require some more work, but it is very interesting as, once extended to SCHED_FIFO/RR, could be used to both replace RT throttling and provide better RT group support.

CFS load balance rework

Vincent Guittot has started the topic by giving a status about the rework of the load_balance () of the scheduler. He summarized the main changes posted on LKML and which UCs were fixed with the proposal. He also noticed some performance improvements but admitted that this was not the main goal of rework at this point. The two main goals of the patchset are:

  • to better classify the group of CPUs when gathering metrics and before deciding to move tasks
  • to choose the best way to describe the current imbalance and what should be done to fix it.

Vincent wanted to present and discuss 3 open items that would need to get fixed in order to improve further the balance of the load on the system.
The 1st item was about using load_avg instead of runnable_load_avg. Runnable load has been introduced to fix cases where there is a huge amount of blocked load on idle CPU. This situation is no more a problem with the rework because the load_avg is now used only when there is no spare capacity on a group of CPUs. The current proposal replace runnable_load_avg by load_avg in the load balance loop and Vincent plans to also move the wake up path and the NUMA algorithm if audience think that it makes sense. Rik said that letting a CPU idle is often a bad choice even for NUMA and the rework is the right direction for NUMA too.
The 2nd item was about the detection of overloaded group of CPUs. Vincent showed some charts of a CPU overloaded by hackbench tasks. The charts showed that utilization can be temporarily low after task migration whereas the load and the runnable load stayed high which is a good indication of large waiting time on an overloaded CPU. Using an unweighted runnable load could help to better classify the sched_groups by monitoring the waiting time. Audience mentioned that tracking the time since last idle time could also be a good indicator.
The 3rd items was the fairness of unbalanceable situation. When the system can’t be balanced like with the N+1 tasks running on N cpus case, the nr_balance_failed drives the migration which makes the migration unpredictable at most. There is no solution so far but the audience proposed to maintain an approximation of a global vruntime that can be used to monitor the fairness across the system. It has also been raised that the compute capacity should also be taken into account because it’s another level of unfairness on heterogeneous system.

Flattening the hierarchy discussion

In addition to the referred talk “CPU controller on a single runqueue”, Rik van Riel wanted to discuss about problem related to flatten the scheduler hierarchy. He showed a typical cgroup hierarchy with pulseaudio and systemd and described what has to be done each time a reschedule happen. The cost of going through the hierarchy is quite prohibitive when tasks are enqueued at high frequency. The basic idea of his patchset is to keep the cgroup hierarchy outside the root rq and rate limit its update. All tasks are put in the root rq by using their hierarchical load and weight and to do as little as possible during enqueue and dequeue. But the wild variety of weight becomes a problem when a task wakes up because it can preempt higher priority task. Rik asked what should be the right behavior for this code. Audience raises that we should make sure to not break the fairness between groups and people cares about uses case more that theory and that should drive the solution.

Scheduler domains and cache bandwidth

Valentin Schneider has presented the results of some tests that he ran on a thunderX platform. The current scheduler topology that has only NUMA and MC level, doesn’t show an intermediate “socklet” made of 6 cores. This intermediate HW level is not described but adding it increases the performance of some UCs. Valentin asked if it makes sense to dynamically add scheduling level similarly to NUMA ? Architecture can already superseded the default SMT/MC/DIE level with their own hierarchy. Adding a new level for arch64 could make sense as long as the arch provides the topology information and it can be dynamically discovered.

TurboSched: Core capacity Computation and other challenges

Parth Shah introduced TurboSched, a proposed scheduler enhancement that aims to sustain turbo frequencies for longer by explicitly marking small tasks (“jitters”, as Parth would refer to them) and packing them on a smaller number of cores. This in turn ensures that the other cores will remain idle, thus saving power budget for CPU intensive tasks.

Discussion focused on which interface it might be used to mark a task as being “jitter”, but this very question soon raised another one: what is a “jitter” task? It turned out that Parth considers to be “jitter” tasks for which quick response is not crucial and can usually be considered background activities (short running background jobs). There was indeed a general agreement that it is firstly very important to agree on terms, so to then decide to do with the different types of tasks.
Another problem discussed was based on the core capacity calculation which TurboSched uses to find non idle core and also to limit the task packing on heavily loaded core. Current approach used by parth is to use CPU capacity of each thread and compute capacity of a core based on linear scaling of single thread capacity, this approach seemed to be incorrect because the capacity of the CPU is not static. Discussion about possible strategies to achieve TurboSched goal followed, unfortunately a definitive agreement wasn’t reached due to time constraints.


Task latency-nice

Subhra Mazumdar wanted to discuss the fact that currently there is no user control on how much time scheduler should spend searching for CPUs when scheduling a task, internal heuristics based on hardcoded logic decide for it. This is however suboptimal for certain types of workloads, especially workloads composed by short running tasks that generally spend only a few microseconds running when they are activated. To tackle the problem, he proposed to provide a new latency-nice property user can set for a task (similar to nice value) that controls the search time (and also potentially the preemption logic).
A discussion was held regarding what might be the best interface to associate this new property to tasks, in which system wide procfs attribute, per-task sched_attr extension and cgroup interface were mentioned. The following question was about what might be a sensible value range for such a property, for which a final answer wasn’t reached. It was then noticed that userspace might help with assigning such a property to tasks, if it knows which tasks are requiring special treatment (like the Android runtime normally does).
Paul Turner seemed in favour of a mechanism to restrict scheduler search space for some tasks, even though he mentioned background tasks as a type of tasks that would benefit from having that.


VFIO/IOMMU/PCI MC

The PCI interconnect specification and the devices implementing it are incorporating more and more features aimed at high performance systems (eg RDMA, peer-to-peer, CCIX, PCI ATS (Address Translation Service)/PRI(Page Request Interface), enabling Shared Virtual Addressing (SVA) between devices and CPUs), that require the kernel to coordinate the PCI devices, the IOMMUs they are connected to and the VFIO layer used to managed them (for userspace access and device passthrough) with related kernel interfaces that have to be designed in-sync for all three subsystems.

The kernel code that enables these new system features requires coordination between VFIO/IOMMU/PCI subsystems, so that kernel interfaces and userspace APIs can be designed in a clean way.

Following up the successful LPC 2017 VFIO/IOMMU/PCI microconference, the Linux Plumbers 2019 VFIO/IOMMU/PCI track will therefore focus on promoting discussions on the current kernel patches aimed at VFIO/IOMMU/PCI subsystems with specific sessions targeting discussion for kernel patches that enable technology (eg device/sub-device assignment, peer-to-peer PCI, IOMMU enhancements) requiring the three subsystems coordination; the microconference will also cover VFIO/IOMMU/PCI subsystem specific tracks to debate patches status for the respective subsystems plumbing.

Tentative topics for discussion:

  • VFIO
    • Shared Virtual Addressing (SVA) interface
    • SRIOV/PASID integration
    • Device assignment/sub-assignment
  • IOMMU
    • IOMMU drivers SVA interface consolidation
    • IOMMUs virtualization
    • IOMMU-API enhancements for mediated devices/SVA
    • Possible IOMMU core changes (like splitting up iommu_ops, better integration with device-driver core)
    • DMA-API layer interactions and how to get towards generic dma-ops for IOMMU drivers
  • PCI
    • Resources claiming/assignment consolidation
    • Peer-to-Peer
    • PCI error management
    • PCI endpoint subsystem
    • prefetchable vs non-prefetchable BAR address mappings (cacheability)
    • Kernel NoSnoop TLP attribute handling
    • CCIX and accelerators management

If you are interested in participating in this microconference and have topics to propose, please use the CfP process. More topics will be added based on CfP for this microconference.

MC leads

Bjorn Helgaas <bjorn@helgaas.com>, Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>, Joerg Roedel <joro@8bytes.org>, and Alex Williamson <alex.williamson@redhat.com>

VFIO/IOMMU/PCI Summary

Etherpad notes at https://linuxplumbersconf.org/event/4/sessions/66/attachments/272/459/go

User Interfaces for Per-Group Default IOMMU Domain Type (Baolu Lu)

Currently DMA transactions going through an IOMMU are configured in a coarse-grained manner: all translated or all bypass. This causes bad user experience when users only want to bypass the IOMMU for DMA from some trusted high-speed devices, e.g. NIC or graphic, for performance consideration, or alternatively only translate DMA transactions from some legacy devices with limited address capability to access high memory. Per-group default domain type is proposed to address this.

The first concern was about the cross dependencies for non-PCI devices (solved through the .is_added struct pci_device flag).

attach() and add() methods already exist and are implemented in ARM SMMUs drivers, but driver model doesn't have the idea of separating "add device" from "attach driver".

A solution that was floated was dynamic groups and devices with changing groups, drivers can be reset to deal with that.

For x86, PCI hotplug could also be a problem since the default domain type is only parsed during boot time.

Hence, splitting default domain type from device enumeration might be a solution. As for the user interface, it's still not feasible for non-PCI devices, hence adding a postfix, i.e. .pci to the command line will make things workable. Joerg Roedel pointed out that there are possible conflicting devices, so there must be a mechanism to detect it. People also have concerns on command line argument - hard to manipulate by distributions, distribution drivers are modules, so ramfs may be a solution.

In summary, it was agreed that the problem needs solving and the community should start to fix It but there are still concerns about the various solutions put forward and related user interfaces.

A consensus should be sought on mailing list discussions to avoid a command line parameter and reach a cleaner upstreamable solution.

Slides

Status of Dual Stage SMMUv3 Integration (Eric Auger)

The work under discussion allows the virtual SMMUv3 to manage the SMMU programming for host devices assigned to a guest OS, which in turn requires that the physical SMMUv3 is set-up
to handle guest OS mappings.

The first part of the session was dedicated to providing some background about this work which started more than one year ago and highlighting the most important technical challenges.

The Intel IOMMU exposes a so-called Caching Mode (CM). This is a register bit that is used in the virtual IOMMU. When exposed, this bit forces the guest driver to send invalidations on MAP operations.

Normally the Intel iommu driver only sends invalidation on UNMAP operations but with Caching Mode on, the VMM is able to trap all the changes to the page tables and cascade them to the physical IOMMU. On ARM however, such a Caching Mode does not exist. In the past, it was attempted to add this mode through an smmuv3 driver option (arm-smmu-v3 tlbi-on-map option RFC, July/August 2017) but the approach was Nak’ed. The main argument was that it was not specified in the SMMU architecture.

So after this first attempt, the natural workaround was to use the two translation stages potentially implemented by the HW (this is an implementation choice). The first stage would be “owned” by the guest while the second stage would be owned by the hypervisor. This session discussed the status of this later work. As a reminder the first RFC was sent on Aug 2018. Latest version of the series is (Aug 2019):

[PATCH v9 00/14] SMMUv3 Nested Stage Setup (IOMMU part)
[PATCH v9 00/11] SMMUv3 Nested Stage Setup (VFIO part)

It is important to notice this patch series share the IOMMU APIs with Intel SVA series (fault reporting, cache invalidation).

Eric Auger highlighted some technical details:

  • The way configuration is set-up for guest IOMMU configuration is different on ARM versus Intel (attach/detach_pasid_table != Intel's sva_bind/unbind_gpasid). That’s because on ARM the stage2 table translation pointer is within the so-called Context Table Entry, owned by the guest. Whereas on Intel it is located in the PASID entry. So on ARM the PASID table can be fully owned by the guest while it is not possible on Intel.
  • Quite a lot of pain is brought about by MSI interrupts on ARM, since they are translated by an IOMMU on ARM (on Intel they are not). So one of the most difficult tasks in upstreaming  the series consists in setting up nested binding for MSIs.
  • The series then obviously changes the state machine in the SMMUv3 drivers as we must allow both stages to cooperate, at the moment only stage1 is used. We have new transitions to handle.
  • The VFIO part is less tricky. It maps onto the new IOMMU API. Most of the series now dedicates to the physical fault propagation up to the guest. This relies on a new fault region and a new specific interrupt.

Then a discussion followed this presentation.

Most of the slot was spent on the first question raised: are there conceptual blockers about the series?  

Will Deacon answered that he does not see any user of this series at the moment and he is reluctant to upstream something that is not exercised and tested.

He expects Eric Auger to clarify the use cases where this implementation has benefits over the virtio-iommu solution and write a short documentation to that extent. Eric explained the series makes sense for use cases where dynamic mappings are used, for which the virtio-iommu performance may be poor as seen on x86 (native driver in the guest and Shared Virtual Memory in the guest).

Will Deacon also explained he would rather see a full implementation supporting multiple PASID. In other words he would be interested in the shared virtual memory enablement for the guest, where the guest programs physical devices with PASID and program their DMAs with its process guest VA.

Eric Auger replied he preferred to enable nested paging first without PASID support. This allows to remove a lot of complex dependencies at kernel level (IOASID allocation, substreamID support in SMMUv3 driver). Also Eric Auger said he has no PASID capable device to test on.

A possibility cheap FPGA to implement PCI devices to implement PASID/PRI was floated as
a solution to lacking HW.  Maybe a DesignWare Controller can configured as an endpoint implementing the required HW features (PCI ATS/PRI/PASID); it was not clear in the debate
If that’s possible or not.

Will Deacon reported that he could not test the current series. Eric Auger stated that the series was testable and tested on three different platforms.

Beyond that question, the review of the IOMMU UAPIs was considered. Eric Auger asked if the patches related to the IOMMU UAPI could be maintained in a separate series to avoid the confusion about who does have the last version of the API, shared by several series.

Maintainers agreed on this strategy.

Slides

PASID Management in Linux (Jacob Pan)

PASID life cycle management was discussed, user APIs (uAPIs), and Intel VT-d specific requirements in the context of nested shared virtual addressing (SVA).

The following upstream plan and breakdown of features was agreed for each stage.

  1. PCI device assignment that uses nested translation, a.k.a. Virtual SVA.
  2. IOMMU detected device faults and page request service
  3. Mediated device support

VT-d supports system-wide PASID allocation in that the PASID table is per-device and devices can have multiple PASIDs assigned to different guests.  Guest-host PASID translation could be possible but it was not discussed in detail.

Joerg Roedel and Ben Herrenschmidt both asked if we really need a guest PASID bind API: why can't it be merged with sva_bind_device(), which is the native SVA bind? After reviewing both APIs, it was decided to keep them separate. The reasons are:

  • Guest bind does not have mm notifier directly; guest mm release is propagated to the host in the form of PASID cache flush
  • PASID is allocated inside sva_bind_device() where guest PASID bind has PASID allocated prior to the call
  • Metadata used for tracking the bond is different

The outcome is for Jacob Pan to send out user API patches only and get them ACKed by ARM.

For stage 1, the uAPI patches include:

  • iommu_bind_gpasid()
  • iommu_cache_invalidate()
  • IOASID allocator

Various scenarios related to PASID tear down were debated, which is more interesting than setup.

The primary source of race conditions comes from device, IOMMU and userspace that operate asynchronously with respect to faults, aborts, and work submission. The goal is to have clean life cycles for PASIDs.

Unlike PCIe function level reset, which is unconditional, PASID level reset has dependencies on device conditions, pending faults. Jacob Pan proposed iommu_pasid_stop API to mature such conditions for device to do PASID reset, but Joerg is concerned the approach is still not race
free. IOMMU PASID stop and drain does not prevent device from keep sending more requests.

The following flow for device driver initiated PASID tear down was agreed:

  • Issue device-specific PASID reset/abort, device must guarantee no more transactions issued  from the device after completion. Device may wait for pending requests to be responded, either normally or timed out.
  • iommu_cache_invalidate() may be called to flush device and IOTLBs, the idea is to speed up the abort.
  • Unbind PASID, clear all pending faults in IOMMU driver, mark PASID not present, drain page requests, free PASID.
  • Unregister device fault handler, which will not fail.

The need to support subdevice fault reporting by IOMMU was discussed, specifically mdevs.

In principle, it was agreed to support multiple data per handler which allows iommu_report_device_fault() to provide sub-device specific info.

Slides

Architecture Considerations for VFIO/IOMMU Handling (Cornelia Huck)

Session was aimed at highlighting discrepancies between what is expected from architectures
in the kernel software stack, given that subsystems take for granted x86 as architectural
default.

Examples of features that were considered hard to adapt to other architectures were ARM
MSIs, IOMMU subsystem adaptation to Power, s390 IO model and the memory attributes
of ioremap_wc() on architectures other than x86 where the WC memory type does not
even exist.

Cornelia Huck mentioned that there are basic assumption on systems, everything is PCI, kernel subsystems are designed with x86 assumptions. PCI on the mainframe it is not what it is expected to do. Instructions not MMIO. Some devices are not PCI at all, s390 CCW, general IO.

IOMMU core kernel code was designed around x86, very difficult to adapt to Power.

Different arches have different software models, for instance Power has hypervisor that works differently from any other hypervisor.

There is an assumption on how DMA on devices is done. On Power, virtio-iommu (ie paravirtualized IOMMU) can be useful, because Power requires emulation IOMMU API in the guest. Some iterations on HW that can handle that in the host side.

Linux assumption that virtual machine migration can’t happen if there are assigned devices (support is in the works). The problem is that the kernel cannot move pages that are setup
for DMA.  It is feasible but it is not actually done; there is a plan to support migration through quiescing.

The session opened the debate on arch assumptions; it continued in hallway discussions.Slides

Slides

Optional or Reduced PCI BARs (Jon Derrick)

Some devices, e.g., NVMe, have both mandatory and optional BARs.  If address space is not assigned to the mandatory BARs, the device can’t be used at all, but the device can operate without space assigned to the optional BARs.

The PCI specs don’t provide a way to identify optional BARs, so Linux assumes all BARs are mandatory.  This means Linux may assign space for device A’s optional BARs, leaving no space for device B’s mandatory BAR, so B is not usable at all.  If Linux were smarter, it might be able to assign space for both A and B’s mandatory BARs and leave the optional BARs unassigned so both A and B were usable.

Drivers know which BARs are mandatory and which are optional, but resource assignment is done before drivers claim devices, so it’s not clear how to teach the resource assignment code about optional BARs.  There is ongoing work for suspending, reassigning resources, and resuming devices.  Maybe this would be relevant.

The PCI specs don’t provide a way to selectively enable MEM BARs -- all of them are enabled or disabled as a group by the Memory Space Enable bit.  Linux doesn’t have a mechanism for dealing with unassigned optional BARs.  It may be possible to program such BARs to address space outside the upstream bridge’s memory window so they are unreachable and can be enabled without conflicting with other devices.

Slides

PCI Resource Assignment Policies (Ben Herrenschmidt)

There is agreement that the kernel PCI resource allocation code should try to honour firmware resource allocation and not reassign them by default and then manage hotplug.

In the current code there are lots of platforms and architectures that just reassign everything and arches that do things in between. Benjamin Herrenschmidt said that he found a way to solve the resource allocation for all arches.

There is a compelling need to move resource allocation out of PCI controller drivers into PCI core by having one single function (easier to debug and maintain) to do the resource allocation with a given policy chosen in a set.

The real problem in updating PCI resource allocation code is how to test on all existing x86 HW.
 
It is also very complicated to define when a given resource allocation is “broken” so that it has to be redone by the kernel, it can be suboptimal but still functioning.

It was agreed that we need to come up with a policy definition and test simple cases gradually.

Slides: none
 

Implementing NTB Controller Using PCIe Endpoint (Kishon Vijay Abraham I)

Kishon presented the idea of adding a software defined NTB (Non-Transparent Bridge) controller using multiple instances of configurable PCIe endpoint in an SoC.

One use of an NTB is as a point-to-point bus connecting two hosts (RCs). NTBs provide three mechanisms by which the two hosts can communicate with each other:

  • Scratchpad Registers: Register space provided by NTB for each host that can be used to pass control and status information between the hosts.
  • Doorbell Registers: Registers used by one host to interrupt the other host.
  • Memory Window: Memory region to transfer data between the two hosts.

All these mechanisms are mandatory for an NTB system.

The SoC should have multiple instances of configurable PCIe endpoint for implementing NTB functionality. The hosts that have to communicate with each other should be connected to each of the endpoint instances and the endpoint should be configured so that transactions from one host are routed to the other host.

A new NTB function driver should be added which uses the endpoint framework to configure the endpoint controller instances. The NTB function driver should model the scratchpad registers, doorbell registers, and memory window and configure the endpoint so transactions from one host are routed to the other host.

A new NTB hardware driver should be added which uses the NTB core on the host side. This will help standard NTB utilities (ntb_netdev, ntb_pingpong, ntb_tool etc..) to be used in a host which is connected to a SoC configured with  NTB function driver.

Benjamin Herrenschmidt asked if the implementation can support multiple ports. The NTB function driver accounts for only two interfaces which means a single endpoint function can be connected to only two hosts. However the NTB function driver can be used with multi-function endpoints to provide a multi-port connection i.e connection between more than two hosts. Using multi-function has the added advantage of being able to provide isolation between different hosts.

Benjamin Herrenschmidt asked if the modeled NTB can support DMA. The system DMA in SoC (with NTB function driver) can be used for transferring data between the two hosts. However this is not modeled in the initial version of NTB driver. Adding support for such a feature in the software has to be thought through (RDMA? NTRDMA?).

Kishon also mentioned the initial version of NTB function driver will be able to use only the NTB function device created using device tree. However Benjamin Herrenschmidt mentioned support for NTB function device to be created using configfs should also be added.

Kishon mentioned that initial version will be posted within weeks to get early review comments and continue the discussion on mailing list.

Slides


Android MC

Building on the Treble and Generic System Image work, Android is
further pushing the boundaries of upgradibility and modularization with
a fairly ambitious goal: Generic Kernel Image (GKI). With GKI, Android
enablement by silicon vendors would become independent of the Linux
kernel running on a device. As such, kernels could easily be upgraded
without requiring any rework of the initial hardware porting efforts.
Accomplishing this requires several important changes and some of the
major topics of this year's Android MC at LPC will cover the work
involved. The Android MC will also cover other topics that had been the
subject of ongoing conversations in past MCs such as: memory, graphics,
storage and virtualization.

Proposed topics include:

 

  • Generic Kernel Image
  • ABI Testing Tools
  • Android usage of memory pressure signals in userspace low memory killer
  • Testing: general issues, frameworks, devices, power, performance, etc.
  • DRM/KMS for Android, adoption and upstreaming dmabuf heaps upstreaming
  • dmabuf cache managment optimizations
  • kernel graphics buffer (dmabuf based)
  • SDcardfs
  • uid stats
  • vma naming
  • vitualization/virtio devices (camera/drm)
  • libcamera unification

These talks build on the continuation of the work done last year as reported on the Android MC 2018 Progress report. Specifically:

 

  • Symbol namespaces have gone ahead
  • There is continued work on using memory pressure signals for uerspace low memory killing
  • Userfs checkpointing has gone ahead with an Android-specific solution
  • The work continues on common graphics infrastructure

If you are interested in participating in this microconference and have topics to propose, please use the CfP process. More topics will be added based on CfP for this microconference.

MC leads

Karim Yaghmour <karim.yaghmour@opersys.com>, Todd Kjos <tkjos@google.com>, Sandeep Patil <sspatil@google.com>, and John Stultz <john.stultz@linaro.org>


Power Management and Thermal Control MC

The focus of this MC will be on power-management and thermal-control frameworks, task scheduling in relation to power/energy optimizations and thermal control, platform power-management mechanisms, and thermal-control methods. The goal is to facilitate cross-framework and cross-platform discussions that can help improve power and energy-awareness and thermal control in Linux.

Prospective topics:

  • CPU idle-time management improvements
  • Device power management based on platform firmware
  • DVFS in Linux
  • Energy-aware and thermal-aware scheduling
  • Consumer-producer workloads, power distribution
  • Thermal-control methods
  • Thermal-control frameworks

If you are interested in participating in this microconference and have topics to propose, please use the CfP process. More topics will be added based on CfP for this microconference.

MC leads

Rafael J. Wysocki (rafael@kernel.org) and Eduardo Valentin (edubezval@gmail.com)

Power Management and Thermal Control Summary

The Power Management and Thermal Control MC covered 7 topics, 3 mostly related to thermal control and 4 power management ones (1 scheduled topic was not covered due to the lack of time).

A proposal to change the representation of thermal zones in the kernel into a hierarchical one was discussed first. It was argued that having a hierarchy of thermal zones would help to cover certain use cases in which alternative cooling devices could be used for the same purpose, but the actual benefit from that was initially unclear to the discussion participants. The discussion went on for some time without a clear resolution and finally it was deferred to the BoF time after the recorded session. During that time, however, the issue was clarified and mostly resolved with a request to provide a more detailed explanation of the design and the intended usage of the new structure.
Slides

The next topic was about the thermal throttling and the usage of the
SCHED_DEADLINE scheduling class. The problem is that the latter appears to guarantee a certain level of service that may not be provided if performance is reduced for thermal reasons. It was agreed that the admission control limit for SCHED_DEADLINE tasks should be based on the sustainable performance and reduced from the current 95% of the maximum capacity. Also it would be good to signal tasks about the possibility of missed deadlines (there are some provisions for that already, but it needs to be wired-up properly).
Slides

The subsequent discussion started with the observation that in quite a few cases the application is the best place to decide how to reduce the level of compute in the face of thermal pressure, so it may be good to come up with a way to allow applications to be more adaptive. It was proposed to try the existing sysfs-based mechanism allowing notifications to be sent to user space on crossing trip points, but it was pointed out that it might be too slow. A faster way, like a special device node in /dev, may be necessary for that.
Slides

The next problem discussed was related to the per-core P-states feature present in recent Intel processors which allows performance to be scaled individually for each processor core. It is generally good for energy-efficiency, but in some rare cases it turns out to degrade performance due to dependencies between tasks running on different cores. The question was how to address that and some possible mitigations were suggested. First, if user space can identify the tasks
in question, it can use utilization clamping, but that is a static mechanism and something more dynamic may be needed. Second, the case at hand is analogous to the IOwait one, in which the scheduler passes a special flag to CPU performance scaling governors, so it might be addressed similarly. It was observed that the issue was also similar to the proxy execution problem for SCHED_DEADLINE tasks, but in that case the "depends-on" relationship was well-defined.
Slides

The topic discussed subsequently was about using platform firmware for device power management instead of attempting to control low-level resources (clocks and voltage regulators) directly from the kernel. This is mostly targeted at reducing kernel code complexity and addressing cases in which the complete information may not be available to the kernel (for example, when it runs in a VM). Some audience members were concerned that the complexity might creep up even after this had been done. Also, it was asked how to drive that from Linux and some ideas, mostly based on using PM domains, were outlined.
Slides

Next, the session went on to discuss possible collection of telemetry data
related to system-wide suspend and resume. There is a tool, called pm-graph, that is included in the mainline kernel source tree, that can be run by anyone and can collect suspend and resume telemetry information like the time it takes to complete various pieces of these operations or whether or not there was a failure and what it was etc. It is run on a regular basis at Intel on a number of machines, but it would be good to get such data from a wider set of testers. The idea is to make it possible for users to opt in for the collection of suspend/resume telemetry information, ideally on all suspend/resume cycles, and collect the data in a repository somewhere. It was mentioned that KernelCI had
a distributed way of running tests, so maybe a similar infrastructure could be used here. It was also recommended to present this topic at the FOSDEM conference and generally talk to Linux distributors about it. openSUSE ships pm-graph already, but currently users cannot upload the results anywhere. Also, there is a problem of verifying possible fixes based on public telemetry data.

The final topic discussed regarded measurements of the CPU idle state exit latency, used by the power management quality of service (PM QoS) mechanism (among other things). There is a tool, called WULT, used for that internally at Intel, but there is a plan to release it as Open Source and submit the Linux driver used by it for inclusion into the mainline kernel. It uses a PCI device that can generate delayed interrupts (currently, the I210 PCIe GbE adapter) in order to trigger a CPU wakeup at a known time (a replacement driver module for that device is used) and puts some "interposer" code (based on an ftrace hook) between the cpuidle core and the idle driver to carry out the measurement when the interrupt triggers (Linus Torvalds suggested to execute the MWAIT instruction directly from the tool to simplify the design). It would need the kernel to contain the information on the available C-state residency counters (currently in the turbostat utility) and it would be good to hook that up to perf. It was suggested that BPF might be used to implement this, but it would need to be extended to allow BPF programs to access model-specific registers (MSRs) of the CPU. Also the user space interface should not be debugfs, but that is not a problem with BPF.
Slides

 


System Boot and Security MC

The security of computer systems is a very important topic for many years. It has been taken into the account in the OSes and applications for a long time. However, security of the firmware and boot process has not been taken so seriously until recently. Now that is changing. Firmware is more often being designed with security in mind. Boot processes are also evolving. There are many security solutions available there and even some that are now becoming common. However, they are often not complete solutions and solve problems only partially. So, it is good time to integrate various approaches and build full top-down solutions. There is a lot happening in that area right now. New projects arise, e.g. TrenchBoot, and they meet various design, implementation, validation, etc. obstacles. The goal of this microconference is to foster a discussion of  the various approaches and hammer out, if possible, the best solutions for the future. Perfect sessions should discuss various designs and/or the issues and limitations of the available security technologies and solutions that were encountered during the development process. Below is the list of topics that would be nice to cover. This is not exhaustive and can be extended if needed.

Expected topics:

 

  • TPMs
  • SRTM and DRTM
  • Intel TXT
  • AMD SKINIT
  • attestation
  • UEFI secure boot
  • IMA
  • Intel SGX
  • boot loaders
  • firmware
  • OpenBMC
  • etc.

If you are interested in participating in this microconference and have topics to propose, please use the CfP process. More topics will be added based on CfP for this microconference.

MC leads

Daniel Kiper <dkiper@net-space.pl>, Joel Stanley <joel@jms.id.au>, and Matthew Garrett <mjg59-plumbers@srcf.ucam.org>