After having run a standalone BPF microconference for the first time in last year's    Linux Plumbers conference, we've been overwhelmed with throughout positive feedback. We received more submissions than we could have accommodated for the one-day slot, and the room at the conference venue was fully packed despite the fact that the networking track had about half of their submissions with BPF related topics as well.
We would like to continue on this success by organizing a BPF micro conference also for 2019. The microconference is aiming to catch BPF related kernel topics mainly in BPF core area as well as having focused discussions in specific subsystems (tracing, security,
networking) with short 1-2 slides in order to get BPF developers together in a face to face working meetup for tackling and hashing out unresolved issues and discussing new ideas.
Folks knowledgeable with BPF that work in core areas or in subsystems making use of BPF.
- libbpf, loader unification
- Standardized BPF ELF format
- Multi-object semantics and linker-style logic for BPF loaders
- Improving verifier scalability to 1 million instructions
- Sleep-able bpf programs
- State on BPF loop support
- Proper String support in BPF
- Indirect calls in BPF
- BPF timers
- BPF type format (BTF)
- Unprivileged BPF
- BTF of vmlinux
- BTF annotated raw_tracepoints
- BPF (k)litmus support
- LLVM BPF backend
- JITs and BPF offloading
- More to be added based on CfP for this microconference
The Linux Plumbers 2019 RISC-V MC will continue the trend established in 2018  to address different relevant problems in RISC-V Linux land.
The overall progress in RISC-V software ecosystem since last year has been really impressive. To continue the similar growth, RISC-V track at Plumbers will focus on finding solutions and discussing ideas that require kernel changes. This will also result in a significant increase in active developer participation in code review/patch submissions which will definitely lead to a better and more stable kernel for RISC-V.
- RISC-V Platform Specification Progress, including some extensions such as power management - Palmer Dabbelt
- Fixing the Linux boot process in RISC-V (RISC-V now has better support for open source boot loaders like U-Boot and coreboot compared to last year. As a result of this developers can use the same boot loaders to boot Linux on RISC-V as they do in other architectures, but there's more work to be done) - Atish Patra
- RISC-V hypervisor emulation  - Alistair Francis
- RISC-V hypervisor implementation - Anup Patel
- NOMMU Linux for RISC-V - Damien Le Moal
- More to be added based on CfP for this microconference
- Atish Patra <firstname.lastname@example.org> or Palmer Dabbelt <email@example.com>
RISC-V Platform Specification Progress
This talk was about current state of RISC-V platform specification and what is required to be done next in terms of coming up with a recommended platform specification.
Fixing the Linux boot process in RISC-V
This talk was about how Linux RISC-V boot process was brought to a sane state matching other well-known architectures. The key part in achieving this was OpenSBI and U-Boot SPL support. This talk also discussed the SBI v0.2 specification and different extension such as Hart state management SBI extension. This extension is required to do a sequential booting of every cpu compared to the current mechanism (all harts jump to Linux together and secondary cpus wait on a per cpu variable to proceed).
Maintain CPU/IO TLBs with sfence.vma for RISC-V
This was an exploratory talk about having RISC-V IOMMU and RISC-V remote tlb flush instructions. This included a broadcast based TLB invalidation mechanism which was not recommended by powerpc maintainers as it is very prone to race conditions. It is also not feasible
to do it with GPUs as it may take long time(~500ms) to clear the queue.
Introduce an implementation of perf trace in riscv system
This talk presented a working version processor tracing using perf in RISC-V. It followed processor trace draft specification to implement
a risk-pt device that can record all branches across cpus, reconstruct pc flow and context from the trace and filter data with various parameters
such as address, context, and privilege level.
Early HPC uses cases for RISC-V
This talk was about how RISC-V can be useful in HPC use-cases. Based on current state of RISC-V, the best thing would be have HPC accelerators based on RISC-V (example, RISC-V PCIe accelerator card)
This talk was about RISC-V Hypervisor support. A brief introduction of RISC-V hypervisor v0.4 spec was provide. Two hypervisor Xvisor and KVM have been ported for RISC-V. A detailed state and future work for both these hypervisors was discussed.
Taking RISC-V to the Datacenter
This talk was about using RISC-V for data center. Quite a few data center use-cases such as management controller and NIC where RISC-V suitable at the moment were discussed.
RISCV NOMMU/M-Mode Linux
This talk was about NOMMU (or M-mode) support for Linux RISC-V. The Linux RISC-V NOMMU patches are already on mailing list and it will be very useful for RISC-V microcontrollers having just M-mode.
The Linux Plumbers 2019 is pleased to welcome the Tracing microconference again this year. Tracing is once again picking up in activity. New and exciting topics are emerging.
There is a broad list of ways to perform Tracing in Linux. From the original mainline Linux tracer, Ftrace, to profiling tools like perf, more complex customized tracing like BPF and out of tree tracers like LTTng, systemtap and Dtrace. Come and join us and not only learn but help direct the future progress of tracing inside the Linux kernel and beyond!
- bpf tracing – Anything to do with BPF and tracing combined
- libtrace – Making libraries from our tools
- Packaging – Packaging these libraries
- babeltrace – Anything that we need to do to get all tracers talking to each other
- Those pesky tracepoints – How to get what we want from places where trace events are taboo
- Changing tracepoints – Without breaking userspace
- Function tracing – Modification of current implementation
- Rewriting of the Function Graph tracer – Can kretprobes and function graph tracer merge as one
- Histogram and synthetic tracepoints – Making a better interface that is more intuitive to use
- More to be added based on CfP for this microconference
- Steven Rostedt (firstname.lastname@example.org)
Tracing MC Etherpad: https://etherpad.net/p/LPC2019_Tracing
Omar Sandova started out describing Facebook's drgn utility for Programmable debugging. It is a utility that reads /proc/kcore and the kernel debugger objects to map and debug the kernel live. Would like to connect to BPF to allow for "breakpoints" but that would cause issues in of itself. It has no macro support (too much info in the dwarf debug - gigs of it). Discussed issue with vmcores where they do not contain per-cpu or kmalloc data. Need to walk page tables to get at this information. It tries to do what crash does offline, but with a running kernel. Not much progress was made to improve the current code.
Masami Hiramatsu discussed kernel boot time tracing. The current kernel command line has a limitation due to the size that can be passed to it. Masami wants to increase the amount of data that can be transferred to the kernel. A file could possibly be passed in via the boot loader. Masami first did this with device tree but was told that its not for config options (even though people said that ship has already sailed, and that Masami should perhaps argue that again). Masami proposed a Supplemental Kernel Commandline (SKC) that can be added, and he demonstrated the proposed format of the file. It was discussed about various ways to get this file added. Perhaps we can append it to the compressed initram disk, and tell the kernel its offset. That way the kernel can find it very early in boot and before the initrd is decompressed.
Song Liu and David Carrillo Cisneros discussed sharing PMU counters across compatible perf events. The issue is that there are more perf events than HW counters. Perhaps be able to better share compatible counters (compatible perf events can share a single counter). Suggested to detect compatible perf events at perf_event_open. Was suggested to implement this in arch code, but Peter Zijlstra stated that it would be better if it were in the perf core code as it appears all archs would require the same logic. Perhaps use their own cgroup (nesting), but Peter said that it would cause performance issues traversing the cgroup hierarchy. The problem space appears to be identified by those involved
and will continue on the mailing list.
Tzvetomir Stoyanov discussed creating a easier user space interface to handle the synthetic events and histogram code. Tzvetomir demonstrated a very complex use of the histogram logic in the kernel and showed that it was a very difficult interface to use. Which explains why it is not used much. It was demonstrated that the interface was used to merge two events based on fields, and if we treat the events as "tables" the event fields as "columns" and each instance of the event as a "row" we could use a database logic to join these evens (namely SQL). Daniel Black (a database maintainer) helped out in the logic. It was discussed if SQL was the right way to go, as many people dislike that syntax, but Brendan Gregg mentioned that sys admins are very familiar with it, and may be happier to have it. We discussed using a BPF like language, but one goal was to keep the language and what can be done 1:1 compatible. BPF is a superset, and if you are doing that, might as well just use BPF. Why not just use BPF then? The answer is that this is for the embedded "busybox" folks, that do not have the luxury of BPF tooling. It was decided to discuss this more later in a BoF (which was done, and a full solution came out of it).
Jérémie Galarneau discussed Unifying tracing ecosystems with Babletrace. This also flowed in with Steven Rostedt's "Unified Tracing Platform" (UTP). Babletrace strives to be able to make any tracing format (perf, LTTng, ftrace, etc) be read by any tool. The UTP strives to make any tool be able to use any of the tracing infrastructures. Perf already supports CTF, trace-cmd has it on the todo list. libbabletrace works on Windows and Mac OS. Babletrace 2.0 is ready just finishing up on documentation and will be released in a few weeks. Steven asked for it to be announced on the linux-trace-user/devel mailing lists.
Alastair Robertson gave a talk on bpftrace. Mathieu suggested looking into TAP for a test output format. It still has issues with raw addresses, as /proc/kallsyms only shows global variables. Plan to get more BPF Type Format (BTF) support. There's work to convert DWARF to BTF. There's work to make systemtap scripts be converted into BPF format to
still work with BPF underneath.
Brendan Gregg gave a talk on BPF Tracing Tools: New Observability for Performance Analysis He talked about his book project BPF Performance Tools where there's a lot of gaps that he's trying to fill in. Tools from the book can be found in https://github.com/brendangregg/bpf-perf-tools-book Still would like to get trace events into the VFS layer (but Al Viro is against it, due to possible ABI issues).
The upstream kernel community is where active kernel development happens but the majority of kernels deployed do not come directly from upstream but distributions. "Distribution" here can refer to a traditional Linux distribution such as Debian or Gentoo but also Android or a custom cloud distribution. The goal of this Microconference is to discuss common problems that arise when trying to maintain a kernel.
- Backporting kernel patches and how to make it easier
- Consuming the stable kernel trees
- Automated testing for distributions
- Managing ABIs
- Distribution packaging/infrastructure
- Cross distribution bug reporting and tracking
- Common distribution kconfig
- Distribution default settings
- Which patch sets are distributions carrying?
- More to be added based on CfP for this microconference
"Distribution kernel" is used in a very broad manner. If you maintain a kernel tree for use by others, we welcome you to come and share your experiences.
- Laura Abbott <email@example.com>
Upstream 1st: Tools and workflows for multi kernel version juggling of short term fixes, long term support, board enablement and features with the upstream kernel
Speaker: Bruce Ashfield, working on Yocto at Xilinx.
Yocto's kernel build recipes need to support multiple active kernel versions (3+ supported streams), multiple architectures, and many different boards. Many patches are required for hardware and other feature support including -rt and aufs.
Goals for maintenance:
- Changes w.r.t. upstream are visible as discrete patches, so rebased rather than merged
- Common feature set and configuration
- Different feature enablements
- Use as few custom tools as possible
Other distributions have similar goals but very few tools in common. So there is a lot of duplicated effort.
Supporting developers, distro builds and end users is challenging. E.g. developers complained about Yocto having separate git repos for different kernel versions, as this led to them needing more disk space.
- Config fragments, patch tracking repo, generated tree(s)
- Branched repository with all patches applied
- Custom change management tools
Using Yocto to build a distro and maintain a kernel tree
Speaker: Senthil Rajaram & Anatol Belski from Microsoft
Microsoft chose Yocto as build tool for maintaining Linux distros for different internal customers. Wanted to use a single kernel branch for different products but it was difficult to support all hardware this way.
Maintaining config fragments and sensible inheritance tree is difficult (?). It might be helpful to put config fragments upstream.
Laura Abbott said that the upstream kconfig system had some support for fragments now, and asked what sort of config fragments would be useful. There seemed to be consensus on adding fragments for specific applications and use cases like "what Docker needs".
Kernel build should be decoupled from image build, to reduce unnecessary rebuilding.
Initramfs is unpacked from cpio, which doesn't support SELinux. So they build an initramfs into the kernel, and add a separate initramfs containing a squashfs image which the initramfs code will switch to.
Making it easier for distros to package kernel source
Speaker: Don Zickus, working on RHEL at Red Hat.
- Makefile includes Makefile.distro
- Other distro stuff goes under distro sub-directory (merge or copy)
- Add targets like fedora-configs, fedora-srpm
Lots of discussion about whether config can be shared upstream, but no agreement on that.
Kyle McMartin(?): Everyone does the hierarchical config layout - like generic, x86, x86-64 - can we at least put this upstream?
Monitoring and Stabilizing the In-Kernel ABI
Speaker: Matthias Männich, working on Android kernel at Google.
Why does Android need it?
- Decouple kernel vs module development
- Provide single ABI/API for vendor modules
- Reduce fragmentation (multiple kernel versions for same Android version; one kernel per device)
Project Treble made most of Android user-space independent of device. Now they want to make the kernel and in-tree modules independent too. For each kernel version and architecture there should be a single ABI. Currently they accept one ABI bump per year. Requires single kernel configuration and toolchain. (Vendors would still be allowed to change configuration so long as it didn't change ABI - presumably to enable additional drivers.)
ABI stability is scoped - i.e. they include/exclude which symbols need to be stable.
ABI is compared using libabigail, not genksyms. (Looks like they were using it for libraries already, so now using it for kernel too.)
Q: How we can ignore compatible struct extensions with libabigail?
A: (from Dodji Seketeli, main author) You can add specific "suppressions" for such additions.
KernelCI applied to distributions
Speaker: Guillaume Tucker from Collabora.
Can KernelCI be used to build distro kernels?
KernelCI currently builds arbitrary branch with in-tree defconfig or small config fragment.
- Preparation steps to apply patches, generate config
- Package result
- Track OS image version that kernel should be installed in
Some in audience questioned whether building a package was necessary.
Possible further improvements:
- Enable testing based on user-space changes
- Product-oriented features, like running installer
Should KernelCI be used to build distro kernels?
Seems like a pretty close match. Adding support for different use-cases is healthy for KernelCI project. It will help distro kernels stay close to upstream, and distro vendors will then want to contribute to KernelCI.
Someone pointed out that this is not only useful for distributions. Distro kernels are sometimes used in embedded systems, and the system builders also want to check for regressions on their specific hardware.
Q: (from Takashi Iwai) How long does testing typically takes? SUSE's full automated tests take ~1 week.
A: A few hours to build, depending on system load, and up to 12 hours to complete boot tests.
Automatically testing distribution kernel packages
Speaker: Alice Ferrazzi of Gentoo.
Gentoo wants to provide safe, tested kernel packages. Currently testing gentoo-sources and derived packages. gentoo-sources combines upstream kernel source and "genpatches", which contains patches for bug fixes and target-specific features.
Testing multiple kernel configurations - allyesconfig, defconfig, other reasonable configurations. Building with different toolchains.
Tests are implemented using buildbot. Kernel is installed on top of a Gentoo image and then booted in QEMU.
Generalising for discussion:
- Jenkins vs buildbot vs other
- Beyond boot testing, like LTP and kselftest
- LAVA integration
- Supporting other configurations
- Any other Gentoo or meta-distro topic
Don Zickus talked briefly about Red Hat's experience. They eventually settled on Gitlab CI for RHEL.
Some discussion of what test suites to run, and whether they are reliable. Varying opinions on LTP.
Tim Bird talked about his experience testing with Fuego. A lot of the test definitions there aren't reusable. kselftest currently is hard to integrate because tests are supposed to follow TAP13 protocol for reporting but not all of them do!
Distros and Syzkaller - Why bother?
Speaker: George Kennedy, working on virtualisation at Oracle.
Which distros are using syzkaller? Apparently Google uses it for Android, ChromeOS, and internal kernels.
Oracle is using syzkaller as part of CI for Oracle Linux. "syz-manager" schedules jobs on dedicated servers. There is a cron job that automatically creates bug reports based on crashes triggered by syzkaller.
Google's syzbot currently runs syzkaller on GCE. Planning to also run on QEMU with a wider range of emulated devices.
How to make syzkaller part of distro release process? Need to rebuild the distro kernel with config changes to make syzkaller work better (KASAN, KCOV, etc.) and then install kernel in test VM image.
How to correlate crashes detected on distro kernel with those known and fixed upstream?
Example of benefit: syzkaller found regression in rds_sendmsg, fixed upstream and backported into the distro, but then regressed in Oracle Linux. It turned out that patches to upgrade rds had undone the fix.
syzkaller can generate test cases that fail to build on old kernel versions due to symbols missing from UAPI headers. How to avoid this?
Q: How often does this catch bugs in the distro kernel?
A: It doesn't often catch new bugs but does catch missing fixes and regressions.
Q: Is anyone checking the syzkaller test cases against backported fixes?
A: Yes [but it wasn't clear who or when]
Google has public database of reproducers for all the crashes found by syzbot.
- Syzkaller repo tag indicating which version is suitable for a given kernel version's UAPI
- tarball of syzbot reproducers
Other possible types of fuzzing (mostly concentrated on KVM):
- They fuzz MSRs, control & debug regs with "nano-VM"
- Missing QEMU and PCI fuzzing
- Intel and AMD virtualisation work differently, and A
The Containers and Checkpoint/Restore MC at Linux Plumbers is the opportunity for runtime maintainers, kernel developers and others involved with containers on Linux to talk about what they are up to and agree on the next major changes to kernel and userspace.
Last year's edition covered a range of subjects and a lot of progress has been made on all of them. There is a working prototype for an id shifting filesystem some distributions already choose to include, proper support for running Android in containers via binderfs, seccomp-based syscall interception and improved container migration through the userfaultfd patchsets.
Last year's success has prompted us to reprise the microconference this year. Topics we would like to cover include:
- Android containers
- Agree on an upstreamable approach to shiftfs
- Securing containres by rethinking parts of ptrace access permissions, restricting or removing the ability to re-open file descriptors through procfs with higher permissions than they were originally created with, and in general how to make procfs more secure or restricted.
- Adoption and transition of cgroup v2 in container workloads
- Upstreaming the time namespace patchset
- Adding a new clone syscall
- Adoption and improvement of the new mount and pidfd APIs
- Improving the state of userfaultfd and its adoption in container runtimes
- Speeding up container live migration
- Address space separation for containers
- More to be added based on CfP for this microconference
- Stéphane Graber <firstname.lastname@example.org>, Christian Brauner <email@example.com>, and Mike Rapoport <firstname.lastname@example.org>
The Containers and Checkpoint/Restore MC at Linux Plumbers is the opportunity for runtime maintainers, kernel developers and others involved with containers on Linux to talk about what they are up to and agree on the next major changes to kernel and userspace. The topics discussed this year were focussed on making progress on various features upstream. Parts of what is mentioned in the summaries below are currently in the process of being implement, some features have even made it upstream (e.g. seccomp syscall continuation support) in the meantime. Below you will find rough summaries of the individual sessions. Please refer to the website for the slides and graphics provided by the speakers.
CRIU and the PID dance
CRIUs goal is to restore any userspace process or process tree on the system. You can point CRIU to a process on the system and it will checkpoint and restore it including all of its children. One of the requirements is that the PID must be identical during checkpoint and restore. But PID recycling makes this difficult to guarantee especially when restoring a process in an already populated PID namespace. The current way of guaranteeing the same PID is by writing to /proc/sys/kernel/ns_last_pid but this is open to the aforementioned race condition. A prior approach to address this problem was the eclone() patchset posted in 2010. However, this did not get merged. A new push is currently made to address this problem after the clone3() syscall had been merged for Linux 5.3. The proposed patchset allows to specify a PID when creating a new process via clone3(). The struct clone_args of clone3() will get extended to include a new set_tid member. Callers with CAP_SYS_ADMIN in the owning user namespace of the target pid namespace can set it to a value greater than 0 and request the process be created with a specific PID. For this to work either the target PID namespace needs to already be populated, i.e. have a PID 1 or the chosen PID must be 1. One outstanding problem to solve is a way to make it possible to remove the CAP_SYS_ADMIN restriction. One proposed solution was to introduce a CRIU specific capability.
After the session a discussion showed that the proposed patchset was not sufficient. CRIU requires restoring a process with a specific PID in nested PID namespaces. The proposed patchset would not cover this case as it only allows to select a specific PID in a single PID namespace. The proposed solution is to extend struct clone_args to include a set_tid pointer that can point to an array of pids and a size argument to specify the size of the array. The maximum size will be limited to the maximum number of PID namespaces that can be nested. Address Space Isolation for Container Security Address space isolation has been used to protect the kernel from the userspace and userspace programs from each other since the invention of the virtual memory. We suggest to isolate parts of the kernel from each other to reduce damage caused by a vulnerability.
Our fist attempt was "system call isolation" mechanism that forced execution of the system calls in a very restricted address space. The intention was to use it as a prevention for ROP attacks, but it was too weak from the security point of view to justify significant performance overhead.
Another idea that we are trying to implement an ability to create a 'secret' mappings. Such mappings would be visible only in the context of the owning process and physical pages backing them would be removed from the direct map. This is currently WIP.
And our current focus is creation of dedicated address spaces for Linux namespaces. Particularly, most of the objects in the network namespace are anyway private to that namespace and should not be accessed from contexts running in a different address namespace. As such we suggest to make such objects mapped only in the processes of the same network namespace.
A discussion started about what other namespace might be assigned a restricted page table to gain security benefits. The user namespace looks promising and using a dedicated page table for a user namespace will effectively isolate tenants from each other.
The discussion got more heated when mount namespaces were mentioned and this part was concluded that such address space isolation might be useful for pivot_root with MS_PRIVATE.
The other questions raised were about how the address space restrictions would propagate through nested namespaces, what should be the user space ABI for controlling this feature and if user should be able to control it at all except enabling or disabling at boot time.
And the most important question yet to be answered is what are the actual security benefit and at what cost they can be achieved.
Seccomp Syscall Interception
In Linux 5.0 we introduced the seccomp notifier. The seccomp notifier makes it possible for a task to retrieve a a file descriptor for its own seccomp filter. This file descriptor can then be handed of to another, usually more privileged, task. The task receiving this file descriptor can then act as a watcher for the sending task (watchee) by listening for its syscall events. This allows the watcher to emulate certain syscalls for the watchee and make decisions whether or not success or failure should be reported back to it. In essence, this is a mechanism to outsource seccomp decisions to userspace rather than taking them in-kernel. One good example is the mknod() syscall. The mknod() syscall can be easily filtered based on dev_t. This allows us to only intercept a very specific subset of mknod() syscalls. Furthermore, mknod() is not possible in user namespaces toto coelo and so intercepting and denying syscalls that are not in the whitelist on accident is not a big deal. The watchee won't notice a difference.
In contrast to mknod(), setxattr() and many other syscalls that we would like to intercept suffer from two major problems:
- they are not easily filterable like mknod() because they have pointer arguments
- some of them might actually succeed in user namespaces already (e.g. fscaps etc.)
The 1. problem is not specific to SECCOMP_RET_USER_NOTIF but also apparently affects future system call design. We recently merged the clone3() syscall into mainline which moves the flag from a register argument into a dedicated extensible struct clone_args to lift the flag limit from legacy clone() and allowing for extensions while supporting all legacy workloads.
One of the counter arguments leveraged against my design early on was that this means clone3() cannot be easily filtered by seccomp due to 1. This argument was fortunately not seen as defeating. I would argue that there sure is value in trying to design syscalls that can be handled by seccomp nicely but that seccomp can't become a burden on designing extensible syscalls. The openat2() syscall proposed currenly also does use a dedicated argument struct which contains flags and the seccomp argument popped back up again.
In light of all this, I would argue that we should seriously look into extending seccomp to allow filtering on pointer arguments.
There is a close connection between 1. and 2. When a watcher intercepts a syscall from a watchee and starts to inspect its arguments it can - depending on the syscall rather often actually - determine whether or not the syscall would succeed or fail. If it knows that the syscall will succeed it currently still has to perform it in lieu of the watchee since there is no way to tell the kernel to "resume" or actually perform the syscall.
There was a general discussion held at the microconference but also together with Kees Cook as a dedicated Kernel Summit session. During the Kernel Summit session we discussed a concrete solution to deep argument filtering coming to the consensus that only specific syscalls need deep argument filtering and that it will not include path-based filtering due to various race conditions. A concrete proposal was made for making it possible to continue syscalls. The patchset is currently up for review on LKML and looks like it will be merged for Linux 5.5.
Update on Task Migration at Google Using CRIU
Google migrates containers since 2018. Their assumptions is that CRIU is run as non-root and within existing namespaces. The main goal is not to minimize downtime to ms but rather to seconds. Google has moved away from storing images on central storage when migrating containers to using a pageserver that streams the data directly and encrypted. Google has been contributing to CRIU more than they did before. One example is the addition of subreaper support to CRIU. The creation of cgroups is now delegated to the user before starting CRIU.
An important topic for CRIU is the migration of TCP_ESTABLISHED connections. It is also important to have clear error messages. Lacking right now is the support for O_PATH file descriptors but adding this feature is planned. Some aspects, such as migrating the state of some virtual filesystems, are unclear on how exactly they can be done correctly.
A long-standing issue has been to get rid of CAP_SYS_ADMIN restrictions in favor of a CRIU specific capability such as CAP_RESTORE.
A feature that is being discussed upstream but has not been merged yet would be quite important: the time namespace. There are workarounds but it is important to land this patchset. Other problems are write-only APIs in the kernel that are not inspectable and thus opaque to CRIU. The question was asked how we can ensure that new interfaces added to the kernel support migration.
What is also missing, is a way to query the kernel for all relevant data for a given migration. A form of this was once proposed as a netlink-based api but ultimately rejected.
During the Q&A
Re getting rid of CAP_SYS_ADMIN: It seems unclear whether that is really needed. It is safe to give a container running in pid and user namespaces CAP_SYS_ADMIN in its user namespace. Google has an internal API that does not require processes to be synced but this issue does not take care of processes currently running in the vdso.
Secure Image-less Container Migration
Currently CRIU creates its image files during checkpointing. These image files need to be transferred to the destination (often SCP). The memory pages, however, are transferred unencrypted through CRIU’s page server. The goal is to add encryption support to the page server. To reduce the number of copies of the image files (checkpoint to disk, disk to destination system disk, from disk back into memory) the image-proxy/image-cache will keep the checkpoint data in memory for image-less migration. This way all related checkpoint data will be sent over a single encrypted network connection.
Problematic in the context of image-less container migration are special resources. One example are mounted tmpfs directories in the container. To migrate tmpfs directories CRIU uses tar archives. It is, however, difficult to know the size of the streamed tar archive which is problematic for the current implementation of the image-proxy/image-cache.
A further improvement might be to integrate the image-proxy/image-cache with the lazy-pages daemon. The lazy-pages daemon would be fork()ed directly from CRIU, instead of having it to be started manually, and then the same TCP socket could be used as for the complete container migration.
Using the new mount API with containers
The new mount API splits mounting from a single syscall into multiple syscalls and makes the interface more flexible. A rough outline how mounting in the new mount API is:
fsfd = fsopen("ext4", 0); fsconfig(fsfd, STRING, "source", "/dev/sdb1"); fsconfig(fsfd, STRING, "journal", "/dev/sdb2"); mntfd = fsmount(fsfd); move_mount(mntfd, "", AT_FDCWD, "/mnt");
A long-standing feature request has been shiftfs which allows to dynamically translate on-disk id mappings either relative to a given id range or to an id mapping in a given user namespace. One idea is to not implement shiftfs as a separate filesystem but rather implement it as a feature in the core VFS, and configured through the mount API. The new fsconfig() syscall would then be used to load mapping tables as part of the configuration.
A core question is whether the user namespace implementation could be abused/changed to implement this feature. A point raised in the discussion was that tying this feature to the user namespace leaves out use-case for shiftfs without user namespaces. It needs to be possible to both dynamically translate relative to an id mapping in a user namespace and also relative to an id mapping specified during mount.
Another proposed feature is the ability to intercept mounts and supervise mounts. This use-case is especially important for container managers that supervise unprivileged containers and want to allow certain mounts that would otherwise be rejected by the kernel. To this end, the fsmount() sycall in the container would be intercepted. The context fd for the new mount would be passed to the container manager. The container manager can then use fsinfo() to inspect the mount context and fsconfig() to alter the mount options. Finally it can reject or allow the mount.
How to implement interception is debatable: does this need an in-kernel container concept? Probably not but it would be one way of doing it. Would this feature be tied to mount namespaces and could it be implemented on top of AF_UNIX or would it even need to be a new syscall?
The new mount API would also gain the ability to set default mount options to e.g. parameterize automounts and load default parameters as specified by the admin. This would include things such as NFS timeouts, window sizes etc.. It is an outstanding question how exactly this would look like. Would the default mount options be loaded at boot or on demand? Would they be passed through a config file or by specifying an fd to a config file etc..
Can we agree on what needs to happen to get shiftfs - or at least its feature set - upstream?
Multiple approaches were discussed what it would take to get an implementation of a concept illustrated by shiftfs upstream. One implementation was outlined and discussed in the mount API talk. Further discussion outside of the micro-conference now prefers a separate user map for filesystem access.
Securing Container Runtimes with openat2 and libpathrs
Currently managing paths (e.g. for mounting) for container is racy, error-prone, and CVE-friendly. For example, opening a file which is a symlink to a symlink needs to be sanitized manually right now. There are already multiple CVEs exploiting this problem. The new openat2() syscall tries to provide a generic solution for such problems by providing a set of flags that that allows callers to explicitly specify what type of resolution they want to allow through a file descriptor returned from openat2(). The new libpathrs library is an implementation that tries to provide safe path resolution to callers. This is done for both openat2() aware and non-openat2() aware kernels. The idea is to remove the attack surface that exists because a lot of programs implement safe path resolution falsely. It is a hard to get right. The discussion was mainly centered around how to make openat2() mergeable upstream. Using
kernel keyrings with containers
The goal is to implement a keyring for a container. Such a container keyring would be alterable by the container manager, and could be used by the container denizens but would not be visible to them. This is useful for actively managed filesystems such as AFS, NFS, CIFS, and ext4 encryption. How to exactly implement this is discussed. One approach is to implement a new namespace type for keyrings.
Another proposed feature is to change to uid/gid/mask-type access controls currently associated with a given key to a proper ACL. An entry in such an ACL would need to have a subject it can be applied to, and a possibility would be to allow a container to be such a subject. This would be easy if there was a kernel concept of what a container is since there isn't a solution might be to use a namespace fd to identify a container. ACLs could then be used to manage permissions for container denizens.
Ongoing work is done how to handle request-key upcalls. A container can't simply exec /sbin/request-key in the init context since this runs into namespace problems. One approach is to allow the container manager to intercept requests but the problem is how exactly they would be routed. Would they be routed through a keyring namespace or an in-kernel concept of what a container is.
Cgroup v1/v2 Abstraction Layer
Since kernel v4.5 a new cgroup filesystem is available which superseeds the old cgroup filesystem. The new cgroup filesystem is simply called cgroup2. The new cgroup2 filesystem is fundamentally incompatible with the old filesystem such that cgroup v1 aware workloads will not work on cgroup v2 hosts. This is a massive userspace regression which needs to be addressed. Multiple companies have customers that run on cgroup v1 workloads which they would like to migrate into containers running on new kernels. Problems arise when a host uses a cgroup v2 layout but the container workload does only know about cgroup v1. These usecase are quite important and are not covered. Solutions that are discussed are to stay on cgroup v2 forever which for obvious reasons is dispreferred. Another solution would be to switch to cgroup v2, cause breakage, and fix the programs that break. This will not work for mixed workloads where a container only support cgroup v1 runs on a host that only supports cgroup v2. A third solution would be to backport cgroup v2 features to cgroup v1. This approach is not supported upstream. The most expensive but probably best solution would be to implement a userspace abstraction that fakes a cgroup v1 layout which could be mounted into containers. Requests to read or get cgroup limits would then be translated into corresponding cgroup v2 requests (where applicable). This could e.g. be done via fuse. How something like this could work is exists in the form of LXCFS though this is nowhere near anything that is needed for a full translation layer.
CRIU: Reworking vDSO proxification, syscall restart
Syscalls with a timeout have their restart block in task_struct. This block is not accessible from userspace and CRIU has no way to checkpoint and restore such syscalls correctly. As a result applications that were stopped while executing such syscalls will be restored with a wrong timeout. A simple prctl() to get and set data is not enough to solve this problem.
Another problem is that vdso compatibility is required before and after checkpoint/restore. The crurrent vdso remapping that CRIU uses is not safe. The solution to this problem which was discussed is to re-link the new VVAR page with the old vdso. An alternative suggestion was to use stack unwinding information to detect where a task was in the vdso at the time of checkpoint and then use this information during restore.
The Internet of Things (IoT) has been growing at an incredible pace as of late.
Some IoT application frameworks expose a model-based view of endpoints, such as
- on-off switches
- dimmable switches
- temperature controls
- door and window sensors
Other IoT application frameworks provide direct device access, by creating real and virtual device pairs that communicate over the network. In those cases, writing to the virtual /dev node on a client affects the real /dev node on the server. Examples are
- GPIO (/dev/gpiochipN)
- I2C (/dev/i2cN)
- SPI (/dev/spiN)
- UART (/dev/ttySN)
Interoperability (e.g. ZigBee to Thread) has been a large focus of many vendors due to the surge in popularity of voice-recognition in smart devices and the markets that they are driving. Corporate heavyweights are in full force in those economies. OpenHAB, on the other hand, has become relatively mature as a technology and vendor agnostic open-source front-end for interacting with multiple different IoT frameworks.
The Linux Foundation has made excellent progress bringing together the business community around the Zephyr RTOS, although there are also plenty of other open-source RTOS solutions available. The linux-wpan developers have brought 6LowPan to the community, which works over 802.15.4 and Bluetooth, and that has paved the way for Thread, LoRa, and others. However, some closed or quasi-closed standards must rely on bridging techniques mainly due to license incompatibility. For that reason, it is helpful for the kernel community to preemptively start working on application layer frameworks and bridges, both community-driven and business-driven.
For completely open-source implementations, experimental results have shown results with Greybus, with a significant amount of code already in staging. The immediate benefits to the community in that case are clear. There are a variety of key subjects below the application layer that come into play for Greybus and other frameworks that are actively under development, such as
- Device Management
- are devices abstracted through an API or is a virtual /dev node provided?
- unique ID / management of possibly many virtual /dev nodes and connection info
- Network Management
- standards are nice (e.g. 802.15.4) and help to streamline in-tree support
- non-standard tech best to keep out of tree?
- userspace utilities beyond command-line (e.g. NetworkManager, NetLink extensions)
- Network Authentication
- re-use machinery for e.g. 802.11 / 802.15.4 ?
- generic approach for other MAC layers ?
- in userspace via e.g. SSL, /dev/crypto
- Firmware Updates
- generally different protocol for each IoT framework / application layer
- Linux solutions should re-use components e.g. SWUpdate
This Microconference will be a meeting ground for industry and hobbyist contributors alike and promises to shed some light on the what is yet to come. There might even be a sneak peak at some new OSHW IoT developer kits.
The hope is that some of the more experienced maintainers in linux-wpan, LoRa and OpenHAB can provide feedback and suggestions for those who are actively developing open-source IoT frameworks, protocols, and hardware.
You, Me, & IoT MC Etherpad: https://etherpad.net/p/LPC2019_IoT
Alexandre Baillon presented the state of Greybus for IoT since he last demonstrated the subject at ELCE 2016. First, Project Ara (the modular phone) was discussed followed by the network topology (SVC, Module, AP connected over UniPro) and then how Greybus for IoT took the same concepts and extended them over any network. Module structure was described - a Device includes at least one Interface, which contains one or more Bundles, which expose one or more Cports. Cports behave much like sockets in the sense that several of them can send and receive data concurrently, and also operate on that data concurrently as well. Alexandre mentioned that Greybus has been out of staging since 4.9, and that he was currently working to have his gb-netlink module merged upstream. More details about the application layer on top of Greybus were discussed and its advantages - i.e. the MCU software only needs to understand how to interact with the bus (i2c, gpio, spi), but the actual driver controlling the bus would live remotely on a Linux host. The device describes itself to Linux via a Manifest. Limitations were listed as well, such as how one RPC message only focuses on one particular procedure, performance can vary by network technology. Current issues include how to enable remote wake-up, security and authentication are missing, it is not currently possible to pair e.g. a remote i2c device and a remote interrupt gpio (possibly extending manifest format to include key-value details / device tree). The possibility of running gb-netlink mostly in the kernel was also discussed (after authentication and encryption was set up) as a potential improvement.
Dr. Malini Bhandaru discussed the infrastructure challenges facing companies working with IoT. She pointed out that there are currently several firmware update solutions offering much of the same services (OSTree, Balena.io, SWUpdate, Swupd, Mender.io). There are also issues about what OS is running on the device. There is a distinct need for an update and configuration framework that can be hooked into. There was a question about whether this belonged in the kernel or user space in order to take privilege and consistency into account - policy must live in userspace. Further discussion highlighted that TPM devices would typically hold keys in a secure location. The IETF SUIT (Software Updates for Internet of Things) working group was mentioned. Malini suggested a command interface to implement the API, but such a command line / hook could violate the policies that distros already have.
Anrdreas Färber discussed LoRa (Long Range), going into detail about it’s physical layer (FSK, CSS - Chirp Spread Spectrum) in U-LPWA and Sub-GHz bands. He contrasted LPWAN (Low Power Wide Area Network) with LowPAN (Low-Power Personal Area Network). The tech allows for a long battery life (up to 10 years) with long transmission distances (up to 48km). Publicly documented APIs are used for modules (communicating over e.g. SPI / UART / USB). The LoRa effort within the Linux kernel is to ensure hardware works with generic enterprise distributions. Some form of socket interface is possible using a different protocol family (e.g. PF_PACKET / AF_LORA). It’s possible that an approach similar to the Netlink interface for 802.15.4 / 802.11 could be used for LoRa as well. Outcomes of Netdev conference were to model a LoRaWAN soft-MAC similar to that of 802.15.4. Planned RFCv2 submission to staging tree.
Peter Robinson gave an overview of Linux IoT from the perspective of an enterprise distribution. U-Boot progress has been great (UEFI). However, there is a large difference between Enterprise / Industrial IoT and e.g. the R-Pi. Went on to point out that BlueZ has not had a release in 15 months. BBB wireless firmware missing. Security fixes not backported to pegged kernels. R-Pi wireless firmware not working outside of Raspbian. Intel has regressed on wireless. Then, there was the issue about GPIO - everything still uses the sysfs interface, but it is deprecated. Nobody is using libgpiod. There are no node.js bindings. Adafruit has switched to the new API, but found some pain points with pull-up support and submitted a PR. The new GPIO API requires some slight paradigm shifts, and it was suggested that perhaps that should be something that libgpiod developers engage in (e.g. migration strategies). Some projects are using /dev/mem for userspace drivers, and that is very bad. There needs to be a consistent API across platforms. There is a lot of work to do to maintain a state of consistent usability and functional firmware - how do we distribute the work?
Stefan Schmidt discussed 6LowPan and the progress that has been made implementing various RFC’s (e.g. header compression). Now there are several network and application layers using 6LowPan or another network layer atop 802.15.4. He discussed some hardware topologies - e.g. many hardware vendors prefer to have a network coprocessor that runs in a very low-power mode, while the Linux applications processor is put to sleep. The Linux approach is soft-mac (i.e. no firmware required). Notable open-hardware solution is ATUSB, which is again being made available from Osmocon. Link-layer security has been developed and tested by Fraunhofer. Some MAC-layer / userspace work still to be done (e.g. beacon, scan, network management). Zigbee / 802.15.4 has lost some momentum, but is gaining traction in Industrial IoT. Is Wireless HART open? Possibly used by unencrypted smart metering in US / Canada.
Jason Kridner showed off some work that has been done at BeagleBoard.org involving Greybus for IoT. Unlike something like USB, which is a discoverable bus, many IoT busses are non-discoverable (I2C, SPI, GPIO). Greybus could be the silver bullet (“grey” bullet) to solve that problem. Furthermore, the difficulty of writing intelligent drivers and interacting larger-scale networks can be moved from the device to a Linux machine like a BeagleBone Black, PocketBeagle, or Android device. Work on Greybus has been done to focus on IPv6 (e.g. 6LowPan) and to add strong authentication and encryption, effectively making the physical mediums arbitrary. The user experience will be that a sensor node is connected wirelessly, automatically detected via Avahi, and then advertises itself to the Linux host. A GSoC student has been working on Mikrobus Click support under Greybus, and has been fairly successful with GBSIM running on the BeagleBone Black and PocketBeagle. The idea is to get away from writing microcontroller firmware and to write drivers once and get them into the Linux kernel where they can be properly maintained and updated. The CC1352R LaunchXL development kit was shown to the audience since that was the platform that most of the work was done on, and a new open source hardware prototype board briefly made an appearance and was shown to be running the Zephyr Real-Time Operating System. Zephyr has gained so much momentum over the last few years and is playing a central role for where companies are focusing their development efforts. The network layer and HAL is fantastic, and they already have support for 802.15.4, BLE, and 6LowPan. There are still some improvements to be made with Greybus: network latency / variance - possibly reintroduce the time-sync code from the original Greybus implementation; extend the Greybus manifest to embed device-tree data as key-value properties.
The main purpose of the Linux Plumbers 2019 Live Patching microconference is to involve all stakeholders in open discussion about remaining issues that need to be solved in order to make live patching of the Linux kernel and the Linux userspace live patching feature complete.
The intention is to mainly focus on the features that have been proposed (some even with a preliminary implementation), but not yet finished, with the ultimate goal of sorting out the remaining issues.
This proposal follows up on the history of past LPC live patching microconferences that have been very useful and pushed the development forward a lot.
Currently proposed discussion/presentation topic proposals (we've not gone through "internal selection process yet") with tentatively confirmed attendance:
- 5 min Intro - What happened in kernel live patching over the last year
- API for state changes made by callbacks 
- source-based livepatch creation tooling 
- klp-convert 
- livepatch developers guide
- userspace live patching
Jiri Kosina <email@example.com> and Josh Poimboeuf <firstname.lastname@example.org>
Live patching miniconference covered 9 topics overall.
What happened in kernel live patching over the last year
Led by Miroslav Benes. It was quite a natural followup to where we ended at the LPC 2018 miniconf, summarizing which of the points that have been agreed on back then have already been fully implemented, where obstacles have been encountered etc.
The most prominent feature that has been merged during past year was "atomic replace", which allows for easier stacking of patches. This is especially useful for distros, as it naturally aligns with the way patches are being distributed by them. Another big step forward since LPC 2018 miniconf was addition of livepatching selftests, which already tremendously helped in various cases, as it e.g. helped to track down quite a few issues during development of reliable stacktraces on s390. Proposal has been made that all major KLP features in the future should be accompanied by accompanying selftest, which the audience agreed on.
One of the last year's discussion topics / pain points were GCC optimizations which are not compatible with livepatching. GCC upstream now has -flive-patching option, which disables all those interfering optimizations.
Rethinking late module patching
Led by Miroslav Benes again.
The problem statement is: in case when there is a patch loaded for module that is yet to be loaded, it has to be patched before it starts executing. The current solution relies on hooks in the module loader, and module is patched when its being linked. It gets a bit nasty with the arch-specifics of the module loader handling all the relocations, patching of alternatives, etc. One of the issues is that all the paravirt / jump label patching has to be done after relocations are resolved, this is getting a bit fragile and not well maintainable.
Miroslav sketched out the possible solutions:
- livepatch would immediately load all the modules for which it has patch via dependency; half-loading modules (not promoting to final LIVE state)
- splitting the currently one big monolithic livepatch to a per-object structure; might cause issues with consistency model
- "blue sky" idea from Joe Lawrence: livepatch loaded modules, binary-patch .ko on disk, blacklist vulnerable version
Miroslav proposed to actually stick to the current solution, and improve
selftests coverage for all the considered-fragile arch-specific module linking code hooks. The discussion then mostly focused, based on proposals from several attendees (most prominently Steven Rostedt and Amit Shah), on expanding on the "blue sky" idea.
The final proposal converged to having a separate .ko for livepatches that's installed on the disk along with the module. This addresses the module signature issue (as signature does not actually change), as well as module removal case (the case where a module was previously loaded while a livepatch is applied, and then later unloaded and reloaded). The slight downside is that this will require changes to the module loader to also look for livepatches when loading a module. When unloading the module, the livepatch module will also need to be unloaded. Steven approved of this approach over his previous suggestion.
Source-based livepatch creation tooling
Led by Nicolai Stange.
The primary objective of the session was basing on the source-based creation of livepatches, while avoiding the tedious (and error-prone task) of copying a lot of kernel code around (from the source tree to the livepatch). Nicolai spent par of last year writing a klp-ccp (KLP Copy and Paste) utility, which automates a big chunk of the process.
Nicolai then presented the still open issues with the tool and with the process around it, most promonent ones being:
- obtaining original GCC commandline that was used to build the original kernel
- externalizability of static functions; we need to know whether GCC emitted static function into the patched object
Miroslav proposed to extend existing IPA dumping capabiity of GCC to emit also the information about dead code elimination; DWARF information is guaranteed not to be reliable when it comes to IPA optimizations.
Objtool on power -- update
Led by Kamalesh Babulal.
Kamalesh reported that as a followup to last year's miniconference, the objtool support for powerpc actually came to life. It hasn't yet been posted upstream, but is currently available on github.
Kamalesh further reported, that decoder has basic functionality (stack
operations + validation, branches, unreachable code, switch table (through gcc plugin), conditional branches, prologue sequences). It turns out that stack validation on powerpc is easier than on x86, as the ABI is much more strict there; which leaves the validation phase to mostly focus on hand-written assembly.
The next steps are basing on arm64 objtool code which already abstracted out the arch-specific bits, and further optimizations can be stacked on top of that (switch table detection, more testing, different gcc versions).
Do we need a Livepatch Developers Guide?
Led by Joe Lawrence.
Joe postulated, that Current in-kernel documentation provides very good documentation for individual features the infrastructure provides to the livepatch author, but Joe further suggested to also include something along the lines of what they currently have for kpatch, which takes a more general look from the point of view of livepatch developer.
Proposals that have been brought up for discussion:
- collecting already existing CVE fixes and ammend them with a lot of commentary
- creating a livepatch blog on people.kernel.org
Mark Brown asked for documenting what architectures need to implement in order to support livepatching.
Amit Shah asked if the 'kpatch' and 'kpatch-build' script/program be renamed to 'livepatch'-friendly names so that kernel sources can also reference them for the user docs part of it.
Both Mark's and Amit's remarks have been considered very valid and useful, and agreement was reached that they will be taken care of.
API for state changes made by callbacks
Led by Petr Mladek.
Petr described his proposal for API for changing, updating and disabling
patches (by callbacks). Example where this was needed: L1TF fix, which needed to change PTE semantics (particular bits). This can't be done before all the code understands this new PTE format/semantics. Therefore pre-patch and post-patch callbacks had to do the actual modifications to all the existing PTEs. What is also currently missing is tracking compatibilities / dependencies between individual livepatches.
Petr's proposal (v2) is already on ML. struct klp_state is being introduced which tracks the actual states of the patch. klp_is_patch_compatible() checks the compatibility of the current states
to the states that the new livepatch is going to bring. No principal issues / objections have been raised, and it's appreciated by the patch author(s), so v3 will be submitted and pre-merge bikeshedding will start.
klp-convert and livepatch relocations"
Led by Joe Lawrence.
Joe started the session with problem statement: accessing non exported / static symbols from inside the patch module. One possible workardound is manually via kallsyms. Second workaround is klp-convert, which actually creates proper relocations inside the livepatch module from the symbol database during the final .ko link. Currently module loader looks for special livepatch relocations and resolves those during runtime; kernel support for these relocations have so far been added for x86 only. Special livepatch relocations are supported and processed also on other architectures. Special quirks/sections are not yet supported. Plus klp-convert would still be needed even with late module patching update.
vmlinux or modules could have ambiguous static symbols.
It turns out that the features / bugs below have to be resolved before we
can claim the klp-convert support for relocation complete:
- handle all the corner cases (jump labels, static keys, ...) properly and have a good regression tests in place
- one day we might (or might not) add support for out-of-tree modules which need klp-convert
- BFD bug 24456 (multiple relocations to the same .text section)
Making livepatching infrastructure better
Led by Kamalesh Babulal.
The primary goal of the discussion as presented by Kamalesh was simple: how to improve our testing coverage. Currently we have sample modules + kselftests. We seem to be currently missing specific unit cases and tests for corner cases. What Kamalesh would also like to see would be more stress testing oriented tests for the infrastructure. We should make sure that projects like kernelCI are running with CONFIG_LIVEPATCH=y.
Another thing Kamalesh currently sees as missing are failure test cases too. It should be checked with sosreport and supportconfig guys whether those diagnostic tools do provide necessary coverage of (at least) livepatching sysfs state. This is especially a task for distro people to figure out.
Nicolai proposed as one of the testcases identity patching, as that should reveal issues directly in the infrastructure.
Open sourcing live patching services
Led by Alice Ferrazzi.
This session followed up on previous suggestion of having public repository for livepatches against LTS kernel. Alice reported on improviement of elivepatch since last year as having moved everything to docker.
Alice proposed to more share livepatch sources; SUSE does publish those, but it's important to mention that livepatches are very closely tied to particular kernel version.
The Open Printing (OP) organisation works on the development of new printing architectures, technologies, printing infrastructure, and interface standards for Linux and Unix-style operating systems. OP collaborates with the IEEE-ISTO Printer Working Group (PWG) on IPP projects.
We maintain cups-filters which allows CUPS to be used on any Unix-based (non-macOS) system. Open Printing also maintains the Foomatic database which is a database-driven system for integrating free software printer drivers with CUPS under Unix. It supports every free software printer driver known to us and every printer known to work with these drivers.
Today it is very hard to think about printing in UNIX based OSs without the involvement of Open Printing. Open Printing has been successful in implementing driverless printing following the IPP standards proposed by the PWG as well.
- Working with SANE to make IPP scanning a reality. We need to make scanning work without device drivers similar to driverless printing.
- Common Print Dialog Backends.
- Printer/Scanner Applications - The new format for printer and scanner drivers. A simple daemon emulating a driverless IPP printer and/or scanner.
- The Future of Printer Setup Tools - IPP Driverless Printing and IPP System Service. Controlling tools like cups-browsed (or perhaps also the print dialog backends?) to make the user's print dialogs only showing the relevant ones or to create printer clusters.
- 3D Printing without the use of any slicer. A filter that can convert a stl code to a gcode.
Till Kamppeter <email@example.com> or Aveek Basu <firstname.lastname@example.org>
The OpenPrinting MC kicked off with Aveek and Till speaking on what is the current status of print in today's date.
Aveek explained driverless printing and how it has changed the way the world prints today. He explained in detail on how a driverless print works in Linux in today's world.
After talking about printing, he explained the problems with regards to scanning saying that scanning is not as smooth as printing in Linux. There is a lot of work that needs to be done on the scanning side to make driverless scanning in Linux a reality.
Till spoke about the current activities of OpenPrinting as an organisation.
Common Print Dialog Backends.
Rithvik talked about the Common Print Dialog Backends project. The Common Print Dialog project was created with the intention to provide a unified printing experience across Linux distributions. The proposed dialog would be independent of the print technology and the UI toolkit.
Later this was turned into the Common Print Dialog Backends project which abstracts away different print technologies from the print dialog itself with a D-Bus interface. Every print technology (CUPS, GCP etc.) has its own backend package. The user has to install the package corresponding to the print technology to access his printer and the dialog will use D-Bus messages to talk to the print technology thus decoupling the frontend development from the backend. Also if the user changes/upgrades his print technology in the future, all he has to do is to install the backend corresponding to his new print technology. The biggest challenge for the Common Print Dialog project is the adoption from major toolkit projects like Qt and GTK. There is no official API defined yet and there was a recommendation from the audience that an API be officially defined so that it will help the integration of the backends into UI toolkit projects like Qt, GTK etc. and other applications looking to add printing support.
IPP Printing & Scanning.
Aveek explained in detail how a driverless printer works using the IPP protocols. The mechanism he explained is something like once a host is connected in a network having driverless printers, a mDNS query is broadcasted to which the printers respond saying if they support driverless. Based on that response, the list of printers is shown in the print dialog. Once the user selects a particular printer, the printer is queried for it’s job attributes. As a response to this query, the printer sends the list of the attributes that it supports. Depending on that the supported features for that printer are listed in the UI and it is then up to the user to select the features that he wants to use.
This mechanism is available now for the case of printing. However the same is not the case for scanning. For scanning we still have to go for the age-old method of using a driver. Contributions are required in this space to pull up scanning and make it be on par with printing. The IPP standards have already been defined by PWG. It is high time that the manufacturers should start manufacturing hardware giving full support for IPP driverless scanning and the scanning community should also do the relevant changes from the software side.
Till gave an introduction into the future of drivers for printers, scanners, and multi-function devices, the Printer/Scanner Applications. On the printing side they replace the 80s-years hack of using PPD (PostScript Printer Description) files to describe printer capabilities, where most printers are not PostScript and the standard PDL (Page Description Language) is PDF nowadays.
They emulate a driverless IPP printer, answering requests for capability info from the client, taking jobs, filtering them to the printer's format, and sending them off to the printer. To clients they behave exactly like a driverless IPP printer, so CUPS will only have to do with driverless IPP printers and the PPD support, deprecated a decade ago can finally get dropped. The Printer Applicsations also allow easy sandboxed packaging (Snap, flatpak, … Packages are OS-distribution-independent) as there are no files to be placed into th system's file system any more, only IP communication.
As the PWG has also created a standard for driverless IPP scanning, especially with printer/scanner multi-function devices in mind, we can do the same with scanning, replacing SANE in the future, but especially also for complete multi-function-device drivers and to also sandbox-package scanner drivers.
Printer/Scanner Applications can also provide configuration interfaces via PWG's configuration interface standard IPP System Service and/or web interface. This way one could for example implement a Printer Application for printers which cannot be auto-discovered (for example require entering the printer's IP or selecting the printer model) or one can initiate head cleaning and other maintenance tasks.
Printer/Scanner applications do not only need to be replacements for classic printer and scanner drivers but also can accommodate special tasks like IPP-over-USB (the already existing ippusbxd) or cups-browsed could be turned into a Printer Application (for clustering, legacy CUPS servers, …).
The Future of Printer Setup Tools - IPP.
In this section Till presented the situation for Printer Setup Tools. The current ones usually show a list of existing print queues, allow to add a queue for a detected or manually selected printer, and assign a driver to it. On the existing print queues default option settings can be selected. Configuration of network printer hardware is done via the printer's web interface in a browser, not in the Printer Setup Tool.
In the future tasks will change: Thanks to IPP driverless printing print queues set up automatically, both on network and USB printers (IPP-over-USB, ippusbxd), so the classic add-printer task is less and less needed. Configuration of Printer hardware is done via IPP System Service in the GUI of the Printer Setup Tool (replaces web admin interfaces).
If printer needs a driver, a driver snap needs to get installed to make the printer appear as (locally emulated) driverless IPP printer. This could also be done by hardware association mechanisms in the Snap Store.
Another new task could be to configure printer clustering with cups-browsed, but if cups-browsed gets turned into a proper Printer Application with IPP System Service it is perhaps not actually needed.
Needed GUI interfaces in modern Printer Management Tool would then be a queue overview with access to: Default options, jobs, Hardware config interface, hardware configuration via IPP System Service, and driver Snap search for non-driverless printers/scanners (as long as Snap Store apps do not have hardware association by itself).
What we really need here are contributors for the new GUI components.
What Can We Change In 3D Print.
The main concept that has been discussed was to develop a filter that can convert a 3D design into GCode. If this can be made possible then there might be a chance to do away with the slicer. There was a good discussion on how and where to have the functionalities provided by a slicer. Currently a slicer has lot of functionalities, so there were questions like how will all those be fit inside a filter.
There were talks about having a common PDL (Print Description Language) or ODL (Object Description Language) for 3D printers.
PWG has already defined the 3D printing standards.
The goal of the Toolchains Microconference is to focus on specific topics related to the GNU Toolchain and Clang/LLVM that have a direct impact in the development of the Linux kernel.
The intention is to have a very practical MC, where toolchain and kernel hackers can engage and, together:
- Identify problems, needs and challenges.
- Propose, discuss and agree on solutions for these specific problems.
- Coordinate on how to implement the solutions, in terms of interfaces, patches submissions, etc in both kernel and toolchain component.
Consequently, we will discourage vague and general "presentations" in favor of concreteness and to-the-point discussions, encouraging the participation of everyone present.
Examples of topics to cover:
- Header harmonization between kernel and glibc.
- Wrapping syscalls in glibc.
- eBPF support in toolchains.
- Potential impact/benefit/detriment of recently developed GCC optimizations on the kernel.
- Kernel hot-patching and GCC.
- Online debugging information: CTF and BTF
Jose E. Marchesi <email@example.com> and Elena Zannoni <firstname.lastname@example.org>
Analyzing changes to the binary interface exposed by the Kernel to its modules
Dodji Seketeli talked about the analysis of the binary interface provided by object files, and how this is important to the Linux kernel, especially in the long term.
He remarked how the ABI checking utilities depend on good debugging formats, which must be complete enough to express the characteristics of the abstractions conforming the interface. The "compact" debugging formats used in the kernel (CTF, BTF) should therefore be able to express these characteristics were they be used for ABI checking utilities. A little discussion on what are these minimal characteristics followed the presentation.
Wrapping system calls in glibc
Maciej W. Rozycki and Dmitry Levin represented the glibc developers, particularly Florian Weimer, who is the person mainly working on these issues at the libc side.
Maciej started by exposing the motivations for having syscalls wrapped in glibc, which are quite obvious and agreed by both parties: portability, increases debug-ability, etc. A short technical discussion on some particular issues followed: supporting old syscalls, multiplexing syscalls, management of ssize_t, size_t, off64_t and so on. Finally, the kernel hackers present expressed they are happy about the recent developments on this matter, and some points about increasing the cooperation between both communities were discussed. This good perception on the kernel side is a big change compared to the situation at LPC 2018.
Security feature parity between GCC and Clang
Kees Cook started with a survey of toolchain-related security-related features that are relevant to the Linux kernel, and how they are implemented in both GCC and LLVM based compilers.
A list of important particular features, and their impact on compilers and in some cases on the language specification, was discussed. The two toolchains were compared in terms of support for these features. Despite progress begin made in recent times (particularly on the llvm side) much work is still needed.
Update on the LLVM port of the Linux Kernel
Behan Webster summarized the current status of the on-going effort to build the Linux kernel using the LLVM toolchain.
He started enumerating several reasons why such a port is beneficial for both the kernel and LLVM, and then discussed several recent support for some features needed by the kernel. Some continuous-integration and testing mechanisms used by the port effort were presented, and also a new mailing list where the LLVM hackers working on the Linux port can be contacted.
Compact C Type Format Support in the GNU toolchain
Nick Alcock introduced the support for the CTF debugging format that is recently being worked on the GNU toolchain.
He started by showing how the different toolchain components (compiler, linker, binary utilities) are being adapted to generate and consume CTF directly, thus eliminating the need of offline conversion utilities. He remarked how the ability of not having to generate DWARF as an intermediate step is of special interest for the kernel, where the debugging information is huge. He continued by showing how this CTF has been expanded from the classic Solaris CTF, which was severely limited in many ways, and compared it with both BTF and ORC. Several kernel areas that could benefit from using CTF were discussed with the kernel hackers: dwarves, the kernel backtracer, the kabi checker, etc.
eBPF support in the GNU Toolchain
Jose E. Marchesi talked about the recent addition of a BPF target to the GNU toolchain.
He started by explaining the current status of the project, and its goals. A fast description of the characteristics of the BPF architecture, from a toolchain perspective, followed. Then several topics were raised for discussion with the present kernel hackers: the handling of the kernel helpers at the compile level, supporting an experimental BPF for testing and debugging purposes, the possibility of using the kernel verifier from userland, and simulation of BPF. Finally, it was discussed how to establish an useful coordination between LLVM, GCC and the BPF developers.
The Linux Plumbers 2019 Testing and Fuzzing track focuses on advancing the current state of testing of the Linux Kernel.
- Defragmentation of testing infrastructure: how can we combine testing infrastructure to avoid duplication.
- Better sanitizers: Tag-based KASAN, making KTSAN usable, etc.
- Better hardware testing, hardware sanitizers.
- Are fuzzers "solved"?
- Improving real-time testing.
- Using Clang for better testing coverage.
- Unit test framework. Content will most likely depend on the state of the patch series closer to the event.
- Future improvement for KernelCI. Bringing in functional tests? Improving the underlying infrastructure?
- Making KMSAN/KTSAN more usable.
- KASAN work in progress
- Syzkaller (+ fuzzing hardware interfaces)
- Stable tree (functional) testing
- KernelCI (autobisect + new testing suites + functional testing)
- Kernel selftests
Our objective is to gather leading developers of the kernel and it’s related testing infrastructure and utilities in an attempt to advance the state of the various utilities in use (and possibly unify some of them), and the overall testing infrastructure of the kernel. We are hopeful that we could build on the experience of the participants of this MC to create solid plans for the upcoming year.
Sasha Levin <email@example.com> and Dhaval Giani <firstname.lastname@example.org>
KernelCI: testing a broad variety of hardware
Kevin Hillman and Guillaume Tucker gave general background about the project and talked about it's growth in the past few years. The KernelCI project's main purpose is to test a wide variety of hardware, to avoid rot with less used hardware.
The goal of KernelCI is not only to act as a test platform for embedded
devices, but also to be a platform for developers to integrate their tests into, being a generic test framework.
Currently KernelCI only does boot tests, but there is a very strong need
to incorporate other testing beyond that, such as Kselftest and LTP, this will also enable things like KASAN for example to catch more bugs. QEMU is great for catching things like early boot crashes, but there are
lots of stuff QEMU doesn't emulate like drivers, it is very valuable to test on "real" hardware. The kernel has ability to emulate hardware, like the "vivid" driver in the media subsystem.
KernelCI is now a Linux Foundation project.
Dealing with complex test suites
Guillaume Tucker discussed making automated bisection work with complex test suites.
It's hard to bisect earlier kernel versions as they might fail for an
unrelated issue (even fail to boot).
Test cases can all start passing/failing independently, but git bisect
tracks only good/bad, rather than tracking which test cases failed or
passed, Looked into using EzBench.
A new tool was written: scalpel. Aiming to be better than git bisect,
ready to become part of KernelCI.
Dmitry Vyukov discussed a new method to utilize tools such as KASAN in production systems.
A bug that manifests itself in production is usually more important than
a theoretical bug found on a test system.
But we can't use KASAN in production as it has big overhead which makes it not feasible.
Solution is to use a combination of electric fence and sampling: sample
every n-th allocation, surround with guard pages and set semplig rate low enough.
This would only work on a large amount of machines due to sampling.
Fighting uninitialized memory in the kernel
Alexander Potapenko went through the history of KMSAN, and the recent changes that went into it: stable compiler interface, support for
vmalloc(), and a bunch of bug fixes.
In the past two years, KMSAN has found 150 bugs, 42 of them are still
open. Fixed bugs include infoleaks (21), KVM bugs (5), and network bugs (86).
To help fix this class of bugs, we're considering initializing all memory by default. Benchmarking is hard and highly depends on workload. This is already done in Windows for example.
Dmitry Vyukov gave a short update about syzbot and syzkaller.
syzbot has reported 3 bugs per day for the past two years, out of those
2 bugs per day were fixed. syzbot now uses kmemleak to detect memory leaks. Although this is slow and produces quite a few false positives, it is still worth it and there are workarounds to some of the issues (for example, report only reproducible leaks).
syzbot also now uses fault injection.
syzkaller has new capability to fuzz the usb interface, it has found 250
bugs so far and has only just scratched the surface (there are 8400) different device ids?
Bisection is difficult because even with a reproducer, something different always fails on older kernel versions.
Collaboration/unification around unit testing frameworks
Knut Omang discussed recent developments around introducing a unit test framework into the kernel.
There is strong agreement that such framework is needed, but it's still an open question as to how it will look like.
Major goals are:
- ease of use
- unified framework
- good support in the kernel's build system
There was a good discussion around different test framework and unification plans (watch the video!).
All about Kselftest
Shuah Khan presented the recent developments around Kselftest and has gone through the recent growth and current challenges.
There are different opinions as to which version of Kselftest should run on older (stable/LTS) kernels, the suggestion at this point is that Kselftest from the older kernel should be run on that kernel.
We want to integrate Kselftest into our various test frameworks - we want KernelCI to run Kselftest for example.
Since 2004 a project has improved the Real-time and low-latency features for Linux. This project has become know as PREEMPT_RT, formally the real-time patch. Over the past decade, many parts of the PREEMPT RT became part of the official Linux code base. Examples of what came from PREEMPT_RT include: Real-time mutexes, high-resolution timers, lockdep, ftrace, RT scheduling, SCHED_DEADLINE, RCU_PREEMPT, generic interrupts, priority inheritance futexes, threaded interrupt handlers and more. The number of patches that need integration has been reduced from previous years, and the pieces left are now mature enough to make their way into mainline Linux. This year could possibly be the year PREEMPT_RT is merged (tm)!
In the final lap of this race, the last patches are on the way to be merged, but there are still some pieces missing. When the merge occurs, PREEMPT_RT will start to follow a new pace: the Linus one. So, it is possible to raise the following discussions:
- The status of the merge, and how can we resolve the last issues that block the merge;
- How can we improve the testing of the -rt, to follow the problems raised as Linus's tree advances;
- What's next?
- Real-time Containers
- Proxy execution discussion
- Merge - what is missing and who can help?
- Rework of softirq - what is need for the -rt merge
- An in-kernel view of Latency
- Ongoing work on RCU that impacts per-cpu threads
- How BPF can influence the PREEMPT_RT kernel latency
- Core-schedule and the RT schedulers
- Stable maintainers tools discussion & improvements.
- Improvements on full CPU isolation
- What tools can we add into tools/ that other kernel developers can use to test and learn about PREEMPT_RT?
- What tests can we add to tools/testing/selftests?
- New tools for timing regression test, e.g. locking, overheads...
- What kernel boot self-tests can be added?
- Discuss various types of failures that can happen with PREEMPT_RT that normally would not happen in the vanilla kernel, e.g, with lockdep, preemption model.
The continuation of the discussion of topics from last year's microconference, including the development done during this (almost) year, are also welcome!
Daniel Bristot de Oliveira <email@example.com>
Core Scheduling for RT
Speaker: Peter Zijlstra
The L1TF/MDS vulnerabilities turn not safe using HT when two applications that do not trust each other share the same core. The current solution for this problem is to disable HT. Core scheduling serves to allow SMT when it is safe, for instance, if two threads trust each other. SMT/HT is not always good for determinism as the execution speed of individual hardware threads can vary a lot. Core scheduling can be used to force-idle siblings for RT tasks while allowing non-RT to use all threads of the same core (when safe). Core-scheduling will work for any task (not just RT) and is currently implemented by cgroups, as Peter figured it was the most natural way. People want this for various reasons, so it will eventually be merged. Regarding real-time schedulers, SCHED_DEADLINE’s admission control could only allow tasks on a single thread of a core, by limiting the amount of runtime/period available in the system. But if we allow trusted threads to share the CPU time, we can double the amount available for that core. Thomas suggested making the part to allow multiple threads (SMT) to be part of setting the scheduling policy. Load balancing might start to be tricky since it doesn't currently take into account where bandwidth is scheduled. Luca's capacity awareness might help there. RT_RUNTIME_SHARE has been removed from RT and should have been removed from upstream. RT_THROTTLING goes away once the server is available.
RCU configuration, operation, and upcoming changes for real-time workloads
Speaker: Paul McKenney
RCU callback self-throttling: Could we prevent somebody from queue callbacks while servicing them? It would be valuable, but there are valid use-cases (e.g., refcounting). Slowly increase the limit of how many callbacks to invoke, until it gets to a considerable limit to do them all (Eric Dumazet's patch). For RCU callback offloading, if CPU0 has its thread managing the callbacks for other CPUs, but then CPU0 starts calling a bunch of callbacks, it can cause the manager thread to become too busy processing its callbacks, and this starves the callbacks for the other threads. One change is to create another thread that handles the grace periods and not having a dual role in managing the callbacks and also invoking them.
Peter mentioned that rcuogp0 with slight a higher priority of rcuop0 so that it can always handle grace periods while the rcuop(s) are invoking callbacks. Verify whether having a single rcuogp0 thread, pinned to a CPU, is not better than the current approach that creates square(max available CPUs) threads. The current impression is that on large systems, the cost to wake up all the threads that should handle callbacks would prevent the single rcuogp0 thread from tackling (observing) grace periods that could be expiring. The number of GP threads uses total CPUs as base calculation so that max would be sqrt(total-CPUs). One issue is if the admin offloads all CPUs to one CPU and then that CPU gets overloaded enough that it can't keep up, the system can OOM (as the garbage collection becomes too slow and we use up all memory).
- First choice: OOM
- Second choice: print warn during OOM
- Third: detect offloading issues and delay call_rcu()
- Forth: Detect offloading issues and stop OL (ignoring what the admin asked for)
If #4 is implemented, probably need a command line parameter to enable it as not everyone will want that (some prefer to OOM in this case).
Mathematizing the latency
Speaker: Daniel Bristot de Oliveira
For most people, it is not clear what the latency components are. Daniel proposes to improve the situation by breaking the latency into independent variables, to apply a measurement or probabilistic methods to get values for each variable. Then, sum up individual variables to determine the possible worst case.
How do we define latency? Until the context switch takes place or until the point in which the scheduled process effectively starts running. Daniel proposed the context switch, but Thomas said that it is when the process effectively starts running. Daniel then pointed to the return of the scheduler. Peter Zijlstra then noted that the most important thing is to define the model, with the individual pieces, and then we will be able to see things more clearly. The main problem comes with the interrupts account. There's no way to timestamp when the actual hard interrupt triggered in the hardware. One may be able to use something like a timestamped networking packet to infer the time, but there's currently no utility to do so. Also, interrupts are not periodic, so it is hard to define their behavior with simple models used in the RT literature. However, the IRQ prediction used for the idle duration estimation in the mainline kernel is a good starting point for analysis.
Real Time Softirq Mainlining
Speaker: Frederic Weisbecker
Softirq design hasn't changed in about three decades and is plenty of hacks (Frederic shows some of them in the slides).
Softirq is now annoying for latency-sensitive tasks as users want a higher priority softirq to interrupt other softirqs in RT. Thomas adds that not only -rt people, but vanilla as well. Mainline kernel people want to find a balance between softirq processing and handling the interrupt itself (networking case). Currently, the RT kernel has one softirq kthread. In the past, multiple were tried, but we faced issues.
Make softirq disable more fine-grained (as opposed to all on/off) is a wish, and it makes sense outside of RT as well. The question is: does softirqs share data among them? We do not know, so we don't know what will break, so it is a real open problem.
The following points were raised:
- Problem: if softirq run in the return of interrupt, you will never get anything done. If you push it out to other threads, it is perfectly fine for networking, but it breaks others.
- For mainline, it would be to have a ksoftirq pending mask. Thomas agrees but adds that then people would like to have a way to go back to that model in RT, where we have a thread per vector.
- RCU should go out of softirqd.
- Good use case: networking has conflicts with block/fs today, and this should solve a lot of it - long term process, good for Frederic's patchcount.
- lockdep support is a big chunk.
- Some drivers might never get converted, which is the same problem we had with BKL. We need to work with the subsystem experts and fix them.
Speaker: Frederic Weisbecker
Some users are requesting Full Dynticks for full CPU isolatoin. Their use case usually involves staying in userspace pooling on devices like PCI or networking, not only in “bare-metal,” but also in virtual-machines, where people want to run the kernel-rt in the host, plus a VM running the kernel-rt while polling inside the VM. They want both host/guest to not have ticks.
Currently, there is one tick every 4 seconds for the timer watchdog that is not exactly a tick. The tsc=reliable on the kernel command helps to reduce the tick, but above two sockets, it is a lie (steve shows the graph of this happening).
Full Dyanticks to work, it requires:
- fixing /proc/stat for NOHZ.
- Appropriate RCU lifecycle for a task
- Clean up code in tick-sched
Another suggestion was to make nohz_full mutable via cpusets, but that is black magic!
PREEMPT_RT: status and Q&A
Speaker: Thomas Gleixner
CONFIG_PREEMPT_RT switch is now in the mainline but is not functional yet. It took 15 yrs on LKML and 20 yrs for Thomas.
There are still a few details to be hashed out for the final merge, but having the CONFIG_PREEMPT_RT switch mainline helps because people did not want to change code, not knowing about preempt rt going mainline. There are lots of stuff queued for v5.4, taking a significant amount out from rt patchset, including an outstanding clean up on printk (discussed in a BoF, including Linus). There is also a more substantial chunk to land with printk, hopefully for 5.5.
Q: What do you expect the new mainline with rt without rt enabled?
A: Should be the same.
Q: Once the mainline kernel builds a functional PREEMPT_RT - what's next? Fix functionalities that are disabled with RT, which people want, and the main is the eBPF.
BPF hardly relies on preempt_disable, mainly because of spinlock embedded in the bytecode. Nowadays, the bytecode is small and should not affect the latency too much, but there are already plans to accept more extensive code, causing non-acceptable latencies for the RT.
The problem with preempt_disable and friends is that they are "scope less," making it hard to define what they are protecting. So a possible solution is to create specific protections, e.g., for eBPF. A candidate is the usage of the local lock. On non-rt, it allocates zero space. On RT, it allocates a lock and then behaves semantically like preempt_disable. And then in RT, you have scope to know what is being protected. Once you have a scope, lockdep will work. (percpu variable access bug detected by the local lock).
Databases utilize and depend on a variety of kernel interfaces and are critically dependent on their specification, conformance to specification, and performance. Failure in any of these results in data loss, loss in revenue, or degraded experience or if discovered early, software debt. Specific interfaces can also remove small or large parts of user space code creating greater efficiencies.
This microconference will get a group of database developers together to talk about how their databases work, along with kernel developers currently developing a particular database-focused technology to talk about its interfaces and intended use.
Database developers are expected to cover:
- The architecture of their database;
- The kernel interfaces utilized, particularly those critical to performance and integrity
- What is a general performance profile of their database with respect to kernel interfaces;
- What kernel difficulties they have experienced;
- What kernel interfaces are particularly useful;
- What kernel interfaces would have been nice to use, but were discounted for a particular reason;
- Particular pieces of their codebase that have convoluted implementations due to missing syscalls; and
- The direction of database development and what interfaces to newer hardware, like NVDIMM, atomic write storage, would be desirable.
The aim for kernel developers attending is to:
- Gain a relationship with database developers;
- Understand where in development kernel code they will need additional input by database developers;
- Gain an understanding on how to run database performance tests (or at least who to ask);
- Gain appreciation for previous work that has been useful; and
- Gain an understanding of what would be useful aspects to improve.
The aim for database developers attending is to:
- Gain an understanding of who is implementing the functionality they need;
- Gain an understanding of kernel development;
- Learn about kernel features that exist, and how they can be incorporated into their implementation; and
- Learn how to run a test on a new kernel feature.
Daniel Black <firstname.lastname@example.org>
io_uring was the initial topic both gauging its maturity and examination if there where existing performance problems. A known problem of buffered writes triggering stalls was raised, and is being already worked on by separating writes into multiple queues/locks. Existing tests showed comparable performance on read both of O_DIRECT and buffered. MySQL showing twice as bad performance currently however this is hoped that this is the known issue.
Write barriers are needed for writing the reversing a partial transaction such that it is written before the tablespace changes such that, in the case of power failure, the partial transaction can be reversed to preserve the Atomicity principle. The crux of the problem is that a write needs to be durable on disk (like fsynced) before another write. SCSI standards contain an option that has never been implemented however for the large part, no hardware level support exists. While the existing userspace implementation uses fsync, its a considerable overhead and it ensure that all file pending writes are synced, when only on aspect is needed. The way forward seems to be use/extend the chained write approach in io_uring.
O_ATOMIC, the promise of write all or nothing (and the existing block remaining intact), was presented as a requirement. We examined cases of existing hardware support by Shannon, FusionIO and Google storage all have different access mechanism and isn't discoverable and gets rather dependent on the filesystem implementation. The current userspace workaround is to double write the same data to ensure a write tear doesn't occur. There may be a path forward by using a NVDIMM aspect as a staging area attached to the side of a filesystem. XFS has a copy on write mechanism that is work in progress and its currently recommend to wait for this on a database workload (assumed to be: [ioctl_ficlonerange|http://man7.org/linux/man-pages/man2/ioctl_ficlonerange.2.html]).
XFS / Ext4 behaviours that exhibited greater throughput with more writes. Some theories for this where proposed but without more data it was hard to say. There was a number of previous bugs where an increase in hardware performance resulted in bigger queues and decreased performance. A full bisection between kernel versions to identify a commit was suggested. There was some correctness aspects fixed between the 4.4 version and the current version, but these may need to be reexamined. Quite possible two effects where in play. Off cpu analysis, in particular using an eBPF based mechanism of sampling proposed by Josef discussion, would result in better identification of what threads are waiting and where. A later discussion covered differences between unix and tcp loopback implementations and the performance of sockets where there was 0 backlog (gained 15% performance) also needs a similar level of probing/measurement to be actionable.
SQLite and the IO Errors discussions covered a large gap in POSIX specification of what can happen in errors. An example is an experimentally found chain of 6 syscalls seemed to be required to reliably rename a file. A document describing what is needed to perform some basic tasks and that result in uniform behaviour across filesystems would alleviate much frustration, guesswork and disappointment. Attendees where invited to ask a set of specific questions on email@example.com where they could be answered and pushed into documentation enhancements.
The reimplementation of MySQL redo log reduced the number of mutexes however left gaps in synchronization mechanism between worker threads. The use of spin vs mutex locks to synchronise between stages was lacking in some APIs. Waiman Long in the talk Efficient Userspace Optimistic Spinning Locks resented some options in a presentation later this day (however lacked saturation test cases).
Syscall overhead and CPU cache issues weren't covered in time however some of this was answered in Tuesdays Tracing BoF and other LPC session covered these.
The LWN article https://lwn.net/Articles/799807/ covers SQLite and Postgresql IO Errors topics in more detail.
All of the topics presented cover the needs of database implementers present in the discussion. Many thanks to our userspace friends:
Sergei Golubchik (MariaDB)
Dimitri Kravtchuk (MySQL)
Pawel Olchawa (MySQL)
Richard Hipp (SQLite)
Andres Freund (Postgresql)
Tomas Vondra (Postgresql)
Josef Ahmad (MongoDB)
Following the success of the past 3 years at LPC, we would like to see a 4th RDMA (Remote Direct Memory Access networking) microconference this year. The meetings in the last conferences have seen significant improvements to the RDMA subsystem merged over the years: new user API, container support, testability/syzkaller, system bootup, Soft iWarp, etc.
In Vancouver, the RDMA track hosted some core kernel discussions on get_user_pages that is starting to see its solution merged. We expect that again RDMA will be the natural microconf to hold these quasi-mm discussions at LPC.
This year there remain difficult open issues that need resolution:
- RDMA and PCI peer to peer for GPU and NVMe applications, including HMM and DMABUF topics
- RDMA and DAX (carry over from LSF/MM)
- Final pieces to complete the container work
- Contiguous system memory allocations for userspace (unresolved from 2017)
- Shared protection domains and memory registrations
- NVMe offload
- Integration of HMM and ODP
And several new developing areas of interest:
- Multi-vendor virtualized 'virtio' RDMA
- Non-standard driver features and their impact on the design of the subsystem
- Encrypted RDMA traffic
- Rework and simplification of the driver API
Leon Romanovsky <firstname.lastname@example.org>, Jason Gunthorpe <email@example.com>
GUP and ZONE_DEVICE pages
Speakers: Jason Gunthorpe, John Hubbard and Don Dutile
- Make the interface to use p2p mechanism be via sysfs. (PCI???).
- Try to kill PTE flag for dev memory to make it easier to support on things like s390.
- s390 will have mapping issues, arm/x86/PowerPC should be fine.
- Looking to map partial BARs so they can be partitioned between different users.
- Total BAR space could exceed 1TB in some scenarios (lots of GPUs in an HPC machine with persistent memory, etc.).
- Initially use struct page element but try to remove it later.
- Unlikely to be able to remove struct page, so maybe make it less painful by doing something like forcing all zone mappings to use hugepages ioctl no, sysfs yes.
- PCI SIG talking about peer-2-peer too.
- Distance might not be the best function name for the pci p2p checking function.
- Conceptually, looking for new type of page fault, DMA fault, that will make a page visible to DMA even if we don’t care if it’s visible to the CPU GUP API makes really weak promise, no one could possibly think that it’s that weak, so everyone assumed it was stronger they were wrong.
- It really is that weak wrappers around the GUP flags? 17+ flags currently, combinational matrix is extreme, some internal only flags can be abused by callers.
- Possible to set "opposite" GUP flags.
- Most (if not all) out of core code (drivers) get_user_pages users need same flags.
RDMA, File Systems, and DAX 
Speaker: Ira Weiny
- There was a bug in previous versions of patch set. It’s fixed.
- New file_pin object to track relationship between mmaped files and DMA mappings to the underlying pages.
- If owners of lease tries to do something that requires changes to the file layout: deadlock of application (current patch set, but not settled).
- Write lease/fallocate/downgrade to read/unbreakable lease - fix race issue with fallocate and lease chicken and egg problem.
3. Discussion about IBNBD/IBTRS, upstreaming and action items
Speakers: Jinpu Wang, Danil Kipnis
- IBTRS is standalone transfer engine that can be used with any ULP.
- IBTRS only uses RDMA_WRITE with IMM and so is limited to fabrics that support this.
- Server does not unmap after write from client so data can change when the server is flushing to disk.
- Need to think about transfer model as the current one appears to be vulnerable to a nefarious kernel module.
- It is worth to consider to unite 4 kernel modules to be 2 kernel modules. One responsible for transfer (server + client) and another is responsible for block operations.
- Security concern should be cleared first before in-depth review.
- No objections to see IBTRS in kernel, but needs to be renamed to something more general, because it works on many fabrics and not only IB.
Improving RDMA performance through the use of contiguous memory and larger pages for files
Speaker: Christopher Lameter
- The main problem is that contiguous physical memory being limited resource in real life systems. The difference in system performance so visible that it is worth to reboot servers every couple of days (depend on workload).
- The reason to it, existence of unmovable pages.
- HugePages help, but pinned objects over time end up breaking up the huge pages and eventually system flows down Need movable objects: dentry and inode are the big culprits.
- Typical use case used to trigger degradation is copying both very large and very small files on the same machine.
- Attempts to allocate unmovable pages in specific place causes to situations where system experiences OOM despite being enough memory.
- x86 has 4K page size, while PowerPC has 64K. The bigger page size gives better performance, but wastes more memory for small objects.
Shared IB objects
Speaker: Yuval Shaia
- There was lively discussion between various models of sharing objects, through file description, or uverbs context, or PD.
- People would like to stick to the file handle model so you share the file handle and get everything you need as being simplest approach.
- Is the security model resolved? Right now, the model assumes trusted processes are allowed to share only.
- Simple (FD) model creates challenge to properly release HW objects after main process exits and leaves HW objects which were in use by itself and not by shared processes.
- Refcount needs to be in the API to track when the shared object is freeable
- API requires shared memory first, then import PD and import MR. This model (as opposed to sharing the fd of the in context), allows for safe cleanup on process death without interfering with other users of the shared PD/MR.
The Linux Plumbers 2019 Scheduler Microconference is meant to cover any scheduler topics other than real-time scheduling.
- Load Balancer Rework - prototype
- Idle Balance optimizations
- Flattening the group scheduling hierarchy
- Core scheduling
- Proxy Execution for CFS
- Improving scheduling latency with SCHED_IDLE task
- Scheduler tunables - Mobile vs Server
- Remove the tick (NO_HZ and NO_HZ_FULL)
- Linux Integrated System Analysis (LISA) for scheduler verification
We plan to continue the discussions that started at OSPM in May 2019 and get a wider audience outside of the core scheduler developers at LPC.
Juri Lelli <firstname.lastname@example.org>, Vincent Guittot <email@example.com>, Daniel Bristot de Oliveira <firstname.lastname@example.org>, Subhra Mazumdar <email@example.com>, and Dhaval Giani <firstname.lastname@example.org>
The micro conference started with the core scheduling topic which have seen several different proposals on the mailing list. The sessions was led by the different people involved on the subject.
Aubrey Li, Jan Schönherr, Hugo Reis and Vineeth Remanan Pillai jointly lead the first topic session of the Scheduler MC, the main focus of which was to discuss different approaches and possible uses of Core Scheduling [https://lwn.net/Articles/799454/]. Apart from the primary purpose of making SMT secure against hardware vulnerabilities, a database use case for Oracle and another use case for deep-learning workloads have been presented as well. For the database use case it has been suggested that Core Scheduling could help reducing idleness by finding and packing on shared siblings processes working on the same database instance, for the deep-learning one instead a similar behaviour would help making processes issuing AVX-512 instructions to be colocated in the same core. Finally an additional use case has been mentioned: coscheduling - isolating some processes while at the same time making others to run on the same (set of) core(s).
Discussion then moved on and focused on how to ensure fairness across group of processes, that might monopolize cores and starve other processes by force idling of siblings. Inter-core migrations are, admittedly, still to be looked at.
The general agreement, before closing the sessions, seemed to be that core scheduling is indeed an interesting mechanism that might eventually reach mainline (mainly because enough people need it), but there is still work to be done to make it sound and general enough to suit everybody’s needs.
Juri Lelli started out with a very quick introduction to Proxy Execution, an enhanced priority inheritance mechanism thought to replace the current rt_mutex PI implementation and possibly merge rt_mutex and mutex code. Proxy Execution is especially needed by SCHED_DEADLINE scheduling policy, as what it is implemented today (deadline inheritance with boosted tasks running outside runtime enforcement) is firstly not correct (as DEADLINE tasks should inherit bandwidth) and risky for normal users (that could implement DoS by forging malicious inheritance chains).
The only question that Lelli intended to address during the topic slot (due to time constraints) was “What to do with potential donors and mutex holder that run on different CPUs?”. The current implementation moves donors (chain) to the CPU on which the holder is running, but this was deemed wrong when the question was discussed at OSPM19 [https://lwn.net/Articles/793502/]. However, both the audience and Lelli (after thinking more about it) agreed that migrating donors might be the only viable solution (from an implementation point of view) and it should be also theoretically sound, at least when holder and donor run on the same exclusive cpuset (root domain), which is probably the only sane configuration for tasks sharing resources.
Making SCHED_DEADLINE safe for kernel kthreads
Paul McKenney introduced his topic by telling the audience that Dmitry Vyukov's testing work identified some (ab)uses of sched_setattr() that can result in SCHED_DEADLINE tasks starving RCU's kthreads for extended time periods (~146 days!), since DEADLINE tasks period and runtime have currently no bound and RCU’s kthreads are SCHED_NORMAL by defult. He then proposed several approaches to possibly fix the problem.
- sched_setattr() could recognize parameter settings that put kthreads at risk and refuse to honor those settings.
- In theory, RCU could detect this situation and take the "dueling banjos" approach of increasing its priority as needed to get the CPU time that its kthreads need to operate correctly.
- Stress testing could be limited to non-risky regimes, such that kthreads get CPU time every 5-40 seconds, depending on configuration and experience.
- SCHED_DEADLINE bandwidth throttling could treat tasks in other scheduling classes as an aggregate group having a reasonable aggregate deadline and CPU budget (i.e. hierarchical SCHED_DEADLINE servers).
Discussion led to approaches 1 and 4 (both implemented and posted on LKML by Peter Zijlstra) to be selected as viable solutions. Approach 1 is actually already close to be merged upstream. Approach 4 might require some more work, but it is very interesting as, once extended to SCHED_FIFO/RR, could be used to both replace RT throttling and provide better RT group support.
CFS load balance rework
Vincent Guittot has started the topic by giving a status about the rework of the load_balance () of the scheduler. He summarized the main changes posted on LKML and which UCs were fixed with the proposal. He also noticed some performance improvements but admitted that this was not the main goal of rework at this point. The two main goals of the patchset are:
- to better classify the group of CPUs when gathering metrics and before deciding to move tasks
- to choose the best way to describe the current imbalance and what should be done to fix it.
Vincent wanted to present and discuss 3 open items that would need to get fixed in order to improve further the balance of the load on the system.
The 1st item was about using load_avg instead of runnable_load_avg. Runnable load has been introduced to fix cases where there is a huge amount of blocked load on idle CPU. This situation is no more a problem with the rework because the load_avg is now used only when there is no spare capacity on a group of CPUs. The current proposal replace runnable_load_avg by load_avg in the load balance loop and Vincent plans to also move the wake up path and the NUMA algorithm if audience think that it makes sense. Rik said that letting a CPU idle is often a bad choice even for NUMA and the rework is the right direction for NUMA too.
The 2nd item was about the detection of overloaded group of CPUs. Vincent showed some charts of a CPU overloaded by hackbench tasks. The charts showed that utilization can be temporarily low after task migration whereas the load and the runnable load stayed high which is a good indication of large waiting time on an overloaded CPU. Using an unweighted runnable load could help to better classify the sched_groups by monitoring the waiting time. Audience mentioned that tracking the time since last idle time could also be a good indicator.
The 3rd items was the fairness of unbalanceable situation. When the system can’t be balanced like with the N+1 tasks running on N cpus case, the nr_balance_failed drives the migration which makes the migration unpredictable at most. There is no solution so far but the audience proposed to maintain an approximation of a global vruntime that can be used to monitor the fairness across the system. It has also been raised that the compute capacity should also be taken into account because it’s another level of unfairness on heterogeneous system.
Flattening the hierarchy discussion
In addition to the referred talk “CPU controller on a single runqueue”, Rik van Riel wanted to discuss about problem related to flatten the scheduler hierarchy. He showed a typical cgroup hierarchy with pulseaudio and systemd and described what has to be done each time a reschedule happen. The cost of going through the hierarchy is quite prohibitive when tasks are enqueued at high frequency. The basic idea of his patchset is to keep the cgroup hierarchy outside the root rq and rate limit its update. All tasks are put in the root rq by using their hierarchical load and weight and to do as little as possible during enqueue and dequeue. But the wild variety of weight becomes a problem when a task wakes up because it can preempt higher priority task. Rik asked what should be the right behavior for this code. Audience raises that we should make sure to not break the fairness between groups and people cares about uses case more that theory and that should drive the solution.
Scheduler domains and cache bandwidth
Valentin Schneider has presented the results of some tests that he ran on a thunderX platform. The current scheduler topology that has only NUMA and MC level, doesn’t show an intermediate “socklet” made of 6 cores. This intermediate HW level is not described but adding it increases the performance of some UCs. Valentin asked if it makes sense to dynamically add scheduling level similarly to NUMA ? Architecture can already superseded the default SMT/MC/DIE level with their own hierarchy. Adding a new level for arch64 could make sense as long as the arch provides the topology information and it can be dynamically discovered.
TurboSched: Core capacity Computation and other challenges
Parth Shah introduced TurboSched, a proposed scheduler enhancement that aims to sustain turbo frequencies for longer by explicitly marking small tasks (“jitters”, as Parth would refer to them) and packing them on a smaller number of cores. This in turn ensures that the other cores will remain idle, thus saving power budget for CPU intensive tasks.
Discussion focused on which interface it might be used to mark a task as being “jitter”, but this very question soon raised another one: what is a “jitter” task? It turned out that Parth considers to be “jitter” tasks for which quick response is not crucial and can usually be considered background activities (short running background jobs). There was indeed a general agreement that it is firstly very important to agree on terms, so to then decide to do with the different types of tasks.
Another problem discussed was based on the core capacity calculation which TurboSched uses to find non idle core and also to limit the task packing on heavily loaded core. Current approach used by parth is to use CPU capacity of each thread and compute capacity of a core based on linear scaling of single thread capacity, this approach seemed to be incorrect because the capacity of the CPU is not static. Discussion about possible strategies to achieve TurboSched goal followed, unfortunately a definitive agreement wasn’t reached due to time constraints.
Subhra Mazumdar wanted to discuss the fact that currently there is no user control on how much time scheduler should spend searching for CPUs when scheduling a task, internal heuristics based on hardcoded logic decide for it. This is however suboptimal for certain types of workloads, especially workloads composed by short running tasks that generally spend only a few microseconds running when they are activated. To tackle the problem, he proposed to provide a new latency-nice property user can set for a task (similar to nice value) that controls the search time (and also potentially the preemption logic).
A discussion was held regarding what might be the best interface to associate this new property to tasks, in which system wide procfs attribute, per-task sched_attr extension and cgroup interface were mentioned. The following question was about what might be a sensible value range for such a property, for which a final answer wasn’t reached. It was then noticed that userspace might help with assigning such a property to tasks, if it knows which tasks are requiring special treatment (like the Android runtime normally does).
Paul Turner seemed in favour of a mechanism to restrict scheduler search space for some tasks, even though he mentioned background tasks as a type of tasks that would benefit from having that.
The PCI interconnect specification and the devices implementing it are incorporating more and more features aimed at high performance systems (eg RDMA, peer-to-peer, CCIX, PCI ATS (Address Translation Service)/PRI(Page Request Interface), enabling Shared Virtual Addressing (SVA) between devices and CPUs), that require the kernel to coordinate the PCI devices, the IOMMUs they are connected to and the VFIO layer used to managed them (for userspace access and device passthrough) with related kernel interfaces that have to be designed in-sync for all three subsystems.
The kernel code that enables these new system features requires coordination between VFIO/IOMMU/PCI subsystems, so that kernel interfaces and userspace APIs can be designed in a clean way.
Following up the successful LPC 2017 VFIO/IOMMU/PCI microconference, the Linux Plumbers 2019 VFIO/IOMMU/PCI track will therefore focus on promoting discussions on the current kernel patches aimed at VFIO/IOMMU/PCI subsystems with specific sessions targeting discussion for kernel patches that enable technology (eg device/sub-device assignment, peer-to-peer PCI, IOMMU enhancements) requiring the three subsystems coordination; the microconference will also cover VFIO/IOMMU/PCI subsystem specific tracks to debate patches status for the respective subsystems plumbing.
Tentative topics for discussion:
- Shared Virtual Addressing (SVA) interface
- SRIOV/PASID integration
- Device assignment/sub-assignment
- IOMMU drivers SVA interface consolidation
- IOMMUs virtualization
- IOMMU-API enhancements for mediated devices/SVA
- Possible IOMMU core changes (like splitting up iommu_ops, better integration with device-driver core)
- DMA-API layer interactions and how to get towards generic dma-ops for IOMMU drivers
- Resources claiming/assignment consolidation
- PCI error management
- PCI endpoint subsystem
- prefetchable vs non-prefetchable BAR address mappings (cacheability)
- Kernel NoSnoop TLP attribute handling
- CCIX and accelerators management
Etherpad notes at https://linuxplumbersconf.org/event/4/sessions/66/attachments/272/459/go
User Interfaces for Per-Group Default IOMMU Domain Type (Baolu Lu)
Currently DMA transactions going through an IOMMU are configured in a coarse-grained manner: all translated or all bypass. This causes bad user experience when users only want to bypass the IOMMU for DMA from some trusted high-speed devices, e.g. NIC or graphic, for performance consideration, or alternatively only translate DMA transactions from some legacy devices with limited address capability to access high memory. Per-group default domain type is proposed to address this.
The first concern was about the cross dependencies for non-PCI devices (solved through the .is_added struct pci_device flag).
attach() and add() methods already exist and are implemented in ARM SMMUs drivers, but driver model doesn't have the idea of separating "add device" from "attach driver".
A solution that was floated was dynamic groups and devices with changing groups, drivers can be reset to deal with that.
For x86, PCI hotplug could also be a problem since the default domain type is only parsed during boot time.
Hence, splitting default domain type from device enumeration might be a solution. As for the user interface, it's still not feasible for non-PCI devices, hence adding a postfix, i.e. .pci to the command line will make things workable. Joerg Roedel pointed out that there are possible conflicting devices, so there must be a mechanism to detect it. People also have concerns on command line argument - hard to manipulate by distributions, distribution drivers are modules, so ramfs may be a solution.
In summary, it was agreed that the problem needs solving and the community should start to fix It but there are still concerns about the various solutions put forward and related user interfaces.
A consensus should be sought on mailing list discussions to avoid a command line parameter and reach a cleaner upstreamable solution.
Status of Dual Stage SMMUv3 Integration (Eric Auger)
The work under discussion allows the virtual SMMUv3 to manage the SMMU programming for host devices assigned to a guest OS, which in turn requires that the physical SMMUv3 is set-up
to handle guest OS mappings.
The first part of the session was dedicated to providing some background about this work which started more than one year ago and highlighting the most important technical challenges.
The Intel IOMMU exposes a so-called Caching Mode (CM). This is a register bit that is used in the virtual IOMMU. When exposed, this bit forces the guest driver to send invalidations on MAP operations.
Normally the Intel iommu driver only sends invalidation on UNMAP operations but with Caching Mode on, the VMM is able to trap all the changes to the page tables and cascade them to the physical IOMMU. On ARM however, such a Caching Mode does not exist. In the past, it was attempted to add this mode through an smmuv3 driver option (arm-smmu-v3 tlbi-on-map option RFC, July/August 2017) but the approach was Nak’ed. The main argument was that it was not specified in the SMMU architecture.
So after this first attempt, the natural workaround was to use the two translation stages potentially implemented by the HW (this is an implementation choice). The first stage would be “owned” by the guest while the second stage would be owned by the hypervisor. This session discussed the status of this later work. As a reminder the first RFC was sent on Aug 2018. Latest version of the series is (Aug 2019):
It is important to notice this patch series share the IOMMU APIs with Intel SVA series (fault reporting, cache invalidation).
Eric Auger highlighted some technical details:
- The way configuration is set-up for guest IOMMU configuration is different on ARM versus Intel (attach/detach_pasid_table != Intel's sva_bind/unbind_gpasid). That’s because on ARM the stage2 table translation pointer is within the so-called Context Table Entry, owned by the guest. Whereas on Intel it is located in the PASID entry. So on ARM the PASID table can be fully owned by the guest while it is not possible on Intel.
- Quite a lot of pain is brought about by MSI interrupts on ARM, since they are translated by an IOMMU on ARM (on Intel they are not). So one of the most difficult tasks in upstreaming the series consists in setting up nested binding for MSIs.
- The series then obviously changes the state machine in the SMMUv3 drivers as we must allow both stages to cooperate, at the moment only stage1 is used. We have new transitions to handle.
- The VFIO part is less tricky. It maps onto the new IOMMU API. Most of the series now dedicates to the physical fault propagation up to the guest. This relies on a new fault region and a new specific interrupt.
Then a discussion followed this presentation.
Most of the slot was spent on the first question raised: are there conceptual blockers about the series?
Will Deacon answered that he does not see any user of this series at the moment and he is reluctant to upstream something that is not exercised and tested.
He expects Eric Auger to clarify the use cases where this implementation has benefits over the virtio-iommu solution and write a short documentation to that extent. Eric explained the series makes sense for use cases where dynamic mappings are used, for which the virtio-iommu performance may be poor as seen on x86 (native driver in the guest and Shared Virtual Memory in the guest).
Will Deacon also explained he would rather see a full implementation supporting multiple PASID. In other words he would be interested in the shared virtual memory enablement for the guest, where the guest programs physical devices with PASID and program their DMAs with its process guest VA.
Eric Auger replied he preferred to enable nested paging first without PASID support. This allows to remove a lot of complex dependencies at kernel level (IOASID allocation, substreamID support in SMMUv3 driver). Also Eric Auger said he has no PASID capable device to test on.
A possibility cheap FPGA to implement PCI devices to implement PASID/PRI was floated as
a solution to lacking HW. Maybe a DesignWare Controller can configured as an endpoint implementing the required HW features (PCI ATS/PRI/PASID); it was not clear in the debate
If that’s possible or not.
Will Deacon reported that he could not test the current series. Eric Auger stated that the series was testable and tested on three different platforms.
Beyond that question, the review of the IOMMU UAPIs was considered. Eric Auger asked if the patches related to the IOMMU UAPI could be maintained in a separate series to avoid the confusion about who does have the last version of the API, shared by several series.
Maintainers agreed on this strategy.
PASID Management in Linux (Jacob Pan)
PASID life cycle management was discussed, user APIs (uAPIs), and Intel VT-d specific requirements in the context of nested shared virtual addressing (SVA).
The following upstream plan and breakdown of features was agreed for each stage.
- PCI device assignment that uses nested translation, a.k.a. Virtual SVA.
- IOMMU detected device faults and page request service
- Mediated device support
VT-d supports system-wide PASID allocation in that the PASID table is per-device and devices can have multiple PASIDs assigned to different guests. Guest-host PASID translation could be possible but it was not discussed in detail.
Joerg Roedel and Ben Herrenschmidt both asked if we really need a guest PASID bind API: why can't it be merged with sva_bind_device(), which is the native SVA bind? After reviewing both APIs, it was decided to keep them separate. The reasons are:
- Guest bind does not have mm notifier directly; guest mm release is propagated to the host in the form of PASID cache flush
- PASID is allocated inside sva_bind_device() where guest PASID bind has PASID allocated prior to the call
- Metadata used for tracking the bond is different
The outcome is for Jacob Pan to send out user API patches only and get them ACKed by ARM.
For stage 1, the uAPI patches include:
- IOASID allocator
Various scenarios related to PASID tear down were debated, which is more interesting than setup.
The primary source of race conditions comes from device, IOMMU and userspace that operate asynchronously with respect to faults, aborts, and work submission. The goal is to have clean life cycles for PASIDs.
Unlike PCIe function level reset, which is unconditional, PASID level reset has dependencies on device conditions, pending faults. Jacob Pan proposed iommu_pasid_stop API to mature such conditions for device to do PASID reset, but Joerg is concerned the approach is still not race
free. IOMMU PASID stop and drain does not prevent device from keep sending more requests.
The following flow for device driver initiated PASID tear down was agreed:
- Issue device-specific PASID reset/abort, device must guarantee no more transactions issued from the device after completion. Device may wait for pending requests to be responded, either normally or timed out.
- iommu_cache_invalidate() may be called to flush device and IOTLBs, the idea is to speed up the abort.
- Unbind PASID, clear all pending faults in IOMMU driver, mark PASID not present, drain page requests, free PASID.
- Unregister device fault handler, which will not fail.
The need to support subdevice fault reporting by IOMMU was discussed, specifically mdevs.
In principle, it was agreed to support multiple data per handler which allows iommu_report_device_fault() to provide sub-device specific info.
Architecture Considerations for VFIO/IOMMU Handling (Cornelia Huck)
Session was aimed at highlighting discrepancies between what is expected from architectures
in the kernel software stack, given that subsystems take for granted x86 as architectural
Examples of features that were considered hard to adapt to other architectures were ARM
MSIs, IOMMU subsystem adaptation to Power, s390 IO model and the memory attributes
of ioremap_wc() on architectures other than x86 where the WC memory type does not
Cornelia Huck mentioned that there are basic assumption on systems, everything is PCI, kernel subsystems are designed with x86 assumptions. PCI on the mainframe it is not what it is expected to do. Instructions not MMIO. Some devices are not PCI at all, s390 CCW, general IO.
IOMMU core kernel code was designed around x86, very difficult to adapt to Power.
Different arches have different software models, for instance Power has hypervisor that works differently from any other hypervisor.
There is an assumption on how DMA on devices is done. On Power, virtio-iommu (ie paravirtualized IOMMU) can be useful, because Power requires emulation IOMMU API in the guest. Some iterations on HW that can handle that in the host side.
Linux assumption that virtual machine migration can’t happen if there are assigned devices (support is in the works). The problem is that the kernel cannot move pages that are setup
for DMA. It is feasible but it is not actually done; there is a plan to support migration through quiescing.
The session opened the debate on arch assumptions; it continued in hallway discussions.Slides
Optional or Reduced PCI BARs (Jon Derrick)
Some devices, e.g., NVMe, have both mandatory and optional BARs. If address space is not assigned to the mandatory BARs, the device can’t be used at all, but the device can operate without space assigned to the optional BARs.
The PCI specs don’t provide a way to identify optional BARs, so Linux assumes all BARs are mandatory. This means Linux may assign space for device A’s optional BARs, leaving no space for device B’s mandatory BAR, so B is not usable at all. If Linux were smarter, it might be able to assign space for both A and B’s mandatory BARs and leave the optional BARs unassigned so both A and B were usable.
Drivers know which BARs are mandatory and which are optional, but resource assignment is done before drivers claim devices, so it’s not clear how to teach the resource assignment code about optional BARs. There is ongoing work for suspending, reassigning resources, and resuming devices. Maybe this would be relevant.
The PCI specs don’t provide a way to selectively enable MEM BARs -- all of them are enabled or disabled as a group by the Memory Space Enable bit. Linux doesn’t have a mechanism for dealing with unassigned optional BARs. It may be possible to program such BARs to address space outside the upstream bridge’s memory window so they are unreachable and can be enabled without conflicting with other devices.
PCI Resource Assignment Policies (Ben Herrenschmidt)
There is agreement that the kernel PCI resource allocation code should try to honour firmware resource allocation and not reassign them by default and then manage hotplug.
In the current code there are lots of platforms and architectures that just reassign everything and arches that do things in between. Benjamin Herrenschmidt said that he found a way to solve the resource allocation for all arches.
There is a compelling need to move resource allocation out of PCI controller drivers into PCI core by having one single function (easier to debug and maintain) to do the resource allocation with a given policy chosen in a set.
The real problem in updating PCI resource allocation code is how to test on all existing x86 HW.
It is also very complicated to define when a given resource allocation is “broken” so that it has to be redone by the kernel, it can be suboptimal but still functioning.
It was agreed that we need to come up with a policy definition and test simple cases gradually.
Implementing NTB Controller Using PCIe Endpoint (Kishon Vijay Abraham I)
Kishon presented the idea of adding a software defined NTB (Non-Transparent Bridge) controller using multiple instances of configurable PCIe endpoint in an SoC.
One use of an NTB is as a point-to-point bus connecting two hosts (RCs). NTBs provide three mechanisms by which the two hosts can communicate with each other:
- Scratchpad Registers: Register space provided by NTB for each host that can be used to pass control and status information between the hosts.
- Doorbell Registers: Registers used by one host to interrupt the other host.
- Memory Window: Memory region to transfer data between the two hosts.
All these mechanisms are mandatory for an NTB system.
The SoC should have multiple instances of configurable PCIe endpoint for implementing NTB functionality. The hosts that have to communicate with each other should be connected to each of the endpoint instances and the endpoint should be configured so that transactions from one host are routed to the other host.
A new NTB function driver should be added which uses the endpoint framework to configure the endpoint controller instances. The NTB function driver should model the scratchpad registers, doorbell registers, and memory window and configure the endpoint so transactions from one host are routed to the other host.
A new NTB hardware driver should be added which uses the NTB core on the host side. This will help standard NTB utilities (ntb_netdev, ntb_pingpong, ntb_tool etc..) to be used in a host which is connected to a SoC configured with NTB function driver.
Benjamin Herrenschmidt asked if the implementation can support multiple ports. The NTB function driver accounts for only two interfaces which means a single endpoint function can be connected to only two hosts. However the NTB function driver can be used with multi-function endpoints to provide a multi-port connection i.e connection between more than two hosts. Using multi-function has the added advantage of being able to provide isolation between different hosts.
Benjamin Herrenschmidt asked if the modeled NTB can support DMA. The system DMA in SoC (with NTB function driver) can be used for transferring data between the two hosts. However this is not modeled in the initial version of NTB driver. Adding support for such a feature in the software has to be thought through (RDMA? NTRDMA?).
Kishon also mentioned the initial version of NTB function driver will be able to use only the NTB function device created using device tree. However Benjamin Herrenschmidt mentioned support for NTB function device to be created using configfs should also be added.
Kishon mentioned that initial version will be posted within weeks to get early review comments and continue the discussion on mailing list.
Building on the Treble and Generic System Image work, Android is
further pushing the boundaries of upgradibility and modularization with
a fairly ambitious goal: Generic Kernel Image (GKI). With GKI, Android
enablement by silicon vendors would become independent of the Linux
kernel running on a device. As such, kernels could easily be upgraded
without requiring any rework of the initial hardware porting efforts.
Accomplishing this requires several important changes and some of the
major topics of this year's Android MC at LPC will cover the work
involved. The Android MC will also cover other topics that had been the
subject of ongoing conversations in past MCs such as: memory, graphics,
storage and virtualization.
Proposed topics include:
- Generic Kernel Image
- ABI Testing Tools
- Android usage of memory pressure signals in userspace low memory killer
- Testing: general issues, frameworks, devices, power, performance, etc.
- DRM/KMS for Android, adoption and upstreaming dmabuf heaps upstreaming
- dmabuf cache managment optimizations
- kernel graphics buffer (dmabuf based)
- uid stats
- vma naming
- vitualization/virtio devices (camera/drm)
- libcamera unification
These talks build on the continuation of the work done last year as reported on the Android MC 2018 Progress report. Specifically:
- Symbol namespaces have gone ahead
- There is continued work on using memory pressure signals for uerspace low memory killing
- Userfs checkpointing has gone ahead with an Android-specific solution
- The work continues on common graphics infrastructure
Karim Yaghmour <email@example.com>, Todd Kjos <firstname.lastname@example.org>, Sandeep Patil <email@example.com>, and John Stultz <firstname.lastname@example.org>
Building on the Treble and Generic System Image work, Android is
further pushing the boundaries of upgradability and modularization with
a fairly ambitious goal: Generic Kernel Image (GKI). With GKI, Android
enablement by silicon vendors would become independent of the Linux
kernel running on a device. As such, kernels could easily be upgraded
without requiring any rework of the initial hardware porting efforts.
This is a multi-year effort that requires several important changes, and some
of the major topics of this year's Android MC at LPC covered the work
involved. The Android MC also covered other topics that had been the
subject of ongoing conversations in past MCs such as: memory, graphics,
storage and virtualization.
The topics discussed were:
- Generic Kernel Image (GKI) progress
- GKI: Monitoring and Stabilizing the In-Kernel ABI
- GKI: Solving issues associated with modules and supplier-consumer dependencies
- Android Virtualization (esp. Camera, DRM)
- libcamera: Unifying camera support on all Linux systems
- Emulated storage features (eg sdcardfs)
- Eliminating WrapFS hackery in Android with ExtFUSE (eBPF/FUSE)
- How we're using ebpf in Android networking
- Linaro Kernel Functional Testing (LKFT): functional testing of android common kernels
- Handling memory pressure on Android
- DMABUF Developments
- DRM/KMS for Android, adoption and upstreaming
- scheduler: uclamp usage on Android
- ARM v8.5 Memory Tagging Extension
Some of the early feedback on these discussions are captured in this spreadsheet.
The focus of this MC will be on power-management and thermal-control frameworks, task scheduling in relation to power/energy optimizations and thermal control, platform power-management mechanisms, and thermal-control methods. The goal is to facilitate cross-framework and cross-platform discussions that can help improve power and energy-awareness and thermal control in Linux.
- CPU idle-time management improvements
- Device power management based on platform firmware
- DVFS in Linux
- Energy-aware and thermal-aware scheduling
- Consumer-producer workloads, power distribution
- Thermal-control methods
- Thermal-control frameworks
The Power Management and Thermal Control MC covered 7 topics, 3 mostly related to thermal control and 4 power management ones (1 scheduled topic was not covered due to the lack of time).
A proposal to change the representation of thermal zones in the kernel into a hierarchical one was discussed first. It was argued that having a hierarchy of thermal zones would help to cover certain use cases in which alternative cooling devices could be used for the same purpose, but the actual benefit from that was initially unclear to the discussion participants. The discussion went on for some time without a clear resolution and finally it was deferred to the BoF time after the recorded session. During that time, however, the issue was clarified and mostly resolved with a request to provide a more detailed explanation of the design and the intended usage of the new structure.
The next topic was about the thermal throttling and the usage of the
SCHED_DEADLINE scheduling class. The problem is that the latter appears to guarantee a certain level of service that may not be provided if performance is reduced for thermal reasons. It was agreed that the admission control limit for SCHED_DEADLINE tasks should be based on the sustainable performance and reduced from the current 95% of the maximum capacity. Also it would be good to signal tasks about the possibility of missed deadlines (there are some provisions for that already, but it needs to be wired-up properly).
The subsequent discussion started with the observation that in quite a few cases the application is the best place to decide how to reduce the level of compute in the face of thermal pressure, so it may be good to come up with a way to allow applications to be more adaptive. It was proposed to try the existing sysfs-based mechanism allowing notifications to be sent to user space on crossing trip points, but it was pointed out that it might be too slow. A faster way, like a special device node in /dev, may be necessary for that.
The next problem discussed was related to the per-core P-states feature present in recent Intel processors which allows performance to be scaled individually for each processor core. It is generally good for energy-efficiency, but in some rare cases it turns out to degrade performance due to dependencies between tasks running on different cores. The question was how to address that and some possible mitigations were suggested. First, if user space can identify the tasks
in question, it can use utilization clamping, but that is a static mechanism and something more dynamic may be needed. Second, the case at hand is analogous to the IOwait one, in which the scheduler passes a special flag to CPU performance scaling governors, so it might be addressed similarly. It was observed that the issue was also similar to the proxy execution problem for SCHED_DEADLINE tasks, but in that case the "depends-on" relationship was well-defined.
The topic discussed subsequently was about using platform firmware for device power management instead of attempting to control low-level resources (clocks and voltage regulators) directly from the kernel. This is mostly targeted at reducing kernel code complexity and addressing cases in which the complete information may not be available to the kernel (for example, when it runs in a VM). Some audience members were concerned that the complexity might creep up even after this had been done. Also, it was asked how to drive that from Linux and some ideas, mostly based on using PM domains, were outlined.
Next, the session went on to discuss possible collection of telemetry data
related to system-wide suspend and resume. There is a tool, called pm-graph, that is included in the mainline kernel source tree, that can be run by anyone and can collect suspend and resume telemetry information like the time it takes to complete various pieces of these operations or whether or not there was a failure and what it was etc. It is run on a regular basis at Intel on a number of machines, but it would be good to get such data from a wider set of testers. The idea is to make it possible for users to opt in for the collection of suspend/resume telemetry information, ideally on all suspend/resume cycles, and collect the data in a repository somewhere. It was mentioned that KernelCI had
a distributed way of running tests, so maybe a similar infrastructure could be used here. It was also recommended to present this topic at the FOSDEM conference and generally talk to Linux distributors about it. openSUSE ships pm-graph already, but currently users cannot upload the results anywhere. Also, there is a problem of verifying possible fixes based on public telemetry data.
The final topic discussed regarded measurements of the CPU idle state exit latency, used by the power management quality of service (PM QoS) mechanism (among other things). There is a tool, called WULT, used for that internally at Intel, but there is a plan to release it as Open Source and submit the Linux driver used by it for inclusion into the mainline kernel. It uses a PCI device that can generate delayed interrupts (currently, the I210 PCIe GbE adapter) in order to trigger a CPU wakeup at a known time (a replacement driver module for that device is used) and puts some "interposer" code (based on an ftrace hook) between the cpuidle core and the idle driver to carry out the measurement when the interrupt triggers (Linus Torvalds suggested to execute the MWAIT instruction directly from the tool to simplify the design). It would need the kernel to contain the information on the available C-state residency counters (currently in the turbostat utility) and it would be good to hook that up to perf. It was suggested that BPF might be used to implement this, but it would need to be extended to allow BPF programs to access model-specific registers (MSRs) of the CPU. Also the user space interface should not be debugfs, but that is not a problem with BPF.
The security of computer systems is a very important topic for many years. It has been taken into the account in the OSes and applications for a long time. However, security of the firmware and boot process has not been taken so seriously until recently. Now that is changing. Firmware is more often being designed with security in mind. Boot processes are also evolving. There are many security solutions available there and even some that are now becoming common. However, they are often not complete solutions and solve problems only partially. So, it is good time to integrate various approaches and build full top-down solutions. There is a lot happening in that area right now. New projects arise, e.g. TrenchBoot, and they meet various design, implementation, validation, etc. obstacles. The goal of this microconference is to foster a discussion of the various approaches and hammer out, if possible, the best solutions for the future. Perfect sessions should discuss various designs and/or the issues and limitations of the available security technologies and solutions that were encountered during the development process. Below is the list of topics that would be nice to cover. This is not exhaustive and can be extended if needed.
- SRTM and DRTM
- Intel TXT
- AMD SKINIT
- UEFI secure boot
- Intel SGX
- boot loaders
Daniel Kiper <email@example.com>, Joel Stanley <firstname.lastname@example.org>, and Matthew Garrett <email@example.com>