drgn (https://github.com/osandov/drgn) is a programmable debugger that makes it easy to introspect and debug state in the kernel. With drgn, it's possible to explore and analyze data structures with the full power of Python. See the LWN coverage of the presentation at LSF/MM: https://lwn.net/Articles/789641/. This presentation will demonstrate the capabilities of drgn, discuss future plans,...
The Linux kernel VxLan driver supports two ways of handling flooded traffic to multiple remote VxLan termination end points (VTEPS):
(a) Head end replication: where the VxLan driver sends a copy of the packet to each participating remote VTEPs
(b) Use of multicast routing to forward to participating remote VTEPs
(b) is generally preferred for both hardware and software VTEP deployments...
Running out of memory on a host is a particularly nasty scenario. In the Linux kernel, if memory is being overcommitted, it results in the kernel out-of-memory (OOM) killer kicking in. Perhaps surprisingly, the kernel does not often handle this well. oomd builds on top of recent kernel development to effectively implement OOM killing in userspace. This results in a faster, more predictable,...
I would like to give a talk about KVA allocator in the kernel and about
improvements i have done.
See below the presentation:
Thank you in advance!
The RISC-V UNIX-Class platform specification working group started in May and aims to have a first release by the end of the year. This talk will discuss where we are and where we're going.
Having maintained a distribution agnostic reference kernel (Yocto), an operating
system vendor kernel (Wind River) and finally a semi-conductor kernel (Xilinx),
there are a lot of obvious workflows and tools that are used to deliver kernels
and support them after release.
The less than obvious workflows (and tools) are often related to distro kernel
tree maintenance and balancing the needs of...
We'd like to spend a few minutes to provide some background around how we're using Yocto to produce kernel builds as well as bigger images that contain userspace as well, and then try to address some of the issues we're seeing with this process.
There are a few topics we'd like to discuss with the room:
- Using a single kernel branch for multiple, very different projects?
- Working with...
Tracing kernel boot is useful when we chase a bug in device and machine initialization, boot performance issue etc. Ftrace already supports to enable basic tracing features in kernel cmdline. However, since the cmdline is very limited and too simple, it is hard to enable complex features which are recently introduced, e.g. multiple kprobe events, trigger actions, and event histogram.
RISC-V now has better support for open source boot loaders like U-Boot and coreboot compared to last year. As a result of this developers can use the same boot loaders to boot Linux on RISC-V as they do in other architectures, but there's more work to be done. We will discuss the current state of the boot flow and pending issues.
Every distro has to package the kernel tree using their own unique package
files. Some parts of the process are built-in to the kernel source and are
easy: build, install, and headers. Some parts are not: configs, devel
package, userspace tools package, tests, distro versioning, changelogs,
custom patches, etc.
This discussion revolves around some of the issues and difficulties a
Hardware PMU counters are limited resources. When there are more perf events than the available hardware counters, it is necessary to use time multiplexing, and the perf events could not run 100% of time.
On the other hand, different perf events may measure the same metric, e.g., instructions. We call these perf events "compatible perf events". Technically, one hardware counter could serve...
Packet capture is useful from a general debugging standpoint, and is useful in particular in debugging BPF programs that do packet processing. For general debugging, being able to initiate arbitrary packet capture from kprobes and tracepoints is highly valuable (e.g. what do the packets that reach kfree_skb() - representing error codepaths - look like?). Arbitrary packet capture is distinct...
Last couple of years, we have witnessed an onslaught of vulnerabilities in the design and architecture of cpus. It is interesting and surprising to note that the vulnerabilities are mainly targeting the features designed to improve the performance of cpus - most notable being the hyperthreading(smt). While some of the vulnerabilities could be mitigated in software and cpu microcodes, couple of...
IOMMU is a very popluar equipment for both embed and server virtualization area. In the topic we'll focus on embed area and shared virtual address.
Firstly, we'll talk about the value of IOMMU for the embed system and what the benefit we could get from IOMMU in our cost-down embed system.
Secondly, Guo will share the experience on the IOMMU implementation, eg: How to keep the same asid with...
Execute only memory can protect from attacks that involve reading executable code. This feature already exists on some CPUs and is enabled for userspace.
This talk will explain how we are working on creating a virtualized “not-readable” permission bit for guest page tables for x86 and the impact to the kernel. This bit can be used to create execute-only memory for userspace programs as done...
The Kernel's API and ABI exposed to Kernel modules is not something that is usually maintained in upstream. Deliberately. In fact, the ability to break APIs and ABIs can greatly benefit the development. Good reasons for that have been stated multiple times. See e.g. Documentation/process/stable-api-nonsense.rst.
The reality for distributions might look different though. Especially - but not...
Ftrace histograms, based on triggers and synthetic events were implemented few years ago by Tom Zanussi. They are very powerful instrument for analyzing the kernel internals, using ftrace events, but its user interface is very complex and hard to use. This proposal is to discuss possible ways to define more easy to use and intuitive interface to this feature, using trace-cmd application.
RISC-V trace spec draft have defined some trace format, we'll share our implementation of linux perf trace based on the spec. How to deal with SMP perf issues, how to verify our design in qemu, demonstrate a demo of perf trace with riscv-qemu.
Lastly, let's discuss perf issues from PMU to trace, any riscv perf topic.
The current main uses cases of RISC V center on embedded uses and small configurations. However, RISC V seems to be also a useful platform to do High Performance Computing and may be able to deliver custom solutions that can go well beyond what the traditional processor vendors can offer. There are already efforts underway to use ARM for that purpose but those approaches are constrained by...
While kernelci.org as a project is dedicated to testing the
upstream Linux kernel, the same KernelCI software may be reused
for alternative purposes. One typical example is distribution
kernels, which often track a stable branch but also carry some
extra patches and a specific configuration. Aside from covering
a particular downstream branch, having a separate KernelCI
instance also makes it...
The Red-Black tree and Radix tree are used in many places in the kernel to store ranges. Both of these trees have drawbacks when used for ranges. The Red-Black tree requires writing your own insertion & search code. It is also designed with the assumption that memory accesses are cheap, which is no longer true. The Radix tree performs acceptably well when ranges are aligned to a power of 2,...
Multipath TCP (MPTCP) is an increasingly popular protocol that members of the kernel community are actively working to upstream. A Linux kernel fork implementing the protocol has been developed and maintained since March 2009. While there are some large MPTCP deployments using this custom kernel, an upstream implementation will make the protocol available on Linux devices of all...
Understanding Application performance and utilization characteristics is critically important for cloud-based computing infrastructure. Minor improvements in predictability and performance of tasks can result in large savings. Google runs all workloads inside containers and as such, cgroup performance monitoring is heavily utilized for profiling. We rely on two approaches built on Linux...
Babeltrace started out as the reference implementation of a Common
Trace Format (CTF) reader. As the project evolved, many
trace manipulation use-cases (merging, trimming, filtering,
conversion, analysis, etc.) emerged and were implemented either
as part of the Babeltrace project, on top of its APIs or through
Today, as more tracers emerged, each using their own trace format,...
The RISC-V hypervisor extension is carefully designed to be compliant with both Type-1 and Type-2 hypervisors. We have ported Xvisor (Type-1) and KVM (Type-2) for RISC-V architecture. In this session, we share our experience porting these hypervisors and also discuss future work on RISC-V hypervisors.
This presentation discusses the work done to add the RISC-V Hypervisor Extension support to QEMU. This allows everyone to use QEMU as a development platform for porting Hypervisors to RISC-V. This can be seen by the recent effort to port KVM to RISC-V.
This presentation will discuss how the RISC-V Hypervisor extension works and how it is different to other common architectures Hypervisor...
I would like to discuss how to implement a series of libraries for all the tracing tools that are out there, and have a repository that at least points to them. From libftrace, libperf, libdtrace to liblttng and libbabletrace.
Provide better kernel packages to the distribution users, is a really hot topic in distributions, as the kernel package is the fundamental part of the distribution.
One of the way to provide a better quality kernel is to implement a quality control by using automated tests.
Each distributions are probably using different tools and tests suits.
Let's share our knowledge and which tools are...
bpftrace is a high level tracing language running on top of BPF: https://github.com/iovisor/bpftrace
We'll talk about important updates from the past year, including improved tracing providers and new language features, and we'll also discuss future plans for the project.
The printk() function has a long history of issues and has undergone many iterations to improve performance and reliability. Yet it is still not an acceptable solution to reliably allow the kernel to send detailed information to the user. And these problems are even magnified when using a real-time system. So why is printk() so complicated and why are we having such a hard time finding a good...
At Netconf 2019 we have presented a BPF-based alternative to steering
packets into sockets with iptables and TPROXY extension. A mechanism
which is of interest to us because it allows (1) services to share a
port number when their IP address ranges don't overlap, and (2) reverse
proxies to listen on all available port numbers.
The solution adds a new BPF program type BPF_INET_LOOKUP, which...
Implementing safety-critical systems usually requires adhering to meticulously defined development processes that specify how code is supposed to be developed, integrated and reviewed, driven by the assumption that a disciplined approach leads to reliably high quality. While known to produce code that can satisfy the highest quality standards, Linux kernel development does not follow such...
Syzkaller is run on Upstream and Stable trees. When paired with KASAN it has proven its usefulness uncovering large numbers of Out-of-Bounds (OOB) and Use-after-free (UAF) bugs. These results are readily available on the syzbot dashboard. What do distros gain by running Syzkaller?
Distros regularly add features to their kernels, fix bugs and add third party drivers. Syzkaller testing focused...
What's it going to take to allow us to make the benefits of the RISC-V
architecture available in centralized computing systems? Are there some
things we need to be working on right now to pave the way for future
success here? How can the state of the ARM architecture help us
understand this problem?
This presentation will explore the technical decisions made in designing
a data-center scale...
Many new BPF tracing tools are about to be published, deepening our view of kernel internals on production systems. This session will summarize what has been done and what will be next with BPF tracing, discussing the challenges with taking kernel and application analysis further, and the potential kernel changes needed.
This presentation will discuss the work ongoing to implement Linux kernel
support for RISCV hardware lacking a memory management unit (MMU). A side effect
of this work is also the ability to execute the kernel directly in M-Mode and
how this is implemented while keeping most of the architecture code unmodified.
The presentation will include examples of testing environment builds, discuss
There have been two different approaches proposposed on the LKML over the past year on core scheduling. One was the coscheduling approach by Jan Schönherr, originally posted at https://lkml.org/lkml/2018/9/7/1521 and the next version posted at https://lkml.org/lkml/2018/10/19/859
Upstream chose a different route and decided to modify CFS, and only do "core-scheduling". Vineeth picked up the...
Greybus is an RPC like protocol on top UniPro bus that has been designed for the Project ARA. This goal of that project was to develop a modular smartphone. Greybus gives the ability to the host to control remotely the buses (such as i2c or spi) of the modules.
Although Project ARA has been aborted, Greybus has been merged to Linux kernel, and it is still maintained by the community.
For many years developers have leveraged gdb or crash to look at kernel crash dumps on linux. Although those tools have served us well, it can sometimes be difficult to navigate the crash dump to find the information you really need. In this talk, we would like to present some new tools that make it easier to debug kernel crash dumps and enhance kernel developer's ability to root causes...
DRM is merging new drivers at a brisk pace, and with lima and panfrost to support ARM Mali GPUs the last obvious gap in not yet reverse-engineered hardware is getting closed. Plus new features, more contributors, more patches - in general upstream graphics is as healthy as it's never been before.
Time for some celebratory drinks, except this talk will be none of that. Now that we've achieved...
This topic will discuss 1) why do we need per-group default domain type, 2) how it solves the problems in the real IOMMU driver, and 3) the user interfaces.
TPM remote attestation (a mechanism allowing remote sites to ask a computer to prove what software it booted) was an object of fear in the open source community in the 2000s, a potential existential threat to Linux's ability to interact with the free internet. These concerns have largely not been realised, and now there's increasing interest in ways we can use remote attestation to improve...
It is well known that batching can often improve software performance. This is
mainly because it utilizes the instruction cache in a more efficient way.
From the networking perspective, the size of driver's packet processing
pipeline is larger than the sizes of instruction caches. Even though NAPI
batches packets over the full stack and driver execution, they are processed
one by one by many...
IoT applications, be they Autonomous Cars  or Health Care or Smart Home or Factory Automation, the IoT devices (sensors and actuators), gateways, and cloud/datacenter endpoints need software and/or firmware updates, to fix security issues, patch bugs, and/or release new features. IoT with its numerous remote devices and gateways presents a large attack surface, making the application of...
Since August 2018 I have been working on SMMUv3 nested stage integration
at IOMMU/VFIO levels, to allow virtual SMMUv3/VFIO integration.
This shares some APIs with the Intel and ARM SVA series (cache invalidation,
fault reporting) but also introduces some specific ones to pass information
about guest stage 1 configuration and MSI bindings.
In this session I would like to discuss the...
Link Aggregation (LAG) is traditionally served by bonding driver. Linux bonding driver supports all LAG modes on almost any LAN drivers - in the software. However modern hardware features like SR-IOV-based virtualization and state full offloads such as RDMA are currently not well supported by this model. One of possible options to solve that is to implement LAG functionality entirely in NIC's...
Linux kernel fastboot is critical for all kinds of platforms: from embedded/smartphone to desktop/cloud, and it has been hugely improved over years. But, is it all done? Not yet!
This topic will first share the optimizations done for our platform, which cut the kernel (inside a VM) bootime from 3000ms to 300ms, and then list the future potential optimization points.
Here are our...
Proxy execution can be considered as a generalization of the real-time priority inheritance mechanism. With proxy execution a task can run using the context of some other task that is "willing" to let the first task run as this improves performace for both. With this topic I'd like to detail about progress that has been made after the initial RFC posting on LKML and discuss about open problems...
Wayland is getting close to being ready for day 2 day generic desktop use, close but there still are many small issues to tackle, see e.g. :
The purpose of this microconference is to get people together to discuss the various open issues, try to come up with solutions for some of them and possibly...
This talk will give an overview of LoRa and related wireless technologies and their role in IoT infrastructure. An initial RFC for a socket interface had been submitted last summer as proof of concept - a linux-lora.git staging tree and linux-lpwan mailing list have been in use for collaboratively iterating on patches towards a mergeable proposal. Open topics include abandoning PF_LORA in...
Dmitry Vyukov's testing work identified some (ab)uses of sched_setattr() that can result in SCHED_DEADLINE tasks starving RCU's kthreads for extended time periods, not millisecond, not seconds, not minutes, not even hours, but days. Given that RCU CPU stall warnings are issued whenever an RCU grace period fails to complete within a few tens of seconds, the system did not suffer silently. ...
PASID (Process Address Space ID) is a PCIe capability that enables sharing of a single device across multiple isolated address domains. It has been becoming a hot term in I/O technology evolution. e.g. it is foundation of SVM and SIOV. Combined with the usages of PASID and the configuration difference due to architecture difference across vendors, it brings an interesting topic on PASID...
While x86 is probably the most prominent platform for vfio/iommu development and usage, other architectures also see quite a bit of movement. These architectures are similar to x86 in some parts and quite different in others; therefore, sometimes issues come up that may be surprising to folks mostly working on more common platforms.
For example, PCI on s390 is using special instructions. QEMU...
The cfs load_balance has became more and more complex over the years and has reached the point where policy can't be explained sometimes. Furthermore, available metrics have evolved and load balance doesn't always take full advantage of it to calculate the imbalance. It's probably the good time to do a rework of the load balance code as proposed in this...
For long time, The kernel have contained two mechanisms with similar packet filtering functionality: tc filter (with chains) and iptables/nftables.
As eBPF is starting to take over, once again we seem to have two mechanisms with similar functionality: BPFilter and the newly suggested OVS-eBPF datapath (on top on tc).
As we move to using eBPF, I'd like to discuss the possibility of uniting...
Storage hardware with built-in “inline” encryption support is becoming increasingly common, especially on mobile SoCs running Android; it's also now part of the UFS and eMMC standards. These devices en/decrypt data between the application processor and disk without generating disk latency or cpu overhead. Inline encryption hardware can be programmed to hold multiple encryption keys...
Having been focused on IoT in Fedora for Red Hat for 3 years and the wider Arm and embedded ecosystem for a lot longer and dealing with customers that are looking to prototype large scale IoT deployments for a range of use cases while using a distribution similar to what they use in their data centre but with IoT use cases, increased security I have a bunch of war wounds and ideas about the...
With the advent of the the flow rule and flow block API, ethtool_rx, netfilter and tc can share the same infrastructure to represent hardware offloads.
This presentation discusses the reuse of the existing infrastructure originally implemented by tc, such as the netdev_ops->ndo_setup_tc() interface and the TC_SETUP_CLSFLOWER classifier.
For the past couple of years the CKI ("cookie") project at Red Hat has been transforming the way the company tests kernels, going from staged testing to continuous integration. We've been testing patches posted to internal maillists, responding with our results, and last year we started testing stable queues maintained by Greg KH, posting results to the "stable" maillist.
Now we'd like to...
Modern PCI graphics devices may contain several gigabytes of memory mapped in its BAR. This trend is continuing into storage with NVMe devices containing large Controller Memory Buffers and Persistent Memory Regions.
Some PCI hierarchies are resource constrained and cannot fit as many devices as desired. In NVMe's case, it's preferable to enumerate and attach all devices rather than use the...
There is a presentation in the refereed track on flattening the CPU controller runqueue hierarchy, but it may be useful to have a discussion on the same topic in the scheduler microconference.
This talk will put the spotlight on the linux-wpan project, which brings IEEE 802.15.4 and 6LoWPAN support to the Linux Kernel. Designed for low-power devices these protocols are ideal for the use in some IoT applications. Over the last years IEEE 802.15.4 support has slowly found its way into the mainline kernel. The 6LoWPAN code is shared with the Bluetooth stack and the ieee802154...
This is meant to be a rather open discussion on PCI resource assignment policies. I plan to discuss a bit what the different arch/platforms do today, how I've tried to consolidate it, then we can debate the pro/cons of the different approaches and decide where to go from there.
The RDMA subsystem in Linux (drivers/infiniband) is now becoming widely used and deployed outside its traditional use case of HPC. This wider deployment is creating demand for new interactions with the rest of the kernel and many of these topics are challenging.
This talk will include a brief overview of RDMA technology followed by an examination & discussion of the main areas where the...
The Linux Kernel scheduler represents a system's topology by the means of
scheduler domains. In the common case, these domains map to the cache topology
of the system.
The Cavium ThunderX is an ARMv8-A 2-node NUMA system, each node containing
48 CPUs (no hyperthreading). Each CPU has its own L1 cache, and CPUs within
the same node will share a same L2 cache.
Running some memory-intensive...
Linux has a nice SW bridge implementation which provides most of the classic
Ethernet switching features. DSA and SwitchDev frameworks allow us to
represent HW switch devices in Linux and potentially offload the SW forwarding
But the offloading facilities are not perfect, and there seem to be room for
- Limiting the flooding of L2-Multicast traffic. IGMP snooping...
Testing the upstream kernel is not an easy task. The burden is
still largely put on developers, although several projects are
now covering parts of it such as 0-day, LKFT, CKI, Coccinelle,
syzkaller and kernelci.org. While they all tend to have their
own speciality, they also face a lot of similar challenges.
This BoF is to give an opportunity to exchange ideas and bring
Turbosched is a proposed scheduler enhancement that aims to sustain turbo frequencies for a longer duration by explicitly marking small tasks that are known to be jitters and pack them on a smaller number of cores. This ensures that the other cores will remain idle, and the energy thus saved can be used by CPU intensive tasks for sustaining higher frequencies for a longer duration.
Many "drivers" for IoT sensors and actuators live outside kernel space through efforts that seek to provide abstractions not sufficiently handled in the kernel today. This is resulting in great code fragmentation that can be resolved by better understanding the developer needs and communicating an achievable collaborative approach. Pushing the interface to these devices off to userspace is...
A PCI-Express non-transparent bridge (NTB) is a point-to-point PCIe bus
connecting 2 host systems. NTB functionality can be achieved in a platform
having 2 endpoint instances. Here each of the endpoint instance will be
connected to an independent host and the hosts can communicate with each other
using endpoint as a bridge. The endpoint framework and the "new" NTB EP
function driver should...
Currently there is no user control on how much time scheduler should spend searching for CPUs when scheduling a task. It is hardcoded logic based on some heuristics that doesn't work well in many cases. e.g. very short running tasks. Provide a new latency-nice property user can set for a task (similar to nice value) that controls the search time and also potentially the preemption logic. Also...
The Thunderbolt vulnerabilities are public and have a nice name as Thunderclap (https://thunderclap.io/) nowadays. This topic will introduce what kind of vulnerabilities we have identified with Linux and how we are fixing them.
Operating system distributors often face challenges that are somewhat
different from that of upstream kernel developers. For instance, some
kernel updates often need to stay at least binary compatible with
modules that might be "out of tree" for some time.
In that context, being able to automatically detect and analyze
changes to the binary interface exposed by the kernel to its module
The BPF VM in the kernel is being used in ever more scenarios where running a restricted, validated program in kernel space provides a super powerful mix of flexibility and performance which is transforming how a kernel work.
That creates challenges for developers, sysadmins and support engineers, having tools for observing what BPF programs are doing in the system is critical.
A lot has...
kernelCI: testing a broad variety of hardware
The Linux kernel runs on an extremely wide range of hardware, but
with the rapid pace of kernel development, it's difficult to ensure
the full range of supported hardware is adequately tested.
The kernelCI project is a small, but growing project, focused on
testing the core kernel on diverse set of architectures, boards and
There is a lot of similar and duplicated code in architecture specific
bits of memory management.
For instance, most architectures have
#define PGALLOC_GFP (GFP_KERNEL | __GFP_ZERO)
for allocating page table pages and many of them use similar, if not
identical, implementation of
But that's only the tip of the iceberg.
There are several
early_alloc() or similarily...
Today’s is a scenario when we can not think of having either a mobile phone or a laptop or a tablet. With the progress of technology and having all these handheld devices, we have been able to get many of our documents digitized. However, whatever advancements we see in this space of documentation, it is still very hard to find someone who did not have the need to print or scan a hard copy....
It goes without saying that XDP is wanted more and more by everyone. Of course, the Linux distributions want to bring to users what they want and need. Even better if it can be delivered in a polished package with as few surprises as possible: receiving bug reports stemming from users' misunderstanding and from their wrong expectations does not make good experience neither for the users nor...
The OpenPrinting project “Common Print Dialog Backends” provides a D-Bus interface to separate the print dialog GUI from the communication with the actual printing system (CUPS, Google Cloud Print, e.t.c.) having each printing system being supported with a backend and these GUI-independent backends working with all print dialogs (GTK/GNOME, Qt/KDE, LibreOffice, e.t.c.). This allows for easily...
The glibc project decided a while back that it wants to add wrappers for
system calls which are useful for general application usage. However,
that doesn't mean that all those missing system calls are added
System call wrappers still need documentation in the manual, which
can be difficult in areas where there is no consensus how to describe
the desired semantics (e.g., in the...
Boot testing is already hard to do well on a wide variety of
hardware. However it is only scratching the surface of the
kernel code base. To take projects such as Kernel CI to the next
level and increase coverage, functional tests are becoming the
next big thing on the list. Large test suites that run close to
the hardware are very hard to tame. Some projects such as
ezbench could become...
What I'd like to get to is to discuss that buffered IO basically sucks for databases with high throughput, and direct IO sucks for databases that aren't individually well tuned, and is not adaptive to memory pressure at all.
Buffered IO is slow, until recently only synchronous, has double buffering issues and writeback is hard to control.
Direct IO requires that the application's equivalent...
The PREEMPT_RT patchset is the longest existing large patchset living outside the Linux kernel. Over the years, the realtime developers had to maintain several stable kernel versions of the patchset. This talk will present the lessons learned from this experience, including workflow, tooling and release management that has proven over time to scale. The workflow deals with upstream changes...
In the linux kernel, most operations affecting a process's address space are protected by by mmap_sem (a per-process read-write semaphore).
This simple design is increasingly a problem for multi-threaded applications, and often causes threads that operate on separate parts of their address space to end up blocking on each other due to false sharing issues - mmap_sem only supports locking the...
Printing at today’s date has progressed a lot and the world is already utilising the benefits of driverless printing. In today’s scenario it is very hard to think of a printer without a scanner. But unfortunately a technology like driverless scanning has yet to see the light of the day. In today’s date you cannot think of using a scanner without a scanner driver. We want to discuss more on...
In this talk Dmitry will introduce the idea of GWP-ASAN, a sampling tool that finds use-after-free and heap-buffer-overflows bugs in production environments. GWP-ASan supplements the normal slab allocator and chooses random allocations to 'sample'. These sampled allocations are placed into a special guarded pool, which is based upon the traditional 'Electric Fence Malloc Debugger' idea. Dmitry...
There are many security features common to both GCC and Clang, but there is a growing set of features that are missing from GCC and present in Clang, missing from Clang and present in GCC, or missing in both. This session seeks to enumerate and discuss these areas, with the eye toward finding next steps forward (or at least elevating development priority).
Potential areas of focus:
During the last two years, KMSAN (a detector of uses of uninitialized
memory based on compiler instrumentation) has found more than a
hundred bugs in the upstream kernel.
We'll discuss the current status of the tool, some of its findings and
implementation challenges. Ideally, I'd like to get more people to
look at the code, as finding bugs in particular subsystems may require
There are two flavors of power management supported by the Linux kernel: system-wide PM based on transitions of the entire system into sleep states and working-state PM focused on controlling individual components when the system as a whole is working. PM-runtime is part of working-state PM concerned about the opportunity to put devices into low-power states when they are not in use.
Big systems are becoming more common these days. Having thousands of CPUs is
no more a dream and some applications are attempting to spread over all
these CPUs by creating threads.
This leads to contention on the mm->mmap_sem which is protecting the memory
layout shared by these threads.
There were multiple attempts to get rid of the mmap_sem's contention or the
mmap_sem itself, Speculative...
Working for a networking hardware vendor can be an extremely rewarding experience for a kernel developer. The rate at which new features are accepted in the kernel also provides lots of motivation to develop new features that showcase hardware capabilities. This could be done by adding new support for dataplane offloads via cls flower, netfilter, or switchdev (if we still think it exists!). ...
Modern server and compute intesive systems are naturally built around several top performance CPUs with large amount of cores and equipped by shared memory that spans a number of NUMA domains. Compute intensive workloads usually implement highly parallel CPU bound cyclic codes performing mathematics calculations that reference data located in the shared memory. Performance observability and...
The upstream author of CUPS has deprecated the classic way to implement printer drivers, describing the printer's capabilities in PPD (PostScript Printer Description) files and providing filters to convert standard PDLs (Page Description Languages) into the printer's own, often proprietary data format. With the background of PostScript not being the standard PDL any more, most modern (even the...
In this talk, Dmitry will share updates on syzkaller/syzbot since last year: USB fuzzing, bisection, memory leaks. Talk about open problems: testability of kernel components; test coverage; syzbot process.
This topic will cover how the LLVM port of the linux kernel is going, where it’s being used, and some of the pain points still plaguing those efforts. The issues the kernel port is having almost always are the same issues that other projects have porting from gcc to clang.
A lot of updates have been made to both the kernel and to llvm/clang which are making both projects better.
From the initial reactions and interest I have seen wrt. KTF
and the discussions on LKML around KUnit (https://lkml.org/lkml/2018/11/29/82),
it seems there's a general belief that some form of unit test framework
like these can be a good addition to the tools and infrastructure already available
in the kernel.
A brief introduction to CTF and its recent addition to the GNU toolchain: what is it for, what's there now, what improvements are planned, and why you might want to use this stuff rather than DWARF.
What cool things might we be able to do now that C programs can inspect their own types cheaply? What cool things might we be able to do if we extend this to other languages, so C programs could...
Very common in the daily life of computer users are printer setup tools, these GUI applications where you configure a queue for a new printer which you want to use. You select the printer from auto-detected ones and choose a driver for it, nowadays it gets rather common that the driver is selected automatically. You also set option defaults, like Letter/A4, print quality, …
With the advent of...
IPv4's success story was in carrying unicast packets
Service sites still need IPv4 addresses for everything,
since the majority of Internet client nodes don't yet
have IPv6 addresses. IPv4 addresses now cost 15 to 20
dollars apiece (times the size of your network!) and
the price is rising.
The IPv4 address space includes hundreds of millions of
addresses reserved for obscure (the...
Recent vulnerabilities like L1 Terminal Fault (L1TF) and Microarchitectural Data Sampling (MDS) have shown that the cpu hyper-threading architecture is very prone to leaking data with speculative execution attacks.
Address space separation is a proven technology to prevent side channel vulnerabilities when speculative execution attacks are used. It has, in particular, been successfully used...
Follow up on the tracing microconference
Topics to be discussed:
- Perf related events
- Histogram sql syntaxes
Kselftest started out as an effort to enable a developer-focused regression test framework in the kernel to ensure the quality of new kernel releases. Today it is an integral part of the Linux Kernel development process to qualify Linux mainline and stable release candidates.
Shuah will go over the Kselftest framework, how to write tests that work well with the framework for effective...
Currently to print an stl model in a 3D printer the same needs to be sliced first into a gcode to be understandable by a 3D printing software. In Linux we do not have any filter that can convert a stl code to a gcode. First we plan to discuss on what is the current scenario and then what can we do to fit in Linux.
This proposal covers the ongoing effort about adding eBPF support to the GNU Toolchain.
Binutils support is already upstream . This includes a CGEN cpu description, assembler, disassembler and linker. A GCC backend will be submitted for inclusion upstream before September.
Both the binutils and GCC ports will be briefly described, and then a list of points will be discussed with the...
Nowadays all consumer PC/laptop devices contain TPM2.0 security chip (due to Windows hardware requirements). Also servers and embedded devices increasingly carry these TPMs. It provides several security functions to the system and the user, such as smartcard-like secure keystore and key operations, secure secret storage, bruteforce-protected access control, etc.
These capabilities can be used...
A year ago at Linux Plumbers, we talked about a generic Android kernel that boots
and runs reasonably well on any Android device. This talk shares the progress we've made so far on many fronts. A summary of those work streams, problems we discovered along the way and our plans for them. We will talk about our short term goals and long term vision to get Android device kernels as close to the...
It looks like there may well be enough critical folks present to have a good BOF about safety and linux. Topics can include safety processes and methodologies, tooling to support analysis, security update concerns, etc. Basically, if you're interested in using Linux in safety critical systems come join, and we'll see where the conversation goes.
In this talk, we will present a scalable re-implementation of the Kubernetes service abstraction with the help of eBPF. We will discuss recent changes in the kernel which made the implementation possible, and some changes in the future which would simplify the implementation.
Kubernetes is an open-source container orchestration multi-component distributed system. It provides mechanisms for...
The current design of the thermal framework forces the usage of a governor with a thermal zone thus limiting the scope of the decisions.
The question of the multiple thermal zones representation and how they are handled by a governor was put several times on the table but without a clear consensus.
In order to go forward in this area, this MC topic proposes a simple design with a hierarchical...
Memory pressure is inevitable in many environments. A decade size survey of DRAM to CPU ratio in virtual machines and physical machines for data centers implies that the pressure will be even more common and severe. As an answer to this problem, heterogeneous memory systems utilizing recently evolved memory devices such as non-volatile memory along with the DRAM are...
CRIU only restores processes with the same PID the processes used to have during checkpointing. As there is no interface to create a process with a certain PID like
fork_with_pid() CRIU does the PID dance to restore the process with the same PID as before checkpointing.
The PID dance consists of
write()ing PID-1 to...
The Kernel's API and ABI exposed to Kernel modules is not something
that is usually maintained in upstream. Deliberately. In fact, the
ability to break APIs and ABIs can greatly benefit the development.
Good reasons for that have been stated multiple times. See e.g.
The reality for distributions might look different though. Especially
- but not...
Performance capping due to thermal limitations is common scenario particularly in mobile systems. Today user-space has no information about what level of performance that can be expected worst case and SCHED_DEADLINE can admit reservations which are impossible to fulfill.
The purpose of the this topic is to discuss what level guarantees the kernel should provide. Should the kernel have a...
Containers are generally percieved less secure than virtual
machines. Without going into a theological argument about the actual
state of the affairs, we suggest to explore the possibility of using
address space isolation inside the kernel to make containers even more
Assuming that kernel bugs and therefore vulnerabilities are inevitable
it is worth isolating parts of the kernel to...
GKI or any ARM64 Linux distro needs a single ARM64 kernel that works across all SoCs. But having a single ARM64 kernel that works across all SoCs has a lot of hurdles. One of them, is getting all the SoC specific devices to be handed off cleanly from the bootloader to the kernel even when all their drivers are loaded as modules. Getting this to work correctly involves proper ordering of events...
An update on how we plan to enable multimedia testing on our 'cuttlefish' virtual platform. Overview of missing components for graphics virtualization.
This BoF session aims to bring together Linux kernel developers who have an interest in formal methods (or formal methods experts with an interest in kernel development). Topics for discussion:
- A poll of formal methods currently used in the context of the Linux kernel: SPIN, TLA+, CBMC, herd, plain English etc.
- High level design specification vs. low level algorithm modelling. What...
With virtualization being the key to the success of cloud computing, Intel's
introduction of the Scalable IO Virtualization (SIOV) aims to further the cause by shifting the creation of assignable virtual devices from hardware to a more software assisted approach. Using SIOV, a device resource can be mapped directly to guest or other user space drivers for near native DMA (Direct Memory...
XDP (the eXpress Data Path) is a new method in Linux to process
packets at L2 and L3 with really high performance. XDP has already
been deployed for use cases involving ingress packet filtering, or
transmission back through the ingress interface, are already well
supported today. However, as we expand the use cases that involve the
XDP_REDIRECT action, e.g., to send packets to other devices,...
Recently the kernel landed seccomp support for SECCOMP_RET_USER_NOTIF which enables a process (watchee) to retrieve a fd for its seccomp filter. This fd can then be handed to another (usually more privileged) process (watcher). The watcher will then be able to receive seccomp messages about the syscalls having been performed by the watchee.
We have integrated this feature into userspace and...
Tools based on low level tracing tend to generate large amounts of data, typically outputted in some kind of text or binary format. On the other hand the predefined data analysis features of those tools are often useless when it comes to solving a nontrivial or very user-specific problem. This is when the possibility to make sophisticated analysis via scripting can be extremely useful.
Thermally unsustainable compute demand is in most systems controlled by reducing performance through disabling performance states on specific CPUs or other devices in the system. It provides an efficient method to ensure the system doesn't overheat, however, it doesn't take the actual workload into account which could be better served if the performance caps were applied differently.
The libcamera project was started at the end of 2018 to unify camera support on all Linux systems (regular Linux distributions, Chrome OS and Android). In 9 months it has produced an Android Camera HAL implementing the LIMITED profile for Chrome OS, and work is in progress to implement the FULL profile. Two platforms are currently supported (Intel IPU3 and Rockchip ISP), with work on...
Over the last year we have worked on expanding the task migration using CRIU in Google. The talk will discuss how in some cases the kernel interfaces are lacking for the purpose of migration:
- Lack of support for reading rseq configuration which means that it requires userspace support to migrate users of rseq properly.
- Lack of support for reading what cgroup events the users have...
Update and discussion of emulated storage on Android
When each CPU core can independently control its performance states, then there is performance loss on some benchmarks compared to the case when there are no independent performance states. There are couple of options to indicate to the cpufreq drivers when a producer thread wakes a consumer thread: One sending some hints like we do for IO boost or give boost PELT utilization. But there is a...
Container runtimes, engines and orchestrators provide a production-grade, robust, high-performing, but also relatively self-managing, self-healing infrastructure using innovative open-source technologies.
CRIU allows the running state of containerised applications to be preserved as a collection of files that can be used to create an equivalent copy of the applications at a later time, and...
The cgroups CPU controller in the Linux scheduler is implemented using hierarchical runqueues, which introduces a lot of complexity, and incurs a large overhead with frequently scheduling workloads. This presentation is about a new design for the cgroups CPU controller, which uses just one runqueue, and instead scales the vruntime by the inverse of the task priority. The goal is to make people...
Continuing the attempts to reducing fragmentation in power management on ARM platforms, there are discussions if something similar to ACPI can be done.i.e. device centric power management.
Currently, a device has power, performance, reset, and clock domains associated with it. SCMI provides interface to deal with these domains directly. This was simpler approach to start with the SCMI...
This work proposes to adopt Extended FUSE (ExtFUSE) framework for improving the performance of Android SDCard FUSE daemon, thereby eliminating a need for out-of-tree WrapFS hackery in the Android kernel.
ExtFUSE leverages eBPF framework for developing extensible FUSE file systems. It allows FUSE daemon in Android to register “thin” eBPF handlers that can serve metadata as well as data I/O...
What could be more fun than talking about kernel documentation? Things we
could get into:
The state of the RST transition, what remains to be done, whether it's
all just useless churn that makes the documentation worse, etc.
Things we'd like to improve in the documentation toolchain.
Overall organization of Documentation/ and moving docs when the need
arises. It seems I end...
Discussion of using Persistent Memory as first- (or second-) class memory.
Google has a successful prototype of a software-managed "Transparent" mode for 3dXPoint / AEP memory, but we're working on re-designing this into something that is more supportable and at least partially upstreamable.
We want to open a discussion of how we can represent this "swap"-like use of AEP sensibly.
Providing encryption in dynamic environments where nodes are added and removed on-the-fly and services spin-up and are then torn-down frequently, such as Kubernetes, has numerous challenges. Cilium, an open source software package for providing and transparently securing network connectivity, leverages BPF and the Linux encryption capabilities to provide L3/L7 encryption and authentication at...
The Linux kernel has recently acquired a new API for creating mounts. This allows a greater range of parameter and parameter values to be specified, including, in the future, container-relevant information such as the namespaces that a mount should use.
Future developments of this API also need to work out how to deal with upcalling from the kernel to gain parameters not directly supplied,...
A short update on eBPF in Android networking:
- how we're using ebpf in Android P on 4.9+ for statistics collection
and Q on 4.9+ for xlat464 offload, with a focus on the sorts of
problems we've run into
- where we'd like to go, ie. future plans with regard to xlat464/forwarding/nat
offload and XDP.
At LPC 2015, we introduced analyze_suspend, a new open source tool to show where the time goes during Linux suspend/resume. Now called "sleepgraph", it has evolved in a number of ways over the last four years. Most importantly, it is now the core of a framework that we use for suspend/resume endurance testing.
Endurance testing has allowed us to identify, track, report and sometimes fix...
Since Canonical is now shipping it I think we can all agree it solves a problem and we just need to get the patches into shape for upstream submission. Can we discuss a pathway for doing that.
As part of the Android Microconference:
Linux Kernel Functional Test is a system to detect kernel regressions across the range of mainline, LTS and Android Common kernels. It is able to run a variety of operating systems from Linux to Android across an array of systems under test. You're probably thinking in terms of standard test suites like CTS, VTS, LTP, kselftest and so on and you're be...
Code review is a collaborative activity involving sentiments and emotions that can affect developers' productivity, creativity, and contribution satisfaction. Discussions in a code review environment in open source could get spirited at times as people from diverse backgrounds and interests are part of it. As a consequence, open source communities have become introspective and started to think...
Many Ethernet PHYs contain hardware to perform diagnostics of the
Ethernet cable. Breaks in the cable and shorts within a twisted pair
or to other pairs can be detected, and an estimate to the length along
the cable to the fault can be made. The talk will explain, at a high
level, how such diagnostics work, sending pulses down the cables and
looking for reflections. There is no standardization...
Linux is complex, and formal verification has been gaining more and more attention because independent "asserts" in the code can be ambiguous and not cover all the desired points. Formal models aim to avoid such problems of natural language, but the problem is that "formal modeling and verification" sound complex. Things have been changing.
What if I say it is possible to verify Linux...
Topic will discuss how Android framework utilizes new kernel features
to better handle memory pressure. This includes app compaction, new
kill strategies and improved process tracking using pidfds.
We in Intel developed instrumentation for measuring C-state wake latency. The instrumentation, which we call "waltr" (WAke up Latency Tracer) consists of user-space and kernel modules parts.
In principle, waltr works by scheduling delayed interrupts and measuring the wake latency close to the x86 'mwait' x86 instruction. This requires an external device equipped with high precision clock and...
To discuss recent developments and directions with DMABUF: * DMABUF Heaps/ION destaging * Better DMABUF ownership state machine documentation * DMABUF cache maintenance optimizations * Kernel graphics buffer idea
Userspace has (for a long time) needed a mechanism to restrict path resolution. Obvious examples are those of FTP servers, Web Servers, archiving utilities, and now container runtimes. While the fundamental issue with privileged container runtimes opening paths within an untrusted rootfs was known about for many years, the recent CVEs (CVE-2018-15664 and CVE-2019-10152 being the most recent)...
There are some improvements in the CPU idle time management to be made, like switching over to using time in nanoseconds (64-bit), reducing overhead and some governor modifications (including possible deprecation of the menu governor) which need to be discussed.
A short update on the status of DRM/KMS ecosystem adoption and how Google is improving verification of the DRM display drivers in Android devices.
Android has been using an out-of-tree schedtune cgroup controller for
task performance boosting of time-sensitive processes. Introduction of
utilization clamping (uclamp) feature in the Linux kernel opens up an opportunity to adopt an upstream mechanism for achieving this goal. The talk will present our plans on adopting uclamp in Android.
The kernel contains a keyrings facility for handling tokens for filesystems and other kernel services to use. These are frequently disabled for container environments, however, because they were not made namespace aware by the authors of the user-namespace and others.
Unfortunately, this lack prevents various things from working inside containers. To get around this, keys are now being...
What is MTE and why we do need to add the support for the Linux Userspace? Memory Tagging is an ARMv8.5 extension and provides architectural support for run-time detection of various classes of memory errors. It can be used to aid with software debugging to eliminate vulnerabilities before they can be exploited (i.e. bounds violations, use-after-free,use-after-return, use-out-of-scope and...
We have cgroup v1 users who want to switch to cgroup v2, but there
currently isn't an upstream migration story for them. (Previous
LPC talks have focused on the issues of migrating from v1 to v2, but
no substantial upstream solution has come to fruition.)
The goal of this talk is to discuss the cgroup v1 to v2 migration
path and gauge community interest in a cgroup v1/v2...
We have a number of unsolved time and vdso related issues in CRIU.
- Syscall restart: if a task Checkpoint interrupted a syscall, on restore CRIU blindely starts again the syscall (executing SYSCALL/SYSENTER/INT80/etc instruction with the original regset). It works OKish, but not with time blocking syscalls i.e., poll(), nanosleep(), futex() and etc. For this purpose, Glibc and vDSO use...
Recently speculative execution techniques have shown that an untrusted application can steal data from another one when both share the same core. To avoid such problems users have to disable SMT, causing non-negligible performance impact. Core-scheduling tries to mitigate the performance problem by allowing trusted applications to run concurrently on siblings of a core while avoiding two...
The csky architecture officially merged the main line in linux-4.20. Before that, eight architectures have just been removed from the main line. Many people ask what is the meaning of csky upstream? Also includes our colleagues. Here, we will give some examples to introduce the progress of the csky architecture in the past six months and the value and significance of linux-csky. This is an...
The demand of DRAM across different platforms is increasing but the cost is not decreasing. Thus DRAM is a major factor of the total cost across all kinds of devices like mobile, desktop or servers. In this talk we will be presenting the work we are doing at Google, applicable to Android, Chrome OS and data center servers, on extracting more memory out of running applications without impacting...
- Suggestion with VFIO (Don)
- RDMA as the importer, VFIO as the exporter
get_user_pages() and friends
- Discussion on future GUP, required to support P2P
- GUP to SGL?
- Non struct page based GUP
- Integrating RDMA ODP with HMM
- 'DMA fault' for ZONE_DEVICE pages
The ABI between Linux and user software mostly sits at the user/privileged boundary, although many architectures extend this with a small amount of special-case code that sits in userspace, such as in special pages or shared libraries (vDSOs) mapped into each user process  that user code can call into.
The reasons for this are a bit arbitrary: system interface libraries such as glibc and...
Quick introduction of people. Frame discussion. Will be quick I promise.
Cilium is an open source project which implements the Container Network
Interface (CNI) to provide networking and security functions in modern
application environments. The primary focus of the Cilium community recently
has been on scaling these functions to support thousands of nodes and hundreds
of thousands of containers. Such environments impose a high rate of churn as
containers and nodes...
many devs are excited about the progress reported on this new stuff, but is it followed / considered by kernel devs.? what kind of gain to expect? any potential issues or feedback to share?
for example, for a write-ahead logging, one needs to guarantee that writes to log are completed before the corresponding data pages are written. fsync() on the log file does this, but it is an overkill for this.
RCU has changed a surprising amount over the past few years, what with elimination of many RCU Kconfig options in favor of kernel boot parameters, RCU flavor consolidation, ongoing work on speeding up RCU's handling of offloaded callbacks, and newly started work on providing warnings when RCU's callback handling is overloaded. These changes affect how RCU behaves, and in some cases in ways...
seems like the patches proposed by Fusion-io devs for general
O_ATOMIC support within Linux kernel are in stand-by since 6 years.. -- any plans to address it ?.. What is the main reason to not guarantee atomicy of
O_DIRECT writes on flash drives? -- seems like most of flash storage vendors are able to provide atomic writes support on HW level, and just SW level (kernel/FS/etc.) is missed.....
KUnit is a new lightweight unit testing and mocking framework for the Linux kernel. Unlike Autotest and kselftest, KUnit is a true unit testing framework; it does not require installing the kernel on a test machine or in a VM (however, KUnit still allows you to run tests on test machines or in VMs if you want) and does not require...
Gen-Z Linux Sub-system
Discuss design choices for a Gen-Z kernel sub-system and the challenges of supporting the Gen-Z interconnect in Linux.
Gen-Z is a fabric interconnect that connects a broad range of devices from CPUs, memory, I/O, and switches to other computers and all of their devices. It scales from two components in an enclosure to an exascale mesh. The Gen-Z consortium has over...
For some time now, special camera setups exist having features which are challenging for I2C address layouts as we know them in Linux: a) a high-speed serial link which can embed I2C communication (e.g. GMSL or FPD-Link III) and b) the ability to reprogram the client addresses of the I2C devices on the camera.
The use case for these cameras is to run multiple of them in parallel, and not just...
Application workloads are becoming increasingly diverse in terms of their network resource requirements and performance characteristics. As opposed to long running monoliths deployed in virtual machines, containerized workloads can be as short lived as few seconds. Today, container orchestrators that schedule these workloads primarily consider their CPU and memory resource requirements since...
since newer kernels (4.14, 5.1, ..) we are observing 50% regression on MySQL IO-bound workloads using EXT4 comparing to the same results on the same HW, but running kernel 3.x or 4.1. Unfortunately we have absolutely no explanation for this regression right now and looking for any available FS layer instrumentation/visibility to understand what is the root problem for such a regression and how...
For almost 2 years now the use of RDMA with DAX filesystems has been disabled due to the incompatibilities of RDMA and the file system page handling.
A general consensus has emerged from many conferences and email threads on a path to support RDMA directly to persistent memory which is managed by a filesystem.
This talk will present the work done since LSFmm to support RDMA and FS...
I'd like to review if-how we can build real-time container. It should include but not limited these topics here,
- Understanding container Scheduling
- Test and evaluations
- Possible factors related to latency issues
- discussions like tracing containers-leveled metrics
We know that reducing the sections with preemption and IRQ disabled reduces the latency, also that IRQs influences on it, but some cases are hard to catch. For example, in the old jump label update, there was a burst of IPIs causing latency spikes. Such non-periodic behavior is hard to mathematize. As a side effect, this adds pessimism to "possible formulas" that tries to define the worst-case...
historically XFS was always showing lower performance comparing to EXT4 on most of IO-bound workloads used for MySQL/InnoDB benchmark testing.. However, since the new kernels & XFS arrived, we observed significantly better results on XFS now -vs- EXT4 particularly when InnoDB "double write" is enabled. From the other side, for our big surprise, XFS was doing worse if "double write" was...
Application-specific accelerators are going to start showing up in larger numbers in the times ahead. Today there's often no suitable subsystem for them to aggregate into, and the first of them have landed under drivers/misc for the time being.
The goal of this BoF is to introduce and discuss the ground rules for a new drivers/accel subsystem, how it fits in with other subsystems, and...
We are going through upstreaming IBNBD/IBTRS 5th iterations, the latest effort is here: https://lwn.net/Articles/791690/.
We would like to discuss in an open round about the unique features of the driver and the library, whether and how they are beneficial for the RDMA eco-system and what should be the next steps in order to get them upstream.
A face to face discussion about action items...
Route entries in a FIB tend to be very redundant with respect to nexthop configuration with many routes using the same gateway, device and potentially encapsulations such as MPLS. The legacy API for inserting routes into the kernel requires the nexthop data to be included with each route specification leading to duplicate processing verifying the nexthop data, an effect that is magnified as...
Traditionally processes are identified globally via process identifiers (PIDs). Due to how pid allocation works the kernel is free to recycle PIDs once a process has been reaped. As such, PIDs do not allow another process to maintain a private, stable reference on a process. On systems under pressure it is thus possible that a PID is recycled without other (non-parent) processes being aware of...
Which Real Time softirq implementation do we want for mainline?
_ Vector-Lock based? (depend on sleeping spinlocks machinery) _ Vector masking based? _ Other?
In this talk Dmitry will highlight some of the areas for improvement related to release quality, security, and developer experience and productivity. Then try to show that the existing processes, approaches and tools poorly cope with the current scale and rate of change and don't provide adequate quality and developer experience. Lastly Dmitry will advocate that only pervasive changes to the...
(1) SQLite is the most widely used database in the world. There are probably in excess of 300 billion active SQLite databases on Linux devices. SQLite is a significant client of the Linux filesystem - perhaps the largest single non-streaming client, especially on small devices such as phones.
(2) Unlike other relational database engines, SQLite tends to live out on the edge of the network,...
Postgres (and many other databases) have, until fairly recently, assumed that IO errors would a) be reliably signalled by fsync/fdatasync/... b) repeating an fsync after a failure would either result in another failure, or the IO operations would succeed.
That turned out not to be true: See also https://lwn.net/Articles/752063/
While a few improvements have been made, both in postgres and...
At MongoDB, we implemented an eBPF tool to collect and display a complete time-series view of information about all threads whether they are on- or off-CPU. This allows us to inspect where the database server spends its time, both in userspace and in kernel. Its minimal overhead allows to deploy it in production.
This can be an effective method to collect diagnostic information in the field...
_ What is needed upstream for real time support of Full Dynticks and isolation? _ Specific requests?
Consider a case of a server with a huge amount of memory and thousands of processes are using it to serve clients requests.
In such a case, the HCA will have to manage thousands of MRs which will compete for caches and address translation entities.
The way to improve performance is to allow sharing of IB objects between processes. One process will create several MRs and share them.
since MySQL 8.0 we have a newly redesigned lock-free REDO log implementation. However, this development involved several questions about overall efficiency around MT communications and synchronization. Curiously spinning on CPU showed to be the most efficient on low load.. -- but any plans to implement "generic" MT framework for more efficient execution of any MT apps ?
Host Bandwidth Manager (HBM) is a BPF based framework for managing per-cgroupv2 egress and ingress bandwidths in order to provide a better experience to workloads/services coexisting within a host. In particular, HBM allows us to divide a host's egress and ingress bandwidth among workloads residing in different v2 cgroups. Note that although sample BPF programs are included in the BPF patches,...
With heterogeneous computing, program's data (range of virtual addresses) have to move to different physical memory during the lifetime of an application to keep it local to compute unit (CPU, GPU, FPGA, ...). NUMA have been the model used so far but it has assumptions that do not work with all the memory type we now have. This presentation will explore the various types of memory and how we...
Discussion around topics related
to PCI specifications and microconference follow up
- Root complex integrated endpoints
- Native host controllers link management
- VFIO/IOMMU/PCI follow up
there is "backlog" option used in MySQL for both IP and UNIX sockets, but seems like it has a significant overhead on heavy connect/disconnect activity workloads (e.g. like most of Web apps which are doing "connect; SQL query; disconnect") -- any explanation/ reason for this? can it be improved?
MySQL is allowing user sessions connections via IP port and UNIX socket on Linux systems. However, curiously connecting via UNIX socket is delivering up to 30% higher performance comparing to IP local port (loopback).. -- any reason for this? and be "loopback" code improved to match the same level of efficiency as UNIX socket? can the same improvements make over all IP stack to be more efficient?
As memory sizes grow so do the sizes of the data transferred between RDMA devices. Generally, the Operating system needs to keep track of the state of each of its pieces of memory and that is on Intel x86 a page of 4 KB. This is also connected to hardware providing memory management features such as the processor page tables as well as the MMU features of the RDMA NIC.
The overhead of the...
In this talk, Thomas Gleixner will present the status of the PREEMPT_RT, along with a section of questions and answers regarding the upstream work and the future of the project.
all MT apps are extremely sensible to CPU cache issues, and MySQL/InnoDB is part of them.. Several times we observed significant regressions (up to 40% and more) due CPU cache miss or simple cache sync due concurrent access to the same variable by several threads, and all "perf" CPU related stats did not show any difference.. Any plans to address it with more deep CPU stats instrumentation?
users are very worry about any kind of overhead due kernel patches applied to solve Intel CPU issues (Spectre/Meltdown/etc.) -- what others are observing? what kind of workloads / test cases do you use for evaluation?
From discussions to code. Where it goes from here?
The way BPF application developers build applications is constantly improving. There are still rough corners, as well as (as of yet) fundamentally inconvenient developer workflows involved (e.g., on-the-fly compilation). The ultimate goal of BPF application development is to provide experience as straightforward and simple as a typical user-land application.
We'll discuss major pain points...
ZRAM is a compressed RAM based block device implementation which has gotten a lot of use recently primarily in the Android world. ZRAM consists of the block device front-end, compressor back-end and memory allocator back-end. Compressor back-end is accessed via a common API, and therefore it is easy with ZRAM to select the particular compression algorithm that fits your special purpose. As...
The most commonly used simple locking functions provided by the pthread library are pthread_mutex and pthread_rwlock. They are sleeping locks and so do suffer from unpredictable wakeup latency limiting locking throughput.
Userspace spinning locks can potentially offer better locking throughput, but they also suffer other drawbacks like lock holder preemption which will waste valuable CPU...
This session will focus on answering questions on the internals and the usage of Linux-kernel RCU. However, questions regarding details of the RCU-related patches in the -rt patchset will be deferred to other venues, given that this topic consumed the entire time in the 2018 informal RCU BoF session.
This is not intended to be a tutorial on RCU basics, though a separate session on this topic...
The OpenBMC project has brought modern Linux to the firmware in your new server. A missing piece of this is ensuring the firmware is the image you expect it to be running.
The next generation of BMC hardware will allow a hardware root of trust to secure the boot chain. This talk will present the a proposed design for trusted boot in OpenBMC.
A short summary of a development in kernel live patching over the last year. There have been many improvements since LPC in Vancouver, but there are still some outstanding issues. Not all attendees might closely follow live-patching mailing list and therefore the talk should be a good starting point for the microconference.
Current livepatch implementation supports late patching of modules when they are loaded (and unpatching when unloaded). It has caused headaches and LPC microconference is a good opportunity to discuss the future of the feature. There were attempt to deny the module removal. Introduction of patch module dependencies could also simplify the code and issue a lot. On the other hand, such solutions...
The UEFI forum is rolling out a new "code first" process, to be available for both UEFI and ACPI specifications, in order to speed up time between initial definition and upstream support.
The UEFI self-certification testsuite (SCT) has been open sourced.
UEFI interface implementation in U-Boot now sufficient for GRUB use (and more) across multiple distributions..
Debugging BPF program logic is hard these days.
Developers typically write their programs and
then checking map values or perf_event outputs
make sense or not. For tricky issues, temporary
maps or bpf_trace_printk are used so developer
can get more insight about what happens. But
this requires possibly multiple rounds of
modifying sources, recompilation and redeployment, etc.
The presentation gives an overview of what has been implemented in the SGX patch set and what there is still left to do. The presentation goes through the known blockers for upstreaming. In particular, access control related issues will be discussed.
At last year's Live Patching MC, an approach to automating source based live patch creation had been proposed. The implementation made good progress since then, in particular an initial release of the "klp-ccp" utility has been published (https://github.com/SUSE/klp-ccp) recently. Its purpose is to handle the transformation of patched kernel parts into self-contained live patch source code...
At the LSF/MM eBPF track, we discussed the necessity of a common Go
library to interact with BPF. Since then, Cilium and Cloudflare have
worked out a proposal to upstream parts of github.com/newtools/ebpf
and github.com/cilium/cilium/pkg/bpf into a new common library.
Our goal is to create a native Go library instead of a CGO wrapper
of C libbpf. This provides superior performance,...
When multiple instances of workloads are consolidated in same host it is
good practice to partition them for best performance. For e.g give a NUMA
node parition to each instance. Currently Linux kernel provides two
interfaces to hard parition: sched_setaffinity system call or cpuset.cpus
cgroup. But this doesn't allow one instance to burst out of its partition
and use available CPUs from other...
TrenchBoot is a cross-community OSS integration project for hardware-rooted, late launch integrity of open and proprietary systems. It provides a general purpose, open-source DRTM kernel for measured system launch and attestation of device integrity to trust-centric access infrastructure. TrenchBoot closes the the measurement gap and reduces the need to trust system firmware. This talk will...
Currently, most BPF functionality requires CAP_SYS_ADMIN or CAP_NET_ADMIN. However, in many cases, CAP_SYS_ADMIN/CAP_NET_ADMIN gives the user more than enough permissions. For example, tracing users need to load BPF programs and access BPF maps, so they need CAP_SYS_ADMIN. However, they don't need to modify the system, so CAP_SYS_ADMIN adds significant risk.
To better control BPF...
A quick update on the objtool port on Power, what is the current state and
what more needs to be done. Also, discuss how do we integrate it upstream.
Over the past few years, kernel engineers have been busy implementing livepatch support features (the consistency model, atomic replace, shadow variables, etc.) to increase potential livepatch patch coverage. At the same time, more and more vendors have adopted livepatching to solve continuous uptime/update problems.
As the livepatch feature set grows and matures and demand for livepatch...
The discussion should focus on an API for handling state of changes made by callbacks. It was already discussed as a global state handling at the last LPC in Vancouver. New ideas have occurred since then. The discussion should also include patch versioning, stickiness and transition reversal.
Patches submitted upstream so...
eBPF offload is a powerful feature on modern SmartNICs used to accelerate
XDP or TC based BPF. The current kernel eBPF offload infrastructure was
introduced for the Netronome NFP based SmartNICs, these were based around a
proprietary ISA and had some specific verifier requirements.
In the near future this may be joined by SmartNICs using public ISA's such
as RISC-V and Arm which also happen...
TPM2 introduced a plain text authorization scheme with the idea that the system using the TPM should now whether the transport was secure. The presence of interposers on the bus, either as physical devices
Or as compromised pre-boot firmware make this threat a reality. A NULL seed based scheme has been proposed for...
Firmware on commodity PCs have used the TPM to store integrity measurements from security relevant components as part of the boot process for some time. Grub2 has recently merged patches that extend this integrity measurement chain through to the launching of the OS kernel. Collecting and storing these measurements in the TPM is a necessary precondition for implementing authorization policy...
Currently, the BPF verifier has to "execute" code at least once and then it can prune branches when it detects the state is the same. In this session we would like to cover a technique called Scalar Evolution (SCEV) which is used by LLVM and GCC to perform optimization passes such as identifying and promoting induction variables and do worst case trip analysis over loops. At its most basic...
The kernel already supports special livepatch relocation types enable several interesting livepatch modules use cases:
- Access to symbols outside of normal C scoping rules
- Deferred access to yet-to-be loaded kernel module symbols
- Support for architecture-specific special sections like altinstructions and paravirt instructions
Although the kernel supports loading livepatch modules...
The Restartable Sequences system call [1,2,3,4] introduced in Linux 4.18 has limitations which can be solved by introducing a bytecode interpreter running in inter-processor interrupt context which accesses user-space data.
This discussion is about the subset of the eBPF bytecode and context needed by this interpreter, and extensions of that bytecode to cover load-acquire and...
The main issue in using TPM2.0 in such measured boot solution is that at the
moment of writing this abstract neither Trusted Grub, nor Linux kernel has
TPM2.0 implementation. There are of course implementations based on UEFI
systems, where bootloaders can utilize TCG EFI protocol to handle TPM. However
other non-UEFI based solutions suffer from lack of TPM2.0 drivers in the
Existing Linux Security Modules can only be extended by modifying and rebuilding the kernel, making it difficult to react to new threats. The Kernel Runtime Security Instrumentation project (KRSI) ([prototype code]) aims to help this by providing an LSM that allows eBPF programs to be added to security hooks.
The talk discusses the need for such an LSM (with representative use cases) and...
Currently testing/stressing of livepatching infrastructure is limited to the creation of livepatching module for the reported CVE/Security issues. Continuous testing of the infrastructure is required, it can be achieved by randomly selecting the patch(s) posted over kernel mailing list to improve and fix the bugs seen in the infrastructure. I would like to discuss the in house framework used...
At the time of writing this paper the Linux kernel supported TPM 1.2
functionalities in sysfs. To these functionalities we include:
$ ls /sys/devices/pnp0/00:04/tpm/tpm0 active caps device enabled pcrs ppi subsystem timeouts cancel dev durations owned power pubek temp_deactivated uevent $ ls /sys/devices/pnp0/00:04/tpm/tpm0/ppi
request response ...
Discussion about current live patch services and how we can make it more open and flexible.
How we can make more open source distributions use or make their own live patch services.
What we are still missing? and what we can share?
bcc community has long discussed that batch
dump, lookup and delete will help its typical
use case, periodically retrieving and deleting
all samples in the kernel. Without batch APIs,
bcc typically does
iterate through all keys (get_next_key API)
get (key, value) pairs
iterate through all keys to delete them
Also, Brian Vazquez
has proposed BPF_MAP_DUMP command to dump
more than one...
Buses will start circulating at 7:30PM.
Last return bus is at 11PM
Busses will be leaving from the Corinthia Hotel lobby from 19:30
Closing Party will be held at the Centro Cultural de Belém (CCB). Accessible by bus starting from the entrance (upstairs) behind the LPC registration desk.
Last return bus: 11PM
Linux kernel maintenance is widely spoken topic at many conferences. Yet, it has it's own complex share of problems which are unique to maintainers, sub-systems and Organizations.
Oracle has a very Open and challenging environment but with access to a lot of information and knowledge about our customer's products and strategies, it can very tricky for a kernel maintainer especially the...
FPGAs are becoming more pervasive because they've gone down in price, and process improvements allow substantial designs to fit on commoditized hardware. Furthermore, processors are shipping with embedded FPGAs, making it an interesting target for scaled deployments and hobbyists alike. It's likely that in the foreseeable future, many platforms you use daily will have an FPGA embedded. As...
Today every modern multimedia supported SoC’s comprises of variety of display controller interfaces bounded with LCD panels or bridges and a GPU, for providing feasible display acceleration.
The Linux kernel handle all these display controller interfaces with associated panels, bridges via DRM subsystem, but it becomes a daunting task for many of the display users to make use of this DRM...
Compared to VM, container technology has been always argued for the security. We might need to discuss how to fit current container implementation into RISC-V arch in such a area. And RISC-V has not had any particular hardware considerations like Intel SGX and even AMD, however we can go far as we can and get some feedback to RISC-V foundation.