Hello.
I would like to give a talk about KVA allocator in the kernel and about
improvements i have done.
See below the presentation:
ftp://vps418301.ovh.net/incoming/Reworking_of_KVA_allocator_in_Linux_kernel.pdf
Thank you in advance!
--
Vlad Rezki
The RISC-V UNIX-Class platform specification working group started in May and aims to have a first release by the end of the year. This talk will discuss where we are and where we're going.
We'd like to spend a few minutes to provide some background around how we're using Yocto to produce kernel builds as well as bigger images that contain userspace as well, and then try to address some of the issues we're seeing with this process.
There are a few topics we'd like to discuss with the room:
- Using a single kernel branch for multiple, very different projects?
- Working with...
Tracing kernel boot is useful when we chase a bug in device and machine initialization, boot performance issue etc. Ftrace already supports to enable basic tracing features in kernel cmdline. However, since the cmdline is very limited and too simple, it is hard to enable complex features which are recently introduced, e.g. multiple kprobe events, trigger actions, and event histogram.
To solve...
RISC-V now has better support for open source boot loaders like U-Boot and coreboot compared to last year. As a result of this developers can use the same boot loaders to boot Linux on RISC-V as they do in other architectures, but there's more work to be done. We will discuss the current state of the boot flow and pending issues.
Hardware PMU counters are limited resources. When there are more perf events than the available hardware counters, it is necessary to use time multiplexing, and the perf events could not run 100% of time.
On the other hand, different perf events may measure the same metric, e.g., instructions. We call these perf events "compatible perf events". Technically, one hardware counter could serve...
Last couple of years, we have witnessed an onslaught of vulnerabilities in the design and architecture of cpus. It is interesting and surprising to note that the vulnerabilities are mainly targeting the features designed to improve the performance of cpus - most notable being the hyperthreading(smt). While some of the vulnerabilities could be mitigated in software and cpu microcodes, couple of...
IOMMU is a very popluar equipment for both embed and server virtualization area. In the topic we'll focus on embed area and shared virtual address.
Firstly, we'll talk about the value of IOMMU for the embed system and what the benefit we could get from IOMMU in our cost-down embed system.
Secondly, Guo will share the experience on the IOMMU implementation, eg: How to keep the same asid with...
Execute only memory can protect from attacks that involve reading executable code. This feature already exists on some CPUs and is enabled for userspace.
This talk will explain how we are working on creating a virtualized “not-readable” permission bit for guest page tables for x86 and the impact to the kernel. This bit can be used to create execute-only memory for userspace programs as done...
The Kernel's API and ABI exposed to Kernel modules is not something that is usually maintained in upstream. Deliberately. In fact, the ability to break APIs and ABIs can greatly benefit the development. Good reasons for that have been stated multiple times. See e.g. Documentation/process/stable-api-nonsense.rst.
The reality for distributions might look different though. Especially - but not...
RISC-V trace spec draft have defined some trace format, we'll share our implementation of linux perf trace based on the spec. How to deal with SMP perf issues, how to verify our design in qemu, demonstrate a demo of perf trace with riscv-qemu.
Lastly, let's discuss perf issues from PMU to trace, any riscv perf topic.
The current main uses cases of RISC V center on embedded uses and small configurations. However, RISC V seems to be also a useful platform to do High Performance Computing and may be able to deliver custom solutions that can go well beyond what the traditional processor vendors can offer. There are already efforts underway to use ARM for that purpose but those approaches are constrained by...
While kernelci.org as a project is dedicated to testing the
upstream Linux kernel, the same KernelCI software may be reused
for alternative purposes. One typical example is distribution
kernels, which often track a stable branch but also carry some
extra patches and a specific configuration. Aside from covering
a particular downstream branch, having a separate KernelCI
instance also makes it...
The Red-Black tree and Radix tree are used in many places in the kernel to store ranges. Both of these trees have drawbacks when used for ranges. The Red-Black tree requires writing your own insertion & search code. It is also designed with the assumption that memory accesses are cheap, which is no longer true. The Radix tree performs acceptably well when ranges are aligned to a power of 2,...
Multipath TCP (MPTCP) is an increasingly popular protocol that members of the kernel community are actively working to upstream. A Linux kernel fork implementing the protocol has been developed and maintained since March 2009. While there are some large MPTCP deployments using this custom kernel, an upstream implementation will make the protocol available on Linux devices of all...
Understanding Application performance and utilization characteristics is critically important for cloud-based computing infrastructure. Minor improvements in predictability and performance of tasks can result in large savings. Google runs all workloads inside containers and as such, cgroup performance monitoring is heavily utilized for profiling. We rely on two approaches built on Linux...
Babeltrace started out as the reference implementation of a Common
Trace Format (CTF) reader. As the project evolved, many
trace manipulation use-cases (merging, trimming, filtering,
conversion, analysis, etc.) emerged and were implemented either
as part of the Babeltrace project, on top of its APIs or through
custom tools.
Today, as more tracers emerged, each using their own trace format,...
The RISC-V hypervisor extension is carefully designed to be compliant with both Type-1 and Type-2 hypervisors. We have ported Xvisor (Type-1) and KVM (Type-2) for RISC-V architecture. In this session, we share our experience porting these hypervisors and also discuss future work on RISC-V hypervisors.
I would like to discuss how to implement a series of libraries for all the tracing tools that are out there, and have a repository that at least points to them. From libftrace, libperf, libdtrace to liblttng and libbabletrace.
bpftrace is a high level tracing language running on top of BPF: https://github.com/iovisor/bpftrace
We'll talk about important updates from the past year, including improved tracing providers and new language features, and we'll also discuss future plans for the project.
The printk() function has a long history of issues and has undergone many iterations to improve performance and reliability. Yet it is still not an acceptable solution to reliably allow the kernel to send detailed information to the user. And these problems are even magnified when using a real-time system. So why is printk() so complicated and why are we having such a hard time finding a good...
At Netconf 2019 we have presented a BPF-based alternative to steering
packets into sockets with iptables and TPROXY extension. A mechanism
which is of interest to us because it allows (1) services to share a
port number when their IP address ranges don't overlap, and (2) reverse
proxies to listen on all available port numbers.
The solution adds a new BPF program type BPF_INET_LOOKUP, which...
Implementing safety-critical systems usually requires adhering to meticulously defined development processes that specify how code is supposed to be developed, integrated and reviewed, driven by the assumption that a disciplined approach leads to reliably high quality. While known to produce code that can satisfy the highest quality standards, Linux kernel development does not follow such...
Many new BPF tracing tools are about to be published, deepening our view of kernel internals on production systems. This session will summarize what has been done and what will be next with BPF tracing, discussing the challenges with taking kernel and application analysis further, and the potential kernel changes needed.
This presentation will discuss the work ongoing to implement Linux kernel
support for RISCV hardware lacking a memory management unit (MMU). A side effect
of this work is also the ability to execute the kernel directly in M-Mode and
how this is implemented while keeping most of the architecture code unmodified.
The presentation will include examples of testing environment builds, discuss
the...
There have been two different approaches proposposed on the LKML over the past year on core scheduling. One was the coscheduling approach by Jan Schönherr, originally posted at https://lkml.org/lkml/2018/9/7/1521 and the next version posted at https://lkml.org/lkml/2018/10/19/859
Upstream chose a different route and decided to modify CFS, and only do "core-scheduling". Vineeth picked up the...
Greybus is an RPC like protocol on top UniPro bus that has been designed for the Project ARA. This goal of that project was to develop a modular smartphone. Greybus gives the ability to the host to control remotely the buses (such as i2c or spi) of the modules.
Although Project ARA has been aborted, Greybus has been merged to Linux kernel, and it is still maintained by the community.
Greybus...
For many years developers have leveraged gdb or crash to look at kernel crash dumps on linux. Although those tools have served us well, it can sometimes be difficult to navigate the crash dump to find the information you really need. In this talk, we would like to present some new tools that make it easier to debug kernel crash dumps and enhance kernel developer's ability to root causes...
DRM is merging new drivers at a brisk pace, and with lima and panfrost to support ARM Mali GPUs the last obvious gap in not yet reverse-engineered hardware is getting closed. Plus new features, more contributors, more patches - in general upstream graphics is as healthy as it's never been before.
Time for some celebratory drinks, except this talk will be none of that. Now that we've achieved...
This topic will discuss 1) why do we need per-group default domain type, 2) how it solves the problems in the real IOMMU driver, and 3) the user interfaces.
IoT applications, be they Autonomous Cars [1] or Health Care or Smart Home or Factory Automation, the IoT devices (sensors and actuators), gateways, and cloud/datacenter endpoints need software and/or firmware updates, to fix security issues, patch bugs, and/or release new features. IoT with its numerous remote devices and gateways presents a large attack surface, making the application of...
Link Aggregation (LAG) is traditionally served by bonding driver. Linux bonding driver supports all LAG modes on almost any LAN drivers - in the software. However modern hardware features like SR-IOV-based virtualization and state full offloads such as RDMA are currently not well supported by this model. One of possible options to solve that is to implement LAG functionality entirely in NIC's...
Wayland is getting close to being ready for day 2 day generic desktop use, close but there still are many small issues to tackle, see e.g. :
https://hansdegoede.livejournal.com/21944.html
https://hansdegoede.livejournal.com/22212.html
The purpose of this microconference is to get people together to discuss the various open issues, try to come up with solutions for some of them and possibly...
Dmitry Vyukov's testing work identified some (ab)uses of sched_setattr() that can result in SCHED_DEADLINE tasks starving RCU's kthreads for extended time periods, not millisecond, not seconds, not minutes, not even hours, but days. Given that RCU CPU stall warnings are issued whenever an RCU grace period fails to complete within a few tens of seconds, the system did not suffer silently. ...
The cfs load_balance has became more and more complex over the years and has reached the point where policy can't be explained sometimes. Furthermore, available metrics have evolved and load balance doesn't always take full advantage of it to calculate the imbalance. It's probably the good time to do a rework of the load balance code as proposed in this...
Storage hardware with built-in “inline” encryption support is becoming increasingly common, especially on mobile SoCs running Android; it's also now part of the UFS and eMMC standards. These devices en/decrypt data between the application processor and disk without generating disk latency or cpu overhead. Inline encryption hardware can be programmed to hold multiple encryption keys...
For the past couple of years the CKI ("cookie") project at Red Hat has been transforming the way the company tests kernels, going from staged testing to continuous integration. We've been testing patches posted to internal maillists, responding with our results, and last year we started testing stable queues maintained by Greg KH, posting results to the "stable" maillist.
Now we'd like to...
There is a presentation in the refereed track on flattening the CPU controller runqueue hierarchy, but it may be useful to have a discussion on the same topic in the scheduler microconference.
This is meant to be a rather open discussion on PCI resource assignment policies. I plan to discuss a bit what the different arch/platforms do today, how I've tried to consolidate it, then we can debate the pro/cons of the different approaches and decide where to go from there.
The RDMA subsystem in Linux (drivers/infiniband) is now becoming widely used and deployed outside its traditional use case of HPC. This wider deployment is creating demand for new interactions with the rest of the kernel and many of these topics are challenging.
This talk will include a brief overview of RDMA technology followed by an examination & discussion of the main areas where the...
The Linux Kernel scheduler represents a system's topology by the means of
scheduler domains. In the common case, these domains map to the cache topology
of the system.
The Cavium ThunderX is an ARMv8-A 2-node NUMA system, each node containing
48 CPUs (no hyperthreading). Each CPU has its own L1 cache, and CPUs within
the same node will share a same L2 cache.
Running some memory-intensive...
Testing the upstream kernel is not an easy task. The burden is
still largely put on developers, although several projects are
now covering parts of it such as 0-day, LKFT, CKI, Coccinelle,
syzkaller and kernelci.org. While they all tend to have their
own speciality, they also face a lot of similar challenges.
This BoF is to give an opportunity to exchange ideas and bring
together people...
Many "drivers" for IoT sensors and actuators live outside kernel space through efforts that seek to provide abstractions not sufficiently handled in the kernel today. This is resulting in great code fragmentation that can be resolved by better understanding the developer needs and communicating an achievable collaborative approach. Pushing the interface to these devices off to userspace is...
A PCI-Express non-transparent bridge (NTB) is a point-to-point PCIe bus
connecting 2 host systems. NTB functionality can be achieved in a platform
having 2 endpoint instances. Here each of the endpoint instance will be
connected to an independent host and the hosts can communicate with each other
using endpoint as a bridge. The endpoint framework and the "new" NTB EP
function driver should...
The Thunderbolt vulnerabilities are public and have a nice name as Thunderclap (https://thunderclap.io/) nowadays. This topic will introduce what kind of vulnerabilities we have identified with Linux and how we are fixing them.
Operating system distributors often face challenges that are somewhat
different from that of upstream kernel developers. For instance, some
kernel updates often need to stay at least binary compatible with
modules that might be "out of tree" for some time.
In that context, being able to automatically detect and analyze
changes to the binary interface exposed by the kernel to its module
does...
The BPF VM in the kernel is being used in ever more scenarios where running a restricted, validated program in kernel space provides a super powerful mix of flexibility and performance which is transforming how a kernel work.
That creates challenges for developers, sysadmins and support engineers, having tools for observing what BPF programs are doing in the system is critical.
A lot has...
kernelCI: testing a broad variety of hardware
The Linux kernel runs on an extremely wide range of hardware, but
with the rapid pace of kernel development, it's difficult to ensure
the full range of supported hardware is adequately tested.
The kernelCI project is a small, but growing project, focused on
testing the core kernel on diverse set of architectures, boards and
compilers using...
There is a lot of similar and duplicated code in architecture specific
bits of memory management.
For instance, most architectures have
#define PGALLOC_GFP (GFP_KERNEL | __GFP_ZERO)
for allocating page table pages and many of them use similar, if not
identical, implementation of pte_alloc_one*()
.
But that's only the tip of the iceberg.
There are several early_alloc()
or similarily...
Today’s is a scenario when we can not think of having either a mobile phone or a laptop or a tablet. With the progress of technology and having all these handheld devices, we have been able to get many of our documents digitized. However, whatever advancements we see in this space of documentation, it is still very hard to find someone who did not have the need to print or scan a hard copy....
It goes without saying that XDP is wanted more and more by everyone. Of course, the Linux distributions want to bring to users what they want and need. Even better if it can be delivered in a polished package with as few surprises as possible: receiving bug reports stemming from users' misunderstanding and from their wrong expectations does not make good experience neither for the users nor...
The OpenPrinting project “Common Print Dialog Backends” provides a D-Bus interface to separate the print dialog GUI from the communication with the actual printing system (CUPS, Google Cloud Print, e.t.c.) having each printing system being supported with a backend and these GUI-independent backends working with all print dialogs (GTK/GNOME, Qt/KDE, LibreOffice, e.t.c.). This allows for easily...
The glibc project decided a while back that it wants to add wrappers for
system calls which are useful for general application usage. However,
that doesn't mean that all those missing system calls are added
immediately.
System call wrappers still need documentation in the manual, which
can be difficult in areas where there is no consensus how to describe
the desired semantics (e.g., in the...
Boot testing is already hard to do well on a wide variety of
hardware. However it is only scratching the surface of the
kernel code base. To take projects such as Kernel CI to the next
level and increase coverage, functional tests are becoming the
next big thing on the list. Large test suites that run close to
the hardware are very hard to tame. Some projects such as
ezbench could become...
The PREEMPT_RT patchset is the longest existing large patchset living outside the Linux kernel. Over the years, the realtime developers had to maintain several stable kernel versions of the patchset. This talk will present the lessons learned from this experience, including workflow, tooling and release management that has proven over time to scale. The workflow deals with upstream changes...
In the linux kernel, most operations affecting a process's address space are protected by by mmap_sem (a per-process read-write semaphore).
This simple design is increasingly a problem for multi-threaded applications, and often causes threads that operate on separate parts of their address space to end up blocking on each other due to false sharing issues - mmap_sem only supports locking the...
During the last two years, KMSAN (a detector of uses of uninitialized
memory based on compiler instrumentation) has found more than a
hundred bugs in the upstream kernel.
We'll discuss the current status of the tool, some of its findings and
implementation challenges. Ideally, I'd like to get more people to
look at the code, as finding bugs in particular subsystems may require
deeper knowledge...
There are two flavors of power management supported by the Linux kernel: system-wide PM based on transitions of the entire system into sleep states and working-state PM focused on controlling individual components when the system as a whole is working. PM-runtime is part of working-state PM concerned about the opportunity to put devices into low-power states when they are not in use.
Since...
Big systems are becoming more common these days. Having thousands of CPUs is
no more a dream and some applications are attempting to spread over all
these CPUs by creating threads.
This leads to contention on the mm->mmap_sem which is protecting the memory
layout shared by these threads.
There were multiple attempts to get rid of the mmap_sem's contention or the
mmap_sem itself, Speculative...
Working for a networking hardware vendor can be an extremely rewarding experience for a kernel developer. The rate at which new features are accepted in the kernel also provides lots of motivation to develop new features that showcase hardware capabilities. This could be done by adding new support for dataplane offloads via cls flower, netfilter, or switchdev (if we still think it exists!). ...
Modern server and compute intesive systems are naturally built around several top performance CPUs with large amount of cores and equipped by shared memory that spans a number of NUMA domains. Compute intensive workloads usually implement highly parallel CPU bound cyclic codes performing mathematics calculations that reference data located in the shared memory. Performance observability and...
In this talk, Dmitry will share updates on syzkaller/syzbot since last year: USB fuzzing, bisection, memory leaks. Talk about open problems: testability of kernel components; test coverage; syzbot process.
A brief introduction to CTF and its recent addition to the GNU toolchain: what is it for, what's there now, what improvements are planned, and why you might want to use this stuff rather than DWARF.
What cool things might we be able to do now that C programs can inspect their own types cheaply? What cool things might we be able to do if we extend this to other languages, so C programs could...
IPv4's success story was in carrying unicast packets
worldwide.
Service sites still need IPv4 addresses for everything,
since the majority of Internet client nodes don't yet
have IPv6 addresses. IPv4 addresses now cost 15 to 20
dollars apiece (times the size of your network!) and
the price is rising.
The IPv4 address space includes hundreds of millions of
addresses reserved for obscure (the...
Recent vulnerabilities like L1 Terminal Fault (L1TF) and Microarchitectural Data Sampling (MDS) have shown that the cpu hyper-threading architecture is very prone to leaking data with speculative execution attacks.
Address space separation is a proven technology to prevent side channel vulnerabilities when speculative execution attacks are used. It has, in particular, been successfully used...
Follow up on the tracing microconference
Topics to be discussed:
- Perf related events
- Histogram sql syntaxes
Kselftest started out as an effort to enable a developer-focused regression test framework in the kernel to ensure the quality of new kernel releases. Today it is an integral part of the Linux Kernel development process to qualify Linux mainline and stable release candidates.
Shuah will go over the Kselftest framework, how to write tests that work well with the framework for effective...
This proposal covers the ongoing effort about adding eBPF support to the GNU Toolchain.
Binutils support is already upstream [1]. This includes a CGEN cpu description, assembler, disassembler and linker. A GCC backend will be submitted for inclusion upstream before September.
Both the binutils and GCC ports will be briefly described, and then a list of points will be discussed with the...
Nowadays all consumer PC/laptop devices contain TPM2.0 security chip (due to Windows hardware requirements). Also servers and embedded devices increasingly carry these TPMs. It provides several security functions to the system and the user, such as smartcard-like secure keystore and key operations, secure secret storage, bruteforce-protected access control, etc.
These capabilities can be used...
It looks like there may well be enough critical folks present to have a good BOF about safety and linux. Topics can include safety processes and methodologies, tooling to support analysis, security update concerns, etc. Basically, if you're interested in using Linux in safety critical systems come join, and we'll see where the conversation goes.
In this talk, we will present a scalable re-implementation of the Kubernetes service abstraction with the help of eBPF. We will discuss recent changes in the kernel which made the implementation possible, and some changes in the future which would simplify the implementation.
Kubernetes is an open-source container orchestration multi-component distributed system. It provides mechanisms for...
Background
Memory pressure is inevitable in many environments. A decade size survey[1] of DRAM to CPU ratio in virtual machines and physical machines for data centers implies that the pressure will be even more common and severe. As an answer to this problem, heterogeneous memory systems utilizing recently evolved memory devices such as non-volatile memory along with the DRAM are...
CRIU only restores processes with the same PID the processes used to have during checkpointing. As there is no interface to create a process with a certain PID like fork_with_pid()
CRIU does the PID dance to restore the process with the same PID as before checkpointing.
The PID dance consists of open()
ing /proc/sys/kernel/ns_last_pid
, write()
ing PID-1 to...
The Kernel's API and ABI exposed to Kernel modules is not something
that is usually maintained in upstream. Deliberately. In fact, the
ability to break APIs and ABIs can greatly benefit the development.
Good reasons for that have been stated multiple times. See e.g.
Documentation/process/stable-api-nonsense.rst.
The reality for distributions might look different though. Especially
- but not...
Containers are generally percieved less secure than virtual
machines. Without going into a theological argument about the actual
state of the affairs, we suggest to explore the possibility of using
address space isolation inside the kernel to make containers even more
secure.
Assuming that kernel bugs and therefore vulnerabilities are inevitable
it is worth isolating parts of the kernel to...
An update on how we plan to enable multimedia testing on our 'cuttlefish' virtual platform. Overview of missing components for graphics virtualization.
This BoF session aims to bring together Linux kernel developers who have an interest in formal methods (or formal methods experts with an interest in kernel development). Topics for discussion:
- A poll of formal methods currently used in the context of the Linux kernel: SPIN, TLA+, CBMC, herd, plain English etc.
- High level design specification vs. low level algorithm modelling. What...
With virtualization being the key to the success of cloud computing, Intel's
introduction of the Scalable IO Virtualization (SIOV) aims to further the cause by shifting the creation of assignable virtual devices from hardware to a more software assisted approach. Using SIOV, a device resource can be mapped directly to guest or other user space drivers for near native DMA (Direct Memory...
XDP (the eXpress Data Path) is a new method in Linux to process
packets at L2 and L3 with really high performance. XDP has already
been deployed for use cases involving ingress packet filtering, or
transmission back through the ingress interface, are already well
supported today. However, as we expand the use cases that involve the
XDP_REDIRECT action, e.g., to send packets to other devices,...
Tools based on low level tracing tend to generate large amounts of data, typically outputted in some kind of text or binary format. On the other hand the predefined data analysis features of those tools are often useless when it comes to solving a nontrivial or very user-specific problem. This is when the possibility to make sophisticated analysis via scripting can be extremely useful.
Fast...
The libcamera project was started at the end of 2018 to unify camera support on all Linux systems (regular Linux distributions, Chrome OS and Android). In 9 months it has produced an Android Camera HAL implementing the LIMITED profile for Chrome OS, and work is in progress to implement the FULL profile. Two platforms are currently supported (Intel IPU3 and Rockchip ISP), with work on...
Over the last year we have worked on expanding the task migration using CRIU in Google. The talk will discuss how in some cases the kernel interfaces are lacking for the purpose of migration:
- Lack of support for reading rseq configuration which means that it requires userspace support to migrate users of rseq properly.
- Lack of support for reading what cgroup events the users have...
Container runtimes, engines and orchestrators provide a production-grade, robust, high-performing, but also relatively self-managing, self-healing infrastructure using innovative open-source technologies.
CRIU allows the running state of containerised applications to be preserved as a collection of files that can be used to create an equivalent copy of the applications at a later time, and...
What could be more fun than talking about kernel documentation? Things we
could get into:
-
The state of the RST transition, what remains to be done, whether it's
all just useless churn that makes the documentation worse, etc. -
Things we'd like to improve in the documentation toolchain.
-
Overall organization of Documentation/ and moving docs when the need
arises. It seems I end...
Discussion of using Persistent Memory as first- (or second-) class memory.
Google has a successful prototype of a software-managed "Transparent" mode for 3dXPoint / AEP memory, but we're working on re-designing this into something that is more supportable and at least partially upstreamable.
We want to open a discussion of how we can represent this "swap"-like use of AEP sensibly.
Providing encryption in dynamic environments where nodes are added and removed on-the-fly and services spin-up and are then torn-down frequently, such as Kubernetes, has numerous challenges. Cilium, an open source software package for providing and transparently securing network connectivity, leverages BPF and the Linux encryption capabilities to provide L3/L7 encryption and authentication at...
The Linux kernel has recently acquired a new API for creating mounts. This allows a greater range of parameter and parameter values to be specified, including, in the future, container-relevant information such as the namespaces that a mount should use.
Future developments of this API also need to work out how to deal with upcalling from the kernel to gain parameters not directly supplied,...
At LPC 2015, we introduced analyze_suspend, a new open source tool to show where the time goes during Linux suspend/resume. Now called "sleepgraph", it has evolved in a number of ways over the last four years. Most importantly, it is now the core of a framework that we use for suspend/resume endurance testing.
Endurance testing has allowed us to identify, track, report and sometimes fix...
Since Canonical is now shipping it I think we can all agree it solves a problem and we just need to get the patches into shape for upstream submission. Can we discuss a pathway for doing that.
Code review is a collaborative activity involving sentiments and emotions that can affect developers' productivity, creativity, and contribution satisfaction. Discussions in a code review environment in open source could get spirited at times as people from diverse backgrounds and interests are part of it. As a consequence, open source communities have become introspective and started to think...
Linux is complex, and formal verification has been gaining more and more attention because independent "asserts" in the code can be ambiguous and not cover all the desired points. Formal models aim to avoid such problems of natural language, but the problem is that "formal modeling and verification" sound complex. Things have been changing.
What if I say it is possible to verify Linux...
Topic will discuss how Android framework utilizes new kernel features
to better handle memory pressure. This includes app compaction, new
kill strategies and improved process tracking using pidfds.
To discuss recent developments and directions with DMABUF: * DMABUF Heaps/ION destaging * Better DMABUF ownership state machine documentation * DMABUF cache maintenance optimizations * Kernel graphics buffer idea
Userspace has (for a long time) needed a mechanism to restrict path resolution. Obvious examples are those of FTP servers, Web Servers, archiving utilities, and now container runtimes. While the fundamental issue with privileged container runtimes opening paths within an untrusted rootfs was known about for many years, the recent CVEs (CVE-2018-15664 and CVE-2019-10152 being the most recent)...
There are some improvements in the CPU idle time management to be made, like switching over to using time in nanoseconds (64-bit), reducing overhead and some governor modifications (including possible deprecation of the menu governor) which need to be discussed.
A short update on the status of DRM/KMS ecosystem adoption and how Google is improving verification of the DRM display drivers in Android devices.
Android has been using an out-of-tree schedtune cgroup controller for
task performance boosting of time-sensitive processes. Introduction of
utilization clamping (uclamp) feature in the Linux kernel opens up an opportunity to adopt an upstream mechanism for achieving this goal. The talk will present our plans on adopting uclamp in Android.
The kernel contains a keyrings facility for handling tokens for filesystems and other kernel services to use. These are frequently disabled for container environments, however, because they were not made namespace aware by the authors of the user-namespace and others.
Unfortunately, this lack prevents various things from working inside containers. To get around this, keys are now being...
We have a number of unsolved time and vdso related issues in CRIU.
- Syscall restart: if a task Checkpoint interrupted a syscall, on restore CRIU blindely starts again the syscall (executing SYSCALL/SYSENTER/INT80/etc instruction with the original regset). It works OKish, but not with time blocking syscalls i.e., poll(), nanosleep(), futex() and etc. For this purpose, Glibc and vDSO use...
Recently speculative execution techniques have shown that an untrusted application can steal data from another one when both share the same core. To avoid such problems users have to disable SMT, causing non-negligible performance impact. Core-scheduling tries to mitigate the performance problem by allowing trusted applications to run concurrently on siblings of a core while avoiding two...
The csky architecture officially merged the main line in linux-4.20. Before that, eight architectures have just been removed from the main line. Many people ask what is the meaning of csky upstream? Also includes our colleagues. Here, we will give some examples to introduce the progress of the csky architecture in the past six months and the value and significance of linux-csky. This is an...
The demand of DRAM across different platforms is increasing but the cost is not decreasing. Thus DRAM is a major factor of the total cost across all kinds of devices like mobile, desktop or servers. In this talk we will be presenting the work we are doing at Google, applicable to Android, Chrome OS and data center servers, on extracting more memory out of running applications without impacting...
P2P
- Suggestion with VFIO (Don)
- RDMA as the importer, VFIO as the exporter
get_user_pages() and friends
- Discussion on future GUP, required to support P2P
- GUP to SGL?
- Non struct page based GUP
hmm_range_fault()
- Integrating RDMA ODP with HMM
- 'DMA fault' for ZONE_DEVICE pages
The ABI between Linux and user software mostly sits at the user/privileged boundary, although many architectures extend this with a small amount of special-case code that sits in userspace, such as in special pages or shared libraries (vDSOs) mapped into each user process [1] that user code can call into.
The reasons for this are a bit arbitrary: system interface libraries such as glibc and...
Quick introduction of people. Frame discussion. Will be quick I promise.
many devs are excited about the progress reported on this new stuff, but is it followed / considered by kernel devs.? what kind of gain to expect? any potential issues or feedback to share?
for example, for a write-ahead logging, one needs to guarantee that writes to log are completed before the corresponding data pages are written. fsync() on the log file does this, but it is an overkill for this.
RCU has changed a surprising amount over the past few years, what with elimination of many RCU Kconfig options in favor of kernel boot parameters, RCU flavor consolidation, ongoing work on speeding up RCU's handling of offloaded callbacks, and newly started work on providing warnings when RCU's callback handling is overloaded. These changes affect how RCU behaves, and in some cases in ways...
KUnit is a new lightweight unit testing and mocking framework for the Linux kernel. Unlike Autotest and kselftest, KUnit is a true unit testing framework; it does not require installing the kernel on a test machine or in a VM (however, KUnit still allows you to run tests on test machines or in VMs if you want) and does not require...
Gen-Z Linux Sub-system
Discuss design choices for a Gen-Z kernel sub-system and the challenges of supporting the Gen-Z interconnect in Linux.
Gen-Z is a fabric interconnect that connects a broad range of devices from CPUs, memory, I/O, and switches to other computers and all of their devices. It scales from two components in an enclosure to an exascale mesh. The Gen-Z consortium has over...
For some time now, special camera setups exist having features which are challenging for I2C address layouts as we know them in Linux: a) a high-speed serial link which can embed I2C communication (e.g. GMSL or FPD-Link III) and b) the ability to reprogram the client addresses of the I2C devices on the camera.
The use case for these cameras is to run multiple of them in parallel, and not just...
I'd like to review if-how we can build real-time container. It should include but not limited these topics here,
- Understanding container Scheduling
- Test and evaluations
- Possible factors related to latency issues
- discussions like tracing containers-leveled metrics
- tips
- etc.
We know that reducing the sections with preemption and IRQ disabled reduces the latency, also that IRQs influences on it, but some cases are hard to catch. For example, in the old jump label update, there was a burst of IPIs causing latency spikes. Such non-periodic behavior is hard to mathematize. As a side effect, this adds pessimism to "possible formulas" that tries to define the worst-case...
Application-specific accelerators are going to start showing up in larger numbers in the times ahead. Today there's often no suitable subsystem for them to aggregate into, and the first of them have landed under drivers/misc for the time being.
The goal of this BoF is to introduce and discuss the ground rules for a new drivers/accel subsystem, how it fits in with other subsystems, and...
We are going through upstreaming IBNBD/IBTRS 5th iterations, the latest effort is here: https://lwn.net/Articles/791690/.
We would like to discuss in an open round about the unique features of the driver and the library, whether and how they are beneficial for the RDMA eco-system and what should be the next steps in order to get them upstream.
A face to face discussion about action items...
Which Real Time softirq implementation do we want for mainline?
_ Vector-Lock based? (depend on sleeping spinlocks machinery) _ Vector masking based? _ Other?
In this talk Dmitry will highlight some of the areas for improvement related to release quality, security, and developer experience and productivity. Then try to show that the existing processes, approaches and tools poorly cope with the current scale and rate of change and don't provide adequate quality and developer experience. Lastly Dmitry will advocate that only pervasive changes to the...
Postgres (and many other databases) have, until fairly recently, assumed that IO errors would a) be reliably signalled by fsync/fdatasync/... b) repeating an fsync after a failure would either result in another failure, or the IO operations would succeed.
That turned out not to be true: See also https://lwn.net/Articles/752063/
While a few improvements have been made, both in postgres and...
At MongoDB, we implemented an eBPF tool to collect and display a complete time-series view of information about all threads whether they are on- or off-CPU. This allows us to inspect where the database server spends its time, both in userspace and in kernel. Its minimal overhead allows to deploy it in production.
This can be an effective method to collect diagnostic information in the field...
_ What is needed upstream for real time support of Full Dynticks and isolation? _ Specific requests?
Host Bandwidth Manager (HBM) is a BPF based framework for managing per-cgroupv2 egress and ingress bandwidths in order to provide a better experience to workloads/services coexisting within a host. In particular, HBM allows us to divide a host's egress and ingress bandwidth among workloads residing in different v2 cgroups. Note that although sample BPF programs are included in the BPF patches,...
Discussion around topics related
to PCI specifications and microconference follow up
- Root complex integrated endpoints
- Native host controllers link management
- VFIO/IOMMU/PCI follow up
there is "backlog" option used in MySQL for both IP and UNIX sockets, but seems like it has a significant overhead on heavy connect/disconnect activity workloads (e.g. like most of Web apps which are doing "connect; SQL query; disconnect") -- any explanation/ reason for this? can it be improved?
As memory sizes grow so do the sizes of the data transferred between RDMA devices. Generally, the Operating system needs to keep track of the state of each of its pieces of memory and that is on Intel x86 a page of 4 KB. This is also connected to hardware providing memory management features such as the processor page tables as well as the MMU features of the RDMA NIC.
The overhead of the...
In this talk, Thomas Gleixner will present the status of the PREEMPT_RT, along with a section of questions and answers regarding the upstream work and the future of the project.
users are very worry about any kind of overhead due kernel patches applied to solve Intel CPU issues (Spectre/Meltdown/etc.) -- what others are observing? what kind of workloads / test cases do you use for evaluation?
The way BPF application developers build applications is constantly improving. There are still rough corners, as well as (as of yet) fundamentally inconvenient developer workflows involved (e.g., on-the-fly compilation). The ultimate goal of BPF application development is to provide experience as straightforward and simple as a typical user-land application.
We'll discuss major pain points...
ZRAM is a compressed RAM based block device implementation which has gotten a lot of use recently primarily in the Android world. ZRAM consists of the block device front-end, compressor back-end and memory allocator back-end. Compressor back-end is accessed via a common API, and therefore it is easy with ZRAM to select the particular compression algorithm that fits your special purpose. As...
This session will focus on answering questions on the internals and the usage of Linux-kernel RCU. However, questions regarding details of the RCU-related patches in the -rt patchset will be deferred to other venues, given that this topic consumed the entire time in the 2018 informal RCU BoF session.
This is not intended to be a tutorial on RCU basics, though a separate session on this topic...
The UEFI forum is rolling out a new "code first" process, to be available for both UEFI and ACPI specifications, in order to speed up time between initial definition and upstream support.
The UEFI self-certification testsuite (SCT) has been open sourced.
UEFI interface implementation in U-Boot now sufficient for GRUB use (and more) across multiple distributions..
The presentation gives an overview of what has been implemented in the SGX patch set and what there is still left to do. The presentation goes through the known blockers for upstreaming. In particular, access control related issues will be discussed.
At the LSF/MM eBPF track, we discussed the necessity of a common Go
library to interact with BPF. Since then, Cilium and Cloudflare have
worked out a proposal to upstream parts of github.com/newtools/ebpf
and github.com/cilium/cilium/pkg/bpf into a new common library.
Our goal is to create a native Go library instead of a CGO wrapper
of C libbpf. This provides superior performance,...
When multiple instances of workloads are consolidated in same host it is
good practice to partition them for best performance. For e.g give a NUMA
node parition to each instance. Currently Linux kernel provides two
interfaces to hard parition: sched_setaffinity system call or cpuset.cpus
cgroup. But this doesn't allow one instance to burst out of its partition
and use available CPUs from other...
A quick update on the objtool port on Power, what is the current state and
what more needs to be done. Also, discuss how do we integrate it upstream.
eBPF offload is a powerful feature on modern SmartNICs used to accelerate
XDP or TC based BPF. The current kernel eBPF offload infrastructure was
introduced for the Netronome NFP based SmartNICs, these were based around a
proprietary ISA and had some specific verifier requirements.
In the near future this may be joined by SmartNICs using public ISA's such
as RISC-V and Arm which also happen...
Currently, the BPF verifier has to "execute" code at least once and then it can prune branches when it detects the state is the same. In this session we would like to cover a technique called Scalar Evolution (SCEV) which is used by LLVM and GCC to perform optimization passes such as identifying and promoting induction variables and do worst case trip analysis over loops. At its most basic...
The Restartable Sequences system call [1,2,3,4] introduced in Linux 4.18 has limitations which can be solved by introducing a bytecode interpreter running in inter-processor interrupt context which accesses user-space data.
This discussion is about the subset of the eBPF bytecode and context needed by this interpreter, and extensions of that bytecode to cover load-acquire and...
The main issue in using TPM2.0 in such measured boot solution is that at the
moment of writing this abstract neither Trusted Grub, nor Linux kernel has
TPM2.0 implementation. There are of course implementations based on UEFI
systems, where bootloaders can utilize TCG EFI protocol to handle TPM. However
other non-UEFI based solutions suffer from lack of TPM2.0 drivers in the
bootloaders....
At the time of writing this paper the Linux kernel supported TPM 1.2
functionalities in sysfs. To these functionalities we include:
```
$ ls /sys/devices/pnp0/00:04/tpm/tpm0
active caps device enabled pcrs ppi subsystem timeouts
cancel dev durations owned power pubek temp_deactivated uevent
$ ls /sys/devices/pnp0/00:04/tpm/tpm0/ppi
request response ...
Discussion about current live patch services and how we can make it more open and flexible.
How we can make more open source distributions use or make their own live patch services.
What we are still missing? and what we can share?
Buses will start circulating at 7:30PM.
Last return bus is at 11PM