Databases utilize and depend on a variety of kernel interfaces and are critically dependent on their specification, conformance to specification, and performance. Failure in any of these results in data loss, loss in revenue, or degraded experience or if discovered early, software debt. Specific interfaces can also remove small or large parts of user space code creating greater efficiencies.
This microconference will get a group of database developers together to talk about how their databases work, along with kernel developers currently developing a particular database-focused technology to talk about its interfaces and intended use.
Database developers are expected to cover:
The architecture of their database;
The kernel interfaces utilized, particularly those critical to performance and integrity
What is a general performance profile of their database with respect to kernel interfaces;
What kernel difficulties they have experienced;
What kernel interfaces are particularly useful;
What kernel interfaces would have been nice to use, but were discounted for a particular reason;
Particular pieces of their codebase that have convoluted implementations due to missing syscalls; and
The direction of database development and what interfaces to newer hardware, like NVDIMM, atomic write storage, would be desirable.
The aim for kernel developers attending is to:
Gain a relationship with database developers;
Understand where in development kernel code they will need additional input by database developers;
Gain an understanding on how to run database performance tests (or at least who to ask);
Gain appreciation for previous work that has been useful; and
Gain an understanding of what would be useful aspects to improve.
The aim for database developers attending is to:
Gain an understanding of who is implementing the functionality they need;
Gain an understanding of kernel development;
Learn about kernel features that exist, and how they can be incorporated into their implementation; and
Learn how to run a test on a new kernel feature.
If you are interested in participating in this microconference and have topics to propose, please use the CfP process. More topics will be added based on CfP for this microconference.
Daniel Black firstname.lastname@example.org
Quick introduction of people. Frame discussion. Will be quick I promise.
many devs are excited about the progress reported on this new stuff, but is it followed / considered by kernel devs.? what kind of gain to expect? any potential issues or feedback to share?
for example, for a write-ahead logging, one needs to guarantee that writes to log are completed before the corresponding data pages are written. fsync() on the log file does this, but it is an overkill for this.
seems like the patches proposed by Fusion-io devs for general
O_ATOMIC support within Linux kernel are in stand-by since 6 years.. -- any plans to address it ?.. What is the main reason to not guarantee atomicy of
O_DIRECT writes on flash drives? -- seems like most of flash storage vendors are able to provide atomic writes support on HW level, and just SW level (kernel/FS/etc.) is missed.....
since newer kernels (4.14, 5.1, ..) we are observing 50% regression on MySQL IO-bound workloads using EXT4 comparing to the same results on the same HW, but running kernel 3.x or 4.1. Unfortunately we have absolutely no explanation for this regression right now and looking for any available FS layer instrumentation/visibility to understand what is the root problem for such a regression and how...
historically XFS was always showing lower performance comparing to EXT4 on most of IO-bound workloads used for MySQL/InnoDB benchmark testing.. However, since the new kernels & XFS arrived, we observed significantly better results on XFS now -vs- EXT4 particularly when InnoDB "double write" is enabled. From the other side, for our big surprise, XFS was doing worse if "double write" was...
(1) SQLite is the most widely used database in the world. There are probably in excess of 300 billion active SQLite databases on Linux devices. SQLite is a significant client of the Linux filesystem - perhaps the largest single non-streaming client, especially on small devices such as phones.
(2) Unlike other relational database engines, SQLite tends to live out on the edge of the network,...
Postgres (and many other databases) have, until fairly recently, assumed that IO errors would a) be reliably signalled by fsync/fdatasync/... b) repeating an fsync after a failure would either result in another failure, or the IO operations would succeed.
That turned out not to be true: See also https://lwn.net/Articles/752063/
While a few improvements have been made, both in postgres and...
At MongoDB, we implemented an eBPF tool to collect and display a complete time-series view of information about all threads whether they are on- or off-CPU. This allows us to inspect where the database server spends its time, both in userspace and in kernel. Its minimal overhead allows to deploy it in production.
This can be an effective method to collect diagnostic information in the field...
since MySQL 8.0 we have a newly redesigned lock-free REDO log implementation. However, this development involved several questions about overall efficiency around MT communications and synchronization. Curiously spinning on CPU showed to be the most efficient on low load.. -- but any plans to implement "generic" MT framework for more efficient execution of any MT apps ?
there is "backlog" option used in MySQL for both IP and UNIX sockets, but seems like it has a significant overhead on heavy connect/disconnect activity workloads (e.g. like most of Web apps which are doing "connect; SQL query; disconnect") -- any explanation/ reason for this? can it be improved?
MySQL is allowing user sessions connections via IP port and UNIX socket on Linux systems. However, curiously connecting via UNIX socket is delivering up to 30% higher performance comparing to IP local port (loopback).. -- any reason for this? and be "loopback" code improved to match the same level of efficiency as UNIX socket? can the same improvements make over all IP stack to be more efficient?
all MT apps are extremely sensible to CPU cache issues, and MySQL/InnoDB is part of them.. Several times we observed significant regressions (up to 40% and more) due CPU cache miss or simple cache sync due concurrent access to the same variable by several threads, and all "perf" CPU related stats did not show any difference.. Any plans to address it with more deep CPU stats instrumentation?
users are very worry about any kind of overhead due kernel patches applied to solve Intel CPU issues (Spectre/Meltdown/etc.) -- what others are observing? what kind of workloads / test cases do you use for evaluation?
From discussions to code. Where it goes from here?