About msync() system call
Contemplating a possible implementation of log-based transactional system and looking at the UNIX API it seems natural to employ msync() function for memory mapped log files.
NAME msync - synchronize memory with physical storage SYNOPSIS int msync(void *addr, size_t len, int flags);
Indeed msync specification states the following:
The msync() function should be used by programs that require a memory object to be in a known state; for example, in building transaction facilities.
The idea is that the log file should be mmaped to the address space of the process, the log data is written to the memory and synchronized with disk by the power of OS virtual memory mechanism. This way there is no need to allocate in-memory buffer for log data and call write() when the buffer is full. Instead just when the transaction is to be committed, exactly the portion of the mmaped log that contains the transaction data is msynced and that’s it. Concurrently the data for other transactions can be written further down the log and stay cached in memory avoiding unnecessary I/O. Additional appeal to msync() gives the existence of two modes MS_ASYNC and MS_SYNC:
When MS_ASYNC is specified, msync() shall return immediately once all the write operations are initiated or queued for servicing; when MS_SYNC is specified, msync() shall not return until all write operations are completed as defined for synchronized I/O data integrity completion.
I can’t help but think that msync() was introduced to UNIX specifically to cater DBMS people. This cannot be a coincidence. This is just what one would want developing a DBMS engine.
Okay, so far I referred to POSIX and UNIX. However currently probably most attention deserves one particular implementation of POSIX API, namely, Linux. Just checking the Linux msync() man page it seems that everything is good. It pretty much conforms the POSIX specification. Or so it says.
Once you start wondering what is situation on the ground the picture becomes more complicated. One interesting tidbit can be found in FreeBSD man page:
The msync() system call is obsolete since BSD implements a coherent file system buffer cache. However, it may be used to associate dirty VM pages with file system buffers and thus cause them to be flushed to physical media sooner rather than later.
This is confusing. The purpose of msync is to ensure data integrity. I understand that if a process crashes then its modified mmapped data still remains in the system cache and at some point it will be synchronized with physical storage. So far so good. But what if the whole system crashes? Without msync() this will result in the data loss. Or are they saying that their msync() merely causes the page flush to happen somewhat earlier but not right away? So on FreeBSD there is no big difference whether you msync() or not as it provides no integrity guarantee anyway? Well, I don’t have answers to these questions as now I don’t want to spend much time on FreeBSD research, I’m more focused on the Linux.
About msync() on Linux
So what is about msync() on Linux precisely? In the fairly recent Linux release the following comment could be found in the file
/* * MS_SYNC syncs the entire file - including mappings. * * MS_ASYNC does not start I/O (it used to, up to 2.5.67). * Nor does it marks the relevant pages dirty (it used to up to 2.6.17). * Now it doesn't do anything, since dirty pages are properly tracked. * * The application may now run fsync() to * write out the dirty pages and wait on the writeout and check the result. * Or the application may run fadvise(FADV_DONTNEED) against the fd to start * async writeout immediately. * So by _not_ starting I/O in MS_ASYNC we provide complete flexibility to * applications. */
So let me summarize the current status of msync() on Linux:
- msync(…, MS_ASYNC) is effectively noop
- msync(…, MS_SYNC) is effectively equal to fsync()
The bottom line is msync() is completely useless on Linux. It cannot help with transaction log idea described above and for that matter it cannot help with anything else. The comment in the source code suggests to use other system calls. At the same time the Linux man page for msync() is absolutely misleading. It makes it apear that everything’s fine, that it fully implements the UNIX specifications.
Okay, should we stop here? Or is there more to learn yet? Sure, it is.