I always found it interesting that WinNT got this one right very early on (IOCP; first appeared in NT 3.5 in 1994), and thus avoided all these problems, and a series of subsequent APIs trying to correct them. The way you do efficient asynchronous I/O in Win32 today is still the same as it was 20 years ago.
For those unfamiliar with that model, here's a brief description:
My understanding is that kqueue in FreeBSD is also conceptually similar, but I never had a chance to take a close look at that.
From the man page...
> Q6 Will closing a file descriptor cause it to be removed from all epoll sets automatically?
> A6 Yes, but be aware of the following point. A file descriptor is a reference to an open file description (see open(2)). Whenever a file descriptor is duplicated via dup(2), dup2(2),
> fcntl(2) F_DUPFD, or fork(2), a new file descriptor referring to the same open file description is created. An open file description continues to exist until all file descriptors refer‐
> ring to it have been closed. A file descriptor is removed from an epoll set only after all the file descriptors referring to the underlying open file description have been closed (or
> before if the file descriptor is explicitly removed using epoll_ctl(2) EPOLL_CTL_DEL). This means that even after a file descriptor that is part of an epoll set has been closed, events
> may be reported for that file descriptor if other file descriptors referring to the same underlying file description remain open.
Lots of comments here seem to think this should be unexpected or is a bug. Closing a FD you are using is a bug. I think epoll does a fairly good job of letting the user know that it is watching the description and not the descriptor. Failing to read the man page for dup would also leave you in a blind spot. I have been writing code for linux a while now and I did not think it was any secret that a file is still open until all of the fds pointing to it are closed. That is why you have to take care and close your duplicated fds at the right time otherwise you will end up with file handles leaking. The example code provided illustrates this perfectly.
As a side note using dup2 to get your original FD passed to epoll associated with the still open description from the duped fd should allow you to remove it.
Interesting writeup, seems like this only applies to people writing epoll abstraction layers though. Everybody else is able to avoid this bug by deregistering the fds before calling close. So I'm not sure calling it "fundamentally broken" is necessarily accurate.
While a best effort has been made to mimic the Linux semantics, there are some semantics that are too peculiar or ill-conceived to merit accommodation. In particular, the Linux epoll facility will -- by design -- continue to generate events for closed file descriptors where/when the underlying file description remains open. For example, if one were to fork(2) and subsequently close an actively epoll'd file descriptor in the parent, any events generated in the child on the implicitly duplicated file descriptor will continue to be delivered to the parent -- despite the fact that the parent itself no longer has any notion of the file description! This epoll facility refuses to honor these semantics; closing the EPOLL_CTL_ADD'd file descriptor will always result in no further events being generated for that event description.
"Using these toolkits is like trying to make a bookshelf out of mashed potatoes."
- Jamie Zawinski
For people reading this thread who want to better understand the distinctions between files, file descriptions, and file descriptors, I recommend Michael Kerrisk's book The Linux Programming Interface, which has great coverage of this topic.
I am making a new post because I feel like this is getting out of hand. There is a lot of misinformation that needs to be cleared up.
When you are working with system calls you use fds to tell the kernel what file you are working with. These are simply numbers, an int to be exact. When you call a system call that takes in a fd you pass the int that references the file you want to work with and it is looked up in a table to find the kernel structure that represents the file. When you dup a fd all that does is add a entry to the lookup table to point to the same kernel structure -- increasing the refcount at the same time. The ref count lives on the kernel structure, not with the int. The fd system is real simple. For open, dupe, dup2… will add an entry. While close will remove the entry. If the entry is removed the fd is now useless for any communication to the kernel over system calls. This is because the lookup table will not be able to translate your fd to a kernel structure.
Now the question is epoll broken or have a bug? I don’t think so. You just have to keep in mind you are using a system call and not a library. This is controlled access to kernel level functions.
In the example adding a fd to epoll, then duping it followed by a closing is providing some surprising results. Understanding how userspace talks to kernel space and how system calls translate into running kernel code is key to understanding why this is not a bug but just how things work. You might also be surprised that other system calls behave the same way!
When you tell epoll to wait you use a fd to tell it which file you want to wait on. You are not telling a library, or a userspace program to do this. You are telling the kernel. So the fd is translated into a kernel object and that is what epoll is working with. Only when adding and removing is this translation done. The kernel subsystem that provides epoll will continue to work with kernel objects -- not fds.
When you call dup you are adding a new entry to your programs fd lookup table for a new fd but pointing to the same kernel structure -- at the same time you are also increment the refcount of the kernel structure. The file is not actually closed until both the original fd and the duped fd have had closed called on it, or more correctly, until the refcount of the kernel structure is zero. This means that if you tell epoll to wait on a file and do so by an fd, and then close the fd -- you will only see a close event if the refcount is zero. By calling dup before closing your original fd you have incremented the ref count and thus the file remains open -- no close event.
If you then try to remove your file from epoll using the now invalid (closed) fd it will fail. This is because the fd is no longer in the fd lookup table and the system call can not translate your fd into a kernel object.
Now you might say this is not how other system calls work. As in another thread pointed out that “read” does not behave this way. To the contrary -- read actually works the same way. If you dup your socket fd, and use a thread to close your socket fd while in an active read -- your read will not fail. It will finish out delivering the data to you. This is because the fd is not the actual object you are working with. It is just a reference to the object. Onces you enter the read function you are in kernel land. And the kernel only knows about the kernel object. Because you duped the fd before closing it the kernel object has a ref count of 2 and your close while read leaves the ref count at 1 thus there is nothing for the kernel to do other than finish out its read. If you attempt to use the now closed and invalid fd for a subsequent read, or close you will get a EBADF error -- much like you do when you attempt to use it with epolls functions to remove the file.
I would think you could use dup (or dup2/ioctl) to de-register (assuming nothing grabbed that FD)?
> had shedded some light