This is not really an article about "How the D garbage collector works", rather an article "How to avoid garbage in D".
I think the latter is of greater interest to the general programmer, however. Initial hubris seems to be the rule for new tools in programming, and in retrospect, this seems to have been true for GC technology. The original motivation was to offload the most painful part of memory management work from the coder, but it somehow morphed into, "Takes care of everything to do with memory management." This resulted in a misguided quest for a "magical" GC that had super throughput, or super low latency. (Or somehow both.) This in turn resulted in numerous GC knobs that were hard to understand and tune.
A recent trend seems to take the approach that GC allows development to get to a correctly running program faster, at which point, good profiling and language/library tools can be used to minimize memory churn. To this end, GC should be optimized by default for latency -- as this is the most likely optimization for a better development environment.
> The original motivation was to offload the most painful part of memory management work from the coder, but it somehow morphed into, "Takes care of everything to do with memory management."
I don't understand the difference. GC was invented for Lisp. Lisp was fully automatically memory managed, just as GC'd languages are today. John McCarthy never conceived of Lisp as a language with malloc and free.
> To this end, GC should be optimized by default for latency -- as this is the most likely optimization for a better development environment.
I don't see how that follows. Throughput is just as important as latency in most scenarios, development included. There's nothing about development that makes latency more desirable than throughput—if anything, I would think the opposite would be true, as running batch testing jobs usually demands good throughput.
Throughput is just as important as latency in most scenarios, development included. There's nothing about development that makes latency more desirable than throughput
Having sluggish, unresponsive, wonky tools, and a sluggish, unresponsive, wonky prototype just sucks. All of those little uncomfortable pauses wear you down and nickel-and-dime you to death.
I would think the opposite would be true, as running batch testing jobs usually demands good throughput.
Running batch testing is exactly the kind of scenario where profiling and optimization can score big wins -- and most often, those wins also apply to the production version. IMO, making your development environment uncomfortable for edit/test/debug just so you can have faster batch testing is a misguided reversal of priorities! Not having the latter provides you with less timely information, but is easily fixed. Not having the former is constantly fatiguing!
> Having sluggish, unresponsive, wonky tools, and a sluggish, unresponsive, wonky prototype just sucks. All of those little uncomfortable pauses wear you down and nickel-and-dime you to death.
I agree. That's why you don't usually want extreme low-latency collectors that make your program slow by sacrificing throughput over latency.
> Running batch testing is exactly the kind of scenario where profiling and optimization can score big wins -- and most often, those wins also apply to the production version.
And that's why you should have a GC that balances throughput and latency.
That's why you don't usually want extreme low-latency collectors that make your program slow by sacrificing throughput over latency.
Of course you know there's a difference between a slow program and a sluggish/unresponsive one. A user will forgive the former, but will abhor the latter. As a programmer, you're a kind of user. However, I note that many programmers have a kind of un-self-awareness when it comes to their own creations and favored, highly customized tools. If a programmer thinks something is nifty for another reason, they will often excuse much more latency than an ordinary user ever would. This is one of the biggest UX blind spots programmers have, generally.
Also, with an environment where it's easy to fix throughput, why not mostly eliminate latency? Fixing throughput is usually more straightforward and easier than fixing latency. (Pareto principle)
In practice, due to programmer hubris and bias against perceiving flaws in our own creations, one has often ended up with GC that has better than needed throughput numbers to alleviate speed criticisms while having latency that is a little bit uncomfortable. I think it's a common programmer blindness with regards to UX.
> Fixing throughput is usually more straightforward and easier than fixing latency. (Pareto principle)
No, because improving throughput requires a generational GC, which is something you need to design from the beginning. Once you do, it tends to be a massive win.
> In practice, due to programmer hubris and bias against perceiving flaws in our own creations, one has often ended up with GC that has better than needed throughput numbers to alleviate speed criticisms while having latency that is a little bit uncomfortable. I think it's a common programmer blindness with regards to UX.
Or maybe you don't notice any throughput problems because most mature GCs have good throughput, leading to the incorrect conclusion that throughput is unimportant. I highly suspect that if all apps were using extreme low-latency GCs, you would notice throughput problems.
Take JS, for instance: you know how JS-heavy apps are criticized all the time on HN for being slow to load? That's a throughput problem.
No, because improving throughput requires a generational GC
Not necessarily true. Often, the throughput issues can be improved by reducing memory pressure. (Though this may not cut it for situations with very high performance requirements.)
requires a generational GC, which is something you need to design from the beginning. Once you do, it tends to be a massive win.
Depending on the environment, you can switch to a higher throughput GC in production.
Or maybe you don't notice any throughput problems because most mature GCs have good throughput, leading to the incorrect conclusion that throughput is unimportant.
Maybe. Most of my experience working in GC environments was in VisualWorks, which had an excellent generational GC in its day. One of the points of pride with the VM was that it was common to see only around 2ms latency for GC pauses. There was quite a bit of work done to eliminate perceptible GC pause. For the initial experience, people would notice the GC pauses first.
In any case the words "extreme low latency GC" are yours, not mine, and I suspect you have a very particular subset in mind when you use that term, which perhaps isn't what I have in mind. Emphasis should be on low latency, to the point where the average programmer doesn't have to think about it very often, until it's time to optimize the program in general.
I suspect you have a very good blog post about the details.
>> Fixing throughput is usually more straightforward and easier than fixing latency. (Pareto principle)
> No, because improving throughput requires a generational GC
Or more boxes. From the perspective of an end user or IT admin, if you have a throughput problem, you can (assuming the task lends itself) solve it with horizontal scaling. If you have a latency problem, you can only scale vertically. If scaling vertically doesn't do the trick, you're out of luck.
This is why when I have to make a choice between latency and throughput, if there's no obviously correct answer I default to latency. Intel will eventually solve my throughput problems, but they won't be able to do much about my latency problems.
> All of those little uncomfortable pauses
But low-latency collectors have more pauses than low-throughput collectors, as well as spending more overall time collecting than low-throughput collectors. They also have a tendency towards higher memory pressure in the long run (though it's far less spiky than low-throughput collectors).
But low-latency collectors have more pauses than low-throughput collectors,
But if you never notice any of the pauses because you're very well optimized for latency, then no harm, no foul.
as well as spending more overall time collecting than low-throughput collectors.
But too-low throughput is usually only a problem in production, whereas too-high latency can increase cost (in harder to quantify ways) everywhere, including in development. Too-low throughput is also usually much easier to fix.
If you need high throughput, you need high throughput, but for most programmers, the correct default is low latency. High throughput for most programmers is like 150 mph top speed on a US sports car. In most cases, people are never going to need it, but if you need it (like, if you actually go to a track) it's relatively straightforward to implement. Whereas having unresponsive programs is like having an unreliable car that has squeaky brakes -- you're going to notice that relatively quickly, it's going to be a constant bother, and it's not so straightforward to correct.
Low latency is like a car with an autonomy of 20000km (12000 miles) but a speed limit of 80km/h (50mph) : pretty slow (low throughput) but no ennoying pauses when you drive from NYC to SF. Balanced throughput and latency mean you have a car with 800km autonomy but you can drive up to 150km/h (90mph). Of course you usually don't need that much speed and you'll end up having a lot of pauses on your SF to NYC trip, but you'll get there a lot quicker because you'll be able to drive at 70 mph all the time.
Which model is better ? I don't know but all the car I know are the second kind. What about GCs ? I don't know, but the majority of modern GC are the second kind …
Low latency is like a car with an autonomy of 20000km (12000 miles) but a speed limit of 80km/h (50mph) : pretty slow (low throughput) but no annoying pauses when you drive from NYC to SF.
If you had a bus like this, combined with autonomous self-driving, it might revolutionize travel in the US!
Of course you usually don't need that much speed and you'll end up having a lot of pauses on your SF to NYC trip, but you'll get there a lot quicker because you'll be able to drive at 70 mph all the time.
And when you arrive, all of the driving and stopping will leave you fatigued. The steady-state travel enabled by the robot nuclear super-bus would let people travel in sleeper compartments!
Which model is better ?
Clearly the 2nd!
I don't know but all the car I know are the second kind. What about GCs ? I don't know, but the majority of modern GC are the second kind …
You seem to be leaning rather heavily on the adjective "modern." Because if you use the colloquial definition of the term, there are a number of currently maintained and used GC that aren't the 2nd kind.
> Because if you use the colloquial definition of the term, there are a number of currently maintained and used GC that aren't the 2nd kind.
For the majority of applications, those GCs are inferior to generational GCs that balance throughput and latency.
No, most programs require a balance of throughput and latency.
This stuff was all heavily studied in the 90s and earlier, you know. The conclusions that researchers and engineers came to are equally valid today.
GC has mostly been studied in languages, where you cannot avoid the GC: Java, Smalltalk, Lisp, etc. If you can avoid the GC and you pick the low hanging fruits, the circumstances for the garbage collector might be very different than all this previous research.
D is not popular enough so far, so there will not be much research invested into its GC. When Java gets value types, it might become interesting.
Well, there's the moderate in the discussion! Many good points made on each side but that's a great point. There's been software like JX operating system that varied GC's on a per-component basis. There was also software such as Oberon work that mixed no-GC components with GC'd OS and tools. Exploring options that aren't all latency or throughput GC... including ability to eliminate GC effect in that... might lead to new paths that are exciting.
Outside this sub-field, another one that offers evidence of your opinion is static analysis vs dynamic checks. Full, memory safety for C programs originally required crazy overhead (300+% sometimes) with checks on all the things that could happen. Then, people started mixing static analysis with dynamic checks and clever architectures to reduce that overhead. They'd often use static analysis to determine when they could eliminate checks. I can see something similar happening with GC's where an analysis helps avoid bad operations. We already have a taste of it with Rust where we get some safety without GC.
C# has had value types for a long time. The conclusions the C# developers came to were that the tradeoffs are indeed the same as in previous languages.
For some reason people keep wanting to deny the generational hypothesis, in the hopes of not having to implement a generational GC. Frankly, this is wishful thinking. The generational hypothesis has been borne out again and again and again, in all kinds of different scenarios.
(I think you mean high-throughput, not low-throughput: broadly speaking, the options are high-throughput but high-latency, or low-latency but low-throughput, or some balance of the two.)
Yes, thanks. :)
> a very slow and conservative stop the world Mark & Sweep, which runs at every allocation
It doesn't run on every allocation. It can only run during an allocation, but that doesn't mean it will, and it's hard to think of a standard use case in which it would run on every allocation. FTA:
"The first thing to understand about D’s garbage collector is that it only runs during allocation, and only if there is no memory available to allocate."
> Due to the unfortunate C ABI, a fast copying collector is not really doable. One cannot track all the internal and external pointers. Well, one could, but has to pessimize the locals on every external pointer reference.
Note that a GC does not have to be copying in order to be fast. What makes or breaks performance for most GCs is whether they are generational . A non-generational copying GC will generally not have impressive performance, either.
A copying GC allows you to have a bump allocator, but a non-copying generational GC can still have a pool allocator. Using a pool allocator does not share all of the benefits of a bump allocator (such as good memory locality for successive allocations), but still allows for very fast allocations for small objects.
Also, mostly copying GCs are still an option even if you scan roots conservatively. See Joel Bartlett's mostly copying GC from the late 1980s.
 Not counting non-tracing approaches, such as deferred reference counting, which achieve similar goals through other means.
A fast copying collector for D would require write barriers: using memory barriers everytime a pointer is written. This has a runtime cost.
> which runs at every allocation.
The article explains in great depth that this is not the case.
> This has a runtime cost.
As does using a conservative/"very slow" GC. I'm sceptical, but it may be that D's situation means trading a pervasive slowness (if the parent is true, I don't know) for no write barriers is the right choice, differing to the conventional wisdom from other languages, but that's the argument you need to make, not just that write barriers are slow.
The case against write barriers is pretty simple: the slowness of the GC can be workarounded, while memory barriers cannot. If they are needed, no other way to copy pointer values (which are quite pervasive in system programming).
> The case against write barriers is pretty simple: the slowness of the GC can be workarounded, while memory barriers cannot.
What workaround do you have in mind, and how does it not apply to write barriers?
You can always use less of the GC, but there are no way not to use pointers.
For example, I sell products in D which disable the GC completely, I certainly don't want write barriers.
If you can safely disable the GC, then you can safely disable write barriers.
And if I use only a bit of the GC, can I use only a bit of write barriers?
Walter Bright was right about segmented stacks, I'm tempted to think he's right about write barriers too.
If the compiler can statically tell which pointers point to GC memory (something it usually needs to know anyway), yes.
In D this is indeed known only at runtime, one of the decision was that "there should be only one type of pointers".
immutable(T) doesn't guarantee it isn't in the GC heap from what I can gather. shared(T) could perhaps do it, but it's rather unpopular.
If you really don't know until runtime what heap a pointer is in, then write barriers will have some kind of cost, but it should be low (definitely not atomic).
However, if you don't know statically what kind of heap a pointer is in, you're going to have a poorly performing GC system no matter what.
The slow GC is only a runtime cost if it ever runs. If no garbage is created in inner loops and stack allocation is preferred there's no reason it _has_ to run.
> A fast copying collector would require write barriers: using atomics everytime a pointer is written.
This is not how garbage collectors work. At all.
Let me amend then: "A fast copying collector _in D_ would require write barriers". Of course with other languages you can have GC without that limitation.
No, unless there's some subtle semantic detail about D that I am missing. This is just not how GCs work. Write barriers, yes. Write barriers that require atomic instructions every time a pointer is written, no.
I meant "memory barriers". Atomics embed memory barriers.
It sounds like you're confusing two types of "write barriers" that happen to share the same name, but have otherwise little in common. Memory barriers for atomic instructions are essential for giving them sane semantics, but have nothing to do with GC. GC write barriers deal with the needs of generational or incremental collections, but are a totally different thing.
Yes I thought GC write barriers were always implemented with memory barrier instructions. Probably a trap because the same word is used. Still I wonder how they are implemented then.
GC write barriers are implemented by expanding sequences that update pointers to perform some sort of additional action.
Their purpose is to capture some information about the updated pointer that the GC can then use to avoid a full heap scan. Card marking marks a 'region' as dirty (such as 128/256/512 bytes of memory). This buffer recording the dirty/clean areas is rescanned as part of the evacuation of the generation being collected. Any pointers to the generation then being collected are updated.
Sequential store buffers (SSBs) can be used to record the address of the object being updated or a pointer to the pointer itself. Again rescanned during collection. SSBs can be easily thread-local avoiding the need for thread synchronisation, except during a collection cycle which already requires thread synchronisation.
Write barrier tend to be optimised heavily as they're used quite frequently. The use of atomic barriers or branch instructions (barring fast path exits) would inhibit performance.
There are a number of solutions. One of the simplest ones (popularized by the JVM) is card marking , where each memory page (usually shorter than an actual OS page) has a byte associated with it that is set to a non-zero value when a pointer in that segment is written. Because that change is an idempotent operation, it can be done without coordination between threads.
No, for two reasons:
1. Write barriers are not necessary if an object is in a nursery, which is true for most objects per the generational hypothesis. This is trivial to check based on the pointer value.
2. You can have thread local remembered sets.
I read this in http://www.infognition.com/blog/2014/the_real_problem_with_g... better ask the author of the article if you have a way for a better D GC.
> To make fast GC it needs to be generational, but generational GC means write barriers: you can't just freely mutate memory, every pointer mutation must be accompanied by additional bookkeeping.
I guess that in the context of D generations are not possible or practical. If you have any idea why it should be possible, you may write a DIP.
It's possible to do generational collection in D. I've written one before and was careful to design the semantics of D to allow for it. It just needs someone to spend the time to implement it (the core language doesn't need to change).
I'm skeptical, however, if it will offer a real improvement.
>Write barriers are not necessary if an object is in a nursery, which is true for most objects per the generational hypothesis. This is trivial to check based on the pointer value.
Yes but what when it leaves it?
You have to update all pointers to another location.
When it passes to the Gen2 slab all references that point to it need to be updated, and those changes need to propagate to all threads, and all cores. The simplest way to do this is a memory barrier.
Well the real simplest way is just "never multi-thread" and this is OCaml's GC is so fast. No write barriers, 1 thread only, final destination.
And 3rd: use forwarding pointers. Every good implementation uses #3.
I thought that they were going to replace the garbage collector in D but I haven't heard anything about it recently. Does anybody know whether that is going to happen?
Ok, we changed the title to use your suggestion. If people object, we can change it again. (But please suggest a better—i.e. more accurate and neutral—title if you do.)
The original title was better. The article not only explains how the GC works but goes into detail about how to prealocate or manualy allocate memory so that the GC does less to no heavy lifting.
How about let's just go general enough to cover both.
This is not really an article about "How the D garbage collector works", rather an article "How to avoid garbage in D". More tips and tricks are at: https://bitbucket.org/infognition/dstuff
Here is the better explanation of the GC: https://dlang.org/spec/garbage.html, which is basically a very slow and conservative stop the world Mark & Sweep, which runs at every allocation. Basically a simple version of BoehmGC.
Due to the unfortunate C ABI, a fast copying collector is not really doable. One cannot track all the internal and external pointers. Well, one could, but has to pessimize the locals on every external pointer reference.
Golang isn't going mainstream either, and D isn't doomed to never got popular either. Let's not speak in such hyperbole. If you like language $X, use language $X and as long as you can find sufficient resources for it, don't worry about who else is using it (I personally can find sufficient resources for D).
If D had in-language-integrated Fibers like Go does it would be just better in every aspect. Native-fast code, fast compile, powerful generic and meta-programming. It was trying to be better C/C++ but should have instead push where Go is going: highly productive and fast. I think Rust is much better C replacement - much higher overhead and mental cost, for software that can't compromise on performance, reliability and safety.
I'm not sure that language integrated ones would be better. As D's abstraction abilities have improved, we've been moving features from the core language to the library.
The only thing that I find great about Go is the fact that you can write, normal, imperative, blocking-io-like code dealing with about anything (network, file, any io) and get a service that can handle 10k concurrent connections without thinking much about it. If D could do the same, no matter if through language support, or a library it would be really awesome. Note: it seems to me, currently D's fibers require manual `yield`. The entire point is not to have to do it.
I suggest you go have a look at vibe.d, which is a networking/IO framework. It allows you to write your code as if it's only ever going to handle a single connection and have it magically work(tm) for multiple connections too.
Too bad Dlang never reached mainstream. It's a perfectly suitable language for backend programming and system programming where golang is growing popular.
My company (~50 people) has been using D in production for several years.
We rely on gdc (we measured it generates the fastest code for our usage).
Nearly once per year, a blocking bug sneaks into the compiler and needs to be worked-around (crashes on invalid code are easy to avoid ; sometimes, it's a little more tricky, but we've never been blocked for more than 1 day). The maintainers are very active, though, and it's generally quickly fixed, provided that I report the bug myself.
There's also been some friction when it came to cross-compilation (definitely doable, but not as easy at "apt install mingw-gcc").
In the team, the adoption went well ; coming from C++, switching to D feels natural. The syntax is more concise and consistent than C++14 syntax. The standard library handles so much stuff (md5, sending emails, command line parsing, filesystem, process creation) that most of our projects don't depend on any other external library (nor do they depend on any OS-specific stuff). The fact that all variables are initialized by default makes it hard to have non-deterministic behaviour ; and the fact that the "buffer" abstraction (aka "Span"/"Slice"/"string_view"/"ptr+len") is in the language makes it really easy to write code which is concise and secure.
C++ has been slowly catching up since 2011, however, why wait?
I do every day. It's pretty awesome, I recommend trying it.
D's CTFE and templates are really really nice to use. I can write very low-level high-performance memory sensitive code, controlling everything, but the code can still be beautiful with great compiler checking.
It can confuse my C++ brain sometimes, though, and then it's great to have someone to consult with (e.g on the D forums).
Quantum Break is a modern Xbox One & PC game and uses D in some subsystems.
Most folks will say that either C++ has improved or use Rust.
(D was a really lovely experience a few years ago, though..)
Most folks haven't tried D.
Since you use it in prod / sell products written in it, could you (briefly) share why you like it and the advantages you see in it? Thank you!
Went from Delphi to Ocaml to lots of C++ to D.
Basically I'm happier with the products I wrote in D so I keep using it. It's an almost schizophrenic language that always says "yes" when you ask if something is possible. The community regularly "discovers" new idioms within the language.
Learning D for real is as involved as any other language, but what people remark first is the low mental overhead and how absurdly practical it is.
Advantages I see would be: fast to edit, compile, and run. Non-controversial syntax, focus on power, and the fact you kind of already know it.
How good is the tooling? IDE support, code completion and debugging especially.
There are 3 major compiler implementations (dmd, gdc, ldc).
I mostly work with gdc from GNU/Linux, and basically, everything that works with g++ works with gdc.
More precisely, I use, on a daily basis:
- vim + ctags (editor + symbol lookup)
- gdb (debugger)
- gcov (code coverage)
- valgrind (mostly callgrind (call-graph) and massif (memory profiler))
- oprofile (cpu profiler)
Unfortunately there's no ccache equivalent yet, and I'm missing it.
For the Windows people, there's VisualD, which basically is the integration in Visual Studio. At this point, you have an IDE and a debugger (I don't know about code completion).
There's also a build-tool (with package-managing abilities), called "dub" ; that handles fetching and building external dependencies (in the vein of npm).
I'll need to check out VisualD.
Someone should do a Language Server Protocol (http://langserver.org/) implementation for D - that's clearly the way forward. Right now this will give you VSCode, Eclipse and NeoVim, with regular Vim and Emacs on the way.
VSCode also has something similar for debugging (https://code.visualstudio.com/docs/extensionAPI/api-debuggin...), but it hasn't been standardized in a similar fashion (yet?).
I use the debugger in VisualD on Windows and LDC on macOS. I guess you can also use the one in Visual Studio Code.
There is ongoing work with the LSP actually, since the VSC integration uses something similar already.
BTW, I tried VisualD, and it seems that its code completion support is rudimentary. I could get it to complete functions sometimes, and I couldn't get it to complete members at all. Obviously, no refactoring either. It's unfortunate, but that would be a blocker for me.
Oh, what's the language feature parity between compilers? Are things as bad as in C++ land in that regard, or is everything reasonably close, without portability headaches when moving code between compilers?
LDC and DMD are basically at the same frontend, so it is rare to do a `version(LDC)`. GDC is a bit behind atm, but the language and library is really stable. Usually you don't change anything at all between compilers and OS. Most notable exceptions is the SIMD capabilities, and inline assembly is different on GDC.
The stdlib covers a lot of ground so portability across OSes is much easier than in you-know-which-language, so small tools tend to blossom.
or use Go
Doesn't Facebook use D internally?
Andrei left fb in 2015 - probably due to their lack of commitment to D & his F stocks doing well.
Andrei works full time on the D language/D foundation. To my knowledge, his decision to leave Facebook had something to do with his family and his desire to work on D full time, not any problems with his employer.
They wrote a C/C++ preprocessor in D, but this now appears to be abandoned.
That's too bad. I guess OCaml is the non-mainstream language Facebook seems to be interested in.
One thing D and Nim are lacking is adoption by a large company, and I feel that's partly what's keeping them behind Go and Rust.
Our data team at Emsi (www.economicmodeling.com) use it in production every day. We are a labor market data viz company with a multitude of different sources of data, most of which are public.
I used it for a while, ended up switching to Haskell, which had a lot of similar benefits despite being so different.
If you're going to use it, I highly recommend the D Programming Language (book). It's extremely well written, and remains once of my favourite technical books.
You'll be okay.
Has anyone here done any production-level stuff with D? I'm debating using it for my next project, but I want to see if it's industry-tested.
> what are the advantages of running GC only when allocations are happening, instead of running concurrently?
That way you don't need a dedicated thread for the GC (current one is hijecked).
> Is it burdensome for the programmer to constantly think about this while writing code?
I sell real-time software in D, you don't constantly think about this when writing code. It's no more burden that avoiding allocations in real-time software in C++.
But avoiding allocations in real-time C++ is a burden to some (increasingly common) kinds of programmers.
The kids today are so indoctrinated in modern C++ that they don't stop to think that `vector<>::push_back()` potentially allocates behind the scenes.
Writing real-time code requires a certain mentality that is at odds with modern frameworks. I expect nice languages like D have similar cultural issues.
If you need a feel of what the community values (apart from rants), have a look at the forums. https://forum.dlang.org/
D is backed by companies like Sociomantic and Weka.io that operates in high-performance domains.
FWIW, D's @nogc makes it easy to trap code that allocates accidentally.
what are the advantages of running GC only when allocations are happening, instead of running concurrently?
Is the idea that you can explicitly control the behavior of GC by tuning your allocations? Is it burdensome for the programmer to constantly think about this while writing code?
If you are interested in current D memory management situation, you might want to look at this as well: https://dlang.org/phobos/std_experimental_allocator.html