Hackr News App

Discovering a JDK Race Condition, and Debugging It in 30 Minutes with Fray

(aoli.al)

157 points

by: aoli-al

11 days ago

☆

52 comments

☆
exabrial

10 days ago

next
[ - ]
> Fray is a concurrency testing tool for Java that can help you find and debug tricky race conditions that manifest as assertion violations, run-time exceptions, or deadlocks. It performs controlled concurrency testing using state-of-the-art techniques such as probabilistic concurrency testing or partial order sampling.
> Fray also provides deterministic replay capabilities for debugging specific thread interleavings. Fray is designed to be easy to use and can be integrated into existing testing frameworks.
I wish I had this 20 years ago.
reply
☆
ryeats

10 hours ago

prev
next
[ - ]
this is really cool I want to use it for deterministim simulation testing. Question, isn't shadow locking essentially the same as continuations in virtual threads? how does this compare to replacing the scheduler in virtual threads? see https://jbaker.io/2022/05/09/project-loom-for-distributed-sy...
reply
☆
MaxBarraclough

10 days ago

prev
next
[ - ]
Neat to see sleep calls artificially introduced to reliably recreate the deadlock. [0]
Looks like fixing the underlying bug is still in-progress, [1] I wonder how many lines of code it will take.
[0] https://github.com/aoli-al/jdk/commit/625420ba82d2b0ebac24d9...
[1] https://bugs.openjdk.org/browse/JDK-8358601
reply
☆
trhway

10 days ago

parent
next
[ - ]
[ x ]
<@MaxBarraclough> without reworking of the code all these checks of the executor and queue state and queue manipulations have to be under a mutex, and that is just a few lines.
reply
☆
brabel

10 days ago

parent
prev
next
[ - ]
[ x ]
<@MaxBarraclough> Bugs like these are pervasive in languages like Java that give no protection against even the most basic race condition causes. It’s nearly impossible to write reliable concurrent code. Freya only helps if you actually use it to test everything which is not realistic. I am convinced, after my last year long struggle to get a highly concurrent Java (actually Kotlin but Kotlin does not add much to help) module at work, that we should only use languages that provide safe concurrency models, like Erlang/Elixir and Rust, or actor-like like Dart and JavaScript, where concurrency is required.
reply
☆
gf000

10 days ago

root
parent
next
[ - ]
[ x ]
<@brabel> What is a safe concurrency model? Like, actors can trivially deadlock/livelock, they are no panacea at all, and are trivial to recreate (there are a million java implementations)
You make it sound like there is some modern development superseding what java has, but that's absolutely not the case.
Like even rust is just pretty much a no-overhead `synchronized` on top of an object. It is necessary there, because data races are a fundamental memory safety issue, but Java is immune to that (it has "safe" data races). Logical bugs can trivially happen in either case - as an easy example even if all your fields are atomically mutated, the whole object may not make sense in certain states, like a date with February the 31st. Rust does nothing against such, and concurrent data structures have ample grounds for realistic examples of the above.
reply
☆
mrkeen

10 days ago

root
parent
next
[ - ]
[ x ]
<@gf000> > What is a safe concurrency model?
STM.
The terms 'atomic', 'thread-safe', and 'concurrent' collections are thrown around too loosely for application programmers IMO, for exactly your example above.
In other scenarios, 'atomics' refer to the ability to do one thing atomically. With STM, you can do two or more things atomically.
Likewise with 'thread-safe'. Thread-safe seems to indicate that the object won't break internally in the presence of multiple threads, which is too low of a bar to clear if your goal is to write an actually thread-safe application out of so-called 'thread-safe' parts.
STM has actual concurrent data structures, where you can write straight-line code like 'if this collection has at least 5 elements, then pop one'.
I don't think the Feb 31 example is that fair though, because if you want to construct a representation of Feb 31, who's going to stop you? And if you don't want to, plain old static types is the solution.
reply
☆
gf000

10 days ago

root
parent
next
[ - ]
[ x ]
<@mrkeen> I couldn't give a better reply than this author:
https://joeduffyblog.com/2010/01/03/a-brief-retrospective-on...
Also, a phenomenal writing (as are his other posts) on the whole concurrency landscape, see:
> A wondrous property of concurrent programming is the sheer number and diversity of programming models developed over the years. Actors, message-passing, data parallel, auto-vectorization, …; the titles roll off the tongue, and yet none dominates and pervades. In fact, concurrent programming is a multi-dimensional space with a vast number of worthy points along its many axes.
reply
☆
mrkeen

10 days ago

root
parent
next
[ - ]
[ x ]
<@gf000> I've read a few postmortems about STM. I have to take them with a grain of salt because I usually read those reports right after doing a bunch of STM programming, and right before doing a bunch more STM programming. Reports of its death have been greatly exaggerated.
Here it is in 2006 featuring the same Tim from your article: https://www.youtube.com/watch?v=tve57vilywc
I didn't start using it in anger till 2013-2014 maybe? But I don't recall any major differences between what the video shows and how it works in 2025.
Anyway, postmortems usually boil down to two issues:
1) That's not how programmers usually do it
2) We couldn't pull it off
The most obvious explanation for 1 is 2. I, too, would be disappointed by the low-adoption rates of my new technology if I hadn't built it or released it to users.
But the article has some gems:
Transactions unfortunately do not address one other issue, which turns out to be the most fundamental of all: sharing. Indeed, TM is insufficient – indeed, even dangerous – on its own because it makes it very easy to share data and access it from multiple threads;
I cannot read this charitably. This is the only reason for, not a damning reason against. It's like doing research & development on condoms, and then realising it's a hopeless failure because they might be used for dangerous activities like sex.
I already mentioned a great virtue of transactions is their ability to nest. But I neglected to say how this works. And in fact when we began, we only recognized one form of nesting. You’re in one atomic block and then enter into another one. What happens if that inner transaction commits or rolls back, before the fate of the outer transaction is known
You nest transactional statements, not the calls to atomic. The happy-path for an atomic is that it will commit; it should be obvious a priori that something that commits cannot be in the codepath that can be rolled back.
Then that same intern’s casual statement pointing out an Earth-shattering flaw that would threaten the kind of TM we (and most of the industry at the time) were building. ... An update in-place system will allow that transaction to freely change the state of x. Of course, it will roll back here, because isItOwned changed to true. But by then it is too late: the other thread using x outside of a transaction will see constantly changing state – torn reads even – and who knows what will happen from there. A known flaw in any weakly atomic, update in-place TM. If this example appears contrived, it’s not. It shows up in many circumstances.
I agree that it's not contrived. It's in the problem-space of application writers. It's not a problem caused by introducing STM. We want an STM system to allow safe access to isItOwned & x, because it's a PITA to try to do this with locks.
reply
☆
gf000

10 days ago

root
parent
next
[ - ]
[ x ]
<@mrkeen> `atomic` is their choice of syntax for an STM transaction in their experimental C# runtime, it's not an atomic statement. Please take the time to actually read the article, because you have obviously just skimmed over it. This was not written by some nobody, he does know what he talks about.
reply
☆
mrkeen

9 days ago

root
parent
next
[ - ]
[ x ]
<@gf000> Argue the point, not the person.
Look ma, no skimming: https://news.ycombinator.com/item?id=37647230
reply
☆
gf000

9 days ago

root
parent
next
[ - ]
[ x ]
<@mrkeen> There is not much to argue, when your point is based on a misunderstanding.
> You nest transactional statements, not the calls to atomic. The happy-path for an atomic is that it will commit; it should be obvious a priori that something that commits cannot be in the codepath that can be rolled back.
This makes absolutely no sense with my above correction.
reply
☆
tialaramex

10 days ago

root
parent
prev
next
[ - ]
[ x ]
<@gf000> > the whole object may not make sense in certain states
"Make invalid states unrepresentable" - it's bad design that February the 31st is a thing in your data structure when that's invalid. You can't always avoid this, but it's appalling how bad most people's data structures are.
C's stdlib provides a tm structure in which day of the week is stored in a signed 32-bit integer. You know, for when it's the negative two billionth day of the week...
reply
☆
gf000

10 days ago

root
parent
next
[ - ]
[ x ]
<@tialaramex> This is more of a toy example for how a set of atomic changes can still end up in an inconsistent state, e.g. setting January the 31st and February 3rd in quick succession from two or more different threads may result in Feb 31st being visible from a third thread. This is not solved by Rust and your struct will even get the Sync trait automatically, which may be not be applicable as in this case.
reply
☆
brabel

10 days ago

root
parent
next
[ - ]
[ x ]
<@gf000> Given your example, I am convinced you've never written any Rust. Of course it does stop you doing shit like that. But in this example, even Java does it properly, since the constructor runs to completion before any Object is accessible to any Thread, not just the one creating it. You need to validate the state of the object in the constructor to prevent that, but TBH why are we talking about this, it's almost completely unrelated to concurrency models.
reply
☆
gf000

10 days ago

root
parent
next
[ - ]
[ x ]
<@brabel> Of course if you are creating a new object and you have an atomic handle to it, it is trivial to solve. Like, having immutable objects solves a lot of these problems.
But what I'm quite obviously talking about is a Rust struct with 3 atomic fields. Just because I can safely race on any of its fields, doesn't mean that the whole struct can safely be shared, yet it will be inferred to be Sync.
reply
☆
tialaramex

10 days ago

root
parent
next
[ - ]
[ x ]
<@gf000> Object mutability isn't relevant here. A Date type which is mutable can ensure that all mutations are valid, it just can't do so while retaining this clumsy "LOL I'm just a D-M-Y tuple" API.
We can see immediately that your type is broken because it allows us to directly set the date to February 31st, there's no concurrency bug needed, the type was always defective.
reply
☆
gf000

10 days ago

root
parent
next
[ - ]
[ x ]
<@tialaramex>
void setDate(int month, int day) { if (notValidDate(month, date)) { throw; } this.month = month; // atomic this.day = day // atomic }
Yet the whole function is not "atomic"/transactional/consistent, and two threads running simultaneously may surface the above error.
Of course it can ensure that it is consistent, C code can also just ensure that it is memory safe. This is just not an inherent property, and in general you will mess it up.
The only difference is that we can reliably solve memory safety issues (GC, Rusty's ownership model), but we have absolutely no way to solve concurrency issues in any model. The only solution is.. having a single thread.
reply
☆
tialaramex

9 days ago

root
parent
next
[ - ]
[ x ]
<@gf000> But you were critiquing Rust's model, yet you've written C++ here. I agree it's perfectly easy to write the bug in C++.
In Rust this improved type doesn't have the defect, to call Rust's analogue of your setDate function you must have the exclusive mutable reference, which means there's no concurrency problem.
You have to do a whole lot of extra work to write the bug and why would you, just write what you meant and it behaves correctly.
reply
☆
gf000

9 days ago

root
parent
next
[ - ]
[ x ]
<@tialaramex> It's called pseudo-code, and some extra attempt on your part to deliberately miss the point.
Give it another go at understanding what I'm saying, cheers!
reply
☆
nlitened

10 days ago

root
parent
prev
next
[ - ]
[ x ]
<@tialaramex> > “Make invalid states unrepresentable”
I think this phrase sounds good but is not applicable to systems that touch messy reality.
For example, I think it’s not even possible to apply it to the `tm` structure, as leap seconds are not known in advance.
reply
☆
tialaramex

10 days ago

root
parent
next
[ - ]
[ x ]
<@nlitened> I agree that messy reality can intervene, in the medium term (for about a decade) we'll need to handle leap seconds
But we can do a lot without challenging the messy reality. 61 second minutes are (regrettably) a thing in some time systems, but negative 1 million second minutes are not a thing, there's no need for this to be a signed integer!
reply
☆
kbolino

10 days ago

root
parent
next
[ - ]
[ x ]
<@tialaramex> The struct is also used for date/time arithmetic and the standard library explicitly supports out-of-range values for this reason.
reply
☆
tialaramex

9 days ago

root
parent
next
[ - ]
[ x ]
<@kbolino> I have no doubt that C "explicitly supports" this, but it's a bad idea.
The C standard library has the excuse that most of it is very old. We should do better.
reply
☆
kbolino

9 days ago

root
parent
next
[ - ]
[ x ]
<@tialaramex> Better for whom? If you want a dead-simple time type, use time_t.
There are plenty of improvements needed in the C time APIs, like sub-second precision, thread safety, and timezone awareness. What benefit is there to making the struct fields unsigned beyond some arbitrary purity test? This is still C, there are still plenty of ways to make invalid values. And it is nice to be able to subtract as well as add.
Heck, there's no way to encode the full Gregorian Calendar rules in the type system of any language I've ever used, so every choice is going to be a compromise. February 29 Not-A-Leap-Year and April 31 are still invalid dates even if you can outlaw January 0 and March 32.
Making all the fields in struct tm signed ints is clearly there to allow them to be manipulated and consistently so, since other types would obviously be better for size if nothing else.
reply
☆
brabel

10 days ago

root
parent
prev
next
[ - ]
[ x ]
<@gf000> > Like, actors can trivially deadlock/livelock,
Oh my ... you never seen a proper Actor language, have you?
Have a look at Erlang and Pony, for starters. It will open your mind.
This in particular is great: https://www.ponylang.io/discover/what-makes-pony-different/#...
> Pony doesn’t have locks nor atomic operations or anything like that. Instead, the type system ensures at compile time that your concurrent program can never have data races. So you can write highly concurrent code and never get it wrong.
This is what I am talking about.
> You make it sound like there is some modern development superseding what java has, but that's absolutely not the case.
Both Actor-model languages and Rust (through a surprisingly different path: tracking aliases and lifetimes) do something that's impossible in Java (and most languages): prevent data races due to improper locking (as mentioned above, if your language even has locks and it doesn't make them safe like Rust does, you know you're going to have a really hard time. actor-languages just eliminate locks, and "manual concurrency", completely). Other kinds of races are still possible, but preventing data races go a very, very long way to making concurrency safe and easy.
reply
☆
gf000

10 days ago

root
parent
next
[ - ]
[ x ]
<@brabel> Does preventing data races (which is not particularly hard if you are willing to give up certain properties, e.g. just immutability alone solves it) that much of a win?
You just made a bunch of concurrent algorithms un-implementable that would give much better performance for the benefit of.. having all the other unsolvable issues with concurrency? Like, all the same issues are trivially reproducible at a higher level, with loops within actors' communication that only appear under certain, very dynamic conditions, or a bunch of message passing ending up in an inconsistent state, just not on an "object" level, but on a "group of object" level.
reply
☆
brabel

9 days ago

root
parent
next
[ - ]
[ x ]
<@gf000> > much of a win?
It's a huge win. Absolutely game changing.
> You just made a bunch of concurrent algorithms un-implementable
Exactly! That's a good thing! You think you need those buggy algorithms, you just don't, at least in 99% of cases.
Yes, you can still end up with inconsistencies when you perform actions without the necessary checks, but those cases that remain are extremely easy to find and fix (and even make completely impossible by design), when compared to the horrors of mutable state with locks.
reply
☆
jbritton

10 days ago

root
parent
prev
next
[ - ]
[ x ]
<@brabel> Perhaps there is some confusion here between data races and race conditions. Rust and Pony prevent data races, but not race conditions.
reply
☆
brabel

9 days ago

root
parent
next
[ - ]
[ x ]
<@jbritton> There's no confusion.
reply
☆
rand_r

10 days ago

root
parent
prev
next
[ - ]
[ x ]
<@brabel> Race conditions are generally solved with algorithms, not the language. For example, defining a total ordering on locks and only acquiring locks in that order to prevent deadlock.
I guess there there are language features like co-routines/co-operative multi-tasking that make certain algorithms possible, but nothing about Java prevents implementing sound concurrency algorithms in general.
reply
☆
mrkeen

10 days ago

root
parent
next
[ - ]
[ x ]
<@rand_r> > Race conditions are generally solved with algorithms, not the language. For example, defining a total ordering on locks
You wouldn't make that claim if your language didn't have locks.
reply
☆
brabel

10 days ago

root
parent
next
[ - ]
[ x ]
<@mrkeen> Exactly, this thread is full of ignorant comments. I was talking about a certain class of race conditions that can be completely prevented in some languages, like Rust (through its aliasing rules that just make it impossible to mutate things from different threads simultaneously, among other things) and languages like Pony, for example, as the language uses the Actor model for concurrency, which means it has no locks at all (it doesn't need them), though I mentioned Dart because Dart Isolates look a lot like Actors (they are single-threaded but can send messages and receive messages from other "actors", similarly to JS workers).
reply
☆
gf000

9 days ago

root
parent
next
[ - ]
[ x ]
<@brabel> In Java, racing a field is safe - you can only ever observe the value as one that was explicitly set by a thread, no tearing can happen. Safe data races can happen in Java, but you sometimes do want that (e.g. efficient concurrent algorithms), and avoiding it is not particularly hard (synchronized blocks are not the state of the art, but does make it easy to solve a problem).
Pony and Rust are both very interesting languages, but it is absolutely trivial to re-introduce locks with actors, even just accidentally, and then you are back at square 1. This is what you have to understand, their fundamental model has a one-to-one mapping to "traditional" multi-threading with locks. The same way you can't avoid the Turing model's gotchas, actors and stuff won't fundamentally change the landscape either.
reply
☆
mrkeen

9 days ago

root
parent
next
[ - ]
[ x ]
<@gf000> > avoiding it is not particularly hard (synchronized blocks are not the state of the art, but does make it easy to solve a problem).
Please have a read of https://joeduffyblog.com/2010/01/03/a-brief-retrospective-on... (and don't just skim it.)
(This was not written by some nobody, he does know what he talks about.)
Contrast this elegant simplicity with the many pitfalls of locks: Data races. Like forgetting to hold a lock when accessing a certain piece of data. And other flavors of data races, such as holding the wrong lock when accessing a certain piece of data. Not only do these issues not exist, but the solution is not to add countless annotations associating locks with the data they protect; instead, you declare the scope of atomicity, and the rest is automatic. Reentrancy. Locks don’t compose. Reentrancy and true recursive acquires are blurred together. If a locked region expects reentrancy, usually due to planned recursion, life is good; if it doesn’t, life is bad. This often manifests as virtual calls that reenter the calling subsystem while invariants remain broken due to a partial state transition. At that point, you’re hosed. Performance. The tension between fine-grained locking (better scalability) versus coarse-grained locking (simplicity and superior performance due to fewer lock acquire/release calls) is ever-present. This tension tugs on the cords of correctness, because if a lock is not held for long enough, other threads may be able to access data while invariants are still broken. Scalability pulls you to engage in a delicate tip-toe right up to the edge of the cliff. Deadlocks. This one needs no explanation.
reply
☆
gf000

9 days ago

root
parent
next
[ - ]
[ x ]
<@mrkeen> Nice "gotcha".
But STM doesn't solve e.g. deadlocks - there are automatisms that can detect them and choose a different retry mechanism to deal with them (see the linked article), but my general point you really want to ignore is that none of these are silver bullets.
Concurrency is hard.
reply
☆
rand_r

9 days ago

root
parent
prev
next
[ - ]
[ x ]
<@mrkeen> Not sure what you mean!? Locks, at their core, are not implemented by languages. They’re feature of a task runtime e.g. Postgres advisory locks or kernel locks in a Posix OS.
reply
☆
anorwell

9 days ago

prev
next
[ - ]
This actually intersects with two of my current interests. We have, in production, rarely been seeing ThreadPoolExecutor hangs (JDK17) during shutdown. After a lot of debugging, I've been suspecting more and more that it may be an actual JDK issue. But, this type of issue is extremely hard to reason about in production, and I've never successfully reproduced it locally. (It's not clear to me that it's the same issue as in the post, since it's not a scheduled executor.)
Separately, we're looking at using fray for concurrency property testing, as a way to reliably catch concurrency issues in a distributed system by simulating it within a single JVM.
reply
☆
latchkey

11 days ago

prev
next
[ - ]
Maybe it is just me, but I can't read the text in the code because the font is nearly white on white.
reply
☆
masklinn

11 days ago

parent
next
[ - ]
[ x ]
<@latchkey> The light mode is fine, but you're right the dark mode is truly awful, the code blocks are unreadable.
edit: for some reason the author overrode the background color on code blocks via an inline style of
```
    background-color:#f0f0f0
```
from
```
    var(--code-background-color) = #f2f2f2
```
to make the background nigh imperceptibly darker, but then while the stylesheet properly switches the to #01242e in dark mode the inline override stays and blows it to bit.
Not that it's amazing if you remove the inline stle, on account of operators and method names being styled pretty dark (#666 and #4070a0).
reply
☆
aoli-al

11 days ago

root
parent
next
[ - ]
[ x ]
<@masklinn> Thanks for pointing it out! Just did a quick fix using Claude :)
reply
☆
malcolmgreaves

11 days ago

root
parent
next
[ - ]
[ x ]
<@aoli-al> On mobile (Safari), the lines in the code blocks have different font sizes. They also have different fonts. Some are like 3-4x the size of other lines. No idea what could be going wrong, but it does unfortunately make the code blocks difficult to follow along.
reply
☆
aoli-al

10 days ago

root
parent
next
[ - ]
[ x ]
<@malcolmgreaves> should be fixed as well :)
reply
☆
NooneAtAll3

10 days ago

root
parent
next
[ - ]
[ x ]
<@aoli-al> any chance you can make light/dark mode switch a UI button?
reply
☆
masklinn

10 days ago

root
parent
next
[ - ]
[ x ]
<@NooneAtAll3> On desktop I’d suggest installing an extension that adds a toggle (they exist for Firefox and chrome at least): adding a toggle manually is a bit of a chore, especially if the css system you use does not build that in.
reply
☆
AugustoCAS

10 days ago

prev
next
[ - ]
[posted this in another thread, but maybe the author can clarify this]
I wonder how this works when one runs test in parallel (something I always enable in any project). By this I mean configuring JUnit to run as many tests as cores are available to speed up the run of the whole test suite.
I took a peek at the code and I have the impression it doesn't work that well as it hooks into when a thread is started. Also, I'm not sure if this works with fibers.
reply
☆
aoli-al

10 days ago

parent
next
[ - ]
[ x ]
<@AugustoCAS> Yes, Fray controls all application threads so it runs one test per JVM. But you can always use multiple JVMs run multiple tests[1].
Fray currently does not support virtual threads. We do have an open issue tracking it, but it is low priority.
[1]: https://docs.gradle.org/current/userguide/java_testing.html#...
reply
☆
delusional

10 days ago

prev
next
[ - ]
You appear to be one of the authors, so forgive me asking a technical question.
In the technical paper, Section 5.4 you mention that kotlin has non-determinism in the scheduler. Where does this non-determinism come from?
It seems unclear to me why Kotlin would inject randomness here, and I suspect that you may actually have identified a false positive in the Lincheck DSL.
reply
☆
aoli-al

10 days ago

parent
next
[ - ]
[ x ]
<@delusional> The "randomness" comes from Kotlin coroutines and user-space scheduling. For example, Kotlin runs multiple user-space threads on the same physical thread. Fray only reschedules physical threads. So when testing applications use coroutine/virtual threads, Fray cannot generate certain thread interleavings. Also, It cannot deterministically replay because the thread execution is no longer controlled by Fray.
In our paper, we found that Fray suffers from false negatives because of this missing feature. Lincheck supports Kotlin coroutines so it finds one more bug than Fray in LC-Bench.
We didn't make any claims about false positives in Lincheck.
reply
☆
delusional

10 days ago

root
parent
next
[ - ]
[ x ]
<@aoli-al> > We didn't make any claims about false positives in Lincheck.
To be clear, I made that claim :) I agree that the paper makes no such claim.
reply
☆
herrDerb

8 days ago

prev
next
[ - ]
In the bug report you state that this bug is starting from jdk 23? Could it be that it also affects earlier versions? I am asking as we do have a similar behavior with 17 & 21 which we can't really explain.
reply
☆
TYMorningCoffee

10 days ago

prev
[ - ]
Impressive! Can't wait to try Fray out at work.
reply