Reading the Tea Leaves: Expert End-Users Explaining the Unexplainable

Reading the Tea Leaves: Expert End-Users Explaining the Unexplainable

. 12 min read

Tasseography[1] is the divination art of tea leaf reading, where a practitioner (tasseographer) argues they can glean some information about the future by decoding the seemingly-random arrangement of tea leaves at the bottom of a cup. Tasseography comes from a time when mathematical models were relegated to celestial bodies and sea travel, so tea leaves were a necessary alternative for other predictions.

Modern day end-users of machine learning systems find themselves in a similar predicament to the tasseographer. We have access to a collection of new inputs (clients wishing to predict the future), and we generate the collection of corresponding model outputs (the arrangements of tea leaves). Often these outputs have intrinsic value by merit of being accurate predictions, e.g. a magical stock trading algorithm that outperforms the market year after year will be used whether or not it can explain its predictions. However, less-than-perfect predictions – as with less-than-perfect tea leaf readings – demand an explanation of “Why?” We need the answer to this question for accountability, but also because it may contain information about how to improve the model or the data used to train it. This is the motivation for the study of explainable artificial intelligence (XAI),[2] which is as old as the approaches it seeks to explain, yet strikingly absent in many application domains. As a result, domain experts act as tasseographers, reading the tea leaves of their models’ results to explain failures, analyze counterintuitive trends, and reinvent their own understanding of the problem.

This article uses the common backdrop of competitive games to explore the ways in which domain experts adapt to new technologies that lack explainability. I illustrate how interpretations vary based on user experience and model architecture, and how special care must be taken when adapting models to human-centric problems.

Games, Napoleon, and intelligence

For the same reasons that competitive games are useful model systems for AI research, they are also interesting targets for understanding how humans themselves interact with AI. Game-playing models are among the oldest attempts in producing a machine with “superhuman” performance at an intellectual human task, and one of the earliest to prove successful. Obsessions with such systems are pervasive; already in 1770 the famed Mechanical Turk was invented,[3] a chess-playing automaton defeating the likes of Benjamin Franklin and Napoleon Bonaparte. The Turk lost to the unofficial then-world-champion François-André Danican Philidor.[4] The automaton was of course a man in a box, but even the illusion of being a capable AI caused Philidor to comment on the game being especially fatiguing.

Upon the advent of computers, Turing and countless subsequent programmers continued the tradition of formulating game-playing systems,[5][6] in recent history highlighted by IBM’s Deep Blue,[7] followed by AlphaGo and later successes from DeepMind.[8][9][10] Indeed, the AlphaGo and AlphaZero innovations from DeepMind were news-worthy successes both as technical feats and also for striking at the core of the public perception of intelligence: rapidly evolving from randomness to expert play in games that are intricate by design. Even if the games themselves are not of particular interest to an individual, solving them is symbolic of the change to come in grander human challenges.

The world’s leading tasseographer

A frequently forgotten impact of these AI is the influence on the domain experts themselves, the top professional players of these games. These players have everything to gain from understanding the strategies, goals, diet, and love lives of the models that embarrass them at the board. Surely the domain expert approaches will be symbolic, even predictive, of the ways in which experts will approach AI in other domains. Evidently, DeepMind realizes this and are keen on developing explainable AI methods for games, in anticipation of new applications.[11]

When AlphaZero defeated the incumbent “hand-designed” chess engine Stockfish in a match, DeepMind invited chess Grandmaster (GM) Matthew Sadler and Woman International Master Natasha Regan to analyze its games. The visit was the topic of the book Game Changer,[12] where the two develop a taxonomy of patterns that they claim make this new approach fundamentally different from previous engines. In subsequent years, GM Sadler has written seasonal perspectives on the Top Chess Engine Championship (TCEC), much in the style of a sports column piece on a championship boxing match.[13]

In my estimation, Sadler’s unique circumstances have made him the preeminent neural network tasseographer. He is a top member of a rare discipline where the verifiably best solution is given by a “black box” neural network, and he has used his talents to untangle the solution in a way understandable to us mortals. In January, I interviewed GM Sadler to understand how he and others approach AI interaction in his field and to find commonalities that may connect that model system of chess to the broader community of end-users of ML systems.

Skill dependence

One feature of tasseography is the dependence of the result on the skill of the user. One everyday example of this is our use of Google search, where users understand they are interacting with some faux-intelligent system and cater their searches to optimize the results. New or casual users are tempted to query the model with plain language, as if they’re conversing with a friend: How do I change the oil in my Civic by myself? By contrast, an experienced user queries only in keyword hieroglyphics: youtube oil change civic. An especially challenging search might warrant special search characters: nike basketball -Kyrie $100..$200. Google knew this characteristic of their users and launched the Hummingbird algorithm update in 2013, favoring natural language queries.[14] Nonetheless, the rare matchless search, like the one I recently discovered in the figure below, can still induce Google to recommend the hieroglyphic approach. In chess too, GM Sadler argues that asking the right questions of the model is deeply important.

DM: What is the state of explainability of AI systems in chess?
MS: Obviously it’s one of the holy grails of online chess, getting the engine analysis to users in a way that can actually help them. There have been some attempts for online chess to do something like that (DecodeChess for example)...

DecodeChess provides natural language interpretations of the outputs of the Stockfish chess engine.[15][16]

…The slight problem is that it's hit-or-miss. What you get with DecodeChess is a mixture of a very good interesting insight and about fifty other insights that are not at all relevant. Somehow I think, in terms of chess, engines are always capable of bringing out the key points of a position but never the actual key point itself. That’s what the stuff I do is about – I run multiple engine games in multiple scenarios; try this, try that, and with that mass of data I extract “Ah ok, here you have to exchange off queens, here you have to exchange rooks,” and deducing from that. It’s getting from engines what you would get from a human Grandmaster in human language in a given position, that’s the problem. For clear tactical blunders, it works very well, there are a lot of online tools where it analyzes a game in a few seconds, your blunders are flagged, and that’s enough for very weak players.

Therein is the crux of skill dependence – the quality of the XAI must track the aptitude of the user. As of now, there is no such XAI technique that is used by and effectively caters to the strongest human players. Top players are not prone to making obvious tactical blunders; their errors are usually those that induce subtle long-term deficiencies. Therefore, simple blunder detection tools are unhelpful and even insights from more complex tools like DecodeChess seem superficial. Instead, Sadler resorts to a more manual approach of deducing the important properties of a position by mentally collating the results from many simulations of a game.

Problem structure & inductive bias

The simulation approach that Sadler performs manually resembles the inner workings of the engines themselves. Strong engines iteratively simulate moves and estimate how favorable the resulting positions are, such as by estimating the probability of winning. These two divisions are referred to as “search” and “evaluation,” respectively. Older engines like Alan Turing’s “Turochamp,” IBM’s Deep Blue, and Stockfish (through version 11)[17] used carefully tuned hand-crafted functions for evaluation. These functions are linear combinations of factors that the programmers think are relevant, such as number of pawns remaining, metrics for king safety, and piece mobility. This type of function is limited by the programmers’ expertise in chess as well as their ability to codify and tune their evaluation factors. Additionally, the flexibility of these functions is restricted as different features cannot interact nonlinearly, like they could in a neural network.

MS:  …older versions of Stockfish, obviously using this hand-crafted evaluation, a lot of what it sees immediately is wrong. The search comes over it and squashes all of the imperfections… If you’re going to look at where engines could explain stuff, it’s definitely evaluation, not the [search], that’s way too deep. All the factors you have to consider in order to understand why a position is good.

Discussing this pre-neural era of chess engines, Sadler suggests that the elements of the hand-crafted function are uninteresting despite being inherently interpretable. This is because the evaluation function itself is actually quite poor. Instead, the search function, which searches millions of positions at a time and thus cannot be easily parsed by humans, is the true source of an engine’s strength.

New versions of Stockfish have likely surpassed AlphaZero (and its stronger open-source descendant Leela)[18][19] by including an interesting variety of neural network that deserves its own article — the Efficiently Updatable Neural Network (NNUE).[20][21] NNUE was introduced for the game Shogi in 2018 and was later adapted to other games. The concept behind NNUE is to exploit the fact that the inputs of the network are sparse and are largely constant after a move is made  — that is, the board state is simple and identical to the previous one besides a handful of bits. Under these conditions, seen in many board games, inputs can be evaluated much faster and on the CPU. Paired with efficient search, these evaluation improvements brought new life to formerly hand-designed engines, but Sadler’s perspective is largely the same:

MS: … You know [Stockfish has] got a neural network now, so the handcrafted evaluation has virtually disappeared, but it’s a very small net only in positions that are non-forcing so without tactics. Any mistakes it makes, the search comes through and squashes any problems… Leela, playing on 1 node, not looking ahead, is probably strong [International Master] strength, which is quite astonishing. That’s the sort of thing that you’d want to learn from.

He sees deeper, stronger evaluation functions as the more promising avenue for explainability. This type of approach is more consistent with the human way of analyzing a position – we do search, but our capacity for search is extremely limited. Instead, humans excel at intuition and pattern recognition for a given board state; to cultivate these strengths, we may want to analyze algorithms that are also strong in these areas.

Of course, the tree-like structure of games like chess is not applicable to every domain, so the search-and-evaluate strategy Sadler learned to do isn’t applicable for experts in every field. Instead, the type of analysis performed by experts equipped with strong AI systems should depend on the underlying structure of the problem. These preconceptions about problem structure are known as inductive bias, and are ubiquitous in modern machine learning applications. This itself can be used as an approach to interpreting models, but often (as in the case of chess) is insufficient on its own to explain the AI predictions.

A human problem

MS: If you’re trying to solve the problem of, like in the World Chess Championship, playing against a human, the human will be playing the moves as well. The human has limited memory capacity, limited calculation ability, etc… there are some areas where you just want to solve the problem, some areas where you want to solve a human problem. I think you’ve got to be very aware of that.
If you want to take on the world number 2 in a match, you really need someone to understand and translate the output of a 3650 [rated] engine, and to translate that into something that will be effective. That’s really the skill of the Seconds in the World Championship.

A “Second” is a professional assistant to a chess player, usually a strong Grandmaster themselves, acting as a coach to aid in computer-guided preparation. Before chess engines were unambiguously better than humans (and even for a while after they were), Seconds faced the challenge of doubting the validity of an engine prediction. Now, the engine moves are as close to optimal as anyone can tell, but humans are still limited by not having access to an engine once play begins. Most domains that employ AI still face both challenges: doubting the predictions themselves and transforming the predictions into actionable knowledge for humans. This is because, unlike chess, most problems don’t fit into the box of perfect information games and they usually lack such infallible oracles.

Sadler was especially excited about the creativity AlphaZero evoked in top players. World champion Magnus Carlsen even commented at one time that, during games, he was mentally deliberating what AlphaZero would do in a given situation. He and his team were early adopters which possibly showed during his reign of terror in 2019, where he went unbeaten in classical chess.

MS: You know there was this one peculiar human thing. Peter Heine Nielsen, [Magnus’] Second, was explaining “We thought of AlphaZero and came up with this idea,” and it was a sort of reverse-reasoning. “AlphaZero liked that so maybe it’d also like this.” But it was something I’d never seen as a theme or AlphaZero do. This human ability to reason and make a connection where there was none, it was very impressive.


Machine learning systems are continually over-represented compared to the techniques that are used to explain them. In this gap, end-users resort to reading the tea leaves of their model’s predictions.

Competitive games have long been used as a testbed for artificial intelligence, and expert interpreters have emerged in the wake of superhuman models reshaping their game. Domain-specific expert interpreters may surface in new fields as powerful ML models reach new disciplines. Through the lens of chess Grandmaster Matthew Sadler, I explored some of the ways in which this “neural net tasseography” might manifest itself.

We are reminded of the importance of appropriate inductive biases, both for performance and as conduits for explainability. The quality of an explanation may depend considerably on the experience of the practitioner, calling for different approaches for different levels of user expertise. Human constraints will affect the best ways to interpret results, even when those results are trustworthy.

Author Bio
Derek Metcalf is a PhD Candidate focused on machine learning in drug discovery at the Georgia Institute of Technology.

This article was inspired by two books GM Matthew Sadler wrote with WIM Natasha Regan, Game Changer and The Silicon Road to Chess Improvement, which I enthusiastically recommend.[24][12][25] Game Changer is appropriate for ML practitioners and both are excellent for chess enthusiasts. Sadler additionally has extensive, high quality content on his personal and book-related YouTube channels: Personal, Game Changer, Silicon Road [26][27][28]



For attribution in academic contexts or books, please cite this work as

Derek Metcalf, "Reading the Tea Leaves: Expert end-users explaining the unexplainable", The Gradient, 2022.

BibTeX citation:

 author = {Veysov, Alexander and Voronin, Dimitrii},
 title = {Derek Metcalf},
 journal = {The Gradient},
 year = {2022},
 howpublished = {\url{ }},

If you enjoyed this piece and want to hear more, subscribe to the Gradient and follow us on Twitter.