Interpretable X-rays, Causal ML, and Long-form QA
A digital diagnostic tool using Machine Learning and cloud computing is able to read chest X-rays accurately and rapidly to help doctors identify, triage, and monitor COVID-19 patients.
Context
We are all too familiar with the impact COVID-19 has had in the past year. One of the key challenges for those working on the front-line is the identification and isolation of infected patients. While critical in controlling the virus' spread, this becomes a real challenge when testing procedures, hospital staff, and other resources are stretched thin by a wave of incoming patients.
To add onto the resource limitations found in many hospitals around the globe, suspected patients present a wide variety of different symptoms with varying levels of intensity. Some show undeniable signs such as coughing, fever, fatigue, aches, and pains; others have more mild cases or appear completely asymptomatic. As you know, the latter can still spread the virus unknowingly to others—posing a huge threat to their environment.
What's new
A number of reputable hospitals, screening centers, and clinics around the world are increasingly turning to fast and accurate chest X-ray exam tools that incorporate Machine Learning algorithms to detect COVID-19.
One provider of such tooling is Lunit. Their solution, called INSIGHT CXR, is trained with 3.5M high-quality, clinically proven (using validation with additional CT-scans) X-rays. Approved for commercial sales in Europe, Australia, and parts of Asia, the solution supports the detection of 10 major chest diseases with 97-99% accuracy.
The technology is also interpretable, which is essential for solutions in healthcare that aid doctors in their daily workflow. In fact, INSIGHT CTR generates location information of the detected lesions in the form of heatmaps. Furthermore, probabilities are generated concerning the detected lesions to indicate how abnormal they are. This allows trained radiologists to further investigate uncertain cases.

Why it matters
South Korean radiologist Dr. Kyu-mok Lee commented on the analyses made by the system. “An X-ray is a compressed two-dimensional rendering of three-dimensional human structures. Inevitably, organs and structures overlap in the images”, he said.
X-rays only appear in black and white, which means that there are cases where lesions aren’t noticeable to the human eye. Lunit’s solution has the advantage of displaying lesions in vivid color.
– Dr. Kyu-mok Lee, radiologist
He then states that “the reality for radiologists, especially in Korea, is that it’s impossible to invest a lot of time in reading each X-ray as they would have to read hundreds or thousands every day.”
The added value of Lunit's solution is that it helps medical staff make informed decisions more rapidly. Additionally, the patient's fate is not put solely in the hands of AI. Difficult cases are flagged by the system and followed up in more detail by specialized doctors.
What's next
The technology has already been adopted in South Korea, Thailand, Indonesia, Mexico, Italy, and France. It has proven particularly useful in reducing the workload in hospitals with many patients and few radiologists.
“By enabling more accurate, efficient, and timely diagnosis of chest diseases, Lunit INSIGHT CXR can help reduce the workload of medical professionals. In this way, they can bring more value to the patients in not only difficult circumstances, such as the current pandemic crisis, but in routine clinical settings as well.” states Brandon Suh, the CEO of Lunit.
The significant global deployments of the solution instill a lot of confidence in Lunit's developers. They believe it could have a larger role as an independent image reader to increase cancer detection in the future.
A new paper from researchers at Max Planck Institute, MILA, ETHZ, and Google Research discusses the intersection between the two fields of Machine Learning and graphical causality.
Context
Animal brains are intuitively strong at causal inference. Without being explicitly instructed to do so, animals learn from their environment by observation. For example, when we learn to play football, we understand that a player's leg's movement is what causing the ball to change direction, not the other way around. By forming underlying causal representations, we are easily able to answer interventional and counterfactual questions. For instance, where does the ball go if the player tilts his foot slightly upwards during the kick? What would happen if the ball flies a bit higher and thus isn't kicked by the player?
Machine Learning algorithms, on the other hand, have managed to outperform humans in very complex tasks occurring in extremely controlled environments such as chess. Using new Deep Learning techniques with huge amounts of data, these algorithms are able to transcribe audio in real-time, label thousands of images per second, and examine X-rays and MRI scans for disease indicators. However, they still struggle immensely with generalization to broader environments and simple causal inferences like the football example described above.
What's new
Published in late February, a paper by researchers from Max Planck Institute, MILA, ETH Zurich, and Google Research discusses the intersection between Machine Learning and graphical causality. The objective is to explore and find potential solutions to Machine Learning's lack of causality. Overcoming this problem could be key to solving some of the most important challenges in the field of Artificial Intelligence.
The use of causal models is so powerful because it allows to perform interventions and answer counterfactual questions.
“Machine Learning often disregards information that animals use heavily: interventions in the world, domain shifts, temporal structure — by and large, we consider these factors a nuisance and try to engineer them away, [...] In accordance with this, the majority of current successes of Machine Learning boil down to large scale pattern recognition on suitably collected independent and identically distributed (i.i.d.) data.”
-- Schölkopf et al., 2021

A straightforward question is: why is Machine Learning still using i.i.d despite its flaws?
The answer is short and simple, scalability. Pattern recognition at scale based on observational approaches can be very powerful. This means that when you frame your problem in a controlled setting, with strong compute and sufficient data (both quantitatively and qualitatively), you are bound to achieve relatively good results. It is no coincidence that the AI revolution comes simultaneously with the advent of high-speed processors and data availability.
As the environment grows in complexity, it becomes impossible to cover the entire distribution by adding more training examples. This is especially true in Reinforcement Learning applications. The following clip exemplifies this challenge, as it shows the Tesla autopilot crash into a overturned semi-truck (which probably never happened during training).
The key strength of causal models is that it allows you to repurpose previously gained knowledge for new domain applications. If you are a good football player, you are able to use skills learning in football such as running, team tactics, and ball passing strategy when introduced to rugby or handball.
I can already hear Machine Learning enthusiasts clamouring: "that's what we call transfer learning!". While extremely useful, transfer learning is limited to extremely narrow use-cases, most commonly the image classifier that is fine-tuned to detect more specific sub-classes of objects. In more complex tasks, such as learning video games, Machine Learning models need huge amounts of training (thousands of years’ worth of play) and respond poorly to minor changes in the environment (e.g., playing on a new map or with a slight rule change).
Why it matters
Broadly speaking, causal models can address the lack of generalization capability in Machine Learning.
“Generalizing well outside the i.i.d. setting requires learning not mere statistical associations between variables, but an underlying causal model,” the researchers write.
What's next
While the advantages of causal modeling are clear, it remains an uphill battle to implement these concepts in Machine Learning algorithms.
“Until now, Machine Learning has neglected a full integration of causality, and this paper argues that it would indeed benefit from integrating causal concepts.”
-- Schölkopf et al., 2021
The researchers discuss several challenges to the application of causal models with Machine Learning: “(a) in many cases, we need to infer abstract causal variables from the available low-level input features; (b) there is no consensus on which aspects of the data reveal causal relations; (c) the usual experimental protocol of training and test set may not be sufficient for inferring and evaluating causal relations on existing data sets, and we may need to create new benchmarks, for example with access to environment information and interventions; (d) even in the limited cases we understand, we often lack scalable and numerically sound algorithms.”
The promising signal is that these challenges are discussed and laid out concretely in papers like this one, slowly paving the way for future research in this domain.
A joint paper from Google Research and Amherst demonstrates SOTA results on the KILT Long-form Question Answering benchmark, all while pointing out flaws in the evaluation system itself.
Context
As the field of Natural Language Processing (NLP) progresses, research teams are showing impressive results on tasks that seemed impossible just a couple of years ago. One of these tasks is open-domain Long-form Question Answering (LFQA). As is discernible in the task's name, the goal is to provide an elaborate paragraph-length answer given a question by retrieving relevant documents.
LFQA's younger brother, QA (open-domain Question Answering), has seen immense progress recently. The number of widely available datasets and concise benchmarking systems (e.g. SQuAD) are certainly responsible in part for these advances. Which brings us to the following question: are current benchmarks and evaluation metrics suitable for stimulating progress on LFQA?
What's new
Last week, researchers from Amherst and Google Research published “Hurdles to Progress in Long-form Question Answering”, a paper that will appear in NAACL 2021 (North American Chapter of the Association for Computational Linguistics).
The paper lays out the methodology used in their submission on KILT, a benchmark for Knowledge Intensive Language Tasks. While their submission tops the leaderboard on ELI5 (the only publicly available LFQA dataset), the authors posit that there are some flaws to the evaluation framework itself.
The model presented by the authors leverages two recent advances in NLP to achieve SOTA results:
- A sparse attention model such as Routing Transformer (RT), allowing for scaling the attention-based mechanism to long sequences.
- The RT model is able to reduce the attention mechanism complexity in the Transformer model from n2 (quadratic) to n1.5 (n being sequence length). Compared to models like Transformer-XL, it enables each token to attend to other tokens anywhere in the sequence, not only those in the immediate vicinity.
- A retrieval based model based on REALM with contrastive loss, aptly called c-REALM. This retrieval method "facilitates retrievals of Wikipedia articles related to a given query."

For some examples of some LFQA pairs, refer to Google's blog postdiscussing the paper.
The evaluation framework (KILT) uses to metrics: (1) Precision (P-Rec) and (2) ROUGE-L
Despite their SOTA results, the authors point out some remaining issues with the KILT evaluation framework.
- There is little evidence that models are actually using retrievals on which they are conditioned.
- Trivial baselines such as input copying and random training set answer achieve relatively high ROUGE-L scores, even beating some models such as BART + DPR. This can be observed in the figure below, taken from Google's blog post.
- There is implicit train/validation overlap in the ELI5 dataset. In fact, some questions seem to be paraphrased versions of other questions, as shown below.


Why it matters
Achieving SOTA results in LFQA is quite impressive, and is a promising step forward for the NLP community. However, as pointed out in the paper, there remain several issues in the current benchmarking framework for the task. For concrete advances to be made, there needs to be a fruitful environment that allows researchers to compare models on solid datasets by using relevant evaluation metrics.
What's next
As stated by the authors themselves: "We hope that the community works together to solve these issues so that researchers can climb the right hills and make meaningful progress in this challenging but important task."