Protein Folding Breakthrough, TLDR in Science, & Robot Bias
AlphaFold solves the protein folding problem
DeepMind's AlphaFold reaches 90+ percent accuracy at protein structure prediction competition
Proteins are indubitably the most important molecules for sustaining life. Practically all functions, from transporting oxygen through our blood to giving leaves their bold colors, are supported by proteins.
These proteins can be described using three different structural languages, which each have a varying degree of complexity and abstraction and are depicted below.
There are different ways of determining protein structure, and each of these methods yields information about the protein in one of the different structural languages. For instance, while mass spectrometry can yield primary structure, only Nuclear Magnetic Resonance (NMR), X-ray crystallography, and cryo-electron microscopy (which are immensely time- and resource-intensive) are able to yield tertiary structure.
Last week, DeepMind's AlphaFold competed in the biennial Critical Assessment of protein Structure Prediction (CASP). The challenge allows participants to predict the tertiary structure of a given primary structure. The metric used for evaluation is called the Global Distance Test (GDT). In short, the score ranges from 0-100 and indicates how close the predicted structure is from the Ground Truth.
In the past 7 versions of CASP, the winners' scores didn't grow past 75 GDT, even staying below 50 GDT before CASP 2018. This year, however, AlphaFold's state-of-the-art AI model was able to achieve a median score of 92.4 GDT. This surpasses the 90 GDT threshold that is considered to be a 'solution' to the protein folding problem.
Their solution implements new deep learning techniques that consider a folded protein as a spatial graph. Using an attention-based neural network, evolutionarily related sequences, and multiple sequence alignment, the system develops strong predictions of the underlying physical structure of the protein.
Why it matters
For 50 years, researchers in Biology have been looking for a method to determine tertiary structure using only the information from the primary structure. This is essential as the tertiary structure is closely linked to its function. Therefore, knowing a protein's tertiary structure unlocks a greater understanding of what it does and how it works.
The DeepMind team states that they're "optimistic about the impact AlphaFold can have on biological research and the wider world, and excited to collaborate with others to learn more about its potential in the years ahead. Alongside working on a peer-reviewed paper, we’re exploring how best to provide broader access to the system in a scalable way."
'Too Long; Didn't Read' comes to scientific literature
A new state-of-the-art summarization model is being used to distill the information of AI research papers into a single sentence
In recent years, many different summarization models have been released. Their common goal: reduce reading time without compromising understanding. You can easily find online bots such as summarizebot, summarization APIs such as one from DeepAI, and articles explaining the key technical concepts behind these types of models. What's the catch? The common flaw of these models is that they don't generalize well. If applied to text that is uncommon in the dataset that was used for training, the model will perform significantly worse.
Researchers from the Allen Institue "introduce TLDR generation for scientific papers, a new automatic summarization task with high source compression, requiring expert background knowledge and complex language understanding." This quote is a summarized version of the abstract of their paper using the method described in said paper.
Using a multitask learning strategy on top of pretrained BART, researchers were able to compile the SciTLDR dataset. By analyzing a paper's abstract, intro, and conclusion (for computational reasons), the method is able to summarize 5 000 word articles in only 20.
The AI solution has been deployed as a beta-version on Semantic Scholar. Displaying the TLDR of articles directly on the search results page enables you to quickly locate the right papers for you. The feature is already available for nearly 10 million computer science papers, and counting!
Why it matters
Staying up to date with scientific literature is an essential part of a researchers’ workflow. Furthermore, parsing through a long list of papers from different sources by reading abstracts is extremely time-consuming.
TLDRs can help researchers make quick and informed decisions about which papers are relevant to them. TLDRs also provide paper summaries for explaining the content in other contexts, such as sharing a paper on social media platforms.
Summarizing papers with 20 words gives you a good idea of the direction of the paper. However, in complex domains such as Computer Science, a couple of dozen words is not enough to distill the information. A possibility for the future might be dynamic N-sentence summarizers.
Making Robots less biased than humans
Researchers in Robotics have committed to actively ensuring fairness in AI-driven solutions
Almost all police robots in use today are straightforward remote-control devices. However, more sophisticated robots are being developed in labs around the world. Increasingly, they use Artificial Intelligence to integrate many more complex and diverse features.
Many researchers find this problematic. In fact, several AI algorithms for facial recognition, predicting people’s actions, or nonlethal projectile launching have led to controversy in past few years. The reason is clear: many of these algorithms are biased against people of color and other minorities. Researchers from Google have argued why the police shouldn't use this type of software. Above that, some private citizens are now using facial Recognition against the Police, as mentioned in a previous digest.
Earlier this year, hundreds of AI and robotics researchers committed to actively changing some practices in their field of work. A first Open Letter from the organization Black in Computing states that “the technologies we help create to benefit society are also disrupting Black communities through the proliferation of racial profiling.” A second statement, “No Justice, No Robots”, calls for its signers to refuse work with or for law enforcement.
Researchers in robotics are trained to solve difficult technical problems. They are not educated to consider societal questions about how the robots they build affect society. Nevertheless, they have committed themselves to actions whose end goal is to make the creation and usage of AI in Robotics more just.
Why it matters
The adoption of AI systems is growing exponentially. Today there are AI systems built into self-driving cars meant specifically for the detection of pedestrians. A study by Benjamin Wilson and his colleagues from Georgia Tech has found that eight such systems were significantly worse at detecting people with darker skin tones than lighter ones.
As a public policy researcher from Georgia Tech, Dr. Jason Borenstein, puts it: “it is disconcerting that robot peacekeepers, including police and military robots, will, at some point, be given increased freedom to decide whether to take a human life, especially if problems related to bias have not been resolved.”
The root cause of this issue, as Dr. Odest Chadwicke Jenkins (one of the main organizers of the open letter mentioned above) from the University of Michigan states, "is representation in the room — in the research lab, in the classroom, and the development team, the executive board.”
In parallel, some technical progress is trying to mitigate the potential unfair outcomes of AI systems. For instance, Google has developed a system to bring a shared understanding of AI models called Model Cards, as mentioned in a previous digest that discussed background features in Google Meet. In the Model Card, bias is tested for different geographies, skin tones, and genders. This method clearly identifies the metrics used and results found, adding a lot of transparency and accountability to Machine Learning Modeling.
Additionally, the market for synthetic datasets is growing rapidly. The use of this methodology, which is covered in more detail in a previous digest, allows to balance datasets that could potentially produce unfair outcomes.