Convolutional Attention, Sparse Transformers, and Legal AI
Researchers at UC Berkeley introduce Bottleneck Transformers: a modified convolutional architecture with Transformer blocks for visual recognition.
Context
Widely used in state-of-the-art language models, the self-attention mechanism allows a model to draw global dependencies within inputs. This method has been shown to be useful for image processing tasks as well (e.g. ViT by Google Research). However, the memory requirement rises rapidly with input size, posing an important problem for their implementation in Computer Vision. In fact, self-attention in particular has a quadratic complexity. Recently, Google has introduced Long Range Arena: a benchmark for efficient transformers, that sets a standard for comparing the efficiency of different transformer variants.
This begs the question: can Transformers really be made efficient enough to run on large image inputs?
For a visual explanation of self-attention, click here.
What's new
The short answer to the question above is perhaps, but it might not matter after all.
Researchers at UC Berkeley have recently introduced Bottleneck Transformers for Visual Recognition. Their model, whose acronym is BoTNet, presents "a conceptually simple yet powerful backbone architecture that incorporates self-attention for multiple Computer Vision tasks including image classification, object detection and instance segmentation.''
The key insights of this methodology is that self-attention and convolutions have very different, but complementary, strengths. While self-attention allows models to find relationships between different areas of an image, convolutional layers help them find localized details. Above that, convolutional layers shrink input size, while self-attention's memory requirement forces it to work with small inputs. Can joining forces offer the best of both worlds?
BoTNet proves that it can! The researchers replaced spatial convolutions with global self-attention in the final three bottleneck blocks of a classic ResNet-50 architecture. This modification "improves upon the baselines significantly on instance segmentation and object detection while also reducing the parameters, with minimal overhead in latency".

The authors trained BoTNet to draw bounding boxes around objects and determine what object each pixel belongs to using COCO’s object detection and segmentation tasks.
BoTNet-50 beat a traditional ResNet-50 in both tasks. Averaged over all objects in COCO, more than half of pixels that BoTNet associated with a given object matched the ground-truth labels 62.5 percent of the time, while the ResNet-50 achieved 59.6 percent. For a given object, more than half of BoTNet’s predicted bounding box overlapped with the ground-truth bounding box 65.3 percent of the time, compared to 62.5 percent for the ResNet-50 model.
Why it matters
BoTNet surpasses the previously best published single model and single scale results of ResNets evaluated on the COCO validation set. Above that, the authors also present an adaptation to their model for image classification tasks. On this task, the model achieves a strong 84.7% top-1 accuracy on the ImageNet benchmark. What is really impressive is that it does so while being 2.33 times faster in compute time than the popular EfficientNet models.
The improved efficiency of Transformers on Computer Vision tasks is extremely promising for industry applications. In fact, their use has been limited by the heavy compute costs and non-applicability to large images. Companies in a wide variety of industries will now be able to improve the accuracy and robustness of their models using this state-of-the-art technique, for defect and object detection and image classification tasks alike.
What's next
As stated by the authors in the abstract of their paper:
We hope our simple and effective approach will serve as a strong baseline for future research in self-attention models for vision.
Since its publication in January 2021, numerous other research teams from Facebook AI, MIT-IBM Watson, as well as McGill and Microsoft AI, have attempted similar techniques that use Transformers on image processing tasks:
- Going deeper with Image Transformers, Facebook AI Research and Université Sorbonne
- CvT: Introducing Convolutions to Vision Transformers, McGill University and Microsoft AI
- CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification, MIT-IBM Watson AI Lab
- ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases, Facebook AI Research, École Normale Supérieure, Université Sorbonne
Two new papers from Google Research pave the way for sparse attention transformers to handle long sequences in NLP
Context
Transformers are really shaking up the AI industry. Their implementation offers versatility and robustness to a wide variety of tasks, which explains their wide-scale adoption. Success has been mainly focused for sequence-based tasks such as a seq2seq model for translation, summarization, generation, and others, or as a standalone encoder for sentiment analysis, POS tagging, machine reading comprehension.
Unfortunately, there's no such thing as a free lunch. Self-attention's quadratic cost in terms of memory and computation is a significant handicap. This usually means that the input sequence is limited to roughly 512 tokens. Therefore, it prevents this methodology to be applied to tasks that require larger context such as question answering, document summarization or genome fragment classification.
What's new
Two recent papers by Google Research propose the use of sparse-attention models to solve some of these issues. More specifically, they address these questions:
- Can we achieve the empirical benefits of quadratic full Transformers using sparse models with computational and memory requirements that scale linearly with the input sequence length?
- Is it possible to show theoretically that these linear Transformers preserve the expressivity and flexibility of the quadratic full Transformers?
The first paper, “ETC: Encoding Long and Structured Inputs in Transformers”, presented at EMNLP 2020, presents a novel method for sparse attention. In particular, they propose using structural information to limit the amount of attention pairs that need to be computed.
The second paper, “Big Bird: Transformers for Longer Sequences”, presented at NeurIPS 2020, presents an extension to ETC called BigBird that generalizes the technique for scenarios where structural information is unavailable.
What is attention? The goal of the attention mechanism is to compute similarity scores between all pairs of tokens in the input sequence. Another way to visualize it is by using a fully-connected (complete) graph, where each edge represents the similarity between the nodes it connects. The goal of these two papers is to find an approach that designs a sparse graph (partially-connected) to reduce compute cost while maintaining high performance.

The first method proposed, called Extended Transformer Construction, leverages a global-local attention mechanism. The input is split into two parts: a global input with unrestricted attention, and a longer input restricted to compute local attention. The scaling of this method is linear, which allows ETC to work with longer input sequences compared to the traditional attention method.
This approach is able to beat state-of-the-art results in five benchmark NLP datasets that are known to require long inputs: TriviaQA, Natural Questions (NQ), HotpotQA, WikiHop, and OpenKP.
The second paper introduces BigBird, a generalized extension of ETC. The key aspect here is that BigBird does not rely on domain knowledge about the input sequence structure. The attention mechanism also scales linearly, and is composed of three parts:
- A set of global tokens attending to all parts of the input sequence
- All tokens attending to a set of local neighboring tokens
- All tokens attending to a set of random tokens

Why is this technique so successful?
A crucial observation is that there is an inherent tension between how few similarity scores one computes and the flow of information between different nodes (i.e., the ability of one token to influence each other). Global tokens serve as a conduit for information flow and we prove that sparse attention mechanisms with global tokens can be as powerful as the full attention model. In particular, we show that BigBird is as expressive as the original Transformer, is computationally universal (following the work of Yun et al. and Perez et al.), and is a universal approximator of continuous functions. Furthermore, our proof suggests that the use of random graphs can further help ease the flow of information — motivating the use of the random attention component.
Aren't sparse operations notorious for being inefficient on GPUs? Yes they are. However, as you might imagine, the authors are aware of this. In fact, they transform the sparse local and random attention into dense tensor operations, taking full advantage of modern hardware. For our understanding, Google AI Blog has provided us with a nice illustration:
Why it matters
The proposed methods achieve a new state-of-the-art on challenging long-sequence tasks, including question answering, document summarization and genome fragment classification.
Recently, researchers published “Long Range Arena: A Benchmark for Efficient Transformers“, a paper that provides a benchmark of six tasks that require longer context. It additionally performed experiments to benchmark all existing long range transformers. As stated by the authors: "The results show that the BigBird model, unlike its counterparts, clearly reduces memory consumption without sacrificing performance."
It is a very promising step forward for the NLP community. In fact, showing that sparse attention can be just as expressive and flexible as full attention is groundbreaking! The finding mitigates the major roadblock for the implementation of Transformers on long input sequences.
While a huge step forward for the research community, this publication will also allow many industry applications to take-off. In fact, document summarization and questions answering models bring an immense amount of value in many organizations. There is no better time than now to start investing in the use of Transformers in NLP models.
What's next
As stated by the authors themselves:
Given the generic nature of our sparse attention, the approach should be applicable to many other tasks like program synthesis and long form open domain question answering. We have open sourced the code for both ETC (github) and BigBird (github), both of which run efficiently for long sequences on both GPUs and TPUs.
A recent article outlines seven legal questions for data scientists to survive in an increasingly regulated industry.
Context
Algorithms create immense value for organizations. In recent years, these algorithms increasingly incorporate Machine Learning methodology—meaning that they are trained with a dataset of some kind to make predictions, classifications, or take actions. Due to the undeterministic nature of such algorithms, they can also get your organization in deep trouble. Prone to lawsuits when the outcome of your algorithms run afoul of local of global regulations, ML systems are likely to become highly regulated in the near-future.
While it is currently extremely rare for ML development teams to include legal counsel, there are certain legal questions that data scientists should absolutely consider when building models.
What's new
A recent article in O'Reilly Radar by employees of bnh.ai discussed Seven Legal Questions for Data Scientists. bnh.ai is an American boutique law firm focused on helping their clients avoid, detect, and respond to the liabilities of AI and analytics.

The list of question goes as follows:
- Fairness: Are there outcome or accuracy differences in model decisions across protected groups? Are you documenting efforts to find and fix these differences?
- Privacy: Is your model complying with relevant privacy regulations?
- Security: Have you incorporated applicable security standards in your model? Can you detect if and when a breach occurs?
- Agency: Is your AI system making unauthorized decisions on behalf of your organization?
- Negligence: How are you ensuring your AI is safe and reliable?
- Transparency: Can you explain how your model arrives at a decision?
- Third Parties: Does your AI system depend on third-party tools, services, or personnel? Are they addressing these questions?
Why it matters
Global efforts in policy-making show a highly regulated Machine Learning future:
- The European Union's civil liability regime for artificial intelligence
- Canada's regulatory framework for AI
- Singapore's approach to AI governance
- The United Kingdom's auditing framework for Artificial Intelligence and its core components
- The United States' Algorithmic Accountability Act and industry specific guidance such as the FDA's proposed regulatory framework for AI
The regulatory aspects of implementing Machine Learning in industry warrant an important deal of attention, both now and in the future. Making sure your Machine Learning algorithms are safe and sound will soon become an obligation.
What's next
As stated by the article's authors:
In light of this government movement, and the growing public and government distrust of big tech, now is the perfect time to start minimizing AI system risk and prepare for future regulatory compliance.
For more resources on AI liabilities and its legal risks, check out the bnh.ai resources page.