> [!NOTE] **TL;DR**
> - First, I explain ANLS, a **metric used for extractive question answering from images/scans**;
> - Second, I provide a succinct Python implementation that is a **10x speedup over the official implementation**
# How do you evaluate Visual Question Answering (VQA)?
We're going to focus on the extractive subset of VQA. That is, there is some text in the image that we want to extract and return. As an example, consider this album cover:
[
If we asked: "Who is the album by?" you might receive the following answers:
```
1. TALKINGHEADS
2. TALKING HEADS
3. TVLKINGHEVDS
4. REMAIN IN LIGHT
```
Depending on what approach you use to read the text in the image (OCR). Some of those answers are more correct than others. For instance, we might consider (1) to be the "correct" answer. If that's the case, should 2-4 all receive a score of 0?
If you believe that yes, they should, then you would want to use an exact match (EM) metric, referred to as "Accuracy" by the DocVQA team.
However, if you believe that they should receive different scores based on how close they are to our "correct" answer, then you'd want to use a softer metric. That's the motivation for ANLS: assigning partial credit to answers that are close but not correct, _because of OCR mistakes_.
ANLS stands for **A**veraged **N**ormalized **L**evenshtein **S**imilarity. As you might be able to guess from "Levenshtein," it's an edit-distance-based similarity measure. It allows you to be off by a few characters (i.e. a few OCR errors) and still get partial credit for your response.
## Computing ANLS
Let's consider the above example. Let's say there are two answers I would accept:
1. `TALKINGHEADS`
2. `TALKING HEADS`
Either of those should get a score of 1 (i.e. completely correct). Anything that's sort of close to one of those should get partial credit. For instance `TVLKINGHEVDS` is is quite close to (1); it's 2 character substitutions away out of a total of 12 characters. So let's say it should get a score of $1 - \frac{2}{12} \approx 0.833$.
ANLS gets there with four parts:
1. **L**evenshtein Edit Distance (the _L_ in ANLS)
2. **N**ormalization and **S**imilarity (the _N_ and _S_ in ANLS)
3. Choosing the best option
4. Thresholding
### Levenshtein Edit Distance
Covering _how_ Levenshtein edit distance is computed is outside the scope of this post, but [here's a great explainer].
### Normalization and Similarity
$
normalized\_similarity = 1 - \frac{edit\_distance(s_1,s_2)}{max(|s_1|, |s_2|)}
$
Remember that $edit\_distance$ is an integer representing the number of edits (insertions, deletions, substitutions). So if the strings are the same, then edit distance is 0, and similarity is $1 - 0 = 1$. Conversely, if you need to replace every character in a string, then edit distance is the length of the larger of the two strings, so similarity is $1 - 1 = 0$
### Max Similarity
Notice how in our example we had _two_ valid examples.
### Thresholding
Why threshold? So you don't get a bunch of partial scores for any overlapping characters!
> [!WARNING] **Why not ROUGE?**
There are other approximate similarity metrics from the NLP community, like ROUGE and Bleu (commonly used for summarization and translation, respectively). Both of these allow variances between the ground truth answer and generated outputs. However, both of them typically use *tokens* as the minimal unit. ROUGE, for instance, tokenizes the text, applies normalization to the tokens (e.g. stemming), and then uses
# Who Uses ANLS?
Here's a sampling of datasets thare are
- Scene Text VQA
- DocVQA
- InfographicsVQA
# A Fast ANLS Implementation
By relying on the `levenshtein`, we get a metric that is 10x faster than the original ANLS metric (which computes the Levenshtein edit distance in pure python).
We also make this implementation more general, by averaging over the predictions if given a list of predictions (the _Average_ in ANLS):
```python
from Levenshtein import distance
import statistics
def ratio(a: str, b: str) -> float:
return 1.0 - distance(a, b) / max(len(a), len(b))
def anls_score(
gold: list[str],
prediction: str,
threshold: float = 0.5,
) -> float:
instance_nls_scores = []
for gold_instance in gold:
instance_nls_scores.append(ratio(gold_instance, prediction))
nls_score = max(instance_nls_scores)
if nls_score <= threshold:
nls_score = 0.0
return nls_score
```
# A More Complete Implementation
# Testing our ANLS Implementation
In order to test the
## The Importance of Testing Your Evaluation Setup
For anybody in machine learning, you've probably been burned by this before. Your evaluation setup is _your model of the real task you want to do_. If your evaluation is incorrect, you're going to have no or low visibility into the actual efficacy of your approach.
I would argue that _the single most important thing you can do is test your evaluation metric_. I've seen bugs in evaluation setups of published papers that invalidate the results.
## Fuzzing
```python
def test_anls():
"""
Some fuzzing tests to ensure behavior is the same
"""
from anls import anls_score as official_anls_score
from tqdm import tqdm
import random
for i in tqdm(range(1000)):
str_len = random.randint(5, 25)
original = "".join(chr(ord("a") + i % 26) for i in range(str_len))
prediction = list(original)
if random.choice([True, False]):
# Permute characters
for i in range(len(prediction)):
j = i + int((len(prediction) - i) * random.random())
prediction[i], prediction[j] = prediction[j], prediction[i]
else:
# Randomly add/delete characters
for _ in range(random.randint(1, len(prediction) // 3)):
if random.choice([True, False]) and len(prediction) > 1:
# Delete random character
del prediction[random.randint(0, len(prediction) - 1)]
else:
# Add random character
prediction.insert(
random.randint(0, len(prediction)),
chr(ord("a") + random.randint(0, 25)),
)
prediction = "".join(prediction)
print(original, prediction)
score1 = anls_score(original, prediction)
score2 = official_anls_score(original, [prediction])
assert abs(score1 - score2) < 1e-6, f"{score1} != {score2}"
```
## Benchmarking
```python
def benchmark_anls():
"""Benchmark scores"""
from anls import anls_score as official_anls_score
from tqdm import tqdm
import random
import timeit
n_samples = 1000000
samples = []
# Generate test samples
for i in range(n_samples):
str_len = random.randint(5, 25)
original = "".join(chr(ord("a") + i % 26) for i in range(str_len))
prediction = list(original)
# Mutate prediction
for _ in range(random.randint(1, len(prediction) // 3)):
if random.choice([True, False]) and len(prediction) > 1:
del prediction[random.randint(0, len(prediction) - 1)]
else:
prediction.insert(
random.randint(0, len(prediction)),
chr(ord("a") + random.randint(0, 25)),
)
prediction = "".join(prediction)
samples.append((original, prediction))
# Time our implementation
start = timeit.default_timer()
for original, prediction in samples:
_ = anls_score(original, prediction)
our_time = timeit.default_timer() - start
# Time official implementation
start = timeit.default_timer()
for original, prediction in samples:
_ = official_anls_score(original, [prediction])
official_time = timeit.default_timer() - start
print(f"Our implementation: {our_time:.4f}s")
print(f"Official implementation: {official_time:.4f}s")
print(f"Speed ratio: {official_time/our_time:.2f}x")
```
## Running Official Tests