SPICE Evaluator
SPICE metric evaluation and interactive scene-graph visualizations
The SPICE & Scene-Graph Evaluation Dashboard is an end-to-end tool for assessing image captions. It combines the official SPICE metric with interactive, force-directed visualizations of object–relation–object and object–attribute tuples.

How SPICE Works
SPICE (Semantic Propositional Image Caption Evaluation) measures how well a candidate caption captures the same meaning as one or more reference captions by breaking each sentence into atomic “facts” and comparing them. It proceeds in four main stages:
-
Dependency Parsing
- Each caption is processed by Stanford CoreNLP, which performs tokenization, part-of-speech tagging, and builds a dependency parse tree.
- The tree makes explicit grammatical relationships (e.g., which word is the subject of a verb, which adjective modifies which noun).
-
Semantic Tuple Extraction
-
From the dependency tree, SPICE extracts two types of tuples:
- Object–Attribute pairs, e.g.
("dog", "brown")
- Subject–Relation–Object triples, e.g.
("dog", "running_in", "park")
- Object–Attribute pairs, e.g.
-
Each tuple represents a single, discrete proposition about the scene described.
-
-
Tuple Alignment with WordNet
-
SPICE aligns the candidate’s tuples with those from the reference(s):
- Exact string match (e.g. “park” ↔ “park”)
- WordNet synonym match when labels differ (e.g. “dog” ↔ “canine”)
- This ensures semantically equivalent facts are paired—even if different words are used.
- Tool reveals the raw tuples it found in your captions.
-

Precision, Recall & F₁ Computation - Precision = matched candidate tuples ÷ total candidate tuples - Recall = matched reference tuples ÷ total reference tuples - F₁ = 2 × (Precision × Recall) ÷ (Precision + Recall) - These scores reflect how accurately (precision) and completely (recall) the candidate caption covers the reference’s semantic content, with F₁ as the harmonic mean.
Once tuples are extracted, they can be viewed as an interactive scene graph.
- Nodes represent objects or attributes.
- Edges represent relations or the “has_attr” link.

Tech Stack
- Python & Streamlit: Backend orchestration and web UI.
- Java & Stanford CoreNLP: SPICE-1.0 computation and scene-graph parsing.
- PyVis & NetworkX: Building interactive force-directed graph layouts.
- NLTK WordNet: Optional synonym matching for tuple-level evaluation.
- Conda: Manages the Python 3.11.8 environment.
Installation & Setup
For detailed installation and setup instructions, please refer to the instructions in the SPICE-Evaluator repository.
Use Cases
- Research & Evaluation: Quickly benchmark captioning models with SPICE and visual tuple inspection.
- Educational Demonstrations: Show how scene-graph tuples map onto captions, and how SPICE quantifies semantic overlap.
- Quality Control: Visually inspect where a caption matches or misses elements of ground-truth.
References
[1] P. Anderson, B. Fernando, M. Johnson, and S. Gould, “SPICE: Semantic Propositional Image Caption Evaluation,” arXiv preprint arXiv:1607.08822, Jul. 2016. [Online]. Available: https://arxiv.org/abs/1607.08822
For full code, examples, and advanced configuration, see the SPICE-Evaluator GitHub Repository.