SPICE Evaluator

SPICE metric evaluation and interactive scene-graph visualizations

The SPICE & Scene-Graph Evaluation Dashboard is an end-to-end tool for assessing image captions. It combines the official SPICE metric with interactive, force-directed visualizations of object–relation–object and object–attribute tuples.

Live dashboard showing SPICE metrics, logs, tuple previews, and interactive scene graphs.

How SPICE Works

SPICE (Semantic Propositional Image Caption Evaluation) measures how well a candidate caption captures the same meaning as one or more reference captions by breaking each sentence into atomic “facts” and comparing them. It proceeds in four main stages:

  1. Dependency Parsing

    • Each caption is processed by Stanford CoreNLP, which performs tokenization, part-of-speech tagging, and builds a dependency parse tree.
    • The tree makes explicit grammatical relationships (e.g., which word is the subject of a verb, which adjective modifies which noun).
  2. Semantic Tuple Extraction

    • From the dependency tree, SPICE extracts two types of tuples:

      • Object–Attribute pairs, e.g.
        ("dog", "brown")
        
      • Subject–Relation–Object triples, e.g.
        ("dog", "running_in", "park")
        
    • Each tuple represents a single, discrete proposition about the scene described.

  3. Tuple Alignment with WordNet

    • SPICE aligns the candidate’s tuples with those from the reference(s):

      1. Exact string match (e.g. “park” ↔ “park”)
      2. WordNet synonym match when labels differ (e.g. “dog” ↔ “canine”)
    • This ensures semantically equivalent facts are paired—even if different words are used.
    • Tool reveals the raw tuples it found in your captions.

Precision, Recall & F₁ Computation - Precision = matched candidate tuples ÷ total candidate tuples - Recall = matched reference tuples ÷ total reference tuples - F₁ = 2 × (Precision × Recall) ÷ (Precision + Recall) - These scores reflect how accurately (precision) and completely (recall) the candidate caption covers the reference’s semantic content, with F₁ as the harmonic mean.

Once tuples are extracted, they can be viewed as an interactive scene graph.

  • Nodes represent objects or attributes.
  • Edges represent relations or the “has_attr” link.
PyVis graphs for the candidate caption (left) and reference caption (right).

Tech Stack

  • Python & Streamlit: Backend orchestration and web UI.
  • Java & Stanford CoreNLP: SPICE-1.0 computation and scene-graph parsing.
  • PyVis & NetworkX: Building interactive force-directed graph layouts.
  • NLTK WordNet: Optional synonym matching for tuple-level evaluation.
  • Conda: Manages the Python 3.11.8 environment.

Installation & Setup

For detailed installation and setup instructions, please refer to the instructions in the SPICE-Evaluator repository.

Use Cases

  • Research & Evaluation: Quickly benchmark captioning models with SPICE and visual tuple inspection.
  • Educational Demonstrations: Show how scene-graph tuples map onto captions, and how SPICE quantifies semantic overlap.
  • Quality Control: Visually inspect where a caption matches or misses elements of ground-truth.

References

[1] P. Anderson, B. Fernando, M. Johnson, and S. Gould, “SPICE: Semantic Propositional Image Caption Evaluation,” arXiv preprint arXiv:1607.08822, Jul. 2016. [Online]. Available: https://arxiv.org/abs/1607.08822

For full code, examples, and advanced configuration, see the SPICE-Evaluator GitHub Repository.