DigestAI
pdf Summary

Multimodal Hierarchical Attention Networks for Abstractive Document Summarization

Multimodal Hierarchical Attention Networks for Abstractive Document Summarization.pdf

March 4, 2026
16 views
xlclaude-sonnet-4-6

### The Core Problem

Abstractive document summarization — generating novel, condensed text from long documents — remains difficult for three key reasons: **document structure blindness** (most models flatten documents into sequences), **multimodal content integration** (tables, figures, and equations are ignored), and **hallucination** (models fabricate facts not present in the source). These issues limit deployment in high-stakes domains like scientific publishing and journalism.

### The Proposed Architecture: MHAN

The authors propose the **Multimodal Hierarchical Attention Network (MHAN)**, built around three integrated components. The **Hierarchical Encoder** operates simultaneously at token, sentence, and paragraph levels, using specialized attention masks and a graph neural network (GRU-based) to model inter-paragraph relationships. This contrasts with prior work that models only one or two granularity levels.

The **Variational Attention Bridge (VAB)** is the model's central innovation. It learns a **Gaussian Mixture Model (GMM)** representation of document semantics from paragraph-level hidden states, sampling latent attention weights that stochastically route information between hierarchy levels. This extends standard VAE approaches (which use simple Gaussian priors) to better capture the multimodal, multi-topic nature of long documents.

The **Multimodal Decoder** generates summaries autoregressively by attending to all three levels of encoder representations simultaneously via a multi-scale attention mechanism. The full training objective combines **negative log-likelihood (NLL)**, **KL divergence** (against a uniform GMM prior), a **semantic consistency loss**, and a **sparsity regularization** term.

### A New Evaluation Metric: SCI

The paper introduces the **Semantic Coherence Index (SCI)** to address limitations of ROUGE, which does not measure factual accuracy. SCI is defined as a product of entity-level overlap, sentence-embedding semantic similarity, and a hallucination penalty. MHAN achieves an average **SCI of 0.847**, versus **0.743** for the ablated baseline — a **14% relative improvement** in factual consistency.

### Experimental Results

MHAN is evaluated on **arXiv** (avg. 4,918 tokens), **PubMed** (3,156 tokens), and **CNN/DailyMail** (781 tokens) — datasets spanning scientific and news domains at varying document lengths. Against baselines including PG-Net, standard Transformer, and Hi-MAP, **MHAN improves ROUGE-L by approximately 2.3% on average**. Ablation studies isolate the contribution of each component: removing the hierarchical encoder causes the largest performance drop (−2.8 ROUGE-L), followed by removal of the variational bridge (−1.3) and cross-modal fusion (−0.9).

### Why It Matters

The combination of structural awareness, probabilistic latent variable modeling, and factual consistency measurement represents a meaningful step toward **reliable, structure-aware summarization** for real-world documents. The SCI metric in particular offers a reusable tool for the broader community to benchmark hallucination reduction. Future work could extend MHAN to truly multimodal inputs (images, tables) beyond the text-structural fusion demonstrated here.

Summarize your own content with DigestAI

Turn any article, YouTube video, PDF, or webpage into a clear, concise summary in seconds.

Try for free