Skip to content
DigestAI
pdf Summary

Multimodal Hierarchical Attention Networks for Abstractive Document Summarization

Multimodal Hierarchical Attention Networks for Abstractive Document Summarization.pdf

March 4, 2026
47 views
xlClaude Sonnet

The Core Problem

Abstractive document summarization — generating novel, condensed text from long documents — remains difficult for three key reasons: document structure blindness (most models flatten documents into sequences), multimodal content integration (tables, figures, and equations are ignored), and hallucination (models fabricate facts not present in the source). These issues limit deployment in high-stakes domains like scientific publishing and journalism.

The Proposed Architecture: MHAN

The authors propose the Multimodal Hierarchical Attention Network (MHAN), built around three integrated components. The Hierarchical Encoder operates simultaneously at token, sentence, and paragraph levels, using specialized attention masks and a graph neural network (GRU-based) to model inter-paragraph relationships. This contrasts with prior work that models only one or two granularity levels.

The Variational Attention Bridge (VAB) is the model's central innovation. It learns a Gaussian Mixture Model (GMM) representation of document semantics from paragraph-level hidden states, sampling latent attention weights that stochastically route information between hierarchy levels. This extends standard VAE approaches (which use simple Gaussian priors) to better capture the multimodal, multi-topic nature of long documents.

The Multimodal Decoder generates summaries autoregressively by attending to all three levels of encoder representations simultaneously via a multi-scale attention mechanism. The full training objective combines negative log-likelihood (NLL), KL divergence (against a uniform GMM prior), a semantic consistency loss, and a sparsity regularization term.

A New Evaluation Metric: SCI

The paper introduces the Semantic Coherence Index (SCI) to address limitations of ROUGE, which does not measure factual accuracy. SCI is defined as a product of entity-level overlap, sentence-embedding semantic similarity, and a hallucination penalty. MHAN achieves an average SCI of 0.847, versus 0.743 for the ablated baseline — a 14% relative improvement in factual consistency.

Experimental Results

MHAN is evaluated on arXiv (avg. 4,918 tokens), PubMed (3,156 tokens), and CNN/DailyMail (781 tokens) — datasets spanning scientific and news domains at varying document lengths. Against baselines including PG-Net, standard Transformer, and Hi-MAP, MHAN improves ROUGE-L by approximately 2.3% on average. Ablation studies isolate the contribution of each component: removing the hierarchical encoder causes the largest performance drop (−2.8 ROUGE-L), followed by removal of the variational bridge (−1.3) and cross-modal fusion (−0.9).

Why It Matters

The combination of structural awareness, probabilistic latent variable modeling, and factual consistency measurement represents a meaningful step toward reliable, structure-aware summarization for real-world documents. The SCI metric in particular offers a reusable tool for the broader community to benchmark hallucination reduction. Future work could extend MHAN to truly multimodal inputs (images, tables) beyond the text-structural fusion demonstrated here.

Summarize your own content with DigestAI

Turn any article, YouTube video, PDF, or webpage into a clear, concise summary in seconds.

Try for free