CCPP-Bench

Evaluating and Benchmarking Classical Chinese Poetry-to-Painting for Multimodal Large Language Models

Under Review

Abstract

Multimodal large language models (MLLMs) have demonstrated impressive capabilities in generating high-quality images from modern text. However, their performance in Classical Chinese Poetry-to-Painting (CCPP) generation, a vital and enduring part of Chinese literature, remains underexplored. To fill this gap, we propose a novel evaluation benchmark CCPP-Bench, which aims to evaluate the model’s poetry-to-painting generation conditioned on classical Chinese poems. Specifically, we first collect 1,079 correct pairs, along with their metadata attributes (e.g., dynasty, theme, and explanation). We then utilize 4 representative MLLMs to produce paintings in two input modes (i) poetry-only and (ii) enhancing poetry with explanation. We employ a panel of 8 human experts to assess a total of 8,632 pairs of poetry and model-painting in terms of 6 key dimensions specifically focused on visual quality, faithfulness to poetry, and cultural precision. We comprehensively investigate 6 research questions (RQs) that reveal significant progress and persistent challenges in this task. The study not only promotes the dissemination of ancient poetry culture, but also offers a multimodal creative paradigm “poetry- to-painting” to enhance LLM evaluations. We release CCPP-Bench on Anonymous Github.

Overview

The CCPP project is designed to facilitate research on "Classical Chinese Poetry-to-Painting" generation. It provides a high-quality benchmark, human-created painting samples, MLLM-generated painting outputs, and tools for data processing/evaluation. The project aims to enable reproducible research on evaluating MLLMs’ understanding of classical Chinese culture (poetry, idioms, classical prose) and their cross-modal generation capabilities. We provide a partial dataset (300 samples) and human-painting. The full dataset will be released upon paper acceptance.CCPP-Bench

Statistics

The distribution of keywords

The statistics of dataset

The distribution of keywords

The distribution of keywords

The distribution of keywords

Fine-grained statistics of CCPP-Bench (one-level), i.e., the total number of samples in terms of different meta-data. In particular, we list meta-data attributes classified into poetry-related and painting-related.

Fine-grained statistics of CCPP-Bench (two-level). Top: type (primary) and dynasty (secondary); Middle: type
(primary) and theme (secondary); Bottom: painting techniques (primary) and painting objects (secondary).

Fine-grained statistics of CCPP-Bench (two-level). Top: type (primary) and dynasty (secondary); Middle: type (primary) and theme (secondary); Bottom: painting techniques (primary) and painting objects (secondary).

Results and Analysis

Results.Results show a significant gap between MLLMs and human performance. Adding explanations improves cultural and historical accuracy, but all models struggle with deep implication and cultural precision. Automatic metrics correlate poorly with human judgments, and even stronger models fail on challenging dimensions.

The distribution of keywords

Performance comparison of 4 MLLMs. We report the human annotated scores (on a scale of 1 to 5) averaged on specific instances for each model in two input modes. The column-wise highest scores and lowest scores are highlighted.

Error Analysis.Ten major error types are identified, including Chaotic generation,Thematic irrelevance,Incorrect inscription,Inappropriate era-style,Common-sense error,Common-sense error,Poetry misunderstanding,Aesthetic error,Excessive AI-style and Emotional inconsistency.

The distribution of keywords

Error ratios (%) of 4 MLLMs on the whole dataset.

Error Examples

Conclusion & Contributions

This work pioneers the systematic evaluation of classical Chinese poetry-to-painting generation. CCPP-Bench lays a solid foundation for future research in multimodal evaluation, cultural generation, and humanities-oriented AI.