A Dataset of Alt Texts from HCI Publications

Author

Sanjana Chintalapati, Jonathan Bragg, Lucy Lu Wang, ASSETS 2022

Keywords

Accessibility, Scientifc Documents, Alt Text, Dataset

WHAT

  • An assessment of the semantic information conveyed by author-written alt text of graph and chart fgures extracted from papers

  • A dataset of 3386 author-written alt text from HCI publications, of which 547 have been annotated with semantic levels

WHY

  • graphs and charts are of special importance to these users

  • the vast majority of scientifc fgures lack alt text altogether

  • automatically generating alt text descriptions do not apply as well to scientifc images

HOW

  • Lundgard and Satyanarayan framework - four diferent levels of semantic content that may be conveyed by graphical data visualizations

    • Level 1: enumerating visualization construction details (e.g., type, marks, and encodings)

    • Level 2: identifying statistical concepts and relations (e.g., extremes and correlations)

    • Level 3: characterizing perceptual and cognitive phenomena (e.g., trends and patterns)

    • Level 4: articulating domain-specifc insights or societal context.

  • Sampling papers and extracting
    author-written alt text

    • to construct a dataset of author-written alt text by automatically sampling and extracting alt text from papers

    • data sources: papers from two conferences: CHI and ASSETS, 2010-2020

    • procedure

      • convert pdf to html

      • extract alt text from html

      • filter the extracted alt text

  • Annotation of alt text semantic levels

    • to assess the semantic content levels present in each sentence of each piece of alt text

    • scispaCy NLP library

    • 6 label options

      • • Level 1: Figure logistics

      • • Level 2: Statistical properties and comparisons

      • • Level 3: Complex trends and patterns in data

      • • Level 4: Domain-specifc insights or societal concepts to help explain Level 3 trends

      • • This alt text contains no levels of content

      • • This image is not a graph or chart

Results

  • Research questions

    • RQ1: What is the distribution of semantic content in author-written alt text?

    • RQ2: How does the distribution of semantic content in alt text change over time?

    • RQ3: How does length of alt text correlate with semantic levels?

  • Descriptive statistics

    • the proportion of papers with valid alt text has improved over time
  • Analysis of alt text semantic content

    • the maximum levels of content are somewhat evenly distributed between levels 1, 2, and 3, only one or two of these levels are present in most fgure alt texts

    • a much lower proportion contain level 2 and 3

information; there have not been signifcant changes to the proportion of alt text that contains level 2 and 3 information over time

- Alt text containing more levels of information tend to be longer

Data use

  • Improving author-written alt text & supporting reading interfaces

  • Training and evaluating NLP models for alt text generation

Discussion

  • The alt text and figures included in our dataset make up a biased sample

  • The analysis and annotations are limited to fgures containing data visualizations

  • Not access whether all of levels 1–3 were necessary for an alt text to be considered complete

  • future work