Paper Reading - A Dataset of Alt Texts from HCI Publications
A Dataset of Alt Texts from HCI Publications
Author
Sanjana Chintalapati, Jonathan Bragg, Lucy Lu Wang, ASSETS 2022
Keywords
Accessibility, Scientifc Documents, Alt Text, Dataset
WHAT
An assessment of the semantic information conveyed by author-written alt text of graph and chart fgures extracted from papers
A dataset of 3386 author-written alt text from HCI publications, of which 547 have been annotated with semantic levels
WHY
graphs and charts are of special importance to these users
the vast majority of scientifc fgures lack alt text altogether
automatically generating alt text descriptions do not apply as well to scientifc images
HOW
Lundgard and Satyanarayan framework - four diferent levels of semantic content that may be conveyed by graphical data visualizations
Level 1: enumerating visualization construction details (e.g., type, marks, and encodings)
Level 2: identifying statistical concepts and relations (e.g., extremes and correlations)
Level 3: characterizing perceptual and cognitive phenomena (e.g., trends and patterns)
Level 4: articulating domain-specifc insights or societal context.
Sampling papers and extracting
author-written alt textto construct a dataset of author-written alt text by automatically sampling and extracting alt text from papers
data sources: papers from two conferences: CHI and ASSETS, 2010-2020
procedure
convert pdf to html
extract alt text from html
filter the extracted alt text
Annotation of alt text semantic levels
to assess the semantic content levels present in each sentence of each piece of alt text
scispaCy NLP library
6 label options
• Level 1: Figure logistics
• Level 2: Statistical properties and comparisons
• Level 3: Complex trends and patterns in data
• Level 4: Domain-specifc insights or societal concepts to help explain Level 3 trends
• This alt text contains no levels of content
• This image is not a graph or chart
Results
Research questions
RQ1: What is the distribution of semantic content in author-written alt text?
RQ2: How does the distribution of semantic content in alt text change over time?
RQ3: How does length of alt text correlate with semantic levels?
Descriptive statistics
- the proportion of papers with valid alt text has improved over time
Analysis of alt text semantic content
the maximum levels of content are somewhat evenly distributed between levels 1, 2, and 3, only one or two of these levels are present in most fgure alt texts
a much lower proportion contain level 2 and 3
information; there have not been signifcant changes to the proportion of alt text that contains level 2 and 3 information over time
- Alt text containing more levels of information tend to be longer
Data use
Improving author-written alt text & supporting reading interfaces
Training and evaluating NLP models for alt text generation
Discussion
The alt text and figures included in our dataset make up a biased sample
The analysis and annotations are limited to fgures containing data visualizations
Not access whether all of levels 1–3 were necessary for an alt text to be considered complete
future work