Paper Reading - A Dataset of Alt Texts from HCI Publications
A Dataset of Alt Texts from HCI Publications
Author
Sanjana Chintalapati, Jonathan Bragg, Lucy Lu Wang, ASSETS 2022
Keywords
Accessibility, Scientifc Documents, Alt Text, Dataset
WHAT
- An assessment of the semantic information conveyed by author-written alt text of graph and chart fgures extracted from papers 
- A dataset of 3386 author-written alt text from HCI publications, of which 547 have been annotated with semantic levels 
WHY
- graphs and charts are of special importance to these users 
- the vast majority of scientifc fgures lack alt text altogether 
- automatically generating alt text descriptions do not apply as well to scientifc images 
HOW
- Lundgard and Satyanarayan framework - four diferent levels of semantic content that may be conveyed by graphical data visualizations - Level 1: enumerating visualization construction details (e.g., type, marks, and encodings) 
- Level 2: identifying statistical concepts and relations (e.g., extremes and correlations) 
- Level 3: characterizing perceptual and cognitive phenomena (e.g., trends and patterns) 
- Level 4: articulating domain-specifc insights or societal context. 
 
- Sampling papers and extracting 
 author-written alt text- to construct a dataset of author-written alt text by automatically sampling and extracting alt text from papers 
- data sources: papers from two conferences: CHI and ASSETS, 2010-2020 
- procedure - convert pdf to html 
- extract alt text from html 
- filter the extracted alt text 
 
 
- Annotation of alt text semantic levels - to assess the semantic content levels present in each sentence of each piece of alt text 
- scispaCy NLP library 
- 6 label options - • Level 1: Figure logistics 
- • Level 2: Statistical properties and comparisons 
- • Level 3: Complex trends and patterns in data 
- • Level 4: Domain-specifc insights or societal concepts to help explain Level 3 trends 
- • This alt text contains no levels of content 
- • This image is not a graph or chart 
 
 
Results
- Research questions - RQ1: What is the distribution of semantic content in author-written alt text? 
- RQ2: How does the distribution of semantic content in alt text change over time? 
- RQ3: How does length of alt text correlate with semantic levels? 
 
- Descriptive statistics - the proportion of papers with valid alt text has improved over time
 
- Analysis of alt text semantic content - the maximum levels of content are somewhat evenly distributed between levels 1, 2, and 3, only one or two of these levels are present in most fgure alt texts 
- a much lower proportion contain level 2 and 3 
 
information; there have not been signifcant changes to the proportion of alt text that contains level 2 and 3 information over time
- Alt text containing more levels of information tend to be longer
Data use
- Improving author-written alt text & supporting reading interfaces 
- Training and evaluating NLP models for alt text generation 
Discussion
- The alt text and figures included in our dataset make up a biased sample 
- The analysis and annotations are limited to fgures containing data visualizations 
- Not access whether all of levels 1–3 were necessary for an alt text to be considered complete 
- future work 
