So, You Want an Accessibility Score?
Oct 26, 2021
Note: This blog was originally published for Tenon, a premier integrated accessibility testing company, that was acquired by Level Access in November 2021. Read more about Tenon and the Level Access acquisition.
By: Karl Groves, Chief Innovation Officer
We’re often asked if the Tenon.io platform has the ability to give a “grade”. Currently, it does not, largely because I personally have hang-ups on how to do so in a way that accurately reflects how usable a product is for users with disabilities. Unless it accurately reflects how a system performs for people with disabilities, no grade will be of any real value for anything other than vanity metrics.
Creating a grade for something is extremely simple: Divide the “passed things” by the “total things”, multiply that quotient by 100, then apply the following grouping to the result:
A: 90-100
B: 80-89
C: 70-79
D: 60-69
F: 59 and lower
If you subject your web site to 20 accessibility tests and you pass 15 of them, you get a 75%, which falls under a C grade. Done.
Challenges
In terms of grading something for accessibility, there are a ton of things wrong with the above idea.
What is the basis for measuring pass vs. fail?
Currently most automated testing tools are unable to give a reliable score because they do not track anything but failures. Most testing tools have no concept of passing other than by virtue of not failing. In other words, a “pass” condition is created by either not failing the test OR the test being irrelevant.
While there is value in getting a score based on the extent (or lack thereof) of your accessibility errors, it lacks context.
Getting a useful score requires knowing:
- What tests were relevant?
- Of those tests that were relevant, which ones passed, and which ones failed?
While some may claim that irrelevant things are a “pass”, I find this to be spurious logic. An irrelevant thing can neither pass nor fail because it doesn’t meet the criteria to do either. To use a computer programming analogy, an irrelevant test would be `null` as an irrelevant thing cannot be “true” (pass) or “false” (fail).
We built this capability into Mortise.io and are moving in this direction with Tenon.io. Each test has specific criteria which determines if it is applicable and specific instructions for determining whether the applicable portions have passed or failed. Without this, any “grade” supplied will be inaccurate.
What is the effect of user impact on the grade?
A raw pass-vs-fail score is fine if everything you’re testing for has the same impact, but accessibility is very different. Some things have very different levels of impact for users.
This is very hard to gauge with automation. As I so often say when discussing overlays, it is easy to find images without text alternatives, but it is much harder to determine whether a text alternative is accurate and informative. To make things worse, in cases where the test alternative is wrong, how wrong is it? What is the negative impact of that wrong text alternative? Does it cause the user to miss important information that isn’t conveyed any other way on the page or is its absence not really a big deal?
In addition, some issues impact multiple user types, and those impacts may also vary. How does that play into a score? Should the relative severity of the problem across user types be additive or multiplicative?
At the moment, we do not factor this into the Accessibility Grade generated in Mortise.io but rather into the Prioritization score for each issue (Mortise and Tenon use the same Prioritization scheme). In other words, our approach has been to consider any issue that impacts a user as a failure and the Priority score is simply a measure of urgency with which you should fix each issue so your remediation efforts have a high positive impact for users quickly. That said, I remain open to the idea that this portion of our priority scoring should be its own metric that contributes to the Accessibility Grade, but that brings its own set of challenges that I’ll skip for now.
Should we consider the volume of issues?
At its most basic, the more issues a system has, the lower its quality. In the context of accessibility, the same is true: The higher number of accessibility problems, the lower its accessibility grade should be. However, raw issue count isn’t useful without additional context. This is where Defect Density comes in. Quite simply, it takes into consideration the number of issues vs the size of the page.
The logic for Density’s importance is pretty straightforward: a simple web page with a lot of issues is worse than a complex web page with the same number of issues. Imagine, for a moment, if you tested the Google.com homepage and got 100 issues and then tested MSNBC.com and got 100 issues. Based solely on issue count vs. page size, the Google.com home page performs worse.
Tenon was the first accessibility testing tool to provide Density as a metric for Web Accessibility. In traditional QA, the Defect Density is based on the lines of code and is measured per 1000 lines of code (KLOC). Because Web pages may have many blank lines, we use the Kilobytes of source code as the comparison.
In practice, we’ve found a strong correlation between Density and usability: pages that exceed 50% Density are significantly more difficult for users to deal with. As density increases, so does the likelihood that users will be completely unable to use the content and features of that page, which tends to beg the question as to whether Density is the true metric upon which we should measure a grade.
Should we consider the comparison between pages?
At this point, Tenon.io has assessed millions of pages on the Web and logged tens-of-millions of errors. This is more than enough data for us to calculate any data point we want with a statistically significant sample size, a confidence level of 99% and a confidence interval of 1. Given that, we could provide users with a comparison of their performance against all other Web pages ever tested.
One way to do that is to provide a grade based on the norm or put another way, in comparison against all of the other pages that have ever been tested. One common example of this is grading “on a curve”.
Unfortunately, the “normal” page is pretty bad. Take a look of these error stats, from Tenon.io
- Min Errors: 0
- Max Errors: 4841
- Average Errors: 83
- Min Density: 0%
- Max Density: 460%
- Average Density: 14.7%
In addition to the average of 83 issues per page, the average density of 14.7% suggests that most pages on the Web are quite bad. When it comes to grading for accessibility, it doesn’t seem useful to base a grade on a base norm when that norm is, itself, not acceptable.
How do we score a project, as a whole?
There are several layers to consider in a scoring scenario:
- The component: an individual feature of a page or application screen, such as its navigation.
- The page: the entire page or application screen and all of its components.
- The product: the entire collection of pages or screens that make up the product.
Getting a grade on a component (or, better, a series of components) is extremely useful in determining the urgency with which you need to make repairs. Getting a grade on a page is a bit less useful, in my opinion, without any specific means of identifying the “value” of the page. A per-page grade is, of course, simple, but an “A” grade on an inconsequential page is less important than getting “A” grades on pages that see the most traffic from users (including any specific features/ documentation/ help for users with accessibility concerns).
Identifying the relative importance of a page can be quite useful, though I’m not sure whether we’d want that as part of the grade or part of the priority. Adding the page’s importance to Priority would allow us to make smarter decisions on which errors should be fixed sooner whereas adding it to a score does not feel as useful.
This assumes we have a complete set of relevant tests
Whether the assessment being run is automated or manual, the relevance of the grade is directly tied to the completeness and relevance of the test set. In the context of automated testing, it is already well known that automated testing tools cannot test for every possible accessibility best practice. It definitely pays to use a product that has a large number of tests. For example, Tenon.io has 189 tests in production. Using a product with less tests means you lose the ability to generate a more accurate and relevant grade.
The target grade must be an “A”
Getting a grade that you can look at and immediately understand where your system stands regarding accessibility is an attractive idea. Provided you’re using the right data in the right ways, it should be relatively straight forward to get a grade that is useful.
Accessibility is too often seen as a compliance domain which needs to be tracked. As a result, organizations are doing a bottoms-up race to whatever the bare-minimum grade they need to attain in order to stop being concerned about it. For instance, if an organization happens to regard a “B” as good enough, then that will be their target and they will pursue accessibility no further.
This approach to a “score” is misleading and dangerous. A score’s value should be solely in measuring your distance from a goal and that goal should be full compliance with WCAG.
Conformance to a standard means that you meet or satisfy the ‘requirements’ of the standard. In WCAG 2.0 the ‘requirements’ are the Success Criteria. To conform to WCAG 2.0, you need to satisfy the Success Criteria, that is, there is no content which violates the Success Criteria. (https://www.w3.org/WAI/WCAG21/Understanding/conformance)
The at-a-glance ability to see a score and intuitively understand how far away you are from getting a perfect grade is super valuable. Getting a score and choosing a less-than-perfect grade as “good enough” is dangerous when it comes to Accessibility.
Ultimately there’s only one true metric
There is, however a much more important metric when it comes to measuring accessibility: Will users with disabilities *want* to use the product?
The WCAG standard itself states:
Although these guidelines cover a wide range of issues, they are not able to address the needs of people with all types, degrees, and combinations of disability. (https://www.w3.org/TR/WCAG21/)
The real measure requires interacting with the real users, watching them use your product, and asking them one of three questions:
- If you are not a current user of this product, would you want to use it?
- If you are a current user of this product, would you want to continue to use it?
- If you are a former user of this product, would you come back to use it?
Automated and manual testing is extremely useful in finding potential problems in your product. Only usability testing with real users can tell you if you’ve gotten it right.
To learn more, engage with our team today.