Improving Access to Online Information via Valid HTML Mark Up

Presented at the conference The Good, The Bad and The Irrelevant, Helsinki, 3--5 September 2003.

Carmen Marincu
RINCE
Dublin City University
Dublin, Ireland
Carmen.Marincu@rince.ie
Phone: +353-1-700-7696

Dr. Barry McMullin
RINCE
Dublin City University
Dublin, Ireland
Barry.McMullin@rince.ie
Phone: +353-1-700-5432
Fax: +353-1-700-5508


Contents


Abstract

Information and Communication Technologies are playing a progressively more important part in our day to day life. Perhaps the most remarkable innovation in ICT is the development of the Internet, through its power of making information universally available. The Internet brings together different genders, different generations, and different cultures, disregarding boundaries such location or time. The Internet is probably the "application" with the most diverse group of users in the history of computing.

Among these diverse users, those with disabilities have particular opportunities to benefit. Using the Internet in conjunction with dedicated assistive technologies, tasks that were very difficult if not impossible to achieve for people with various types of disability can now be made fully accessible – at least, in principle. However, in practice, many online resources and services are still poorly accessible to those with disability due to unsatisfactory web content design.

One important aspect of achieving web content accessibility is compliance with technical inter-operability standards. A common practice in current web development is designing and testing for compatibility with a small number of “popular” browsers in “standard” configurations – rather than designing for compliance with generic technical standards for inter-operability. But, almost by definition, many users with disability need to use special purpose technologies – either “minority” browsers, or mainsteam browsers with unusual configurations or coupled to specialised assistive devices. Such users are, accordingly, especially reliant on compliance with generic inter-operability standards.

This paper presents results of a survey of HTML standards compliance for two samples of web sites, one drawn from Ireland, the other from the UK. It analyses the most common HTML defects encountered, and considers their potential impact on web accessibility. It also gives some recommendations to improve HTML compliance.

A particular conclusion of the study is that the general level of HTML standards compliance in both the Irish and UK samples is very poor; and that the pattern of failure is strikingly consistent in the two samples. Although considerable efforts are being made to promote web accessibility for users with disabilities in both jurisdictions, this is certainly not yet manifesting itself in improving HTML validity.

1. Introduction

The significant benefits brought in our society by the Internet are well known. It reduces barriers of distance and time, and creates a society in which – in principle – anyone can have access to products and services all over the world at any time.

The people who could, arguably, benefit most dramatically from this are those who, because of some disability, have restricted access to information and services in the physical world. Using dedicated assistive technologies, they can have access to the online version of the desired services. For example, a blind user can "read" the online version of the daily edition of her favourite newspaper or her bank statement, a user with restricted mobility can visit virtual stores from the comfort of his home, a student with cognitive disability can take her own time in understanding the taught material. [5]

An assistive technology is any device or tool (hardware or software) which adapts a conventional system for use by a person with disability. Most adaptive technologies used in browsing the Web do not operate on their own, but they act as an interface between a disabled person and the “mainstream” software/hardware, employed by a user with no disability to perform the same operation or action.

Depending on the specific disability the Internet user has, the assistive technologies will differ. Since the preponderant content on the web is textual or graphical, it can be said that persons with hearing impairments are the category least disadvantaged on the web; whilst those with blindness or other serious visual impairment are those potentially most affected. Between these extremes will fall those with a variety of other disability, such as motor impairment and learning and cognitive disabilities.

Currently, there is little specific adaptive technology for deaf users. Persons with mobility impairments usually have trouble using the hardware rather than understanding the information provided online. Depending on the severity of the disability, the solutions can start from slow keys and onscreen keyboards (where the user selects the needed character on a onscreen representation of a keyboard) to more advanced switches and scanning software (using a switch activated by the move of a body part - head nudge for example - as a yes/no selector, or word prediction that can save the effort of "typing in" the characters). Similarly, persons with vision impairments might use a variety of assistive technologies. A person with low vision, for example, might use screen magnification software which turns the mouse pointer into a magnification glass; or a colour-blind person might use software to override the the server-suggested colours. When it comes to blind people the assistive technology is more complex since it has to render the information, provided for a visual medium, in an alternative medium. Assistive technologies for blind users include braille displays and screen-readers - which are dedicated software programs that read aloud onscreen text, menus, icons and the like.

The typical practice of many web content developers is to test web site functionality only against a small number of “popular” web browser platforms in “normal” configurations. But, although the content might seem to be rendered correctly on such a platform, this does not guarantee that it is designed correctly. Most of the mainstream browsers are designed to “guess” and heuristically “repair” technical HTML defects. Thus, if the content is not rendered as expected by the author, new “adaptations” of the content, to the specific and non-standard “quirks” of a particular browser, are performed until the author is satisfied with the (purely visual) appearance of the web site. This can create a dangerous circle that not only leads to increased downloading time of the desired web site content, but dramatically decreases the chances of the same content being rendered satisfactorily on a different browser or platform. This particularly affects users with disabilities since, by definition, users of specialised assistive technologies are not using the “popular” platforms – or at least, not using them in the “normal” configurations; rather, they must depend on equipment tailored to their particular needs. As a result, it frequently happens that web sites are poorly accessible, or completely inaccessible, to such users.

This situation would be quite different if web sites were designed keeping in mind “write once, read everywhere”, which can be achieved by designing web content to meet appropriate guidelines and technical standards for interoperability. Provided both the server side and the client side conform to such guidelines and standards, the client platform can be tailored to individual user needs, and still interoperate effectively with all conforming servers.

An important source for accessible web design resources is the W3C's Web Accessibility Initiative. WAI published the Web Content Accessibility Guidelines (WCAG 1.0) in May 1999 [6]. This is now a reference point in achieving web accessibility in many of the E.U.'s Member States [3,4]. In Ireland, The National Disability Authority (NDA) have adopted WCAG 1.0 into the national “Guidelines for Web Accessibility”. In the UK, WCAG 1.0 is a source for the “Guidelines for UK Governmental web sites”, published by the Cabinet Office in May 2002.

A major step in achieving web accessibility is to use mark-up for its intended purpose when designing web content, and, in particular, conforming to relevant technical standards. WCAG 1.0 Checkpoint 3.2 expresses this thus: Create documents that validate to published formal grammars.

This checkpoint is assigned priority level 2 which is defined as follows:

Web content developer should satisfy this checkpoint. Otherwise, one or more groups will find it difficult to access information in the document. Satisfying this checkpoint will remove significant barriers to accessing Web documents.

The WCAG guidelines are under ongoing review. The most recent public working draft of version 2.0 was published on 24th June 2003 [1]. In this version, the need for inter-operability is expressed in Guideline 4:

Use Web technologies that maximize the ability of the content to work with current and future accessibility technologies and user agents.

In this draft WCAG 2.0 the checkpoints are classified as core and extended. Each checkpoint also specifies Required Success Criteria and Best Practice items. To claim even the minimum level of conformance to WCAG 2.0, [a web] resource must satisfy all required success criteria for all Core checkpoints.

4.1 [CORE] Technologies are used according to specification. [1]

The Required Success Criteria for this checkpoint are:

  1. for markup, except where the site has documented that a specification was violated for backward compatibility, the markup has:
    1. passed validity tests of the language (whether it be conforming to a schema, Document Type Definition (DTD), or other tests described in the specification)
    2. structural elements and attributes are used as defined in the specification
    3. accessibility features are used
    4. deprecated features are avoided
  2. for Application Programming Interfaces (API's), programming standards for the language are followed.
  3. accessibility features and API's are used when available.

Thus, in this draft WCAG 2.0 the requirement for valid mark-up is not only re-iterated, but increased in effective priority (from should to must).

Prompted by these provisions of WCAG 1.0 and the draft WCAG 2.0, a study specifically regarding HTML validity of a sample of Irish and UK web sites was conducted in May 2003. This paper presents the techniques used, the key results, and an analysis of the most common HTML mark-up defects encountered.

2. Methodology

In this section the web sampling methodology and the HTML compliance evaluation methodology will be discussed. The information generated during these processes was recorded in a PostgreSQL database for further analysis.

2.1 Web Sampling

Selecting a representative sample of web sites corresponding to a certain country is not a simple process. For the purpose of this survey it was considered that an open directory would provide a good basis. When implementing the technologies used in the surveying process, the goal was to use as many automated tools as possible. The Open Directory Project (ODP) offers its content for download, structured as RDF (although it is not guaranteed to be well formated RDF) which made it particularly suitable source for our purposes.

The information provided by ODP is structured in a hierarchical tree of categories (similar to the Google Web Directory; indeed, the latter is derived from ODP data). Each web site is assigned to a specific category, representing the subject of the web site as closely as possible. The web sites considered for the Irish sample were taken from the category “Ireland” and its sub-categories (online version: "http://www.dmoz.org/Regional/Europe/Ireland") and the web sites considered for the UK sample were taken from the category “United Kingdom” and its subcategories (online version: "http://www.dmoz.org/Regional/Europe/United_Kingdom"). At the time when the samples were extracted (13th February 2003), the Ireland category and its sub-categories contained 5,440 web sites and the UK category and its sub-categories contained 114,044 web sites.

Considering the significant difference in the number of web sites between the two categories, it was considered that the best approach when deciding the number of web sites in each sample would be to use a fixed fraction or percentage of the category total. On the basis of the communication, processing, and storage resources available for the study, this was set at 5%, which equates to 272 sites in the Irish sample and 5,702 sites in the UK sample. These samples were then selected primarily from the following sub-categories:

Due to the fact that web sites vary significantly in size and the type of media in which resources are offered, each web site's content was also subject to sampling, on the following basis:

The mirroring process (sampling of each web site's content) used the web content mirroring robot pavuk. 3,319 individual web pages (totaling 29MB) were retrieved for the Irish sample and 67,598 pages (totaling 552 MB) were retrieved for the UK sample. It was considered that, in order for the survey 's results to be reasonably representative for a web site, each site sample should have at least 3 web pages and at least 100kB of data. There were 9 Irish web sites and 258 UK web sites for which no data was captured, possibly due to network disruptions or other server side failure. For another 91 Irish web sites and 1,669 UK web sites only the home page was captured showing a failure in following any link from the home page.

Thus, in the end, the validity analysis was performed on 123 Irish sites (45% of the original sample) totaling 2,288 pages, and 2,380 UK sites (41% of the original sample) totaling 45,857 web pages.

2.2 The HTML Validity Analysis

An HTML page is properly built when its mark-up conforms to a standard technical specification. Each standard is specified by a Document Type Declaration (DTD) document which contains descriptions of the entities, elements and attributes that can be part of an HTML document, and how they can be interrelated. Because most of the existing web browsers are able to process web pages which don't conform to a DTD, many of the failures in the HTML code can pass unnoticed by most users. But such code defects can be a real impediment in access by users with disability helped by special purpose Web browsers and dedicated assistive technologies. They also complicate, and therefore inhibit, ongoing development of such niche technologies.

The technologies used in rendering web content for people with disabilities are designed to recognise mark-up elements, interpret their functionality and deliver the web content in a form that will keep the structure and the functionality of the web content as intended by the web developer. In order for this to be achieved a properly built HTML mark-up is crucial.

There are different tools that can be used in order to validate HTML code against its description in the corresponding DTD. The output is usually a list of problems encountered (diagnostics) with suggestions as to how could they be fixed. A list of such tools can be seen on the WAI's web page at http://www.w3.org/WAI/ER/existingtools.html.

The tool chosen for the HTML compliance tests in this study is onsgmls. onsgmls is a parser and validator of SGML files (HTML and XHTML) and it is part of the OpenSP collection of SGML/XML processing tools.

onsgmls implements 438 individual diagnostics which can be triggered when an element or attribute in the HTML content is not used according to its specification in the HTML page's DTD.

Each diagnostic is assigned a “severity level” as follows:

The validation process implemented by onsgmls involves comparing the use of each component in an HTML document with its specification in the DTD of the HTML standard used in the document. Thus, in order for onsgmls to generate consistent validation results, it needs a properly specified DTD. For example, the W3C HTML 4.01 specification states that “a valid HTML document declares what version of HTML is used in the document”. This is normally done via a DOCTYPE declaration in the beginning of the HTML page.

Accordingly, if the DOCTYPE declaration is missing (or unrecognised), the web page will immediately and automatically fail the HTML validation tests. In such a case it would be possible to configure onsgmls to assume a default DTD against which the document should validated. However, detailed results generated by a validation of such HTML pages are not considered relevant to our study (since the document might validate against some standard DTD - had the correct one been specified).

Considering this, the final analysis and report are generally based only on the diagnostics that were triggered in web pages that contained a correct DTD in their HTML content. Missing or incorrectly specified DTD information is an HTML defect which will be considered separately from the final results.

The mapping between the HTML standard used in a document to be tested and the document type declaration corresponding to that standard is made through formal public identifiers (FPI), components of the DOCTYPE declaration.

For an SGML processor (such as onsgmls) the formal public identifiers and the DTD they are mapped to are usually specified in a “catalog” file. Appropriate catalog files are generally made public with each HTML standard published. In the configuration used in this study, onsgmls uses case-sensitive matching of FPIs when determining the DTD against which a document should be validated. However, it was found in practice that a significant number of declarations failed to match appropriate DTDs precisely because the FPI differed only in case from one in a known catalog. It was decided to enable validation in such cases, by manually adding additional catalog entries for such case-variant FPIs, mapping them onto the appropriate standard DTDs. The FPI case-variations permitted in this way are listed in Appendix A.

In general, when analyzing the results, it was considered that if one diagnostic is triggered in at least one web page of a web site sample, the web site should be counted in statistics regarding that diagnostic. However, this rule was not applied to the "no DOCTYPE declaration" diagnostic since 98.4% of the Irish sites and 99.0% of the UK sites had at least one page with missing or unrecognised document type information. Instead, the web pages that triggered this diagnostic were eliminated from the overall sample. Then each site of the resulting sample (having now only the web pages with usable DTD information) was again tested for compliance with the minimum amount of data (100kB) and the minimum number of pages (3) required, per web site, as previously outlined.

3. Key Results

Of the web sites studied, only one Irish site and one UK site had completely valid HTML mark-up.

Of the 2288 web pages in the Irish sample, and the 45,857 pages in the UK sample, analysed for HTML validity:

Of the 45 Irish web sites (of 123 web sites considered for tests) and 843 UK web sites (of 2380 web sites considered for tests) having at least 3 pages with usable DTD information:

As it can be seen, although the Irish and UK samples differed significantly in size, the overall pattern of results is remarkably similar.

4. Representative HTML Defects

By far, the most common HTML mark-up defect triggered in the validation tests was the absence or the misconstruction of the document type information [98.4% of the Irish sites and 99.0% of the UK sites]. A correctly specified document type declaration is obviously crucial in the validation process of a web page. But more importantly than this, when the document type information is correctly specified in an HTML document, the web browser knows how the document is constructed and its content and functionality can therefore be rendered consistently and as intended. An HTML document without a usable DTD is a challenge to web browsers to behave consistently or reliably because the mark-up structure is unpredictable.

More than that, starting with Internet Explorer 5.0 on Macintosh (released in March 2000) there is now a trend for the mainstream browsers to more strictly implement standards-compliant behaviour. This means that the way that the web content is rendered by the browser depends on the precise document type which is declared. The document type has to be specified using a correctly structured DOCTYPE declaration in each web page to be rendered. For backwards compatibility a feature called doctype switching is sometimes implemented. Depending on a correctly structured document type declaration in the web page, the content may be rendered either according to the HTML standard specified – standards mode – or in a backwards compatibility way also known as quirks mode. But the behavior in quirks mode differs, in general, from browser to browser, version to version, and from platform to platform – that’s what quirks means! So it can't be predicted what kind of behavior should be expected when the browser and the operating system is not known. More than that, the doctype switch feature is not guaranteed to be kept for long and it can be predicted that future browser implementations will drop it in favour of the standardized mode only rendering behavior.

As mentioned before, in the absence of a usable document type declaration, the validation results are not consistent, and the web pages that triggered this diagnostic were removed from further, more detailed analysis. In the end, the most common remaining HTML mark-up defects were considered based only on the results of the validation of 454 web pages over 45 Irish web sites and 8598 web pages over 843 UK web sites, i.e., only those pages and sites with usable DOCTYPE information.

The HTML compliance tests triggered 35 distinct diagnostics on the sites in the Irish sample and 48 distinct diagnostics on the web in the UK sample. Although the number of distinct diagnostics triggered by the UK sample is larger than that of the Irish one, the most common 10 diagnostics are the same for both samples, and their ranking by relative frequency varies with an average of just 2 places. The 10 most common diagnostics and the percentage of the web sites in which the diagnostic is triggered at least once is shown in Fig. 1

Fig. 1 The 10 most common HTML diagnostics

The 10 most common HTML diagnostics[D]

The most common 5 distinct diagnostics, triggered at least once in the validation process of sites in both samples, were:

5. Future Research

The HTML survey presented in the current paper is part of the “Web Accessibility Reporting Project” (WARP) carried out in the eAccessibility lab at RINCE, Dublin City University. The overall project encompasses both HTML validity testing and evaluation against the wider WCAG guidelines.

The W3C WCAG guidelines are referenced in the EU Information Society policies for web accessibility so they should be considered in one way or another when web accessibility policies are developed in all EU Member States. In order to see how these policies are implemented, similar studies will be conducted in the future on web samples from other EU Member States, beyond the UK and Ireland.

6. Conclusion

Although the number of sites in each sample was very different (the number of UK sites being 20 times larger than the number of Irish sites), the results of the surveys were remarkably similar. This level of similarity could be explained by the fact that is it very likely that the web content authors would use the same range of existing web authoring tools in order to generate content. Unfortunately, determining the web authoring tool used in order to generate a specific web page is not that straight forward, so the conjecture that the similar results are due mostly to the usage of similar web authoring tools cannot be directly tested.

The results of the survey showed a disappointingly low level of validity, especially since there is effort invested in promoting web accessibility in both countries sampled here.

Most of the HTML defects on the sites studied are probably not apparent to the majority of web users – because the developers have specifically tested them against some (small) selection of popular browser platforms. However, because they do not conform to technical standards for interoperability, their rendering is – at best – unpredictable. This is likely to have a disproportionate affect on users who rely on specialised, tailored, client technologies – specifically, users with disabilities. Content may thus fail to be rendered, may be garbled, or may be otherwise inaccessible to such users. Worse, precious development effort in individualising assistive technologies may have to be spent on attempting to compensate for these server side defects, rather than improving the client side functionality that the user really needs. In the worst case, this effort may have to be wasted repeatedly for each different client accessing each different (non-compliant) server. Whereas, conformance to technical interoperability would substantially reduce or eliminate this waste.

Web content authors should therefore become familiar with the HTML standard technical specifications and use one of the existing HTML validation tools in order to ensure that web content is valid before making it public online. These tools not only list the defects encountered but also give references as to how these defects might be repaired. One such tool is the W3C's online HTML Validator which is easy to use, doesn't require any local installations and can validate either online or uploaded web content.

As seen from the analysis of the results, most of the defects can be repaired relatively easily. Although it might seem that it would involve a considerable amount of work, many of the detected defects are inter-related – so that correcting one substantive HTML code defect could eliminate a number of reported diagnostics.

Of course, may web page authors use HTML authoring tools or content management systems, VLE/MLE etc. In such cases the HTML mark-up may be hidden from the author. If the HTML mark-up is found to be invalid, the authors can get easily discouraged from providing valid HTML code: they may have no knowledge or understanding of the HTML specification, hence they don't have the knowledge to repair the HTML code (even if the authoring tool allowed such intervention – which it commonly does not!). In this case it is strongly recommended that the authors raise the accessibility issue with the web content authoring tool developer or vendor, maybe even providing the validation results.

This survey was conducted with the purpose of emphasising the importance of valid HTML mark-up in universal access to online information. The results do not represent the exact level of web accessibility on the Irish of UK sites, but they demonstrate a widespread lack of concern with technical interoperability. While it may be argued that the results are still generated based on sample of sites, the fact that samples with such different number of sites generated essentially the same results is suggestive that this situation (level of compliance with technical standards) is probably typical of the web as a whole in these countries. This is disappointing because it decreases considerably the potential that the web could offer for significant improvement in service and opportunity for users with disabilities.

Finally, although valid HTML mark-up is an important step in having an accessible web there are still many other things to be considered. Web publishers should thus certainly not settle for simply having valid HTML content. The WAI's WCAG 1.0 presents a list of other guidelines which, if accomplished, can lead to genuinely universal access to information.

Bibliography

  1. B. Caldwell, W. Chisholm, J. White, G. Vanderheiden, Web Content Accessibility Guidelines 2.0 (WCAG 2.0) W3C Working Draft, World Wide Web Consortium (W3C), http://www.w3c.org/TR/WCAG20, 24 June 2003, accessed 10-07-2003
  2. D. Raggett, A. Le Hors, I. Jacobs, HTML 4.01 Specification, World Wide Web Consortium (W3C), http://www.w3.org/TR/html401, 24 December 1999, accessed 11-02-2003
  3. Information Society, eAccessibility: EEurope Targets 2001/2002, European Commission, http://europa.eu.int/information_society/eeurope/action_plan/eaccess/member_states/targets_2001_2002/index_en.htm, accessed 6-02-2003
  4. Information Society, eAccessibility: Participation for all in the knowledge based economy, European Commission, http://europa.eu.int/information_society/eeurope/action_plan/eaccess, accessed 6-02-2003
  5. J. Brewer, How People with Disabilities Use the Web, World Wide Web Consortium (W3C), http://www.w3.org/WAI/EO/Drafts/PWD-Use-Web, 4 January 2001, accessed 11-02-2003
  6. W. Chisholm, G. Vanderheiden, I. Jacobs, Web Content Accessibility Guidelines 1.0 (WCAG 1.0), World Wide Web Consortium (W3C), http://www.w3c.org/TR/WCAG10, 5 May 1999, accessed 11-02-2003
  7. J. Clark, Building Accessible Website, NewRiders Publishing, October 2002, ISBN : 0-7357-1150-X

Appendix A: Case-variations of Formal Public Identifiers

Page Administrative Information

Maintainer: eaccess@rince.ie