The "Lang" attribute and Classic Latin.

Recently I was faced with a question that I really had no hard data on. Fortunately, within the circle of international colleagues I work with who specialize in "web accessibility", I was able to gather some interesting information and opinion, and share these thoughts and observations here.

Update

After the original posting of this White Paper, it was noted that the example shown was both Latin (and more specifically Medieval Latin) and Middle English. As such the correct markup would be: <span lang="la"> Incipit:</span> <span lang="enm">Holi writ haþ a liknesse to tre þat bereþ noote oþer appel</span>. Obviously this will further impact on the decision process of the project in question as I am most certain that there are currently no speech synthesizers for Middle English, although the other reasons given for using the "lang" attribute remain valid. As well, while Medievel Latin and Classical Latin can differ, there is currently only one ISO code for Latin, being "la".

The Problem:

A web page (or specifically a series of web pages) written in English also features extensive tracts of Classical Latin text - text originating from the 12th and 13th century. The W3C WCAG1 guidance states: Clearly identify changes in the natural language of a document's text and any text equivalents (e.g., captions). (Priority 1, Checkpoint 4.1)

The question however was whether or not undertaking the non-trivial task of marking up these Latin texts to meet the WCAG Requirement was worth the return on investment? (<span lang="la"> Incipit: Holi writ haþ a liknesse to tre þat bereþ noote oþer appel</span>)

There is the possibility that doing so would still not satisfy a key constituency, screen readers, in a practical way as it was not clear that a Latin Speech Synthesizer even existed today. As well, "Screen readers without Unicode support will read a character outside Latin-1 as a question mark, and even in the latest version of JAWS, the most popular screen reader, Unicode characters are very difficult to read." [1]

Opinions and Facts:

The facts and opinions that ensued, prompted by the question, centered on the following relevant points

Screen Readers / Screen Reading Software

The number of languages supported by JAWS (the leading screen reading software package in the marketplace) is not limited to the list at the software vendor's website [2] as local distributors, for example Freedom Scientific Benelux, can deliver a JAWS version with a speech synthesizer for Dutch. It could not be determined however whether a JAWS speech synthesizer for Latin currently exists.[3] However, there is a sizable cottage industry of JAWS scripters who could add support for the characters even if it is not currently available. Given that this is an academic project it is conceivable that a blind researcher may wish to tap into this scripting resource, to add the capability if documents exist that would become more accessible with the investment.

Even if there were no speech synthesis available for a language, screen readers like JAWS can announce language changes and users can associate particular voice configurations with particular languages.

Looking beyond JAWS, Classical Latin [4] is among the current MBROLA voices [5] available. It is therefore (at least theoretically) usable with at least some screen readers and text-to-speech software, e.g. NVDA [6], FreeTTS (used by FireVox) [7], and Emacspeak [8].

Typesetting / Alternate Usage

As this is an academic project it might be more important to correctly mark-up the language for reasons other than accessibility. It is possible to machine-process words or even phrases in various useful ways, e.g. for machine translation. It is significantly more successful if you know for sure what language you are dealing with.

For example, if a user opened your HTML page in a word processor such as Microsoft Word, it would use the language markup, and this can be relevant when spelling checks are "on", i.e. words classified as misspelled are highlighted. Declaring Latin words as Latin prevents the program from applying English spelling rules to them. (A copy of Word tested for this seemed to be Latin-ignorant. That is, it recognized the words being in Latin but did not flag anything as misspelled and did not even hyphenate Latin words. This is probably better than treating them as English or some other language.)

Even when the language markup is correct however, search engines (such as Google) and related tools do not necessarily use this information today. One respondent found web pages in Dutch, with correct language markup, that still showed up in search results even when he explicitly asked Google to return only pages in English.

Regarding Character Support, this is a different issue and should not depend on language markup, and mostly doesn't. Generally, in special software like screen readers or specialized browsers, we should expect character support to be more restricted than in common modern browsers. Even Latin-1 isn't as safe as in "normal" browsing. For example, what would a screen reader do upon encountering a special character like "¶"? Would it recognize it as having a special meaning (paragraph separator) and make a pause? It probably spells it out. This might mean saying "pilcrow sign", perhaps independently of the language being used (since characters names aren't widely localized - most characters don't even have a name in most languages), which might be complete gibberish even to people who understand normal English.

Current and Future Technology

Style sheets, either page or user style sheets, could be used to style words in a particular language as different from others, using a selector like [lang="la"] or :lang(la). However, this does not work in all browsers, such as IE 6, which does not recognize such selectors. On some browsers, like Firefox, the user can right-click on a word and get information about its language*. Finally, some day some browsers or other software could make real use of the markup.

(* Firefox users can test this directly from this page: one of the contributors to this document works at Katholieke Universiteit Leuven - place your mouse over this name, right click and choose 'Properties')

Conclusion

While current support for foreign languages such as Latin remains minimal in 2008, there does exist at least some compelling reasons to consider marking up existing content using the "lang" attribute. Outside of strict conformance to a W3C WCAG Priority 1 requirement, future-proofing the content and enhancing it's usability suggest that the Return-on-Investment can be justified. It remains however the decision of the content owner to make the final call.

A special note of thanks goes out to the following contributors, who provided much of this information, and have been quoted (often verbatim) in this white paper:


  1. http://en.wikipedia.org/wiki/Wikipedia:Accessibility
  2. http://www.freedomscientific.com/fs_products/software_jawsinfo.asp
  3. http://lists.w3.org/Archives/Public/w3c-wai-gl/2005AprJun/0097.html
  4. http://tcts.fpms.ac.be/synthesis/mbrola/demo/la1.wav - NOTE: male voice, 188K Wav file
  5. http://tcts.fpms.ac.be/synthesis/mbrola.html
  6. http://www.nvda.fr/spip.php?article14
  7. http://freetts.sourceforge.net/
  8. http://web.mit.edu/ATIC/src/emacspeak-9.0/mbrola