advertisement

Localization Week 2

50 %
50 %
advertisement
Information about Localization Week 2
Entertainment

Published on November 27, 2007

Author: Tarzen

Source: authorstream.com

advertisement

Slide1:  Lecture 2 Software Localization Dr. Gregory M. Shreve Institute for Applied Linguistics Slide2:  Intercultural Communication Slide3:  In order to reach a large international user base it is necessary that the globalisation of a software system, web page or electronic document is done properly. The culturally and linguistically dependent parts of the software must be isolated, a process referred to as internationalisation. These parts include text manipulation and display, character-encoding methods, collation sequences, hyphenation and morphological rules, formats used for numbers and dates, as well as more subtle cultural conventions such as the use of icons, symbols and colour. The local market requirements for these items are encapsulated in the term locale. Multicultural/Multilingual Information Delivery Slide4:  Let's say your web site features a JavaScript graphic of a cartoon character waving at the reader. You know, just...waving. Friendly gesture, right? Not in Greece. Not in Nigeria. In those nations, the palm-forward wave is a nasty gesture indeed. As is the thumbs-up signal (so common in U.S. product reviews) in Iran. Not to mention the thumb-and-index-finger "OK" sign, which is most definitely "not the OK sign in Brazil," according to Wei-Tai Kwok, president of DAE Interactive Marketing, a San Francisco-based full-service Web development company that specializes in helping U.S. companies build a global presence. Cultural assessment or cultural auditing is often necessary to understand the full requirements for the localization of any product. Such assessment can produce the localization requirements required for a successful project. Understand the Culture Slide5:  Cultural Attitudes and Assumptions The credit card is the linchpin of e-commerce in the United States. In many countries, this causes all sorts of payment trouble. Many Germans view plastic as a crutch for weaklings who can't control their finances; they far prefer direct bank-account debit. Mark Lancaster, CEO and chair of SDL International, a globalization solutions provider based in Berkshire, England, points to currency and date formats as two obvious but easy-to-miss snags for global sites. "But the big one," Lancaster says, "is examples. Most e-commerce stuff is typically marketing oriented, and you know they really work hard to spin that stuff. But the right spin in the United States typically doesn't have the same impact elsewhere. There's a lot of baseball knowledge in the States; you might want to rethink that example in, say, the Netherlands." Slide6:  Culture And Technology But there's culture and then there's culture intertwined with the use of technology. Hong Kong lacks any sort of ZIP code equivalent. "A guy in my office just moved here from Hong Kong," Kwok says. "The ZIP code field is a major annoyance to him. He gets all the way through an order, puts in his address, and since he doesn't put in a ZIP code, some systems ding him. Won't accept the order. He's taken to punching in 00000 and hoping for the best." Pointing out that China uses a six-digit geographic code while Great Britain and other nations use letters, Kwok advises that ZIP codes should be assigned a free-form field that accepts any input, rather than just five- or nine-digit U.S.-style codes. "The technology should be dumb enough not to be too smart," he says. Slide7:  Some Issues are Simpler Formats / Conventions Money Time Date Postal Codes Address Formats Some Issues are Not Simple Non-Language Symbologies and Allowed Usages (flags, sacred symbols) Icons Meaning of Graphics (allowed content) Meaning of Colors Language Writing System, Characters Translation (esp. humour, idiom, allusion) Document Structure Some Issues are Very Complex Indeed Slide8:  Culture, Law And Design In most countries, a company selling a product might feature a typical "click-here" contract page. But guess what? "In Italy and Mexico, they don't view online contracts as legally binding," says Jeff Anderson, a manager for GE TradeWeb. In those countries, potential customers are instead shown a "please contact your local office" page. Once they make contact, that local office faxes them a contract. This cuts down on efficiency and speed, of course, but the law's the law. Allison Kurlya, senior vice president and technology director at Digitas, says that "in Latin America they do a lot of business by phone calls." So a webpage that would be transaction-oriented in the United States might be a request for a fax in many European countries and a form that says, "When's the best time to call you?" in South America, Kurlya says. Slide9:  Research by Barber & Badre (2000) indicates the existence of culture-specific web design elements across web sites from various cultural origins, for which they coined the term "cultural markers". Cultural markers are design elements that are repeatedly employed in web pages, creating recognisable design patterns across culturally different web sites. For instance, Middle-eastern sites predictably tended to orientate text and graphics from right to left. This orientation makes sense to a Middle-eastern, whilst it would not to a Westerner. Also, across all Brazilian sites examined and of all genres, there was a noticeable preference for use of many colours, while Lebanese sites of all genres tended to be text-oriented rather than graphics-based. Culture Specific Design Slide10:  Colors, too, carry different meanings in different places. A few years ago, Kwok recalls, a large technology vendor was building a global website that was entirely black, a color that connotes hipness and sophistication in the United States. When the site was unveiled, the vendor's webmaster told Kwok, "Hong Kong and China objected: Black, that means death, unlucky, morbid." DAE Interactive Marketing, like many similar agencies and services, performs a "cultural audit" on clients' sites to make sure innocent mistakes like these don't kill a company's global push. An Example: Color Slide11:  The selection of colours on a web page carries subtle, but still significant amount of information (e.g. Badre & Barber, 2000; Gould & Marcus, 2000). Different cultures have different associations for specific colours. For instance, while in western cultures red and green represent prohibition and permission respectively, this is not true for the Chinese. Slide12:  Number formats, despite the almost general use of the Arabic symbols, vary across different regions of the world. For instance, in Europe a period (.) separates digits in large numbers (e.g. 1.000.000) and a comma (,) the whole part of a figure from the decimal digits (1,5), whilst in America the reversed convention is applied. Number 1.000.000 1,000,000 Slide13:  Date and time formats also vary greatly as well, across the different cultures: order of day/month/year appearance and 12- or 24 hour-based time. Furthermore, some Asian countries don't use western calendars, or their calendars are based on a decade counting unit. Date and Time Slide14:  Icons Images are media rich in meaning, conveying messages that textual information may fail to transmit. Therefore, meaningfully converting them for use in another cultural context. For example, the icon commonly used by Sun Microsystems to represent an e-mail mailbox or a link sending email, my be incomprehensible to users that have never seen a similar real-life mailbox: An even more common example is the widely used "home" icon. This icon is mainly directing to the "home" or "root" location in a web browser's taskbar or web site's home page. However, some users may fail to recognise its meaning if they have not encountered "homes" like the one represented by this icon. Assumingly, this will be more frequent among novice users, since constant practice on the web will result in establishing the icon's meaning. Slide15:  Image Acceptability Furthermore, a software or web developer should consider the image or icon’s acceptability before using it. Namely, whether a particular image is not offensive to another culture. More specifically, religious symbols, such as stars and crosses, body and body parts images, women images and hand gestures might have a negative impact on other nations' moral perception. Characteristically, Khaslavsky (1998) reports that in Japan isolated bodily parts are perceived particularly negatively. Slide16:  Symbols, as well as graphics, might be incomprehensible if not properly localised. For example, whilst the cross in the Christian western world represents prohibition, in Arab countries it doesn't. Again, religious symbols or ones that are subject to misinterpretation and ambiguity should be carefully localised. Symbols Slide17:  The way textual and graphical information is comprehensively and meaningfully aligned on a web page varies significantly among the cultures. In America a series of columns and tables will be aligned from left to right and top to bottom. Nevertheless, in the Arab world that will not make sense, because there the logical arrangement of information is from right to left. What is more, the Chinese read from vertically from top to bottom and not horizontally. Flow of Information Slide18:  Tractinsky (1997), citing Jakob Nielsen's work on international user interfaces, ascribes usability of a computerised system to five attributes: learnability, efficiency, memorability, errors and satisfaction. The latter may imply that the aesthetic impact of a web site may enhance user performance. Indeed, Tractinsky's experiments found strong relations between aesthetics and user performance. Clearly, aesthetic perceptions are culturally dependent (Fernandes, 1994). In light of the above, one might expect that an additional adaptation may be indeed essential. A web page should comply with the culturally established aesthetical conventions of the country in question. The above general guidelines highlight the necessity to consider a wider array of design variables during the internationalisation and localisation process than the readily apparent ones. Aesthetics Slide19:  A variety of factors to be considered when localising a web site or piece of software. Text, colours, graphics, navigation and functions need to be carefully employed to aptly fit the target culture's mental models, patterns of thought and actions and expectations. The work of a Dutch cultural anthropologist, Geert Hofstede in the 1980's, provided web developers with significant results and suggestions, in order to develop sets of guidelines so as to produce successfully localised software and web sites. Geert Hofstede Power-distance Collectivism versus individualism Femininity versus Masculinity Uncertainty avoidance Long- versus short-term orientation Slide20:  By fare the most complex cultural issue is language. A language is a way that humans interact. In computerised form, a text in a written language can be expressed as a string of characters. The same set of characters can often be used for many written languages, and many written languages can be expressed using different scripts. Concepts like character set and encoding describe the way text is stored in computers, in files and data structures, and how applications handle such text. When you use a computer to write and file your master's thesis or your mother's Black Forest cake recipe, you produce text that you expect your computer to store, to display on your home page, or to send in e-mail. Language, Writing and Computers Slide21:  Text consists of characters, mostly. Fancy text or rich text includes display properties like color, italics, and superscript styles, but it is still based on characters forming plain text. Sometimes the distinction between fancy text and plain text is complex, and the distinction may depend on the application. Here, we focus on plain text. So, what is a character? Typically, a letter. Also, a digit, a period, a hyphen, punctuation, and math symbols. There are also control characters (typically not visible) that define the end of a line or paragraph. There is a character for tabulation, and a few others in common use. Language and Writing: Text A A A rich text plain text Slide22:  The same characters are often shown with somewhat different glyphs (shapes) for display of a text depending on the font used, the automatic shaping applied, or the automatic formation of ligatures. In addition, the same characters can be shown with somewhat different glyphs (shapes) for display of a text depending on the language being used, even within the same font or through automatic font change. Some glyphs very different even though they represent the same abstract character, as for instance lowercase cyrillic p: Language and Writing: Glyph character glyph A A A Slide23:  Characters may also take on different shapes in different contexts. So, for example, the Arabic character hah may have four different basic shapes. Language and Writing: Glyph Slide24:  In internationalization we are concerned primarily with representing the abstract character, not its rich display formats nor its font (script) variations. In any given language-using application there is a set of definable (in the absract) characters that can occur. The is a non-coded character set, also called an Abstract Character Repertoire. Abstract Character Repertoire (ACR) abcdefghijklomopqurstuvwxyzABCDEFGHIHKLMNOPQURSTUVWXYZ1234567890-=!@#$%^&*()_+{}[];’:”<>?,./ Slide25:  Coded Character Sets (CCS) Thus, to design a character set, you first decide how many and which characters you need. These characters are the repertoire that you will work with. Then you give each character an integer number, and you've got a character set. The result is called a Coded Character Set (CCS). before you assign the numbers, the collection of characters is called an Abstract Character Repertoire (non-coded character set). Abstract Character Repertoire Coded Character Set A 65 code point Slide26:  Character Mapping Getting from the ACR to the CCS is accompished by a scheme that assigns the numbers to the abstract characters. This is a character mapping. Abstract Character Repertoire Coded Character Set Character Mapping A 65 = code point Slide27:  Single-Byte 8 Bit Character Set A CCS like US-ASCII or ISO-8859-1 with 256 or less characters and no integer value above 255 can easily serve as a single-byte 8bit charset where each octet of 8 bits (byte) is taken as a binary number to look up the one coded character it represents: 01000001 -> 65 -> 'A'. 01000001 65 A 8 bits = octet code point character code unit Slide28:  US-ASCII maps from a set of integers to a single code unit that is 8 bits wide. The character encoding form (CEF) is a mapping from the set of integers used in a CCS to the set of sequences of code units. A code unit is an integer occupying a specified binary width in a computer architecture, such as a septet, an octet, or a 16-bit unit. An octet is a small unit of data with a numerical value between 0 and 255, inclusively. The encoding form enables character representation as actual data in a computer. There can be multiple code units of different lengths. Character Encoding Form (CEF) 01000001 Character Encoding Form integer character 1 Code unit of length 8 code point code unit character Slide29:  A CEF can specify multiple code units of varying length. A character encoding form whose 1…n sequences are all of the same length is known as fixed width. A character encoding form whose 1…n sequences are not all of the same length is known as variable width. Character Encoding Form (CEF) The encoding form defines one of the fundamental relations that internationalized software cares about: how many code units are there for each character and what their size is. This used to be expressed in terms of how many bytes each character was represented by. With the introduction of UCS-2, UTF-16, UCS-4, and UTF-32 with wider code units for Unicode and 10646, this is generalized to two pieces of information: a specification of the width of the code unit, and the number of code units used to represent each character. Slide30:  Examples of fixed-width encoding forms other than ASCII: 7-bit : each encoded character is represented in a 7-bit quantity. For example, as in ISO 646 8-bit : each encoded character is represented in an 8-bit quantity 8-bit EBCDIC : each encoded character is represented in an 8 bit quantity, with the EBCDIC conventions rather than ASCII conventions 16-bit (UCS-2) : each encoded character is represented in a 16-bit quantity 32-bit (UCS-4) : each encoded character is represented in a 32-bit quantity within a code space 0..7FFFFFFF 32-bit (UTF-32) : each encoded character is represented in a 32-bit quantity within a code space of 0..10FFFF. Examples of variable-width encoding forms: UTF-8 : used only with Unicode/10646: a mix of one to four 8-bit code units in Unicode and one to six code units in 10646 UTF-16 : used only with Unicode/10646: a mix of one to two 16 bit code units Slide31:  Given a CCS and a CEF decision, then one can construct a character encoding scheme, a mapping of code units into serialized byte sequences. The CES provides a set of rules for mapping the code units into a stream of bytes (and back again). Most fixed-width byte-oriented encoding forms have a trivial mapping into a CES: each 7-bit or 8-bit code unit maps to a byte of the same value. A scheme based on 8-bits can represent only 256 characters. However more complex schemes are possible. A 16-bit mapping allows more than 65,000 characters to be represented. We need more complex schemes to represent multiple languages in one character set. Character Encoding Scheme (CES) Slide32:  Character Set Size How big a repertoire do you need? For the English alphabet, with some digits and little more, maybe around 60 characters. The Western European Teletex standard comes with about 330 characters for the many languages. Korean has almost 12, 000 syllables, and some comprehensive Chinese dictionaries list far more than 50, 000 letters in their script. There are also hundreds of other characters in common use, such as math and currency symbols. Slide33:  Why Character Encoding Schemes? Inside a computer program or data file, text is stored as a sequence of numbers, just like everything else. These sequences are integers of various sizes, values, and interpretations. Now that we know what a character is, what number is assigned to each one? A simple character such as the letter "a" may have different integer values in different programs or data files. In some instances, there may not even be a number for a certain character. The integers used for characters have different sizes, or numbers of bits. If the character is really an "&auml;", an "a" with dots above it, then it might be stored as two characters with two integer values; one for the "a" and one for the dots. A Ä 1 integer 2 integers Slide34:  Legacy Character Coded Character Sets Historically, computers were pretty slow and had fairly little memory. Some of our character sets date back to that punch-card age and are designed with these cards in mind. In fact, most of the character sets that we have to this day are based on those 1960s design decisions! In the early days of computers, every computer maker invented their own machine and memory layout. At first, this wasn't a problem, because there was no Internet where everything needed to fit together -- every vendor just did what fit their customers. As a result, there was a great variety of bits per byte and bits per machine word (byte groups), and different computer architectures came with different character sets and encodings. Characters were stored with anywhere from 5 to 9 bits each. Slide35:  Legacy CCS: ACII, EBCDIC, BAUDOT The two character set dinosaurs that are still roaming the circuits of the networks are ASCII and EBCDIC, both from the 1960s. Where there is still a Telex (TTY) terminal, there is also the much older Baudot-code. Baudot was designed for 5-bit units, ASCII for 7 bits, and EBCDIC for 8 bits. Another important legacy from those days is the fact that some of the Internet e-mail system is still only prepared to handle 7-bit bytes. Fortunately, 7-bit e-mail gateways are a dying species. Every modern computer architecture uses bytes and machine words with at least 8 bits and that are powers of 2 (8, 16, 32, 64, and so on). Slide36:  Encodings and Byte Streams Inside a program, it is often best to deal directly with fixed-length units according to the character set so that each unit contains a single character. When you do that, then following text forward or backward is easy -- you just always go to the next or previous unit. When you write text into a file or send it over a network, then you almost always read and write a number of bytes, and if your units are bigger than bytes, then you need to transform them in a defined and reproducible way to make them fit. As we have said, this is called a character encoding scheme: the way you get characters into byte streams, and, more importantly, how you interpret byte streams to get characters. Slide37:  Multiple and Variable Bytes As we have said, the same character value can be encoded with multiple bytes, even with different bytes in different parts of the same byte stream. When the character set units fit into single bytes, the encoding is trivial and indistinguishable from the character set itself. For character sets with units that are larger than bytes, there are often several encodings to fit different needs, and one single encoding might carry characters from more than one character set to make them even more versatile. ASCII is a character set using 7-bit units, with a trivial encoding designed for 7-bit bytes. It is the most important character set out there, despite its limitation to very few characters, because its design is the foundation for most modern character sets. Slide38:  ASCII has only 95 Real Code Points Only 95 ASCII code points are used for "real" text-characters (or 94, not counting the space character). These graphic characters are mostly Latin upper- and lower-case letters, digits, and punctuation, plus some special braces, an underline, and some accent marks. It is a good base for the American market, but not for European languages with their accented letters, and does not cover any other scripts. A code point is identical to a character code. It is a mapping, often presented in tabular form, which defines one-to-one correspondence between characters in a character repertoire and a set of nonnegative integers. That is, it assigns a unique numerical code, a code point, to each character in the repertoire. Slide40:  The most common encodings (character encoding schemes) use a single byte per character, and they are often called single-byte character sets (SBCS). They are all limited to 256 characters. Because of this, none of them can even cover all of the accented letters for the Western European languages. Consequently, many different such encodings were created over time to fulfill the needs of different user communities. The most widely used SBCS encoding today, after ASCII, is ISO-8859-1. It is an 8-bit superset of ASCII and provides most of the characters necessary for Western Europe. A modernized version, ISO-8859-15, also has the euro symbol and some more French and Finnish letters. Character Sets for Many Characters Slide41:  Double-byte character sets (DBCS) were developed to provide enough space for the thousands of ideographic characters in East Asian writing systems. Here, the encoding is still byte-based, but each two bytes together represent a single character. Even in East Asia, text contains letters from small alphabets like Latin or Katakana. These are represented more efficiently with single bytes. Multi-byte character sets (MBCS) provide for this by using a variable number of bytes per character, which distinguishes them from the DBCS encodings. MBCSs are often compatible with ASCII; that is, the Latin letters are represented in such encodings with the same bytes that ASCII uses. Some less often used characters may be encoded using three or even four bytes. Examples of commonly used MBCS encodings are Shift-JIS and EUC-JP (for Japanese), with up to two and three bytes per character, respectively. Double and Multiple-Byte Sets Slide42:  The ISO 10646 Universal Character Set (UCS, Unicode) is a coded character set Unicode is a standard, by the Unicode Consortium, which defines a character repertoire and character code intended to be fully compatible with ISO 10646, and an encoding for it. In principle, ISO 10646 is more general in nature and Unicode corresponds to "Basic Multilingual Plane (BMP)" of ISO 10646; however, other "planes" haven't even been defined yet. In practice, people usually talk about Unicode rather than ISO 10646, partly because we prefer names to numbers ISO/IEC 10646 Slide43:  Hundreds of encodings have been developed, each for small groups of languages and special purposes. As a result, the interpretation of text, input, sorting, display, and storage depends on the knowledge of all the different types of character sets and their encodings. Programs are written to either handle one single encoding at a time and switch between them, or to convert between external and internal encodings. There is no single, authoritative source of precise definitions of many of the encodings and their names. Transferring of text from one machine to another one often causes some loss of information. Also, if a program has the code and the data to perform conversion between a significant subset of traditional encodings, then it carries several Megabytes of data around. Why Unicode? Slide44:  Unicode provides a single character set that covers the languages of the world, and a small number of machine-friendly encoding forms and schemes to fit the needs of existing applications and protocols. It is designed for best interoperability with both ASCII and ISO-8859-1, the most widely used character sets, to make it easier for Unicode to be used in applications and protocols. Unicode is in use today, and it is the preferred character set for the Internet, especially for HTML and XML. It is slowly being adopted for use in e-mail, too. Its most attractive property is that it covers all the characters of the world (with exceptions, which will be added in the future). Unicode makes it possible to access and manipulate characters by unique numbers -- their Unicode code points -- and use older encodings only for input and output, if at all. Slide45:  Unicode: The Last Character Set? The Unicode standard specifies a character set and several encodings. As of early 2000, it contains almost 50000 characters. It is an open character set, which means that it keeps growing and adding less frequently used characters. The standard assigns numbers from 0 to 0x10FFFF, which is more than a million possible numbers for characters. About 5% of this space is used. Another 5% is in preparation, about 13% is reserved for private use (anyone can place any character in there), and about 2% is reserved and not to be used for characters. The remaining 75% is open for future use but not by any means expected to be filled up. In other words, there is finally a character set with plenty of space! Slide46:  Unicode: UTF Encodings For single characters, 32-bit integer variables are most appropriate for the value range of Unicode. For strings, however, storing 32 bits for each character takes up too much space, especially considering that the highest value, 0x10FFFF, takes up only 21 bits. 11 bits are always unused in a 32-bit word storing a Unicode code point. Therefore, you will find that software generally uses 16-bit or 8-bit units as a compromise, with a variable number of code units per Unicode code point. It is a trade-off between ease of programming and storage space. As a result, there are three common ways to store Unicode strings: UTF-32, with 32-bit code units, each storing a single code point UTF-16, with one or two 16-bit code units for each code point is extremely well designed and is the default CES for Unicode. UTF-8, with one to four 8-bit code units (bytes) for each code point Slide47:  Unicode Encoding is Rich in Information The Unicode Standard specifies a numeric value and a name for each of its characters. In this respect, it is similar to other character encoding standards from ASCII onward. In addition to character codes and names, other information is crucial to ensure legible text: a character's case, directionality, and alphabetic properties must be well defined. The Unicode Standard defines this and other semantic information. Slide48:  There are Still MANY Issues: UniHan HAN (From the Han dynasty, 206 B.C.E to 25 C.E.) One of the set of glyphs common to Chinese (where they are called "hanzi"), Japanese (where they are called kanji), and Korean (where they are called hanja). Modern Korean, Chinese and Japanese fonts may represent a given Han character as somewhat different glyphs. However, in the formulation of Unicode, these differences were folded, in order to conserve the number of code units necessary for all of CJK. This unification is referred to as "Han Unification", with the resulting character repertoire sometimes referred to as "Unihan". It is a hot political issue and has caused problems because of the large number of ancient characters. Examples of characters that were "unified"

Add a comment

Related presentations

Related pages

Course notes: Week 3, Localization - Harvard University

Course notes: Week 3, Localization 1. Localization of rings ... Localization and ideals 2.1. We will now study the relation between ideals in Aand those in ...
Read more

Weak localization - Wikipedia, the free encyclopedia

Weak localization is a physical effect which occurs in disordered electronic systems at very low temperatures. ... [2] Weak anti-localization
Read more

Localization - Scheduler - Kendo UI Forum

Forum thread about Localization in Kendo UI. Join the conversation now.
Read more

Datepicker | jQuery UI

Show week of the year; ... < title > jQuery UI Datepicker ... < script src = "//code.jquery.com/jquery-1.10.2.js" >
Read more

Localization for .NET; Chapter 3: Week Numbers - CODE Online

Localization for .NET; ... Week numbers are complicated, ... Figure 2: US calendar.NET. To get the week of the year you simply invoke:
Read more

Wasteland 2 • Crowdsourcing Localization – Next Steps

Crowdsourcing Localization – Next Steps We launched the Wasteland 2 crowdsourcing localization about a week ago. The effort is off to a very good start ...
Read more

Chapter 12: Sound Localization and the Auditory Scene

Chapter 12: Sound Localization and the Auditory Scene • What makes it possible to tell where a sound is coming from in space? • When we are listening ...
Read more

Localization Workflow | Localization 2.0

In what is turning out to be a busy week (despite the weeks of planning and preparation!) ... Introduced Localization 2.0; Identified the core tools, ...
Read more

android - How to get localization of day in week - Stack ...

I couldnt find a symbol for that in SimpleDateFormat How can I get the current day's localization ? For example: Monday: 1. day of week ; Tuesday: 2.day ...
Read more

OBOUT - ASP.NET Calendar - Tutorial - Localization

ASP.NET Calendar - Tutorial - Localization. ... FullDayNames - Comma delimetered list of full days of week names. ... Clear date picker 2 : Date picker and ...
Read more