Original
Original
Original
A Technical Seminar Report submitted to the Faculty of Computer Science and Engineering
Accredited by NBA
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
Date:
CERTIFICATE
This is to Certify that the Technical Seminar report on Unicode and Multilingual Computing is a bonafide work done by M.MEDHA (09R11A0594) in partial fulfillment of the requirement of the award for the degree of Bachelor of Technology in Computer Science and Engineering J.N.T.U.H, Hyderabad during the year 2012 - 2013.
Technical Seminar Co-Ordinator HOD-CSE (Mr. P. Srinivas) (Prof. Dr. P.V.S. Srinivas) Sr. Associate Professor
ABSTRACT
Today's global economy demands global computing solutions. Instant communications across continents--and computer platforms--characterize a business world at work 24 hours a day, 7 days a week. The widespread use of the Internet and e-commerce continue to create new international challenges. More and more, users are demanding a computing environment to suit their own linguistic and cultural needs. They want applications and file formats they can share around the world, interfaces in their own language, and local time and date displays. Essentially, users want to write and speak at the keyboard the way they write and speak in the office. Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one. Even for a single language like English no single encoding was adequate for all the letters, punctuation, and technical symbols in common use.These encoding systems also conflict with one another. That is, two encodings can use the same number for two different characters, or use different numbers for the same character.So we develop a Unicode which provides a unique number for every character,
Any given computer (especially servers) needs to support many different encodings; yet whenever data is passed between different encodings or platforms, that data always runs the risk of corruption.
Contents
Chapters
1.. Introduction
Page No.
5
2. The Present Context
7
3. Urbanization pattern in India
11
4.......... Importance of integrating urban development and land use control 12 5. Importance of network and modal integration 15 6. Urbanization Inevitable and Desirables
18
7. Urban Transport and City efficiency 19 8. How Administrative lessons relate to urban transport
24
30
11. Environmental impact of Urban Transport 12. Urban Rail Transit Architecture 13. Conclusion 14. Future Scope 15. List of Abbreviations 16. References
34 43 47
48 49 50
Unicode In most writing systems, keyboard input is converted into character codes, stored in memory, and converted to glyphs in a particular font for display and printing. The collection of characters and character codes form a codeset. To represent characters of different languages, a different codeset is used. A character code in one codeset, however, does not necessarily represent the same character in another codeset. For example, the character code 0xB1 is the plus-minus sign (+-) in Latin-1 (ISO 8859-1 codeset), capital BE in Cyrillic (ISO 8859-5 codeset), and does not represent anything in Arabic (ISO 8859-6 codeset) or Traditional Chinese (CJK unified ideographs). In Unicode, every character, ideograph, and symbol has a unique character code, eliminating any confusion between character codes of different codesets. In Unicode, multiple codesets need not be defined. Unicode represents characters from most of the world's languages as well as publishing characters, mathematical and technical symbols, and punctuation characters. This universal representation for text data has been further enhanced and extended in the latest release of Unicode: The Unicode Standard, Version 3.0.
PROCEDURE Sun Microsystems defines the following levels at which an application can support a customer's international needs:
Internationalization Localization
Software internationalization is the process of designing and implementing software to transparently manage different linguistic and cultural conventions without additional modification. The same binary copy of an application should run on any localized version of the Solaris operating environment, without requiring source code changes or recompilation. Software localization is the process of adding language translation (including text messages, icons, buttons, and so on), cultural data, and components (such as input methods and spell checkers) to a product to meet regional market requirements. The Solaris operating environment is an example of a product that supports both internationalization and localization. The Solaris operating environment is a single internationalized binary that is localized into various languages (for example, French,
Japanese, and Chinese) to support the language and cultural conventions of each language.
Unicode (Universal Codeset) is a universal character encoding scheme developed and promoted by the Unicode Consortium, a non-profit organization which includes Sun Microsystems. The Unicode standard encompasses most alphabetic, ideographic, and symbolic characters. Using one universal codeset enables applications to support text from multiple scripts in the same documents without elaborate tagging. However, applications must treat Unicode as any another codeset--applying codeset independence to Unicode as well. Unicode locales are called the same way and function the same way as all other locales in the Solaris operating environment. These locales provide the extra benefits that the Unicode codeset brings to the work environment, including the ability to create text in multiple scripts without having to switch locales. Sun Microsystems provides the same level of Unicode locale support for both 32-bit and 64-bit Solaris environments.
Benefits of Unicode Support for Unicode provides many benefits to application developers, including:
Global source and binary. Support for mixed-script computing environments. Improved cross-platform data interoperability through a common codeset. Space-efficient encoding scheme for data storage. Reduced time-to-market for localized products. Expanded market access.
Developers can use Unicode to create global applications. Users can exchange data more freely using one flat codeset without elaborate code conversions to comprehend characters. In the Solaris operating environment internationalization framework, Unicode is "just another codeset." By adopting and implementing codeset independence to design, applications can handle different codesets without extensive code rework to support specific languages.
In recent years, the Unicode Consortium and other related organizations have developed different formats to represent and store a Unicode codeset. To represent characters from all major languages in multibyte format, the ISO/IEC International Standard 10646-1 (commonly referred to as 10646) has defined the Universal Multiple-Octet Coded Character Set (UCS) format. Character forms contained in the 10464 specifications are: Universal Coded Character Set-2 (UCS-2) also known as Basic Multilingual Plane (BMP)--characters are encoded in two bytes on a single plane. Universal Coded Character Set-4 (UCS-4)--characters encoded in four bytes on multiple planes and multiple groups. UCS Transformation Format 16-bit form (UTF-16)--extended variant of UCS-2 with characters encoded in 2-4 bytes.
UCS Transformation Format 8-bit form (UTF-8)--a transformation format using characters encoded in 1-6 bytes. UCS-2 defines a 64K coding space, or BMP, to represent character codes in a two-octet row and cell format. The row and cell octets designate the cell location of a particular character code within a 256 by 256 (00-FF) plane. UCS-4 defines a four-octet coding space divided into four units: group, plane, row, and cell. The row and cell octets designate the cell location of a particular character code within a plane. The plane octet designates the plane number (00-FF), and the group octet the group number (00-7F) to which the plane belongs. In total, there are 256 planes occurring 127 times.
The support of Unicode, Version 3.0 in the Solaris 8 Operating Environment's Unicode locales has provided an enhanced framework for developing multiscript applications. Properly internationalized applications require no changes to support the Unicode locales. All internationalized CUI and GUI utilities and commands in the Solaris operating environment are available in Unicode locales without modification. All Unicode locales in the Solaris operating environment are based on the UTF-8 format. Each locale includes a base language in the UTF-8 codeset and regional data related to the base language and its cultural conventions (such as local formatting rules, text
10
messages, help messages, and other related files). Each locale also supports several other scripts for input, display, code conversion, and printing.
Multilingual Computing with the 9.1 SAS Unicode Server In 9.1, SAS customers in many regions around the world will use the DBCS extensions in order to support global data (multilingual data which can only be represented in the Unicode character set). With the SAS Unicode server, it is now possible to write a SAS application which processes Japanese data, German data, Polish data, and more, all in the same session. A single server can deliver multilingual data to users around the world.
11
en_US.UTF-8 is the flagship Unicode locale in the Solaris operating environment. The en_US.UTF-8 locale is an American English-based locale with multiscript processing support for characters in many different languages. New and enhanced features of all Unicode locales include support of the Unicode 3.0 character set, complex text layout scripts in correct rendition, native Asian input methods, more MIME character sets in dtmail, various new iconv code conversions, and an enhanced PostScript print filter. All Unicode locales in the Solaris operating environment support multiple scripts. Thirteen input modes area available: English/European, Cyrillic, Greek, Arabic, Hebrew,
12
Thai, Unicode Hex, Unicode Octal, Table lookup, Japanese, Korean, Simplified Chinese, and Traditional Chinese. Users can input characters from any combination of scripts and the entire Unicode coding space.
Language
Code
Cyrillic
cc
Greek
gg
Thai
tt
Arabic
ar
Hebrew
hh
Unicode Hex
uh
Unicode Octal
uo
Lookup
ll
Japanes
ja
Korean
ko
Simplified Chinese
sc
13
Language
Code
Traditional Chinese
tc
English/European
Control+Space
To input text from a Lookup table, select the Lookup input mode. A lookup table with all input modes and various symbol and technical codesets appears, as shown in Figure 3-2.
14
The Table lookup input mode is the easiest for non-native speakers to input characters in a foreign language--a lookup window displays characters from a selected script, as shown for the Asian input mode in Figure 3-3. The Arabic, Hebrew, and Thai input modes provide full complex text layout features, including right-to-left display and context-sensitive character rendering. The Unicode octal and hexadecimal code input modes generate Unicode characters from their octal and hexadecimal equivalents, respectively. The Japanese, Korean, Simplified Chinese, and Traditional Chinese input modes provide full native Asian input.
Figure 3-3 UTF-8 Table Lookup
15
The Unicode locales can use the enhanced mp(1) printing filter to print text files. mp(1) prints flat text files written in UTF-8 using various Solaris system and printer resident fonts (such as bitmap, Type1, TrueType) depending on the script. The output is standard PostScript. For more information, refer to the mp(1) man page. The Unciode locale supports various MIME character sets in dtmail, including various Latin, Greek, Cyrillic, Thai, and Asian character sets. Some of the example character sets are: ISO-8859-1 ~ 10, 13, 14, 15, UTF-8, UTF-7, UTF-16, UTF-16BE, UTF-16LE, Shift_JIS, ISO-2022-JP, EUC-KR, ISO-2022-KR, TIS-620, Big5, GB2312, KOI8-R, KOI8-U, and ISO-2022-CN. With this support, users can send and receive email messages encoded in MIME character sets from almost any region in the world. dtmail automatically decodes e-mail by recognizing the MIME character set and content transfer encoding in the message. The sender specifies the MIME character set for the recipient mail user agent.
16
17
Codeset Conversion The Solaris operating environment locale supports enhanced code conversion among the major codesets of several countries. Figure 3-5shows the codeset conversions between UTF-8 and many other codesets.
Figure 3-6 Unicode codeset conversions
Codesets can be converted using the sdtconvtool utility or the iconv(1) command. sdtconvtool detects available iconv code conversions and presents them in an easy-to-use format.
18
Users can also add their own code conversions and use them in iconv(3) functions, iconv(1) command line utilities, andsdtconvtool(1). For more information on user-extensible, user-defined code conversions, refer to the geniconvtbl(1) andgeniconvtbl(4) man pages. Developers can use iconv(3) to access the same functionality. This includes conversions to and from UTF-8 and many ISO-standard codesets, including UCS-2, UCS-4, UTF-7, UTF-16, KO18-R, Japanese EUC, Korean EUC, Simplified Chinese EUC, Traditional Chinese EUC, GBK, PCK (Shift JIS), BIG5, Johap, ISO-2022-JP, ISO-2022-KR, and ISO-2022-CN.
19
20
Localization
Localization is the next important concept in understanding multilingual computing. Web localization can be defind as simply the act of making a Web site linguistically and culturally appropriate to a local audience. An "accurate" translation may not be enough -- a translated text must be "localized" for the target audience viewing the Web page. We could use Spanish as an example: A Web page might be accurately translated into Spanish, but then the question could be, Which Spanish? Peruvian Spanish, for example, is not the same as Mexican Spanish. Therefore, Peruvians reading a Mexican Spanish Web site might be able to understand almost all of it, but certain nuances or turns of phrase might be unfamiliar to them. For the most part, multilingual
21
sites are not yet sophisticated enough or targeted enough to deal with such differentiations, but that will certainly change in the future. In other words, localization will become more and more significant as it helps direct the growth and acceptance of the Internet in ever more broad and diverse cultural settings.
Expert review
Expert review is rather self-explanatory. After a Web page has been translated, globalized for consistency and content, and localized for individual target language impact, it should undergo expert review. Does the site hang together overall? Is there a consistent message for content and tone between the languages? This examination for consistency between languages might also be viewed as the process of "internationalizing" the site. These are all issues that could fall under the category of expert review.
22
Page markup
After translation, localization and expert review, we can proceed to working out the Unicode equivalents and the actual page markup. Besides the three languages -English, Russian and Chinese -- used for the survey question in this tutorial, we have also added random characters in Japanese, Hebrew, Hindi, and Tibetan to demonstrate the amazing variety of Unicode-based characters in a multilingual site. Sample multilingual survey question We begin to construct our Unicode-based multilingual Web page example with a sample survey question: "Do you want to buy a new computer? Yes___ No___" translated (except for English, of course!) into our two other target languages. The result is displayed in Figure 1.
Unicodization
Once this procedure is complete, we need to transfer these language texts into their Unicode equivalents. We could call this step "Unicodization" (I don't know if this term has been coined yet, but if not, it needs to be.) It is not necessary, of course, to translate the English characters of our example into their Unicode equivalents, since they would be displayed properly in any case because of ASCII. However, we do so anyway in order to demonstrate how the process works overall (as well as for consistency).
23
XML tags provide a basis for organization and for building complexity into multilingual documents. For the purpose of this tutorial, we simply use a single survey question as the basis of our Unicode-based multilingual document. However, as multilingual e-commerce develops, documents can easily become extremely complex. XML can provide an excellent mechanism for managing this
25
scripts and responses might be utilized. Defining how those scripts are used and interact -- and using XML in that process -- is at the heart of building effective multilingual Web sites using Unicode. That is a more advanced topic that builds on what we have presented here.
Display of HTML
Now we're ready to view our basic survey question. For those who have not yet loaded a Unicode font.
26
UNICODE SUPPORT IN SAS 9.1 In 9.1, SAS customers in many regions around the world will use the DBCS extensions in order to support global data (multilingual data which can only be represented in the Unicode character set). With the SAS Unicode server, it is now possible to write a SAS application which processes Japanese data, German data, Polish data, and more, all in the same session. A single server can deliver multilingual data to users around the world. This paper will discuss the following six scenarios for using the SAS Unicode server. 1. Populating a Unicode database. 2. Using SAS/SHARE as a Unicode data server. 3. Using thin-client applications with the Unicode data server. 4. Using SAS/IntrNet as a Unicode compute server. 5. Using SAS AppDev Studio as a Unicode compute server. 6. Generating Unicode HTML and PDF output using the SAS Output Display System (ODS).
The SAS Unicode server is designed to run on ASCII based machines. It can be run as a data or compute server or as a batch program.
1. The SAS Display Manger is not supported and if used will not display data correctly. 2. Enterprise Guide cannot access a SAS Unicode server. 3. You cannot run a SAS Unicode server on MVS (OS/390) or OpenVMS. 4. Fullscreen capability for UTF8 session encoding using national characters is not supported. Therefore, products that rely on fullscreen capability are not supported. This includes SAS/EIS, SAS/Warehouse Administrator, SAS/ASSIST, SAS/LAB, and Enterprise Miner V3.0 and earlier. FSEDIT, INSIGHT, and other Frame-based products are not supported. 5. Multi-lingual characters are not supported with SAS/GRAPH fonts, SAS/GRAPH ActiveX, and SAS/GRAPH Java Applets and UTF8. 6. OLEDB local data providers do not fully support multi-lingual data. 7. SAS/Access Engines to Oracle, ODBC, and OLEDB fully support the UTF8 encoding in SAS 9.1, but other Access engines do not. 8. OLEDB Local data providers and OLEDB IOM data providers do not support multilingual data. 9. The UTF8 server running on Windows does not support national characters for pathnames, such as an external file name or the directory name of a SAS dataset.
STARTING AND USING A SAS UNICODE SERVER To start a SAS Unicode server you must do two things:
28
1. Install SAS 9.1 or later, with DBCS extensions. 2. Specify ENCODING UTF8 when you start SAS, such as: sas -encoding UTF8
POPULATING A UNICODE DATABASE The first step in converting an existing database to Unicode or in setting up a new Unicode based system will be to convert all of your data from its legacy encoding to the UTF8 encoding. Once the data is in a Unicode database, there will not be any loss of data when it is read by a Unicode server.
USING SAS/SHARE AS A UNICODE DATA SERVER SAS/SHARE is a product that enables multiple users to access data from a central server. To convert your existing SAS/SHARE server to a SAS Unicode server you must specify the ENCODING UTF8 config option.
USING JDBC WITH A UNICODE DATA SERVER The SAS system is continuously increasing support for industry standard data access protocols such as JDBC. The JDBC interface is a data access interface for Java applications. Java supports Unicode string data and therefore, it would be very natural for the SAS Unicode server to function as the data server for Java.
USING SAS/INTRNET AS A COMPUTE SERVER The SAS system is often used as a compute server from a non-SAS client. This is another natural fit for the SAS Unicode server.
29
SAS AppDev Studio enables Java programmers to run programs on a SAS server. The programs that run on the server are either SCL programs running with Jconnect or remote objects executed through SAS Integration Technologies.
GENERATING UNICODE OUTPUT USING ODS A SAS Unicode server can be used in a batch program to produce ODS output with an encoding of UTF8. At the time of this writing, the following ODS output formats support encoding UTF8: HTML XML PDF
UNICODE PROCESSING IN THE SAS SYSTEM There are several Unicode related features of SAS 9. These features are available for SAS sessions running legacy encodings as well as SAS Sessions running with a UTF8 encoding.
30
Unicode ENCODING= values for FILENAME and ODS statements. Unicode FORMATS and INFORMATS. NL formats for displaying currency and date formats matching the users locale.
Conclusion
31
Thus by using Unicode for mulit languages the corruption of data is less.Moreover the conversion process is easy and less time consuming. Using this the information can be passed globally in any language.This gives us more security for data.We can also develop web pagesbased on this.
REFERENCES
32
1. Tony Graham. M&T Press/IDG Books Worldwide,A guide to the Unicode standard and its use 2. MultiLingual Computing & Technology published by MultiLingual Computing, Inc., 3. 1. SAS(R) 9.1 National Language Support (NLS) 4. Reference. SAS Institute Inc., Cary, NC. SAS. 5. 2. "Base SAS Software." SAS OnlineDoc, Version 9.1 6. 2003 CD-ROM. SAS Institute Inc., Cary, NC. SAS. 7. 3. Cross-Environment Data Access (CEDA). "Base 8. SAS Software." SAS OnlineDoc, Version 9. 2003. 9. CD-ROM. SAS Institute Inc., Cary, NC. SAS. 10. 4. Cross-Environment Data Access (CEDA). SAS 11. Institute Inc., Cary, NC.SAS Available at: 12. http://support.sas.com/rnd/migration/planning/files/ce 13. da.html. 14. 5. Character Variable Padding (CVP). "Base SAS 15. Software." SAS OnlineDoc, Version 9.1 2003. CDROM. SAS Institute Inc., Cary, NC. SAS. 16. 6. Encoding. "National Language Support (NLS)
33
17. Reference." SAS OnlineDoc, Version 9.1 2003. CDROM. SAS Institute Inc., Cary, NC. SAS.. 18. Technology's address is: MultiLingual Computing, Inc., 319 North First Avenue, Sandpoint, ID 83864. 19. For an excellent article and examples of some of the issues involved in using Perl, XML and Unicode, see Michel Rodriguez's "Character Encodings in XML and Perl" at XML.com (April, 2000).
34