Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Original

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 34

UNICODE AND MULTILINGUAL COMPUTING

A Technical Seminar Report submitted to the Faculty of Computer Science and Engineering

Geethanjali College of Engineering & Technology


A.P.) (Cheeryal(V), Keesara(M), R.R. Dist., Hyderabad-

Accredited by NBA (Affiliated to J.N.T.U.H, Approved by AICTE, New Delhi)


In partial fulfillment of the requirement for the award of degree of BACHELOR OF TECHNOLOGY IN COMPUTER SCIENCE AND ENGINEERING Under the esteemed guidance of Mr. P. Srinivas, M.Tech, (Ph.D) Sr. Associate Professor By
M.MEDHA 09R11A0594

Department of Computer Science & Engineering


Year : 2012-2013

Geethanjali College of Engineering & Technology


(Affiliated to J.N.T.U.H, Approved by AICTE, NEW DELHI.)

Accredited by NBA
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Date:

CERTIFICATE

This is to Certify that the Technical Seminar report on Unicode and Multilingual Computing is a bonafide work done by M.MEDHA (09R11A0594) in partial fulfillment of the requirement of the award for the degree of Bachelor of Technology in Computer Science and Engineering J.N.T.U.H, Hyderabad during the year 2012 - 2013.

Technical Seminar Co-Ordinator HOD-CSE (Mr. P. Srinivas) (Prof. Dr. P.V.S. Srinivas) Sr. Associate Professor

ABSTRACT
Today's global economy demands global computing solutions. Instant communications across continents--and computer platforms--characterize a business world at work 24 hours a day, 7 days a week. The widespread use of the Internet and e-commerce continue to create new international challenges. More and more, users are demanding a computing environment to suit their own linguistic and cultural needs. They want applications and file formats they can share around the world, interfaces in their own language, and local time and date displays. Essentially, users want to write and speak at the keyboard the way they write and speak in the office. Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one. Even for a single language like English no single encoding was adequate for all the letters, punctuation, and technical symbols in common use.These encoding systems also conflict with one another. That is, two encodings can use the same number for two different characters, or use different numbers for the same character.So we develop a Unicode which provides a unique number for every character,

Any given computer (especially servers) needs to support many different encodings; yet whenever data is passed between different encodings or platforms, that data always runs the risk of corruption.

Contents
Chapters
1.. Introduction

Page No.

5
2. The Present Context

7
3. Urbanization pattern in India

11
4.......... Importance of integrating urban development and land use control 12 5. Importance of network and modal integration 15 6. Urbanization Inevitable and Desirables

18
7. Urban Transport and City efficiency 19 8. How Administrative lessons relate to urban transport

24

9. Current urban transport scenario in India 28 10. Road safety in India

30
11. Environmental impact of Urban Transport 12. Urban Rail Transit Architecture 13. Conclusion 14. Future Scope 15. List of Abbreviations 16. References

34 43 47
48 49 50

Unicode In most writing systems, keyboard input is converted into character codes, stored in memory, and converted to glyphs in a particular font for display and printing. The collection of characters and character codes form a codeset. To represent characters of different languages, a different codeset is used. A character code in one codeset, however, does not necessarily represent the same character in another codeset. For example, the character code 0xB1 is the plus-minus sign (+-) in Latin-1 (ISO 8859-1 codeset), capital BE in Cyrillic (ISO 8859-5 codeset), and does not represent anything in Arabic (ISO 8859-6 codeset) or Traditional Chinese (CJK unified ideographs). In Unicode, every character, ideograph, and symbol has a unique character code, eliminating any confusion between character codes of different codesets. In Unicode, multiple codesets need not be defined. Unicode represents characters from most of the world's languages as well as publishing characters, mathematical and technical symbols, and punctuation characters. This universal representation for text data has been further enhanced and extended in the latest release of Unicode: The Unicode Standard, Version 3.0.

PROCEDURE Sun Microsystems defines the following levels at which an application can support a customer's international needs:

Internationalization Localization

Software internationalization is the process of designing and implementing software to transparently manage different linguistic and cultural conventions without additional modification. The same binary copy of an application should run on any localized version of the Solaris operating environment, without requiring source code changes or recompilation. Software localization is the process of adding language translation (including text messages, icons, buttons, and so on), cultural data, and components (such as input methods and spell checkers) to a product to meet regional market requirements. The Solaris operating environment is an example of a product that supports both internationalization and localization. The Solaris operating environment is a single internationalized binary that is localized into various languages (for example, French,

Japanese, and Chinese) to support the language and cultural conventions of each language.

Supporting the Unicode Standard

Unicode (Universal Codeset) is a universal character encoding scheme developed and promoted by the Unicode Consortium, a non-profit organization which includes Sun Microsystems. The Unicode standard encompasses most alphabetic, ideographic, and symbolic characters. Using one universal codeset enables applications to support text from multiple scripts in the same documents without elaborate tagging. However, applications must treat Unicode as any another codeset--applying codeset independence to Unicode as well. Unicode locales are called the same way and function the same way as all other locales in the Solaris operating environment. These locales provide the extra benefits that the Unicode codeset brings to the work environment, including the ability to create text in multiple scripts without having to switch locales. Sun Microsystems provides the same level of Unicode locale support for both 32-bit and 64-bit Solaris environments.

Benefits of Unicode Support for Unicode provides many benefits to application developers, including:

Global source and binary. Support for mixed-script computing environments. Improved cross-platform data interoperability through a common codeset. Space-efficient encoding scheme for data storage. Reduced time-to-market for localized products. Expanded market access.

Developers can use Unicode to create global applications. Users can exchange data more freely using one flat codeset without elaborate code conversions to comprehend characters. In the Solaris operating environment internationalization framework, Unicode is "just another codeset." By adopting and implementing codeset independence to design, applications can handle different codesets without extensive code rework to support specific languages.

Unicode Coded Representations

In recent years, the Unicode Consortium and other related organizations have developed different formats to represent and store a Unicode codeset. To represent characters from all major languages in multibyte format, the ISO/IEC International Standard 10646-1 (commonly referred to as 10646) has defined the Universal Multiple-Octet Coded Character Set (UCS) format. Character forms contained in the 10464 specifications are: Universal Coded Character Set-2 (UCS-2) also known as Basic Multilingual Plane (BMP)--characters are encoded in two bytes on a single plane. Universal Coded Character Set-4 (UCS-4)--characters encoded in four bytes on multiple planes and multiple groups. UCS Transformation Format 16-bit form (UTF-16)--extended variant of UCS-2 with characters encoded in 2-4 bytes.

UCS Transformation Format 8-bit form (UTF-8)--a transformation format using characters encoded in 1-6 bytes. UCS-2 defines a 64K coding space, or BMP, to represent character codes in a two-octet row and cell format. The row and cell octets designate the cell location of a particular character code within a 256 by 256 (00-FF) plane. UCS-4 defines a four-octet coding space divided into four units: group, plane, row, and cell. The row and cell octets designate the cell location of a particular character code within a plane. The plane octet designates the plane number (00-FF), and the group octet the group number (00-7F) to which the plane belongs. In total, there are 256 planes occurring 127 times.

Figure 2-1 UCS-2 and UCS-4 coding schemes

Unicode in the Solaris 8 Operating Environment

The support of Unicode, Version 3.0 in the Solaris 8 Operating Environment's Unicode locales has provided an enhanced framework for developing multiscript applications. Properly internationalized applications require no changes to support the Unicode locales. All internationalized CUI and GUI utilities and commands in the Solaris operating environment are available in Unicode locales without modification. All Unicode locales in the Solaris operating environment are based on the UTF-8 format. Each locale includes a base language in the UTF-8 codeset and regional data related to the base language and its cultural conventions (such as local formatting rules, text
10

messages, help messages, and other related files). Each locale also supports several other scripts for input, display, code conversion, and printing.

Multilingual Computing with the 9.1 SAS Unicode Server In 9.1, SAS customers in many regions around the world will use the DBCS extensions in order to support global data (multilingual data which can only be represented in the Unicode character set). With the SAS Unicode server, it is now possible to write a SAS application which processes Japanese data, German data, Polish data, and more, all in the same session. A single server can deliver multilingual data to users around the world.

11

Unicode UTF-8 en_US.UTF-8 Locale

en_US.UTF-8 is the flagship Unicode locale in the Solaris operating environment. The en_US.UTF-8 locale is an American English-based locale with multiscript processing support for characters in many different languages. New and enhanced features of all Unicode locales include support of the Unicode 3.0 character set, complex text layout scripts in correct rendition, native Asian input methods, more MIME character sets in dtmail, various new iconv code conversions, and an enhanced PostScript print filter. All Unicode locales in the Solaris operating environment support multiple scripts. Thirteen input modes area available: English/European, Cyrillic, Greek, Arabic, Hebrew,

12

Thai, Unicode Hex, Unicode Octal, Table lookup, Japanese, Korean, Simplified Chinese, and Traditional Chinese. Users can input characters from any combination of scripts and the entire Unicode coding space.

Language

Code

Cyrillic

cc

Greek

gg

Thai

tt

Arabic

ar

Hebrew

hh

Unicode Hex

uh

Unicode Octal

uo

Lookup

ll

Japanes

ja

Korean

ko

Simplified Chinese

sc

13

Language

Code

Traditional Chinese

tc

English/European

Control+Space

Table 3-1 UTF-8 Input Mode two-letter codes

Figure 3-2 UTF-8 Input Mode selection

To input text from a Lookup table, select the Lookup input mode. A lookup table with all input modes and various symbol and technical codesets appears, as shown in Figure 3-2.

14

The Table lookup input mode is the easiest for non-native speakers to input characters in a foreign language--a lookup window displays characters from a selected script, as shown for the Asian input mode in Figure 3-3. The Arabic, Hebrew, and Thai input modes provide full complex text layout features, including right-to-left display and context-sensitive character rendering. The Unicode octal and hexadecimal code input modes generate Unicode characters from their octal and hexadecimal equivalents, respectively. The Japanese, Korean, Simplified Chinese, and Traditional Chinese input modes provide full native Asian input.
Figure 3-3 UTF-8 Table Lookup

15

Figure 3-4 Asian input mode

The Unicode locales can use the enhanced mp(1) printing filter to print text files. mp(1) prints flat text files written in UTF-8 using various Solaris system and printer resident fonts (such as bitmap, Type1, TrueType) depending on the script. The output is standard PostScript. For more information, refer to the mp(1) man page. The Unciode locale supports various MIME character sets in dtmail, including various Latin, Greek, Cyrillic, Thai, and Asian character sets. Some of the example character sets are: ISO-8859-1 ~ 10, 13, 14, 15, UTF-8, UTF-7, UTF-16, UTF-16BE, UTF-16LE, Shift_JIS, ISO-2022-JP, EUC-KR, ISO-2022-KR, TIS-620, Big5, GB2312, KOI8-R, KOI8-U, and ISO-2022-CN. With this support, users can send and receive email messages encoded in MIME character sets from almost any region in the world. dtmail automatically decodes e-mail by recognizing the MIME character set and content transfer encoding in the message. The sender specifies the MIME character set for the recipient mail user agent.

16

Figure 3-5 Multiple character sets in dtmail

17

Codeset Conversion The Solaris operating environment locale supports enhanced code conversion among the major codesets of several countries. Figure 3-5shows the codeset conversions between UTF-8 and many other codesets.
Figure 3-6 Unicode codeset conversions

Codesets can be converted using the sdtconvtool utility or the iconv(1) command. sdtconvtool detects available iconv code conversions and presents them in an easy-to-use format.

18

Figure 3-7 sdtconvtool for converting between codesets

Users can also add their own code conversions and use them in iconv(3) functions, iconv(1) command line utilities, andsdtconvtool(1). For more information on user-extensible, user-defined code conversions, refer to the geniconvtbl(1) andgeniconvtbl(4) man pages. Developers can use iconv(3) to access the same functionality. This includes conversions to and from UTF-8 and many ISO-standard codesets, including UCS-2, UCS-4, UTF-7, UTF-16, KO18-R, Japanese EUC, Korean EUC, Simplified Chinese EUC, Traditional Chinese EUC, GBK, PCK (Shift JIS), BIG5, Johap, ISO-2022-JP, ISO-2022-KR, and ISO-2022-CN.

19

ISSUES IN USING UNICODE


To properly internationalize an application, use the following guidelines: Avoid direct access with Unicode. (This is a task of the platform's internationalization framework.) Use the POSIX model for multibyte and wide-character interfaces. Only call APIs that the internationalization framework provides for language and culturalspecific operations. All POSIX, X11, Motif, and CDE interfaces are available to Unicode locales. Remain codeset independent.

20

Unicode-based multilingual form development Translation


The first step in constructing a Unicode-based multilingual Web page is fairly self-evident: The material must be translated into the desired target languages by persons knowledgeable in those languages. At some point in the future, automatic translation or global translation (formerly known as "machine translation" or "MT") may be sophisticated enough to do a large part of that job, but it is not quite ready at this time. Additionally, although great strides have been made in this area in recent years, it is hard to imagine a time when human review of automatically translated text will not be necessary.

Localization
Localization is the next important concept in understanding multilingual computing. Web localization can be defind as simply the act of making a Web site linguistically and culturally appropriate to a local audience. An "accurate" translation may not be enough -- a translated text must be "localized" for the target audience viewing the Web page. We could use Spanish as an example: A Web page might be accurately translated into Spanish, but then the question could be, Which Spanish? Peruvian Spanish, for example, is not the same as Mexican Spanish. Therefore, Peruvians reading a Mexican Spanish Web site might be able to understand almost all of it, but certain nuances or turns of phrase might be unfamiliar to them. For the most part, multilingual

21

sites are not yet sophisticated enough or targeted enough to deal with such differentiations, but that will certainly change in the future. In other words, localization will become more and more significant as it helps direct the growth and acceptance of the Internet in ever more broad and diverse cultural settings.

Multiple languages on the same Web page


For the purposes of this tutorial, however, localization is not an issue: The English used is clearly American (not British), the Russian text used is very straightforward, and although Chinese has many different spoken dialects, any Chinese reader could understand and respond to the survey question as presented. Our main goal here is to simply construct the initial building blocks of a multilingual Web site, where it is possible to display multiple languages on the same Web page together. Once the key concepts involved in getting a Unicode-based multilingual Web page up and running are understood, localization and more advanced aspects of Web design can be addressed by the developer.

Expert review
Expert review is rather self-explanatory. After a Web page has been translated, globalized for consistency and content, and localized for individual target language impact, it should undergo expert review. Does the site hang together overall? Is there a consistent message for content and tone between the languages? This examination for consistency between languages might also be viewed as the process of "internationalizing" the site. These are all issues that could fall under the category of expert review.

22

Page markup
After translation, localization and expert review, we can proceed to working out the Unicode equivalents and the actual page markup. Besides the three languages -English, Russian and Chinese -- used for the survey question in this tutorial, we have also added random characters in Japanese, Hebrew, Hindi, and Tibetan to demonstrate the amazing variety of Unicode-based characters in a multilingual site. Sample multilingual survey question We begin to construct our Unicode-based multilingual Web page example with a sample survey question: "Do you want to buy a new computer? Yes___ No___" translated (except for English, of course!) into our two other target languages. The result is displayed in Figure 1.

Unicodization
Once this procedure is complete, we need to transfer these language texts into their Unicode equivalents. We could call this step "Unicodization" (I don't know if this term has been coined yet, but if not, it needs to be.) It is not necessary, of course, to translate the English characters of our example into their Unicode equivalents, since they would be displayed properly in any case because of ASCII. However, we do so anyway in order to demonstrate how the process works overall (as well as for consistency).

23

Transferring Unicode characters into their hexadecimal equivalents


Although Unicode does work with decimal numbers, hexadecimal numbers are the standard. The Unicode characters are transferred into their hexadecimal equivalents. The characters (with the underlying hexadecimal equivalents) are then placed by the software or by hand into whatever markup document is being prepared. The key point is that, whether the user or developer sees them or not, the hexadecimal equivalents are the foundation of the process, and can then be manipulated as needed for various other programming purposes.

Random characters in other languages


In the previous example we used three languages. However, Unicode allows us to use a large number of languages -- at least in short sentences or segments -- on the same Web page.

Adding Unicode hexadecimal numbers to the page markup


The next step is transferring these Unicode hexadecimal numbers into markup language for the Web page which will be built around them. To do this, we add the symbols &#x to the front of the number with a semi-colon (;) placed at the end. For example, the Chinese characters for the word "computer" are designated as follows in hexadecimal form: 电 脑.

Using XML tags


24

XML tags provide a basis for organization and for building complexity into multilingual documents. For the purpose of this tutorial, we simply use a single survey question as the basis of our Unicode-based multilingual document. However, as multilingual e-commerce develops, documents can easily become extremely complex. XML can provide an excellent mechanism for managing this

Multilingual standard for XML


There is, however, a problem: At present, there is not a consistent multilingual XML standard. As Yves Savourel points out in an article in the October/November 2000 edition of MultiLingual Computing and Technology magazine, "a standard markup method is needed for working with multilingual documents" ("XML Technologies and the Localization Process," #35, Volume 11, Issue 7, p. 62). Savourel's comment could eventually emerge as a major understatement! Just as Unicode itself is bringing standardization to the characters of the world's languages and symbols, a standard XML language for multilingual documents will become crucial to the smooth development of multilingual e-commerce.

Form development: More complex scripts and XML tags needed


For the purposes of our simple survey, the lack of a multilingual XML standard is not a problem. A more complex multilingual survey might be devised that would have numerous questions and would be sorted and tabulated by XML according to language, type of response, region of the world, or other factors. The user would first find his language, then work his way down through a series of questions, with answers and responses being sent back and forth to a cgi bin file. For now, we will merely use a simple Perl CGI script. In a more complex multilingual Web page, numerous layers of

25

scripts and responses might be utilized. Defining how those scripts are used and interact -- and using XML in that process -- is at the heart of building effective multilingual Web sites using Unicode. That is a more advanced topic that builds on what we have presented here.

Display of HTML
Now we're ready to view our basic survey question. For those who have not yet loaded a Unicode font.

Multilingual Computing with the 9.1 SAS Unicode Server

26

UNICODE SUPPORT IN SAS 9.1 In 9.1, SAS customers in many regions around the world will use the DBCS extensions in order to support global data (multilingual data which can only be represented in the Unicode character set). With the SAS Unicode server, it is now possible to write a SAS application which processes Japanese data, German data, Polish data, and more, all in the same session. A single server can deliver multilingual data to users around the world. This paper will discuss the following six scenarios for using the SAS Unicode server. 1. Populating a Unicode database. 2. Using SAS/SHARE as a Unicode data server. 3. Using thin-client applications with the Unicode data server. 4. Using SAS/IntrNet as a Unicode compute server. 5. Using SAS AppDev Studio as a Unicode compute server. 6. Generating Unicode HTML and PDF output using the SAS Output Display System (ODS).

The SAS Unicode server is designed to run on ASCII based machines. It can be run as a data or compute server or as a batch program.

RESTRICTIONS There are a few restrictions to the SAS Unicode server.


27

1. The SAS Display Manger is not supported and if used will not display data correctly. 2. Enterprise Guide cannot access a SAS Unicode server. 3. You cannot run a SAS Unicode server on MVS (OS/390) or OpenVMS. 4. Fullscreen capability for UTF8 session encoding using national characters is not supported. Therefore, products that rely on fullscreen capability are not supported. This includes SAS/EIS, SAS/Warehouse Administrator, SAS/ASSIST, SAS/LAB, and Enterprise Miner V3.0 and earlier. FSEDIT, INSIGHT, and other Frame-based products are not supported. 5. Multi-lingual characters are not supported with SAS/GRAPH fonts, SAS/GRAPH ActiveX, and SAS/GRAPH Java Applets and UTF8. 6. OLEDB local data providers do not fully support multi-lingual data. 7. SAS/Access Engines to Oracle, ODBC, and OLEDB fully support the UTF8 encoding in SAS 9.1, but other Access engines do not. 8. OLEDB Local data providers and OLEDB IOM data providers do not support multilingual data. 9. The UTF8 server running on Windows does not support national characters for pathnames, such as an external file name or the directory name of a SAS dataset.

STARTING AND USING A SAS UNICODE SERVER To start a SAS Unicode server you must do two things:

28

1. Install SAS 9.1 or later, with DBCS extensions. 2. Specify ENCODING UTF8 when you start SAS, such as: sas -encoding UTF8

POPULATING A UNICODE DATABASE The first step in converting an existing database to Unicode or in setting up a new Unicode based system will be to convert all of your data from its legacy encoding to the UTF8 encoding. Once the data is in a Unicode database, there will not be any loss of data when it is read by a Unicode server.

USING SAS/SHARE AS A UNICODE DATA SERVER SAS/SHARE is a product that enables multiple users to access data from a central server. To convert your existing SAS/SHARE server to a SAS Unicode server you must specify the ENCODING UTF8 config option.

USING JDBC WITH A UNICODE DATA SERVER The SAS system is continuously increasing support for industry standard data access protocols such as JDBC. The JDBC interface is a data access interface for Java applications. Java supports Unicode string data and therefore, it would be very natural for the SAS Unicode server to function as the data server for Java.

USING SAS/INTRNET AS A COMPUTE SERVER The SAS system is often used as a compute server from a non-SAS client. This is another natural fit for the SAS Unicode server.

USING SAS APPDEV STUDIO AS A COMPUTE SERVER

29

SAS AppDev Studio enables Java programmers to run programs on a SAS server. The programs that run on the server are either SCL programs running with Jconnect or remote objects executed through SAS Integration Technologies.

GENERATING UNICODE OUTPUT USING ODS A SAS Unicode server can be used in a batch program to produce ODS output with an encoding of UTF8. At the time of this writing, the following ODS output formats support encoding UTF8: HTML XML PDF

UNICODE PROCESSING IN THE SAS SYSTEM There are several Unicode related features of SAS 9. These features are available for SAS sessions running legacy encodings as well as SAS Sessions running with a UTF8 encoding.
30

Unicode ENCODING= values for FILENAME and ODS statements. Unicode FORMATS and INFORMATS. NL formats for displaying currency and date formats matching the users locale.

Conclusion

31

Thus by using Unicode for mulit languages the corruption of data is less.Moreover the conversion process is easy and less time consuming. Using this the information can be passed globally in any language.This gives us more security for data.We can also develop web pagesbased on this.

REFERENCES

32

1. Tony Graham. M&T Press/IDG Books Worldwide,A guide to the Unicode standard and its use 2. MultiLingual Computing & Technology published by MultiLingual Computing, Inc., 3. 1. SAS(R) 9.1 National Language Support (NLS) 4. Reference. SAS Institute Inc., Cary, NC. SAS. 5. 2. "Base SAS Software." SAS OnlineDoc, Version 9.1 6. 2003 CD-ROM. SAS Institute Inc., Cary, NC. SAS. 7. 3. Cross-Environment Data Access (CEDA). "Base 8. SAS Software." SAS OnlineDoc, Version 9. 2003. 9. CD-ROM. SAS Institute Inc., Cary, NC. SAS. 10. 4. Cross-Environment Data Access (CEDA). SAS 11. Institute Inc., Cary, NC.SAS Available at: 12. http://support.sas.com/rnd/migration/planning/files/ce 13. da.html. 14. 5. Character Variable Padding (CVP). "Base SAS 15. Software." SAS OnlineDoc, Version 9.1 2003. CDROM. SAS Institute Inc., Cary, NC. SAS. 16. 6. Encoding. "National Language Support (NLS)

33

17. Reference." SAS OnlineDoc, Version 9.1 2003. CDROM. SAS Institute Inc., Cary, NC. SAS.. 18. Technology's address is: MultiLingual Computing, Inc., 319 North First Avenue, Sandpoint, ID 83864. 19. For an excellent article and examples of some of the issues involved in using Perl, XML and Unicode, see Michel Rodriguez's "Character Encodings in XML and Perl" at XML.com (April, 2000).

34

You might also like