J02a-JavaCharsStrings
J02a-JavaCharsStrings
Version 1.2.0
© Marco Torchiano, 2023
Primitive types
Type Size Encoding
boolean 1 bit -
char 16 bits Unicode UTF16
byte 8 bits Signed integer 2C
short 16 bits Signed integer 2C
int 32 bits Signed integer 2C
long 64 bits Signed integer 2C
float 32 bits IEEE 754 sp
double 64 bits IEEE 754 dp
void -
3
Literals
§ Literal values for chars and strings
follow the C syntax
¨ Non printable characters are introduce by
a ‘\’ backslash
¨ Chars
'a' '%' '\n’
¨ Strings
"prova" "prova\n"
4
Characters and Strings
§ Characters
¨ Primitive
type char
¨ Wrapper class Character
§ String
¨ No primitive representation!
¨ Class String
¨ Class StringBuffer
- and class StringBuilder
CHARACTERS
Wrapper Character
§ Encapsulates a single character
¨ Immutable (like all wrapper classes)
§ Utility methods for the char category
¨ isLetter()
¨ isDigit()
¨ isSpaceChar()
§ Utility methods for conversions
¨ toUpperCase()
¨ toLowerCase()
7
Unicode
§ Standard that assigns a unique code
to every character in any language
¨ Core specification gives the general
principles
¨ Code charts show representative glyphs
for all the Unicode characters.
¨ Annexes supply detailed normative
information
¨ Character Database normative and
informative data for implementers
http://www.unicode.org/versions/latest/
Characters and Glyphs
§ Character: the abstract concept
¨ e.g. LATIN SMALL LETTER I
§ Glyph: the graphical representation of
a character
¨ e.g. i i i ii
§ Font: a collection of glyphs
Unicode Codepoint
§ Codepoint: the numeric
representation of a character
¨ Included in the range 0 to 10FFFF16 (21
bits)
¨ Represented with U+ followed by the
hexadecimal code
¨ e.g. U+0069 for 'i'
Unicode Encoding
Mapping between byte sequence and
code point.
§ UTF-32 fixed width, 32 bits per char
¨A most 23 used: memory occupation
§ UTF-16 variable width, represents
¨ codepoints from `U+0` to `U+d7ff` on
16 bits (2 bytes)
¨ codepoints from `U+10000` to
`U+10ffff` on 32 bits (4 bytes)
Unicode Encoding
§ UTF-8 variable width,
¨ codepoints
`U+00` to `U+7f` are
mapped directly to bytes,
- i.e. ASCII transparent
¨ High bit (0x80) marks multi byte code
¨ Most non-ideographic codepoints are
represented on 1 or 2 bytes
- e.g. `U+00C8` representing character ‘è’ is
mapped to two bytes: `0xC3` `0xA8`.
Character set
§ Class Charset allows handling
different charsets
§ A few static methods
¨ defaultCharset()
¨ forName(..)
- Returns the corresponding charset
¨ availableCharsets()
- Returns a map of all charsets by name
Predefined charsets
§ US-ASCII
¨ 7-bit ASCII, a.k.a. ISO646-US
§ ISO-8859-1
¨ 8-bit single byte ISO Latin No. 1, a.k.a. ISO-LATIN-1
§ UTF-8
¨ 8-bit multi byte UCS Transformation Format
§ UTF-16BE
¨ 16-bit UCS Transformation Fmt., big-endian
§ UTF-16LE
¨ 16-bit UCS Transformation Fmt., little-endian
§ UTF-16
¨ 16-bit UCS Transformation Fmt., w/byte-order mark
Encoding and Decoding
§ Convenience methods
¨ CharBuffer decode(ByteBuffer)
¨ ByteBuffer decode(CharBuffer)
§ Generation of codecs
¨ getDecoder()
¨ getEncoder()
18
Operator +
§ It is used to concatenate 2 strings
"This is " + "a concatenation"
¨ Remember: strings are immutable,
therefore + creates a new string object
with the concatenation
§ Works also with other types
¨ Everythingis automatically converted to a
string representation and concatenated
System.out.println("pi = " + 3.14);
System.out.println("x = " + x);
19
String methods
§ int length()
¨ returns string length
§ boolean equals(String s)
¨ comparesthe contents of two strings
String h = "Hello";
String w = "World";
String hw = "Hello World";
String h_w = h + " " + w;
hw.equals(h_w) // -> true
hw == h_w // -> false
20
String methods
§ String toUpperCase()
¨ Converts string to upper case
§ String toLowerCase()
¨ Converts string to lower case
§ String concat(String str)
¨ Creates a concatenation with the given string
§ int compareTo(String str)
¨ Compare to another string returning
- < 0 : if this string precedes the other
- == 0 : if this string equals the other
- > 0 : if this string follows the other
21
Method subString
§ String subString(int startIndex)
"Human".subString(2) à "man"
§ String subString(int start,int end)
¨ Char start included, end excluded
"Greatest".subString(0,5) à"Great"
§ int indexOf(String str)
¨ Returns the index of the first occurrence of str
§ int lastIndexOf(String str)
¨ The same as before but search starts from the
end
22
String (static methods)
§ String valueOf(..)
¨ Converts any primitive type into a String
¨ Overloads defined for all primitive types
23
Format essentials
Max width or
decimal digits
Start at 1 Min width for floats
%[arg_index$][flags][width][.prec]conversion
F Result C Conversion
- left justified b boolean
+ include sign s string
0 0 padding d integer
( Neg in parenthesis f decimal
e scientific
24
StringBuffer
§ Represents a string of characters
§ It is mutable and allows operation that
modify the content
§ Can be converted to the
corresponding String using the
method toString()
StringBuffer
§ append(String str)
¨ Inserts str at the end of string
§ insert(int offset, String str)
¨ Inserts str starting from offset position
§ delete(int start, int end)
¨ Deletes character from start to end (excluded)
§ reverse()
¨ Reverses the sequence of characters
26
Class StringBuilder
§ Method-level compatible with
StringBuffer
§ Non thread safe and non reentrant
§ More efficient: ~30% faster
Performance issues
String s=""; StringBuffer sb =
new StringBuffer();
for(i=0;i<N;++i){ for(i=0;i<N;++i){ 2.9 ms
s += i; sb.append(i); N = 100k
} }
StringBuilder sb =
2 sec new StringBuilder();
N = 100k
for(i=0;i<N;++i){
2.2 ms
sb.append(i); N = 100k
}
30
String internalization
char chars[]= {'H’,'i'}; String pool
String s1 = new String(chars);
String s2 = new String(chars);
String i1 = s1.intern();
String i2 = s2.intern();
: String
"Hi"
s1
31
String internalization
char chars[]= {'H’,'i'}; String pool
String s1 = new String(chars);
String s2 = new String(chars);
String i1 = s1.intern();
String i2 = s2.intern();
: String
"Hi"
s1 : String
"Hi"
s2
32
String internalization
char chars[]= {'H’,'i'}; String pool
String s1 = new String(chars);
String s2 = new String(chars);
String i1 = s1.intern();
String i2 = s2.intern();
: String
"Hi"
s1 : String
"Hi"
s2
i1
.intern() adds the string to the pool
since none equal already exists
33
String internalization
char chars[]= {'H’,'i'}; String pool
String s1 = new String(chars);
String s2 = new String(chars);
String i1 = s1.intern();
String i2 = s2.intern();
: String
"Hi"
s1 : String
"Hi"
s2
i1
i2
.intern() returns the already
existing equal string 34
Internalizing literals
String ss1 = "Hi";
¨ Generates the same code as:
String ss1 = (new String(
new char[]{'H', 'i'})
).intern();
¨ On the first occurrence of a literal
- compiler creates the string and
- adds it to the pool
¨ Upon later occurrences of a literal
- compiler creates a string and
- through intern returns reference to the one in the pool
Wrap-up
§ Java characters are stored a 16 bits unicode
§ Conversion to/from streams of bytes is
managed by Charset objects
§ String is immutable representation of
strings
§ StringBuffer are mutable
¨ Significantly more efficient for string
manipulation
36
References
§ Unicode specification
¨ http://www.unicode.org/versions/latest/