The Ruby Language
This chapter is a bottom-up look at the Ruby language. Unlike the
previous tutorial, here we're concentrating on presenting facts,
rather than motivating some of the language design features. We also
ignore the built-in classes and modules where possible. These are
covered in depth starting on page 275.
If the content of this chapter looks familiar, it's because it should;
we've covered just about all of this in the earlier tutorial chapters.
Consider this chapter to be a self-contained reference to the core Ruby
language.
Ruby programs are written in 7-bit
ASCII.
[Ruby also has extensive support for Kanji,
using the
EUC, SJIS, or UTF-8 coding system. If a
code set
other than 7-bit ASCII is used, the KCODE
option must be
set appropriately, as shown on page 137.]
Ruby is a line-oriented language. Ruby expressions and statements are
terminated at the end of a line unless the statement is obviously
incomplete---for example if the last token on a line is an operator or
comma.
A semicolon can be used to separate
multiple expressions on a line. You can also put a backslash at the
end of a line to continue it onto the next. Comments start
with `#' and run to the end of the
physical line. Comments are ignored during compilation.
a = 1
b = 2; c = 3
d = 4 + 5 + # no '\' needed
6 + 7
e = 8 + 9 \
+ 10 # '\' needed
|
Physical lines between a line starting with =begin and{=begin...=end@
{=begin documentation}
a line starting with =end are
ignored by the compiler and may be used for embedded documentation
(see Appendix A, which begins on page 511).
Ruby reads its program input in a single pass, so you can pipe
programs to the compiler's
stdin
.
echo 'print "Hello\n"' | ruby
|
If the compiler comes across a line anywhere in the source containing
just ``
__END__
'',
with no leading or trailing whitespace, it
treats that line as the end of the program---any subsequent lines will not be
compiled. However, these lines can be read into the running program
using the global
IO
object
DATA
, described
on page 217.
Every Ruby source file can declare blocks of code to be run as the
file is being loaded (the
BEGIN
blocks) and after the program
has finished executing (the
END
blocks).
BEGIN {
begin code
}
END {
end code
}
|
A program may include multiple
BEGIN
and
END
blocks.
BEGIN
blocks are executed in the order they are encountered.
END
blocks are executed in reverse order.
There are alternative forms of literal strings, arrays, regular
expressions, and shell commands that are specified using a generalized
delimited syntax.
All these literals start with a percent character,
followed by a single character that identifies the literal's type.
These
characters are summarized in Table 18.1 on page 200; the actual
literals are described in the corresponding sections later in this
chapter.
General delimited input
Type
|
Meaning
|
See Page
|
%q
|
Single-quoted string |
202 |
%Q , %
|
Double-quoted string |
202 |
%w
|
Array of tokens |
204 |
%r
|
Regular expression pattern |
205 |
%x
|
Shell command |
218 |
 |
|
Following the type character is a delimiter, which can be any
character. If the delimiter is one of the characters ``
(
'',
``
[
'', ``
{
'', or ``
<
'', the literal consists of the
characters up to the matching closing delimiter, taking account of
nested delimiter pairs. For all other delimiters, the literal
comprises the characters up to the next occurrence of the delimiter
character.
%q/this is a string/
%q-string-
%q(a (nested) string)
|
Delimited strings may continue over multiple lines.
%q{def fred(a)
a.each { |i| puts i }
end}
|
The basic types in Ruby are numbers, strings, arrays, hashes, ranges,
symbols, and regular expressions.
Ruby integers are objects of class
Fixnum
or
Bignum
.
Fixnum
objects
hold integers that fit within the native machine word minus 1 bit.
Whenever a
Fixnum
exceeds this range, it is automatically converted
to a
Bignum
object, whose range is effectively limited only
by available memory. If an operation with a
Bignum
result
has a final value that will fit in a
Fixnum
, the result will
be returned as a
Fixnum
.
Integers are written using an optional leading sign, an optional base
indicator (
0
for octal,
0x
for hex, or
0b
for binary), followed by a string of digits in the appropriate base.
Underscore characters are ignored in the digit string.
You can get the integer value corresponding to an ASCII
character by preceding that character with a question mark. Control
and meta combinations of characters can also be generated using
?\C-
x, ?\M-
x, and ?\M-\C-
x.
The control version of
ch
is
ch&0x9f
, and the meta
version is
ch | 0x80
. You can get the integer value of a backslash
character using the sequence
?\\
.
123456 # Fixnum
123_456 # Fixnum (underscore ignored)
-543 # Negative Fixnum
123_456_789_123_345_789 # Bignum
0xaabb # Hexadecimal
0377 # Octal
-0b1010 # Binary (negated)
0b001_001 # Binary
?a # character code
?A # upper case
?\C-a # control a = A - 0x40
?\C-A # case ignored for control chars
?\M-a # meta sets bit 7
?\M-\C-a # meta and control a
|
A numeric literal with a decimal point and/or an exponent is turned
into a
Float
object, corresponding to the native
architecture's
double
data type. You must
follow the decimal point with a digit, as
1.e3
tries to invoke the method
e3
in class
Fixnum
.
12.34
|
» |
12.34
|
-.1234e2
|
» |
-12.34
|
1234e-2
|
» |
12.34
|
Ruby provides a number of mechanisms for creating literal strings.
Each generates objects of type
String
. The different
mechanisms vary in terms of how a string is delimited and how much
substitution is done on the literal's content.
Single-quoted string literals (
'
stuff
'
and
%q/
stuff/)
undergo the least substitution.
Both convert
the sequence
into a single backslash, and the form with
single quotes converts \' into a single quote.
'hello'
|
» |
hello
|
'a backslash \'\\\''
|
» |
a backslash '\'
|
%q/simple string/
|
» |
simple string
|
%q(nesting (really) works)
|
» |
nesting (really) works
|
%q no_blanks_here ;
|
» |
no_blanks_here
|
Double-quoted strings
("
stuff",
%Q/
stuff/, and
%/
stuff/)
undergo additional substitutions, shown in Table
18.2 on page 203.
Substitutions in double-quoted
strings
 |
\a |
Bell/alert (0x07) |
\nnn
|
Octal nnn
|
\b |
Backspace (0x08) |
\xnn
|
Hex nn
|
\e |
Escape (0x1b) |
\cx
|
Control-x
|
\f |
Formfeed (0x0c) |
\C-x
|
Control-x
|
\n |
Newline (0x0a) |
\M-x
|
Meta-x
|
\r |
Return (0x0d) |
\M-\C-x
|
Meta-control-x
|
\s |
Space (0x20) |
\x
|
x
|
\t |
Tab (0x09) |
#{expr} |
Value of expr
|
\v |
Vertical tab (0x0b) |
 |
|
a = 123
|
"\123mile"
|
» |
Smile
|
"Say \"Hello\""
|
» |
Say "Hello"
|
%Q!"I said 'nuts'," I said!
|
» |
"I said 'nuts'," I said
|
%Q{Try #{a + 1}, not #{a - 1}}
|
» |
Try 124, not 122
|
%<Try #{a + 1}, not #{a - 1}>
|
» |
Try 124, not 122
|
"Try #{a + 1}, not #{a - 1}"
|
» |
Try 124, not 122
|
Strings can continue across multiple input lines, in which case they
will contain newline characters. It is also possible to use
here
documents to express long string literals. Whenever Ruby parses the sequence
<<
identifier or <<
quoted string, it
replaces it with a string literal built from successive logical input
lines.
It stops building the string when it finds a line that starts
with the identifier or the
quoted string. You can put a minus
sign immediately after the << characters, in which case the
terminator can be indented from the left margin. If a quoted
string was used to specify the terminator, its quoting rules will be
applied to the here document; otherwise, double-quoting rules apply.
a = 123
print <<HERE
Double quoted \
here document.
Sum = #{a + 1}
HERE
print <<-'THERE'
This is single quoted.
The above used #{a + 1}
THERE
|
produces:
Double quoted here document.
Sum = 124
This is single quoted.
The above used #{a + 1}
|
Adjacent single- and double-quoted strings in the input are
concatenated to form a single
String
object.
'Con' "cat" 'en' "ate"
|
» |
"Concatenate"
|
Strings are stored as sequences of 8-bit bytes,
[For use
in Japan, the jcode
library supports a set of operations of
strings written with EUC, SJIS, or UTF-8
encoding.
The underlying string, however, is still accessed as a
series of bytes.] and each byte may contain any of the 256 8-bit
values, including null and newline.
The substitution mechanisms in
Table 18.2 on page 203 allow nonprinting characters to be
inserted conveniently and portably.
Every time a string literal is used in an assignment or as a
parameter, a new
String
object is created.
for i in 1..3
print 'hello'.id, " "
end
|
produces:
537767360 537767070 537767040
|
The documentation for class
String
starts on page 363.
Outside the context of a conditional expression,
expr
..
expr and
expr
...
expr
construct
Range
objects.
The two-dot form is an inclusive range;
the one with three dots is a range that excludes its last element.
See the description of class
Range
on page 359 for
details. Also see the description of conditional expressions
on page 222 for other uses of ranges.
Literals of class
Array
are created by placing a comma-separated
series of object references between square brackets. A trailing comma
is ignored.
arr = [ fred, 10, 3.14, "This is a string", barney("pebbles"), ]
|
Arrays of strings can be constructed using a shortcut notation,
%w
,
which extracts space-separated tokens into successive
elements of the array. A space can be escaped with a backslash.
This is a form of general delimited input,
described on pages 200--201.
arr = %w( fred wilma barney betty great\ gazoo )
|
arr
|
» |
["fred", "wilma", "barney", "betty", "great gazoo"]
|
A literal Ruby
Hash
is created by placing a list of key/value
pairs between braces, with either a comma or the sequence
=>
between the key and the value. A trailing comma is ignored.
colors = { "red" => 0xf00,
"green" => 0x0f0,
"blue" => 0x00f
}
|
There is no requirement for the keys and/or values in a particular
hash to have the same type.
The only restriction for a hash key is that it must respond to the
message
hash
with a hash value, and the hash value for a
given key must not change.
This means that certain classes (such as
Array
and
Hash
, as of this writing) can't conveniently be used
as keys, because their hash values can change based on their contents.
If you keep an external reference to an object that is used as a key,
and use that reference to alter the object and change its hash value,
the hash lookup based on that key may not work.
Because strings are the most frequently used keys, and because string
contents are often changed, Ruby treats string keys specially. If you
use a
String
object as a hash key, the hash will duplicate the
string internally and will use that copy as its key. Any changes
subsequently made to the original string will not affect the hash.
If you write your own classes and use instances of them as hash keys, you
need to make sure that either (a) the hashes of the key objects
don't change once the objects have been created or (b) you remember
to call the
Hash#rehash
method to reindex the hash whenever a
key hash
is changed.
A Ruby symbol is the internal representation of a name. You construct
the symbol for a name by preceding the name with a colon. A particular
name will always generate the same symbol, regardless of how that name
is used within the program.
Other languages call this process ``interning,'' and call symbols
``atoms.''
Regular expression literals are objects of type
Regexp
. They can
be created by explicitly calling the
Regexp.new
constructor, or
by using the literal forms, /
pattern/ and
%r{
pattern
}
. The
%r
construct is
a form of general delimited input (described
on pages 200--201).
/pattern/
/pattern/options
%r{pattern}
%r{pattern}options
Regexp.new( 'pattern' [, options
] )
|
A regular expression may include one or more options that modify the
way the pattern matches strings. If you're using literals to create
the
Regexp
object, then the options comprise one or more characters placed
immediately after the terminator. If you're using
Regexp.new
, the
options are constants used as the second parameter of the constructor.
i
|
Case Insensitive. The pattern match will ignore
the case of letters in the pattern and string. Matches are also
case-insensitive if the global variable $= is set. |
o
|
Substitute Once. Any #{...} substitutions
in
a particular regular expression literal will be performed just once,
the first time it is evaluated. Otherwise, the substitutions
will be performed every time the literal generates a Regexp object. |
m
|
Multiline Mode. Normally, ``.'' matches any
character except a newline. With the /m option, ``.'' matches
any character. |
x
|
Extended Mode.
Complex regular expressions can be difficult to read. The `x'
option
allows you to insert spaces, newlines, and comments in the pattern to
make it more readable. |
-
regular characters
- All characters except ., |, (, ), [, \, ^, {, +, $, *, and ? match
themselves. To match one of these characters, precede it with a
backslash.
-
^
- Matches the beginning of a line.
-
$
- Matches the end of a line.
-
\A
- Matches the beginning of the string.
-
\z
- Matches the end of the string.
-
\Z
- Matches the end of the string unless the string
ends with a ``\n'', in
which case it matches just before the ``\n''.
-
\b
, \B
- Match word boundaries and nonword boundaries respectively.
-
[
characters
]
- A character class matches any single character between the
brackets. The characters
|, (, ), [, ^, $, *,
and ?
,
which have special meanings elsewhere in patterns, lose their
special significance between brackets. The sequences
\
nnn, \x
nn, \c
x, \C-
x, \M-
x, and \M-\C-
x
have the meanings shown in Table 18.2 on page 203. The
sequences \d
, \D
, \s
, \S
, \w
, and \W
are abbreviations for groups of characters, as
shown in Table 5.1 on page 59. The sequence c1-c2
represents all the characters between c1 and c2, inclusive.
Literal ]
or -
characters must appear immediately after
the opening bracket. An uparrow (^) immediately following the
opening bracket negates the sense of the match---the pattern matches
any character that isn't in the character class.
-
\d
, \s
, \w
- Are abbreviations for character classes that match digits, whitespace,
and word characters, respectively. \D, \S, and \W match
characters that are not digits, whitespace, or word
characters. These abbreviations are summarized in Table
5.1 on page 59.
-
.
(period)
- Appearing outside brackets, matches any character except a newline.
(With the
/m
option, it matches newline, too).
-
re
*
- Matches zero or more occurrences of re.
-
re
+
- Matches one or more occurrences of re.
-
re
{m,n}
- Matches at least ``m'' and at most ``n'' occurrences of re.
-
re
?
- Matches zero or one occurrence of re.
The
*
, +
, and {m,n}
modifiers are greedy by
default. Append a question mark to make them minimal.
-
re1
|
re2
- Matches either re1 or re2.
|
has a low
precedence.
-
(...)
- Parentheses are used to group regular expressions. For example, the
pattern
/abc+/
matches a string containing an ``a,'' a ``b,''
and one or more ``c''s. /(abc)+/
matches one or more sequences
of ``abc''. Parentheses are also used to collect the results of
pattern matching. For each opening parenthesis, Ruby stores the
result of the partial match between it and the corresponding closing
parenthesis as successive groups. Within the same pattern,
\1
refers to the match of the first group, \2
the
second group, and so on. Outside the pattern, the special variables
$1
, $2
, and so on, serve the same purpose.
-
#{...}
- Performs an expression substitution, as with strings. By default, the
substitution is performed each time a regular expression literal is
evaluated. With the
/o
option, it is performed just the first
time.
-
\0, \1, \2, ... \9, \&, \`, \', \+
- Substitutes the value matched by the nth grouped
subexpression, or by the entire match, pre- or postmatch, or the
highest group.
In common with Perl and Python, Ruby regular expressions offer some
extensions over traditional Unix regular expressions. All the extensions are
entered between the characters
(?
and
)
. The parentheses
that bracket these extensions are groups, but they do not generate
backreferences: they do not set the values of
\1
and
$1
etc.
-
(?# comment)
- Inserts a comment into the pattern. The content is ignored during
pattern matching.
-
(?:re)
-
Makes re into a group without generating backreferences. This
is often useful when you need to group a set of constructs but don't
want the group to set the value of
$1
or whatever. In the
example that follows, both patterns match a date with either colons
or spaces between the month, day, and year. The first form stores
the separator character in $2
and $4
, while the second
pattern doesn't store the separator in an external variable.
date = "12/25/01"
|
date =~ %r{(\d+)(/|:)(\d+)(/|:)(\d+)}
|
[$1,$2,$3,$4,$5]
|
» |
["12", "/", "25", "/", "01"]
|
date =~ %r{(\d+)(?:/|:)(\d+)(?:/|:)(\d+)}
|
[$1,$2,$3]
|
» |
["12", "25", "01"]
|
-
(?=re)
-
Matches re at this point, but does not consume it (also known
charmingly as ``zero-width positive lookahead''). This lets
you look forward for the context of a match without affecting
$&
. In this example, the scan
method matches words
followed by a comma, but the commas are not included in the result.
str = "red, white, and blue"
|
str.scan(/[a-z]+(?=,)/)
|
» |
["red", "white"]
|
-
(?!re)
-
Matches if re does not match at this point. Does not
consume the match (zero-width negative lookahead). For example,
/hot(?!dog)(\w+)/
matches any word that contains the
letters ``hot'' that aren't followed by ``dog'', returning the end
of the word in $1
.
-
(?>re)
-
Nests an independent regular expression within the first regular
expression.
This expression is anchored at the current match position. If it
consumes characters, these will no longer be available to the
higher-level regular expression. This construct therefore inhibits
backtracking, which can be a performance enhancement. For example,
the pattern
/a.*b.*a/
takes exponential time when matched
against a string containing an ``a'' followed by a number of ``b''s,
but with no trailing ``a.'' However, this can be avoided by using a
nested regular expression /a(?>.*b).*a/
. In this form, the
nested expression consumes all the the input string up to the last
possible ``b'' character. When the check for a trailing ``a'' then
fails, there is no need to backtrack, and the pattern match fails promptly.
require "benchmark"
include Benchmark
str = "a" + ("b" * 5000)
bm(8) do |test|
test.report("Normal:") { str =~ /a.*b.*a/ }
test.report("Nested:") { str =~ /a(?>.*b).*a/ }
end
|
produces:
user system total real
Normal: 0.420000 0.000000 0.420000 ( 0.414843)
Nested: 0.000000 0.000000 0.000000 ( 0.001205)
|
-
(?imx)
-
Turns on the corresponding ``i,'' ``m,'' or ``x'' option. If used
inside a group, the effect is limited to that group.
-
(?-imx)
-
Turns off the ``i,'' ``m,'' or ``x'' option.
-
(?imx:re)
-
Turns on the ``i,'' ``m,'' or ``x'' option for re.
-
(?-imx:re)
-
Turns off the ``i,'' ``m,'' or ``x'' option for re.
Ruby names are used to refer to constants, variables, methods,
classes, and modules. The first character of a name helps Ruby to
distinguish its intended use. Certain names, listed in Table
18.3 on page 210, are reserved words and should not be used as
variable, method, class, or module names.
Reserved words
 |
__FILE__ |
and |
def |
end |
in |
or |
self |
unless |
__LINE__ |
begin |
defined? |
ensure |
module |
redo |
super |
until |
BEGIN |
break |
do |
false |
next |
rescue |
then |
when |
END |
case |
else |
for |
nil |
retry |
true |
while |
alias |
class |
elsif |
if |
not |
return |
undef |
yield |
 |
|
In these descriptions,
lowercase letter means the characters
``a'' though ``z'', as well as ``_'', the underscore.
Uppercase
letter means ``A'' though ``Z,'' and
digit means ``0''
through ``9.''
Name characters means any combination of upper-
and lowercase letters and digits.
A local variable name consists of a lowercase letter followed by name
characters.
fred anObject _x three_two_one
|
An instance variable name starts with an ``at'' sign (``
@
'') followed by an
upper- or lowercase letter, optionally followed by name
characters.
A class variable name starts with two ``at'' signs (``
@@
'')
followed by an upper- or lowercase letter, optionally followed by name
characters.
A constant name starts with an uppercase letter followed by name characters.
Class names and module names are constants, and follow the constant
naming conventions. By convention, constant variables are normally
spelled using uppercase letters and underscores throughout.
module Math
PI = 3.1415926
end
class BigBlob
|
Global variables, and some special system variables, start with a
dollar sign (``
$
'') followed by name characters. In addition,
there is a set of two-character variable names in which the second
character is a punctuation character. These predefined variables are
listed starting on page 213. Finally, a global variable name
can be formed using ``
$-
'' followed by
any single character.
$params $PROGRAM $! $_ $-a $-.
|
Method names are described in the section beginning on page 225.
When Ruby sees a name such as ``a'' in an expression, it needs to
determine if it is a local variable reference or a call to a method
with no parameters.
To decide which is the case, Ruby uses a
heuristic. As Ruby reads a source file, it keeps track of symbols
that have been assigned to. It assumes that these symbols are variables. When it
subsequently comes across a symbol that might be either a variable or
a method call, it checks to see if it has seen a prior assignment to
that symbol. If so, it treats the symbol as a variable; otherwise it
treats it as a method call. As a somewhat pathological case of this,
consider the following code fragment, submitted by Clemens Hintze.
def a
print "Function 'a' called\n"
99
end
for i in 1..2
if i == 2
print "a=", a, "\n"
else
a = 1
print "a=", a, "\n"
end
end
|
produces:
a=1
Function 'a' called
a=99
|
During the parse, Ruby sees the use of ``a'' in the first print
statement and, as it hasn't yet seen any assignment to ``a,'' assumes
that it is a method call. By the time it gets to the second print
statement, though, it
has seen an assignment, and so treats
``a'' as a variable.
Note that the assignment does not have to be executed---Ruby just has
to have seen it. This program does not raise an error.
Ruby variables and constants hold references to objects.
Variables
themselves do not have an intrinsic type. Instead, the type of a
variable is defined solely by the messages to which the object
referenced by the
variable responds.
[When we say that a variable
is not typed, we mean that any given variable can at different times
hold references to objects of many different types.]
A Ruby
constant is also a reference to an object.
Constants are
created when they are first assigned to (normally in a class or module
definition). Ruby, unlike less flexible languages, lets you alter the value
of a constant, although this will generate a warning message.
MY_CONST = 1
MY_CONST = 2 # generates a warning
|
produces:
prog.rb:2: warning: already initialized constant MY_CONST
|
Note that although constants should not be changed, you can alter the
internal states of the objects they reference.
MY_CONST = "Tim"
|
MY_CONST[0] = "J" # alter string referenced by constant
|
MY_CONST
|
» |
"Jim"
|
Assignment potentially
aliases objects,
giving the same object different names.
Constants defined within a class or module may be accessed unadorned
anywhere within the class or module.
Outside the class or module, they
may be accessed using the scope operator, ``
::
'' prefixed by an
expression that returns the appropriate class or module object.
Constants defined outside any class or module may be accessed
unadorned or by using the scope operator ``
::
'' with no prefix. Constants may not
be defined in methods.
OUTER_CONST = 99
|
class Const
|
def getConst
|
CONST
|
end
|
CONST = OUTER_CONST + 1
|
end
|
Const.new.getConst
|
» |
100
|
Const::CONST
|
» |
100
|
::OUTER_CONST
|
» |
99
|
Global variables are available throughout a program. Every reference
to a particular global name returns the same object. Referencing an
uninitialized global variable returns
nil
.
Class variables are available throughout a class or module body. Class
variables must be initialized before use. A class variable is shared
among all instances of a class and is available within the class
itself.
class Song
@@count = 0
def initialize
@@count += 1
end
def Song.getCount
@@count
end
end
|
Class variables belong to the innermost enclosing class or
module. Class variables used at the top level are defined in
Object
, and behave like global variables. Class variables defined
within singleton methods belong to the receiver if the receiver is a
class or a module; otherwise, they belong to the class of the receiver.
class Holder
@@var = 99
def Holder.var=(val)
@@var = val
end
end
a = Holder.new
def a.var
@@var
end
|
Extracted from the book "Programming Ruby -
The Pragmatic Programmer's Guide"
Copyright
©
2001 by Addison Wesley Longman, Inc. This material may
be distributed only subject to the terms and conditions set forth in
the Open Publication License, v1.0 or later (the latest version is
presently available at
http://www.opencontent.org/openpub/)).
Distribution of substantively modified versions of this document is
prohibited without the explicit permission of the copyright holder.
Distribution of the work or derivative of the work in any standard
(paper) book form is prohibited unless prior permission is obtained
from the copyright holder.