layout	title
default	Parser

Translated by Robert GRAVINA

Chapter 10: Parser

Outline of this chapter

Parser construction

The parser is defined in the `yacc` source file `parse.y`, which `yacc` uses to
produce a working parser, `parse.c`.

Although one would expect `lex.c` to contain the scanner, this is not the case.
This file is created by `gperf`, taking the file `keywords` as input, and
defines the reserved word hashtable. This tool-generated `lex.c` is `#include`d
in (the also tool-generated) `parse.c`. The details of this process is somewhat
difficult to explain at this time, so we shall return to this later.

Figure 1 shows the parser construction process. For the benefit of those readers
using Windows who may not be aware, the `mv` (move) command creates a new copy
of a file and removes the original. `cc` is, of course, the C compiler and `cpp`
the C pre-processor.

Dissecting `parse.y`

Let’s now look at `parse.y` in a bit more detail. The following figure presents
a rough outline of the contents of `parse.y`.

▼ parse.y

%{
header
%}
%union ….
%token ….
%type ….

%%

rules

%%
user code section
parser interface
scanner (character stream processing)
syntax tree construction
semantic analysis
local variable management
ID implementation

Previously we briefly mentioned the rules and user code sections. With this
chapter we begin to study the parser in some detail and so turn our attention to
these sections.

There are a considerable number of support functions defined in the user code
section, but roughly speaking, they can be divided into the six parts
previously mentioned. The following table shows where each of parts are
explained in this book.

Category	Chapter	Section
Parser interface	This chapter	Section 3 “Scanning”
Scanner	This chapter	Section 3 “Scanning”
Syntax tree construction	Chapter 12 “Syntax tree construction”	Section 2 “Syntax tree construction”
Semantic analysos	Chapter 12 “Syntax tree construction”	Section 3 “Semantic analysis”
Local variable management	Chapter 12 “Syntax tree construction”	Section 4 “Local variables”
`ID` implementation	Chapter 3 “Names and name tables”	Section 2 “`ID` and symbols”

General remarks about grammar rules

Coding rules

The grammar of `ruby` conforms to a coding standard and is thus easy to read
once you are familiar with it.

Firstly, regarding symbol names, all non-terminal symbols are written in lower
case characters. Terminal symbols are prefixed by some lower case character and
then followed by upper case. Reserved words (keywords) are prefixed with the
character `k`. Other terminal symbols are prefixed with the character `t`.

▼ Symbol name examples

Token	Symbol name
(non-terminal symbol)	`bodystmt`
`if`	`kIF`
`def`	`kDEF`
`rescue`	`kRESCUE`
`varname`	`tIDENTIFIER`
`ConstName`	`tCONST`
1	`tINTEGER`

The only exceptions to these rules are `klBEGIN` and `klEND`. These symbol names
refer to the reserved words for “BEGIN” and “END”, respectively, and the `l`
here stands for `large`. Since the reserved words `begin` and `end` already
exist (naturally, with symbol names `kBEGIN` and `kEND`), these non-standard
symbol names were required.

Important symbols

`parse.y` contains both grammar rules and actions, however, for now I would like
to concentrate on the grammar rules alone. The script sample/exyacc.rb can be
used to extract the grammar rules from this file. Aside from this, running `yacc
-v` will create a logfile `y.output` which also contains the grammar rules,
however it is rather difficult to read. In this chapter I have used a slighty
modified version of `exyacc.rb`\footnote{modified `exyacc.rb`:`tools/exyacc2.rb`
located on the attached CD-ROM} to extract the grammar rules.

▼ `parse.y`(rules)

program : compstmt

bodystmt : compstmt
opt_rescue
opt_else
opt_ensure

compstmt : stmts opt_terms
:
:

The output is quite long – over 450 lines of grammar rules – and as such I have
only included the most important parts in this chapter.

Which symbols, then, are the most important? Symbols such as `program`, `expr`,
`stmt`,`primary`, `arg` etc. represent the more general grammatical elements of
a programming language and it is on these which we shall focus our attention.
The following table outlines these general elements and the symbol names used
to represent them.

Syntax element	Relevant symbol names
Program	`program prog file input stmts whole`
Sentence	`statement stmt`
Expression	`expression expr exp`
Smallest element	`primary prim`
Left hand side of an expression	`lhs`(left hand side)
Right hand side of an expression	`rhs`(right hand side)
Function call	`funcall function_call call function`
Method call	`method method_call call`
Argument	`argument arg`
Function definition	`defun definition function fndef`
Declarations	`declaration decl`

In general, programming languages tend to have the following symbol hierarchy.

Program element	Properties
Statement	Can not be combined with other symbols. A syntax tree trunk.
Expression	Can be combined with itself or be part of other expressions. A syntax tree internal node.
Primary	An element which can not be further decomposed. A syntax tree leaf node.

C function definitions and Java class definitions are examples of statements in
other languages. An expression can be a procedure call, an arithmetic expression
etc., while a primary usually refers to a string literal or number. Some languages
do not contain all of these symbol types, however they generally contain some
kind of hierarchy of symbols such as `program`→`stmt`→`expr`→`primary`.

It is often the case that symbol types lower in this hierarchy can be promoted
to higher levels and vice versa. For example, in C function calls are expressions
yet can be may also be statements.

Conversely, when surrounded in parentheses, expressions become primaries.

Scoping rules differ considerably between programming languages. Consider
substitution. In C, the value of expressions can be used in substitutions
whereas in Pascal substitution occurs only at the statement level. Also,
function and class definitions are typically statements however in languages
such as Lisp and Scheme, since everything is an expression, they too are
expressions. Ruby follows Lisp’s design in this regard.

Program structure

Now let’s turn our attention to the grammar rules of `ruby`. Firstly,
`yacc` will begin by examining the first rule defined in `parse.y`, and as
we can see from the following table, this is `program`. From here we can see
the ruby grammar unfold and the existence of the `program stmt expr primary`
hierarchy mentioned earlier. However there is an extra rule here for `arg`.
Let’s now take a look at this.

▼ `ruby` grammar (outline)

program : compstmt

compstmt : stmts opt_terms

stmts : none
| stmt
| stmts terms stmt

stmt : kALIAS fitem fitem
| kALIAS tGVAR tGVAR
:
:
| expr

expr : kRETURN call_args
| kBREAK call_args
:
:
| ‘!’ command_call
| arg

arg : lhs ‘=’ arg
| var_lhs tOP_ASGN arg
| primary_value ‘[’ aref_args ‘]’ tOP_ASGN arg
:
:
| arg ‘?’ arg ‘:’ arg
| primary

If you look at each of the final alternatives for each of the rules you should
be able to clearly make out a hierarchy of `program`→`stmt`→`expr`→`arg`→
`primary`.

I’d like to focus on the rule for `primary`.

primary         : literal
                    :
                    :
                | tLPAREN_ARG expr  ')'      /* here */

The name `tLPAREN_ARG` comes from `t` for terminal symbol, `L` for left and
`PAREN` for parentheses – it is the open parenthesis. Why this isn’t `‘(’`
is covered in the next section “Context-dependent scanner”. The purpose of this
rule is demote an `expr` to a `primary`, and is shown in Figure 2. This creates
a cycle which can the seen in Figure 2, and the arrow shows how this rule is
reduced during parsing.

The next rule is also particularly interesting.

primary         : literal
                    :
                    :
                | tLPAREN compstmt ')'   /* here */

A `compstmt`, which represents code for an entire program, can be demoted to
a `primary` with this rule. The next figure illustrates this rule in action.

This means that for any syntax element in Ruby, if we surround it with
parenthesis it will become a `primary` and can be passed as an argument to a
function, be used as the right hand side of an expression etc. It helps to
see an example of this to grasp what this truly means.

p((class C; end))
p((def a() end))
p((alias ali gets))
p((if true then nil else nil end))
p((1 + 1 * 1 ** 1 - 1 / 1 ^ 1))

If we invoke `ruby` with the `-c` option (syntax check), we get the following
output.

% ruby -c primprog.rb
Syntax OK

Although it might look surprising at first, yes, you really can do this in Ruby!

The details of this are covered when we look at semantic analysis (in Chaper 12
“Syntax tree construction”) however it is important to note there are exceptions
to this rule. For example passing a `return` statement as an argument to a
function will result in an error. For the most part, however, the “surrounding
anything in parenthesis means it can be passed as an argument to a function”
rule does hold.

In the next section I will cover the most important grammar rules in some detail.

`program`

▼ `program`

program : compstmt

compstmt : stmts opt_terms

stmts : none
| stmt
| stmts terms stmt

As mentioned earlier, in the grammar, `program` represents an entire program.
For all intents and purposes `compstmts` is equivalent to `program`. `stmts` is
almost equivalent to `compstmts`. In turn, `stmts` is a list of `stmt` nonterminals,
delimited by `terms`.

`terms` is (of course) an abbreviation for “terminators”, either semicolons or newlines.
`opt_terms` means “OPTional terms”. The definitions are as follows:

▼ `opt_terms`



opt_terms       :

                | terms
terms           : term

                | terms ‘;’
term            : ‘;’

                | ‘\n’

The initial `;` or `\n` of a `terms` can be followed by any number of `;` only; based on that, you might start thinking that if there are 2 or more consecutive newlines, it could cause a problem. Let’s try and see what actually happens.

1 + 1   # first newline
        # second newline
        # third newline
1 + 1

Run that with `ruby -c`.

% ruby -c optterms.rb
Syntax OK

Strange, it worked! What actually happens is this: consecutive newlines are simply discarded by the scanner, which returns only the first newline in a series.

By the way, although we said that `program` is the same as `compstmt`, if that was really true, you would question why `compstmt` exists at all. Actually, the distinction is there only for execution of semantic actions. `program` exists to execute any semantic actions which should be done once in the processing of an entire program. If it was only a question of parsing, `program` could be omitted with no problems at all.

To generalize this point, the grammar rules can be divided into 2 groups: those which are needed for parsing the program structure, and those which are needed for execution of semantic actions. The `none` rule which was mentioned earlier when talking about `stmts` is another one which exists for executing actions — it’s used to return a `NULL` pointer for an empty list of type `NODE*`.

`stmt`

Next is `stmt`. This one is rather involved, so we’ll look into it a bit at a time.

▼ `stmt`(1)



stmt            : kALIAS fitem  fitem

                | kALIAS tGVAR tGVAR

                | kALIAS tGVAR tBACK_REF

                | kALIAS tGVAR tNTH_REF

                | kUNDEF undef_list

                | stmt kIF_MOD expr_value

                | stmt kUNLESS_MOD expr_value

                | stmt kWHILE_MOD expr_value

                | stmt kUNTIL_MOD expr_value

                | stmt kRESCUE_MOD stmt

                | klBEGIN ‘{’ compstmt ‘}’

                | klEND ‘{’ compstmt ‘}’

Looking at that, somehow things start to make sense. The first few have `alias`, then `undef`, then the next few are all something followed by `_MOD` — those should be statements with postposition modifiers, as you can imagine.

`expr_value` and `primary_value` are grammar rules which exist to execute semantic actions. For example, `expr_value` represents an `expr` which has a value. Expressions which don’t have values are `return` and `break`, or `return`/`break` followed by a postposition modifier, such as an `if` clause. For a detailed definition of what it means to “have a value”, see chapter 12, “Syntax Tree Construction”. In the same way, `primary_value` is a `primary` which has a value.

Just as the names say, `klBEGIN` and `klEND` represent `BEGIN` and `END`.

▼ `stmt`(2)



                | lhs ‘=’ command_call

                | mlhs ‘=’ command_call

                | var_lhs tOP_ASGN command_call

                | primary_value ‘[’ aref_args ‘]’ tOP_ASGN command_call

                | primary_value ‘.’ tIDENTIFIER tOP_ASGN command_call

                | primary_value ‘.’ tCONSTANT tOP_ASGN command_call

                | primary_value tCOLON2 tIDENTIFIER tOP_ASGN command_call

                | backref tOP_ASGN command_call

この規則は全部くくって見るのが正しい。
The common point is that they all have `command_call` on the right-hand side. `command_call` represents a method call with the parentheses omitted. The new symbols which are introduced here are explained in the following table. I hope you’ll refer to the table as you check over each grammar rule.

`lhs`	assignment left hand side
`mlhs`	assignment multiple left hand side
`var_lhs`	assignment left hand side, which must be a variable
`tOP_ASGN`	compound assignment operator like `+=` or `*=` (OPerator ASsiGN)
`aref_args`	argument to a `[]` method call (Array REFerence)
`tIDENTIFIER`	local variable identifier
`tCONSTANT`	constant identifier (with leading uppercase letter)
`tCOLON2`	`::`
`backref`	`$1 $2 $3…`

ちなみにarefはLisp用語だ。対になるasetというのもあって、そちらは
array setの略。この略語は`ruby`のソースコードのいろいろなところで
使われている。

▼ `stmt`(3)



                | lhs ‘=’ mrhs_basic

                | mlhs ‘=’ mrhs

この二つは多重代入である。`mrhs`は`mlhs`と同じ作りで、multipleな
`rhs`(右辺)。
こう見てくると、名前の意味を知るだけでもかなりわかりやすいということが
わかる。

▼ `stmt`(4)



                | expr

最後に、`expr`へ連結する。

`expr`

▼ `expr`



expr            : kRETURN call_args

                | kBREAK call_args

                | kNEXT call_args

                | command_call

                | expr kAND expr

                | expr kOR expr

                | kNOT expr

                | ‘!’ command_call

                | arg

式。`ruby`の式は文法上はかなり小さい。なぜなら普通は`expr`に入るものがほと
んど`arg`に行ってしまっているからだ。逆に言うと、ここには`arg`に行けなかっ
たものが残っているということになる。では行けなかったものはなにかと言う
と、これまたメソッド呼び出し括弧の省略組である。`call_args`は剥き出しの
引数リストで、`command_call`は先程言ったとおり括弧省略メソッドのこと。こ
ういうものを「小さい」単位に入れると衝突しまくることになる。

ただし以下の二つだけは種類が違う。

expr kAND expr
expr kOR expr

`kAND` is `and`, and `kOR` is `or`. この二つは制御構造としての役割があ
るので、`command_call`以上に「大きい」構文単位に入れなければならない。
そして`command_call`は`expr`にある。だから最低でも`expr`にしてやらない
とうまくいかない。例えば次のような使いかたが存在するのだが……

  valid_items.include? arg  or raise ArgumentError, 'invalid arg'
# valid_items.include?(arg) or raise(ArgumentError, 'invalid arg')

もし`kAND`の規則が`expr`でなくて`arg`にあったとすると、次のように
連結されてしまうことになる。

valid_items.include?((arg or raise)) ArgumentError, 'invalid arg'

当然パースエラーだ。

`arg`

▼ `arg`



arg             : lhs ‘=’ arg

                | var_lhs tOP_ASGN arg

                | primary_value ‘[’ aref_args ‘]’ tOP_ASGN arg

                | primary_value ‘.’ tIDENTIFIER tOP_ASGN arg

                | primary_value ‘.’ tCONSTANT tOP_ASGN arg

                | primary_value tCOLON2 tIDENTIFIER tOP_ASGN arg

                | backref tOP_ASGN arg

                | arg tDOT2 arg

                | arg tDOT3 arg

                | arg ‘+’ arg

                | arg ‘-’ arg

                | arg ‘*’ arg

                | arg ‘/’ arg

                | arg ‘%’ arg

                | arg tPOW arg

                | tUPLUS arg

                | tUMINUS arg

                | arg ‘|’ arg

                | arg ‘^’ arg

                | arg ‘&’ arg

                | arg tCMP arg

                | arg ‘>’ arg

                | arg tGEQ arg

                | arg ‘<’ arg

                | arg tLEQ arg

                | arg tEQ arg

                | arg tEQQ arg

                | arg tNEQ arg

                | arg tMATCH arg

                | arg tNMATCH arg

                | ‘!’ arg

                | ‘~’ arg

                | arg tLSHFT arg

                | arg tRSHFT arg

                | arg tANDOP arg

                | arg tOROP arg

                | kDEFINED opt_nl  arg

                | arg ‘?’ arg ‘:’ arg

                | primary

Although there are many rules here, the complexity of the grammar is not proportionate to the number of rules. 単に場合分けが
多いだけの文法は`yacc`にとっては非常に扱いやすく、むしろ規則の深さである
とか再帰のほうが複雑さに影響する。

そうすると演算子のところで`arg OP arg`という形で再帰しているのが気になる
のだが、これらの演算子には全て演算子優先順位が設定されているため
実質上ただの列挙にすぎない。
そこで`arg`の規則からそういう「ただの列挙」を併合して削ってしまおう。

arg: lhs '=' arg              /* 1 */
   | primary T_opeq arg       /* 2 */
   | arg T_infix arg          /* 3 */
   | T_pre arg                /* 4 */
   | arg '?' arg ':' arg      /* 5 */
   | primary                  /* 6 */

終端記号および終端記号のリストは区別する意味がないので全部まとめて`T_`の
付く記号で表現した。`opeq`は`operator + equal`、`T_pre`は`‘!’`や`‘~’`の
ような前置型演算子、`T_infix`は`‘*’`や`‘%’`と言った二項演算子を表す。

この構造で衝突しないためには以下のようなことが重要になる
(ただしこれが全てではない)。

`T_infix`が`‘=’`を含まないこと

`arg`は`lhs`と部分的に重なるので`‘=’`があると規則1と3が区別できない。

`T_opeq`と`T_infix`が共通項を持たないこと

`arg`は`primary`を含むから共通項を持つと規則2と3が区別できない。

`T_infix`の中に`‘?’`がないこと

もし含むと規則3と5がshift/reduce conflictを起こす。

`T_pre`が`‘?’`や`‘:’`を含まない

もし含むと規則4と5がかなり複雑に衝突する。

結論としては全ての条件が成立しているので、この文法は衝突しない。
当然と言えば当然だ。

`primary`

Because `primary` has a lot of grammar rules, we’ll split them up and show them in parts.

▼ `primary`(1)



primary         : literal

                | strings

                | xstring

                | regexp

                | words

                | qwords

リテラル類。`literal`は`Symbol`リテラル(`:sym`)と数値。

▼ `primary`(2)



                | var_ref

                | backref

                | tFID

変数類。`var_ref`はローカル変数やインスタンス変数など。
`backref`は`$1 $2 $3`……。
`tFID`は`!`や`?`が付いた識別子、例えば`include? reject!`など。
`tFID`はローカル変数ではありえないので、
それ単独で出てきたとしてもパーサレベルでメソッド呼び出しになる。

▼ `primary`(3)



                | kBEGIN

                  bodystmt

                  kEND

`bodystmt`が`rescue`や`ensure`も含んでいる。
つまりこれは例外制御の`begin`である。

▼ `primary`(4)



                | tLPAREN_ARG expr  ‘)’

                | tLPAREN compstmt ‘)’

既に説明した。構文縮退。

▼ `primary`(5)



                | primary_value tCOLON2 tCONSTANT

                | tCOLON3 cname

定数の参照。`tCONSTANT`は定数名(大文字で始まる識別子)。

`tCOLON2`は`::`と`tCOLON3`は両方とも`::`なのだが、`tCOLON3`はトップレベルを
意味する`::`だけを表している。つまり`::Const`という場合の`::`である。
`Net::SMTP`のような`::`は`tCOLON2`だ。

同じトークンに違う記号が使われているのは括弧省略メソッドに対応する
ためである。例えば次の二つを見分けるためだ。

p Net::HTTP    # p(Net::HTTP)
p Net  ::HTTP  # p(Net(::HTTP))

直前にスペースがあるか、あるいは開き括弧などの境界文字がある場合は
`tCOLON3`でそれ以外では`tCOLON2`になる。

▼ `primary`(6)



                | primary_value ‘[’ aref_args ‘]’

インデックス形式の呼び出し、例えば`arr[i]`。

▼ `primary`(7)



                | tLBRACK aref_args ‘]’

                | tLBRACE assoc_list ‘}’

配列リテラルとハッシュリテラル。こちらの`tLBRACK`も`‘[’`を表して
いるのだが、`‘[’`は`‘[’`でも前に空白のない`‘[’`のこと。この区別が必要
なのもメソッド呼び出し括弧の省略の余波だ。

それにしてもこの規則の終端記号は一文字しか違わないので非常にわかりずらい。
以下の表に括弧の読みかたを書いておいたので対照しながら読んでほしい。

▼ 各種括弧の英語名

記号	英語名	日本語名(の一例)
`( )`	parentheses	丸括弧、括弧
`{ }`	braces	ヒゲ括弧、中括弧
`[ ]`	brackets	角括弧、大括弧

▼ `primary`(8)



                | kRETURN

                | kYIELD ‘(’ call_args ‘)’

                | kYIELD ‘(’ ‘)’

                | kYIELD

                | kDEFINED opt_nl ‘(’  expr ‘)’

メソッド呼び出しと形式が似ている構文。
順番に、`return`、`yield`、`defined?`。

`yield`には引数が付いているのに`return`に引数がないのはどうしてだろう。
根本的な原因は、`yield`はそれ自体に返り値があるのに対し`return`には返り値が
ないことである。ただし、ここで引数がないからといって値を渡せないわけでは、
もちろんない。`expr`に次のような規則があった。

kRETURN call_args

`call_args`は剥き出しの引数リストなので`return 1`や`return nil`には対
処できる。`return(1)`のようなものは`return (1)`として扱われる。という
ことは、次のように引数が二個以上ある`return`には括弧が付けられないはず
だ。

return(1, 2, 3)   # return  (1,2,3)と解釈されてパースエラー

このあたりは次章『状態付きスキャナ』を読んでから
もう一度見てもらうとよくわかるだろう。

▼ `primary`(9)



                | operation brace_block

                | method_call

                | method_call brace_block

メソッド呼び出し。`method_call`は引数あり(括弧もある)、`operation`は
括弧も引数もなし。`brace_block`は`{`〜`}`か`do`〜`end`で、それがついて
いるメソッドとはつまりイテレータだ。`brace`なのになぜ`do`〜`end`が入る
のか……ということにはマリアナ海溝よりも深淵な理由があるのだが、これも
やはり次章『状態付きスキャナ』を読んでもらうしかない。

▼ `primary`(10)



  | kIF expr_value then compstmt if_tail kEND         # if

  | kUNLESS expr_value then compstmt opt_else kEND    # unless

  | kWHILE expr_value do compstmt kEND                # while

  | kUNTIL expr_value do compstmt kEND                # until

  | kCASE expr_value opt_terms case_body kEND         # case

  | kCASE opt_terms case_body kEND                    # case(形式2)

  | kFOR block_var kIN expr_value do compstmt kEND    # for

基本制御構造。ちょっと意外なのは、こんなデカそうなものが`primary`という
「小さい」ものの中にあることだ。`primary`は`arg`でもあるのでこんなこと
もできるということになる。

p(if true then 'ok' end)   # "ok"と表示される

Rubyの特徴の一つに「ほとんどの構文要素が式」というのがあった。
`if`や`while`が`primary`にあることでそれが具体的に表現されている。

それにしてもどうしてこんな「大きい」要素が`primary`にあって大丈夫なのだ
ろう。それはRubyの構文が「終端記号Aで始まり終端記号Bで終わる」
という特徴があるからに他ならない。この点については次の項で改めて
考えてみる。

▼ `primary`(11)

| kCLASS cname superclass bodystmt kEND # クラス定義
| kCLASS tLSHFT expr term bodystmt kEND # 特異クラス定義
| kMODULE cname bodystmt kEND # モジュール定義
| kDEF fname f_arglist bodystmt kEND # メソッド定義
| kDEF singleton dot_or_colon fname f_arglist bodystmt kEND

特異メソッド定義

定義文。クラス文クラス文と呼んできたが、本当はクラス項と呼ぶべきで
あったか。これらも全て「終端記号Aで始まりBで終わる」パターンなので、
いくら増えたところで一向に問題はない。

▼ `primary`(12)



                | kBREAK

                | kNEXT

                | kREDO

                | kRETRY

各種ジャンプ。このへんはまあ、文法的にはどうでもいい。

衝突するリスト

先の項では`if`が`primary`なんかにあって大丈夫なのだろうか、という疑問を
提示した。厳密に証明するにはなかなか難しいのだが、感覚的にならわりと
簡単に説明できる。ここでは次のような小さい規則でシミュレーション
してみよう。

%token A B o
%%
element   : A item_list B

item_list :
          | item_list item

item      : element
          | o

`element`がこれから問題にしようとしている要素だ。例えば`if`について考える
なら`if`だ。`element`は終端記号`A`で始まり`B`で終わるリストである。
`if`で言うなら`if`で始まり`end`で終わる。内容物の`o`はメソッドや
変数参照やリテラルである。リストの要素には、その`o`か、または`element`が
ネストする。

この文法に基くパーサで次のような入力をパースしてみよう。

A  A  o  o  o  B  o  A  o  A  o  o  o  B  o  B  B

ここまでネストしまくっていると人間にはインデントなどの助けがないとちょっ
とわかりにくい。だが次のように考えればわりと簡単である。いつか必ず
`A`と`B`が`o`だけを狭んで現れるので、それを消して`o`に変える。それを繰り
返すだけでいい。結論は図4のようになる。

しかしもし終端の`B`がなかったりすると……

%token A o
%%
element   : A item_list    /* Bを消してみた */

item_list :
          | item_list item

item      : element
          | o

これを`yacc`で処理すると2 shift/reduce conflictsと出た。つまりこの文法は
曖昧である。入力は、先程のものから単純に`B`を抜くと次のようになってしまう。

A  A  o  o  o  o  A  o  A  o  o  o  o

どうにもよくわからない。しかしshift/reduce conflictはシフトしろ、とい
うルールがあったので、試しにそれに従ってシフト優先(即ち内側優先)でパー
スしてみよう(図5)。

とりあえずパースできた。しかしこれでは入力と意図が全く違うし、どうやっ
ても途中でリストを切ることができなくなってしまった。

実はRubyの括弧省略メソッドはこれと似たような状態にある。わかりにくいが、
メソッド名と第一引数が合わせて`A`だ。なぜならその二つの間にだけカンマが
ないので、新しいリストの始まりだと認識できるからである。

他に「現実的な」HTMLもこのパターンを含む。例えば``や``が省略され
たときはこうなる。そういうわけで`yacc`は普通のHTMLにはまるで通用しない。

スキャナ

パーサ概形

スキャナに移る前にパーサの概形について説明しておこう。
図6を見てほしい。

パーサの公式インターフェイスは`rb_compile_cstr()`、
`rb_compile_string()`、
`rb_compile_file()`の三つである。それぞれCの文字列、Rubyの文字列オブジェク
ト、Rubyの`IO`オブジェクトからプログラムを読み込んでコンパイルする。

これらの関数は直接間接に`yycompile()`を呼び出し、最終的に`yacc`が生成した
`yyparse()`に完全に制御を移す。パーサの中心とはこの`yyparse()`に他ならない
のだから、`yyparse()`を中心に把握してやるとよい。即ち`yyparse()`に移る前の
関数は全て前準備であり、`yyparse()`より後の関数は`yyparse()`にこき使われ
る雑用関数にすぎない。

`parse.y`にある残りの関数は`yylex()`から呼び出される補助関数群だが、こちらも
また明確に分類することができる。

まずスキャナの最も低レベルな部分には入力バッファがある。`ruby`はソースプ
ログラムをRubyの`IO`オブジェクトか文字列のどちらからでも入力できるように
なっており、入力バッファはそれを隠蔽して単一のバイトストリームに見せか
ける。

次のレベルはトークンバッファである。入力バッファから1バイト
ずつ読んだらトークンが一つ完成するまでここにまとめてとっておく。

従って`yylex`全体の構造は図7のように図示できる。

入力バッファ

まず入力バッファから見ていこう。インターフェイスは
`nextc()`、`pushback()`、`peek()`の三つだけだ。

しつこいようだがまずデータ構造を調べるのだった。
入力バッファの使う変数はこうだ。

▼ 入力バッファ



2279  static char *lex_pbeg;

2280  static char *lex_p;

2281  static char *lex_pend;
(parse.y)

バッファの先頭と現在位置、終端。どうやらこのバッファは単純な一列の
文字列バッファらしい(図8)。

`nextc()`

ではこれを使っているところを見てみる。まずは一番オーソドックスと
思われる`nextc()`から。

▼ `nextc()`



2468  static inline int

2469  nextc()

2470  {

2471      int c;

2472

2473      if (lex_p == lex_pend) {

2474          if (lex_input) {

2475              VALUE v = lex_getline();

2476

2477              if (NIL_P(v)) return -1;

2478              if (heredoc_end > 0) {

2479                  ruby_sourceline = heredoc_end;

2480                  heredoc_end = 0;

2481              }

2482              ruby_sourceline++;

2483              lex_pbeg = lex_p = RSTRING→ptr;

2484              lex_pend = lex_p + RSTRING→len;

2485              lex_lastline = v;

2486          }

2487          else {

2488              lex_lastline = 0;

2489              return -1;

2490          }

2491      }

2492      c = (unsigned char)*lex_p++;

2493      if (c == ‘\r’ && lex_p <= lex_pend && *lex_p == ‘\n’) {

2494          lex_p++;

2495          c = ‘\n’;

2496      }

2497

2498      return c;

2499  }
(parse.y)

最初の`if`で入力バッファの最後まで行ったかどうかテストしているようだ。そ
してその内側の`if`は、`else`で`-1`(`EOF`)を返していることから入力全体の
終わりをテストしているのだと想像できる。逆に言えば、入力が終わったときには
`lex_input`が0になる。

ということは入力バッファには少しずつ文字列が入ってきているのだというこ
とがわかる。その単位はと言うと、バッファを更新する関数の名前が
`lex_getline()`なので行に間違いない。

まとめるとこういうことだ。

if (バッファ終端に到達した)
    if (まだ入力がある)
        次の行を読み込む
    else
        return EOF
ポインタを進める
CRを読み飛ばす
return c

行を補給する関数`lex_getline()`も見てみよう。
この関数で使う変数も一緒に並べておく。

▼ `lex_getline()`



2276  static VALUE (lex_gets)();     / gets function /

2277  static VALUE lex_input;         / non-nil if File */
2420  static VALUE

2421  lex_getline()

2422  {

2423      VALUE line = (*lex_gets)(lex_input);

2424      if (ruby_debug_lines && !NIL_P(line)) {

2425          rb_ary_push(ruby_debug_lines, line);

2426      }

2427      return line;

2428  }
(parse.y)

最初の行以外はどうでもよい。
`lex_gets`が一行読み込み関数へのポインタ、`lex_input`が本当の入力だろう。
`lex_gets`をセットしているところを検索してみると、こう出た。

▼ `lex_gets`をセット



2430  NODE*

2431  rb_compile_string(f, s, line)

2432      const char *f;

2433      VALUE s;

2434      int line;

2435  {

2436      lex_gets = lex_get_str;

2437      lex_gets_ptr = 0;

2438      lex_input = s;
2454  NODE*

2455  rb_compile_file(f, file, start)

2456      const char *f;

2457      VALUE file;

2458      int start;

2459  {

2460      lex_gets = rb_io_gets;

2461      lex_input = file;
(parse.y)

`rb_io_gets()`はパーサ専用というわけではなく、`ruby`の汎用ライブラリだ。
`IO`オブジェクトから一行読み込む関数である。

一方の`lex_get_str()`は次のように定義されている。

▼ `lex_get_str()`



2398  static int lex_gets_ptr;
2400  static VALUE

2401  lex_get_str(s)

2402      VALUE s;

2403  {

2404      char beg, *end, *pend;

2405

2406      beg = RSTRING→ptr;

2407      if (lex_gets_ptr) {

2408          if (RSTRING→len == lex_gets_ptr) return Qnil;

2409          beg += lex_gets_ptr;

2410      }

2411      pend = RSTRING→ptr + RSTRING→len;

2412      end = beg;

2413      while (end < pend) {

2414          if (end++ == ‘\n’) break;

2415      }

2416      lex_gets_ptr = end – RSTRING→ptr;

2417      return rb_str_new(beg, end – beg);

2418  }
(parse.y)

この関数はいいだろう。`lex_gets_ptr`が既に読み込んだところを記憶している。
それを次の`\n`まで移動し、同時にそこを切り取って返す。

ここで`nextc`に戻ろう。このように同じインターフェイスの関数を二つ用意
して、パーサの初期化のときに関数ポインタを切り替えてしまうことで他の部
分を共通化しているわけだ。コードの差分をデータに変換して吸収していると
も言える。`st_table`にも似たような手法があった。

`pushback()`

バッファの物理構造と`nextc()`がわかればあとは簡単だ。
一文字書き戻す`pushback()`。Cで言えば`ungetc()`である。

▼ `pushback()`



2501  static void

2502  pushback©

2503      int c;

2504  {

2505      if (c == 1) return;

2506      lex_p-;

2507  }
(parse.y)

`peek()`

そしてポインタを進めずに次の文字をチェックする`peek()`(語意は
「覗き見る」)。

▼ `peek()`



2509  #define peek© (lex_p != lex_pend && © == *lex_p)
(parse.y)

トークンバッファ

トークンバッファは次のレベルのバッファである。
トークンが一つ切り出せるまで文字列を保管しておく。
インターフェイスは以下の五つである。

`newtok`	新しいトークンを開始する
`tokadd`	バッファに文字を足す
`tokfix`	バッファを終端する
`tok`	バッファリングしている文字列の先頭へのポインタ
`toklen`	バッファリングしている文字列長さ
`toklast`	バッファリングしている文字列の最後のバイト

ではまずデータ構造から見ていく。

▼ トークンバッファ



2271  static char *tokenbuf = NULL;

2272  static int   tokidx, toksiz = 0;
(parse.y)

`tokenbuf`がバッファで、`tokidx`がトークンの末尾(`int`なのでインデッ
クスらしい)、`toksiz`がバッファ長だろう。こちらも入力バッファと同じく
単純な構造だ。絵にすると図9のようになる。

引き続きインターフェイスに行って、新しいトークンを開始する`newtok()`を
読もう。

▼ `newtok()`



2516  static char*

2517  newtok()

2518  {

2519      tokidx = 0;

2520      if (!tokenbuf) {

2521          toksiz = 60;

2522          tokenbuf = ALLOC_N(char, 60);

2523      }

2524      if (toksiz > 4096) {

2525          toksiz = 60;

2526          REALLOC_N(tokenbuf, char, 60);

2527      }

2528      return tokenbuf;

2529  }
(parse.y)

バッファ全体の初期化インターフェイスというのはないので、バッファが初期
化されていない可能性がある。だから最初の`if`でそれをチェックし初期化す
る。`ALLOC_N()`は`ruby`が定義しているマクロで、`calloc()`とだいたい
同じだ。

割り当てる長さは初期値で60だが、大きくなりすぎていたら(`> 4096`)小さく
戻している。一つのトークンがそんなに長くなることはまずないので、このサ
イズで現実的だ。

次はトークンバッファに文字を追加する`tokadd()`を見る。

▼ `tokadd()`



2531  static void

2532  tokadd©

2533      char c;

2534  {

2535      tokenbuf[tokidx++] = c;

2536      if (tokidx >= toksiz) {

2537          toksiz *= 2;

2538          REALLOC_N(tokenbuf, char, toksiz);

2539      }

2540  }
(parse.y)

一行目で文字を追加。そのあとトークン長をチェックし、バッファ末尾を
越えそうだったら`REALLOC_N()`する。`REALLOC_N()`は引数指定方式が
`calloc()`と同じ`realloc()`だ。

残りのインターフェイスはまとめてしまう。

▼ `tokfix() tok() toklen() toklast()`



2511  #define tokfix() (tokenbuf[tokidx]=‘\0’)

2512  #define tok() tokenbuf

2513  #define toklen() tokidx

2514  #define toklast() (tokidx>0?tokenbuf[tokidx-1]:0)
(parse.y)

問題ないだろう。

`yylex()`

`yylex()`はとても長い。現在1000行以上ある。そのほとんどが巨大な
`switch`文一つで占められており、文字ごとに分岐するようになっている。
まず一部省略して全体構造を示す。

▼ `yylex`概形



3106  static int

3107  yylex()

3108  {

3109      static ID last_id = 0;

3110      register int c;

3111      int space_seen = 0;

3112      int cmd_state;

3113

3114      if (lex_strterm) {

              /* ……文字列のスキャン…… /

3131          return token;

3132      }

3133      cmd_state = command_start;

3134      command_start = Qfalse;

3135    retry:

3136      switch (c = nextc()) {

3137        case ‘\0’:                / NUL /

3138        case ‘\004’:              / ^D /

3139        case ‘\032’:              / ^Z /

3140        case -1:                  / end of script. /

3141          return 0;

3142

3143          / white spaces /

3144        case ’ ‘: case ’\t’: case ‘\f’: case ‘\r’:

3145        case ‘\13’: / ‘\v’ /

3146          space_seen++;

3147          goto retry;

3148

3149        case ‘#’:         / it’s a comment /

3150          while ((c = nextc()) != ‘\n’) {

3151              if (c == -1)

3152                  return 0;

3153          }

3154          / fall through /

3155        case ‘\n’:

              / ……省略…… */
case xxxx:
:
break;
:
/* 文字ごとにたくさん分岐 */
:
:
4103        default:

4104          if (!is_identchar© || ISDIGIT) {

4105              rb_compile_error(“Invalid char `\\%03o’ in expression”, c);

4106              goto retry;

4107          }

4108

4109          newtok();

4110          break;

4111      }
/* ……通常の識別子を扱う…… */
}
(parse.y)

`yylex()`の返り値はゼロが「入力終わり」、非ゼロなら記号である。

「`c`」などという非常に簡潔な変数が全体に渡って使われているので注意。
スペースを読んだときの`space_seen++`は後で役に立つ。

あとは文字ごとに分岐してひたすら処理すればよいのだが、単調な処理が続く
ので読むほうはとても退屈だ。そこでいくつかポイントを絞って見てみること
にする。本書では全ての文字は解説しないが、同じパターンを敷衍していけば
簡単である。

`‘!’`

まずは簡単なものから見てみよう。

▼ `yylex`-`‘!’`



3205        case ‘!’:

3206          lex_state = EXPR_BEG;

3207          if ((c = nextc()) == ‘=’) {

3208              return tNEQ;

3209          }

3210          if (c == ‘~’) {

3211              return tNMATCH;

3212          }

3213          pushback©;

3214          return ‘!’;
(parse.y)

日本語でコードの意味を書き出したので対照して読んでみてほしい。

case '!':
  EXPR_BEGに移行
  if (次の文字が'='なら) {
      トークンは「!=(tNEQ)」である
  }
  if (次の文字が'~'なら) {
      トークンは「!~(tNMATCH)」である
  }
  どちらでもないなら読んだ文字は戻しておく
  トークンは「'!'」である。

この`case`節は短かいが、スキャナの重要な法則が示されている。それは
「最長一致の原則」である。`“!=”`という二文字は「`!`と`=`」「`!=`」の
二通りに解釈できるが、この場合は「`!=`」を選ばなければならない。
プログラム言語のスキャナでは最長一致が基本である。

また`lex_state`はスキャナの状態を表す変数である。
次章『状態付きスキャナ』で嫌になるほど
やるので、とりあえず無視しておいていい。いちおう意味だけ言っておくと、
`EXPR_BEG`は「明らかに式の先頭にいる」ことを示している。`not`の`!`だろ
うと`!=`だろうと`!~`だろうと、次は式の先頭だからだ。

`‘>’`

次は`yylval`(記号の値)を使う例として`‘>’`を見てみる。

▼ `yylex`-`‘>’`



3296        case ‘>’:

3297          switch (lex_state) {

3298            case EXPR_FNAME: case EXPR_DOT:

3299              lex_state = EXPR_ARG; break;

3300            default:

3301              lex_state = EXPR_BEG; break;

3302          }

3303          if ((c = nextc()) == ‘=’) {

3304              return tGEQ;

3305          }

3306          if (c == ‘>’) {

3307              if ((c = nextc()) == ‘=’) {

3308                  yylval.id = tRSHFT;

3309                  lex_state = EXPR_BEG;

3310                  return tOP_ASGN;

3311              }

3312              pushback©;

3313              return tRSHFT;

3314          }

3315          pushback©;

3316          return ‘>’;
(parse.y)

`yylval`のところ以外は無視してよい。プログラムを読むときは一つのことに
だけ集中するのが肝心だ。

ここでは`>=`に対応する記号`tOP_ASGN`に対してその値`tRSHFT`をセットして
いる。使っている共用体メンバが`id`だから型は`ID`である。`tOP_ASGN`は自
己代入の記号で、`+=`とか`-=`とか`*=`といったものを全部まとめて表してい
るから、それを後で区別するために何の自己代入かを値として渡すわけだ。

なぜ自己代入をまとめるかというと、そのほうが規則が短くなるからだ。
スキャナでまとめられるものはできるだけたくさんまとめてしまったほうが規
則がすっきりする。ではなぜ二項演算子は全部まとめないのかと言うと、優先
順位が違うからである。

`‘:’`

スキャンがパースから完全に独立していれば話は簡単なのだが、現実はそう
簡単ではない。Rubyの文法は特に複雑で空白が前にあるとなんか違うとか、
まわりの状況でトークンの切り方が変わったりする。以下に示す`‘:’`の
コードは空白で挙動が変わる一例だ。

▼ `yylex`-`‘:’`



3761        case ‘:’:

3762          c = nextc();

3763          if (c == ‘:’) {

3764              if (lex_state == EXPR_BEG ||  lex_state == EXPR_MID ||

3765                  (IS_ARG() && space_seen)) {

3766                  lex_state = EXPR_BEG;

3767                  return tCOLON3;

3768              }

3769              lex_state = EXPR_DOT;

3770              return tCOLON2;

3771          }

3772          pushback©;

3773          if (lex_state == EXPR_END ||

                  lex_state == EXPR_ENDARG ||

                  ISSPACE) {

3774              lex_state = EXPR_BEG;

3775              return ‘:’;

3776          }

3777          lex_state = EXPR_FNAME;

3778          return tSYMBEG;
(parse.y)

今度も`lex_state`関係は無視して`space_seen`まわりだけに注目してほしい。

`space_seen`はトークンの前に空白があると真になる変数だ。それが成立
すると、つまり`‘::’`の前に空白があるとそれは`tCOLON3`になり、空白がな
ければ`tCOLON2`になるらしい。これは前節で`primary`のところで説明した通
りだ。

識別子

ここまでは記号ばかりなのでせいぜい一文字二文字だったが、今度は
もう少し長いものも見てみることにする。識別子のスキャンパターンだ。

まず`yylex`の全体像はこうなっていた。

yylex(...)
{
    switch (c = nextc()) {
      case xxxx:
        ....
      case xxxx:
        ....
      default:
    }

    識別子のスキャンコード
}

次のコードは巨大`switch`の末尾のところからの引用である。
やや長いのでコメントを入れつつ示そう。

▼ `yylex`-識別子



4081        case ‘@’:                 /* インスタンス変数かクラス変数 /

4082          c = nextc();

4083          newtok();

4084          tokadd(‘@’);

4085          if (c == ‘@’) {         / @、つまりクラス変数 */
4086              tokadd('‘);

4087              c = nextc();

4088          }

4089          if (ISDIGIT) {       /* 1など */
4090              if (tokidx == 1) {
4091    rb_compile_error("`%c’ is not a valid instance variable name", c);

4092              }

4093              else {

4094    rb_compile_error(“`@@%c’ is not a valid class variable name”, c);

4095              }

4096          }

4097          if (!is_identchar©) { /* の次に変な文字がある */
4098              pushback(c);
4099              return '‘;

4100          }

4101          break;

4102

4103        default:

4104          if (!is_identchar© || ISDIGIT) {

4105              rb_compile_error("Invalid char `\\%03o’ in expression", c);

4106              goto retry;

4107          }

4108

4109          newtok();

4110          break;

4111      }

4112

4113      while (is_identchar©) {   /* 識別子に使える文字のあいだ…… /

4114          tokadd©;

4115          if (ismbchar©) {      / マルチバイト文字の先頭バイトならば /

4116              int i, len = mbclen©-1;

4117

4118              for (i = 0; i < len; i++) {

4119                  c = nextc();

4120                  tokadd©;

4121              }

4122          }

4123          c = nextc();

4124      }

4125      if ((c == ‘!’ || c == ‘?’) &&

              is_identchar(tok()⁰) &&

              !peek(‘=’)) {      / name!またはname?の末尾一文字 */

4126          tokadd©;

4127      }

4128      else {

4129          pushback©;

4130      }

4131      tokfix();
(parse.y)

最後に`!`/`?`を追加しているところの条件に注目してほしい。
この部分は次のような解釈をするためである。

obj.m=1       # obj.m  =   1       (obj.m= ... ではない)
obj.m!=1      # obj.m  !=  1       (obj.m! ... ではない)

つまり最長一致「ではない」。
最長一致はあくまで原則であって規則ではないので、
ときには破っても構わないのだ。

予約語

識別子をスキャンした後には実はもう100行くらいコードがあって、そこでは
実際の記号を割り出している。先程のコードではインスタンス変数やクラス変
数、ローカル変数などをまとめてスキャンしてしまっていたから、それを改め
て分類するわけだ。

それはまあいいのだが、その中にちょっと変わった項目がある。それは予約語
を漉し取ることだ。予約語は文字種的にはローカル変数と変わらないので、ま
とめてスキャンしておいて後から分類するほうが効率的なのである。

では`char*`文字列`str`があったとして、予約語かどうか見分けるにはどうしたら
いいだろうか。まずもちろん`if`文と`strcmp()`で比較しまくる方法がある。しか
しそれでは全くもって賢くない。柔軟性がない。スピードも線型増加する。
普通はリストとかハッシュにデータだけ分離して、コードを短く済ますだろう。

/* コードをデータに変換する */
struct entry {char *name; int symbol;};
struct entry *table[] = {
    {"if",     kIF},
    {"unless", kUNLESS},
    {"while",  kWHILE},
    /* ……略…… */
};

{
    ....
    return lookup_symbol(table, tok());
}

それで`ruby`はどうしているかと言うと、ハッシュテーブルを使っている。そ
れも完全ハッシュだ。`st_table`の話をしたときにも言った
が、キーになりうる集合が前もってわかっていれば絶対に衝突しないハッシュ
関数を作れることがある。予約語というのは「キーになりうる集合が前もって
わかってい」るわけだから、完全ハッシュ関数が作れそうだ。

しかし「作れる」のと実際に作るのとは別の話だ。
手動で作るなんてことはやっていられない。予約語は増えたり減ったり
するのだから、そういう作業は自動化しなければいけない。

そこで`gperf`である。`gperf`はGNUプロダクトの一つで、値の集合から完全
ハッシュ関数を作ってくれる。`gperf`自体の詳しい使いかたは`man gperf`し
てもらうとして、ここでは生成した結果の使いかただけ述べよう。

`ruby`での`gperf`の入力ファイルは`keywords`で出力は`lex.c`である。
`parse.y`はこれを直接`#include`している。基本的にはCの
ファイルを`#include`す
るのはよろしくないが、関数一つのために本質的でないファイル分割をやるの
はさらによろしくない。特に`ruby`では`extern`関数はいつのまにか拡張ライブラリ
に使われてしまう恐れがあるので、互換性を保ちたくない関数はできるだけ
`static`にすべきなのだ。

そしてその`lex.c`には`rb_reserved_word()`という関数が定義されている。
予約語の`char*`をキーにしてこれを呼ぶと索ける。返り値は、見付から
なければ`NULL`、見付かれば(つまり引数が予約語なら)`struct kwtable*`が返る。
`struct kwtable`の定義は以下のとおり。

▼ `kwtable`



   1  struct kwtable {char *name; int id²; enum lex_state state;};
(keywords)

`name`が予約語の字面、`id⁰`がその記号、`id¹`が修飾版の記号
(`kIF_MOD`など)。
`lex_state`は「その予約語を読んだ後に移行すべき`lex_state`」である。
`lex_state`については次章で説明する。

実際に索いているのはこのあたりだ。

▼ `yylex()`-識別子-`rb_reserved_word()`を呼ぶ



4173                  struct kwtable kw;

4174

4175                  / See if it is a reserved word.  */

4176                  kw = rb_reserved_word(tok(), toklen());

4177                  if (kw) {
(parse.y)

文字列類

`yylex()`のダブルクォート(`"`)のところを見ると、こうなっている。

▼ `yylex`-`‘"’`



3318        case ‘"’:

3319          lex_strterm = NEW_STRTERM(str_dquote, ‘"’, 0);

3320          return tSTRING_BEG;
(parse.y)

なんと一文字目だけをスキャンして終了してしまう。そこで今度は
規則を見てみると、次の部分に`tSTRING_BEG`が見付かった。

▼ 文字列関連の規則

string1 : tSTRING_BEG string_contents tSTRING_END

string_contents :
| string_contents string_content

string_content : tSTRING_CONTENT
| tSTRING_DVAR string_dvar
| tSTRING_DBEG term_push compstmt ‘}’

string_dvar : tGVAR
| tIVAR
| tCVAR
| backref

term_push :

この規則は文字列中の式埋め込みに対応するために導入された部分である。
`tSTRING_CONTENT`がリテラル部分、`tSTRING_DBEG`が`“#{”`である。
`tSTRING_DVAR`は「後に変数が続く`#`」を表す。例えば

".....#$gvar...."

のような構文である。説明していなかったが埋め込む式が変数一個の
場合には`{`と`}`は省略できるのだ。ただしあまり推奨はされない。
ちなみに`DVAR`、`DBEG`の`D`は`dynamic`の略だと思われる。

また`backref`は`$1 $2`……とか`$& $’`といった正規表現関連の
特殊変数を表す。

`term_push`は「アクションのための規則」である。

さてここで`yylex()`に戻る。
単純にパーサに戻っても、コンテキストが文字列の「中」になる
わけだから、次の`yylex()`でいきなり変数だの`if`だのスキャンされたら困る。
そこで重要な役割を果たすのが……

      case '"':
        lex_strterm = NEW_STRTERM(str_dquote, '"', 0);
        return tSTRING_BEG;

……`lex_strterm`だ、ということになる。`yylex()`の先頭に戻ってみよう。

▼ `yylex()`先頭



3106  static int

3107  yylex()

3108  {

3109      static ID last_id = 0;

3110      register int c;

3111      int space_seen = 0;

3112      int cmd_state;

3113

3114      if (lex_strterm) {

              /* ……文字列のスキャン…… */

3131          return token;

3132      }

3133      cmd_state = command_start;

3134      command_start = Qfalse;

3135    retry:

3136      switch (c = nextc()) {
(parse.y)

`lex_strterm`が存在するときは問答無用で文字列モードに突入するようになっ
ている。ということは逆に言うと`lex_strterm`があると文字列をスキャン中と
いうことであり、文字列中の埋め込み式をパースするときには`lex_strterm`を
0にしておかなければならない。そして埋め込み式が終わったらまた戻さな
ければならない。それをやっているのが次の部分だ。

▼ `string_content`



1916  string_content  : ….

1917                  | tSTRING_DBEG term_push

1918                      {

1919                          $1 = lex_strnest;

1920                          $$ = lex_strterm;

1921                          lex_strterm = 0;

1922                          lex_state = EXPR_BEG;

1923                      }

1924                    compstmt ‘}’

1925                      {

1926                          lex_strnest = $1;

1927                          quoted_term = $2;

1928                          lex_strterm = $3;

1929                          if (($$ = $4) && nd_type($$) == NODE_NEWLINE) {

1930                              $$ = $$→nd_next;

1931                              rb_gc_force_recycle((VALUE)$4);

1932                          }

1933                          $$ = NEW_EVSTR($$);

1934                      }
(parse.y)

埋め込みアクションで`lex_strterm`を`tSTRING_DBEG`の値として保存し
(事実上のスタックプッシュ)、通常のアクションで復帰している(ポップ)。
なかなかうまい方法だ。

しかしなんでこのような七面倒くさいことをするのだろう。普通にスキャンし
ておいて、「`#{`」を見付けた時点で`yyparse()`を再帰呼び出しすればいい
のではないだろうか。そこに実は問題がある。`yyparse()`は再帰呼び出しで
きないのだ。これはよく知られた`yacc`の制限である。値の受け渡しに利用す
る`yylval`がグローバル変数なので、うかつに再帰すると値が壊れてしまう。
`bison`(GNUの`yacc`)だと`%pure_parser`というディレクティブを使うこと
で再帰可能にできるのだが、現在の`ruby`は`bison`を仮定しないことになっ
ている。現実にBSD由来のOSやWindowsなどでは`byacc`(Berkeley yacc)
を使うことが多いので`bison`が前提になるとちょっと面倒だ。

`lex_strterm`

見てきたように`lex_strterm`は真偽値として見た場合はスキャナが文字列モードか
そうでないかを表すわけだが、実は内容にも意味がある。まず型を見てみよう。

▼ `lex_strterm`



  72  static NODE *lex_strterm;
(parse.y)

まずこの定義から型は`NODE*`だとわかる。これは構文木に使うノードの型で、
第12章『構文木の構築』で詳しく説明する。とりあえずは、三要素を持つ構造体
である、`VALUE`なので`free()`する必要がない、の二点を押さえておけばよ
い。

▼ `NEW_STRTERM()`



2865  #define NEW_STRTERM(func, term, paren) \

2866          rb_node_newnode(NODE_STRTERM, (func), (term), (paren))
(parse.y)

これは`lex_strterm`に格納するノードを作るマクロだ。まず`term`は文字列の終
端文字を示す。例えば`“`文字列ならば`”`である。`‘`文字列ならば`’`である。

`paren`は`%`文字列のときに対応する括弧を格納するのに使う。例えば

%Q(..........)

なら`paren`に`‘(’`が入る。そして`term`には閉じ括弧`‘)’`が入る。
`%`文字列以外のときは`paren`は0だ。

最後に`func`だが、文字列の種類を示す。
使える種類は以下のように決められている。

▼ `func`



2775  #define STR_FUNC_ESCAPE 0×01  /* \nなどバックスラッシュ記法が有効 /

2776  #define STR_FUNC_EXPAND 0×02  / 埋め込み式が有効 /

2777  #define STR_FUNC_REGEXP 0×04  / 正規表現である /

2778  #define STR_FUNC_QWORDS 0×08  / %w(….)または%W(….) /

2779  #define STR_FUNC_INDENT 0×20  / <<-EOS */

2780

2781  enum string_type {

2782      str_squote = (0),

2783      str_dquote = (STR_FUNC_EXPAND),

2784      str_xquote = (STR_FUNC_ESCAPE|STR_FUNC_EXPAND),

2785      str_regexp = (STR_FUNC_REGEXP|STR_FUNC_ESCAPE|STR_FUNC_EXPAND),

2786      str_sword  = (STR_FUNC_QWORDS),

2787      str_dword  = (STR_FUNC_QWORDS|STR_FUNC_EXPAND),

2788  };
(parse.y)

つまり`enum string_type`のそれぞれの意味は次のようだとわかる。

`str_squote`	`’`文字列/`%q`
`str_dquote`	`"`文字列/`%Q`
`str_xquote`	コマンド文字列(本書では説明していない)
`str_regexp`	正規表現
`str_sword`	`%w`
`str_dword`	`%W`

文字列スキャン関数

あとは文字列モードのときの`yylex()`、つまり冒頭の`if`を読めばいい。

▼ `yylex`-文字列



3114      if (lex_strterm) {

3115          int token;

3116          if (nd_type(lex_strterm) == NODE_HEREDOC) {

3117              token = here_document(lex_strterm);

3118              if (token == tSTRING_END) {

3119                  lex_strterm = 0;

3120                  lex_state = EXPR_END;

3121              }

3122          }

3123          else {

3124              token = parse_string(lex_strterm);

3125              if (token == tSTRING_END || token == tREGEXP_END) {

3126                  rb_gc_force_recycle((VALUE)lex_strterm);

3127                  lex_strterm = 0;

3128                  lex_state = EXPR_END;

3129              }

3130          }

3131          return token;

3132      }
(parse.y)

大きくヒアドキュメントとそれ以外に分かれている。のだが、今回は
`parse_string()`は読まない。前述のように大量の条件付けがあるため、凄まじ
いスパゲッティコードになっている。例え説明してみたとしても「コードその
ままじゃん!」という文句が出ること必至だ。しかも苦労するわりに全然面白
くない。

しかし全く説明しないわけにもいかないので、スキャンする対象ごとに関数を
分離したものを添付CD-ROMに入れて
おく\footnote{`parse_string()`の解析:添付CD-ROM`doc/parse_string.html`}。
興味のある読者はそちらを眺めてみてほしい。

ヒアドキュメント

普通の文字列に比べるとヒアドキュメントはなかなか面白い。やはり他の
要素と違って単位が「行」であるせいだろう。しかも開始記号がプログラムの
途中に置けるところが恐ろしい。まずヒアドキュメントの開始記号をスキャン
する`yylex()`のコードから示そう。

▼ `yylex`-`‘<’`



3260        case ‘<’:

3261          c = nextc();

3262          if (c == ‘<’ &&

3263              lex_state != EXPR_END &&

3264              lex_state != EXPR_DOT &&

3265              lex_state != EXPR_ENDARG &&

3266              lex_state != EXPR_CLASS &&

3267              (!IS_ARG() || space_seen)) {

3268              int token = heredoc_identifier();

3269              if (token) return token;
(parse.y)

例によって`lex_state`の群れは無視する。すると、ここでは「`<<`」だけを
読んでおり、残りは`heredoc_identifier()`でスキャンするらしいとわかる。
そこで`heredoc_identifier()`だ。

▼ `heredoc_identifier()`



2926  static int

2927  heredoc_identifier()

2928  {

          /* ……省略……開始記号を読む /

2979      tokfix();

2980      len = lex_p – lex_pbeg;   /(A)/

2981      lex_p = lex_pend;         /(B)/

2982      lex_strterm = rb_node_newnode(NODE_HEREDOC,

2983                          rb_str_new(tok(), toklen()),  / nd_lit /

2984                          len,                          / nd_nth /

2985          /©/       lex_lastline);                / nd_orig */

2986

2987      return term == ‘`’ ? tXSTRING_BEG : tSTRING_BEG;

2988  }
(parse.y)

開始記号(`<<EOS`)を読むところはどうでもいいのでバッサリ省略した。
ここまでで入力バッファは図10のようになっているはずだ。
入力バッファは行単位だったことを思い出そう。

`heredoc_identifier()`でやっていることは以下の通り。
(A)`len`は現在行のうち読み込んだバイト数だ。
(B)そしていきなり`lex_p`を行の末尾まで飛ばす。
ということは、読み込み中の行のうち開始記号の後が読み捨てられて
しまっている。この残りの部分はいつパースするのだろう。
その謎の答えは©で`lex_lastline`(読み込み中の行)と
`len`(既に読んだ長さ)を保存しているところにヒントがある。

では`heredoc_identifier()`前後の動的コールグラフを以下に簡単に示す。

yyparse
    yylex(case '<')
        heredoc_identifier(lex_strterm = ....)
    yylex(冒頭のif)
        here_document

そしてこの`here_document()`がヒアドキュメント本体のスキャンを行っている。
以下に、異常系を省略しコメントを加えた`here_document()`を示す。
`lex_strterm`が`heredoc_identifier()`でセットしたままであることに
注意してほしい。

▼ `here_document()`(簡約版)



here_document(NODE here)

{

    VALUE line;                      / スキャン中の行 /

    VALUE str = rb_str_new("", 0);   / 結果をためる文字列 */
/* ……異常系の処理、省略…… */
if (式の埋め込み式が無効) {
do {
line = lex_lastline;     /(A)/
rb_str_cat(str, RSTRING→ptr, RSTRING→len);
lex_p = lex_pend;        /(B)/
if (nextc() == -1) {     /©/
goto error;
}
} while (読み込み中の行が終了記号と等しくない);
}
else {
/* 式の埋め込み式ができる場合……略 */
}
heredoc_restore(lex_strterm);
lex_strterm = NEW_STRTERM(-1, 0, 0);
yylval.node = NEW_STR(str);
return tSTRING_CONTENT;
}

`rb_str_cat()`はRubyの文字列の末尾に`char*`を連結する関数だ。
つまり(A)では現在読み込み中の行`lex_lastline`を`str`に連結している。
連結したらもう今の行は用済みだ。(B)で`lex_p`をいきなり行末に
飛ばす。そして©が問題で、ここでは終了チェックをしているように
見せかけつつ実は次の「行」を読み込んでいるのだ。思い出してほしいの
だが、`nextc()`は行が読み終わっていると勝手に次の行を読み込むという
仕様だった。だから(B)で強制的に行を終わらせているので
©で`lex_p`は次の行に移ることになる。

そして最後に`do`〜`while`ループを抜けて`heredoc_restore()`だ。

▼ `heredoc_restore()`



2990  static void

2991  heredoc_restore(here)

2992      NODE *here;

2993  {

2994      VALUE line = here→nd_orig;

2995      lex_lastline = line;

2996      lex_pbeg = RSTRING→ptr;

2997      lex_pend = lex_pbeg + RSTRING→len;

2998      lex_p = lex_pbeg + here→nd_nth;

2999      heredoc_end = ruby_sourceline;

3000      ruby_sourceline = nd_line(here);

3001      rb_gc_force_recycle(here→nd_lit);

3002      rb_gc_force_recycle((VALUE)here);

3003  }
(parse.y)

`here→nd_orig`には開始記号のあった行が入っている。
`here→nd_nth`には開始記号のあった行のうち既に読んだ長さが入っている。
つまり開始記号の直後から何事もなかったかのようにスキャンしつづけられる
というわけだ(図11)。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parser.textile

parser.textile

Chapter 10: Parser

Outline of this chapter

Parser construction

Dissecting `parse.y`

General remarks about grammar rules

Coding rules

Important symbols

Program structure

`program`

`stmt`

`expr`

`arg`

`primary`

衝突するリスト

スキャナ

パーサ概形

入力バッファ

`nextc()`

`pushback()`

`peek()`

トークンバッファ

`yylex()`

`‘!’`

`‘>’`

`‘:’`

識別子

予約語

文字列類

`lex_strterm`

文字列スキャン関数

ヒアドキュメント

Files

parser.textile

Latest commit

History

parser.textile

File metadata and controls

Chapter 10: Parser

Outline of this chapter

Parser construction

Dissecting `parse.y`

General remarks about grammar rules

Coding rules

Important symbols

Program structure

`program`

`stmt`

`expr`

`arg`

`primary`

衝突するリスト

スキャナ

パーサ概形

入力バッファ

`nextc()`

`pushback()`

`peek()`

トークンバッファ

`yylex()`

`‘!’`

`‘>’`

`‘:’`

識別子

予約語

文字列類

`lex_strterm`

文字列スキャン関数

ヒアドキュメント