LEX(1,C) AIX Commands Reference LEX(1,C) ------------------------------------------------------------------------------- lex PURPOSE Generates a C language program that matches patterns for simple lexical analysis of an input stream. SYNTAX +----------+ +--------+ +--------+ lex ---| +----+ |---| one of |---| |---| +--| -m |--+ | +----+ | +- file -+ ^ | -t | | +-| -n |-+ ^ | | +----+ | | -v | +------+ +--------+ +----+ DESCRIPTION The lex command reads file or standard input, generates a C Language program, and writes it to a file named lex.yy.c. This file, lex.yy.c, is a compilable C Language program. The lex command uses rules and actions contained in file to generate a program, lex.yy.c, which can be compiled with the cc command. It can then receive input, break the input into the logical pieces defined by the rules in file, and run program fragments contained in the actions in file. For a more detailed discussion of the lex command and its operation, see AIX Operating System Programming Tools and Interfaces. The generated program is a C Language function called yylex. The lex command stores the yylex function in a file named lex.yy.c. You can use the yylex function alone to recognize simple, one-word input, or you can use it with other C Language programs to perform more difficult input analysis functions. For example, you can use the lex command to generate a program that simplifies an input stream before sending it to a parser program generated by the yacc command. The function yylex analyzes the input stream using a program structure called a "finite state machine". This structure allows the program to exist in only one state (or condition) at a time. There is a finite number of states allowed. The rules in file determine how the program moves from one state to another. If you do not specify a file, the lex command reads standard input. It treats multiple files as a single file. Processed July 12, 1991 LEX(1,C) 1 LEX(1,C) AIX Commands Reference LEX(1,C) Note: Since the lex command uses fixed names for intermediate and output files, you can have only one lex command-generated program in a given directory. Input File Format (file) The input file can contain three sections: definitions, rules, and user subroutines. Each section must be separated from the others by a line containing only the delimiter, %%. Format is: definitions %% rules %% user subroutines The purpose and format of each section are described in the following sections. PROGRAMMING UNDER THE MBCS ENVIRONMENT You must use the -m option in order to map potentially large numbers of MBCS characters into a limited working character set. The restriction is that in all regular expressions, the total number of distinct characters plus the total number of partitions of the range sets must be less than 255. Range sets are expanded using collation weight under compile time locale. Neither n-to-1 nor 1-to-n extended collation is allowed. In the definition section, the defined variables must be in ASCII. If you use your own main( ) rather than the one in libl.a, then you must add setlocale (LC_ALL, "") in that routine. During run time, when a match is found, yytext returns an MBCS character string. You will have to use MBCS conversion routines to get the actual matching characters. DEFINITIONS If you want to use variables in your rules, you must define them in this section. The are put in the left column, and their definitions are put in the right column. For example, if you wanted to define "D" as a numerical digit, you would write; D [0-9] You can use a defined variable in the rules section by enclosing the variable name in braces ("{D}"). In the definitions section, you can set table sizes for the resulting finite state machine. The default sizes are large enough for small programs. You may want to set larger sizes for more complex programs. Processed July 12, 1991 LEX(1,C) 2 LEX(1,C) AIX Commands Reference LEX(1,C) %p n Number of positions is n (default 2000) %n n Number of states is n (default 500) %e n Number of parse tree nodes is n (default 1000) %a n Number of transitions is n (default 3000) If extended characters appear in regular expression strings, you may need to reset the output array size with the %o parameter (possibly to array sizes in the range 10,000 to 20,000). This reset reflects the much larger number of characters relative to the number of ASCII characters. RULES Once you have defined your terms, you can write the rules section. It contains strings and expressions to be matched in file to the yylex function, and C commands to execute when a match is made. This section is required, and it must be preceded by the delimiter %%, whether you have a definitions section. The lex command does not recognize your rules without this delimiter. In this section, the left column contains the pattern to be recognized in an input file to the yylex function. The right column contains the C program fragment that is executed when that pattern is recognized. Patterns can include extended characters with one exception: these characters may not appear in range specifications within character class expressions surrounded by square brackets. The columns are separated by a tab. For example, if you wanted to search files for the keyword "KEY", you might write: (KEY) printf("found KEY"); If you include this rule in file, the lexical analyzer yylex matches the pattern "KEY" and runs the printf command. Each pattern may have a corresponding action, a C command to execute when the pattern is matched. Each statement must end with a semicolon. If you use more than one statement in an action, you must enclose all of them in braces. A second delimiter, %%, must follow the rules section if you have a user subroutine section. When the yylex function matches a string in the input stream, it copies the matched file to an external character array, yytext, before it executes any commands in the rules section. You can use the following operators to form patterns that you want to match: x Matches the character written. The x matches the literal character x. [ ] Matches any one character in the enclosed range ([.-.]) or the enclosed list ([...]). For example, [a,b,c,x-z] matches a,b,c,x,y, or z. Processed July 12, 1991 LEX(1,C) 3 LEX(1,C) AIX Commands Reference LEX(1,C) " " Matches the enclosed character or string even if it is an operator. For example, "$" prevents lex from interpreting the character "$" as an operator. \ Acts the same as " ". For example, \"$" also prevents the shell from interpreting the character "$" as an operator. * Matches zero or more occurrences of the character immediately preceding it. For example, x* matches zero or more repeated x's. + Matches one or more occurrences of the character immediately preceding it. ? Matches either zero or one occurrences of the character immediately preceding it. ^ Matches the character only at the beginning of a line. ^"x" matches an x at the beginning of a line. [^] Matches any character but the one following the ^. For example, [^"x"] matches any character but x. . Matches any character except the new-line character. $ Matches the end of a line. | Matches either of two characters. For example, "x | y" matches either x or y. / Matches one character only when followed by a second character. It reads only the first character into the yytext character array. For example, x/y matches x when it is followed by y, and reads x into yytext. ( ) Matches the pattern in the parentheses. This operator is used for grouping. The parentheses reads the whole pattern into yytext. A group in parentheses can be used in place of any single character in any other pattern. For example, "(xyz123)" matches the pattern "xyz123" and reads the whole string into yytext. {} Matches the character as you defined it in the definitions section. For example, you defined "D" to be numerical digits, "{D}" matches all numerical digits. {m,n} Matches m to n occurrences of the character. For example, x{2,4} matches 2, 3, or 4 occurrences of x. If a line begins with only a blank, the lex command copies the line to the output file, lex.yy.c. If the line is in the declarations section of file, the lex copies the line to the declarations section of lex.yy.c. If the line is in the rules section, the lex command copies the line to the program code section of lex.yy.c. Processed July 12, 1991 LEX(1,C) 4 LEX(1,C) AIX Commands Reference LEX(1,C) USER The lex library has three subroutines defined as macros, and which you can use in the rules. input( ) Reads a character from yyin. unput( ) Replaces a character after it has been read. output( ) Writes an output character to yyout. You can override these three macros by writing your own code for these routines in the user subroutines section. But if you write your own, you must undefine these macros in the definition section as follows: %{ #undef input #undef unput #undef output }% There is no main( ) in the file lex.yy.c because the lex library contains the main( ) that calls the lexical analyzer yylex. Therefore, if you do not include main( ) in the user subroutines section, when you compile the file lex.yy.c, you must enter "cc -ll lex.yy.c", where ll calls the lex library. External names generated by the lex command all begin with the prefix yy, as in yyin, yyout, yylex, and yytext. FLAGS -m Uses a partitioning algorithm to reduce the number of character sets. This option is mandatory when lex works in locales other than C. Under the C locale, in most situations, using this option will cause a reduction in compile time, and an increase in run time. -n Suppresses the statistics summary. When you set your own table sizes for the finite state machine (see page 2), the lex command automatically produces this summary if you do not select this flag. -t Writes the file lex.yy.c to standard output instead of to a file. -v Provides a one-line summary of the generated finite-state-machine statistics. FILES /usr/lib/libl.a Run-time library. RELATED INFORMATION See the following command: "yacc." Processed July 12, 1991 LEX(1,C) 5 LEX(1,C) AIX Commands Reference LEX(1,C) See "Introduction to International Character Support" in Managing the AIX Operating System. Processed July 12, 1991 LEX(1,C) 6