blog.dideriksen.org: January 2015

Code that writes code that writes code

The problem I'd like to solve in this post is a simple network structure generation scheme. Say we have a number of messages that we'd like to send from one machine to another. Being programmers we'd also like to have a visual indication of our network transactions, for this I'd use "Wireshark". And we'd like to have valid documentation representation, say, in html or generated by doxygen for that matter.

To keep this example as simple as possible, I'll address only the C structure generation and the wireshark "Lua" extention generation. This pair should be enough to display the general idea behind code generation. What I'd like to achieve is to produce a set of outputs based on a single input, helping me maintain the code base. This approach is similar to the "Strategy" pattern, if you're into patterns.

The tool we're writing is a command line tool for linux. I'd like the following options:
-l creates lua output, -c creates c output and -a creates all targets. The tool should take a single file in as argument with -f and produce a single file with -o output i.e

# ./yantlg -c -f telegram.ntl -o telegram.h
# ./yantlg -l -f telegram.ntl -o telegram.lua
# ./yantlg -a -f telegram.ntl

The above examples are what I'd like the final program to use, in this simple version the focus will be on the Lex and Yacc parts of the project. As using gnu opts for passing and error reporting program options in a program is worthy of a separate blog post.

The Lex and Yacc tools are explained in detail in may places here is, IMHO, a good introduction.

Network telegram language

Lex needs a file it can use to tokensize the language. But first we must define a simple network telegram structure language or "NTL" for short. I'm sure I'm not the first implementer of something like this, there's most likely a bunch of tools out there, but here's what I consider rational for solving the problem at hand.

The first and simple version of NTL will be supporting the data types listed below, and of these data types, except the end of message marker, need a name to identify the variable of that type, which gives us:

M <name> - message
B <name> - byte (8 bits)
W <name> - word (16 bits)
D <name> - double word (32 bits)
Q <name> - quart word (64 bits)
E - end of message

An example of a test NTL telegram could, therefore, be:

M telegram
B version
B size
W word
D dword
Q qword
E

This will be stored in a file called "telegram.ntl" and this file will serve as the input for our NTL generator (compiler/encode/writer/whatever) .

Flex'ing tokens with lex - "or lexing tokens with Flex"

Lex needs a token it shares these tokens with the parser, translate file. Lex input files are usually identified by the extention "l" (that's lower case L), at least that's the extension recognized by make, so we'll use this.

%{
#include <stdio.h>
#include <stddef.h>
#include "yyltype.h"
#include "parser.h"

/* definitions for reporting errors, see a lex and yacc doc/book for details */
int yycolumn = 1;
#define YY_USER_ACTION yylloc.first_line = yylloc.last_line = yylineno; \
    yylloc.first_column = yycolumn; yylloc.last_column = yycolumn + yyleng - 1;\
    yycolumn += yyleng;

%}
%option yylineno

%%
m                     { yylval.type = MESSAGE; return MESSAGE;}
e                     { return END; }
b                     { yylval.type = BYTE; return BYTE; }
w                     { yylval.type = WORD; return WORD; }
d                     { yylval.type = DWORD; return DWORD; }
q                     { yylval.type = QWORD; return QWORD; }
[a-zA-Z][a-zA-Z0-9_]+ { yylval.name = strdup(yytext); return NAME; }
\n                    { yycolumn = 1; } /* ignore end of line */;
[ \t]+                /* ignore whitespace */;
[^mebwdq]             { yyerror("'%s' unknown symbol",yytext); yyerror;} /* Error on unknown tokens */
%%

The above is a *simple* lexer for our NTL generator, the only file that seem a bit strange is in the header section where #include "parser.h" is added. This file is generated by Yacc, and it must be included for the token names to be defined in the final compilation.

Notice every data type is assigned to a token. A token is identified by a token type and a name (except from the end token). Tokens are composed by the [a-zA-Z][a-zA-Z0-9_]+ regular expression.

This expression states that a name must begin with a character either upper or lower case, this is defined by the first []: [a-zA-Z] part. Followed by either upper or lower characters a numeric value or the underscore character, defined by the second []: [a-zA-Z0-9_]+ part.

The final tree expressions are to ignore white spaces and newlines, and forces error reporting for unknown token characters.

Tokenized code generation

The code generation part is handled with Yacc (Bison). This tool needs a file with the extension "y", again an extension recognized by make. This file must contain the various specific C code parts for our NTL generator program.

Like the Lex file for tokenization this file is divided into a number of sections. Where the percentage curly braces contains the C code part, "%{ C code %}", including header file definitions etc. The %tokens contain the token definitions, and, finally, the percentage percentage section, "%% bnf %%", contain the BNF notated language syntax.

%{
#include <stdio.h>
#include <stdlib.h>
#include <stdarg.h>

#include "token.h"
#include "yyltype.h"

extern int yyleng;           /* Defined by the lexer used to output the length of the column */
extern int yylineno;         /* Defined by the lexer to get the current line number */
extern char *yytext;         /* Defined by the lexer to get the current token stream */
struct YYLTYPE yylloc;       /* This is the lexer location structure, it is used to produce the parser error messages */
struct Token *root,*current; /* This is the parsed token tree, it is collected by the parser and used later to produce the output code */

void yyerror(char *s, ...)
{
va_list ap;
va_start(ap, s);

fprintf(stderr,"error: line: ");

if(yylloc.first_line) {
      fprintf(stderr, "%d.%d-%d.%d ", yylloc.first_line, yylloc.first_column,yylloc.last_line, yylloc.last_column);
}
vfprintf(stderr, s, ap);
fprintf(stderr, "\n");
}

void lyyerror(YYLTYPE t, char *s, ...)
{
va_list ap;
va_start(ap, s);

fprintf(stderr,"error: line: ");

if(yylloc.first_line) {
      fprintf(stderr, "%d.%d - %d.%d ", yylloc.first_line, yylloc.first_column,yylloc.last_line, yylloc.last_column);
}
vfprintf(stderr, s, ap);
fprintf(stderr, "\n");
va_end(ap);
}

int yywrap()
{
return 1;
}

%}

%locations

The above C code section contain error parts, see the yacc/bison documentation or check a book on the subject for details. We will focus on the next part, where the actual parsing definitions are placed.

%union {
    int type;
    char *name;
}

%token <type> MESSAGE
%token <type> BYTE
%token <type> WORD
%token <type> DWORD
%token <type> QWORD
%token <name> NAME
%token END

%%
message: /* empty */
   |
   message data;

data: byte | word | dword | qword | end;

message: MESSAGE NAME { root = current = create($1,$2); };
| NAME { lyyerror(@1,"'%s' undeclared identifier",$1); /*YYABORT*/;};
| error '\n'

byte: BYTE NAME { current = current->next = create($1,$2); };
| NAME { lyyerror(@1,"'%s' undeclared identifier",$1); /*YYABORT*/;};
| error '\n'

word: WORD NAME { current = current->next = create($1,$2);};
| NAME {lyyerror(@1,"'%s' undeclared identifier",$1); /*YYABORT*/;};
| error '\n'

dword: DWORD NAME { current = current->next = create($1,$2);};
| NAME {lyyerror(@1,"'%s' found where d was exspected",$1); /*YYABORT*/;};
| error '\n'

qword: QWORD NAME { current = current->next = create($1,$2);};
| NAME {lyyerror(@1,"'%s' undeclared identifier",$1); /*YYABORT*/;};
| error '\n'

end: END { current = NULL;};
%%

The above yacc code defines the token type, the input as it is read fom yyin to be either a int, in the case of token type or a char* in the case of token names. These are used to create a token table (a simple linked list) from the files token.h and token.c

The next part, all the %token <foo> bar, definitions defines the tokens. Finally, the %% message: section defines the NTL language BNF.

Basically, what the BNF does it defines that a message must contain a set of data, taht can be of the types byte, word, dword, qword or end.

Then, each of the token types are assigned some code, which in a correctly formatted scenario feeds a token generator, a simple c linked list, with the token type and name. If there is an error in the syntax this will be issued to the user and the parser will read to the next newline.

The token linked list

Heres a listing of the token linked list structure for creating the simple linked list. First the header then the implementation file

#ifndef token_h_
#define token_h_

struct Token {
    int type;
    char *name;
    struct Token *next;
} *root,*current;

struct Token *create(int type, char *name);

#endif // token_h

The above defines a token structure, used to generate code from.

#include <stdlib.h>
#include <string.h>
#include "token.h"

struct Token *create(int type, char *name)
{
    struct Token *t = malloc(sizeof(struct Token));
    memset(t,0x00,sizeof(struct Token));
    t->type = type;
    t->name = name;
    return t;
}

The above defines a function that creates and assigns the token type and name to a simple linked list. This list is used to generate the code from.

The code generation for this project is also fairly simple. The principle is that a single function generates the code for each section. This function is then used from the main program as a function pointer, providing the polymorphic ability for generating code.

The benefit of using function pointers is that to generate a new output all you have to do is write a function for the generation, and register this function. The other parts of the program stays the same.

#ifndef C_GENERATOR_H
#define C_GENERATOR_H

enum OUTPUT_TYPE {
    GENERATE_NONE = 0x00,
    GENERATE_TREE,
    GENERATE_C,
    GENERATE_LUA,
    GENERATE_MAX = GENERATE_LUA,
};

void c_generator(char *filename);
void tree_generator(char *filename);
void lua_generator(char *filename);
typedef void (*generator)(char *);

#endif /* C_GENERATOR_H */

The above defines a set of functions, one for each output type, and a set of constants for the corresponding function.

#include <stdio.h>
#include <stdlib.h>
#include <stddef.h>
#include "token.h"
#include "file.h"

void tree_generator(char *outfile) {

    outfile = strcat(outfile,".tree");
    fprintf(stdout,"Generating: %s\n",outfile);

    struct Token *next;
    fprintf(stdout,"Generating: %s\n",outfile);
    fprintf(stdout,"Token: %s Type: %d\n",root->name,root->type);
    for(next = root->next; next!=NULL; next = next->next) {
    fprintf(stdout,"\t |-- Type: %d - Token: %s\n",next->type,next->name);
    }
}

void c_generator(char *outfile)
{
    struct Token *next;

    outfile = strcat(outfile,".h");
    fprintf(stdout,"Generating: %s\n",outfile);

    char *type_words[] = {"","struct","uint8_t","uint16_t","uint32_t","uint64_t"};
    char *hton_functions[] = {"","","","htons","htonl","undefined_htoll"};
    char *ntoh_functions[] = {"","","","ntohs","ntohl","undefined_ntohll"};
    FILE *ntlout = file_open(outfile,"w","couldn't open temp for writting\n");

    fprintf(ntlout,"#ifndef %s_h\n",root->name);
    fprintf(ntlout,"#define %s_h\n\n",root->name);

    fprintf(ntlout,"struct %s {\n",root->name);
    for(next = root->next;next!=NULL;next = next->next) {
    fprintf(ntlout,"\t%s %s\n",type_words[next->type],next->name);
    }
    fprintf(ntlout,"\n");
    fprintf(ntlout,"\tvoid hton();\n");
    fprintf(ntlout,"\tvoid ntoh();\n");
    fprintf(ntlout,"};\n\n");

    fprintf(ntlout,"void %s::hton()\n{\n",root->name);
    for(next = root->next;next!=NULL;next = next->next) {
    fprintf(ntlout,"\t%s = %s(%s)\n",next->name,hton_functions[next->type],next->name);
    }
    fprintf(ntlout,"}\n\n");

    fprintf(ntlout,"void %s::ntoh()\n{\n",root->name);
    for(next = root->next;next!=NULL;next = next->next) {
    fprintf(ntlout,"\t%s = %s(%s)\n",next->name,ntoh_functions[next->type],next->name);
    }
    fprintf(ntlout,"}\n\n");

    fprintf(ntlout,"#endif // %s_h\n",root->name);

    fclose(ntlout);
}

void lua_generator(char *outfile) {
    outfile = strcat(outfile,".lua");
    fprintf(stdout,"Generating: %s\n",outfile);
}

Disadvantages

Lets face it, writing code that writes code has both its advantages and disadvantages.
The advantages are many, and discussed deeply at various resources like: "The Pragmatic Programmer" or "The Art Of Unix Programming". The Disadvantages are usually not addressed as they're not the key selling point, and their consequences are usually ignored. Remember:

"If you write code that writes bad code, there's no end to what you can't generate!"

If you just want to know how jump to the code section. If you want to know why, the following section lists some of the pitfalls I've met during my travels in professional programmer land.

The disadvantages I personally have experienced with code generation, have all been in the same area.

The outputted code was impossible to debug and use due to bad code design.
The input was so advanced that it was a programming language on it's own.
Lack of tokenization after the lexical analysis.
Choosing the wrong tool for the job.

Well, the first item leads to the kind of code you just want to, for the lack of a better word, refactor. Bad code should be removed immediately as it's consequences is your projects greatest risk. The first item is also related heavily to the second item.

The second leads to your project having to educate its programmers to the specific code generation language. This is expensive as you cannot hire an expert. You'll have to, as a project or company, educate your own experts.

That will be hard as the experts your project educate cannot use the expertise elsewhere, and this might cause a motivational issue in your project. Plus, your project will have to maintain the full infrastructure related to a programming language like: usage examples, and documentation. Expensive.

Item three is a common mistake, a code generation step that is often overlooked in the process of producing the code generator output. This may be due to lack of experience, or, even worse, due to a poor programming work environment. The latter of these is the worst as the environment enforces a fast process with out any consideration for the side effects. A side effect could be porting of the generated code to a new platform, or documentation generation in a different format.

The fourth item is a common mistake in many technical matters. Choosing a tool to preform a specific task is often not trivial. But in many cases some of the new parsers etc. are often misused due to lack of understanding in the problem domain. One of the things I have seen is the usage of Apache ant and not make.

This is not a criticism of either Apache ant or make. I'm certain that they both have their specific problem to solve. Nevertheless, I believe that the purpose of Apache ant can be solved by make, but that is just my personal opinion, I honestly believe in the approach taken in the "The Art Of Unix Programming" book. Where the suggested solution is to make use of the tools available to you already, and creating only the smallest possible solution to perform the task needed.