The HTML Parser
The HTML parser was written in C using Microsoft Visual
Studio. The final form of the parser is much less elegant, yet probably
more efficient, than the intended original version. I had planned
to generate a nice parser using the ANTLR tool. ANTLR is a parser
generator written in Java that can currently produce Java, C++, or Sather
code. For more information on ANTLR, visit www.ANTLR.org
. The plan was to use ANTLR to generate an HTML parser in C++, which could
supposedly be compiled for WindowsCE.
The first step to generating a parser with ANTLR
is to specify a grammar file which ANTLR uses to produce appropriate code.
The grammar file contains lexer rule definitions for token recognition
and parser rule definitions for syntactic construct recognition.
From these rules, ANTLR creates an LL(k) parser (k meaning how many characters
of lookahead the lexer uses to recognize tokens). I therefore set
about creating a grammar file to describe HTML constructs. However,
due to my limited knowledge of HTML, I found this grammar specification
to be much more challenging than I anticipated. Luckily, I found
a nearly complete HTML grammar specification for ANTLR co-authored by one
of the authors of ANTLR itself, so I began to modify this specification
to fit the requirements of the project, which included extracting only
text and link information from the HTML code. I quickly discovered
that the grammar specification in the file followed the HTML definition
extremely strictly, much more so than Netscape Navigator or Microsoft Internet
Explorer, and was therefore very stingy when it came to parsing files fully
without exiting due to an unexpected token. My modifications to the
grammar consisted of the elimination of most of the specific HTML tag recognitions
in favor of a generic tag recognition. This made the ANTLR-produced
parser code much more lenient in its parsing of HTML code. The final
modified grammar which I thought produced a reasonable parser can be seen
here
. The current contents of this document can be seen below before
and after being parsed by the ANTLR-produced parser:
Before Parsing
After Parsing

The problem with the code generated by ANTLR occurred
when I attempted to port the parser to WindowsCE. Although ANTLR
refrains from using any Windows API calls, I found the ANTLR code to be
very difficult to port to WindowsCE. Apparently WindowsCE supports
fewer C++ standard library headers, and forcing the CE toolkit for Visual
Studio to recognize the required header files proved futile after over
a week of wasted time. So I made the decision to write a parser from
scratch.
The final parser behaves like a simplistic state
machine, skipping all tags except the A HREF tags in order to preserve
the link information. This parser gives nearly the same output and
behavior as the ANTLR-generated parser, albeit much less elegantly.
The entire source code for the final parser implementation can be seen
here . Below is the output of the final
parser from the same HTML file parsed above:
In the final WinGUI application, the parser function is called in the
DoMainCommandOpen() function, which corresponds to the "open" menu item.