The HTML Parser

    The HTML parser was written in C using Microsoft Visual Studio.  The final form of the parser is much less elegant, yet probably more efficient, than the intended original version.  I had planned to generate a nice parser using the ANTLR tool.  ANTLR is a parser generator written in Java that can currently produce Java, C++, or Sather code.  For more information on ANTLR, visit  www.ANTLR.org . The plan was to use ANTLR to generate an HTML parser in C++, which could supposedly be compiled for WindowsCE.
    The first step to generating a parser with ANTLR is to specify a grammar file which ANTLR uses to produce appropriate code.  The grammar file contains lexer rule definitions for token recognition and parser rule definitions for syntactic construct recognition.  From these rules, ANTLR creates an LL(k) parser (k meaning how many characters of lookahead the lexer uses to recognize tokens).  I therefore set about creating a grammar file to describe HTML constructs.  However, due to my limited knowledge of HTML, I found this grammar specification to be much more challenging than I anticipated.  Luckily, I found a nearly complete HTML grammar specification for ANTLR co-authored by one of the authors of ANTLR itself, so I began to modify this specification to fit the requirements of the project, which included extracting only text and link information from the HTML code.   I quickly discovered that the grammar specification in the file followed the HTML definition extremely strictly, much more so than Netscape Navigator or Microsoft Internet Explorer, and was therefore very stingy when it came to parsing files fully without exiting due to an unexpected token.  My modifications to the grammar consisted of the elimination of most of the specific HTML tag recognitions in favor of a generic tag recognition.  This made the ANTLR-produced parser code much more lenient in its parsing of HTML code.  The final modified grammar which I thought produced a reasonable parser can be seen here .  The current contents of this document can be seen below before and after being parsed by the ANTLR-produced parser:
Before Parsing

 
 

After Parsing

    The problem with the code generated by ANTLR occurred when I attempted to port the parser to WindowsCE.  Although ANTLR refrains from using any Windows API calls, I found the ANTLR code to be very difficult to port to WindowsCE.  Apparently WindowsCE supports fewer C++ standard library headers, and forcing the CE toolkit for Visual Studio to recognize the required header files proved futile after over a week of wasted time.  So I made the decision to write a parser from scratch.
    The final parser behaves like a simplistic state machine, skipping all tags except the A HREF tags in order to preserve the link information.  This parser gives nearly the same output and behavior as the ANTLR-generated parser, albeit much less elegantly.  The entire source code for the final parser implementation can be seen here .  Below is the output of the final parser from the same HTML file parsed above:

In the final WinGUI application, the parser function is called in the DoMainCommandOpen() function, which corresponds to the "open" menu item.