ConveX- A lightweight Conversational XML parser

Version 0.02 Source ConveX.zip

Purpose

ConveX is a lightweight  XML parser written in C. It performs a reasonable amount of well-formless checking, but in particular does not parse the DTD section of any document. It is suitable for conversational XML applications, rather than main stream database transactions. Conversational meaning transactions that are not necessarily retained after processing, or that do not require a high level of integrity - in other words strict XML well formed ness checking or validity checking. For a more sophisticated, and conformant treatment of XML use one of the widely available parsers. Expat is popularly used, and is available in C source.

The current incarnation supports ASCII (not really UTF8) and UTF16 processing

Architecture

A nested state machine, using a push down stack. Has properties of a hierarchical state machine although much of the usual semantics of HSM's are missing. "Init", "Run", and "Done" are the messages a machine may process. The machine as a whole is more of a collection of flat FSM's, that can call other FSM's, that is, the system is a stack of states, where each state has one additional discriminating state - that has none of the entry -exit semantics - and in fact, it is a C case statement at this level. 

There are 4 C source modules:

LexBuffer.c

Fetches the next input character as well as collecting these into a lexeme that can be extracted when the scanner is in the accepting state.

StateMachine.c

Manages a push down stack of state machines used by the scanner module.

Scanner.c

Makes use of the State Machine to scan and parse lexemes according to the XML production rules. There is no separate Lexical analyser and parser stages, these are combined in the scanner.

XML.C

Manages the other modules, and exposes a user API to handle different XML sections.

DStr.c

Provides support of dynamic strings, that can expand when required and possibly in future, a managed pool for better memory usage.

Interface

The interface found in xml.h is shown below:

extern XMLError XML_Create( XML** );

extern XMLError XML_SetTextHandler( XML*, TextNotify* );

extern XMLError XML_SetStartTagHandler( XML*, StartTagNotify* );

extern XMLError XML_SetEndTagHandler( XML*, EndTagNotify* );

extern XMLError XML_SetCommentHandler( XML*, CommentNotify* );

extern XMLError XML_SetPITagHandler( XML*, PITagNotify* );

extern XMLError XML_SetParseErrorHandler( XML*, ParseErrorNotify* );

extern void XML_SetUserData( XML*, void * );

extern void* XML_GetUserData( XML* );

extern XMLError XML_Destroy( XML** );

extern XMLError XML_SetMaxTextChunk( XML *, size_t );

extern XMLError XML_ParseBlock( XML *, XMLChar*, size_t );

Handler signatures:

typedef void (TextNotify)( void *UserData, DStr * text );

typedef void (StartTagNotify)( void *UserData, DStr* name, AttribList* attribs );

typedef void (EndTagNotify)( void *UserData, DStr *name );

typedef void (CommentNotify)( void *UserData, DStr *comment );

typedef void (PITagNotify)( void *UserData, DStr *name, AttribList* attribs );

typedef void (ParseErrorNotify)( void *UserData, unsigned long line, unsigned long column, XMLError err );

Sample run

 

Input

<?xml version="1.0" standalone="yes" ?>
<!DOCTYPE world 
[
<!ENTITY jason "jason">
]
>

<?PI
?>
<wo~rld>
<room
name = 
"start&gt;" 
>


&#xdude; &jason; The parser doesn't process the prolog, so named entities will not be recognised.



<![CDATA[
[[ CDATA TEXT HERE ]]
]]>


<attrib name1="1" name2="2" name3="" name4="4" name5="5" name6="6" name7="7" name8="8" name9="9" name10="10" />

<!--- -This is the starting room -->


<!-- --This is an illegal comment since, there is a double dash in the comment stream -->

<exit name="north" destination="newbie" />
<long-description>This is the long description for the start room.</long-description>
<script>

<7his/> <!-- is a bad tag, which the parser will pick up -->

<once-only language="mud command" fish='jason "was" here'>
tell [name] hello :-&gt;
</once-only>
This is script data to go with the script command
</script>
</room>
<room name="newbie" description="This is the newbie area. Welcome">
<!--newbie area-->
</room>
<test/>
<!-- 
&jason; Since the parser doesn't process the DTD, it can't process internal named entities
-->
&#x40; hex @ symbol
&#64; decimal @ symbol
&amp;
&quot;
&lt;
&gt;
&apos;
</wo~rld>

Output

Opening ../../../world.xml
#xml version="1.0" standalone="yes"#


#PI#
(wo) {
[PARSE ERROR near Line 10, column 4, `Unexpected spurious characters`]

(room name="start>") {



[PARSE ERROR near Line 17, column 8, `Expected integer was malformed`]

[PARSE ERROR near Line 17, column 15, `Undefined entity id`]
The parser doesn't process the prolog, so named entities will not be recognised.




[[ CDATA TEXT HERE ]]



(attrib name1="1" name2="2" name3="" name4="4" name5="5" name6="6" name7="7" name8="8" name9="9" name10="10") {} (/attrib)

/* - -This is the starting room */



[PARSE ERROR near Line 31, column 13, `The sequence "--" is not allowed in the middle of a comment`]

[PARSE ERROR near Line 31, column 17, `Unexpected spurious characters`]


(exit name="north" destination="newbie") {} (/exit)
(long-description) {This is the long description for the start room.} (/long-description)
(script) {


[PARSE ERROR near Line 37, column 6, `Expected Identifier was malformed`]

[PARSE ERROR near Line 37, column 6, `Unexpected spurious characters`]
/* is a bad tag, which the parser will pick up */

(once-only language="mud command" fish="jason "was" here") {
tell [name] hello :->
} (/once-only)
This is script data to go with the script command
} (/script)
} (/room)
(room name="newbie" description="This is the newbie area. Welcome") {
/* newbie area */
} (/room)
(test) {} (/test)
/* 
&jason; Since the parser doesn't process the DTD, it can't process internal named entities
*/
@ hex @ symbol
@ decimal @ symbol
&
"
<
>
'
} (/wo)
[PARSE ERROR near Line 59, column 5, `Unexpected spurious characters`]
Finished with exit code = 0