Session Four: XML Handler (Simple tags, Globals, Multiple Targets, Style Files) (Guy)

XML Files

All HTML / XML files are run through the lonxml handler before being served to a user. This allows us to rewrite many portion of a document and to support serverside tags. There are 2 ways to add new tags to the xml parsing engine, either through LON-CAPA style files or by writing Perl tag handlers for the desired tags.

Global Variables

*          $Apache::lonxml::debug - debugging control

*          @Apache::lonxml::pwd - path to the directory containing the file currently being processed

*          @Apache::lonxml::outputstack

$Apache::lonxml::redirection - these two are used for capturing a subset of the output for later processing, don't touch them directly use &startredirection and &endredirection

*          $Apache::lonxml::import - controls whether the <import> tag actually does anything

*          @Apache::lonxml::extlinks - a list of URLs that the user is allowed to look at because of the current resource (images, and links)

*          $Apache::lonxml::metamode - some output is turned off, the meta target wants a specific subset, use <output> to guarentee that the catianed data will be in the parsing output

*          $Apache::lonxml::evaluate - controls whether run::evaluate actually derefences variable references

*          %Apache::lonxml::insertlist - data structure for edit mode, determines what tags can go into what other tags

*          @Apache::lonxml::namespace - stores the list of tag namespaces used in the insertlist.tab file that are currently active, used only in edit mode.

*          $Apache::lonxml::registered - set to 1 once the remote has been updated to know what resource we are looking at.

*          $Apache::lonxml::request - current Apache request object, or undef

*          $Apache::lonxml::curdepth - current depth of the overall parse depth. Will be a string like: 2_3_1 (first tag in the third second level tag in the second toplevel tag). It gets set by callsub, and can be used in Perl tag implementations. It relies upon the internal globals: @Apache::lonxml::depthcounter, $Apache::lonxml::depth, $Apache::lonxml::olddepth

*          $Apache::lonxml::prevent_entity_encode - By default the xmlparser will try to rencode any 8-bit characters into HTMLEntity Codes, If this is set to a true value it will be prevented.

In common usage, $Apache::lonxml::prevent_entity_encode, $Apache::lonxml::evaluate, $Apache::lonxml::metamode, $Apache::lonxml::import, should never be set to a value directly, but rather incremented when you want the effect on, and decremented when you want the effect off.

Notable Perl subroutines

If not specified these functions are in Apache::lonxml

*          xmlparse - see the XMLPARSE figure - also not callable from inside a tag, if one needs to restart parsing, either create add a new LCParser to the parser stack parser using the newparser function, or call inner_xmlparser, see the xmlparse function in scripttag.pm

*          recurse - acts just like xmlparse, except it doesn't do the style definition check it always calls callsub

*          callsub - callsub looks if a perl subroutine is defined for the current tag and calls. Otherwise it just returns the tag as it was read in. It also will throw on a default editing interface unless the tag has a defined subroutine that either returns something or requests that call sub not add the editing interface.

*          afterburn - called on the output of xmlparse, it can add highlights, anchors, and links to regular expersion matches to the output.

*          register_insert - builds the %Apache::lonxml::insertlist structure of what tags can have what other tags inside.

*          whichuser - returns a list of $symb, $courseid, $domain, $name that is correct for calls to lonnet functions for this setup. Uses form.grade_ parameters, if the user is allowed to mgr in the course

*          setup_globals - initializes all lonxml globals when xmlparse is called. If you intend to create a new target you will likely need to tweak how the globals are setup upon start up.

*          init_safespace - creates Holes to external functions, creates some global variables, and set the permitted operators of the global Safespace intepreter.

Functions Tag Handlers can use

If not specified these functions are in Apache::lonxml

*          debug - a function to call to printout debugging messages. Will only print when Apache::lonxml::debug is set to 1

*          warning - a function to use for warning messages. The message will appear at the top of a resource when it is viewed in construction space only.

*          error - a function to use for error messages. The message will appear at the top of a resource when it is viewed in construction space, and will message the resource author and course instructor, while informing the student that an error has occured otherwise.

*          get_all_text - 2 args, tag to look for (need to use /tag to look for an end tag) and a HTML::TokeParser reference, it will repedelyt get text from the TokeParser until the requested tag is found. It will return all of the document it pulled form the TokeParser. (See Apache::scripttag::start_script for an example of usage.)

*          get_param - 4 arguments, first is a scaler sting of the argument needed, second is a reference to the parser arguments stack, third is a reference to the Safe space, and fourth is an optional "context" value. This subroutine allows a tag to get a tag argument, after being interpolated inside the Safe space. This should be used if the tag might use a safe space variable reference for the tag argument. (See Apache::scripttag::start_script for an example.) This version only handles scalar variables.

*          get_param_var - 4 arguments, first is a scaler sting of the argument needed, second is a reference to the parser arguments stack, third is a reference to the Safe space, and fourth is an optional "context" value. This subroutine allows a tag to get a tag argument, after being interpolated inside the Safe space. This should be used if the tag might use a safe space variable reference for the tag argument. (See Apache::scripttag::start_script for an example.) This version can handle list or hash variables properly.

*          description - 1 argument, the token object. This will return the textual decription of the current tag from the insertlist.tab file.

*          whichuser - 0 arguments. This will take a look at the current environment setting and return the current $symb, $courseid, $udom, $uname. You should always use this function if you want to determine who the current user is. (Since a instructor might be trying to view a students version of a resource.)

*          inner_xmlparse - 6 arguments, the target, an array pointer to the current stack of tags, and array pointer to the current stack of tag arguments, an array pointer to the current stack of LCParser's, a pointer to the current Safe space, a pointer to the hash of current style definitions

*          newparser - 3 args, first is a reference to the parser stack, second should be a reference to a string scaler containg the text the newparser should run over, third should be a scaler of the directory path the file the parser is parsing was in. (See Apache::scripttag::start_import for an example.)

*          register - should be called in a file's BEGIN block. 2 arguments, a scaler string, and a list of strings. This allows a file to register what tags it handles, and what the namespace of those tags are. Example:

sub BEGIN {

  &Apache::lonxml::register('Apache::scripttag',('script','display'));

}

Would tell xmlparse that in Apache::scripttag it can find handlers for <script> and <display>, if one regsiters a tag that was already registered the previous one is remembered and will be restored on a deregister.

*          deregister - used to remove a previously registered tag implementation. It will restore the previous registration if there was one.

*          startredirection - used when a tag wants to save a portion of the document for its end tag to use, but wants the intervening document to be normally processed. (See Apache::scripttag::start_window for an example.)

*          endredirection - used to stop preventing xmlparse from hiding output. The return value is everthing that xmlparse has processed since the corresponding startredirection. (See Apache::scripttag::end_window for an example.)

*          Apache::run::evaluate - 3 args, first a string, second a reference to the Safe space, 3 a string to be evaluated before the first arg. This subroutine will do variable interpolation and simple function interpolations on the first argument. (See Apache::lonxml::inner_xmlparse for an example.)

*          Apache::run::run - 2 args, first a string, second a reference to the Safe space. This handles passing the passed string into the Safe space for evaluation and then returns the result. (See Apache::scripttag::start_script for an example.)

Style Files

Fig. 2.4.1 Ð Using a style file

Style File specific tags

<definetag> - 2 arguments, name name of new tag being defined, if proceeded with a / defining an end tag, required; parms parameters of the new tag, the value of these parameters can be accesed by $parametername.

*          <render> - define what the new tag does for a non meta target

*          <meta> - define what the new tag does for a meta target

*          <tex> / <web> / <latexsource> - define what a new tag does for a specific no meta target, all data inside a <render> is render to all targets except when surrounded by a specific target tags.

Fig. 2.4.2 Ð The parser

HTML::LCParser - Alternative HTML::Parser interface

SYNOPSIS

 require HTML::LCParser;

 $p = HTML::LCParser->new("index.html") || die "Can't open: $!";

 while (my $token = $p->get_token) {

     #...

 }

DESCRIPTION

The C<HTML::LCParser> is an alternative interface to the

C<HTML::Parser> class.  It is an C<HTML::PullParser> subclass.

The following methods are available:

* $p = HTML::LCParser->new( $file_or_doc );

The object constructor argument is either a file name, a file handle

object, or the complete document to be parsed.

If the argument is a plain scalar, then it is taken as the name of a

file to be opened and parsed.  If the file can't be opened for

reading, then the constructor will return an undefined value and $!

will tell you why it failed.

If the argument is a reference to a plain scalar, then this scalar is

taken to be the literal document to parse.  The value of this

scalar should not be changed before all tokens have been extracted.

Otherwise the argument is taken to be some object that the

C<HTML::LCParser> can read() from when it needs more data.  Typically

it will be a filehandle of some kind.  The stream will be read() until

EOF, but not closed.

It also will turn attr_encoded on by default.

* $p->get_token

This method will return the next I<token> found in the HTML document,

or C<undef> at the end of the document.  The token is returned as an

array reference.  The first element of the array will be a (mostly)

single character string denoting the type of this token: "S" for start

tag, "E" for end tag, "T" for text, "C" for comment, "D" for

declaration, and "PI" for process instructions.  The rest of the array

is the same as the arguments passed to the corresponding HTML::Parser

v2 compatible callbacks (see L<HTML::Parser>).  In summary, returned

tokens look like this:

  ["S",  $tag, $attr, $attrseq, $text, $line]

  ["E",  $tag, $text, $line]

  ["T",  $text, $is_data, $line]

  ["C",  $text, $line]

  ["D",  $text, $line]

  ["PI", $token0, $text, $line]

where $attr is a hash reference, $attrseq is an array reference and

the rest are plain scalars.

* $p->unget_token($token,...)

If you find out you have read too many tokens you can push them back,

so that they are returned the next time $p->get_token is called.

* $p->get_tag( [$tag, ...] )

This method returns the next start or end tag (skipping any other

tokens), or C<undef> if there are no more tags in the document.  If

one or more arguments are given, then we skip tokens until one of the

specified tag types is found.  For example:

   $p->get_tag("font", "/font");

will find the next start or end tag for a font-element.

The tag information is returned as an array reference in the same form

as for $p->get_token above, but the type code (first element) is

missing. A start tag will be returned like this:

  [$tag, $attr, $attrseq, $text]

The tagname of end tags are prefixed with "/", i.e. end tag is

returned like this:

  ["/$tag", $text]

* $p->get_text( [$endtag] )

This method returns all text found at the current position. It will

return a zero length string if the next token is not text.  The

optional $endtag argument specifies that any text occurring before the

given tag is to be returned. All entities are unmodified.

The $p->{textify} attribute is a hash that defines how certain tags can

be treated as text.  If the name of a start tag matches a key in this

hash then this tag is converted to text.  The hash value is used to

specify which tag attribute to obtain the text from.  If this tag

attribute is missing, then the upper case name of the tag enclosed in

brackets is returned, e.g. "[IMG]".  The hash value can also be a

subroutine reference.  In this case the routine is called with the

start tag token content as its argument and the return value is treated

as the text.

The default $p->{textify} value is:

  {img => "alt", applet => "alt"}

This means that <IMG> and <APPLET> tags are treated as text, and that

the text to substitute can be found in the ALT attribute.

* $p->get_trimmed_text( [$endtag] )

Same as $p->get_text above, but will collapse any sequences of white

space to a single space character.  Leading and trailing white space is

removed.

EXAMPLES

This example extracts all links from a document.  It will print one

line for each link, containing the URL and the textual description

between the <A>...</A> tags:

  use HTML::LCParser;

  $p = HTML::LCParser->new(shift||"index.html");

  while (my $token = $p->get_tag("a")) {

      my $url = $token->[1]{href} || "-";

      my $text = $p->get_trimmed_text("/a");

      print "$url\t$text\n";

  }

This example extract the <TITLE> from the document:

  use HTML::LCParser;

  $p = HTML::LCParser->new(shift||"index.html");

  if ($p->get_tag("title")) {

      my $title = $p->get_trimmed_text;

      print "Title: $title\n";

  }