NAME URL::Transform - perform URL transformations in various document types SYNOPSIS my $output; my $urlt = URL::Transform->new( 'document_type' => 'text/html;charset=utf-8', 'content_encoding' => 'gzip', 'output_function' => sub { $output .= "@_" }, 'transform_function' => sub { return (join '|', @_) }, ); $urlt->parse_file($Bin.'/data/URL-Transform-01.html'); print "and this is the output: ", $output; DESCRIPTION URL::Transform is a generic module to perform an url transformation in a documents. Accepts callback function using which the url link can be changed. There are different modules to handle different document types, elements or attributes: `text/html', `text/vnd.wap.wml', `application/xhtml+xml', `application/vnd.wap.xhtml+xml' URL::Transform::using::HTML::Parser, URL::Transform::using::XML::SAX (incomplete was used only to benchmark) `text/css' URL::Transform::using::CSS::RegExp `text/html/meta-content' URL::Transform::using::HTML::Meta `application/x-javascript' URL::Transform::using::Remove By passing `parser' option to the `URL::Transform->new()' constructor you can set what library will be used to parse and execute the output and transform functions. Note that the elements inside for example `text/html' that are of a different type will be transformed via default_for($document_type) modules. `transform_function' is called with following arguments: transform_function->( 'tag_name' => 'img', 'attribute_name' => 'src', 'url' => 'http://search.cpan.org/s/img/cpan_banner.png', ); and must return (un)modified url as the return value. `output_function' is called with (already modified) document chunk for outputting. PROPERTIES content_encoding document_type parser transform_function output_function parser For HTML/XML can be HTML::Parser, XML::SAX document_type text/html - default transform_function Function that will be called to make the transformation. The function will receive one argument - url text. output_function Reference to function that will receive resulting output. The default one is to use print. content_encoding Can be set to `gzip' or `deflate'. By default it is `undef', so there is no content encoding. METHODS new Object constructor. Requires `transform_function' a CODE ref argument. The rest of the arguments are optional. Here is the list with defaults: document_type => 'text/html;charset=utf-8', output_function => sub { print @_ }, parser => 'HTML::Parser', content_encoding => undef, default_for($document_type) Returns default parser for a supplied $document_type. Can be used also as a set function with additional argument - parser name. If called as object method set the default parser for the object. If called as module function set the default parser for a whole module. parse_string($string) Submit document as a string for parsing. This some function must be implemented by helper parsing classes. parse_chunk($chunk) Submit chunk of a document for parsing. This some function should be implemented by helper parsing classes. can_parse_chunks Return true/false if the parser can parse in chunks. parse_file($file_name) Submit file for parsing. This some function should be implemented by helper parsing classes. link_tags # To simplify things, reformat the %HTML::Tagset::linkElements # hash so that it is always a hash of hashes. # Construct a hash of tag names that may have links. js_attributes # Construct a hash of all possible JavaScript attribute names decode_string($string) Will return decoded string suitable for parsing. Decoding is chosen according to the $self->content_encoding. Decoding is run automatically for every chunk/string/file. encode_string($string) Will return encoded string. Encoding is chosen according to the $self->content_encoding. NOTE if you want to have your content encoded back to the $self->content_encoding you will have to run this method in your code. Argument to the `output_function()' are always plain text. get_supported_content_encodings() Returns hash reference of supported content encodings. benchmarks Benchmark: timing 10000 iterations of HTML::Parser , XML::LibXML::SAX, XML::SAX::PurePerl... HTML::Parser : 3 wallclock secs ( 2.41 usr + 0.04 sys = 2.45 CPU) @ 4081.63/s (n=10000) XML::LibXML::SAX : 29 wallclock secs (27.22 usr + 0.11 sys = 27.33 CPU) @ 365.90/s (n=10000) XML::SAX::PurePerl: 192 wallclock secs (180.62 usr + 0.50 sys = 181.12 CPU) @ 55.21/s (n=10000) TODO There are urls in `pics' meta tag: `<meta http-equiv="pics-label" content=" ...'. See http://www.w3.org/PICS/. SEE ALSO HTML::Parser, URL::Transform::using::HTML::Parser AUTHOR Jozef Kutej `<jkutej at cpan.org>' LICENSE AND COPYRIGHT This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License. See http://dev.perl.org/licenses/ for more information.