[ Index ]

PHP Cross Reference of phpwcms V1.4.3 _r380 (23.11.09)

title

Body

[close]

/include/inc_ext/htmlfilter/ -> htmlfilter.php (summary)

htmlfilter.inc --------------- This set of functions allows you to filter html in order to remove any malicious tags from it. Useful in cases when you need to filter user input for any cross-site-scripting attempts. Copyright (C) 2002-2004 by Duke University

Author: Konstantin Riabitsev <icon@linux.duke.edu>
Version: 1.1 ($Date: 2005/06/30 18:06:08 $)
File Size: 1021 lines (38 kb)
Included or required: 1 time
Referenced: 0 times
Includes or requires: 0 files

Defines 12 functions

  spew()
  tagprint()
  casenormalize()
  skipspace()
  findnxstr()
  findnxreg()
  getnxtag()
  deent()
  defang()
  unspace()
  fixatts()
  sanitize()

Functions
Functions that are not part of a class:

spew($message)   X-Ref
This is a debugging function used throughout the code. To enable
debugging you have to specify a global variable called "debug" before
calling sanitize() and set it to true.

Note: Although insignificantly, debugging does slow you down even
when $debug is set to false. If you wish to get rid of all
debugging calls, run the following command:

fgrep -v 'spew("' htmlfilter.inc > htmlfilter.inc.new

htmlfilter.inc.new will contain no debugging calls.

param: $message  A string with the message to output.
return: void.

tagprint($tagname, $attary, $tagtype)   X-Ref
This function returns the final tag out of the tag name, an array
of attributes, and the type of the tag. This function is called by
sanitize internally.

param: $tagname  the name of the tag.
param: $attary   the array of attributes and their values
param: $tagtype  The type of the tag (see in comments).
return: a string with the final tag representation.

casenormalize(&$val)   X-Ref
A small helper function to use with array_walk. Modifies a by-ref
value and makes it lowercase.

param: $val a value passed by-ref.
return: void since it modifies a by-ref value.

skipspace($body, $offset)   X-Ref
This function skips any whitespace from the current position within
a string and to the next non-whitespace value.

param: $body   the string
param: $offset the offset within the string where we should start
return: the location within the $body where the next

findnxstr($body, $offset, $needle)   X-Ref
This function looks for the next character within a string.  It's
really just a glorified "strpos", except it catches the failures
nicely.

param: $body   The string to look for needle in.
param: $offset Start looking from this position.
param: $needle The character/string to look for.
return: location of the next occurance of the needle, or

findnxreg($body, $offset, $reg)   X-Ref
This function takes a PCRE-style regexp and tries to match it
within the string.

param: $body   The string to look for needle in.
param: $offset Start looking from here.
param: $reg    A PCRE-style regex to match.
return: Returns a false if no matches found, or an array

getnxtag($body, $offset)   X-Ref
This function looks for the next tag.

param: $body   String where to look for the next tag.
param: $offset Start looking from here.
return: false if no more tags exist in the body, or

deent(&$attvalue, $regex, $hex=false)   X-Ref
Translates entities into literal values so they can be checked.

param: $attvalue the by-ref value to check.
param: $regex    the regular expression to check against.
param: $hex      whether the entites are hexadecimal.
return: True or False depending on whether there were matches.

defang(&$attvalue)   X-Ref
This function checks attribute values for entity-encoded values
and returns them translated into 8-bit strings so we can run
checks on them.

param: $attvalue A string to run entity check against.
return: Nothing, modifies a reference value.

unspace(&$attvalue)   X-Ref
Kill any tabs, newlines, or carriage returns. Our friends the
makers of the browser with 95% market value decided that it'd
be funny to make "java[tab]script" be just as good as "javascript".

param: attvalue  The attribute value before extraneous spaces removed.
return: attvalue  Nothing, modifies a reference value.

fixatts($tagname, $attary, $rm_attnames,$bad_attvals,$add_attr_to_tag)   X-Ref
This function runs various checks against the attributes.

param: $tagname         String with the name of the tag.
param: $attary          Array with all tag attributes.
param: $rm_attnames     See description for sanitize
param: $bad_attvals     See description for sanitize
param: $add_attr_to_tag See description for sanitize
return: Array with modified attributes.

sanitize($body, $tag_list = array()   X-Ref
This is the main function and the one you should actually be calling.
There are several variables you should be aware of an which need
special description.

$tag_list
----------
This is a simple one-dimentional array of strings, except for the
very first one. The first member should be einter false or true.
In case it's FALSE, the following list will be considered a list of
tags that should be explicitly REMOVED from the body, and all
others that did not match the list will be allowed.  If the first
member is TRUE, then the list is the list of tags that should be
explicitly ALLOWED -- any tag not matching this list will be
discarded.

Examples:
$tag_list = Array(
false,
"blink",
"link",
"object",
"meta",
"marquee",
"html"
);

This will allow all tags except for blink, link, object, meta, marquee,
and html.

$tag_list = Array(
true,
"b",
"a",
"i",
"img",
"strong",
"em",
"p"
);

This will remove all tags from the body except b, a, i, img, strong, em and
p.

$rm_tags_with_content
---------------------
This is a simple one-dimentional array of strings, which specifies the
tags to be removed with any and all content between the beginning and
the end of the tag.
Example:
$rm_tags_with_content = Array(
"script",
"style",
"applet",
"embed"
);

This will remove the following structure:
<script>
window.alert("Isn't cross-site-scripting fun?!");
</script>

$self_closing_tags
------------------
This is a simple one-dimentional array of strings, which specifies which
tags contain no content and should not be forcefully closed if this option
is turned on (see further).
Example:
$self_closing_tags =  Array(
"img",
"br",
"hr",
"input"
);

$force_tag_closing
------------------
Set it to true to forcefully close any tags opened within the document.
This is good if you want to take care of people who like to screw up
the pages by leaving unclosed tags like <a>, <b>, <i>, etc.

$rm_attnames
-------------
Now we come to parameters that are more obscure. This parameter is
a nested array which is used to specify which attributes should be
removed. It goes like so:

$rm_attnames = Array(
"PCRE regex to match tag name" =>
Array(
"PCRE regex to match attribute name"
)
);

Example:
$rm_attnames = Array(
"|.*|" =>
Array(
"|target|i",
"|^on.*|i"
)
);

This will match all attributes (.*), and specify that all attributes
named "target" and starting with "on" should be removed. This will take
care of the following problem:
<em onmouseover="window.alert('muahahahaha')">
The "onmouseover" will be removed.

$bad_attvals
------------
This is where it gets ugly. This is a nested array with many levels.
It goes like so:

$bad_attvals = Array(
"pcre regex to match tag name" =>
Array(
"pcre regex to match attribute name" =>
Array(
"pcre regex to match attribute value"
)
Array(
"pcre regex replace a match from above with"
)
)
);

An extensive example:

$bad_attvals = Array(
"|.*|" =>
Array(
"/^src|background|href|action/i" =>
Array(
Array(
"/^([\'\"])\s*\S+script\s*:.*([\'\"])/si"
),
Array(
"\\1http://veryfunny.com/\\2"
)
),
"/^style/i" =>
Array(
Array(
"/expression/si",
"/url\(([\'\"])\s*https*:.*([\'\"])\)/si",
"/url\(([\'\"])\s*\S+script:.*([\'\"])\)/si"
),
Array(
"idiocy",
"url(\\1http://veryfunny.com/\\2)",
"url(\\1http://veryfynny.com/\\2)"
)
)
)
);

This will take care of nearly all known cross-site scripting exploits,
plus some (see my filter sample at
http://www.mricon.com/html/phpfilter.html for a working version).

$add_attr_to_tag
----------------
This is a useful little feature which lets you add attributes to
certain tags. It is a nested array as well, but not at all like
the previous one. It goes like so:

$add_attr_to_tag = Array(
"PCRE regex to match tag name" =>
Array(
"attribute name"=>'"attribute value"'
)
);

Note: don't forget quotes around attribute value.

Example:

$add_attr_to_tag = Array(
"/^a$/si" =>
Array(
'target'=>'"_new"'
)
);

This will change all <a> tags and add target="_new" to them so all links
open in a new window.



param: $body                 the string with HTML you wish to filter
param: $tag_list             see description above
param: $rm_tags_with_content see description above
param: $self_closing_tags    see description above
param: $force_tag_closing    see description above
param: $rm_attnames          see description above
param: $bad_attvals          see description above
param: $add_attr_to_tag      see description above
return: sanitized html safe to show on your pages.



Generated: Wed Dec 30 05:55:15 2009 Cross-referenced by PHPXref 0.7