HTML Comments 6 Comments

After I got the new Blog up and running, I quickly noticed that plain text comments kinda suck. I have never been a big fan of Textile, Markdown, or any of the other simplified markup languages, so I decided to stick with plain old HTML.

Plain old HTML is unfortunately not a very safe thing to allow people to stick in your comments. Malicious JavaScript, random CSS, all these things can mess you up in a hurry.

The second problem is there are plenty of people out there who don't know HTML and don't want to know HTML, for them I decided a rich editor was in order.

I needed to figure out how to sanitize the HTML. Bold, italic, underlined text, paragraphs and hyper links seem to be about all you really want in the average Blog comment. I wanted a way to only allow these tags and strip out everything else.

Tom Insam's recipe using Beautiful Soup seemed to fit the bill perfectly, I only needed to modify his tag list a little.

Heres my ever so slightly modified version

from BeautifulSoup import BeautifulSoup
import re

def sanitize(html):
    # allow these tags. Other tags are removed, but their child elements remain
    whitelist = ['em', 'i', 'strong', 'u', 'a', 'b', 'p', 'br', 'code', 'pre' ]

    # allow only these attributes on these tags. No other tags are allowed any
    # attributes.
    attr_whitelist = { 'a':['href','title','hreflang']}

    # remove these tags, complete with contents.
    blacklist = [ 'script', 'style' ]

    attributes_with_urls = [ 'href', 'src' ]

    # BeautifulSoup is catching out-of-order and unclosed tags, so markup
    # can't leak out of comments and break the rest of the page.
    soup = BeautifulSoup(html)

    # now strip HTML we don't like.
    for tag in soup.findAll():
        if tag.name.lower() in blacklist:
            # blacklisted tags are removed in their entirety
            tag.extract()
        elif tag.name.lower() in whitelist:
            # tag is allowed. Make sure all the attributes are allowed.
            for attr in tag.attrs:
                # allowed attributes are whitelisted per-tag
                if tag.name.lower() in attr_whitelist and \
                    attr[0].lower() in attr_whitelist[ tag.name.lower() ]:
                    # some attributes contain urls..
                    if attr[0].lower() in attributes_with_urls:
                        # ..make sure they're nice urls
                        if not re.match(r'(https?|ftp)://', attr[1].lower()):
                            tag.attrs.remove( attr )
                    # ok, then
                    pass
                else:
                    # not a whitelisted attribute. Remove it.
                    tag.attrs.remove( attr )
        else:
            # not a whitelisted tag. I'd like to remove it from the tree
            # and replace it with its children. But that's hard. It's much
            # easier to just replace it with an empty span tag.
            tag.name = "span"
            tag.attrs = []

    # stringify back again
    safe_html = unicode(soup)

    # HTML comments can contain executable scripts, depending on the browser,
    # so we'll
    # be paranoid and just get rid of all of them
    # e.g. <!--[if lt IE 7]><script type="text/javascript">h4x0r();</script><!
    # [endif]-->
    # TODO - I rather suspect that this is the weakest part of the operation..
    safe_html = re.sub(r'<!--[.\n]*?-->','',safe_html)
    return safe_html

All comments are run through this sanitizer before being saved. If a tag is not allowed, but contains valid child tags, they are preserved (wrapped in a span instead of the original container).

Now I needed a rich editor. I have used TinyMCE. Its very configurable and can be used for simple editors like mine, or all the way up to a very rich word processor.

To use it include the main tiny_mce.js script on your page, and then a second configuration script that starts TinyMCE and configures it.

<script type="text/javascript" src="/static/blog/js/tiny_mce/tiny_mce.js"></script>
<script type="text/javascript" src="/static/blog/js/commenteditor.js"></script>

Heres the code from commenteditor.js

tinyMCE.init(
  {
    //just turn one specific textarea into a tiny mce editor
    mode:"exact",  
    //the specific textarea has id="id_comment"
    elements : "id_comment", 
    //use the advanced theme so we can configure the exact appearance
    theme: "advanced",
    //the first row of buttons in the editor, 
    //these are the only functions I want
    theme_advanced_buttons1 : "bold, italic, underline,link,unlink", 
    theme_advanced_buttons2 : "", //make the other 2 rows empty
    theme_advanced_buttons3 : "",
    //tell tiny_mce I am working with xhtml strict (default is transitional)
    doctype: '<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http:// www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">', 
    //dont use inline styles , this makes sanitizing the  html a lot easier
    inline_styles: false, 
    //css styles for the    content in the editor
    content_css : "/static/blog/css/commenteditor.css", 
    width: "528" //the width of the control
  });

The comment editor.css file contains CSS styles used for the content of the editor. This allows you to set its background, font size etc to match how you style the rendered comments, giving a real WYSIWYG experience.

And voila you see the nice rich editor at the bottom of this page

Add Comment

Comments

Yo. To strip blacklisted tags use:

            tag.replaceWith(tag.contents[0])

tezro 23:47 Sunday the 30th of August 2009

Here is a more reliable way to remove comments:

from BeautifulSoup import Comment

comments = soup.findAll(text=lambda text:isinstance(text, Comment))

[comment.extract() for comment in comments]

Chase Seibert 15:39 Friday the 26th of February 2010

Thanks Chase, but in my case I wanted to allow some tags, not simply strip the text in every case, but I have not seen the Comment class before, very handy.

Sean O'Donnell 15:51 Friday the 26th of February 2010
To strip non-white listed tages use the following:

In the
sanitize(html) function,
# not a whitelisted tag. I'd like to remove it from the tree
# and replace it with its children. But that's hard.

This is easy to be solved by:
tag.hidden = True

Good luck and thanks for sharing this script with us.
lucas 14:07 Tuesday the 14th of September 2010
????????? ??????????? ? ??????? ???? ????, ??? ????????????? ? ???????? ???????????????? ? ???????! ??? ???? ?????????? ????????? ????????????? ? ???????? ?????, ??????? ??????? ? ?????? ?????? ????? ???????????
axvfhmloth 01:39 Monday the 23rd of January 2012
Web-Hosting Monster Beats Outlet Monster Beats Outlet:http://www.storemonsterbeatsbydre.com/ might proceed your recommended companion or Monster Beats Outlet Monster Beats Outlet:http://www.storemonsterbeatsbydre.com/ it will be ones most harming nightmare if you make opted incorrect 1. You will find A large number of Web hosting companies away at that place and it is a really ambitious business Monster Beats Outlet Monster Beats Outlet:http://www.storemonsterbeatsbydre.com/ . Inexpensive Reputable Web host capabilities the purchase price include $1 to $9 /month and infrequently Monster Beats Outlet Monster Beats Outlet:http://www.storemonsterbeatsbydre.com/ you should get it for nothing of charge in the event you granted your service service provider to help number his or her advert on the web site. With this sort of lowest price, any individual may possibly function their particular web site. But exactly how trustworthy are the type Affordable Reliable Website hosting? It seems sensible Not any e-mail. This differs from quite a few internet hosting enterprise with. I've gone with a host company for added than Five years right up until right now which is until now consequently genuine in my situation. This answer for that Hosting Company is actually unquestionably beneficial. Research study of their Web hosting Professional: 1. Cost: My partner and i fork out each and every month low-cost in comparison with $4 (following the promotion) and it's definitely low-cost along with economical by simply everyone together with internet connection. Couple of. Purchaser Aid: Finish assist is usually speedy. That they personal on-line speak 24/7 and yes it facilitates everyone completely while i don't understand nearly in the techie issues. They may be very helpful plus competent. 3. General Rapid Internet hosting Hosting server Answer: I have evaluated ahead of pricey server that i provided $95 per 30 days and that i obtain not very much variance from the fastness involving both equally low-cost plus pricey forum. Five. Ninety nine.9% Server consistency: As long as I remember, there is exclusively at one time that will web server had been straight down, and it must have been a scheduled downtime along with fewer than One particular hour or so. Your web-hosting service provider laughed and said nearly the item Nine days previous to that occurred. I'm going to check with me and you all over again, is reasonable web hosting reliable? Sure, without a doubt with regard to my personal illustration it truly is affordable reputable internet hosting as well as awesome. It's very major to adopt the right hosting company plus eternally be sure you go ahead and take finest having beneficial reputation. Concerning tested out most of that said affordable dependable web host ideas ahead of I made the choice together with my own honest hosting firm, Fatcow webhost. My business is actually satisfied by means of in which webhost. Fatcow resources distributed webhosting remedies that will suit the majority of wants. In addition it gives you 24/7 customer aid and also 30-day refund policy. QGKaS3MjcK6k7L756K1LtYFoWZpO
sdtekerfclte 22:31 Monday the 30th of January 2012