One of the most frequently asked questions in ASP.NET is when to use HtmlEncode and when to use UrlEncode. Before digging into answering the question of what to use when, let's step back and look at what they are for.
Why encode data?
Web applications are mostly quite boring. Users enter stuff in a form, it gets stuck it in a database, and later on it is presented on a web page. All of these stages need some care from a security point of view, however. This article looks at the last stage.
Generating HTML is mundane, but there's plenty of room to get it wrong and introduce dangerous security problems. We often write things like this:
Username: <%=UserName%><br />
A user innocently entering a less than sign to make a creative username could, if it makes it into the generated HTML, make quite a mess of the layout of a page. For example, if the user enters:
I<3You
Then it is rendered as:
Username: I<3You<br />
Meaning that the letters "3You" would never be shown, and maybe you wouldn't get a linebreak either, not to mention your page won't validate anymore.
A user who fancied being more malicious could create much more fun for you, and input some JavaScript between <script> tags, so when the page is rendered you have:
Username: <script>alert("This site has security problems.");</script><br />
This is called a cross site scripting attack, abbreviated XSS (to differentiate it from CSS, and because anything with an X in its name must be cool). While popping up a message saying that your site has a security problem is probably not going to do much for your reputation, a "better" use of XSS would be to try and hijack the sessions of people visiting the site by stealing session cookies. That can be done silently, but the results might be theft of customer details without anyone visiting your site ever knowing. So, what can we do about this?
Encode Everything!
Even if you are not sure if you need to. I use the username example above because it nearly bit us here at Programmer's Heaven. We were developing something that used usernames in URLs (part of some URL re-writing that we do).
My assumption was that usernames would always be alphanumeric, that this was promised by validation in the sign-up script and thus I did not need to encode them when building URLs. In fact, while nothing quite like the above could have gone wrong since characters with special meaning in HTML were not allowed, there were characters allowed in usernames that would have had special meanings in URLs. Therefore, to be safe I needed to encode them.
Since then I've changed from "encode everything I think needs encoding" to "just encode everything". Even if there are validation rules that mean you should never need to, Just Do It. You never know what was on the mind of the person who coded the validation rules, or what they might do with them in the future.
Obvious exceptions to this rule are numbers (where the value is coming from an int or floating point datatype), where it is just not possible that anything other than a number could have been stored. But if it's string data, or the result of an object's ToString() method, encode it.
HtmlEncode and UrlEncode
The HttpUtility class, in the System.Web namespace, has a number of methods for encoding data that is being emitted into a HTML document. It's important that you pick the correct one, otherwise you could leave your web application vulnerable or emit the wrong thing.
HtmlEncode
HtmlEncode is the method you will be using most of the time. It encodes things that could affect the meaning of the
HTML, replacing them with
HTML entities. For example, "<" is replaced with "<" and "&" is replaced with "&". Therefore, in our examples above, using HtmlEncode on UserName would have led to the more innocent output of:
Username: I<3You<br />
And:
Username: <script>alert("This site has
security problems.");</script><br />
Using HtmlEncode is as simple as:
EncodedUserName = HttpUtility.HtmlEncode(UserName);
You should use HtmlEncode:
- When emitting something into a web page to be displayed
- When emitting something into an attribute, even if it is quoted (but see the note on HtmlAttributeEncode below)
- When emitting something containing an entire URL (but not when you are putting something into a URL, e.g. as part of the query string - use UrlEncode for that)
UrlEncode
UrlEncode encodes things that could affect the meaning of
URLs, replacing them with escapes of the form
%dd. For example, if you are building up a URL based upon the username, and have something like this in your ASPX page:
<a href="/homepages/<%=UserName%>">Home page</a>
Then you should have UrlEncode'd UserName. I tend to stick things that are UrlEncoded into a variable with URL on the end, though that's just my own convention - having a consistent convention in your own shop matters more than what the convention is.
<a href="/homepages/<%=UserNameURL%>">Home page</a>
And then in the code behind file:
UserNameURL = HttpUtility.UrlEncode(UserName);
The thing that seems to endlessly confuse some people is this situation:
<a href="<%=LinkToSite%>">Click To Visit</a>
Here, "LinkToSite" holds the entire URL, not just something that is being placed into a URL. Therefore, you should use HtmlEncode and not UrlEncode. You expect "LinkToSite" to contain special URL characters because, well, it's a URL! And you did validate for that, right?

We use HtmlEncode in case the link contains characters that have special meaning in HTML.
HtmlAttributeEncode
If you are emitting something to be used
only inside double quotes in the HTML that is produced, you may use HtmlAttributeEncode in place of HtmlEncode. This will only encode characters that will be taken to have special meaning inside double quoted strings.
You could correctly use HtmlAttributeEncode in the following example:
<a name="<%=AnchorName%>">
Using it for anything that is not double quoted may lead to a security issue. Using HtmlEncode consistently rather than HtmlAttributeEncode, on the other hand, will not. Therefore, you're free to forget HtmlAttributeEncode and just use UrlEncode and HtmlEncode.
The MSDN states HtmlAttributeEncode runs faster, which it clearly can be since it needs to escape less. However, the likely fraction of a millisecond you'd save per request would take a long time to add up to the amount of time you'd spend dealing with an exploit because you or someone else used the output of it incorrectly.
More Subtle XSS Attacks
Encoding is not everything, it's just one important thing you need to do when building secure web applications. For example, if you are building a link directory you will likely be emitting full URLs, along the lines of:
<a href="<%=URL%>"><%=Title%></a>
You correctly encode URL using HtmlEncode, and all is well. Or is it? Trouble is that what was entered may not have been a valid URL. It may have been:
javascript:alert('Oooh! A security hole!')
This is not something that encoding will make safe - there are no characters that change the meaning of the HTML and it is perfectly valid to generate such a page, if it's what you intended to do.
If you are being paranoid (not always a bad thing), you could check to make sure that you really do have what is expected (in this case, perhaps something starting with "http://").
if (Link.StartsWith("http://"))
Or to be really sure, use a regex to validate it's a real URL. However, it is really down to the input code to be doing proper validation to make sure that anything that is not a URL doesn't make it into the database in the first place.
Summary
- Encode everything apart from integers and numbers, even if in theory it should be OK. Theory isn't practice.
- If the data is going to be part of a URL, but is not a full URL, use UrlEncode.
- For everything else, use HtmlEncode.
- Encoding isn't a magic answer to every security issue in web applications, just one important piece of the puzzle.