As a software provider, you’re not just concerned about the cybersecurity of your own company, but you also carry the burden of protecting your customers’ data and systems. According to IBM, the average cost of a data breach is projected to reach $4.2 million by 2023. With that number in mind, it makes sense to leave no stone unturned when it comes to the security of your SaaS app.
You’ve already got your eyes on things like data encryption, access control and backups. But what about security at the point where users enter information – specifically, where they create rich content inside your app?
This in-depth, technical article covers what you need to know about input sanitization, HTML sanitization, user input validation, HTML validation, and how they all play an important role in keeping your app safe from popular attack vectors like cross-site scripting (XSS).
Table of Contents
When should input sanitization be used?
Benefits of input sanitization
Strategies for input sanitization
Mutation cross-site scripting (mXSS)
How does HTML sanitization work?
Input sanitization and validation in your app
Input sanitization
What is input sanitization?
A standard security measure, input sanitization is the process of checking and filtering input data to ensure it’s free of characters or strings that could inject malicious code into an application or system.
Depending on the input medium and target system, attacks including cross-site scripting (XSS), SQL injection, remote file inclusion (RFI), command injection, and buffer overflow can all be prevented by input sanitization.
When should input sanitization be used?
Input sanitization should be used whenever some system, such as a web application, database, or file server, accepts user input, for example when someone is creating web page or email content.
Here are some examples of where input sanitization helps:
- Without input sanitization, the input may contain malicious strings that can be read as HTML or JavaScript code. This makes the system vulnerable to cross-site scripting (XSS) – explained later in this article – if the unchanged input were to be rendered on the web application.
- If unsanitized user input is used as part of a database query, the system is made vulnerable to SQL injection attacks if the user input contains malicious strings that resemble SQL commands. This can allow unauthorized modifications to the contents of a database by a threat actor.
- When a server allows users to upload their own files, sanitizing those files ensures that they are free of malicious executable scripts that may impact the server in unintended ways once the files are read or retrieved.
Benefits of input sanitization
Input sanitization benefits your software in two ways. First, it helps prevent attacks, and second, it helps you comply with industry standards, which are often a requirement from customers when purchasing software.
Preventing attacks
Having the ability to thwart common injection attacks means input sanitization minimizes cybersecurity risks such as unauthorized access to sensitive data, data breaches, malware infections, and other unwanted and unauthorized actions by a third party.
This helps to maintain the security and integrity of applications and systems, minimizing downtimes and promoting trust among end-users.
Compliance
The security benefits of input sanitization also extends to compliance with industry standards.
For example, the HIPAA (Health Insurance Portability and Accountability Act) requires healthcare providers to have technical security measures to protect the privacy and security of patient information.
Another example of an industry standard that recommends input sanitization is the Payment Card Industry Data Security Standard (PCI DSS), which requires organizations to protect against unauthorized disclosure of cardholder data.
Strategies for input sanitization
To sanitize user input, several strategies are commonly employed:
1. Whitelisting
This strategy allows only certain characters, patterns, or data types to be entered into a system. For example, if a certain input field requires a valid email address, it perhaps only whitelists alphanumeric strings followed by an “@” and a domain name.
Whitelisting is highly effective at sanitizing input and is commonly used to prevent SQL injections, but may be less suitable for more complex input data such as HTML documents.
2. Blacklisting
This strategy involves detecting and preventing certain characters, patterns, or data types from being entered into a system. Many HTML sanitization libraries employ this approach.
For example, the DOMPurify library filters potentially malicious HTML patterns to prevent XSS issues, such as ensuring there is no namespace collision in values of the named properties id and name to suppress DOM clobbering attacks.
3. Encoding
The encoding strategy converts special characters from input types such as HTML and URL to their encoded equivalents. In HTML, the < and > characters have special meaning to denote the start and end of a tag. Once encoded, they would be converted into < and > respectively.
For example, this potentially malicious HTML
<img src="error" onerror="alert('xss')">
would be transformed into the following when encoded.
<img src="error" onerror="alert('xss')">
If the encoded input is rendered on the webpage, the literal string "<img src="error" onerror="alert('xss')">” would be rendered instead of an HTML image element. While encoding is useful for sanitizing literal text, it is less so when the user input is required to be rendered as HTML, such as inside a WYSIWYG editor like TinyMCE.
HTML sanitization
HTML sanitization is a category of input sanitization where the input is HTML markup.
By adopting a blacklist approach to remove potentially malicious code patterns, HTML sanitization ensures that user-generated content is free of HTML and JavaScript code that can enable cross-site scripting attacks.
While encoding, as described before, can neutralize any malicious code in user input, it’s only useful when that input is not intended to be rendered later as HTML. In scenarios where the user needs to author HTML, such as within a WYSIWYG editor like TinyMCE, output encoding results in HTML rendering incorrectly, and HTML sanitization should be used here instead.
Before delving deeper, let’s look at a few vulnerabilities that can be prevented by most modern HTML sanitization techniques: cross-site scripting (XSS), mutation cross-site scripting (mXSS), and DOM clobbering.
Common security vulnerabilities
Cross-site scripting (XSS)
Cross-site scripting (XSS) is a security vulnerability that occurs when attackers inject malicious code into a web application. If unintentionally executed by the web application, the injected code can perform a range of serious actions including account impersonation, accessing sensitive data, observing user behavior, and altering application functionality and data.
Typically, the medium that attackers use to inject code is form fields that have no or an insufficient level of input sanitization.
As an example, let’s say there’s a web application that allows users to create and edit blog articles using a WYSIWYG editor, which allows the inputted content to be formatted using HTML tags. If neither the client-side WYSIWYG editor nor the server-side code sanitizes the user-inputted blog, the system can be made vulnerable to an XSS attack. For instance, a threat actor inputs the following HTML into the editor:
<script>
function doBadThings() {
// much more damaging scripts can be defined here, for
// demo purposes, we will just have an alert statement
alert('XSS');
}
doBadThings();
</script>
Without HTML sanitization, the malicious code within the <script> tags will be preserved as is when the article is saved to the server. When the article’s HTML is rendered and displayed to other users, the code will execute and potentially allow unauthorized third-party access to session cookies and other sensitive data.
Mutation cross-site scripting (mXSS)
Mutation cross-site scripting (mXSS) is a type of XSS attack where any injected code strings are not intended to be executed as is but are designed to be changed by either the browser or client-side code to be transformed into malicious code that can be executed.
mXSS attacks leverage the fact that browsers and sometimes client-side code modify inputted HTML code for performance and/or validation reasons. Any malicious HTML strings at the point of injection may not resemble known patterns in naive XSS filters and sanitizers, which is often under the assumption that the browser only performs trivial modifications to HTML.
This makes mXSS more difficult to detect and prevent, than conventional XSS attacks.
Heiderich et al. (2013) describes an example of when an input containing backtick characters is directly assigned to an element’s innerHTML property on IE 8. If given the following as input:
<img src="test.jpg" alt="``onload=xss()" />;
if the input is assigned as-is to an existing element’s innerHTML, the browser will transform the input into:
<IMG alt=``onload=xss() src="test.jpg" />
Whereas the input supplies the ``onload=xss() string apparently safely in the alt attribute, the browser ends up mutating the input in a way that generates a new onload attribute with the value of a potentially malicious function.
Even though modern browsers have implemented fixes for most of the common mXSS vectors, it is still important for HTML sanitization approaches to detect any pattern with the potential of causing mXSS to prevent known mXSS vulnerabilities on older browsers and ideally unseen mXSS techniques as well.
DOM clobbering
DOM clobbering is a vulnerability that can enable XSS attacks, but is not an XSS attack in itself. DOM clobbering describes the situation where attackers inject named HTML elements whose id or name attribute matches the names of browser APIs or global variables. When elements with an id or name attribute are created in the DOM tree, browsers also create a global reference to that in the window and document objects.
For example, when an anchor element with id="location" is rendered on a webpage, the references document.location and window.location are created pointing to this DOM element. In addition, these named element references come before lookups of built-in document or window APIs and other attributes. This means that this newly created document.location reference to this anchor element clobbers the built-in location field on document and window.
If some web application intends to use the document.location API in their code, they may unknowingly reference the anchor tag with id="location" instead. The reason why anchor elements are frequently chosen as vectors is because its toString value is its href attribute, meaning threat actors can control its value when referenced as a string.
For example, if the web application has a WYSIWYG editor without sanitization that allows users to enter and save blog articles and the application contains a script like:
<script> location.assign(document.location);</script>;
a threat actor can either input
<a id="location" href="<https://example.com>">DOM clobbering</a>
to redirect the user to a third-party site for potentially phishing purposes, or input
<a id="location" href="javascript:doBadThings();">DOM clobbering</a>
to perform XSS, when the article is displayed to other users.
How does HTML sanitization work?
Different HTML sanitization libraries use different techniques to sanitize HTML. Let’s use DOMPurify, a popular, modern, OWASP-recommended HTML sanitization library, to explore further. DOMPurify uses a combination of whitelisting and blacklisting strategies to identify potentially XSS-inducing tags and attributes in HTML to remove or modify.
DOMPurify first parses the HTML input into a DOM tree using the DOMParser API. This parsing ensures the input is well-formed, and can be processed correctly. DOMPurify then iterates through the DOM tree using a NodeIterator, and for each node, it performs both node and attribute-level sanitization.
Node sanitization
DOMPurify has a default whitelist of allowed tags that excludes tags like <script> that may be used to inject malicious content or execute code for XSS. If the current node is not an allowed tag, it is removed. For example: <p>Lorem ipsum<script>alert('XSS');</script></p> → <p>Lorem ipsum</p>. The user can optionally configure the lists of allowed and forbidden tags.
It also performs blacklist-style checks for and removes nodes that have HTML-like non-node children to prevent mXSS attacks: <div><iframe src="<https://example.com/>"><p>Lorem ipsum</p></iframe></div> → <div></div>. Note that any HTML wrapped within <iframe> tags are not parsed into nodes.
Attribute sanitization
If the node has attributes, DOMPurify iterates through all of its attributes and performs blacklist-style checks to modify or remove potentially malicious attributes, including (and this list is non-exhaustive):
-
If the attribute value for id or name matches a built-in API or other global variable → removed to prevent DOM clobbering: <p id="location">Lorem ipsum</p> → <p>Lorem ipsum</p>
-
If the attribute is an event attribute such as onload or onerror → removed to prevent execution of arbitrary code that can cause XSS: <img src="error" onerror="alert('XSS');"> → <img src="error">
-
If the attribute is src or href and its value is a javascript: URL → removed to prevent execution of arbitrary code that can cause XSS: <a href="javascript:alert('XSS');">Lorem ipsum</a> → <a>Lorem ipsum</a>
Does TinyMCE sanitize HTML?
Yes, TinyMCE sanitizes HTML. As an enterprise-grade rich text editor trusted by millions of apps every day, it’s crucial for TinyMCE to employ HTML sanitization to prevent security vulnerabilities. Why is it crucial? Because in most cases, users input and format HTML content in TinyMCE with the intention of saving and rendering it as HTML.
In addition to our own layer of HTML validation and filtering (more on this later), TinyMCE also uses the DOMPurify library (described in the previous section) to sanitize HTML input. Learn how TinyMCE handles HTML sanitization.
User input validation
Input validation is the process of checking user-inputted data against specific rules or criteria to ensure that it meets the expected format, type, range and other requirements that are necessary for the application to function properly and securely.
This ensures that only expected input data is entering a system and that the data is suitable for its intended use. Input validation also has the potential to prevent data corruption, security vulnerabilities, functionality alterations, performance degradation, and system crashes.
Input validation, similarly to input sanitization, should be used wherever user input is required. Some examples of where you would use input validation:
- Web forms – ensure that the user input data is of a suitable format, such as ensuring that an email address field contains an input that is a valid email address.
- User authentication – ensure that passwords meet length and complexity requirements for security purposes.
- File uploads – to ensure the user uploaded files are within the type and size requirements to suit its intended usages and not overload the server.
HTML validation
HTML validation is a type of input validation where the input data is HTML code and the validation criteria are the standards set by the World Wide Web Consortium (W3C). The standards ensure HTML documents follow proper structure, are free of errors, and comply with correct syntax and semantics.
Two commonly used HTML standards are HTML5, which is the 5th and latest major version published by W3C, and HTML 4.01, its predecessor.
An example of HTML validation: This HTML, <h1>Lorem ipsum</h2>, is invalid as it has the <h1> tag and is closed with a mismatched </h2> tag. An HTML validator would output <h1>Lorem ipsum</h1> to ensure it matches the standard.
Benefits of HTML validation include:
- Improved accessibility: assistive technologies such as screen readers are designed to interpret and present content based on these standards.
- Improved website SEO: valid HTML code indicates that the website is well-structured and easy to navigate, which is favored by search engines. Also, search engine result pages are based on information excavated by their crawlers, which are designed to read standard HTML documents.
- Improved performance: valid HTML code tends to be more lightweight and is easier to interpret by web browsers, resulting in faster load times.
HTML validation and HTML sanitization are two different processes with the same goal of improving the quality and security of HTML documents. While HTML validation is often a strategy of HTML sanitization, it only checks the syntax and semantics in HTML code, and does not filter potentially malicious patterns which often exist as valid HTML.
In the context of a WYSIWYG editor like TinyMCE, both HTML validation sanitization are important processes to ensure that the HTML output generated by the editor is valid and safe for publishing on a website for example.
Does TinyMCE validate HTML?
TinyMCE uses the built-in DOMParser API to correct malformed HTML to fix basic validation issues such as mismatched tags. TinyMCE then implements its own schema which it uses to more thoroughly validate the user’s HTML input against either HTML5, HTML 4.01, or by default, a combination of both (the full HTML5 specification including older HTML 4.01 elements for compatibility).
The user can configure the choice of standard, html5 (default), html5-strict, or html4 using the schema option. Based on the choice of standard, the schema processes and transforms the nodes and attributes of the input into a compliant output.
If the user inputs the following:
<p>
<a href="about:blank" name="Lorem ipsum">
Lorem ipsum
</a>
</p>;
into TinyMCE with schema: html5-strict, the editor’s content processing engine transforms that into an HTML5-compliant output:
<p>
<a href="about:blank">Lorem ipsum</a>
</p>;
considering the name attribute of the <a> element has been removed in the HTML5 standard.
Another example is, if the user inputs
<p><input name="Lorem ipsum" type="text" autofocus=""></p>
into TinyMCE with schema: html4, the editor’s content processing engine transforms that into an HTML 4.01-compliant output:
<p><input name="Lorem ipsum" type="text"></p>
considering the autofocus attribute was not part of the HTML 4.01 standard.
The user can further customize the validation criteria by:
- Defining their own valid elements and attributes using the valid_elements option
- Extending the existing standard using extended_valid_elements
- Defining valid child elements using valid_children
- Defining invalid elements and styles using the invalid_elements and invalid_styles options.
For more information on how you can customize the validation criteria, learn about Content Filtering in our docs.
Input sanitization and validation in your app
Ensuring the security of your SaaS app requires a multi-layered approach, with input sanitization and validation playing a pivotal role in protecting against attackers. By implementing robust input sanitization and validation techniques, you can mitigate the risk of popular attack vectors like cross-site scripting (XSS), protecting both you and your customers.
TinyMCE follows best practices for HTML validation and sanitization, so that you can rest easy knowing our editor won’t send malformed or risky HTML into your database. Learn more about what makes a secure rich text editor, and give TinyMCE a try in a free 14-day trial.