Post Pic

Convert HTML to Plain Text in C# using Markdown

While working on my customisations to Tim Geyssens MailEngine I was looking for an accurate method of automatically creating a plain-text version of the HTML emails that were being sent out by the site. Further reading brought my attention to something called Markdown. After some hunting around with a little help from my friend Google I managed to find a markdown XSLT file. Using the XSLT I could transform my HTML email to plain-text with relative ease and accuracy. Of course in order to do this I would need a valid XML document and as my pages were already valid XHTML I had no problems there.

Here is my method for doing the conversion, all it requires is that you pass it the HTML you want to convert which must be valid XML:

/// <summary>
/// Converts to HTML to plain-text.
/// </summary>
/// <param name="HTML">The HTML.</param>
/// <returns>The plain text representation of the HTML</returns>
private static string ConvertToText(string HTML)
{
    string text = string.Empty;

    XmlDocument xmlDoc = new XmlDocument();
    XmlDocument xsl = new XmlDocument();
    xmlDoc.LoadXml(HTML);
    xsl.CreateEntityReference("nbsp");
    xsl.Load(System.Web.HttpContext.Current.Server.MapPath("/xslt/Markdown.xslt"));

    //creating xslt
    XslTransform xslt = new XslTransform();
    xslt.Load(xsl, null, null);

    //creating stringwriter
    StringWriter writer = new System.IO.StringWriter();

    //Transform the xml.
    xslt.Transform(xmlDoc, null, writer, null);

    //return string
    text = writer.ToString();
    writer.Close();

    return text;
}

Download the XSLT file I used from here:

http://www.getsymphony.com/download/xslt-utilities/view/20573/

I would love to hear from anyone that does this differently or if you can find any problems with the method I have chosen to implement for this solution.

Umbraco Certified Developer
SagePay Approved Partner
Creative Market