andrewjwelch.com

A CSV to XML converter in XSLT 2.0

This version has now been superseded by Version 2 - this version will not work for csv files that contain quoted values. If you want to use the tranform, use Version 2.
I wrote a rudimentary csv to XML converter a while back which broke when the csv contained quoted values, eg foo, "foo, bar", bar Dealing with these quotes is surprisingly hard, especially when you take into account quotes are escaped by doubling them.
I raised it on xsl-list, and Abel Braaksma came up with a genious solution - the technique is to use both sides of analyze-string.
For example, this sample csv:
Col 1, Col 2, Col 3
foo, "foo,bar", "foo:""bar"""
...is transformed into this XML:
<root>
    
<row>
       
<elem name="Col 1">foo</elem>
       
<elem name="Col 2">foo,bar</elem>
       
<elem name="Col 3">foo:"bar"</elem>
    
</row>
</root>
Here's the transform:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:fn="fn" xmlns:xs="http://www.w3.org/2001/XMLSchema" version="2.0" exclude-result-prefixes="xs fn">

<xsl:output indent="yes" encoding="US-ASCII" />

<xsl:param name="pathToCSV" select="'file:///c:/csv.csv'" />

<xsl:function name="fn:getTokens" as="xs:string+">
    
<xsl:param name="str" as="xs:string" />
    
<xsl:analyze-string regex="("[^"]*")+" select="$str">
       
<xsl:matching-substring>
           
<xsl:sequence select="replace(., "^""|""$|("")""", "$1")" />
       
</xsl:matching-substring>
       
<xsl:non-matching-substring>
           
<xsl:for-each select="tokenize(., '\s*,\s*')">
               
<xsl:sequence select="." />
           
</xsl:for-each>
       
</xsl:non-matching-substring>
    
</xsl:analyze-string>
</xsl:function>

<xsl:template match="/" name="main">
    
<xsl:choose>
        
<xsl:when test="unparsed-text-available($pathToCSV)">
            
<xsl:variable name="csv" select="unparsed-text($pathToCSV)" />
            
<xsl:variable name="lines" select="tokenize($csv, ' ')" as="xs:string+" />
            
<xsl:variable name="elemNames" select="fn:getTokens($lines[1])" as="xs:string+" />
            
<root>
                
<xsl:for-each select="$lines[position() &gt; 1]">
                    
<row>
                        
<xsl:variable name="lineItems" select="fn:getTokens(.)" as="xs:string+" />

                        
<xsl:for-each select="$elemNames">
                            
<xsl:variable name="pos" select="position()" />
                            
<elem name="{.}">
                                
<xsl:value-of select="$lineItems[$pos]" />
                            
</elem>
                        
</xsl:for-each>
                    
</row>
                
</xsl:for-each>
            
</root>
        
</xsl:when>
        
<xsl:otherwise>
            
<xsl:text>Cannot locate : </xsl:text><xsl:value-of select="$pathToCSV" />
        
</xsl:otherwise>
    
</xsl:choose>
</xsl:template>

</xsl:stylesheet>
To keep the transform generic I've used an attribute instead of an element for the column names to cope with names that aren't valid QNames (for example ones that contain a space) - for my own use would add a function to convert names to valid QNames and then change to as it generates nicer XML.




top