andrewjwelch.com

A CSV to XML converter in XSLT 2.0 (version 2)

The previous version of the CSV to XML converter had a bug where extra tokens would be inserted either side of quoted values.
I raised it again on xsl-list, and Mike Kay suggested adding a comma to the end of the string to help make an XPath regex feasible.
That solution was pretty close, but didn't cope with nested quotes such as "foo, ""bar""". This is what I've come up with - it's a combination of Mike's and Abel's suggestions, with a modified regex to handle nested quotes:
<xsl:stylesheet
  xmlns:xsl
="http://www.w3.org/1999/XSL/Transform"
  xmlns:fn
="fn"
  xmlns:xs
="http://www.w3.org/2001/XMLSchema"
  version="2.0" exclude-result-prefixes="xs fn">

<xsl:output indent="yes" encoding="US-ASCII" />

<xsl:param name="pathToCSV" select="'file:///c:/csv.csv'" />

<xsl:function name="fn:getTokens" as="xs:string+">
    
<xsl:param name="str" as="xs:string" />
    
<xsl:analyze-string select="concat($str, ',')" regex='(("[^"]*")+|[^,]*),'>
        
<xsl:matching-substring>
        
<xsl:sequence select='replace(regex-group(1), "^""|""$|("")""", "$1")' />
        
</xsl:matching-substring>
    
</xsl:analyze-string>
</xsl:function>

<xsl:template match="/" name="main">
    
<xsl:choose>
    
<xsl:when test="unparsed-text-available($pathToCSV)">
        
<xsl:variable name="csv" select="unparsed-text($pathToCSV)" />
        
<xsl:variable name="lines" select="tokenize($csv, ' ')" as="xs:string+" />
        
<xsl:variable name="elemNames" select="fn:getTokens($lines[1])" as="xs:string+" />
        
<root>
        
<xsl:for-each select="$lines[position() &gt; 1]">
            
<row>
            
<xsl:variable name="lineItems" select="fn:getTokens(.)" as="xs:string+" />

            
<xsl:for-each select="$elemNames">
                
<xsl:variable name="pos" select="position()" />
                
<elem name="{.}">
                
<xsl:value-of select="$lineItems[$pos]" />
                
</elem>
            
</xsl:for-each>
            
</row>
        
</xsl:for-each>
        
</root>
    
</xsl:when>
    
<xsl:otherwise>
        
<xsl:text>Cannot locate : </xsl:text><xsl:value-of select="$pathToCSV" />
    
</xsl:otherwise>
    
</xsl:choose>
</xsl:template>

</xsl:stylesheet>
I've tested the transform with the following template - these few cover the edge cases and all pass.
<xsl:template match="/">
    
<xsl:variable name="tests">
    
<tests>
        
<test>
            
<input>a,b,c</input>
            
<expected><n>a</n><n>b</n><n>c</n></expected>
        
</test>
        
<test>
            
<input>,,</input>
            
<expected><n /><n /><n /></expected>
        
</test>
        
<test>
            
<input>"a","b","c"</input>
            
<expected><n>a</n><n>b</n><n>c</n></expected>
        
</test>
        
<test>
            
<input>"foo,bar",,"foo,bar"</input>
            
<expected><n>foo,bar</n><n /><n>foo,bar</n></expected>
        
</test>
        
<test>
            
<input>"foo,""bar""",,</input>
            
<expected><n>foo,"bar"</n><n /><n /></expected>
        
</test>
        
<test>
            
<input>,,"foo,""bar"""</input>
            
<expected><n /><n /><n>foo,"bar"</n></expected>
        
</test>
    
</tests>
    
</xsl:variable>

    
<xsl:for-each select="$tests/tests/test">
        
<xsl:variable name="tokens" select="fn:getTokens(input/text())" as="xs:string+" />
        
<xsl:variable name="result" select="expected/n" as="xs:string+" />
        
<xsl:variable name="tokenStr" select="string-join($tokens,',')" as="xs:string" />
        
<xsl:variable name="resultStr" select="string-join($result, ',')" as="xs:string" />
        
<xsl:sequence select="if ($tokenStr = $resultStr) then (concat(' Passed: (', $tokenStr, ')')) else concat(' Failed: (', $tokenStr, ') (', $resultStr, ')')" />
    
</xsl:for-each>
</xsl:template>
If this transform fails for a valid CSV file, let me know!





Instructions

This transform is a "standalone" transform, which means it doesn't expect input XML. Because there is no input, you must specify which template the transform should start at - instructions are below on how to do that.
You need to tell the transform where your CSV is. You can do that by supplying the parameter "pathToCSV" to the transform, or by editing the default value in the XSLT.
For help, please don't email me directly, use the xsl-list for general XSLT questions, the kernow-help for help with Kernow and the saxon-help list for help with Saxon. I'm on all of these lists so I will still get the email, but it makes it much more likely you will get response.

Instructions for running the transform using Kernow

- Download Kernow
- Download the transform
- Run Kernow, go to the Standalone tab and open the csv-to-xml.XSLT file
- Click "auto" button to discover the Initial Template - it should say "main"
- Click the Params button, click "Auto Populate", then double click the Value entry to edit the path to your CSV file
- The path must be a valid URI, which basically means for Windows use "file:/c:/" at the start and then forward slashes as the separator. For example, if your CSV file was called "test.csv" in the "dev" folder of your C: drive, you would use file:/c:/dev/test.csv
- Close the params dialog (ensuring you have pressed enter to commit the modification to the value entry).
- Run the transform!
- The results are displayed in the output window.
- Tick the "send output to file" checkbox, type the path in directly or use the file chooser then re-run the transform to write the output to a file
- That's it all done :)

Instructions for running the transform from the command line using Saxon

- Download Saxon
- Download the transform
- Use the command java -cp saxon.jar net.sf.saxon.Transform -o output.xml -it main csv-to-xml.xslt pathToCSV=file:/C:/dev/test.csv
- You will need to modify the -cp switch to correctly configure your classpath (if it isn't already done), and the other switches to use the correct paths
- If you get a NoClassDefFoundError then you classpath isn't configure correctly