Using STRING_SPLIT with more than one character separator

Okay, a quick tip today on how to use STRING_SPLIT in SQL Server with more than one character separator.

STRING_SPLIT is pretty cool.  Back before the latest SQL Server release frenzy that's leaving our heads spinning, it was necessary to write sometimes complex procedures to split strings that contained commas or other separators into their various components.

Why?  The first rule of normalisation - 1NF - is why.  It states that values should be atomic, i.e. only breakable into one component value within the context of the datum.

What does this mean?  Surely then, a date, like '2008-01-01', is not in 1NF since it can be broken down into day, month and year - and thereafter into individual numerals?  This is a flawed argument for two reasons - the first is that a full date like this is correct to context, i.e. breaking down the date would detract from the value of the whole data point.  Secondly, dates are stored internally in a different format to the one shown anyway!

I digress. 

It's not unusual for data to be presented to SQL Server like this:

'Sherman, P., 15 Wallaby Way, Sydney, Australia'

Which presents a problem when we're trying to fit that data into a schema with a Person entity and Address entity (and maybe a Country entity too).

So, we can either use horrible multi-part string-splitting mechanisms as detailed on StackOverflow here ...

Or create our own function like this...

Or we can use STRING_SPLIT like this:

DECLARE @MyGodawfulString VARCHAR(100) = 'Sherman, P., 15 Wallaby Way, Sydney, Australia'
DECLARE @SplitStrings TABLE ( Word VARCHAR(20) ) 
INSERT INTO @SplitStrings 
    SELECT * FROM STRING_SPLIT(@MyGodawfulString, ',') 
-- Do something with the resulting table

Or use it on another pre-existing table (GodawfulStringsToSplit), instead of on a @var.  This example assumes the first and second columns are populated.  (The third column could be updated afterwards using something like ROW_NUMBER() OVER ( PARTITION_BY GodawfulString ) to get a set of Word IDs representing each word extracted from each GodawfulString):

SELECT GodawfulStringID, GodawfulString, WordID, value 
FROM GodawfulStringsToSplit 

But what happens if we try to use STRING_SPLIT with a multi-character separator, like this?

SELECT * FROM STRING_SPLIT(@GodawfulString, ',,')

Msg 214, Level 16, State 11, Line 67
Procedure expects parameter 'separator' of type 'nchar(1)/nvarchar(1)'.

Ah.  That's a no, then.

Multi-character separators are useful.  Line endings/carriage returns vary between operating systems, with \r\n, \r or \n differing depending.  Typically a full CR\LF can be represented as CHAR(13) + CHAR(10) in SQL Server, which of course is multi-character.

There's other use-cases too - I ran into difficulties when trying to split a string (academic reference in Harvard format) by '.,', to separate out authors.  I couldn't use ',' since this was also a separator for first name / surname.

So what's the solution then?  Without further ado, it's really rather simple - just replace your multi-separator with another character that you KNOW will not occur in your data.  For example, a pipe:

SELECT * FROM STRING_SPLIT(REPLACE(@MyVariable, '.,', '|'), '|')

This is a great way of swapping out pesky multi-character separators. 

Beware that if you're CROSS-APPLYing against a result set, the REPLACE operation might slow you down especially with large row numbers, so this isn't the only solution - consider instead adding a calculated column REPLACing the multi-characters (or persisting it in some other way) in these cases so you can use an ordinary STRING_SPLIT with a single character.

Hope this was useful.  Feel free to leave comments below.


Popular Posts