![]() |
Excel String Manipulation
SPSS® Reference Manual: A guide for market researchers
Prepared by Paul Hartzer
Contents
Introduction
In computer terms, a string is any set of letters and numbers, like "Dog" or "754." String functions combine strings together (such as making "Dogfood" out of "Dog" and "food"), break them apart (such as making "Dog" out of "Dogfood"), or otherwise changing them (such as replacing all the "o"s in "Dogfood" with "a" to make "Dagfaad").
The creation of syntax files to load data from ASCII files into SPSS is frequently simplified by using Excel's string functions. There are two reasons for this:
For instance, here's a portion of a data layout file:
| VarID | Start | End | Punch | Question | Answer |
| snd1_1 | 16 | 19 | 0001 | What is your age? | |
| sqd2 | 20 | 20 | 1 | What is your gender? | Male |
| sqd2 | 20 | 20 | 2 | What is your gender? | Female |
| sod3_1 | 21 | 27 | 0000001 | For demographic purposes, please provide your postal code. | |
| sqa1 | 29 | 30 | 01 | When will you next buy a new vehicle? | Within 3 months |
| sqa1 | 29 | 30 | 02 | When will you next buy a new vehicle? | In 3-6 months |
| sqa1 | 29 | 30 | 03 | When will you next buy a new vehicle? | In 7-12 months |
This has all the information you need to create all three SPSS statements within Excel, and then paste that into an SPSS syntax file. Here's what the SPSS syntax file would look like, for the relative portions:
DATA LIST FILE="C:\documents\datafile.dat" / snd1_1 16 - 19 sqd2 20 sod3_1 21 - 27 (3) sqa1 29 - 30 . VARIABLE LABELS snd1_1 "What is your age?" sqd2 "What is your gender?" sod3_1 "For demographic purposes, please provide your postal code." sqa1 "When will you next buy a new vehicle?" . VALUE LABELS sqd2 1 "Male" 2 "Female" /sqa1 01 "Within 3 months" 02 "In 3-6 months" 03 "In 7-12 months" .
Most of this code was generated using Excel string functions on the table, saving quite a bit of typing. These formulas were used:
Data List: =" " & A1 & " " & B1 & IF(C1>B1," - " & C1,"")
Variable Labels: =" " & A3 & " """ & F3 & """"
Value Labels: =" " & D2 & " """ & G2 & """"
There are numerous functions in Excel, but this section describes the more useful ones for string manipulation.
Concatenation
The most useful Excel string function is concatenation, combining multiple strings together. To do this, simply join two strings with &. If A1 = "Dog" and B1 = "food", then putting this in C1:
=A1 & B1
will create "Dogfood".
Useful note: Math operations will take place before concatenation, so you can do calculations, and then use the results as strings in concatenation. If this were in C1:
=A1 & (4 + 5) / 3 & B1
then C1 would contain "Dog3food" because (4 + 5) / 3 is calculated to 3, and then "3" is used in the concatenation.
Useful note: If you want to include specific text in your string, just put it in quotes. For instance, this in C1:
=A1 & " is " & B1
will create "Dog is food".
If you want to include a double quotation mark, use "" within the quoted string for each mark:
=A1 & " "" " & B1
will create "Dog " food".
=A1 & """" & B1
will create "Dog"food".
IF
IF is important for evaluating cell values. In the example in the introduction, we wanted to only include the end column value if it was larger than the starting value (that is, if the width of the data field was greater than one).
The function is IF(Condition, True-result, False-result). If column B has the start value and column C has the end value, then the function for D1 is
IF(B1>C1, " - " & C1, "")
Useful note: It's important to remember that IF will always return a value for both true and false. If you want the function to do nothing if the case is false, include "" (as shown).
IF can be nested (that is, used inside of other IF functions). If you wanted to find out the largest of three values, B1, C1, and D1, you could do it this way:
IF(B1>C1, IF(B1>D1, B1, D1), IF(D1>C1, D1, C1))
LEN
LEN returns the length of a specified string. If A1 = "Dog" and B1 = "Food", then:
LEN(A1 & B1)
will be 7. This function will be used below.
FIND
FIND returns the location of the first occurrence of a character (or characters) within a string. This is useful, for instance, if you have a series of labels like this:
LOYALTY: Being loyal to your family FAITH: Having a strong religion conviction DUTY: Working hard for society and self
and you want to use just the portion before the colon. A specific example will be shown below.
LEFT
LEFT returns the leftmost n characters in the string, where you specify n. For instance, if you want the three leftmost characters in B1, use this:
LEFT(B1, 3)
Looking at the list in FIND, if those values are in E1:E3, use this formula in F1:
=LEFT(E1, FIND(":", E1)-1)
That is, find the first colon in E1, then take all the characters to the left of it.
RIGHT
RIGHT returns the rightmost n characters in the string. For instance, if you want the three rightmost characters in B1, use this:
RIGHT(B1, 3)
Looking at the list in FIND, if those values are in E1:E3 and you want everything after the colon, use this formula in F1:
=RIGHT(E1, LEN(E1) - FIND(":", E1))
Note that this uses LEN as well, because we need to figure out how many total characters are in E1, and then subtract the number of characters up to the colon.
TRIM
TRIM removes spaces from both ends of a string. It's very straightforward; if A1 = " Dog" and B1 = "food ", then:
=TRIM(A1) & TRIM(B1)
will result in "Dogfood".
ROW and COLUMN
ROW returns the number of the current row. This is useful if you want to number a series of variables attr1, attr2, attr3, and so on. For instance, if your attribute variables start in row 75, use this to determine the variable name:
"attr" & ROW() - 74
This will result in attr1 in row 75, attr2 in row 76, and so on.
COLUMN returns the number of the current column. You can use ROW and COLUMN together if you have a matrix of variables. For instance, let's say there are six brands (rows) and seven attributes (columns). The brands are listed in column 1 and the attributes in row 1, so your table starts in B2. This might be the formula:
"attr" & COLUMN() - 1 & "_" & ROW() - 1
B2 would hold attr1_1. Filling the grid with the formula, you would have C2=attr2_1, B3=attr1_2, and so on.
TEXT
TEXT formats a number into a given format. This is useful if you have more than 9 variables in a series and want to keep them in order. If you name 10 variables attr1 to attr10, then attr10 will be sorted between attr1 and attr2. Instead, you could use attr01 to attr10, which is what TEXT is for.
TEXT takes two values: A number, and a format. "00" as a format says to convert the number to an integer with at least two digits ("0" on the left if needed). Using the example for ROW, you could change the formula to:
"attr" & TEXT(ROW() - 74, "00")
This will result in attr01, attr02, and so on (up to attr99; the next variable would be attr100, so if you have that many variables, use "000" instead).
UPPER, LOWER, and PROPER
There are two reasons for these functions:
UPPER is the most used of these three: It converts all the letters in a string to their upper case equivalents; "Dog" becomes "DOG". Let's say you have two cells (B1 and C1) and you want to see if they have the same words in them, but they might have extra characters and different capitalizations. You could do this in D1:
=IF(TRIM(UPPER(B1))=TRIM(UPPER(C1)),"same","different")
LOWER converts all the letters to lower case, while PROPER converts the first letter of each word to upper case and all the rest to lower case.
AND and OR
These aren't string functions per se, but they can be useful in more complicated assessments.
AND returns a true value if each listed value is true. For instance, another way to find the maximum of three values B1, C1, and D1 is:
IF(AND(B1>C1, B1>D1), B1, IF(AND(D1>C1, D1>B1), D1, C1))
That is, if B1 is greater than C1 AND B1 is greater than D1, THEN B1 is the greatest of those values. Otherwise, test D1; otherwise, it's C1.
OR returns a true value if ANY listed value is true (and none of the listed values are errors). Yet another way to determine the maximum of three values is:
IF(OR(B1This is, if B1 is less than C1 or D1, then it's not the greatest; otherwise, it is. If it's not the greatest, test D1. If D1's not the greatest, C1 is.
Useful note: The temptation is to put AND (and OR) between the values you want to assess, but that's not where it goes. As the example shows, the format is AND(case1, case2, ...). On the down side, this is something special to remember about Excel, if you're used to AND going in the middle. On the up side, it means you're not limited to two things per AND. If you want to see if B1 is the greatest of four values, for instance, you could do this:
IF(AND(B1>C1, B1>D1, B1>E1), "largest", "not largest")ISERR and ISNUMBER
Sometimes, you want to make sure that a calculation within a larger calculation will return a valid number. For instance, if you use FIND(C1, ":") and C1 doesn't contain a colon, you'll get an error.
ISERR is true if a calculation results in an error. Let's say you had these values in three cells (B1:B3):
LOYALTY--Being loyal to your family FAITH: Having a strong religion conviction DUTY: Working hard for society and selfYou know that each one has EITHER a colon or a hyphen, but you don't know which. You could use this, to calculate your string in one cell:
=LEFT(B1,IF(ISERR(FIND(":",B1)),FIND("-",B1),FIND(":",B1))-1)First, check to see if B1 contains a colon. If it does, use that to cut the string. If it doesn't, then assume it contains a hyphen and use that. Notice that since both FIND functions return a number, then the IF function will likewise return a number.
ISNUMBER is an alternate route; it is true if a calculation results in a number. Another way to do the same thing as above is:
=LEFT(B1,IF(ISNUMBER(FIND(":",B1)),FIND(":",B1),FIND("-",B1))-1)This is clearer to read, but ISERR is more flexible (and shorter!). Note that the FIND functions are reversed in the two examples.
Parting Comments
These functions can be mixed-and-matched to create quite powerful, but quite long, formulas in single Excel cells. Alternately, you can create shorter formulas in multiple cells. The important thing is: Find a method that works best for you.
Parentheses are especially tricky in longer formulas, so make sure not to lost track of these. Excel does provide color-pairing to help you keep organized, but even then, it might be better to break things up into multiple cells until you get comfortable.
Return to Contents - Back to Loading and Saving Data Files