In the example below I have used three Microsoft Excel functions to identify and then correct a common mistake that occurs when first name and surname are captured in a single column. I also include a very real-world solution to a problem often encountered when working with data: “dirty data”.
In this post:
The Full Name column holds two pieces of data: a name and a surname. Whilst you could argue that the separate data holds no significance for the solution to this problem, we will separate it nonetheless. The names also contain leading and trailing spaces as well as double spaces. I have used the
LEN function to highlight the difference in length (number of characters) of the “dirty” and “clean” strings.
The string “Anna Karenina” appears in cell B1. A cursory inspection suggests that there are 13 characters (4 letters, space, 8 letters) in the string. Using the
LEN function in cell C2 we establish that there are actually 15 characters in the string. This is as a result of there being 3 spaces between the name and surname instead of only 1.
In the above example,
LEN has simply been used to highlight values that include characters other than letters or digits — in this instance spaces. Compare the values returned by the LEN function in column C to the value returned by the same function in column G. “Cleaning” data in this manner is a very real-world task.
The TRIM function (removes leading and trailing spaces as well as extra spaces in a string). The TRIM function takes one argument: a string or a reference to a cell containing a string.
The CLEAN function (removes special characters) clean up our data for us. The CLEAN function takes one argument: a string or a reference to a cell containing a string.
The FIND function finds a substring within a string and returns the position of the first instance of that substring in the main string. In the above example, we search for the space in the Full Name string with the assumption that everything before that position is the first name and everything after that position, is the surname (this is a dangerous assumption — more about this later!).
Consider the record with the ID 3: cell D4 reports that the position of the space is 1. After cleaning the data, we can see that the space that we are really interested in — the space between the name and surname — is in position 11 (cell G3)
It should now be clear to you why we took so much trouble to remove extra spaces from the strings (values) in the Full Name column; a leading space (a space at the beginning) results in the FIND function returning 1. You will understand during the next 2 steps how that would have created problems solving our problem.
The LEFT function returns a specified number of characters from a string, starting from the left-hand side. For record 2, the FIND function reports that the first space in the string “Buhle Motsepe” occurs at position 6. We now know that all the characters to the left (before) position 6 will be the first name, “Buhle” and can therefore use 6-1 which is 5 as the second argument for the LEFT function, resulting in the function:
We naturally want this function to be dynamic so we use a cell reference in place of the actual number:
We could of course combine the functions and perform the operation in one step:
=LEFT(E3, FIND(" ",E3))
The RIGHT function returns a specified number of characters from a string, starting from the right-hand side.
Because we will be counting from the right, we calculate:
- the length of the entire string with the LEN function:
=LEN(E3)which returns 13
- subtract the position of the space (which we already have in G3): 13-6 = 7
We then use RIGHT to return the surname:
Again, using the same logic as we used to make the LEFT function dynamic, we end up with:
You would have to be pretty brave to complete this is one step:
=RIGHT(E3, LEN(E3)-FIND(" ",E3))
This solution is fairly robust, however, should a name be entered where the first name consists of two names separated by a space, or a name, second name and surname are entered, then this solution will not provide results that are 100% correct.
This is just one more reason why you should always ensure that data is entered with separate columns or fields for each of the smallest meaningful values.