We are probably most comfortable seeing data in a structure such as a table; tabulated data is easy to follow as the rows and columns make the relationship between the values noticeably clear. Text file data formats are not quite as easily “human-readable” (after all, that is why we have computers), but very useful when transferring data from one system to another where those systems’ file formats are not compatible (for example, transferring data from one proprietary application to another).

1. Tabular data

It is incredibly easy to see that all the data about young Bieber is in the same row, making it easy to find his cellphone number:

empIDempNameempSurnameempCodeempCellempGender
1JustinBieberBIJU590844233359M
2SelenaGomezGOSE950834567895F
3NeilDiamondDINE360641255536M
4JackJohnsonJOJA890824533789M
5BillieEilishEIBI990846658799F

Don’t, at any stage, be fooled into thinking that any of this can be accomplished manually just because the example datasets are only a few records worth!

2. CSV data: same-same but different

The exact same data can be represented as follows:

empID,empName,empSurname,empCode,empCell,empGender
1,Justin,Bieber,BIJU59,0844233359,M
2,Selena,Gomez,GOSE95,0834567895,F
3,Neil,Diamond,DINE36,0641255536,M
4,Jack,Johnson,JOJA89,0824533789,M
5,Billie,Eilish,EIBI99,0846658799,F

The data is again organised in rows, making it easy to see the values related to empID 1 — Ms Eilish — for example. However, the fields (columns) do not exist, instead, the values in each record (row) are separated by a comma, making it a bit more challenging to match the values to their fields.

The most commonly occurring delimiter is in fact the comma and CSV text files are a common way to store and transfer data.

3. Text Qualifiers

The same data appears below, this time with a semi-colon as a delimiter and quotation marks as Text Qualifiers. In each instance where a value is a text string, the value is enclosed in quotation marks. Note that this is not necessary with values that are numbers (unless of course, you have thousand separators that match the delimiters — see 4.1 below!):

"empID";"empName";"empSurname";"empCode";"empCell";"empGender"
1;"Justin";"Bieber";"BIJU59";"0844233359";"M"
2;"Selena";"Gomez";"GOSE95";"0834567895";"F"
3;"Neil";"Diamond";"DINE36";"0641255536";"M"
4;"Jack";"Johnson";"JOJA89";"0824533789";"M"
5;"Billie";"Eilish";"EIBI99";"0846658799";"F"

The text qualifiers allow for the instance where the delimiter might appear as part of a text string data value. This seems like a highly unlikely scenario given the data we are looking at; we will look at an example in a moment where this would be necessary.

4. More about delimiters

delimiter is a sequence of one or more characters for specifying the boundary between separate, independent regions in plain textmathematical expressions or other data streams. An example of a delimiter is the comma character, which acts as a field delimiter in a sequence of comma-separated values. Another example of a delimiter is the time gap used to separate letters and words in the transmission of Morse code.

https://en.wikipedia.org/wiki/Delimiter

4.1 Currency & thousand separators

You must inspect your data before importing it to understand any challenges that you may need to deal with. A common issue occurs when values representing currency are imported. The same issue can occur if thousand separators are present in data values, for example where 1000 is expressed as 1,000 (thousand separators are quite old-fashioned). For that reason, it is internationally accepted that the space be used as the thousand separators when one is required: 1 000, 10 000, 100 000, 1 000 000.

In the data below, an hourly rate has been added. We can see the problem clearly when a comma as a thousand separator is present: is the comma acting as a delimiter or as a thousand separator?

empID,empName,empSurname,empCode,empCell,empGender,empHourlyRate
1,Justin,Bieber,BIJU59,0844233359,M,500.00
2,Selena,Gomez,GOSE95,0834567895,F,1,000.00
3,Neil,Diamond,DINE36,0641255536,M,1,500.00
4,Jack,Johnson,JOJA89,0824533789,M,1,000.00
5,Billie,Eilish,EIBI99,0846658799,F,2,000.00

There are two workable solutions here. The first is to deal with the issue as it occurs during the import process, and the second is to use delimiters. The use of delimiters assumes you are in control of where the data is coming from and can determine its format.

If we have control over the data and the software system storing and processing the data, we could change the delimiter from a comma to a semi-colon (changing your actual data is not a good idea and is possibly not even possible):

empID;empName;empSurname;empCode;empCell;empGender;empHourlyRate
1;Justin;Bieber;BIJU59;0844233359;M;500.00
2;Selena;Gomez;GOSE95;0834567895;F;1,000.00
3;Neil;Diamond;DINE36;0641255536;M;1,500.00
4;Jack;Johnson;JOJA89;0824533789;M;1,000.00
5;Billie;Eilish;EIBI99;0846658799;F;2,000.00

Alternatively, you could also use text qualifiers. When we place a character string in quotation marks, it denotes the string as a literal string. In the data below, in line 3, the quotation marks surrounding the string 1,500.00 result in the comma being treated literally as a comma, not a delimiter. In simple terms, the string between quotes is treated as one piece of data.

"empID","empName","empSurname","empCode","empCell","empGender","empHourlyRate"
"1","Justin","Bieber","BIJU59","0844233359","M","500.00"
"2","Selena","Gomez","GOSE95","0834567895","F","1,000.00"
"3","Neil","Diamond","DINE36","0641255536","M","1,500.00"
"4","Jack","Johnson","JOJA89","0824533789","M","1,000.00"
"5","Billie","Eilish"","EIBI99","0846658799"","F"","2,000.00"

5. Example

Delimiters and thousand separators are defined in system settings, they are operating system options. This means that the computer you use at school and the computer you use at home or elsewhere may treat them differently. The following exercise will help you understand how this works.

The data in the following two files is identical. One uses commas as delimiters, the other semi-colons.

Download both files.

Open both files with Excel. You will see something similar to the below screenshots if your computer is set to the comma as the delimiter.

Screenshot of comma delimited data opened in Excel on a system with the comma set as the delimiter.Screenshot of semi-colon delimited data opened in Excel on a system with the commas set as the delimiter.

In my case, I would have a problem with semi-colon-delimited data. What would my solution be?

Use Excel’s Text to Columns command to separate the data into two columns, and then resave the data as a CSV file!

By MisterFoxOnline

Mister Fox AKA @MisterFoxOnline is an ICT, IT and CAT Teacher who has just finished training as a Young Engineers instructor. He has a passion for technology and loves to find solutions to problems using the skills he has learned in the course of his IT career.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.