Last Updated on 14 October 2019 by Roberto De Pedrini
A very common activity for those working with DB2 for i is to have to import information from external data sources and store it in DB2 tables.
There are various methods and contexts in which it is necessary to acquire data from sources outside DB2. In this article we focus on importing data from CSV type text files.
A CSV (Comma-Separated Value) file is a text file in which each line represents a record and within each line the “fields” are separated by a special character.
The reference document that describes the CSV format is RFC 4180 which reads as follows in the introductory part:
The comma separated values format (CSV) has been used for exchanging and converting data between various spreadsheet programs for quite some time. Surprisingly, while this forma[…]t is very common, it has never been formally documented
Index
Separator character
The character with which the fields are delimited can obviously be the comma “,”, according to the name of the format itself, but other characters can be used. The most common are: tab, semicolon “;”, pipe “|”, space “”.
The choice of the field separator character is a choice that one undergoes, in the case in which the file is received from an external source that cannot be controlled by us and therefore one must adapt.
In case you have the possibility to choose the separator character personally I prefer either the pipe or the tabulation; or characters that very rarely may be present in the content of the fields.
Main rules for a “well-formed” csv file
As RFC 4180 highlights, there is NO recognized standard for the CSV format, so there is a variety of interpretations. In any case, RFC 4180 attempts to define some rules for writing “well” a csv file. I summarize the main rules:
- each line represents a record and each line must end with CRLF characters
- optionally the first line of the file can contain the column headings
- in each line one or more fields can be present. In any case all the lines must contain the same number of fields
- optionally each field can be enclosed in double quotes “. (N.B. Microsoft Excel does not use double quotes)
- if the fields are enclosed in double quotes and one field contains a double quote, this must be doubled
It is not described in RFC 4180, but the parameter can be specified in the first line of a csv file
sep = <delimiter>
to indicate what is the separator character of the fields.
Although I have never seen a csv file containing this parameter.
Importing a csv file into DB2 for i
To import a CSV file into a DB2 table, you must first copy it to a folder in the “root” file system and then use the CPYFRMIMPF command.
The CpyFrmImpF command copies an “import file” (in our case the CSV file) into a physical / logical file.
In general, “import file” is a file created for the purpose of copying data between heterogeneous systems.
Let’s analyze the most important parameters of this command useful for importing a csv file into a DB2 table.
The table must already exist in DB2 and the table fields must match the fields in the CSV file.
The data type of the DB2 table fields must be compatible with the contents of the CSV file.
In other words, if for example the 3rd field of the table is defined as numerical 5 along which 2 decimals, the content of the 3rd field in all lines of the CSV file must be either null or contain a valid number with at most 3 integers and 2 decimals and must there is the decimal separator character (or comma or dot).
For every error that the DB2 finds trying to import the contents of the various fields of the CSV file, a message is recorded in the job joblog and the line affected by the error is discarded. In the message present in the joblog it is in most cases specified exactly on which field the problem was found.
N.B. The import DOES NOT stop at the first line which has an error, but continues to import the following lines.
Let’s analyze in detail the most important parameters of the CPYFRMIMPF command:
- RCDDLM: specifies the end-of-line character used in the CSV file. Normally the end-of-line character is * CRLF (or carriage return + line feed). If you have any doubts about which end of line character you can view the text file in hexadecimal mode. The hexadecimal code for the CR character is 0D, instead for the LF character it is 0A.
- DTAFMT: * DLM specifies that you are importing a delimited text file
- STRDLM: if in the csv file the alphanumeric fields are enclosed in double quotes, specify * DBLQUOTE, otherwise specify * NONE.
- STRESCCHR: specify the escape character. By default the escape character coincides with the string delimiter (parameter STRDLM). Setting this parameter to * STRDLM complies with rule n. 5 described in the previous paragraph.
- RMVBLANK: specifies whether blanks characters must be removed. I recommend using the value * BOTH (= Leading and trailing spaces are eliminated)
- FLDDLM: specify the field delimiter. The default value is the comma. To specify the tabulation as the field delimiter use the special value * TAB. It can’t be space.
- DECPNT: specifies the decimal separator. Note that the default value is * PERIOD (ie the point). To use the comma as a decimal separator, indicate * COMMA.
- RPLNULLVAL: if the DB2 target table does not allow null values and there are null fields in the CSV file (that is, there are consecutive field separators without even a space between them) you need to set this parameter to * FLDDFT to make sure that the null value is replaced with the default value congruent with the data type of the DB2 field
- RMVCOLNAM: if the first line of the CSV file contains the column headers, you must specify * YES for it to be excluded from the import.
Using a file to record errors
The CpyFrmImpF command allows you to copy the lines of the CSV file that have not been imported into an “error” file. It is a possibility perhaps little known, but very convenient to identify more easily the lines that are rejected by the import. Very useful especially when the csv file contains hundreds or thousands of lines.
The parameters to use to manage the error file are:
- ERRRCDFILE: specifies the file in which the wrong records will be written. I recommend creating a source file of sufficient length to contain an entire line of the CSV file. Eg if the CSV file can have lines with a maximum length of 500 characters, the source file must be created with a length of 500 or more:
CRTSRCPF FILE (QGPL / ERRORS) RCDLEN (550) MBR (* FILE) SIZE (* NOMAX)
- ERRRCDOPT: I recommend specifying * REPLACE so that with each execution of the CPYFRMIMPF command the error file is replaced.
Additional information
- the fields of the text file that are longer than the corresponding field of the table will be truncated
- two consecutive field delimiters or two consecutive string delimiters or a field delimiter followed by a record delimiter means that the corresponding field of the table will be null
How to display the end-of-line character
If you have doubts about the end-of-line character of a csv file, you can open and check the last characters of the line by displaying them in hexadecimal mode.
Open the csv file stored in a folder on IFS with the DSPF comm
and or using option 5 = Display by the WRKLNK command
.
Move the view to the right until the end of the line is shown: type in the column placement the last column of the text file. Then press F10 = Hexadecimal display
Import text files with fixed columns (fixed data)
In case you need to import text files with fixed columns into DB2, you can still use the CPYFRMIMPF command.
In addition to what has been written so far it is necessary to provide a definition file for the file to be imported or a FDF (field definition file) file.
Each line of the FDF file contains the following information:
- field name (corresponding to the system field name of the physical file written in uppercase)
- initial position in the text file
- final position in the text file
- position of the null value indicator in the text file (if the text file does not have the null value indicator specify 0)
The 4 values must be separated by a space.
Other useful information relating to the FDF file:
- blank lines are ignored
- the comment lines are preceded by a double dash –
- the last line can be * END, the following lines will be ignored (ie will be considered as comments)
- it is not necessary to specify all the fields of the physical file in the FDF file. The fields not defined in the FDF file will be set to the default value
- the order in which the fields are written in the FDF file is not important
- instead of the field name you can specify the keyword * COL. In this case, however, the FDF file must contain the fields in exact order.
- the FDF file must be written either in a source file or in a physical file with only one field
The null value indicator in the text file must have the values Y or N.
FDF file example:
-- item number CODART 1 15 0 -- article description DESART 16 50 0 - list price PRICE 51 13 0 * END these lines after * END will be ignored
Bibliography
- RFC 4180, Common Format and MIME Type for Comma-Separated Values (CSV) Files, Oct-2005: https://www.ietf.org/rfc/rfc4180.txt
- ITJungle vol. article 13 n. 18 “CPYFRMIMPF And Fixed Data”, Sep 25, 2013, by Ted Holt: https://www.itjungle.com/2013/09/25/fhg092513-story02/
Read this article and many others on the website https://www.markonetools.it