Useful Unix Utilities to Data File Management


The Unix operating system has a number of utilities that can be very useful for pre-processing data files to be loaded with SQL*Loader. Even when the same functionality can be achieved through SQL*Loader, the utilities described

here will be much faster. Data warehousing applications, in particular, can benefit greatly from these utilities.

EXAMPLE 1
Let us assume a load with the following control file:
LOAD DATA
INFILE ‘example1.dat’
INTO TABLE emp
(empno        POSITION(01:04)    INTEGER EXTERNAL,
ename        POSITION(06:14)    CHAR,
job        POSITION(16:24) CHAR,
mgr        POSITION(26:29) INTEGER EXTERNAL,
sal        POSITION(31:37) DECIMAL EXTERNAL,
comm        POSITION(39:42) DECIMAL EXTERNAL,
deptno        POSITION(44:45)    INTEGER EXTERNAL)

Here are the contents of data file example1.dat:
7782 CLARK MANAGER 7839 2572.50 0.20 10
7839 KING PRESIDENT 5850.00 10
7654 MARTIN SALESMAN 7698 1894.00 0.15 30

EXAMPLE 2
Let us assume another load with the following control file:
LOAD DATA
INFILE ‘example2.dat’
INTO TABLE dept
FIELDS TERMINATED BY ‘,’ OPTIONALLY ENCLOSED BY ‘”‘
(deptno, dname, loc)

Below are the contents of data file example2.dat:
12,RESEARCH,”SARATOGA”
10,”ACCOUNTING”,CLEVELAND
13,”FINANCE”,”BOSTON”

PRESORTING THE DATA FILE
The performance of direct path loads can be significantly improved by presorting the input data on indexed columns. Pre-sorting minimizes the demand for temporary segments during the load.

The Unix command to be used for presorting is “sort”.
In Example 1, suppose you have added the SORTED INDEXES (empno) clause to the control file to indicate that fields in the data file are presorted on the EMPNO column. To do that presorting, you would enter at the Unix prompt:
% sort +0 -1 example1.dat > example1.srt

This will sort file example1.dat by its first field (by default fields are delimited by spaces and tabs) and send the output to file example1.srt:
7654 MARTIN SALESMAN 7698 1894.00 0.15 30
7782 CLARK MANAGER 7839 2572.50 0.20 10
7839 KING PRESIDENT 5850.00 10

In Example 2, if you wanted to sort file example2.dat by column DNAME, you would enter:
% sort -t, -d +1 -2 example2.dat > example2.srt
where “-t,” indicates that commas are delimiters, “-d” causes sort to consider only letters and digits in comparisons, and example2.srt is the output file:
10,”ACCOUNTING”,CLEVELAND
13,”FINANCE”,”BOSTON”
12,RESEARCH,”SARATOGA”

ELIMINATING FIELDS FROM THE DATA FILE
Often, it is necessary to remove one or more fields from all the records in the data file. The Unix command that does that is “cut”.
In Example 1, if you want to eliminate the COMM field altogether from the data file, enter at the Unix prompt:
% cut -c1-38,44- example1.dat > example1.cut
where the “-c” option specifies the character ranges that you want to extract from each record. The output file example1.cut contains:
7782 CLARK MANAGER 7839 2572.50 10
7839 KING PRESIDENT 5850.00 10
7654 MARTIN SALESMAN 7698 1894.00 30

In Example 2, to eliminate the LOC field from the data file, you would enter:
% cut -f1-2 -d, example2.dat > example2.cut
where “-f1-2” indicates you want to extract the first two fields of each record and “-d,” tells cut to treat comma as a delimiter. The output file example2.cut would contain:
12,RESEARCH
10,”ACCOUNTING”
13,”FINANCE”

REPLACING CHARACTERS IN THE DATA FILE
Two Unix commands can be used here: “tr” or “sed”.
For instance, if you want to replace all double quotes in the data file in Example 2 by single quotes, you may enter:
% cat example2.dat | tr \” \’ > example2.sqt
The piped “cat” is necessary because tr’s input source is the standard input. Single and double quotes are preceded by backslashes because they are special characters. The output file will be:
12,RESEARCH,’SARATOGA’
10,’ACCOUNTING’,CLEVELAND
13,’FINANCE’,’BOSTON’

Similarly, to substitute colons for commas as delimiters in Example 2, you may enter:
% sed ‘s/,/:/g’ example2.dat > example2.cln
The output would be:
12:RESEARCH:”SARATOGA”
10:”ACCOUNTING”:CLEVELAND
13:”FINANCE”:”BOSTON”

ELIMINATING CHARACTERS FROM THE DATA FILE
Just as for replacing characters, “tr” and “sed” can be used for eliminating them from the data file.
If you want to remove all double quotes from the data file in Example 2, you may type:
% cat example2.dat | tr -d \” > example2.noq
The contents of file example2.dat are piped to the tr process, in which the “-d” option stands for “delete”. The output file example2.noq would look like:
12,RESEARCH,SARATOGA
10,ACCOUNTING,CLEVELAND
13,FINANCE,BOSTON

An identical result would be obtained by using sed:
% sed ‘s/\”//g’ example2.dat > example2.noq
The string in single quotes indicates that double quotes should be replaced by an empty string globally in the input file. Another interesting usage of tr would be to squeeze multiple blanks between fields down to a single space character. That can be achieved by doing:

% cat example1.dat | tr -s ‘ ‘ ‘ ‘ > example1.sqz
The output file would look like:
7782 CLARK MANAGER 7839 2572.50 0.20 10
7839 KING PRESIDENT 5850.00 10
7654 MARTIN SALESMAN 7698 1894.00 0.15 30

INSERTING CHARACTERS INTO THE DATA FILE
A typical situation in which you may need to insert characters into the datafile would be to convert a fixed position data file into a delimited one. The data file in Example 1 is a fixed position one. To convert it into a file delimited by commas, you would enter
% cat example1.dat | tr -s ‘ ‘ ‘,’ > example1.dlm
and obtain
7782,CLARK,MANAGER,7839,2572.50,0.20,10
7839,KING,PRESIDENT,5850.00,10
7654,MARTIN,SALESMAN,7698,1894.00,0.15,30

MERGING DIFFERENT FILES INTO A SINGLE DATA FILE
Merging can be done by using “paste”. This command allows you to specify a list of files to be merged and the character(s) to be used as delimiter(s). For instance, to merge the data files in Examples 1 and 2, you may enter:
% paste -d’ ‘ example1.dat example2.dat > example.mrg
where “-d’ ‘” specifies a blank character as the delimiter between records being merged and example.mrg is the merged output file:
7782 CLARK MANAGER 7839 2572.50 0.20 10 12,RESEARCH,”SARATOGA”
7839 KING PRESIDENT 5850.00 10 10,”ACCOUNTING”,CLEVELAND
7654 MARTIN SALESMAN 7698 1894.00 0.15 30 13,”FINANCE”,”BOSTON”

ELIMINATING DUPLICATE RECORDS/FIELDS FROM THE DATA FILE
The Unix command for this is “uniq”. It eliminates or reports consecutive lines that are identical. Because only adjacent lines are compared, you may have to use the sort utility before using uniq.
In Example1, suppose you wanted to keep only the first entry with DEPTNO = 10. The Unix command would be:
% uniq +43 example1.dat > example1.unq
The “+43” indicates that the first 43 characters in each record should be ignored for the comparison. The output file example1.unq would contain:
7782 CLARK MANAGER 7839 2572.50 0.20 10
7654 MARTIN SALESMAN 7698 1894.00 0.15 30

COUNTING THE NUMBER OF LINES IN THE DATA FILE
Use “wc”. For example:
% wc -l example1.dat
3 example1.dat
which indicates that the file contains three lines.

CONVERTING BETWEEN ASCII AND EBCDIC
The “dd” Unix command can be used to convert EBCDIC data files to ASCII and vice-versa. For example:
% dd if=example1.ebc conv=ascii > example1.asc
takes example1.ebc as EBCDIC input file, converts it into ASCII, and writes the converted output to file example1.asc.

CASE CONVERSIONS
The “dd” and “tr” commands can also be used for converting between uppercase and lowercase characters. Because Oracle is currently case-sensitive, this can be useful in many situations.
For example, to convert all characters in file example1.dat from uppercase to lowercase, simply enter
% dd if=example1.dat conv=lcase > example1.low
or
% cat example1.dat | tr “[A-Z]” “[a-z]” > example1.low
The contents of output file example1.low will be:
7782 clark manager 7839 2572.50 0.20 10
7839 king president 5850.00 10
7654 martin salesman 7698 1894.00 0.15 30

To convert this file back to uppercase characters, type
% dd if=example1.low conv=ucase > example1.dat
or
% cat example1.low | tr “[a-z]” “[A-Z]” > example1.dat

2 thoughts on “Useful Unix Utilities to Data File Management

Want to give some comment to author ( Shivmohan Purohit )

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s