Useful Unix Utilities to Data File Management


The Unix operating system has a number of utilities that can be very useful for pre-processing data files to be loaded with SQL*Loader. Even when the same functionality can be achieved through SQL*Loader, the utilities described

here will be much faster. Data warehousing applications, in particular, can benefit greatly from these utilities.

EXAMPLE 1
Let us assume a load with the following control file:
LOAD DATA
INFILE ‘example1.dat’
INTO TABLE emp
(empno        POSITION(01:04)    INTEGER EXTERNAL,
ename        POSITION(06:14)    CHAR,
job        POSITION(16:24) CHAR,
mgr        POSITION(26:29) INTEGER EXTERNAL,
sal        POSITION(31:37) DECIMAL EXTERNAL,
comm        POSITION(39:42) DECIMAL EXTERNAL,
deptno        POSITION(44:45)    INTEGER EXTERNAL)

Here are the contents of data file example1.dat:
7782 CLARK MANAGER 7839 2572.50 0.20 10
7839 KING PRESIDENT 5850.00 10
7654 MARTIN SALESMAN 7698 1894.00 0.15 30

EXAMPLE 2
Let us assume another load with the following control file:
LOAD DATA
INFILE ‘example2.dat’
INTO TABLE dept
FIELDS TERMINATED BY ‘,’ OPTIONALLY ENCLOSED BY ‘”‘
(deptno, dname, loc)

Below are the contents of data file example2.dat:
12,RESEARCH,”SARATOGA”
10,”ACCOUNTING”,CLEVELAND
13,”FINANCE”,”BOSTON”

PRESORTING THE DATA FILE
The performance of direct path loads can be significantly improved by presorting the input data on indexed columns. Pre-sorting minimizes the demand for temporary segments during the load.

The Unix command to be used for presorting is “sort”.
In Example 1, suppose you have added the SORTED INDEXES (empno) clause to the control file to indicate that fields in the data file are presorted on the EMPNO column. To do that presorting, you would enter at the Unix prompt:
% sort +0 -1 example1.dat > example1.srt

This will sort file example1.dat by its first field (by default fields are delimited by spaces and tabs) and send the output to file example1.srt:
7654 MARTIN SALESMAN 7698 1894.00 0.15 30
7782 CLARK MANAGER 7839 2572.50 0.20 10
7839 KING PRESIDENT 5850.00 10

In Example 2, if you wanted to sort file example2.dat by column DNAME, you would enter:
% sort -t, -d +1 -2 example2.dat > example2.srt
where “-t,” indicates that commas are delimiters, “-d” causes sort to consider only letters and digits in comparisons, and example2.srt is the output file:
10,”ACCOUNTING”,CLEVELAND
13,”FINANCE”,”BOSTON”
12,RESEARCH,”SARATOGA”

ELIMINATING FIELDS FROM THE DATA FILE
Often, it is necessary to remove one or more fields from all the records in the data file. The Unix command that does that is “cut”.
In Example 1, if you want to eliminate the COMM field altogether from the data file, enter at the Unix prompt:
% cut -c1-38,44- example1.dat > example1.cut
where the “-c” option specifies the character ranges that you want to extract from each record. The output file example1.cut contains:
7782 CLARK MANAGER 7839 2572.50 10
7839 KING PRESIDENT 5850.00 10
7654 MARTIN SALESMAN 7698 1894.00 30

In Example 2, to eliminate the LOC field from the data file, you would enter:
% cut -f1-2 -d, example2.dat > example2.cut
where “-f1-2” indicates you want to extract the first two fields of each record and “-d,” tells cut to treat comma as a delimiter. The output file example2.cut would contain:
12,RESEARCH
10,”ACCOUNTING”
13,”FINANCE”

REPLACING CHARACTERS IN THE DATA FILE
Two Unix commands can be used here: “tr” or “sed”.
For instance, if you want to replace all double quotes in the data file in Example 2 by single quotes, you may enter:
% cat example2.dat | tr \” \’ > example2.sqt
The piped “cat” is necessary because tr’s input source is the standard input. Single and double quotes are preceded by backslashes because they are special characters. The output file will be:
12,RESEARCH,’SARATOGA’
10,’ACCOUNTING’,CLEVELAND
13,’FINANCE’,’BOSTON’

Similarly, to substitute colons for commas as delimiters in Example 2, you may enter:
% sed ‘s/,/:/g’ example2.dat > example2.cln
The output would be:
12:RESEARCH:”SARATOGA”
10:”ACCOUNTING”:CLEVELAND
13:”FINANCE”:”BOSTON”

ELIMINATING CHARACTERS FROM THE DATA FILE
Just as for replacing characters, “tr” and “sed” can be used for eliminating them from the data file.
If you want to remove all double quotes from the data file in Example 2, you may type:
% cat example2.dat | tr -d \” > example2.noq
The contents of file example2.dat are piped to the tr process, in which the “-d” option stands for “delete”. The output file example2.noq would look like:
12,RESEARCH,SARATOGA
10,ACCOUNTING,CLEVELAND
13,FINANCE,BOSTON

An identical result would be obtained by using sed:
% sed ‘s/\”//g’ example2.dat > example2.noq
The string in single quotes indicates that double quotes should be replaced by an empty string globally in the input file. Another interesting usage of tr would be to squeeze multiple blanks between fields down to a single space character. That can be achieved by doing:

% cat example1.dat | tr -s ‘ ‘ ‘ ‘ > example1.sqz
The output file would look like:
7782 CLARK MANAGER 7839 2572.50 0.20 10
7839 KING PRESIDENT 5850.00 10
7654 MARTIN SALESMAN 7698 1894.00 0.15 30

INSERTING CHARACTERS INTO THE DATA FILE
A typical situation in which you may need to insert characters into the datafile would be to convert a fixed position data file into a delimited one. The data file in Example 1 is a fixed position one. To convert it into a file delimited by commas, you would enter
% cat example1.dat | tr -s ‘ ‘ ‘,’ > example1.dlm
and obtain
7782,CLARK,MANAGER,7839,2572.50,0.20,10
7839,KING,PRESIDENT,5850.00,10
7654,MARTIN,SALESMAN,7698,1894.00,0.15,30

MERGING DIFFERENT FILES INTO A SINGLE DATA FILE
Merging can be done by using “paste”. This command allows you to specify a list of files to be merged and the character(s) to be used as delimiter(s). For instance, to merge the data files in Examples 1 and 2, you may enter:
% paste -d’ ‘ example1.dat example2.dat > example.mrg
where “-d’ ‘” specifies a blank character as the delimiter between records being merged and example.mrg is the merged output file:
7782 CLARK MANAGER 7839 2572.50 0.20 10 12,RESEARCH,”SARATOGA”
7839 KING PRESIDENT 5850.00 10 10,”ACCOUNTING”,CLEVELAND
7654 MARTIN SALESMAN 7698 1894.00 0.15 30 13,”FINANCE”,”BOSTON”

ELIMINATING DUPLICATE RECORDS/FIELDS FROM THE DATA FILE
The Unix command for this is “uniq”. It eliminates or reports consecutive lines that are identical. Because only adjacent lines are compared, you may have to use the sort utility before using uniq.
In Example1, suppose you wanted to keep only the first entry with DEPTNO = 10. The Unix command would be:
% uniq +43 example1.dat > example1.unq
The “+43” indicates that the first 43 characters in each record should be ignored for the comparison. The output file example1.unq would contain:
7782 CLARK MANAGER 7839 2572.50 0.20 10
7654 MARTIN SALESMAN 7698 1894.00 0.15 30

COUNTING THE NUMBER OF LINES IN THE DATA FILE
Use “wc”. For example:
% wc -l example1.dat
3 example1.dat
which indicates that the file contains three lines.

CONVERTING BETWEEN ASCII AND EBCDIC
The “dd” Unix command can be used to convert EBCDIC data files to ASCII and vice-versa. For example:
% dd if=example1.ebc conv=ascii > example1.asc
takes example1.ebc as EBCDIC input file, converts it into ASCII, and writes the converted output to file example1.asc.

CASE CONVERSIONS
The “dd” and “tr” commands can also be used for converting between uppercase and lowercase characters. Because Oracle is currently case-sensitive, this can be useful in many situations.
For example, to convert all characters in file example1.dat from uppercase to lowercase, simply enter
% dd if=example1.dat conv=lcase > example1.low
or
% cat example1.dat | tr “[A-Z]” “[a-z]” > example1.low
The contents of output file example1.low will be:
7782 clark manager 7839 2572.50 0.20 10
7839 king president 5850.00 10
7654 martin salesman 7698 1894.00 0.15 30

To convert this file back to uppercase characters, type
% dd if=example1.low conv=ucase > example1.dat
or
% cat example1.low | tr “[a-z]” “[A-Z]” > example1.dat

Advertisements

Basic Unix Vi editor command


Unix Vi editor command keys:

ZZ

Q

<ESC>

:<cmd>

:!<cmd>

^g

^f

^b

^d

^u

<x>G

/<x>

?<x>

n

N

]]

[[

%

^l

^r

z<CR>

z-

^e

^y

H

L

M

+

hjkl

0

$

f<x>

F<x>

Exit, saving changes

Enter ex mode

End of insert

Execute ex command

Shell command

Show filename/size

Forward one screen

Back one screen

Forward half screen

Backward half screen

Go to line <x>

Search forward for <x>

Search backward for <x>

Repeat last search

Reverse last search

Next section/function

Previous section/function

Find matching () { or }

Redraw screen

Refresh screen

Current line at top

Current line at bottom

Scroll down one line

Scroll up one line

revious context

Home window line

Last window line

Middle window line

Next line

Cursor movement:

left/down/up/right

Beginning of line

End of line

Find <x> forward

Find <x> backward

t<x>

T<x>

<x>|

w,W

b,B

e,E

^h

^w

^?

~

a

i,I

A

o

O

r

R

d

dd

c

y

C

D

s

S

J

x

X

Y

p

P

<<

>>

u

U

Up to <x> forward

Back up to <x>

Go to column <x>

Forward one word

Back one word

End of word

Erase last character

Erase last word

Interrupt

Toggle character case

Append after

Insert before

Append at end of line

Open line below

Open line above

Replace character

Replace characters

Delete

Delete line

Change

Yank lines to buffer

Change rest of line

Delete rest of line

Substitute character

Substitute lines

Join lines

Delete after

Delete before

Yank current line

Put back lines

Put before

Shift line left

Shift line right

Undo last change

Restore current line

File Options:

sh      Invoke shell

vi      Vi mode

wq      Write and quit

w <f>   Write file <f>

w! <f>  Overwrite file <f>

Ex mode commands:

q

q!

r <f>

set <x>

set no<v>

set all

Quit

Quit, discard changes

Read in file <f>

Enable option

Disable option

Show all options

      
      

UNIX / Shell Programming Basic – Beginner Questions Answers


 

 

What is $*?

Its mainly used for showing up all params. This show all parameter values passed in shell script

 

What does $# stand for?

# will return the number of parameters that are passed as the command-line arguments.

 

What does $? Return?

$? will return exit status of command .0 if command gets successfully executed ,non-zero if command failed.

 

What are Different types of shells?

sh : the oldest shell 
csh : C shell 
ksh : Korn Shell 
bash : bourne again shell
 

 

 

How do you read arguments in a shell program – $1, $2?

Shell script accepts parameters in following format… 
$1 would be the first command line argument, $2 the second, and so on 
$0 is the name of the script or function

If your script has more than 9 params then accept in following way… 
${12} : 12th param 
${18} : 18th param

 

What are the different kinds of loops available in shell script?

for, if, while, case

 

What is the difference between a shell variable that is exported and the one that is not exported?

The Shell variable which is exported would available to all the programs outside the Shell also. And the shell variable which is not exported, would available for that shell or for the shell program only, in which the variable is declared.

 

Export LANG=C 
will make the variable LANG the
global variable, put it into the global environment. All other processes can use it. 
 
LANG=C 
will change the value only in the current script. 

 

  

If you have a string “one two three”, which shell command would you use to extract the strings?

echo $string | cut -d” ” -f1 
echo $string | cut -d” ” -f2 
echo $string | cut -d” ” -f3 

 

How will you list only the empty lines in a file (using grep)?

grep “^$” filename.txt

 

How would you get the character positions 10-20 from a text file?

cat filename.txt | cut -c 10-20

or

cut -c10-20 <filename.txt>

 

How would you replace the n character in a file with some xyz?

sed ‘s/n/xyz/g’ filename > new_filename

 

We can replace n characters by using the following command:
1,$s/./xyz/g
where 1 shows that the search string will start searching patterns from first line of the file.
           ‘.’ for any character.

            g for global replacemet.

 

 

What is the difference between a ‘thread’ and a ‘process’?

A process is a collection of virtual memory space, code, data, and system resources. A thread is code that is to be serially executed within a process. A processor executes threads, not processes, so each application has at least one process, and a process always has at least one thread of execution, known as the primary thread. A process can have multiple threads in addition to the primary thread

Thread – is stream of executable code within process. They are light weight process. All thread with in a process  share process instruction,code & data segment,open file descriptor,signal handler,userID and GroupID. Thread has its own set of register including program counter,stack pointer 

 

What is this line in the shell script do ?#!/bin/ksh?

To invoke the shell indirectly this line is added as the first line in the file.This particular line invokes korn shell

 

 

What is use of “cut” command? Give some examples.

Cut – Utility used to cut/Strip out the required data/text from the source.

Cut can be used in three modes,

  Stripping by Character

     cut -c 1-3

  Striping by Byte length

     cut -b -1-72

Stripping by delimiter and fields.

     cut -d “|” -f1

Where   

-d “|” -> Delimiter used in input text to separate columns

-f1 -> Field/Column number  

While processing huge input files, Cut’s performance is far better than awk

 

  

thanks – shivmohan purohit