Thursday, February 2, 2012

Column concatenation

When processing text files on a shell, sometimes you want to concatenate columns from different files. Maybe you have the students' names on one file and their grades on the other, or something. Anyway, I found several ways to concatenate columns, and due to their potential time-saving capability, I'm posting them up here:

  • Method 1: pr (print)
  • Method 2: paste
  • Method 3: awk
The first two will allow you to simply concatenate columns horizontally, and nothing else. Most useful perhaps for numbering rows in a text file (aided by seq), or concatenating two perfectly aligned files. I've had both needs at times.

The third method extends upon the first two, and allows you to shuffle the columns onto another order, or to join columns based on their values. The first use case is neatly exemplified here:


> cat file1
one two three
one two three
one two three
one two three

> cat file2
four five six
four five six
four five six
four five six

> pr -m -t -s\  file1 file2 | gawk '{print $4,$5,$6,$1}'
four five six one
four five six one
four five six one
four five six one

See how it uses pr to concatenate the columns first, and THEN uses awk to shuffle them. I went through the manual pages, and it seems pr is geared towards formatting text for page printing, while paste is a more simple operator. I haven't used them much, but I'd suggest paste over pr in general. Both both have their uses.

Then this other webpage suggests the following command, which I have not tried:

awk '
  # store the first file, indexed by col2
  NR==FNR {f1[$2] = $0; next}
  # output only if file1 contains file2's col2
  ($2 in f1) {print f1[$2], $0}
' file1 file2

It seems complicated and I don't yet quite understand it, but if it does what it claims, it should come very much in handy.

And a supporter of the join command suggests a slightly cleaner method:


You can do that with a combination of the sort and join commands. The straightforward approach is
join -j2 <(sort -k2 file1) <(sort -k2 file2)
but that displays slightly differently than you're looking for. It just shows the common join field and then the remaining fields from each file
"1431_at" "3973" 2.52832098784342 "653" 2.14595534191867
"207201_s_at" "1826" 2.41685345240968 "1109" 2.13777517447307
If you need the format exactly as you showed, then you would need to tell join to output in that manner
join -o 1.1,1.2,1.3,2.1,2.2,2.3 -j2 <(sort -k2 file1) <(sort -k2 file2)
where -o accepts a list of FILENUM.FIELDNUM specifiers.
Note that the <() syntax I'm using isn't POSIX sh, so you should sort to a temporary file if you need POSIX sh syntax.

I also haven't sat today to understand it, but it looks promising.
I should be trying them out soon. Hope you do too.

No comments:

Post a Comment