gawk and sed Transforming Data in Pipelines ██ gawk 2 / 34 ██ What is gawk? gawk is the GNU implementation of awk — a text processing language designed to work in pipelines.
It reads input one line at a time, splits each line into fields, and lets you work with them.
Think of it as a much more flexible version of cut:
• cut can only extract columns as-is
• gawk can extract, rearrange, compute, and format
3 / 34 ██ The gawk Model For each line of input, gawk: 1. Splits the line into fields: $1, $2, $3, ..., $NF
2. Runs your program on it 3. Writes output to stdout $ echo "Alice 42 78.5" | gawk '{print $2}'
42
$NF is always the last field (NF = Number of Fields):
$ echo "Alice 42 78.5" | gawk '{print $NF}'
78.5
4 / 34 ██ gawk as a Flexible cut gawk '{print $N}' is like cut, but whitespace-delimited by default:
$ cut -d' ' -f2 data.txt # cut approach
$ gawk '{print $2}' data.txt # gawk approach
For CSV data, use -F, to set the field separator:
$ cut -d, -f2 data.csv # cut approach
$ gawk -F, '{print $2}' data.csv # gawk approach
The real power comes from what gawk can do beyond just extracting.
5 / 34 ██ Printing Multiple Fields Print fields in any order, with any separator: $ gawk '{print $3, $1}' data.txt # swap columns 1 and 3
$ gawk -F, '{print $1, $4}' data.csv # columns 1 and 4, space-separated
Use "," to print a literal comma between fields:
$ gawk '{print $1 "," $3}' data.txt # output as CSV
$ gawk -F, '{print $2 "\t" $4}' data.csv # output as TSV
6 / 34 ██ Demo - gawk as cut 7 / 34 ██ Column Arithmetic gawk can do math on fields — this is where it leaves cut far behind.
$ gawk '{print $1, $2 * $3}' data.txt # multiply columns 2 and 3
$ gawk '{print $1, $2 + $3}' data.txt # add columns
$ gawk '{print $1, ($2 + $3) / 2}' data.txt # average of two columns
Unit conversions: # Convert column 2 from Celsius to Fahrenheit
$ gawk '{print $1, $2 * 9/5 + 32}' temps.txt
# Convert column 2 from meters to feet
$ gawk '{print $1, $2 * 3.28084}' distances.txt
8 / 34 ██ Computing Derived Columns Add a column that doesn't exist in the original data: # File: experiments.txt
# name distance_m time_s
# ball 15.2 3.1
# disk 22.7 4.5
$ gawk '{print $1, $2, $3, $2 / $3}' experiments.txt
ball 15.2 3.1 4.9032...
disk 22.7 4.5 5.0444...
You can even label the output: $ gawk '{print $1, $2/$3, "m/s"}' experiments.txt
ball 4.9032... m/s
disk 5.0444... m/s
9 / 34 ██ Formatted Output with printf printf gives you control over number formatting:
$ gawk '{printf "%s %.2f\n", $1, $2/$3}' experiments.txt
ball 4.90
disk 5.04
Format specifiers: • %s — string
• %f — floating point (use %.2f for 2 decimal places)
• %d — integer
• \n — newline (printf doesn't add one automatically)
# Clean column-aligned output
$ gawk '{printf "%-10s %8.3f\n", $1, $2}' data.txt
10 / 34 ██ gawk in a Pipeline gawk fits naturally into pipelines: # Get just the velocities, then find the max
$ gawk '{print $2/$3}' experiments.txt | sort -n | tail -1
# Filter first, then transform
$ grep "ball" experiments.txt | gawk '{print $2/$3}'
# Transform, then filter results above a threshold
$ gawk '{print $1, $2/$3}' experiments.txt | gawk '$2 > 5.0'
11 / 34 ██ Demo - Column Arithmetic 12 / 34 ██ sed 13 / 34 ██ What is sed? sed is the stream editor — it applies text transformations to each line as it passes through.
The most useful thing sed does: substitution.
$ sed 's/old/new/' input.txt
• s — substitute command
• old — the pattern to find
• new — what to replace it with
By default, only the first occurrence per line is replaced.
Add g (global) to replace all occurrences:
$ sed 's/old/new/g' input.txt
14 / 34 ██ sed: Converting Delimiters A common use: change the delimiter so other tools can work with the data. CSV → space-separated: $ sed 's/,/ /g' data.csv
Before: Alice,42,78.5
Bob,37,65.2
After: Alice 42 78.5
Bob 37 65.2
Now you can pipe it to gawk using the default whitespace delimiter — no -F, needed.
15 / 34 ██ Other Delimiter Conversions Space-separated → CSV: $ sed 's/ /,/g' data.txt
Tab-separated → comma-separated: $ sed 's/\t/,/g' data.tsv
Multiple spaces → single space: $ sed 's/ */ /g' data.txt # two spaces and * means "one or more spaces"
16 / 34 ██ Demo - sed Format Conversion 17 / 34 ██ sed as a Sorting Hack Sometimes your data has text mixed in with numbers that you want to sort:
sample_A: 3.72
sample_B: 1.14
sample_C: 8.45
sample_D: 0.93
Sorting alphabetically gets the wrong answer. You need to sort by the number. Strip the label with sed, sort numerically, done: $ sed 's/.*: //' measurements.txt | sort -n
0.93
1.14
3.72
8.45
18 / 34 ██ Stripping Units Data from instruments often comes with units attached: 15.2 kg
8.7 kg
22.1 kg
sort -n won't work because of the kg suffix — it would sort as text.
$ sed 's/ kg//' weights.txt | sort -n
8.7
15.2
22.1
19 / 34 ██ Keeping Labels While Sorting What if you want to sort by value but keep the label? sample_A 3.72
sample_B 1.14
sample_C 8.45
$ sort -k2 -n measurements.txt
sample_B 1.14
sample_A 3.72
sample_C 8.45
sort -k2 -n sorts by the 2nd column numerically — no sed needed when the data is already clean.
Use sed when the data is not clean — strip what's in the way, then sort.
20 / 34 ██ Stripping a Pattern from Lines grep to find lines, then sed to extract just the value:
# Program output that looks like:
# Run 1: final energy = 1.234
# Run 2: final energy = 0.891
# Run 3: final energy = 2.107
$ grep "final energy" results.txt | sed 's/.*= //' | sort -n
0.891
1.234
2.107
.*= matches everything up to and including the last = on the line.
21 / 34 ██ Demo - sed and Sorting 22 / 34 ██ Putting It All Together 23 / 34 ██ A Data Analysis Pipeline Suppose you have a CSV file with measurements: name,mass_kg,height_m
Alice,65.3,1.68
Bob,82.1,1.81
Carol,58.7,1.62
David,74.5,1.75
Goal: compute BMI (mass / height²) for each person, sorted highest to lowest. $ tail -n +2 measurements.csv \
| sed 's/,/ /g' \
| gawk '{printf "%s %.1f\n", $1, $2/($3*$3)}' \
| sort -k2 -rn
Bob 25.1
David 24.3
Alice 23.1
Carol 22.4
24 / 34 ██ Breaking Down the Pipeline $ tail -n +2 measurements.csv # skip the header row
| sed 's/,/ /g' # CSV → space-separated so gawk can read it
| gawk '{printf "%s %.1f\n", $1, $2/($3*$3)}' # compute BMI
| sort -k2 -rn # sort by 2nd column, reverse, numeric
Each tool does one job: • tail — skip the header
• sed — fix the format
• gawk — do the math
• sort — order the results
25 / 34 ██ When to Use gawk vs cut Use cut when you just need to extract columns from consistently-formatted data:
$ cut -d, -f1,3 data.csv
Use gawk when you need to:
• Work with whitespace-delimited data (no -F needed)
• Print columns in a different order • Compute a new value from existing columns • Format numbers with a specific number of decimal places • Do anything math-related with the data 26 / 34 ██ When to Use sed sed is the right tool when you need to: • Change one delimiter to another (s/,/ /g)
• Strip a prefix or suffix from data • Remove text that's in the way before sorting or computing • Do a simple find-and-replace transformation on every line For more complex transformations, reach for gawk — it's more readable than a complex sed expression. 27 / 34 ██ Example: Travel Reimbursement When you travel for a company or govergment (i.e. FHSU), the company or government will have limits on how much they will pay for lodging and meals. These limits are updated regularly... 28 / 34 ██ Example: Travel Reimbursement 1. How many different locations are included the travel spreadsheet? 1. How many different states are included the travel spreadsheet? 1. What is the most expensive place(s) to stay? 1. What is the most expensive place(s) to eat? 1. What is the cheapest place(s) to stay? 1. What is the cheapest place(s) to eat? 1. What is the most expensive place(s) to stay and eat? 1. What is the cheapest place(s) to stay and eat? 29 / 34 ██ Example: Ocular Transmission 1. What wavelength has the highest transmission? 1. What wavelength has the lowest transmission? 1. Is the total transmission column correct? 1. Which wavelength has the largest portion of total transmission comming to the lens? 30 / 34 ██ Summary We learned: • gawk '{print $N}' extracts fields — more flexible than cut
• -F, sets the field separator (use for CSV, TSV, etc.)
• gawk can do arithmetic: $2 * $3, $2 / $3, ($2 + $3) / 2
• printf gives control over number formatting
• sed 's/old/new/g' replaces text on every line
• sed converts delimiters: sed 's/,/ /g' for CSV → space-separated
• sed strips unwanted text before sorting: pipe through sed, then sort -n
The pattern: use sed to fix the format, use gawk to do the math, use sort/grep/cut for the rest.
31 / 34 ██ Next Time In Module 05, we'll make our scripts more professional: • Functions • Better argument handling ($@, $#, shift)
• Writing to stderr • The PATH variable and ~/bin
32 / 34 ██ Last Slide This space intentionally left blank 33 / 34 34 / 34