gawk and sed

                                                                        Transforming Data in Pipelines

              ██ gawk

                                                                                                                                                                       2 / 34

              ██ What is gawk?

              gawk is the GNU implementation of awk — a text processing language designed to work in pipelines.

              It reads input one line at a time, splits each line into fields, and lets you work with them.

              Think of it as a much more flexible version of cut:

                 •  cut can only extract columns as-is

                 •  gawk can extract, rearrange, compute, and format

                                                                                                                                                                       3 / 34

              ██ The gawk Model

              For each line of input, gawk:

                 1. Splits the line into fields: $1, $2, $3, ..., $NF

                 2. Runs your program on it

                 3. Writes output to stdout

                                                                $ echo "Alice 42 78.5" | gawk '{print $2}'

              $NF is always the last field (NF = Number of Fields):

                                                                $ echo "Alice 42 78.5" | gawk '{print $NF}'

                                                                78.5

                                                                                                                                                                       4 / 34

              ██ gawk as a Flexible cut

              gawk '{print $N}' is like cut, but whitespace-delimited by default:

                                                              $ cut -d' ' -f2 data.txt         # cut approach

                                                              $ gawk '{print $2}' data.txt     # gawk approach

              For CSV data, use -F, to set the field separator:

                                                              $ cut -d, -f2 data.csv           # cut approach

                                                              $ gawk -F, '{print $2}' data.csv # gawk approach

              The real power comes from what gawk can do beyond just extracting.

                                                                                                                                                                       5 / 34

              ██ Printing Multiple Fields

              Print fields in any order, with any separator:

                                                 $ gawk '{print $3, $1}' data.txt        # swap columns 1 and 3

                                                 $ gawk -F, '{print $1, $4}' data.csv    # columns 1 and 4, space-separated

              Use "," to print a literal comma between fields:

                                                          $ gawk '{print $1 "," $3}' data.txt     # output as CSV

                                                          $ gawk -F, '{print $2 "\t" $4}' data.csv  # output as TSV

                                                                                                                                                                       6 / 34

              ██ Demo - gawk as cut

                                                                                                                                                                       7 / 34

              ██ Column Arithmetic

              gawk can do math on fields — this is where it leaves cut far behind.

                                                    $ gawk '{print $1, $2 * $3}' data.txt      # multiply columns 2 and 3

                                                    $ gawk '{print $1, $2 + $3}' data.txt      # add columns

                                                    $ gawk '{print $1, ($2 + $3) / 2}' data.txt  # average of two columns

              Unit conversions:

                                                               # Convert column 2 from Celsius to Fahrenheit

                                                               $ gawk '{print $1, $2 * 9/5 + 32}' temps.txt

                                                               # Convert column 2 from meters to feet

                                                               $ gawk '{print $1, $2 * 3.28084}' distances.txt

                                                                                                                                                                       8 / 34

              ██ Computing Derived Columns

              Add a column that doesn't exist in the original data:

                                                            # File: experiments.txt

                                                            # name  distance_m  time_s

                                                            # ball  15.2  3.1

                                                            # disk  22.7  4.5

                                                            $ gawk '{print $1, $2, $3, $2 / $3}' experiments.txt

                                                            ball 15.2 3.1 4.9032...

                                                            disk 22.7 4.5 5.0444...

              You can even label the output:

                                                              $ gawk '{print $1, $2/$3, "m/s"}' experiments.txt

                                                              ball 4.9032... m/s

                                                              disk 5.0444... m/s

                                                                                                                                                                       9 / 34

              ██ Formatted Output with printf

              printf gives you control over number formatting:

                                                          $ gawk '{printf "%s %.2f\n", $1, $2/$3}' experiments.txt

                                                          ball 4.90

                                                          disk 5.04

              Format specifiers:

                 •  %s — string

                 •  %f — floating point (use %.2f for 2 decimal places)

                 •  %d — integer

                 •  \n — newline (printf doesn't add one automatically)

                                                             # Clean column-aligned output

                                                             $ gawk '{printf "%-10s %8.3f\n", $1, $2}' data.txt

                                                                                                                                                                      10 / 34

              ██ gawk in a Pipeline

              gawk fits naturally into pipelines:

                                                        # Get just the velocities, then find the max

                                                        $ gawk '{print $2/$3}' experiments.txt | sort -n | tail -1

                                                        # Filter first, then transform

                                                        $ grep "ball" experiments.txt | gawk '{print $2/$3}'

                                                        # Transform, then filter results above a threshold

                                                        $ gawk '{print $1, $2/$3}' experiments.txt | gawk '$2 > 5.0'

                                                                                                                                                                      11 / 34

              ██ Demo - Column Arithmetic

                                                                                                                                                                      12 / 34

              ██ sed

                                                                                                                                                                      13 / 34

              ██ What is sed?

              sed is the stream editor — it applies text transformations to each line as it passes through.

              The most useful thing sed does: substitution.

                                                                $ sed 's/old/new/' input.txt

                 •  s — substitute command

                 •  old — the pattern to find

                 •  new — what to replace it with

              By default, only the first occurrence per line is replaced.

              Add g (global) to replace all occurrences:

                                                                $ sed 's/old/new/g' input.txt

                                                                                                                                                                      14 / 34

              ██ sed: Converting Delimiters

              A common use: change the delimiter so other tools can work with the data.

              CSV → space-separated:

                                                                $ sed 's/,/ /g' data.csv

              Before:

                                                                Alice,42,78.5

                                                                Bob,37,65.2

              After:

                                                                Alice 42 78.5

                                                                Bob 37 65.2

              Now you can pipe it to gawk using the default whitespace delimiter — no -F, needed.

                                                                                                                                                                      15 / 34

              ██ Other Delimiter Conversions

              Space-separated → CSV:

                                                                $ sed 's/ /,/g' data.txt

              Tab-separated → comma-separated:

                                                                $ sed 's/\t/,/g' data.tsv

              Multiple spaces → single space:

                                                 $ sed 's/  */ /g' data.txt    # two spaces and * means "one or more spaces"

                                                                                                                                                                      16 / 34

              ██ Demo - sed Format Conversion

                                                                                                                                                                      17 / 34

              ██ sed as a Sorting Hack

              Sometimes your data has text mixed in with numbers that you want to sort:

                                                                sample_A: 3.72

                                                                sample_B: 1.14

                                                                sample_C: 8.45

                                                                sample_D: 0.93

              Sorting alphabetically gets the wrong answer. You need to sort by the number.

              Strip the label with sed, sort numerically, done:

                                                                $ sed 's/.*: //' measurements.txt | sort -n

                                                                0.93

                                                                1.14

                                                                3.72

                                                                8.45

                                                                                                                                                                      18 / 34

              ██ Stripping Units

              Data from instruments often comes with units attached:

                                                                15.2 kg

                                                                8.7 kg

                                                                22.1 kg

              sort -n won't work because of the  kg suffix — it would sort as text.

                                                                $ sed 's/ kg//' weights.txt | sort -n

8.7

                                                                15.2

                                                                22.1

                                                                                                                                                                      19 / 34

              ██ Keeping Labels While Sorting

              What if you want to sort by value but keep the label?

                                                                sample_A 3.72

                                                                sample_B 1.14

                                                                sample_C 8.45

                                                                $ sort -k2 -n measurements.txt

                                                                sample_B 1.14

                                                                sample_A 3.72

                                                                sample_C 8.45

              sort -k2 -n sorts by the 2nd column numerically — no sed needed when the data is already clean.

              Use sed when the data is not clean — strip what's in the way, then sort.

                                                                                                                                                                      20 / 34

              ██ Stripping a Pattern from Lines

              grep to find lines, then sed to extract just the value:

                                                        # Program output that looks like:

                                                        # Run 1: final energy = 1.234

                                                        # Run 2: final energy = 0.891

                                                        # Run 3: final energy = 2.107

                                                        $ grep "final energy" results.txt | sed 's/.*= //' | sort -n

                                                        0.891

                                                        1.234

                                                        2.107

              .*=  matches everything up to and including the last =  on the line.

                                                                                                                                                                      21 / 34

              ██ Demo - sed and Sorting

                                                                                                                                                                      22 / 34

              ██ Putting It All Together

                                                                                                                                                                      23 / 34

              ██ A Data Analysis Pipeline

              Suppose you have a CSV file with measurements:

                                                                name,mass_kg,height_m

                                                                Alice,65.3,1.68

                                                                Bob,82.1,1.81

                                                                Carol,58.7,1.62

                                                                David,74.5,1.75

              Goal: compute BMI (mass / height²) for each person, sorted highest to lowest.

                                                              $ tail -n +2 measurements.csv \

                                                                | sed 's/,/ /g' \

                                                                | gawk '{printf "%s %.1f\n", $1, $2/($3*$3)}' \

                                                                | sort -k2 -rn

                                                              Bob 25.1

                                                              David 24.3

                                                              Alice 23.1

                                                              Carol 22.4

                                                                                                                                                                      24 / 34

              ██ Breaking Down the Pipeline

                                               $ tail -n +2 measurements.csv      # skip the header row

                                                 | sed 's/,/ /g'                  # CSV → space-separated so gawk can read it

                                                 | gawk '{printf "%s %.1f\n", $1, $2/($3*$3)}'  # compute BMI

                                                 | sort -k2 -rn                   # sort by 2nd column, reverse, numeric

              Each tool does one job:

                 •  tail — skip the header

                 •  sed — fix the format

                 •  gawk — do the math

                 •  sort — order the results

                                                                                                                                                                      25 / 34

              ██ When to Use gawk vs cut

              Use cut when you just need to extract columns from consistently-formatted data:

                                                                $ cut -d, -f1,3 data.csv

              Use gawk when you need to:

                 •  Work with whitespace-delimited data (no -F needed)

                 •  Print columns in a different order

                 •  Compute a new value from existing columns

                 •  Format numbers with a specific number of decimal places

                 •  Do anything math-related with the data

                                                                                                                                                                      26 / 34

              ██ When to Use sed

              sed is the right tool when you need to:

                 •  Change one delimiter to another (s/,/ /g)

                 •  Strip a prefix or suffix from data

                 •  Remove text that's in the way before sorting or computing

                 •  Do a simple find-and-replace transformation on every line

              For more complex transformations, reach for gawk — it's more readable than a complex sed expression.

                                                                                                                                                                      27 / 34

              ██ Example: Travel Reimbursement

              When you travel for a company or govergment (i.e. FHSU), the company or government will have limits on how much they will pay for lodging and

              meals.

              These limits are updated regularly...

                                                                                                                                                                      28 / 34

              ██ Example: Travel Reimbursement

                 1. How many different locations are included the travel spreadsheet?

                 1. How many different states are included the travel spreadsheet?

                 1. What is the most expensive place(s) to stay?

                 1. What is the most expensive place(s) to eat?

                 1. What is the cheapest place(s) to stay?

                 1. What is the cheapest place(s) to eat?

                 1. What is the most expensive place(s) to stay and eat?

                 1. What is the cheapest place(s) to stay and eat?

                                                                                                                                                                      29 / 34

              ██ Example: Ocular Transmission

                 1. What wavelength has the highest transmission?

                 1. What wavelength has the lowest transmission?

                 1. Is the total transmission column correct?

                 1. Which wavelength has the largest portion of total transmission comming to the lens?

                                                                                                                                                                      30 / 34

              ██ Summary

              We learned:

                 •  gawk '{print $N}' extracts fields — more flexible than cut

                 •  -F, sets the field separator (use for CSV, TSV, etc.)

                 •  gawk can do arithmetic: $2 * $3, $2 / $3, ($2 + $3) / 2

                 •  printf gives control over number formatting

                 •  sed 's/old/new/g' replaces text on every line

                 •  sed converts delimiters: sed 's/,/ /g' for CSV → space-separated

                 •  sed strips unwanted text before sorting: pipe through sed, then sort -n

              The pattern: use sed to fix the format, use gawk to do the math, use sort/grep/cut for the rest.

                                                                                                                                                                      31 / 34

              ██ Next Time

              In Module 05, we'll make our scripts more professional:

                 •  Functions

                 •  Better argument handling ($@, $#, shift)

                 •  Writing to stderr

                 •  The PATH variable and ~/bin

                                                                                                                                                                      32 / 34

              ██ Last Slide

              This space intentionally left blank

                                                                                                                                                                      33 / 34

                                                                                                                                                                      34 / 34