$ grep "cell" grep_sed_example1.txt
This quarter I’m taking GENOME569, which covers developing bioinformatic workflows for high-throughput sequencing. Today we briefly covered some of the primaries of Unix, and I realized I’ve never learned awk or written awk commands fro scratch! These are some awk practice problems from the class to hopefully help me get up to speed.
Practice file:
grep_sed_example1.txt
Experiment notes:
We are using the hg19 genome build
The cell line is A549
A549 cells have a KRAS G12C mutation
Another cell line is MCF7
grep
Find the lines that mention “cell”:
The cell line is A549
A549 cells have a KRAS G12C mutation
Another cell line is MCF7
“cell”: This is the regular expression pattern to match. It simply looks for the string “cell” within each line of the file.
Find the lines that talk about “A549”:
$ grep "A549" grep_sed_example1.txt
The cell line is A549
A549 cells have a KRAS G12C mutation
“A549”: This pattern matches the string “A549” within each line of the file.
Find the lines that talk about either A549 or MCF7:
$ grep -E "A549|MCF7" grep_sed_example1.txt
The cell line is A549
A549 cells have a KRAS G12C mutation
Another cell line is MCF7
-E: Enables extended regular expressions, allowing the use of the | (OR) operator.
“A549|MCF7”: This pattern matches lines containing either “A549” or “MCF7”.
Find the lines that end with a cell line id (i.e., A549 or MCF7):
$ grep -E "A549$|MCF7$" grep_sed_example1.txt
Another cell line is MCF7
-E: Enables extended regular expressions, allowing the use of the $ anchor to match the end of a line and the use of the | (OR) operator
“A549$|MCF7$”: This pattern matches lines that end ($) with either “A549” or “MCF7”.
sed
Note that sed will not make a permanent edit to the original file unless you specifically instruct it to do so using the -i option. Default behavior is to just output the modified text to standard output.
Change instances of A549 to MCF7:
$ sed 's/A549/MCF7/g' grep_sed_example1.txt
Experiment notes:
We are using the hg19 genome build
The cell line is MCF7
MCF7 cells have a KRAS G12C mutation
Another cell line is MCF7
s: This is the substitute command in sed, which is used to perform substitutions.
/A549/MCF7/: This is the substitution operation. It finds all occurrences of “A549” and replaces them with “MCF7”.
g: This is the global flag, which tells sed to perform the substitution globally within each line, not just the first occurrence.
Change only the second instance of A549 to A549_LUNG:
Change instances of A549 to MCF7, but without stating “A549”:
$ sed 's/[A-Z][0-9]\{3\}/MCF7/g' grep_sed_example1.txt
Experiment notes:
We are using the hg19 genome build
The cell line is MCF7
MCF7 cells have a KRAS G12C mutation
Another cell line is MCF7
[A-Z]: Matches any uppercase letter from A to Z.
[0-9]: Matches any digit from 0 to 9.
\{3\}: Specifies that the previous pattern (digit) should occur exactly three times.
/MCF7/: Replaces the matched pattern with “MCF7”.
g: This flag is used for global substitution, ensuring all occurrences within each line are replaced.
How can I make multiple changes in a single line?
-e: This option allows you to specify multiple sed commands in one line. Each -e flag indicates the beginning of a new sed command.
Then you can just sequentially use the expressions you want to apply!