I'd like to remove any word which contains a non alpha char from a text file. e.g
"ok 0bad ba1d bad3 4bad4 5bad5bad5"
should become
"ok"
I've tried using
echo "ok 0bad ba1d bad3 4bad4 5bad5bad5" | sed 's/\b[a-zA-Z]*[^a-zA-Z]\+[a-zA-Z]*\b/ /g'
I'd like to remove any word which contains a non alpha char from a text file. e.g
"ok 0bad ba1d bad3 4bad4 5bad5bad5"
should become
"ok"
I've tried using
echo "ok 0bad ba1d bad3 4bad4 5bad5bad5" | sed 's/\b[a-zA-Z]*[^a-zA-Z]\+[a-zA-Z]*\b/ /g'
The following sed command does the job:
sed 's/[[:space:]]*[[:alpha:]]*[^[:space:][:alpha:]][^[:space:]]*//g'
It removes all words containing at least one non-alphabetic character. It is better to use POSIX character classes like [:alpha:], because for instance they won't consider the French name "François" as being faulty (i.e. containing a non-alphabetic character).
We remove all patterns starting with an arbitrary number of spaces followed by an arbitrary (possibly nil) number of alphabetic characters, followed by at least one non-space and non-alphabetic character, and then glob to the end of the word (i.e. until the next space). Please note that you may want to swap [:space:] for [:blank:], see this page for a detailed explanation of the difference between these two POSIX classes.
$ echo "ok 0bad ba1d bad3 4bad4 5bad5bad5" | sed 's/[[:space:]]*[[:alpha:]]*[^[:space:][:alpha:]][^[:space:]]*//g'
ok
Using awk:
s="ok 0bad ba1d bad3 4bad4 5bad5bad5"
awk '{ofs=""; for (i=1; i<=NF; i++) if ($i ~ /^[[:alpha:]]+$/)
{printf "%s%s", ofs, $i; ofs=OFS} print ""}' <<< "$s"
ok
This awk command loops through all words and if word matches the regex /^[[:alpha:]]+$/ then it writes to standard out. (i<NF)?OFS:RS is a short cut to add OFS if current field no is less than NF otherwise it writes RS.
Using grep + tr together:
s="ok 0bad ba1d bad3 4bad4 5bad5bad5"
r=$(grep -o '[^ ]\+' <<< "$s"|grep '^[[:alpha:]]\+$'|tr '\n' ' ')
echo "$r"
ok
First grep -o breaks the string into individual words. 2nd grep only searches for words with alphabets only. ANd finally tr translates \n to space.
If you're not concerned about losing different numbers of spaces between each word, you could use something like this in Perl:
perl -ane 'print join(" ", grep { !/[^[:alpha:]]/ } @F), "\n"
the -a switch enables auto-split mode, which splits the text on any number of spaces and stores the fields in the array @F. grep filters out the elements of that array that contain any non-alphabetical characters. The resulting array is joined on a single space.
This might work for you (GNU sed):
sed -r 's/\b([[:alpha:]]+\b ?)|\S+\b ?/\1/g;s/ $//' file
This uses a back reference within alternation to save the required string.
st="ok 0bad ba1d bad3 4bad4 5bad5bad5"
for word in $st;
do
if [[ $word =~ ^[a-zA-Z]+$ ]];
then
echo $word;
fi;
done