8

I am trying to use sed to extract the value part of one of the many key-value pairs in a URL's query string

This is what I am trying:

echo 'http://www.youtube.com/watch?v=abc&g=xyz' | sed 's@^https?://(www.)?youtube.com/(watch\\?)?.*?v(=|/)([a-zA-Z0-9\-_]*)(&.*)?$@$4@'

but it always outputs the input URL as is.

What am I doing wrong?

Update 1

To clarify some issues:

  1. The regex is more complicated than it has to be because I am also trying to check the validity of the input and generate the output only if the input is valid. So a stricter match.
  2. The desired output is the value of the key 'v' in the query string.
  3. Have been unable to find the version of sed that I am using, but it's the one that comes with Mac OS X (10.7.5).
  4. In my version of sed $1, $2 etc. seem to be the matches, \1, \2 etc. give the error: sed: 1: "s@^https?://(www.)?yout ...": \4 not defined in the RE Not correct! as I found out later. Apologies for causing the confusion.

Update 2

Have updated the sed RE to make it more specific based on suggestion by @slhck below, but the issue remains as before.

Update 3

Based on the man page for this version of sed it appears that this is a BSD-flavoured version.

markvgti
  • 583

5 Answers5

13

Even simpler, if you just want the abc:

 echo 'http://www.youtube.com/watch?v=abc&g=xyz' | awk -F'[=&]' '{print $2}'

If you want the xyz :

echo 'http://www.youtube.com/watch?v=abc&g=xyz' | awk -F'[=&]' '{print $4}'

EXPLANATION:

  • awk : is a scripting language that automatically processes input files line by line, splitting each line into fields. So, when you process a file with awk, for each line, the first field is $1, the second $2 etc up to $N. By default awk uses blanks as the field separator.

  • -F'[=&]' : -F is used to change the field delimiter from spaces to something else. In this case, I am giving it a _class_ of characters. Square brackets ([ ]) are used by many languages to denote groups of characters. So, specifically, -F'[=&]' means that awk should use both & and = as field delimiters.

  • Therefore, given the input string from your question, using & and = as delimiters, awk will read the following fields:

      http://www.youtube.com/watch?v=abc&g=xyz
      |----------- $1 -------------| --- - ---      
                                      |  |  |
                                      |  |  ̣----- $4
                                      |  -------- $3
                                      ----------- $2
    

    So, all you need to do is print whichever one you want {print $4}.


You said you also want to check that the string is a valid youtube URL, you can't do that with sed since if it does not match the regex you give it, it will simply print the entire line. You can use a tool like Perl to only print if the regex matches:

echo 'http://www.youtube.com/watch?v=abc&g=xyz' | 
  perl -ne 's/http.*www.youtube.com\/watch\?v=(.+?)&.+/$1/ && print'

Finally, to simply print abc you can use the standard UNIX tool cut:

echo 'http://www.youtube.com/watch?v=abc&g=xyz' | 
  cut -d '=' -f 2 | cut -d '&' -f 1
terdon
  • 54,564
2

If you really just want the video ID – so, anything between v= and the next & – just use:

sed -r 's/.*v=([[:alnum:]]*).*/\1/'

Here's what's wrong with your command:

  • The -r is needed to use extended regular expressions. If you leave that out, sed interprets the parentheses literally, so there won't be any match groups. With BSD sed, use the -E option instead.

  • You use $1 to refer to matches, but you should use \1. $1 is actually a shell argument passed to the current script, for example.

  • You should use a character class like [[:alnum:]] (or [a-zA-Z0-9_] depending on how the IDs are set up) to match the parameter value, since otherwise the next & will be captured as well. The regex is greedy and will just match abc&g=xyz if you use .*?, since lazy quantification is not supported in BRE/ERE, and only in Perl regex or other "modern" flavors.

slhck
  • 235,242
2

if you need "xyz" try this (GNU sed):

echo 'http://www.youtube.com/watch?v=abc&g=xyz' | sed 's/.*=\([[:alnum:]]*\).*/\1/'
Endoro
  • 3,014
2

Experimenting with sed based off the answers given by @Endoro and @slhck led me to the final answer (the one I wanted). This is what works for me with the version of sed on Mac OS X (10.7.5):

echo 'http://www.youtube.com/watch?v=dnCkNz_xrpg' | sed -E 's@https?://(www\.)?youtube.com/(watch\?).*v=([-_a-zA-Z0-9]*).*@\3@'

Explanation:

  1. -E is to make sed use extended RE. In other versions of sed -r may be the equivalent option.
  2. The seemingly more-complicated-than-it-needs-to-be RE is to also verify that this is a valid YouTube link. Modify the beginning parts of this RE as required (e.g., https?://(www\.)?example.com/(.*\?).*key=([^&]*).*)
  3. The \3 matches the 3rd expression in parentheses and prints it out as the answer/match (which is what I want).
  4. Using 's@@@' instead of the usual 's///' so that I don't have to escape the many forward slashes (\) in a URL.

Hope this helps others too as I have been helped.

markvgti
  • 583
0

It always display the URL because SED is not matching it.

    echo 'http://www.youtube.com/watch?v=abc&g=xyz' | sed 's!^http://www.youtube.com/watch\?\(.*=.*\)&\(.*=.*\)!\1!'

Will display v=abc