Why is there a limit in the concurrent number of downloads?

Question

I am trying to make my own simple web crawler. I want to download files with specific extensions from a URL. I have the following code written:

    private void button1_Click(object sender, RoutedEventArgs e)
    {
        if (bw.IsBusy) return;
        bw.DoWork += new DoWorkEventHandler(bw_DoWork);
        bw.RunWorkerAsync(new string[] { URL.Text, SavePath.Text, Filter.Text });
    }
    //--------------------------------------------------------------------------------------------
    void bw_DoWork(object sender, DoWorkEventArgs e)
    {
        try
        {
            ThreadPool.SetMaxThreads(4, 4);
            string[] strs = e.Argument as string[];
            Regex reg = new Regex("<a(\\s*[^>]*?){0,1}\\s*href\\s*\\=\\s*\\\"([^>]*?)\\\"\\s*[^>]*>(.*?)</a>", RegexOptions.Compiled | RegexOptions.CultureInvariant | RegexOptions.IgnoreCase);
            int i = 0;
            string domainS = strs[0];
            string Extensions = strs[2];
            string OutDir = strs[1];
            var domain = new Uri(domainS);
            string[] Filters = Extensions.Split(new char[] { ';', ',', ' ' }, StringSplitOptions.RemoveEmptyEntries);
            string outPath = System.IO.Path.Combine(OutDir, string.Format("File_{0}.html", i));

            WebClient webClient = new WebClient();
            string str = webClient.DownloadString(domainS);
            str = str.Replace("\r\n", " ").Replace('\n', ' ');
            MatchCollection mc = reg.Matches(str);
            int NumOfThreads = mc.Count;

            Parallel.ForEach(mc.Cast<Match>(), new ParallelOptions { MaxDegreeOfParallelism = 2,  },
            mat =>
            {
                string val = mat.Groups[2].Value;
                var link = new Uri(domain, val);
                foreach (string ext in Filters)
                    if (val.EndsWith("." + ext))
                    {
                        Download((object)new object[] { OutDir, link });
                        break;
                    }
            });
            throw new Exception("Finished !");

        }
        catch (System.Exception ex)
        {
            ReportException(ex);
        }
        finally
        {

        }
    }
    //--------------------------------------------------------------------------------------------
    private static void Download(object o)
    {
        try
        {
            object[] objs = o as object[];
            Uri link = (Uri)objs[1];
            string outPath = System.IO.Path.Combine((string)objs[0], System.IO.Path.GetFileName(link.ToString()));
            if (!File.Exists(outPath))
            {
                //WebClient webClient = new WebClient();
                //webClient.DownloadFile(link, outPath);

                DownloadFile(link.ToString(), outPath);
            }
        }
        catch (System.Exception ex)
        {
            ReportException(ex);
        }
    }
    //--------------------------------------------------------------------------------------------
    private static bool DownloadFile(string url, string filePath)
    {
        try
        {
            HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create(url);
            request.UserAgent = "Web Crawler";
            request.Timeout = 40000;
            WebResponse response = request.GetResponse();
            Stream stream = response.GetResponseStream();
            using (FileStream fs = new FileStream(filePath, FileMode.CreateNew))
            {
                const int siz = 1000;
                byte[] bytes = new byte[siz];
                for (; ; )
                {
                    int count = stream.Read(bytes, 0, siz);
                    fs.Write(bytes, 0, count);
                    if (count == 0) break;
                }
                fs.Flush();
                fs.Close();
            }
        }
        catch (System.Exception ex)
        {
            ReportException(ex);
            return false;
        }
        finally
        {

        }
        return true;
    }

The problem is that while it works fine for 2 parallel downloads:

        new ParallelOptions { MaxDegreeOfParallelism = 2,  }

...it doesn't work for greater degrees of parallelism like:

        new ParallelOptions { MaxDegreeOfParallelism = 5,  }

...and I get connection timeout exceptions.

At first I thought it was because of WebClient:

                //WebClient webClient = new WebClient();
                //webClient.DownloadFile(link, outPath);

...but when I replaced it with the function DownloadFile that used the HttpWebRequest I still got the error.

I have tested it on many web pages and nothing changed. I have also confirmed with chrome's extension, "Download Master", that these web servers allow multiple parallel downloads. Does anyone have any idea for why I get timeout Exceptions when trying to download many files in parallel?

Just curious: Why do you throw an exception when the work is done? — Brian Rasmussen, Jun 13 '12 at 15:16
http://stackoverflow.com/questions/866350/how-can-i-programmatically-remove-the-2-connection-limit-in-webclient — Răzvan Flavius Panda, Jun 13 '12 at 15:17
The exception I throw at the end is a temporary piece of code. I needed sth quick to see when it was all done, so I thought "why not?". — NoOne, Jun 13 '12 at 15:26

score 6 · Accepted Answer · edited May 23 '17 at 12:25

6

You need to assign the ServicePointManager.DefaultConnectionLimit. Default concurrent connections to the same host is 2. Also see related SO post on using web.config connectionManagement.

edited May 23 '17 at 12:25

Community

1
1

answered Jun 13 '12 at 15:14

SliverNinja - MSFT

31,051
11
110
173

2

Thanks a lot! I just got it working by setting ServicePointManager.DefaultConnectionLimit! You have saved me a lot of time. – NoOne Jun 13 '12 at 15:20

score 1 · Answer 2 · answered Jun 13 '12 at 15:24

As far as I know IIS will limit the total number of connections in and out, however this number should be in the range of 10^3 not ~5.

Is it possible you are testing off of the same url? I know a lot of web servers limit the number of simultaneous connections from clients. Ex: Are you testing by trying to download 10 copies of http://www.google.com?

If so you might want to try testing with a list of different sites such as:

Why is there a limit in the concurrent number of downloads?

2 Answers2