10

Watching the DNS and SNI of my network adapter in Wireshark, all I see is domain names and sub-domain names, but nothing after the slash, like no mention of example.com/page or twitter.com/mypage

So, I'm wondering, how does an app or browser know which page to access after the forward slash?

Does the browser or app only need to know/query the IP address of the main domain or sub-domain and then it will simply add the slash after it? like 192.168.1.1/mypage in case of Twitter for example?

I assume that works, but what if the address after the slash has a different IP address? like for example, Twitter.com is located at 192.168.1.1 but Twitter.com/mypage is located at 192.168.2.1? Is it even mainstream to do this?

Lastly but the most important one, if DNS requests/responses and TLS SNI fields only contain subdomains and main domain of a website, does it mean for example my ISP won't know exactly which Twitter or Instagram pages I visit and only can see that I access Twitter.com and Instagram.com, as long as connection is HTTPS?

P.S. Please consider only usage of plain text DNS on port 53, no secure DNS like DoH or DoT at all.

Update: Reading the comments under the selected answer on this Server Fault post answered my first question.

Giacomo1968
  • 58,727

8 Answers8

43

When it comes to handling http(s) requests, all that DNS does is convert the domain name to an IP address. The web browser then connects to that IP address and asks for the resource (eg part after the slash) - no DNS involved.

Your contention that twitter.com is on 192.168.1.1 but twitter.com/mypage is on 192.168.2.1 is wrong. From the web clients POV, both twitter.com and twitter.com/mypage exist on the same IP address. It is possible for the server at twitter.com to act as a reverse proxy and fetch the final data from 192.168.2.1, but it will route the request through the secure connection established between the browser and 192.168.1.1.

DNS and SNI are barely related. SNI is negotiated by the webserver and cares nothing about DNS (ignoring for the timebeing CAA records and the like, which are related but not SNI and are not ubiquitous). In fact, take a website, move it to another IP address on another server - but make sure you port the certificates as well, modify your hosts file to point to the new IP address and your HTTPS site will work even when you have overridden the DNS.

davidgo
  • 73,366
39

To add to the other answers: here's a quick dissection of a URL:

https://www.example.com:99/some/path?a=b&c=d#1223
  • https:// - the protocol aka the "language" that the browser will use to talk to the webserver.
  • www.example.com:99 - the address, which is further split into two parts:
    • www.example.com - the hostname aka the "domain name". The browser will convert this to an IP address before connecting
    • :99 - the TCP port number that the browser will use to establish the network connection. This part is often omitted and then the browser uses the default port number for the selected protocol (80 for http; 443 for https)
  • /some/path and ?a=b&c=d - the "path to the resource" and the "query string". The browser sends all of this together to the server, after it has established a connection (in the case of HTTPS that includes all TLS negotiations, so this gets sent encrypted). The browser doesn't modify this text apart from making sure that it doesn't contain illegal characters. It can really be anything and it's only a convention that the first part is a path to a "resource" and the second part is some sort of parameters. In reality you can send in almost anything and the server is free to do with it whatever it pleases.
  • #1223 - this is called "the fragment" and the browser does NOT send this to the server at all. This is 100% for client-side use. For example, if the URL results in an HTML page, the browser will try to find a HTML element with this ID and scroll to it. It can also be accessed via Javascript that runs in the browser (which can then do anything it wants with it). But it will never be sent anywhere.

So, as you can see, it is indeed only the domain part which gets looked up in the DNS system. And you can't use different IP addresses depending on the path.

Vilx-
  • 4,237
19

what if the address after the slash has a different IP address?

It literally never has a different IP address. The HTTP URL syntax doesn't make that possible; it defines that only the part up to the slash is the "authority" (the server's domain name or IP address to connect to) – the same server is always responsible for all HTTP paths under its domain.

(The actual server can handle HTTP requests for different paths in whatever way it likes, e.g. it may serve some paths locally while proxying others to a different backend host, but that's all server-side logic that is invisible to clients.)

grawity
  • 501,077
4

There are many good answers here but they are frame challenges or explanations of the components of a URL. I'd recommend reading those before mine since mine is meant to expand upon those.

I am going to answer by accepting the premise of the question ("how could this happen?") but clarify what it actually means when it does.

It is not strictly true that "all that DNS does is convert the domain name to an IP address". It is possible for DNS to convert a domain name into multiple IP addresses. However, all of these IP addresses are meant to be equivalent to each other, and the selection of which of them to use (in all practical cases) has nothing to do with the other components of a URL.

Here is an example answer section from dig microsoft.com that I ran just now:

microsoft.com.      2838    IN  A   20.84.181.62
microsoft.com.      2838    IN  A   20.81.111.85
microsoft.com.      2838    IN  A   20.53.203.50
microsoft.com.      2838    IN  A   20.112.52.29
microsoft.com.      2838    IN  A   20.103.85.33

The parts in the middle aren't important, but for completeness, they are the TTL (2838), the address family (IN), and the record type (A).

When you request your browser or other tool to retrieve https://microsoft.com/example it will first do a DNS lookup for microsoft.com and then it will select one of the returned addresses to use. Very often, it will simply select the first in the list. The DNS server may also shuffle the addresses in the response, so that the first in the list is not the same one every time.

There are two main reasons why a server administrator may set up their DNS server to return more than one IP address for a particular domain name:

  • Redundancy: if the HTTP server running at one of those addresses goes down, your browser may be able to handle this case by trying again with a different IP address; since they're all meant to be equivalent, you should get the same response back.
  • Load balancing: one HTTP server may not be capable of serving all the requests that are received for the domain, and so multiple servers are used; again, they're all meant to be equivalent to one another, so you should get the same response regardless of which one you choose.

There are other ways to provide redundancy and load balancing though; for example, dig google.com right now is only returning one address for me, but I'm fairly certain Google isn't running their main page less robustly than Microsoft is. DNS is just one part of the process.

So, to connect back to the original question, it's entirely possible for https://microsoft.com/ and https://microsoft.com/example to appear to resolve to two different IP addresses, but that's just because microsoft.com resolves to multiple IP addresses and a different one was picked the second time. If you kept doing this experiment a large number of times, you would see that both URLs can be resolved to any of the 5 addresses in the pool, since as stated by others it's only the domain name that matters.

kbolino
  • 215
2

So, I'm wondering, how does an app or browser know which page to access after the forward slash?

The browser sends that path and query information to the server whose address it found from the domain name. The server determines what it wishes to return for that.

When you ask your browser (or other user agent) to retrieve http://www.example.com/foo/bar?a=1&b=2#baz, it breaks down that URL into its components specified by standard URL syntax and does the following:

  1. Determine from the scheme portion, http:, that it is to use the HTTP protocol.

  2. Determine from the // that what immediately follows it will be an authority, which in this case is just a server name: www.example.com. It will then look up the server name via DNS to get an IP address for it. You should see this DNS request and response in your Wireshark trace, if your filters allow it.

  3. Since the authority had no port specification, the browser will assume the default port 80, just as if you had typed http://www.example.com:80/foo/bar.

  4. It will then connect to the server on that host and TCP port and send the path and query strings as part of the HTTP request. These will be in the request line that starts the request: GET /foo/bar?a=1&b=2 HTTP/1.0. (Note that it does not send the fragment.) You will see this if you examine the contents of the HTTP request in Wireshark.

  5. The server will interpret the request as it wishes and return some sort of result.

  6. If the result that comes back is an HTTP document, the browser will then look for an element with an id="baz" attribute (i.e., matching the fragment specified above) and scroll to it.

There are actually a few more subtleties in this process; for simplicity I've deliberately left out any mention of other schemes, other parts of the HTTP request beyond the request line (such as HTTP headers), any details about the HTTP response format, and what browsers might do with fragments when the response is not an HTML document.

Lastly but the most important one, if DNS requests/responses and TLS SNI fields only contain subdomains and main domain of a website, does it mean for example my ISP won't know exactly which Twitter or Instagram pages I visit and only can see that I access Twitter.com and Instagram.com, as long as connection is HTTPS?

This is correct, so long as you've not installed any non-standard certificates in your browser that would allow a proxy or transparent proxy to proxy HTTPS connections via decryption and re-encryption.

In fact, for any given HTTPS request (or what they assume is an HTTPS request, since it goes to port 443 and uses TLS) all they can see is the IP address to which you connect, which in some cases might be a system hosting many different web sites (particularly if it's the address of a CDN endpoint). That said, they will usually see your DNS requests as well, which are in cleartext, so they can certainly guess that if you looked up example.com to get 192.168.1.1 and you shortly after connect to port 443 on 192.168.1.1, you are connecting to example.com and not a different site that may also be served from that address.

cjs
  • 1,140
1

DNS will only resolve the domain name twitter.com to an IP address, e.g. 192.168.1.1 (note that this is not actually Twitter's IP address but an address from an address block reserved for private networks).

The returned IP address may differ between multiple DNS requests due to e.g. DNS traffic management or simply a change in the DNS records associated with the domain.

Once your browser has resolved twitter.com to e.g. 192.168.1.1, it will send an HTTP GET request to the server behind 192.168.1.1 asking for the resource mypage on the domain twitter.com:

GET /mypage HTTP/1.1
Host: twitter.com

Note that it would be possible for the server behind 192.168.1.1 to host multiple domains. If, for example, example.com was also hosted on 192.168.1.1, an HTTP GET request for example.com/mypage would look like this:

GET /mypage HTTP/1.1
Host: example.com

In summary, your browser finds out where to send the HTTP request using DNS and specifies within the request, which resource precisely it would like to get. The server, in turn, will know exactly which resource for which domain to serve given the information in the HTTP request.

For your last question, yes, using HTTPS the URL will be encrypted. However, the domain name part of the URL may be sent in clear text, depending on the TLS handshaking process in use. See this question for details.

So an attacker may be able to see you visited Twitter or Instagram but won't be able to tell which pages/profiles exactly.

Tobias
  • 111
0

You already received a good explanation of how DNS works in relation to your question. I'll answer the SNI part.

Short answer: Your ISP would only be able to see the hostname. SNI only contains the hostname your browser is trying to access. That is sent in plain text and is necessary for your browser to tell the web server which SSL certificate its requesting. The handshake is then made, and the connection secured before the full URL is sent.

Not as short answer (much more than you asked for but...)

SNI=Server Name Indication. It's part of the HTTPS TLS handshake process. When you want to connect to twitter.com, first the DNS is resolved for it. Then your browser sends a request to that IP address on port 443 (when using https://). Part of that request includes the SNI, if your browser supports it, which most do. The SNI only contains the domain name. If you typed https//www.twitter.com/bejrjoftj then the DNS lookup would resolve www.twitter.com and then include www.twitter.com as the SNI request. Note that "www." is actually a subdomain of the top-level domain name. A single IP can host many domains. Only HTTP and HTTPS access different resources based on the hostname requested. This is important because even though twitter.com and geocities.com might resolve to the same IP address, a web browser will receive different resources (the web page the server serves to you) based on the hostname requested, but that IP address can only host, for example, one SSH server on port 22. So, when you're accessing different websites with the same IP, that IP is only running one webserver, which decides which page to send you based on the SNI hostname. But that's all SNI is, is the hostname.

Apache HTTP Server and nginx both support virtual hosts. The server has a "default host" that it will serve if you, for example, used the IP address directly in your browser. This most often redirects to call a virtual host config. Virtual hosts aren't just the hostname though.

A virtual host can also be data to the right of the host name. For example, twitter.com and twitter.com/something/ could be two different virtual hosts. Since DNS only resolves the domain name/host name, twitter.com would resolve to the same IP no matter what the rest of the URL is. But the webserver does receive the full requested URL after the TLS handshake is made and the connection is encrypted. To reiterate, the purpose of SNI is to make sure the web server sends the correct SSL certificate to encrypt your connection, because if you're trying to access example.com and its IP address is the same as twitter.com, the server needs to make sure it sends the right certificate to your browser so your browser can verify that the certificate it received matches the host name it's trying to connect to.

Without the SNI, the server would have no way of knowing that you want the example.com virtual host from the server, not the twitter.com virtual host. And your browser needs to receive the example.com certificate to complete the handshake without any issues. The web server at that IP needs to have a virtual host entry for the hostname before it can define URL virtual hosts. example.com/ and example.com/page/ are not necessarily the same virtual host, even if they share the same certificate, if the config file has a virtual host defined for example.com/page/*. As for the reason why, you might end up at 192.168.2.1 for example.com/page while example.com/ is at 192.168.1.1, is because the virtual host can have a redirect defined. If it does, it will redirect your browser to the other IP. This is a software defined redirect and is defined by a result code, 300. A more commonly known result code is 404, which means the file requested doesn't exist. if the virtual host config includes a custom response page to send back to your browser whenever the server receives a 404 response back from URL you requested, it will send you that page every time you request a file that doesn't exist. The redirect response of 300 also includes a new URL, which tells your browser "Hey, you reached the example.com/page/ virtual host but sorry Mario, your princess is in another castle. You need to send that request to twitter.com/page/ instead." And then your browser says, "oh damn, okay, my bad." And then sends a request to whatever URL the server told it to go to. Thats how you end up getting redirected to a malicious URL when you tried accessing a seemingly innocent URL. But that redirect comes directly from the web server and not DNS. A DNS redirect is only when the DNS config has a CNAME record (normal IP address entries for IPV4 are A records). A CNAME record is an alias record. And an A or CNAME record is assigned to a host name. So, if page.example.com has a CNAME record in the example.com master record file with a value of "twitter.com", then the DNS client will be told to look up twitter.com to complete the request for page.example.com. a CNAME record is always a domain name and never an IP address. It tells your DNS client that page.example.com is just another name to use for twitter.com. this might be useful if you wanted to use page.example.com as another name for pagebook.com and you want the DNS client to follow the trail over to pagebook.com. this doesn't require you to run a web server with a virtual host set up for page.example.com. your DNS client will then look up pagebook.com and you'll get back whatever IP is assigned to pagebook.com instead.

-7

DNS only involve the domain name. What you are looking at is a url. The domain name is the word immediately before the .com and can not have a period in it. So something.domain.com/something… simply domain is the domain name, which a then relate to dns in various manner. See URL’s for more.

Giacomo1968
  • 58,727