DNS Distributed System

Motivations

Distributed System

A very large number of autonomously managed servers cooperate to :

independently manage small parts of the global unique DNS Database
realize the Resolution Algorithm (Database access).

The only centralized point in the DNS system is the initial database entry point defined by the Root zone. Unique root leads to Unique DNS database.

Reliable

Fault tolerance and robustness, redundancy, database replications.

Efficiency

Low traffic, encourage local traffic,h igh distribution for data, caching mechanisms.

Extensibility

large set of Typed Information, high autonomy and independence in management of elementary data.

Typed Data

Resource Record (RR)

DNS data are typed Information. One DNS name can refer at the same time to different king of information (IPv4 address, Mailing relay, host alias...). We can also have several values with the same type for a given name (for instance a list of IP addresses for 1 name).

NB : On top of TYPE, DNS also defines a CLASS notion, but in practice only the IN (INternet) class is used. DNS tree and types could be different for each class but the DNS root is unique for all classes.

All DNS Information are stored in a common structure, called RR (Resource Records). A RR is a quintuplet as following :

{ DomainName TTL CLASS TYPE RDATA }

DomainName is an absolute DNS name (FQDN Fully Qualified Domain Name)
TTL Time To Live is the number of seconds left before the information expires. (32bits integer, so maximum 140 years)
CLASS equals to IN for Internet, CH for chaos, HS for Hesiod...
TYPE defines the information type (cf. table RR Types later)
RDATA holds the information value. The format derives from the RR type (cf. table RDATA of common RR-TYPES later)

NB :
- TYPE and CLASS values are IANA Assigned Numbers . Normative reference is http://www.iana.org/assignments/dns-parameters.
- The TTL value is statically defined by the source of information but is decremented every second for each copy of the information.

Example :

diamant.int-evry.fr. 172800 IN A 157.159.10.12

"diamant's INternet IPv4 Address is 157.159.10.12. this information is valid during 172800 seconds (2 days)".

The DNS Database is now defined as the set of all existing RRs defined in various DNS servers.

DNS Request

A DNS Request is a triplet as following :

{DomainName CLASS QTYPE}

where QTYPE Query Type is a common TYPE value, or some PSEUDO-TYPE value used in special requests (i.e. ANY, AXFR...)

Resolution for a common request (QTYPE=TYPE) consists in finding all matching RRs in DNS. For instance :

Question = { google.com. IN A } ?

Response =

google.com.             300     IN      A       64.233.167.99
google.com.             300     IN      A       64.233.187.99
google.com.             300     IN      A       72.14.207.99

The set of all DNS record with the same name, class and type is called an RRset. It is a set and not a list : all RRs from the same RRset are equivalent and not ordered.

That way, the DNS system gives a simple solution to manage a load balancing service between several redundant hosts or servers. By default, we have an equipartition for load, but with some specific RRs, we can implement various strategies for load balancing. For instance : the MX record , used by SMTP, integrates a priority between mail servers (backup strategy),
the SRV record integrates a general strategy mixing priorities and weighting in load balancing.

Another resolution example with CNAME alias (works as a redirection during resolution):

Question = { www.google.com. IN A } ?

Response =

www.google.com.         604800  IN      CNAME   www.l.google.com.
www.l.google.com.       300     IN      A       209.85.129.99
www.l.google.com.       300     IN      A       209.85.129.104
www.l.google.com.       300     IN      A       209.85.129.147

Simplified Resolution

Hypothesis

To explain the resolution algorithm, we'll first consider a simplified version of the DNS system. The real case will appear after as an extension of this simplified case by introducing the 3 notions : "Zone", "Caching" and "redundancy".

In our DNS approximation, we assume that each internal node (not leaf) in the DNS tree is managed by one DNS server. That way, the DNS infrastructure derives exactly from the DNS tree with one DNS server dedicated to each DNS domain. In the real case (later), we can have several servers for the same node or also one server for several nodes.

The server dedicated to a domain XXX is called the authoritative DNS server for XXX. This host manages directly all RRs with names such as label.XXX (leaves) and also all references to sub-domains such as .label.XXX (child delegation). The delegation reference consists in giving the name or the address for the authoritative DNS server of the sub-domain. The DNS system uses the NS record (Name Server) to manage these references. A record "xx NS yy" means the host "yy"is the authoritative server for the domain "xx".

Remark: All Information used to managed the DNS system are stored in the DNS database itself.

HOW TO resolve a DNS request?

Under the simplified assumption, only the authoritative server for domain XXX knows about RRs with name such as label.XXX. So, resolving a request consists in finding the authoritative server dedicated to the domain used in the request and then asking him directly for the response. Resolution for a name D.C.B.A. implies (as a final step) requesting for D to the authoritative server of domain C.B.A..

Now, how to find the authoritative server for a given domain ? This information exists only as a NS record in the parent domain of the searched domain. So the only solution for the complete resolution is to start from the Root domain and to ask iteratively at each tree level for delegation Information (NS record).

For instance, we should have a step by step resolution as following :

Question = { D.C.B.A. IN Type } ?
step1 = Request for { A. IN NS } ? to the Root authoritative server (well-known !)
step2 = Request for { B.A. IN NS } ? to A. authoritative server (obtained from step1)
step3 = Request for { C.B.A. IN NS } ? to B.A. authoritative server (obtained previously)
step4 = Request for { D.C.B.A. IN Type } ? to C.B.A. authoritative server
step 5 = thanks all !

In fact, we always use the same initial question at each step, and DNS servers replies with the best known information w.r.t the request : final response if authoritative, delegation if not.

DNS servers

In practice, the term DNS server is used to cover different and may be independent services.

First, as described before, the Authoritative DNS Servers are responsible to manage a small part of the DNS Database.

Another kind of DNS servers are dedicated to the resolution algorithm. The initial idea was to avoid a full implementation of the DNS Protocol on all terminal hosts and to deploy locally dedicated server for the full resolution. This architecture will become fundamental for performance and efficiency in the real case latter. In that way, each Internet host implements a light DNS protocol consisting in only requesting to his well-known DNS server (locally to a site). Then the server manages the full DNS resolution in place of the client, and acts as a proxy with respect to the DNS protocol.

These servers are called resolution DNS servers or Recursive DNS servers.

Light clients in all hosts are called stub resolver.

Simplified resolution Algorithm

Resolution process can be describe as :

The client send the DNS request to a Recursive DNS server (local configuration : i.e /etc/resolv.conf with UNIX). This choice is free and client can try several servers if needed.
If the server is authoritative for the requested domain, he directly replies with local information.
else the server searches for the authoritative server w.r.t requested domain, sends the request to this authoritative server, and then sends back the response to the client.
Searching the authoritative server consists in :
1. Choose the "best known ancestor" w.r.t the requested domain : for now, it could be only the root domain or the local domain managed by the server,
2. Request for NS records to successive authoritative server following the branch from the best known server to the destination authoritative server.

QED.

Recursive or Iterative

The previous algorithm integrates 2 modes for resolution : iterative and recursive

The Recursive mode consists in requesting to a server for a full resolution and waiting the final response from that server.

The Iterative mode consists in requesting for the best information known by a server w.r.t. the request and then iterating server by server to reach the authoritative one.

In fact, sending a DNS request, you can have 3 kinds of response from a DNS server :

Positive Recursive response : The requested RRset
Negative Recursive response : No such record in the DNS
Iterative response : "I don't know but I give you a better Name-Server to ask"

Each DNS request contains the desired mode, iterative or recursive, but the server can decide to manage the 2 modes or only to send iterative responses in all cases. In many case, it's a good idea to separately deploy "recursive only DNS servers" and "authoritative only DNS servers".

A common recursive server uses the iterative process to resolve requests, but in some cases it could be useful to resolve recursive requests using another recursive process. Such servers are called forwarders.

In practice, a complete resolution starts with a recursive phase from the client and may be forwarders and ends with an iterative phase.

Improvements : Zone, Cache, Redundancy

DNS Zone

It could be heavy to manage one authoritative server for each node in the DNS tree, so we'll permit now to manage a set of domains as an unique process.

A DNS ZONE is defined as a connected part of the naming tree (so zone== partial sub-tree). A DNS zone contains a DNS domain, the root of the zone, and may be any number of descendant domains (at any level). Moreover, all DNS Zones are disjointed and the set of zones spans the DNS tree.

The zone name is the domain name of the root of the zone

Simplified algorithm can be easily modified considering that DNS servers are authoritative for a zone (not a unique domain) and that one step of iteration means one level in the zones tree which could be more than one level in DNS naming tree.

Caching

The most important improvement for the DNS efficiency is the caching mechanism. To limit the traffic and to avoid overload in authoritative servers (mostly at top level), any client or any resolution DNS server can memorise all DNS information previously learned and can reuse these information for resolving new requests.

Caching mechanism integrates all final responses, but also all name-server (NS) references learned during the iterative process.

For instance, making the full resolution for www.int-evry.fr A ?, I learn :

www.int-evry.fr.    3600  IN      A       157.159.11.8

;; and also

.                 251584  IN      NS      A.ROOT-SERVERS.NET.
fr.               172800  IN      NS      A.NIC.fr.
int-evry.fr.      172800  IN      NS      diamant.int-evry.fr.

;; and also

A.ROOT-SERVERS.NET.     604800  IN      A       198.41.0.4
A.nic.fr.               171930  IN      A       192.93.0.129
A.nic.fr.               171930  IN      AAAA    2001:660:3005:3::1:1
diamant.int-evry.fr.    172800  IN      A       157.159.10.12

;; and more ...

Next, if I have to resolve ftp.int-evry.fr A ?, I directly start (and finish) the iterative process from server "diamant",authoritative for "int-evry.fr". The same way, If I have to resolve www.enst.fr A ?, I start the iterations from server "A.NIC.fr.",authoritative for "fr."

We can see now the main interest of a local resolution DNS server, which acts as a shared cache for a site and highly limits long-distance traffics.

Caching mechanism is controlled by the TTL value, set in each RR. This value is fixed by the source of information (the authoritative server) and is decremented every second when the information is copied in a cache. TTL defines the validity period for each DNS information and also the maximum rate for reasking the same question to an authoritative server. So, the choice of the TTL value is a compromise between reactivity of the database and overloading of authoritative server (also traffic).

NB : Interesting to note that the DNS caching policy is fixed by the source of the information and not by the user of the information as in many caching mechanism.

The Cache also operates for negative responses (No such name, No such record). In this case, the time to live in caches is a common parameter for all "nonexistent" record in a DNS zone. This value, Negative Cache TTL, appears as one parameter in a specific RR named SOA.

It could be longer to rewrite the simplified algorithm with caching mechanism, but the resolution logic remains unchanged, considering that at each step we try to use the "best known information" including information from local cache. (...)

Redundancy

For reliability and also for performance, we can't work with only one authoritative server for a DNS zone, but we must replicate the authoritative service on several hosts. The DNS system allows to defines a list (no, a set !) of hosts to be authoritative for the same Zone. The zone database is then duplicate between the authoritative servers of the zone and all these servers are equivalent in the resolution process.

For redundancy, the DNS protocol offers the following capabilities :

full transfer of zone database between the zone authoritative servers,
consistency and update controls between duplicated zone database
load balancing between the authoritative servers during resolution process

We don't detail, here, the DNS redundancy mechanisms. For information, the transfer capability is managed with the specific QueryTYPEs : AXFR or IXFR, the consistency is controlled with parameters in the record SOA (and also NOTIFY extension).

The load balancing during resolution directly derives from the RRset semantic in the DNS. To define multiple authoritative servers for a zone, just define multiple NS records for the zone name. Such way, we obtain a set of equivalent and unordered authoritative zone servers. The simplified algorithm can be update replacing "the authoritative server" by "anyone of the authoritative servers".

Remarks: the authoritative server with the original source of the zone database is called "primary" or "master" server. Other authoritative servers with only a copy of the database are called "secondary" or "slave". Theses notions are useful for transfer and consistency capabilities, but for resolution all the authoritative are equal.

Some precisions about NS record

NS RRset

The NS records implements the DNS tree structure and give for each zone name the set of authoritative DNS servers of the zone. For instance, requesting for "int-evry.fr. NS ?", we obtain a RRset with the 4 authoritative servers for zone "int-evry.fr." : int-evry.fr. 172800 IN NS diamant.int-evry.fr. int-evry.fr. 172800 IN NS ns4.enst.fr. int-evry.fr. 172800 IN NS ns1.int-evry.fr. int-evry.fr. 172800 IN NS ns2.nic.fr.

The NS records for a given zone exist in the zone database itself because it is the standard place for any record with this name (zone management autonomy and independence), but these NS records are also required to exist in the parent zone to ensure the delegation process. For instance, before accessing the first time to the zone int-evry.fr., I ask the parent zone fr. what's about int-evry.fr. and I am waiting for the previous NS RRset.

This duplication of the NS RRsets between the parent and the child zones is an important exception to the independence principle of zone bases. The consistency of this duplication is not manage by the DNS protocol and is under the zone administrators responsibility (!).

Glue

The NS RRset duplication is not enough to ensure a correct delegation in all cases. Looking the previous example, the fr. server says :
-- for int-evry.fr. information, request to diamant.int-evry.fr.
- OK, but how to request to diamant ?
-- Create a DNS request, encapsulate the DNS message in UDP then in IP, and send it to the diamant IP address
- OK, thanks. What's the diamant IP address ?
-- Clearly ask the DNS (so stupid !)
- !!!! I need to speak with diamant to learn the IP address of diamant !!!

To be useful in the parent zone, a NS record sometimes requires an additional Address record (A or AAAA) to exist in the parent zone. This record is called a GLUE record, and is mandatory in the parent zone for every NS record in which the server name is a sub-name of the zone name. As for the NS RRset duplication, the consistency of duplication for GLUE records is under the administrator responsibility.

Example

Zone Database example. :

;; sub-zone delegation : "sub"
sub.example.       7200 IN NS  foo.sub.example.
sub.example.       7200 IN NS  toto.example.test.
foo.sub.example.  14400 IN A 157.159.10.10  ;!!! Mandatory GLUE !!!

;; sub-domains without delegation : "other"
bar.other.example.     14400 IN A 157.159.100.1
foo.bar.other.example. 14400 IN A 157.159.100.2

Zone Database sub.example. :

sub.example.      3600 IN SOA ......; zone parameters
sub.example.      7200 IN NS  toto.other.test.
sub.example.      7200 IN NS  foo.sub.example.
foo.sub.example.  7200 IN A 157.159.10.10 ;!!!Take care, glue exists!!!
www.sub.example.  7200 IN A 157.159.10.11

DNS resolution Algorithm

Follows the DNS resolution algorithm, as defined in the STANDARD RFC1034 p.34-35 :

RFC1034 Domain Concepts and Facilities November 1987

5.3.3. Algorithm

The top level algorithm has four steps:

See if the answer is in local information, and if so return it to the client.
Find the best servers to ask.
Send them queries until one returns a response.
Analyze the response, either:
- if the response answers the question or contains a name error, cache the data as well as returning it back to the client.
- if the response contains a better delegation to other servers, cache the delegation information, and go to step 2.
- if the response shows a CNAME and that is not the answer itself, cache the CNAME, change the SNAME to the canonical name in the CNAME RR and go to step 1.
- if the response shows a servers failure or other bizarre contents, delete the server from the SLIST and go back to step 3.

Step 1 searches the cache for the desired data. If the data is in the cache, it is assumed to be good enough for normal use. Some resolvers have an option at the user interface which will force the resolver to ignore the cached data and consult with an authoritative server. This is not recommended as the default. If the resolver has direct access to a name server's zones, it should check to see if the desired data is present in authoritative form, and if so, use the authoritative data in preference to cached data.

Step 2 looks for a name server to ask for the required data. The general strategy is to look for locally-available name server RRs, starting at SNAME, then the parent domain name of SNAME, the grandparent, and so on toward the root. Thus if SNAME were Mockapetris.ISI.EDU, this step would look for NS RRs for Mockapetris.ISI.EDU, then ISI.EDU, then EDU, and then . (the root). These NS RRs list the names of hosts for a zone at or above SNAME. Copy the names into SLIST. Set up their addresses using local data. It may be the case that the addresses are not available. The resolver has many choices here; the best is to start parallel resolver processes looking for the addresses while continuing onward with the addresses which are available. Obviously, the design choices and options are complicated and a function of the local host's capabilities. The recommended priorities for the resolver designer are:

Bound the amount of work (packets sent, parallel processes started) so that a request can't get into an infinite loop or start off a chain reaction of requests or queries with other implementations EVEN IF SOMEONE HAS INCORRECTLY CONFIGURED SOME DATA.
Get back an answer if at all possible.
Avoid unnecessary transmissions.
Get the answer as quickly as possible.

If the search for NS RRs fails, then the resolver initializes SLIST from the safety belt SBELT. The basic idea is that when the resolver has no idea what servers to ask, it should use information from a configuration file that lists several servers which are expected to be helpful. Although there are special situations, the usual choice is two of the root servers and two of the servers for the host's domain. The reason for two of each is for redundancy. The root servers will provide eventual access to all of the domain space. The two local servers will allow the resolver to continue to resolve local names if the local network becomes isolated from the internet due to gateway or link failure.

In addition to the names and addresses of the servers, the SLIST data structure can be sorted to use the best servers first, and to insure that all addresses of all servers are used in a round-robin manner. The sorting can be a simple function of preferring addresses on the local network over others, or may involve statistics from past events, such as previous response times and batting averages.

Step 3 sends out queries until a response is received. The strategy is to cycle around all of the addresses for all of the servers with a timeout between each transmission. In practice it is important to use all addresses of a multihomed host, and too aggressive a retransmission policy actually slows response when used by multiple resolvers contending for the same name server and even occasionally for a single resolver. SLIST typically contains data values to control the timeouts and keep track of previous transmissions.

Step 4 involves analyzing responses. The resolver should be highly paranoid in its parsing of responses. It should also check that the response matches the query it sent using the ID field in the response.

The ideal answer is one from a server authoritative for the query which either gives the required data or a name error. The data is passed back to the user and entered in the cache for future use if its TTL is greater than zero.

If the response shows a delegation, the resolver should check to see that the delegation is "closer" to the answer than the servers in SLIST are. This can be done by comparing the match count in SLIST with that computed from SNAME and the NS RRs in the delegation. If not, the reply is bogus and should be ignored. If the delegation is valid the NS delegation RRs and any address RRs for the servers should be cached. The name servers are entered in the SLIST, and the search is restarted.

If the response contains a CNAME, the search is restarted at the CNAME unless the response has the data for the canonical name or if the CNAME is the answer itself.

Details and implementation hints can be found in [RFC-1035].