Skip to content

这里记录了自己学习DNS协议中总结和对一些问题的思考

DNS协议

DNS默认使用UDP协议,发消息到服务器的53端口. UDP不需要TCP那样三个握手建立连接. 效率更高. 简单地请求,然后收到回复.
根域名有13台, 这是因为早期协议涉及时, 规定dns数据包最大512个字节(不含IP层和MAC层). 最多就只能容纳13个信息. 具体见下面文章所诉
https://www.apnic.net/get-ip/faqs/rootservers/
https://miek.nl/2013/november/10/why-13-dns-root-servers/
确实根域名有13个IP, 但后面每个IP并不指向一个服务器. 这里采用了BGP的任播机制, 访问IP, 路由器会路由到距离你最近的一台服务器. 这是根DNS服务器的负载均衡技术
单播,多播,广播和任播技术介绍参见: https://www.hi-linux.com/posts/26571.html

resolv.conf文件解析

/etc/resolv.conf详细配置解析参见man 5 resolv.conf, 它属于glibc的一部分. 如果程序发消息到DNS服务器请求域名解析,则该文件的所有配置文件不生效. 比如dig直接和DNS服务器交互. ping,curl,wget则使用glibc
默认timeout为5, attempts为2, 意思是在放弃该连接,尝试下一个DNS服务器前等待超时时间为5s, 尝试2遍后放弃. 其中一遍表示轮询完一遍所有的DNS服务器失败
假设配置文件如下

# cat /etc/resolv.conf
options single-request-reopen
; generated by /usr/sbin/dhclient-script
search localdomain
nameserver 114.114.114.117
nameserver 114.114.114.113

下面是ping www.baid1xe21.com1的报文交互. 可以看到放弃两个NS后,还根据search的配置尝试使用www.baid1xe21.com1.localdomain获取解析

# tcpdump -i eth0 -eennvv udp
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
15:39:12.576101 00:1c:42:03:fd:99 > 00:1c:42:00:00:18, ethertype IPv4 (0x0800), length 78: (tos 0x0, ttl 64, id 46506, offset 0, flags [DF], proto UDP (17), length 64)
    10.211.55.22.52434 > 114.114.114.117.53: [bad udp cksum 0x270e -> 0x20fb!] 20589+ A? www.baid1xe21.com1. (36)
15:39:17.581591 00:1c:42:03:fd:99 > 00:1c:42:00:00:18, ethertype IPv4 (0x0800), length 78: (tos 0x0, ttl 64, id 37849, offset 0, flags [DF], proto UDP (17), length 64)
    10.211.55.22.55684 > 114.114.114.113.53: [bad udp cksum 0x270a -> 0x144d!] 20589+ A? www.baid1xe21.com1. (36)
15:39:22.588076 00:1c:42:03:fd:99 > 00:1c:42:00:00:18, ethertype IPv4 (0x0800), length 78: (tos 0x0, ttl 64, id 52914, offset 0, flags [DF], proto UDP (17), length 64)
    10.211.55.22.52434 > 114.114.114.117.53: [bad udp cksum 0x270e -> 0x20fb!] 20589+ A? www.baid1xe21.com1. (36)
15:39:27.593191 00:1c:42:03:fd:99 > 00:1c:42:00:00:18, ethertype IPv4 (0x0800), length 78: (tos 0x0, ttl 64, id 47412, offset 0, flags [DF], proto UDP (17), length 64)
    10.211.55.22.55684 > 114.114.114.113.53: [bad udp cksum 0x270a -> 0x144d!] 20589+ A? www.baid1xe21.com1. (36)
15:39:32.598796 00:1c:42:03:fd:99 > 00:1c:42:00:00:18, ethertype IPv4 (0x0800), length 90: (tos 0x0, ttl 64, id 60578, offset 0, flags [DF], proto UDP (17), length 76)
    10.211.55.22.53255 > 114.114.114.117.53: [bad udp cksum 0x271a -> 0x09f4!] 59663+ A? www.baid1xe21.com1.localdomain. (48)
15:39:37.605004 00:1c:42:03:fd:99 > 00:1c:42:00:00:18, ethertype IPv4 (0x0800), length 90: (tos 0x0, ttl 64, id 52533, offset 0, flags [DF], proto UDP (17), length 76)
    10.211.55.22.50522 > 114.114.114.113.53: [bad udp cksum 0x2716 -> 0x14a5!] 59663+ A? www.baid1xe21.com1.localdomain. (48)
15:39:42.609907 00:1c:42:03:fd:99 > 00:1c:42:00:00:18, ethertype IPv4 (0x0800), length 90: (tos 0x0, ttl 64, id 3384, offset 0, flags [DF], proto UDP (17), length 76)
    10.211.55.22.53255 > 114.114.114.117.53: [bad udp cksum 0x271a -> 0x09f4!] 59663+ A? www.baid1xe21.com1.localdomain. (48)
15:39:47.616265 00:1c:42:03:fd:99 > 00:1c:42:00:00:18, ethertype IPv4 (0x0800), length 90: (tos 0x0, ttl 64, id 52752, offset 0, flags [DF], proto UDP (17), length 76)
    10.211.55.22.50522 > 114.114.114.113.53: [bad udp cksum 0x2716 -> 0x14a5!] 59663+ A? www.baid1xe21.com1.localdomain. (48)

通过strace观测到的系统调用如下:

17:38:46 close(5)                       = 0 <0.000263>
17:38:46 socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 4 <0.000222>
17:38:46 connect(4, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.211.54.2")}, 16) = 0 <0.000280>
17:38:46 poll([{fd=4, events=POLLOUT}], 1, 0) = 1 ([{fd=4, revents=POLLOUT}]) <0.000173>
17:38:46 sendto(4, "w\221\1\0\0\1\0\0\0\0\0\0\3www\003163\3com\vlocaldo"..., 41, MSG_NOSIGNAL, NULL, 0) = 41 <0.000147>
17:38:46 poll([{fd=4, events=POLLIN}], 1, 5000) = 0 (Timeout) <5.006352>
17:38:51 socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 5 <0.000306>
17:38:51 connect(5, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.211.54.3")}, 16) = 0 <0.000243>
17:38:51 poll([{fd=5, events=POLLOUT}], 1, 0) = 1 ([{fd=5, revents=POLLOUT}]) <0.000109>
17:38:51 sendto(5, "w\221\1\0\0\1\0\0\0\0\0\0\3www\003163\3com\vlocaldo"..., 41, MSG_NOSIGNAL, NULL, 0) = 41 <0.000290>
17:38:51 poll([{fd=5, events=POLLIN}], 1, 5000) = 0 (Timeout) <5.006234>
17:38:56 poll([{fd=4, events=POLLOUT}], 1, 0) = 1 ([{fd=4, revents=POLLOUT}]) <0.000221>
17:38:56 sendto(4, "w\221\1\0\0\1\0\0\0\0\0\0\3www\003163\3com\vlocaldo"..., 41, MSG_NOSIGNAL, NULL, 0) = 41 <0.000371>
17:38:56 poll([{fd=4, events=POLLIN}], 1, 5000) = 0 (Timeout) <5.006071>
17:39:01 poll([{fd=5, events=POLLOUT}], 1, 0) = 1 ([{fd=5, revents=POLLOUT}]) <0.000119>
17:39:01 sendto(5, "w\221\1\0\0\1\0\0\0\0\0\0\3www\003163\3com\vlocaldo"..., 41, MSG_NOSIGNAL, NULL, 0) = 41 <0.000617>
17:39:01 poll([{fd=5, events=POLLIN}], 1, 5000) = 0 (Timeout) <5.005939>
17:39:06 close(4)                       = 0 <0.000189>
17:39:06 close(5)                       = 0 <0.000146>
17:39:06 open("/usr/share/locale/locale.alias", O_RDONLY|O_CLOEXEC) = 4 <0.000120>

DHCP自动获取IP场景,经常会覆盖用户自己配置的参数, 可以使用类似如下的命令将配置持久化. dhcp只会更新DNS服务器,不会删除用户配置.

echo 'RES_OPTIONS="timeout:2 attempts:3 rotate single-request-reopen"' >>/etc/sysconfig/network

在容器场景下,经常遇到域名解析5s超时的问题,主要是因为glibc下会同时并行发出请求A级记录和AAA记录的报文, 底层有时处理不了这种竞争条件,导致超时. 可通通过single-request-reopen来解决, 详细见如下文章:
https://tencentcloudcontainerteam.github.io/2018/10/26/DNS-5-seconds-delay/

在查询域名时,如果域名里面.的个数小于 ndots 指定的数,则会根据 search 中配置的列表依次在对应域中查询,如果没有返回,则最后直接查询域名本身。 ndots 默认是1

如果大于ndtos, 先直接查询, 查询不到再在serach里的域里继续查询 假设配置如下:

nameserver 10.232.0.3
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

想解析的kubernetes.default.svc中的 . 只有2,小于5,这时会依次拼接上 search 中的地址之后再进行查询,如果都查询不到,则再查询本身。

# host -v kubernetes.default.svc
Trying "kubernetes.default.svc.default.svc.cluster.local"
Trying "kubernetes.default.svc.svc.cluster.local"
Trying "kubernetes.default.svc.cluster.local"
...

dig命令使用

查询www.163.com解析信息

$ dig www.163.com

迭代查询

$ dig +trace www.163.com

指定从DNS服务器8.8.8.8获取解析结果

$ dig @8.8.8.8 www.163.com

反向解析一个ip

$ dig -x 114.114.114.114

; <<>> DiG 9.9.4-RedHat-9.9.4-74.el7_6.1 <<>> -x 114.114.114.114
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 58603
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;114.114.114.114.in-addr.arpa.	IN	PTR

;; ANSWER SECTION:
114.114.114.114.in-addr.arpa. 119 IN	PTR	public1.114dns.com.

;; Query time: 33 msec
;; SERVER: 10.211.55.1#53(10.211.55.1)
;; WHEN: Sun Oct 04 20:20:09 HKT 2020
;; MSG SIZE  rcvd: 89

getaddrinfo返回的多个A记录,负载均衡失效

域名经常会对应多个地址, 如下面所示, 解析www.baidu.com返回 14.215.177.39 和 14.215.177.38, 程序默认是取第一个作为最终结果. 所以DNS服务器会随机调整返回的IP列表顺序, 这样可以实现负载均衡, 叫做 round-robin-dns

$ dig www.baidu.com

; <<>> DiG 9.11.20-RedHat-9.11.20-5.el8 <<>> www.baidu.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 6006
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;www.baidu.com.                 IN      A

;; ANSWER SECTION:
www.baidu.com.          831     IN      CNAME   www.a.shifen.com.
www.a.shifen.com.       122     IN      A       14.215.177.39
www.a.shifen.com.       122     IN      A       14.215.177.38

;; Query time: 11 msec
;; SERVER: 192.168.3.1#53(192.168.3.1)
;; WHEN: Sun Feb 21 21:35:18 CST 2021
;; MSG SIZE  rcvd: 90

但在一些特殊场景下, getaddrinfo的返回的IPv4列表顺序永远是固定的,即使DNS服务器返回是随机的. 这是因为getaddrinfo遵从rfc3484对返回的IP列表进行排序. 具体影响是当返回的IP与客户端的IP处于同一子网, 拥有最长前缀的排在前面. 用于比较的前缀指的是ip地址的bit值

192.168.3.20 == 11000000.10101000.00000011.00010100

本机IPDNS服务返回getaddrinfo返回说明
192.168.3.20 /2411000000.10101000.00001101.00011101 192.168.13.29
11000000.10101000.00001101.00011011 192.168.13.27
11000000.10101000.00000011.01000101 192.168.3.69
192.168.3.69
192.168.13.29
192.168.13.27
只有192.168.3.69和本机IP同网段,所以它靠前,非同一子网的保持原顺序不变
192.168.3.20 /2411000000.10101000.00001101.00011101 192.168.13.29
11000000.10101000.00000011.01000101 192.168.3.69
11000000.10101000.00000011.00011011 192.168.3.27
192.168.3.27
192.168.3.69
192.168.13.29
同网段有3.27 和 3.69 , commonprefixlen(27, 20) 大于 commonprefixlen(69, 20), 所以27靠前. 69靠后. 非同一子网的保持原顺序不变
192.168.3.20 /2411000000.10101000.00000011.00011101 192.168.3.29
11000000.10101000.00000011.00011010 192.168.3.26
11000000.10101000.00000011.00011011 192.168.3.27
192.168.3.29
192.168.3.26
192.168.3.27
三个的commonprefixlen都一样, 不需要排序.按原顺序返回
192.168.3.20 /2411000000.10101000.00000011.00011101 192.168.3.29
11000000.10101000.00000011.00010111 192.168.3.23
11000000.10101000.00000011.00011011 192.168.3.27
192.168.3.23
192.168.3.29
192.168.3.27
23的commonprefixlen大, 所以23靠前,其他两个的最长前缀相等,顺序不变

如下是一个简单的测试结果

$ go run a.go
use pure go, keep order from dns server  [192.168.3.29 192.168.3.23 192.168.3.27]
use cgo, invoke getaddrinfo              [192.168.3.23 192.168.3.29 192.168.3.27]

测试代码:

package main

import (
	"fmt"
	"io/ioutil"
	"net"
)

func main() {

	data := `
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.3.29  www.abc.com
192.168.3.23  www.abc.com
192.168.3.27  www.abc.com
`

	ioutil.WriteFile("/etc/hosts", []byte(data), 0644)

	testCases := []struct {
		name     string
		preferGo bool
	}{
		{"use pure go, keep order from dns server", true},
		{"use cgo, invoke getaddrinfo", false},
	}

	for _, tc := range testCases {
		net.DefaultResolver.PreferGo = tc.preferGo
		ips, _ := net.LookupHost("www.abc.com")
		fmt.Printf("%-40s %v\n", tc.name, ips)
	}

	data = `
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
`

	ioutil.WriteFile("/etc/hosts", []byte(data), 0644)

}

要想避免这种特定场景返回的ip是固定的, 可以关闭ipv6. 该问题的相关参考信息:
https://access.redhat.com/solutions/22132
https://access.redhat.com/solutions/8709
https://gist.github.com/SpComb/c509bd064bc75151e6b41e8bc949d13f
https://github.com/hashicorp/consul/issues/1481
https://github.com/weaveworks/weave/issues/1245
https://tools.ietf.org/rfc/rfc3484.txt
https://www.api.rackspace.com/blog/glibc-linux-dns-round-robin-explanation/
https://github.com/golang/go/issues/18518

Released under the MIT License.