出发点:SEO驱动的URL调整,将http://www.hello.com/hello?xxx=xx 调整为 http://www.hello.com/hello/?xxx=xx 有助于提升百度SEO。
故障现象:
301跳转时通过回复报文头的Location字段指明下一跳地址,但此次bug导致Location地址错误,导致301跳转后访问不存在域名。
正确的301回复报文头:/城市/ershouche/过滤条件?参数列表
1
Location: /cn/ershouche/pr-0-5/?fr=bd_pz&fr_word&utm_campaign=cn&utm_source=baidu_pc&utm_medium=pz&utm_term=%E5%AD%90%E9%93%BE4%2E1
异常的301回复报文头:1
Location: //ershouche/pr-5-10/?fr=bd_pz&fr_word&utm_campaign=cn&utm_source=baidu_pc&utm_medium=pz&utm_term=%E5%AD%90%E9%93%BE4%2E2
原因:
- 代码逻辑bug导致城市获取为空时,Location的地址变为//开头
- HTTP在301请求的Location的URL处理逻辑是怎样的?
根据RFC 2396可知URI解析的正则和BNF描述如下:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
12 3 4 5 6 7 8 9
http://www.ics.uci.edu/pub/ietf/uri/#Related
results in the following subexpression matches:
$1 = http:
$2 = http
$3 = //www.ics.uci.edu
$4 = www.ics.uci.edu
$5 = /pub/ietf/uri/
$6 = <undefined>
$7 = <undefined>
$8 = #Related
$9 = Related
where <undefined> indicates that the component is not present, as is
the case for the query component in the above example. Therefore, we
can determine the value of the four components and fragment as
scheme = $2
authority = $4
path = $5
query = $7
fragment = $9
A. Collected BNF for URI
URI-reference = [ absoluteURI | relativeURI ] [ "#" fragment ]
absoluteURI = scheme ":" ( hier_part | opaque_part )
relativeURI = ( net_path | abs_path | rel_path ) [ "?" query ]
hier_part = ( net_path | abs_path ) [ "?" query ]
opaque_part = uric_no_slash *uric
uric_no_slash = unreserved | escaped | ";" | "?" | ":" | "@" |
"&" | "=" | "+" | "$" | ","
net_path = "//" authority [ abs_path ]
abs_path = "/" path_segments
rel_path = rel_segment [ abs_path ]
rel_segment = 1*( unreserved | escaped |
";" | "@" | "&" | "=" | "+" | "$" | "," )
scheme = alpha *( alpha | digit | "+" | "-" | "." )
authority = server | reg_name
reg_name = 1*( unreserved | escaped | "$" | "," |
";" | ":" | "@" | "&" | "=" | "+" )
server = [ [ userinfo "@" ] hostport ]
userinfo = *( unreserved | escaped |
";" | ":" | "&" | "=" | "+" | "$" | "," )
hostport = host [ ":" port ]
host = hostname | IPv4address
hostname = *( domainlabel "." ) toplabel [ "." ]
domainlabel = alphanum | alphanum *( alphanum | "-" ) alphanum
toplabel = alpha | alpha *( alphanum | "-" ) alphanum
IPv4address = 1*digit "." 1*digit "." 1*digit "." 1*digit
port = *digit
path = [ abs_path | opaque_part ]
path_segments = segment *( "/" segment )
segment = *pchar *( ";" param )
param = *pchar
pchar = unreserved | escaped |
":" | "@" | "&" | "=" | "+" | "$" | ","
query = *uric
fragment = *uric
简化来说,针对http请求的URL解析时如果存在双斜线//,则判定//与/之间为域名。
经验:
- 排查问题时Chrome审查元素或Firebug,注意选日志保持
- fiddler或charless代理看请求,或者直接wireshark
- 看问题代码时,如果项目代码不熟,先查MR定位大概范围,再看对应逻辑的最新代码
- 防御代码不仅防御当前,还考虑未来,把直接导致异常的地方做严格防御(把程序员当mokey,无法保证其他地方不会有变化,所以核心代码位置加强防御)