301跳转后域名缺失故障分析

出发点:SEO驱动的URL调整,将http://www.hello.com/hello?xxx=xx 调整为 http://www.hello.com/hello/?xxx=xx 有助于提升百度SEO。
故障现象:
301跳转时通过回复报文头的Location字段指明下一跳地址,但此次bug导致Location地址错误,导致301跳转后访问不存在域名。
正确的301回复报文头:/城市/ershouche/过滤条件?参数列表

1
Location: /cn/ershouche/pr-0-5/?fr=bd_pz&fr_word&utm_campaign=cn&utm_source=baidu_pc&utm_medium=pz&utm_term=%E5%AD%90%E9%93%BE4%2E1

异常的301回复报文头:

1
Location: //ershouche/pr-5-10/?fr=bd_pz&fr_word&utm_campaign=cn&utm_source=baidu_pc&utm_medium=pz&utm_term=%E5%AD%90%E9%93%BE4%2E2

原因:

  1. 代码逻辑bug导致城市获取为空时,Location的地址变为//开头
  2. HTTP在301请求的Location的URL处理逻辑是怎样的?
    根据RFC 2396可知URI解析的正则和BNF描述如下:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
          ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
    12 3 4 5 6 7 8 9

    http://www.ics.uci.edu/pub/ietf/uri/#Related

    results in the following subexpression matches:

    $1 = http:
    $2 = http
    $3 = //www.ics.uci.edu
    $4 = www.ics.uci.edu
    $5 = /pub/ietf/uri/
    $6 = <undefined>
    $7 = <undefined>
    $8 = #Related
    $9 = Related

    where <undefined> indicates that the component is not present, as is
    the case for the query component in the above example. Therefore, we
    can determine the value of the four components and fragment as

    scheme = $2
    authority = $4
    path = $5
    query = $7
    fragment = $9


    A. Collected BNF for URI

    URI-reference = [ absoluteURI | relativeURI ] [ "#" fragment ]
    absoluteURI = scheme ":" ( hier_part | opaque_part )
    relativeURI = ( net_path | abs_path | rel_path ) [ "?" query ]

    hier_part = ( net_path | abs_path ) [ "?" query ]
    opaque_part = uric_no_slash *uric

    uric_no_slash = unreserved | escaped | ";" | "?" | ":" | "@" |
    "&" | "=" | "+" | "$" | ","

    net_path = "//" authority [ abs_path ]
    abs_path = "/" path_segments
    rel_path = rel_segment [ abs_path ]

    rel_segment = 1*( unreserved | escaped |
    ";" | "@" | "&" | "=" | "+" | "$" | "," )

    scheme = alpha *( alpha | digit | "+" | "-" | "." )

    authority = server | reg_name

    reg_name = 1*( unreserved | escaped | "$" | "," |
    ";" | ":" | "@" | "&" | "=" | "+" )

    server = [ [ userinfo "@" ] hostport ]
    userinfo = *( unreserved | escaped |
    ";" | ":" | "&" | "=" | "+" | "$" | "," )

    hostport = host [ ":" port ]
    host = hostname | IPv4address
    hostname = *( domainlabel "." ) toplabel [ "." ]
    domainlabel = alphanum | alphanum *( alphanum | "-" ) alphanum
    toplabel = alpha | alpha *( alphanum | "-" ) alphanum
    IPv4address = 1*digit "." 1*digit "." 1*digit "." 1*digit
    port = *digit

    path = [ abs_path | opaque_part ]
    path_segments = segment *( "/" segment )
    segment = *pchar *( ";" param )
    param = *pchar
    pchar = unreserved | escaped |
    ":" | "@" | "&" | "=" | "+" | "$" | ","

    query = *uric

    fragment = *uric

简化来说,针对http请求的URL解析时如果存在双斜线//,则判定//与/之间为域名。

经验:

  1. 排查问题时Chrome审查元素或Firebug,注意选日志保持
  2. fiddler或charless代理看请求,或者直接wireshark
  3. 看问题代码时,如果项目代码不熟,先查MR定位大概范围,再看对应逻辑的最新代码
  4. 防御代码不仅防御当前,还考虑未来,把直接导致异常的地方做严格防御(把程序员当mokey,无法保证其他地方不会有变化,所以核心代码位置加强防御)

参考:

  1. https://forums.aws.amazon.com/thread.jspa?threadID=13898
  2. http://www.ietf.org/rfc/rfc2396.txt