2017-05-11

301跳转后域名缺失故障分析

出发点：SEO驱动的URL调整，将http://www.hello.com/hello?xxx=xx 调整为 http://www.hello.com/hello/?xxx=xx 有助于提升百度SEO。
故障现象：
301跳转时通过回复报文头的Location字段指明下一跳地址，但此次bug导致Location地址错误，导致301跳转后访问不存在域名。
正确的301回复报文头：/城市/ershouche/过滤条件?参数列表

1	Location: /cn/ershouche/pr-0-5/?fr=bd_pz&fr_word&utm_campaign=cn&utm_source=baidu_pc&utm_medium=pz&utm_term=%E5%AD%90%E9%93%BE4%2E1

异常的301回复报文头：

1	Location: //ershouche/pr-5-10/?fr=bd_pz&fr_word&utm_campaign=cn&utm_source=baidu_pc&utm_medium=pz&utm_term=%E5%AD%90%E9%93%BE4%2E2

原因：

代码逻辑bug导致城市获取为空时，Location的地址变为//开头

HTTP在301请求的Location的URL处理逻辑是怎样的？
根据RFC 2396可知URI解析的正则和BNF描述如下：

      ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
       12            3  4          5       6  7        8 9
 
http://www.ics.uci.edu/pub/ietf/uri/#Related

   results in the following subexpression matches:

      $1 = http:
      $2 = http
      $3 = //www.ics.uci.edu
      $4 = www.ics.uci.edu
      $5 = /pub/ietf/uri/
      $6 = <undefined>
      $7 = <undefined>
      $8 = #Related
      $9 = Related

   where <undefined> indicates that the component is not present, as is
   the case for the query component in the above example.  Therefore, we
   can determine the value of the four components and fragment as

      scheme    = $2
      authority = $4
      path      = $5
      query     = $7
      fragment  = $9
 
 
A. Collected BNF for URI

      URI-reference = [ absoluteURI | relativeURI ] [ "#" fragment ]
      absoluteURI   = scheme ":" ( hier_part | opaque_part )
      relativeURI   = ( net_path | abs_path | rel_path ) [ "?" query ]

      hier_part     = ( net_path | abs_path ) [ "?" query ]
      opaque_part   = uric_no_slash *uric

      uric_no_slash = unreserved | escaped | ";" | "?" | ":" | "@" |
                      "&" | "=" | "+" | "$" | ","

      net_path      = "//" authority [ abs_path ]
      abs_path      = "/"  path_segments
      rel_path      = rel_segment [ abs_path ]

      rel_segment   = 1*( unreserved | escaped |
                          ";" | "@" | "&" | "=" | "+" | "$" | "," )

      scheme        = alpha *( alpha | digit | "+" | "-" | "." )

      authority     = server | reg_name

      reg_name      = 1*( unreserved | escaped | "$" | "," |
                          ";" | ":" | "@" | "&" | "=" | "+" )

      server        = [ [ userinfo "@" ] hostport ]
      userinfo      = *( unreserved | escaped |
                         ";" | ":" | "&" | "=" | "+" | "$" | "," )

      hostport      = host [ ":" port ]
      host          = hostname | IPv4address
      hostname      = *( domainlabel "." ) toplabel [ "." ]
      domainlabel   = alphanum | alphanum *( alphanum | "-" ) alphanum
      toplabel      = alpha | alpha *( alphanum | "-" ) alphanum
      IPv4address   = 1*digit "." 1*digit "." 1*digit "." 1*digit
      port          = *digit

      path          = [ abs_path | opaque_part ]
      path_segments = segment *( "/" segment )
      segment       = *pchar *( ";" param )
      param         = *pchar
      pchar         = unreserved | escaped |
                      ":" | "@" | "&" | "=" | "+" | "$" | ","

      query         = *uric

      fragment      = *uric

简化来说，针对http请求的URL解析时如果存在双斜线//，则判定//与/之间为域名。

经验：

排查问题时Chrome审查元素或Firebug，注意选日志保持
fiddler或charless代理看请求，或者直接wireshark
看问题代码时，如果项目代码不熟，先查MR定位大概范围，再看对应逻辑的最新代码
防御代码不仅防御当前，还考虑未来，把直接导致异常的地方做严格防御（把程序员当mokey，无法保证其他地方不会有变化，所以核心代码位置加强防御）

小武

301跳转后域名缺失故障分析

原因：

经验：

参考：