HTMLParser using python on Google App Engine－地瓜粥在讀書

在python裡了HTMLParser 不會從錯誤中 recover回來

也就是說 parse 到了一個不合法的 HTML語法

它就會直接丟出exception了

然後就不會繼續parse下去了

但對於這個世界存在著一大堆不喜歡照規定的網頁

這樣在作parsing網頁的時候

實在是一個很不方便的功能

拜了一下大神

發現大概有幾種解法

1. 改python裡的HTMLParser 標準函式庫

2.用其他的parser ex. html5lib

但是對於我的目的是在google的app engine平台上run

這兩種似乎都不是很好的解法

http://cacaeggs.blogspot.com/2009/11/htmlparser-malformed-start-tag-errors.html

在以上的網頁中說明了第一種方法的解法

覆寫了在HTMLParser class中的error function

然後還要在HTMLParser.py 裡加入 return

但是我是要在google的平台上run

google只支援python 2.5.2 標準函式庫

換句話說我自己改HTMLParser.py 沒用， google那邊也沒改阿 @@

一不作二不休

乾脆覆寫 error 和 check_for_whole_start_tag 這兩個函式就可以了~^^

以下分享一下sample code 還有紅色是我有改的地方

class MyParser(HTMLParser.HTMLParser):
    def error(self, message):
        print message
        #else:
        #    raise HTMLParseError(message, self.getpos())

    # Internal -- check to see if we have a complete starttag; return end
    # or -1 if incomplete.
    def check_for_whole_start_tag(self, i):
        rawdata = self.rawdata
        m = HTMLParser.locatestarttagend.match(rawdata, i)
        if m:
            j = m.end()
            next = rawdata[j:j+1]
            if next == ">":
                return j + 1
            if next == "/":
                if rawdata.startswith("/>", j):
                    return j + 2
                if rawdata.startswith("/", j):
                    # buffer boundary
                    return -1
                # else bogus input
                self.updatepos(i, j + 1)
                self.error("malformed empty start tag")
            if next == "":
                # end of input
                return -1
            if next in ("abcdefghijklmnopqrstuvwxyz=/"
                        "ABCDEFGHIJKLMNOPQRSTUVWXYZ"):
                # end of input in or before attribute value, or we have the
                # '/' from a '/>' ending
                return -1
                self.updatepos(i, j)
                self.error("malformed start tag")
            return j
        raise AssertionError("we should not get here!")