PeggyでHTMLパーサー(手抜き)を書いてみた
PEGとか知らないけどPeggy - The Parser Generator for Haskellを見ながら書いてみた。
思うところは色々あるけどもう今日は終わりたいから書いておく。
{-# LANGUAGE TemplateHaskell, QuasiQuotes, FlexibleContexts #-} import Text.Peggy main = print . parseHtml =<< getContents parseHtml :: String -> Either ParseError HTML parseHtml = parseString document "HTML" data HTML = TextNode String | ElemNode { name :: String, attributes :: [(String,String)], children :: [HTML] } | Doctype String | Document [HTML] deriving (Show) [peggy| document :: HTML = nodes { Document $1 } nodes :: [HTML] = (node, [\r\n \t]*) node :: HTML = '<!DOCTYPE ' [^<>]+ '>' { Doctype $1 } / soloTag / coupleTag / [^<>]+ { TextNode $1 } soloTag :: HTML = '<' soloTagList [ \t]* (attribute, [ \t]*) [ \t]* '/'? '>' { ElemNode $1 $3 [] } soloTagList :: String = 'meta' { "meta" } / 'link' { "link" } / 'img' { "img" } / 'input' { "input" } / 'hr' { "hr" } / 'br' { "br" } coupleTag :: HTML = '<' tagName [ \t]* (attribute, [ \t]*) [ \t]* '>' [\r\n \t]* nodes [\r\n \t]* '</' tagName '>' { ElemNode $1 $3 $6 } tagName :: String = [^<>\r\n \t/!]+ attribute :: (String, String) = attributeName '=' attributeValue attributeName :: String = [^ \r\n\t&<>=]+ attributeValue :: String = '\"' [^\"\r\n]* '\"' |]
上記パーザーに下のHTMLを食わせる。
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html lang="ja"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf8"> <title>hoge</title> </head> <body> <div id="header" class="mark"> <div id="logo"><img src="logo.png"/></div> </div> <div id="body"> <div id="content"> <p>foobar</p> </div> </div> <div id="footer"> <div id="copyright">Copyright © hogehoge</div> </div> </body> </html>
そうするとこんな結果。
Right ( Document [ Doctype "html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\"", ElemNode {name = "html", attributes = [("lang","ja")], children = [ ElemNode {name = "head", attributes = [], children = [ ElemNode {name = "meta", attributes = [("http-equiv","Content-Type"),("content","text/html; charset=utf8")], children = []}, ElemNode {name = "title", attributes = [], children = [ TextNode "hoge" ]} ]}, ElemNode {name = "body", attributes = [], children = [ ElemNode {name = "div", attributes = [("id","header"),("class","mark")], children = [ ElemNode {name = "div", attributes = [("id","logo")], children = [ ElemNode {name = "img", attributes = [("src","logo.png")], children = []} ]} ]}, ElemNode {name = "div", attributes = [("id","body")], children = [ ElemNode {name = "div", attributes = [("id","content")], children = [ ElemNode {name = "p", attributes = [], children = [ TextNode "foobar" ]} ]} ]}, ElemNode {name = "div", attributes = [("id","footer")], children = [ ElemNode {name = "div", attributes = [("id","copyright")], children = [ TextNode "Copyright © hogehoge" ]} ]} ]} ]} ] )
あ、ドキュメントタイプがXHTMLだった。
もっと綺麗に書きたいなー。