Text Manipulation with/without Parsec

Text manipulation
with/without parsec
October 11, 2011 Vancouver Haskell UnMeetup

Tatsuhiro Ujihisa

Tuesday, October 11, 2011

• Tatsuhiro Ujihisa
• @ujm
• HootSuite Media inc
• Osaka, Japan
• Vim: 14
• Haskell: 5

Topics
• text manipulation functions with/
without parsec
• parsec library
• texts in Haskell
• attoparsec library


Haskell for work
• Something academical
• Something methematical
• Web app
• Better shell scripting
• (Improve yourself )


Text manipulation
• The concept of text
• String is [Char]
• lazy
• Pattern matching


Example: split
• Ruby/Python example
• 'aaa<>bb<>c<><>d'.split('<>')
['aaa', 'bb', 'c', '', 'd']
• Vim script example
• split('aaa<>bb<>c<><>d', '<>')


split in Haskell
• split :: String -> String -> [String]
• split "aaa<>bb<>c<><>d" "<>"
["aaa", "bb", "c", "", "d"]
• "aaa<>bb<>c<><>d" `split` "<>"


Design of split
• "aaa" : split "bb<>c<><>d" "<>"
• "aaa" : "bb" : split "c<><>d" "<>"
• "aaa" : "bb" : "c" : split "<>d" "<>"
• "aaa" : "bb" : "c" : "" : split "d" "<>"
• "aaa" : "bb" : "c" : "" : "d" split "" "<>"
• "aaa" : "bb" : "c" : "" : "d" : []

Design of split
• "aaa" : split "bb<>c<><>d" "<>"


Design of split
• split' "aaa<>bb<>c<><>d" "" "<>"
• split' "aa<>bb<>c<><>d" "a" "<>"
• split' "a<>bb<>c<><>d" "aa" "<>"
• split' "<>bb<>c<><>d" "aaa" "<>"
• "aaa" : split "bb<>c<><>d" "<>"


• split' "aaa<>bb<>c<><>d" "" "<>"

• split' "aa<>bb<>c<><>d" "a" "<>"

• split' "a<>bb<>c<><>d" "aa" "<>"

1 split :: String -> String -> [String] • split' "<>bb<>c<><>d" "aaa" "<>"
2
3
str `split` pat = split' str pat ""
• "aaa" : split "bb<>c<><>d" "<>"

4 split' :: String -> String -> String -> [String]
5 split' "" _ memo = [reverse memo]
6 split' str pat memo = let (a, b) = splitAt (length pat) str in
7 ______________________if a == pat
8 _________________________then (reverse memo) : (b `split` pat)
9 _________________________else split' (tail str) pat (head str : memo)


Another approach
• Text.Parsec: v3
• Text.ParserCombinators.Parsec: v2
• Real World Haskell Parsec chapter
• csv parser


Design of split
• many of
• any char except for the string of
"<>"
• that separated by "<>" or the end
of string


1 import qualiﬁed Text.Parsec as P
2
3 str `split` pat = case P.parse (split' (P.string pat)) "split" str of
4 _______________________Right x -> x
5 split' pat = P.anyChar `P.manyTill` (P.eof P.<|> (P.try (P.lookAhead pat) >> return ())) `P.sepBy` pat


2
3 str `split` pat = case P.parse (split' (P.string pat)) "split" str of
4 _______________________Right x -> x
5 split' pat = P.anyChar `P.manyTill` (P.eof P.<|> (P.try (P.lookAhead pat) >> return ())) `P.sepBy` pat

Any char

Except for end of the string or the pattern to separate
(without consuming text)


2
3 main = do
4 print $ abc1 "abc" -- True
5 print $ abc1 "abcd" -- False
6 print $ abc2 "abc" -- True
7 print $ abc2 "abcd" -- False
8
9 abc1 str = str == "abc"
10 abc2 str = case P.parse (P.string "abc" >> P.eof ) "abc" str of
11 Right _ -> True
12 Left _ -> False


2
3 main = do
4 print $ parenthMatch1 "(a (b c))" -- True
5 print $ parenthMatch1 "(a (b c)" -- False
6 print $ parenthMatch1 ")(a (b c)" -- False
7 print $ parenthMatch2 "(a (b c))" -- True
8 print $ parenthMatch2 "(a (b c)" -- False
9 print $ parenthMatch2 ")(a (b c)" -- False
10
11 parenthMatch1 str = f str 0 1 parenthMatch2 str =
12 where 2 case P.parse (f >> P.eof ) "parenthMatch" str of
13 f "" 0 = True 3 Right _ -> True
14 f "" _ = False 4 Left _ -> False
15 f ('(':xs) n = f xs (n + 1) 5 where
16 f (')':xs) 0 = False 6 f = P.many (P.noneOf "()" P.<|> g)
17 f (')':xs) n = f xs (n - 1) 7 g = do
18 f (_:xs) n = f xs n 8 P.char '('
9 f
10 P.char ')'


Parsec API
• anyChar
• char 'a'
• string "abc"
== string ['a', 'b', 'c']
== char 'a' >> char 'b' >> char 'c'
• oneOf ['a', 'b', 'c']
• noneOf "abc"
• eof

Parsec API (combinator)
• >>, >>=, return, and fail
• <|>
• many p
• p1 `manyTill` p2
• p1 `sepBy` p2
• p1 `chainl` op

Parsec API (etc)
• try
• lookAhead p
• notFollowedBy p


texts in Haskell


three types of text
• String
• ByteString
• Text


String
• [Char]
• Char: a UTF-8 character
• "aaa" is String
• List is lazy and slow


ByteString
• import Data.ByteString
• Base64
• Char8
• UTF8
• Lazy (Char8, UTF8)
• Fast. The default of snap

ByteString (cont'd)
1 {-# LANGUAGE OverloadedStrings #-}
2 import Data.ByteString.Char8 ()
3 import Data.ByteString (ByteString)
4
5 main = print ("hello" :: ByteString)

• OverloadedStrings with Char8
• Give type expliticly or use with
ByteString functions


ByteString (cont'd)

1 import Data.ByteString.UTF8 ()
2 import qualiﬁed Data.ByteString as B
3 import Codec.Binary.UTF8.String (encode)
4
5 main = B.putStrLn (B.pack $ encode " " :: B.ByteString)


Text
• import Data.Text
• import Data.Text.IO
• always UTF8
• import Data.Text.Lazy
• Fast


Text (cont'd)
2 import Data.Text (Text)
3 import qualiﬁed Data.Text.IO as T
4
5 main = T.putStrLn (" " :: Text)

• UTF-8 friendly

Parsec supports
• String
• ByteString


Attoparsec supports
• ByteString
• Text


Attoparsec
• cabal install attoparsec
• attoparsec-text
• attoparsec-enumerator
• attoparsec-iteratee
• attoparsec-text-enumerator


Attoparsec pros/cons
• Pros
• fast
• text support
• enumerator/iteratee
• Cons
• no lookAhead/notFollowedBy

Parsec and Attoparsec
1 import qualiﬁed Text.Parsec as P 2 import qualiﬁed Data.Attoparsec.Text as P
2 3
3 main = print $ abc "abc" 4 main = print $ abc "abc"
4 5
5 abc str = case P.parse f "abc" str of 6 abc str = case P.parseOnly f str of
6 Right _ -> True 7 Right _ -> True
7 Left _ -> False 8 Left _ -> False
8 f = P.string "abc" 9 f = P.string "abc"


return ()


Practice
• args "f(x, g())"
-- ["x", "g()"]
• args "f(, aa(), bb(c))"
-- ["", "aa()", "bb(c)"]


Text Manipulation with/without Parsec

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (17)

Ähnlich wie Text Manipulation with/without Parsec

Ähnlich wie Text Manipulation with/without Parsec (20)

Mehr von ujihisa

Mehr von ujihisa (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Text Manipulation with/without Parsec