Whenever I consider learning a new language -- haskell in this case -- I try to hack together a primitive grep clone to see how good the language implementation and/or its libraries are at text processing, because that's a major use case for me.
Inspired by code on the haskell wiki, I came up with the following naive attempt:
{-# LANGUAGE FlexibleContexts, ExistentialQuantification #-}
import Text.Regex.PCRE
import System.Environment
io :: ([String] -> [String]) -> IO ()
io f = interact (unlines . f . lines)
regexBool :: forall r l .
(RegexMaker Regex CompOption ExecOption r,
RegexLike Regex l) =>
r -> l -> Bool
regexBool r l = l =~ r :: Bool
grep :: forall r l .
(RegexMaker Regex CompOption ExecOption r, RegexLike Regex l) =>
r -> [l] -> [l]
grep r = filter (regexBool r)
main :: IO ()
main = do
argv <- getArgs
io $ grep $ argv !! 0
This appears to be doing what I want it to, but unfortunately, it's really slow -- about 10 times slower than a python script doing the same thing. I assume it's not the regex library that's at fault here, because it's calling into PCRE which should be plenty fast (switching to Text.Regex.Posix slows things down quite a bit further). So it must be the String implementation, which is instructive from a theoretical point of view but inefficient according to what I've read.
Is there an alternative to Strings in haskell that's both efficient and convenient (i.e. there's little or no friction when switching to using that instead of Strings) and that fully and correctly handles UTF-8-encoded Unicode, as well as other encodings without too much hassle if possible? Something that everybody uses when doing text processing in haskell but that I just don't know about because I'm a complete beginner?