Optimizing string parsing with Python -

- July 15, 2011

i have string in form 'ab(ab(ddc)c)a(baac)dab(abc)'.

each character represents element (a, b, c or d).
between parentheses, on right, there child of each element (which may absent).

in example, having 'ab(ab(ddc)c)a(baac)da', top level ab(ab(ddc)c)a(baac)da --> [a, b, a, d, a] , corresponding children [none, ab(ddc)c, baac, none, none]. children parsed recursively.

i have implemented solution here:

def parse_string(string):      = 0                                                                            parsed = []                                                                       while < len(string):                                                               if string[i] in ('a', 'b', 'c', 'd'):                                                     parsed.append([string[i], none])                                                 += 1                                                                       elif string[i] == '(':                                                               open_brakets = 1                                                                 += 1                                                                           j =                                                                            while open_brakets:                                                                  if string[j] == '(':                                                                 open_brakets += 1                                                            elif string[j] == ')':                                        open_brakets -= 1                                     j += 1             # parse children             parsed[-1][-1] = parse_string(string[i:j - 1])                    = j                                                                        else:                                                                                += 1                                                                    return parsed  print parse_string('ab(ab(ddc)c)a(baac)dab(abc)')

although think it's bit ugly , i'm sure not efficient.

i wonder if there's way make python in cleaner/faster/more elegant way? using external libraries allowed (specially if they're written in c! :-p).

update

other examples of strings should work:

abc(dab(acb)bbb(aaa)abc)dcb

in general, length of string not limited, neither number of children, nor length, nor number of nested levels.

if need recursively parse inner parentheses well:

def parse_tree(tree, string, start=0):     index = start     while index < len(string):         current = string[index]         if current == "(":             child = tree[-1][1]             child_parsed = parse_tree(child, string, index+1)             index += child_parsed + 2 # adds 2 parentheses         elif current == ")":             break         else:             tree.append((current, []))             index += 1     return index - start tree = [] print(parse_tree(tree, 'abc(abc(defg)d)de(f)gh'))

the way works can thought of state machine. state machine accepts node definitions until sees open parentheses, in pushes new context (i.e. recursive function call) parsing stack parse content of parentheses. when parsing inner context, close parentheses pops context.

another alternative, can scale better if have more complex grammars use parsing library pyparsing:

from pyparsing import oneormore, optional, oneof, alphas, word, group, forward, suppress, dict  # define grammar nodes = forward() nodename = oneof(list(alphas)) nodechildren = suppress('(') + group(nodes) + suppress( ')') node = group(nodename + optional(nodechildren)) nodes <<= oneormore(node)  print(nodes.parsestring('abc(abc(defg)d)de(f)gh'))

parsing libraries pyparsing allows define easy-to-read declarative grammar.

answer original non-recursive parsing: 1 way itertools (accumulate python 3.2 , up, itertools docs has pure python implementation of accumulate use in older versions). avoids use of indices:

from itertools import takewhile, accumulate parens_map = {'(': 1, ')': -1} def parse_tree(tree, string):     string = iter(string)     while string:         current = next(string)         if current == "(":             child = iter(string)             child = ((c, parens_map.get(c, 0)) c in child)             child = accumulate(child, lambda a,b: (b[0], a[1]+b[1]))             child = takewhile(lambda c: c[1] >= 0, child)             child = (c[0] c in child)             tree[-1][1] = "".join(child)         else:             tree.append([current, none]) print(parse_tree('abc(abc(defg)d)de(f)gh'))

i'm not quite sure whether it's faster or more elegant, think using explicit indexes easier write, understand, , modify.

Search This Blog

Shefl

Optimizing string parsing with Python -

update

Comments

Post a Comment

Popular posts from this blog

java - Custom OutputStreamAppender not run: LOGBACK: No context given for <MYAPPENDER> -

java - UML - How would you draw a try catch in a sequence diagram? -

c++ - No viable overloaded operator for references a map -