Optimizing string parsing with Python -
i have string in form 'ab(ab(ddc)c)a(baac)dab(abc)'
.
- each character represents element (
a
,b
,c
ord
). - between parentheses, on right, there child of each element (which may absent).
in example, having 'ab(ab(ddc)c)a(baac)da'
, top level ab(ab(ddc)c)a(baac)da --> [a, b, a, d, a]
, corresponding children [none, ab(ddc)c, baac, none, none]
. children parsed recursively.
i have implemented solution here:
def parse_string(string): = 0 parsed = [] while < len(string): if string[i] in ('a', 'b', 'c', 'd'): parsed.append([string[i], none]) += 1 elif string[i] == '(': open_brakets = 1 += 1 j = while open_brakets: if string[j] == '(': open_brakets += 1 elif string[j] == ')': open_brakets -= 1 j += 1 # parse children parsed[-1][-1] = parse_string(string[i:j - 1]) = j else: += 1 return parsed print parse_string('ab(ab(ddc)c)a(baac)dab(abc)')
although think it's bit ugly , i'm sure not efficient.
i wonder if there's way make python in cleaner/faster/more elegant way? using external libraries allowed (specially if they're written in c! :-p).
update
other examples of strings should work:
abc(dab(acb)bbb(aaa)abc)dcb
in general, length of string not limited, neither number of children, nor length, nor number of nested levels.
if need recursively parse inner parentheses well:
def parse_tree(tree, string, start=0): index = start while index < len(string): current = string[index] if current == "(": child = tree[-1][1] child_parsed = parse_tree(child, string, index+1) index += child_parsed + 2 # adds 2 parentheses elif current == ")": break else: tree.append((current, [])) index += 1 return index - start tree = [] print(parse_tree(tree, 'abc(abc(defg)d)de(f)gh'))
the way works can thought of state machine. state machine accepts node definitions until sees open parentheses, in pushes new context (i.e. recursive function call) parsing stack parse content of parentheses. when parsing inner context, close parentheses pops context.
another alternative, can scale better if have more complex grammars use parsing library pyparsing:
from pyparsing import oneormore, optional, oneof, alphas, word, group, forward, suppress, dict # define grammar nodes = forward() nodename = oneof(list(alphas)) nodechildren = suppress('(') + group(nodes) + suppress( ')') node = group(nodename + optional(nodechildren)) nodes <<= oneormore(node) print(nodes.parsestring('abc(abc(defg)d)de(f)gh'))
parsing libraries pyparsing allows define easy-to-read declarative grammar.
answer original non-recursive parsing: 1 way itertools (accumulate python 3.2 , up, itertools docs has pure python implementation of accumulate use in older versions). avoids use of indices:
from itertools import takewhile, accumulate parens_map = {'(': 1, ')': -1} def parse_tree(tree, string): string = iter(string) while string: current = next(string) if current == "(": child = iter(string) child = ((c, parens_map.get(c, 0)) c in child) child = accumulate(child, lambda a,b: (b[0], a[1]+b[1])) child = takewhile(lambda c: c[1] >= 0, child) child = (c[0] c in child) tree[-1][1] = "".join(child) else: tree.append([current, none]) print(parse_tree('abc(abc(defg)d)de(f)gh'))
i'm not quite sure whether it's faster or more elegant, think using explicit indexes easier write, understand, , modify.
Comments
Post a Comment