rolisz's site

Regular Expressions for Objects

For work I recently needed to do something that is very similar to regexes, but with a twist: it should operate on lists of objects, not only on strings. Luckily, Python came to the rescue with REfO, a library for doing just this.

My usecase was selecting phrases from Part-of-Speech (POS) annotated text. The text was lemmatized and tagged using SpaCy and it resulted in lists of the following form:

s = [['i', 'PRON'], ['look', 'VERB'], ['around', 'ADP'], ['me', 'PRON'],
 ['and', 'CCONJ'], ['see', 'VERB'], ['that', 'ADP'], ['everyone', 'NOUN'],
 ['be', 'VERB'], ['run', 'VERB'], ['around', 'ADV'], ['in', 'ADP'],
 ['a', 'DET'], ['hurry', 'NOUN']]

From these sentences we want to extract human action phrases and noun phrases, which are defined as follows, using regex-like notation:

human_action = ("he"|"she"|"i"|"they"|"we") ([VERB] [ADP])+
noun_phrase = [DET]? ([ADJ] [NOUN])+

Translated to English this means that human actions are defined as 1st and 3rd person, singular and plural pronouns followed by repeated groups of verbs and ad­po­si­tions (in, to, during). Noun phrases are composed of an optional determiner (a, an, the) followed by repeated groups of adjectives and nouns.

Most standard regex libraries won't help you with this, because they work only on strings. But this problem is still perfectly well described by regular grammars, so after a bit of Googling I found REfO and it's super simple to use, albeit you have to read the source code, because it doesn't really have doc­u­men­ta­tion.

REfO is a bit more verbose than normal regular ex­pres­sions, but at least it tries to stay close to usual regex notions. Lazy repetition (*) is done using the refo.Star operator, while greedy one (+) is refo.Plus . The only new operator is refo.Predicate, which takes a function which takes a parameter and matches if that function returns true when called with the element at that position. Using this we will build the functions we need:

def pos(pos):
    return refo.Predicate(lambda x: x[1] == pos)

def humanpron():
    return refo.Predicate(lambda x: x[1] == 'PRON' and x[0] in {'i', 'he', 'she', 'we', 'they'})

For matching POS, we use a helper to create a function that will match the given tag. For matching human pronouns, we also check the words, not just the POS tag.

np = refo.Question(pos('DET')) + refo.Plus(refo.Question(pos('ADJ')) + pos('NOUN'))
humanAction = humanpron() + refo.Plus(pos('VERB') + pos('ADP'))

Then we just compose our functions and con­cate­nate them and we got what we wanted. Using them is simple. You either call refo.search, which finds the first match or refo.finditer which returns an iterable over all matches.

for match in refo.finditer(humanAction, s):
    start = match.start()
    end = match.end()
    print(s[start:end])
[[u'i', u'PRON'], [u'look', u'VERB'], [u'around', u'ADP']]

So, it's always good to Google around for a solution, because my first instict to whip up a parser in Parsec would have lead to a much more com­pli­cat­ed solution. This is nice, elegant, short and efficient.