text processing in python - 388
anything when it encounters any event. Utilizing HTMLParser.HTMLParser() is a matter of subclassing it and providing methods to handle the events you are interested in. If it is important to keep track of the structural position of the current event within the document, you will need to maintain a data structure with this information. If you are certain that the document you are processing is well-formed XHTML, a stack suffices. For example:
HTMLParser_stack.py
#!/usr/bin/env python import HTMLParser html = """Advice The IETF admonishes: Be strict in what you send. """ tagstack = [] class ShowStructure(HTMLParser.HTMLParser): def handle_starttag(self, tag, attrs): tagstack.append(tag) def handle_endtag(self, tag): tagstack.pop() def handle_data(self, data): if data.strip(): for tag in tagstack: sys.stdout.write('/'+tag) sys.stdout.write(' >> %s\n' % data[:40].strip()) ShowStructure().feed(html) Running this optimistic parser produces: % ./HTMLParser_stack.py /html/head/title >> Advice /html/body/p >> The /html/body/p/a >> IETF admonishes: /html/body/p/a/i >> Be strict in what you /html/body/p/a/i/b >> send /html/body/p/a/i >> . You could, of course, use this context information however you wished when processing a particular bit of content (or when you process the tags themselves). A more pessimistic approach is to maintain a "fuzzy" tagstack. We can define a new object that will remove the most recent starttag corresponding to an endtag and will also prevent and tags from nesting if no corresponding endtag is found. You could do more along this line for a production application, but a class like TagStack makes a good start: class TagStack: def __init__(self, lst=[]): self.lst = lst def __getitem__(self, pos): return self.lst[pos] def append(self, tag): # Remove every paragraph-level tag if this is one if tag.lower() in ('p','blockquote'): self.lst = [t for t in self.lst if t not in ('p','blockquote')] self.lst.append(tag) def pop(self, tag): # "Pop" by tag from nearest pos, not only last item self.lst.reverse() try: pos = self.lst.index(tag) except ValueError: raise HTMLParser.HTMLParseError, "Tag not on stack" del self.lst[pos] self.lst.reverse()
- Pro možnost psaní komentářů se přihlašte nebo zaregistrujte.
