As if I would’t have enough to do, I started playing with Python, and I loved it: It’s kept simple, it’s powerful, and you don’t have to learn it. I started programming seriously in Python after two hours of command line tests. Everyone who has seen a programming language yet knows Python.I really felt the absence of serious functional programming support and a kind of OCaml-like matching but… it’s another story.
I often need to grasp data from html pages, and I often do that with Java and HTMLParser… but this time I tried with Python because one of my friend told me how beautiful the soup is.
When facing the problem of getting a lot of data from html pages, I always dream a sort of pattern matching for HTML: I want to be able to get “this kind of tag sequence” inside a page. So I created a library to do that for me based on BeautifulSoup, and it’s impressively effective and easy to use!
Theory
An HTML page has a tree of nodes, called DOM, where each node is a tag or a string (this is how BeautifulSoup see it). So we have the DOM of a page and the tree of a pattern, and we want to find all the occurrences of that pattern in the page.
Now the question is: how do we find that two nodes matches ? Well, for my purposes I defined that two nodes are similar if they have the same tag name and if the input node (from the page) matches AS STRING, at least all the attributes of the pattern node. Ok, what does mean that two strings matches ? I defined a simple syntax that worked good for all my targets, and you’ll see it in the following example.
Practice
Let’s suppose you have a page with a list of videos (page.html), and you want to get all the videos:
<html> <head><title>Example</title></head> <body> <div class="video"> <a href="watch?v=0001">Title first video</a><img src="preview1.jpg"/></div> <div class="video"> <a href="watch?v=0002">Title second video</a><img src="preview2.jpg"/></div> <div class="video"> <a href="watch?v=0003">Title third video</a><img src="preview3.jpg"/></div> ... </body> </html>
You just have to copy the tags structure of one video and give it to my python script as pattern, substituting the variables you want, like this (pattern.html):
<div class="video"><a href="watch?v=$code$">$title$</a><img src="$preview$"/></div>
So, just put $variable$ where you want (be as much restrictive as you can to avoid ambiguities). Now if you run the script you get:
claudio@laptop:~$ ./htmlmatch.py index.html pattern.html code: 0001 title: The first video preview: preview1.jpg code: 0002 title: The second video preview: preview2.jpg code: 0003 title: The third video preview: preview3.jpg
.
You can easily access all these filed using the library as a function in your python code and iterating the list (of dictionaries) it gives you back. For example:
page = urllib2.urlopen("http://www.your_video_website.com/") pattern = open("pattern.html", "r") matches = htmlmatch(page, pattern) for map in matches: for k, v in map.iteritems(): print k, v print
The Source
I suggest you to use BeautifulSoup version 3.0.7a, because it behaves better with real world HTML (newer versions are based upon the new python HTML parser which is not very improved).
Click here to download htmlmatch-0.1.py as file
#!/usr/bin/env python import sys import urllib2 from BeautifulSoup import BeautifulSoup, Tag, NavigableString import HTMLParser def htmlmatch(page, pattern): """Finds all the occurrencies of the pattern tree into the given html page""" isoup = BeautifulSoup(page) psoup = BeautifulSoup(pattern) def untiltag(gen): node = gen.next() while True: if isinstance(node, Tag): break elif len(node.lstrip()) == 0: node = gen.next() else: break return node pgen = psoup.recursiveChildGenerator() pnode = untiltag(pgen) igen = isoup.recursiveChildGenerator() inode = untiltag(igen) variables = [] lastvars = {} while True: newvars = nodematch(inode, pnode) if newvars != None: if len(newvars) > 0: lastvars.update(newvars) try: pnode = untiltag(pgen) except StopIteration: pgen = psoup.recursiveChildGenerator() pnode = untiltag(pgen) if len(lastvars) > 0: variables.append(lastvars) lastvars = {} else: pgen = psoup.recursiveChildGenerator() pnode = untiltag(pgen) try: inode = untiltag(igen) except StopIteration: if variables != None: return variables return None return variables def nodematch(input, pattern): """Matches two tags: returns True if the tags are of the same kind, and if the first tag has AT LEAST all the attributes of the second one (the pattern) and if these attributes match as strings, as defined in strmatch function.""" if input.__class__ != pattern.__class__: return None if isinstance(input, NavigableString): return strmatch(input, pattern) if isinstance(input, Tag) and input.name != pattern.name: return None variables = {} for attr, value in pattern._getAttrMap().iteritems(): if input.has_key(attr): newvars = strmatch(input.get(attr), value) if newvars != None: variables.update(newvars) else: return None else: return None return variables def strmatch(input, pattern): """Matches the input string with the pattern string. For example: input: "python and ocaml are great languages" pattern: "$lang1$ and $lang2$ are great languages" gives as output the map: {"lang1": "python", "lang2": "ocaml"} The function returns None if the strings don't match.""" var, value = None, None i, j = 0, 0 map = {} input_len = len(input) pattern_len = len(pattern) while i < input_len: if var == None: if pattern[j] == '$': var = "" value = "" j += 1 elif input[i] != pattern[j]: return None else: i += 1 j += 1 else: while pattern[j] != '$': var += pattern[j] j += 1; j += 1 if j == pattern_len: while i < input_len: value += input[i] i += 1 else: while i < input_len and input[i] != pattern[j]: value += input[i] i += 1 i +=1 j +=1 map[var] = value var = None return map def main(argv): if len(argv) < 2: print "example: ./htmlmatch.py input.html pattern.html" return page = open(argv[0], "r") pattern = open(argv[1], "r") l = htmlmatch(page, pattern) for m in l: for k, v in m.iteritems(): print k, v print if __name__ == "__main__": main(sys.argv[1:])
