Quantcast
Channel: Theory matters » Uncategorized
Viewing all articles
Browse latest Browse all 4

HTML Pattern Matching in Python

$
0
0

As if I would’t have enough to do, I started playing with Python, and I loved it: It’s kept simple, it’s powerful, and you don’t have to learn it. I started programming seriously in Python after two hours of command line tests. Everyone who has seen a programming language yet knows Python.I really felt the absence of serious functional programming support and a kind of OCaml-like matching but… it’s another story.

I often need to grasp data from html pages, and I often do that with Java and HTMLParser… but this time I tried with Python because one of my friend told me how beautiful the soup is.

When facing the problem of getting a lot of data from html pages, I always dream a sort of pattern matching for HTML: I want to be able to get “this kind of tag sequence” inside a page. So I created a library to do that for me based on BeautifulSoup, and it’s impressively effective and easy to use!

Theory

An HTML page has a tree of nodes, called DOM, where each node is a tag or a string (this is how BeautifulSoup see it). So we have the DOM of a page and the tree of a pattern, and we want to find all the occurrences of that pattern in the page.

HTML Pattern

Now the question is: how do we find that two nodes matches ? Well, for my purposes I defined that two nodes are similar if they have the same tag name and if the input node (from the page) matches AS STRING, at least all the attributes of the pattern node. Ok, what does mean that two strings matches ? I defined a simple syntax that worked good for all my targets, and you’ll see it in the following example.

Practice

Let’s suppose you have a page with a list of videos (page.html), and you want to get all the videos:

<html>
<head><title>Example</title></head>
<body>
<div class="video">
		<a href="watch?v=0001">Title first video</a><img src="preview1.jpg"/></div>
<div class="video">
		<a href="watch?v=0002">Title second video</a><img src="preview2.jpg"/></div>
<div class="video">
		<a href="watch?v=0003">Title third video</a><img src="preview3.jpg"/></div>
...
</body>
</html>

You just have to copy the tags structure of one video and give it to my python script as pattern, substituting the variables you want, like this (pattern.html):

<div class="video"><a href="watch?v=$code$">$title$</a><img src="$preview$"/></div>

So, just put $variable$ where you want (be as much restrictive as you can to avoid ambiguities). Now if you run the script you get:

claudio@laptop:~$ ./htmlmatch.py index.html pattern.html
code: 0001
title: The first video
preview: preview1.jpg

code: 0002
title: The second video
preview: preview2.jpg

code: 0003
title: The third video
preview: preview3.jpg

.

You can easily access all these filed using the library as a function in your python code and iterating the list (of dictionaries) it gives you back. For example:

page = urllib2.urlopen("http://www.your_video_website.com/")
pattern = open("pattern.html", "r")
matches = htmlmatch(page, pattern)
for map in matches:
	for k, v in map.iteritems():
		print k, v
	print

The Source

I suggest you to use BeautifulSoup version 3.0.7a, because it behaves better with real world HTML (newer versions are based upon the new python HTML parser which is not very improved).

Click here to download htmlmatch-0.1.py as file

#!/usr/bin/env python

import sys
import urllib2
from BeautifulSoup import BeautifulSoup, Tag, NavigableString
import HTMLParser

def htmlmatch(page, pattern):
	"""Finds all the occurrencies of the pattern tree into the given html page"""
	isoup = BeautifulSoup(page)
	psoup = BeautifulSoup(pattern)

	def untiltag(gen):
		node = gen.next()
		while True:
			if isinstance(node, Tag):
				break
			elif len(node.lstrip()) == 0:
				node = gen.next()
			else:
				break
		return node

	pgen = psoup.recursiveChildGenerator()
	pnode = untiltag(pgen)
	igen = isoup.recursiveChildGenerator()
	inode = untiltag(igen)

	variables = []
	lastvars = {}

	while True:
		newvars = nodematch(inode, pnode)
		if newvars != None:
			if len(newvars) > 0:
				lastvars.update(newvars)
			try:
				pnode = untiltag(pgen)
			except StopIteration:
				pgen = psoup.recursiveChildGenerator()
				pnode = untiltag(pgen)
				if len(lastvars) > 0:
					variables.append(lastvars)
					lastvars = {}
		else:
			pgen = psoup.recursiveChildGenerator()
			pnode = untiltag(pgen)
		try:
			inode = untiltag(igen)
		except StopIteration:
			if variables != None:
				return variables
			return None
	return variables

def nodematch(input, pattern):
	"""Matches two tags: returns True if the tags are of the same kind, and if
	the first tag has AT LEAST all the attributes of the second one
	(the pattern) and if these attributes match as strings, as defined in
	strmatch function."""
	if input.__class__ != pattern.__class__:
		return None
	if isinstance(input, NavigableString):
		return strmatch(input, pattern)
	if isinstance(input, Tag) and input.name != pattern.name:
		return None
	variables = {}
	for attr, value in pattern._getAttrMap().iteritems():
		if input.has_key(attr):
			newvars = strmatch(input.get(attr), value)
			if newvars != None:
				variables.update(newvars)
			else:
				return None
		else:
			return None
	return variables

def strmatch(input, pattern):
	"""Matches the input string with the pattern string. For example:

	input: 	 "python and ocaml are great languages"
	pattern: "$lang1$ and $lang2$ are great languages"

	gives as output the map:
	{"lang1": "python", "lang2": "ocaml"}

	The function returns None if the strings don't match."""
	var, value = None, None
	i, j = 0, 0
	map = {}
	input_len = len(input)
	pattern_len = len(pattern)
	while i < input_len:
		if var == None:
			if pattern[j] == '$':
				var = ""
				value = ""
				j += 1
			elif input[i] != pattern[j]:
				return None
			else:
				i += 1
				j += 1
		else:
			while pattern[j] != '$':
				var += pattern[j]
				j += 1;
			j += 1
			if j == pattern_len:
				while i < input_len:
					value += input[i]
					i += 1
			else:
				while i < input_len and input[i] != pattern[j]:
					value += input[i]
					i += 1
			i +=1
			j +=1
			map[var] = value
			var = None
	return map

def main(argv):
	if len(argv) < 2:
		print "example: ./htmlmatch.py input.html pattern.html"
		return
	page = open(argv[0], "r")
	pattern = open(argv[1], "r")
	l = htmlmatch(page, pattern)
	for m in l:
		for k, v in m.iteritems():
			print k, v
		print

if __name__ == "__main__":
    main(sys.argv[1:])


Viewing all articles
Browse latest Browse all 4

Trending Articles