HTML Pattern Matching in Python

As if I would’t have enough to do, I started playing with Python, and I loved it: It’s kept simple, it’s powerful, and you don’t have to learn it. I started programming seriously in Python after two hours of command line tests. Everyone who has seen a programming language yet knows Python.I really felt the absence of serious functional programming support and a kind of OCaml-like matching but… it’s another story.

I often need to grasp data from html pages, and I often do that with Java and HTMLParser… but this time I tried with Python because one of my friend told me how beautiful the soup is.

When facing the problem of getting a lot of data from html pages, I always dream a sort of pattern matching for HTML: I want to be able to get “this kind of tag sequence” inside a page. So I created a library to do that for me based on BeautifulSoup, and it’s impressively effective and easy to use!

Theory

An HTML page has a tree of nodes, called DOM, where each node is a tag or a string (this is how BeautifulSoup see it). So we have the DOM of a page and the tree of a pattern, and we want to find all the occurrences of that pattern in the page.

Now the question is: how do we find that two nodes matches ? Well, for my purposes I defined that two nodes are similar if they have the same tag name and if the input node (from the page) matches AS STRING, at least all the attributes of the pattern node. Ok, what does mean that two strings matches ? I defined a simple syntax that worked good for all my targets, and you’ll see it in the following example.

Practice

Let’s suppose you have a page with a list of videos (page.html), and you want to get all the videos:

<html>
<head><title>Example</title></head>
<body>
<div class="video">
		<a href="watch?v=0001">Title first video</a><img src="preview1.jpg"/></div>
<div class="video">
		<a href="watch?v=0002">Title second video</a><img src="preview2.jpg"/></div>
<div class="video">
		<a href="watch?v=0003">Title third video</a><img src="preview3.jpg"/></div>
...
</body>
</html>

You just have to copy the tags structure of one video and give it to my python script as pattern, substituting the variables you want, like this (pattern.html):

<div class="video"><a href="watch?v=$code$">$title$</a><img src="$preview$"/></div>

So, just put $variable$ where you want (be as much restrictive as you can to avoid ambiguities). Now if you run the script you get:

claudio@laptop:~$ ./htmlmatch.py index.html pattern.html
code: 0001
title: The first video
preview: preview1.jpg

code: 0002
title: The second video
preview: preview2.jpg

code: 0003
title: The third video
preview: preview3.jpg

You can easily access all these filed using the library as a function in your python code and iterating the list (of dictionaries) it gives you back. For example:

page = urllib2.urlopen("http://www.your_video_website.com/")
pattern = open("pattern.html", "r")
matches = htmlmatch(page, pattern)
for map in matches:
	for k, v in map.iteritems():
		print k, v
	print

The Source

I suggest you to use BeautifulSoup version 3.0.7a, because it behaves better with real world HTML (newer versions are based upon the new python HTML parser which is not very improved).

Click here to download htmlmatch-0.1.py as file

http://www.crummy.com/software/BeautifulSoup/download/3.x/BeautifulSoup-3.0.7a.py

#!/usr/bin/env python

import sys
import urllib2
from BeautifulSoup import BeautifulSoup, Tag, NavigableString
import HTMLParser

def htmlmatch(page, pattern):
	"""Finds all the occurrencies of the pattern tree into the given html page"""
	isoup = BeautifulSoup(page)
	psoup = BeautifulSoup(pattern)

	def untiltag(gen):
		node = gen.next()
		while True:
			if isinstance(node, Tag):
				break
			elif len(node.lstrip()) == 0:
				node = gen.next()
			else:
				break
		return node

	pgen = psoup.recursiveChildGenerator()
	pnode = untiltag(pgen)
	igen = isoup.recursiveChildGenerator()
	inode = untiltag(igen)

	variables = []
	lastvars = {}

	while True:
		newvars = nodematch(inode, pnode)
		if newvars != None:
			if len(newvars) > 0:
				lastvars.update(newvars)
			try:
				pnode = untiltag(pgen)
			except StopIteration:
				pgen = psoup.recursiveChildGenerator()
				pnode = untiltag(pgen)
				if len(lastvars) > 0:
					variables.append(lastvars)
					lastvars = {}
		else:
			pgen = psoup.recursiveChildGenerator()
			pnode = untiltag(pgen)
		try:
			inode = untiltag(igen)
		except StopIteration:
			if variables != None:
				return variables
			return None
	return variables

def nodematch(input, pattern):
	"""Matches two tags: returns True if the tags are of the same kind, and if
	the first tag has AT LEAST all the attributes of the second one
	(the pattern) and if these attributes match as strings, as defined in
	strmatch function."""
	if input.__class__ != pattern.__class__:
		return None
	if isinstance(input, NavigableString):
		return strmatch(input, pattern)
	if isinstance(input, Tag) and input.name != pattern.name:
		return None
	variables = {}
	for attr, value in pattern._getAttrMap().iteritems():
		if input.has_key(attr):
			newvars = strmatch(input.get(attr), value)
			if newvars != None:
				variables.update(newvars)
			else:
				return None
		else:
			return None
	return variables

def strmatch(input, pattern):
	"""Matches the input string with the pattern string. For example:

	input: 	 "python and ocaml are great languages"
	pattern: "$lang1$ and $lang2$ are great languages"

	gives as output the map:
	{"lang1": "python", "lang2": "ocaml"}

	The function returns None if the strings don't match."""
	var, value = None, None
	i, j = 0, 0
	map = {}
	input_len = len(input)
	pattern_len = len(pattern)
	while i < input_len:
		if var == None:
			if pattern[j] == '$':
				var = ""
				value = ""
				j += 1
			elif input[i] != pattern[j]:
				return None
			else:
				i += 1
				j += 1
		else:
			while pattern[j] != '$':
				var += pattern[j]
				j += 1;
			j += 1
			if j == pattern_len:
				while i < input_len:
					value += input[i]
					i += 1
			else:
				while i < input_len and input[i] != pattern[j]:
					value += input[i]
					i += 1
			i +=1
			j +=1
			map[var] = value
			var = None
	return map

def main(argv):
	if len(argv) < 2:
		print "example: ./htmlmatch.py input.html pattern.html"
		return
	page = open(argv[0], "r")
	pattern = open(argv[1], "r")
	l = htmlmatch(page, pattern)
	for m in l:
		for k, v in m.iteritems():
			print k, v
		print

if __name__ == "__main__":
    main(sys.argv[1:])

HTML Pattern Matching in Python

Theory

Practice

The Source

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List