This message was edited by Moderator at 2006-7-27 10:27:28
: I am trying to remove HTML tags from index.html
: I managed,But my code does not remove comments,links and other things that are irrelevant.
: Here is my Code below:
: import re
: import os, sys, glob
: from os import system
: from urllib import urlopen
: page = urlopen("http://www.ee.uct.ac.za").read()
: myfile = open('testfile.txt', 'w')
: myfile = open('testfile.txt', 'r')
: #Removing all the HTML tags from the file
: myfile = re.sub('<(?!(?:a\s|/a|!))[^>]*>','',page)
: print myfile
I'm not good yet with complex regex patterns like you have there, but clearly you either need to add some more complexity to that regex, or you need to create at least one more regex substitution to get the things that this one misses.
Or you could use an XML parser and just spit out the text parts. If there's a chance that the HTML is not well-formed XML, though, there is a class available called BeautifulSoup that can parse pages that have mistakes in them: http://www.crummy.com/software/BeautifulSoup/download/
(though I am unable to access that page currently).
$ select * from users where clue > 0
no rows returned