First off, HTML doesn't like tabs, so you need to use special tags to keep preformatted text properly formatted. On PH, those tags are [code], and [/code]
: OK, I have a string that I got from a website (when I'm not making the program it'll read it all the time...)
:
: Here is the program:
: --- START ---
:
: import urllib2
:
: # file = urllib2.urlopen("http://www.tremorseven.com/aim/deepaim.php?job=view")
:
: # print "URL Opened: " + file.geturl()
: # URLInfo = file.read()
: URLInfo = """
: <b>Deep Thoughts by Jack Handey</b>#235: When this girl at the museum asked me whom I liked better, Monet or Manet, I said, "I like mayonnaise." She just stared at me, so I said it again, louder. Then she left. I guess she went to try to find some mayonnaise for me. <a href="http://www.tremorseven.com/aim/deepaim.php?job=view" target="_self">Refresh</a> |
<a href="http://www.tremorseven.com/aim/deepaim.php?job=adding">add Deep Thoughts to your aim</a>
<a href="https://www.paypal.com/cgi-bin/webscr?cmd=_xclick&business=tinglea@chilitech.net&item_name=the%20deep%20thoughts%20stay%20online%
20fund&no_note=1¤cy_code=USD&tax=0">please support this service.</a><font size=1>a service of <a href="http://www.tremorseven.com/">tremorseven.com</a></font>
: """
: print "Contents of URL: "
: print URLInfo
:
: for x in range(0, len(URLInfo)):
: if (URLInfo[x] == '#'):
: for y in range(x, x + 5):
: if (URLInfo[y] == ':'):
: NumberStr = URLInfo[x+1:y]
: print "Number of Deep Thought: " + NumberStr
: StartOfThought = y+2
: break
:
: for z in range(StartOfThoughts, len(URLInfo)):
: if (URLInfo[z] == '<':
: EndOfThought = z
:
: print "Contents of Deep Thought:"
: print URLInfo[StartOfThought: EndOfThought]
:
: ---- END ----
:
: I search for the # (this works), then i search for the ":" (this works) and i retrieve the number of this deep thought. Then i search for the position after the ":" for a "<" (this doesn't work).
:
: And it does not work (dum dum dum). Any help would be appreciated.
Here's my first stab (note that I broke the text arbitrarily because preformatted text does not wrap and makes this page scroll far to the right if not forced to break):
import urllib2
import re
text = """<b>Deep Thoughts by Jack Handey</b><br><br>#235: When this girl at the
museum asked me whom I liked better, Monet or Manet, I said, "I like mayonnaise."
She just stared at me, so I said it again, louder. Then she left. I guess she went
to try to find some mayonnaise for me. <br><br>
<a href="http://www.tremorseven.com/aim/deepaim.php?job=view"
target="_self">Refresh</a> |
<a href="http://www.tremorseven.com/aim/deepaim.php?job=adding">add Deep Thoughts
to your aim</a><br><br><a href="https://www.paypal.com/cgi-bin/webscr?
cmd=_xclick&business=tinglea@chilitech.net&item_name=the%20deep%20thoughts%20stay%
20online%20fund&no_note=1¤cy_code=USD&tax=0">please support this service.</a>
<br><br><font size=1>a service of <a href="http://www.tremorseven.com/">
tremorseven.com</a></font>"""
for match in re.finditer("#[0-9]+:", text):
thought_number = text[match.start()+1 : match.end()-1]
thought = ""
try:
thought = text[match.end()+1 : text.index("<", match.end() + 1)].strip()
except ValueError: # '<' character not found in text
thought = text[match.end()+1 : ].strip()
print thought_number
print thought
You may not be familiar with regular expressions (regex). Python has an "re" module that lets you use them. They are perfect for searching text for patterns. The re.finditer() method takes a pattern and a string and returns an iterable objects so you can step through it with a for loop. The pattern, "#[0-9]+:", is quite simple as far as regexen go. They can be quite complex. This one says "find a substring that starts with a hash (#), is followed by one or more (+) digits ([0-9]) and ends with a colon (:). There are entire books written about regular expressions and I highly recommend you at least learn the basics. I tried to come up with a regular expression that would also pick out the "deep thought" as well, but that was beyond my ability, so I just opted for using the string method "index" which returns the position of a substring you specify (you can optionally specify the start and end points for the search as well).
Try this out and let me know if you have any other questions.
infidel