I am trying to read a PDF file using pypdf and write onto a text file. But its not working. content value in the below code is just "u/n/n/n/n/n'...PDF file has 5 pages so 5 times new line character and in the begining 'u'..whats going wrong please help. why the contents are not coming. Any help is highly appreciated. Thanks Sujan
[code]
#!/usr/bin/python
import pyPdf
import sys
def getPDFContent(path):
content = ""
p = file(path, "rb")
pdf = pyPdf.PdfFileReader(p)
for i in range(0, pdf.getNumPages()):
content += pdf.getPage(i).extractText() + "
"
content = " ".join(content.replace(u"xa0", " ").strip().split())
return content
def main():
f= open('test.txt','w')
pdfl = getPDFContent("test.pdf").encode("ascii", "ignore")
f.write(pdfl)
f.close()
if __name__ == "__main__":
main()
[/code]
Comments
Similar issure on PDF reader using pypdf
http://stackoverflow.com/questions/15459802/pypdf-to-read-page-content/pdf-reading
I am using another PDF reader instead of pypdf to help me read PDF files. And using code to deal with the related PDF reading projects is too complicated for me. I prefer to use some manual toolkits which can be customized by users according to our own favors. You can google it and select one to have a try. I hope you usccess. Good luck.
Best regards,
Arron