Closed Thread Icon

Topic awaiting preservation: Searching the text content of PDFs with PHP? (Page 1 of 1) Pages that link to <a href="https://ozoneasylum.com/backlink?for=13090" title="Pages that link to Topic awaiting preservation: Searching the text content of PDFs with PHP? (Page 1 of 1)" rel="nofollow" >Topic awaiting preservation: Searching the text content of PDFs with PHP? <span class="small">(Page 1 of 1)</span>\

 
smonkey
Paranoid (IV) Inmate

From: Northumberland, England
Insane since: Apr 2003

posted posted 03-16-2004 20:40

Hi,

I am looking for an ideally free method to serch through the text content of PDF files using PHP. Other serverside methods would be acceptable, except for asp which I have no support for.

Any ideas anybody?

<A HREF="http://www.cryokinesis.co.uk" TARGET=_blank>visit

Cameron
Bipolar (III) Inmate

From: Brisbane
Insane since: Jan 2003

posted posted 03-16-2004 23:39

I'd be interested to know if this is possible too.

Can you somehow convery a PDF document into some kind of XML format?

DmS
Paranoid (IV) Inmate

From: Sthlm, Sweden
Insane since: Oct 2000

posted posted 03-17-2004 10:15

Did this for my archive part of my site, it searches .doc, .xls, .ppt, .pdf and so on.
Take a look here for the discussion: http://development.gurusnetwork.com/discussion/showthread.php?s=&threadid=1941

Here is the class I used as a base for the PDF part: http://www.phpclasses.org/browse.html/package/702.html

I had serious trouble getting it to work on my Win developmentbox due to problems with the decompression but on my linux host it works perfectly.
/Dan

{cell 260}
-{ a vibration is a movement that doesn't know which way to go }-

smonkey
Paranoid (IV) Inmate

From: Northumberland, England
Insane since: Apr 2003

posted posted 03-17-2004 13:41

Any chance of a peek at some code, or is that a part of your 'product'? You should release a full php file search class or group of classes, sounds cool.

Thanks, still checking into everything, any other ideas are still welcome.

Cheers,

Jon

<A HREF="http://www.cryokinesis.co.uk" TARGET=_blank>visit

DmS
Paranoid (IV) Inmate

From: Sthlm, Sweden
Insane since: Oct 2000

posted posted 03-17-2004 17:14

No prob.
The actual pdf_search is not mine, it's dl:ed from the link I gave above http://www.phpclasses.org/browse.html/package/702.html

I did a quickie class just now that handles a very simple search in files.
If the file is a pdf it uses the class pdf_search.

As you instansiate the class pass the following:
A valid path to the directory you want to search incl trailing "/"
A string of words to search for separated with a space.
(this class searches like this <word> OR <word> OR <word>...)
A 0 (zero) for non-recursive directory
I don't have time right now to be able to handle recursive searches from the given directory.
Last invoke the method searchInFile();
That's about it.

Oh, IMPORTANT!
This seems to work fine on Linux
BUT...
The pdf search part will not work on Windows, at least not with a default php 4.23 installation from phpdev. It either chrashes on the "gzuncompress()" or as it parses the regexp in the method textline()
Don't know why.

Examples are in the code.
See output: http://www.dmsproject.com/test/pdf/searchDoc.php
See code: http://www.dmsproject.com/test/pdf/searchDoc.phps

Have fun
/Dan

{cell 260}
-{ a vibration is a movement that doesn't know which way to go }-

smonkey
Paranoid (IV) Inmate

From: Northumberland, England
Insane since: Apr 2003

posted posted 03-19-2004 11:37

ok, well I'm having trouble getting it to work, could you be a bit more specific with how to use it, talk to me like I'm an idiot.

Also how would I go about indexing/searching .doc and .ppt/.pps files?

One final thing - the php pdf class thing, I find it hard to understand how it actually works - I know that pdfs can be just plain text with special markup, but often they are 'compressed'/'encoded' in some magic 'adobe' way that really confuses me - does the class overcome that too?

Thanks guys

<A HREF="http://www.cryokinesis.co.uk" TARGET=_blank>visit

DmS
Paranoid (IV) Inmate

From: Sthlm, Sweden
Insane since: Oct 2000

posted posted 03-19-2004 19:49

Have another go now, I've created a simple interface (a search form) to the class that you can start from. In that code I've documented quite a lot on what I'm doing.
Some minor changes are done in the class-file

This class already searches .doc .xls .ppt and so on, if php can open the file you can search it, probably even images (although you would have problems finding words that match the code of an images source).

The pdf-class:
Very simplified (I don't know the inner workings of how a .pdf is actually built) explanation.

First it stores the files content in the private variable _buffer as you initialize it.
Then as you call the method "textfound($str)" the class uses the other methods to split the document in lines, then it it looks for "stream" and "endstream" in the line, stores that, then it uncompresses that stream, then it then it parses that stream for some chars and replace them with other chars, as this is done a line that can be matched against the search term results.
If a match is found, it returns true, otherwise false.

Not sure, but I believe it returns true on the first match, so there is no way of knowing if there are more than one match in the file.

Enough talk, here are the new links to the files:
Form, see output: http://www.dmsproject.com/test/pdf/searchForm.php
Form, see code: http://www.dmsproject.com/test/pdf/searchForm.phps

Class, see code: http://www.dmsproject.com/test/pdf/searchDoc.phps


Now you must be able to get it going right?
<edit>Important, the path to the directory to be searched can not be an url, it has to be a filesystem path.</edit>
/Dan


{cell 260}
-{ a vibration is a movement that doesn't know which way to go }-

[This message has been edited by DmS (edited 03-19-2004).]

smonkey
Paranoid (IV) Inmate

From: Northumberland, England
Insane since: Apr 2003

posted posted 03-20-2004 15:36

Ok well the error I'm getting is:

quote:
Parse error: parse error in c:\phpdev\www\pdfsearch\search_docs.php(158) : regexp code on line 1

Fatal error: Failed evaluating code: chr(0²); in c:\phpdev\www\pdfsearch\search_docs.php on line 158


However I am running on windows at the moment so I'm just wondering if that is the error that happens on windows - I will be trying on my linux webhost later when I get time.

But surely that error indicates that the code is wrong in some way rather than the server?

I have renamed your 'searchDoc.php' to 'search_docs.php' just because that is my filenaming convention - I have updated the appropriate references too.

Any ideas?

***EDIT***

Ok I have tried it on my linux webserver and it works, although it doesn't ctually ever find any of the serach terms within the pdfs, it only finds search terms in the .doc - I assume it is because either my server doesn't support gzuncompress() or that the pdfs aren't compressed using gzip. I have found other pdf indexing utilities ( http://www.pdfbox.org/ ) and they seem to refer to many forms of decompression/decoding:

quote:
-FlateDecode
- ASCIIHexDecode
- ASCII85Decode
- LZWDecode
- PDF Documents can now be decrypted!
- RunLengthDecode
- CCITTFaxDecode
- JBIG2Decode
- DCTDecode
- Encryption


Could that also be the problem?

In regards to pdfbox - it looks good but java is something I'd rather not delve into in this project.

Arg, this is painful


Thanks tho


<A HREF="http://www.cryokinesis.co.uk" TARGET=_blank>visit

DmS
Paranoid (IV) Inmate

From: Sthlm, Sweden
Insane since: Oct 2000

posted posted 03-20-2004 21:40

The parse error, that's the exact same error I'm getting now and then on my Win-box.

The Linux thing, well I guess it could be the .pdf's or the php-install, don't really know what options there are since I'm not that knowledgable on tyhe inner workings of a pdf.

Perhaps InI could step in here, he knows a h*ll of a lot more on these things than I do.
It was worth a try anyways
/Dan

{cell 260}
-{ a vibration is a movement that doesn't know which way to go }-

smonkey
Paranoid (IV) Inmate

From: Northumberland, England
Insane since: Apr 2003

posted posted 03-21-2004 12:28

Thanks Dan, what you have given works great and is soemthing for me to build from, I appreciate your help with this and I wonder what the deal with pdfs.

CALLING INI, COME IN INI...

InI is my hero, he knows too much

Thanks again Dan,

Jon

<A HREF="http://www.cryokinesis.co.uk" TARGET=_blank>visit

smonkey
Paranoid (IV) Inmate

From: Northumberland, England
Insane since: Apr 2003

posted posted 03-21-2004 14:28

Ok I could have edited the last post but figured this deserved a new post since it is kinda different:
http://zeus.jtlnet.com/~conradis/pdfhi.php.txt

Basically that is a php highlighting script that generates a highlighting file for pdf or something like that - but it has the potential to parse pdf in order to do this - any ideas how to rip it apart? maybe combine it with Dan's basis

***EDIT***
Ok I have tried the highlighter script and whilst it throws no errors under linux it also doesn't appear to do anything. I have looked through the code but can't figure it out exactly - although from looking at my pdf's in a text editor too am have an idea how it is working, but my knowledge of php lets me down

Form looking at my pdfs it seems that most text strings look like this: <BLOCKQUOTE><FONT face="Verdana, Arial">quote:</font><HR>845 0 obj
<< /Filter /FlateDecode /Length 860 0 R >>
stream
H?\?Ë?£@@÷~E-»ZuoUB OÈbLf>Àh%-tT?YäïÇxB?ÀA+?s!7]ï7û¦Lú³oËCÌ©nª>^Û[_Fs?çºIrkªº?4}??¢KƳ?ûu??}sj?ùܤ¿Æ{ס¿??eÕãk?þè«Ø×ÍÙ¼üY^Mz¸uÝg¼Äf0?Y,LOIºþVtß?K4étìm_÷ëáþ6?ù÷Äï{8Ç¥l«xí?2öEs?É< ¯??ïÆk?Ħúï~?qìx*??~zÜg?ÍyÈA+èZC3h­'Ê3håвÐâ}?÷ååBÂÅá?H wH¡ä¡% 
9
??

DmS
Paranoid (IV) Inmate

From: Sthlm, Sweden
Insane since: Oct 2000

posted posted 03-21-2004 23:48

From the snip you posted, this: http://zeus.jtlnet.com/~conradis/pdfhi.php.txt (same as you found?) script, and how the script I used works it seems like that both scripts uses the same base tecnique to get at the contents.

The one I use parses the document looking for "stream" and "endstream" (as in your snip) then it uses gzuncompress() to uncompress it before it can do it's work, the hiliter does the same and uses gzinflate() to do the same. If you lack support for gz.... it seems like neither of these will work.

/Dan


{cell 260}
-{ a vibration is a movement that doesn't know which way to go }-

smonkey
Paranoid (IV) Inmate

From: Northumberland, England
Insane since: Apr 2003

posted posted 03-22-2004 00:05

so ok, how do i do a proper check for gzip? is there a phpinfo() type command that will show that?

if I don't have it how can I install it?

any ideas what the difference between gzinflate and gzuncompress is? to me it sounds like gzuncompress is just a standard gz function call that lets gzip work out what method to uncompress it with, whereas the gzinflate sounds like a specific function call to use an 'inflate' method of uncompression - just my brain thoughts tho, I dunno really.

<A HREF="http://www.cryokinesis.co.uk" TARGET=_blank>visit

InI
Paranoid (IV) Mad Scientist

From: Somewhere over the rainbow
Insane since: Mar 2001

posted posted 03-22-2004 08:23

The poster has demanded we remove all his contributions, less he takes legal action.
We have done so.
Now Tyberius Prime expects him to start complaining that we removed his 'free speech' since this message will replace all of his posts, past and future.
Don't follow his example - seek real life help first.

Tyberius Prime
Paranoid (IV) Mad Scientist with Finglongers

From: Germany
Insane since: Sep 2001

posted posted 03-22-2004 13:37

the easiest way to check for an installed gziplib is to see if function_exits(gzuncompress)


The differencee between inflate and uncompress is - one expects a certain header, while the other one just takes a compressed stream, without the header.

« BackwardsOnwards »

Show Forum Drop Down Menu