ColdFusion 9.0 Resources |
Extracting text from a PDF documentYou can use the DocumentText DDX element to return an XML file that contains the text in one or more PDF documents. As with the PDF element, you specify a result attribute the DocumentText element and enclose one or more PDFsource elements within the start and end tags, as the following example shows: <?xml version="1.0" encoding="UTF-8"?> <DDX xmlns="http://ns.adobe.com/DDX/1.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://ns.adobe.com/DDX/1.0/ coldfusion_ddx.xsd"> <DocumentText result="Out1"> <PDF source="doc1"/> </DocumentText> </DDX> The following code shows the CFM page that calls the DDX file. Instead of writing the output to a PDF file, you specify an XML file for the output: <cfif IsDDX("documentText.ddx"> <cfset ddxfile = ExpandPath("documentText.ddx")> <cfset sourcefile1 = ExpandPath("book1.pdf")> <cfset destinationfile = ExpandPath("textDoc.xml")> <cffile action="read" variable="myVar" file="#ddxfile#"/> <cfset inputStruct=StructNew()> <cfset inputStruct.Doc1="#sourcefile1#"> <cfset outputStruct=StructNew()> <cfset outputStruct.Out1="#destinationfile#"> <cfpdf action="processddx" ddxfile="#myVar#" inputfiles="#inputStruct#" outputfiles="#outputStruct#" name="ddxVar"> <!--- Use the cfdump tag to verify that the PDF files processed successfully. ---> <cfdump var="#ddxVar#"> </cfif> The XML file conforms to a schema specified in doctext.xsd. For more information, see http://ns.adobe.com/DDX/DocText/1.0 When you specify more than one source document, ColdFusion aggregates the pages into one file. The following example shows the DDX code for combining a subset of pages from two documents into one output file: <?xml version="1.0" encoding="UTF-8"?> <DDX xmlns="http://ns.adobe.com/DDX/1.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://ns.adobe.com/DDX/1.0/ coldfusion_ddx.xsd"> <DocumentText result="Out1"> <PDF source="doc1" pages="1-10"/> <PDF source="doc2" pages="3-5"/> </DocumentText> </DDX> |