Extracting text from a PDF document

You can use the DocumentText DDX element to return an XML file that contains the text in one or more PDF documents. As with the PDF element, you specify a result attribute the DocumentText element and enclose one or more PDFsource elements within the start and end tags, as the following example shows:

<?xml version="1.0" encoding="UTF-8"?> 
<DDX xmlns="http://ns.adobe.com/DDX/1.0/"  
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
        xsi:schemaLocation="http://ns.adobe.com/DDX/1.0/ coldfusion_ddx.xsd"> 
    <DocumentText result="Out1"> 
        <PDF source="doc1"/> 
    </DocumentText> 
</DDX>

The following code shows the CFM page that calls the DDX file. Instead of writing the output to a PDF file, you specify an XML file for the output:

<cfif IsDDX("documentText.ddx"> 
    <cfset ddxfile = ExpandPath("documentText.ddx")> 
    <cfset sourcefile1 = ExpandPath("book1.pdf")> 
    <cfset destinationfile = ExpandPath("textDoc.xml")> 
 
    <cffile action="read" variable="myVar" file="#ddxfile#"/> 
 
    <cfset inputStruct=StructNew()> 
    <cfset inputStruct.Doc1="#sourcefile1#"> 
 
    <cfset outputStruct=StructNew()> 
    <cfset outputStruct.Out1="#destinationfile#"> 
 
    <cfpdf action="processddx" ddxfile="#myVar#" inputfiles="#inputStruct#" outputfiles="#outputStruct#" name="ddxVar"> 
 
    <!--- Use the cfdump tag to verify that the PDF files processed successfully. ---> 
    <cfdump var="#ddxVar#"> 
</cfif>

The XML file conforms to a schema specified in doctext.xsd. For more information, see http://ns.adobe.com/DDX/DocText/1.0

When you specify more than one source document, ColdFusion aggregates the pages into one file. The following example shows the DDX code for combining a subset of pages from two documents into one output file:

<?xml version="1.0" encoding="UTF-8"?> 
<DDX xmlns="http://ns.adobe.com/DDX/1.0/"  
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
    xsi:schemaLocation="http://ns.adobe.com/DDX/1.0/ coldfusion_ddx.xsd"> 
<DocumentText result="Out1"> 
    <PDF source="doc1" pages="1-10"/> 
    <PDF source="doc2" pages="3-5"/> 
</DocumentText> 
</DDX>