When a file is added to Hyperion, the system determines if the file is of a type that should be included in the content index. These policies are used to configure which file types can be indexed, and from where the textual content is derived.
From the List Policy window, you can create, display, modify, copy, or remove the File Type policies. Click Close to exit the wizard.
Attributes
The File Type policy contains the following attributes.
• | Name |
• | Description |
• | File Type |
• | Full Text Retrievable |
This attribute uniquely identifies the file extension that is to be included in the index. This name is ten characters or less, and may not include spaces or punctuation, except for dash (-), underscore (_), and dollar sign ($). Additionally, the pipe character (|) may not be used.
File name examples: PDF, TXT, HTM, and TIFF
Note: For more information about TIFF and PDF file types, see the following sections on TIFF Image Files and PDF Image Files.
This attribute provides more information about the policy and its use by the library. The description may be up to 60 characters in length. Although the Description attribute may contain spaces and punctuation, the pipe character (|) cannot be used.
This 1-10 character attribute defines the type of file from which to derive the text content of the file that is to be included in the index.
This attribute answers the question, “Is the image file type full text retrievable?”
Note: Regarding TIFF Image Files
When a user adds a TIF image file representing a scanned document, the user may have already run an optical character recognition (OCR) process and generated a text (TXT) file that contains the text derived from the scanned TIF page. When the user adds that TIF image to Hyperion, the associated TXT file will be added with the TIF file for inclusion in the content index. The File Type attribute value determines the type of file that should be added along with the file being imported. In this example, TIF files should have a Text File Type of TXT. The next time the content indexing report is run (Rebuild Content Database or Add, Update Content Database), the associated TXT files will be added to the content index.
Files that are not to be included in the content index do not need to be defined in this policy.
Note: Regarding PDF Image Files
Hyperion also includes special functionality for handling the content of Adobe Acrobat PDF files. PDF files do not contain text that can be readily indexed.
Hyperion includes an extra utility that will extract the text from PDF files, generate a text file that can be indexed, and automatically associate that extracted text file with the original PDF file. If PDF files are to be included in the content index, this extraction process occurs during the building of the content index. Any new PDF file that has not already been indexed will have the text content extracted and saved along with the PDF file. Therefore, the extraction process only happens once per PDF file.
The utility for PDF text extraction works only on PDF files that are of the PDF Normal format, or are of the “PDF Image + Hidden Text” format. Go to your Adobe Acrobat documentation for more information about PDF file formats.
The Bulkload utility and the Rebuild Content Database report also use this policy.
Image File Type policies are delivered as follows.
Name |
Description |
File Type |
Full Text Retrievable |
HTM |
HTML file |
HTML file |
Yes |
|
|
Yes |
|
TEXT |
Text |
txt |
Yes |
TIF |
TIFF image |
txt |
Yes |
Related topics
Hyperion Configuration Wizards
© 2006-2016 Sirsi Corporation. All rights reserved.