Oracle and Microsoft Products
Oracle and Microsoft products can interchange data. The traditional expectation is that any data or information transferred to and from Microsoft Word to a database would involve the use of Access. In the past, the only way to achieve an export to Oracle would be to use Access as an intermediate database.
These days software is more sophisticated, and the Oracle database and utilities are capable of importing information from Microsoft Word into Oracle database tables. This ability means development and automation of tasks is easier, depending on the method chosen. Here, we review a few different ways that Oracle communicates with Microsoft products and can embed information from Word documents, using standard Oracle functions, third-party products and utilities of various types.
In this method, you first read whole Word documents into Oracle and then filter out the information we wish to extract. This will involve storing the Word document data as a CLOB or BLOB field in the database alongside the filename. The extraction of data then takes place by converting the binary field into a varchar field. Hopefully, this will result in useable ASCII data returned for which a substring is searchable. However, as Word documents are stored in binary, it is not always easy to extract the exact information required without a lot of fine-tuning.
This is a utility provided by Oracle to enable the reading of data from binary documents such as Word, into an Oracle database. Article Text is a text manipulation tool containing several functions and applications to query and classify documents based on text based data. It includes Word search and indexing as well as viewing functions for text.
Oracle Text is not just about manipulating text, as it can also classify documents with its documents classification application included. The text search function allows users to search for text using several search terms if so desired. Text is usually contained within a number of documents, as opposed to a single document. This makes searching capability more flexible as it can search several documents in a single instance. Oracle Text can also search through several different types of documents such as plain format, HTML, Microsoft Word or XML. This means that websites and other types of documents that are not purely in plain text are searchable.
The search functionality includes a document classification application in that it can classify documents as well as search information within those documents. The documentation classification application works based on the content type, such as sport or the latest news. This is regardless of the format the data is in, such as HTML, or XML. The idea is that certain types of categories are indexable for classifying and searching. Now, however, Oracle Text long support classifications for plain text, HTML and XML. The document classification function also works in different languages, such as French. Related words help to classify documents even if such a word does not exist in the document. For example, support could return classifications of football, tennis and Ruby. This is possible even though the word "sport" may not actually exist in these documents.
The advantage of using Oracle Text for populating an Oracle database table with the data is that the functions contained within the utility write directly to Oracle tables. However, it is not easy to use for the novice Oracle user, and study of the Oracle documentation is important.
Convert Word Document to Text
If you are not concerned about Microsoft Word formatting to be stored in the database and simply need the text unformatted, all that needs to happen is data stored in Word could be saved as a text file. The process then would be to read the data into an Oracle table in the standard fashion, by reading it directly in using oracle commands, or using the SQL Loader utility.
Object database Connectivity (ODBC) has been around for a while and is still widely used today. It allows data stored in one data format to be made immediately available in a different format. This means that normally incompatible data and applications can work with each other with ease. For instance, it is possible to have a direct link between Oracle database and Access, and Word documents and Oracle. The challenge would be how Oracle reads the contents of a Word document. Possibly, this would be achieved in a similar fashion to the Oracle package method only this time the data would be streamed directly into Oracle rather than read into a BLOB (binary field) directly.
By far, the easiest way to get data into an oracle database from a Word document would simply be to copy and paste text values from a Word document into an Oracle database field. However, this would normally be an automated process, and it would be difficult to automate without a tool to handle the contents of the clipboard.
- Image Credit: Peter Kaminski from San Francisco, California, USA, http://commons.wikimedia.org/wiki/File:Oracle_headquarters.jpg
- Oracle, White paper on Oracle text: http://www.oracle.com/technetwork/database/enterprise-edition/11goracletexttwp-133192.pdf