Three Methods for Extracting Tables from a Website

Three Methods for Extracting Tables from a Website
Page content

This task can sometimes be deceptively hard, depending on the format a table is in and how much data you are trying to transfer. Simple HTML tables tend to be a snap, but later on we’ll get into techniques for power users or those needing to extract masses of data.

Copy and Paste (Source)

As simple as it may seem, the easiest way to extract tables from a website is often to simply select the table, copy it, and then paste it where you want it. This works particularly well if all you want to do is place the table into a Microsoft Word document. If you’re attempting to move the table into another HTML page a neat little Firefox feature can come in handy. Highlight the table you wish to transfer, then right click and select ‘View Selection Source’. This will neatly encapsulate all the code you need to display the table.

One issue you may find is that certain websites are written in technologies which do not allow copying. For example it is very hard to extract tables from flash based websites. In this case, it may be sufficient to simply take a screenshot of the table, and crop it to an appropriate size in an image editing program.

Automated Extraction

One of the great advantages of computers is that they can automate repetitive tasks. This is particularly valuable when it comes to extracting the data stored in tables, which can often be a long and boring job when done manually. Several coders have seen the need for a better way of getting at the data stored in website tables, and released products to help out.

WebTable’ is probably one of the oldest applications, and extracts tables from the website directly into plain-text files in either tab separated or comma separated (CSV) form. These text files can then be imported directly into Excel or similar spreadsheet programs.

The ‘Newbie’ software suite can be used to automate practically any repetitive task. This includes, but is certainly not limited to, extracting HTML tables. Unlike WebTable it requires some technical knowledge, rather than being point-and-click, and so it might not be ideal for people who are unhappy dealing with a learning curve.

Modern websites have a habit of embedding PDF documents into their content, which can be tricky to extract tables from. Luckily there are websites like ‘PDF to Excel’, which allow you to quickly and simply extract tables from PDF documents. Download the file you require, upload it to the website and the results will be emailed directly to you.

Roll Your Own Extractor

This may be a little beyond the scope of this article, but if you find you need more power and flexibility when extracting tables it might be worth creating your own code to do so. Regular expressions are a popular way to extract data, and there are several tutorials on the net which can teach you how to use them. PHP in particular can be useful when working with HTML tables, with preg_match and the PHP domain object model (DOM) combining to provide easy access to the underlying data structures. For example, to get a reference to a table named ‘bob’, you simply need to use:

$homepage = new DOMDocument;

$homepage->loadHTMLFile(‘https://www.etc.com/etc.htm');

$table1 = $homepage->getElementById(‘bob’);

From there, the only limit is your imagination and coding skill.

Image Credits

https://commons.wikimedia.org/wiki/File:Table-sample-collapse-border-css-01.gif