Skip to content
Jul 16 / Garejoor

Web-Harvest

web-harvest2

Web Harvest

Web-harvest is an open-source java based web extraction tool. It uses Xpath, XSLT and XQuery functions to do the job. For example, you can open a URL website and convert the html to an XML file, which then using Xpath you can extract specific elements from the page. Using XQuery you can do some logic to the data like sorting and then save the results to a file. Web-harvest has some built-in functions such as zipping the extracted data, or saving it to a database and etc. However, since Web harvest is open-source you can also customize it to do almost anything you want. You can run the software on a stand alone jar file, with a GUI interface or use command line (terminal) or just include it in your java project and use it.

Prerequisites:

It’s important to know xpath before reading these examples.

a Java usage example can be found here.

Example 1:

<xpath expression="//a[@shape='rect']/@href">
  <html-to-xml>
    <http url="http://www.somesite.com/"/>
  </html-to-xml>
</xpath>

http tag get’s the html page from the url website. html-to-xml converts the html data to xml, of more specifically to xhtml.xpath uses the expression on the resulted xml page and retrieves that data.

Example 2:

	<file action="write" path="test/numbers.txt">
		<while condition="true" index="i" maxloops="9">
			<template> ${i} </template>
		</while>
	</file>

The output of numbers.txt is 123456789.if opened using notepad, using notepad++ each number is located in a separate line. The while loop runs the body of this loop for 10 times, the the body uses template to evaluate the variable i and return a list of number to the file tag. template tag is used to evaluate scripts written inside the ${…} format. so here the variable i is evaluated and returned to the result of the while loop. If added the empty tag like so:

	<file action="write" path="test/numbers.txt">
            <empty>
		<while condition="true" index="i" maxloops="9">
			<template> ${i} </template>
		</while>
             </empty>
	</file>

The the output would be noting since noting is returned to the file tag from it’s body.

Example 3:

	<file action="write" path="test/numbers.txt">
		<empty>
			<var-def name="digitList">
				<while condition="true" index="i" maxloops="9">
					<var-def name="digit${i}">
						<template> ${i} </template>
					</var-def>
				</while>
			</var-def>
		</empty>

		<template>
			The variable digitList value is:
			${digitList}
			${sys.lf} ${sys.cr}
			And each digit# variable have values:
			${sys.lf} ${sys.cr}
			digit3 = ${digit3} ${sys.lf} ${sys.cr}
			digit6 = ${digit6} ${sys.lf} ${sys.cr}
			digit1 = ${digit1} ${sys.lf} ${sys.cr}
		</template>

	</file>

As we discussed earlier the empty tag doens’t return anything. Last example, the empty didn’t return anything to it’s ‘parent’ the file tag, here is a bit different. Since the scripting language is procedural, first the empty tag is excecuted then the template. If the empty tag didn’t exist, the result of var-def would be sent to the template tag. So here notting is sent to the template tag, however, the variable names defined in the empty tag is still in the record table. So using the ${} format we display the variables defined in the empty tag such as digitList and digit1 or digit2 … . So as you can tell var-def defines a variable name and it’s body is the value of that variable. Inside the template tag there are some predefined variables starting with the name sys. sys.If is the Line feed character (\n) and the sys.cr is the carriage return character \r. I’m placing both, because I open the output file numebrs.txt using notepad and notepad doesn’t recognize \n.

Example 4:

<var-def name="seachEngine">
	google
</var-def>

<var-def name="${seachEngine}Content">
	<http url="http://www.${seachEngine}.com"/>
</var-def>

<file action="write" path="test/${seachEngine}_content.html">
	<var name="${seachEngine}Content" />
</file>

This example explained variables better and it’s usage. The tag var is used to retrieve a variable named something.  Note why var is used here. If we didn’t want to use var and use the ${..} to get the value of searchEngineContent we’d have to write:

${${seachEngine}Content}

which the parser doens’t recognize as a valid statement. Thus we need to use the var tag here.

Example 5:

<loop item="link" index="i" filter="unique">
    <list>
        <xpath expression="//img/@src">
            <html-to-xml>
                <http url="http://www.yahoo.com"/>
            </html-to-xml>
        </xpath>
    </list>

    <body>
        <file action="write" type="binary" path="tests/${i}.gif">
           <!-- <http url="${sys.fullUrl('www.yahoo.com', link)}"/> -->
           	 <http url="${link}"/>
        </file>
    </body>
</loop>

in this example, we get all the images located on yahoo.com using loops. The loop  has two section a list and a body. The list tag, includes a list a items for which the body tag executes on each of them.  It’s like a foreach loop but combines with the fetching of the whole items in which we are going to loop. The variable link, is each items name specified in the main loop arguments, that is item=”link”.

Example 6:

<?xml version="1.0" encoding="UTF-8"?>

<config charset="ISO-8859-1">

    <include path="functions.xml"/>

    <!-- collects all tables for individual products -->
    <var-def name="products">
        <call name="download-multipage-list">
            <call-param name="pageUrl">http://shopping.yahoo.com/s:Digital%20Cameras:4168-Brand=Canon:browsename=Canon%20Digital%20Cameras:refspaceid=96303108;_ylt=AnHw0Qy0K6smBU.hHvYhlUO8cDMB;_ylu=X3oDMTBrcDE0a28wBF9zAzk2MzAzMTA4BHNlYwNibmF2</call-param>
            <call-param name="nextXPath">//a[starts-with(., 'Next')]/@href</call-param>
            <call-param name="itemXPath">//li[@class="hproduct" or @class="hproduct first" or @class="hproduct last"]</call-param>
            <call-param name="maxloops">30</call-param>
        </call>
    </var-def>

    <!-- iterates over all collected products and extract desired data -->
    <file action="write" path="canon/catalog.html" charset="UTF-8">
        <![CDATA[ <html><head></head><body>]]>
		<![CDATA[ <table> ]]>

        <loop item="item" index="i">
            <list><var name="products"/></list>
            <body>
                <xquery>
                    <xq-param name="item" type="node()"><var name="item"/></xq-param>
                    <xq-expression><![CDATA[
                            declare variable $item as node() external;

                            let $name := data($item//*[@class='title'])
                            let $desc := data($item//*[@class='desc'])
                            let $price := data($item//*[@class='price'])
                            let $imgloc := $item//img
                                return
                                    <tr>
                                    	<td>{$imgloc}</td>
                                        <td>{normalize-space($name)}</td>
                                        <td>{normalize-space($desc)}</td>
                                        <td>{normalize-space($price)}</td>
                                    </tr>
                    ]]></xq-expression>
                </xquery>
            </body>
        </loop>
        <![CDATA[ </body></html> ]]>
    </file>

</config>

Example 7:

<?xml version="1.0" encoding="UTF-8"?>

<config charset="ISO-8859-1">

    <!-- sends post request with needed login information -->
    <http method="post" url="http://www.nytimes.com/auth/login">
        <http-param name="is_continue">true</http-param>
        <http-param name="URI">http://</http-param>
        <http-param name="OQ"></http-param>
        <http-param name="OP"></http-param>
        <http-param name="USERID">web-harvest</http-param>
        <http-param name="PASSWORD">web-harvest</http-param>
    </http>

    <var-def name="startUrl">http://www.nytimes.com/pages/todayspaper/index.html</var-def>

    <file action="write" path="nytimes/nytimes${sys.date()}.xml" charset="UTF-8">
        <template>
            <![CDATA[ <newyourk_times date="${sys.datetime("dd.MM.yyyy")}"> ]]>
        </template>

        <loop item="articleUrl" index="i">
            <!-- collects URLs of all articles from the front page -->
            <list>
                <xpath expression="//div[@class='story clearfix ']/h5/a[1]/@href">
                    <html-to-xml>
                        <http url="${startUrl}"/>
                    </html-to-xml>
                </xpath>
                <xpath expression="//div[@class='story' or @class='story headline']/a[1]/@href">
                    <html-to-xml>
                        <http url="${startUrl}"/>
                    </html-to-xml>
                </xpath>
            </list>

            <!-- downloads each article and extract data from it -->
            <body>
                <xquery>
                    <xq-param name="doc">
                        	<html-to-xml>
                            	<http url="${articleUrl}"/>
                        	</html-to-xml>
                    </xq-param>
                    <xq-expression>
                    	<![CDATA[
                        declare variable $doc as node() external;

                        let $author := data($doc//*[@class="byline"])
                        let $title := data($doc//*[@class="articleHeadline"])
                        let $text := data($doc//div[@class="articleBody"])
                            return
                                <article>
                                    <title>{data($title)}</title>
                                    <author>{data($author)}</author>
                                    <text>{data($text)}</text>
                                </article>
                    ]]>
                    </xq-expression>
                </xquery>
            </body>
        </loop>

        <![CDATA[ </newyourk_times> ]]>
    </file>

</config>
Leave a Comment