Feeding a
Query
How to retrieve news and
save them into a query object
Cláudio Alexandre da Costa Dias
Introduction
Recently, we have noticed the
large number of sites sharing news about all kinds of subjects. These news are
often available in HTML format, and, sometimes, in XML (RDF/RSS) format ? to be
read using specific readers.
When news are in RDF/RSS
format ? W3C standards, it?s pretty simple to implement a ColdFusion code to
handle this XML, creating a query object for further display.
However, if news are only
available in HTML format, some extra effort is needed to take these data out and
to arrange them into a query object.
In order to fully understand
this tutorial, you are required to know RDF/RSS standard, to fairly use Regular
Expressions and to be familiar to ColdFusion MX XML functions.
Creating a Query Object from RDF/RSS data
First of all, let?s take a
look at the simplest case: in other words, retrieve news from an RDF/RSS
document.
We will take Ben Forta?s blog
page in our examples ?
http://www.forta.com/blog

Figure 1 : Ben Forta?s Blog
As we can see, there are a
lot of news on screen?s right side. These news are also available in RDF/RSS
format, through
http://www.forta.com/blog/rss.cfm?mode=full.
This link gives us:

Figure 2 : RSS feed from Blog?s page
In other words, an XML
document like any other. A complete description of RDF/RSS standard can be found
at http://www.w3.org/RDF.
Let?s build the query object
from this XML document. CFML coding of all used templates can be found at the
bottom of this tutorial.
To write a CFML code doing
this job, we?ll track the following steps:
Retrieving XML
We use
<cfhttp> tag to retrieve the XML news document:
<cfhttp url="http://www.forta.com/blog/rss.cfm?mode=full" method="GET">
This request results are
stored in cfhttp.fileContent variable. It means this variable contains
the news within an XML string. An output of this variable would be like Figure
2.
Converting a XML string to a XML object
In ColdFusion MX, there?s a
new type of data: the XML object. Using XMLparse() function, we can
convert an XML string into an XML object.
<cftry>
<cfset xDoc = XMLparse(cfhttp.fileContent)>
<cfcatch>
Invalid RDF/RSS !
<cfabort>
</cfcatch>
</cftry>
We use try...catch
methodology to prevent from mal-formed XML. Dumping xDoc variable:

Figure 3 : xDoc XML object view
Identifying RSS version and searching for
items
Once we have our XML object,
xDoc, let?s identify to which RSS standard it belongs. To do this, we use
the XML root element name ? xDoc.XmlRoot.XmlName.
In order to retrieve items
from XML object ? the news themselves ? we use XMLsearch() function. It
uses an XPath language expression to
search an XML document and returns an array of XML object nodes that match the
search criteria.
<cfswitch expression="#xDoc.XmlRoot.XmlName#">
<cfcase value="rdf:RDF"><!--- Version 1.x --->
<cfset arrItems = XMLSearch(xDoc, '/rdf:RDF/:item')>
</cfcase>
<cfcase value="rss"><!--- Version 0.9x --->
<cfset arrItems = XMLSearch(xDoc, '/rss/channel/item')>
</cfcase>
</cfswitch>
Each array element contains
an XML object node <item></item>, which contains
the elements title, description and link. We can see
arrItems array in the next figure:

Figure 4 : arrItems array view
Creating query object
Now, we have news inside
arrItems array elements. First, we create the query object, q_rss:
<cfset q_rss = queryNew("title, link, description")>
Looping over array elements,
we get, for each item, text inside elements title, description and
link.
<cfset n = arrayLen(arrItems)>
<!--- Loop over found items, populating query object --->
<cfloop index="i" from="1" to="#n#">
<cfset queryAddRow(q_rss)>
<cfset querySetCell(q_rss, "title", arrItems[i].title.xmlText,i)>
<cfset querySetCell(q_rss, "link", arrItems[i].link.xmlText,i)>
<cfset querySetCell(q_rss, "description", arrItems[i].description.xmlText,i)>
</cfloop>
Then, dumping q_rss:

Figure 5 : q_rss query ? final display
Creating a Query Object from a HTML news page
As we have seen, it?s fairly
simple to create a query object from an RDF/RSS document. However, what if the
RDF/RSS news document is not available? In other words, news are only available
in HTML format.
The steps we?ll follow are
essentially the same. But, as we don?t have the XML object, we won?t be able to
use XMLsearch() function to retrieve items. Then, we have to search for
items with another tool. How about Regular Expressions? They are quite
helpful when searching patterns.
Let?s start working:
Retrieving HTML
We use
<cfhttp> tag to retrieve the HTML news page:
<cfhttp url="http://www.forta.com/blog" method="GET">
This request results are
stored in cfhttp.fileContent variable. It means this variable contains
the news within an HTML string. This string is, then, stored in sDoc
variable.
<cfset sDoc = cfhttp.fileContent>
Creating Regular Expression
The hardest part of our job
is to build a regular expression that matches Ben Forta?s HTML news text. We
highly recommend you to use a regular expressions tester tool, which tests them
as long as they are created.
At the bottom of this
tutorial, an HTML application ? REtest.htm ? is given. It will help you
when creating regular expressions.
Using it, we get to the
following regular expression:
<cfset regExp = '<font color="336633"><b>([\s\S]*?)</b></font>[\s\S]*?
<font size="-1">([\s\S]*?)</font>[\s\S]*?
<a href="(index\.cfm\?mode=e&entry=[0-9]*?)">'>
Sub expressions ? terms
inside parenthesis ? represent title, description and link to each item. Note
that there are links to sub expressions as well as to next occurrences of search
pattern.

Figure 6 : REtest.htm
Creating query object
First, we create the query
object, q_rss:
<cfset q_rss = queryNew("title, link, description")>
We use, now, REfindNoCase()
function to search sDoc text for the regular expression specified before.
Note that the function call is nested in a loop, which tests the function
return, through start variable.
As seen before, sub
expressions title, description and link can be found in
this order. Therefore, they match to positions 2, 3 and 4 pos and len
arrays. These arrays are keys of the stResult structure, returned by
REfindNoCase() function.
<cfset start = 1>
<cfloop condition="#start#">
<cfset stResult = REfindNoCase(regExp,sDoc,start,"Yes")>
<cfif stResult.pos[1]>
<cfset queryAddRow(q_rss)>
<cfset querySetCell(q_rss,"title",mid(sDoc,stResult.pos[2],stResult.len[2]))>
<cfset querySetCell(q_rss,"link",mid(sDoc,stResult.pos[4],stResult.len[4]))>
<cfset querySetCell(q_rss,"description",
mid(sDoc,stResult.pos[3],stResult.len[3]))>
</cfif>
<cfset start = stResult.pos[1] + stResult.len[1]>
</cfloop>
Checking results:

Figure 7 : q_rss query final display
As it was seen before.
CFML coding
rss2query.cfm
<!--- Retrieve RSS data from Ben Forta's blog--->
<cfhttp url=&qu