Woofy's download algorithm
Woofy's download algorithm is pretty much straight-forward: it visits a comic's web pages one by one, from the current strip to the first one, and it downloads the strips from each page.
When visiting a page, it searches it in order to find a link to the comic strip, and a link to the next page. If either of those isn't found then it assumes that it has reached the first strip, so it stops and reports that the download has finished.
Don't forget that you can use the definition debugger to better understand the way Woofy uses your custom definitions. It provides more information than when downloading a comic, plus it's way faster as it only checks the comic links and doesn't download them.
The structure of a comic definition
A comic definition consists of a single xml file, describing how Woofy should visit and download the comic, and it looks kinda like the following fragment:
<comicInfo friendlyName="My Fabulous Comic. Seriously." allowMultipleStrips="false" allowMissingStrips="false" author="Pogonitch" authorEmail="pogo@helloworld.com" >
<startUrl><![CDATA[...]]></startUrl>
<firstIssue><![CDATA[...]]></firstIssue>
<rootUrl><![CDATA[...]]></rootUrl>
<comicRegex><![CDATA[...]]></comicRegex>
<backButtonRegex><![CDATA[...]]></backButtonRegex>
<latestPageRegex><![CDATA[...]]></latestPageRegex>
</comicInfo>
Note that each xml element wraps its content in a CDATA region. This is because the content might contain angle brackets, and those would break the xml markup.
First, let's take a look at the attributes, in order to better understand how to use them:
- friendlyName - the comic's name, the one that Woofy will display in the comics list;
- allowMultipleStrips - false by default; use this in order to tell Woofy that it shouldn't stop downloading if several strips are found in the same page;
- allowMissingStrips - false by default; use this in order to tell Woofy that it shouldn't stop downloading if a page contains no strips;
- author - the definition's author; it will appear in the About dialog;
- authorEmail - the definition author's email :)
Now, let's see what each of the elements do:
- startUrl - the url at which to start looking for strips; it will usually be the comic's start page;
- firstIssue - this is the url to the comic's very first issue; Woofy will not consider this element, but it's being used in the automated tests, in order to determine whether Woofy can visit all the pages for that comic or it gets stuck somewhere; setting it is recommended.
- rootUrl - the url with which to combine the relative comic/button paths; should be specified if the startUrl can't be used for this (see the "Looking for Group" definition for an example).
- comicRegex - this is where it gets interesting; in order to be able to retrieve a strip's link from a page, Woofy needs to know how the link looks like, so we describe it, using .NET regular expressions;
- backButtonRegex - a regular expression describing how the link to the previous page looks like.
- latestPageRegex - a regular expression describing the link to the comic's latest page; this should be used when the comic doesn't have a fixed page displaying the latest strip, but instead offers a link to the latest strip in the startUrl.
Note that the comicRegex, backButtonRegex and latestPageRegex elements allow the use of the content grouping construct, in order to wrap the actual link contained in a wider expression (see the "Least I Could Do" definition for an example).
Building a comic definition from scratch
Ok, now that we know how Woofy behaves and how a comic definition should look like, let's build our own, shall we? Let's build a definition for Ugly Hill.
For this, let's start with an empty template, like the one below:
<?xml version="1.0" encoding="utf-8" ?>
<comicInfo friendlyName="">
<startUrl><![CDATA[]]></startUrl>
<firstIssue><![CDATA[]]></firstIssue>
<comicRegex><![CDATA[]]></comicRegex>
<backButtonRegex><![CDATA[]]></backButtonRegex>
</comicInfo>
Since the comic's name is "Ugly Hill", we should let Woofy know about that.
<comicInfo friendlyName="Ugly Hill">
We should also specify the address at which Woofy should start looking for strips. Note that this is not necessarily the comic's home page, but a web page containing the most recent comic strip AND a link to the previous strip (that links to the strip before it, and so on).
<startUrl><![CDATA[http://www.uglyhill.com]]></startUrl>
Now, in order to make the developer's life easier, fill in the address of the first comic issue, so that submitted comics will considered for automated tests. In order to to this, you'll have to visit Ugly Hill's first strip, and get it's link (not the page's link, the strip's link).
<firstIssue><![CDATA[http://www.uglyhill.com/comics/20050523.jpg]]></firstIssue>
And now, let's do the interesting part - the regular expressions. For this, we will need a regular expressions tester. I personally use Expresso, so the tutorial will use its concepts, but you should be ok with any other regex tester. In case you need it, an introduction to regular expressions is available at http://www.codeproject.com/dotnet/regextutorial.asp.
Anyways, let's get the source of the start page (http://www.uglyhill.com), and paste it in Expresso's Sample Text area. Right.
Once we have done this, we need to search it for the current strip link. I prefer looking at the image's properties and getting the filename from there - in our case it's 20070726_luggage.jpg. If we search the source code for it, we find something like
<img border=0 src="/comics/20070726_luggage.jpg" >
, of which we are only interested in /comics/20070726_luggage.jpg. All we need to do now is convert what we found into a .NET regular expression. Easy!
We do this by analyzing the desired fragment. Notice that it can be described as
/comics/<eight digits>_<some text, maybe even some digits>.<an image extension, may be jpg, may be something else>
A regular expression describing this could be
/comics/[0-9]{8}_[^.]*\.(gif|jpg|jpeg|png)
We test it in Expresso, and see that it actually matches the text we wanted.
Don't forget to test the regular expression you come up with on several of the comic's pages, to make sure it retrieves one and only one link. Otherwise, Woofy will assume it has finished downloading the comic, and it will stop.
Now, for the link to the previous page. We notice the back button is represented by the back_button.gif image, so we search the source code and find something like this:
<a href="/d/20070725.html" target="_self"><img border=0 src="/images/back_button.gif"></a>
This time we need to find an expression for the whole fragment, in order to avoid retrieving the forward button instead of the back button. Long story short, the regular expression is as follows:
<backButtonRegex><![CDATA[<a\shref="/d/[0-9]{8}\.html"\starget="_self"><img\sborder=0\ssrc="/images/back_button\.gif"></a>]]></backButtonRegex>
Note that even though we're trying to match a larger text fragment, wich we're only interested in a tiny fragment inside, so we wrap it in a content capturing group.
The full comic definition should look like this:
<?xml version="1.0" encoding="utf-8" ?>
<comicInfo friendlyName="Ugly Hill">
<startUrl><![CDATA[http://www.uglyhill.com]]></startUrl>
<firstIssue><![CDATA[http://www.uglyhill.com/comics/20050523.jpg]]></firstIssue>
<comicRegex><![CDATA[/comics/[0-9]{8}_[^.]*\.(gif|jpg|jpeg|png)]]></comicRegex>
<backButtonRegex><![CDATA[<a\shref="/d/[0-9]{8}\.html"\starget="_self"><img\sborder=0\ssrc="/images/back_button\.gif"></a>]]></backButtonRegex>
</comicInfo>
Now that the new comic definition is finished, you can try it out, by saving it as an xml file in the ComicInfos folder, inside Woofy's install folder. After doing this, start Woofy and try to download all the latest comics. If it reaches the first comic strip ever, then it's a success.
Enjoy.
RegEx quick reference (contributed by Citizen Drago)
Because a RegEx search can do much more than only match literal pieces of text, there are certain characters reserved for special use, which are usually called "metacharacters". If you need to match any metacharacters as a literal character in a regex you have to "escape" them by putting a backslash \ in front of them.
The following metacharacters that must be escaped if you want to literally match them in a RegEx, their names and their metacharacter functions are as follows:
| [ | "Opening Bracket" or "(Box or Square) Bracket" | Starts a "Character Class" |
| ( | "Opening Parenthesis" or "Opening Round Bracket" | Starts a "Capture Group" |
| ) | "Closing Parenthesis" or "Closing Round Bracket" | Closes a "Capture Group" |
| \ | "Backslant" or "Backslash" | Indicates a literal character or a RegEx token with a special meaning |
| | | "Pipe" or "Vertical Bar" | Separates alternate match choices |
| ? | "Question Mark" | Preceding item is optional, it can match 0 or 1 times |
| * | "Asterisk" "Star" or "Splat" | Preceding item will match 0 or more times |
| + | "Plus" or "Plus Sign" | Preceding item will match 1 or more times |
| { | "Opening Brace" or "Opening Curly Bracket" | Starts a repetition sequence |
| } | "Closing Brace" or "Closing Curly Bracket" | Ends a repetition sequence |
| ^ | "Caret" or "Circumflex" | Start of String Anchor |
| $ | "Dollar Sign" | End of String Anchor |
| . | "Period" or "Dot" | Matches (almost) ANY character (Use sparingly, especially with repetition!!) |
Additionally, since the IgnorePatternWhitespace RegEx option is enabled in Woofy, two more characters must be escaped to be used in a match:
| "Space" | A Space character - Can be matched by "\ " or by "\s" but "\s" will also match "Tab", "Line Feed" and "Carriage Return" characters | |
| # | "Number Sign" or "Pound Sign" | Comment character - # and anything following are ignored until the next non-comment character after an "End of Line" character is reached |
Also, you should remember - especially when testing - that Woofy enables the following options when matching regular expressions IgnoreCase, Multiline and IgnorePatternWhitespace.