Porting my blog for the second time, images part 1

This is post #8 of my series about how I port this blog from Blogengine.NET 2.5 ASPX on a Windows Server 2003 to a Linux Ubuntu server, Apache2, MySQL and PHP. A so called LAMP. The introduction to this project can be found in this blog post /post/Porting-my-blog-for-the-second-time-Project-can-start.

This far in this project I managed to load all the data from the XML files from my previous blog engine. I still need to walk all the files but before doing that I would like to be able to download the images. In the early days of my previous blog I still had an average consumer ADSL line for my web server. Since most people want to surf the Internet quickly the download is quicker than the upload so everything I wanted to serve had to be served through that slow line. To speed up loading the site I placed the images on on-line image services. These days I got symmetric fiber. It feels like I have no benefit anymore serving the images somewhere else since it gives a more complex administration of the images. Also when I host my images on my own server I am in charge of the images, no risk for policy violation or anything. Mine is mine. My plan is to download the images back home again and store them locally. Obviously I have the original somewhere but I would like to avoid having to find the original photo from my camera and match up with old blog posts.

Here is an example of a post:

≺p≻≺a href="/media/DSC_1929 skarpare.jpg"≻≺br /≻≺img style="float: left; margin: 5px;"≺br /≻src="https://lh5.googleusercontent.com/-oWLQT6bbQjA/TwylXYYTBXI/AAAAAAAABZY/5lPgCBuQ_90/s640/DSC_1929%252520skarpare.jpg"≺br /≻alt="" width="640" height="358" /≻≺br /≻≺/a≻Here is a photo of a Blue Tit eating seeds from a birch tree at Sunnerås.

Usually the image url is linking to a smaller version and it is enclosed in a href link so that when clicking an image a bigger version is opening. I wonder if I am able to download the images with Perl. I expect all sorts of problems, "enter captcha here" and such. I might need to find other solutions. For now I just go on.

To find the URLs to process I create a little Perl routine working on the content of the Post. To begin with my routine is not doing much. It is just printing messages to the console. Here is the first version of my parser program:

my $content = $dictStringFieldToValue{"content"};
while ($content =~ /(.*?)(≺a.+?≻|≺img.+?/≻)/msi)
{
	print "-------------Before------------
" . $1 . "
";
	print "--------------Tag--------------
" . $2 . "
";
	$content = substr($content, length($1 . $2));
}
print "-----------------After-------------
" . $content . "
";

I get the entire content of the post from the XML parser I wrote about in this post /post/Porting-my-blog-for-the-second-time-load-a-post. Then as long as my regular expression match I print things I found and when there is no match anymore then I print the "after" message. This parser works from the front of the material and as soon as something was found it removes from the parsed part with a substr command. The regular expression contains two parenthesis representing the part before a tag and then the tag itself.

Next time I will try to download the images. It will be interesting to see if that works out.