Wednesday, October 15, 2008

Using Yahoo Pipes screen scraping to create an RSS Feed

The internet amazes me. Every time I use it, or write something for it, I continue to be blown away. Take for example, my favorite website for Husker information: (for those that don't know, what I mean by husker information is Nebraska Cornhusker Football. I was born in Nebraska so I naturally follow the cornhuskers.). Someone takes the time to consolidate all husker information and updates this site. It is the de facto standard for Husker news. Unfortunately, this site does not come with any type of notifications via an RSS feed.

Now this site does not necessarily amaze me, but rather how I can take a tool like Yahoo Pipes, scrape the latest news, and in a couple of hours subscribe to that feed in Google Reader. Not only that, but I cloned Paul Arterburn's Huskerpedia Pipe to get started and now everyone can take advantage of my new feed by either subscribing to it, making it better, or cloning it for their own use.

This would make the second time I have used Yahoo Pipes to accomplish something complex very easily. Several months ago, I created a pipe called the Gestalt Shared Feed, that combined all of my co-workers shared items from their Google Reader feed. This allowed us to read the best of the best among ourselves and has really helped our communication. I think we have about 10 people who share items and many more who subscribe to it.

Check out my RssHuskerPedia Pipe and view the source if you are curious. See below for the before and after.


Anonymous said...

This is a very good article.

We do provide Screen Scrapping and data collection service.

Paul Arterburn said...


Thanks for fixing this! I am in fact the Paul Arterburn that put the original Pipes feed together. I immediately subscribed to yours, cause it was always a bit of a hassle to click through the links. You might submit it to Huskerpedia as well (with the meta link code as well just in case he's not sure how to add it).

Thanks again. Go Big Red.

fyi - not trying to spam or anything, because I have no affiliation...but I found your site mention my name through

jlorenzen said...

Unfortunately Paul, I soon realized a flaw in my new Pipe. The Pipes we created don't set a Published Date. Therefore after you read an entry it just shows back up again, and again, and again. And since huskerpedia doesn't provide a date for each line item, I really don't know how to get around this issue.
Ideally I could get with them and at a minimum get them to add an attribute on each line item that my Pipe could then parse.

Paul Arterburn said...

No need to...simply take off the "now" date in the pipes you are feeding as the pub_date. Google Reader, at least, is smart enough to grab only what it hasn't already if no date is published.