Dapper, Pipes and Plagger: meta-Mashup

The launch of Yahoo! Pipes raised a lot of discussions of "similar services / products" and apparently a lot of people have found the similarity between Pipes and Dapper or Plagger.

We can say something about pros and cons for these products, because they're built on top of different architectures and implemented differently. But instead of just comparing what-can-do or cant-do, I'd suggest doing some "mash-up" of these services. Let's say it's a meta-mashup.

Pipes accepts RSS/Atom feeds as an input, and doesn't have an ability to detect site updates that don't come with feeds. Dapper and Plagger are good at it. So you use Dapper's cool UI to analyze and generate API for any web pages, then consume that generated feeds as an input to Pipes, and create a flow to generate remixed feeds.

Dapper and Pipes are both hosted on the server and hence there's no way to interact with local devices or personally authenticated services, like storing updated feeds to iPod, or notify via MacGrowl or post the updates to Twitter using your account, which Plagger is really good at.

So you can mash up these remix sites to create a fully GUI controlled API creator (Dapper) and Pipes-flow programming (Pipes) with your own publisher/notification engine (Plagger).

Yahoo! Pipes = Dressed up Plagger, but the dress is nice

Plagger is a perl-based, open-source feed routing system, or so called "mash-up creator." A couple of interesting sites, like CDTube (a mash-up of Count Down TV and YouTube) are powered by Plagger and it has been driving geeks to do some interesting bits with RSS/Atom/iCal feeds.

The most annoying thing about Plagger development, for me and for end users as well, is the lack of the document, and the lack of pretty interface. And it requires half of CPAN to run it. (Well, not really, because most of CPAN modules are required by "plugins" and you don't actually need to install all plugins. They're all optional.) These barriers have been sort of intentionally made there, to reduce the S/N ratio in the community, and looks like it's been working well, but it caused the other problems ("Plagger is HARD to install!").

But anyway, the most frequently asked question in the conferences have been: "Are there any hosted version of Plagger that I can use, without installing it by myself?" and I always answered "Yes, there could be, but if you're asking me to do it, I won't. Why? I don't need it :) Plagger is licensed under Artistic/GPL and there's no way for me to stop someone to create the hosted Plagger."

So now, Yahoo! Pipes could be the answer for these people. I'm not saying "Y! Pipes ripped off my idea!" since there are other services already like Dapper or xFruit, but having that cool IDE and the feature to share the "pipe" is really awesome. Congrats to the Pipes team.

It also makes me grin that they named the service "Pipes," and follks like Jeremy or Tim O'Reilly are saying "RSS is the pipes for Internet!", which coincidentally matches with the Plagger's tagline "the UNIX pipe programming for Web 2.0." As you can see from the slides (esp. in YAPC::Europe and XML Developer's Day), I've been saying "Plagger allows you to mash up feeds just like Unix pipe/filters." Thanks to Yahoo! for proving that my vision's been correct :)

Plagger upgrades your feeds ... but it degrades too.

Plagger tries real hard to upgrade and normalize your feeds data. For example if you have a feed without a content (title or summary only), Filter::EntryFullText module does auto-fetch the content of the page and extracts fulltext body from there.

Sometimes we need to degrade the content, though. One example is embedded YouTube video in the feeds. When you use Publish::Gmail to read feed updates on Gmail, Thunderbird or whatever Email clients, YouTube embedded videos are not playable, due to the security reason. (Thunderbird has an option to enable scripting in the HTML emails, but then you'll have a security risk reading phishing emails.)

So tokuhirom suggested a filter to strip the embed tag and replace that with a simple A link with IMG tag to the thumbnail image on YouTube server. We named it Filter::DegradeYouTube and now I committed it to the trunk. If you provide your YouTube dev_id, which is highly recommended, it calls YouTube API to figure out the thumbnail image location.

This works really great. I'd like to create a more general, Filter::DegradeHTML framework, so other sites similar to YouTube can be degraded like this.

Filteryoutube_1

Summary support in Plagger 0.8

From my email sent to plagger-dev list:

I've been working on summary support in hackathon-summary branch at http://plagger.org/trac/browser/branches/hackathon-summary

The point of summary feature is:

  • summary, title, body and author has "is_html" or "is_text" method,
    to determine if the data it holds is HTML or plaintext.
  • Plagger::Util::strip_html($html) will render HTML into text using
    HTML::FormatText (Making it pluggable is left TODO)
  • $entry->summary->text and $entry->body->html does what you mean
  • summary is automatically extracted from feed metadata ala RSS description or Atom:summary field. If there's not one, Summary::Auto plugin will auto-generate using Plagger::Util::summarize($entry->body)
  • actually how to generate summary from body HTML is pluggable. There're couple of Summary::* plugins already checked in in the branch. http://plagger.org/trac/browser/branches/hackathon-summary/plagger/lib/Plagger/Plugin/Summary

The final thing left undone is how to declare which field to use in each Notify/Publish plugin without updating the templates. I'd like to say, "send full-content HTML mail to my gmail account, but plaintext of summary to my mobile. Use summary as HTML in Publish::Planet." with
a single config, not requiring to update the templates.

The syntax would be something like:

- module: Publish::Gmail
   config:
     mailto: ****@gmail.com
   override:
     body: $args->{entry}->body->html

- module: Publish::Gmail
   config:
     mailto: ****@mobile.example.com
   override:
     body: $args->{entry}->summary->text

- module: Publish::Planet
   config:
     mailto: ****@mobile.example.com
   override:
     body: $args->{entry}->summary->html

But not sure what would be the best syntax.

I'd like to apply the override/localize methodology to link(s) as well, ala:

- module: Notify::IM
   override:
     link: $args->{entry}->alternate_link('tinyurl')

to display link in TinyURLed (using WWW::Shorten), for instance.

Any feedbacks would be welcome on this.

Plagger Hackathon 2 in Cybozu Labs

This weekend we had our 2nd Hackathon dedicated to Plagger in Cybozu Labs, Akasaka Tokyo.

The biggest event during this Hackathon was TestAThon. Most of our committers have went wild to create unit test files for plugins without test. Now # of .t files jumped up from 32 to 102, which is amazing. Kudos to our testathoners team: youpy, takesako, 33rpm, tomi-ru, hakobe & charsbar (on-site) and mizzy, hsbt & drawnbody for off-site remote testing. Thank you ALL!

Thanks to the unit testing effort, I've been going far to merge the work done in Hackathon-MT branch, and do lots of core changes without worrying about the backward incompatiblity, since whenever I broke the code, those tests will notify me. Great.

So I've been working on: JSON dumper, Feed serializer, Notify::Audio and enclosure integration, decoding YAML config as UTF-8 by default, summary and pluggbale summarizer support, rewriting OPML parser using LibXML SAX, refactored XML::Feed parser into Plagger::FeedParser so plugins can use. Hmm there's a lot.

Summary and summarizer stuff is being done in hackathon-summary branch and will be merged to trunk soon.

Otsune has been working on document improvement and Nagayaman was doing the great website redesign, which will be online soon. Yappo was doing Senna hacking and per-plugin storage, and Search::Rast improvement.

Trac the changes during the Hackathon to be amazed more.

SAKK Plagger Hackathon #1: MT meets Plagger!

Mtplagger

Today we had a Plagger Hackathon in Six Apart KK. We had most of MT and TypePad engineers coming around the couch and hacked together to build a Movable Type plugin for Plagger and couple more search extensions. You're curious why Vox engineers are not there. Yeah, they are all in San Francisco now :)

今日は赤坂の Six Apart KK オフィスで Plagger Hackathon。MT/TypePad のエンジニアがカウチに集まって MT-Plagger プラグインと Search 系のエンハンスメントをハック。Vox エンジニアはいま San Francisco にいってるので不参加です :)

So let's see MT-Plagger demo. The basic idea of the plugin is so simple. When you create or update a new post on Movable Type CMS, new MT::Entry object will be transformed to Plagger::Feed (where MT::Blog corresponds to Feed and MT::Entry to Plagger::Entry) and it will bootstrap the Plagger context to run the later pharses than Filter.

まずは MT-Plagger のデモをどうぞ。このプラグインがやっていることはシンプルで、MT で新しいエントリをつくったり、更新したりしたときに、MT::Entry をベースに Plagger::Feed と Plagger::Entry をつくって Plagger プロセス(Filter:: 以降のフェーズ) を bootstrap します。

You can see in the demo how it works with Publish::Gmail to notify email, and Notify::IRC to do realtime notification to the IRC channel. Integration with Search::* to build a better search engine for MT would be a nice hack. Making it work with new Comment (rather than Entry) would be cool for pluggable Comment notification. Using Publish::Delicious and you'll always get your entry bookmarked on del.icio.us first :)

デモでは Publish::Gmail をつくってメールを送信したり、Notify::IRC をつかってリアルタイムに IRC アナウンスしたりしてます。Search::* と連動して MT 検索を改善してみたり、エントリだけでなくてコメントがきたときに Notify にとばすなんてのも面白そう。Publish::HatenaBookmark でつねにセルクマ 1 get とかもできますね。夢が広がりんぐ。

MT Plagger code is now in hackathon-mt branch and will be merged down to the trunk soon.

Plagger competitive services

Planet and newspipe are examples of the Plagger competitive "software". What about services? Well, there're a couple of competitive web 2.0 sites that do things pretty close to what Plagger does.

Feed Rinse: "Feed Rinse is an easy to use tool that lets you automatically filter out syndicated content that you aren't interested in. It's like a spam filter for your RSS subscriptions"

Touchstone: "Subscribe to changes on your favourite website. Set rules for what’s important. Have alerts appear on your desktop while you work." They call subscription/notify/publish plugins as Input/Output Adapters.

Dapper: "Dapper’s mission is to allow you to use any web based content in any way you can imagine. And by use, we mean going beyond just reading or viewing a webpage."

xFruits: "XFruits makes possible the Mashup RSS creation in a very simple way thanks to the Composer. You can assemble the bricks together so as to build your own feed-based service.
"xFruiter" service's users are referenced. "

OSCON: My Plagger talk went well

So my first talk in OSCON is over. Check out the slides on Plagger website if you missed it. Basically I got mostly positive feedbacks "Hey that's cool!". I'm pretty happy. And now Junior (Mark Smith) from Six Apart wrote a really nice writeup about the session. Thanks Junior!

I updated the slides after I woke up on 5am this morning and did some reharsal in the room. The main problem I found is that it took longer than 45 minutes, and it was actually 60 minutes. So I decided to speak faster than usual. Another issue was the order of the slides, since it started with the killer app before the quick guide of plugins, it got boring towards the end, and at the same time it gets somewhat awful to explain which plugin it actually uses.

So I decided to switch so I introduce the killer apps after the plugins tour. It worked well, but it might be boring for some people during the 1st half. I don't know how to deal with that in YAPC::EU again. Feedbacks are welcome.

BTW the highlight plugin this time is Notify::Pizza. That was Notify::Eject in YAPC::Asia, and Publish::Excel in YAPC::NA. I wonder what is gonna be in YAPC::EU :)

Now that my talk is over and I'm relaxed. Looking forward to attending to other sessions in the conference, and party!

Plagger vs. newspipe

Meta: I've been lazy posting here on this blog. If you need more geeky stuff, check out my use Perl blog and want to check non-geeky stuff, go to my shiny Vox blog. Thanks,

Plagger is written in Perl and the competition always comes from Python world. For instance, Planet is a good software written in Python to aggregate posts from lots of blogs into a single HTML and our friend Dave Cross wrote a very good post how easy it is to build Planet site using Plagger.

Today I found another competition: newspipe, which is also written in python. The site has a very good summary of what it does and how you use it. And the list of the features makes me feel that the author has the very similar strategy to how to deal with configuration and site scraping stuff, and what's actually needed for email notification/archiving. Very good stuff.

So, let's compare the things newspipe claims to do, with Plagger (Subscription::OPML + Publish::Gmail).

Supports RSS and Atom feeds through Mark Pilgrim's Universal Feed Parser

Yeah, Plagger uses XML::Feed and XML::Liberal to deal with various versions of feeds, even if they're somewhat broken.

feeds are listed in an OPML file:

Yes, you can write a list of feeds in an OPML file using Subscription::OPML. Optionally you can take the list from XOXO file, Database, your Bloglines account etc. with the power of plugins.

feed options can be set individually for each feed or for a group of feeds

No feed options support yet when you use Subscription::OPML. I think this is a good idea.

supports screen-scraping scripts via an internal pipe:// URI schema

Supports screen-scraping with the usual HTML page URL (written in htmlUrl if you use OPML) and register CustomFeed:: for the URL. (Plan: I'll rename the CustomFeed:: to Scraping:: or something) CustomFeed::Pipe plugin actually calls another process to get the data from various sources including screen scraping.

the OPML file can reside locally (in your hard disk) or remotely (in a web server).

Yes. The OPML file can be local or remote one.

sends news items via SMTP to a designated e-mail address

Yes. Plagger supports SMTP, Sendmail and SMTP-TLS. There's a patch to do SMTP-AUTH as well.

messages can be in HTML/multipart MIME format or plaintext

Yes. The Emails are now sent in HTML. No built-in plaintext format support yet but there're plugins to do that.

multiple items from a feed can be grouped and sent in a digest message

Yes. This is the default behavior of Plagger Publish::Gmail.

updated news items are detected and re-sent with additions and deletions highlighted

No, we don't do this but leave it for Gmail's diff functionality.

images linked from the feed items are downloaded and included as inline images inside the mail message (great for archiving purposes).

Not by default but easily can be done with:

- module: Filter::FindEnclosures
- module: Filter::FetchEnclosure
- module: Publish::Gmail
  config:
    attach_enclosures: 1

a mobile (text-only) view of a feed can be sent to a secondary e-mail address (to read on a PDA or an MMS-enabled mobile phone)

Once we've done refactoring on email notification, it'd be pretty simple to enable another Notify::Email directive with plaintext mode on.

full support for HTTP optimizations like gzip compression, If-Modified-Since and If-None-Match headers. Feeds and image files will only be downloaded when they have changed.

Yes, with URI::Fetch.

E-mail "threading" based on previously sent RSS items.

No, it doesn't do and leave it for Gmail's "conversations" feature. Adding X-Item-* headers and References: (or In-Reply-To:) would be a good idea.

That's it. Obviously, we are better in some places and can learn from them in others, too.

Plagger update: Planet and microformats (XOXO)

I've been a little lazy not updating recent  Plagger development. Don't mind. It doesn't mean Plagger development has been squeezed, but it's actually still very hot.

Looks like Planet Planet added Atom support recently. Plagger also has a nice Planet publisher plugin Casey wrote for ETech 2006 and thanks to a nice design of Plagger plugin system, it's damn easy to add Atom/RSS/OPML output in combination with Planet output.

Today I added another fancy Subscription module called Subscription::XOXO. It tries to get XHTML and extract subcription list using XOXO microformats. This is pretty nice, because someone can update their source of "interesting links to Perl hackers" for example, and you can use that single XHTML entry as a source of your subscription.

You can create Planet out of the subscription, or create OPML feed so that your favorite RSS aggregator can synchronize out of XOXO links. Sweet!

Another very interesting plugin created these days is Search::Estraier (screenshot). It uses a pretty fast search engine Hyper Estraier  and its P2P node API to index entries aggregated by Plagger. Nice example of usage of this module will be: 1)create your own Blog Search with Subscription::PingServer + Aggregator::Xango 2) aggregate all entries from the web containing certain keyword using Subscription::Planet, then index it for later search to be used as a Knowledge Base, like this Plagger KB.

My Photo

del.icio.us/miyagawa

Flickr

  • www.flickr.com

My Last FM

Blog powered by TypePad

Blog Search

Lingr