So, Greg Linden of Findory was kind enough to drop by yesterday and leave some thoughts regarding my post on automated tag creation. Greg, I realized this morning — with a fresh pair of eyes — that I completely missed the point of your comment (sorry); you weren’t asking so much about the implementation or the ‘need’ I was solving for… you were (I’m now guessing) asking about the value of tags in general (especially vs. search).
Let me take a quick (overly simple) shot at describing what I see as the key difference(s).
In my mind:
– Tags describe what something is, is about, etc; whereas
– Search allows me to discover a set of “objects” that contain my query tokens
Tags are pretty obvious in a “low text” environment, e.g., photos (Flickr was nowhere close to being the first to use tags with photos). With photos, if the user doesn’t annotate, there’s precious little meta-data to use for discovery in most use cases. (EXIF headers **generally** offer little more than a time/date stamp to help discover photos; other EXIF data like camera make/model is much less useful.) So here, tags are good, as is any other lightweight scheme that would help users avoid the “shoebox” situation.
But tags are also interesting in a “high text” environment, where there is a lot of “extra information” that can lead to false positives. For example, neither this post nor my last post are about Greg Linden or Findory. Yet, if I wanted to find all of my posts that were about either of those entities using Search, both this post and the last post would be surfaced, thereby degrading the relevancy of the result set.
Hopefully that’s a bit more helpful; definitely interested in continuing the conversation.
Smart comments coming from Greg Linden, John Dowdell and Aron Miller on this thread, and an insightful notion from Jeff Clavier on the original thread.
I have more to say, but I’m really waiting on the other shoe to drop so that I can tie a couple of different things together here. Sigh.
Hi, Tony. That’s an interesting point. The tags are metadata, capturing and emphasizing information isn’t obvious from the content alone.
Your point is particularly compelling for images. It’s hard to find similar and related images normally, but on sites like Flickr, tags make it easy.
It’s a little less compelling for text, but you’re right that tags may include keywords not in the text and may help distinguish important keywords from irrelevant keywords.
But, to be clear, I was never doubting the value of tags. I think the success of sites like Del.icio.us and Flickr make it pretty obvious that tags have value.
I’m more hesitant about the value of automated tag creation. An automated process doesn’t have the advantages you mentioned: increased relevance and non-obvious summarization. And, whatever the automated process is should be able to be embedded into a general search engine to improve relevance rank on all searches.
So, I’m still left wondering about automated tag creation. Perhaps we’ll have to wait and see what clever things people do with it before we know its value.
Say Tony, have you heard of any Viagra vendors putting up photos or weblogs of their ads, and tagging them “sxsw”, “podcasting” and what-all to catch eyeballs…?
(Tagging, like HTML metadata keywords, or even HTML text, doesn’t seem a sufficient filter… looks to me like it will work in the short-term and then require re-engineering like email, Usenet, etc… but maybe I’m missing something…?)
Greg you have cut through this clearly. Conversion of source material to a set of tags is akin to lossy compression. The process strictly reduces the amount of semantic information on hand.
That’s not to say there isn’t benefits. The obvious one might be computing horsepower. But it may have other uses..
I have only browsed findory for a short time, but it appears that whatever intermediate data structure that represents a particular user’s “tastes” is not directly exposed. It seems to me like tags or perhaps some vector of tags and weighting is an appropriate reductive format. This introduces the ability to affect this intermediate format via direct user input or simply an open API of sorts.
It might be interesting to think about this in relation to Amazon’s new SIPs (Statistically Improbable Phrases) which is an automated process as well (and therefore lossy).