To make smarter systems, it’s all about the data

ovi256 · on Aug 30, 2009

If you are interested in the topic, I highly recommend "The Unreasonable Effectiveness of data" by Norvig et al. [1]

[1] http://googleresearch.blogspot.com/2009/03/unreasonable-effe...

byoung2 · on Aug 30, 2009

I think the next big advance in search AI won't happen until we figure out a better way of organizing data.

I had an idea in 2000 for a new kind of GPS. The technology existed back then, but until now the data is missing (we're gradually getting there). If every brick and mortar store fed an XML feed of inventory and prices to a central service (probably Google), you could tie that to location data and have a GPS that let you search by product and price rather than store name or address.

Let's say you're looking for a charging cable for an iPod. Instead of navigating to the nearest Fry's Electronics, you would type in "iPod charger" and sort by price or distance. The results might surprise you...it turns out TJ Maxx has them for $0.99.

Now, if we could just get the stores to give up that data!

queensnake · on Aug 30, 2009

I'd think it'd be the losinger stores that do that at first (I'm thinking Borders), since maybe that wouldn't be the first store you'd go to. If/when that kicked up sales, other stores would have to follow. What a utopia that'd be though. But, eg Borders can't tell you for sure that something _is_ in the store, from their own system.

pasbesoin · on Aug 30, 2009

aka The importance of context. (Not just in valuing ideas, but also in generating them.)

As an aside, this is the second time this morning that this concept has come up, in my persona communications. (Dare I note some context in this, itself?)

lucifer · on Aug 30, 2009

I found the article self-contradictory.

In case of Google, he actually makes a pretty strong case for the algorithm, not "data". The data (the links) were always there. It was precisely the algorithm that generated information from that data.

And a Bayesian Net itself is clearly the by product of applying an algorithm to a data space.

The NetFlix case is anecdotal, but consider the following (equally anecdotal) counter example: The (massive) increase in the available data to humans since the advent of the networks have not contributed to any significant increase in the general intelligence of the population.

Unless by AI he is referring to the highly narrow case of machine creativity given little to no input (of data), then (obviously) algorithms do require data sources.

physcab · on Aug 30, 2009

I think you're right that the post is slightly contradictory, but his premise I believe is correct. In all that I have studied on machine learning in both academia and start-up land, I have observed that you consistently pick algorithms to glean out the information you need from a particular dataset.

In many cases you can do what my graduate advisor recommends "keep it simple stupid" meaning that perhaps all that is needed is a k-nn approach and euclidean distance. But sometimes the data is highly overlapped and complex, so you have to go with a more rigorous means of classification or whatnot.

Finally it should be noted that machine learning techniques are relatively new. Neural networks have been around for quite some time and have well documented advantages. ML by contrast is still a lot of black magic (tweaking of various magic parameters and such) so the benefits of various algorithms are somewhat subjective.