Database data analysis

-=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- (c) WidthPadding Industries 1987 0\|189\|0 -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=-
Socoder -> Web Development -> Database data analysis
Thu, 12 Aug 2010, 08:41
HoboBen	Say you had a database table for subscriptions to music artists: subscriptions (userId INT, artistId, INT) If you wanted to do a "similar likes" feature, you could do something like this: likes (artistId1 INT, artistId2 INT, score INT) Where score is the number of users who like both artists, divided by the the number of users who like either artist. A higher score means a stronger similar taste. However, with 500,000 artists, that leaves 500,000^2 = 250,000,000,000 combinations (okay, less because artistId1, artistId2 is the same as artistId2, artistId1) So obviously, that won't work. Any ideas? -=-=- blog \| work \| code \| more code
Thu, 12 Aug 2010, 09:33
HoboBen	Ahhh, I've just thought, it's quite obvious... I should only save the highest scoring, say, 10,000 links? Perhaps even much less until I get a larger subscriber dataset. But it should be simple with a few "count" queries. Obviously there's no way of knowing the highest 5000 until they're counted (even if I set a minimum score threshold), so there might be a lot of database I/O for the initial set, but it only needs to be done occasionally. If anyone has any other thoughts though, I'm all ears. -=-=- blog \| work \| code \| more code
Thu, 12 Aug 2010, 10:26
JL235	There are existing similar likes features on other sites. I believe Amazon does a 'people who bought this also bought...' thing too which is similar. Rather then re-inventing the wheel you could find one to scrounge off and use for your site.
Thu, 12 Aug 2010, 10:28
Jayenkai	Every time I think I have it, the numbers all start to unravel, and it all gets super complicated again! Can't quite grasp the potential simplicity of it all... sorry! -=-=- ''Load, Next List!''
Thu, 12 Aug 2010, 11:07
HoboBen	@DD, I would except not all artists are covered by individual APIs/databases. Additionally you run into licensing issues as "factual/objective" data is usually licensed differently to "community" data. And on top of that, Amazon's API is the worst shit I've ever seen. @Jay, no worries! -=-=- blog \| work \| code \| more code
Thu, 12 Aug 2010, 12:11
Stealth	I would do the like system the way you have envisioned. Realistically, you'll never fill it with that many rows (why would every single artist be like every other artist?). Only the most popular artists will get linked to other ones (top 10,000), and I can't imagine most artists will have more than 10 similar. This puts you at 100,000 realistic entries. You can easily scale past 1 million entries without killing MySQL. Just make you you index the IDs. At work we have clients with tables that are nearly 10 million rows and they can still query complex SELECTs in 0.5 secs with indexing on. As I read in a good application design book: Believe it or not, the bigger problem isn't scaling, it's getting to the point where you have to scale. -=-=- Quit posting and try Google.
Thu, 12 Aug 2010, 14:17
HoboBen	Thanks for the input Stealth; that was very helpful. -=-=- blog \| work \| code \| more code
Thu, 12 Aug 2010, 19:14
CodersRule	Your use of 'less' rather than 'fewer' is bothering me.
Thu, 12 Aug 2010, 21:55
JL235	I've also mentioned many times before that I built a gigantic DB at HSBC (had over 100 millions). Even though mySQL is meant to scale really badly the front-end app was still able to pull out stats in real-time.