Latest Uploads
isometric test

realtime

dungeon ma ... litz basic

realtime

Crystal Ca ... een So Far

rychan

Crystal Ca ... Screen V1

rychan

Day 5 Graphic

rychan

Mutanty Mo ... Workfiles.

steve_ancell

Forum Home

Databases -- must be a better way

UserMessage
Posted : Friday, 26 August 2011, 14:14 | Permalink
HoboBen


WW Entries : 9
For a CMS, I will have articles, and each article can have several revisions.

I was thinking of having an article table and a revision table:

-->

As you can see, this is denormalised so that no data is duplicated (generally a good thing, I think?)

But if I want to display a list of articles, I first have to query the articles table, then for each article, query the revision to get the title. This might mean 50 queries on a page!

Surely there's a better way?

I could duplicate the revision title column into the articles table, so that the revision table doesn't need to be queried.

I could combine the article and revision tables together, and do one query (possibly using DISTINCT or similar on the articleId field to only get the latest revision). Does that sound like a good idea? In which case, how do I safely get an auto-incremented id for new articles? (I would guess by locking the table first before a MAX query?)

I haven't had much practice with databases, so I thought it best to ask in case I was doing this completely the wrong way.

Thanks!

-----
Posted : Friday, 26 August 2011, 14:41 | Permalink | Mark Here
JL235


WW Entries : 7
You need to use 'joins'. There are two ways, the common way is:
-->
I select from two tables, but the 'id = articleID' specifies I am only interested in the data that crosses over.

In practice that will pull out a row containing both the Article and Revision data, for each Article and each Revision.

So if you have 5 articles, and each has 2 revisions, then you have 10 rows returned.

Another way of thinking about it is that this will return a list of all revisions, but each revision row has it's article data included. This means you have a lot of repeated article data, but that is _FAR_ cheaper then performing multiple SQL queries.

The other way to do it is using the proper JOIN syntax, which allows you to do funky stuff like pulling out data even if the join fails (the fields will have null).

However, if you use some frameworks (like Ruby on Rails) then it can handle all of these joins for you, hiding them away. It allows you to get more of your CMS built, in less time. I use my own PHP database code (which I'm actually editing right now) to handle joins for me.

You might also want to move the creation time out into the revision. Each revision should probably have a time stamp, and the article creation time will be the same as the first revision.

-----
PlayMyCode.com - build and play in your browser, Blog, Twitter.
Homepage : http://www.StudioFortress.com
Posted : Friday, 26 August 2011, 16:14 | Permalink | Mark Here
HoboBen


WW Entries : 9
Thanks, that makes a lot of sense. I've read up on joins.

You're right that article creation time is the same as the first revision time stamp - it might make sense to remove the articles table completely.

I guess getting a list of say 50 distinct articles with the latest revision data would then look something like:

select *
from Revisions
where articleId > $offset // for pagination
group by articleId // unique article id
order by max(timestamp)
limit 0,50

Although I'm a bit dodgy on using max. I'll have to test, but I don't know if it should be in order by or part of the select.

Thanks for the help.

-----
Posted : Friday, 26 August 2011, 16:49 | Permalink | Mark Here
JL235


WW Entries : 7
I don't think you should use articleID for pagination. It would be very easy to mess up your pagination, such as deleting or hiding an article.

You'll also find in the future you will have more bits of data to store, which is specific to an article. For example storing a pretty url, a publication date, or a flag to say if the article is hidden or public.

In contrast to what I said above, it is probably a good idea to also keep the creation time in the Article table as well. So it would be stored twice. Otherwise you will have to pull out the first revision, every time you pull out an article, which I'd imagine you'll be doing every time the article is displayed.

That is known as 'de-nomalization', where you go against normalization, in order to improve performance and efficiency.

Normalization is a bit like functional programming. It's mainly from a theoretical and aesthetic point of view, to help build DB designs where it is difficult to end up with common DB issues. For example you don't store data in multiple places, to ensure that if you update the data, you can't end up with all those locations going out of sync.

Good DB design is just as difficult as programming.

-----
PlayMyCode.com - build and play in your browser, Blog, Twitter.
Homepage : http://www.StudioFortress.com
Posted : Friday, 26 August 2011, 18:07 | Permalink | Mark Here
HoboBen


WW Entries : 9
Normally I'd agree that it's not the best way to do pagination generally, but it will work in this specific case (mainly because there are no deletions (so that you can always revert)). I'd like to avoid using a limit offset as it gets slower as the table gets bigger. I also intend to use static HTML as much as possible so that you only hit the DB if you're logged in.

On de-normalization -- knowing where to use it seems like a black art. Do you just have to get a feel for it, or are there really some rules?

Lastly, is it worth splitting tables into two, e.g. one for querying by (e.g. fixed size title, timestamp, section, id) and one for variable text-heavy content (e.g. synopsis, body) and then joining them when necessary? (if nothing else, I assume this would help reduce fragmentation?)

Thanks again; this is all very helpful.

-----
Posted : Friday, 26 August 2011, 21:59 | Permalink | Mark Here
Stealth


HoboBen On de-normalization -- knowing where to use it seems like a black art. Do you just have to get a feel for it, or are there really some rules?


You probably don't need to worry about this unless you're running a high traffic database. In fact, JOINing tables is worst performance wise as there is more overhead.

HoboBen Lastly, is it worth splitting tables into two, e.g. one for querying by (e.g. fixed size title, timestamp, section, id) and one for variable text-heavy content (e.g. synopsis, body) and then joining them when necessary?


Datatypes such as longtext are stored separately on the disk from the row data. If you don't request these fields then the database isn't going to bother with them.


If you're concerned with performance then definitely look in to caching. A simple file cache can be used and you only need to update it when someone modifies the database.

-----
Andrew // stealth
"Some people see things as they are and say why? I dream things that never were and say why not?" - Robert Kennedy
Homepage : http://strubhar.me
Posted : Friday, 26 August 2011, 22:47 | Permalink | Mark Here
HoboBen


That's very helpful, thanks Stealth!
Posted : Saturday, 27 August 2011, 07:32 | Permalink | Mark Here
JL235


WW Entries : 7
However if you pull out all data, like using *, then it's much faster to use VARCHAR over Text. That is because VARCHAR values are stored within the table, and so only require going to disk once, rather then twice.

-----
PlayMyCode.com - build and play in your browser, Blog, Twitter.
Homepage : http://www.StudioFortress.com
Latest Posts
AGameAWeek Progress Report
Jayenkai Wed 16:12
Little Lever News
rychan Wed 11:02
Rooting for OUYA
9572AD Tue 19:07
Scene Demos
rockford Tue 07:46
Happy Easter
rockford Mon 11:10
Jay gets a Wii U
Jayenkai Sun 07:14
Win7 Installing
dna Sat 13:32
A New Theme!?
Jayenkai Sat 05:30
Ramble ramble/finishing stuff
therevillsgames Fri 19:41
Mint
Dabz Fri 01:02
More

Latest Items
Dev-Diary : Another Evening Of Coding Approaches!
rychan Wed 15:27
Snippet : Balls!
rychan Wed 13:49
Blog : DX9Engine
Kuron Tue 05:22
Family : :c(
Kuron Tue 04:27
Dev-Diary : Yet another one...
Jayenkai Tue 00:01
Blog : EmuMenu
rickdangerous Fri 04:16
Showcase : BambooBasic
Dabz Thu 18:46
Blog : FInally Some Game Dev Time
rychan Thu 15:35
Showcase : isometric prototype
Jayenkai Thu 04:31
Showcase : dungeon maker algorithm
Jayenkai Wed 01:42
Dev-Diary : More Screens
Kuron Sat 09:01
News : Newsletter #221
Dabz Fri 10:07
Dev-Diary : Crystal Catch Development
rychan Thu 14:38
Link : Sfxr
rychan Thu 04:16
Showcase : SpringySquirrel
Jayenkai Tue 22:58
More

Who's Online
steve_ancell
Wed, at 19:55
Dabz
Wed, at 19:29
Jayenkai
Wed, at 17:59
9572AD
Wed, at 17:31
Kuron
Wed, at 17:08
rychan
Wed, at 17:02
dna
Wed, at 16:46
blanko1324
Wed, at 16:34
shroom_monk
Wed, at 16:18
Erebel55
Wed, at 12:54
Link to this page
Site : Jayenkai 2006-Infinity | MudChat's origins, BBCode's former life, Image Scaler.