Latest Uploads
image-21-0 ... -21-29.jpg

Jayenkai

image-15-0 ... -19-40.jpg

Jayenkai

image-15-0 ... -20-07.jpg

Jayenkai

Bumble Bee

dna

Tires

dna

Guess The Remake

rychan

Forum Home

Databases -- must be a better way

UserMessage
Posted : Friday, 26 August 2011, 15:14 | Permalink
HoboBen


WW Entries : 9
For a CMS, I will have articles, and each article can have several revisions.

I was thinking of having an article table and a revision table:

-->

As you can see, this is denormalised so that no data is duplicated (generally a good thing, I think?)

But if I want to display a list of articles, I first have to query the articles table, then for each article, query the revision to get the title. This might mean 50 queries on a page!

Surely there's a better way?

I could duplicate the revision title column into the articles table, so that the revision table doesn't need to be queried.

I could combine the article and revision tables together, and do one query (possibly using DISTINCT or similar on the articleId field to only get the latest revision). Does that sound like a good idea? In which case, how do I safely get an auto-incremented id for new articles? (I would guess by locking the table first before a MAX query?)

I haven't had much practice with databases, so I thought it best to ask in case I was doing this completely the wrong way.

Thanks!

-----
Posted : Friday, 26 August 2011, 15:41 | Permalink | Mark Here
JL235


WW Entries : 7
You need to use 'joins'. There are two ways, the common way is:
-->
I select from two tables, but the 'id = articleID' specifies I am only interested in the data that crosses over.

In practice that will pull out a row containing both the Article and Revision data, for each Article and each Revision.

So if you have 5 articles, and each has 2 revisions, then you have 10 rows returned.

Another way of thinking about it is that this will return a list of all revisions, but each revision row has it's article data included. This means you have a lot of repeated article data, but that is _FAR_ cheaper then performing multiple SQL queries.

The other way to do it is using the proper JOIN syntax, which allows you to do funky stuff like pulling out data even if the join fails (the fields will have null).

However, if you use some frameworks (like Ruby on Rails) then it can handle all of these joins for you, hiding them away. It allows you to get more of your CMS built, in less time. I use my own PHP database code (which I'm actually editing right now) to handle joins for me.

You might also want to move the creation time out into the revision. Each revision should probably have a time stamp, and the article creation time will be the same as the first revision.

-----
PlayMyCode.com - build and play in your browser, Blog, Twitter.
Homepage : http://www.StudioFortress.com
Posted : Friday, 26 August 2011, 17:14 | Permalink | Mark Here
HoboBen


WW Entries : 9
Thanks, that makes a lot of sense. I've read up on joins.

You're right that article creation time is the same as the first revision time stamp - it might make sense to remove the articles table completely.

I guess getting a list of say 50 distinct articles with the latest revision data would then look something like:

select *
from Revisions
where articleId > $offset // for pagination
group by articleId // unique article id
order by max(timestamp)
limit 0,50

Although I'm a bit dodgy on using max. I'll have to test, but I don't know if it should be in order by or part of the select.

Thanks for the help.

-----
Posted : Friday, 26 August 2011, 17:49 | Permalink | Mark Here
JL235


WW Entries : 7
I don't think you should use articleID for pagination. It would be very easy to mess up your pagination, such as deleting or hiding an article.

You'll also find in the future you will have more bits of data to store, which is specific to an article. For example storing a pretty url, a publication date, or a flag to say if the article is hidden or public.

In contrast to what I said above, it is probably a good idea to also keep the creation time in the Article table as well. So it would be stored twice. Otherwise you will have to pull out the first revision, every time you pull out an article, which I'd imagine you'll be doing every time the article is displayed.

That is known as 'de-nomalization', where you go against normalization, in order to improve performance and efficiency.

Normalization is a bit like functional programming. It's mainly from a theoretical and aesthetic point of view, to help build DB designs where it is difficult to end up with common DB issues. For example you don't store data in multiple places, to ensure that if you update the data, you can't end up with all those locations going out of sync.

Good DB design is just as difficult as programming.

-----
PlayMyCode.com - build and play in your browser, Blog, Twitter.
Homepage : http://www.StudioFortress.com
Posted : Friday, 26 August 2011, 19:07 | Permalink | Mark Here
HoboBen


WW Entries : 9
Normally I'd agree that it's not the best way to do pagination generally, but it will work in this specific case (mainly because there are no deletions (so that you can always revert)). I'd like to avoid using a limit offset as it gets slower as the table gets bigger. I also intend to use static HTML as much as possible so that you only hit the DB if you're logged in.

On de-normalization -- knowing where to use it seems like a black art. Do you just have to get a feel for it, or are there really some rules?

Lastly, is it worth splitting tables into two, e.g. one for querying by (e.g. fixed size title, timestamp, section, id) and one for variable text-heavy content (e.g. synopsis, body) and then joining them when necessary? (if nothing else, I assume this would help reduce fragmentation?)

Thanks again; this is all very helpful.

-----
Posted : Friday, 26 August 2011, 22:59 | Permalink | Mark Here
Stealth


HoboBen On de-normalization -- knowing where to use it seems like a black art. Do you just have to get a feel for it, or are there really some rules?


You probably don't need to worry about this unless you're running a high traffic database. In fact, JOINing tables is worst performance wise as there is more overhead.

HoboBen Lastly, is it worth splitting tables into two, e.g. one for querying by (e.g. fixed size title, timestamp, section, id) and one for variable text-heavy content (e.g. synopsis, body) and then joining them when necessary?


Datatypes such as longtext are stored separately on the disk from the row data. If you don't request these fields then the database isn't going to bother with them.


If you're concerned with performance then definitely look in to caching. A simple file cache can be used and you only need to update it when someone modifies the database.

-----
Andrew // stealth
"Some people see things as they are and say why? I dream things that never were and say why not?" - Robert Kennedy
Homepage : http://strubhar.me
Posted : Friday, 26 August 2011, 23:47 | Permalink | Mark Here
HoboBen


That's very helpful, thanks Stealth!
Posted : Saturday, 27 August 2011, 08:32 | Permalink | Mark Here
JL235


WW Entries : 7
However if you pull out all data, like using *, then it's much faster to use VARCHAR over Text. That is because VARCHAR values are stored within the table, and so only require going to disk once, rather then twice.

-----
PlayMyCode.com - build and play in your browser, Blog, Twitter.
Homepage : http://www.StudioFortress.com
Latest Posts
Pebble Watch Clocks
spinal Mon 01:23
NerdBlock
rockford Sun 12:06
Temporary Fix : iOS8 Image Uploading
Jayenkai Sun 06:44
Movies : Pride
Jayenkai Sat 07:29
WaterFox
steve_ancell Sat 07:25
Batch Recolouring Images
rockford Sat 05:43
Static Molly
steve_ancell Fri 11:25
Microsoft : Where good coders go for a nap
Jayenkai Fri 05:21
Awful Site Design
Jayenkai Fri 01:44
AGameAWeek iOS Bundle
rychan Thu 08:46
More

Latest Items
Blog : Bedroom 2.0
Jayenkai Fri 06:01
Blog : 20 Years Ago....
steve_ancell Mon 15:28
Life : Back Online
Kuron Tue 11:48
Showcase : Munky Blocks DX
Jayenkai Mon 07:06
Showcase : Little Nemo's Nightmare
Jayenkai Sat 04:14
News : Newsletter #234
rychan Fri 08:02
Showcase : Space Rocket Adventurers
HoboBen Thu 16:56
Blog : Health Update : August 2014
Jayenkai Sat 11:31
Techy : Back from Game Dev Con!
Jayenkai Sun 16:52
Family : Sunday
Jayenkai Wed 03:31
Family : Good News (for once)
Kuron Tue 15:37
Showcase : Don't Tap The Zombies
therevillsgames Mon 16:26
Showcase : Mutant Monty
steve_ancell Wed 19:26
Techy : New Laptop!?
Kuron Fri 15:52
Showcase : Ascension 2 Live Wallpaper
CodersRule Fri 13:48
More

Who's Online
steve_ancell
Mon, at 04:06
rockford
Mon, at 03:23
shroom_monk
Mon, at 02:29
spinal
Mon, at 02:24
Krakatomato
Mon, at 02:19
Afr0
Sun, at 23:45
Kuron
Sun, at 23:01
dna
Sun, at 20:53
Jayenkai
Sun, at 19:11
HoboBen
Sun, at 15:14
Link to this page
Site : Jayenkai 2006-Infinity | MudChat's origins, BBCode's former life, Image Scaler.