Latest Uploads
Day 1 - Ga ... eboy Racer

rychan

Pain_Claim.png

Jayenkai

Monkey Maz ... pload test

Pakz

image-26-1 ... 37-09.jpeg

Jayenkai

image-14-1 ... 50-52.jpeg

Jayenkai

Socoder_9yo.png

Jayenkai

Forum Home

Databases -- must be a better way

UserMessage
Posted : Friday, 26 August 2011, 15:14 | Permalink
HoboBen


WW Entries : 9
For a CMS, I will have articles, and each article can have several revisions.

I was thinking of having an article table and a revision table:

-->

As you can see, this is denormalised so that no data is duplicated (generally a good thing, I think?)

But if I want to display a list of articles, I first have to query the articles table, then for each article, query the revision to get the title. This might mean 50 queries on a page!

Surely there's a better way?

I could duplicate the revision title column into the articles table, so that the revision table doesn't need to be queried.

I could combine the article and revision tables together, and do one query (possibly using DISTINCT or similar on the articleId field to only get the latest revision). Does that sound like a good idea? In which case, how do I safely get an auto-incremented id for new articles? (I would guess by locking the table first before a MAX query?)

I haven't had much practice with databases, so I thought it best to ask in case I was doing this completely the wrong way.

Thanks!

-----
Posted : Friday, 26 August 2011, 15:41 | Permalink | Mark Here
JL235


WW Entries : 7
You need to use 'joins'. There are two ways, the common way is:
-->
I select from two tables, but the 'id = articleID' specifies I am only interested in the data that crosses over.

In practice that will pull out a row containing both the Article and Revision data, for each Article and each Revision.

So if you have 5 articles, and each has 2 revisions, then you have 10 rows returned.

Another way of thinking about it is that this will return a list of all revisions, but each revision row has it's article data included. This means you have a lot of repeated article data, but that is _FAR_ cheaper then performing multiple SQL queries.

The other way to do it is using the proper JOIN syntax, which allows you to do funky stuff like pulling out data even if the join fails (the fields will have null).

However, if you use some frameworks (like Ruby on Rails) then it can handle all of these joins for you, hiding them away. It allows you to get more of your CMS built, in less time. I use my own PHP database code (which I'm actually editing right now) to handle joins for me.

You might also want to move the creation time out into the revision. Each revision should probably have a time stamp, and the article creation time will be the same as the first revision.

-----
PlayMyCode.com - build and play in your browser, Blog, Twitter.
Homepage : http://www.StudioFortress.com
Posted : Friday, 26 August 2011, 17:14 | Permalink | Mark Here
HoboBen


WW Entries : 9
Thanks, that makes a lot of sense. I've read up on joins.

You're right that article creation time is the same as the first revision time stamp - it might make sense to remove the articles table completely.

I guess getting a list of say 50 distinct articles with the latest revision data would then look something like:

select *
from Revisions
where articleId > $offset // for pagination
group by articleId // unique article id
order by max(timestamp)
limit 0,50

Although I'm a bit dodgy on using max. I'll have to test, but I don't know if it should be in order by or part of the select.

Thanks for the help.

-----
Posted : Friday, 26 August 2011, 17:49 | Permalink | Mark Here
JL235


WW Entries : 7
I don't think you should use articleID for pagination. It would be very easy to mess up your pagination, such as deleting or hiding an article.

You'll also find in the future you will have more bits of data to store, which is specific to an article. For example storing a pretty url, a publication date, or a flag to say if the article is hidden or public.

In contrast to what I said above, it is probably a good idea to also keep the creation time in the Article table as well. So it would be stored twice. Otherwise you will have to pull out the first revision, every time you pull out an article, which I'd imagine you'll be doing every time the article is displayed.

That is known as 'de-nomalization', where you go against normalization, in order to improve performance and efficiency.

Normalization is a bit like functional programming. It's mainly from a theoretical and aesthetic point of view, to help build DB designs where it is difficult to end up with common DB issues. For example you don't store data in multiple places, to ensure that if you update the data, you can't end up with all those locations going out of sync.

Good DB design is just as difficult as programming.

-----
PlayMyCode.com - build and play in your browser, Blog, Twitter.
Homepage : http://www.StudioFortress.com
Posted : Friday, 26 August 2011, 19:07 | Permalink | Mark Here
HoboBen


WW Entries : 9
Normally I'd agree that it's not the best way to do pagination generally, but it will work in this specific case (mainly because there are no deletions (so that you can always revert)). I'd like to avoid using a limit offset as it gets slower as the table gets bigger. I also intend to use static HTML as much as possible so that you only hit the DB if you're logged in.

On de-normalization -- knowing where to use it seems like a black art. Do you just have to get a feel for it, or are there really some rules?

Lastly, is it worth splitting tables into two, e.g. one for querying by (e.g. fixed size title, timestamp, section, id) and one for variable text-heavy content (e.g. synopsis, body) and then joining them when necessary? (if nothing else, I assume this would help reduce fragmentation?)

Thanks again; this is all very helpful.

-----
Posted : Friday, 26 August 2011, 22:59 | Permalink | Mark Here
Stealth


HoboBen On de-normalization -- knowing where to use it seems like a black art. Do you just have to get a feel for it, or are there really some rules?


You probably don't need to worry about this unless you're running a high traffic database. In fact, JOINing tables is worst performance wise as there is more overhead.

HoboBen Lastly, is it worth splitting tables into two, e.g. one for querying by (e.g. fixed size title, timestamp, section, id) and one for variable text-heavy content (e.g. synopsis, body) and then joining them when necessary?


Datatypes such as longtext are stored separately on the disk from the row data. If you don't request these fields then the database isn't going to bother with them.


If you're concerned with performance then definitely look in to caching. A simple file cache can be used and you only need to update it when someone modifies the database.

-----
Andrew // stealth
"Some people see things as they are and say why? I dream things that never were and say why not?" - Robert Kennedy
Homepage : http://strubhar.me
Posted : Friday, 26 August 2011, 23:47 | Permalink | Mark Here
HoboBen


That's very helpful, thanks Stealth!
Posted : Saturday, 27 August 2011, 08:32 | Permalink | Mark Here
JL235


WW Entries : 7
However if you pull out all data, like using *, then it's much faster to use VARCHAR over Text. That is because VARCHAR values are stored within the table, and so only require going to disk once, rather then twice.

-----
PlayMyCode.com - build and play in your browser, Blog, Twitter.
Homepage : http://www.StudioFortress.com
-=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- (c) WidthPadding Industries 1987 -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=- -=+=-
Latest Posts
Happy Birthday, Stealth
Jayenkai Tue 03:10
AGameAWeek : 2016 - Part One
Jayenkai Mon 13:22
Tin Can Alley
steve_ancell Mon 05:02
Die Alien Slime *cough* Walkthru
GfK Sun 14:52
Anybody recommend an artist?
GfK Sun 14:44
Pebble Watch Clocks
Jayenkai Sun 10:01
Global Warming
rockford Sun 05:49
Happy Birthday, Jay's Mum
spinal Sun 01:34
NerdBlock
Jayenkai Sat 05:13
Top Gear V3
Jayenkai Fri 14:15
More

Latest Items
Showcase : Monkey Plotting
zzoom Tue 15:46
Showcase : Waifu Clicker
rychan Mon 14:54
Showcase : Space Station A6-100
Jayenkai Wed 12:20
Showcase : Novascape
rychan Fri 01:09
Showcase : Space INvasion Zero
rychan Wed 15:38
Showcase : Harpoons and Balls
Jayenkai Tue 02:46
Showcase : NeonPlat's Cosmic Adventure
Jayenkai Fri 12:56
Dev-Diary : Centipede remake
steve_ancell Wed 19:25
Showcase : Beta Collexion
rychan Mon 13:55
Dev-Diary : Global Variables!
rychan Thu 06:33
Showcase : Spike Dislike GB
rychan Wed 04:18
Blog : Back on the run
rychan Wed 01:03
Showcase : The Catcher Van
rychan Mon 04:40
Showcase : GodOfMiniWars
gameleaper Sun 20:12
Showcase : WarTactical
gameleaper Mon 17:08
More

Who's Online
Hotshot
Tue, at 07:39
rockford
Tue, at 07:23
magicman
Tue, at 07:06
spinal
Tue, at 06:59
Jayenkai
Tue, at 06:38
Dabz
Tue, at 06:14
GfK
Tue, at 05:15
steve_ancell
Tue, at 05:00
rychan
Tue, at 04:45
Pakz
Tue, at 04:23
Link to this page
Site : Jayenkai 2006-Infinity | MudChat's origins, BBCode's former life, Image Scaler.