Although Cassandra query language resembles with SQL language, their data modelling methods are totally different.
In Cassandra, a bad data model can degrade performance, especially when users try to implement the RDBMS concepts on Cassandra. It is best to keep in mind few rules detailed below.
In this tutorial, you will learn-
- Cassandra Data Model Rules
- Model Your Data in Cassandra
- Handling One to One Relationship
- Handling one to many relationships
- Handling Many to Many Relationship
In Cassandra, writes are not expensive. Cassandra does not support joins, group by, OR clause, aggregations, etc. So you have to store your data in such a way that it should be completely retrievable. So these rules must be kept in mind while modelling data in Cassandra.
- Maximize the number of writes
In Cassandra, writes are very cheap. Cassandra is optimized for high write performance. So try to maximize your writes for better read performance and data availability. There is a tradeoff between data write and data read. So, optimize you data read performance by maximizing the number of data writes.
- Maximize Data Duplication
Data denormalization and data duplication are defacto of Cassandra. Disk space is not more expensive than memory, CPU processing and IOs operation. As Cassandra is a distributed database, so data duplication provides instant data availability and no single point of failure.
Data Modeling Goals
You should have following goals while modelling data in Cassandra.
- Spread Data Evenly Around the Cluster
You want an equal amount of data on each node of Cassandra cluster. Data is spread to different nodes based on partition keys that is the first part of the primary key. So, try to choose integers as a primary key for spreading data evenly around the cluster.
- Minimize number of partitions read while querying data
Partition are a group of records with the same partition key. When the read query is issued, it collects data from different nodes from different partitions.
If there will be many partitions, then all these partitions need to be visited for collecting the query data.
It does not mean that partitions should not be created. If your data is very large, you can’t keep that huge amount of data on the single partition. The single partition will be slowed down.
So try to choose a balanced number of partitions.
Good Primary Key
Let’s take an example and find which primary key is good.
Here is the table MusicPlaylist.
Create table MusicPlaylist ( SongId int, SongName text, Year int, Singer text, Primary key(SongId, SongName) );
In above example, table MusicPlaylist,
- Songid is the partition key, and
- SongName is the clustering column
- Data will be clustered on the basis of SongName. Only one partition will be created with the SongId. There will not be any other partition in the table MusicPlaylist.
Data retrieval will be slow by this data model due to the bad primary key.
Here is another table MusicPlaylist.
Create table MusicPlaylist ( SongId int, SongName text, Year int, Singer text, Primary key((SongId, Year), SongName) );
In above example, table MusicPlaylist,
- Songid and Year are the partition key, and
- SongName is the clustering column.
- Data will be clustered on the basis of SongName. In this table, each year, a new partition will be created. All the songs of the year will be on the same node. This primary key will be very useful for the data.
Our data retrieval will be fast by this data model.
Following things should be kept in mind while modelling your queries.
- Determine what queries you want to support
First of all, determine what queries you want.
For example, do you need?
- Group by
- Filtering on which column etc.
Create table according to your queries. Create a table that will satisfy your queries. Try to create a table in such a way that a minimum number of partitions needs to be read.
One to one relationship means two tables have one to one correspondence. For example, the student can register only one course, and I want to search on a student that in which course a particular student is registered in.
So in this case, your table schema should encompass all the details of the student in corresponding to that particular course like the name of the course, roll no of the student, student name, etc.
Create table Student_Course ( Student rollno int primary key, Student_name text, Course_name text, );
One to many relationships means having one to many correspondence between two tables.
For example, a course can be studied by many students. I want to search all the students that are studying a particular course.
So by querying on course name, I will have many student names that will be studying a particular course.
Create table Student_Course ( Student_rollno int, Student_name text, Course_name text, );
I can retrieve all the students for a particular course by the following query.
Select * from Student_Course where Course_name='Course Name';
Many to many relationships means having many to many correspondence between two tables.
For example, a course can be studied by many students, and a student can also study many courses.
I want to search all the students that are studying a particular course. Also, I want to search all the course that a particular student is studying.
So in this case, I will have two tables i.e. divide the problem into two cases.
First, I will create a table by which you can find courses by a particular student.
Create table Student_Course ( Student_rollno int primary key, Student_name text, Course_name text, );
I can find all the courses by a particular student by the following query.