Apache Cassandra Notes

Russell Bateman
August 2017
last update:

It turns out that Cassandra was so named because of the allusions to a curse on an oracle—pun intended toward the latter, software giant's famous RDBMS.

The actual history of the prophetess is rather murky and sordid coming as it does from myriad sources and inspirations. The synthesis I'm used to is that she was given the prophetic ability by Apollo in a hoped-for exchange of her womanly pleasures, but she backed out at the last minute whereupon the god spat in her mouth condemning her always to prophesy and never be believed. She was simply thought to be mad.

So it is that she famously warned against bring the Achaean offering left behind into the gates of Troy. Despite being the first-family daughter of Priam and Hecuba, her warning was ignored leading to the well known city's infamous downfall.

Test first, code second, that's the order...

Setting up to unit-test with Cassandra...

Here's what I'm using in pom.xml:

<properties>
  <cassandra.version>3.3.0</cassandra.version>
  <cassandra-unit.version>3.1.3.2</cassandra-unit.version>
  <slf4j.version>1.7.25</slf4j.version>
</properties>

<dependencies>
  <dependency>
    <groupId>com.datastax.cassandra</groupId>
    <artifactId>cassandra-driver-core</artifactId>
    <version>${cassandra.version}</version>
  </dependency>
  <dependency>
    <groupId>org.cassandraunit</groupId>
    <artifactId>cassandra-unit</artifactId>
    <version>${cassandra-unit.version}</version>
    <scope>test</scope>
  </dependency>
  <dependency>
    <groupId>junit</groupId>
    <artifactId>junit</artifactId>
    <version>4.12</version>
  </dependency>
  <dependency>
    <groupId>org.slf4j</groupId>
    <artifactId>slf4j-api</artifactId>
    <version>${slf4j.version}</version>
  </dependency>
  <dependency>
    <groupId>org.slf4j</groupId>
    <artifactId>slf4j-simple</artifactId>
    <version>${slf4j.version}</version>
  </dependency>
</dependencies>

It's notoriously difficult to unit-test code that calls into a database's APIs. Cassandra provides an embedded, stand-alone database that calling isn't like a real instance in terms of having to set up a local instance let alone separate cluster-node instances.

Here is a simple test to see if this embedded Cassandra will start up. It does nothing except demonstrate that Cassandra's unit-testing helper will work.

CassandraExampleTest.java:
package com.etretatlogiciels.cassandra;

import java.io.IOException;

import org.junit.BeforeClass;
import org.junit.Test;

import org.apache.cassandra.exceptions.ConfigurationException;
import org.apache.thrift.transport.TTransportException;
import org.cassandraunit.utils.EmbeddedCassandraServerHelper;

/**
 * To run this, you must add a Run/Debug Configuration in the form
 * of an Environment Variable:
 *
 * LD_LIBRARY_PATH=/home/russ/dev/cassandra/target/classes
 *
 * This is so that libsigar-amd64-linux.so can be found and loaded
 * by the Cassandra code.
 */
public class CassandraExampleTest
{
  @BeforeClass
  public static void startCassandra()
      throws TTransportException, IOException, InterruptedException, ConfigurationException
  {
    EmbeddedCassandraServerHelper.startEmbeddedCassandra( "another-cassandra.yaml", 20000 );
  }

  @Test
  public void test()
  {
    System.out.println( "This is a test!" );
  }
}

An early project...

<properties>
  <cassandra.version>3.3.0</cassandra.version>
  <cassandra-unit.version>3.1.3.2</cassandra-unit.version>
  <slf4j.version>1.7.25</slf4j.version>
</properties>

<dependencies>
  <dependency>
    <groupId>com.datastax.cassandra</groupId>
    <artifactId>cassandra-driver-core</artifactId>
    <version>${cassandra.version}</version>
  </dependency>
  <dependency>
    <groupId>org.cassandraunit</groupId>
    <artifactId>cassandra-unit</artifactId>
    <version>${cassandra-unit.version}</version>
    <scope>test</scope>
  </dependency>
  <dependency>
    <groupId>junit</groupId>
    <artifactId>junit</artifactId>
    <version>4.12</version>
  </dependency>
  <dependency>
    <groupId>org.slf4j</groupId>
    <artifactId>slf4j-api</artifactId>
    <version>${slf4j.version}</version>
  </dependency>
  <dependency>
    <groupId>org.slf4j</groupId>
    <artifactId>slf4j-simple</artifactId>
    <version>${slf4j.version}</version>
  </dependency>
</dependencies>
another-cassandra.yaml:
cluster_name: 'Test Cluster'
hints_directory: target/embeddedCassandra/hints
cdc_raw_directory: target/embeddedCassandra/data/cdc_raw
hinted_handoff_enabled: true
max_hint_window_in_ms: 10800000 # 3 hours
hinted_handoff_throttle_in_kb: 1024
max_hints_delivery_threads: 2
authenticator: AllowAllAuthenticator
authorizer: AllowAllAuthorizer
permissions_validity_in_ms: 2000
partitioner: org.apache.cassandra.dht.Murmur3Partitioner

# directories where Cassandra should store data on disk.
data_file_directories:
commitlog_directory: target/embeddedCassandra/commitlog
disk_failure_policy: stop
key_cache_size_in_mb:
key_cache_save_period: 14400
row_cache_size_in_mb: 0
row_cache_save_period: 0
saved_caches_directory: target/embeddedCassandra/saved_caches
commitlog_sync: periodic
commitlog_sync_period_in_ms: 10000
commitlog_segment_size_in_mb: 32
seed_provider:
  - class_name: org.apache.cassandra.locator.SimpleSeedProvider
    parameters:
      - seeds: "127.0.0.1"
concurrent_reads: 32
concurrent_writes: 32
trickle_fsync: false
trickle_fsync_interval_in_kb: 10240
storage_port: 7010
ssl_storage_port: 7011
listen_address: 127.0.0.1
start_native_transport: true
native_transport_port: 9152
start_rpc: true
rpc_address: localhost
rpc_port: 9175
rpc_keepalive: true
rpc_server_type: sync
thrift_framed_transport_size_in_mb: 15
thrift_max_message_length_in_mb: 16
incremental_backups: false
snapshot_before_compaction: false
auto_snapshot: false
column_index_size_in_kb: 64
compaction_throughput_mb_per_sec: 16
read_request_timeout_in_ms: 5000
range_request_timeout_in_ms: 10000
write_request_timeout_in_ms: 2000
cas_contention_timeout_in_ms: 1000
truncate_request_timeout_in_ms: 60000
request_timeout_in_ms: 10000
cross_node_timeout: false
endpoint_snitch: SimpleSnitch
dynamic_snitch_update_interval_in_ms: 100
dynamic_snitch_reset_interval_in_ms: 600000
dynamic_snitch_badness_threshold: 0.1
request_scheduler: org.apache.cassandra.scheduler.NoScheduler
index_interval: 128
encryption_options:
  internode_encryption: none
  keystore: conf/.keystore
  keystore_password: cassandra
  truststore: conf/.truststore
  truststore_password: cassandra

Basic Cassandra connection

public class CassandraConnector
{
  private Cluster cluster;
  private Session session;

  public void connect( final String node, final int port )
  {
    cluster = Cluster.builder()
              .addContactPoint( node )
              .withPort( port )
              .build();

    Metadata metadata = cluster.getMetadata();

    System.out.println( String.format( "Connected to cluster: %s",
                          metadata.getClusterName() ) );

    for( Host host : metadata.getAllHosts() )
    {
      System.out.println( String.format( "Datacenter: %s, Host: %s, Rack: %s",
                            host.getDatacenter(),
                            host.getAddress(),
                            host.getRack() ) );
    }

    session = cluster.connect();
}

  public Session getSession() { return session; }
  public void    close()      { cluster.close(); }
}

Cassandra data types

Nothing too surprising here...

ascii counter float list text tinyint varint
bigint date frozen map time tuple
blob decimal inet set timestamp uuid
boolean double int smallint timeuuid varchar

Assuming I'll ever need to do so, here's a Java enumeration for internal use. However, this is really code too early and may not be of much use ultimately.

public enum CassandraType
{
  c_text,       // UTF-8 encoded string
  c_ascii,      // US_ASCII 7-bit
  c_varchar,    // UTF-8 encoded string

  c_int,        // 32-bit signed
  c_bigint,     // 64-bit signed
  c_smallint,   // 2-byte signed
  c_tinyint,    // 1-byte signed
  c_varint,     // arbitrary-precision

  c_decimal,    // variable-precision
  c_float,      // 32-bit IEEEE-754
  c_double,     // 64-bit IEEEE-754

  c_boolean,    // true/false
  c_counter,    // distributed, 64-bit

  c_date,       // 32-bit day since Epoch
  c_time,       // 64-bit nanoseconds since midnight
  c_timestamp,  // 8 bytes since Epoch; date and time with millisecond precision
  c_timeuuid,   // ?

  c_inet,       // IPv4 or IPv6
  c_tuple,      // 2-3 fields
  c_uuid,       // 128-bit globally unique identifier

  c_list,       // collection of 1+ elements (performance impact)
  c_map,        // JSON-style array of literals
  c_set,        // collection of 1+ literal elements

  c_blob,       // arbitrary bytes (no validation), in hexadecimal
  c_frozen,     // multiple types in single value, treated as blob
  ;

  /**
   * Useful to determine whether potential enum type,
   * in string form, is a Cassandra type.
   */
  public static boolean contains( String type )
  {
    try
    {
      CassandraType.valueOf( type );
      return true;
    }
    catch( IllegalArgumentException e )
    {
      return false;
    }
  }

  /**
   * Useful to determine whether potential type,
   * in string form, is a Cassandra type.
   */
  public static CassandraType stringToCassandraType( String string )
  {
    try
    {
      CassandraType type = CassandraType.valueOf( "c_" + string );

      if( type != null )
        return type;

      return CassandraType.valueOf( string );
    }
    catch( IllegalArgumentException e )
    {
      return null;
    }
  }

  /**
   * Useful to return a list of Cassandra types.
   */
  public static List< String > getCassandraTypes()
  {
    List< String > list = new ArrayList<>( CassandraType.values().length );

    for( CassandraType type : CassandraType.values() )
      list.add( type.name() );

    return list;
  }
}

Friday, 18 August 2017

public class CassandraConnector
{
  private Cluster cluster;
  private Session session;

  public void connect( String node, Integer port )
  {
    Builder b = Cluster.builder().addContactPoint( node );
    if( port != null )
      b.withPort( port );
    cluster = b.build();
    session = cluster.connect();
  }

  public Session getSession() { return this.session; }
  public void close() { session.close(); cluster.close(); }
}

In Cassandra, there's something called, keyspace. This is a little like the schema in a relational context. Remember, Cassandra isn't a document database like MongoDB, but a columnar database. The keyspace is the outermost container for data in Cassandra. The main attributes to set per keyspace are the...

Another important notion in Cassandra are the column, a data structure that contains a column name, value and timestamp. The columns and the number of columns in each row may vary in contrast with the contents of a relational database table where data are rigidly structured.

Creating a keyspace...

For the example I'm studying, the keyspace to create is "library":

public void createKeyspace( String keyspaceName, String replicationStrategy, int replicationFactor )
{
  StringBuilder sb = new StringBuilder();

	sb.append( "CREATE KEYSPACE IF NOT EXISTS ")
    .append( keyspaceName )
    .append( " WITH replication = {" )
    .append( "'class':'" )
    .append( replicationStrategy )
    .append( "','replication_factor':" )
    .append( replicationFactor )
    .append( "};" );

    String query = sb.toString();
    session.execute( query );
}

Tuesday, 22 August 2017

I reposted to the Cassandra users' forum asking for a reply so that I know my posts are even getting there. I finally got an answer back, but the suggestion was just a pile of code that merged unit testing and production together without ultimately providing a solution around the problem I'm having:

Exception (java.lang.ExceptionInInitializerError) encountered during startup: null
java.lang.ExceptionInInitializerError
	at org.apache.cassandra.transport.Server.start(Server.java:128)
	at java.util.Collections$SingletonSet.forEach(Collections.java:4767)
	at org.apache.cassandra.service.NativeTransportService.start(NativeTransportService.java:128)
	at org.apache.cassandra.service.CassandraDaemon.startNativeTransport(CassandraDaemon.java:649)
	at org.apache.cassandra.service.CassandraDaemon.start(CassandraDaemon.java:511)
	at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:616)
	at org.cassandraunit.utils.EmbeddedCassandraServerHelper$1.run(EmbeddedCassandraServerHelper.java:129)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException: name
	at io.netty.util.internal.logging.AbstractInternalLogger.(AbstractInternalLogger.java:39)
	at io.netty.util.internal.logging.Slf4JLogger.(Slf4JLogger.java:30)
	at io.netty.util.internal.logging.Slf4JLoggerFactory.newInstance(Slf4JLoggerFactory.java:73)
	at io.netty.util.internal.logging.InternalLoggerFactory.getInstance(InternalLoggerFactory.java:84)
	at io.netty.util.internal.logging.InternalLoggerFactory.getInstance(InternalLoggerFactory.java:77)
	at io.netty.bootstrap.ServerBootstrap.(ServerBootstrap.java:46)
	... 10 more

I've since read other attempts to explain using this helper class, but no matter how hard I've tried, I keep coming back to the error above. I worried originally that the error was saying that I had done something stupid, but I don't believe that now. It means that I don't know how to start the Cassandra unit test help up. The articles I've read all assert that I need only call it:

EmbeddedCassandraServerHelper.startEmbeddedCassandra();

...but, this is not true. I've tried to supply a YAML file and have, I think. It came from step 2 in this article. Though this is required (and hardly do all the authors allude to it), it doesn't work the magic. I got one from someplace that I'm using. I've also added log4j-embedded-cassandra.properties to no avail.

I bottled up some simple test code from Testing Cassandra repositorys using Cassandra Unit. I didn't use the Spring Boot code, but just the basic Java code. It worked; it's the early project above. This means there's some crapola going on, likely slf4j in my greater nifi-pipeline project.

This Cassandra unit-test stuff works. Sadly, the thrust of the tutorial is Spring Boot, and the useful code is overly infected by it and therefore pretty useless when there are other tutorials around.

Wednesday, 23 August 2017

Monday, 28 August 2017

Setting up Cassandra as local to my development host:


https://www.tutorialspoint.com/cassandra/cassandra_installation.htm (following along with this)

http://cassandra.apache.org/download/ (Browser download to ~/dev/cassandra)

~/dev/cassandra $ tar -zxf apache-cassandra-3.11.0-bin.tar.gz
~/dev/cassandra $ ll
total 37060
drwxr-xr-x   3 russ russ     4096 Aug 28 12:59 .
drwxrwxr-x. 96 russ russ     4096 Aug 28 12:58 ..
drwxr-xr-x  10 russ russ     4096 Aug 28 12:59 apache-cassandra-3.11.0
-rw-rw-r--   1 russ russ 37929669 Aug 28 12:58 apache-cassandra-3.11.0-bin.tar.gz

~/dev/cassandra/apache-cassandra-3.11.0/bin $ gvim cassandra.yaml
  (insert https://svn.apache.org/repos/asf/cassandra/trunk/conf/cassandra.yaml)

export CASSANDRA_HOME = ~/dev/cassandra/apache-cassandra-3.11.0
export PATH = $PATH:$CASSANDRA_HOME/bin

~/dev/cassandra/apache-cassandra-3.11.0/bin $ sudo bash
[[email protected] bin]# mkdir -p /var/lib/cassandra/data
[[email protected] bin]# mkdir -p /var/lib/cassandra/commitlog
[[email protected] bin]# mkdir -p /var/lib/cassandra/saved_caches
[[email protected] bin]# mkdir -p /var/log/cassandra
[[email protected] bin]# chmod 777 /var/lib/cassandra/
[[email protected] bin]# chmod 777 /var/log/cassandra/

~/dev/cassandra/apache-cassandra-3.11.0 $ ./bin/cassandra -f
(lots of fun stuff...)

(output from my Cassandra connector test code...)
Connected to cluster: Test Cluster
Datacenter: datacenter1, Host: /127.0.0.1, Rack: rack1


I want to connect to Cassandra, and do some stuff like use prepared statements. Here's my Cassandra code...

CassandraConnector.java:

package com.etretatlogiciels.cassandra;

import com.datastax.driver.core.Cluster;
import com.datastax.driver.core.Metadata;
import com.datastax.driver.core.Session;

public class CassandraConnector
{
  private Cluster cluster;
  private Session session;

  public void connect( final String node, final int port )
  {
    cluster = Cluster.builder()
               .addContactPoint( node )
               .withPort( port )
               .build();
    session = cluster.connect();
  }

  public Session  getSession()  { return session; }
  public Metadata getMetadata() { return cluster.getMetadata(); }
  public void     close()       { cluster.close(); }
}

CassandraConnectorTest.java:

package com.etretatlogiciels.cassandra;

import org.junit.After;
import org.junit.Before;
import org.junit.Rule;
import org.junit.Test;
import org.junit.rules.TestName;

import com.datastax.driver.core.Host;
import com.datastax.driver.core.Metadata;
import com.etretatlogiciels.testing.TestUtilities;

public class CassandraConnectorTest
{
  // @formatter:off
  @Rule   public TestName name = new TestName();
  @After  public void tearDown() { }
  @Before public void setUp() throws Exception { TestUtilities.setUp( name ); }

  @Test
  public void testConnector()
  {
    if( !TestUtilities.runningInsideIntelliJ() )
      return;

    // connects to Cassandra instance running on local box...
    CassandraConnector client = new CassandraConnector();
    client.connect( "127.0.0.1", 9042 );

    Metadata metadata = client.getMetadata();

    System.out.println( String.format( "Connected to cluster: %s",
                          metadata.getClusterName() ) );

    for( Host host : metadata.getAllHosts() )
    {
      System.out.println( String.format( "Datacenter: %s, Host: %s, Rack: %s",
                            host.getDatacenter(),
                            host.getAddress(),
                            host.getRack() ) );
    }
  }
}

This test also appears to work...

TryPreparedStatement.java:

package com.etretatlogiciels.cassandra;

import org.junit.After;
import org.junit.Before;
import org.junit.Rule;
import org.junit.Test;
import org.junit.rules.TestName;

import com.datastax.driver.core.BoundStatement;
import com.datastax.driver.core.LocalDate;
import com.datastax.driver.core.PreparedStatement;
import com.datastax.driver.core.Session;

public class TryPreparedStatementsTest
{
  @After  public void tearDown() { }
  @Before public void setUp() throws Exception
  {
    // connects to Cassandra instance running on local box...
    CassandraConnector client = new CassandraConnector();
    client.connect( "127.0.0.1", 9042 );
    session = client.getSession();
  }

  private static final String DROP_KEYSPACE   = "drop keyspace if exists product";
  private static final String CREATE_KEYSPACE = "create keyspace product with replication = { 'class' : 'SimpleStrategy',"
                             + " 'replication_factor' : 1 };";
  private static final String USE_KEYSPACE    = "use product;";
  private static final String DROP_TABLE      = "drop table if exists product.sku_list;";
  private static final String CREATE_TABLE    = "create table "
                             + "product.sku_list( sku text, description text, when date, primary key( sku ) );";
  private static final String INSERT_SKU      = "insert into sku_list( sku, description, when ) values( ?, ?, ? );";

  private CassandraConnector client;
  private Session            session;

  @Test
  public void testPreparedStatement()
  {
    PreparedStatement statement;
    BoundStatement    bound;

    statement = session.prepare( DROP_KEYSPACE );
    bound     = statement.bind();
    session.execute( bound );

    statement = session.prepare( CREATE_KEYSPACE );
    bound     = statement.bind();
    session.execute( bound );

    statement = session.prepare( USE_KEYSPACE );
    bound     = statement.bind();
    session.execute( bound );

    statement = session.prepare( CREATE_TABLE );
    bound     = statement.bind();
    session.execute( bound );

    statement = session.prepare( INSERT_SKU );
    bound     = statement.bind();
    bound.setString( 0, "665892" );
    bound.setString( 1, "LCD screen" );
    bound.setDate( 2, LocalDate.fromMillisSinceEpoch( System.currentTimeMillis() ) );
    session.execute( bound );
  }
}

Here's evidence:

~/dev/cassandra/apache-cassandra-3.11.0 $ ./bin/cqlsh
cqlsh> show host;
Connected to Test Cluster at 127.0.0.1:9042.

cqlsh> describe keyspaces;

system_schema  system_auth  product  system  system_distributed  system_traces

cqlsh> use product;

cqlsh:product> describe tables;

sku_list

cqlsh> describe product;

CREATE KEYSPACE product WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'}  AND durable_writes =
true;

CREATE TABLE product.sku_list (
    sku text PRIMARY KEY,
    description text,
    when date
) WITH bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
    AND comment = ''
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
    AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99PERCENTILE';

cqlsh:product> select * from sku_list;

 sku    | description | when
--------+-------------+------------
 665892 |  LCD screen | 2017-08-28

	(1 rows)

cqlsh:product> exit;

Friday, 8 September 2017

Time out for debugging Cassandra...

I worked with David for half an hour on setting up to debug on Cassandra. Decidedly, trying to do development work under Windows is sorely limiting and greatly lengthens the amount of research one must to do accomplish what are simple actions under a UNIX/Linux shell. This said, it's not going to be a piece of cake on Linux either the first time. Here are some links I looked at:

Note that if JVM_OPTS isn't defined in the environemnt, it can be for the process starting Cassandra. That means they'll be present. Also, note that there are option order problems with http://10.10.10.6/notes/daily.html#cassandra-debug. See notes from late last year and early this year for running NiFi remotely.

Based on what I've read of the Cassandra-Lucene Index plug-in, the lib subdirectory is guaranteed to be on Cassandra's classpath.

The installation of the Cassandra-Lucene Index plug-in, which must be done by cloning and building the source, is done thus:

mvn clean package -Ppatch -Dcassandra_home=<CASSANDRA_HOME>

Stuff to figure out:

  1. Do Cassandra plug-ins follow a framework (à la Tomcat, NiFi, etc.) in the sense that they must be dropped into s specific subdirectory?
  2. Instead, are they simply code resources (.jar) that a loader knows to load on condition that they be findable in the effective ${CLASSPATH}?
  3. Are we really going to have to build Cassandra because the only way to figure out how this plug-in stuff works is to step through Cassandra itself?
  4. What is the Cassandra client? How is it used relevant to plug-ins?

Tuesday, 12 September 2017

Setting up Cassandra on Linux Mint...

  1. I created /etc/apt/sources.list.d/cassandra.list to contain:
    ### THIS FILE WAS CREATED BY HAND ###
    # from http://cassandra.apache.org/download/
    deb http://www.apache.org/dist/cassandra/debian 311x main
    
  2. Add the Cassandra repository keys:
    # curl https://www.apache.org/dist/cassandra/KEYS | apt-key add -
    
  3. Install Cassandra:
    # apt-get install cassandra
    nargothrond sources.list.d # dpkg --list | grep [c]assandra
    ii  cassandra     3.11.0   all  distributed storage system for structured data
    
  4. Start up the service:
    # service cassandra start
    [email protected] ~ $ sudo service --status-all | grep [c]assandra
     [ + ]  cassandra
    
  5. I found these subdirectories. In bold are things that I think I want most to know about:
    • /etc/cassandra (cassandra-env.sh, cassandra.yaml, jvm.options where I expect to get remote debugging going)
    • /var/lib/cassandra (commit log, data, hints)
    • /var/log/cassandra (debug log, system log)
    • /usr/share/cassandra (JAR, cassandra.in.sh, lib where I expect to drop my index plug-in)

Important things learned...

I was able to copy my plug-in to Cassandra, bounce it, then connect IntelliJ IDEA via remote session to my Cassandra service. Presumably, it will kick into the debugger once I figure out how to get Cassandra to call through the plug-in.

Next up, how to tell Cassandra to call my plug-in?

Wednesday, 13 September 2017

To execute a cql command file (i.e.: a text file containing a cql command) from the cql shell, do this:

cqlsh> source 'create-keyspace.cql'

Of course, a full or relative path to the command file works to (although, if Cassandra is installed as a service, what would the current-working directory be?) the file works as well.

Comments in Cassandra cql command files can be:

In the IntelliJ IDEA editor, however, on the first one gives the warm fuzzies of grey text; the other two will not stop IntelliJ from highlighting SQL/CQL keywords found in the comment.

Tuesday, 26 September 2017

Cassandra custom index-relevant links...

Wednesday, 27 September 2017

Secondary indices in Cassandra...

If you attempt to query on a column in a table that's not part of the PRIMARY key, a error will be returned (let's do this in cqlsh). In this example, assume that first_name and not last_name is the primary key:

cqlsh:some_keyspace> SELECT * FROM some_table WHERE last_name = 'Schwartz';
InvalidRequest: code=2200 [Invalid query] message="No supported secondary index found for the non primary key colum
ns restrictions"

The error alludes, a secondary index must be created consisting at least of last_name. By definition, a secondary index is one created on/for a column that's not in the primary key:

cqlsh:some_keyspace> CREATE INDEX last_name_index ON some_table( last_name );

(Note: the name last_name_index is completely optional.)

...whereupon the original query begins to work:

cqlsh:some_keyspace> SELECT * FROM some_table WHERE last_name = 'Schwartz';

first_name | last_name
-----------+-----------
       Joe | Schwartz

(1 rows)

The secondary index is a different concept than the custom index that I'm working on.

Cassandra partitions data across multiple nodes in a cluster. For this reason, a secondary index based on the the data it refers to must be kept as a copy on every, relevant node. So, queries using a secondary index are significantly more expensive.

Because of how secondary indices are built and maintained, there are cases in which they are not recommended:

An index is built using:

Query-first design and design notes

In Cassandra, by opposition to RDBMS practices, begin design by laying out what queries are to be used instead of what the data and data relationships are to be. Organize the data to satisfy the queries. I see this as being a little like test-driven development, so it's a good thing.

Keep related columns together in the same Cassandra table. Queries that search a single partition will yield the best performance.

Monday, 2 October 2017

The SSTable, or "sorted-strings table," in Cassandra is created when the data of a column family (in memory) is flushed to disk.

The reason a disk needs to be left with 50% space free is so that Cassandra has space rebuild SSTables to optimize them.

Tuesday, 3 October 2017

Materialized views

Today, I'm looking into this topic.

When you move from an RDBMS to Cassandra, whether really or conceptually because you're adopting Cassandra and, like most, have a sort of solidly SQL mindset, you must denormalize data into separate tables based on the queries that will be run against your database (keyspace and tables).

Thinking about how to organize data in Cassandra requires different thoughts and approaches.

For example, the only way to query a column in a table without specifying the partition key is to use a secondary index. This method is not fit for data of high cardinality, that is, columns that contain values that are very uncommon or unique, like a GUID, e-mail address, user name, etc. This is very slow because high-cardinality, secondary-index queries can require all nodes in the ring to respond, adding considerable latency to the action.

One solution to this problem has been to make the client (the one making the query) perform denormalization as a part of his processing of queries into multiple, independent tables. This means that such code, in an application, would be running at the hands of many users on many hosts (instead of just one place).

In Cassandra 3.0 was introduced a new feature, materialized views, one that handles automated, server-side denormalization. This feature takes the form of a statement that's sort of a combination index-creation and select query. For example, suppose this table:

CREATE TABLE scores
(
  user TEXT,
  game TEXT,
  year INT,
  month INT,
  day INT,
  score INT,
  PRIMARY KEY( user, game, year, month, day ) )

We want some way to get the all-timer high scores from the data in this table:

CREATE MATERIALIZED VIEW alltimehigh AS              # name the view
  SELECT user                                        # must identify the columns to be contained
  FROM scores                                        # must identify the base table
    WHERE game  IS NOT NULL                          # filter must be specified for each column
      AND score IS NOT NULL
      AND user  IS NOT NULL
      AND year  IS NOT NULL
      AND month IS NOT NULL
      AND day   IS NOT NULL
  PRIMARY KEY( game, score, user, year, month, day ) # must include all of the columns
  WITH CLUSTERING ORDER BY( score desc )

In this example, we prime the table with some data:

INSERT INTO scores( user, game, year, month, day, score ) VALUES( 'pcmanus', 'Coup', 2015, 05, 01, 4000 )
INSERT INTO scores( user, game, year, month, day, score ) VALUES( 'jbellis', 'Coup', 2015, 05, 03, 1750 )
INSERT INTO scores( user, game, year, month, day, score ) VALUES( 'yukim', 'Coup', 2015, 05, 03, 2250 )
INSERT INTO scores( user, game, year, month, day, score ) VALUES( 'tjake', 'Coup', 2015, 05, 03, 500 )
INSERT INTO scores( user, game, year, month, day, score ) VALUES( 'jmckenzie', 'Coup', 2015, 06, 01, 2000 )
INSERT INTO scores( user, game, year, month, day, score ) VALUES( 'iamaleksey', 'Coup', 2015, 06, 01, 2500 )
INSERT INTO scores( user, game, year, month, day, score ) VALUES( 'tjake', 'Coup', 2015, 06, 02, 1000 )
INSERT INTO scores( user, game, year, month, day, score ) VALUES( 'pcmanus', 'Coup', 2015, 06, 02, 2000 )

...and here's how we search for the all-time high score:

SELECT user, score FROM alltimehigh WHERE game = 'Coup' LIMIT 1

The result is:

user       | score
-----------+-------
   pcmanus |  4000

A lot of the magic happens at write time, i.e.: when the table is built. Consequently, there is a performance penalty at write- and query time. Low-cardinality data will create hotspots around the ring. In our example, because the only game is 'Coup', only the node storing 'Coup' have any data store on them. If there are tombstoned entries, the materialized view must query for and generate a tombstone for each entry. This is all overhead.

Monday, 6 November 2017

Setting up a cluster...

It's possible to use something called Cassandra Cluster Manager (CCM), but for practice and deep learning about configuration aspects and details in administration, do each box manually as a separate node. This comes from Jeff Jirsa, who says that official, first-time set-up documents are pretty lacking and gives the following steps:

  1. Install the Debian package from Configuring Cassandra.
  2. Configure following Configuring Cassandra:
    1. Pick a cluster name. (You cannot change this later.)
    2. Set the listen_address (and maybe the broadcast_address).
    3. Put the IP address of the first node as the seed. Once the cluster is up you can change the seeds. The first time a node joins the ring (and for some other stuff not to worry about), this seed is used. Thereafter, as long as the cluster isn't growing, the actual seeds don't matter very much. People tend to think of seeds as being more important than they really are. They should be the same, but, if different across nodes for a while, it's not likely to hurt the cluster much.
  3. Start the node just installed and configured.
  4. Wait 2 minutes.
  5. Proceed with next node (start these instructions over).

Another good reference is How To Run a Multi-Node Cluster Database with Cassandra on Ubuntu 14.04.

Wednesday, 8 November 2017

cqlsh> CREATE ROLE cassadmn WITH PASSWORD = 'Cassadmn' AND LOGIN = true;
NoHostAvailable: ('Unable to complete the operation against any hosts', {})

"Unavailable" indicates that the number of nodes Cassandra needs for the query to succeed isn't available. Too many nodes are down. Either it's a single node that thinks it's more than one node and others are down (you added/removed nodes to/from that cluster in the past), or the replication strategy for system_auth is wrong.

Thursday, 16 November 2017

The Cassandra Coordinator...

...or, more properly, coordinator node, is what sends the client's search request (or query) to each node in the cluster. Each node then returns its result whereupon the coordinator combines these partial results, then gives the n (where n is prescribed in the query by a limit) most highly ranked. This avoids a full scan of all the data.

Cassandra says that the client read or write requests can go to any node in the cluster because all nodes in Cassandra are peers. When a client connects to a node and issues a read or write request, that node serves as the coordinator for that particular client operation.

The job of the coordinator is to act as a proxy between the client application and the nodes (or replicas) that own the data being requested. The coordinator determines which nodes in the ring should get the request based on the cluster configured partitioner and replica placement strategy.

In my mind, this begs a number of questions, "Will every node offer a coordinator?" Or, only some nodes? "Does the coordinator consist of universal code or code that's not everywhere installed?"

My hypothesis is that every "stock" Cassandra node is a potential coordinator for mere Cassandra purposes: Any node that receives a client query is referred to as the coordinator for that client operation (query).

The coordinator node is typically chosen by an algorithm that takes network distance into account. Any node can act as the coordinator. At first requests will be sent to the nodes the client driver knows about. (Remember, a client application initiates its connection to Cassandra by passing a list of one or more contact points which are hostname plus port.)

It's also useful to know that each client request may be coordinated by a different node and there is no single point of failure (fundamental to Cassandra's architecture).

However, once the client connects and understands the topology of the cluster, the driver may change to a closer coordinator, i.e.: choose a different node including one that wasn't in the original list of contact points. This is because each node contains the metadata of all the other nodes, meaning as long as one is connected, the driver could get infomation of all the nodes in the cluster. The driver will then use the metadata of the entire cluster got from the connected node to create the connection pool. This also means that it's not necessary to set the addresses of all the nodes in the cluster in the contact-points list. Best practice is to set the nodes (in the contact-point list) that respond the quickest to the client application when it starts up. This can be difficult if impossible to predict at the finest level.

How is a coordinator chosen? How your application sets up its own load-balancing policy has an effect.

In configuring Cassandra load-balancing policy for your client application, the options are:

import com.datastax.driver.core.Cluster;
import com.datastax.driver.core.policies.RoundRobinPolicy;

public class ClientApplicationStub
{
  public static void main( String[] args )
  {
    Cluster cluster = Cluster.builder()
                       .addContactPoint( "127.0.0.1:9042" )
                       .withLoadBalancingPolicy( new RoundRobinPolicy() )
                       .build();
    ...

Once the cluster is built, it's not possible to change the policy set.