Caching in Spark SQL works a bit differently. Since we know the types of each column, Spark is able to more efficiently store the data. To make sure that we cache using the memory efficient representation, rather than the full objects, we should use the special
hiveCtx.cacheTable("tableName") method. When caching a table Spark SQL represents the data in an in-memory columnar format. This cached table will remain in memory only for the life of our driver program, so if it exits we will need to recache our data. As with RDDs, we cache tables when we expect to run multiple tasks or queries against the same data.
You can also cache tables using HiveQL/SQL statements. To cache or uncache a table simply run
CACHE TABLE tableName or
UNCACHE TABLE tableName. This is most commonly used with command-line clients to the JDBC server.