Apache Accumulo 2.0.0
02 Aug 2019
WARNING: This version of Accumulo has critical bugs fixed in newer versions.
Please check our release archive for a newer version.
Apache Accumulo 2.0.0 contains significant changes from 1.9 and earlier versions. It is the first major release since adopting semver and is the culmination of more than 3 years worth of work by more than 40 contributors from the Accumulo community. The following release notes highlight some of the changes. If anything is missing from this list, please contact the developers to have it included.
New API for creating connections to Accumulo
A fluent API for creating Accumulo clients was introduced in ACCUMULO-4784 and #634.
ZooKeeperInstance objects have been deprecated and replaced by
AccumuloClient which is created from the
Accumulo entry point. The new API also deprecates
ClientConfiguration and introduces its own properties file called
that ships with the Accumulo tarball. The new API has the following benefits over the old API:
- All connection information can be specifed in properties file to create the client. This was not possible with old API.
- The new API does not require
ZooKeeperInstanceto be created first before creating a client.
- The new client is closeable and does not rely on shared static resource management
- Clients can be created using a new Java builder,
- Clients can now be created with default settings for
- Create scanners with default authorizations. #744
See the client documentation for more information on how to use the new API.
Hadoop 3 Java 8 & 11.
Accumulo 2.x expects at least Java 8 and Hadoop 3. It is built against Java 8 and Hadoop 3 and the binary tarball is targeted to work with a Java 8 and Hadoop 3 system. See ACCUMULO-4826, #531, and ACCUMULO-4299. Running with Java 11 is also supported, but Java 11 is not required.
Simplified Accumulo scripts and configuration files
Accumulo’s scripts and configuration were refactored in ACCUMULO-4490 to make Accumulo
easier to use. The number of scripts in the
bin directory of the Accumulo release tarball
has been reduced from 20 scripts to the four scripts below:
accumulo- mostly left alone except for improved usage
accumulo-service- manage Accumulo processes as services
accumulo-cluster- manage Accumulo on cluster. Replaces
accumulo-util- combines many utility scripts into one script.
Read this blog post for more information on this change.
New Bulk Import API
A new bulk import API was added in 2.0 that has very different implementation. This new API supports the following new functionality.
- Bulk import to an offline table.
- Load plans that specify where files go in a table which avoids opening the files for inspection.
- Inspection of file on the client side. Inspection of all files is done before the FATE operation starts. This results in less namenode operations and fail-fast for bad files (no longer need a fail directory).
- A new improved algorithm to load files into tablets. This new algorithm scans the metadata table and makes asynchronous load calls to all tablets. This queues load operations on all tablets at around the same time. The async RPC calls and beforehand inspection make the bulk load FATE operation much shorter.
The shell command for doing bulk load supports the old and new API. To use the new API from the shell simply omit the failure directory argument. For the API, use the new fluent API. See #436, #472, and #570.
Summaries enables continually generating statistics about a table with user defined functions. This feature can inform a user about what is in their table and be used by compaction strategies to make decisions. For example, using this feature it would be possible to compact all tablets where deletes are more than 25% of the data. Another example use case is optimizing filtering compactions by enabling smart selection of files with pertinent data. Examples of filtering compactions are age off and removal of non-compliant data.
Scan executors support prioritizing and dedicating scan resources. Each executor has a configurable number of threads and an optional custom prioritizer. Tables can be configured in a flexible way to dispatch scans to different executors.
All new pluggable components introduced in 2.0 were placed under a new SPI package. The SPI package is analyzed by Apilyzer at build time to ensure plugins only use SPI and API types. This prevents plugins from using internal Accumulo types that are inherently unstable over time. Plugins created before 2.0 do use internal types and are less stable. The new pluggable interfaces should be much more stable.
Official Accumulo docker image was created
--upload-accumulo-siteoption was added to
accumulo initto set properties in accumulo-site.xml to Zookeeper during initialization.
-o <key>=<value>option was added to the
accumulocommand to override configuration that could not be set in Zookeeper.
Updated and improved Accumulo documentation
Accumulo’s documentation has been refactored with the following improvements:
- Documentation source now lives in accumulo-website repo so changes are now immediately viewable.
- Improved navigation using a new sidebar
- Better linking to Javadocs, between documentation pages, and to configuration properties.
Accumulo’s documentation was also reviewed and changes were made to improve accuracy and remove out of date documentation.
Moved Accumulo Examples to its own repo
The Accumulo examples were moved out the accumulo repo to the accumulo-examples repo which has the following benefits:
- The Accumulo examples are no longer released with Accumulo and can be continuously improved.
- The Accumulo API version used by the examples can be updated right before Accumulo is released to test for any changes to the API that break semver.
Simplified Accumulo logging configuration
The log4j configuration of Accumulo services was improved in ACCUMULO-4588 with the following changes:
- Logging is now configured using standard log4j JVM property ‘log4j.configuration’ in accumulo-env.sh.
- Tarball ships with fewer log4j config files (3 rather than 6) which are all log4j properties files.
- Log4j XML can still be used by editing accumulo-env.sh
- Removed auditLog.xml and added audit log configuration to log4j-service properties files
- Accumulo conf/ directory no longer has an examples/ directory. Configuration files ship in conf/ and are used by default.
- Accumulo monitor by default will bind to 0.0.0.0 but will advertise hostname looked up in Java for log forwarding
- Switched to use full hostnames rather than short hostnames for logging
Removed comparison of Value with byte in Value.equals()
Replaced the ability to use
Value.equals(byte) to check if the contents of a
Value object was equal to a given byte array in ACCUMULO-4726. To perform
that check, you must now use the newly added
method. This corrects the behavior of the
equals method so that it conforms
to the API contract documented in the javadoc inherited from its superclass.
However, it will break any code that was relying on the undocumented and broken
behavior to compare
Value objects with byte arrays. Such comparisons will now
false instead of
true, even if the contents are equal.
Removed default dynamic reloading classpath directory (lib/ext)
In #1179, the default directory for dynamic class reloading (lib/ext)
was removed and the default value for the deprecated property
general.dynamic.classpaths was set to blank. This was done as part of a plan
to phase out class loading behaviors that are tightly coupled to Accumulo, in
favor of more user-pluggable class loading features that are easier to maintain
separately from Accumulo’s core code.
To continue to use this feature until it is removed, you must set this property
to a value. However, it is recommended to add your non-dynamic user class paths
CLASSPATH environment in
accumulo-env.sh instead, or to leverage the
per-table context class paths feature, depending on your use case. For
reference, the previous default value was
Other Notable Changes
- ACCUMULO-3652 - Replaced string concatenation in log statements with slf4j where applicable. Removed tserver TLevel logging class.
- ACCUMULO-4449 - Removed ‘slave’ terminology and replaced with ‘tserver’ in most cases. The former ‘slaves’ config file is now named ‘tservers’. Added checks to scripts to fail if ‘slaves’ file is present.
- ACCUMULO-4808 - Can now create table with splits and offline. Specifying splits at table creation time can be much faster than adding splits after creation.
- ACCUMULO-4463 - Caching is now pluggable.
- ACCUMULO-4177 - New built in cache implementation based on TinyLFU.
- ACCUMULO-4376 ACCUMULO-4746 - Mutation and Key Fluent APIs allow easy mixing of types. For example a family of type
Stringand qualifier of type
byteis much easier to write using this new API.
- ACCUMULO-4771 - The Accumulo monitor was completely rewritten.
- ACCUMULO-4732 - Specify iterators and locality groups at table creation time.
- ACCUMULO-4612 - Use percentages for memory related configuration.
- ACCUMULO-1787 - Two tier compaction strategy. Support compacting small files with snappy and large files with gzip.
- #560 - Provide new Crypto interface & impl
- #536 - Removed mock Accumulo.
- #438 - Added support for ZStandard compression
- #404 - Added basic Grafana dashboard example.
- #1102 #1100 #1037 - Removed lock contention in different areas. These locks caused threads working unrelated task to impede each other.
- #1033 - Optimized the default compaction strategy. In some cases the Accumulo would rewrite data O(N^2) times over repeated compactions. With this change the amount of rewriting is always logarithmic.
- Many performance improvements mentioned in the 1.9.X release notes are also available in 2.0.
- Scanners close server side sessions on close #813 #905
View the Upgrading Accumulo documentation for guidance.
View all releases in the archive