Subscribe cloud computing RSS CSDN home> cloud computing

The latest version of the Hive 0.13 release, adding ACID features

Published in08:16 2014-04-23| Time reading| sourceCwiki| ZeroArticle comments| authorGates Alan

Abstract:Hive is based on Hadoop, a data warehouse tools can be structured data file is mapped to a database table, and provides simple SQL query function, can convert the SQL statement to run MapReduce tasks. The latest version of the hive 0.13 just released, this will be what has changed?

[editor's note]Recently released Hive 0.13 uses the ACID semantic transaction mechanism, in the partition layer to ensure transactional atomic, consistency and persistence, and by opening the ZooKeeper or memory lock mechanism to ensure that the transaction isolation. Data flow uptake, slowly changing dimensions, data to repeat these new cases in the new version has become possible, of course, the new version of Hive there are still some deficiencies, Hive new version of the specific changes brought about? Author Alan Gates brings us a wonderful analysis.

The following is the original text:

What is ACIDWhat role does it have?

Acid on behalf of the database transaction in the four characteristics, the implementation of atomic (any database operations are either complete, or incomplete implementation), consistency (once the application performs an operation, operation results for each operation are visible), isolation (a user does not operate on other users have unintended side effects), persistent (once a operation is completed, these operations will also be recorded, even if the machine or system failure, also want to ensure the integrity of these records). These characteristics have always been considered as an important part of transaction function.

In the recently released Hive 0.13, the atomic, consistent, and persistent transactions in the partition layer are guaranteed, and isolation is guaranteed by opening the lock mechanism in the ZooKeeper or memory. By the Hive 0.13 join the business, to provide all the acid semantics at the row level. So, an application can add rows, and another application can read data from the same partition, each other interference will not occur.

The transaction mechanism using ACID semantics is added to the Hive to process the following use cases:

  1. Data stream uptake( Ingest of data streaming).Many users will use Flume Apache, Storm Apache or Apache Kafka like tools, these tools can be the data is written to the Hadoop cluster, write data rate per second for hundreds of, and the hive can every 15 minutes to 1 hour to add a partition, however, to add too much partition is also often cause table of confusion. These tools can also be data is written to the existing partition, but this may lead to read data produces a dirty read (read their data in the execution of a query was modified), and in their directory will generate many small files, the namenode caused by pressure. With the new feature of the data stream, the use case will allow the reader to get a consistent view of the data, while avoiding too many files.
  2. Slowly varying dimension (changing dimensions slow). In a data warehouse of a typical star schema, the dimension table changes slowly over time. For example, when a retailer opens a new store, the store will need to add a "store list", but it is also possible to expand the existing stores, and may also add some new services. These changes will result in the insertion of new records into the data warehouse or changes to the existing records (depending on the selected strategy). Hive is currently unable to support these operations, as long as INSERT These manipulations are supported by VALUES, UPDATE, and DELETE, slowly changing dimensions will become possible.
  3. Data repeat (restatement data). Sometimes collected data is not correct, need some correction, data may be the first instance has approximation (90% of the report server) later to provide complete data, may have some issues due to the follow-up services also need to be Restatement (for example: in a transaction, the customer may buy membership, and therefore have the right to enjoy a discounted price, including before the transaction also need to enjoy the discount price), there may be users in their trading relationship is terminated, according to the contract is required to remove their user data. As long as INSERT These operations are supported by VALUES, UPDATE, and DELETE, and data will also become possible.

insufficient:

  • In Hive 0.13, it does not support INSERT...... VALUES, UPDATE and DELETE and other operations, BEGIN, COMMIT and ROLLBACK and other operations have not been supported, these features are planned to be implemented in the next version.
  • In the first version of Hive 0.13 can only support ORC file format. Transaction can be any storage format using the update or delete operation how to apply in the underlying record (in fact, record in explicit or implicit row ID), but so far, the integration also can only be used Orc format.
  • The stream interface (see below) can not be integrated with the existing INTO INSERT operation in Hive. If the table uses the stream interface, any data that is added by INTO INSERT will be lost. But INSERT OVERWRITE is still available, and it can be used for streaming, and it is as if the data is written to the partition by streaming it.
  • In Hive 0.13, the transaction is closed by default. See the following configuration section, and some of the key items in the Hive need to be manually configured.
  • Tables must be bucketed to better take advantage of these features. The same system does not use a transaction and the ACID table does not need to be bucket.
  • Snapshot level isolation can only be supported. When a given query is started, it will produce a consistent snapshot of the data. Dirty read, read, read, read, and serial. After the introduction of BEGIN, will support the snapshot isolation transaction, in order to ensure that the transaction is persistent, rather than just to complete a query. Other isolation levels can be added according to user requirements.
  • The existing Zookeeper and memory lock manager is not compatible with the transaction. At present, there is no plan to solve this problem, to understand how to store the lock for the transaction, please refer to the following basic design.

Stream interface

For more information on the use of stream data, seeStreamingDataIngest.

Grammatical variation

Several new commands are added to the DDL Hive to support ACID and transactions, some of the existing DDL have also been made a few modifications.

For example: the new TRANSACTIONS SHOW command, the details of the command, seeShowTransactions.

COMPACTIONS SHOW is also a new addition to the command, please refer to the details of theShowCompactions.

The original LOCKS SHOW command is modified to provide a new lock information associated with the transaction. If you are using a ZooKeeper or memory lock command, you will notice that this command does not change much in the output.ShowLocks.

TABLE ALTER added a new option to compress tables or partitions. General users do not need to request compression, because the system will detect their needs, and then automatically start compression. However, if a table compression is accidentally terminated or a user wants to go to a manual compression table, ALTER TABLE can meet the user, provide manual start compression, the relevant details, please refer toAlterTable/PartitionCompact. TABLE ALTER will request the queue, compressed and returned to the request, if the user wants to see the progress of compression, you can use the SHOW COMPACTIONS command.

foundation design

HDFS does not support changes to the file. The writer writes to the file, and the file is read by other users, which can't guarantee the consistency of the read. In order to provide this functionality in HDFS, we adopt standard methods used in other tools of data warehouse, the table or partition data stored in a set of basic documents, new record, update, and delete operations are stored in the delta file. Create a new set for each transactionDelta files (or in a stream proxy such as Flume or Storm, create a new set of delta files for each batch of transactions), change tables or partitions. When reading, the reader will base files and delta files, update and delete operation application.

Sometimes these changes need to be merged into a basic file, and a set of threads must be added to the Metastore Hive. They determine when to need to be compressed, and then to perform compression, and finally to clean up (delete the old file). The type of compression is divided into two types, minor and major. Secondary compression uses a set of existing delta files, and for each bucket to rewrite a delta file. Primary compression is to write one or more delta files for each bucket, and to rewrite a new basic file for each bucket. All compression is done in the background, and does not interfere with the data read and write. After a compression, the system will wait until the end of the reading of all the old files, and then delete the old file.

All files in a single directory, before a partition (or if a table without a partition table) are placed in a single directory. Because of these changes, all of the partitions that are written using ACID thought will have a basic file directory, as well as the delta file directory.

The new lock manager DbLockManager is also added to the Hive. The lock manager stores all the information stored in the Metastore, and all transactions are stored in the Metastore. This means that the transaction and lock even in server failure can also guarantee the persistence, in order to avoid client crashes, leave or lock hanging up, lock holders and transaction starter need to Metastore sends heartbeat signals (heartbeat), if in a given time server does not receive a heartbeat signal sent by the client, the lock or transaction will be aborted.

to configure

Many new configuration key items are added to the system to support the transaction.

Configuration key

Default value

Transaction start value

notes

Hive.txn.manager

Org.apache.hadoop.

Hive.ql.lockmgr.

DummyTxnManager

Org.apache.hadoop.

Hive.ql.lockmgr.

DbTxnManager

Use the DummyTxnManager Hive 0.13 before practice, does not provide services.

Hive.txn.timeout

Three hundred

If the client does not send a heartbeat signal at this time, the transaction will be terminated.

Hive.txn.max.

Open.batch

One thousand

Maximum number of transactions, you can use open_txns () to obtain.

Hive.compactor.

Initiator.on

False

True (exactly one for of instance the Thrift Metastore service)

The Metastore instance is running in the starter and garbage thread.

Hive.compactor.

Worker.threads

Zero

> 0 at least on one instance of the Thrift Metastore service

How many working threads run on the Metastore instance.

Hive.compactor.

Worker.timeout

Eighty-six thousand and four hundred

Compression work is not completed in this time will be declared a failure, the compression operation is re discharged into the queue.

Hive.compactor.

Check.interval

Three hundred

To check if a partition needs to be compressed.

Hive.compactor.

Delta.num.threshold

Ten

The number of delta directories, reaching that number will trigger secondary compression.

Hive.compactor.

Delta.pct.threshold

Zero point one

And the percentage of the delta file in the base file, reaching this value will trigger the main compression. (100% = 1)

Hive.compactor.

Abortedtxn.threshold

One thousand

The number of aborted transactions in a given partition, which can also be triggered by the number of.

Hive.txn.max.open.batch controls multiple transaction flow agents, such as Flume or Storm. A stream agent writes a number of entries to a single file (each Flume proxy or each Storm bolt). Thus, increasing the value of the number of files created by the stream proxy can be reduced, but adding this value also increases the number of open transactions (Hive need to be traced), which may affect the read performance.

The working thread produced many MapReduce jobs for the compression operation, and they did not do it themselves. Determine the table after the compression, increase the number of working threads will reduce the time table compression. With more MapReduce jobs running in the background, the Hadoop cluster's background load is also increasing.

Reducing this value will reduce the time of the compression table or partition. Of course, the first need to check whether the compression is necessary, it is necessary for each table or partition to call a number of NameNode, reduce this value can reduce the load of NameNode.

Table properties

If the table's owner does not want the system to automatically determine when to compress, then you can manually set the table property NO_AUTO_COMPACTION, used to prevent all automatic compression operation.

Text link:And Transactions in Hive ACID(translation / Mao Mengqi / commissioning editor Wei Wei)

In order to"Cloud computing big data to promote the wisdom of China "As the themeThe Sixth China Cloud Computing Conference Will be held in Beijing National Convention Center on May 20-23. Industry observers, technical training, theme forums, industry research, content rich, full of dry cargo. Fare discount, right awaysign up !

top
Zero
step on
Zero