在Hive中使用Avro

　　Avro（读音类似于[ævrə]）是Hadoop的一个子项目，由Hadoop的创始人Doug Cutting牵头开发。Avro是一个数据序列化系统，设计用于支持大批量数据交换的应用。它的主要特点有：支持二进制序列化方式，可以便捷，快速地处理大量数据；动态语言友好，Avro提供的机制使动态语言可以方便地处理Avro数据。
　　在Hive中，我们可以将数据使用Avro格式存储，本文以avro-1.7.1.jar为例，进行说明。

　　如果需要在Hive中使用Avro，需要在$HIVE_HOME/lib目录下放入以下四个工具包：avro-1.7.1.jar、avro-tools-1.7.4.jar、 jackson-core-asl-1.8.8.jar、jackson-mapper-asl-1.8.8.jar。当然，你也可以把这几个包存在别的路径下面，但是你需要把这四个包放在CLASSPATH中。

　　为了解析Avro格式的数据，我们可以在Hive建表的时候用下面语句：

hive> CREATE EXTERNAL TABLE tweets
    > COMMENT "A table backed by Avro data with the 
    >        Avro schema embedded in the CREATE TABLE statement"
    > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
    > STORED AS
    > INPUTFORMAT  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
    > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
    > LOCATION '/user/wyp/examples/input/'
    > TBLPROPERTIES (
    >    'avro.schema.literal'='{
    >        "type": "record",
    >        "name": "Tweet",
    >        "namespace": "com.miguno.avro",
    >        "fields": [
    >            { "name":"username",  "type":"string"},
    >            { "name":"tweet",     "type":"string"},
    >            { "name":"timestamp", "type":"long"}
    >        ]
    >   }'
    > );
OK
Time taken: 0.076 seconds

hive> describe tweets;                                                             
OK
username            	string              	from deserializer   
tweet               	string              	from deserializer   
timestamp           	bigint              	from deserializer

然后用Snappy压缩我们需要的数据，下面是压缩前我们的数据：

{
   "username": "miguno",
   "tweet": "Rock: Nerf paper, scissors is fine.",
   "timestamp": 1366150681
},
{
   "username": "BlizzardCS",
   "tweet": "Works as intended.  Terran is IMBA.",
   "timestamp": 1366154481
},
{
   "username": "DarkTemplar",
   "tweet": "From the shadows I come!",
   "timestamp": 1366154681
},
{
   "username": "VoidRay",
   "tweet": "Prismatic core online!",
   "timestamp": 1366160000
}

压缩完的数据假如存放在/home/wyp/twitter.avsc文件中，我们将这个数据复制到HDFS中的/user/wyp/examples/input/目录下：

hadoop fs -put /home/wyp/twitter.avro  /user/wyp/examples/input/

然后我们就可以在Hive中使用了：

hive> select * from tweets limit 5;;
OK
miguno	Rock: Nerf paper, scissors is fine.	1366150681
BlizzardCS	Works as intended.  Terran is IMBA.	1366154481
DarkTemplar	From the shadows I come!	1366154681
VoidRay	Prismatic core online!	1366160000
Time taken: 0.495 seconds, Fetched: 4 row(s)

当然，我们也可以将avro.schema.literal中的

{
   "type": "record",
   "name": "Tweet",
   "namespace": "com.miguno.avro",
   "fields": [
      {
         "name": "username",
         "type": "string"
      },
      {
         "name": "tweet",
         "type": "string"
      },
      {
         "name": "timestamp",
         "type": "long"
      }
   ]
}

存放在一个文件中，比如：twitter.avsc,然后上面的建表语句就可以修改为：

CREATE EXTERNAL TABLE tweets
    COMMENT "A table backed by Avro data with the Avro schema stored in HDFS"
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
    STORED AS
    INPUTFORMAT  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
    OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
    LOCATION '/user/wyp/examples/input/'
    TBLPROPERTIES (
        'avro.schema.url'='hdfs:///user/wyp/examples/schema/twitter.avsc'
    );

效果和上面的一样。

本博客文章除特别声明，全部都是原创！
原创文章版权归过往记忆大数据（过往记忆）所有，未经许可不得转载。
本文链接: 【在Hive中使用Avro】（https://www.iteblog.com/archives/1007.html）

Snappy如何使用去压缩下面的数据：
{
"username": "miguno",
"tweet": "Rock: Nerf paper, scissors is fine.",
"timestamp": 1366150681
},
{
"username": "BlizzardCS",
"tweet": "Works as intended. Terran is IMBA.",
"timestamp": 1366154481
},
{
"username": "DarkTemplar",
"tweet": "From the shadows I come!",
"timestamp": 1366154681
},
{
"username": "VoidRay",
"tweet": "Prismatic core online!",
"timestamp": 1366160000
}
非常感谢。

zhangyun2015-06-04 12:11 回复