Friday, July 5, 2013

Querying JSON records via Hive


/* ---[ Opacity: A brief rant ]--- */

Despite the popularity of Hadoop and its ecosystem, I've found that much of it is frustratingly underdocumented or at best opaquely documented. An example proof of this is the O'Reilly Programming Hive book, whose authors say they wrote it because so much of Hive is poorly documented and exists only in the heads of its developer community.

But even the Programming Hive book lacks good information on how to effectively use Hive with JSON records, so I'm cataloging my findings here.


/* ---[ JSON and Hive: What I've found ]--- */

I've only been playing with Hive about two weeks now, but here's what I found with respect to using complex JSON documents with Hive.

Hive has two built-in functions, get_json_object and json_tuple, for dealing with JSON. There are also a couple of JSON SerDe's (Serializer/Deserializers) for Hive. I like this one the best: https://github.com/rcongiu/Hive-JSON-Serde

I will document using these three options here.

Let's start with a simple JSON document and then move to a complex document with nested subdocuments and arrays of subdocuments.

Here's the first document:

{
    "Foo": "ABC",
    "Bar": "20090101100000",
    "Quux": {
        "QuuxId": 1234,
        "QuuxName": "Sam"
    }
}

We are going to store this as a Text document, so it is best to have the whole JSON entry on a single line in the text file you point the Hive table to.

Here it is on one line for easy copy and pasting:

{"Foo":"ABC","Bar":"20090101100000","Quux":{"QuuxId":1234,"QuuxName":"Sam"}}

Let's create a Hive table to reference this. I've put the above document in a file called simple.json:

CREATE TABLE json_table ( json string );

LOAD DATA LOCAL INPATH  '/tmp/simple.json' INTO TABLE json_table;

Since there are no delimiters, we leave off the ROW FORMAT section of the table DDL


Built in function #1: get_json_object

The get_json_object takes two arguments: tablename.fieldname and the JSON field to parse, where '$' represents the root of the document.

select get_json_object(json_table.json, '$') from json_table; 

Returns the full JSON document.

So do this to query all the fields:

select get_json_object(json_table.json, '$.Foo') as foo, 
       get_json_object(json_table.json, '$.Bar') as bar,
       get_json_object(json_table.json, '$.Quux.QuuxId') as qid,
       get_json_object(json_table.json, '$.Quux.QuuxName') as qname
from json_table;

You should get the output:

foo    bar              qid     qname
ABC    20090101100000   1234    Sam

(Note: to get the header fields, enter set hive.cli.print.header=true at the hive prompt or in your $HOME/.hiverc file.)

This works and has a nice JavaScript like "dotted" notation, but notice that you have to parse the same document once for every field you want to pull out of your JSON document, so it is rather inefficient.

The Hive wiki recommends using json_tuple for this reason.


Built in function #2: json_tuple

So let's see what json_tuple looks like. It has the benefit of being able to pass in multiple fields, but it only works to a single level deep. You also need to use Hive's slightly odd LATERAL VIEW notation:

select v.foo, v.bar, v.quux, v.qid 
from json_table jt
     LATERAL VIEW json_tuple(jt.json, 'Foo', 'Bar', 'Quux', 'Quux.QuuxId') v
     as foo, bar, quux, qid;

This returns:

foo  bar             quux                              qid
ABC  20090101100000  {"QuuxId":1234,"QuuxName":"Sam"}  NULL

It doesn't know how to look inside the Quux subdocument. And this is where json_tuple gets clunky fast - you have to create another lateral view for each subdocument you want to descend into:

select v1.foo, v1.bar, v2.qid, v2.qname 
from json_table jt
     LATERAL VIEW json_tuple(jt.json, 'Foo', 'Bar', 'Quux') v1
     as foo, bar, quux
     LATERAL VIEW json_tuple(v1.quux, 'QuuxId', 'QuuxName') v2
     as qid, qname;

This gives us the output we want:

foo  bar             qid   qname
ABC  20090101100000  1234  Sam

With a complicated highly nested JSON doc, json_tuple is also quite inefficient and clunky as hell. So let's turn to a custom SerDe to solve this problem.


The best option: rcongiu's Hive-JSON SerDe

A SerDe is a better choice than a json function (UDF) for at least two reasons:

  1. it only has to parse each JSON record once
  2. you can define the JSON schema in the Hive table schema, making it much easier to issue queries against.

I reviewed a couple of SerDe's and by far the best one I've found is rcongiu's Hive-JSON SerDe.

To get that SerDe, clone the project from GitHub and run mvn package. It creates a json-serde-1.1.6.jar in the target directory. If you have a place you like to put your jars for runtime referencing move it there.

Then tell Hive about it with:

ADD JAR /path/to/json-serde-1.1.6.jar;

You can do this either at the hive prompt or put it in your $HOME/.hiverc file.

Now let's define the Hive schema that this SerDe expects and load the simple.json doc:

CREATE TABLE json_serde (
  Foo string,
  Bar string,
  Quux struct<QuuxId:int, QuuxName:string>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';

LOAD DATA LOCAL INPATH '/tmp/simple.json' INTO TABLE json_serde;

With the openx JsonSerDe, you can define subdocuments as maps or structs. I prefer structs, as it allows you to use the convenient dotted-path notation (e.g., Quux.QuuxId) and you can match the case of the fields. With maps, all the keys you pass in have to be lowercase, even if you defined them as upper or mixed case in your JSON.

The query to match the above examples is beautifully simple:

SELECT Foo, Bar, Quux.QuuxId, Quux.QuuxName
FROM json_serde;

Result:

foo  bar             quuxid  quuxname
ABC  20090101100000  1234    Sam



And now let's do a more complex JSON document:

{
  "DocId": "ABC",
  "User": {
    "Id": 1234,
    "Username": "sam1234",
    "Name": "Sam",
    "ShippingAddress": {
      "Address1": "123 Main St.",
      "Address2": null,
      "City": "Durham",
      "State": "NC"
    },
    "Orders": [
      {
        "ItemId": 6789,
        "OrderDate": "11/11/2012"
      },
      {
        "ItemId": 4352,
        "OrderDate": "12/12/2012"
      }
    ]
  }
}

Collapsed version:

{"DocId":"ABC","User":{"Id":1234,"Username":"sam1234","Name":"Sam","ShippingAddress":{"Address1":"123 Main St.","Address2":"","City":"Durham","State":"NC"},"Orders":[{"ItemId":6789,"OrderDate":"11/11/2012"},{"ItemId":4352,"OrderDate":"12/12/2012"}]}}

Hive Schema:

CREATE TABLE complex_json (
  DocId string,
  User struct<Id:int,
              Username:string,
              Name: string,
              ShippingAddress:struct<Address1:string,
                                     Address2:string,
                                     City:string,
                                     State:string>,
              Orders:array<struct<ItemId:int,
                                  OrderDate:string>>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';

Load the data:

    LOAD DATA LOCAL INPATH '/tmp/complex.json' 
    OVERWRITE INTO TABLE complex_json;

First let's query something from each document section. Since we know there are two orders in the orders array we can reference them both directly:

SELECT DocId, User.Id, User.ShippingAddress.City as city,
       User.Orders[0].ItemId as order0id,
       User.Orders[1].ItemId as order1id
FROM complex_json;

Result:

docid  id    city    order0id  order1id
ABC    1234  Durham  6789      4352

But what if we don't know how many orders there are and we want a list of all a user's order Ids? This will work:

SELECT DocId, User.Id, User.Orders.ItemId
FROM complex_json;

Result:

docid  id    itemid
ABC    1234  [6789,4352]

Oooh, it returns an array of ItemIds. Pretty cool. One of Hive's nice features.

Finally, does the openx JsonSerDe require me to define the whole schema? Or what if I have two JSON docs (say version 1 and version 2) where they differ in some fields? How constraining is this Hive schema definition?

Let's add two more JSON entries to our JSON document - the first has no orders; the second has a new "PostalCode" field in Shipping Address.

{
    "DocId": "ABC",
    "User": {
        "Id": 1235,
        "Username": "fred1235",
        "Name": "Fred",
        "ShippingAddress": {
            "Address1": "456 Main St.",
            "Address2": "",
            "City": "Durham",
            "State": "NC"
        }
    }
}

{
    "DocId": "ABC",
    "User": {
        "Id": 1236,
        "Username": "larry1234",
        "Name": "Larry",
        "ShippingAddress": {
            "Address1": "789 Main St.",
            "Address2": "",
            "City": "Durham",
            "State": "NC",
            "PostalCode": "27713"
        },
        "Orders": [
            {
                "ItemId": 1111,
                "OrderDate": "11/11/2012"
            },
            {
                "ItemId": 2222,
                "OrderDate": "12/12/2012"
            }
        ]
    }
}

Collapsed version:

{"DocId":"ABC","User":{"Id":1235,"Username":"fred1235","Name":"Fred","ShippingAddress":{"Address1":"456 Main St.","Address2":"","City":"Durham","State":"NC"}}}
{"DocId":"ABC","User":{"Id":1236,"Username":"larry1234","Name":"Larry","ShippingAddress":{"Address1":"789 Main St.","Address2":"","City":"Durham","State":"NC","PostalCode":"27713"},"Orders":[{"ItemId":1111,"OrderDate":"11/11/2012"},{"ItemId":2222,"OrderDate":"12/12/2012"}]}}


Add those records to complex.json and reload the data into the complex_json table.

Now try the query:

SELECT DocId, User.Id, User.Orders.ItemId
FROM complex_json;

It works just fine and gives the result:

docid  id    itemid
ABC    1234  [6789,4352]
ABC    1235  null
ABC    1236  [1111,2222]

Any field not present will just return null, as Hive normally does even for non-JSON formats.

Note that we cannot query for User.ShippingAddress.PostalCode because we haven't put it on our Hive schema. You would have to revise the schema and then reissue the query.


/* ---[ A tool to automate creation of Hive JSON schemas ]--- */

One feature missing from the openx JSON SerDe is a tool to generate a schema from a JSON document. Creating a schema for a large complex, highly nested JSON document is quite tedious.

I've created a tool to automate this: https://github.com/midpeter444/hive-json-schema.

59 comments:

  1. I am not able to create jar. i found json-serde-1.1.6-jar-with-dependencies.jar in the target directory but not json-serde-1.1.6-SNAPSHOT.jar; can any you send me the jar.

    ReplyDelete
    Replies
    1. This comment has been removed by the author.

      Delete
    2. It looks like rcongiu has updated his SerDe to generate 1.1.6, not 1.1.6-SNAPSHOT. Thanks for the warning. I've updated the blog. The instructions are basically the same. Just clone https://github.com/rcongiu/Hive-JSON-Serde, cd into the Hive-JSON-Serde and type 'mvn clean package' and use the json-serde-1.1.6.jar. You shouldn't need me to email to you.

      Delete
    3. Hi,

      After building the project with i got only json-serde-1.3.7-SNAPSHOT.jar in this path...json-serde-1.3.7-SNAPSHOT.jar. I haven't got the 1.1.6 JSON-serde jar. I am having the twitter data in the form of json data.

      Sample Data




      Please help me in this regard, thanks in advance.

      Delete
    4. hello, i was trying to save the serde file into hive/lib folder in cloud era, but it is not allowing me to save the file, here i tried so many types, but i could not have done successfully.

      Delete
  2. This helped a lot. Thanks for the post

    ReplyDelete
  3. Thanks a ton for this post and for the automation library....saved a lot of research and development time.

    ReplyDelete
  4. Greetings Michael !

    Thanks for the wonderful post. Btw, I'm working on a small use case wherein I want to parse the Amazon CloudTrail logs. The logs get stored in a S3 bucket and we have planned to use Hive to query these log files. The JSON log file uses the following format,

    http://docs.aws.amazon.com/awscloudtrail/latest/userguide/eventreference.html

    I used the above procedure which you had mentioned in your blog to generate the HiveQL schema and got the following statement,

    java -jar json-hive-schema-1.0-jar-with-dependencies.jar /tmp/sample.json cloud_trail
    CREATE TABLE cloud_trail (
    records array>>>>>, nexttoken:string, reservedinstancesofferingsset:struc>>, responseelements:string, sourceipaddress:string, useragent:string, useridentity:struct>>)
    ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';

    When I use this statement to create the table structure I hit the following error,

    # hive
    Hive history file=/tmp/root/hive_job_log_root_201311242319_917055133.txt
    hive> ADD JAR /root/source/hive-serdes-1.0-SNAPSHOT.jar;
    Added /root/source/hive-serdes-1.0-SNAPSHOT.jar to class path
    Added resource: /root/source/hive-serdes-1.0-SNAPSHOT.jar
    hive> CREATE TABLE cloud_trail (
    > records array>>>>>, nexttoken:string, reservedinstancesofferingsset:struc>>, responseelements:string, sourceipaddress:string, useragent:string, useridentity:struct>>)
    > ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe';
    FAILED: Parse Error: line 2:160 mismatched input 'items' expecting Identifier near '<' in column specification

    So is the 'items' key from the JSON log file causing this exception as it seems to be a reserved keyword in Hive. Can you please provide your valuable inputs on the same ?

    Thanks.

    ReplyDelete
    Replies
    1. Can you post the one full JSON document with all the fields you want in the Hive schema to the GitHub project? You have fields listed above (e.g., "nexttoken", "reservedinstancesofferingsset", etc.) that I dont see listed on this page: http://docs.aws.amazon.com/awscloudtrail/latest/userguide/eventreference.html

      Please post it as an issue on the GitHub project rather than here with the full error message and I'll take a look at it.

      Delete
  5. This is the most helpful JSON/Hive post I've come across, thanks for being so thorough, it has helped me immensely today.

    ReplyDelete
  6. Getting Following Error while buiding the package using mvn clean package

    [WARNING] The POM for org.apache.hive:hive-serde:jar:0.8.0-cdh4a1-SNAPSHOT is missing, no dependency information available
    [WARNING] The POM for org.apache.hive:hive-exec:jar:0.8.0-cdh4a1-SNAPSHOT is missing, no dependency information available
    [INFO] ------------------------------------------------------------------------
    [INFO] BUILD FAILURE
    [INFO] ------------------------------------------------------------------------
    [INFO] Total time: 2.847s
    [INFO] Finished at: Tue Dec 10 14:14:20 IST 2013
    [INFO] Final Memory: 8M/102M
    [INFO] ------------------------------------------------------------------------
    [ERROR] Failed to execute goal on project json-serde: Could not resolve dependencies for project org.openx.data:json-serde:jar:1.1.7: The following artifacts could not be resolved: org.apache.hive:hive-serde:jar:0.8.0-cdh4a1-SNAPSHOT, org.apache.hive:hive-exec:jar:0.8.0-cdh4a1-SNAPSHOT: Failure to find org.apache.hive:hive-serde:jar:0.8.0-cdh4a1-SNAPSHOT in https://repository.cloudera.com/artifactory/cloudera-repos/ was cached in the local repository, resolution will not be reattempted until the update interval of Cloudera has elapsed or updates are forced -> [Help 1]
    [ERROR]

    ReplyDelete
  7. This has been very helpful. I wonder if you can create a table using serde on a table which I have already partitioned via another insert. This will allow a more efficient querying JSON data.

    ReplyDelete
  8. This comment has been removed by the author.

    ReplyDelete
  9. Hi, i made it to create the table, load the data into the table, but when i run a query it doesn't work.
    this is what is showing:
    hive> select Foo, Bar, Quux.QuuxId, Quux.QuuxName
    > from json_serde;
    Total MapReduce jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks is set to 0 since there's no reduce operator
    Starting Job = job_201405191544_0002, Tracking URL = http://bigdatalite.localdomain:50030/jobdetails.jsp?jobid=job_201405191544_0002
    Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_201405191544_0002
    Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
    2014-05-19 17:29:53,509 Stage-1 map = 0%, reduce = 0%
    2014-05-19 17:30:53,817 Stage-1 map = 0%, reduce = 0%
    2014-05-19 17:31:54,115 Stage-1 map = 0%, reduce = 0%
    2014-05-19 17:32:54,654 Stage-1 map = 0%, reduce = 0%
    2014-05-19 17:33:50,859 Stage-1 map = 100%, reduce = 100%
    Ended Job = job_201405191544_0002 with errors
    Error during job, obtaining debugging information...
    Job Tracking URL: http://bigdatalite.localdomain:50030/jobdetails.jsp?jobid=job_201405191544_0002
    Examining task ID: task_201405191544_0002_m_000002 (and more) from job job_201405191544_0002

    Task with the most failures(4):
    -----
    Task ID:
    task_201405191544_0002_m_000000

    URL:
    http://bigdatalite.localdomain:50030/taskdetails.jsp?jobid=job_201405191544_0002&tipid=task_201405191544_0002_m_000000
    -----
    Diagnostic Messages for this Task:
    java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing writable {
    at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:159)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:417)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:268)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
    at org.apache.hadoop.mapred.Child.main(Child.java:262)
    Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing writable {
    at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:647)
    at org.apache.hadoop.hive.ql.exec.ExecMapper.map(ExecMapper.java:141)
    ... 8 more
    Caused by: org.apache.hadoop.hive.serde2.SerDeExcep

    FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
    MapReduce Jobs Launched:
    Job 0: Map: 1 HDFS Read: 0 HDFS Write: 0 FAIL
    Total MapReduce CPU Time Spent: 0 msec
    ..
    What could be wrong?

    ReplyDelete
  10. Hi i am facing the problem while retriving the data
    { "Age" : 32, "Marks" : "[67,84,54]", "StudentId" : 101, "UserName" : "P", "_id" : ObjectId( "53a2e77aa1fcf432470939a4" ) }
    { "_id" : ObjectId( "53a2e79ea1fcf432470939a5" ), "Age" : 34, "Marks" : [ 35, 65, 85 ], "StudentId" : 102, "UserName" : "Q" }
    { "_id" : ObjectId( "53a2e7b9a1fcf432470939a6" ), "Age" : 35, "Marks" : [ 54, 74, 64 ], "StudentId" : 103, "UserName" : "R" }
    { "_id" : ObjectId( "53a2e7d7a1fcf432470939a7" ), "Age" : 33, "Marks" : [ 35, 85, 45 ], "StudentId" : 104, "UserName" : "S" }
    { "_id" : ObjectId( "53a2e7f9a1fcf432470939a8" ), "Age" : 22, "Marks" : [ 54, 56, 85 ], "StudentId" : 105, "UserName" : "T" }
    { "Age" : 34, "Marks" : [], "StudentId" : 101, "UserName" : "X", "_id" : ObjectId( "53a2e874e4b0c1f55b4a4357" ) }

    ReplyDelete
  11. This comment has been removed by the author.

    ReplyDelete
    Replies
    1. I haven't worked on Hive stuff for quite a while now, but my thought is that you have an array of events, but those events do not have a common schema - each one is different. That may not be allowed in Hive. Also, the JSON you provided is invalid - it is missing a starting quotation around "chatTranscript".

      Delete
  12. This comment has been removed by the author.

    ReplyDelete
    Replies
    1. Thank you Michael. I also think its not possible in hive. But somehow i solved the above problem by modifying the json structure.

      Delete
  13. Can i have your emailid ? I want to send another one sample json which is screwing me :(

    ReplyDelete
  14. I tried the below step also but issue is not resolved.

    hive> ADD JAR [path to JSON SerDe jar file];
    For example:

    hive> ADD JAR /usr/lib/hive/lib/json-serde-1.1.4-jar-with-dependencies.jar;

    ReplyDelete
  15. Thanks for the informative post - I'm trying to use JSON-Serde but there are spaces in my column names which result in the Create table statement failing on execution. Do you know if there are any workarounds?

    ReplyDelete
  16. FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Cannot validate serde: json-serde.src.main.java.org.openx.data.jsonserde.JsonSerde

    how to process json ,plz send step by step information ,how to slove this error

    ReplyDelete
  17. This comment has been removed by the author.

    ReplyDelete
  18. your blogs are excellent. Your blogs are very much useful to me, Many thanks for that.
    My warm regards to you.

    ReplyDelete
  19. Hi,

    After building the project with i got only json-serde-1.3.7-SNAPSHOT.jar in this path...json-serde-1.3.7-SNAPSHOT.jar. I haven't got the 1.1.6 JSON-serde jar. I am having the twitter data in the form of json data.

    Sample Data




    Please help me in this regard, thanks in advance.

    ReplyDelete
  20. After following the steps and reading the posts i am stuck with the following issue . i have the json-serdr-1.3.7-SNAPSHOT.jar

    jjava.lang.NoSuchFieldError: byteTypeInfo
    at org.openx.data.jsonserde.objectinspector.primitive.TypeEntryShim.(TypeEntryShim.java:23)
    at org.openx.data.jsonserde.objectinspector.primitive.JavaStringJsonObjectInspector.(JavaStringJsonObjectInspector.java:14)
    at org.openx.data.jsonserde.objectinspector.JsonObjectInspectorFactory.(JsonObjectInspectorFactory.java:204)
    at org.openx.data.jsonserde.JsonSerDe.initialize(JsonSerDe.java:124)
    at org.apache.hadoop.hive.metastore.MetaStoreUtils.getDeserializer(MetaStoreUtils.java:203)
    at org.apache.hadoop.hive.ql.metadata.Table.getDeserializerFromMetaStore(Table.java:260)
    at org.apache.hadoop.hive.ql.metadata.Table.getDeserializer(Table.java:253)
    at org.apache.hadoop.hive.ql.metadata.Table.getCols(Table.java:490)
    at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:518)

    ReplyDelete
  21. Awesome Post!!! Really very much helpful..

    ReplyDelete
  22. thanks a lot , best doc I have read on serde

    ReplyDelete
  23. beauty..only one thing user is the reserved keyword due to which table is not created.

    ReplyDelete
  24. As user is the reserved key word so we have to write it (`User`) while doing any operation,same is applied for other reserved keyword

    ReplyDelete
  25. when i tried to display the table by using json_tuple or the other thing, it returns null for all the rows and columns. can u help me out please.

    And when i give select * from table_name; it returns the document as such.

    ReplyDelete
  26. when i tried to display the table by using json_tuple or the other thing, it returns null for all the rows and columns. can u help me out please.

    And when i give select * from table_name; it returns the document as such.

    ReplyDelete
  27. This comment has been removed by the author.

    ReplyDelete
  28. Hi facing below issue
    1. i have created tweetsjson table to store data coming from twitter streaming API. use below sql to create table
    CREATE External TABLE tweetsjson (
    id BIGINT,
    created_at STRING,
    source STRING,
    favorited BOOLEAN,
    retweet_count INT,
    retweeted_status STRUCT<
    text:STRING,
    user:STRUCT,
    retweet_count:INT>,
    entities STRUCT<
    urls:ARRAY>,
    user_mentions:ARRAY>,
    hashtags:ARRAY>>,
    text STRING,
    user STRUCT<
    screen_name:STRING,
    name:STRING,
    friends_count:INT,
    followers_count:INT,
    statuses_count:INT,
    verified:BOOLEAN,
    utc_offset:INT,
    time_zone:STRING>,
    in_reply_to_screen_name STRING
    )
    ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
    LOCATION '/opt/hadoop/hadoop-1.2.1/twitterData';


    2. above table got create successfully but not bale fetch any data from table
    getting below output from query
    hive> select * from tweetsjson;
    OK
    NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL
    Time taken: 0.224 seconds, Fetched: 1 row(s)



    Can anyone pl tell why i am getting all null values. Location "/opt/hadoop/hadoop-1.2.1/twitterData" has one .json file created by twitter

    ReplyDelete
  29. This comment has been removed by the author.

    ReplyDelete
  30. Hi,

    I have a question to ask related to JSON and inserting the data in Hive.

    How to store a JSON file and fields in it. If it has multiple url's and we need to access those url's and get corresponding data from that link and then insert it in hive.

    Any suggestion/ help is appreciated.

    ReplyDelete
  31. This blog is very helpful to me,thankyou & I hope it helps the people who don't know hive json.

    ReplyDelete
  32. Awesome tutor. Thanks man, you are saving the day.

    ReplyDelete
  33. what if the json has an identifier like "@Name" - with the @ character... any hql statement keeps failing like this:
    select get_json_object(assets.value, '$.Entries.@Name') from jsondb.assets; --> NULL

    How can we access the value for @Name?

    ReplyDelete
  34. Hi, Is there a way to convert the multi-line json files to single line. Without that hive will throw error..

    ReplyDelete
  35. Also if I use your package to generate schema from json file, it changes the sequence of the fields in the resulting structure. It is tedious to go back and fix them again..

    Any way to keep the fields structure matching to json data.

    ReplyDelete
  36. What are my options as I had json file with map type having key with spaces in between ? Is there anyway to make it work with Serde.
    eg: { id: "myid", event source: "eventsource" }

    ReplyDelete
  37. Below query is not working on AWS Athena which uses hive internally. (Orders is an array taken from your post). It is working only with array number like User.Order[1].ItemId
    SELECT DocId, User.Id, User.Orders.ItemId FROM complex_json;

    Can you please suggest any other better approach?

    ReplyDelete
  38. thank you for sharing this informative blog.. this blog really helpful for everyone.. explanation are clear so easy to understand... I got more useful information from this blog

    hadoop training course syllabus | big data training and course syllabus | hadoop training topics | big data training topics

    ReplyDelete
  39. After reading this blog i very strong in this topics and this blog really helpful to all... explanation are very clear so very easy to understand... thanks a lot for sharing this blog

    hadoop training and placements | big data training and placements | hadoop training course contents | big data training course contents

    ReplyDelete
  40. You can also use the hcatalog jar to construct the table with json serde

    add jar /local path of hcatalog jar/hive-hcatalog-core-**.jar;

    CREATE TABLE json_table (field1 string, field2 int, field3 string, field4 double) ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe' ;

    load data local inpath 'local path of input file' overwrite into table json_table;

    ReplyDelete
  41. How to create Hive external table for json :
    [{"total_views":"-250135418","finish":"2017-04-20","data":[{"videos":7,"views":20018,"sentiment":1,"post_date":"2017-01-01","likes":7,"comments":2,"dislikes":0},{"videos":7,"views":27598,"sentiment":1,"post_date":"2017-01-02","likes":29,"comments":1,"dislikes":0}]

    ReplyDelete
  42. Thanks for sharing Valuable information about hadoop. Really helpful. Keep sharing...........

    ReplyDelete
  43. I agree with your posts that the Employee management i think this was most important among those points you are mentioning here. If there is no problem for management there will be sure productivity from them. Thank you fr sharing this nice information in which H to be followed. Really nice and informtive

    Dataware Housing Training in Chennai

    ReplyDelete
  44. Very Interesting information shared than other blogs
    Thanks for Sharing and Keep updating us

    ReplyDelete
  45. Interesting blog post.This blog shows that you have a great future as a content writer.waiting for more updates...
    Digital Marketing Company in India

    ReplyDelete
  46. HI ,

    With the same JSON you have mentioned above, what would be the code If I want the result as two separate records as shown below.

    docid id itemid
    ABC 1234 6789
    ABC 1234 4352

    ReplyDelete
  47. Superb i really enjoyed very much with this article here. Really its a amazing article i had ever read. I hope it will help a lot for all. Thank you so much for this amazing posts and please keep update like this excellent article.

    MSBI Training in Chennai

    Informatica Training in Chennai

    ReplyDelete
  48. thank u very much for your knowledge sharing

    ReplyDelete