Twitter json to csv

g3troot · May 14, 2018, 4:48pm

I am trying to convert the twitter json data to csv. I am able to get the top level attributes perfectly but the attributes which are nested, I’m not able to parse. The nested json is like this

 "entities":{

  "hashtags":[

  ],
  "urls":[
     {
        "url":"https:\/\/t.co\/ieON9yclmy",
        "expanded_url":"http:\/\/www.dailymail.co.uk\/news\/article-4044728\/Theresa-wants-use-army-computerised-Trump-mind-readers-help-win-Election.html#ixzz5AE6Hx3VW",
        "display_url":"dailymail.co.uk\/news\/article-4\u2026",
        "indices":[
           39,
           62
        ]
     }
  ],
  "user_mentions":[
     {
        "screen_name":"neonbubble",
        "name":"Mark H",
        "id":10934622,
        "id_str":"10934622",
        "indices":[
           0,
           11
        ]
     }

Right now my python code looks like this

from operator import itemgetter
from StringIO import StringIO
import csv
import json
import sys
reload(sys)
sys.setdefaultencoding('utf8')

def get_leaves(item, key=None):
    if isinstance(item, dict):
        leaves = []
        for i in item.keys():
            leaves.extend(get_leaves(item[i], i))
        return leaves
    elif isinstance(item, list):
        leaves = []
        for i in item:
            leaves.extend(get_leaves(i, key))
        return get_leaves
    else:
        return [(key,item)]


header = ['created_at', 'id', 'id_str', 'in_reply_to_status_id', 'in_reply_to_user_id', 'text', 'source', 'truncated', 'in_reply_to_status_id_str', 'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'user', 'entities']
required_cols = itemgetter(*header)

with open('twitter.json') as f_input, open('output.csv', 'wb') as f_output:
    csv_output = csv.writer(f_output)
    csv_output.writerow(header)
    write_header = True

    for entry in f_input:
        if entry.strip(): 
            leaf_entries = sorted(get_leaves(entry))
            csv_output.writerow(required_cols(json.loads(leaf_entries)))

How can I do it in python?

fxbg · May 14, 2018, 5:01pm

I have to ask, before I give it a go, why convert it to CSV?

g3troot · May 14, 2018, 5:04pm

running a big data query requires that.

fxbg · May 14, 2018, 6:01pm

#!/usr/bin/python
import csv
import json

with open('json', 'r') as myfile:
        jsonData = myfile.read()

data = json.loads(jsonData)
wfile = csv.writer(open("json.csv", "wb+"))

for data in data:
        ndata = json.dumps(data)
        wfile.writerow([
                ndata["entities"],ndata["urls"]
                ])

It errors:

Traceback (most recent call last):
  File "conv.py", line 14, in <module>
    ndata["entities"],ndata["urls"]
TypeError: string indices must be integers, not str

but maybe you can get it going, I suck at python, below is the json file I used.

[{ "entities":{

  "hashtags":[

  ],
  "urls":[
     {
        "url":"https:\/\/t.co\/ieON9yclmy",
        "expanded_url":"http:\/\/www.dailymail.co.uk\/news\/article-4044728\/Theresa-wants-use-army-computerised-Trump-mind-readers-help-win-Election.html#ixzz5AE6Hx3VW",
        "display_url":"dailymail.co.uk\/news\/article-4\u2026",
        "indices":[
           39,
           62
        ]
     }
  ],
  "user_mentions":[
     {
        "screen_name":"neonbubble",
        "name":"Mark H",
        "id":10934622,
        "id_str":"10934622",
        "indices":[
           0,
           11
        ]
     }
]
}
}]

g3troot · May 14, 2018, 6:36pm

you can use the json file here https://pastebin.com/RxCJr871

system · June 13, 2018, 4:49pm

This topic was automatically closed after 30 days. New replies are no longer allowed.