时事通讯

通过电子邮件获得 Hortonworks 的最新更新

每月一次,接收最新的洞察力、趋势、分析信息和大数据的知识。

AVAILABLE NEWSLETTERS:

Sign up for the Developers Newsletter

每月一次,接收最新的洞察力、趋势、分析信息和大数据的知识。

CTA

开始

云

是否已准备就绪?

下载 sandbox

我们能为您做什么?

* 我了解我可以随时取消预订。我也承认在 Hortonworks 隐私政策中发现的更多信息。
关闭关闭按钮
October 05, 2016
上一张幻灯片下一张幻灯片

HDF 2.0 Flow Processing Real-Time Tweets from Strata Hadoop with Slack, TensorFlow, Phoenix, Zeppelin

作者:
Timothy Spann

 Original post in HCC

I had a few hours in the morning before the Strata+ Hadoop World conference schedule kicked in, so I decided to write a little HDF 2.0 flow to grab all the tweets about the Strata Hadoop conference.

First up, I used GetTwitter to read tweets and filtered on these terms:

strata, stratahadoop, strataconf, NIFI, FutureOfData, ApacheNiFi, Hortonworks, Hadoop, ApacheHive, HBase, ApacheSpark, ApacheTez, MachineLearning, ApachePhoenix, ApacheCalcite,ApacheStorm, ApacheAtlas, ApacheKnox, Apache Ranger, HDFS, Apache Pig, Accumulo, Apache Flume, Sqoop, Apache Falcon

Input:

InvokeHttp: I used this to download the first image URL from tweets.

GetTwitter: This is our primary source of data and the most important. You must have a twitter account, a twitter developer account and create a twitter application. Then you can access the keywords and hashtags above. So far I’ve ingested 14,211 tweets into Phoenix. This included many times I’ve shut it down for testing and moving things around. I’ve had this run live as I’ve added pieces. I do not recommend this development process, but it’s good for exploring data.

Processing:

RouteOnAttribute: To only process tweets with an actual messages, sometimes they are damaged or missing. Don’t waste our time.

ExecuteStreamCommand: To call shell scripts that call TensorFlow C++ binaries and Python scripts. Many ways to do this, but this is the easiest.

UpdateAttribute: To change the file name for files I downloaded to HDFS.

For output sinks:

PutHDFS: Saved to HDFS in a few different directories (the first attached image); the raw JSON tweet, a limited set of fields such as handle, message, geolocation and a fully processed file that I added TensorFlow Inception v3 image recognition for images attached to Strata tweets and sentiment analysis using VADER on the text of the tweet.

PutSQL: I upserted all tweets that were enriched with HDF called TensorFlow & Python Sentiment Analysis into a Phoenix Table;

PutSlack: https://nifi-se.slack.com/messages/general/

Visualization:

There are a ton of ways to look at this data now.

I used Apache Zeppelin since it was part of my HDP 2.5 cluster and it’s so easy to use. I added a few tables, charts and did quick SQL exploration of the data in Phoenix.

Linux Shell Scripts

source /usr/local/lib/bazel/bin/bazel-complete.bash
export JAVA_HOME=/opt/jdk1.8.0_101/
/bin/rm -rf /tmp/$@
hdfs dfs -get /twitter/rawimage/$@ /tmp/
/opt/demo/tensorflow/bazel-bin/tensorflow/examples/label_image/label_image --image="/tmp/$@" --output_layer="softmax:0" --input_layer="Mul:0"  --input_std=128 --input_mean=128 --graph=/opt/demo/tensorflow/tensorflow/examples/label_image/data/tensorflow_inception_graph.pb --labels=/opt/demo/tensorflow/tensorflow/examples/label_image/data/imagenet_comp_graph_label_strings.txt 2>&1| cut -c48-
/bin/rm -rf /tmp/$@


python /opt/demo/sentiment/sentiment2.py "$@"

Python Script

If you have Python 2.7 installed, in previous articles I have shown how to install PiP and NLTK. Very easy to do some simple Sentiment Analysis. I also have a version where I just return the polarity_scores (compound, negative, neutral and positive).

from nltk.sentiment.vader import SentimentIntensityAnalyzer
import sys
sid = SentimentIntensityAnalyzer()
ss = sid.polarity_scores(sys.argv[1])
if ss['compound'] == 0.00:
print('Neutral')
elif ss['compound'] < 0.00:
print ('Negative')
else:
print('Positive')

NIFI 1.0.0 Flow Template

tweetnyc.xml

Resources:

http://conferences.oreilly.com/strata/hadoop-big-data-ny/public/schedule/grid/public/

下载

http://futureofdata.io/

Future of Data: Princeton, New Jersey

Princeton, NJ
626 Members

This meetup is focused on the Future of Data and the open community data projects governed by the Apache Software Foundation. Geared towards developers, data scientists and AL…

Next Meetup

Deep Learning with DeepLearning4J (DL4J)

Thursday, Oct 5, 2017, 6:00 PM
39 Attending

Check out this Meetup Group →

撰写回复

您的电子邮件地址将不会被公布。必填字段标记了 *