information retrieval - How to apply topic modeling? -

- September 15, 2011

i have 10000 tweets 5 topics. assume know ground truth (the actual topic of each tweet) , grouping tweets 5 documents each document contain tweets particular topic. apply lda on 5 documents number of topics set 5. in case topic words.

now if don't know ground truth of tweets, how create input documents in way lda still give me topic words describing 5 topics.

what if create input documents randomly selecting sample of tweets? if ends similar topic mixtures input documents? should lda still find topic words in case of 1st paragraph?

if understand correctly, problem topic modeling on short texts (tweets). 1 approach combine tweets long pseudo-documents before training lda. 1 assume there 1 topic per document/tweet.

in case don't know ground truth labels of tweets, might want seek one-topic-per-document topic model (i.e. mixture-of-unigrams). model details described in:

jianhua yin , jianyong wang. 2014. dirichlet multinomial mixture model-based approach short text clustering. in proceedings of 20th acm sigkdd international conference on knowledge discovery , info mining, pages 233–242.

you can find java implementations model , lda @ http://jldadmm.sourceforge.net/ assumed know ground truth labels, can utilize implementation compare these topic models in document clustering task.

if you'd evaluate topic coherence (i.e. evaluate how topic words), suggest have @ palmetto toolkit (https://github.com/aksw/palmetto) implements topic coherence calculations.

information-retrieval topic-modeling

Search This Blog

Five

information retrieval - How to apply topic modeling? -

Comments

Post a Comment

Popular posts from this blog

java - How to set log4j.defaultInitOverride property to false in jboss server 6 -

c - GStreamer 1.0 1.4.5 RTSP Example Server sends 503 Service unavailable -

Using ajax with sonata admin list view pagination -