How to use text data to draw conclusions about users of our website or forum?
This talk describes a solution to a particular problem, using Machine Learning and Statistics. Based on provided forum we will create the program that learns the structure of posts using Natural Language Processing technics. Then after proper Machine Learning models are trained, program is able to answer with probability which of the users of the forum wrote a particular post.
We will go through all the steps required to create Machine Learning models for text. How to use Natural Language Processing and Bag-of-Words techniques to analyse text? How to prepare input data to further Processing by Machine Learning Models? I will answer those questions. Implementation will be written in Apache Spark, so we will get to know that technology with some important libraries like Spark MLlib and DataFrame API. In MLlib we will use Gaussian Mixure Model and Logistic Regression.