Software Engineer, programming mostly in Java, Scala. Fan of microservices architecture and functional programming. Every day I dedicate considerable time and effort to become better. Recently I'm diving into Big Data technologies, such as Apache Spark and Hadoop. I am passionate about nearly everything associated with software development. I thnik that we should always try different solutions and approaches before solving a problem. Recently I was a speaker at a few conferences in Poland - Confitura and JDD, and also at Krakow Scala User Group. I was also conducting live coding session at Geecon Conference.
How to use text data to draw conclusions about users of our website or forum?
This talk describes a solution to a particular problem, using Machine Learning and Statistics. Based on provided forum we will create the program that learns the structure of posts using Natural Language Processing technics. Then after proper Machine Learning models are trained, program is able to answer with probability which of the users of the forum wrote a particular post.
We will go through all the steps required to create Machine Learning models for text. How to use Natural Language Processing and Bag-of-Words techniques to analyse text? How to prepare input data to further Processing by Machine Learning Models? I will answer those questions. Implementation will be written in Apache Spark, so we will get to know that technology with some important libraries like Spark MLlib and DataFrame API. In MLlib we will use Gaussian Mixure Model and Logistic Regression.