"I understand a fury in your words, but not your words." (- William Shakespeare, Othello, 4.2) | English | 日本語 | Polski

Michal Ptaszynski / Research


Cyberbullying Detection

This page contains some general information about my research in cyberbullying detection. The research began for me around autumn 2009. Actually it was most probably the 3rd of September around 20:00 or 21:00, during a banquet of PACLING 2009. I, still being a Ph.D. student, had a luck to talk to Prof. Fumito Masui, who has started the research in the first place. He proposed I process his data with ML-Ask to see if there are any emotional hints in cyberbullying sentences. Here is where my contribution in the research begins.

Research Description


We got the actual cyberbullying data containing cyberbullying gathered by Human Rights Center in Mie Prefecture, Japan.

(1) We firstly analyzed the data with my affect analysis system ML-Ask (Open Source Affect Analysis System for Japanese), and found out that vulgar and violent words were the most distinguishable for cyberbullying.
(2) Then we separately created a vulgar and violent word dictionary and used it in training SVM classifier.

The first publishable results were containing part of the above results from (2) and are in this paper:
Tatsuaki Matsuba, Fumito Masui, Atsuo Kawai, Naoki Isu. 2010. Gakkou hikoushiki saito ni okeru yuugai jouhou kenshutsu [Detection of harmful information on informal school websites] (In Japanese). In Proceedings of The 16th Annual Meeting of The Association for Natural Language Processing (NLP2010).

A conference paper containing both (1) and (2) appeared slightly later:
Michal Ptaszynski, Pawel Dybala, Tatsuaki Matsuba, Fumito Masui, Rafal Rzepka and Kenji Araki. 2010. Machine Learning and Affect Analysis Against Cyber-Bullying. In Proceedings of The Thirty Sixth Annual Convention of the Society for the Study of Artificial Intelligence and Simulation of Behaviour (AISB’10), 29th March – 1st April 2010, De Montfort University, Leicester, UK, pp. 7-16, 2010.

Then we refined the SVM a bit and described it here:
Michal Ptaszynski, Pawel Dybala, Tatsuaki Matsuba, Fumito Masui, Rafal Rzepka, Kenji Araki, and Yoshio Momouchi. 2010. In the Service of Online Order: Tackling Cyber-Bullying with Machine Learning and Affect Analysis. International Journal of Computational Linguistics Research, Vol. 1, Issue 3, pp. 135-154, 2010.

Eventually we found out that SVMs are in fact not ideal for the task because they cannot deal with many ambiguities in the language. We could improve the results a bit, but it seemed there was a border which SVMs could not overcome. So we turned to other methods.

(3) We tried the SO-PMI-IR. But the original version of the method, working on words, did not achieve good results because in Japanese words by themselves are often not effective - one needs to look at a wider context. So we modified the Turney's method a bit so it would work not on words, but on phrases - this pretty much dealt with much of the ambiguities and the results were much better. The first paper with PMI-IR on cyberbullying is here:
Tatsuaki Matsuba, Fumito Masui, Atsuo Kawai, Naoki Isu. 2011. Gakkou hi-koushiki saito ni okeru yugai jouhou kenshutsu wo mokuteki to shita kyokusei hantei moderu ni kansuru kenkyuu [Study on the polarity classification model for the purpose of detecting harmful information on informal school sites] (in Japanese), In Proceedings of The Seventeenth Annual Meeting of The Association for Natural Language Processing (NLP2011), pp. 388-391.

Then we improved the PMI method quite a bit more by grouping the seed words. The results for that are here:
Taisei Nitta, Fumito Masui, Michal Ptaszynski, Yasutomo Kimura, Rafal Rzepka, Kenji Araki. 2013. Detecting Cyberbullying Entries on Informal School Websites Based on Category Relevance Maximization. In Proceedings of the 6th International Joint Conference on Natural Language Processing (IJCNLP 2013), pp. 579-586, Nagoya, Japan, October 14-18, 2013.

We also did a demo once and made a poster describing the whole process in a concise way:
Taisei Nitta, Fumito Masui, Michal Ptaszynski, Yasutomo Kimura, Rafal Rzepka, Kenji Araki. 2013. Cyberbullying Detection Based on Category Relevance Maximization. In Demo Session of 20th International Conference on Language Processing and Intelligent Information Systems (LP&IIS 2013), Warsaw, Poland, June 17-18, 2013. The method based on SO-PMI-IR was so good we patented it (Patent No.: 2015-103210).

Now we are doing three things with the cyberbullying data:
1. Looking for a good way of preprocessing the data to make it possible for release.
2. Further improving the PMI method.
3. Looking for a new method.

The first point is more political than scientific - we need to find the best way to cover the personal information and still keep the data usable in research. As for the point 2 - there are two students working on it right now - one is trying to expand the dictionary, second is trying to further modify and optimize the PMI method. As for the last point, I am doing experiments with my Language Combinatorics method to find if there are any frequent language patterns in cyberbullying taht could be extracted automatically.

See the online presentation from AISB 2010:



Another presentation from IJCAI 2015 IP Workshop:



Parties that expressed interest in this research:
  • Parental Options. spec A company providing solutions that help parents protect their children from dangers of the Internet. Paretal Options was formed by a group of FBI Agents, child advocates and other professionals.

  • back to top

    Released Files


    Released files related to the system.

    back to top

    Development History


    This paragraph contains development history of the system

    back to top

    Main References


    • Michal Ptaszynski, Fumito Masui, Yasutomo Kimura, Rafal Rzepka, Kenji Araki, "Brute Force Works Best Against Bullying", IJCAI 2015 Workshop on Intelligent Personalization (IP 2015), Buenos Aires, July 25-31, 2015. paper, slides
    • Taisei Nitta, Fumito Masui, Michal Ptaszynski, Yasutomo Kimura, Rafal Rzepka, Kenji Araki, "Detecting Cyberbullying Entries on Informal School Websites Based on Category Relevance Maximization", In Proceedings of the 6th International Joint Conference on Natural Language Processing (IJCNLP 2013), pp. 579-586, Nagoya, Japan, October 14-18, 2013. paper
    • Michal Ptaszynski, Pawel Dybala, Tatsuaki Matsuba, Fumito Masui, Rafal Rzepka, Kenji Araki, and Yoshio Momouchi, "In the Service of Online Order: Tackling Cyber-Bullying with Machine Learning and Affect Analysis", International Journal of Computational Linguistics Research, Vol. 1, Issue 3, pp. 135-154, 2010. paper or pre-edited version

    back to top