Module import issue with a Japanese Tokenizer

By : user2175022
Date : October 16 2020, 08:10 PM
hope this fix your issue I have been in direct contact with the developer of JapaneseTokenizer who has kindly given permission for me to repost his answer to my query:
I'm glad that you sent me a message about the issue. I read your post at StackOverflow. As other user suggested, the main issue is that pyknp package does not have juman++ module. I don't know the reason, but an author of pyknp package removed module for juman++. The straightforward way to solve this issue is that you install pyknp package version 3 from here and install it your environment. The main procedure is below.
code :

Correct Regexp for japanese sentence tokenizer- python

By : user2795134
Date : March 29 2020, 07:55 AM
I think the issue was by ths following , This is the current text that i've but the regex isn't correct to split the sentences correction. please help to correct my regex, thank you. , Try this:
code :

Options for MeCab Japanese tokenizer on iOS?

By : Sunil Kata Tech-BLR
Date : March 29 2020, 07:55 AM
Any of those help There is nothing iOS-specific in this. The dictionary you are using with mecab (probably ipadic) contains an entry for the company name 吉本興業. Although both parts of the name are listed as separate nouns as well, mecab has a strong preference to tag the compound name as one word.
Mecab lacks a feature that allows the user to choose whether or not compounds should be split into parts. Note that such a feature is generally hard to implement because not everyone agrees on which compounds can be split and which ones can't. E.g. is 容疑者 a compound made up of 容疑 and 者? From a purely morphological point of view perhaps yes, but for most practical applications probably no.
code :
$> $MECAB/libexec/mecab/mecab-dict-index  -d /usr/lib64/mecab/dic/ipadic -u mydic.dic -f utf-8 -t utf-8 ./mydic
userdic = home/myhome/mydic.dic

solr Japanese tokenizer not working for katakana

By : Yan Lin
Date : March 29 2020, 07:55 AM
like below fixes the issue I was able to solve this using lucene-gosen Sen Tokenizer,
and compiling ipadic dictionary with custom rules and word weights.

Spacy Japanese Tokenizer

By : Mohamed Ismail
Date : March 29 2020, 07:55 AM
Any of those help I am not sure why you got that particular bug, but Japanese support has been improved since you posted this question and it should work with the latest version of spaCy. For Japanese support you'll also need to install MeCab and some other dependencies yourself, see here for a detailed guide.
Actual code would look like this:
code :
import spacy

ja = spacy.blank('ja')

install ipadic on Ubuntu 16.04 for mecab Japanese tokenizer

By : user1450480
Date : March 29 2020, 07:55 AM
I hope this helps . There is no reason to compile from source on Ubuntu 16.04
Simple do:
code :
$ sudo apt-get update
$ sudo apt install mecab mecab-ipadic-utf8
$ echo "日本語です" | mecab
日本  ニッポン    ニッポン    日本  名詞-固有名詞-地名-国        
語   ゴ   ゴ   語   名詞-普通名詞-一般      
です  デス  デス  です  助動詞 助動詞-デス  終止形-一般
