本頁面使用javascript於瀏覽器端進行處理, 主機不會截取任何資訊, 請放心使用
Kad切詞器

這是一組懶人工具, 用來處理想從網頁上複製貼上關鍵字到Kad上又不想用大腦思考Kad禁用字的問題,

例如我從網頁上複製了 (魔片) [無碼絕症] 川島草莓牛奶 (2000-11-22).avi,
然後貼進輸入框裡按下Enter或是切詞,
接下來就會吐出 魔片 無碼絕症 川島草莓牛奶 2000 來供你複製貼上給Kad搜尋,

基本上是參考eMule內部Kademlia CSearchManager::GetWords()的處理,
其實Kad的切詞原則很單純, 但是手動去弄很煩,

  1. " ()[]{}<>,._-!?:;\/" 雙引號內所含的所有字元都會被當作切詞單位
  2. 只接受三個byte以上的詞, 以utf-8編碼為準(中文一個字在utf-8下編碼就有3bytes, 一般英數及半形符號一個1byte)
  3. 如果詞組超過兩個以上, 又最後一個詞組為3bytes, 則丟棄最後一個詞組(如avi rar zip之類的副檔名就會被丟掉)

應該能正確吐出想要的東西, 並剃除掉Kad不會索引的字串單位
就是這樣, 以下是相關部分的程式碼


/*
    This program is free software; you can redistribute it and/or
    modify it under the terms of the GNU General Public License
    as published by the Free Software Foundation; either
    version 2 of the License, or (at your option) any later version.
*/
LPCSTR _aszInvKadKeywordCharsA = INV_KAD_KEYWORD_CHARS;
LPCTSTR _aszInvKadKeywordChars = _T(INV_KAD_KEYWORD_CHARS);
LPCWSTR _awszInvKadKeywordChars = L" ()[]{}<>,._-!?:;\\/";

void CSearchManager::GetWords(LPCTSTR sz, WordList *plistWords)
{
	LPCTSTR szS = sz;
	size_t uChars = 0;
	size_t uBytes = 0;
	CStringW sWord;
	while (_tcslen(szS) > 0)
	{
		uChars = _tcscspn(szS, _aszInvKadKeywordChars);
		sWord = szS;
		sWord.Truncate(uChars);
		// TODO: We'd need a safe way to determine if a sequence which contains only 3 chars is a real word.
		// Currently we do this by evaluating the UTF-8 byte count. This will work well for Western locales,
		// AS LONG AS the min. byte count is 3(!). If the byte count is once changed to 2, this will not
		// work properly any longer because there are a lot of Western characters which need 2 bytes in UTF-8.
		// Maybe we need to evaluate the Unicode character values itself whether the characters are located
		// in code ranges where single characters are known to represent words.
		uBytes = KadGetKeywordBytes(sWord).GetLength();
		if (uBytes >= 3)
		{
			KadTagStrMakeLower(sWord);
			plistWords->remove
			(sWord);
			plistWords->push_back(sWord);
		}
		szS += uChars;
		if (uChars < _tcslen(szS))
			szS++;
	}

	// if the last word we have added, contains 3 chars (and 3 bytes), it's in almost all cases a file's extension.
	if (plistWords->size() > 1 && (uChars == 3 && uBytes == 3))
		plistWords->pop_back();
}
        

page written by AndCycle