本頁面使用javascript於瀏覽器端進行處理, 主機不會截取任何資訊, 請放心使用
這是一組懶人工具, 用來處理想從網頁上複製貼上關鍵字到Kad上又不想用大腦思考Kad禁用字的問題,
例如我從網頁上複製了 (魔片) [無碼絕症] 川島草莓牛奶 (2000-11-22).avi,
然後貼進輸入框裡按下Enter或是切詞,
接下來就會吐出 魔片 無碼絕症 川島草莓牛奶 2000 來供你複製貼上給Kad搜尋,
基本上是參考eMule內部Kademlia CSearchManager::GetWords()的處理,
其實Kad的切詞原則很單純, 但是手動去弄很煩,
- " ()[]{}<>,._-!?:;\/" 雙引號內所含的所有字元都會被當作切詞單位
- 只接受三個byte以上的詞, 以utf-8編碼為準(中文一個字在utf-8下編碼就有3bytes, 一般英數及半形符號一個1byte)
- 如果詞組超過兩個以上, 又最後一個詞組為3bytes, 則丟棄最後一個詞組(如avi rar zip之類的副檔名就會被丟掉)
應該能正確吐出想要的東西, 並剃除掉Kad不會索引的字串單位
就是這樣, 以下是相關部分的程式碼
/* This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. */ LPCSTR _aszInvKadKeywordCharsA = INV_KAD_KEYWORD_CHARS; LPCTSTR _aszInvKadKeywordChars = _T(INV_KAD_KEYWORD_CHARS); LPCWSTR _awszInvKadKeywordChars = L" ()[]{}<>,._-!?:;\\/"; void CSearchManager::GetWords(LPCTSTR sz, WordList *plistWords) { LPCTSTR szS = sz; size_t uChars = 0; size_t uBytes = 0; CStringW sWord; while (_tcslen(szS) > 0) { uChars = _tcscspn(szS, _aszInvKadKeywordChars); sWord = szS; sWord.Truncate(uChars); // TODO: We'd need a safe way to determine if a sequence which contains only 3 chars is a real word. // Currently we do this by evaluating the UTF-8 byte count. This will work well for Western locales, // AS LONG AS the min. byte count is 3(!). If the byte count is once changed to 2, this will not // work properly any longer because there are a lot of Western characters which need 2 bytes in UTF-8. // Maybe we need to evaluate the Unicode character values itself whether the characters are located // in code ranges where single characters are known to represent words. uBytes = KadGetKeywordBytes(sWord).GetLength(); if (uBytes >= 3) { KadTagStrMakeLower(sWord); plistWords->remove (sWord); plistWords->push_back(sWord); } szS += uChars; if (uChars < _tcslen(szS)) szS++; } // if the last word we have added, contains 3 chars (and 3 bytes), it's in almost all cases a file's extension. if (plistWords->size() > 1 && (uChars == 3 && uBytes == 3)) plistWords->pop_back(); }
page written by AndCycle