Root/
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 | <?php /* -*- tab-width: 4; indent-tabs-mode: nil; c-basic-offset: 4 -*- */ /* # ***** BEGIN LICENSE BLOCK ***** # This file is part of Plume Framework, a simple PHP Application Framework. # Copyright (C) 2001-2007 Loic d'Anterroches and contributors. # # Plume Framework is free software; you can redistribute it and/or modify # it under the terms of the GNU Lesser General Public License as published by # the Free Software Foundation; either version 2.1 of the License, or # (at your option) any later version. # # Plume Framework is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU Lesser General Public License for more details. # # You should have received a copy of the GNU Lesser General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA # # ***** END LICENSE BLOCK ***** */ /** * Detect the language of a text. * * <code> * list($lang, $confid) = Pluf_Text_Lang::detect($string); * </code> */ class Pluf_Text_Lang { /** * Given a string, returns the language. * * Algorithm by Cavnar et al. 94. * * @param string * @param bool Is the string clean (false) * @return array Language, Confidence */ public static function detect( $string , $is_clean =false) { if (! $is_clean ) { $string = Pluf_Text::cleanString( $string ); } } /** * Returns the sorted n-grams of a document. * * FIXME: We should detect the proportion of thai/chinese/japanese * characters and switch to unigram instead of n-grams if the * proportion is greater than 50%. * * @param string The clean document. * @param int Maximum size of the n grams (3) * @return array N-Grams */ public static function docNgrams( $string , $n =3) { // do not remove the accents $words = Pluf_Text::tokenize( $string , false); $ngrams = array (); for ( $i =2; $i <= $n ; $i ++) { foreach ( $words as $word => $occ ) { foreach (self::makeNgrams( $word , $i ) as $ngram ) { $ngrams [] = array ( $ngram , $occ ); } } } $out = array (); foreach ( $ngrams as $ngram ) { if (!isset( $out [ $ngram [0]])) { $out [ $ngram [0]] = $ngram [1]; } else { $out [ $ngram [0]] += $ngram [1]; } } // split the ngrams by occurence. $ngrams = array (); foreach ( $out as $ngram => $occ ) { if (isset( $ngrams [ $occ ])) { $ngrams [ $occ ][] = $ngram ; } else { $ngrams [ $occ ] = array ( $ngram ); } } krsort( $ngrams ); $res = array (); foreach ( $ngrams as $occ => $list ) { sort( $list ); foreach ( $list as $ngram ) { $res [] = $ngram ; } } return $res ; } /** * Returns the n-grams of rank n of the word. * * @param string Word. * @return array N-grams */ public static function makeNgrams( $word , $n =3) { $chars = array ( '_' ); $chars = $chars + Pluf_Text::stringToChars( $word ); $chars [] = '_' ; $l = count ( $chars ); $ngrams = array (); for ( $i =0; $i < $l +1- $n ; $i ++) { $ngrams [ $i ] = array (); } $n_ngrams = $l +1- $n ; for ( $i =0; $i < $l ; $i ++) { for ( $j =0; $j < $n ; $j ++) { if (isset( $ngrams [ $i - $j ])) { $ngrams [ $i - $j ][] = $chars [ $i ]; } } } $out = array (); foreach ( $ngrams as $ngram ) { $t = implode( '' , $ngram ); if ( $t != '__' ) { $out [] = $t ; } } return $out ; } /** * Return the distance between two document ngrams. * * @param array n-gram * @param array n-gram * @return integer distance */ public static function ngramDistance( $n1 , $n2 ) { $res = 0; $n_n1 = count ( $n1 ); $n_n2 = count ( $n2 ); if ( $n_n1 > $n_n2 ) { list( $n_n1 , $n_n2 ) = array ( $n_n2 , $n_n1 ); list( $n1 , $n2 ) = array ( $n2 , $n1 ); } for ( $i =0; $i < $n_n1 ; $i ++) { if (false !== ( $index = array_search ( $n1 [ $i ], $n2 ))) { $offset = abs ( $index - $i ); $res += ( $offset > 3) ? 3 : $offset ; } else { $res += 3; } } $res += ( $n_n2 - $n_n1 ) * 3; return $res ; } } |