PHP - 海运的博客

PHP实现Trie结构

发布时间：December 24, 2014 // 分类：PHP // 1 Comment

  class Trie{
  public $tree = array();

  public function insert($str){
    $count = strlen($str);
    $T = &$this->tree;
    //关联数组递归各个字符，无则新建
    for($i = 0; $i < $count; $i++){
      $c = $str[$i];
      if (!isset($T[$c]))
        $T[$c] = array();  
      $T = &$T[$c];
    }
  }

  public function remove($chars){
    if($this->find($chars)){    
      $count = strlen($chars);
      $T = &$this->tree;
      for($i = 0;$i < $count;$i++){
        $c = $chars[$i];
        if(count($T[$c]) == 1){     
          unset($T[$c]);
          return true;
        }
        $T = &$T[$c];
      }
    }
  }

  public function find($chars){
    $count = strlen($chars);
    $T = &$this->tree;
    //循环遍历各个字符，如无则返回false
    for($i = 0;$i < $count;$i++){
      $c = $chars[$i];
      if(!array_key_exists($c, $T)){
        return false;
      }
      $T = &$T[$c];
    }
    //否则返回找到
    return true;
  }
}
$t = new Trie();
$t->insert('str');
var_dump($t->find('str'));

TF-IDF取文章关键词PHP

发布时间：December 21, 2014 // 分类：PHP // No Comments

计算词频，即分词后计算文章的总词数和每个词的出现次数，词数较多可取TOPk

//$tf = 词出现次数 / 总词数

计算IDF，语料可使用百度/Google结果数：

//$idf = log( 总文档数 / 包含词的文档数, 2); 
$idf = log( $total_document_count / $documents_with_term, 2);

计算TF-IDF，值越大分类能力越强：

$tfidf = $tf * $idf

贝叶斯过滤垃圾邮件PHP实现

发布时间：December 21, 2014 // 分类：PHP // No Comments

根据贝叶斯推断及其互联网应用（二）：过滤垃圾邮件实现：
首先收集垃圾邮件和正常邮件，分词后计算每个词分别出现的频率，比如计算垃圾邮件库每个词的频率：

//分词略过
foreach ($words as $word) {
  $key = base64_encode($word);
  if (isset($spamwords[$key])) {
    $spamwords[$key]++;
  } else {
    $spamwords[$key] = 1;
  }
}

单个词判断垃圾邮件概率：

//先验概率为50%
$ps = 0.5;
$ph = 0.5;
//在正常邮件中的出现频率，比如4000封正常邮件2封包含这个词。
$pwh = 0.0005;
//在垃圾邮件中的出现频率
$pws = 0.05;
//垃圾邮件概率
$psw = $pws * $ps / ($pws * $ps + $pwh * $ph);
echo $psw;

多个词计算联合概率：

//根据上面计算的多个词的概率集合
$psws = array(0.2, 0.3, 0.2, 0.3, 0.4, 0.6, 0.7, 0.8, 0.9, 0.8);
$numerator = 1; 
$denominator1 = 1;
$denominator2 = 1;
foreach ($psws as $value) {
  $numerator *= $value;
  $denominator1 *= $value;
  $denominator2 *= 1 - $value;
}
echo $numerator / ($denominator1 + $denominator2);

PHP IP匹配IP段

发布时间：December 21, 2014 // 分类：PHP // No Comments

起始IP方式匹配：

<?php 
$ips = array('192.168.1.1-192.168.1.254','192.168.0.1-192.168.0.254','192.168.3.1-192.168.3.254','192.168.4.1-192.168.4.254');
foreach ($ips as $ip) {
  list($start, $end) = explode('-', $ip);
  //去除32位下负数
  $start = sprintf("%u", ip2long($start));
  $end = sprintf("%u",ip2long($end));
  $newips[] = array($start, $end);
}

function my_sort($a, $b) {
  if ($a[0] == $b[0])
    return 0;
  return ($a[0] < $b[0]) ? -1 : 1;
}

//从大到小排序，IP段较多时可以使用2分查找或多分。
usort($newips, 'my_sort'); 
$matchip = '192.168.1.22';
$matchip = sprintf("%u",ip2long($matchip));
foreach ($newips as $value) {
  if ($matchip > $value[0] && $matchip < $value[1])
    echo "匹配IP段：". long2ip($value[0]) . '-' . long2ip($value[1]) . "\n";
}

CIDR方式匹配：

<?php
function cidr_match($ip, $range)
{
  list ($subnet, $bits) = explode('/', $range);
  $ip = ip2long($ip);
  $subnet = ip2long($subnet);
  $mask = -1 << (32 - $bits);
  $subnet &= $mask;
  return ($ip & $mask) == $subnet;
}
var_dump(cidr_match('192.168.1.22', '192.168.1.0/24'));

CIDR获取IP段起始IP地址：

<?php
function cidrToRange($cidr) {
  $range = array();
  $cidr = explode('/', $cidr);
  $range[0] = long2ip((ip2long($cidr[0])) & ((-1 << (32 - (int)$cidr[1]))));
  $range[1] = long2ip((ip2long($cidr[0])) + pow(2, (32 - (int)$cidr[1])) - 1);
  return $range;
}
var_dump(cidrToRange("192.168.1.1/24"));

参考：
http://stackoverflow.com/questions/4931721/getting-list-ips-from-cidr-notation-in-php
http://stackoverflow.com/questions/594112/matching-an-ip-to-a-cidr-mask-in-php5

PHP一致性hash

发布时间：December 11, 2014 // 分类：PHP // No Comments

方法1，hash后取模：

<?php
function get_hash($str, $num=10){
  $crc = crc32($str);
  $nu = $crc % $num;
  return $nu;
}

测试公平性：

<?php
function genstr($num) {
  return substr(str_shuffle('abcdefghijklmnupqrstuvwxyz'), 0, $num);
}

for ($i=1; $i<10000000; $i++) {
  $str = genstr(5);
  $crc = crc32($str);
  $nu = get_hash($str);;
  if (isset($count[$nu])) {
    $count[$nu]++;
  } else {
    $count[$nu]=1;
  }
}
print_r($count);

方法2，动态增添后最小影响之前的hash结果可使用consistent hashing：
https://github.com/RJ/ketama
https://github.com/pda/flexihash

海运的博客

PHP实现Trie结构

TF-IDF取文章关键词PHP

贝叶斯过滤垃圾邮件PHP实现

PHP IP匹配IP段

PHP一致性hash

分类

最新文章

最近回复