スクリプト集

COMMANDOES

スクリプト集 †

GitHub時代に逆行
利用バージョン: gcc 4.7.2, Perl 5.8.8 @ MinGW on Windows 7 x64

feature_zurashi.pl †

指定した素性を上下にずらすスクリプト
例えば
```
1 1:0 2:1 3:1 4:0
2 1:1 2:0 3:1 4:1
2 1:1 2:0 3:1 4:0
2 1:1 2:0 3:0 4:1
1 1:1 2:1 3:0 4:0
```
という学習データの「素性ID=2と4」だけを「下にn行ずつ」ずらしたいときなどに役立つ
- これを実現できるシェルコマンドってないですよね...？あったら無念
ずらしたい素性は複数選択可能

使い方: feature_zurashi.pl 2,4 3 test.train (素性ID=2と4を下に3行ずつずらす)

ずらす値をマイナスにすると上にもずらせる

#! perl

use strict;
use warnings;

# 指定した素性を指定した行だけずらす
# usage feature_zurashi.pl [ids (use ,)] [linenum] [file]

################

my($id, $n, $file) = @ARGV;

my @ids;
if(index($id, ',') > -1) { @ids = split(/\,/, $id); }
else { $ids[0] = $id; }

my @restore_features; # ずらす素性専用配列

open(FILE, '<', $file) or die("file open error");
my $file_max; # ファイルの最大行記録用

while(my $line = <FILE>) {
  my @features = split(/\s/, $line);
  my @push_vals;
  for(my $i=1; $i<@features; $i++) {
    my($fid, $val) = split(/:/, $features[$i]);
    for(@ids) {
      if($_ == $fid) {
        push(@push_vals, $val);
      }
    }
  }
  push(@restore_features, \@push_vals); # 二次元配列にする
  $file_max++;
}

seek(FILE, 0, 0);

my $cnt = 0; # 行カウンタ
while(my $line = <FILE>) {
  # 当たり前だけどずらした時の先頭行・末行は特別な処理が必要
  if($n > 0 && $cnt < $n) {
    print $line;
  }
  elsif($n < 0 && $cnt - $n >= $file_max) {
    print $line;
  }
  else {
    # ずらして出力
    my $res_cnt = 0; # 複数の素性を対象にしたときのカウンタ
    for my $c (@ids) {
      $line =~ s/$c:[^\s\r\n]+/$c:$restore_features[$cnt-$n][$res_cnt++]/;
    }
    print $line;
  }
  $cnt++;
}

↑

balanced.pl †

機械学習における不均衡データ (http://ibisforest.org/index.php?%E4%B8%8D%E5%9D%87%E8%A1%A1%E3%83%87%E3%83%BC%E3%82%BF) を解消するために，アンダーサンプリングを行うスクリプト
一番サンプル数の少ないクラスに他のクラスのサンプル数をあわせる
- 例: +1 100個，+2 200個，+3 50個の学習データなら，50個ずつに合わせる
原理としてはシャッフルして取り出す方式．そのため順番はバラバラになる
リファレンスの練習

prettifyのバグ(?)で「}}」が入力できないため，「} }」としてあります

#!/usr/local/bin/perl

# 各クラスのバランスをとる
# 一番小さい数に合わせる
# usage: balanced.pl [学習データ]

use strict;
use warnings;
use List::Util;

##########################

open(FILE, "$ARGV[0]") or die("file open error");
my(%datas, %counts);

# 学習データ読み込み
while(my $line = <FILE>) {
  my($class, undef) = split(/\s/, $line);
  push(@{$datas{$class} }, $line); # クラスをキーとするハッシュ配列に追加
  $counts{$class}++; # サンプル数も記録
}
close(FILE);

# 最も小さい数のクラスを見つける
my $lowest;
foreach (keys %counts) {
  $lowest = $counts{$_} if(! $lowest || $counts{$_} < $lowest);
}

# 各クラスごとにランダムソート (shuffle)
foreach (sort keys %datas) {
  @{$datas{$_} } = List::Util::shuffle @{$datas{$_} };

  # 先頭から lowest だけ持ってくる
  print @{$datas{$_} }[0..$lowest-1];
}

↑

cv.pl †

svm-predict で生成した予測結果を用いて
- Accuracy
- Precision
- Recall
- F値
- Confusion Matrix
を求める

二値分類限定．うち一方は「+1」である必要あり

#! perl

# cv.pl 評価データ 予測データ

use strict;
use warnings;

# コマンドラインからファイルリストを取得
my @files = @ARGV;

# 各要素を定義
my ($tp, $fp, $tn, $fn);
$tp = $fp = $tn = $fn = 0;

open(FILE, "<$files[0]") or die("file open error : $files[0]");
open(FILE2, "<$files[1]") or die("file open error : $files[1]");

while(my $a = readline(FILE)) {
  my $p = readline(FILE2);
  
  # 評価データから答えを取得
  my ($a_class, @temp) = split(/\s+/, $a);
  
  # positive
  if(int($p) == 1) {
    # true-positive
    if(int($a_class) == 1) { $tp++; }
    # false-positive
    else { $fp++; }
  }
  # negative
  else {
    # false-negative
    if(int($a_class) == 1) { $fn++; }
    # true-negative
    else { $tn++; }
  }
}

close(FILE);
close(FILE2);

# Precision, Recall, F-measure の計算
my $p_precision = $tp / ($tp + $fp);
my $p_recall = $tp / ($tp + $fn);
my $p_fmeasure = 2 * $p_precision * $p_recall / ($p_precision + $p_recall);

my $n_precision = $tn / ($tn + $fn);
my $n_recall = $tn / ($tn + $fp);
my $n_fmeasure = 2 * $n_precision * $n_recall / ($n_precision + $n_recall);

# stdout 出力
printf "Accuracy = \t%.3f (%d / %d)\n\n", ($tp + $tn) / ($tp + $fp + $tn + $fn), ($tp + $tn), ($tp + $fp + $tn + $fn);

printf "class +1:\n";
printf "  Precision = \t%.3f (%d / %d)\n", $p_precision, $tp, ($tp + $fp);
printf "  Recall = \t%.3f (%d / %d)\n", $p_recall, $tp, ($tp + $fn);
printf "  F-measure = \t%.3f\n\n", $p_fmeasure;

printf "class -1:\n";
printf "  Precision = \t%.3f (%d / %d)\n", $n_precision, $tn, ($tn + $fn);
printf "  Recall = \t%.3f (%d / %d)\n", $n_recall, $tn, ($tn + $fp);
printf "  F-measure = \t%.3f\n\n", $n_fmeasure;

# ここ汚い
printf "Confusion Matrix\n";
printf "\tPredicted Class\n";
printf "\tp (+1)\t\tn (-1)\n";
printf "p (+1)\t%.3f (%d/%d)\t%.3f (%d/%d)\n", $tp / ($tp + $fn), $tp, ($tp + $fn), $fn / ($tp + $fn), $fn, ($tp + $fn);
printf "n (-1)\t%.3f (%d/%d)\t%.3f (%d/%d)\n", $fp / ($fp + $tn), $fp, ($fp + $tn), $tn / ($fp + $tn), $tn, ($fp + $tn);
printf "Labelled\nClass\n";
exit;

↑

mixman.pl †

２つのWAVファイルを重ねあわせて保存する
16bit only
RIFFタグは44バイト固定
サンプリング周波数とチャンネル数は合わせて下さい

音声にノイズを合成するときなんかに使います

#! perl

# 音声を重ねがけ保存 (16bit only)
# mixman.pl inputfile mixsoundfile outputfile
# inputfile :    かける元のWAVファイル
# mixsoundfile : 重ね合わせるWAVファイル
# outputfile :   書き出されるWAVファイル

use strict;
use warnings;

my ($input, $sound, $output) = @ARGV;

open(IN, "<$input") or die("file open error : $input");
binmode IN; # binary mode

open(DATA, "<$sound") or die("file open error : $input");
binmode DATA; # binary mode

my @effect;  # 重ね合わせるWAVのRAWデータ配列
my $temp;

# riffタグスキップ
seek(DATA, 44, 0);

while(read(DATA, $temp, 2)) { # 2byte (short) read
  my $data = unpack("s", $temp); # binary -> short に変換
  push(@effect, $data);
}

open(SAVE, ">$output") or die("file write error");
binmode SAVE;

# 元のファイルに加算していく
my $in_cnt = 0;  # 繰り返すのでリングバッファ用のカウンタ
my $eff_cnt = $#effect;

# riffタグをコピー
for(my $i=0; $i<44; $i++) {
  read(IN, $temp, 1);
  print SAVE $temp;
}

while(read(IN, $temp, 2)) { # 2byte read
  my $data = unpack("s", $temp);  # short -> binary
  print SAVE pack("s", ($data + $effect[$in_cnt % $eff_cnt]));
  $in_cnt++;
}

close(SAVE);
close(DATA);
close(IN);

↑

seq2.c †

Linuxコマンド「seq」の改悪(?)版
- seqは始点から終点までの数列を作ってくれるコマンド．
- seq [START (省略可）] [INCREMENT （省略可）] [END]
```
$ seq 1 5
> 1 2 3 4 5
$ seq 1 2 5
> 1 3 5
```
  といったようになる．
- ただし，START > END となるような数列，たとえば
```
3 4 5 1 2
```
  といったものは作れない（はず）．
seq2は START > END といった数列や，多少トリッキーな数列を作ることができる．
使い方
- seq2 [START] [INCREMENT （省略可）] [END] [MIN（省略可）] [MAX（省略可）]
- MIN : START = MAX となったとき代入される値
- ※ START > END のとき，MIN と MAX は省略できない
- ※ START < END のとき，一度 MAX まで加算してから MIN に戻り， END まで加算される

例

$ seq2 0 5 (STARTとEND）
> 0 1 2 3 4 5 （seqと同じ）
$ seq2 0 2 5 (START, INCREMENT, END)
> 0 2 4
$ seq2 5 2 0 10 (START, END, MIN, MAX)
> 5 6 7 8 9 10 0 1
$ seq2 2 5 0 15 (START, END, MIN, MAX)
> 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 1 2 3 4

なお，seqにあった -f, -s, -w オプションはまた余力のあるときに追加したい・・・

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

typedef char BOOL;

/* prototype call */
void chkargc(int argc, char *argv[], double *start, double *increment, double *min, double *max, double *end);
BOOL isinteger(double n);
void printf2(double number, BOOL intflag);

int main(int argc, char *argv[]) {
  double i, start, increment, min, max, end;
  BOOL intflag;

  /* argv から値を取得 */
  chkargc(argc, argv, &start, &increment, &min, &max, &end);
  
  /* 整数かどうかのチェック */
  if(isinteger(start) && isinteger(increment) && isinteger(end))
    intflag = 1;
  else
    intflag = 0;

  /* 数字表示部 */
  for(i=start; i<=max; i+=increment) {
    printf2(i, intflag);
    printf(" ");
    /* MIN, MAX 指定時の挙動 */
    if(i == max && max != end) {
      i = min - increment;   /* カウンターを MIN に設定 */
      max = end - increment; /* MAX を END に設定 */
      end = max;             /* 再びこの制御に入らないようにする */
    }
  }
  return EXIT_SUCCESS;
}

/* argc をチェックして各変数に argv の値を入れる */
void chkargc(int argc, char *argv[], double *start, double *increment, double *min, double *max, double *end) {
  char *ec;  /* strtod 用 (使ってない) */

  switch(argc) {
  case 3:
    *start = strtod(argv[1], &ec);
    *increment = 1.0;
    *end = strtod(argv[2], &ec);
    *max = *end;
    break;

  case 4:
    *start = strtod(argv[1], &ec);
    *increment = strtod(argv[2], &ec);
    *end = strtod(argv[3], &ec);
    *max = *end;
    break;
    
  case 5:
    *start = strtod(argv[1], &ec);
    *increment = 1.0;
    *end = strtod(argv[2], &ec);
    *min = strtod(argv[3], &ec);
    *max = strtod(argv[4], &ec);
    break;

  case 6:
    *start = strtod(argv[1], &ec);
    *increment = strtod(argv[2], &ec);
    *end = strtod(argv[3], &ec);
    *min = strtod(argv[4], &ec);
    *max = strtod(argv[5], &ec);
    break;

  default: /* error */
    fprintf(stderr, "seq2 [START] [INCREMENT (can omit)] [END]\n");
    fprintf(stderr, "seq2 [START] [INCREMENT (can omit)] [END] [MIN] [MAX]\n");
    exit(EXIT_FAILURE);
  }

  /* error catch */
  if(*increment == 0.0) {
    fprintf(stderr, "error : INCREMENT is 0\n");
    exit(EXIT_FAILURE);
  }
  else if(argc == 5 || argc == 6) {
    if(*min > *max) {
      fprintf(stderr, "error : MIN > MAX\n");
      exit(EXIT_FAILURE);
    }
  }
  return;
}

/* 与えられた実数が整数かどうかをチェック */
BOOL isinteger(double n) {
  double intpart; /* 使わないけどな */
  if(modf(n, &intpart) != 0.0) /* 小数部が 0 じゃない */
    return 0;
  else /* 小数部が 0 = 整数 */
    return 1;
}

/* intflag の値によって整数表示 or 小数表示を切り分ける printf */
void printf2(double number, BOOL intflag) {
  /* print as an integeral number */
  if(intflag) {
    int i = number;
    printf("%d", i);
  }
  /* print as a floating number */
  else {
    double f = number;
    printf("%f", f);
  }
  return;
}