Peter's Blog

Redefining the Impossible

Items filed under crm114


I'm interested in using CRM114 in a project written in python. CRM114 is a hard to describe but I want to use it as an intelligent categoriser to decide whether an item of text should go into group a or group b (something like a bayesian spam filter). I could use the bayesian filters from SpamBayes but CRM114 is likely to be faster, more flexible and less fixated with email spam filtering.

CRM114 has it's own weird programming language to learn so the problem is really to create something minimal that works and to wrap it in a language I know. Hence this recipe uses CRM114 to simply decide whether a lump of text is good or bad and returns the result to python. If this is in doubt it seems to err towards 'good'.

  • Install crm114 on debian or ubuntu system:
    sudo apt-get install crm114
    
  • create a script called 'learngood.crm' to create a 'good' database
    #!/usr/bin/crm
    
    {
        learn <osb unique microgroom> (good.css)
    }
    
  • teach it about the good things in 'good.txt' like this:
    ./learngood.crm < good.txt
    
  • create a script called 'learnbad.crm' to create a 'bad' database
    #!/usr/bin/crm
    
    {
        learn <osb unique microgroom> (good.css)
    }
    
  • teach it about the bad things in 'bad.txt' like this:
    ./learnbad.crm < bad.txt
    
  • create a script called 'pick.crm' to make the decision:
    #!/usr/bin/crm
    
    {
        {
             classify <osb unique microgroom> ( bad.css | good.css )
             # bad
             exit /1/
        }
        # good
        exit /0/
    }
    
  • create a python script to run it:
       1  #
       2  # Pick one or the other.
       3  #
       4  
       5  import sys
       6  import popen2
       7  
       8  strText = sys.stdin.read()
       9  
      10  oCrm = popen2.Popen3( './pick.crm', 'w')
      11  
      12  oCrm.tochild.write( strText)
      13  oCrm.tochild.close()
      14  
      15  nRet = oCrm.wait()
      16  
      17  if nRet == 0:
      18      print 'It was Good'
      19  else:
      20      print 'It was Bad'
    
    This script reads text from standard input and then passes it through the crm filter. The script prints whether the text is good or bad.

This is using the OSB (Orthogonal Sparse Bigram) classifier. CRM114 has multiple classifiers to choose from if you have some objection to Orthogonal Sparse Bigrams.

The use of 'Popen3' to pipe the text to crm114 means it won't work under Windows. You have my deepest sympathy.


Filed under: crm114 python

2 Comments