Package pyarabic :: Module araby
[hide private]
[frames] | no frames]

Module araby

source code

Arabic module


Author: Taha Zerrouki

Contact: taha dot zerrouki at gmail dot com

Copyright: Arabtechies, Arabeyes, Taha Zerrouki

License: GPL

Date: 2010/03/01

Version: 0.1

Functions [hide private]
    is letter functions
 
isSukun(archar)
Checks for Arabic Sukun Mark.
source code
 
isShadda(archar)
Checks for Arabic Shadda Mark.
source code
 
isTatweel(archar)
Checks for Arabic Tatweel letter modifier.
source code
 
isTanwin(archar)
Checks for Arabic Tanwin Marks (FATHATAN, DAMMATAN, KASRATAN).
source code
 
isTashkeel(archar)
Checks for Arabic Tashkeel Marks (FATHA,DAMMA,KASRA, SUKUN, SHADDA, FATHATAN,DAMMATAN, KASRATAn).
source code
 
isHaraka(archar)
Checks for Arabic Harakat Marks (FATHA,DAMMA,KASRA,SUKUN,TANWIN).
source code
 
isShortharaka(archar)
Checks for Arabic short Harakat Marks (FATHA,DAMMA,KASRA,SUKUN).
source code
 
isLigature(archar)
Checks for Arabic Ligatures like LamAlef.
source code
 
isHamza(archar)
Checks for Arabic Hamza forms.
source code
 
isAlef(archar)
Checks for Arabic Alef forms.
source code
 
isYehlike(archar)
Checks for Arabic Yeh forms.
source code
 
isWawlike(archar)
Checks for Arabic Waw like forms.
source code
 
isTeh(archar)
Checks for Arabic Teh forms.
source code
 
isSmall(archar)
Checks for Arabic Small letters.
source code
 
isWeak(archar)
Checks for Arabic Weak letters.
source code
 
isMoon(archar)
Checks for Arabic Moon letters.
source code
 
isSun(archar)
Checks for Arabic Sun letters.
source code
    general letter functions
integer;
order(archar)
return Arabic letter order between 1 and 29.
source code
unicode;
name(archar)
return Arabic letter name in arabic.
source code
unicode;
arabicrange(self)
return a list of arabic characteres .
source code
    Has letter functions
 
hasShadda(word)
Checks if the arabic word contains shadda.
source code
    word and text functions
 
isVocalized(word)
Checks if the arabic word is vocalized.
source code
 
isVocalizedtext(text)
Checks if the arabic text is vocalized.
source code
Boolean
isArabicstring(text)
Checks for an Arabic standard Unicode block characters; An arabic string can contain spaces, digits and pounctuation.
source code
Boolean
isArabicrange(text)
Checks for an Arabic Unicode block characters;
source code
Boolean
isArabicword(word)
Checks for an valid Arabic word.
source code
    Char functions
unicode char;
firstChar(word)
Return the first char
source code
unicode char;
secondChar(word)
Return the second char
source code
unicode char;
lastChar(word)
Return the last letter example: zerrouki; 'i' is the last.
source code
unicode char;
secondlastChar(word)
Return the second last letter example: zerrouki; 'k' is the second last.
source code
    Strip functions
unicode.
stripHarakat(text)
Strip Harakat from arabic word except Shadda.
source code
unicode.
stripTashkeel(text)
Strip vowels from a text, include Shadda.
source code
unicode.
stripTatweel(text)
Strip tatweel from a text and return a result text.
source code
unicode.
normalizeLigature(text)
Normalize Lam Alef ligatures into two letters (LAM and ALEF), and Tand return a result text.
source code
unicode.
normalizeHamza(word)
Standardize the Hamzat into one form of hamza, replace Madda by hamza and alef.
source code
 
separate(word)
separate the letters from the vowels, in arabic word, if a letter hasn't a haraka, the not definited haraka is attributed.
source code
 
joint(letters, marks)
joint the letters with the marks the length ot letters and marks must be equal return word;
source code
 
vocalizedlike(word1, word2)
if the two words has the same letters and the same harakats, this fuction return True.
source code
 
waznlike(word1, wazn)
if the word1 is like a wazn (pattern), the letters must be equal, the wazn has FEH, AIN, LAM letters.
source code
 
shaddalike(partial, fully)
if the two words has the same letters and the same harakats, this fuction return True.
source code
unicode.
reduceTashkeel(text)
Reduce the Tashkeel, by deleting evident cases.
source code
Variables [hide private]
  COMMA = u'،'
  SEMICOLON = u'؛'
  QUESTION = u'؟'
  HAMZA = u'ء'
  ALEF_MADDA = u'آ'
  ALEF_HAMZA_ABOVE = u'أ'
  WAW_HAMZA = u'ؤ'
  ALEF_HAMZA_BELOW = u'إ'
  YEH_HAMZA = u'ئ'
  ALEF = u'ا'
  BEH = u'ب'
  TEH_MARBUTA = u'ة'
  TEH = u'ت'
  THEH = u'ث'
  JEEM = u'ج'
  HAH = u'ح'
  KHAH = u'خ'
  DAL = u'د'
  THAL = u'ذ'
  REH = u'ر'
  ZAIN = u'ز'
  SEEN = u'س'
  SHEEN = u'ش'
  SAD = u'ص'
  DAD = u'ض'
  TAH = u'ط'
  ZAH = u'ظ'
  AIN = u'ع'
  GHAIN = u'غ'
  TATWEEL = u'ـ'
  FEH = u'ف'
  QAF = u'ق'
  KAF = u'ك'
  LAM = u'ل'
  MEEM = u'م'
  NOON = u'ن'
  HEH = u'ه'
  WAW = u'و'
  ALEF_MAKSURA = u'ى'
  YEH = u'ي'
  MADDA_ABOVE = u'ٓ'
  HAMZA_ABOVE = u'ٔ'
  HAMZA_BELOW = u'ٕ'
  ZERO = u'٠'
  ONE = u'١'
  TWO = u'٢'
  THREE = u'٣'
  FOUR = u'٤'
  FIVE = u'٥'
  SIX = u'٦'
  SEVEN = u'٧'
  EIGHT = u'٨'
  NINE = u'٩'
  PERCENT = u'٪'
  DECIMAL = u'٫'
  THOUSANDS = u'٬'
  STAR = u'٭'
  MINI_ALEF = u'ٰ'
  ALEF_WASLA = u'ٱ'
  FULL_STOP = u'۔'
  BYTE_ORDER_MARK = u''
  FATHATAN = u'ً'
  DAMMATAN = u'ٌ'
  KASRATAN = u'ٍ'
  FATHA = u'َ'
  DAMMA = u'ُ'
  KASRA = u'ِ'
  SHADDA = u'ّ'
  SUKUN = u'ْ'
  SMALL_ALEF = u'ٰ'
  SMALL_WAW = u'ۥ'
  SMALL_YEH = u'ۦ'
  LAM_ALEF = u''
  LAM_ALEF_HAMZA_ABOVE = u''
  LAM_ALEF_HAMZA_BELOW = u''
  LAM_ALEF_MADDA_ABOVE = u''
  simple_LAM_ALEF = u'لا'
  simple_LAM_ALEF_HAMZA_ABOVE = u'لأ'
  simple_LAM_ALEF_HAMZA_BELOW = u'لإ'
  simple_LAM_ALEF_MADDA_ABOVE = u'لآ'
  LETTERS = u'ابتةثجحخدذرزسشصضطظعغفقكلمنهويءآأؤإئ'
  TASHKEEL = (u'ً', u'ٌ', u'ٍ', u'َ', u'ُ', u'ِ', u'ْ', u'ّ')
  HARAKAT = (u'ً', u'ٌ', u'ٍ', u'َ', u'ُ', u'ِ', u'ْ')
  SHORTHARAKAT = (u'َ', u'ُ', u'ِ', u'ْ')
  TANWIN = (u'ً', u'ٌ', u'ٍ')
  LIGUATURES = (u'', u'', u'', u'')
  HAMZAT = (u'ء', u'ؤ', u'ئ', u'ٔ', u'ٕ', u'إ', u'أ')
  ALEFAT = (u'ا', u'آ', u'أ', u'إ', u'ٱ', u'ى', u'ٰ')
  WEAK = (u'ا', u'و', u'ي', u'ى')
  YEHLIKE = (u'ي', u'ئ', u'ى', u'ۦ')
  WAWLIKE = (u'و', u'ؤ', u'ۥ')
  TEHLIKE = (u'ت', u'ة')
  SMALL = (u'ٰ', u'ۥ', u'ۦ')
  MOON = (u'ء', u'آ', u'أ', u'إ', u'ا', u'ب', u'ج', u'ح', u'خ', ...
  SUN = (u'ت', u'ث', u'د', u'ذ', u'ر', u'ز', u'س', u'ش', u'ص', u...
  AlphabeticOrder = {u'ء': 29, u'آ': 29, u'أ': 29, u'ؤ': 29, u'إ...
  NAMES = {u'ء': u'همزة', u'آ': u'ألف ممدودة', u'أ': u'همزة على ...
  HARAKAT_pattern = re.compile(r'[\u064b\u064c\u064d\u064e\u064f...
  TASHKEEL_pattern = re.compile(r'[\u064b\u064c\u064d\u064e\u064...
  HAMZAT_pattern = re.compile(r'[\u0621\u0624\u0626\u0654\u0655\...
  ALEFAT_pattern = re.compile(r'[\u0627\u0622\u0623\u0625\u0671\...
  LIGUATURES_pattern = re.compile(r'[\ufefb\ufef7\ufef9\ufef5]')
  __package__ = 'pyarabic'
Function Details [hide private]

isSukun(archar)

source code 

Checks for Arabic Sukun Mark.

Parameters:
  • archar (unicode) - arabic unicode char

isShadda(archar)

source code 

Checks for Arabic Shadda Mark.

Parameters:
  • archar (unicode) - arabic unicode char

isTatweel(archar)

source code 

Checks for Arabic Tatweel letter modifier.

Parameters:
  • archar (unicode) - arabic unicode char

isTanwin(archar)

source code 

Checks for Arabic Tanwin Marks (FATHATAN, DAMMATAN, KASRATAN).

Parameters:
  • archar (unicode) - arabic unicode char

isTashkeel(archar)

source code 

Checks for Arabic Tashkeel Marks (FATHA,DAMMA,KASRA, SUKUN, SHADDA, FATHATAN,DAMMATAN, KASRATAn).

Parameters:
  • archar (unicode) - arabic unicode char

isHaraka(archar)

source code 

Checks for Arabic Harakat Marks (FATHA,DAMMA,KASRA,SUKUN,TANWIN).

Parameters:
  • archar (unicode) - arabic unicode char

isShortharaka(archar)

source code 

Checks for Arabic short Harakat Marks (FATHA,DAMMA,KASRA,SUKUN).

Parameters:
  • archar (unicode) - arabic unicode char

isLigature(archar)

source code 

Checks for Arabic Ligatures like LamAlef. (LAM_ALEF, LAM_ALEF_HAMZA_ABOVE, LAM_ALEF_HAMZA_BELOW, LAM_ALEF_MADDA_ABOVE)

Parameters:
  • archar (unicode) - arabic unicode char

isHamza(archar)

source code 

Checks for Arabic Hamza forms. HAMZAT are (HAMZA, WAW_HAMZA, YEH_HAMZA, HAMZA_ABOVE, HAMZA_BELOW,ALEF_HAMZA_BELOW, ALEF_HAMZA_ABOVE )

Parameters:
  • archar (unicode) - arabic unicode char

isAlef(archar)

source code 

Checks for Arabic Alef forms. ALEFAT=(ALEF, ALEF_MADDA, ALEF_HAMZA_ABOVE, ALEF_HAMZA_BELOW,ALEF_WASLA, ALEF_MAKSURA );

Parameters:
  • archar (unicode) - arabic unicode char

isYehlike(archar)

source code 

Checks for Arabic Yeh forms. Yeh forms : YEH, YEH_HAMZA, SMALL_YEH, ALEF_MAKSURA

Parameters:
  • archar (unicode) - arabic unicode char

isWawlike(archar)

source code 

Checks for Arabic Waw like forms. Waw forms : WAW, WAW_HAMZA, SMALL_WAW

Parameters:
  • archar (unicode) - arabic unicode char

isTeh(archar)

source code 

Checks for Arabic Teh forms. Teh forms : TEH, TEH_MARBUTA

Parameters:
  • archar (unicode) - arabic unicode char

isSmall(archar)

source code 

Checks for Arabic Small letters. SMALL Letters : SMALL ALEF, SMALL WAW, SMALL YEH

Parameters:
  • archar (unicode) - arabic unicode char

isWeak(archar)

source code 

Checks for Arabic Weak letters. Weak Letters : ALEF, WAW, YEH, ALEF_MAKSURA

Parameters:
  • archar (unicode) - arabic unicode char

isMoon(archar)

source code 

Checks for Arabic Moon letters. Moon Letters :

Parameters:
  • archar (unicode) - arabic unicode char

isSun(archar)

source code 

Checks for Arabic Sun letters. Moon Letters :

Parameters:
  • archar (unicode) - arabic unicode char

order(archar)

source code 

return Arabic letter order between 1 and 29. Alef order is 1, Yeh is 28, Hamza is 29. Teh Marbuta has the same ordre with Teh, 3.

Parameters:
  • archar (unicode) - arabic unicode char
Returns: integer;
arabic order.

name(archar)

source code 

return Arabic letter name in arabic. Alef order is 1, Yeh is 28, Hamza is 29. Teh Marbuta has the same ordre with Teh, 3.

Parameters:
  • archar (unicode) - arabic unicode char
Returns: unicode;
arabic name.

arabicrange(self)

source code 

return a list of arabic characteres . Return a list of characteres between \u060c to \u0652

Returns: unicode;
list of arabic characteres.

hasShadda(word)

source code 

Checks if the arabic word contains shadda.

Parameters:
  • word (unicode) - arabic unicode char

isVocalized(word)

source code 

Checks if the arabic word is vocalized. the word musn't have any spaces and pounctuations.

Parameters:
  • word (unicode) - arabic unicode char

isVocalizedtext(text)

source code 

Checks if the arabic text is vocalized. The text can contain many words and spaces

Parameters:
  • text (unicode) - arabic unicode char

isArabicstring(text)

source code 

Checks for an Arabic standard Unicode block characters; An arabic string can contain spaces, digits and pounctuation. but only arabic standard characters, not extended arabic

Parameters:
  • text (unicode) - input text
Returns: Boolean
True if all charaters are in Arabic block

isArabicrange(text)

source code 

Checks for an Arabic Unicode block characters;

Parameters:
  • text (unicode) - input text
Returns: Boolean
True if all charaters are in Arabic block

isArabicword(word)

source code 

Checks for an valid Arabic word. An Arabic word not contains spaces, digits and pounctuation avoid some spelling error, TEH_MARBUTA must be at the end.

Parameters:
  • word (unicode) - input word
Returns: Boolean
True if all charaters are in Arabic block

firstChar(word)

source code 

Return the first char

Parameters:
  • word (unicode;) - given word;
Returns: unicode char;
the first char

secondChar(word)

source code 

Return the second char

Parameters:
  • word (unicode;) - given word;
Returns: unicode char;
the first char

lastChar(word)

source code 

Return the last letter example: zerrouki; 'i' is the last.

Parameters:
  • word (unicode;) - given word;
Returns: unicode char;
the last letter

secondlastChar(word)

source code 

Return the second last letter example: zerrouki; 'k' is the second last.

Parameters:
  • word (unicode;) - given word;
Returns: unicode char;
the second last letter

stripHarakat(text)

source code 

Strip Harakat from arabic word except Shadda. The striped marks are :

  • FATHA, DAMMA, KASRA
  • SUKUN
  • FATHATAN, DAMMATAN, KASRATAN, , , .

Example:

>>> text=u"الْعَرَبِيّةُ"
>>> stripTashkeel(text)
العربيّة
Parameters:
  • text (unicode.) - arabic text.
Returns: unicode.
return a striped text.

stripTashkeel(text)

source code 

Strip vowels from a text, include Shadda. The striped marks are :

  • FATHA, DAMMA, KASRA
  • SUKUN
  • SHADDA
  • FATHATAN, DAMMATAN, KASRATAN, , , .

Example:

>>> text=u"الْعَرَبِيّةُ"
>>> stripTashkeel(text)
العربية
Parameters:
  • text (unicode.) - arabic text.
Returns: unicode.
return a striped text.

stripTatweel(text)

source code 

Strip tatweel from a text and return a result text.

Example:

>>> text=u"العـــــربية"
>>> stripTatweel(text)
العربية
Parameters:
  • text (unicode.) - arabic text.
Returns: unicode.
return a striped text.

normalizeLigature(text)

source code 

Normalize Lam Alef ligatures into two letters (LAM and ALEF), and Tand return a result text. Some systems present lamAlef ligature as a single letter, this function convert it into two letters, The converted letters into LAM and ALEF are :

  • LAM_ALEF, LAM_ALEF_HAMZA_ABOVE, LAM_ALEF_HAMZA_BELOW, LAM_ALEF_MADDA_ABOVE

Example:

>>> text=u"لانها لالء الاسلام"
>>> normalizeLigature(text)
لانها لالئ الاسلام
Parameters:
  • text (unicode.) - arabic text.
Returns: unicode.
return a converted text.

normalizeHamza(word)

source code 

Standardize the Hamzat into one form of hamza, replace Madda by hamza and alef. Replace the LamAlefs by simplified letters. Example:

>>> text=u"سئل أحد الأئمة"
>>> normalizeHamza(text)
سءل ءحد الءءمة
Parameters:
  • word (unicode.) - arabic text.
Returns: unicode.
return a converted text.

separate(word)

source code 

separate the letters from the vowels, in arabic word, if a letter hasn't a haraka, the not definited haraka is attributed. return ( letters,vowels);

vocalizedlike(word1, word2)

source code 

if the two words has the same letters and the same harakats, this fuction return True. The two words can be full vocalized, or partial vocalized

waznlike(word1, wazn)

source code 

if the word1 is like a wazn (pattern), the letters must be equal, the wazn has FEH, AIN, LAM letters. this are as generic letters. The two words can be full vocalized, or partial vocalized

shaddalike(partial, fully)

source code 

if the two words has the same letters and the same harakats, this fuction return True. The first word is partially vocalized, the second is fully if the partially contians a shadda, it must be at the same place in the fully

reduceTashkeel(text)

source code 

Reduce the Tashkeel, by deleting evident cases.

Parameters:
  • text (unicode. @return : partially vocalized text.) - the input text fully vocalized.
Returns: unicode.

Variables Details [hide private]

MOON

Value:
(u'ء',
 u'آ',
 u'أ',
 u'إ',
 u'ا',
 u'ب',
 u'ج',
 u'ح',
...

SUN

Value:
(u'ت',
 u'ث',
 u'د',
 u'ذ',
 u'ر',
 u'ز',
 u'س',
 u'ش',
...

AlphabeticOrder

Value:
{u'ء': 29,
 u'آ': 29,
 u'أ': 29,
 u'ؤ': 29,
 u'إ': 29,
 u'ئ': 29,
 u'ا': 1,
 u'ب': 2,
...

NAMES

Value:
{u'ء': u'همزة',
 u'آ': u'ألف ممدودة',
 u'أ': u'همزة على الألف',
 u'ؤ': u'همزة على الواو',
 u'إ': u'همزة تحت الألف',
 u'ئ': u'همزة على الياء',
 u'ا': u'ألف',
 u'ب': u'باء',
...

HARAKAT_pattern

Value:
re.compile(r'[\u064b\u064c\u064d\u064e\u064f\u0650\u0652]')

TASHKEEL_pattern

Value:
re.compile(r'[\u064b\u064c\u064d\u064e\u064f\u0650\u0652\u0651]')

HAMZAT_pattern

Value:
re.compile(r'[\u0621\u0624\u0626\u0654\u0655\u0625\u0623]')

ALEFAT_pattern

Value:
re.compile(r'[\u0627\u0622\u0623\u0625\u0671\u0649\u0670]')