About the Turkish-I Problem

  • Context

We’ve got some computers located in Turkey. My colleagues started to execute some PowerShell code in the end-user context/session/environment.
They had the following code and noticed that the regular expression on line 5 doesn’t match when the culture is set to tr-TR

(Get-Content -Path $FilePath -ReadCount 1 -Encoding UTF8) | 
ForEach-Object {
 $line = $_
 
 if ($line -match "^([A-Z0-9_]+)=(.*)$") {
  $k = $Matches[1]
  $v = $Matches[2]
   			
  if ($Keys -contains $k) {
   $myData.$k = $v.ToString().Trim()
  } elseif ($WarnIfUnknown) {
   Write-Warning "Unknown ini key '$k' in file '$FilePath'"
  }
 } else {
  Write-Warning "Wrong line '$line' in file '$FilePath'"
 }
}
  • Problem

What happens is actually well documented on this page:

By default, when the regular expression engine performs case-insensitive comparisons, it uses the casing conventions of the current culture to determine equivalent uppercase and lowercase characters.

However, this behavior is undesirable for some types of comparisons, particularly when comparing user input to the names of system resources, such as passwords, files, or URLs. The following example illustrates such as scenario. The code is intended to block access to any resource whose URL is prefaced with FILE://. The regular expression attempts a case-insensitive match with the string by using the regular expression $FILE://. However, when the current system culture is tr-TR (Turkish-Turkey), “I” is not the uppercase equivalent of “i”. As a result, the call to the Regex.IsMatch method returns false, and access to the file is allowed.

Source: https://docs.microsoft.com/en-us/dotnet/standard/base-types/regular-expression-options

Here’s the same example in PowerShell to showcase what happens:

NB: It uses a function named Using-Culture that you can find on this page.

  • Solution

The solution is provided on the same page that documents the above problem:

Instead of using the case-insensitive comparisons of the current culture, you can specify the RegexOptions.CultureInvariant option to ignore cultural differences in language and to use the conventions of the invariant culture.

Using-Culture -Culture 'tr-TR' -ScriptBlock {
 [regex]::IsMatch(
  'file',
   'FILE',
   [System.Text.RegularExpressions.RegexOptions]::IgnoreCase+
   [System.Text.RegularExpressions.RegexOptions]::CultureInvariant
 )
}

As you can see, it returns now true instead of false:

How could my colleagues solve their issue?
Well, they should use the RegexOptions.CultureInvariant option to ignore cultural differences when they perform their regular expression match.
They cannot use the -match operator anymore and have it populate the $Matches automatic variable when the input is scalar.
Instead of the -match operator, they should use the .Net [regex] object to be able to specify the RegexOptions.CultureInvariant option:

(Get-Content -Path $FilePath -ReadCount 1 -Encoding UTF8) | 
ForEach-Object {
 $line = $_
 
 if ([regex]::Matches($line,'^([A-Z0-9_]+)=(.*)$',513)) {

  $k,$v = [regex]::Matches($line,'^([A-Z0-9_]+)=(.*)$',513) | 
  Select-Object -ExpandProperty Groups | 
  Select -Last 2 -ExpandProperty Value
   			
  if ($Keys -contains $k) {
   $myData.$k = $v.ToString().Trim()
  } elseif ($WarnIfUnknown) {
   Write-Warning "Unknown ini key '$k' in file '$FilePath'"
  }
 } else {
  Write-Warning "Wrong line '$line' in file '$FilePath'"
 }
}
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.