Używanie Objective C / Cocoa do unescape znaków unicode, czyli u1234

Question

Używanie Objective C / Cocoa do unescape znaków unicode, czyli u1234

Niektóre strony, z których pobieram dane, zwracają ciągi UTF-8, z unikniętymi znakami UTF-8, czyli: \u5404\u500b\u90fd

Czy jest wbudowana funkcja cocoa, która może w tym pomóc, czy będę musiał napisać własny algorytm dekodowania.

34

objective-c cocoa unicode

Author: corydoras, 2010-01-20

Source

4 answers

~~to prawda, że Cocoa nie oferuje rozwiązania~~ , jednak Core Foundation robi: CFStringTransform.

CFStringTransform mieszka w zakurzonym, odległym zakątku Mac OS (i iOS), a więc jest to trochę know gem. Jest to przedni koniec silnika transformacji ciągów zgodnego z ICU firmy Apple. Może wykonywać prawdziwą magię, jak transliteracje między grecką i łacińską (lub o dowolnych znanych skryptach), ale może być również używany do wykonywania przyziemnych zadań, takich jak unescaping strings from a crappy Serwer:

NSString *input = @"\\u5404\\u500b\\u90fd";
NSString *convertedString = [input mutableCopy];

CFStringRef transform = CFSTR("Any-Hex/Java");
CFStringTransform((__bridge CFMutableStringRef)convertedString, NULL, transform, YES);

NSLog(@"convertedString: %@", convertedString);

// prints: 各個都, tada!

Jak już mówiłem, CFStringTransform jest naprawdę potężny. Obsługuje szereg predefiniowanych przekształceń, takich jak mapowanie wielkości liter, normalizacja lub Konwersja nazw znaków unicode. Możesz nawet zaprojektować własne przekształcenia.

~~nie mam pojęcia, dlaczego Apple nie udostępnia go z kakao.~~

Edytuj 2015:

OS X 10.11 i iOS 9 dodają następującą metodę do Fundacji:

- (nullable NSString *)stringByApplyingTransform:(NSString *)transform reverse:(BOOL)reverse;

Więc przykład z góry staje się...

NSString *input = @"\\u5404\\u500b\\u90fd";
NSString *convertedString = [input stringByApplyingTransform:@"Any-Hex/Java"
                                                     reverse:YES];

NSLog(@"convertedString: %@", convertedString);

Dzięki @nschmidt za Ostrzeżenie.

90

Author: Nikolai Ruhe,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2017-05-23 10:30:31

Oto, co napisałem. Mam nadzieję, że to pomoże niektórym ludziom.

+ (NSString*) unescapeUnicodeString:(NSString*)string
{
// unescape quotes and backwards slash
NSString* unescapedString = [string stringByReplacingOccurrencesOfString:@"\\\"" withString:@"\""];
unescapedString = [unescapedString stringByReplacingOccurrencesOfString:@"\\\\" withString:@"\\"];

// tokenize based on unicode escape char
NSMutableString* tokenizedString = [NSMutableString string];
NSScanner* scanner = [NSScanner scannerWithString:unescapedString];
while ([scanner isAtEnd] == NO)
{
    // read up to the first unicode marker
    // if a string has been scanned, it's a token
    // and should be appended to the tokenized string
    NSString* token = @"";
    [scanner scanUpToString:@"\\u" intoString:&token];
    if (token != nil && token.length > 0)
    {
        [tokenizedString appendString:token];
        continue;
    }

    // skip two characters to get past the marker
    // check if the range of unicode characters is
    // beyond the end of the string (could be malformed)
    // and if it is, move the scanner to the end
    // and skip this token
    NSUInteger location = [scanner scanLocation];
    NSInteger extra = scanner.string.length - location - 4 - 2;
    if (extra < 0)
    {
        NSRange range = {location, -extra};
        [tokenizedString appendString:[scanner.string substringWithRange:range]];
        [scanner setScanLocation:location - extra];
        continue;
    }

    // move the location pas the unicode marker
    // then read in the next 4 characters
    location += 2;
    NSRange range = {location, 4};
    token = [scanner.string substringWithRange:range];
    unichar codeValue = (unichar) strtol([token UTF8String], NULL, 16);
    [tokenizedString appendString:[NSString stringWithFormat:@"%C", codeValue]];

    // move the scanner past the 4 characters
    // then keep scanning
    location += 4;
    [scanner setScanLocation:location];
}

// done
return tokenizedString;
}

+ (NSString*) escapeUnicodeString:(NSString*)string
{
// lastly escaped quotes and back slash
// note that the backslash has to be escaped before the quote
// otherwise it will end up with an extra backslash
NSString* escapedString = [string stringByReplacingOccurrencesOfString:@"\\" withString:@"\\\\"];
escapedString = [escapedString stringByReplacingOccurrencesOfString:@"\"" withString:@"\\\""];

// convert to encoded unicode
// do this by getting the data for the string
// in UTF16 little endian (for network byte order)
NSData* data = [escapedString dataUsingEncoding:NSUTF16LittleEndianStringEncoding allowLossyConversion:YES];
size_t bytesRead = 0;
const char* bytes = data.bytes;
NSMutableString* encodedString = [NSMutableString string];

// loop through the byte array
// read two bytes at a time, if the bytes
// are above a certain value they are unicode
// otherwise the bytes are ASCII characters
// the %C format will write the character value of bytes
while (bytesRead < data.length)
{
    uint16_t code = *((uint16_t*) &bytes[bytesRead]);
    if (code > 0x007E)
    {
        [encodedString appendFormat:@"\\u%04X", code];
    }
    else
    {
        [encodedString appendFormat:@"%C", code];
    }
    bytesRead += sizeof(uint16_t);
}

// done
return encodedString;
}

11

Author: Christoph,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2011-10-28 22:44:03

Kod prosty:

const char *cString = [unicodeStr cStringUsingEncoding:NSUTF8StringEncoding];
NSString *resultStr = [NSString stringWithCString:cString encoding:NSNonLossyASCIIStringEncoding];

From: https://stackoverflow.com/a/7861345

2

Author: likid1412,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2017-05-23 12:34:54

score 23 · Accepted Answer

Nie ma wbudowanej funkcji do wykonywania C unescapingu.

Możesz trochę oszukać z NSPropertyListSerialization ponieważ plista "starego stylu tekstu" obsługuje C poprzez \Uxxxx:

NSString* input = @"ab\"cA\"BC\\u2345\\u0123";

// will cause trouble if you have "abc\\\\uvw"
NSString* esc1 = [input stringByReplacingOccurrencesOfString:@"\\u" withString:@"\\U"];
NSString* esc2 = [esc1 stringByReplacingOccurrencesOfString:@"\"" withString:@"\\\""];
NSString* quoted = [[@"\"" stringByAppendingString:esc2] stringByAppendingString:@"\""];
NSData* data = [quoted dataUsingEncoding:NSUTF8StringEncoding];
NSString* unesc = [NSPropertyListSerialization propertyListFromData:data
                   mutabilityOption:NSPropertyListImmutable format:NULL
                   errorDescription:NULL];
assert([unesc isKindOfClass:[NSString class]]);
NSLog(@"Output = %@", unesc);

Ale pamiętaj, że to nie jest zbyt wydajne. O wiele lepiej, jeśli napiszesz swój własny parser. (BTW dekodujesz ciągi JSON? Jeśli tak, możesz użyć istniejących parserów JSON.)