Release Notes

Below are the release notes for the Private AI container. To update, please grab a new version of the image.

3.8.2 (2024/05/23)

What’s new in 3.8.2?

attention

On May 24th, we uncovered a bug in our PDF redaction module that could allow PII to leak through in the invisible text layer re-inserted into the new document. The PII is only accessible by searching the redacted document or by copying the invisible text layer. Our testing revealed that the issue occurs on only a few percent of PDFs, predominantly around slanted text and tables. As such we believe the likelihood of a leak is small in ordinary scenarios. However, as we take data privacy very seriously, we strongly suggest users of our PDF processing to update to 3.8.2 immediately

  • Model Improvements
    • Improved PII detection in Japanese call transcripts
    • Improved detection of PII within structured tables in PDFs ( MONEY mentions, in particular)
    • Improvements to CVV detection in Russian text
  • General Improvements
    • PDF invisible text layer issue addressed. Please see the top of the 3.8.2 notes for details
    • Increased coverage of supported PowerPoint elements, particularly around text in separate containers on pages. Please see the PowerPoint page for further details
    • Security updates

3.8.1 (2024/05/07)

What’s new in 3.8.1?

  • Model Improvements
    • Note : The following model improvements for this point release are only included in the high and high_multilingual accuracy modes
    • Improved detection of PII in Japanese ASR call transcripts, especially for EMAIL_ADDRESS , LOCATION_ADDRESS (and related entity types), and NAME
    • Improved detection of MONEY entities in structured data and PDF tables
    • Better performance on PCI entity types (e.g., ROUTING_NUMBER ) and other numerical classes (e.g., SSN , NUMERICAL_PII ) in English text
    • Improved performance on PII detection in English clinical notes and other medical data types
    • Better detection of partial CREDIT_CARD numbers (e.g., the last four digits only) in English, German, French, Portuguese, and Japanese
    • Improvements to the DRIVER_LICENSE and VEHICLE_ID classes in Spanish
  • General Improvements
    • Security updates
    • Fixed issue with copyable Japanese text in de-identified PDFs
    • The gpu-text container no longer has a strict 4GB shared memory requirement; This requirement is only for the gpu container

3.8.0 (2024/03/28)

What’s new in 3.8.0?

attention

3.8.0 includes a container log warning that strongly recommends 64 GB of RAM for anyone utilizing the file support endpoints.

  • Model Improvements
    • Improved detection of PII in structured data
    • Improved detection of PII in French and Portuguese text, particularly with respect to numerical entity types such as SSN and HEALTHCARE_NUMBER
  • Translated Redaction Labels
    • Redaction markers are now supported in core languages. Please see the languages page to see which are supported.
  • Websocket (Beta)
    • The websocket endpoint now retains context. This can be enabled / disabled via the PAI_WS_LINK_BATCH environment variable by setting it to true / false. The default is true
    • The context window can also be adjusted via the PAI_WS_CONTEXT_SIZE environment variable. The default size is 50 .
  • Other Improvements
    • Image processing now supports redaction with black boxes.

3.7.3 (2024/03/14)

What’s new in 3.7.3?

  • New Entity Type
    • We've added support for a new entity type, LOCATION_ADDRESS_STREET , which is a subclass of our existing LOCATION_ADDRESS . Whereas LOCATION_ADDRESS captures a full address, LOCATION_ADDRESS_STREET captures only the street name and number of an address, plus unit numbers, if relevant. Please see our Supported Entities page for examples of both categories.
  • Model Improvements
    • Improved detection of numerical entity types in French, Spanish, and English, especially SSN and CREDIT_CARD
    • Better detection of BANK_ACCOUNT , MEDICAL_PROCESS , and TIME in Dutch
    • Improved detection of numerical entity types written in words (as in ASR transcripts) for Japanese, French, Spanish, and Portuguese text
    • Note Model improvements for this point release are only included in the high and high_multilingual accuracy modes
  • Other General Improvements
    • Docx document processing has improved and can now handle embedded hyperlinks, text boxes and shapes with text data

3.7.2 (2024/03/01)

What’s new in 3.7.2?

  • Model Improvements
    • Improvements to detection of partial credit card numbers ( i.e. , "the last four digits are ...") and social security numbers
    • Better detection of numerical entity types (e.g., SSN , CREDIT_CARD ) written in words ( e.g. , "one, two, three"), a common format used by ASR tools, especially in multilingual text (Spanish, Dutch, Korean, German, Italian)
    • Improved detection of MONEY entities in English and all numerical entity types in French
    • General improvements to PII detection in Mandarin (simplified script), especially for NAME , LOCATION , MONEY , DRUG , and DATE
    • Improved PII detection in Spanish, especially with respect to regional equivalents of SSN , CVV , ACCOUNT_NUMBER , PASSPORT_NUMBER , and DOB
    • Note Model improvements for this point release are only included in the high and high_multilingual accuracy modes
  • Other General Improvements
    • Improved support for Japanese fonts in PDF file processing
    • Security updates

3.7.1 (2024/02/09)

What’s new in 3.7.1?

  • Model Improvements
    • Improved PII detection in tabular data with abbreviated field / column header names
  • New Language Support
    • We now provide extended support for Georgian
  • Other General Improvements
    • Security updates

3.7.0 (2024/02/02)

What’s new in 3.7.0?

attention

3.7.0 introduces a breaking change in model behaviour. Medical codes previously redacted as NUMERICAL_PII will no longer be detected, unless the new Beta entity type MEDICAL_CODE is explicitly enabled in your POST request (details on this entity type below).

  • New Beta Entity Type
    • We've added Beta support for a MEDICAL_CODE entity type to our English models, covering medical classification systems such as ICD-10, NDC, SNOMED, etc. Please see our Supported Entities page for more information on how to enable Beta entity types.
      Ex. : 1981-03-11T04:11:32-03:00 Forearm sprain SNOMED-CT 70704007
  • Model Improvements
    • English:
      • Improved detection of BANK_ACCOUNT and ROUTING_NUMBER , including regional variants such as UK sort code and Australian BSB
      • Enhanced detection of SSN in ASR call transcripts
      • Improved support for PII detection in tabular data and unstructured text containing mathematical formulas
      • Better PHI detection in ASR-transcribed clinical note dictations
    • Japanese:
      • Improved average recall of PII entity types
    • Spanish & Portuguese:
      • Improved detection of NAME_MEDICAL_PROFESSIONAL and ORGANIZATION_MEDICAL_FACILITY
  • Allow Filter Logic (Beta)

We've introduced an AllowTextFilter parameter under entity_types/filter that applies a regex filter on the text payload as a whole and not just the entities detected (which is how the AllowFilter parameter currently functions). This filter functionality is flagged as beta in the 3.7.0 release and is not recommended for production use.

More information can be found on the API spec.

  • New Audio Options

For audio file processing, we've introduced two new parameters to adjust the audio redaction bleep frequency and gain. These parameters can be adjusted under AudioOptions in the process_file routes using bleep_frequency and bleep_gain.

More information can be found on the API spec

  • Other General Improvements
    • Improvements to ASR engine to provide better overall audio redaction of detected entities
    • Processed Docx files with entities detected in footers could previously cause issues, this has been fixed
    • Processed Docx formatting issues including tables, checkboxes and spacing have been addressed

3.6.3 (2024/01/12)

What’s new in 3.6.3?

  • Improved performance of Standard ASR
  • GPU container can now be run as a non-root user

3.6.2 (2024/01/03)

What’s new in 3.6.2?

  • Bug fixes for .ppt and .pptx files:
    • Issue with delimiters in text being improperly redacted
    • Issue with redaction in slide notes
    • Better error handling for unsupported images embedded within the files
  • Image resizing support within .ppt and .pptx files
  • Introduction of PAI_MAX_IMAGE_PIXELS environment variable to configure max allowed pixels in images processed

3.6.1 (2023/12/21)

What’s new in 3.6.1?

  • Security patch for the transformer library

3.6.0 (2023/12/20)

attention

3.6.0 introduces a breaking change: The automatic English/multilingual accuracy mode selection introduced in 3.5.0 is now used by default. To retain previous behaviour, please set accuracy under the entity_detection payload configuration to high.

What’s new in 3.6.0?

  • Websocket Endpoint (Beta)
    • A websocket endpoint has been introduced in this version: /ws
    • More information can be found here
  • General Improvements
    • The high_automatic accuracy mode introduced in 3.5.0 is now the default model when processing data with the container. This means that if the standard_high_multilingual or high_multilingual models are available in your container instance and the container detects a non-English language, it will automatically use the *-multilingual model to process the data. To retain previous behaviour, please set accuracy to high
    • Various improvements to PDF and other document types, specifically:
    • Post processed visual distortions / black outs on PDFs no longer occuring
    • DICOM files support 16 bit images
    • Office documents processing speed improvements
    • Office documents have improved entity numbering
    • The process/file/base64 endpoint now supports the filetype as well as the mimetype. E.g. pdf and application/pdf can be used for a base64-encoded pdf file.
    • The project_id character limit has been increased from 32 characters to 60
    • Better reference tracking for entities referred to with different names, e.g. "Gary", "Gary's" and "G-A-R-Y" can all be linked with the same maker ( NAME_1 )
    • The CPU container RAM requirement when audio file support is enabled has been raised to 16GB.
    • Support for containerised Azure OCR has been added
    • Support for Audio Distortion has been added. More information can be found here.
  • Model Improvements
    • Enhanced detection of “spelled-out” entities, commonly occurring in call transcripts (e.g., “g as in golf, a as in apple, r for red, y for yellow” , “G-A-R-Y” )
    • Improvements to DURATION detection in English, developing multilingual support (with a focus on Spanish, German, and Dutch)
    • Improved PII / PHI detection in healthcare data, in particular: single word responses in patient forms and DICOM attributes
    • Support added for Irish eircode (postal code) detection in English text
    • Improvements to PHI detection in Dutch, English, Italian, Ukrainian (focused on: CONDITION , BLOOD_TYPE , INJURY )
    • Improved detection of German, Korean, and Italian LOCATION s and addresses

3.5.0 (2023/11/14)

What’s new in 3.5.0?

  • Company Confidential Information Preview
    • This new feature allows users to detect and redact company confidential information. It is enabled through entity configuration in this release. Please reach out to support@private-ai.com for more information!
  • General Improvements
    • Improved OCR performance (again!)
    • Office file support improvement (Doc / DocX, PPT / PPTX etc.)
    • The container OpenAPI spec now generates a v2 compatible schema for a seamless integration with API tools
  • New Beta Entities
    • We've introduced beta entities which capture Confidential Company Information (CCI). Please see our Supported Entities page for more information on how to enable these classes and what they cover.

3.4.3 (2023/10/25)

What’s new in 3.4.3?

  • General Improvements
    • Improved OCR performance
    • Improved Japanese OCR image / file redaction
    • The multilingual model is now auto-selected if a non-English language is detected and the English model is not explicitly selected
    • General performance improvements
  • Model Improvements
    • Tagalog: Improvements in accuracy for PHONE_NUMBER detection
    • English: Improvements in accuracy for PCI classes (in particular: CREDIT_CARD_EXPIRATION , CVV , ROUTING_NUMBER ) and other numerical classes ( ACCOUNT_NUMBER , NUMERICAL_PII , PHONE_NUMBER , VEHICLE_ID )
    • Added support for DUNS number detection (classified as NUMERICAL_PII )

3.4.2 (2023/10/11)

What’s new in 3.4.2?

  • General Improvements
    • Doc / Docx files now process contents within tables
    • Additional configuration with best label matching is now available in the Process text endpoint. Find more details on Enable Non-Max Suppression in the process text documentation.
  • Model Improvements
    • Improved detection of ACCOUNT_NUMBER entity, particularly in contexts where it may be ambiguous with other numerical classes such as BANK_ACCOUNT and CREDIT_CARD

3.4.1.1 (2023/10/04)

What’s new in 3.4.1.1?

  • Model Improvements
    • Improved detection of NUMERICAL_PII and MONEY entities related to cryptocurrency wallet IDs, transaction hashes, and cryptocurrency names / amounts

2.14.6 (2023/10/02)

What’s new in 2.14.6?

  • General Information
    • Please note that this release is for legacy users only and is NOT for users already on V3 of Private AI
  • Model Improvements
    • Improved PCI detection (in particular, CREDIT_CARD s) in French

3.4.1 (2023/09/22)

What’s new in 3.4.1?

  • New Language Support
    • We now provide extended support for Cantonese
  • Model Improvements
    • Improvements to PII detection in Dutch, with particular attention to SSN (Burgerservicenummer / Citizen Service Number and the Belgian NISS) and NUMERICAL_PII such as organization numbers ( e.g. , Ondernemingsnummer, Identificatienummer) and VAT numbers ( e.g. , BTW Identificatienummer, BTW Nummer)

3.4.0 (2023/09/15)

What’s new in 3.4.0?

  • New Language Support
    • We now provide Core Support for Dutch and Japanese
    • Extended Support has also been added for Afrikaans
  • General Improvements
    • DICOM file support is now available
    • PNG file support is now available
    • BMP file support is now available
    • XML file support has been improved
    • Audio support has been improved and can now be deployed in a single container
  • Model Improvements
    • Improvements to multilingual PII detection, with a particular focus on PCI entity types, in: French, German, Spanish, and Portuguese
    • Fine-tuning of recently-added classes: NAME_MEDICAL_PROFESSIONAL and ORGANIZATION_MEDICAL_FACILITY

3.3.4 (2023/09/02)

What’s new in 3.3.4?

  • General Improvements
    • Improved OCR support and general performance improvements with PDFs
    • General Office document support improvements
    • webm format support for audio files

3.3.3 (2023/08/15)

What’s new in 3.3.3?

  • General Improvements
    • General performance improvement and reduced memory footprint
    • Various library updates based on security recommendations
    • File processing now supports disabling entities being returned in response
  • New Entity Types
    • NAME_MEDICAL_PROFESSIONAL : detects the names and professional titles of medical professionals such as doctors and nurses (e.g., Dr. Kay Martinez, MD )
    • ORGANIZATION_MEDICAL_FACILITY : detects the names of medical facilities such as hospitals and clinics (e.g., Victoria General Hospital , Union Family Health Clinic )
  • Model Improvements
    • Improved detection of PII in medical records and in .xml processed as plain text
    • Improved detection of ACCOUNT_NUMBER , particularly in French
    • Improved detection of HEALTHCARE_NUMBER in English

3.3.2 (2023/07/12)

What’s new in 3.3.2?

  • General Improvements
    • Significant performance improvement with OCR related tasks
    • Image blurring has improved significantly

3.3.1 (2023/07/12)

What’s new in 3.3.1?

  • General Improvements
    • Various library updates based on security recommendations

3.3.0 (2023/07/12)

What’s new in 3.3.0?

  • General Improvements
    • File redaction for PDFs responds with numbered entities for the entire document rather than per page.
    • PDF and image processing have speed improvements on the GPU container
    • Doc / DocX file processing now returns redacted main file contents in response
    • General updates to libraries based on security recommendations
  • Model Improvements
    • General improvements to PII detection in: English, French, Japanese, Korean, Portuguese, Russian, Tagalog, Ukrainian
    • Improved detection of numerical classes in: English, Korean, Spanish, Russian
    • Improved detection of English PHI Classes: English
    • Improvements to the ACCOUNT_NUMBER entity in English and Spanish

3.2.1 (2023/06/03)

What’s new in 3.2.1?

  • General Improvements
    • The Re-identification route has been improved to handle additional use cases.
  • New Language Support
    • Extended support has been added for Bambara

3.2.0 (2023/05/25)

What’s new in 3.2.0?

  • New Features
    • Re-identification endpoint now available. This endpoint allows a user to pass previously de-identified text to be re-identified. Further details on how to use this new endpoint can be found on the API Reference
    • You can now configure our solution to redact only entities protected by Japan's Act on the Protection of Personal Information (APPI) or APPI's sensitive personal data designation. See our documentation for details on how to implement and our supported entities list for the entities covered by APPI and APPI_SENSITIVE
  • Model Improvements
    • Improved detection of numerical entity classes in English (e.g., BANK_ACCOUNT , ACCOUNT_NUMBER , CREDIT_CARD , CREDIT_CARD_EXPIRATION )
    • Improved precision in detecting PHI classes in English (e.g., CONDITION , DOSE , DRUG , and MEDICAL_PROCESS )
    • Improved PII & PCI detection in Japanese, Polish, Portuguese, Russian, Spanish, Ukrainian
  • Better Image and PDF Processing (Again!)

    PDF and image processing has once again been improved performance-wise.

  • New File Formats

    The following file formats are now supported in the /process/file/uri and process/file/base64 endpoints:

    • .eml
    • .txt
    • .xls / .xlsx
    • .ppt

3.1.1 (2023/04/18)

What’s new in 3.1.1?

  • New Entity Types
    • ACCOUNT_NUMBER captures the number associated with a client’s account (e.g., Policy No. 10042992 , Member ID: HZ-5235-001 )
    • DURATION captures mentions of periods of time, specified as a number and a unit of time (e.g., 8 months , 2 years )
  • New Language Support
    • Added Core Support for Mandarin (simplified script)
  • Model Improvements
    • Improved detection of PCI classes in English, including optimization for South African English, Italian, Spanish (in particular: BANK_ACCOUNT , CREDIT_CARD )
    • Improved detection of PHI classes in English
    • Improved detection of PII in English clickstream data sets
    • Improved detection of PII in Mandarin (simplified), Tagalog, French

2.14.5 (2023/04/18)

What’s new in 2.14.5?

  • Model Improvements
    • Improved detection of PCI classes in English, including optimization for South African English, Italian, Spanish (in particular: BANK_ACCOUNT , CREDIT_CARD )
    • Improved detection of PHI classes in English
    • Improved detection of PII in English clickstream data sets
    • Improved detection of PII in Mandarin (simplified), Tagalog, French

3.1.0 (2023/04/03)

What’s new in 3.1.0?

  • New File Formats

    The following file formats are now supported in the /process/file/uri and process/file/base64 endpoints:

    • .doc
    • .docx
    • .xml
    • .json
  • Language Detection

    The /process/text endpoint returns a language_detected attribute which specifies ISO 639-1 language labels in the response. For more information, please have a look at the process text documentation

  • Better Image and PDF Processing

    PDF and image processing has been greatly improved in both accuracy and throughput performance.

  • Model Improvements
    • Improved detection of PCI and other numerical classes in English (in particular: CREDIT_CARD , CREDIT_CARD_EXPIRATION , CVV , HEALTHCARE_NUMBER , VEHICLE_ID )
    • Improved detection of PCI classes in French and Spanish (in particular: BANK_ACCOUNT , CREDIT_CARD , CREDIT_CARD_EXPIRATION , CVV )

3.0.0 (2023/03/12)

We are proud to announce the 3rd major version of Private AI's solution. Note that 3.0 does not maintain backwards compatibility. Instead, Private AI will continue to do 2.X releases with updated models and potential security fixes until 3 months after this release.

What’s new in 3.0?

Starting with 3.0, we will be distributing our container exclusively through the Azure Container Registry. Login credentials and sample commands to download the container image can be found in the customer portal and will look like:

Copy
Copied
docker login -u INSERT_UNIQUE_CLIENT_ID -p INSERT_UNIQUE_CLIENT_PW crprivateaiprod.azurecr.io
  • Licensing Change

    We have changed our licensing system from an API Key to a license file. In order to run the container with the license file, run the following:

    Copy
    Copied
    docker run --rm -v "full path to license.json":/app/license/license.json \
    -p 8080:8080 -it crprivateaiprod.azurecr.io/deid:<version>

    Once you have the container up and running with the new license file, you can run send the container a request like this:

    Copy
    Copied
    curl --request POST --url http://localhost:8080/v3/process/text --header 'Content-Type: application/json' \
    --data '{"text": ["Hello John"]}'
  • New API Interface

    3.0 introduces many changes to the API, please see the new API Reference for details. Key changes:

    • deidentify_text is now called /v3/process/text
    • Endpoints in general now follow the standard of process/type/subtype
    • text field is required to be a list by default, even with a single string
    • key field has been removed from the body and is now in the request header: X-API-KEY . It is only required when using our cloud API
    • accuracy_mode is now called accuracy and can be found one layer down in the entity_detection dictionary settings
    • return_entities parameter allows you to configure whether to include identified entities in the response
    • unique_pii_markers has been removed. Instead, please set pattern inside the marker parameters to BEST_ENTITY_TYPE
    • Entity is established in nomenclature to recognize PII, PHI, PCI

    Example conversions from V2 request payload to 3.0:

    Copy
    Copied
    ###
    Example with enabled_classes
    ###
    2.0:
    {"text": "Hello there John!", 
    "key":<My_api_key>, 
    "accuracy_mode":"high", 
    "enabled_classes":["NAME"]
    }
    
    3.0:
    {"text": ["Hello there John! I live in Newark"],
    "entity_detection":
     {"accuracy": "high",
      "entity_types": [{"type": "ENABLE", "value":["NAME"]}]
    }
    }
    
    -----------------------------------------------------------------------------------------
    ###
    Example with inclusion of all entity types in entity marker
    ###
    2.x:
    {"text": "Hello there Pieter!", 
    "key":<My_api_key>,
    "accuracy_mode":"standard",
    "marker_format": "[ALL_CLASS_NAMES]"
    }
    
    3.0:
    {"text": ["Hello there Pieter!"], 
    "entity_detection": {"accuracy": "standard"},
    "processed_text": {"type": "MARKER", "pattern": "[ALL_ENTITY_TYPES]"}
    }
    
    -----------------------------------------------------------------------------------------
    ###
    Example with disabling unique_pii_markers through MARKER definition 
    ###
    2.0:
    {"text": "Hello there Paul!", 
    "key":<My_api_key>,
    "accuracy_mode":"high_multilingual",
    "unique_pii_markers": false
    }
    
    3.0:
    {"text": ["Hello there Paul!"], 
    "entity_detection": {"accuracy": "high_multilingual"},
    "processed_text": {"type": "MARKER", "pattern": "[BEST_ENTITY_TYPE]"}
    }
  • File Support for Audio / PDFs / Images

    3.0 supports file redaction using an unified endpoint, which works either with URIs or base64-encoded files: /v3/process/files/uri and /v3/process/files/base64. Please see the Quickstart Guide for details.

  • Application version endpoint

    Sending a GET request to the container root endpoint http://container-address:8080 will return a response providing information about the application version:

    Copy
    Copied
    {"app_version": "3.0.0"}
  • Synthetic Entity Generation

    Synthetic entity generation is now supported across each language Private AI supports.

    Quality of generated entities has been improved, particularly around matching the formatting and length of the original entity.

  • Environment Variables

    All previous environment variables are now prefixed with “PAI” to better differentiate PAI specific variables. You can find the full list of environment variables in Environment Variables.

  • PII Metrics

    In 3.0, non-airgapped users can enable PII metrics gathering for reporting purposes. In order to do this, add PAI_ENABLE_PII_COUNT_METERING=True as an environment variable. You'll be able to see the number of PII captured by your license usage and we will be further improving this feature to provide you with a granular view on entity types captured and other reporting features.

    Please note that this feature is OFF by default and requires explicit configuration to gather this data. Any usage prior to enabling this feature is NOT captured and cannot be reported on retroactively.

2.14.3 (2023/03/07)

  • Improvements to numerical entity detection and classification, specifically: NUMERICAL_PII , BANK_ACCOUNT , PHONE_NUMBER , CREDIT_CARD , CREDIT_CARD_EXPIRATION and CVV .
  • Improvements to PII detection within ASR transcripts, including variable casing (lower/upper/sentence case) for named entities.
  • Improvements to ORGANIZATION detection.
  • Better recognition of emergency phone numbers.
  • GPU container image size has been reduced.

2.14.2 (2023/01/17)

  • Improvements to PHONE_NUMBER detection, particularly in ASR transcripts in which entities may have unusual formatting.
  • Improvements to CREDIT_CARD detection in ASR transcripts, which may contain spelling and formatting anomalies.
  • Optimizations for detecting PII entities in HR documents, such as CVs and resumes.
  • General improvements to PII detection in Spanish text.
  • Resolved an issue where redaction markers in previously redacted data were sometimes captured as PII.
  • The trailing period in company names such as ACME Co. are now included in the entity.

2.14.1 (2022/11/30)

  • Improved PCI detection in French and Spanish
  • / , \ and $ characters are no longer stripped from entities. For example, Visit us at facebook.com/user123/ is now redacted as Visit us at [URL_1] instead of Visit us at [URL_1]/ .
  • Tuned RAM check thresholds for machines with 8GB RAM.
  • Language Support: Added Extended Support for Japanese .

2.14.0 (2022/11/11)

What’s new in 2.14.0?

  • New Language Support

    The following languages have been added to Extended support:

    • Luxembourgish
    • Swahili
  • Entity Types
    • NAME_GIVEN , which encompasses name(s) given to an individual, usually at birth, often first/middle names in Western cultures, middle/last names in Eastern cultures.
    • NAME_FAMILY , which encompasses names indicating a person’s family or community, often a last name in Western cultures, first name in Eastern cultures.
    • MEDICAL_MISC entity type has been deleted.

Improvements

  • Improved Models
    • Improved detection of names spelled out in all caps by ASR systems.
    • NAME : Improved name subclass detection / classification in English.
    • EMAIL_ADDRESS : More robustness around partial / unformatted emails in English.
    • CREDIT_CARD : improvement around mentions of the last 4 digits only in English.
    • Enhanced detection of NAMEs and other entities when spelled-out in a transcript (e.g., “c as in charlie …”)
    • Improvements to detection of PASSWORD, including verification answers
    • Improved handling of eponymous medical conditions in English.
    • Improvements to PHI detection in English.
    • Improvements to PHI detection in Spanish.
    • Improvements to all personal number classes such as PASSPORT , CREDIT_CARD and SSN including international variants in French, German, Italian, Tagalog and Ukrainian.
    • Improved PII detection in text containing facerolls and typos.
    • Improvements to PII detection in Tagalog data containing profanities / toxic material.
    • Improved detection of ambiguous LOCATION / ORGANIZATION mentions, as well as ambiguous NAMEs
    • Improved PII detection in text containing control characters
    • General improvements to:
      • Russian
      • Spanish
  • Miscellaneous
    • Container startup memory check is now performed on container start, instead of after loading models
    • Fixed handling of null strings

2.13.1 (2022/09/26)

  • Emoji Improvements

    Processing of non-English text containing emojis has been improved

2.13.0 (2022/09/08)

What’s new in 2.13.0?

  • Second Generation Synthetic PII

    This release features the debut of our second generation synthetic PII system. The system has been rebuilt from the ground up and leverages a new approach developed by Private AI. The new system features the following improvements:

    • Increased PII realism, including greater variety of generated terms and less generation of common terms such as "John" or "Paul".
    • Better generation of numerical PII, particularly around the correct number of digits.

    Note that the CPU containers are now approximately 700MB larger due to this change and that the new synthetic PII system is slower than the first generation. Private AI will be releasing optimizations for both container size and processing time in subsequent releases, along with GPU support.

  • New Language Support

    The following languages have been added to Extended support:

    • Belarusian
    • Icelandic
    • Indonesian
    • Khmer
    • Thai

    We have also added Beta support for Japanese.

  • New Entity Types
    • NAME_GIVEN , which encompasses name(s) given to an individual, usually at birth, often first/middle names in Western cultures, middle/last names in Eastern cultures.
    • NAME_FAMILY , which encompasses names indicating a person’s family or community, often a last name in Western cultures, first name in Eastern cultures.
  • Disable GPU
    • PAI_DISABLE_GPU_CHECK allows users to disable the startup check for GPU on the container and run the GPU container using CPU only.

Improvements

  • Best Label Calculation

    The best label calculation has been updated to prefer the most granular entity type. For example, Hello John will become Hello [NAME_GIVEN] instead of Hello [NAME]. Similarly, I live in Toronto will be I live in [LOCATION_CITY] instead of I live in [LOCATION]. When an entity spans multiple words that have additional, nested labels, the existing behaviour is retained: namely, the most general entity type, covering the entire span, is used. For example, Hello John Doe will be Hello [NAME] and I live in Toronto, Canada will be I live in [LOCATION].

  • Improved Models

    This release features a number of PII detection improvements:

    • Further improvements to the character-level recognition that was introduced in 2.12.
    • False Positive reduction for CONDITION , DRUG , MEDICAL_PROCESS in English.
    • CREDIT_CARD , PHONE_NUMBER , EMAIL_ADDRESS , BANK_ACCOUNT , PASSPORT_NUMBER , SSN improvements in Spanish.
    • CONDITION , DRUG , MEDICAL_PROCESS in Spanish.
    • NAME , LOCATION , ORGANIZATION , POLITICAL_AFFILIATION improvements in German, French, Italian and Polish.
    • Improved performance across all entity types in Tagalog.
  • Miscellaneous

    Improved log messages on container startup.

2.12.0 (2022/07/27)

What’s new in 2.12.0?

  • New Inference Pipeline

    This release features the debut of our new inference pipeline. The main feature of the new pipeline is that it is able to operate on non-whitespace separated text. This has a number of benefits, including better performance around punctuation and control characters and enables new languages, such as Mandarin (simplified).

  • Prometheus Endpoint

    A Prometheus metrics endpoint is now available at /metrics. See the API reference for details.

  • New Language Support

    The following languages have been added to Core support:

    • Ukrainian
    • Hindi

    In addition to this, we have added Extended support for the following 5 languages:

    • Estonian
    • Malay
    • Punjabi
    • Tamil
    • Vietnamese

    We have also added Beta support for Mandarin (simplified)

Improvements

  • Improved Models

    This release features a number of PII detection improvements:

    • German NUMERICAL_PII detection has been improved.
    • Improved performance on medical questionnaires and customer onboarding forms.
    • Multilingual chat performance has been improved, particularly in Spanish.
    • Postal address detection performance has been improved for addresses in the United Kingdom, Australia and New Zealand.
    • PASSWORD and CVV detection performance has been improved.
    • PHI Attributes / symptoms detection has been improved.
    • General improvements for EHRs and ASR transcripts.
  • Security Patch

    Several updates to container image dependencies and Python libraries have been updated to address security recommendations

2.11.1 (2022/06/01)

Improvements

  • Security Patch

    Several libraries received patch updates to address security recommendations and have been included in this release.

  • Improved Models

    Improvements have been made to the detect instances of medical entities such as CONDITION, INJURY and MEDICAL_MISC.

    Improvements have been made to NUMERICAL_PII, particularly in multilingual models

  • Container Options

    Allow startup resource check to be disabled.

2.11.0 (2022/05/10)

What’s new in 2.11.0?

  • New language support

    Tagalog has been moved from extended to core support. For the full list, please see the supported languages page.

  • New Entity Types

    VEHICLE_ID has been added in this release. This entity type covers vehicle identification numbers such as license plate numbers, vehicle serial and vehicle identification numbers.

Improvements

  • Model Improvements

    PII detection error has been reduced by approximately 10%, particularly around CREDIT_CARD, CREDIT_CARD_EXPIRATION and CVV. Australian and New Zealand address recognition have also been improved.

    Performance on disfluent ASR transcripts (particularly around passwords), chat logs and medical patient records has been improved.

    CPU model processing speed has increased by approximately 8%, whilst GPU processing speed has been improved by up to 35%, depending on the chosen accuracy mode.

  • Service health monitoring

    The /healthz endpoint is more robust for detecting the overall health of the API service.

  • Improved error messages

    Error messages when either the key or text fields are missing are now more specific.

  • Security updates

    Libraries have been updated based on security recommendations from our regular vulnerability scans.

  • Documentation revamp

    Our public documentation has been updated to include new guides, updated install instructions and sample configurations.

  • Other

    ENABLED_CLASSES can now be set via an environment variable, similar to LOG_LEVEL.

2.10.0 (2022/03/14)

What’s new in 2.10.0?

  • 33 Supported Languages

    Our system can now detect PII in 33 different languages, with more coming soon. For the full list, please see the supported languages page.

  • New Entities

    2.10 includes the following new entities:

    • GENDER_SEXUALITY : Terms indicating gender identity or sexual orientation, including slang terms. E.g.: “female”, “bisexual”, “trans”
    • MARITAL_STATUS : Terms indicating marital status. E.g.: “single”, “common-law”, “ex-wife”, “married”
    • LOCATION_COORDINATE : A subclass of LOCATION. A geographic position referred to using latitude, longitude, and/or elevation coordinates. E.g.: “We’re at: [40.748440 and -73.984559] ”

    The NUMERICAL_PII class now includes MAC addresses and cookie IDs.

    These entities will be listed in the docs shortly.

  • Complete HIPAA Support

    With this release we have complete support for all the entities listed under the HIPAA Safe Harbor rule. Health plan beneficiary numbers and medical record numbers have been added to HEALTHCARE_NUMBER, whilst medical device serial numbers have been added to NUMERICAL_PII.

  • Entity Sets

    enabled_classes now supports entity sets. This way, you can simply include the name of the regulation that you want to comply with, and we will enable the entities that are listed in that regulation for you. The regulations that are implemented in this release are:

    • GPDR
    • CPRA
    • HIPAA
    • Quebec Privacy Act
    • PCI

    Example command:

    Copy
    Copied
    curl -X POST localhost:8080/deidentify_text -H 'content-type: application/json' -d '{"text": "Hi Anwar", "key": "<customer key>", "enabled_classes": ["GDPR"]}'

    The docs will be updated to include this functionality and the entities included in each entity set shortly.

  • Docker image version logging

    We are now logging the version of the docker image in our logs. This allows us to provide better customer support based on the version of that is in the logs.

Improvements

  • Better models

    2.10 features improved PII detection models, particularly around credit card numbers, verification codes, social security numbers, US postal addresses, email addresses in emails and resumes.

  • TIME entity adjustment

    We have adjusted the TIME entity to no longer include ASR transcript timestamps.

  • Better API error messages

    In order to improve error handling and make debugging easier, we have reworked our API error messages to be more detailed and understandable. Error messages (but not potentially sensitive payloads) are now also logged to console.

  • Redaction marker label calculation

    We have improved how the redaction marker that is used in the redacted text is calculated.

  • Resource validation system

    In 2.9 we introduced checks that validate that the container has been provided with enough resources. In this release, we have further expanded and improved these checks to be able to detect memory and GPU resources more accurately.

  • Health check system

    We have improved the health check endpoint in the GPU build to return the health of the GPU inference engine as well.

    Improve the process monitoring inside the GPU build to eliminate the possibility of having dead containers that are still running.

    We have updated the health check route in the CPU build to be completely asynchronous.

  • RAM usage

    The container printed the RAM usage on every API call. This has now been moved to ‘debug’ log level.

  • Docs

    We have added a new page in our documentation title “Deployment Considerations”, which aims to help users on how to deploy the docker image on production environments.

    Other notable changes are:

    • Adding a new page that lists supported languages
    • Update the list of supported entities
  • Web Demo

    We have made a small improvement to the UI of the web demo by changing the model options from a drop down list to radio buttons.

    Web demo now has unique PII markers disabled by default. This change will be reflected in the upcoming API refactor.

2.9.1 (2022/02/24)

  • Logging Improvements

    RAM usage is now logged on debug level instead of info

  • Container Health

    healthz route latency improved

    Docker container health check has been implemented, for improved AWS ECS use

2.9.0 (2022/01/18)

What’s new in 2.9.0?

  • New PII Classes

    Passport numbers are now recognized as a separate entity type, PASSPORT_NUMBER instead of NUMERICAL_PII.

    POLITICAL_AFFILIATION has been added and covers terms referring to a political party, movement, or ideology (e.g., Republican, liberal)

    We now support IPv6 address deidentification as well in addition to IPv4 addresses. Any IPv6 address that is found in the text will be labelled as IP_ADDRESS.

  • Container Startup Resource Validation

    Based on our user feedback, we have implemented a hardware resource validation that runs on container startup. This implementation validates that the container has access to an NVIDIA GPU and/or enough RAM on startup. If the implementation fails to validate these requirements, it prints a helpful and detailed error message (rather than the default “Killed” message printed by Docker) which guides the user on how to solve these resource related issues.

  • Docker Hub Repository

    Starting with release 2.9.0, the container can be pulled from a private Docker Hub repository. Please contact us if you would like to receive the container via this repository, instead of the existing encrypted Docker image export.

Improvements

  • Model Improvements

    This release includes improved models. Improvements include:

    • Better performance on ASR system transcripts, particularly around disfluencies
    • Improved Driver License detection
    • Better performance on SMS message style conversations
  • Improved Documentation

    We have spent some time improving our documentation as well. The noteworthy improvements are:

    • The table of contents is now more clear and easier to navigate.
    • A new detailed introduction page.
    • Detailed installation instructions.
    • Updated API reference.
    • Updated Web Demo to showcase Multilingual PII Redaction and Synthetic Personal Data Generation in addition to English PII Redaction.
  • Fixes

    We fixed an issue where the built-in labels that use regex patterns would override the custom labels defined in block_list.

    We tuned the models to fix an issue where some non-PII words that are following PII words would be labelled as part of the PII word.

    We have removed a warning message that would show up on container startup due to an internal library incorrectly assessing the ML dependencies.

    Synthetic PII generation now works when the custom block_list feature is used.

2.8.0 (2021/12/20)

What’s new in 2.8.0?

  • New Entity Types

    Added DRIVER_LICENSE entity type. Driver's licenses will now be picked up in this class instead of `NUMERICAL_PII.

Improvements

  • Improved backup authentication mechanism fail-over logic.
  • Updated API server. This was a dependency and security upgrade.
  • GPU inference server errors now return 500 instead of 503.

Deprecation Notice: We’ve rearranged the plumbing on our authentication system. Releases prior to 2.3.0 will no longer authenticate as of 31st December 2021.

2.7.1 (2021/10/28)

  • Linked Batch Processsing

    This release adds the link_batch option. When enabled, batch inputs will be joined together internally in the Private AI inference engine, to share context between the different inputs. This is useful when processing a sequence of short inputs, such as an SMS chat log. Please visit the API reference for implementation details.

2.7.0 (2021/10/29)

What’s new in 2.7.0?

Breaking change: The default accuracy mode has been changed from standard to high.

  • Added the LOG_LEVEL environment variable, which controls logging verbosity. The environment variable can be set to info , warning or error . Default is info .

Improvements

  • Model Improvements

    This release features improved PII detection models:

    • Numerical PII detection has been further refined, particularly around SSNs and credit card numbers
    • Further improvements for chat transcripts
    • Further improvements for OCR documents, particularly receipts
    • Further improvements for JSON files
  • Authentication

    The backup authentication mechanism has been moved to a completely new system, improving redundancy

  • Usage Reporting

    The get_usage route now returns the current month's usage, instead of current week.

2.6.1 (2021/09/27)

  • Improved Models

    This release fixes phone numbers and credit card numbers occasionally being detected as SSNs. Additionally, performance around ASR transcripts and the various ways they transcribe numbers was improved

2.6.0 (2021/09/21)

Improvements

  • Improved Models

    This release features improved PII detection models, particularly surrounding English and Portuguese.

    Optimizations for a number of popular ASR systems have been added in this release. In particular, the optimizations cover how the systems transcribe numbers.

2.5.0 (2021/08/20)

What’s new in 2.5.0?

  • New Entity Types

    The DATE class has been split into DATE and DATE_INTERVAL. DATE_INTERVAL covers broader references such as 'last summer', whilst DATE remains targeted as specific references like '21/8/2019'

  • Batch Processing

    Support for batch processing has been added. To use batch processing, simply submit a list of text strings:

    Copy
    Copied
    curl -X POST http://localhost:8080/deidentify_text -H 'content-type: application/json' -d '{"text": ["My password is: 4XDX63F8O1", "My password is: 33LMVLLDHNasdfsda"], "key": <key>}'

Improvements

  • Multilingual Improvements

    This release features improved PII detection models, particularly surrounding English, Italian and Korean.

  • Image Size

    Container image size has been further reduced.

2.4.0 (2021/07/21)

What’s new in 2.4.0?

  • Custom Redaction Markers

    Added support for custom redaction markers.

  • Allow Lists

    Added support for allow lists - any entities matching entries in the allow list will be discarded.

  • New Entity Types

    Added new location classes:

    • LOCATION_ADDRESS : A street address, e.g. '48 Bristol Ave, 6157, Perth, Australia'
    • LOCATION_CITY : A city, e.g. 'Perth' or 'Toronto'
    • LOCATION_COUNTRY : A country, e.g. 'Spain'
    • LOCATION_ZIP : A zip or postal code, e.g. '10405'
    • LOCATION_STATE : A reference to a state within a country, e.g. 'California'

    NOTE: These entities are subclasses of LOCATION - the LOCATION label remains unchanged and will appear along with the above entities

Improvements

  • Model Improvements

    This release features improved PII detection and synthetic PII generation models, particularly surrounding Spanish, Italian and Korean.

  • Phone Number Improvements

    Improved phone number post-processing, particularly around bracket handling and '+' in international dialling codes

  • Best Label Calculation

    Improved automatic calculation of the number of processing threads to use whilst executing the ML models.

2.3.1 (2021/07/16)

  • CPU Performance Improvement

    Patch release to address CPU utilisation

2.3.0 (2021/06/25)

What’s new in 2.3.0?

  • New Languages

    Added support for Korean

  • New Entity Types

    Added ROUTING_NUMBER, which is a number associated with a bank or financial institution (e.g., 012345678).

    Added BANK_ACCOUNT, which is a bank account or bank card number (e.g., 012345-67).

Improvements

  • Improved Models

    This release features improved PII detection models, trained on ~50% more data than 2.2.0.

    We have improved PHI detection performance. More to come in the next release.

  • Authentication

    This release now authenticates with our revamped authentication system. No changes on the user side are required.

2.2.2 (2021/06/03)

  • New Accuracy Mode

    Added a new accuracy mode that is approximately 4x faster than standard. In order to use this model, please set accuracy_mode to fast.

2.2.1 (2021/05)

  • Improved Models

    Improved SSN detection in ASR transcripts

    Improved PHI detection

2.2.0 (2021/04/29)

What’s new in 2.2.0?

  • Multilingual Support

    This release adds support for Spanish, French, Italian, German and Portuguese. To enable it, please see the API Reference for details

  • Synthetic PII Generation

    Beta release of synthetic PII generation. In addition to identifying and redacting PII, Private AI can now also generate synthetic PII. To try it out, please set fake_entity_accuracy_mode to standard:

    Copy
    Copied
    $ curl -X POST http://localhost:8080/deidentify_text -H 'content-type: application/json' -d '{"text": "so, it expires the 1st; and the 3 digits on the back", "fake_entity_accuracy_mode": "standard", "key": <key>}'
    {
    "result": "so, it expires the [CREDIT_CARD_EXPIRATION_1]; and the 3 digits on the back",
    "result_fake": "so, it expires the 20th; and the 3 digits on the back",
    "pii": [
      {
        "marker": "CREDIT_CARD_EXPIRATION_1",
        "text": "21st",
        "best_label": "CREDIT_CARD_EXPIRATION",
        "stt_idx": 19,
        "end_idx": 23,
        "labels": {"CREDIT_CARD_EXPIRATION": 0.8895},
        "fake_text": "20th",
       "fake_stt_idx": 19,
       "fake_end_idx": 23
      },
    ],
    "api_calls_used": 1,
    "output_checks_passed": true
    }

Improvements

  • Customizable API Port

    API port can now be customized. See the Environment Variables page for details.

    Health check port is now on port 8080, same as the main deidentify_text route

  • Revamped API Serving

    The API serving infrastructure has been completely rebuilt

    Shortened authentication request timeout

2.1.3 (2021/04/13)

  • Improved Models

    Improved credit card handling in ASR transcripts

2.1.2 (2021/03/02)

  • Added ZODIAC_SIGN , which covers Zodiac Signs such as "Aries" or "Taurus".
  • This release features improved PII detection, particularly surrounding SSN , DOB and NUMERICAL_PII .
  • Added passport numbers, vehicle license plate numbers and vehicle serial numbers to NUMERICAL_PII .
  • Passport numbers and vehicle serial numbers are now recognised as NUMERICAL_PII .

2.1.1 (2021/02/26)

  • Improved Models

    Further PII detection improvements targeted at numerical entity detection.

2.1.0 (2021/02/18)

Improvements

  • Improved Models

    This release improves PII detection accuracy, via model updates and improved training data.

    Additionally an improvement was made in an edge case where model output is highly ambiguous.

2.0.1 (2021/01/25)

  • Improved Models

    Improved PII detection models.

  • Reduced Image Size

    Further reduced Docker image size.

2.0.0 (2021/01/14)

What’s new in 2.0.0?

  • Revamped API

    The 2.0.0 release features a revamped API interface, based on recent customer feedback

  • New Entity Types

    New entity types:

    • FILENAME : Name of a computer file, e.g., bradtaxreturns.txt, koalabear.jpg
    • ORIGIN : Origin encompasses nationalities, ethnicities, and races. E.g., Canadian, american, caucasian

    Added PHI entity types:

    • BLOOD_TYPE : Blood type, e.g., O-
    • CONDITION : A medical condition. Includes diseases, syndromes, deficits, disorders. E.g., chronic fatigue syndrome, arrhythmia, depression.
    • DRUG : Medical drug, including vitamins and minerals. E.g., Advil, Acetaminophen, Panadol
    • INJURY : Human injury, e.g., I broke my arm, I have a sprained wrist. Includes mutations, miscarriages and dislocations.
    • MEDICAL_PROCESS : Medical process, including treatments, procedures and tests. E.g., ‘heart surgery’, ‘CT scan’.
    • PHYSICAL_ATTRIBUTE : A body attribute, e.g. I’m 190cm tall.
    • STATISTICS : How many people in a specific country have the disease or what percentage of people were cured of a disease, for example. E.g., 20 percent of people have arrythmia

Improvements

  • New Inference Engine

    New inference engine, which is significantly faster than previous releases

  • Reduced Container Image Size

    Docker image size has been drastically reduced

1.5.1 (2020/12/08)

  • Improved Models

    Improved credit card number and SSN detection in chat logs.

1.5.0 (2020/11/19)

What’s new in 1.5.0?

  • New Accuracy Mode

    The previous standard accuracy model is now fast. In it’s place, we have introduced a new model ~2x slower but with far better performance.

Improvements

  • Improved Models

    Improved model accuracy via additional training data.

  • Runtime Performance Improvements

    Reduced latency by ~15% on fast mode. 60ms to 52ms on our single core GCP N2 Cascade Lake test instance.

    Dramatically reduced RAM usage for all models.

    Reduced Docker image size.

1.4.2 (2020/11/6)

  • Phone Number Improvements

    Improved support for 7 digit phone numbers

1.4.1 (2020/11/4)

  • SSN Improvements

    Improved SSN detection

1.4.0 (2020/10/23)

What’s new in 1.4.0?

  • New Entity Types

    Added DOB entity type, which covers Date of Birth (e.g., Date of Birth: March 7, 1961)

    Added CVV, which covers credit card verification codes (e.g., CVV: 080)

    Added CREDIT_CARD_EXPIRATION, which is the expiration date of a credit card (e.g., Expires: 2/28)

    Added PASSWORD entity type, which covers account passwords, pins, access keys, or verification answers (e.g., 27%alfalfa, 1234)

Improvements

  • Improved Models

    Adjusted entity types to give better per class accuracy.

    Improved SSN and credit card detection.

  • Health Route

    Added last_auth_call_successful into healthz response.

1.3.2 (2020/10/12)

  • Authentication

    Added backup authentication mechanism.

1.3.1 (2020/10/05)

  • Large Input Handling

    Improved handling of ultra large inputs (>100K words).

1.3.0 (2020/09/25)

What’s new in 1.3.0?

  • New Entity Types

    Added USERNAME entity type, User name or handle (e.g., privateairocks, @_PrivateAI).

    Added RELIGION entity type, which covers terms indicating religious affiliation (e.g., Hindu).

1.2.0 (2020/08/14)

What’s new in 1.2.0?

  • New Entity Types

    Added AGE entity type, which is a number or phrase associated with an age (e.g., 27)

  • New Accuracy Mode

    Added best accuracy mode. To use it, please set accuracy_mode to best.

1.1.0 (2020/07/05)

What’s new in 1.1.0?

  • Credit Card Number Support

    Added support for credit card numbers

1.0.0 (2020/06/15)

Initial container release

For release notes older than 1.0.0, please contact us.

© Copyright 2024 Private AI.