Logo Oapen
  • Join
    • Deposit
    • For Librarians
    • For Publishers
    • For Researchers
    • Funders
    • Resources
    • OAPEN
        View Item 
        •   OAPEN Home
        • View Item
        •   OAPEN Home
        • View Item
        JavaScript is disabled for your browser. Some features of this site may not work without it.

        The Unicode cookbook for linguists

        Managing writing systems using orthography profiles

        Thumbnail
        Download PDF Viewer
        Author(s)
        Moran, Steven
        Cysouw, Michael
        Collection
        Knowledge Unlatched (KU); Language Science Press 2018-2020
        Number
        103595
        Language
        English
        Show full item record
        Abstract
        This text is a practical guide for linguists, and programmers, who work with data in multilingual computational environments. We introduce the basic concepts needed to understand how writing systems and character encodings function, and how they work together at the intersection between the Unicode Standard and the International Phonetic Alphabet. Although these standards are often met with frustration by users, they nevertheless provide language researchers and programmers with a consistent computational architecture needed to process, publish and analyze lexical data from the world's languages. Thus we bring to light common, but not always transparent, pitfalls which researchers face when working with Unicode and IPA. Having identified and overcome these pitfalls involved in making writing systems and character encodings syntactically and semantically interoperable (to the extent that they can be), we created a suite of open-source Python and R tools to work with languages using orthography profiles that describe author- or document-specific orthographic conventions. In this cookbook we describe a formal specification of orthography profiles and provide recipes using open source tools to show how users can segment text, analyze it, identify errors, and to transform it into different written forms for comparative linguistics research.
        URI
        http://library.oapen.org/handle/20.500.12657/28277
        Keywords
        Linguistics
        DOI
        10.5281/zenodo.1296780
        ISBN
        9783961100903
        OCN
        1076699025
        Publisher
        Language Science Press
        Publisher website
        https://langsci-press.org/
        Publication date and place
        Berlin, 2018-07-11
        Grantor
        • Knowledge Unlatched - 103595 - Language Science Press 2018 - 2020
        Series
        Translation and Multilingual Natural Language Processing,
        Rights
        https://creativecommons.org/licenses/by/4.0/legalcode
        • Imported or submitted locally

        Browse

        All of OAPENSubjectsPublishersLanguagesCollections

        My Account

        LoginRegister

        Export

        Repository metadata
        Logo Oapen
        • For Librarians
        • For Publishers
        • For Researchers
        • Funders
        • Resources
        • OAPEN

        Newsletter

        • Subscribe to our newsletter
        • view our news archive

        Follow us on

        License

        • If not noted otherwise all contents are available under Attribution 4.0 International (CC BY 4.0)

        Credits

        • logo EU
        • This project received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement No 683680, 810640, 871069 and 964352.

        OAPEN is based in the Netherlands, with its registered office in the National Library in The Hague.

        Director: Niels Stern

        Address:
        OAPEN Foundation
        Prins Willem-Alexanderhof 5
        2595 BE The Hague
        Postal address:
        OAPEN Foundation
        P.O. Box 90407
        2509 LK The Hague

        Websites:
        OAPEN Home: www.oapen.org
        OAPEN Library: library.oapen.org
        DOAB: www.doabooks.org

         

         

        Export search results

        The export option will allow you to export the current search results of the entered query to a file. Differen formats are available for download. To export the items, click on the button corresponding with the preferred download format.

        A logged-in user can export up to 15000 items. If you're not logged in, you can export no more than 500 items.

        To select a subset of the search results, click "Selective Export" button and make a selection of the items you want to export. The amount of items that can be exported at once is similarly restricted as the full export.

        After making a selection, click one of the export format buttons. The amount of items that will be exported is indicated in the bubble next to export format.