티스토리 뷰

MLDL/Tutorial

Tacotron Analysis - Data Preprocessing

Lazyer 2018. 12. 19. 09:03

Tacotron Analysis - Data Preprocessing

Tacotron : https://carpedm20.github.io/tacotron/

python3 -m datasets.son.download

datasets/son/video : download ts format video data

datasets/son/assets : download text format data

python3 -m audio.silence --audio_pattern "./datasets/son/audio/*.wav" --method=pydub

 from pydub import silence audio = read_audio(audio_path) not_silence_ranges = silence.detect_nonsilent(      audio, min_silence_len=silence_chunk_len,      silence_thresh=silence_thresh)

최소 시간 간격(min_silence_len=400)보다 큰 경우 silence로 인정.

분리된 오디오 파일 앞, 뒤로 100 (keep_silence=100)의 공백(0) 추가.

daasets/son/audio에 *.wav 파일로 저장.

python3 -m recognition.google --audio_pattern "./datasets/son/audio/*.*.wav"

 from google.cloud import speech client = speech.SpeechClient() config = types.RecognitionConfig(    encoding=enums.RecognitionConfig.AudioEncoding.LINEAR16,    sample_rate_hertz=config.sample_rate,    language_code='ko-KR') response = client.recognize(config, audio)

분리한 오디오 파일 텍스트로 변환.

datasets/son/audio/*.txt 로 저장.

recognition.json 생성

 {     "./datasets/son/audio/NB11600312.0001.wav": "뉴스룸 앵커브리핑 시작하겠습니다",     "./datasets/son/audio/NB11600312.0002.wav": "해외 출장 여성 인턴 0명",     "./datasets/son/audio/NB11600312.0003.wav": "전임 대통령의 첫 해외 순방 중에 벌어진 참으로 망신 스러웠던 그 사건 이후에",     "./datasets/son/audio/NB11600312.0004.wav": "청와대가 내놓은 대책은 그로 했음",     "./datasets/son/audio/NB11600312.0005.wav": "바로 뒤에 진행되는 총리 순방에 동행한 인턴은 전원",     "./datasets/son/audio/NB11600312.0006.wav": "남성으로 채워졌고",     "./datasets/son/audio/NB11600312.0007.wav": "종민",     ... }

python3 -m recognition.alignment --recognition_path "./datasets/son/recognition.json" --score_threshold=0.5

api통해 생성된 텍스트와 원본 텍스트 비교

jtbc에서 다운받은 대본 txt파일 readline -> scores = SequenceMatcher(readline,recognition) -> scores > 0.5 이상이면 대본 txt파일의 text로 치환

alignment.json 생성

 {     "./datasets/son/audio/NB11600312.0001.wav": "뉴스룸의 앵커브리핑을 시작하겠습니다.",     "./datasets/son/audio/NB11600312.0002.wav": [        "해외 출장 여성 인턴 0명"     ],     "./datasets/son/audio/NB11600312.0003.wav": "전임 대통령의 첫 해외 순방 중에 벌어진 참으로 망신스러웠던 그 사건 이후 청와대가 내놓은 대책은 그러했습니다.",     "./datasets/son/audio/NB11600312.0004.wav": [        "청와대가 내놓은 대책은 그로 했음"     ],     "./datasets/son/audio/NB11600312.0005.wav": [        "바로 뒤에 진행되는 총리 순방에 동행한 인턴은 전원"     ],     "./datasets/son/audio/NB11600312.0006.wav": [        "남성으로 채워졌고"     ],     "./datasets/son/audio/NB11600312.0007.wav": [        "종민"     ]     ... }

readline으로 대본을 읽고, stt파일과 비교 후 score threshol(0.5) 이상일 시 바꿔버리기 때문에 위와 같은 오류가 발생.

NB11600312.0003.wav ->"전임 대통령의 첫 해외 순방 중에 벌어진 참으로 망신스러웠던 그 사건 이후" recognition.json ->"전임 대통령의 첫 해외 순방 중에 벌어진 참으로 망신스러웠던 그 사건 이후"

readline ->"전임 대통령의 첫 해외 순방 중에 벌어진 참으로 망신스러웠던 그 사건 이후 청와대가 내놓은 대책은 그러했습니다."

score > threshold(0.5) -> google speach의 번역이 맞았음에도 score값에 따라 대본의 text로 바꿔버린다.

python3 -m datasets.generate_data ./datasets/son/alignment.json

텍스트 파일은 다음과 같이 변환된다. 텍스트 -> jamo를 통하여 자음 모음 분리 -> 영어로 변환 -> embedding audio파일은 load되어 다음과 같은 정보로 변형되어 진 후 저장된다.

 def spectrogram(y):     D = _stft(_preemphasis(y))     S = _amp_to_db(np.abs(D)) - hparams.ref_level_db     return _normalize(S)  def melspectrogram(y):     D = _stft(_preemphasis(y))     S = _amp_to_db(_linear_to_mel(np.abs(D)))     return _normalize(S)  def _preemphasis(x):     return signal.lfilter([1, -hparams.preemphasis], [1], x)  def _stft(y):     n_fft, hop_length, win_length = _stft_parameters()     #2048 , 300, 1200     return librosa.stft(y=y, n_fft=n_fft, hop_length=hop_length,            win_length=win_length)

디지털 필터 필터란 입력 신호에 대해 변경을 가한 후 출력 신호를 내보내는 시스템. 보통 주파수 영역에서 특정한 성분만을 통과시키거나 특정한 성분을 제거하는 등의 기능을 수행하는 것을 필터. 신호의 잡음 제거나, 특정한 주파수 성분을 걸러내는 등의 응용.

lfilter signal.lfilter(a,b,data)

a 분자계수, 분모계수

STFT(Short Time Fourier Transform)

D = librosa.stft(y=D, n_fft=2048, hop_length = 300, win_length=1200) hop_length : number audio of frames between STFT columns win_length : representation window size n_fft : FFT window size(샘플을 몇 개씩 넣을 것인가)

(stft 수행 후 저주파에 민감, 고주파에 둔감한 성질을 이용하여 로그 변환 수행.) (lfilter 를 하지 않았을 경우 어두운 부분이 많이 나타난다.)

output shape : (1025,356)

output shape : (80,356) . hparams.num_mels=80

MFCC

mel_filter 를 곱하여 주파수 영역에 따라 다른 필터 적용. 고주파로 갈 수록 필터의 폭이 두꺼워진다. 스펙트럼 해당 구간을 제외한 나머지 값들은 0으로 바꾼다. 인간은 청각을 선형 방식으로 해석하지 않는다. 멜 스케일을 적용하요 이를 선형 척도로 설명하려 하는 것이 목적.